The toolbox accesses many of the databases on the Web and other online data sources. It allows you to copy data into the MATLAB^{®} Workspace, and read and write to files with standard bioinformatic formats. It also reads many common genome file formats, so that you do not have to write and maintain your own file readers.
Web-based databases — You can directly access public databases on the Web and copy sequence and gene expression information into the MATLAB environment.
The sequence databases currently supported are GenBank^{®} (getgenbank
), GenPept (getgenpept
), European Molecular Biology
Laboratory (EMBL) (getembl
),
and Protein Data Bank (PDB) (getpdb
).
You can also access data from the NCBI Gene Expression Omnibus (GEO)
Web site by using a single function (getgeodata
).
Get multiply aligned sequences (gethmmalignment
),
hidden Markov model profiles (gethmmprof
),
and phylogenetic tree data (gethmmtree
)
from the PFAM database.
Gene Ontology database —
Load the database from the Web into a gene ontology object (geneont
). Select
sections of the ontology with methods for the geneont object (geneont.getancestors
, geneont.getdescendants
, geneont.getmatrix
, geneont.getrelatives
),
and manipulate data with utility functions (goannotread
, num2goid
).
Read data from instruments —
Read data generated from gene sequencing instruments (scfread
, joinseq
, traceplot
), mass spectrometers (jcampread
), and Agilent^{®} microarray
scanners (agferead
).
Reading data formats — The toolbox provides a number of functions for reading data from common bioinformatic file formats.
Sequence data: GenBank (genbankread
),
GenPept (genpeptread
), EMBL
(emblread
), PDB (pdbread
), and FASTA (fastaread
)
Multiply aligned sequences: ClustalW and GCG formats
(multialignread
)
Gene expression data from microarrays: Gene Expression
Omnibus (GEO) data (geosoftread
), GenePix^{®} data
in GPR and GAL files (gprread
, galread
), SPOT data (sptread
), Affymetrix^{®} GeneChip^{®} data
(affyread
), and ImaGene^{®} results
files (imageneread
)
Hidden Markov model profiles: PFAM-HMM file (pfamhmmread
)
Writing data formats —
The functions for getting data from the Web include the option to
save the data to a file. However, there is a function to write data
to a file using the FASTA format (fastawrite
).
BLAST searches — Request
Web-based BLAST searches (blastncbi
),
get the results from a search (getblast
)
and read results from a previously saved BLAST formatted report file
(blastread
).
The MATLAB environment has built-in support for other industry-standard file formats including Microsoft^{®} Excel^{®} and comma-separated-value (CSV) files. Additional functions perform ASCII and low-level binary I/O, allowing you to develop custom functions for working with any data format.
You can select from a list of analysis methods to compare nucleotide or amino acid sequences using pairwise or multiple sequence alignment functions.
Pairwise sequence alignment —
Efficient implementations of standard algorithms such as the Needleman-Wunsch
(nwalign
) and Smith-Waterman
(swalign
) algorithms for pairwise
sequence alignment. The toolbox also includes standard scoring matrices
such as the PAM and BLOSUM families of matrices (blosum
, dayhoff
, gonnet
, nuc44
, pam
). Visualize sequence similarities
with seqdotplot
and sequence
alignment results with showalignment
.
Multiple sequence alignment —
Functions for multiple sequence alignment (multialign
, profalign
) and functions that support
multiple sequences (multialignread
, fastaread
, showalignment
).
There is also a graphical interface (seqalignviewer
)
for viewing the results of a multiple sequence alignment and manually
making adjustment.
Multiple sequence profiles —
Implementations for multiple alignment and profile hidden Markov model
algorithms (gethmmprof
, gethmmalignment
, gethmmtree
, pfamhmmread
, hmmprofalign
, hmmprofestimate
, hmmprofgenerate
, hmmprofmerge
, hmmprofstruct
, showhmmprof
).
Biological codes — Look
up the letters or numeric equivalents for commonly used biological
codes (aminolookup
, baselookup
, geneticcode
, revgeneticcode
).
You can manipulate and analyze your sequences to gain a deeper
understanding of the physical, chemical, and biological characteristics
of your data. Use a graphical user interface (GUI) with many of the
sequence functions in the toolbox (seqviewer
).
Sequence conversion and manipulation —
The toolbox provides routines for common operations, such as converting
DNA or RNA sequences to amino acid sequences, that are basic to working
with nucleic acid and protein sequences (aa2int
, aa2nt
, dna2rna
, rna2dna
, int2aa
, int2nt
, nt2aa
, nt2int
, seqcomplement
, seqrcomplement
, seqreverse
).
You can manipulate your sequence by performing an in silico
digestion with restriction endonucleases (restrict
)
and proteases (cleave
).
Sequence statistics —
Determine various statistics about a sequence (aacount
, basecount
, codoncount
, dimercount
, nmercount
, ntdensity
, codonbias
, cpgisland
, oligoprop
), search for specific patterns
within a sequence (seqshowwords
, seqwordcount
), or search for open reading
frames (seqshoworfs
). In addition,
you can create random sequences for test cases (randseq
).
Sequence utilities —
Determine a consensus sequence from a set of multiply aligned amino
acid, nucleotide sequences (seqconsensus
,
or a sequence profile (seqprofile
).
Format a sequence for display (seqdisp
)
or graphically show a sequence alignment with frequency data (seqlogo
).
Additional MATLAB functions efficiently handle string operations
with regular expressions (regexp
, seq2regexp
) to look for specific patterns
in a sequence and search through a library for string matches (seqmatch
).
Look for possible cleavage sites in a DNA/RNA sequence by searching
for palindromes (palindromes
).
You can use a collection of protein analysis methods to extract
information from your data. You can determine protein characteristics
and simulate enzyme cleavage reactions. The toolbox provides functions
to calculate various properties of a protein sequence, such as the
atomic composition (atomiccomp
),
molecular weight (molweight
),
and isoelectric point (isoelectric
).
You can cleave a protein with an enzyme (cleave
, rebasecuts
) and create distance and
Ramachandran plots for PDB data (pdbdistplot
, ramachandran
). The toolbox contains
a graphical user interface for protein analysis (proteinplot
) and plotting 3-D protein
and other molecular structures with information from molecule model
files, such as PDB files (molviewer
).
Amino acid sequence utilities —
Calculate amino acid statistics for a sequence (aacount
) and get information about character
codes (aminolookup
).
You can use functions for phylogenetic tree building and analysis. There is also a GUI to draw phylograms (trees).
Phylogenetic tree data —
Read and write Newick-formatted tree files (phytreeread
, phytreewrite
) into the MATLAB Workspace
as phylogenetic tree objects (phytree
).
Create a phylogenetic tree —
Calculate the pairwise distance between biological sequences (seqpdist
), estimate the substitution
rates (dnds
, dndsml
), build a phylogenetic tree from
pairwise distances (seqlinkage
, seqneighjoin
, reroot
),
and view the tree in an interactive GUI that allows you to view, edit,
and explore the data (phytreeviewer
or view
).
This GUI also allows you to prune branches, reorder, rename, and explore
distances.
Phylogenetic tree object methods —
You can access the functionality of the phytreeviewer
GUI
using methods for a phylogenetic tree object (phytree
). Get property values (get
)
and node names (getbyname
). Calculate
the patristic distances between pairs of leaf nodes (pdist
, weights
)
and draw a phylogenetic tree object in a MATLAB Figure window
as a phylogram, cladogram, or radial treeplot (plot
).
Manipulate tree data by selecting branches and leaves using a specified
criterion (select
, subtree
)
and removing nodes (prune
). Compare trees (getcanonical
)
and use Newick-formatted strings (getnewickstr
).
The MATLAB environment is widely used for microarray data analysis, including reading, filtering, normalizing, and visualizing microarray data. However, the standard normalization and visualization tools that scientists use can be difficult to implement. The toolbox includes these standard functions:
Microarray data — Read Affymetrix GeneChip files
(affyread
) and plot data (probesetplot
), ImaGene results
files (imageneread
), SPOT
files (sptread
) and Agilent microarray
scanner files (agferead
).
Read GenePix GPR files (gprread
)
and GAL files (galread
). Get
Gene Expression Omnibus (GEO) data from the Web (getgeodata
) and read GEO data from files
(geosoftread
).
A utility function (magetfield
)
extracts data from one of the microarray reader functions (gprread
, agferead
, sptread
, imageneread
).
Microarray normalization and filtering —
The toolbox provides a number of methods for normalizing microarray
data, such as lowess normalization (malowess
)
and mean normalization (manorm
),
or across multiple arrays (quantilenorm
).
You can use filtering functions to clean raw data before analysis
(geneentropyfilter
, genelowvalfilter
, generangefilter
, genevarfilter
), and calculate the range
and variance of values (exprprofrange
, exprprofvar
).
Microarray visualization —
The toolbox contains routines for visualizing microarray data. These
routines include spatial plots of microarray data (maimage
, redgreencmap
),
box plots (maboxplot
), loglog
plots (maloglog
), and intensity-ratio
plots (mairplot
). You can
also view clustered expression profiles (clustergram
, redgreencmap
). You can create 2-D scatter
plots of principal components from the microarray data (mapcaplot
).
Microarray utility functions —
Use the following functions to work with Affymetrix GeneChip data
sets. Get library information for a probe (probelibraryinfo
),
gene information from a probe set (probesetlookup
),
and probe set values from CEL and CDF information (probesetvalues
). Show probe set information
from NetAffx™ Analysis Center (probesetlink
)
and plot probe set values (probesetplot
).
The toolbox accesses statistical routines to perform cluster analysis and to visualize the results, and you can view your data through statistical visualizations such as dendrograms, classification, and regression trees.
The toolbox includes functions, objects, and methods for creating, storing, and accessing microarray data.
The object constructor function, DataMatrix
,
lets you create a DataMatrix object to encapsulate
data and metadata from a microarray experiment. A DataMatrix object
stores experimental data in a matrix, with rows typically corresponding
to gene names or probe identifiers, and columns typically corresponding
to sample identifiers. A DataMatrix object also stores metadata, including
the gene names or probe identifiers (as the row names) and sample
identifiers (as the column names).
You can reference microarray expression values in a DataMatrix object the same way you reference data in a MATLAB array, that is, by using linear or logical indexing. Alternately, you can reference this experimental data by gene (probe) identifiers and sample identifiers. Indexing by these identifiers lets you quickly and conveniently access subsets of the data without having to maintain additional index arrays.
Many MATLAB operators and arithmetic functions are available
to DataMatrix objects by means of methods. These methods let you modify,
combine, compare, analyze, plot, and access information from DataMatrix
objects. Additionally, you can easily extend the functionality by
using general element-wise functions, dmarrayfun
and dmbsxfun
,
and by manually accessing the properties of a DataMatrix object.
Note: For more information on creating and using DataMatrix objects, see Representing Expression Data Values in DataMatrix Objects. |
The mass spectrometry functions preprocess and classify raw data from SELDI-TOF and MALDI-TOF spectrometers and use statistical learning functions to identify patterns.
Reading raw data —
Load raw mass/charge and ion intensity data from comma-separated-value
(CSV) files, or read a JCAMP-DX-formatted file with mass spectrometry
data (jcampread
) into the MATLAB environment.
You can also have data in TXT files and use the importdata
function.
Preprocessing raw data —
Resample high-resolution data to a lower resolution (msresample
) where the extra data points
are not needed. Correct the baseline (msbackadj
).
Align a spectrum to a set of reference masses (msalign
) and visually verify the alignment
(msheatmap
). Normalize the
area between spectra for comparing (msnorm
),
and filter out noise (mslowess
and mssgolay
).
Spectrum analysis —
Load spectra into a GUI (msviewer
)
for selecting mass peaks and further analysis.
The following graphic illustrates the roles of the various mass spectrometry functions in the toolbox.
Graph theory functions in the toolbox apply basic graph theory
algorithms to sparse matrices. A sparse matrix represents a graph,
any nonzero entries in the matrix represent the edges of the graph,
and the values of these entries represent the associated weight (cost,
distance, length, or capacity) of the edge. Graph algorithms that
use the weight information will cancel the edge if a NaN
or
an Inf
is found. Graph algorithms that do not use
the weight information will consider the edge if a NaN
or
an Inf
is found, because these algorithms look
only at the connectivity described by the sparse matrix and not at
the values stored in the sparse matrix.
Sparse matrices can represent four types of graphs:
Directed Graph — Sparse matrix, either double real or logical. Row (column) index indicates the source (target) of the edge. Self-loops (values in the diagonal) are allowed, although most of the algorithms ignore these values.
Undirected Graph — Lower triangle of a sparse matrix, either double real or logical. An algorithm expecting an undirected graph ignores values stored in the upper triangle of the sparse matrix and values in the diagonal.
Direct Acyclic Graph (DAG) — Sparse matrix, double real or logical, with zero values in the diagonal. While a zero-valued diagonal is a requirement of a DAG, it does not guarantee a DAG. An algorithm expecting a DAG will not test for cycles because this will add unwanted complexity.
Spanning Tree — Undirected graph with no cycles and with one connected component.
There are no attributes attached to the graphs; sparse matrices representing all four types of graphs can be passed to any graph algorithm. All functions will return an error on nonsquare sparse matrices.
Graph algorithms do not pretest for graph properties because such tests can introduce a time penalty. For example, there is an efficient shortest path algorithm for DAG, however testing if a graph is acyclic is expensive compared to the algorithm. Therefore, it is important to select a graph theory function and properties appropriate for the type of the graph represented by your input matrix. If the algorithm receives a graph type that differs from what it expects, it will either:
Return an error when it reaches an inconsistency.
For example, if you pass a cyclic graph to the graphshortestpath
function
and specify Acyclic
as the method
property.
Produce an invalid result. For example, if you pass a directed graph to a function with an algorithm that expects an undirected graph, it will ignore values in the upper triangle of the sparse matrix.
The graph theory functions include graphallshortestpaths
, graphconncomp
, graphisdag
, graphisomorphism
, graphisspantree
, graphmaxflow
, graphminspantree
, graphpred2path
, graphshortestpath
, graphtopoorder
, and graphtraverse
.
The toolbox includes functions, objects, and methods for creating, viewing, and manipulating graphs such as interactive maps, hierarchy plots, and pathways. This allows you to view relationships between data.
The object constructor function (biograph
)
lets you create a biograph object to hold graph data. Methods of the
biograph object let you calculate the position of nodes (dolayout
),
draw the graph (view
),
get handles to the nodes and edges (getnodesbyid
and getedgesbynodeid
)
to further query information, and find relations between the nodes
(getancestors
, getdescendants
,
and getrelatives
).
There are also methods that apply basic graph theory algorithms to
the biograph object.
Various properties of a biograph object let you programmatically
change the properties of the rendered graph. You can customize the
node representation, for example, drawing pie charts inside every
node (CustomNodeDrawFcn
). Or you can associate
your own callback functions to nodes and edges of the graph, for example,
opening a Web page with more information about the nodes (NodeCallback
and EdgeCallback
).
You can classify and identify features in data sets, set up cross-validation experiments, and compare different classification methods.
The toolbox provides functions that build on the classification
and statistical learning tools in the Statistics and Machine Learning Toolbox™ software
(classify
, kmeans
,
and treefit
).
These functions include imputation tools (knnimpute
), and K-nearest neighbor classifiers
(knnclassify
).
Other functions include set up of cross-validation experiments
(crossvalind
) and comparison
of the performance of different classification methods (classperf
). In addition, there are tools
for selecting diversity and discriminating features (rankfeatures
, randfeatures
).
The MATLAB environment lets you prototype and develop algorithms and easily compare alternatives.
Integrated environment — Explore biological data in an environment that integrates programming and visualization. Create reports and plots with the built-in functions for mathematics, graphics, and statistics.
Open environment — Access the source code for the toolbox functions. The toolbox includes many of the basic bioinformatics functions you will need to use, and it includes prototypes for some of the more advanced functions. Modify these functions to create your own custom solutions.
Interactive programming language — Test your ideas by typing functions that are interpreted interactively with a language whose basic data element is an array. The arrays do not require dimensioning and allow you to solve many technical computing problems,
Using matrices for sequences or groups of sequences allows you to work efficiently and not worry about writing loops or other programming controls.
Programming tools — Use a visual debugger for algorithm development and refinement and an algorithm performance profiler to accelerate development.
You can visually compare pairwise sequence alignments, multiply aligned sequences, gene expression data from microarrays, and plot nucleic acid and protein characteristics. The 2-D and volume visualization features let you create custom graphical representations of multidimensional data sets. You can also create montages and overlays, and export finished graphics to an Adobe^{®} PostScript^{®} image file or copy directly into Microsoft PowerPoint^{®}.
The open MATLAB environment lets you share your analysis solutions with other users, and it includes tools to create custom software applications. With the addition of MATLAB Compiler™ and MATLAB Compiler SDK™, you can create standalone applications independent of the MATLAB environment.
Share algorithms with other users — You can share data analysis algorithms created in the MATLAB language across all supported platforms by giving files to other users. You can also create GUIs within the MATLAB environment using the Graphical User Interface Development Environment (GUIDE).
Deploy MATLAB GUIs — Create a GUI within the MATLAB environment using GUIDE, and then use MATLAB Compiler software to create a standalone GUI application that runs separately from the MATLAB environment.
Create dynamic link libraries (DLLs) — Use MATLAB Compiler software to create DLLs for your functions, and then link these libraries to other programming environments such as C and C++.
Create COM objects — Use MATLAB Compiler SDK to create COM objects, and then use a COM-compatible programming environment (Visual Basic^{®}) to create a standalone application.
Create Excel add-ins — Use MATLAB Compiler to create Excel add-in functions, and then use these functions with Excel spreadsheets.
Create Java^{®} classes — Use MATLAB Compiler SDK to automatically generate Java classes from algorithms written in the MATLAB programming language. You can run these classes outside the MATLAB environment.