cuffcompare
Compare assembled transcripts across multiple experiments
Syntax
Description
compares the assembled transcripts in statsFile
= cuffcompare(gtfFiles
)gtfFiles
and returns summary
statistics in the output file statsFile
[1].
cuffcompare
requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download
link. For details, see Bioinformatics Toolbox Software Support Packages.
uses additional options specified by statsFile
= cuffcompare(gtfFiles
,compareOptions
)compareOptions
.
uses additional options specified by one or more name-value pair arguments. For example,
statsFile
= cuffcompare(gtfFiles
,Name,Value
)statsFile = cuffcompare(gtfFile,'OutputPrefix',"cuffComp")
appends the
prefix "cuffComp"
to the output file names.
[
returns the names of the output files using any of the input argument combinations in the
previous syntaxes. By default, the function saves all files to the current directory.statsFile
,combinedGTF
,lociFile
,trackingFile
] = cuffcompare(___)
Examples
Assemble Transcriptome and Perform Differential Expression Testing
Create a CufflinksOptions
object to define cufflinks options, such
as the number of parallel threads and the output directory to store the results.
cflOpt = CufflinksOptions;
cflOpt.NumThreads = 8;
cflOpt.OutputDirectory = "./cufflinksOut";
The SAM files provided for this example contain aligned reads for Mycoplasma
pneumoniae from two samples with three replicates each. The reads are
simulated 100bp-reads for two genes (gyrA
and
gyrB
) located next to each other on the genome. All the reads are
sorted by reference position, as required by cufflinks
.
sams = ["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam",... "Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"];
Assemble the transcriptome from the aligned reads.
[gtfs,isofpkm,genes,skipped] = cufflinks(sams,cflOpt);
gtfs
is a list of GTF files that contain assembled isoforms.
Compare the assembled isoforms using cuffcompare
.
stats = cuffcompare(gtfs);
Merge the assembled transcripts using cuffmerge
.
mergedGTF = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput');
mergedGTF
reports only one transcript. This is because the two
genes of interest are located next to each other, and cuffmerge
cannot distinguish two distinct genes. To guide cuffmerge
, use a
reference GTF (gyrAB.gtf
) containing information about these two
genes. If the file is not located in the same directory that you run
cuffmerge
from, you must also specify the file path.
gyrAB = which('gyrAB.gtf'); mergedGTF2 = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput2',... 'ReferenceGTF',gyrAB);
Calculate abundances (expression levels) from aligned reads for each sample.
abundances1 = cuffquant(mergedGTF2,["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... 'OutputDirectory','./cuffquantOutput1'); abundances2 = cuffquant(mergedGTF2,["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"],... 'OutputDirectory','./cuffquantOutput2');
Assess the significance of changes in expression for genes and transcripts between
conditions by performing the differential testing using cuffdiff
.
The cuffdiff
function operates in two distinct steps: the function
first estimates abundances from aligned reads, and then performs the statistical
analysis. In some cases (for example, distributing computing load across multiple
workers), performing the two steps separately is desirable. After performing the first
step with cuffquant
, you can then use the binary CXB output file as
an input to cuffdiff
to perform statistical analysis. Because
cuffdiff
returns several files, specify the output directory is
recommended.
isoformDiff = cuffdiff(mergedGTF2,[abundances1,abundances2],... 'OutputDirectory','./cuffdiffOutput');
Display a table containing the differential expression test results for the two genes
gyrB
and gyrA
.
readtable(isoformDiff,'FileType','text')
ans = 2×14 table test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2_fold_change_ test_stat p_value q_value significant ________________ _____________ ______ _______________________ ________ ________ ______ __________ __________ _________________ _________ _______ _______ ___________ 'TCONS_00000001' 'XLOC_000001' 'gyrB' 'NC_000912.1:2868-7340' 'q1' 'q2' 'OK' 1.0913e+05 4.2228e+05 1.9522 7.8886 5e-05 5e-05 'yes' 'TCONS_00000002' 'XLOC_000001' 'gyrA' 'NC_000912.1:2868-7340' 'q1' 'q2' 'OK' 3.5158e+05 1.1546e+05 -1.6064 -7.3811 5e-05 5e-05 'yes'
You can use cuffnorm
to generate normalized expression tables for
further analyses. cuffnorm
results are useful when you have many
samples and you want to cluster them or plot expression levels for genes that are
important in your study. Note that you cannot perform differential expression analysis
using cuffnorm
.
Specify a cell array, where each element is a string vector containing file names for a single sample with replicates.
alignmentFiles = {["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... ["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"]} isoformNorm = cuffnorm(mergedGTF2, alignmentFiles,... 'OutputDirectory', './cuffnormOutput');
Display a table containing the normalized expression levels for each transcript.
readtable(isoformNorm,'FileType','text')
ans = 2×7 table tracking_id q1_0 q1_2 q1_1 q2_1 q2_0 q2_2 ________________ __________ __________ __________ __________ __________ __________ 'TCONS_00000001' 1.0913e+05 78628 1.2132e+05 4.3639e+05 4.2228e+05 4.2814e+05 'TCONS_00000002' 3.5158e+05 3.7458e+05 3.4238e+05 1.0483e+05 1.1546e+05 1.1105e+05
Column names starting with q have the format: conditionX_N, indicating that the column contains values for replicate N of conditionX.
Input Arguments
gtfFiles
— Names of GTF files
string array | cell array of character vectors
Names of GTF files, specified as a string vector or cell array of character vectors.
Each GTF file corresponds to a sample produced by cufflinks
.
Example: ["Myco_1_1.transcripts.gtf","Myco_2_1.transcripts.gtf"]
Data Types: string
| cell
compareOptions
— cuffcompare
options
CuffCompareOptions
object | character vector | string
cuffcompare
options, specified as a
CuffCompareOptions
object, character vector, or string. The character
vector or string must be in the original cuffcompare
option syntax
(prefixed by one or two dashes), such as '-d 100 -e 80'
[1].
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: statsFile =
cuffcompare(gtfFile,'OutputPrefix',"cuffComp",'MaxGroupingRange',90)
ConsensusPrefix
— Prefix for consensus transcript names
"TCONS"
(default) | string | character vector
Prefix for consensus transcript names in the output
combined.gtf
file, specified as a string or character vector. This option
must be a string or character vector with a non-zero length.
Example: 'ConsensusPrefix',"consensusTs"
Data Types: char
| string
DiscardIntronRedundant
— Flag to ignore intron-redundant transfrags
false
(default) | true
Flag to ignore intron-redundant transfrags if they have the same 5' end but different 3' ends, specified as true
or false
.
Example: 'DiscardIntronRedundant',true
Data Types: logical
DiscardSingleExonAll
— Flag to discard single-exon transfrags and reference transcripts
false
(default) | true
Flag to discard single-exon transfrags and reference transcripts, specified as true
or false
.
Example: 'DiscardSingleExonAll',true
Data Types: logical
DiscardSingleExonReference
— Flag to discard single-exon reference transcripts
false
(default) | true
Flag to discard single-exon reference transcripts, specified as true
or false
.
Example: 'DiscardSingleExonReference',true
Data Types: logical
ExtraCommand
— Additional commands
""
(default) | character vector | string
The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB® properties.
Example: 'ExtraCommand',"--library-type
fr-secondstrand"
Data Types: char
| string
GTFManifest
— Name of text file containing list of GTF files to process
string | character vector
Name of the text file containing a list of GTF files to
process, specified as a string or character vector. The file must contain one GTF file path per
line. You can use this option as an alternative to passing an array of file names to
cuffcompare
.
Example: 'GTFManifest',"gtfManifestFile.txt"
Data Types: char
| string
GenericGFF
— Flag to treat input GTF files as GFF
false
(default) | true
Flag to treat input GTF files as GFF files, specified as true
or false
. Use this option when the input GFF or GTF files are not produced by cufflinks
.
Example: 'GenericGFF',true
Data Types: logical
IncludeAll
— Flag to include all available options
false
(default) | true
The original (native) syntax is prefixed by one or two dashes.
By default, the function converts only the specified options. If the value is
true
, the software converts all available options, with default values
for unspecified options, to the original syntax.
Note
If you set IncludeAll
to true
, the software
converts all available properties, using default values for unspecified properties. The
only exception is when the default value of a property is NaN
,
Inf
, []
, ''
, or
""
. In this case, the software does not translate the
corresponding property.
Example: 'IncludeAll',true
Data Types: logical
IncludeContained
— Flag to include transfrags contained by other transfrags
false
(default) | true
Flag to include transfrags contained by other transfrags in the
same locus in the output combined.gtf
, specified as true
or false
. By default, cuffcompare
does not include these
contained transfrags. If the value is true
, the contained transfrags include
a contained_in
attribute indicating the first container transfrag
found.
Example:
'IncludeContained',true
Data Types: logical
MaxAccuracyRange
— Number of bases from terminal exons to use when assessing exon accuracy
100
(default) | positive integer
Number of bases from the free ends of terminal exons to use when assessing exon accuracy, specified as a positive integer.
Example:
'MaxAccuracyRange',80
Data Types: double
MaxGroupingRange
— Number of bases to use for grouping transcript start sites
100
(default) | positive integer
Number of bases to use for grouping transcript start sites, specified as a positive integer.
Example:
'MaxGroupingRange',90
Data Types: double
OutputPrefix
— Prefix for cuffcompare
output files
"cuffcmp"
(default) | string | character vector
Prefix for cuffcompare
output files,
specified as a string or character vector. This option must be a string or character vector with
a non-zero length.
Example:
'OutputPrefix',"cuffcompareOut"
Data Types: char
| string
ReferenceGTF
— Name of GTF or GFF file containing reference transcripts
string | character vector
Name of the GTF or GFF file containing reference transcripts to
compare to each sample, specified as a string or character vector. If you provide a file, the
function compares each sample to the references in the file and marks isoforms as
overlapping
, matching
, or novel
.
The function stores these tags in the output files .refmap
and
.tmap
files.
Example:
'ReferenceGTF',"references.gtf"
Data Types: char
| string
SequenceDirectory
— Name of directory containing FASTA sequences to classify input transcripts as repeats
string | character vector
Name of directory containing FASTA sequences to classify input transcripts as repeats, specified as a string or character vector. The directory must contain FASTA-format files with the underlying genomic sequences and contain one FASTA file per reference. Name each FASTA file after the chromosome with the extension .fa
or .fasta
.
Example: 'SequenceDirectory',"./SequenceDirectory/"
Data Types: char
| string
SnCorrection
— Flag to consider only reference transcripts that overlap with input transfrags
false
(default) | true
Flag to consider only reference transcripts that overlap with
any of the input transfrags, specified as true
or false
.
If the value is true
:
The function ignores any reference transcripts that do not overlap with any of the input transfrags.
You must also specify the
ReferenceGTF
option.
Example:
'SnCorrection',true
Data Types: logical
SpCorrection
— Flag to consider only input transcripts that overlap with reference transcripts
false
(default) | true
Flag to consider only input transcripts that overlap with any
of the reference transcripts, specified as true
or false
.
If the value is true
:
The function ignores any input transcripts that do not overlap with any of the reference transcripts and reports no novel loci.
You must also specify the
ReferenceGTF
option.
Example:
'SpCorrection',true
Data Types: logical
SuppressMapFiles
— Flag to prevent creation of .tmap
and .refmap
files
false
(default) | true
Flag to prevent the creation of .tmap
and
.refmap
files, specified as true
or
false
. Set the value to true
to prevent the function
from generating the files.
Example:
'SuppressMapFiles',true
Data Types: logical
Output Arguments
statsFile
— Name of text file containing statistics
"cuffcmp.stats"
Name of the text file containing statistics related to the accuracy of the transcripts in each sample, returned as string. The function performs the tests for sensitivity (Sn) and specificity (Sp) at various levels, including the nucleotide, exon, and intron levels, and reports the results in this file.
The default file name is "cuffcmp.stats"
. If you specify
OutputPrefix
, the function uses it instead of
"cuffcmp"
.
combinedGTF
— Name of file containing union of all transfrags in each sample
"cuffcmp.combined.gtf"
Name of the file containing the union of all transfrags in each sample, returned as a string.
The default file name is "cuffcmp.combined.gtf"
. If you specify
OutputPrefix
, the function uses it instead of
"cuffcmp"
.
lociFile
— Name of file with all processed loci
"cuffcmp.loci"
Name of file with all processed loci across all transcripts, returned as a string.
The default file name is "cuffcmp.loci"
. If you specify
OutputPrefix
, the function uses it instead of
"cuffcmp"
.
trackingFile
— Name of file containing transcripts with identical coordinates
"cuffcmp.tracking"
Name of the file containing transcripts with identical coordinates, introns, and strands, returned as a string.
The default file name is "cuffcmp.tracking"
. If you specify
OutputPrefix
, the function uses it instead of
"cuffcmp"
.
References
[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.
Version History
Introduced in R2019a
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)