Count Features from NGS Reads

This example shows how to count features from paired-end sequencing reads after aligning them to the whole human genome curated by the Genome Reference Consortium. This example uses Genome Reference Consortium Human Build 38 patch release 12 (GRCh38.p12) as the human genome reference.

Prerequisites and Data Set

This example works on the UNIX® and Mac platforms only. Download the Bioinformatics Toolbox™ Interface for Bowtie Aligner support package from the Add-On Explorer.

This example assumes you have:

sequence/
+--HG00096/
|  +--SRR077487_1.filt.fastq
|  +--SRR077487_2.filt.fastq
|  +--SRR081241_1.filt.fastq
|  +--SRR081241_2.filt.fastq
|
+--HG00097
   +-- SRR765989_1.filt.fastq
   +-- SRR765989_2.filt.fastq

Build Index

Construct an index for aligning reads to the reference using bowtie2build. The file GCF_000001405.38_GRCh38.p12_genomic.fna contains the human reference genome in the FASTA format. bowtieIdx is the base name of the reference index files. The '--threads 8' option specifies the number of parallel threads to build index files faster.

bowtieIdx  = 'GCF_000001405.38_GRCh38.p12_genomic.index';
buildFlag  = bowtie2build('GCF_000001405.38_GRCh38.p12_genomic.fna',...
                         bowtieIdx,'--threads 8');

Align Reads to Reference

Use the helper function alignAllReads to align each set of paired-end reads to the reference. The function produces a SAM file in the current folder for each sample in the 'sequence' folder.

addpath(fullfile(matlabroot,'examples','bioinfo','main')); % Make sure the supporting files are on the search path
type alignAllReads
function samFiles = alignAllReads(indexBaseName, inputDir, outputDir)
    %ALIGNALLREADS Helper function for CountFeaturesExample
    %
    % SAMFILES = alignAllReads(INDEXBASENAME, INPUTDIR, OUTPUTDIR) aligns
    % paired-end sequencing reads using Bowtie2 to a prebuilt index,
    % producing SAM-formatted alignment files. INDEXBASENAME
    % is a character vector specifying the prefix of the index files
    % created with BOWTIE2BUILD. INPUTDIR contains subdirectories for each
    % sample containing paired-end reads in FASTQ format:
    % 
    % INPUTDIR/
    % +-- HG00096/
    % |   +-- SRR077487_1.filt.fastq
    % |   +-- SRR077487_2.filt.fastq
    % |   +-- SRR081241_1.filt.fastq
    % |   +-- SRR081241_2.filt.fastq
    % |
    % +-- HG00097/
    %     +-- SRR765989_1.filt.fastq
    %     +-- SRR765989_2.filt.fastq
    % 
    % NOTE: each mate is distinguished by the '_1' or '_2' after the
    % accession number in the filename.
    % 
    % SAMFILES is a cell array of the resulting SAM-formatted files created
    % with Bowtie2, and are placed in OUTPUTDIR.
    
    % Copyright 2018 The MathWorks, Inc.
	
    % Use dir() to identify sample names (subfolders of inputDir).
    d = dir(inputDir);
    samples = {d(3:end).name}; % skip special '.' & '..' folders
    samFiles = strcat(samples, '.sam');
    
    for s=1:numel(samples)
       % Identify mate pairs of reads for each sample
       sampleReadsPath = fullfile(inputDir, samples{s});
       reads1 = dir([sampleReadsPath '/*_1*']);
       reads2 = dir([sampleReadsPath '/*_2*']);
       
       reads1 = fullfile(sampleReadsPath, {reads1(:).name});
       reads2 = fullfile(sampleReadsPath, {reads2(:).name});
       
       % Get full filename to SAM file
       samFiles{s} = fullfile(outputDir, samFiles{s});
       
       % Perform alignment, if file doesn't exist
       if ~exist(samFiles{s}, 'file')
           bowtie2(indexBaseName, reads1, reads2, samFiles{s}, '-p 2');
       end
    end
end
samFiles = alignAllReads(bowtieIdx,'sequence','.');

Selectively Align to Gene of Interest

SAM files can be very large. Use BioMap to select only the data for the correct reference. For this example, consider APOE, which is a gene on Chromosome 19 linked to Alzheimer's disease. Create a set of smaller BAM files for APOE to improve performance.

apoeRef  = 'NC_000019.10'; % Reference name for Chromosome 19 in HG38
bamFiles = strrep(samFiles,'.sam','.bam');
for i=1:numel(samFiles)
  if ~exist(bamFiles{i},'file')
    bm = BioMap(samFiles{i},'SelectReference',apoeRef);
    write(bm, bamFiles{i},'Format','bam');
  end
end

Summarize Read Counts

Use featurecount to compare the number of transcripts for each APOE variant using a GTF file. A full table of features is included in the GRCh38.p12 assembly in GFF format, which can be converted to GTF using cuffgffread. This example uses a simplified GTF based on APOE transcripts. APOE_gene.gtf is included with the software.

[FeatTable, Summary] = featurecount('APOE_gene.gtf',bamFiles,...
                                  'Metafeature','transcript_id');
Processing GTF file APOE_gene.gtf ...
Processing BAM file ./HG00096.bam ...
Processing reference NC_000019.10 ...
10000 reads processed ...
20000 reads processed ...
30000 reads processed ...
40000 reads processed ...
50000 reads processed ...
60000 reads processed ...
70000 reads processed ...
80000 reads processed ...
90000 reads processed ...
100000 reads processed ...
110000 reads processed ...
120000 reads processed ...
130000 reads processed ...
140000 reads processed ...
150000 reads processed ...
160000 reads processed ...
170000 reads processed ...
180000 reads processed ...
190000 reads processed ...
200000 reads processed ...
210000 reads processed ...
220000 reads processed ...
230000 reads processed ...
240000 reads processed ...
250000 reads processed ...
260000 reads processed ...
270000 reads processed ...
280000 reads processed ...
290000 reads processed ...
300000 reads processed ...
310000 reads processed ...
320000 reads processed ...
330000 reads processed ...
340000 reads processed ...
350000 reads processed ...
360000 reads processed ...
370000 reads processed ...
380000 reads processed ...
390000 reads processed ...
400000 reads processed ...
410000 reads processed ...
420000 reads processed ...
430000 reads processed ...
440000 reads processed ...
450000 reads processed ...
460000 reads processed ...
470000 reads processed ...
480000 reads processed ...
490000 reads processed ...
500000 reads processed ...
510000 reads processed ...
520000 reads processed ...
530000 reads processed ...
540000 reads processed ...
550000 reads processed ...
560000 reads processed ...
570000 reads processed ...
580000 reads processed ...
590000 reads processed ...
600000 reads processed ...
610000 reads processed ...
620000 reads processed ...
630000 reads processed ...
640000 reads processed ...
650000 reads processed ...
660000 reads processed ...
670000 reads processed ...
680000 reads processed ...
690000 reads processed ...
700000 reads processed ...
710000 reads processed ...
720000 reads processed ...
730000 reads processed ...
740000 reads processed ...
750000 reads processed ...
760000 reads processed ...
770000 reads processed ...
780000 reads processed ...
790000 reads processed ...
800000 reads processed ...
810000 reads processed ...
820000 reads processed ...
830000 reads processed ...
840000 reads processed ...
850000 reads processed ...
860000 reads processed ...
870000 reads processed ...
880000 reads processed ...
890000 reads processed ...
900000 reads processed ...
910000 reads processed ...
920000 reads processed ...
930000 reads processed ...
940000 reads processed ...
950000 reads processed ...
960000 reads processed ...
970000 reads processed ...
980000 reads processed ...
990000 reads processed ...
1000000 reads processed ...
1010000 reads processed ...
1020000 reads processed ...
1030000 reads processed ...
1040000 reads processed ...
1050000 reads processed ...
1060000 reads processed ...
1070000 reads processed ...
1080000 reads processed ...
1090000 reads processed ...
1100000 reads processed ...
1110000 reads processed ...
1120000 reads processed ...
1130000 reads processed ...
1140000 reads processed ...
1150000 reads processed ...
1160000 reads processed ...
1170000 reads processed ...
1180000 reads processed ...
1190000 reads processed ...
1200000 reads processed ...
1210000 reads processed ...
1220000 reads processed ...
1230000 reads processed ...
1240000 reads processed ...
1250000 reads processed ...
1260000 reads processed ...
1270000 reads processed ...
1280000 reads processed ...
1290000 reads processed ...
1300000 reads processed ...
1310000 reads processed ...
1320000 reads processed ...
1330000 reads processed ...
1340000 reads processed ...
1350000 reads processed ...
1360000 reads processed ...
1370000 reads processed ...
1380000 reads processed ...
1390000 reads processed ...
1400000 reads processed ...
1410000 reads processed ...
1420000 reads processed ...
1430000 reads processed ...
1440000 reads processed ...
1450000 reads processed ...
1460000 reads processed ...
1470000 reads processed ...
1480000 reads processed ...
1490000 reads processed ...
1500000 reads processed ...
1510000 reads processed ...
1520000 reads processed ...
1530000 reads processed ...
1540000 reads processed ...
1550000 reads processed ...
1560000 reads processed ...
1570000 reads processed ...
1580000 reads processed ...
1590000 reads processed ...
1600000 reads processed ...
1610000 reads processed ...
1620000 reads processed ...
1630000 reads processed ...
1640000 reads processed ...
1650000 reads processed ...
1660000 reads processed ...
1670000 reads processed ...
1680000 reads processed ...
1690000 reads processed ...
1700000 reads processed ...
1710000 reads processed ...
1720000 reads processed ...
1730000 reads processed ...
1740000 reads processed ...
1750000 reads processed ...
1760000 reads processed ...
1770000 reads processed ...
1780000 reads processed ...
1790000 reads processed ...
1800000 reads processed ...
1810000 reads processed ...
1820000 reads processed ...
1830000 reads processed ...
1840000 reads processed ...
1850000 reads processed ...
1860000 reads processed ...
1870000 reads processed ...
1880000 reads processed ...
1890000 reads processed ...
1900000 reads processed ...
1910000 reads processed ...
1920000 reads processed ...
1930000 reads processed ...
1940000 reads processed ...
1950000 reads processed ...
1960000 reads processed ...
1970000 reads processed ...
1980000 reads processed ...
1990000 reads processed ...
2000000 reads processed ...
2010000 reads processed ...
2020000 reads processed ...
2030000 reads processed ...
2040000 reads processed ...
2050000 reads processed ...
2060000 reads processed ...
2070000 reads processed ...
2080000 reads processed ...
2090000 reads processed ...
2100000 reads processed ...
2110000 reads processed ...
2120000 reads processed ...
2130000 reads processed ...
2140000 reads processed ...
2150000 reads processed ...
2160000 reads processed ...
2170000 reads processed ...
2180000 reads processed ...
2190000 reads processed ...
2200000 reads processed ...
2210000 reads processed ...
2220000 reads processed ...
2230000 reads processed ...
2240000 reads processed ...
2250000 reads processed ...
2260000 reads processed ...
2270000 reads processed ...
2280000 reads processed ...
2290000 reads processed ...
2300000 reads processed ...
2310000 reads processed ...
2320000 reads processed ...
2330000 reads processed ...
2340000 reads processed ...
2350000 reads processed ...
2360000 reads processed ...
2370000 reads processed ...
2380000 reads processed ...
2390000 reads processed ...
2400000 reads processed ...
2410000 reads processed ...
2420000 reads processed ...
2430000 reads processed ...
2440000 reads processed ...
2450000 reads processed ...
2460000 reads processed ...
2470000 reads processed ...
2480000 reads processed ...
2490000 reads processed ...
2500000 reads processed ...
2510000 reads processed ...
2520000 reads processed ...
2530000 reads processed ...
2540000 reads processed ...
2550000 reads processed ...
2560000 reads processed ...
2570000 reads processed ...
2580000 reads processed ...
2590000 reads processed ...
2600000 reads processed ...
2610000 reads processed ...
2620000 reads processed ...
2630000 reads processed ...
2640000 reads processed ...
2650000 reads processed ...
2660000 reads processed ...
2670000 reads processed ...
2680000 reads processed ...
2690000 reads processed ...
2700000 reads processed ...
2710000 reads processed ...
2720000 reads processed ...
2730000 reads processed ...
2740000 reads processed ...
2750000 reads processed ...
2760000 reads processed ...
2770000 reads processed ...
2780000 reads processed ...
2790000 reads processed ...
2800000 reads processed ...
2810000 reads processed ...
2820000 reads processed ...
2830000 reads processed ...
2840000 reads processed ...
2850000 reads processed ...
2860000 reads processed ...
2870000 reads processed ...
2880000 reads processed ...
2890000 reads processed ...
2900000 reads processed ...
2910000 reads processed ...
2920000 reads processed ...
2930000 reads processed ...
2940000 reads processed ...
2950000 reads processed ...
2960000 reads processed ...
2970000 reads processed ...
2980000 reads processed ...
2990000 reads processed ...
3000000 reads processed ...
3010000 reads processed ...
3020000 reads processed ...
3030000 reads processed ...
3040000 reads processed ...
3050000 reads processed ...
3060000 reads processed ...
3070000 reads processed ...
3080000 reads processed ...
3090000 reads processed ...
3100000 reads processed ...
3110000 reads processed ...
3120000 reads processed ...
3130000 reads processed ...
3140000 reads processed ...
3150000 reads processed ...
3160000 reads processed ...
3170000 reads processed ...
3180000 reads processed ...
3190000 reads processed ...
3200000 reads processed ...
3210000 reads processed ...
3220000 reads processed ...
3230000 reads processed ...
3240000 reads processed ...
3250000 reads processed ...
3260000 reads processed ...
3270000 reads processed ...
3280000 reads processed ...
3290000 reads processed ...
3300000 reads processed ...
3310000 reads processed ...
3320000 reads processed ...
3330000 reads processed ...
3340000 reads processed ...
3350000 reads processed ...
3360000 reads processed ...
3370000 reads processed ...
3380000 reads processed ...
3390000 reads processed ...
3400000 reads processed ...
3410000 reads processed ...
3420000 reads processed ...
3430000 reads processed ...
3440000 reads processed ...
3450000 reads processed ...
3460000 reads processed ...
3470000 reads processed ...
3480000 reads processed ...
3490000 reads processed ...
3500000 reads processed ...
3510000 reads processed ...
3520000 reads processed ...
3530000 reads processed ...
3540000 reads processed ...
3550000 reads processed ...
3560000 reads processed ...
3570000 reads processed ...
3580000 reads processed ...
3590000 reads processed ...
3600000 reads processed ...
3610000 reads processed ...
3620000 reads processed ...
3630000 reads processed ...
3640000 reads processed ...
3650000 reads processed ...
3660000 reads processed ...
3670000 reads processed ...
3680000 reads processed ...
3690000 reads processed ...
3700000 reads processed ...
3710000 reads processed ...
3720000 reads processed ...
3730000 reads processed ...
3740000 reads processed ...
3750000 reads processed ...
3760000 reads processed ...
3770000 reads processed ...
3780000 reads processed ...
3790000 reads processed ...
3800000 reads processed ...
3810000 reads processed ...
3820000 reads processed ...
3830000 reads processed ...
3840000 reads processed ...
3850000 reads processed ...
3860000 reads processed ...
3870000 reads processed ...
3880000 reads processed ...
3890000 reads processed ...
3900000 reads processed ...
3910000 reads processed ...
3920000 reads processed ...
3930000 reads processed ...
3940000 reads processed ...
3950000 reads processed ...
3960000 reads processed ...
3970000 reads processed ...
Done.

See Also

| | | | |