Main Content

Manage Sequence Read Data in Objects

Overview

High-throughput sequencing instruments produce large amounts of sequence read data that can be challenging to store and manage. Using objects to contain this data lets you easily access, manipulate, and filter the data.

Bioinformatics Toolbox™ includes two objects for working with sequence read data.

ObjectContains This Information Construct from One of These
BioRead
  • Sequence headers

  • Read sequences

  • Sequence qualities (base calling)

  • FASTQ file

  • SAM file

  • FASTQ structure (created using the fastqread function)

  • SAM structure (created using the samread function)

  • Cell arrays containing header, sequence, and quality information (created using the fastqread function)

BioMap
  • Sequence headers

  • Read sequences

  • Sequence qualities (base calling)

  • Sequence alignment and mapping information (relative to a single reference sequence), including mapping quality

  • SAM file

  • BAM file

  • SAM structure (created using the samread function)

  • BAM structure (created using the bamread function)

  • Cell arrays containing header, sequence, quality, and mapping/alignment information (created using the samread or bamread function)

Represent Sequence and Quality Data in a BioRead Object

Prerequisites

A BioRead object represents a collection of sequence reads. Each element in the object is associated with a sequence, sequence header, and sequence quality information.

Construct a BioRead object in one of two ways:

  • Indexed — The data remains in the source file. Constructing the object and accessing its contents is memory efficient. However, you cannot modify object properties, other than the Name property. This is the default method if you construct a BioRead object from a FASTQ- or SAM-formatted file.

  • In Memory — The data is read into memory. Constructing the object and accessing its contents is limited by the amount of available memory. However, you can modify object properties. When you construct a BioRead object from a FASTQ structure or cell arrays, the data is read into memory. When you construct a BioRead object from a FASTQ- or SAM-formatted file, use the InMemory name-value pair argument to read the data into memory.

Construct a BioRead Object from a FASTQ- or SAM-Formatted File

Note

This example constructs a BioRead object from a FASTQ-formatted file. Use similar steps to construct a BioRead object from a SAM-formatted file.

Use the BioRead constructor function to construct a BioRead object from a FASTQ-formatted file and set the Name property:

BRObj1 = BioRead('SRR005164_1_50.fastq', 'Name', 'MyObject')
BRObj1 = 

  BioRead with properties:

     Quality: [50x1 File indexed property]
    Sequence: [50x1 File indexed property]
      Header: [50x1 File indexed property]
       NSeqs: 50
        Name: 'MyObject'

The constructor function construct a BioRead object and, if an index file does not already exist, it also creates an index file with the same file name, but with an .IDX extension. This index file, by default, is stored in the same location as the source file.

Caution

Your source file and index file must always be in sync.

  • After constructing a BioRead object, do not modify the index file, or you can get invalid results when using the existing object or constructing new objects.

  • If you modify the source file, delete the index file, so the object constructor creates a new index file when constructing new objects.

Note

Because you constructed this BioRead object from a source file, you cannot modify the properties (except for Name) of the BioRead object.

Represent Sequence, Quality, and Alignment/Mapping Data in a BioMap Object

Prerequisites

A BioMap object represents a collection of sequence reads that map against a single reference sequence. Each element in the object is associated with a read sequence, sequence header, sequence quality information, and alignment/mapping information.

When constructing a BioMap object from a BAM file, the maximum size of the file is limited by your operating system and available memory.

Construct a BioMap object in one of two ways:

  • Indexed — The data remains in the source file. Constructing the object and accessing its contents is memory efficient. However, you cannot modify object properties, other than the Name property. This is the default method if you construct a BioMap object from a SAM- or BAM-formatted file.

  • In Memory — The data is read into memory. Constructing the object and accessing its contents is limited by the amount of available memory. However, you can modify object properties. When you construct a BioMap object from a structure, the data stays in memory. When you construct a BioMap object from a SAM- or BAM-formatted file, use the InMemory name-value pair argument to read the data into memory.

Construct a BioMap Object from a SAM- or BAM-Formatted File

Note

This example constructs a BioMap object from a SAM-formatted file. Use similar steps to construct a BioMap object from a BAM-formatted file.

  1. If you do not know the number and names of the reference sequences in your source file, determine them using the saminfo or baminfo function and the ScanDictionary name-value pair argument.

    samstruct = saminfo('ex2.sam', 'ScanDictionary', true);
    samstruct.ScannedDictionary
    ans = 
    
        'seq1'
        'seq2'

    Tip

    The previous syntax scans the entire SAM file, which is time consuming. If you are confident that the Header information of the SAM file is correct, omit the ScanDictionary name-value pair argument, and inspect the SequenceDictionary field instead.

  2. Use the BioMap constructor function to construct a BioMap object from the SAM file and set the Name property. Because the SAM-formatted file in this example, ex2.sam, contains multiple reference sequences, use the SelectRef name-value pair argument to specify one reference sequence, seq1:

    BMObj2 = BioMap('ex2.sam', 'SelectRef', 'seq1', 'Name', 'MyObject')
    BMObj2 = 
    
      BioMap with properties:
    
        SequenceDictionary: 'seq1'
                 Reference: [1501x1 File indexed property]
                 Signature: [1501x1 File indexed property]
                     Start: [1501x1 File indexed property]
            MappingQuality: [1501x1 File indexed property]
                      Flag: [1501x1 File indexed property]
              MatePosition: [1501x1 File indexed property]
                   Quality: [1501x1 File indexed property]
                  Sequence: [1501x1 File indexed property]
                    Header: [1501x1 File indexed property]
                     NSeqs: 1501
                      Name: 'MyObject'

The constructor function constructs a BioMap object and, if index files do not already exist, it also creates one or two index files:

  • If constructing from a SAM-formatted file, it creates one index file that has the same file name as the source file, but with an .IDX extension. This index file, by default, is stored in the same location as the source file.

  • If constructing from a BAM-formatted file, it creates two index files that have the same file name as the source file, but one with a .BAI extension and one with a .LINEARINDEX extension. These index files, by default, are stored in the same location as the source file.

Caution

Your source file and index files must always be in sync.

  • After constructing a BioMap object, do not modify the index files, or you can get invalid results when using the existing object or constructing new objects.

  • If you modify the source file, delete the index files, so the object constructor creates new index files when constructing new objects.

Note

Because you constructed this BioMap object from a source file, you cannot modify the properties (except for Name and Reference) of the BioMap object.

Construct a BioMap Object from a SAM or BAM Structure

Note

This example constructs a BioMap object from a SAM structure using samread. Use similar steps to construct a BioMap object from a BAM structure using bamread.

  1. Use the samread function to create a SAM structure from a SAM-formatted file:

    SAMStruct = samread('ex2.sam');
    
  2. To construct a valid BioMap object from a SAM-formatted file, the file must contain only one reference sequence. Determine the number and names of the reference sequences in your SAM-formatted file using the unique function to find unique names in the ReferenceName field of the structure:

    unique({SAMStruct.ReferenceName})
    ans = 
    
        'seq1'    'seq2'
    
  3. Use the BioMap constructor function to construct a BioMap object from a SAM structure. Because the SAM structure contains multiple reference sequences, use the SelectRef name-value pair argument to specify one reference sequence, seq1:

    BMObj1 = BioMap(SAMStruct, 'SelectRef', 'seq1')
    
    BMObj1 = 
    
      BioMap with properties:
    
        SequenceDictionary: {'seq1'}
                 Reference: {1501x1 cell}
                 Signature: {1501x1 cell}
                     Start: [1501x1 uint32]
            MappingQuality: [1501x1 uint8]
                      Flag: [1501x1 uint16]
              MatePosition: [1501x1 uint32]
                   Quality: {1501x1 cell}
                  Sequence: {1501x1 cell}
                    Header: {1501x1 cell}
                     NSeqs: 1501
                      Name: ''

Retrieve Information from a BioRead or BioMap Object

You can retrieve all or a subset of information from a BioRead or BioMap object.

Retrieve a Property from a BioRead or BioMap Object

You can retrieve a specific property from elements in a BioRead or BioMap object.

For example, to retrieve all headers from a BioRead object, use the Header property as follows:

allHeaders = BRObj1.Header;

This syntax returns a cell array containing the headers for all elements in the BioRead object.

Similarly, to retrieve all start positions of aligned read sequences from a BioMap object, use the Start property of the object:

allStarts = BMObj1.Start;

This syntax returns a vector containing the start positions of aligned read sequences with respect to the position numbers in the reference sequence in a BioMap object.

Retrieve Multiple Properties from a BioRead or BioMap Object

You can retrieve multiple properties from a BioRead or BioMap object in a single command using the get method. For example, to retrieve both start positions and headers information of a BioMap object, use the get method as follows:

multiProp = get(BMObj1, {'Start', 'Header'});

This syntax returns a cell array containing all start positions and headers information of a BioMap object.

Note

Property names are case sensitive.

For a list and description of all properties of a BioRead object, see BioRead class. For a list and description of all properties of a BioMap object, see BioMap class.

Retrieve a Subset of Information from a BioRead or BioMap Object

Use specialized get methods with a numeric vector, logical vector, or cell array of headers to retrieve a subset of information from an object. For example, to retrieve the first 10 elements from a BioRead object, use the getSubset method:

newBRObj = getSubset(BRObj1, [1:10]);

This syntax returns a new BioRead object containing the first 10 elements in the original BioRead object.

For example, to retrieve the first 12 positions of sequences with headers SRR005164.1, SRR005164.7, and SRR005164.16, use the getSubsequence method:

subSeqs = getSubsequence(BRObj1, ...
          {'SRR005164.1', 'SRR005164.7', 'SRR005164.16'}, [1:12]')
subSeqs = 

    'TGGCTTTAAAGC'
    'CCCGAAAGCTAG'
    'AATTTTGCGGCT'

For example, to retrieve information about the third element in a BioMap object, use the getInfo method:

Info_3 = getInfo(BMObj1, 3);

This syntax returns a tab-delimited character vector containing this information for the third element:

  • Sequence header

  • SAM flags for the sequence

  • Start position of the aligned read sequence with respect to the reference sequence

  • Mapping quality score for the sequence

  • Signature (CIGAR-formatted character vector) for the sequence

  • Sequence

  • Quality scores for sequence positions

Note

Method names are case sensitive.

For a complete list and description of methods of a BioRead object, see BioRead class. For a complete list and description of methods of a BioMap object, see BioMap class.

Set Information in a BioRead or BioMap Object

Prerequisites

To modify properties (other than Name and Reference) of a BioRead or BioMap object, the data must be in memory, and not indexed. To ensure the data is in memory, do one of the following:

Provide Custom Headers for Sequences

First, create an object with the data in memory:

BRObj1 = BioRead('SRR005164_1_50.fastq','InMemory',true);

To provide custom headers for sequences of interest (in this case sequences 1 to 5), do the following:

BRObj1.Header(1:5) = {'H1', 'H2', 'H3', 'H4', 'H5'};

Alternatively, you can use the setHeader method:

BRObj1 = setHeader(BRObj1, {'H1', 'H2', 'H3', 'H4', 'H5'}, [1:5]);

Several other specialized set methods let you set the properties of a subset of elements in a BioRead or BioMap object.

Note

Method names are case sensitive.

For a complete list and description of methods of a BioRead object, see BioRead class. For a complete list and description of methods of a BioMap object, see BioMap class.

Determine Coverage of a Reference Sequence

When working with a BioMap object, you can determine the number of read sequences that:

  • Align within a specific region of the reference sequence

  • Align to each position within a specific region of the reference sequence

For example, you can compute the number, indices, and start positions of the read sequences that align within the first 25 positions of the reference sequence. To do so, use the getCounts, getIndex, and getStart methods:

Cov = getCounts(BMObj1, 1, 25)
Cov =

    12
Indices = getIndex(BMObj1, 1, 25)
Indices =

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
startPos = getStart(BMObj1, Indices)
startPos =

           1
           3
           5
           6
           9
          13
          13
          15
          18
          22
          22
          24

The first two syntaxes return the number and indices of the read sequences that align within the specified region of the reference sequence. The last syntax returns a vector containing the start position of each aligned read sequence, corresponding to the position numbers of the reference sequence.

For example, you can also compute the number of the read sequences that align to each of the first 10 positions of the reference sequence. For this computation, use the getBaseCoverage method:

Cov = getBaseCoverage(BMObj1, 1, 10)
Cov =

     1     1     2     2     3     4     4     4     5     5

Construct Sequence Alignments to a Reference Sequence

It is useful to construct and view the alignment of the read sequences that align to a specific region of the reference sequence. It is also helpful to know which read sequences align to this region in a BioMap object.

For example, to retrieve the alignment of read sequences to the first 12 positions of the reference sequence in a BioMap object, use the getAlignment method:

[Alignment_1_12, Indices] = getAlignment(BMObj2, 1, 12)
Alignment_1_12 =

CACTAGTGGCTC
  CTAGTGGCTC
    AGTGGCTC
     GTGGCTC
        GCTC


Indices =

     1
     2
     3
     4
     5

Return the headers of the read sequences that align to a specific region of the reference sequence:

alignedHeaders = getHeader(BMObj2, Indices)
alignedHeaders = 

    'B7_591:4:96:693:509'
    'EAS54_65:7:152:368:113'
    'EAS51_64:8:5:734:57'
    'B7_591:1:289:587:906'
    'EAS56_59:8:38:671:758'

Filter Read Sequences Using SAM Flags

SAM- and BAM-formatted files include the status of 11 binary flags for each read sequence. These flags describe different sequencing and alignment aspects of a read sequence. For more information on the flags, see the SAM Format Specification. The filterByFlag method lets you filter the read sequences in a BioMap object by using these flags.

Filter Unmapped Read Sequences

  1. Construct a BioMap object from a SAM-formatted file.

    BMObj2 = BioMap('ex1.sam');
  2. Use the filterByFlag method to create a logical vector indicating the read sequences in a BioMap object that are mapped.

    LogicalVec_mapped = filterByFlag(BMObj2, 'unmappedQuery', false);
    
  3. Use this logical vector and the getSubset method to create a new BioMap object containing only the mapped read sequences.

    filteredBMObj_1 = getSubset(BMObj2, LogicalVec_mapped);

Filter Read Sequences That Are Not Mapped in a Pair

  1. Construct a BioMap object from a SAM-formatted file.

    BMObj2 = BioMap('ex1.sam');
  2. Use the filterByFlag method to create a logical vector indicating the read sequences in a BioMap object that are mapped in a proper pair, that is, both the read sequence and its mate are mapped to the reference sequence.

    LogicalVec_paired = filterByFlag(BMObj2, 'pairedInMap', true);
    
  3. Use this logical vector and the getSubset method to create a new BioMap object containing only the read sequences that are mapped in a proper pair.

    filteredBMObj_2 = getSubset(BMObj2, LogicalVec_paired);