Main Content

Exploring a Nucleotide Sequence Using Command Line

Overview of Example

After sequencing a piece of DNA, one of the first tasks is to investigate the nucleotide content in the sequence. Starting with a DNA sequence, this example uses sequence statistics functions to determine mono-, di-, and trinucleotide content, and to locate open reading frames.

Searching the Web for Sequence Information

The following procedure illustrates how to use the MATLAB® Help browser to search the Web for information. In this example you are interested in studying the human mitochondrial genome. While many genes that code for mitochondrial proteins are found in the cell nucleus, the mitochondrial has genes that code for proteins used to produce energy.

First research information about the human mitochondria and find the nucleotide sequence for the genome. Next, look at the nucleotide content for the entire sequence. And finally, determine open reading frames and extract specific gene sequences.

  1. Use the MATLAB Help browser to explore the Web. In the MATLAB Command Window, type

    web('http://www.ncbi.nlm.nih.gov/')
    

    A separate browser window opens with the home page for the NCBI Web site.

  2. Search the NCBI Web site for information. For example, to search for the human mitochondrion genome, from the Search list, select Genome , and in the Search list, enter mitochondrion homo sapiens.

    The NCBI Web search returns a list of links to relevant pages.

  3. Select a result page. For example, click the link labeled NC_012920.

    The MATLAB Help browser displays the NCBI page for the human mitochondrial genome.

Reading Sequence Information from the Web

The following procedure illustrates how to find a nucleotide sequence in a public database and read the sequence information into the MATLAB environment. Many public databases for nucleotide sequences are accessible from the Web. The MATLAB Command Window provides an integrated environment for bringing sequence information into the MATLAB environment.

The consensus sequence for the human mitochondrial genome has the GenBank® accession number NC_012920. Since the whole GenBank entry is quite large and you might only be interested in the sequence, you can get just the sequence information.

  1. Get sequence information from a Web database. For example, to retrieve sequence information for the human mitochondrial genome, in the MATLAB Command Window, type

    mitochondria = getgenbank('NC_012920','SequenceOnly',true)
    

    The getgenbank function retrieves the nucleotide sequence from the GenBank database and creates a character array.

    mitochondria = 
    GATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCAT
    TTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG
    GAGCCGGAGCACCCTATGTCGCAGTATCTGTCTTTGATTCCTGCCTCATT
    CTATTATTTATCGCACCTACGTTCAATATTACAGGCGAACATACCTACTA
    AAGT . . . 
    

  2. If you don't have a Web connection, you can load the data from a MAT file included with the Bioinformatics Toolbox™ software, using the command

    load mitochondria
    

    The load function loads the sequence mitochondria into the MATLAB Workspace.

  3. Get information about the sequence. Type

    whos mitochondria
    

    Information about the size of the sequence displays in the MATLAB Command Window.

     Name              Size               Bytes  Class    Attributes
    
     mitochondria      1x16569            33138  char               
    

Determining Nucleotide Composition

The following procedure illustrates how to determine the monomers and dimers, and then visualize data in graphs and bar plots. Sections of a DNA sequence with a high percent of A+T nucleotides usually indicate intergenic parts of the sequence, while low A+T and higher G+C nucleotide percentages indicate possible genes. Many times high CG dinucleotide content is located before a gene.

After you read a sequence into the MATLAB environment, you can use the sequence statistics functions to determine if your sequence has the characteristics of a protein-coding region. This procedure uses the human mitochondrial genome as an example. See Reading Sequence Information from the Web.

  1. Plot monomer densities and combined monomer densities in a graph. In the MATLAB Command Window, type

    ntdensity(mitochondria)
    

    This graph shows that the genome is A+T rich.

  2. Count the nucleotides using the basecount function.

    basecount(mitochondria)
    

    A list of nucleotide counts is shown for the 5'-3' strand.

    ans = 
        A: 5124
        C: 5181
        G: 2169
        T: 4094
    
    
  3. Count the nucleotides in the reverse complement of a sequence using the seqrcomplement function.

    basecount(seqrcomplement(mitochondria))
    

    As expected, the nucleotide counts on the reverse complement strand are complementary to the 5'-3' strand.

    ans = 
        A: 4094
        C: 2169
        G: 5181
        T: 5124
    
  4. Use the function basecount with the chart option to visualize the nucleotide distribution.

    figure
    basecount(mitochondria,'chart','pie');
    

    A pie chart displays in the MATLAB Figure window.

  5. Count the dimers in a sequence and display the information in a bar chart.

    figure
    dimercount(mitochondria,'chart','bar')
    
    ans = 
    
        AA: 1604
        AC: 1495
        AG: 795
        AT: 1230
        CA: 1534
        CC: 1771
        CG: 435
        CT: 1440
        GA: 613
        GC: 711
        GG: 425
        GT: 419
        TA: 1373
        TC: 1204
        TG: 513
        TT: 1004

Determining Codon Composition

The following procedure illustrates how to look at codons for the six reading frames. Trinucleotides (codon) code for an amino acid, and there are 64 possible codons in a nucleotide sequence. Knowing the percent of codons in your sequence can be helpful when you are comparing with tables for expected codon usage.

After you read a sequence into the MATLAB environment, you can analyze the sequence for codon composition. This procedure uses the human mitochondria genome as an example. See Reading Sequence Information from the Web.

  1. Count codons in a nucleotide sequence. In the MATLAB Command Window, type

    codoncount(mitochondria)
    

    The codon counts for the first reading frame displays.

    AAA - 167     AAC - 171     AAG -  71     AAT - 130     
    ACA - 137     ACC - 191     ACG -  42     ACT - 153     
    AGA -  59     AGC -  87     AGG -  51     AGT -  54     
    ATA - 126     ATC - 131     ATG -  55     ATT - 113     
    CAA - 146     CAC - 145     CAG -  68     CAT - 148     
    CCA - 141     CCC - 205     CCG -  49     CCT - 173     
    CGA -  40     CGC -  54     CGG -  29     CGT -  27     
    CTA - 175     CTC - 142     CTG -  74     CTT - 101     
    GAA -  67     GAC -  53     GAG -  49     GAT -  35     
    GCA -  81     GCC - 101     GCG -  16     GCT -  59     
    GGA -  36     GGC -  47     GGG -  23     GGT -  28     
    GTA -  43     GTC -  26     GTG -  18     GTT -  41     
    TAA - 157     TAC - 118     TAG -  94     TAT - 107     
    TCA - 125     TCC - 116     TCG -  37     TCT - 103     
    TGA -  64     TGC -  40     TGG -  29     TGT -  26     
    TTA -  96     TTC - 107     TTG -  47     TTT -  78

  2. Count the codons in all six reading frames and plot the results in heat maps.

    for frame = 1:3
        figure
        subplot(2,1,1);
        codoncount(mitochondria,'frame',frame,'figure',true,...
                   'geneticcode','Vertebrate Mitochondrial');
        title(sprintf('Codons for frame %d',frame));
        subplot(2,1,2);
        codoncount(mitochondria,'reverse',true,'frame',frame,...
                   'figure',true,'geneticcode','Vertebrate Mitochondrial');
        title(sprintf('Codons for reverse frame %d',frame)); 
    end
    

    Heat maps display all 64 codons in the 6 reading frames.

Open Reading Frames

The following procedure illustrates how to locate the open reading frames using a specific genetic code. Determining the protein-coding sequence for a eukaryotic gene can be a difficult task because introns (noncoding sections) are mixed with exons. However, prokaryotic genes generally do not have introns and mRNA sequences have the introns removed. Identifying the start and stop codons for translation determines the protein-coding section, or open reading frame (ORF), in a sequence. Once you know the ORF for a gene or mRNA, you can translate a nucleotide sequence to its corresponding amino acid sequence.

After you read a sequence into the MATLAB environment, you can analyze the sequence for open reading frames. This procedure uses the human mitochondria genome as an example. See Reading Sequence Information from the Web.

  1. Display open reading frames (ORFs) in a nucleotide sequence. In the MATLAB Command Window, type:

    seqshoworfs(mitochondria);
    

    If you compare this output to the genes shown on the NCBI page for NC_012920, there are fewer genes than expected. This is because vertebrate mitochondria use a genetic code slightly different from the standard genetic code. For a list of genetic codes, see the Genetic Code table in the aa2nt reference page.

  2. Display ORFs using the Vertebrate Mitochondrial code.

    orfs= seqshoworfs(mitochondria,...
                      'GeneticCode','Vertebrate Mitochondrial',...
                      'alternativestart',true);
    

    Notice that there are now two large ORFs on the third reading frame. One starts at position 4470 and the other starts at 5904. These correspond to the genes ND2 (NADH dehydrogenase subunit 2 [Homo sapiens] ) and COX1 (cytochrome c oxidase subunit I) genes.

  3. Find the corresponding stop codon. The start and stop positions for ORFs have the same indices as the start positions in the fields Start and Stop.

    ND2Start = 4470;
    StartIndex = find(orfs(3).Start == ND2Start)
    ND2Stop = orfs(3).Stop(StartIndex)
    

    The stop position displays.

    ND2Stop =
    
            5511
  4. Using the sequence indices for the start and stop of the gene, extract the subsequence from the sequence.

    ND2Seq = mitochondria(ND2Start:ND2Stop)
    

    The subsequence (protein-coding region) is stored in ND2Seq and displayed on the screen.

    attaatcccctggcccaacccgtcatctactctaccatctttgcaggcac
    actcatcacagcgctaagctcgcactgattttttacctgagtaggcctag
    aaataaacatgctagcttttattccagttctaaccaaaaaaataaaccct
    cgttccacagaagctgccatcaagtatttcctcacgcaagcaaccgcatc
    cataatccttc . . .
  5. Determine the codon distribution.

    codoncount (ND2Seq)
    

    The codon count shows a high amount of ACC, ATA, CTA, and ATC.

    AAA - 10     AAC - 14     AAG -  2     AAT -  6     
    ACA - 11     ACC - 24     ACG -  3     ACT -  5     
    AGA -  0     AGC -  4     AGG -  0     AGT -  1     
    ATA - 23     ATC - 24     ATG -  1     ATT -  8     
    CAA -  8     CAC -  3     CAG -  2     CAT -  1     
    CCA -  4     CCC - 12     CCG -  2     CCT -  5     
    CGA -  0     CGC -  3     CGG -  0     CGT -  1     
    CTA - 26     CTC - 18     CTG -  4     CTT -  7     
    GAA -  5     GAC -  0     GAG -  1     GAT -  0     
    GCA -  8     GCC -  7     GCG -  1     GCT -  4     
    GGA -  5     GGC -  7     GGG -  0     GGT -  1     
    GTA -  3     GTC -  2     GTG -  0     GTT -  3     
    TAA -  0     TAC -  8     TAG -  0     TAT -  2     
    TCA -  7     TCC - 11     TCG -  1     TCT -  4     
    TGA - 10     TGC -  0     TGG -  1     TGT -  0     
    TTA -  8     TTC -  7     TTG -  1     TTT -  8  
  6. Look up the amino acids for codons ATA, CTA, ACC, and ATC.

    aminolookup('code',nt2aa('ATA'))
    aminolookup('code',nt2aa('CTA'))
    aminolookup('code',nt2aa('ACC'))
    aminolookup('code',nt2aa('ATC'))
    

    The following displays:

    Ile	isoleucine
    Leu	leucine
    Thr	threonine
    Ile	isoleucine
    

Amino Acid Conversion and Composition

The following procedure illustrates how to extract the protein-coding sequence from a gene sequence and convert it to the amino acid sequence for the protein. Determining the relative amino acid composition of a protein will give you a characteristic profile for the protein. Often, this profile is enough information to identify a protein. Using the amino acid composition, atomic composition, and molecular weight, you can also search public databases for similar proteins.

After you locate an open reading frame (ORF) in a gene, you can convert it to an amino sequence and determine its amino acid composition. This procedure uses the human mitochondria genome as an example. See Open Reading Frames.

  1. Convert a nucleotide sequence to an amino acid sequence. In this example, only the protein-coding sequence between the start and stop codons is converted.

    ND2AASeq = nt2aa(ND2Seq,'geneticcode',...
                     'Vertebrate Mitochondrial')
    

    The sequence is converted using the Vertebrate Mitochondrial genetic code. Because the property AlternativeStartCodons is set to 'true' by default, the first codon att is converted to M instead of I.

    MNPLAQPVIYSTIFAGTLITALSSHWFFTWVGLEMNMLAFIPVLTKKMNP
    RSTEAAIKYFLTQATASMILLMAILFNNMLSGQWTMTNTTNQYSSLMIMM
    AMAMKLGMAPFHFWVPEVTQGTPLTSGLLLLTWQKLAPISIMYQISPSLN
    VSLLLTLSILSIMAGSWGGLNQTQLRKILAYSSITHMGWMMAVLPYNPNM
    TILNLTIYIILTTTAFLLLNLNSSTTTLLLSRTWNKLTWLTPLIPSTLLS
    LGGLPPLTGFLPKWAIIEEFTKNNSLIIPTIMATITLLNLYFYLRLIYST
    SITLLPMSNNVKMKWQFEHTKPTPFLPTLIALTTLLLPISPFMLMIL
    
  2. Compare your conversion with the published conversion in the GenPept database.

    ND2protein = getgenpept('YP_003024027','sequenceonly',true)
    

    The getgenpept function retrieves the published conversion from the NCBI database and reads it into the MATLAB Workspace.

  3. Count the amino acids in the protein sequence.

    aacount(ND2AASeq, 'chart','bar')
    

    A bar graph displays. Notice the high content for leucine, threonine and isoleucine, and also notice the lack of cysteine and aspartic acid.

  4. Determine the atomic composition and molecular weight of the protein.

    atomiccomp(ND2AASeq)
    molweight (ND2AASeq)
    

    The following displays in the MATLAB Workspace:

    ans = 
    
        C: 1818
        H: 2882
        N: 420
        O: 471
        S: 25
    ans =
    
      3.8960e+004
    

    If this sequence was unknown, you could use this information to identify the protein by comparing it with the atomic composition of other proteins in a database.