Main Content

seqconsensus

Calculate consensus sequence

    Description

    CSeq = seqconsensus(Seqs) returns a character vector with the consensus sequence, CSeq for a multiply aligned set of sequences. The frequency of symbols (20 amino acids, 4 nucleotides) in the set of sequences is determined with the function seqprofile. For ambiguous nucleotide or amino acid symbols, the frequency or count is added to the standard set of symbols.

    example

    [CSeq,Score] = seqconsensus(Seqs) returns the conservation score of the consensus sequence. Scores are computed with the scoring matrix "BLOSUM50" for amino acids or "NUC44" for nucleotides. Scores are the average euclidean distance between the scored symbol and the M-dimensional consensus value. M is the size of the alphabet. The consensus value is the profile weighted by the scoring matrix.

    CSeq = seqconsensus(Profile) returns a character vector with the consensus sequence CSeq from a sequence profile Profile.

    seqconsensus(Seqs,"ScoringMatrix",sMatrix) specifies the scoring matrix.

    seqconsensus(Seqs,Name=Value) specifies additional options using one or more name-value arguments.

    Examples

    collapse all

    Create an array of structures representing a multiple alignment of amino acids.

     seqs = fastaread("pf00002.fa");

    Get the consensus sequence for the above multiple alignment of amino acids, starting at position 50, and ending at position 60, with all gaps included.

    [C,S] = seqconsensus(seqs,"limits",[50 60],"gaps","all")
    C = 
    'ILRAIAFFIK-'
    
    S = 1×11
    
        6.0082    5.1850    7.9670    6.9334    7.7020    8.1416    9.3865    5.8234    4.3273    9.4917    1.4572
    
    

    Input Arguments

    collapse all

    Set of multiply aligned sequences, represented by any of the following:

    • Character array

    • Cell array of character vectors

    • String vector

    • Array of structures, containing the field Sequence

    Example: {'YNTVKTGYTIGYSLS.LASLLVAMAILSLFRKLHCTR'}

    Sequence profile. Enter a profile from the function seqprofile.

    Profile is a matrix of size [20 (or 4) x Sequence Length] with the frequency, or count of amino acids (or nucleotides) for every position. A Profile matrix can also have 21 (or 5) rows, if gaps are included in the consensus.

    Scoring matrix for alignment, specified as either a character vector, string, or numeric array with dimensions 21x21, 5x5, 20x20, or 4x4.

    Possible values for amino acid string or character vector sequences are:

    • "BLOSUM62"

    • "BLOSUM30" increasing by 5 up to "BLOSUM90"

    • "BLOSUM100"

    • "PAM10" increasing by 10 up to "PAM500"

    • "DAYHOFF"

    • "GONNET"

    Default values are:

    • "BLOSUM50" — When Alphabet equals "AA"

    • "NUC44" — When Alphabet equals "NT"

    Note

    The above scoring matrices, provided with the software, also include a structure containing a scale factor that converts the units of the output score to bits. You can also use the Scale property to specify an additional scale factor to convert the output score from bits to another unit.

    Specify a 21x21, 5x5, 20x20, or 4x4 numeric array for the gap-included cases. The gap scores (last row/column) are set to mean(diag(ScoringMatrix)) for a gap matching with another gap, and mean(nodiag(ScoringMatrix)) for a gap matching with another symbol.

    Note

    If you use a scoring matrix that you created, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix.

    Note

    If you need to compile seqconsensus into a stand-alone application or software component using MATLAB® Compiler™, use a matrix instead of a character vector or string for ScoringMatrixValue.

    Example: "DAYHOFF"

    Name-Value Arguments

    Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

    Example: seqconsensus(seqs,Gaps="all") returns a set of multiply aligned amino acid or nucleotide sequences with all gaps included.

    Sequence alphabet, specified as one of the following:

    • "NT" — Nucleotides

    • "AA" — Amino acids

    Example: "NT"

    Data Types: char | string

    Control for counting gaps in a sequence, specified as one of the following:

    • "none" — Counts no gaps.

    • "all" — Counts all gaps.

    • "noflanks" — Counts all gaps except those at the flanks of every sequence.

    Data Types: char | string

    Control for ambiguous symbols, specified as "ignore" or "count".

    Control for using part of a sequence, specified as a 1-by-2 vector. Enter a 1-by-2 vector with the first position and the last position to include in the profile.

    The defaults limits are [1,SeqLength], where SeqLength is the length of a multiply aligned sequence.

    Example: [50, 60]

    Output Arguments

    collapse all

    Consensus sequence, returned as a character vector or string.

    Conservation score of a consensus sequence, returned as a double. Scores are computed with the scoring matrix "BLOSUM50" for amino acids or "NUC44" for nucleotides. Scores are the average euclidean distance between the scored symbol and the M-dimensional consensus value. M is the size of the alphabet. The consensus value is the profile weighted by the scoring matrix.

    Version History

    Introduced before R2006a