seqconsensus

Calculate consensus sequence

collapse all in page

Syntax

CSeq = seqconsensus(Seqs)

[CSeq,Score] = seqconsensus(Seqs)

CSeq = seqconsensus(Profile)

seqconsensus(Seqs,"ScoringMatrix",sMatrix)

seqconsensus(Seqs,Name=Value)

Description

CSeq = seqconsensus(Seqs) returns a character vector with the consensus sequence, CSeq for a multiply aligned set of sequences. The frequency of symbols (20 amino acids, 4 nucleotides) in the set of sequences is determined with the function seqprofile. For ambiguous nucleotide or amino acid symbols, the frequency or count is added to the standard set of symbols.

example

[CSeq,Score] = seqconsensus(Seqs) returns the conservation score of the consensus sequence. Scores are computed with the scoring matrix "BLOSUM50" for amino acids or "NUC44" for nucleotides. Scores are the average euclidean distance between the scored symbol and the M-dimensional consensus value. M is the size of the alphabet. The consensus value is the profile weighted by the scoring matrix.

CSeq = seqconsensus(Profile) returns a character vector with the consensus sequence CSeq from a sequence profile Profile.

seqconsensus(Seqs,"ScoringMatrix",sMatrix) specifies the scoring matrix.

seqconsensus(Seqs,Name=Value) specifies additional options using one or more name-value arguments.

Examples

collapse all

Return Consensus Sequence for a Multiply Aligned Set of Sequences

Open Live Script

Create an array of structures representing a multiple alignment of amino acids.

 seqs = fastaread("pf00002.fa");

Get the consensus sequence for the above multiple alignment of amino acids, starting at position 50, and ending at position 60, with all gaps included.

[C,S] = seqconsensus(seqs,"limits",[50 60],"gaps","all")

C = 
'ILRAIAFFIK-'

S = 1×11

    6.0082    5.1850    7.9670    6.9334    7.7020    8.1416    9.3865    5.8234    4.3273    9.4917    1.4572

Input Arguments

collapse all

`Seqs` — Set of multiply aligned sequences
character array | string vector | array of structures | cell array of character vectors

Set of multiply aligned sequences, represented by any of the following:

Character array
Cell array of character vectors
String vector
Array of structures, containing the field Sequence

Example: {'YNTVKTGYTIGYSLS.LASLLVAMAILSLFRKLHCTR'}

`Profile` — Sequence profile
matrix of doubles

Sequence profile. Enter a profile from the function seqprofile.

Profile is a matrix of size [20 (or 4) x Sequence Length] with the frequency, or count of amino acids (or nucleotides) for every position. A Profile matrix can also have 21 (or 5) rows, if gaps are included in the consensus.

`sMatrix` — Scoring matrix for alignment
character vector | string | numeric array

Scoring matrix for alignment, specified as either a character vector, string, or numeric array with dimensions 21x21, 5x5, 20x20, or 4x4.

Possible values for amino acid string or character vector sequences are:

"BLOSUM62"
"BLOSUM30" increasing by 5 up to "BLOSUM90"
"BLOSUM100"
"PAM10" increasing by 10 up to "PAM500"
"DAYHOFF"
"GONNET"

Default values are:

"BLOSUM50" — When Alphabet equals "AA"
"NUC44" — When Alphabet equals "NT"

Note

The above scoring matrices, provided with the software, also include a structure containing a scale factor that converts the units of the output score to bits. You can also use the Scale property to specify an additional scale factor to convert the output score from bits to another unit.

Specify a 21x21, 5x5, 20x20, or 4x4 numeric array for the gap-included cases. The gap scores (last row/column) are set to mean(diag(ScoringMatrix)) for a gap matching with another gap, and mean(nodiag(ScoringMatrix)) for a gap matching with another symbol.

Note

If you use a scoring matrix that you created, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix.

Note

If you need to compile seqconsensus into a stand-alone application or software component using MATLAB^® Compiler™, use a matrix instead of a character vector or string for ScoringMatrixValue.

Example: "DAYHOFF"

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: seqconsensus(seqs,Gaps="all") returns a set of multiply aligned amino acid or nucleotide sequences with all gaps included.

`Alphabet` — Sequence alphabet
`"AA"` (default) | `"NT"`

Sequence alphabet, specified as one of the following:

"NT" — Nucleotides
"AA" — Amino acids

Example: "NT"

Data Types: char | string

`Gaps` — Control for counting gaps in sequence
`"none"` (default) | `"all"` | `"noflanks"`

Control for counting gaps in a sequence, specified as one of the following:

"none" — Counts no gaps.
"all" — Counts all gaps.
"noflanks" — Counts all gaps except those at the flanks of every sequence.

Data Types: char | string

`Ambiguous` — Control for ambiguous symbols
`"ignore"` (default) | "count"

Control for ambiguous symbols, specified as "ignore" or "count".

`Limits` — Control for using part of sequence
`[0,Inf]` (default) | `1`-by-`2` vector

Control for using part of a sequence, specified as a 1-by-2 vector. Enter a 1-by-2 vector with the first position and the last position to include in the profile.

The defaults limits are [1,SeqLength], where SeqLength is the length of a multiply aligned sequence.

Example: [50, 60]

Output Arguments

collapse all

`CSeq` — Consensus sequence
character vector | string

Consensus sequence, returned as a character vector or string.

`Score` — Conservation score of consensus sequence
`double`

Conservation score of a consensus sequence, returned as a double. Scores are computed with the scoring matrix "BLOSUM50" for amino acids or "NUC44" for nucleotides. Scores are the average euclidean distance between the scored symbol and the M-dimensional consensus value. M is the size of the alphabet. The consensus value is the profile weighted by the scoring matrix.

Version History

Introduced before R2006a

seqconsensus

Syntax

Description

Examples

Return Consensus Sequence for a Multiply Aligned Set of Sequences

Input Arguments

`Seqs` — Set of multiply aligned sequences
character array | string vector | array of structures | cell array of character vectors

`Profile` — Sequence profile
matrix of doubles

`sMatrix` — Scoring matrix for alignment
character vector | string | numeric array

Name-Value Arguments

`Alphabet` — Sequence alphabet
`"AA"` (default) | `"NT"`

`Gaps` — Control for counting gaps in sequence
`"none"` (default) | `"all"` | `"noflanks"`

`Ambiguous` — Control for ambiguous symbols
`"ignore"` (default) | "count"

`Limits` — Control for using part of sequence
`[0,Inf]` (default) | `1`-by-`2` vector

Output Arguments

`CSeq` — Consensus sequence
character vector | string

`Score` — Conservation score of consensus sequence
`double`

Version History

See Also

Functions

seqconsensus

Syntax

Description

Examples

Return Consensus Sequence for a Multiply Aligned Set of Sequences

Input Arguments

Seqs — Set of multiply aligned sequences character array | string vector | array of structures | cell array of character vectors

Profile — Sequence profile matrix of doubles

sMatrix — Scoring matrix for alignment character vector | string | numeric array

Name-Value Arguments

Alphabet — Sequence alphabet "AA" (default) | "NT"

Gaps — Control for counting gaps in sequence "none" (default) | "all" | "noflanks"

Ambiguous — Control for ambiguous symbols "ignore" (default) | "count"

Limits — Control for using part of sequence [0,Inf] (default) | 1-by-2 vector

Output Arguments

CSeq — Consensus sequence character vector | string

Score — Conservation score of consensus sequence double

Version History

See Also

Functions

`Seqs` — Set of multiply aligned sequences
character array | string vector | array of structures | cell array of character vectors

`Profile` — Sequence profile
matrix of doubles

`sMatrix` — Scoring matrix for alignment
character vector | string | numeric array

`Alphabet` — Sequence alphabet
`"AA"` (default) | `"NT"`

`Gaps` — Control for counting gaps in sequence
`"none"` (default) | `"all"` | `"noflanks"`

`Ambiguous` — Control for ambiguous symbols
`"ignore"` (default) | "count"

`Limits` — Control for using part of sequence
`[0,Inf]` (default) | `1`-by-`2` vector

`CSeq` — Consensus sequence
character vector | string

`Score` — Conservation score of consensus sequence
`double`