seqfilter
Filter out sequences based on specified criterion
Description
seqfilter(
applies
a filtering criterion to the sequences in fastqFile
)fastqFile
and
saves the sequences that meet the criterion in a new FASTQ file. By
default, the sequences that pass the criterion are saved under file
names with the suffix '_filtered'
appended. If
you do not specify any criterion, the function filters sequences using
the default.
seqfilter(
uses
additional options specified by one or more fastqFile
,Name,Value
)Name,Value
pair
arguments.
Examples
Filter next-generation sequencing data
Filter out sequences with more than 10% of low quality bases, where a base is considered low quality when its quality score is less than 20.
[outFile,in,out] = seqfilter('SRR005164_1_50.fastq',... 'Method','MaxPercentLowQualityBases',... 'Threshold',[10 20]) ;
Check the number of sequences saved in the output file.
in
in = 39
Check the number of sequences filtered out.
out
out = 11
Filter out sequences having an average quality score of below 20.
[outFile,in,out] = seqfilter('SRR005164_1_50.fastq',... 'Method','MeanQuality',... 'Threshold',20);
Apply the filtering criterion to every 10 bases as a sliding window.
[outFile,in,out] = seqfilter('SRR005164_1_50.fastq',... 'Method','MeanQuality',... 'Threshold',20,'WindowSize',10);
Filter out sequences with less than 100 bases.
[outFile,in,out] = seqfilter('SRR005164_1_50.fastq',... 'Method','MinLength',... 'Threshold',100);
Input Arguments
fastqFile
— Names of FASTQ files with sequence and quality information
character vector | string | string vector | cell array of character vectors
Names of FASTQ-formatted files with sequence and quality information, specified as a character vector, string, string vector, or cell array of character vectors.
Example: 'SRR005164_1_50.fastq'
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: 'Method','MaxNumberLowQualityBases','Threshold',[5
15]
specifies to filter out sequences with a total of more
than 5 low-quality bases, where a base is considered a low-quality
base if its quality score is less than 15.
Method
— Criterion to filter sequences
'MaxNumberLowQualityBases'
(default) | 'MaxPercentLowQualityBases'
| 'MeanQuality'
| 'MinLength'
Criterion to filter sequences, specified as one of the following options. Specify only one filtering criterion per function call.
'MaxNumberLowQualityBases'
– applies a maximum threshold on the number of low-quality bases allowed.'MaxPercentLowQualityBases'
– applies a maximum threshold on the percentage of low-quality bases allowed.'MeanQuality'
– applies a minimum threshold on the average base quality across each sequence.'MinLength'
– applies a minimum threshold on the sequence length.
Use this name-value pair argument together with 'Threshold'
to
specify the appropriate threshold value. Depending on the filtering
criterion, the corresponding value for 'Threshold'
can
be a scalar or two-element vector. See the 'Threshold'
option
for the default values. If you do not specify 'Threshold'
,
then the function uses the default threshold value of the specified
method. For each filtering criterion, the function uses the base quality
encoding format specified by the 'Encoding'
name-value
pair argument.
Example: 'Method','MaxNumberLowQualityBases','Threshold',[5
15]
Threshold
— Threshold value for filtering criterion
scalar | vector
Threshold value for the filtering criterion, specified as a
scalar or vector. Use this name-value pair to define the threshold
value for the filtering criterion specified by 'Method'
.
Depending on the filtering criterion, the corresponding value
for 'Threshold'
can be a scalar or two-element
vector. If you do not specify 'Threshold'
, then
the function uses the default threshold value of the corresponding
method. For each filtering criterion, the function uses the encoding
format of the base quality specified by the 'Encoding'
name-value
pair argument.
'Method' | 'Threshold' | Default 'Threshold' value |
---|---|---|
'MaxNumberLowQualityBases' | Two-element vector [V1 V2] . V1 is
a nonnegative integer that specifies the maximum number of low-quality
bases allowed. V2 specifies the minimum base quality.
Any base with quality less than V2 is considered
a low-quality base. Any sequence containing a number of low-quality
bases greater than V1 is filtered out and not saved
in the output file. | [0 10] |
'MaxPercentLowQualityBases' | Two-element vector [V1 V2] . V1 is
a scalar between 0 and 100 that specifies the maximum percentage of
low-quality bases allowed. V2 specifies the minimum
base quality. Any base with quality less than V2 is
considered a low-quality base. Any sequence containing a percentage
of low-quality bases greater than V1 is filtered
out and not saved in the output file. | [0 10] |
'MeanQuality' | Positive scalar that specifies the minimum threshold on the average base quality across each sequence. Any sequence with average base quality less than this value is filtered out. | 0 |
'MinLength' | Nonnegative integer that specifies the minimum threshold on the sequence length allowed. Any sequence with length less than this value is filtered out. | 1 |
Example: 'Method','MaxPercentLowQualityBases','Threshold',[10
20]
WindowSize
— Size of sliding window to apply filtering criterion to sequence
Inf
(default) | positive integer
Size of the sliding window to apply the filtering criterion to a sequence, specified as a positive integer. The size of the window corresponds to the number of bases that the function uses at one time to apply the criterion. If any window fails the criterion, the whole sequence is discarded.
The default is Inf
, that is, the filtering
criterion is applied to the whole sequence.
Example: 'WindowSize',100
Encoding
— Base quality encoding format
'Illumina18'
(default) | 'Sanger'
| 'Solexa'
| 'Illumina13'
| 'Illumina15'
Base quality encoding format, specified as a character vector or string.
Example: 'Encoding','Sanger'
OutputDir
— Relative or absolute path to output file directory
character vector | string
Relative or absolute path to the output file directory, specified as a character vector or string. The default is the current directory.
Example: 'OutputDir','F:\results'
OutputSuffix
— Suffix to use in output file name
'_filtered'
(default) | character vector | string
Suffix to use in the output file name, specified as a character vector or string. It is
inserted after the input file name and before the file extension. The
default is '_filtered'
.
Example: 'OutputSuffix','_WindowSize100_filtered'
PairedFiles
— Whether to consider input files as pairs for paired-end sequence data
false
(default) | true
Whether to consider the input files as pairs for paired-end
sequence data, specified as true
or false
.
If true
, the input files are read as pairs,
and the sequence data is maintained in sync between the files. That
is, if a sequence is filtered out in the first file, the corresponding
sequence in the paired file is also filtered out.
Example: 'PairedFiles',true
WriteSingleton
— Whether to save singleton sequences in a separate output file
false
(default) | true
Whether to save singleton sequences in a separate output file,
specified as true
or false
.
To set this to true
, the 'PairedFiles'
option
must also be set to true
.
A singleton sequence is the sequence that pass the filtering
criterion but its corresponding sequence in the paired file does not.
If true
, singleton sequences are saved in a separate
file with the suffix '_singleton'
. The default
is false
, meaning that, only sequences that pass
the filtering criterion in both input files of a given pair are saved
in the output files.
Example: 'PairedFiles',true,'WriteSingleton',true
UseParallel
— Boolean indicating whether to perform computation in parallel
false
(default) | true
Boolean indicating whether to perform computation in parallel,
specified as true
or false
.
For parallel computing, you must have Parallel Computing Toolbox™. If a parallel pool does not exist, one is created automatically when the auto-creation option is enabled in your parallel preferences. Otherwise, computation runs in serial mode.
Note
There is a cost associated with sharing large input files across workers in a distributed environment. In some cases, running in parallel may not be beneficial in terms of performance.
Example: 'UseParallel',true
Output Arguments
outFiles
— Output file names
cell array of character vectors
Output file names, returned as a cell array of character vectors.
nSeqIn
— Number of sequences selected from each input file
scalar | vector
Number of sequences selected from each input file, returned
as a scalar or an n-by-1
vector
where n is the number of input files. If there
are multiple input files, the order within nSeqIn
corresponds
to the order of the input files.
nSeqOut
— Number of sequences excluded from each input file
scalar | vector
Number of sequences excluded from each input file, returned
as a scalar or an n-by-1
vector
where n is the number of input files. If there
are multiple input files, the order within nSeqOut
corresponds
to the order of the input files.
Extended Capabilities
Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.
To run in parallel, set 'UseParallel'
to true
.
For more information, see the 'UseParallel'
name-value pair argument.
Version History
See Also
Open Example
You have a modified version of this example. Do you want to open this example with your edits?
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)