detectspeechnn
Syntax
Description
specifies options using one or more name-value arguments. For example,
roi
= detectspeechnn(audioIn
,fs
,Name=Value
)detectspeechnn(audioIn,fs,MergeThreshold=0.5)
merges speech regions
that are separated by 0.5 seconds or less.
detectspeechnn(___)
with no output arguments plots the
input signal and the detected speech regions.
This function requires both Audio Toolbox™ and Deep Learning Toolbox™.
Examples
Detect Speech in Audio Signal
Read in an audio signal containing speech and music and listen to the sound.
[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
sound(audioIn,fs)
Call detectspeechnn
on the signal to obtain the regions of interest (ROIs), in samples, containing speech.
roi = detectspeechnn(audioIn,fs)
roi = 2×2
1 63120
83600 150000
Convert the ROIs from samples to seconds.
roiSeconds = (roi-1)/fs
roiSeconds = 2×2
0 3.9449
5.2249 9.3749
Plot the audio waveform with the speech regions.
detectspeechnn(audioIn,fs)
Refine Speech Regions with Energy-Based VAD
Read in an audio signal containing a speaker repeating the phrase "volume up".
[audioIn,fs] = audioread("MaleVolumeUp-16-mono-6secs.ogg");
Compare detected speech regions by calling detectspeechnn
with and without the application of an energy-based voice activity detector (VAD) in postprocessing.
tiledlayout(2,1) nexttile() detectspeechnn(audioIn,fs) nexttile() detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)
Adjust Postprocessing Parameters for Detecting Speech
Read in an audio signal.
[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
Call detectspeechnn
with no output arguments to display a plot of the detected speech regions.
detectspeechnn(audioIn,fs);
Modify the parameters used in the postprocessing algorithm and see how they affect the detected speech regions. For more information about the VAD postprocessing algorithm, see Postprocessing.
mergeThreshold = 1.3 ; % seconds lengthThreshold = 0.25; % seconds activationThreshold = 0.5; % probability deactivationThreshold = 0.25 ; % probability applyEnergyVAD = false ; detectspeechnn(audioIn,fs,MergeThreshold=mergeThreshold, ... LengthThreshold=lengthThreshold, ... ActivationThreshold=activationThreshold, ... DeactivationThreshold=deactivationThreshold)
Get Probability of Voice Activity Per Sample of Audio
Read in an audio signal containing speech and music.
[audioIn,fs] = audioread("MusicAndSpeech-16-mono-14secs.ogg");
Call detectspeechnn
with an additional output variable to get the probabilities of speech in each sample of the signal.
[roi,probs] = detectspeechnn(audioIn,fs);
Plot the audio signal along with the voice activity probability.
t = (0:length(audioIn)-1)/fs; plot(t,audioIn,t,probs,"r") legend("Audio signal","Probability of speech",Location="best") xlabel("Time (s)") title("Voice Activity Probability")
Detect Speech in Streaming Audio
Use detectspeechnn
to detect the presence of speech in a streaming audio signal.
Create a dsp.AudioFileReader
object to stream an audio file for processing. Set the SamplesPerFrame
property to read 100 ms nonoverlapping chunks from the signal.
afr = dsp.AudioFileReader("MaleVolumeUp-16-mono-6secs.ogg"); analysisDuration = 0.1; % seconds afr.SamplesPerFrame = floor(analysisDuration*afr.SampleRate);
The neural network architecture of detectspechnn
does not retain state between calls, and it performs best when analyzing larger chunks of audio signals. When you use detectspeechnn
in a streaming scenario, specific application requirements of accuracy, computational efficiency, and latency dictate the analysis duration and whether to overlap analysis chunks.
Create a timescope
object to plot the audio signal and the detected speech regions. Create an audioDeviceWriter
to play the audio as you stream it.
scope = timescope(NumInputPorts=2, ... SampleRate=afr.SampleRate, ... TimeSpanSource="property",TimeSpan=5, ... YLimits=[-1.2,1.2], ... ShowLegend=true,ChannelNames=["Audio","Detected Speech"]); adw = audioDeviceWriter(afr.SampleRate);
In a streaming loop:
Read in a 100 ms chunk from the audio file.
Use
detectspeechnn
to detect any regions of speech in the frame. Usesigroi2binmask
to convert the region indices to a binary mask.Plot the audio signal and the detected speech.
Play the audio with the device writer.
while ~isDone(afr) audioIn = afr(); segments = detectspeechnn(audioIn,afr.SampleRate,LengthThreshold=0.01); mask = sigroi2binmask(segments,afr.SamplesPerFrame); scope(audioIn,mask) adw(audioIn); end
Input Arguments
audioIn
— Audio input
column vector
Audio input signal, specified as a column vector (single channel).
Data Types: single
| double
fs
— Sample rate (Hz)
positive scalar
Sample rate in Hz, specified as a positive scalar.
Data Types: single
| double
Name-Value Arguments
Specify optional pairs of arguments as
Name1=Value1,...,NameN=ValueN
, where Name
is
the argument name and Value
is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
Before R2021a, use commas to separate each name and value, and enclose
Name
in quotes.
Example: detectspeechnn(audioIn,fs,ApplyEnergyVAD=true)
MergeThreshold
— Merge threshold
0.25
(default) | nonnegative scalar
Merge threshold in seconds, specified as a nonnegative scalar. The function merges
speech regions that are separated by a duration less than or equal to the specified
threshold. Set the threshold to Inf
to not merge any detected
regions.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
LengthThreshold
— Length threshold
0.25
(default) | nonnegative scalar
Length threshold in seconds, specified as a nonnegative scalar. The function does not return speech regions that have a duration less than or equal to the specified threshold.
Data Types: single
| double
| int8
| int16
| int32
| int64
| uint8
| uint16
| uint32
| uint64
ActivationThreshold
— Probability threshold to start a speech segment
0.5
(default) | scalar in the range [0, 1]
Probability threshold to start a speech segment, specified as a scalar in the range [0, 1].
Data Types: single
| double
DeactivationThreshold
— Probability threshold to end a speech segment
0.25
(default) | scalar in the range [0, 1]
Probability threshold to end a speech segment, specified as a scalar in the range [0, 1].
Data Types: single
| double
ApplyEnergyVAD
— Apply energy-based voice activity detector
false
(default) | true
Apply energy-based voice activity detector (VAD) to the speech regions detected by
the neural network, specified as true
or
false
.
Data Types: logical
Output Arguments
roi
— Speech regions
N-by-2 matrix
Speech regions, returned as an N-by-2 matrix of indices into the input signal, where N is the number of individual speech regions detected. The first column contains the index of the start of a speech region, and the second column contains the index of the end of a region.
probs
— Probability of speech per sample
column vector
Probability of speech per sample of the input audio signal, returned as a column vector with the same size as the input signal.
Algorithms
Preprocessing
The detectspeechnn
function preprocesses the audio data using the following
steps.
Resample the audio to 16kHz.
Compute a centered short-time Fourier transform (STFT) using a 25 ms periodic Hamming window and 10 ms hop length. Pad the signal so that the first window is centered at 0 s.
Convert the STFT to a power spectrogram.
Apply a mel filter bank with 40 bands to obtain a mel spectrogram.
Convert the mel spectrogram to a log scale.
Standardize each of the mel bands to have zero mean and standard deviation of 1.
Neural Network Inference
The preprocessed data is passed to a pretrained VAD neural network. The network outputs represent the probability of speech in each frame of audio in the input spectrogram.
The neural network is a ported version of the vad-crdnn-libriparty
pretrained model provided by SpeechBrain[1], which combines
convolutional, recurrent, and fully connected layers.
Postprocessing
The detectspeechnn
function postprocesses the VAD network output using the
following steps.
Apply activation and deactivation thresholds to posterior probabilities to determine candidate speech regions.
Optionally, apply energy-based VAD to refine the detected speech regions.
Merge speech regions that are close to each other according to the merge threshold.
Remove speech regions that are shorter than or equal to the length threshold.
References
[1] Ravanelli, Mirco, et al. SpeechBrain: A General-Purpose Speech Toolkit. arXiv, 8 June 2021. arXiv.org, http://arxiv.org/abs/2106.04624
Extended Capabilities
C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.
Usage notes and limitations:
Variable-size input is not supported.
The sample rate
fs
must be constant.
GPU Arrays
Accelerate code by running on a graphics processing unit (GPU) using Parallel Computing Toolbox™.
This function fully supports GPU arrays. For more information, see Run MATLAB Functions on a GPU (Parallel Computing Toolbox).
Version History
Introduced in R2023aR2024b: Output probabilities of voice activity
Use an additional output argument to get the per-sample probabilities of voice activity in a signal.
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list:
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)