Train Voice Activity Detection in Noise Model Using Deep Learning

This example uses:

This example shows how to detect regions of speech in a low signal-to-noise environment using deep learning. You train a bidirectional long short-term memory (BiLSTM) network from scratch to perform voice activity detection (VAD) and compare that network to a pretrained deep learning-based VAD. To explore the model trained from scratch in this example, see Voice Activity Detection in Noise Using Deep Learning. To use an off-the-shelf deep learning-based VAD, see detectspeechnn.

Introduction

Voice activity detection is an essential component of many audio systems, such as automatic speech recognition, speaker recognition, and audio conferencing. Voice activity detection can be especially challenging in low signal-to-noise (SNR) situations, where speech is obstructed by noise.

For reproducibility, set the random seed to default.

rng default

In high SNR scenarios, traditional speech detection algorithms perform adequately. Read in an audio file that consists of words spoken with pauses between and listen to it.

fs = 16e3;
[speech,fileFs] = audioread("MaleVolumeUp-16-mono-6secs.ogg");
sound(speech,fs)

Use the detectSpeech function to locate regions of speech. The detectSpeech function correctly identifies all regions of speech.

detectSpeech(speech,fs)

Load two noise signals and resample to the audio sample rate.

[noise200,fileFs200] = audioread("WashingMachine-16-8-mono-200secs.mp3");
[noise1000,fileFs1000] = audioread("WashingMachine-16-8-mono-1000secs.mp3");
noise200 = resample(noise200,fs,fileFs200);
noise1000 = resample(noise1000,fs,fileFs1000);

Use the supporting function mixSNR to corrupt the clean speech signal with washing machine noise at a desired SNR level in dB. Listen to the corrupted audio.

SNR = -10;
noisySpeech = mixSNR(speech,noise200,SNR);

sound(noisySpeech,fs)

Call detectSpeech on the noisy speech signal. The function fails to detect the speech regions given the very low SNR. The remainder of the example walks through training and evaluating deep learning-based VAD networks that can perform well under low SNR.

detectSpeech(noisySpeech,fs)

Download and Prepare Data

Download and extract the Google Speech Commands Dataset [1].

downloadFolder = matlab.internal.examples.downloadSupportFile("audio","google_speech.zip");
dataFolder = tempdir;
unzip(downloadFolder,dataFolder)
dataset = fullfile(dataFolder,"google_speech");

Create audioDatastore objects to point to the training and validation data sets.

adsTrain = audioDatastore(fullfile(dataset,"train"),IncludeSubfolders=true);
adsValidation = audioDatastore(fullfile(dataset,"validation"),IncludeSubfolders=true);

Construct Train and Validation Signals

The Google dataset consists of isolated words. Use the supporting function, constructSignal, to construct train and validation signals that consist of isolated words and regions of silence. The constructSignal function also returns ground truth binary masks indicating the regions of speech in the train and validation signals.

[audioTrain,TTrainPerSample] = constructSignal(adsTrain,fs,1000);
[audioValidation,TValidationPerSample] = constructSignal(adsValidation,fs,200);

Listen to the first 10 seconds of the constructed signal. Use signalMask and plotsigroi to visualize the signal and ground truth binary mask.

duration = 10;

sound(audioTrain(1:duration*fs),fs)

mask = signalMask(TTrainPerSample,SampleRate=fs);
plotsigroi(mask,audioTrain,true)
xlim([0,duration])
title("Clean Signal ("+duration+" seconds)")

Add Noise to Train and Validation Signals

Use the supporting function mixSNR to corrupt the train and validation signals with noise.

audioTrain = mixSNR(audioTrain,noise1000,SNR);
audioValidation = mixSNR(audioValidation,noise200,SNR);

Listen to the first 10 seconds of the train signal and visualize the signal and mask.

sound(audioTrain(1:duration*fs),fs)

plotsigroi(mask,audioTrain,true)
xlim([0,duration])
title("Training Signal ("+duration+" seconds)")

Input Pipeline

Define an audioFeatureExtractor to extract the following spectral features: spectralCentroid, spectralCrest, spectralEntropy, spectralFlux, spectralKurtosis, spectralRolloffPoint, spectralSkewness, spectralSlope, and the periodicity feature harmonicRatio. Extract features using a 256-point Hann window with 50% overlap.

afe = audioFeatureExtractor(SampleRate=fs, ...
    Window=hann(256,"Periodic"), ...
    OverlapLength=128, ...
    ...
    spectralCentroid=true, ...
    spectralCrest=true, ...
    spectralEntropy=true, ...
    spectralFlux=true, ...
    spectralKurtosis=true, ...
    spectralRolloffPoint=true, ...
    spectralSkewness=true, ...
    spectralSlope=true, ...
    harmonicRatio=true);

featuresTrain = extract(afe,audioTrain);

Display the dimensions of the features matrix. The first dimension corresponds to the number of windows the signal was broken into (it depends on the signal length, window length, and overlap length). The second dimension is the number of features used in this example.

[numWindows,numFeatures] = size(featuresTrain)

numWindows = 124999

numFeatures = 9

In classification applications, it is a good practice to normalize all features to have zero mean and unity standard deviation.

Compute the mean and standard deviation for each coefficient, and use them to normalize the data.

M = mean(featuresTrain,1);
S = std(featuresTrain,[],1);
featuresTrain = (featuresTrain - M) ./ S;

Extract features from the validation signal using the same process.

XValidation = extract(afe,audioValidation);
XValidation = (XValidation - mean(XValidation,1)) ./ std(XValidation,[],1);

Each feature corresponds to 256 samples of data (the window length), sampled every 128 samples (the hop length). For each window, set the expected voice/no voice value to the mode of the baseline mask values corresponding to those 256 samples. Convert the voice/no voice mask to categorical.

windowLength = numel(afe.Window);
overlapLength = afe.OverlapLength;

TTrain = mode(buffer(TTrainPerSample,windowLength,overlapLength,"nodelay"),1);

TTrain = categorical(TTrain);

Do the same for the validation mask.

TValidation = mode(buffer(TValidationPerSample,windowLength,overlapLength,"nodelay"),1);

TValidation = categorical(TValidation);

Use the supporting function featureBuffer to split the training features and the mask into sequences with a duration approximately 8 seconds and a 75% overlap between consecutive sequences.

sequenceDuration = 8;
analysisHopLength = numel(afe.Window) - afe.OverlapLength;
sequenceLength = round(sequenceDuration*fs/analysisHopLength);

overlapPercent = 0.75;

XTrain = featureBuffer(featuresTrain',sequenceLength,overlapPercent);
TTrain = featureBuffer(TTrain,sequenceLength,overlapPercent);

Network Architecture

LSTM networks can learn long-term dependencies between time steps of sequence data. This example uses the bidirectional LSTM layer bilstmLayer (Deep Learning Toolbox) to look at the sequence in both forward and backward directions.

layers = [ ...
    sequenceInputLayer(afe.FeatureVectorLength)
    bilstmLayer(200,OutputMode="sequence")
    bilstmLayer(200,OutputMode="sequence")
    fullyConnectedLayer(2)
    softmaxLayer
    ];

Training Options

To define parameters for training, use trainingOptions (Deep Learning Toolbox). Use the Adam optimizer with a mini-batch size of 64 and a piecewise learn rate schedule.

maxEpochs = 20;
miniBatchSize = 64;
options = trainingOptions("adam", ...
    MaxEpochs=maxEpochs, ...
    MiniBatchSize=miniBatchSize, ...
    Shuffle="every-epoch", ...
    Verbose=false, ...
    ValidationFrequency=floor(numel(XTrain)/miniBatchSize), ...
    ValidationData={XValidation.',TValidation}, ...
    Plots="training-progress", ...
    LearnRateSchedule="piecewise", ...
    Metrics = "Accuracy",...
    LearnRateDropFactor=0.1, ...
    LearnRateDropPeriod=5, ...
    OutputNetwork="best-validation-loss",...
    InputDataFormats = "CTB");

Train Network

To train the network, use trainnet.

speechDetectNet = trainnet(XTrain,TTrain,layers,"crossentropy" ,options);

Evaluate Trained Network

Estimate voice activity in the validation signal using the trained network. Convert the estimated VAD mask from categorical to double, then replicate the window-based decisions to sample-based decisions.

YValidation = predict(speechDetectNet,XValidation);
YValidation = scores2label(YValidation,unique(TValidation));
YValidation = double(YValidation)-1;
wL = numel(afe.Window);
hL = wL - afe.OverlapLength;
YValidationPerSample = [repelem(YValidation(1),floor(wL/2 + hL/2),1);
    repelem(YValidation(2:end-1),hL,1);
    repelem(YValidation(end),ceil(wL/2 + hL/2),1)];

Calculate and plot the validation confusion matrix from the vectors of actual and estimated labels. Save the results for later analysis.

cc = confusionchart(TValidationPerSample,YValidationPerSample, ...
    title="speechDetect - Validation Confusion Chart", ...
    ColumnSummary="column-normalized",RowSummary="row-normalized");

speechDetectResults = cc.NormalizedValues;

Evaluate Pretrained VAD Network

The vadnet network is a pretrained network for voice activity detection. You can use it with the vadnetPreprocess and vadnetPostprocess functions for applications such as transfer learning, or you can use detectspeechnn, which encapsulates vadnetPreprocess, vadnet, and vadnetPostprocess for inference-only applications. The vadnet network performs well under every-day adverse conditions, however it fails in the cases of extreme SNR, such as the -10 dB SNR used in this example. Also, vadnet was trained to detect regions of continuous speech (meaning several words in a row), not isolated words. In short, the pretrained vadnet fails for the validation signal in this example.

Load in the pretrained vadnet model.

net = audioPretrainedNetwork("vadnet");

Extract features from the validation signal using the same input pipeline used to train the network.

XValidation = vadnetPreprocess(audioValidation,fs);

Predict the VAD mask.

y = predict(net,XValidation);

vadnet is a regression network and requires additional post-processing to determine decision boundaries. Use vadnetPostprocess to determine the boundaries of voice activity regions.

boundaries = vadnetPostprocess(audioValidation,16e3,y);

The vadnetPostprocess function returns the decisions as time boundaries. To convert the boundaries to a binary mask that corresponds to the original signal samples, use sigroi2binmask.

YValidationPerSample = double(sigroi2binmask(boundaries,size(audioValidation,1)));

To create a confusion chart to analyze the error, use confusionchart (Deep Learning Toolbox).

confusionchart(TValidationPerSample,YValidationPerSample, ...
    title="vadnet - Validation Confusion Chart", ...
    ColumnSummary="column-normalized",RowSummary="row-normalized");

Transfer Learning

Apply transfer learning to the pretrained vadnet to make use of both the pretrained weights and the network architecture.

Extract features from the audio.

featuresTrain = vadnetPreprocess(audioTrain,fs);

Buffer the ground truth mask so that decisions correspond to the analysis windows used in vadnetPreprocess.

windowLength = 400;
overlapLength = 240;
TTrainPerSamplePadded = [zeros(floor(windowLength/2),1);TTrainPerSample;zeros(ceil(windowLength/2),1)];
TTrain = mode(buffer(TTrainPerSamplePadded,windowLength,overlapLength,"nodelay"),1);

Buffer the validation mask.

TValidationPerSamplePadded = [zeros(floor(windowLength/2),1);TValidationPerSample;zeros(ceil(windowLength/2),1)];
TValidation = mode(buffer(TValidationPerSamplePadded,windowLength,overlapLength,"nodelay"),1);

Split the long training signal into overlapped sequences for training. Do the same for the ground-truth mask.

sequenceDuration = 8;
analysisHopLength = windowLength - overlapLength;
sequenceLength = round(sequenceDuration*fs/analysisHopLength);

overlapPercent = 0.75;

XTrain = featureBuffer(featuresTrain,sequenceLength,overlapPercent);
TTrain = featureBuffer(TTrain,sequenceLength,overlapPercent);

To define parameters for training, use trainingOptions (Deep Learning Toolbox).

miniBatchSize = 12;
maxEpochs = 9;
options = trainingOptions("adam", ...
    InitialLearnRate=0.01, ...
    LearnRateSchedule="piecewise", ...
    LearnRateDropPeriod=3, ...
    MiniBatchSize=miniBatchSize, ...
    Shuffle="every-epoch", ...
    ValidationFrequency=floor(numel(XTrain)/miniBatchSize), ...
    ValidationData={XValidation,TValidation}, ...
    Verbose=false, ...
    Plots="training-progress", ...
    MaxEpochs=maxEpochs, ...
    OutputNetwork="best-validation-loss" ...
    );

To train the network, use trainnet.

noisyvadnet = trainnet(XTrain,TTrain,net,"mse",options);

Estimate voice activity in the validation signal using the trained network. Postprocess the predictions using vadnetPostprocess, then convert the boundaries in time to a sample-based mask.

y = predict(noisyvadnet,gpuArray(XValidation));
boundaries = vadnetPostprocess(audioValidation,fs,y);
YValidationPerSample = double(sigroi2binmask(boundaries,size(audioValidation,1)));

Calculate and plot the validation confusion matrix from the vectors of actual and estimated labels. Save the results for later analysis.

cc = confusionchart(TValidationPerSample,YValidationPerSample, ...
    title="noisyvadnet - Validation Confusion Chart", ...
    ColumnSummary="column-normalized",RowSummary="row-normalized");

noisyvadnetResults = cc.NormalizedValues;

Compare Networks

There are several considerations when choosing a network, such as size, inference speed, error, and streaming capabilities.

Streaming

The speechDetectNet trained from scratch in this example is well-suited for streaming inference because its BiLSTM layers retain state between calls. See Voice Activity Detection in Noise Using Deep Learning for an example of using speechDetect for streaming voice activity detection.

The vadnet architecture consists of convolutional, recurrent, and fully-connected layers, and is not well-suited for low-latency streaming. See the vadnet documentation for an example of streaming VAD detection using vadnet.

Network Size

Compare the network sizes.

networks = ["speechDetect","noisyvadnet"];
b = bar(reordercats(categorical(networks),networks),[whos("speechDetectNet").bytes/1024,whos("noisyvadnet").bytes/1024]);
title("Network Size")
ylabel("Size (KB)")
grid on
b.FaceColor = "flat";
b.CData(2,:) = [0.8500 0.3250 0.0980];

Network Inference Speed

Compare the network inference speeds. The simple speechDetect architecture has faster inference speed on both the CPU and the GPU for short durations (approximately 8 second chunks or less). For longer durations, speechDetect is faster than noisyvadnet on the GPU and slower on the CPU.

durationsToTest = [1,5,10,20,40];
environment = ["CPU","GPU"];

speechDetectSpeed = zeros(numel(durationsToTest),numel(environment));
noisyvadnetSpeed = zeros(numel(durationsToTest),numel(environment));
for jj = 1:numel(environment)
    for ii = 1:numel(durationsToTest)
        idx = 1:durationsToTest(ii)*fs;
        speechDetectFeatures = extract(afe,audioValidation(idx))';
        vadnetFeatures = vadnetPreprocess(audioValidation(idx),fs);

        switch environment(jj)
            case "CPU"
                speechDetectSpeed(ii,1) = timeit(@()predict(speechDetectNet,speechDetectFeatures.'),1);
                noisyvadnetSpeed(ii,1) = timeit(@()predict(noisyvadnet,vadnetFeatures),1);
            case "GPU"
                speechDetectSpeed(ii,2) = gputimeit(@()predict(speechDetectNet,gpuArray(speechDetectFeatures.')),1);
                noisyvadnetSpeed(ii,2) = gputimeit(@()predict(noisyvadnet,gpuArray(vadnetFeatures)),1);
        end
    end
end

tiledlayout(2,1)
for ii = 1:numel(environment)
    nexttile
    plot(durationsToTest,speechDetectSpeed(:,ii),"b-", ...
        durationsToTest,noisyvadnetSpeed(:,ii),"r-", ...
        durationsToTest,speechDetectSpeed(:,ii),"bo", ...
        durationsToTest,noisyvadnetSpeed(:,ii),"ro")
    legend(["speechDetect","noisyvadnet"],Location="best")
    grid on
    xlabel("Audio Duration (s)")
    ylabel("Computation Duration (s)")
    title("Inference Speed ("+environment(ii)+")")
end

Network Error

Use the previously calculated confusion charts to display common statistics for error analysis. Accuracy, recall, precision, and f1 score are all derived from the confusion matrices previously plotted.

Accuracy is defined as the ratio of correctly predicted observations to the total observations. It is the most intuitive metric but can be misleading for imbalanced data sets. For example, if speech is only present in 5% of the audio, then classifying all audio as non-speech would result in 95 % accuracy.

$Accuracy = \frac{TP + TN}{TP + TN + FP + FN}$

Recall, also called sensitivity, is the ratio of correctly predicted positive observations to all observations that belong to the positive class. Recall answers the question: Of all speech regions, how many were correctly classified? A low recall indicates that regions of speech were misclassified as regions of nonspeech.

$Recall = \frac{TP}{TP + FN}$

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. Precision answers the question: Of all the observations the network classified as speech, how many were actually speech? A low precision indicates that regions of nonspeech were misclassified as regions of speech.

$Precision = \frac{TP}{TP + FP}$

F1 score is the harmonic mean of the precision and recall: it accounts for both false positives and false negatives.

$F1 Score = 2 (\frac{Precision \times Recall}{Precision + Recall})$

The true measure of a network depends on your application. In real-world situations, a cost function is usually optimized which weights the costs of false positives and false negatives.

TP = speechDetectResults(2,2);
TN = speechDetectResults(1,1);
FP = speechDetectResults(1,2);
FN = speechDetectResults(2,1);
speechDetectAccuracy = (TP+TN)/(TP+TN+FP+FN);
speechDetectRecall = TP/(TP+FN);
speechDetectPrecision = TP/(TP+FP);
speechDetectF1Score = 2*(speechDetectRecall*speechDetectPrecision)/(speechDetectRecall+speechDetectPrecision);

TP = noisyvadnetResults(2,2);
TN = noisyvadnetResults(1,1);
FP = noisyvadnetResults(1,2);
FN = noisyvadnetResults(2,1);
noisyvadnetAccuracy = (TP+TN)/(TP+TN+FP+FN);
noisyvadnetRecall = TP/(TP+FN);
noisyvadnetPrecision = TP/(TP+FP);
noisyvadnetF1Score = 2*(noisyvadnetRecall*noisyvadnetPrecision)/(noisyvadnetRecall+noisyvadnetPrecision);

figure
bar(categorical(["Accuracy","Recall","Precision","F1 Score"]), ...
    [speechDetectAccuracy,noisyvadnetAccuracy; ...
    speechDetectRecall,noisyvadnetRecall; ...
    speechDetectPrecision,noisyvadnetPrecision; ...
    speechDetectF1Score,noisyvadnetF1Score]);
title("Error Analysis")
legend("speechDetect","noisyvadnet",Location="bestoutside")
ylim([0.5,1])
grid on

Supporting Functions

Convert Feature Vectors to Sequences

function sequences = featureBuffer(features,featureVectorsPerSequence,overlapPercent)
% y = featureBuffer(x,sequenceLength,overlapPercent) buffers a sequence of
% feature vectors, x, into sequences of length sequenceLength overlapped by
% overlapPercent. The sequences output are returns in a cell array for
% consumption by trainnet.

featureVectorOverlap = round(overlapPercent*featureVectorsPerSequence);
hopLength = featureVectorsPerSequence - featureVectorOverlap;

N = floor((size(features,2) - featureVectorsPerSequence)/hopLength) + 1;
sequences = cell(N,1);

idx = 1;
for jj = 1:N
    sequences{jj} = features(:,idx:idx + featureVectorsPerSequence - 1);
    idx = idx + hopLength;
end

end

Mix SNR

function [noisySignal,requestedNoise] = mixSNR(signal,noise,ratio)
% [noisySignal,requestedNoise] = mixSNR(signal,noise,ratio) returns a noisy
% version of the signal, noisySignal. The noisy signal has been mixed with
% noise at the specified ratio in dB.

numSamples = size(signal,1);

% Convert noise to mono
noise = mean(noise,2);

% Trim or expand noise to match signal size
if size(noise,1)>=numSamples
    % Choose a random starting index such that you still have numSamples
    % after indexing the noise.
    start = randi(size(noise,1) - numSamples + 1);
    noise = noise(start:start+numSamples-1);
else
    numReps = ceil(numSamples/size(noise,1));
    temp = repmat(noise,numReps,1);
    start = randi(size(temp,1) - numSamples - 1);
    noise = temp(start:start+numSamples-1);
end

signalNorm = norm(signal);
noiseNorm = norm(noise);

goalNoiseNorm = signalNorm/(10^(ratio/20));
factor = goalNoiseNorm/noiseNorm;

requestedNoise = noise.*factor;
noisySignal = signal + requestedNoise;

noisySignal = noisySignal./max(abs(noisySignal));
end

Construct Signal

function [audio,mask] = constructSignal(ds,fs,duration)
% [audio,mask] = constructSignal(ds,fs,duration) constructs an audio signal
% of the specified duration by concatenating samples from the
% audioDatastore ds with random duration of silence between.

win = hamming(50e-3*fs,"periodic");

% Create a 1000-second training signal by combining multiple speech files
% from the training data set. Use detectSpeech to remove unwanted portions
% of each file. Insert a random period of silence between speech segments.
% Preallocate the training signal.
N = duration*fs;
audio = zeros(N,1);

% Preallocate the voice activity training mask. Values of 1 in the mask
% correspond to samples located in areas with voice activity. Values of 0
% correspond to areas with no voice activity.
mask = zeros(N,1);

% Specify a maximum silence segment duration of 2 seconds.
maxSilenceSegment = 2;

% Construct the training signal by calling read on the datastore in a loop.
numSamples = 1;
while numSamples < N
    data = read(ds);
    data = data ./ max(abs(data)); % Scale amplitude

    % Determine regions of speech
    idx = detectSpeech(data,fs,Window=win);

    % If a region of speech is detected
    if ~isempty(idx)

        % Extend the indices by five frames
        idx(1,1) = max(1,idx(1,1) - 5*numel(win));
        idx(1,2) = min(length(data),idx(1,2) + 5*numel(win));

        % Isolate the speech
        data = data(idx(1,1):idx(1,2));

        % Write speech segment to training signal
        audio(numSamples:numSamples+numel(data)-1) = data;

        % Set VAD baseline
        mask(numSamples:numSamples+numel(data)-1) = true;

        % Random silence period
        numSilenceSamples = randi(maxSilenceSegment*fs,1,1);
        numSamples = numSamples + numel(data) + numSilenceSamples;
    end
end
audio = audio(1:N);
mask = mask(1:N);
end

References

[1] Warden P. "Speech Commands: A public dataset for single-word speech recognition", 2017. Available from https://storage.googleapis.com/download.tensorflow.org/data/speech_commands_v0.01.tar.gz. Copyright Google 2017. The Speech Commands Dataset is licensed under the Creative Commons Attribution 4.0 license