G.729 Voice Activity Detection
This example shows how to implement the ITU-T G.729 Voice Activity Detector (VAD)
Introduction
Voice Activity Detection (VAD) is a critical problem in many speech/audio applications including speech coding, speech recognition or speech enhancement. For instance, the ITU-T G.729 standard uses VAD modules to reduce the transmission rate during silence periods of speech.
Algorithm
At the first stage, four parametric features are extracted from the input signal. These parameters are the full-band and low-band frame energies, the set of line spectral frequencies (LSF) and the frame zero crossing rate. If the frame number is less than 32, an initialization stage of the long-term averages takes place, and the voice activity decision is forced to 1 if the frame energy from the LPC analysis is above 21 dB. Otherwise, the voice activity decision is forced to 0. If the frame number is equal to 32, an initialization stage for the characteristic energies of the background noise occurs.
At the next stage, a set of difference parameters is calculated. This set is generated as a difference measure between the current frame parameters and running averages of the background noise characteristics. Four difference measures are calculated:
a) A spectral distortion
b) An energy difference
c) A low-band energy difference
d) A zero-crossing difference
The initial voice activity decision is made at the next stage, using multi-boundary decision regions in the space of the four difference measures. The active voice decision is given as the union of the decision regions and the non-active voice decision is its complementary logical decision. Energy considerations, together with neighboring past frames decisions, are used for decision smoothing. The running averages have to be updated only in the presence of background noise, and not in the presence of speech. An adaptive threshold is tested, and the update takes place only if the threshold criterion is met.
VAD Implementation
vadG729 is the function containing the algorithm's implementation.
Initialization
Set up an audio source. This example uses an audio file reader.
audioSource = dsp.AudioFileReader(SamplesPerFrame=80,... Filename='speech_dft_8kHz.wav',... OutputDataType='single'); % Note: You can use a microphone as a source instead by using an audio % device reader (NOTE: audioDeviceReader requires an Audio Toolbox % (TM) license) % audioSource = audioDeviceReader(OutputDataType='single', ... % NumChannels=1, ... % SamplesPerFrame=80, ... % SampleRate=8000); % Create a time scope to visualize the VAD decision (channel 1) and the % speech data (channel 2) scope = timescope(SampleRate=[8000/80 8000], ... TimeSpanSource='property', ... TimeSpan=10, ... YLimits=[-0.3 1.1], ... Title='Decision speech and speech data', ... TimeSpanOverrunAction='Scroll');
Stream Processing Loop
% Initialize VAD parameters VAD_cst_param = vadInitCstParams; clear vadG729 % Run for 10 seconds numTSteps = 1000; while(numTSteps) % Retrieve 10 ms of speech data from the audio recorder speech = audioSource(); % Call the VAD algorithm decision = vadG729(speech, VAD_cst_param); % Plot speech frame and decision: 1 for speech, 0 for silence scope(decision, speech); numTSteps = numTSteps - 1; end release(scope);
Cleanup
Close the audio input device and release resources
release(audioSource);
Generating and Using the MEX-File
MATLAB Coder can be used to generate C code for the function vadG729. In order to generate a MEX-file, execute the following command.
codegen vadG729 -args {single(zeros(80,1)), coder.Constant(VAD_cst_param)}
Code generation successful.
Speed Comparison
Creating MEX-Files often helps achieve faster run-times for simulations. The following lines of code first measure the time taken by the MATLAB function and then measure the time for the run of the corresponding MEX-file. Note that the speedup factor may be different for different machines.
audioSource = dsp.AudioFileReader('speech_dft_8kHz.wav', ... SamplesPerFrame=80, ... OutputDataType='single'); clear vadG729 VAD_cst_param = vadInitCstParams; tic; while ~isDone(audioSource) speech = audioSource(); decision = vadG729(speech, VAD_cst_param); end t1 = toc; reset(audioSource); tic; while ~isDone(audioSource) speech = audioSource(); decision = vadG729_mex(speech, VAD_cst_param); end t2 = toc; disp('RESULTS:')
RESULTS:
disp(['Time taken to run the MATLAB code: ', num2str(t1), ' seconds']);
Time taken to run the MATLAB code: 1.1622 seconds
disp(['Time taken to run the MEX-File: ', num2str(t2), ' seconds']);
Time taken to run the MEX-File: 0.22835 seconds
disp(['Speed-up by a factor of ', num2str(t1/t2),... ' is achieved by creating the MEX-File']);
Speed-up by a factor of 5.0894 is achieved by creating the MEX-File
References
[1] ITU-T Recommendation G.729 - Annex B: A silence compression scheme for G.729 optimized for terminals conforming to ITU-T Recommendation V.70