Main Content

Audio Event Classification Using TensorFlow Lite on Raspberry Pi

This example demonstrates audio event classification using a pretrained deep neural network, YAMNet, from TensorFlow™ Lite library on Raspberry Pi™. You load the TensorFlow Lite model and predict the class for the given audio frame on Raspberry Pi using a processor-in-the-loop (PIL) workflow. To generate code on Raspberry Pi, you use Embedded Coder®, MATLAB® Support Package for Raspberry Pi Hardware and Deep Learning Toolbox Interface for TensorFlow Lite. Refer to Audio Classification and yamnet classification for more details on the YAMNet model description.

Third-Party Prerequisites

  • Raspberry Pi hardware

  • TensorFlow Lite library (on the target ARM® hardware)

  • Pretrained TensorFlow Lite Model

Download YAMNet

Download and unzip the yamnet (Audio Toolbox).

component = "audio";
filename = "yamnet.zip";
localfile = matlab.internal.examples.downloadSupportFile(component,filename);
downloadFolder = fileparts(localfile);
if exist(fullfile(downloadFolder,"yamnet"),"dir") ~= 7
    unzip(localfile,downloadFolder)
end
addpath(fullfile(downloadFolder,"yamnet"))

Read Audio Data and Classify the Sounds

Use audioread to read the audio file data and listen to it using sound function.

[audioIn, fs] = audioread("multipleSounds-16-16-mono-18secs.wav");
sound(audioIn,fs)

Call classifySound (Audio Toolbox) to detect the different sounds present in the given audio.

detectedSounds = classifySound(audioIn,fs)
detectedSounds = 1×5 string
    "Stream"    "Machine gun"    "Snoring"    "Bark"    "Meow"

You detected the different sounds in the pre-recorded audio in offline mode. The later sections of this example demonstrates the audio event classification in the real-time scenario where you process one audio frame at a time.

Load TensorFlow Lite Model and Audio Event Classes

You load the TFLite YAMNet using loadTFLiteModel (Deep Learning Toolbox). As mentioned in TFLiteModel (Deep Learning Toolbox) page, you set the Mean and Variance parameter of the TFLite model to 0 and 1, respectively, because the input to YAMNet is not already normalized.

modelFileName = "lite-model_yamnet_classification_tflite_1.tflite";
modelFullPath = fullfile(downloadFolder,"yamnet",modelFileName);
TFLiteYAMNet = loadTFLiteModel(modelFullPath);
TFLiteYAMNet.Mean = 0;
TFLiteYAMNet.StandardDeviation = 1;

Use yamnetGraph (Audio Toolbox) to load all the audio event classes supported by YAMNet, as an array of strings.

[~, audioEventClasses] = yamnetGraph;

Set the sample rate (in Hertz), the length of input audio frame and the frame duration in seconds, supported by YAMNet.

modelSamplingRate = 16000;
frameDimension = TFLiteYAMNet.InputSize{1};
frameLength = frameDimension(2);
frameDuration = frameLength/modelSamplingRate;

Set the classificationRate i.e. the number of classifications per second. As the number of hops per second must be equal to the classification rate, set the hopDuration to the reciprocal of classificationRate.

classificationRate = 10;
hopDuration = 1/classificationRate;
hopLength = floor(modelSamplingRate*hopDuration);
overlapLength = frameLength - hopLength;

Read Input Audio

You use dropdown control to list the different input audio files. Use dsp.AudioFileReader (DSP System Toolbox) to read the audio file data.

afr = dsp.AudioFileReader("multipleSounds-16-16-mono-18secs.wav");
audioInSamplingRate = afr.SampleRate;
audioFileInfo = audioinfo(afr.Filename);

Set the SamplesPerFrame corresponding to one hop.

audioInFrameLength = floor(audioInSamplingRate*hopDuration);
afr.SamplesPerFrame = audioInFrameLength;

Setup the FIFO Buffers

Create two dsp.AsyncBuffer (DSP System Toolbox) objects audioBufferYamnet and audioClassBuffer to buffer the resampled audio samples and the indices of predicted audio classes. You set the length of the audioClassBuffer corresponding to predictedAudiolassesDuration seconds. You initialize the audioClassBuffer with the index corresponding to the Silence audio class.

predictedAudiolassesDuration = 1;
audioClassBufferLength = floor(predictedAudiolassesDuration*classificationRate);
audioClassBuffer = dsp.AsyncBuffer(audioClassBufferLength);
audioBufferYamnet = dsp.AsyncBuffer(2*frameLength);
indexOfSilenceAudioClass = find(audioEventClasses == "Silence");
write(audioClassBuffer,ones(audioClassBufferLength,1)*indexOfSilenceAudioClass);

Create a timescope (DSP System Toolbox) object to visualize the audio.

timeScope = timescope("SampleRate", modelSamplingRate, ...
    "YLimits",[-1 1], ...
    "Name","Audio Event Classification Using TensorFlow Lite YAMNet", ...
    "TimeSpanSource","Property", ...
    "TimeSpan",audioFileInfo.Duration);

Run TFLite YAMNet in MATLAB to Perform Audio Event Classification

Setup a dsp.SampleRateConverter (DSP System Toolbox) system object to convert the sampling rate of the input audio to 16000 Hz, as YAMNet is trained using audio signals sampled at 16000 Hz sampling rate.

src = dsp.SampleRateConverter('InputSampleRate',audioInSamplingRate,...
                              'OutputSampleRate',modelSamplingRate,...
                              'Bandwidth',10000);

You feed one audio frame at a time to represent the system as it would be deployed in a real-time embedded system. In the streaming loop, you first load one hop of audio samples and fed them to the dsp.SampleRateConverter (DSP System Toolbox) to convert the sampling rate to 16000 Hz. The resampled frame is written in a FIFO buffer, audioBufferYamnet, you load the overlapping frames of length frameLength from this buffer and fed it to the YAMNet. The TensorFlow Lite YAMNet model outputs the predicted score vector that contains a score for each audio event class. You calculate the index of the maximum score in the score vector and write it in the FIFO buffer, audioClassBuffer. The predicted index is the statistical mode of the contents of the audioClassBuffer. The predicted audio event class is the value of audioEventClasses array at the predicted index. You visualize the resampled audio frame in the time scope and print the predicted audio event class as the title of the time scope.

while ~isDone(afr)
    audioInFrame = afr();
    resampledAudioInFrame = src(audioInFrame);
    write(audioBufferYamnet,resampledAudioInFrame);
    audioInYamnetFrame = read(audioBufferYamnet,frameLength,overlapLength);
    scoresTFLite = TFLiteYAMNet.predict(audioInYamnetFrame');
    [~, audioClassIndex] = max(scoresTFLite);
    write(audioClassBuffer,audioClassIndex);
    preditedSoundClass = audioEventClasses(mode(audioClassBuffer.peek(audioClassBufferLength)));
    timeScope(resampledAudioInFrame);
    timeScope.Title = char(preditedSoundClass);
    drawnow
end
hide(timeScope)
reset(timeScope)
reset(afr)

Prepare MATLAB Code for Deployment

You prepare a MATLAB function predictAudioClassUsingYAMNET that performs audio class prediction for the input audio frames. It buffers the indices of the predicted audio class in a FIFO buffer. The predicted audio class index is the statistical mode of the contents of this FIFO buffer.

type predictAudioClassUsingYAMNET.m
function preditedAudioClassIndex = predictAudioClassUsingYAMNET(audioIn, audioClassHistoryBufferLength,indexSilenceAudioClass)
% predictAudioClassUsingYAMNET Predicts the audio class of input audio by
% using a pre-trained TensorFlow Lite YAMNET model.
%
% Input Arguments:
% audioIn                           - Audio frame of length 1x15600 with
%                                     sampling rate of 16000 samples per
%                                     second
% audioClassHistoryBufferLength     - Length of the audio class FIFO buffer
%                                     to contain predicted audio class
%                                     indices. The index of the predicted
%                                     audio class is the statistical mode
%                                     of the contents of this buffer.
%
% Output Arguments:
% preditedAudioClassIndex           - Index of the predicted audio class.
%
%
% Copyright 2022 The MathWorks, Inc.

%#codegen

persistent TFLiteYAMNETModel AudioClassBuffer

if isempty(TFLiteYAMNETModel)
    TFLiteYAMNETModel = loadTFLiteModel("lite-model_yamnet_classification_tflite_1.tflite");
    TFLiteYAMNETModel.NumThreads = 4;
    TFLiteYAMNETModel.Mean = 0;
    TFLiteYAMNETModel.StandardDeviation = 1;

    % Create and initialize a FIFO buffer with index of the 'Silence'
    AudioClassBuffer = dsp.AsyncBuffer(audioClassHistoryBufferLength);
    write(AudioClassBuffer,ones(audioClassHistoryBufferLength,1)*indexSilenceAudioClass);
end

scores = predict(TFLiteYAMNETModel,audioIn);
[~, audioClassIndex] = max(scores);
write(AudioClassBuffer,audioClassIndex);
predictedAudioClassHistory = peek(AudioClassBuffer,audioClassHistoryBufferLength);
preditedAudioClassIndex = mode(predictedAudioClassHistory);
end

Generate Code for Audio Event Classifier on Raspberry Pi

Create Code Generation Configuration

cfg = coder.config("lib", "ecoder", true);
cfg.TargetLang = 'C++';
cfg.VerificationMode = "PIL";

Set Up Connection with Raspberry Pi

Use the Raspberry Pi Support Package function, raspi, to create a connection to your Raspberry Pi. In the following code, replace:

  • raspiname with the name of your Raspberry Pi

  • pi with your user name

  • password with your password

if ~(exist("r","var"))
  r = raspi("raspiname","pi","password");
end

Configure Code Generation Hardware Parameters for Raspberry Pi

Create a coder.hardware object for Raspberry Pi and attach it to the code generation configuration object.

hw = coder.hardware("Raspberry Pi");
cfg.Hardware = hw;

Specify the build folder on Raspberry Pi.

buildDir = "~/remoteBuildDir";
cfg.Hardware.BuildDir = buildDir;

Copy TensorFlow Lite Model to the Target Hardware and the Current Directory

Copy the TensorFlow Lite model to the Raspberry Pi board. On the hardware board, set the environment variable TFLITE_MODEL_PATH to the location of the TensorFlow Lite model. For more information on setting environment variables, see Prerequisites for Deep Learning with TensorFlow Lite Models (Deep Learning Toolbox).

Use putFile method of the raspi object to copy the TFLite model to Raspberry Pi.

putFile(r,char(modelFullPath),'/home/pi')

Copy the model to the current directory as it is required by codegen during code generation.

copyfile(modelFullPath)

Generate PIL MEX

You use coder.Constant to make the constant input arguments, compile time constants in the generated code. Run the codegen command to generate a PIL MEX function predictAudioClassUsingYAMNET_pil.

codegen -config cfg predictAudioClassUsingYAMNET -args {ones(1,15600,"single"), coder.Constant(audioClassBufferLength), coder.Constant(indexOfSilenceAudioClass)} -silent
### Connectivity configuration for function 'predictAudioClassUsingYAMNET': 'Raspberry Pi'

Predict Audio Class on Raspberry Pi Using PIL Workflow

You call the generated PIL function predictAudioClassUsingYAMNET_pil to stream one audio frame at a time to represent the system as it would be deployed in a real-time embedded system.

show(timeScope)
while ~isDone(afr)
    audioInFrame = afr();
    resampledAudioInFrame = src(audioInFrame);
    write(audioBufferYamnet,resampledAudioInFrame);
    audioInYamnetFrame = read(audioBufferYamnet,frameLength,overlapLength);
    predictedSoundClassIndex = predictAudioClassUsingYAMNET_pil(single(audioInYamnetFrame'),audioClassBufferLength, indexOfSilenceAudioClass);
    preditedSoundClass = audioEventClasses(predictedSoundClassIndex);
    timeScope(resampledAudioInFrame)
    timeScope.Title = char(preditedSoundClass);
    drawnow
end
### Starting application: 'codegen\lib\predictAudioClassUsingYAMNET\pil\predictAudioClassUsingYAMNET.elf'
    To terminate execution: clear predictAudioClassUsingYAMNET_pil
### Launching application predictAudioClassUsingYAMNET.elf...
hide(timeScope)

AudioEventClassificationGunFire.png

Terminate the PIL execution

clear predictAudioClassUsingYAMNET_pil
### Host application produced the following standard output (stdout) and standard error (stderr) messages:

Evaluate Raspberry Pi Execution Time

You use PIL workflow to profile the predictAudioClassUsingYAMNET function. You enable profiling in the code generation configuration and generate the PIL function that keeps a log of execution profile.

cfg.CodeExecutionProfiling = true;
codegen -config cfg predictAudioClassUsingYAMNET -args {ones(1,15600,"single"), coder.Constant(audioClassBufferLength), coder.Constant(indexOfSilenceAudioClass)} -silent
### Connectivity configuration for function 'predictAudioClassUsingYAMNET': 'Raspberry Pi'

You call the generated PIL function multiple times to get the average execution time.

numCalls = 100;
for k = 1:numCalls
    x = pinknoise(1,15600,"single");
    scores = predictAudioClassUsingYAMNET_pil(x,audioClassBufferLength,indexOfSilenceAudioClass);
end
### Starting application: 'codegen\lib\predictAudioClassUsingYAMNET\pil\predictAudioClassUsingYAMNET.elf'
    To terminate execution: clear predictAudioClassUsingYAMNET_pil
### Launching application predictAudioClassUsingYAMNET.elf...
    Execution profiling data is available for viewing. Open Simulation Data Inspector.
    Execution profiling report available after termination.

Terminate the PIL execution.

clear predictAudioClassUsingYAMNET_pil 
### Host application produced the following standard output (stdout) and standard error (stderr) messages:

    Execution profiling report: coder.profile.show(getCoderExecutionProfile('predictAudioClassUsingYAMNET'))

Generate an execution profile report to evaluate execution time.

executionProfile = getCoderExecutionProfile('predictAudioClassUsingYAMNET');
report(executionProfile, ...
       'Units','Seconds', ...
       'ScaleFactor','1e-03', ...
       'NumericFormat','%0.4f');

In the code execution profiling report, you find that the average execution time taken by predictAudioClassUsingYAMNET is 24.29 ms which is within the budget of 100 ms. You calculate the budget as the reciprocal of the classification rate. The performance is measured on Raspberry Pi 3 Model B Plus Rev 1.2.

Release buffers, timescope and other system objects used in the example.

release(audioBufferYamnet)
release(audioClassBuffer)
release(timeScope)
release(src)
release(afr)

See Also

(Deep Learning Toolbox) | | |

Related Topics