The Key Role of Data in Modern AI-Powered Systems - Spotting Voice Keywords and Beyond | Deep Learning Webinars 2020, Part 5
From the series: Deep Learning Webinars 2020
The adoption of machine learning and deep learning continues to grow in modern systems across many application areas, including computer vision, signal processing, and text analytics. While research publications are generally effective for selecting suitable deep learning models, real-world system development demands application-specific resources, tools, and expertise for curating and processing the vast amounts of data required to train and evaluate these models.
Understand these intangible requirements by reviewing the development workflow of a practical example, and then generalize those considerations to a wider range of applications. Using MATLAB® code, explore what it takes to make a device selectively wake up with trigger phrases like ""Hey Siri"" or ""OK Google"". Learn data-specific best practices, including data labeling and annotation, data ingestion, data synthesis and augmentation, feature extraction, and domain transformations.
Published: 20 Oct 2020
In today's session, basically we're going to walk through the design of an AI-powered system example. And that system is going to be a voice keyword spotting system. So the purpose is to give you guys a good overview of the deep learning workflow and the capabilities that MATLAB brings to this workflow. This particular example is an audio-focused example. But you'll find that a lot of the aspects of the workflow will be applicable to other sorts of deep learning applications, particularly those in the signal processing domain.
So let's just set a little context to the terminology that we're using here before we jump into the example. So you might hear about deep learning. You might hear about AI. The AI obviously has been around for quite a while, going all the way back to the 1950s. And where does machine learning and deep learning fit into this?
So machine learning is a subset of AI, where you're basically training a machine to figure out a task without explicitly programming it to do that task. And then deep learning, again, is basically a subset of machine learning where it's a type of machine learning. But typically, you're working with a much larger or deeper network. So there's more layers to the network. And there's typically more data involved in training it, so hence the term deep. And because of the depth of these networks, they're capable of learning tasks directly from the data, learning directly from the data. And therefore, they can be a lot more capable than just a traditional machine learning model.
If we focus in a little bit more specifically on signal processing applications, we can find it in widespread use in many different areas. We've got a few different examples here, specifically of areas where it's been used by MathWorks customers and applied to applications ranging from biomedical, to oil and gas drilling, and, of course, audio, which is what we're going to be talking about today. So the links here bring you to user stories on our website. So you can go and peruse those. Again, you'll get copies of these slides and be able to click through those later on.
And it gives you a good idea of the widespread applicability of deep learning to solving a wide variety of different signal processing problems. A lot of people think of just classification. But that's really just one possible problem that can be solved with deep learning-- other things like signal segmentation, signal processing. Many different varieties of task can be solved with deep learning.
And so of course, today we're here to talk about one practical example. And that is trigger word detection. So I'm sure most people are pretty familiar with what it is. But let's just refresh ourselves. Here's one example, my colleague Gabrielle here. I'm not hearing that audio. Let me just double check my settings here. Should be able to hear that.
Once again. Can somebody in the chat just let me know if you're able to hear that audio? I'm not hearing it right now. I tested it earlier. I was able to hear it. OK, you guys aren't hearing it either. Of course, the audio engineer has audio problems. Let me just try-- let's see I am unmuted. Unmuting the meeting again.
OK. So yes, he is saying, hey Siri. And then the trigger word is being detected. I wonder if something-- something must've gone wrong. But I'm going to keep plowing through here. And we'll make the most of it. All right, so that is, of course, the practical real world example. And now, let's talk through and discuss what it takes to create this with MATLAB.
So what we're seeing here is basically the end result of our example today. So we've created a keyword detector. And sorry Just get rid of that background noise. So what we're looking at now is our same keyword example that we've created in MATLAB.
But at this point, this is the end result of it. So we developed it in MATLAB. And we've actually converted to C code using MATLAB Coder. And we've exported it out into the real world.
So this is the example deployed as a VST plug-in algorithm. For those of you that do audio processing, you might be familiar with this. But basically, it's a DLL, a compiled CDLL version of our algorithm. And that allows us to bring it into other applications outside of MATLAB and test it out.
So in this case, it's running inside of a DAW, digital audio workstation, application called REAPER. And again, I apologize that you're probably not going to hear this audio. But we can visualize it. And what you're going to see is, when the audio plays back, we have a ding or chirp sound that basically indicates that the trigger word was detected.
And in this case, the trigger word that we're training for our example is going to be the word yes. So we have a couple parameters for our algorithm that we can tune. We start recording. And we listen to Gabrielle as he recites some different phrases.
When the he says the word yes, the trigger occurs. And you hear the ding. And yes, for those of you in the chat that didn't hear in the beginning, you'll get access to the slides later on. I'm also recording this. And we'll try to get the audio sorted out for the recorded version that we put up later on. So I'll rerecord it again later on with the audio, if necessary.
All right. So this is the example going to walk through. I do want to emphasize that most of the code for this example is available online. So again, when you have access to these slides, you'll be able to just click this link. That'll bring you to the doc page for our Audio Toolbox, where you can find this example. And what I'll be doing today is highlighting both components that exist in the published examples, as well as a few other things we've done to enhance this published example for the purpose of this presentation.
All right, so what's involved in developing this system? So most people that are new or might have a little bit of experience with deep learning, their first inclination might be to focus on the network itself. So you're thinking of, OK, what's the right deep network design for this type of system?
It's going to have some sort of network architecture that is appropriate for a sequential input, such as an audio signal. And for those that do some deep learning, you might be familiar with the LSTM or BiLSTM network as being one of those types of networks. And then, of course, we just finish our network layer with-- our network design with a couple of different layers that convert the outputs of those LSTM layers into a final classification for determining if the word yes was present in the audio recording.
A second answer to this-- and one that is in line with this presentation title-- is that a lot of data is also really important to training this type of system. And we're going to focus a lot in today's session about how important the data is to training this type of system effectively. And of course, along with that, having signal processing familiarity or expertise and having the right set of tools is also going to be extremely important.
So an interesting insight that was relayed by Andrej Karpathy, who is the director of AI at Tesla, was basically that, in the real world, a lot more man hours is spent on collecting, and refining, and preparing the data for training a deep learning model or deep learning network. And this is very different from where the time is spent in research or academia, where it's mostly on the models and algorithms.
And so this mirrors what we see in industry in MathWorks, and the customers that we support in the commercial world trying to solve real world applications. We always start off working closely with them and trying to help them figure out what is the right data to use for the training of the model, depending upon what type of system they're trying to solve.
Let's talk a little bit about the workflow for AI-driven system design. So of course, the title of this presentation is data. And that's going to be the first part of the workflow. So the preparation and the cleansing of data-- that includes like labeling data, cleaning it up. In the case of audio, it might be converting it to the right sampling rate, removing outliers, and things like that.
It could also include things like simulation-generated data. So in many cases, you might not have access to the ideal data for training your network, or any data for training your network. So having the ability to run models and simulate data, realistic data, is going to be incredibly important to creating useful real world models. And of course, we have AI-modeling itself. So that's going to be choosing the right layers, choosing the right parameters for your tuning, having the ability to accelerate that so you can to train effectively with hardware such as GPUs.
Interoperability-- by that, we basically mean the ability to use different frameworks for training your AI model. So that might be MATLAB. But it might also be other tools from the open source community. So the ability to kind of interoperate with those tools and go back and forth can be very valuable.
And then once you've got a model, you need to be able to test it. And you need to be able to test it in the context of your system. So it's not going to live in a vacuum. So being able to place it into either a virtual environment or join it together with any pre- and post-processing that occurs in your system is going to be very important to ensuring that it can be verified and validated.
And then finally, if you're not just doing research and you're trying to solve a real world problem, you're going to want to take your system and bring it out somewhere. And the ability to generate code and bring it to embedded devices and enterprise systems is going to be very important.
All right, so with that kind of workflow in mind, here's the agenda for how we're going to walk through this trigger word detection example. So we're going to go a little bit out of order. We're actually going to start with the basics of training the network first. And then I'll back up to the first step and spend the bulk of the presentation in the middle three bullets, basically focused on the data because it's such an important part of training these systems. And from there, we'll wrap up with taking the trained model and how we got it out into the real world and created a prototype that works in a real-time fashion.
And again, thank you for putting the questions in the chat. I'll answer those at the end. But feel free to keep putting them in.
All right, so the layers part itself is pretty straightforward. Basically, in MATLAB and in other frameworks, it's simply a matter of constructing an array, and then inserting the layers that you want to use for your model. Of course, that's the simple way of describing it. The hard part is what layers do you choose. So we'll get into that in a minute.
In this case, the LSTM layer is the key to this particular deep learning design. And it's pretty commonly used in any type of problem that deals with sequence-type data-- sequence, or signal, or time series, whatever terminology you want to use. And the reason being is it is it has pretty good properties for both long and short-term memory. So that's why it has this somewhat contradictory name of long short-term memory.
And the way it gets those properties is by having some capability to remember previous inputs. And that's basically this feedback component, where the input to each successful layer-- or successive layer includes a little bit of the previous layer's output as well as some new inputs. And there's basically gates or coefficients within each one of these layers that controls how much of the previous history is passed through and how much of the new history is included.
So RNN, in this case, is basically saying that the LSTM layer is a type of recurrent recursive neural network, or RNN for short. You'll find there's a few other types. But the LSTM is one of the most common and one of the most popular.
All right, so we picked our layer that's going to be doing the bulk of the work for us. And then we've sandwiched it with inputs and output layers. So we have a great tool in MATLAB to help us visualize this in more of a graphical fashion. And of course, that can work side by side with the programmatic approach.
So the Deep Network Designer Tool allows you to use more of a drag-and-drop GUI-type approach to network construction. And it's got some other great capabilities that we'll look into in a minute. One of the nice things is we also have analysis tools-- or analysis capabilities built into this tool. So the ability to quickly kind of analyze the size of your network, to understand the resources required for its utilization could be very important, particularly if you're trying to deploy to an embedded system and you're trying to choose the most effective network without going overboard.
So how did we end up with this LSTM? So there's a few different approaches we could take. So we could basically find research papers in our field that are applicable to the problem that we're trying to solve, such as what you see here on the left. Another approach is you can import models from other frameworks. So the ONNX framework is the open-- or excuse me, ONNX, in this case, is the Open Neural Network Exchange format. And it's not a particular framework. But it's a format exchange file format that allows us to go back and forth between these different frameworks. So this is touching upon that interoperability piece that I mentioned earlier on.
All right, so once you have the layers configured, then you have to set up some different options for your training. So these training options are sometimes referred to as hyperparameters, or the parameters of the training itself. That, again, is basically just going to be an array or a structure that contains the different settings for these options.
And then when we're ready to perform that the training, both of the layers, structure, and the options gets fed into the train network function for performing the training, along with the inputs and the outputs, of course, that define either the features that we're using as the inputs, or the signal itself in some cases. And then the outputs would basically be the label, or the mask, for the value of the output at that particular point in time for the input feature.
All right, so when you fire up the train network function, it's going to give you a progress graph that looks something like this. We've sped up the animation here just to give you a good visualization in a short period of time. But basically, the blue line represents the accuracy of the training data. And the black line represents-- or the black dots represent the accuracy of the validation data.
And of course, we want both of these to increase over time. And then on the bottom, we have the loss plot. And we want that to go down. So the loss basically represents the difference between the ideal output and then the predicted output that the network is giving us at that particular iteration. So the closer it gets to the ideal, the smaller the loss will be. So another important point about training is that, if you have a GPU available, you'll be able to utilize that. And it's going to speed up your training significantly, usually by many orders of magnitude, depending upon what specific type of application you're doing or network architecture you're using.
All right, so in this case, we end up with the final output. And then again, ideally the validation set would be as close as possible to the training set. But there's usually going to be a little bit of a gap. And what that represents is basically the validation is data that was held out during training. So it wasn't used for the adjustment of the weights.
So it's new, fresh data that the network hadn't seen during the training. And it helped us basically prevent overfitting of the data to that training set. So if it's fit too tightly to the training, then that means that your training set is not representing a good enough, wide enough variety of real-world data. And you need to maybe go back and either get more data, or revisit the partitioning that you did between training and validation.
All right, so one of the key themes that you'll find with our tools for deep learning is we've included a lot of different apps for trying to speed up the process of exploring different network architectures and exploring different training options and data sets. So we already talked a little bit about the Deep Network Designer. But there's another great tool called Experiment Manager that will help you go through and iterate across different sets of hyperparameters or different sets of input data.
And coming soon in this tool is a really cool feature called Bayesian optimization, which will actually use optimization tools to pick automatically for you the right, most optimal parameters for the hyperparameters for training. So that's a really great feature that's coming soon, in a future release.
So this is just a little bit more detail on that Experiment Manager. So you can see here, basically what we're doing is we set up an experiment where we're sweeping across different sets of these hyperparameters. And so for each experiment, we get a different training accuracy plot. And then we can kick this off, go get a cup of coffee, and come back, and look at all of our different output results and choose the most optimal one for our application.
I do see the questions coming in. Thank you. Please keep them coming. We'll address those when we get to the end of the session.
All right, so let's move on to the data aspect. Pretty quickly here, this is pretty fundamental stuff. But we already talked a little bit about how, when we're training a network, we need to do some splitting, some partitioning, of training and validation data. What's interesting is that, as networks get deeper and data sets get larger, we can adjust the relative proportions that are most commonly used for the training process.
So traditionally, going back to maybe machine learning type of applications, it might have been this type of split, where you're dealing with input data sizes and the kilobytes to megabytes type of range. But when you crank it up to gigabytes or terabytes in such large quantities of data, you don't need as much validation percentage test because we just have such a large training set to begin with. So this is a trend that we see in the deep learning space.
So this is one of the slides where, unfortunately, the lack of audio sharing is going to inhibit this a bit. But I'll do my best to talk through it. Essentially, what we have right now is one sample from a speech database of some sort, so a bunch of basically recordings of different folks saying different sentences that are just random real-world sentences. And again, the purpose of our task here is to train a keyword detector on the word yes.
All right, so this sentence here contains the word yes. But it also contains many other words. So in order to produce our training data, we're going to need the word yes. But we're also going to need the word yes in the context of other words. And we're going to need to have it separated from the other words as well. So the step that we're trying to accomplish here is to produce this isolated version of yes, as well as-- and separate it out from the full sentence.
Unfortunately, I'm still not hearing the audio. So we'll just keep talking through it.
So how do we get there? So again, this is basically the keyword mask here is highlighting those regions of the word yes. So how do we get to that? So we need to do some labeling, right? So we've got this non-annotated audio data. One way we could do this is we say, OK, we need some sort of intelligence system that can carry out this task. And it needs to be accurate because the model is not community good if the labeling isn't accurate.
One way to do it is we get a whole bunch of people, interns most likely. And we train them to do this labeling manually. But they're still going to need some sort of tool to do it. So we have an audio labeling tool that can work in both a manual and an automated workflow. We're going to show the manual part first.
So the idea here is basically you've got your audio data. You can listen back to it, and then identify the regions where that particular word is being spoken. Once you've identified it, then you can go ahead and apply a region of interest label to it. That's what we're doing here.
And we could do this for individual files. You can see, on the left, we've got just a couple of files loaded in. But this app will allow you to basically point it at a whole folder of hundreds or thousands of files. And you can go through and do it this way. It's not ideal. But it is feasible. And having access to the sort of tools is certainly the first necessary step.
So how do we make this process a little less painful? And that's basically by using some pre-trained models to automate the process. So remember, we're just trying to build a keyword detector. We're not trying to build something that translates all the speech without limitations. But we do have access to such systems through APIs. And that's basically what we're showing here.
So within this app and also via a function called speech to text, we can access existing APIs from Google, IBM, and Microsoft for doing speech to text conversion. And that's basically what just happened there in the first part of the video. You can see that the labels get applied.
So basically, we retrieved the translation-- excuse me. I skipped ahead accidentally. We retrieved the translation from the third party service. And then we populated it into the app here. And then after you've done that, you still might need to go in. And it's probably still going to need to be involved in verifying this. So you can zoom into the different regions, and then go through and manually inspect it, and verify, and make any changes if you need to.
And again, the really cool part about this is it can work with huge numbers of files. So we just showed it with one file. But if I wanted to point it at 1,000 files and say, OK, go speech to text convert all of these, and then apply the labels, it can go and do that. And then the human's job is just to verify. And that makes it much easier.
All right, so then the final output of something like that might look like this. So again, we're just trying to put a mask around the keyword, which is the word yes. And that's what we've done for this particular audio file. And now that we've done that, we've got those region of interest labels for the word yes. We could easily isolate them using the logical indexing capabilities in MATLAB. We can easily isolate them and then separate them out for when we're ready to do our training.
All right, so that's the labeling aspect. What about the generation of data? So you might have--
I found this on the web
That was my keyword detection just spoke to me. You might have a recording of audio data. But most likely, it was probably done by some folks sitting in front of a microphone in an ideal scenario. That's not really real-world realistic, right?
So in the real world, if you're doing a keyword detection system, you might be talking to your phone in the car, or in your house with machinery running in the background. So how do we create that and make it realistic? So having access to some sort of tool that can provide this type of augmentation would be very valuable.
So we have a feature called the Audio Data Augmenter that can do exactly that. So one thing we want to do is apply some realistic reverb. So we've got, in this case, the speaker located in a kitchen environment. So there's a lot of tiles. So we might want to have sharp reflections and that type of reverb. There might also be machinery running in the background. So we need to turn on some washing machine noise and add that to the background as well.
All right, so in addition to that, we might want to just increase the scope of our data set by doing some speech effects. So that could be things just stretching out the audio. That's one way to slow it down and change the pitch a little bit. But another more realistic way to do it might be to do pitch shifting. So that's going to be a little bit different than just time stretching.
So excuse me. So time stretching would basically be the process of lengthening the word without changing the pitch, whereas pitch shifting would be the same duration, but we do change the pitch. And then, of course, a time shift would basically just be the exact same piece of audio but moved within the window that we're looking at the audio data.
When we do these sorts of effects, it's pretty important to try to do them as realistically as possible. So take the case pitch shifting. So those of us that are audio people have probably all heard the very simple pitch shifting effect where basically, if you take some sort of audio and you play it back at a faster or lower sampling rate, it doesn't necessarily sound very realistic because you're basically changing what's called a formant, which is the relative location of the peaks in your vocal cords' frequency response.
When you change that and stretch it out or shrink it, that's how you get that Chimunk-y sound or you get a lower, more robotic sounding effect on your voice. So we've taken that into account when we do our pitch shifting within this augmenter. We do a type of a more sophisticated algorithm that includes form and conservation. And that will allow you to get around this problem of not having the formant conservation.
All right, so how is this used in MATLAB? So basically, it's an individual object that you can construct and preset it with different augmentations that you want to do. And then you have parameters available, such as the range of augmentation and the probability of augmentation. So in other words, you might use that probability to say, OK, I want to pitch shift everything. Or I only want to pitch shift some percentage of my data set.
And when you do this, of course, it makes a difference if you're going to do it in a sequential or series fashion, or if you're doing it in a parallel fashion because, depending upon which one you choose, you'll get different results. So we give you the flexibility to do either. And that way, you have that level of customization for the type of augmentation that you're trying to do.
So that's augmentation for audio. But in other cases, let's say you might not even have a data set to begin with. So in those sorts of situations, it really comes more towards a synthesis type of approach. So there's various tools available, depending upon your specific domain application, that can help out with this.
So we talked earlier about speech to text. We also have text to speech. So that's going the other direction, so basically using some existing-- again, some existing third party APIs to access text to speech synthesis engines, and then creating more data in MATLAB. In other sorts of application domains, if you're not doing audio, you might want to use MATLAB and Simulink to synthesize radar or comms-type data. Across all of our different toolboxes, we have a number of different examples of doing this type of data synthesis, and then applying it to these real-world problems.
And what's cool is, in a lot of these examples, we then take the trained model that we did with synthesized data and we apply it to real data. So for example, in the wireless space, there's one where we train it on some synthesized data, some synthesized-- I believe it's WLAN data. And then we setup some software-defined radios. And we transmit the data back and forth. And then we pass it through the classifier. And we're able to detect the modulation scheme successfully. So there's some good examples in different domains out there.
All right, so that brings us to creating inputs for deep networks. So we've talked a lot about the data that we're going to construct. But what are we actually feeding into these deep networks?
So some people might assume, well, OK, you've got your data. Just feed that directly into the network. And that might work in some cases. But it doesn't seem to be the most optimal technique. And it's not what we see most often in the real world.
There's usually some sort of intermediate step that's referred often to as a feature extraction-type step. And this is basically the conversion to reduce the amount of data that gets fed into the network. And the purpose of this is a couple of things.
So it helps us use a smaller, more efficient network. And it helps give the network a leg up. It gives it a jumpstart. We're using some of our domain expertise to identify what are the features that are probably going to be most important to helping this network make its determination, in this case its determination being a type of classification.
So depending upon what application you're working on, you might choose a different approach to this feature extraction. So in the case of speech, which we've been talking about today, it's a very common approach. So we start off with the time domain audio data. We apply a windowing operation to it.
And then the first step might be to do just a conventional short-time Fourier transform and look at the magnitude. And if we do that over successive windows, we end up with this thing called a spectrogram, right? And that might be good enough for many applications. But we can keep going.
And we can, say, apply this thing called a Mel filter bank and do a log domain conversion. And that gives us a Mel spectrogram. The difference there is we've now converted it to the log domain. So we've made it more in-line with how our hearing works. We hear things logarithmically. And then we've weighted it via this Mel filter bank. We've weighted it to be most sensitive to the frequencies that we tune into with our ears when we're listening to speech.
Then finally, we might want to go even one step further and apply the DCT to that. And that produces the MFCC, which is the Mel-frequency cepstral coefficients, which are like an even further reduced summary of that Mel spectrogram.
So this is the particular feature extraction technique we use for our keyword-spotting example. Of course, it's a lot simpler in this. In MATLAB, it just boils down to calling the MFCC function. But I just want to give you this nice visual of what's actually happening under the hood of that function.
And that's what we're using for our particular example. But it's certainly not the only possibility. So across all the different toolboxes within the signal processing domain, we've got a number of different options available to us. A lot of these are just the base Signal Processing Toolbox for MATLAB. But there's also wavelet transform, which can be a more effective type of time frequency transformation, in some cases.
And we basically can use the best tool for the job. We've got a lot of tools in our tool chest here for frequency domain conversion and time frequency maps. And we can choose the one that's most appropriate. And if we don't know, we can easily experiment, and iterate, and figure out the one that's most appropriate.
All right, so we've talked about how we create the data. So let's put this all together into one big picture here, in terms of how we actually get the data from the labeled audio data into the network. So we're going to go through and do a feature extraction step that gives us a set of features with the same label that the audio data had. Then we're going to train our network.
And then optionally, if we chose to, we could also do the augmentation step here as well. So this is if we needed to create more data. And if we do that, then it's going to be basically a multiplier on the data size, depending upon how many different types of augmentation we do.
So here, we're doing three. So we end up with-- basically, for every original input, we ended up with three different augmentations. And then we have the feature extractions of those augmentations. So we can easily expand the size of our data set quite significantly.
And if we're doing that, we want to make sure we do it efficiently. And again, we talked about using GPU for the actual training process. But having GPU processing available for the augmentation and feature extraction processes is going to be equally important. And the MFCC function happens to be one of those functions that we've been able recently to have GPU array support. And you can see that, for larger input lengths, the speed up factor becomes pretty significant. And that can be really a huge benefit when it comes to doing this training, and if you're working with very large data sets.
All right, so now, we can start to bring it back to the beginning and put together the whole picture of everything we've talked about. So we've got some data. We've extracted some features from it. And we want to basically use that now on our trained model to create new predictions. So this is a process that's often referred to as inference. And the output of the model is going to just be a new label or a new mask, in this case, indicating if the keyword was detected or not.
So what does that look like programmatically? So remember back in our original example, we had my colleague Gabrielle up here talking and basically saying sentences. And when it heard him say the word yes, it would spit out a ding. All right, so what we have to do, for inference, is we have to take his audio data, and we have to do the same MFCC feature extraction step that we did during training because if the features are the input during training, then they also have to be the input during inference when the network is deployed.
So again, that just boils down to a single MATLAB function called MFCC. Once we've done that, it's pretty common to normalize feature data. So we did this during the training as well. I just didn't include that particular piece of code when we were talking about it.
But we can take our normalized feature data, and then feed it into our trained network. So net, in this case, is the output of that trained network function that we saw at the beginning. And we can call the classify method or the classify function on that trained network and feed in the new data.
Sorry about that. Jump back. We can pass in the new data, the new feature matrix, and get a new output, which is going to be our keyword detection mask. And then we just have a little bit of post-processing that basically, depending upon whether that mask was detected or not, allows us to generate a chime when the detection event occurred.
And of course, if you're using this in a real system, you might do something more exciting than generating a chime. Maybe you'll turn on your smart speaker or whatever it is you're looking to do. But in this case, we just needed to play some sort of indication that the network was working.
All right, so this is what the MATLAB code boils down to once we have our trained network. And we can take this, and bundle it together, and go from our model to a prototype that we can bring out into the real world. So let's just revisit the AI system design workflow.
And basically, we've covered these three steps-- the preparation, the modeling, the simulation, the simulation and test being what we just saw on the previous slide. And now, we're ready to do the deployment.
So to do the deployment, we basically have to take the code that we've produced, convert it into something that we can generate code from, and then, depending upon where we're trying to go to, we can configure the appropriate code generation settings, whether I want C code or C++ code. Or maybe I'm even going to something else, some other target, and then test and deploy that code on that actual hardware.
So what's involved in doing that? So here's the code that we just saw previously. We basically can bundle that together into an individual function, called trigger word detector. For those of you that have used some of our generation tools before, you might be familiar that basically we take a subset of the MATLAB language. And that subset is growing larger or larger with every release. That is supported for conversion to C and C++ code.
And what's good, particularly in the signal processing domain, is that all the new functionality that we're putting out for machine learning and deep learning applications comes out of the box supporting code generation. So something like MFCC, where, in some other framework, you might have access to that available to you-- but when it comes to then taking that model and bringing out into the real world, you'd have to go and basically reproduce what you used during the training process, whereas, in this case, we don't have to do that. We can use the exact same function. And it works for code generation as well.
So we can take it, bundle it up into the new function. And then in this particular case, my deployment path of choice is an audio plugin. So in the Audio Toolbox, we have the ability to create these audio plugins-- again, an audio plugin, in this case, is referring to basically a compiled library or DLL-- that generate C code, puts it into one of those DLLs. And then there's different plug-in formats that are compatible with different host applications.
And that allows you to take your compiled plugin and bring it out into the real world . And I can share it now. I can run it in a streaming fashion in one of these other tools. I can share it with colleagues that don't necessarily have MATLAB. And I can optionally even view the generated C code.
In this case, we're showing a tool called Juice. I recognize that the text is probably too large for you to see. But the key thing here is, if you do have the Coder application or the Coder toolbox, you can get access to the generated code as well.
And that brings us back to the beginning. So then we're now back to where we started. We've got the compiled DLL. That's going to basically be called triggerworddetection.dll. And we can bring that into our other software here, such as REAPER digital audio workstation software. And we can run our algorithm in real time, and adjust the parameters, and verify that it's working.
All right, so to use that-- again, we used a technology from our code generation tools. We have MATLAB Coder, which generates C code for desktop applications, as well as embedded applications. We have GPU Coder, which I didn't really talk too much about today.
In the signal space, we see a little bit more interest in going into CPU type of deployment. But certainly for those of you that are doing image processing or more complex signal processing, you might be more interested in this. And you can follow-up with some of our resources there if you'd like to learn more. And then coming soon, we'll also have FPGA deployment coming in the very near future.
Coming back to the questions that we started off with at the beginning, hopefully, throughout the course of today's session, you got a good understanding of why these two answers are both correct. Neither one is more correct than the other. But they're both correct. And they're both applicable. And we covered why data is such an important part of training a model effectively.
All right, so I'll leave you just with a couple of next steps. So we just walked through one particular example today-- the first one on the left-- of trigger word detection. Again, the link is in that slide that you saw earlier, which we'll send to you. But if you go to our Doc, you'll find many, many more examples across all of the signal processing space, and image processing as well. But we're just focusing here on signals today.
And it's not just classification. There's real-time signal processing examples for things like separating speakers in a noisy environment, de-noising signals, other types of classification, and other application areas. So you'll find all kinds of good stuff. If you can't find something you're looking for, please reach out. We'd love to help you out with that sort of thing.
I didn't really touch too much on specific toolboxes. But I got a lot of questions with regards to what's used in the demo. So I did want to just throw this slide in here to indicate what particular toolboxes were used. The Audio Toolbox is basically where the feature extraction, the augmentation, the labeling-- anything specific to the audio part of this example comes from there, MFCC as well, whereas the actual deep learning, training and the acceleration of that training happens and in Deep Learning Toolbox and Parallel Computing Toolbox.
And then finally, for the deployment side of things, I didn't actually use GPU Coder today. But I reference it. I did use MATLAB Coder to get the C code out of my trained network.
There's a lot of really good follow-up resources you can use, if you want to dig in further. For those of you that are new, I'd recommend starting with the Onramp, which will basically be a hands-on tutorial to getting started with deep learning. For those of you that are maybe a little bit more experienced, the longer course, the 16 hour in-depth course is a great place to start.
And of course, we have a lot of resources. One of the big differences, I guess, of using MATLAB for your deep learning applications is you get access to real engineers for support. And we love to help, and work with our customers, and learn about the projects that you might be trying to solve.
So please feel free to reach out. My email at the beginning of the slides. adamcook@mathworks.com. And we have a lot of different tiers of support that can help you, if you have a specific project in mind.
And I just want to re-emphasize that we get a lot of questions. Why should I use MATLAB for deep learning? I've already started off with some of these other tools. And really, there's no one reason. But there's a lot of good different reasons. And we talked a lot about those. So I won't repeat them today. But I think the most important one, in my experience, is basically that the people at MathWorks, the really smart people here that like to help our customers and help you solve your deep learning challenges.