Automating Machine Learning with DevOps for MATLAB and Simulink
By Peter Webb and Gokhan Atinc, MathWorks
As more organizations rely on machine learning applications for core business functions, many are taking a closer look at the full lifecycle of those applications. The initial focus on development and deployment of machine learning models has expanded to encompass continuous monitoring and updates. Changes in the input data may decrease a model’s predictive or classification accuracy. Prompt retraining and model evaluation produces better models and more accurate decisions.
In machine learning operations, or ML Ops, the plan, design, build, and test activities of development are linked with the deploy, operate, and monitor activities of operations in a continuous feedback loop (Figure 1). Many data science teams have started to automate parts of the ML Ops cycle, such as deployment and operations.
Fully automating the cycle, however, requires additional steps: monitoring and evaluating model performance, incorporating the results of that evaluation into a better performing model, and redeploying the new model. Implementing this automation offers significant benefits, enabling data scientists to spend more time designing useful ML solutions and less time on IT administration and tedious, error-prone manual tasks.
To illustrate how Model-Based Design with MATLAB® and Simulink® can be used to automate ML Ops processes, we implemented a predictive maintenance application for a fictional metropolitan transit system. The organization needed a way to plan for the maintenance or replacement of batteries in its fleet of electric buses before the batteries were at risk of failing while on the road.
The application includes a machine learning model that uses battery state of charge (SOC), current, and other measurements to predict the battery’s state of health (SOH). Other components include an application server that runs the machine learning model at scale, a drift detection component that compares observed data with training data to determine when retraining is required, and a high-fidelity physical model of the battery that enables automated labeling of observed data.
For many organizations, this final component—the high-fidelity physical model—is the missing piece that enables full automation. Without it, a human is needed to review observed data and apply labels; with it, this fundamental step and the full ML Ops cycle can be automated.
Building Models for Battery Data Generation and Automated Labeling
Before we could begin training a machine learning model to predict a battery’s state of health, we needed data. In some cases, an organization may have lots of data collected from real-world systems in operation. In others, including for our fictional transit system, data must be generated via simulation.
To generate training data for the transit network’s battery system, we created two physics-based models with Simulink and Simscape™. The first model, incorporating dynamics from the electrical and thermal domains, generates realistic raw sensor measurements, including current, voltage, temperature and SOC (Figure 2). The second computes SOH from the battery’s estimated capacity and the internal resistance, which are derived from the measurements produced by the first model. It is this second model that allows us to automatically label observed data and drastically reduce the need for human interference of the retraining loop.
By applying independent aging profiles to individual batteries and varying the input ambient temperature to the first model, we created a historic data set for a large fleet of vehicles, suitable for training our predictive maintenance machine learning model.
Building and Deploying the ML Model
Once we had training data, we turned our attention to the ML model. We used the Diagnostic Feature Designer app to explore the raw measurements, extract multidomain features, and select the feature set with the best condition indicators. Because our objective was to automate the entire cycle, we needed to automate model selection and training as well. To this end we created a component that we refer to as AutoML. Built in MATLAB with Statistics and Machine Learning Toolbox™, this component is responsible for automatically finding the best machine learning model and optimal hyperparameters for a given set of training data. The AutoML component gets the cycle started too: It generates our initial machine learning model from the original training data and our feature set.
In addition to support vector machines, the AutoML component trains and evaluates linear regression models, Gaussian process regression models, ensembles of boosted decision trees, random forests, and fully connected feedforward neural networks (Figure 3). The AutoML component uses MATLAB Parallel Server™ to accelerate this part of the process by training and assessing multiple models concurrently.
When the AutoML process was completed, we deployed the optimal model into our on-premises production environment using MATLAB Production Server™.
Identifying and Addressing Data Drift
In many machine learning problems, there’s an implicit assumption that the data used for training a model fully represents the underlying distribution of the whole feature space. In other words, it is assumed that the distribution of the data does not change. In the real world this is not always the case. For instance, in our electric bus application, we may have trained our model with the assumption that the vehicles would be operating within a certain temperature range. In production, however, we find that the buses frequently operate in temperatures higher than that range. This change in the data is referred to as drift, and as drift increases the accuracy of predictions made by the model tends to decrease. Therefore, it is often necessary for data scientists to detect and react to changes in the data that develop over time, typically by training new models.
Here, it is important to distinguish between concept drift and data drift. In machine learning, concept drift is defined as the change in the joint probability of observed features and labels or responses over time. It can be quite difficult to use concept drift for machine learning models in production because both feature values and response values need to be known. As a result, many organizations focus on the next best option: data drift, or the change in the distribution of only the observed features and not the labels. This is the approach we took.
We developed a MATLAB application for drift detection that compares values in newly observed data with values in the model’s training set.
In production, it reads the observed data in near real time from an Apache® Kafka stream and makes battery health predictions with a MATLAB function that processes the observations using our machine learning model (Figure 4). We developed this MATLAB function using the Streaming Data Framework for MATLAB Production Server, which enabled us to move easily from processing historical data in files to live data in Kafka streams. The framework processes streaming data via a series of iterations because the complete stream doesn’t fit into memory. Each iteration consists of four steps: read a batch of observations from the stream, load the model, make a prediction and write it to an output stream, and, if necessary, save any data required for the next iteration. The size of each batch spans a time interval long enough to ensure the extracted features capture sufficient battery characteristics for a valid SOH prediction.
Note that even if the drift detection application determines that there has been a significant change in the observed data, that does not necessarily mean the machine learning model is outdated. The application cannot determine if the model is outdated until it obtains response values (or labels) for the newly observed data, which it does by propagating the new data through the physics-based SOH model. At that point, the application can compare the response values from the physics-based model with the response values from the machine learning model; if the values vary significantly, it is time to invoke the Auto ML component with the new data and automatically create a new machine learning model optimized for the data now coming in from the fleet.
It's fair to ask why we need a machine learning model to predict a battery’s state of health if we can estimate it via simulation in the first place. The answer is that the ML model can generate predictions in near real time—much faster than is possible via physics-based simulations.
A Scalable, Generalizable Architecture
We designed the architecture for automating ML Ops to be horizontally scalable. Both the prediction and the monitoring components run on MATLAB Production Server and the model training is done on MATLAB Parallel Server (Figure 5). The architecture is also generalizable. Although our example focused on predictive maintenance and drift detection for electric buses, the architecture can be easily adapted to other applications and use cases. For example, the physics-based Simulink model can be replaced with a numerical model developed in MATLAB. Likewise, many of the off-the-shelf components that we used—such as Apache Kafka for data streaming, Grafana for the dashboard framework—can be replaced with other cloud-native services.
The use of off-the-shelf components enabled us to focus on the architecture instead of implementation details, much like a fully automated ML Ops cycle enables data scientists to focus on designing machine learning solutions instead of managing IT administration details.