Description

A Unified Data Analytics Framework for Time-Series Data

Sundeep Tuteja, PACCAR

In the automotive industry today, vehicle data is collected and archived from a large, diverse group of vehicles and test cells for a variety of purposes. This data is usually time-series data with varying sampling rates. Often the teams collecting this data utilize different acquisition methods and file formats and have different reporting requirements. Furthermore, the data identifiers, units, and the sampling rates themselves are often subject to change. For this reason, a need for a unified data analytics platform that provides a common foundation that satisfies the requirements of multiple departments was identified. Dashboard2 is an in-house tool that provides a strict separation between the foundational code that processes the input data and the analysis code that generates interactive reports within MATLAB^®. It also provides interfaces to remote data sources and quick query capabilities that allow engineers to access specific data sets of interest in a generic and efficient manner. MATLAB and Simulink^® products such as Parallel Computing Toolbox™, MATLAB Compiler™, Database Toolbox™, Vehicle Network Toolbox™, as well as MATLAB tables, timetables, and built-in interfaces to Java, were instrumental in the development of Dashboard2. The same platform is designed to support interactive and automated data analysis on both Linux^® and Windows^®. Being able to frequently distribute updates and deliver a compiled, distributable solution allows engineers to work more efficiently and focus on data analysis function development rather than file processing code, with the assurance that there will be no need to rewrite the analysis function even if the data acquisition sources, signal units, or file formats change.

Published: 15 May 2024

Full Transcript

All right. Morning, everybody. How's everybody doing today? Just a show of hands, I would like to understand what percentage of the audience has written simple MATLAB scripts for time series data analysis to generate some reports or something like that. It looks like there's a lot of you. That's excellent. All right. So I think a lot of you would like this talk a lot. I hope so.

In the automotive industry today, we collect large amounts of time series data. It gets collected, archived, and over the years, this amount of data is only expected to grow. That has been the narrative of many of the talks until now.

So today we're going to talk about a framework that we have developed in-house to ease some of the pains associated with large scale time series data analysis. First a little bit about PACCAR. It was founded over 100 years ago, so we're definitely old timers. And initially it was founded as the Seattle Car Manufacturing company. It has undergone several name changes over the years, and today it's known as PACCAR.

And today we are a global manufacturer of heavy duty trucks, diesel engines, transmissions, and we manufacture trucks under the Kenworth, Peterbilt and DAF nameplates, with over 31,000 employees worldwide.

The PACCAR technical center located in Mount Vernon, Washington, it is a 375-acre facility. And it's a premier research and testing facility where some of our newest technologies get developed, including hydrogen, fuel cell vehicles, and battery electric vehicles.

We'll go over the agenda. We'll start by describing some data collection and analysis scenarios that we've encountered in our day-to-day activities. And we'll describe the challenges that are associated with that. And finally, we'll introduce the tool that we call Dashboard2. And I'll get into the naming a little bit later. This is the tool that we've developed to alleviate some of the challenges introduced here.

We'll talk about some selected features of the tool. We won't go into too much depth about that. We'll talk about how we get Dashboard2 to work with data sources collected from different systems with different clocks. We will talk about how we work with remote data sources, automated report generation. And towards the end, we will go into some detail about the MathWorks tools and capabilities that we've used.

We'll also mention a future capabilities section, where we'll talk about the planned features. And we'll describe the MathWorks tool ecosystem at PACCAR a little briefly as developed within our team. And then we'll go over acknowledgments.

Starting with data collection and analysis scenarios. Continuous data collection is paramount. These days, we have to be able to track a product's performance over many years. And this data is often collected from test cells which accurately mimic a real vehicle in a controlled environment. There is data that's collected related to emissions. There's data collected to vehicle performance. And several other time series data analysis is done.

In addition, you have a scenario where engineers go out in vehicles, they go out on test trips all across the country and they are sitting in the passenger seat and collecting data on their computers, using off the shelf tools like CANape and ATI Vision. Those are two examples. We also have trucks that have data loggers installed in them, and they travel all across the country and they stream data to our servers on a regular basis.

And finally, we also have data that gets generated from HIL simulation setups and Model-in-the-Loop simulation setups, just to name a few types of data that get collected. What's common in all these is that these are all essentially time series data. You always have timestamps, you always have signal names, you always have some units associated with it. But the formats vary widely.

You can see here that we have talked about MAT files, CSV files, CANalyzer files, LabVIEW files, and quite a few others. And what we want to do is ensure that we don't have to collect data any more times than is necessary. We want to make sure that large amounts of data collected can serve multiple stakeholders all at once. And that was turning out to be a real challenge over the years.

So to that end, first, let's talk about some of the other challenges as well. You have a situation as described earlier, you have multiple data formats and tool vendors. Now, some departments and some tools like to generate data as CSV files for their simplicity because they're ASCII based files and they're more advanced formats out there like MDF4 files, which is an ASAM industry standard.

You also have situations where you have data at different sampling rates. You have some data that gets sampled at one Hertz and other data that can only be reasonably sampled at one Kilohertz, just as an example. Heterogeneous data sets are another problem. Data comes in with-- as Nishant mentioned, with several impurities, so to speak. You have data with enumerations, numeric data, a mix of textual data and hexadecimal data. We've encountered many different types of data sets.

And data collection imperfections are inevitable because of transient conditions on the bus. You have signal dropouts, you have situations where the nominal sampling rate is not matching the actual realistic sampling rate.

You also have use cases where you want interactive on-demand analysis of this data. You collect the data and you want to analyze it right away to understand how it's doing and maybe do this in an iterative process. But you also want automated regular analysis to be done ideally with the same code base.

And another big use case for us was the identification of data sets of interest, because when we collect huge amounts of data, we want to make sure that for a specific reporting action, we are able to get to exactly the data sets that we need to get to in order to make sure that we do this in an efficient way.

To that end, let me introduce to you Dashboard2. It has a fairly simple interface in its normal workflow where you have an analysis function section on the left, you select the analysis function that you want to run, and it operates on the data that you want it to operate on.

It is a global tool designed for engineers, specifically imports 16 data formats as of today, and it's designed with a goal of reusing analysis routines as much as possible. It makes use of configurable analysis parameters, like signal alias lists and units.

And the reason we've done this is because we often encounter data which was often similar enough that we could reuse the data-analysis routines, but we always have different identifiers and different units in there. So we have a very streamlined way of handling that situation.

We also make sure that analysis functions reside in a separate namespace to separate the duties of content development from foundational development, which allows for streamlining of some of this development.

And the tool supports both interactive operation and command-line operation as a compiled application, as well. And it supports both Windows and Linux. And finally, it includes utilities for format independent data visualization, merging, cleaning, querying and time alignment. And we will describe a few of these features as we proceed with the presentation.

First, let's talk about a normal workflow for Dashboard2 the stages that will define some terminology in the context of Dashboard2. When we talk about loading a data file, data files are skimmed for high level information such as identifier names, metadata, starting timestamp, and we instantiate an object for the file.

But at this stage we don't actually load the signals because this is a very cheap operation comparatively and we want to ensure that we only load the signals that are needed for the report. That stage comes later, which we call the processing stage. The required signals are extracted and we associate a POSIX timestamp to every data point. And we'll talk about that a little later as well.

If applicable to the report, we go through a merging step where we have multiple files that can be merged even if they have overlaps in them. It's resilient to that situation. Alternatively, if the report requires it, we can order the data volumes require it. We can do the analysis sequentially, that is file by file.

And finally, the analysis and reporting step, which is basically the execution of the analysis logic. This is where we generate an interactive figure file report. We have other formats as well. But I will talk about the reporting format that we've developed internally later on.

Let's talk about the data pipeline that Nishant mentioned a little while back. It's a very simplified look at one of our pipelines where we receive data consisting of MAT files from a fleet of vehicles running across the USA. And this data is voluminous.

So, as soon as soon as it enters, we have it cataloged in a special SQL server ledger. This step is important because we then have to have the ability to query this ledger to point to the files that we need. We extract the header information from the files, which is a JSON format.

And we have selected JSON because we want to be resilient to changes in the header format and it's an unstructured format. This is an automated operation and the tool has also been developed in MATLAB to do exactly this.

Finally Dashboard2 hosted as a server hosted instance or running as a local instance can be used to talk to this ledger, and we have an alternate path where the Dashboard2 can also talk to the fleet data directly from the network shared drive.

And finally, the reporting step happens where we generate data in a variety of formats or send the data to another database for reporting through other tools as well. We do have a future supplemental path where we want to migrate some of these capabilities to the cloud, and we should be able to use AWS S3 and Azure cloud storage services and have Dashboard2 talk directly to those. And I understand the MathWorks has started taking some steps in that area, which we're definitely going to leverage as time goes by.

Let's talk about working with disparate data sources. Now, as Nishant mentioned, we have situations sometimes where we are forced to utilize systems with different clocks. You have one computer collecting data and another computer collecting a different set of data, but during the same time. And we want to do a unified analysis of these.

So what do we do? We have a tool here called the data alignment tool, where we start with the Reference Signal Selection. Now, this reference signal could be any signal in the data file. It could be identical signals or strongly correlated signals that we can ensure they vary simultaneously in a predictable way.

We can do a shift manually, or we can attempt an automatic alignment process which uses an optimization algorithm to calculate the best shift necessary in order to time align the two files. In this example, we have seen that the automatic alignment results in a shift of 3.7 seconds. And then this data is unified, merged, and can be used for further analysis. Also, this tool can be used in the interactive mode, as shown here, or in a batch-processing mode, as well, for hundreds of files at a time.

Now working with remote data sources is an important utility for us because, as I mentioned, we collect terabytes of data. And over the years-- I did a calculation recently where, in about 10 years, we've collected about 30 terabytes of data.

Now, how do we handle such a huge amount of data? We use this data file ledger querying utility as described previously. Every data file that gets streamed from our data loggers, it gets cataloged into a ledger, which is an SQL server database. The header gets extracted as JSON.

In this tool, we start by selecting the vehicle chassis numbers, which is a unique vehicle identifier and primary query parameters, just a starting timestamp and ending timestamp and wildcard searches. But the real meat of this tool is the header query section, which allows us to really fine tune that query.

So in this example we're able to filter out all the files that we need for a very specific model year and a very specific route ID that the vehicle took. And it returns the results in the window on the right, which we can then import into Dashboard2 directly.

The same ledger that was described previously also enables us to do automated report generation. As I mentioned earlier, briefly, Dashboard2 supports both interactive operation and a command line interface on both Windows and Linux.

So here we have an example of the command string that we use for Dashboard2 specifically on Windows. We specify an action for it, which is to run an automated report. We specify exactly which automated-report function is to be run, and depending on the Build for Dashboard2, it will pick out a specific subset of analysis functions to run.

We specify a start date vector and an end date vector. We can go right down to the millisecond if we need to. And we specify a list of chassis numbers of interest to make sure we only operate on the trucks that we want to get the report for.

If we have variants associated with the signal alias lists, that's something we can specify here, as well. It's an optional field. And finally, we just send an email out to a few people who are interested in the report, and this is something that we can run on a weekly basis in the Linux Cron job or a Windows schedule task, depending on the platform.

Let's talk a little bit about the back end here. I mentioned briefly that we use POSIX timestamps to unambiguously represent every data point in a time series data file. The best analogy I can think of is that POSIX timestamps are the closest thing we have to the stardate from Star Trek for any Star Trek aficionados. They're a unique number, simple to query, very portable that unambiguously identifies any instant.

And for those who don't know, it's just the number of seconds counted down from the Unix epoch, which is defined as January 1st, 1970, midnight in the UTC time zone. So that's the best way to represent time stamps and we've incorporated that very heavily in the use of our tool.

In addition, we don't want to load the entire file. No matter what the file format is, we want to ensure that only the specific signal set that we need for a report gets loaded. And for that reason we have developed a specific reader function for every data format that we support.

In this situation MATLAB's load and MAT file functions. Many of you might already be familiar with that. MATLAB users of course. We are able to extract just the signals that we need from the MAT files are recorded. But MATLAB's newer capabilities include the readtable function, which provides similar capabilities for ASCII based formats as well. And we also utilize parallel computing for up to 32 workers using the Parallel Computing Toolbox. It's a feature that can be scaled using the MATLAB Parallel cluster sometime in the future.

I mentioned that reports are done in a variety of formats. We have reporting data that gets sent to SQL servers for reporting through other tools such as Tableau or Power BI, for example. But I really wanted to talk about one internal reporting format that we've developed here. It's based on the MATLAB figure file directly, but we've set it up as a standalone figure file with no dependencies at all, and we're able to have it serve as a container for multiple slides.

The advantage of that is that it's got no dependencies for one, and all of MATLAB's interactive plotting and annotation capabilities are made available within this window. And as a result of this, we're also able to incorporate data traceability.

So if we have to identify exactly which files this report was coming from, we're able to pick that out. If we have a scatter plot, we want to identify some outliers in the scatter plot and where did they come from? We can identify exactly the time in the file where that came from as well.

Now we'll talk about some special MathWorks tools and capabilities that were utilized. Definitely helpful in the development of this tool. The Parallel Computing Toolbox is definitely the most important one, speeds up processing tremendously.

You can see that the trend lines here indicate almost a 10-fold increase in processing and loading times (speed) depending on the number of cores. But we do see some diminishing returns as we do continue to increase the number of cores. So that's a decision to be made, but it should be a scalable system with the parallel cluster in the mix.

The Database Toolbox serves as a useful feature for cross-platform connectivity to SQL server. Since we do want to support both windows and Linux. We decided to use JDBC drivers and we also plan to use the Database Toolbox possibly for querying Parquet files in the future using other tools. Apache Drill is one product that we came across that is able to directly query Parquet files with SQL, which is nice, and we will be looking into that too.

We use MATLAB interfaces to Java. Java is a language that contains the Java cryptography extension. It allows you to do encryption of credentials, and MATLAB is really nice that it can actually call those interfaces directly. So to encrypt credentials for our database connections, we use MATLAB interfaces to Java as well.

We use Vehicle Network Toolbox to add support for MF4 files and vector CAN DBC files, two of the formats that we have Dashboard2 to support for. We have MATLAB tables and timetables that really form the basis of the abstraction layer for Dashboard2.

So any time series format that we encounter, the reader function transforms it into a timetable which can then be operated on from Dashboard2, which allows us to reuse the analysis functions without ever having to worry about, OK, fine. Sometimes the data source would change what I have to rewrite the function at all. We can use reuse the same analysis function no matter what happens to the data source going forward.

And finally MATLAB Compiler, compiles Dashboard2 target it towards different stakeholders. So we have builds targeted for the OBD team, we have builds targeted for the performance team, and we have builds targeted for an ADAS development team that just came across. And we are able to do this release on a regular basis. Today we're at a four-week release cycle and we're able to automate those builds as well.

Going into some of the future capabilities for the tool. This has been a topic that has been touched on in previous presentations, and I really like the idea that Lauren proposed where you shouldn't have to change the document but change the printer. So we are trying to integrate Dashboard2 directly with cloud storage systems and reuse all the code that we've developed for it right now. AWS or Azure cloud storage systems.

Today, MATLAB does have some capabilities to load files from S3 storage areas, for example, and those capabilities are only being improved and we're likely to incorporate them as time goes by. And an activity that would go hand-in-hand would be the archiving of this data in a queryable form.

So today we're able to query the headers using query techniques. We're able to query based on some specific strict relational database fields. But we would also like to be able to query on the signals themselves, which would be a big improvement in efficiency. And we're considering the use of Parquet files or time series databases for this. And as MATLAB's capabilities improve, we will probably incorporate some of those capabilities too.

Scaling with the MATLAB Parallel Cluster is another item on the roadmap for us, which will allow us to use more than 32 parallel workers for even more increases in processing speed and interfacing with Simulink-based tools for generating simulink data is on-- simulated data is on the cards too. So we would then be able to do report generation, which would utilize a Simulink model as a part of the analysis logic potentially.

And finally making reports available through a software as a service model where Dashboard2 would have its capabilities transferred to a web interface and used in the cloud. I would also like to talk a little bit about the MathWorks tool ecosystem at PACCAR, generated specifically by the engineering solutions team here.

We use MATLAB in a variety of activities for the development of all of our products, and some of them are software release tools to prepare memory images for programming ECUs. We have vehicle simulation tools that optimize calibration work similar to some of the tools that have been described here.

And we also have in-house developed rapid controls prototyping tool. Oops, sorry. In-house rapid controls, prototyping tools and frameworks that utilize MATLAB code or Simulink Coder and Embedded Coder for production code generation for embedded controllers.

All right. So with that, I would like to mention that this was definitely a team effort. I would want to have a special mention for Nishant Singh and Veronica Ma for inviting us, and our supervisors and colleagues and especially our IT Department, which, contrary to some of the suggestions before, we actually like a lot. All right. Well, thank you.