Documentation

This is machine translation

Translated by Microsoft
Mouseover text to see original. Click the button below to return to the English version of the page.

Note: This page has been translated by MathWorks. Click here to see
To view all translated materials including this page, select Country from the country navigator on the bottom of this page.

Statistics and Machine Learning with Big Data Using Tall Arrays

This example shows how to perform statistical analysis and machine learning on out-of-memory data with MATLAB® and Statistics and Machine Learning Toolbox™.

Tall arrays and tables are designed for working with out-of-memory data. This type of data consists of a very large number of rows (observations) compared to a smaller number of columns (variables). Instead of writing specialized code that takes into account the huge size of the data, such as with MapReduce, tall arrays let you work with large data sets in a manner similar to in-memory MATLAB arrays. The fundamental difference is that tall arrays typically remain unevaluated until you request that the calculations be performed.

This example works with a subset of data on a single computer to develop a linear regression model, and then it scales up to analyze all of the data set. You can scale up this analysis even further to:

  • Work with data that cannot be read into memory

  • Work with data distributed across clusters using MATLAB Distributed Computing Server™

  • Integrate with big data systems like Hadoop® and Spark®

Introduction to Machine Learning with Tall Arrays

Several unsupervised and supervised learning algorithms in Statistics and Machine Learning Toolbox are available to work with tall arrays to perform data mining and predictive modeling with out-of-memory data. These algorithms are appropriate for out-of-memory data and can include slight variations from the in-memory algorithms. Capabilities include:

  • k-Means clustering

  • Linear regression

  • Generalized linear regression

  • Logistic regression

  • Discriminant analysis

The machine learning workflow for out-of-memory data in MATLAB is similar to in-memory data:

  1. Preprocess

  2. Explore

  3. Develop model

  4. Validate model

  5. Scale up to larger data

This example follows a similar structure in developing a predictive model for airline delays. The data includes a large file of airline flight information from 1987 through 2008. The example goal is to predict the departure delay based on a number of variables.

Details on the fundamental aspects of tall arrays are included in the example Analyze Big Data in MATLAB Using Tall Arrays (MATLAB). This example extends the analysis to include machine learning with tall arrays.

Create Tall Table of Airline Data

A datastore is a repository for collections of data that are too large to fit in memory. You can create a datastore from a number of different file formats as the first step to create a tall array from an external data source.

Create a datastore for the sample file airlinesmall.csv. Select the variables of interest, treat 'NA' values as missing data, and generate a preview table of the data.

ds = datastore(fullfile(matlabroot,'toolbox','matlab','demos','airlinesmall.csv'));
ds.SelectedVariableNames = {'Year','Month','DayofMonth','DayOfWeek',...
    'DepTime','ArrDelay','DepDelay','Distance'};
ds.TreatAsMissing = 'NA';
pre = preview(ds)
pre =

  8x8 table

    Year    Month    DayofMonth    DayOfWeek    DepTime    ArrDelay    DepDelay    Distance
    ____    _____    __________    _________    _______    ________    ________    ________

    1987     10          21            3          642          8          12         308   
    1987     10          26            1         1021          8           1         296   
    1987     10          23            5         2055         21          20         480   
    1987     10          23            5         1332         13          12         296   
    1987     10          22            4          629          4          -1         373   
    1987     10          28            3         1446         59          63         308   
    1987     10           8            4          928          3          -2         447   
    1987     10          10            6          859         11          -1         954   

Create a tall table backed by the datastore to facilitate working with the data. The underlying data type of a tall array depends on the type of datastore. In this case, the datastore is tabular text and returns a tall table. The display includes a preview of the data, with indication that the size is unknown.

tt = tall(ds)
Starting parallel pool (parpool) using the 'local' profile ...
connected to 12 workers.

tt =

  Mx8 tall table

    Year    Month    DayofMonth    DayOfWeek    DepTime    ArrDelay    DepDelay    Distance
    ____    _____    __________    _________    _______    ________    ________    ________

    1987     10          21            3          642          8          12         308   
    1987     10          26            1         1021          8           1         296   
    1987     10          23            5         2055         21          20         480   
    1987     10          23            5         1332         13          12         296   
    1987     10          22            4          629          4          -1         373   
    1987     10          28            3         1446         59          63         308   
    1987     10           8            4          928          3          -2         447   
    1987     10          10            6          859         11          -1         954   
     :        :          :             :           :          :           :           :
     :        :          :             :           :          :           :           :

Preprocess Data

This example aims to explore the time of day and day of week in more detail. Convert the day of week to categorical data with labels and determine the hour of day from the numeric departure time variable.

tt.DayOfWeek = categorical(tt.DayOfWeek,1:7,{'Sun','Mon','Tues',...
    'Wed','Thu','Fri','Sat'});
tt.Hr = discretize(tt.DepTime,0:100:2400,0:23)
tt =

  Mx9 tall table

    Year    Month    DayofMonth    DayOfWeek    DepTime    ArrDelay    DepDelay    Distance    Hr
    ____    _____    __________    _________    _______    ________    ________    ________    __

    1987     10          21          Tues         642          8          12         308        6
    1987     10          26          Sun         1021          8           1         296       10
    1987     10          23          Thu         2055         21          20         480       20
    1987     10          23          Thu         1332         13          12         296       13
    1987     10          22          Wed          629          4          -1         373        6
    1987     10          28          Tues        1446         59          63         308       14
    1987     10           8          Wed          928          3          -2         447        9
    1987     10          10          Fri          859         11          -1         954        8
     :        :          :             :           :          :           :           :        :
     :        :          :             :           :          :           :           :        :

Include only years after 2000 and ignore rows with missing data. Identify data of interest by logical condition.

idx = tt.Year >= 2000 & ...
    ~any(ismissing(tt),2);
tt = tt(idx,:);

Explore Data by Group

A number of exploratory functions are available for tall arrays. For a list of supported statistics functions, see Tall Array Support, Usage Notes, and Limitations.

The grpstats function calculates grouped statistics of tall arrays. Explore the data by determining the centrality and spread of the data with summary statistics grouped by day of week. Also, explore the correlation between the departure delay and arrival delay.

g = grpstats(tt(:,{'ArrDelay','DepDelay','DayOfWeek'}),'DayOfWeek',...
    {'mean','std','skewness','kurtosis'})
g =

  Mx11 tall table

    GroupLabel    DayOfWeek    GroupCount    mean_ArrDelay    std_ArrDelay    skewness_ArrDelay    kurtosis_ArrDelay    mean_DepDelay    std_DepDelay    skewness_DepDelay    kurtosis_DepDelay
    __________    _________    __________    _____________    ____________    _________________    _________________    _____________    ____________    _________________    _________________

        ?             ?            ?               ?               ?                  ?                    ?                  ?               ?                  ?                    ?        
        ?             ?            ?               ?               ?                  ?                    ?                  ?               ?                  ?                    ?        
        ?             ?            ?               ?               ?                  ?                    ?                  ?               ?                  ?                    ?        
        :             :            :               :               :                  :                    :                  :               :                  :                    :
        :             :            :               :               :                  :                    :                  :               :                  :                    :

C = corr(tt.DepDelay,tt.ArrDelay)
C =

  MxNx... tall array

    ?    ?    ?    ...
    ?    ?    ?    ...
    ?    ?    ?    ...
    :    :    :
    :    :    :

These commands produce more tall arrays. The commands are not executed until you explicitly gather the results into the workspace. The gather command triggers execution and attempts to minimize the number of passes required through the data to perform the calculations. gather requires that the resulting variables fit into memory.

[statsByDay,C] = gather(g,C)
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 17 sec
Evaluation completed in 39 sec

statsByDay =

  7x11 table

    GroupLabel    DayOfWeek    GroupCount    mean_ArrDelay    std_ArrDelay    skewness_ArrDelay    kurtosis_ArrDelay    mean_DepDelay    std_DepDelay    skewness_DepDelay    kurtosis_DepDelay
    __________    _________    __________    _____________    ____________    _________________    _________________    _____________    ____________    _________________    _________________

      'Wed'         Wed           8489          9.3324           37.406            5.1638               57.479                 10           33.426            6.4336               85.426      
      'Tues'        Tues          8381          6.4786           32.322             4.374               38.694             7.6083           28.394            5.2012               46.249      
      'Fri'         Fri           7339          4.1512             32.1             7.082               120.53             7.0857           29.339            8.9387               168.37      
      'Sat'         Sat           8045           7.132           33.108            3.6457               22.991             9.1557           29.731            4.5135               31.228      
      'Mon'         Mon           8443          5.2487           32.453            4.5811               37.175             6.8319           28.573            5.6468               50.271      
      'Sun'         Sun           8570          7.7515           36.003            5.7943                80.91             9.3324           32.516            7.2146               118.25      
      'Thu'         Thu           8601          10.053            36.18            4.1381               37.051             10.923           34.708            1.1414               138.38      


C =

    0.8966

The variables containing the results are now in-memory variables in the Workspace. Based on these calculations, variation occurs in the data and there is correlation between the delays that you can investigate further.

Explore the effect of day of week and hour of day and gain additional statistical information such as the standard error of the mean and the 95% confidence interval for the mean. You can pass the entire tall table and specify which variables to perform calculations on.

byDayHr = grpstats(tt,{'Hr','DayOfWeek'},...
    {'mean','sem','meanci'},'DataVar','DepDelay');
byDayHr = gather(byDayHr);
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 6 sec
Evaluation completed in 15 sec

Due to the data partitioning of the tall array, the output might be unordered. Rearrange the data in memory for further exploration.

x = unstack(byDayHr(:,{'Hr','DayOfWeek','mean_DepDelay'}),...
    'mean_DepDelay','DayOfWeek');
x = sortrows(x)
x =

  24x8 table

    Hr      Sun        Mon         Tues        Wed        Thu        Fri        Sat  
    __    _______    ________    ________    _______    _______    _______    _______

     0     38.519      71.914      39.656     34.667         90     25.536     65.579
     1     45.846      27.875        93.6     125.23     52.765     38.091     29.182
     2        NaN          39         102        NaN      78.25       -1.5        NaN
     3        NaN         NaN         NaN        NaN     -377.5       53.5        NaN
     4         -7     -6.2857          -7    -7.3333      -10.5         -5        NaN
     5    -2.2409     -3.7099     -4.0146    -3.9565    -3.5897    -3.5766    -4.1474
     6        0.4     -1.8909     -1.9802    -1.8304    -1.3578    0.84161    -2.2537
     7     3.4173    -0.47222    -0.18893    0.71546       0.08      1.069    -1.3221
     8     2.3759      1.4054      1.6745     2.2345     2.9668     1.6727    0.88213
     9     2.5325      1.6805      2.7656      2.683     5.6138     3.4838     2.5011
    10       6.37      5.2868      3.6822     7.5773     5.3372     6.9391     4.9979
    11     6.9946      4.9165      5.5639     5.5936     7.0435     4.8989     5.2839
    12      5.673      5.1193      5.7081     7.9178     7.5269     8.0625     7.4686
    13     8.0879      7.1017      5.0857     8.8082     8.2878     8.0675     6.2107
    14     9.5164      5.8343       7.416     9.5954     8.6667     6.0677      8.444
    15     8.1257      4.8802      7.4726     9.8674     10.235      7.167     8.6219
    16     12.302      7.4968      11.406     12.413     12.874     10.962     12.908
    17      11.47      8.9495      10.658     12.961     13.487     7.9034     8.9327
    18     15.148      13.849      11.266     15.406     16.706     11.022     13.042
    19      14.77      11.618      15.053     17.561     21.032     12.644     16.404
    20     17.711      13.942      17.105     22.382     25.945     11.223     22.152
    21     23.727      17.276      23.092     25.794     28.828     14.011     22.682
    22     29.383      24.949      28.265     30.649      37.38     24.328     36.272
    23     38.296      33.966      34.904     47.592     49.523       29.5     44.122

Visualize Data in Tall Arrays

Currently, you can visualize tall array data using histogram, histogram2, binScatterPlot, and ksdensity. The visualizations all trigger execution, similar to calling the gather function.

Use binScatterPlot to examine the relationship between the Hr and DepDelay variables.

binScatterPlot(tt.Hr,tt.DepDelay,'Gamma',0.25)
ylim([0 500])
xlabel('Time of Day')
ylabel('Delay (Minutes)')
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 4 sec
Evaluation completed in 6 sec
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 3 sec
Evaluation completed in 4 sec

As noted in the output display, the visualizations often take two passes through the data: one to perform the binning, and one to perform the binned calculation and produce the visualization.

Split Data into Training and Validation Sets

To develop a machine learning model, it is useful to reserve part of the data to train and develop the model and another part of the data to test the model. A number of ways exist for you to split the data into training and validation sets.

Use datasample to obtain a random sampling of the data. Then use cvpartition to partition the data into test and training sets. To obtain nonstratified partitions, set a uniform grouping variable by multiplying the data samples by zero.

rng(1234)
data = datasample(tt,25000,'Replace',false);
groups = 0*data.DepDelay;
y = cvpartition(groups,'HoldOut',1/3);
dataTrain = data(training(y),:);
dataTest = data(test(y),:);

Fit Supervised Learning Model

Build a model to predict the departure delay based on several variables. The linear regression model function fitlm behaves similarly to the in-memory function. However, calculations with tall arrays result in a CompactLinearModel, which is more efficient for large data sets. Model fitting triggers execution because it is an iterative process.

model = fitlm(dataTrain,'ResponseVar','DepDelay')
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 3 sec
- Pass 2 of 2: Completed in 7 sec
Evaluation completed in 14 sec

model = 


Compact linear regression model:
    DepDelay ~ [Linear formula with 9 terms in 8 predictors]

Estimated Coefficients:
                      Estimate         SE         tStat        pValue  
                      _________    __________    ________    __________

    (Intercept)          26.988        100.35     0.26894       0.78798
    Year              -0.014925      0.050062    -0.29814        0.7656
    Month              0.071712      0.037433      1.9158      0.055414
    DayofMonth        0.0058979      0.014534     0.40578       0.68491
    DayOfWeek_Mon      -0.34962       0.46627    -0.74983       0.45337
    DayOfWeek_Tues     -0.78282       0.46688     -1.6767      0.093617
    DayOfWeek_Wed      -0.42015       0.46721    -0.89928       0.36851
    DayOfWeek_Thu      -0.41237       0.46733    -0.88239       0.37758
    DayOfWeek_Fri       0.64029       0.48496      1.3203       0.18676
    DayOfWeek_Sat       0.12523       0.47531     0.26347       0.79219
    DepTime            0.012629     0.0071762      1.7599      0.078449
    ArrDelay            0.80052     0.0036849      217.24             0
    Distance          0.0011485    0.00022293      5.1516    2.6117e-07
    Hr                 -0.94526       0.71641     -1.3194       0.18704


Number of observations: 16667, Error degrees of freedom: 16653
Root Mean Squared Error: 16.4
R-squared: 0.751,  Adjusted R-Squared 0.75
F-statistic vs. constant model: 3.86e+03, p-value = 0

Predict and Validate the Model

The display indicates fit information, as well as coefficients and associated coefficient statistics.

The model variable contains information about the fitted model as properties, which you can access using dot notation. Alternatively, double click the variable in the Workspace to explore the properties interactively.

model.Rsquared
ans = 

  struct with fields:

    Ordinary: 0.7506
    Adjusted: 0.7504

Predict new values based on the model, calculate the residuals, and visualize using a histogram. The predict function predicts new values for both tall and in-memory data.

pred = predict(model,dataTest);
err = pred - dataTest.DepDelay;
figure
histogram(err,'BinLimits',[-100 100],'Normalization','pdf')
title('Histogram of Residuals')
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 6 sec
- Pass 2 of 2: Completed in 4 sec
Evaluation completed in 13 sec

Assess and Adjust Model

Looking at the output p-values in the display, some variables might be unnecessary in the model. You can reduce the complexity of the model by removing these variables.

Examine the significance of the variables in the model more closely using anova.

a = anova(model)
a =

  9x5 table

                    SumSq        DF        MeanSq         F          pValue  
                  __________    _____    __________    ________    __________

    Year              23.992        1        23.992    0.088888        0.7656
    Month             990.61        1        990.61      3.6701      0.055414
    DayofMonth        44.444        1        44.444     0.16466       0.68491
    DayOfWeek           2939        6        489.83      1.8148      0.091953
    DepTime           835.95        1        835.95      3.0971      0.078449
    ArrDelay      1.2738e+07        1    1.2738e+07       47194             0
    Distance          7163.3        1        7163.3      26.539    2.6117e-07
    Hr                 469.9        1         469.9      1.7409       0.18704
    Error         4.4949e+06    16653        269.91                          

Based on the p-values, the variables Year, Month, and DayOfMonth are not significant to this model, so you can remove them without negatively affecting the model quality.

To explore these model parameters further, use interactive visualizations such as plotSlice, plotInterations, and plotEffects. For example, use plotEffects to examine the estimated effect that each predictor variable has on the departure delay.

plotEffects(model)

Based on these calculations, ArrDelay is the main effect in the model (it is highly correlated to DepDelay). The other effects are observable, but have much less impact. In addition, Hr was determined from DepTime, so only one of these variables is necessary to the model.

Reduce the number of variables to exclude all date components, and then fit a new model.

model2 = fitlm(dataTrain,'DepDelay ~ DepTime + ArrDelay + Distance')
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 3 sec
- Pass 2 of 2: Completed in 4 sec
Evaluation completed in 9 sec

model2 = 


Compact linear regression model:
    DepDelay ~ 1 + DepTime + ArrDelay + Distance

Estimated Coefficients:
                   Estimate         SE         tStat       pValue  
                   _________    __________    _______    __________

    (Intercept)      -2.2803       0.41773    -5.4589    4.8595e-08
    DepTime        0.0031695    0.00027026     11.727    1.2288e-31
    ArrDelay         0.80005     0.0036769     217.59             0
    Distance       0.0011678    0.00022284     5.2405    1.6211e-07


Number of observations: 16667, Error degrees of freedom: 16663
Root Mean Squared Error: 16.4
R-squared: 0.75,  Adjusted R-Squared 0.75
F-statistic vs. constant model: 1.67e+04, p-value = 0

Model Development

Even with the model simplified, it can be useful to further adjust the relationships between the variables and include specific interactions. To experiment further, repeat this workflow with smaller tall arrays. For performance while tuning the model, you can consider working with a small extraction of in-memory data before scaling up to the entire tall array.

In this example, you can use functionality like stepwise regression, which is suited for iterative, in-memory model development. After tuning the model, you can scale up to use tall arrays.

Gather a subset of the data into the workspace and use stepwiselm to iteratively develop the model in memory.

subset = gather(dataTest);
sModel = stepwiselm(subset,'ResponseVar','DepDelay')
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 1: Completed in 3 sec
Evaluation completed in 3 sec
1. Adding ArrDelay, FStat = 43238.1549, pValue = 0
2. Adding DepTime, FStat = 40.6008, pValue = 1.96645e-10
3. Adding Distance, FStat = 5.1848, pValue = 0.022811
4. Adding ArrDelay:Distance, FStat = 367.0233, pValue = 4.324243e-80
5. Adding DepTime:ArrDelay, FStat = 16.0308, pValue = 6.2863e-05
6. Adding Year, FStat = 5.0322, pValue = 0.024907
7. Adding Year:ArrDelay, FStat = 232.8207, pValue = 7.247845e-52
8. Adding DayOfWeek, FStat = 2.5043, pValue = 0.020135
9. Adding DayOfWeek:ArrDelay, FStat = 22.7281, pValue = 9.85142e-27

sModel = 


Linear regression model:
    DepDelay ~ [Linear formula with 10 terms in 5 predictors]

Estimated Coefficients:
                                Estimate          SE         tStat       pValue  
                               ___________    __________    _______    __________

    (Intercept)                     103.33         108.6     0.9514       0.34143
    Year                         -0.052245      0.054185    -0.9642       0.33497
    DayOfWeek_Mon                  0.66392       0.49926     1.3298       0.18361
    DayOfWeek_Tues                 0.50045       0.50619    0.98866       0.32286
    DayOfWeek_Wed                  0.52661       0.50462     1.0436       0.29671
    DayOfWeek_Thu                   0.9681       0.50556     1.9149      0.055536
    DayOfWeek_Fri                   2.1484       0.52563     4.0873     4.405e-05
    DayOfWeek_Sat                   1.0865       0.51532     2.1084      0.035025
    DepTime                      0.0017251    0.00028848       5.98    2.3241e-09
    ArrDelay                       -41.288        3.1205    -13.231    1.4546e-39
    Distance                     0.0012633    0.00024229     5.2141    1.8917e-07
    Year:ArrDelay                 0.021044     0.0015553     13.531    2.8068e-41
    DayOfWeek_Mon:ArrDelay        -0.12034       0.01326    -9.0751    1.3978e-19
    DayOfWeek_Tues:ArrDelay      -0.081098      0.013169    -6.1582     7.698e-10
    DayOfWeek_Wed:ArrDelay       -0.082887      0.013064    -6.3448    2.3435e-10
    DayOfWeek_Thu:ArrDelay        -0.13943      0.013959    -9.9884    2.3185e-23
    DayOfWeek_Fri:ArrDelay       -0.095706      0.016178    -5.9159    3.4321e-09
    DayOfWeek_Sat:ArrDelay        -0.10735      0.014277    -7.5188     6.105e-14
    DepTime:ArrDelay            5.7482e-05    6.9455e-06     8.2762    1.4695e-16
    ArrDelay:Distance          -0.00012842    7.5138e-06    -17.092    2.1405e-64


Number of observations: 8333, Error degrees of freedom: 8313
Root Mean Squared Error: 12.2
R-squared: 0.853,  Adjusted R-Squared 0.853
F-statistic vs. constant model: 2.55e+03, p-value = 0

The model that results from the stepwise fit includes interaction terms.

Now try to fit a model for the tall data by using fitlm with the formula returned by stepwiselm.

model3 = fitlm(dataTrain,sModel.Formula)
Evaluating tall expression using the Parallel Pool 'local':
- Pass 1 of 2: Completed in 2 sec
- Pass 2 of 2: Completed in 4 sec
Evaluation completed in 8 sec

model3 = 


Compact linear regression model:
    DepDelay ~ [Linear formula with 10 terms in 5 predictors]

Estimated Coefficients:
                                Estimate          SE         tStat        pValue   
                               ___________    __________    ________    ___________

    (Intercept)                     86.447        98.829     0.87472        0.38174
    Year                         -0.043631      0.049301    -0.88499        0.37617
    DayOfWeek_Mon                 -0.16864       0.45596    -0.36986        0.71149
    DayOfWeek_Tues                -0.59953       0.45856     -1.3074        0.19109
    DayOfWeek_Wed                 -0.88004       0.46063     -1.9105       0.056086
    DayOfWeek_Thu                   1.0626        0.4631      2.2946       0.021766
    DayOfWeek_Fri                  0.52349       0.47222      1.1086        0.26762
    DayOfWeek_Sat                   0.1349       0.46679     0.28899         0.7726
    DepTime                      0.0017795    0.00026504      6.7142     1.9527e-11
    ArrDelay                       -4.0773        2.5797     -1.5806          0.114
    Distance                     0.0016412    0.00021801      7.5278     5.4221e-14
    Year:ArrDelay                0.0023229     0.0012863      1.8058       0.070969
    DayOfWeek_Mon:ArrDelay       -0.027851      0.013428     -2.0741       0.038081
    DayOfWeek_Tues:ArrDelay      -0.042877      0.013492     -3.1779      0.0014863
    DayOfWeek_Wed:ArrDelay        0.027284      0.011961      2.2811       0.022552
    DayOfWeek_Thu:ArrDelay        -0.16448      0.012479      -13.18     1.8078e-39
    DayOfWeek_Fri:ArrDelay        0.092599      0.013508      6.8552     7.3731e-12
    DayOfWeek_Sat:ArrDelay       -0.037005      0.013386     -2.7643      0.0057102
    DepTime:ArrDelay            0.00018955    6.4255e-06        29.5    1.7661e-186
    ArrDelay:Distance          -6.0792e-05    5.6146e-06     -10.828     3.1395e-27


Number of observations: 16667, Error degrees of freedom: 16647
Root Mean Squared Error: 15.8
R-squared: 0.77,  Adjusted R-Squared 0.77
F-statistic vs. constant model: 2.93e+03, p-value = 0

You can repeat this process to continue to adjust the linear model. However, in this case, you should explore different types of regression that might be more appropriate for this data. For example, if you do not want to include the arrival delay, then this type of linear model is no longer appropriate. See Logistic Regression with Tall Arrays for more information.

Scale to Spark

A key capability of tall arrays in MATLAB and Statistics and Machine Learning Toolbox is the connectivity to platforms such as Hadoop and Spark. You can even compile the code and run it on Spark using MATLAB Compiler™. See Extend Tall Arrays with Other Products (MATLAB) for more information about using these products:

  • Database Toolbox™

  • Parallel Computing Toolbox™

  • MATLAB® Distributed Computing Server™

  • MATLAB Compiler™

Was this topic helpful?