Matlab Table / Dataset type optimization

11 views (last 30 days)
Victor
Victor on 27 Jun 2017
Commented: dpb on 29 Jun 2017
I am searching some optimized datatypes for "observations-variables" table in Matlab, that can be fast and easily accessed by columns (through variables) and by rows (through observations).
Here is сomparison of existing Matlab datatypes:
  1. Matrix is very fast, hovewer, it has no built-in indexing labels/enumerations for its dimensions, and you can't always remember variable name by column index.
  2. Table has very bad performance, especially when reading individual rows/columns in a for loop (I suppose it runs some slow convertion methods, and is designed to be more Excel-like).
  3. Scalar structure (structure of column arrays) datatype - fast column-wise access to variables as vectors, but slow row-wise conversion to observations.
  4. Nonscalar structure (array of structures) - fast row-wise access to observations as vectors, but slow column-wise conversion to variables.I wonder if I can use some simpler and optimized version of Table data type, if I want just to combine row-number and column-variable indexing with only numerical variables -OR- any variable type.
See the same question on Stack Overflow.
--
Results of test script:
----
TEST1 - reading individual observations
Matrix: 0.072519 sec
Table: 18.014 sec
Array of structures: 0.49896 sec
Structure of arrays: 4.3865 sec
----
TEST2 - reading individual variables
Matrix: 0.0047834 sec
Table: 0.0017972 sec
Array of structures: 2.2715 sec
Structure of arrays: 0.0010529 sec
Test script:
Nobs = 1e5; % number of observations-rows
varNames={'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'};
Nvar = numel(varNames); % number of variables-colums
M = randn(Nobs, Nvar); % matrix
T = array2table(M, 'VariableNames', varNames); % table
NS = struct; % nonscalar structure = array of structures
for i=1:Nobs
for v=1:Nvar
NS(i).(varNames{v}) = M(i,v);
end
end
SS = struct; % scalar structure = structure of arrays
for v=1:Nvar
SS.(varNames{v}) = M(:,v);
end
%%TEST 1 - reading individual observations (row-wise)
disp('----'); disp('TEST1 - reading individual observations');
tic; % matrix
for i=1:Nobs
x = M(i,:); end
disp(['Matrix: ', num2str(toc()), ' sec']);
tic; % table
for i=1:Nobs
x = T(i,:); end
disp(['Table: ', num2str(toc), ' sec']);
tic;% nonscalar structure = array of structures
for i=1:Nobs
x = NS(i); end
disp(['Array of structures: ', num2str(toc()), ' sec']);
tic;% scalar structure = structure of arrays
for i=1:Nobs
for v=1:Nvar
x.(varNames{v}) = SS.(varNames{v})(i);
end
end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);
%%TEST 2 - reading individual variables (column-wise)
disp('----'); disp('TEST2 - reading individual variables');
tic; % matrix
for v=1:Nvar
x = M(:,v); end
disp(['Matrix: ', num2str(toc()), ' sec']);
tic; % table
for v=1:Nvar
x = T.(varNames{v}); end
disp(['Table: ', num2str(toc()), ' sec']);
tic; % nonscalar structure = array of structures
for v=1:Nvar
for i=1:Nobs
x(i,1) = NS(i).(varNames{v});
end
end
disp(['Array of structures: ', num2str(toc()), ' sec']);
tic; % scalar structure = structure of arrays
for v=1:Nvar
x = SS.(varNames{v}); end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);

Answers (2)

dpb
dpb on 27 Jun 2017
"Matrix is very fast, [but] you can't always remember variable name by column index."
If speed is the goal, you can't beat native data types. It is unfortunate that the table data class does have such a high performance hit; it is extremely handy in many ways but just isn't up to handling large datasets as of yet, anyway. We can only hope TMW can/will improve the implementation at the moment, however.
For arrays, the identification problem can be improved, however--define index variables for mnemonic use...
M = randn(Nobs, Nvar); % matrix
date=1; % 1st column is date/time datetime, say
accel=2; % 2nd is acceleration
...
plot(M(:,date),M(:,accel)) % plot acceleration vs time...
By row, consider using a categorical indexing vector that correlates with the row. It's also not too difficult to write searching logic; often these can be packaged as anonymous functions for specific types of searches. Just what works best for a given instance depends on the kind of search one needs.
The main thing to do to try to minimize performance issues will be to remove the loop and select the data to operate over by logical addressing and use vector operations on that subset. With table use addressing modes that return the underlying data type rather than another table will help.
In short, "there is no free lunch"; the ability to handle disparate data types in a higher-level abstraction is going to cost in additional overhead.
A specific problem trying to solve rather than general timings as the above, while illustrative that there is a difference in those operations doesn't really get to the heart of the actual problem to be solved.
  3 Comments
dpb
dpb on 27 Jun 2017
It would help to see a specific dataset and how it's being acquired for actual specific code but it would seem that the above mnemonic naming scheme could be made general and dynamic for any particular collection setup. All it would need would be the relationship of which is which which you would need to make up the table column labels anyway.
I can't think but there would also be ways to make the table more effective than looping or make the loop occur at a higher level perhaps while the data itself is treated at the native level.
dpb
dpb on 28 Jun 2017
" I can't remember each parameter name of a particular dataset by its index - I only remember their meaningful names."
You do recognize that in implementing the idea of the aforementioned named variables as stand-ins for column indices that categorical variables are, underneath, simply integers and that the values can be used as indices, right? The value/name correspondence is set when created so can make any arbitrary matching by position as needed.

Sign in to comment.


Peter Perkins
Peter Perkins on 29 Jun 2017
Edited: Peter Perkins on 29 Jun 2017
Victor, you are stating your problem as if you must always use either one or the other. As already discussed on a couple of other threads ( [1] [2] ) that you've contributed to, it very often possible to use numeric matrices in time-critical portions of the code, but still get the subscripting benefits of tables outside of tight loops.
Vectorizing the operations is, of course, the first thing you should try, but that isn't always possible.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!