Matlab Table / Dataset type optimization

Question

Victor on 27 Jun 2017

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/346434-matlab-table-dataset-type-optimization

Commented: dpb on 29 Jun 2017

I am searching some optimized datatypes for "observations-variables" table in Matlab, that can be fast and easily accessed by columns (through variables) and by rows (through observations).

Here is сomparison of existing Matlab datatypes:

Matrix is very fast, hovewer, it has no built-in indexing labels/enumerations for its dimensions, and you can't always remember variable name by column index.
Table has very bad performance, especially when reading individual rows/columns in a for loop (I suppose it runs some slow convertion methods, and is designed to be more Excel-like).
Scalar structure (structure of column arrays) datatype - fast column-wise access to variables as vectors, but slow row-wise conversion to observations.
Nonscalar structure (array of structures) - fast row-wise access to observations as vectors, but slow column-wise conversion to variables.I wonder if I can use some simpler and optimized version of Table data type, if I want just to combine row-number and column-variable indexing with only numerical variables -OR- any variable type.

See the same question on Stack Overflow.

--

Results of test script:

----
TEST1 - reading individual observations
Matrix: 0.072519 sec
Table: 18.014 sec
Array of structures: 0.49896 sec
Structure of arrays: 4.3865 sec
----
TEST2 - reading individual variables
Matrix: 0.0047834 sec
Table: 0.0017972 sec
Array of structures: 2.2715 sec
Structure of arrays: 0.0010529 sec

Test script:

Nobs = 1e5; % number of observations-rows
varNames={'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O'};
Nvar = numel(varNames); % number of variables-colums
M = randn(Nobs, Nvar); % matrix
T = array2table(M, 'VariableNames', varNames); % table
NS = struct; % nonscalar structure = array of structures
for i=1:Nobs
    for v=1:Nvar
        NS(i).(varNames{v}) = M(i,v);
    end
end
SS = struct; % scalar structure = structure of arrays
for v=1:Nvar
    SS.(varNames{v}) = M(:,v);
end
%%TEST 1 - reading individual observations (row-wise)
disp('----'); disp('TEST1 - reading individual observations');
tic; % matrix
for i=1:Nobs
   x = M(i,:); end
disp(['Matrix: ', num2str(toc()), ' sec']);
tic; % table
for i=1:Nobs
   x = T(i,:); end
disp(['Table: ', num2str(toc), ' sec']);
tic;% nonscalar structure = array of structures
for i=1:Nobs
    x = NS(i); end
disp(['Array of structures: ', num2str(toc()), ' sec']);
tic;% scalar structure = structure of arrays 
for i=1:Nobs
    for v=1:Nvar
        x.(varNames{v}) = SS.(varNames{v})(i);
    end
end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);
%%TEST 2 - reading individual variables (column-wise)
disp('----'); disp('TEST2 - reading individual variables');
tic; % matrix
for v=1:Nvar
   x = M(:,v); end
disp(['Matrix: ', num2str(toc()), ' sec']);
tic; % table
for v=1:Nvar
   x = T.(varNames{v}); end
disp(['Table: ', num2str(toc()), ' sec']);
tic; % nonscalar structure = array of structures
for v=1:Nvar
    for i=1:Nobs
        x(i,1) = NS(i).(varNames{v});
    end
end
disp(['Array of structures: ', num2str(toc()), ' sec']);
tic; % scalar structure = structure of arrays
for v=1:Nvar
    x = SS.(varNames{v}); end
disp(['Structure of arrays: ', num2str(toc()), ' sec']);

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 27 Jun 2017

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/346434-matlab-table-dataset-type-optimization#answer_272106

Open in MATLAB Online

"Matrix is very fast, [but] you can't always remember variable name by column index."

If speed is the goal, you can't beat native data types. It is unfortunate that the table data class does have such a high performance hit; it is extremely handy in many ways but just isn't up to handling large datasets as of yet, anyway. We can only hope TMW can/will improve the implementation at the moment, however.

For arrays, the identification problem can be improved, however--define index variables for mnemonic use...

M = randn(Nobs, Nvar); % matrix
date=1;                % 1st column is date/time datetime, say
accel=2;               % 2nd is acceleration
...
plot(M(:,date),M(:,accel)) % plot acceleration vs time...

By row, consider using a categorical indexing vector that correlates with the row. It's also not too difficult to write searching logic; often these can be packaged as anonymous functions for specific types of searches. Just what works best for a given instance depends on the kind of search one needs.

The main thing to do to try to minimize performance issues will be to remove the loop and select the data to operate over by logical addressing and use vector operations on that subset. With table use addressing modes that return the underlying data type rather than another table will help.

In short, "there is no free lunch"; the ability to handle disparate data types in a higher-level abstraction is going to cost in additional overhead.

A specific problem trying to solve rather than general timings as the above, while illustrative that there is a difference in those operations doesn't really get to the heart of the actual problem to be solved.

3 Comments
Show 1 older commentHide 1 older comment

dpb on 27 Jun 2017

It would help to see a specific dataset and how it's being acquired for actual specific code but it would seem that the above mnemonic naming scheme could be made general and dynamic for any particular collection setup. All it would need would be the relationship of which is which which you would need to make up the table column labels anyway.

I can't think but there would also be ways to make the table more effective than looping or make the loop occur at a higher level perhaps while the data itself is treated at the native level.

dpb on 28 Jun 2017

" I can't remember each parameter name of a particular dataset by its index - I only remember their meaningful names."

You do recognize that in implementing the idea of the aforementioned named variables as stand-ins for column indices that categorical variables are, underneath, simply integers and that the values can be used as indices, right? The value/name correspondence is set when created so can make any arbitrary matching by position as needed.

Sign in to comment.

Answer 2

Peter Perkins on 29 Jun 2017

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/346434-matlab-table-dataset-type-optimization#answer_272399

Edited: Peter Perkins on 29 Jun 2017

Victor, you are stating your problem as if you must always use either one or the other. As already discussed on a couple of other threads ( [1] [2] ) that you've contributed to, it very often possible to use numeric matrices in time-critical portions of the code, but still get the subscripting benefits of tables outside of tight loops.

Vectorizing the operations is, of course, the first thing you should try, but that isn't always possible.

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 29 Jun 2017

Good to emphasize that explicitly, Peter!

Sign in to comment.

Matlab Table / Dataset type optimization

0 Comments
Show -2 older commentsHide -2 older comments

Answers (2)

3 Comments
Show 1 older commentHide 1 older comment

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

Matlab Table / Dataset type optimization

0 Comments Show -2 older commentsHide -2 older comments

Answers (2)

3 Comments Show 1 older commentHide 1 older comment

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Products

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment

1 Comment
Show -1 older commentsHide -1 older comments