Performance of table data type
29 views (last 30 days)
Show older comments
Hello!
Is it normal that writing into a table data structure is 1000 times slower than writing into a cell array of the same size? And that reading is 50 times slower?
Try the following code:
%Test:
tic;
A = cell(10000, 50);
'Time for initializing cell array:'
toc
tic;
B = cell2table(A);
'Time for initializing table:'
toc
i = 0; % create variable
tic;
for i = 1 : 2500
A{i, 7} = 'aaa';
end
'Time for writing into cell array:'
toc
tic;
for i = 1 : 2500
B{i, 7} = {'aaa'};
end
'Time for writing into table:'
toc
x = ''; % create variable
tic;
for i = 1 : 2500
x = A{i, 7};
end
'Time for reading from cell array:'
toc
tic;
for i = 1 : 2500
x = B{i, 7};
end
'Time for reading from table:'
toc
2 Comments
Oleg Komarov
on 30 Nov 2016
Edited: Oleg Komarov
on 1 Dec 2016
While tables do have performance issues, this example is particularly pathological.
The initialization of a table with an array of empty cells is problematic. The following initialization is much faster:
tic;
A = repmat({''},1e4,50);
'Time for initializing cell array:'
toc
Also, named reference is preferred to curly brackets, i.e. B.A7(i) instead of B{i,7}.
Answers (6)
Peter Perkins
on 30 Oct 2014
Michael, table is currently not as fast as datatypes like double and cell when you are reading or writing individual values in a long loop. However, it's often possible to vectorize your code and read or write entire variables, at which point you probably won't notice a speed difference. You may also find that
B.Var7{i} = 'aaa'
is faster than
B{i, 7} = {'aaa'}
Hope this helps.
0 Comments
Michael
on 31 Oct 2014
1 Comment
Nigel Dyer
on 28 Jun 2015
Agreed. The table type appeared to be a perfect solution for what I needed to do. I found this question, registered my profile and wrote this while waiting for writetable to complete. The previous code using dmlwrite took a couple of seconds.
Oleg Komarov
on 30 Nov 2016
Edited: Oleg Komarov
on 2 Dec 2016
I have been using table() way before they were introduced into the core package, since de facto they are the ported version of the dataset() class from the Statistics Toolbox. I also noticed long time ago many limitations in terms of performance and functionality, and have logged feature enhancements with TMW.
To address the limitations of the table(), while waiting for the ufficial implementation of my enhancement requests, I created the tableutils(). Among the problems, you would be astonished to know that the disp() of a big table can literally freeze your pc until the next ice age (and I am not talking about the movies...). This is somethig that I fixed with a buffered disp method.
While my tableutils() do not address directly the problems in subsref/subsasgn, anyone is welcome to contribute to this effort to make the table() class better by submitting an issue or a Pull Request on Github.
.
Addressing some points in the question
- It is 50x faster to initialize with {''} rather than with []
N = 500;
A = cell(N);
sprintf('cell2table() on empty cells: %.3fs', timeit(@()cell2table(A)))
A = repmat({''},(N));
sprintf('cell2table() on {''} cells: %.3fs', timeit(@()cell2table(A)))
- It is 5x faster to use dot-indexing, i.e. subsasgDot, than brace-indexing, i.e. subsasgBraces
S = 1000;
[row,col] = ind2sub(N,randsample(N^2,S,false));
% {} assignment
B = cell2table(A);
tic
for ii = 1:S
B{row(ii),col(ii)} = {'aaa'};
end
toc
% . assignment
C = cell2table(A);
vnames = B.Properties.VariableNames;
tic
for ii = 1:S
C.(vnames{col(ii)})(row(ii)) = {'aaa'};
end
toc
0 Comments
LuisCardona
on 5 May 2016
Tables are the slowest thing I have ever had. I had to rewrite my code to use matrices coding the name of my columns with integers because their poor performance.
Stay away of the tables!
3 Comments
Victor
on 15 Jun 2017
I think, the current Table datatype seems to be an attempt to support more sophisticated Excel-like functionality, with optimization trade-off.
The problem is, with matrices you can't always remember column name by index, and searching string for every call to a variable is not a good solution.
I have used two ways to keep variable/column names - structure of vectors of the same length and vector of structures (a.k.a. nonscalar struct array).
Both have drawbacks - you can't get simultaneous simple row-wise and colum-wise access without slow convertion to another data structure.
But I think that there can be some simpler and optimized version of Table data type, if we want just to combine row-number and column-variable indexing with original arrays and cell arrays. And if we have only numbers (with no cell/string/sparce functionality), it can be even more faster.
LuisCardona
on 28 Jun 2017
Hoi Wong. I wanted to clarify that I was talking about the tables in MATLAB, not the concept altogether. Thanks for the comment. But, I keep my position that they are terrible slow in MATLAB
jbpritts
on 24 Nov 2016
I have Matlab 2016b. I can confirm that tables are terribly slow. Unless you really need it for heterogeneous data, then avoid them in any performance critical code. I will have to rewrite a fairly complicated section of code using legacy data structures. Matlab should address this extreme performance deficiency.
0 Comments
Peter Perkins
on 2 Dec 2016
Edited: Peter Perkins
on 2 Dec 2016
As posts on this thread have indicated, while tables are often the right data structure for the job, their performance in scalar indexing is not comparable to that of types such as double and struct. While there have been significant performance improvements since the initial release in R2014b (e.g. writetable), and those improvements will continue, tables are best when operations can be vectorized. That's often true even with plain old double matrices. It's also best to pre-allocate a table rather than growing it row by row, and again, that's true even for double matrices.
In situations where code cannot be vectorized, perhaps because the results of one iteration of a loop affect subsequent iterations, it's often possible to encapsulate the body of a loop into a function that you call by passing it a table's variables using dot subscripting, and assign back to a table's variables, rather than completely rewriting code to not use tables. It often looks something like this:
[t.X,t.Y,t.Z] = fun(t.A,T.B,t.C)
where fun is a loop that works on separate arrays. Even when it's not desirable to encapsulate the code in a function body, it's often possible to "hoist" a small number of variables out of a table and into the workspace before a loop, have the loop work on them, and then put them back in the table. In other words, if performance is an issue, consider replacing the bottlenecks with code that uses lower-level data types rather than completely avoiding tables.
2 Comments
Oleg Komarov
on 4 Dec 2016
Edited: Oleg Komarov
on 4 Dec 2016
Hi Peter, thanks for the suggestion. Is there any particular reason why the table.subsasgnBraces() transforms the RHS into a table?
A lot of overhead is incurred in that operation and subsequent table methods applied to a table-like RHS.
See for e.g. line 121 @tabular\subsasgnBraces.m, and line 191 of @tabular\subsasgnParens.m which calls a matlab coded repmat since the input is the RHS rendered table, instead of the builtin repmat.
Peter Perkins
on 5 Dec 2016
Your earlier observation that dot-then-parens indexing is faster than braces, for example, B.A7(i) vs B{i,7}, is true. That's one of the "significant performance improvements" I was referring to. It's an ongoing process. Table brace indexing is something we're planning to work on.
See Also
Categories
Find more on Matrix Indexing in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!