MATLAB Answers

speed up renamecats/categorical multiple columns

3 views (last 30 days)
Peng Li
Peng Li on 12 May 2020
Commented: Peng Li on 9 Oct 2020
I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.
I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.
See below an example
tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
["123", "234", "345", "456", "567"]', ...
'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
["apple", "banana", "cat", "dog", "elephont"]', ...
'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
["East", "West", "North", "South", "Middle"]', ...
'VariableNames', {'Code', 'Meaning'});
Vars = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).
Thank you!

Answers (1)

Campion Loong
Campion Loong on 9 Oct 2020
Hi Peng,
It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).
In this case, why not create the table with categorical to begin with:
tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"], dictionary.A.Code, dictionary.A.Meaning),...
categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
'VariableNames', {'A', 'B'});
There is no loop, faster and much more readable.
Peng Li
Peng Li on 9 Oct 2020
Hi Campion, thanks again for you attention. I've actually tried different options -- tall array, datastore, transform a datastore, mapreduce, or readall in a server (over 380G ram) a while ago. This is easily handlable.
The issue is with this de-coding part. It is simply too slow to do a loop. And ImportOptions couldn't help with the decoding of the actual data I guess, as i have to load the data first and do the decoding.
I've tried a way using transform datastore. Basically in the transform function, I do the decoding, and then write the datastore to disk. It works, but slow too.
I have several workable solutions now but just no one gives me the best speed. The single file is around 20G in cvs format, with over half a million rows and almost 10 thunsands of columns. With my server this tasks takes over 24 hours so I guess i just need to be a bit patient to let the server work while i'm doing something else.

Sign in to comment.




Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!