speed up renamecats/categorical multiple columns

Question

Peng Li on 12 May 2020

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/525001-speed-up-renamecats-categorical-multiple-columns

Commented: Peng Li on 9 Oct 2020

I have a huge csv file of about 16GB which over 9k columns. Each column is initially filled with some codes (either integer or string), and I have a code book with code and meaning for each column. What I'm trying to do is to translate the table and finally have a table that has readable texts instead of codes.

I can use either categorical or renamecats to "translate" them, but the issue is that it takes substentially long time to loop through these columns. I'm thinking if there is a way to speed this up.

See below an example

tbl = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["123", "234", "345", "456", "567"]', ...
    'VariableNames', {'A', 'B'});
dictionary.A = table(["a1", "b2", "c3", "d4", "e5"]', ...
    ["apple", "banana", "cat", "dog", "elephont"]', ...
    'VariableNames', {'Code', 'Meaning'});
dictionary.B = table(["123", "234", "345", "456", "567"]', ...
    ["East", "West", "North", "South", "Middle"]', ...
    'VariableNames', {'Code', 'Meaning'});
Vars   = tbl.Properties.VariableNames;
for iC = 1:width(tbl)
    tbl.(iC) = categorical(tbl.(iC), dictionary.(Vars{iC}).Code, ...
        dictionary.(Vars{iC}).Meaning);
end

Is that possible to avoid this loop, or any suggestions to speed this up (considering that I have over 500k rows and 9k columns).

Thank you!

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Campion Loong on 9 Oct 2020

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/525001-speed-up-renamecats-categorical-multiple-columns#answer_509771

Open in MATLAB Online

Hi Peng,

It seems you have the Dictionary code book to boot, and you already know which sets of code go wtih which field/name in the Dictionary (i.e. you can designate "VariableNames" in the first table(...) call).

In this case, why not create the table with categorical to begin with:

tbl = table(categorical(["a1"; "b2"; "c3"; "d4"; "e5"],      dictionary.A.Code, dictionary.A.Meaning),...
            categorical(["123"; "234"; "345"; "456"; "567"], dictionary.B.Code, dictionary.B.Meaning),...
            'VariableNames', {'A', 'B'});

There is no loop, faster and much more readable.

3 Comments
Show 1 older commentHide 1 older comment

Campion Loong on 9 Oct 2020

If you have thousands of columns, are you actually reading it from a file or a source somewhere? I struggle to imagine that could be manageable if you're making the first table call manually on thousands of columns.

If you are reading or importing, check out ImportOptions -- it gives you much more flexibility before actually reading the data in:

Peng Li on 9 Oct 2020

Hi Campion, thanks again for you attention. I've actually tried different options -- tall array, datastore, transform a datastore, mapreduce, or readall in a server (over 380G ram) a while ago. This is easily handlable.

The issue is with this de-coding part. It is simply too slow to do a loop. And ImportOptions couldn't help with the decoding of the actual data I guess, as i have to load the data first and do the decoding.

I've tried a way using transform datastore. Basically in the transform function, I do the decoding, and then write the datastore to disk. It works, but slow too.

I have several workable solutions now but just no one gives me the best speed. The single file is around 20G in cvs format, with over half a million rows and almost 10 thunsands of columns. With my server this tasks takes over 24 hours so I guess i just need to be a bit patient to let the server work while i'm doing something else.

Sign in to comment.

speed up renamecats/categorical multiple columns

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

3 Comments
Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

speed up renamecats/categorical multiple columns

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

3 Comments Show 1 older commentHide 1 older comment

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

3 Comments
Show 1 older commentHide 1 older comment