Quick conversion of a large (and with mixed data types) cell array into a numerical array

17 views (last 30 days)
Hi, I have a 800'000 x 20 cell array and I would like to convert it into a 800'000 x 20 double precision array.
My cell array has the following features:
  • it is a mix of data types, i.e. some columns are character vectors (which represent numbers) and some columns are numbers.
  • within a column of character vectors, the number of characters is different among the rows.
Here, an example/part of my cell array:
a =
10×3 cell array
{'1170790,889999999897555'} {'1009,697999999999979'} {[ 1]}
{'1170260,709000000031665'} {'869,886999999999944' } {[ 2]}
{'1181959,743999999947846'} {'504,125999999999976' } {[ 3]}
{'1174019,564000000013039'} {'674,142000000000053' } {[ 4]}
{'1173729,793000000063330'} {'689,635999999999967' } {[ 5]}
{'1181555,257999999914318'} {'509,391999999999996' } {[ 6]}
{'1180374,280000000027940'} {'538,126999999999953' } {[ 7]}
{'1180124,550999999977648'} {'568,719000000000051' } {[ 8]}
{'1179305,570999999996275'} {'626,528999999999996' } {[ 9]}
{'1177132,448000000091270'} {'705,443999999999960' } {[10]}
Any suggestion to quickly/efficiently convert this (mixed data type) cell array into a numerical array, possibly without using str2double() ?
Just for information: among my attempts for conversion, I found that str2double() is very slow and
>> cell2mat(a(:,2))
Error using cat
Dimensions of arrays being concatenated are not consistent.
Error in cell2mat (line 83)
m{n} = cat(1,c{:,n});
and
>> str2num(char(a(:,3)))
Error using char
Cell elements must be character arrays.
  2 Comments
Stephen23
Stephen23 on 17 Feb 2020
Your character vectors appear to use decimal commas, which are treated as thousands separators by str2double:
>> str2double('12,345')
ans =
12345
and will result in separate numbers if you use str2num:
>> str2num('12,345')
ans =
12 345
I doubt that either of those are very useful to you.
Sim
Sim on 17 Feb 2020
Yes, you are right...
I replaced the commas with dots in other parts of my code with
strrep(C,',','.')
....And I should do it for this case too... :)

Sign in to comment.

Accepted Answer

Stephen23
Stephen23 on 17 Feb 2020
Edited: Stephen23 on 17 Feb 2020
This is likely to be faster than str2double. Use a decimal point rather than a decimal comma for more speed.
>> C = {'1170790,889999999897555','1009,697999999999979',1;'1170260,709000000031665','869,886999999999944',2;'1181959,743999999947846','504,125999999999976',3;'1174019,564000000013039','674,142000000000053',4;'1173729,793000000063330','689,635999999999967',5;'1181555,257999999914318','509,391999999999996',6;'1180374,280000000027940','538,126999999999953',7;'1180124,550999999977648','568,719000000000051',8;'1179305,570999999996275','626,528999999999996',9;'1177132,448000000091270','705,443999999999960',10}
C =
'1170790,889999999897555' '1009,697999999999979' [ 1]
'1170260,709000000031665' '869,886999999999944' [ 2]
'1181959,743999999947846' '504,125999999999976' [ 3]
'1174019,564000000013039' '674,142000000000053' [ 4]
'1173729,793000000063330' '689,635999999999967' [ 5]
'1181555,257999999914318' '509,391999999999996' [ 6]
'1180374,280000000027940' '538,126999999999953' [ 7]
'1180124,550999999977648' '568,719000000000051' [ 8]
'1179305,570999999996275' '626,528999999999996' [ 9]
'1177132,448000000091270' '705,443999999999960' [10]
>> X = cellfun('isclass',C,'char'); % slower: cellfun(@ischar,C)
>> M = nan(size(C));
>> M(~X) = [C{~X}]; % numeric data
>> M(X) = sscanf(strrep(sprintf(' %s',C{X}),',','.'),'%f') % character data
M =
1170790.890 1009.698 1
1170260.709 869.887 2
1181959.744 504.126 3
1174019.564 674.142 4
1173729.793 689.636 5
1181555.258 509.392 6
1180374.280 538.127 7
1180124.551 568.719 8
1179305.571 626.529 9
1177132.448 705.444 10
If the intermediate character vector does not fit into memory then simply do the above for each column of C.
  3 Comments
Stephen23
Stephen23 on 17 Feb 2020
Edited: Stephen23 on 17 Feb 2020
"is there any way to keep all the decimals of C in M? it looks like the output is rounded, as you showed:"
That is just how they are displayed. Do not confuse how numeric data are displayed with how numeric data are stored on a computer. Those values are converted and stored with complete double precision, which means around 15-16 decimal digits. All of that precision is stored even if it is not displayed (according to the format documentation the display routine does do rounding, but of course that makes absolutely no difference to the stored values).
Judging by the input character vectors, the original numeric data were also doubles, which means that you will get back the same numeric data. or at least something very close to it (the trailing digits after that precision are just trash anyway, they have no significance once stored as a double):
>> fprintf('%.42f\n',M(:,1))
1170790.889999999897554516792297363281250000000000
1170260.709000000031664967536926269531250000000000
1181959.743999999947845935821533203125000000000000
1174019.564000000013038516044616699218750000000000
1173729.793000000063329935073852539062500000000000
1181555.257999999914318323135375976562500000000000
1180374.280000000027939677238464355468750000000000
1180124.550999999977648258209228515625000000000000
1179305.570999999996274709701538085937500000000000
1177132.448000000091269612312316894531250000000000
You can also experiment with the format options to change how those values are displayed.

Sign in to comment.

More Answers (0)

Categories

Find more on Numeric Types in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!