You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
How to increase reading speed from a Gigabyte large file ?
2 views (last 30 days)
Show older comments
Hi all
how do I increase reading speed from an Excel file that contains rows and columns with a volume of some GigaBytes?
18 Comments
dpb
on 17 Jun 2019
Dunno...'pends on what the data are and how saved...getting it out of Excel and into a .mat or stream file would undoutedly be the fastest.
farzad
on 17 Jun 2019
The data are float and let's say 5 Gigabytes.
why .mat and why stream file ? how would the code be like ?
is using the table useful ?
dpb
on 17 Jun 2019
'Cuz both .mat and stream files are binary representations of the actual bytes in memory, thus eliminating the need for conversion.
You've still not said which form of file it actually is; if it is .xls(x), then the xlsread is fairly slow.
A table would be one choice for internal storage in Matlab; how useful depends entirely on what the data are and how they need to be processed which like the actual file itself, you're keeping us totally in the dark so all we can do is guess...
dpb
on 17 Jun 2019
Well, with .xlsx files you have the choice between xlsread and readtable. You'll just have to test which is faster--one presumes probably readtable. If you have R2019a, you can try the new readmatrix which is now recommended instead of xlsread.
For csv files, the historic ways are csvread, textscan, fscanf altho again with the caveat of requiring R2019a, readmatrix is the TMW-recommended alternative now.
I don't have R2019a installed yet, so I can't comment on the relative performance between it and alternatives.
Still, if speed and doing this more than once will be required, then doing it once and then using .mat or stream files will undoubtedly beat any of the alternatives.
You could, if your application can live with single precision, cut the file size in half by saving single instead of double. That's purely a case of what is required of the data itself as to whether would be a viable alternative or not.
Walter Roberson
on 18 Jun 2019
Edited: Walter Roberson
on 18 Jun 2019
I wrote out 1e6 by 50 of doubles = 4 gigabytes in binary form, and tested how long loading took.
When saved as space-delimited double using save -ascii -double, then using load() of the 12501000000 bytes of text file took 1416 seconds.
textscan() of that same file took 265 seconds.
fscanf() of the same file took 371 seconds.
When saved as a .csv file using dlmwrite() with precision 16, then using load() took 1107 seconds.
When saved as -v7.3 .mat, then using load() of the 3796914266 bytes of file took 25 seconds.
When saved as a pure binary file, then fread(fid, [1e6 500],'*double') took 14 1/4 seconds the first time, and 2.1 seconds the second time (file in operating system cache.) fread(fid, [1 inf], '*double') takes 4.6 seconds when the file is in operating system cache, which tells us that there is more memory management overhead when the size is unknown.
(I will update as I generate more times.)
farzad
on 18 Jun 2019
Thank you very much Walter
That is very much what's I was searching for. How do you save as mat?
Walter Roberson
on 18 Jun 2019
data = rand(1e6, 50);
save testdata.mat data -v7.3
but this relies upon having the data in the first place to write out as .mat.
Walter Roberson
on 18 Jun 2019
I am having difficulty creating a excel file that large. I wrote the file as .csv but my Excel complains about running out of memory when trying to import it, which does not make sense to me.
Walter Roberson
on 18 Jun 2019
I have been updating the timings; you might want to have another look, above.
dpb
on 18 Jun 2019
All of which continues to say "ditch Excel" entirely for such large files...
I do find it interesting that textscan manages to beat fscanf -- one would think would boil down to the same C runtime library call. Just out of curiosity, what were the two specific commands used, Walter? Oh--did you include overhead to cast the cell array from textscan to double?
Walter Roberson
on 18 Jun 2019
Edited: Walter Roberson
on 18 Jun 2019
I created a format with repmat of '%f' 50 times. I fopen and then
datacell = textscan(fid, fmt, 'collectoutput', 1);
Because this puts everything into a single cell the overhead to extract the array is trivial.
The timing with collectoutput 0 without joining the columns after, was a hair higher but not statistically significant.
dpb
on 18 Jun 2019
Yeah, that's kinda' what I suspected, thanks for confirming, Walter.
I still find it more than strange that there's 30% reduction over fscanf -- what are they doing wrong with it then is the question that there's that much room for improvement?
These timings couldn't possibly be related to caching issues, I presume; you're too careful for that! :)
Answers (0)
See Also
Categories
Find more on Spreadsheets in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)