MATLAB Answers

Reading large number of csv files

5 views (last 30 days)
I have a large number of csv files to process. The files exist on AWS S3.
Currently I have a for loop like this
fds = fileDatastore(fp,'IncludeSubfolders',true,'ReadFcn',@csvread);
for i = 1:numFiles
data = read(fds); % I have tried csvread(fileName) as well
end
Each of these file reads is taking 0.5 s on average. Considering that I have to process a large number of flies, is there anyway to speed this up?
P.S:Parallel Computing ToolBox is currently not a choice
Thank you in advance!

  4 Comments

Show 1 older comment
Sameer Gummuluru
Sameer Gummuluru on 15 Aug 2020
The file is about 1.7 MB.
Reading the sane file from local HD takes around 0.0847 s.
Walter Roberson
Walter Roberson on 16 Aug 2020
I wonder if it would be productive to use the AWS Java interface https://docs.aws.amazon.com/AmazonS3/latest/dev/RetrievingObjectUsingJava.html to download a batch of files to local storage ?
I also wonder whether it would be practical to re-organize the storage so that groups of files were stored in .zip ? This would reduce the size of most text files (because of compression) and should reduce the per-file overhead.
Sameer Gummuluru
Sameer Gummuluru on 18 Aug 2020
Thank you Walter! I will givee that a try

Sign in to comment.

Accepted Answer

per isakson
per isakson on 16 Aug 2020
Edited: per isakson on 16 Aug 2020
Caveats
  • My notion of your use case (work flow) is vague. Is your local free disk space a restriction? How many GB in total will you download in one working session? "large number" How many is that?
  • I've never used fileDatastore and don't know how much overhead it brings.
  • It's a bit difficult to measure the elapsed time of reading files because after the first run the file will reside in the cache memory. I don't know how to "clear" the cache.
I made a small performance test
  • I use R2018b, Win10, a spinning HD,
  • Created one local csv-file with 20,000 rows and 10 columns (1.73 MB) and a few copies of that.
  • Run a script to compare csvread, textscan, fileread and fileDatastore
%%
fprintf( '%-16s', 'csvread' )
tic, m = csvread( 'sameer.csv' ); toc
%%
fprintf( '%-16s', 'textscan' )
tic,
fid = fopen('sameer.csv');
cac = textscan( fid, '%f%f%f%f%f%f%f%f%f%f', 'Delimiter',',', 'Collectoutput',true );
fclose( fid );
toc
%%
fprintf( '%-16s', 'fileread' )
tic
txt = fileread('sameer.csv');
toc
%%
sad = dir('d:\m\cssm\sameer*.csv');
fp = fullfile( {sad.folder}, {sad.name} );
fprintf( '%-16s', 'fileDatastore' )
tic, fds = fileDatastore( fp, 'ReadFcn',@csvread ); toc
fprintf( '%-16s', 'read( fds )' )
tic, m1 = read( fds ); toc
fprintf( '%-16s', 'read( fds )' )
tic, m2 = read( fds ); toc
Result (with the files in memory cache)
>> cssm
csvread Elapsed time is 0.049017 seconds.
textscan Elapsed time is 0.024477 seconds.
fileread Elapsed time is 0.007118 seconds.
fileDatastore Elapsed time is 0.002117 seconds.
read( fds ) Elapsed time is 0.049034 seconds.
read( fds ) Elapsed time is 0.048331 seconds.
Discussion
  • the overhead of fileDatastore is small
  • textscan is twice as fast as csvread. csvread calls textscan to read and parse the file. Here the overhead is significant.
  • fileread is included to get a time for reading without parsing. Reading (from cache) takes a third of textscan's elapsed time.
  • "Reading the same file from local HD takes around 0.21 s." Does that refer to the first time you read this file? That is four times the elapsed time I see reading from cache. That's a large difference, first time or not.
Too me it seems as if the total time, 0.7 sec, is dominated by the download over the Internet. I've a free TB on my desktop and would start a download and go for lunch.

  2 Comments

Sameer Gummuluru
Sameer Gummuluru on 17 Aug 2020
Thank you for the detailed answer!
To clarify, when I said reading from the local HD, what I did is very similar to what you have done using filedatastore.
  • I have used FileDatastore with @ReadFcn csvread. I have performed this operation on a folder containing 50 files and I have never opened any of them.
  • The time taken to loop through 50 "read(fds)" commands is 0.0847s * 50 (Hence an average of 0.0847s per file). Sorry for the 0.21s I have mentioned. That was using load instead of csvread.
It is interesting to see from your analysis that textscan runs around twice as fast as csvread. So, I will give a try replacing csvread with textscan and see the difference in performance.
Considering that my data is growing continuously, it might make sense for me to invest in the parallel computing toolbox soon.
Thank you!
per isakson
per isakson on 18 Aug 2020
"I have never opened any of them" Yes but FileDatastore relies on @fcn to do that. ( In your case csvread does it.)
"The time taken to loop through 50 "read(fds)"" Did you repeat that a number of times?
"load instead of csvread" Yes load is indeed slow with ascii
>> tic, load('sameer.csv','-ascii'); toc <<< first time after start of Matlab
Elapsed time is 0.571249 seconds.
>> tic, load('sameer.csv','-ascii'); toc
Elapsed time is 0.201560 seconds.
>> tic, load('sameer.csv','-ascii'); toc
Elapsed time is 0.200004 seconds.
"replacing csvread with textscan" If not all csv-files have the same number of columns, the format specification might pose a problem. textscan has undocumented features that are used by several Matlab functions, e.g. csvread. In csvread that problem is solved. (I've forgotten how.)
"invest in the parallel computing toolbox soon" I'm not sure the main bottleneck is on your side. And I guess it's about moving around data rather than cpu-cycles. Furthermore, how will the Amazon server handle several "simultaneous" requests from the same IP-number? (I haven't a clue, google helped me figure out what AWS S3 stands for.) See Walter's comment to your question.
My questions here do not require answers.

Sign in to comment.

More Answers (0)

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!