Reading large number of csv files

Question

Sameer Gummuluru on 14 Aug 2020

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/579768-reading-large-number-of-csv-files

Commented: Sameer Gummuluru on 18 Aug 2020

Accepted Answer: per isakson

Open in MATLAB Online

I have a large number of csv files to process. The files exist on AWS S3.

Currently I have a for loop like this

fds = fileDatastore(fp,'IncludeSubfolders',true,'ReadFcn',@csvread);
for i = 1:numFiles
    data = read(fds); % I have tried csvread(fileName) as well
end

Each of these file reads is taking 0.5 s on average. Considering that I have to process a large number of flies, is there anyway to speed this up?

P.S:Parallel Computing ToolBox is currently not a choice

Thank you in advance!

4 Comments
Show 2 older commentsHide 2 older comments

Walter Roberson on 16 Aug 2020

Edited: Walter Roberson on 16 Aug 2020

I wonder if it would be productive to use the AWS Java interface https://docs.aws.amazon.com/AmazonS3/latest/dev/RetrievingObjectUsingJava.html to download a batch of files to local storage ?

I also wonder whether it would be practical to re-organize the storage so that groups of files were stored in .zip ? This would reduce the size of most text files (because of compression) and should reduce the per-file overhead.

Sameer Gummuluru on 18 Aug 2020

Thank you Walter! I will givee that a try

Sign in to comment.

Sign in to answer this question.

Answer 1

per isakson on 16 Aug 2020

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/579768-reading-large-number-of-csv-files#answer_480456

Edited: per isakson on 16 Aug 2020

Open in MATLAB Online

Caveats

My notion of your use case (work flow) is vague. Is your local free disk space a restriction? How many GB in total will you download in one working session? "large number" How many is that?
I've never used fileDatastore and don't know how much overhead it brings.
It's a bit difficult to measure the elapsed time of reading files because after the first run the file will reside in the cache memory. I don't know how to "clear" the cache.

I made a small performance test

I use R2018b, Win10, a spinning HD,
Created one local csv-file with 20,000 rows and 10 columns (1.73 MB) and a few copies of that.
Run a script to compare csvread, textscan, fileread and fileDatastore

%%
fprintf( '%-16s', 'csvread' )
tic, m = csvread( 'sameer.csv' ); toc
%%
fprintf( '%-16s', 'textscan' )
tic, 
fid = fopen('sameer.csv'); 
cac = textscan( fid, '%f%f%f%f%f%f%f%f%f%f', 'Delimiter',',', 'Collectoutput',true );
fclose( fid );
toc
%%
fprintf( '%-16s', 'fileread' )
tic
txt = fileread('sameer.csv');  
toc
%%
sad = dir('d:\m\cssm\sameer*.csv');
fp  = fullfile( {sad.folder}, {sad.name} );
fprintf( '%-16s', 'fileDatastore' )
tic, fds = fileDatastore( fp, 'ReadFcn',@csvread ); toc
fprintf( '%-16s', 'read( fds )' )
tic, m1 = read( fds ); toc
fprintf( '%-16s', 'read( fds )' )
tic, m2 = read( fds ); toc

Result (with the files in memory cache)

>> cssm
csvread         Elapsed time is 0.049017 seconds.
textscan        Elapsed time is 0.024477 seconds.
fileread        Elapsed time is 0.007118 seconds.
fileDatastore   Elapsed time is 0.002117 seconds.
read( fds )     Elapsed time is 0.049034 seconds.
read( fds )     Elapsed time is 0.048331 seconds.

Discussion

the overhead of fileDatastore is small
textscan is twice as fast as csvread. csvread calls textscan to read and parse the file. Here the overhead is significant.
fileread is included to get a time for reading without parsing. Reading (from cache) takes a third of textscan's elapsed time.
"Reading the same file from local HD takes around 0.21 s." Does that refer to the first time you read this file? That is four times the elapsed time I see reading from cache. That's a large difference, first time or not.

Too me it seems as if the total time, 0.7 sec, is dominated by the download over the Internet. I've a free TB on my desktop and would start a download and go for lunch.

2 Comments
Show NoneHide None

Sameer Gummuluru on 17 Aug 2020

Edited: Sameer Gummuluru on 17 Aug 2020

Thank you for the detailed answer!

To clarify, when I said reading from the local HD, what I did is very similar to what you have done using filedatastore.

I have used FileDatastore with @ReadFcn csvread. I have performed this operation on a folder containing 50 files and I have never opened any of them.
The time taken to loop through 50 "read(fds)" commands is 0.0847s * 50 (Hence an average of 0.0847s per file). Sorry for the 0.21s I have mentioned. That was using load instead of csvread.

It is interesting to see from your analysis that textscan runs around twice as fast as csvread. So, I will give a try replacing csvread with textscan and see the difference in performance.

Considering that my data is growing continuously, it might make sense for me to invest in the parallel computing toolbox soon.

Thank you!

per isakson on 18 Aug 2020

Edited: per isakson on 18 Aug 2020

Open in MATLAB Online

"I have never opened any of them" Yes but FileDatastore relies on @fcn to do that. ( In your case csvread does it.)

"The time taken to loop through 50 "read(fds)"" Did you repeat that a number of times?

"load instead of csvread" Yes load is indeed slow with ascii

>> tic, load('sameer.csv','-ascii'); toc  <<< first time after start of Matlab
Elapsed time is 0.571249 seconds.
>> tic, load('sameer.csv','-ascii'); toc
Elapsed time is 0.201560 seconds.
>> tic, load('sameer.csv','-ascii'); toc
Elapsed time is 0.200004 seconds.

"replacing csvread with textscan" If not all csv-files have the same number of columns, the format specification might pose a problem. textscan has undocumented features that are used by several Matlab functions, e.g. csvread. In csvread that problem is solved. (I've forgotten how.)

"invest in the parallel computing toolbox soon" I'm not sure the main bottleneck is on your side. And I guess it's about moving around data rather than cpu-cycles. Furthermore, how will the Amazon server handle several "simultaneous" requests from the same IP-number? (I haven't a clue, google helped me figure out what AWS S3 stands for.) See Walter's comment to your question.

My questions here do not require answers.

Sign in to comment.

Reading large number of csv files

4 Comments
Show 2 older commentsHide 2 older comments

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Reading large number of csv files

4 Comments Show 2 older commentsHide 2 older comments

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

4 Comments
Show 2 older commentsHide 2 older comments

2 Comments
Show NoneHide None