Increase read time of large text file

I have a text file that is very large, and each line basically looks like this:
A1,1= 2.5 A1,2=1.8e-5 A1,3=4
A2,1= 1 A2,2=3 A2,3=6.4
etc...
I want to extract only the numerical data for each coordinate into a matrix, and I have code to successfully do it. But it takes too long. Here is my current technique:
Str=regexp(chunk{k},'=\s*(\S+)','tokens');
num_data=cellfun(@str2double,[Str{:}]);
I'm reading in 1000 lines at a time into the 'chunk' array. So the chunk{k} just returns a particular line of the file.
Str will contain just the numerical data from the line, but in char format. num_data is the final output for each line in double format.
For an example file that is 20MB, this technique takes 40 seconds. But if I remove this line:
num_data=cellfun(@str2double,[Str{:}]);
then the code only takes 13 seconds.
Any suggestions on how to speed up the conversion from cell array of strings to double array would be greatly appreciated.
Thanks,
Adam

2 Comments

Posting an example file would be useful to create and test a suggested method.
I assume you mean decrease the time for reading.

Sign in to comment.

 Accepted Answer

Jan
Jan on 24 Sep 2013
Edited: Jan on 25 Sep 2013
What about this:
Data = fscanf(fid, '%*c%*d,%*d= %g ', Inf); % [EDITED, Walter's suggestion]

4 Comments

It should be Data(4:4:end) I guess.
Just
Data = fscanf(fid, '%*c%*d,%*d= %g ', Inf);
as the other fields are not needed.
Alternately,
Data = fscanf(fid, '%*[^=]=%g', inf)
might be more efficient
@Cedric: Yes, of course :4: was meant.
@Walter: And omitting the values which are ignored later on is much more efficient. I hesitated because I thought of compatibility problems in Matlab 6.5, but I should start to say goodbye to historic versions.
I've tried sscanf(s, '%[~=]s=%g') without success - obviously because "~" is not "^". But I'm surprised about '%*[^=]=', because I'd expect an additional character to specify the format, e.g. 's'. Unfortunately Matlab's documentation is really lean in this point. Although I know, that the format specifiers are not programmed by TMW but taken from a C-library, a clear documentation would be useful.
The "[" is the format specifier character in "%[^=]" . Just like in "%%", the second % is the format specifier character.

Sign in to comment.

More Answers (1)

Adam
Adam on 25 Sep 2013
Great! That worked perfectly. Read time went from 40 seconds to about 12 seconds.
I did not realize fscanf & sscanf could be used with such complex formatting. Thanks so much to all that replied!
I went with the "Data = sscanf(S, '%*c%*d,%*d= %g ', Inf);" option since there was little time difference between the two, and I can actually look at this option and understand what it is doing :) The other is a bit beyond by regular expressions knowledge.

Categories

Tags

Asked:

on 24 Sep 2013

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!