extract routine data from extra large (~20GB) text file
39 Comments
- One structure per "frame" of data. "frame" being the data between one string "timestep" and the following string, "timestep".
- Would it be robust to use the string, "timestep" as delimiter between "frames"?
- Would C,O,Ca,H_spc,O_spc,... be appropriate to use as field names?
- "cssm" stands for computer-science-software-matlab. I use that name as a generic function name when answering questions on the net.
- My intention is to implement that function once you confirm that I understood your requirements

- Go back to the drawing board. Write a concise requirement specification for READER. Usage of READER. It shall be be used a few times and thrown away or used in the analyzes of many simulation results over more than half a year? During the analyzes, response times to different queries. Reruns for publications? ...
- Several years ago I read and parsed a 20GB text file on a 8GB desktop. The first step was to split the file into two dozen sub-files with GSplit. With GSplit you can control exactly where the file is split, a bit like the Matlab function, strsplit. Next loop over the sub-files, read and parse with something similar to cssm of my answer and save results to mat-files, et cetera.
- Now I might want to use Large Files and Big Data and store the results to one or more HDF-files. This would be an occasion to try Matlabs "Big Data" stuff.
Accepted Answer
More Answers (2)

- Locate the positions of all frames in the file. A frame starts with "t" in the string, "timestep", and ends with the last character before the following string, "timestep". The result is stored in the persistent variable, ix_frame_start, and used in subsequent calls with the same text file.
- Calculate the positions of a series of chunks of frames. The result is stored in ix_frame_chunk_start and ix_frame_chunk_end. This calculation is cheep.
- Loop over all chunks of frames
- Read one chunk with fseek and fread. This is about three times faster than using memmapfile.
- Loop over the list of atoms, which was given in the call.
- Extract the values of "timestamp" and "X,Y,Z" with regexp.
- Return the result in a struct array.
- is two order of magnitude slower to read position data for one atom
- is more robust to variations in the text file
- Only the regular expression is well commented
- Cheep test to catch anomalies in the text file

- replacing consecutive spaces by one space and trim trailing spaces of the 3.6GB text file. The trimmed file is half the size of the original file.
- replacing the regular expression by a search with the function, strfind and a simple regular expression.

- fread and
- scan_for_atoms_in_one_chunk_of_frames, i.e. regexp and strfind, respectively.



10 Comments
- The function, cssm, would work well on appropriate sized chunks of data.
- It's for a reason that The MathWorks has implemented special tools to work with Large Files and Big Data
- See <https://www.infoq.com/presentations/regex Understanding and Using Regular Expressions by Damian Conway>and
- use regex101 for experimenting. I use the pcre flavor, it's close enough to Matlabs own flavor.
- is two order of magnitude slower to read position data for one atom
- is more robust to variations in the text file...
Premature optimization: Without any supporting evidence I assumed that processing larger chunks of the file would be more efficient. Thus, I processed chunks of 50+ frames at a time. That required extra source code.
Now, I have modified the code to process one or part of one frame at a time. That made it possible to
- remove the strfind( str, 'timestep' ). 'timestep' is now on the first line.
- add the option 'once' to the regexp
Yesterday, RC2 used 12.2 seconds. Now the modified function uses 8.0 seconds when reading entire frames and 5.4 seconds when reading part of frames. The end is defined by the string, 'O_spc 14401', which is the first "_spc".
>> tic, S = scan_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 12.190721 seconds.
>> tic, S = scan_dpb_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 8.020505 seconds.
>> tic, S = scan_dpb_CaCO3nH2O( 'c:\tmp\HISTORY_trimmed.txt', {'O'}, {18,6107,11520} ); toc
Elapsed time is 5.415384 seconds.
Of the 5.4 seconds fread uses 3.8 seconds, regexp 1.2 and sscanf together with strtok 0.2

0 votes
3 Comments
Categories
Find more on Low-Level File I/O in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!