Clear Filters
Clear Filters

File Reading - Skip to next line

8 views (last 30 days)
dvd7e
dvd7e on 5 Jul 2016
Edited: Stephen23 on 5 Jul 2016
Hello, I have a very large textfile (a few gigabytes). The large file size is dominated by how much data there is per line, not the total number of lines. I only need a small fraction of these lines (which are not evenly spaced apart), and I was previously doing this by fgetl() and then dumping the data if I didn't need it. But this has become really slow.
What I want is to be able to read in just the first few characters of a line, and if it matches my search criteria then read in the rest of the line. Otherwise, I don't really care what's in there and I don't want to waste the time of reading it in, so skip to the next line. Can this be done?
fread(fid,[0 10],'*char') seems to do the part of reading in the first few characters, but then how do I skip to the next line without actually reading in the rest of the data on that line?
  1 Comment
Stephen23
Stephen23 on 5 Jul 2016
Edited: Stephen23 on 5 Jul 2016
@Michael Epstein: you need to think about what a "line" really means: some (zero or more) non-newline characters separated by newline characters.
  • Question: How does a program know where the newline characters are ?
  • Answer: It has to read all the characters until it finds a newline character.
So what your are proposing is self-contradictory: you want to skip characters (jump to the next newline) by reading all of the characters until the next newline...
There is no simple solution to this for standard text files with arbitrary line lengths. There may be file formats that have some index of the line locations, or that use a fixed line length.
Probably a better solution would be to store the data in a better format to start with (some binary file like a .mat file), or read your data file once, filter the parts that you need, and then save this in a more efficient form (e.g. a .mat file). Designing good data structures and storage is one of the most important steps of program design, but is sadly underrated by many coders, even though it makes a huge difference to program operation and efficiency.

Sign in to comment.

Answers (1)

José-Luis
José-Luis on 5 Jul 2016
Edited: José-Luis on 5 Jul 2016
That is beyond Matlab's and most IO routines I'm afraid. In order to know where the next line begins you need to know where the current line ends: that means scanning the entire line until you find a newline. If all lines were of the same length you could use some low-level io and skip a certain number of bytes, but my guess would be the gain, if any, is minimal.
You could try memmapfile(), but that would mostly help if all you have is numeric data.
The better approach would be to either modify the file externally in order to get only the lines of interest (e.g. grep) or if you have access to the program that generated the files, then only save what you need.
Also, text files are not the fastest format around.
  1 Comment
Walter Roberson
Walter Roberson on 5 Jul 2016
If you re-read the same file multiple times then it can be worthwhile to preprocess it to determine the ftell() positions of each line, after which you can check a few characters, fseek() to the next line if you are not interested.

Sign in to comment.

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!