how to read data from desired lines of a large data set?

1 view (last 30 days)
Dear all, I want to read desired lines from a large data set(>50GB) which is not possible to load all the data by simply invoking textscan.
what I can think is:
fid = fopen('data.dat');
nline = 0; % the line index
wline = 1000: 10^7; % the wanted lines
i = 1; % index for wline;
while ~feof(fid)||nline<max(wline)
ldata = fgets(fid);
nline = nline+1;
if nline == wline(i)
datas(i) = ldata;
i= i+1;
end
end
as you see, this loop is really time consuming. my questions is: 1. is there any function to read it faster (on Unix system) 2. is it possible to use pointer, so that just read the desired line
thank you
George
dataset 10^9 lines and 4 columns
0 0 0 0.5
0 0.05 200.05 1 ...

Answers (1)

José-Luis
José-Luis on 5 Oct 2012
Edited: José-Luis on 5 Oct 2012
That is one big chunk of data. I have several suggestions:
  • Preallocate: in your code your are growing datas at each iteration. Preallocate using, e.g.
datas = ones(numLines,5);
This might not be a viable option if you want to allocate for a 10^9 x 5 matrix.
  • Split your data in several chunks, that you can read when needed. Look at the split utility
  • Use a database.
If you want to read just one line, and know the exact position (in bytes from the beginning), you could always try fseek.
  2 Comments
George
George on 5 Oct 2012
thank you for your helpful suggestions, José.
the problem is that bytes are changing line by line. which make it difficult to calculate the exact position.
again, thank you. George

Sign in to comment.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!