read small selection of data from large file
Show older comments
I have several large .csv files (up to around 8GB, the arrays have about 10^5 rows and up to 15k columns) from which I would like to read data. Most of these read operations will only be pulling from 1000 to 10000 data points at a time (generally it will be just a single row of data or a subset of a row). However, it seems like dlmread is doing something inefficiently since each read operation is taking several minutes. Is there a lower-level read function which can do this significantly faster (really, it needs to be orders of magnitude faster; even a 2x increase in speed isn't going to cut it)? Should I use another format for the data? I thought about building a mySQL database for it but I have no experience with this. Is Matlab even the right environment to be doing this sort of thing? Thanks in advance.
Josh
Accepted Answer
More Answers (1)
Ashish Uthama
on 25 May 2011
0 votes
If you have the option to change the source or if you plan to use the data over and over again, it might be best to change the format to plain binary. i.e use fwrite to write it out as doubles rather than a text format like csv. (Unless, of course, each line has a varying number of entries and that this structure is integral to your data.)
This would probably be the fastest, since it is the most simplest. Also, the file size might be smaller.
You will be able to index into the file to read subset more easily. You can easily compute the offset to the (i,j)th element since you know exactly how much space a single double will take in a binary file.
Categories
Find more on Database Toolbox in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!