How to import big data CSV files

Hi,
Similar questions have been already asked but I wanted to know if there is an alternative to importing. I have a csv file with 3.5 million rows and 56 columns. At present while importing, I have to select the range I need. For instance I am selecting only 36 columns and barring few rows almost all. I tried textscan() but unable to achieve what I want which is without importing my code should select those rows and columns needed and also my main aim is to save time of importing and since textscan() is pretty fast I was using that. My column values are numerics.
Please let me know if I need to update with something here. I have a 8GB RAM so I think it should be easily feasible.
Thanks in advance!text

 Accepted Answer

Walter Roberson
Walter Roberson on 25 May 2019
Edited: per isakson on 25 May 2019
These days, often the most convenient way is to use detectImportOptions, and set the SelectedVariableNames property of that to choose particular columns, and the readtable() -- or as of R2019a, readmatrix() if you are pure numeric.
textscan() is also a possibility. The easiest way to use that might be something like,
numcol = 56;
wanted_cols = [5 17:33 44:47]; %adjust to suit
fmt_cell = repmat('%*s', 1, numcol);
fmt_cell(wanted_cols) = {'%f'};
fmt = [fmt_cell{:}];
data = cell2mat( textscan(fid, fmt, 'Delimiter', ',', 'CollectOutput', 1) );

17 Comments

Hi, thank you for the answer.
It is showing me error that conversion to char from cell is not possible. I am sorry for asking such dumb questions. Not only I am new to Matlab I am new to coding and a bit slow learner. Also, where and how should I declare my CSV file. Thank you
fmt_cell = repmat({'%*s'}, 1, numcol);
Abhishek Singh
Abhishek Singh on 25 May 2019
Edited: Abhishek Singh on 25 May 2019
Thank you and should I call my csv file using fid=fopen()?
I did that but I am getting Nan in my cells.
fid = fopen('YourFileNameGoesHere.csv', 'rt');
If you are getting NaN in your cells then you have not selected numeric columns in your list of wanted_columns .
It would help if you were to post a short extract of the file so we can see the structure of it.
Yes. I am doing the same. I can not open the csv file so I am not 100 hundred percent sure if it is comma separated. It may also be colon separated. However, here is a example of first few columns which I looked at by importing. I also am sorry to not tell you that along with columns I have to also select particular rows like from row 36 to end.
0.294118 -0.661765 0.735294 11.9853
0.294118 -0.78441 0.735294 11.9853
0.294118 -0.78441 0.735294 11.9853
0.294118 -0.78441 0.735294 11.8626
0.294118 -0.78441 0.735294 11.7403
0.294118 -0.906767 0.735294 11.6176
numcol = 56;
rows_to_skip = 35;
wanted_cols = [5 17:33 44:47]; %adjust to suit
fmt_cell = repmat('%*s', 1, numcol);
fmt_cell(wanted_cols) = {'%f'};
fmt = [fmt_cell{:}];
data = cell2mat( textscan(fid, fmt, 'Delimiter', ',', 'HeaderLines', rows_to_skip, 'CollectOutput', 1) );
Abhishek Singh
Abhishek Singh on 25 May 2019
Edited: Abhishek Singh on 25 May 2019
I did the changes. Now the data has the required matrix length but I am still getting NaN values. Could it be because my csv file is not comma separated or may be colon separated if that is the case what should I do? Or could it be because the first few rows which I am removing have some characters?
Edit: I replaced comma with colon and I got the output. Many thanks friend. One more question. Do you think there is a way if I want to select rows and columns with some conditions like last but one row. Or select rows from where the numbers start?
"last but one row" would be data(end-1,:)
If you want it to detect where the numbers start, the easiest way is to use readtable like I described earlier.
However, you could also consider using rmmissing(data)
I was trying to extend this to excluding the first 35 rows as you have done and also two end rows. rows_to_skip = 35,(end),(end-1) doesn't work because it says "end" is reserved. What would you suggest I should do?
Also, do you think readtable and rmmissing data to terminate rows with characters would be time efficient as textscan() because if they would be slow I may stick to manual option.
I think they would be time efficient enough.
Do not change the rows to skip. Instead after you have read the data you can use
data(end-1:end, :) = [] ;
Abhishek Singh
Abhishek Singh on 30 May 2019
Edited: Abhishek Singh on 30 May 2019
Great, thank you so very very much.
Could you please also help me with the readtable or should I ask that as a different question?
wanted_cols = [5 17:33 44:47]; %adjust to suit
filename = 'YourFileName.csv';
opt = detectImportOptions(filename);
varnames = opt.VariableNames;
opt.SelectedVariableNames = opt.SelectedVariableNames(wanted_cols);
datatable = readtable(filename, opt);
datatable(end-1:end,:) = [];
I will show you a little bit of the data here. The first 4 rows below are the rows 34:37 so here I need to reed the column 1 to know that they are characters and then start from 37 since they are the numbers and also only include from column 2:37. Also the next 4 rows are last 4 columns from the data, again I have to read the first column and delete rows which have their first entity as character hence I was deleting last two rows also. I could do all these with the code you have helped me with but for that I need to be aware with the structure of the file. But I somehow want to automate it so that even if I do not know the structure I am just running and getting the numbers in workspace.
ChannelAttributes 0 0 0
ChannelAttributesUseable 1
0 -0.80882 0 -8.67647
1 -0.88235 0 -8.82353
61094 -18.0882 0 -24.6324
61095 -18.0882 0 -24.6324
ChannelPhysicalInputPort 0 1 2
ChannelPhysicalInputHardware 1 1 1
And where should I define that I do not need the first 35 rows.
I tried this and it shows me this error "Index exceeds the number of array elements" at the line
opt.SelectedVariableNames = opt.SelectedVariableNames(wanted_cols);
It would have helped if you had given this sample much earlier.
This is not a csv file, and does not have 56 columns.
I would not recommend using readtable() or similar for it.
For this kind of work, either write a custom parser using fgetl() / sscanf() / regexp(), or else read in the entire file at once using fileread() and break it apart using tools such as regexp()
Sometimes it helps to use fgetl() / sscanf() to get through headers, until you hit a regular block of data, and then use textscan() in a way that will stop reading when it reaches the point after the data, and then go back to other parsing methods.
Yes I agree I should have showed this before. But it is a csv file and I have just shared the first 4 columns because rest all contain numbers like columns from 2.
If the method you are suggesting is longer and would take longer time to run I'd possibly stick with what we have
Shrug. You can get a wrong and unusable input quickly, or you can get a correct and usable input more slowly.
The extract you showed is not a csv file. csv files are, by definition, Comma Separated Values, never whitespace separated. RFC4180 specifically says that spaces are to be considered part of a field, not a separator.
It is not difficult to extract from a file only the lines that start with numbers. It is not all that much more difficult to divide a file into blocks of numbers, so that non-number marks the end of a block and the blocks are to be imported separately. But as soon as you start wanting to extract particular information from the non-number sections and associating it with a block of numbers, the task becomes more complicated.
I am sorry. Importing in matlab showed it in columns and since it was a csv format file I thought it must be like that. But I do understand your point. I think I will stick to the first plan of using textscan() since it was also efficient.

Sign in to comment.

More Answers (0)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!