Importing large .csv by READTABLE gives wrong results
Show older comments
Hello everyone !
I have a lot of .csv files (example attached). I need to extract column F and G from them. In col F some cells are empty and in col G it's reverse.
My script is as follows:
day = input('Give day ','s');
month = input('Give month ','s');
year = '2023';
filename = ['L:\XT4Comb_16_' [year month day] '.csv'] % I have a lot of these files with subsequent dates in their names
tablica = readtable(filename,'Delimiter',',','ReadVariableNames',true,'HeaderLines',0,'Format','%*s%*s%*s%*s%*s%s%s%*s%*s%*s%*s%*s%*s%*s%*s%*s%*s'); %% column F and G only
Of course it's just the most important part of the whole script
The problem is:
If a file is small, like that one I attached, there is no problem to get the same values (in F column) as in .csv.
But if a file is larger (like 35MB or larger) values are mixed and they don't match the original .csv file.
Why is that ? Can you tell me what am I doing wrong ?
Many thanks in advance !!!
7 Comments
Jan
on 17 Jul 2023
Please explain, what "values are mixed and they don't match the original .csv file" means explicitly.
Weronika Huss
on 17 Jul 2023
Walter Roberson
on 18 Jul 2023
Edited: Walter Roberson
on 18 Jul 2023
If the file exceeded 1048576 rows then it would exceed the design limit for Excel... though I am not sure if that would affect reading csv files using readtable unless UseExcel had been set true.
Is that something you might need to be concerned about? Well your existing file is 2774729 bytes for 46821 rows (plus one header row), so for 35 megabytes you would expect roughly rows / bytes * 35e6 bytes in larger file
format long g
46821 / 2774729 * 35e6
which is about half of the limit... so you probably are not hitting that limit
Walter Roberson
on 18 Jul 2023
Your code would have a problem if there are any lines that do not have exactly 17 entries. I would recommend using readtable() without a format, but using a Range 'F:G' specification. I would also suggest using detectImportOptions and using setvartype . In the sample file, F is sometimes empty but never text, so I would suggest setting it to double and allowing the empty entries to show up as NaN, but to set G to string and to use setvaropts for G to make QuoteRule 'keep' for consistency with your %s format... though a lot of the time it is easier to let QuoteRule be 'remove' the same as if you had used %q format.
Cris LaPierre
on 18 Jul 2023
Can you share some screenshots of a file where this does happen? Find the first row in MATLAB that does not match the csv file, and then take a screen shot of the table in MATLAB and the corresponding rows in the csv file (include a few rows before).
Error using readtable (line 223)
Reading failed at line 17693. All lines of a text file must have the same number of delimiters. Line 17693 has 20 delimiters, while preceding lines have 16.
There are multiple lines which have extra delimiters or different content
T = fileread(filename); % read the whole file
T = split(T);
D = cellfun(@(x) nnz(x == ','),T) % find number of delimiters
badidx = find(D~=16) % find lines with nonstandard number of delimiters
badlines = T(badidx) % print those lines
numdelim = D(badidx) % number of delimiters on those lines
badidx =
7256
11005
17693
46822
badlines =
4×1 cell array
{'1683916462929,53.53142,10.00707,0,356.7,,"TRANSPORT",,,,"[(4,114),(8,11.5),(12,15.5),(16,16.5),(20,14.5),(24,4.5),(28,3.5),(32,5),(36,2),(40,2),(44,0.5),(48,2),(52,0.5),(56,4),(60,0),(64,2.5),(68,0),(72,0),(76,0),(80,0.5),(84,1),(88,0),(92,0),(96,1),(100,0.5),(104,1),(108,0),(112,0),(116,0.5),(120,0),(124,0),(128,0),(132,0.5)]","[(4,36154.5),(8,9495),(12,3450.5),(16,2616),(20,1320),(24,1005.5),(28,552),(32,470),(36,249),(40,219),(44,136),(48,108),(52,75.5),(56,54.5),(60,30),(64,37.5),(68,18.5),(72,17),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,3.5),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000754496,2.13207e-05'}
{'1683916597744,53.53086,10.00883,0,52.7,,"TRANSPORT",,,,"[(4,30),(8,8),(12,6),(16,3.5),(20,1),(24,0.5),(28,3.5),(32,0),(36,1.5),(40,0),(44,1),(48,1.5),(52,0.5),(56,0),(60,0),(64,0.5),(68,0),(72,1),(76,0),(80,0),(84,0),(88,0),(92,0),(96,0),(100,0),(104,0),(108,0.5)]","[(4,36184.5),(8,9503),(12,3456.5),(16,2619.5),(20,1321),(24,1006),(28,555.5),(32,470),(36,250.5),(40,219),(44,137),(48,109.5),(52,76),(56,54.5),(60,30),(64,38),(68,18.5),(72,18),(76,13),(80,7.5),(84,10.5),(88,7),(92,7.5),(96,9),(100,7),(104,8),(108,4),(112,4.5),(116,4),(120,1),(124,2),(128,3.5),(132,2.5),(136,0.5),(140,0.5),(144,0.5)]",3170,401,-760,0.000759143,4.54643e-06' }
{'1683916837559,53.53370,10.00302,0,451683996200710,,,0,0,149.255,,,,,,,,,,,' }
{0×0 char }
numdelim =
152
140
20
0
I would say that line 17693 is where logging got interrupted and a new entry was appended in the middle of a partial line.
Walter Roberson
on 18 Jul 2023
%q format should handle the commas but %s format would not.
Answers (1)
Abhas
on 9 Oct 2024
One way to fix this issue is to adjust the formatting of the CSV file. Some lines have an inconsistent number of comma delimited fields. Ensuring that each line has the same number of comma delimited fields will help the import process progress successfully.
Alternatively, you can apply the import options from the version of the CSV file that works correctly to the version of the file with the additional lines. This would involve code that looks like this:
opts = detectImportOptions('<Working CSV Filename>');
originalTable = readtable('<Working CSV Filename>', opts); % This reads the original table
newTable = readtable('<Longer CSV Filename>', opts); % This reads the modified table with additional lines with the original formatting options.
You may refer to the MathWorks documentation link below to know more about "detectImportOptions" :
If the files aren't the same format exactly, they can pass a detection hint to detectImportOptions. In this case, disable header line detection by providing a value.
opts = detectImportOptions('<Longer CSV Filename>', 'NumHeaderLines',0)
More information on the "delimitedTextImportOptions" created by "detectImportOptions" on a CSV can be found by referencing the link below:
I hope this helps!
Categories
Find more on Text Files in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!