Using readtable on a csv with missing columns

30 views (last 30 days)
Hey all,
I'm writing matlab script to load a number of CSV files into matlab to process them.
The last column of these CSVs is a notes column, which is often left blank. The software we use to edit our CSVs sometimes doesn't even fill in the last comma, like so:
colA,colB,colC,Notes
1000,2000,3000
1100,2200,3300
1110,2220,3330
1111,2222,3333
My ideal output would be
colA colB colC Notes
____ ____ ____ ____
1000 2000 3000 NaN
1100 2200 3300 NaN
1110 2220 3330 NaN
1111 2222 3333 NaN
I'm currently using readtable to load this data in, and have tried a few different parameters. Unfortunately, they all seem to fall short.
tab = readtable("tt.csv", 'MissingRule', 'fill');
Var1 Var2 Var3
____ ____ ____
1000 2000 3000
1100 2200 3300
1110 2220 3330
1111 2222 3333
tab = readtable("tt.csv", 'ReadVariableNames', true)
x1000 x2000 x3000
_____ _____ _____
1100 2200 3300
1110 2220 3330
1111 2222 3333
Note how both examples are missing the variable names (they should be named with col instead of var), and they're both missing the 4th column.
Any ideas? I could just write a helper script that goes around adding extra commas, but that seems like a lot of extra work for what might just be a setting.
Thanks!
  2 Comments
Voss
Voss on 3 Sep 2024
Can you upload (using the paperclip button) at least two of these CSV files - at least one that has a Notes column and at least one that does not?
Samuel
Samuel on 3 Sep 2024
@Voss uploaded 3 files for you:
  • tt.csv: the original, failing case
  • tt_no_notes.csv: the Notes column in the header has been entirely deleted
  • tt_added_commas.csv: added commas after each line, the table is now read as intended

Sign in to comment.

Accepted Answer

dpb
dpb on 3 Sep 2024
Edited: dpb on 3 Sep 2024
f=["colA,colB,colC,colD"
"1000,2000,3000"
"1100,2200,3300"
"1110,2220,3330"
"1111,2222,3333"];
writelines(f,'tt.csv')
type tt.csv
colA,colB,colC,colD 1000,2000,3000 1100,2200,3300 1110,2220,3330 1111,2222,3333
opt=detectImportOptions('tt.csv','filetype','text','readvariablenames',1,'Range',1)
opt =
DelimitedTextImportOptions with properties: Format Properties: Delimiter: {','} Whitespace: '\b\t ' LineEnding: {'\n' '\r' '\r\n'} CommentStyle: {} ConsecutiveDelimitersRule: 'split' LeadingDelimitersRule: 'keep' TrailingDelimitersRule: 'ignore' EmptyLineRule: 'skip' Encoding: 'UTF-8' Replacement Properties: MissingRule: 'fill' ImportErrorRule: 'fill' ExtraColumnsRule: 'addvars' Variable Import Properties: Set types by name using setvartype VariableNames: {'colA', 'colB', 'colC' ... and 1 more} VariableTypes: {'double', 'double', 'double' ... and 1 more} SelectedVariableNames: {'colA', 'colB', 'colC' ... and 1 more} VariableOptions: [1-by-4 matlab.io.VariableImportOptions] Access VariableOptions sub-properties using setvaropts/getvaropts VariableNamingRule: 'modify' Location Properties: DataLines: [2 Inf] VariableNamesLine: 1 RowNamesColumn: 0 VariableUnitsLine: 0 VariableDescriptionsLine: 0 To display a preview of the table, use preview
t=readtable('tt.csv',opt)
t = 4x4 table
colA colB colC colD ____ ____ ____ __________ 1000 2000 3000 {0x0 char} 1100 2200 3300 {0x0 char} 1110 2220 3330 {0x0 char} 1111 2222 3333 {0x0 char}
NOTA BENE: If the notes column ever does contain a note, then reading will fail if try to read as numeric.
ERRATUM: I think there is a failure in that setting only 'ReadVariableNames',1 didn't read the first record as the variable names but skipped it and read the first data row instead. I would submit this as a bug/support request. Inferring import options from a general file of almost any description is a tough job; detectImportOptions continues to evolve; example cases that don't work as expected are always good fodder for enhancements.
That the file is malformed in missing the delimiter to always have the fourth column is an excuse, but it is a fairly common thing with csv files.
  2 Comments
Samuel
Samuel on 3 Sep 2024
I had never seen the Range argument used like that, so I gave it a shot:
t = readtable("tt.csv", 'Range', 1)
t =
4×4 table
colA colB colC colD
____ ____ ____ ____
1000 2000 3000 NaN
1100 2200 3300 NaN
1110 2220 3330 NaN
1111 2222 3333 NaN
So, uh... looks like that works?
I looked over the documentation and I can't figure out how it's working, but it certainly seems to be outputting the right thing.
Thanks so much!
dpb
dpb on 3 Sep 2024
Edited: dpb on 3 Sep 2024
The how/why is compare the import options structure with and without it...the comment about 'ReadVariableNames' is that without the range value for some reason the first line was skipped as it should have been for data, but it was not read as the variable names per the documented behavior, but the second line was as your example above. That's what I think is worthy of a bug report.
To work around that, I forced it to begin reading at the first row.
What I didn't try was
f=["colA,colB,colC,colD"
"1000,2000,3000"
"1100,2200,3300"
"1110,2220,3330"
"1111,2222,3333"];
writelines(f,'tt.csv')
%type tt.csv
%opt=detectImportOptions('tt.csv','filetype','text','readvariablenames',1,'Range',1)
t=readtable('tt.csv','readvariablenames',1,'Range',1)
t = 4x4 table
colA colB colC colD ____ ____ ____ ____ 1000 2000 3000 NaN 1100 2200 3300 NaN 1110 2220 3330 NaN 1111 2222 3333 NaN
That also works at least on this example without the extra step of the import options object.

Sign in to comment.

More Answers (0)

Products


Release

R2021b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!