Reading csv files ignoring header info and text in between

45 views (last 30 days)
HI,
I have two types of CSV files which i extract from a measurement instrument.
The first one(Dirty_Complex_Condensed.csv) , attached with this question has
  • Header text data which i would like to ignore(lines 1-5)
  • Some integers which i would like to not ignore(lines 6-34)
  • Complex decimal data which i would like to not ignore(lines 21-35) or really need.(measurement)
The second one(Dirty_Supplement_Condensed.csv) also attached has
  • Header text data which i would like to ignore(lines 1-5)
  • Some integers which i would like to not ignore(lines 6-34) or really need(measurement)
  • Header text data which i would like to ignore(lines 21-27)
  • Some integer which i would like to not ignore(lines 9-34) and append to the earlier integer data read(lines 6-34).(measurement)
Also note that the location(line number) of these header and measurement data is not always constant because it depends on the type of conditions of the data being mesaured. So how do i take that consideration and make my code flexible so that i am not hard-coding the 'row' or 'line number' for different csv read in?
I am attaching 'Clean_Complex_Condensed.csv' and 'Clean_Supplement_Condensed.csv' for your reference on how i would like the data be eventually as i read them in. The number of rows in the imported 'Cleaned' version of the csv's should always be the same between the two files.
Currently, i am manually going in and deleting these header text data and then using 'csvread' command , but undoubtedly, this is too time consuming.
I have checked the following previous links which are relevent but not exactly or quite close to what am trying to accomplish here.
I guess the most powerful way is to use 'regex' but before i break my head further i wanted to know if any of the clever cogs MVPs /MathWorks staff had any fast and bright suggestions for a relatively simple question i have.
Thanks,
IP

Accepted Answer

Allen
Allen on 25 May 2021
Iroquois,
The line numbers that you are referencing do not quite match up with your descriptions, and is making your needs a bit hard to decipher. However, to possibly get you started I have provided the following snippets.
1st: Start by reading in a portion of the data to enough lines to capture all of the splits in data type. This is quicker than reading the entire file, especially for large data files. The example below assumes that the last change in data type will occur before line 100 (corresponds to row number in Excel).
% Opens the *.csv file and read the first 100 lines only
file = "path\filename for your data file";
fid = fopen(file);
C = textscan(fid,"%s",100,"delimiter","\n"); % Should be a cell-array of cells. Will only need C{1}
C{1} should be a cell-array containing lines of text, which also includes numerical values as text.
2nd: Run a number of searches on C{1} to find the row indices where data changes to the type(s) you want and the type(s) you do not want. The following example looks for and returns line number that contain the text "DataLabelIdxs". Then looks for the first line containing single column values. In the Dirty_Complex_Condensed.csv file you provided this finds the first line after your integer values.
txt = "DataLabelIdxs";
index1 = find(contains(C{1},txt),1);
% Last integer line from this example dataset is also the last line to not
% contain a comma.
index2 = index2 = find(~contains(C{1}(index1+1:end),","),1,"last");
Applying additional searches will help you find other text lines below this and when you change from integer to floating doubles, but will be a bit more complicated. You may also want to program what to do if a search returns an empty value.
3rd: Rewind the FID line counter and use your indices to read the sets of values you wish to keep. The following simply reads in values up to the line as integers starting after the header lines marked by index1
frewind(fid)
% Converts the cell-arrays of numbers that are being read to numeric-arrays
intValues = cell2mat(textscan(fid,"%d",index2,"headerlines",index1));
dblValues = cell2mat(textscan(fid,"%f%f"));
% Close FID
fclose("all");
You may need to use the frewind function multiple times depending on how you generate your indices, or if you are reading all integer values first and have floating double values between sets of your integers.
Also, if it is possible that the floating double lines may contain more than two columns of data and again is a variable number of columns you can use the following to generate a variable-sized expression for textscan.
% Assuming index is a row along the data containing the floating double
% values.
expr = strcat("%f",repmat("%f",[1,sum(C{1}{index}==',')]));
% Then you can use textscan with this variable expression input similarly
% to before.
dblValues = cell2mat(textscan(fid,expr))
Hope this helps.
-Allen

More Answers (0)

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!