Need to open a large dat. file to read, and then plot, the problem is that when opening and reading the file in MATLAB is not reading all the lines

Question

Carolina Corella Velarde on 23 Nov 2022

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/1861023-need-to-open-a-large-dat-file-to-read-and-then-plot-the-problem-is-that-when-opening-and-reading

Answered: Walter Roberson on 29 Nov 2022

The dat. file contains 42672 lines of data to read that need to be plotted. This is the code I have done so far

fileID=fopen('F:\code\matlab\GVRs\data\met\met_backup\full_met_files\met_CR1000_met_all.dat','r');
while ~feof(fileID)
    tline = fgetl(fileID);
    S = textscan(fileID,'%s %d %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f', 'HeaderLines', 4, 'Delimiter', ','); % HeaderLines- to skip as much rows as needed. Delimeter ',' so matlab doesn't assumes is a space and acknowledges is a coma
    newChr = strrep(S{1},'"','' );
    disp(newChr);
    date =strrep(newChr,'''',''); %get rid of single quote
    DateString = date;
    formatIn = 'yyyy-mm-dd HH:MM:SS';
    DateNumber = datenum(DateString,formatIn);
end

The addition of fgetl(fileID) was because originally I had only this line, but not all the line files get translated only like around 400 lines files are getting read by MATLAB using the code below

fileID=fopen('F:\code\matlab\GVRs\data\met\met_backup\full_met_files\met_CR1000_met_all.dat','r');
S = textscan(fileID,'%s %d %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f', 'HeaderLines', 4, 'Delimiter', ','); % HeaderLines- to skip as much rows as needed. Delimeter ',' so matlab doesn't assumes is a space and acknowledges is a coma

4 Comments
Show 2 older commentsHide 2 older comments

Walter Roberson on 29 Nov 2022

Open in MATLAB Online

fgetl() reads one line from the file.

There is no difference in final file position our textscan output between

fgetl(fileID);
textscan(fileID, format, 'HeaderLines', 4)

compared to

textscan(fileID, format, 'HeaderLines', 5) %with no fgetl

Carolina Corella Velarde on 29 Nov 2022

I tried this but is only reading the beginning portion of the file, like around 400 data points, and not the rest. In total I need to be able to have matlab read and process 42668 data points.

Sign in to comment.

Sign in to answer this question.

Answer 1

Walter Roberson on 29 Nov 2022

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/1861023-need-to-open-a-large-dat-file-to-read-and-then-plot-the-problem-is-that-when-opening-and-reading#answer_1115703

Open in MATLAB Online

    S = textscan(fileID,'%s %d %f %f %f %f %f %f %f %f %f %f %f %f %f %f %f', 'HeaderLines', 4, 'Delimiter', ','); % HeaderLines- to skip as much rows as needed. Delimeter ',' so matlab doesn't assumes is a space and acknowledges is a coma

For all format items except %c the first thing that is done when a % format is encountered, is to examine the current character in the file, and skip any character that appears in the Whitespace list or EndOfLine list; if 'MultipleDelimiters' option is true, then any character in the Delimiters list is also skipped.

By the time the processing of the format itself starts, except for %c the file positioned will be positioned at a non-Whitespace character; if MultipleDelimiters is false then it might be positioned at a Delimiter character

A %s format will read from that (non-whitespace) character until it encounters something in Delimiter or in EndOfLine, or reaches end of file, "consuming" the (first) delimiter or end of line character and leaving the file positioned immediately after that. Note that % does not specifically look for non-numeric characters: %s is entirely happy to consider (for example) '1984' as being %s. The only way that %s can be considered by textscan to fail to match is if end of file was reached; the skipping of initial whitespace and end of line would have zipped through any empty lines and any leading blanks or tabs on such lines, looking for the first non-blank thing and reading that. Encountering something in the Delimiter list immediately is not considered a failure for %s purposes: that just results in that particular entry being recorded as empty character vector.

%d after that first triggers discarding of whitespace, as described above. Then it looks for characters that can be present in a (possibly complex-valued) floating point number. If it finds such a number, it converts it to int32() and saves it. If it does not find such a number, then the match is considered to fail and textscan would stop processing; in the case where the %s succeeded but the %d failed, then the cell for the %s would have one more entry than the cell for the %d .

Notice that the processing absolutely does not examine the whole line to determine whether the line as a whole matches. If the input was

   header1
   header2
   header3
   header4
   antaires 1134 (bunch of numbers)
   mars -803 (bunch of numbers)
        end part1
   header1
   header2
   header3
   header4
   zernith 3510 (bunch of numbers)
   lanroz 3535 (bunch of numbers)
       end part2

then the %s%d%f<etc> format would not look at the "end part1" line and say "oh, that does not match the template of the format" and stop reading when it reached that line. Instead, the %s is going to discard the leading blanks, read and store the word "end" from the line, and then the %d would get control and would fail because 'part1' is not a valid number. The return values would be {'antaires'; 'mars'; 'end'} (three rows) and {[1134; -803]} (two rows). And textscan would be left positioned at the beginning of the 'part1'

I predict, then, that the reason you are not reading as many lines as you expect, is that you have incorrect expectations about how textscan will determine the end of repetitions of the format. That %s at the beginning is going to give you problems, happy to consume any leading keyword or difference in style that a human might look at and say "Oh, that's obviously the end of the block". textscan() %s just knows it is text and will grab it; textscan() never does look-ahead to see whether the rest of the format matches, and instead just keeps reading until one of the format items fails.