Textscan with different formats
10 views (last 30 days)
Show older comments
Hi,
I'm not familiar reading text files and also formats involved in text files. Here is my problem, i been trying to read a text file which has a unknown rows and columns, with a '\t' delimiter, column headers with more than 2( second one will be an unit which is not required for me, only first one is considered). I was using importdata for reading text and data separately, it was working fine but yesterday i found a problem like my input text file contains '*' for missing data, which during importing considered as character and as a row header.
There is been hundreds of questions asked for text file reading, ive found solutions like tableread, import as char and convert with str2double(which is slow),readtext(file exchange) but none of the solution is as fast as importdata function.
What i was expecting is read only the numeric data from the textfile(replace char with NaN during import itself as xlsread), I understand which can be done using textscan but i was unable to give formatspec for the files Or a faster str2double function.
When i give formatspec as ('%s %f') is the first row is taken as string or the first column?
Note: text file size is 100000*600 column.Some files second column(Units) may not be present,data starts form second column itself. Suppose if my delimiter changes to ',' for another file how to auto detect delimiter?
4 Comments
Stephen23
on 25 Oct 2017
@surey: are the missing data always in that column, or can they occur in other columns as well?
Answers (1)
Walter Roberson
on 24 Oct 2017
"When i give formatspec as ('%s %f') is the first row is taken as string or the first column?"
No, not either. textscan() loops contining from the current file position, which might be in the middle of a line. If your format only reads a portion of a line, then the rest of the line is not discarded before the format is used again: instead the file position is updated right into the middle of a line and then it loops and applies the format again to where-ever it is.
For example, in the file
abc 123 456 789 1011
def
then a "%s%f" format would first read the 'abc' with %s format, then read the 123 with numeric format, temporarily leaving the textscan output as {{'abc'}, [123]} . Then textscan would re-apply the format from where it was, reading '456' with the %s format and 789 with the numeric format, updating the textscan output to {{'abc'; '456'}, [123; 789]}. Then the %s would grab the 1011, and the %f would choke on the def of the next line, leaving you with {{'abc'; '456'; '1011'}, [123; 789]} -- notice the numeric column is shorter than the text column because it happened to give up reading before that column was updated.
Now, if you happen to have the same number of format items as you have columns, then the effect is that each format item applies to a column. But if you hit a row that has a missing entry that is implied by spacing (no explicit delimiter between fields), or you have a numeric field specification but encounter a string instead and you do not have TreatAsEmpty set, or if %s column unexpectedly has a space in it... in any of those circumstances, the nice correspondence between column and format specifier will get messed up.
One of the key things you need to know about textscan() is that unless you have set 'WhiteSpace' to exclude the space character, that at the beginning of every format specifier, blanks starting at the current position are discarded -- even if the format specifier is %c or %s or %[]. This makes it tricky to deal with optional fields that are replaced by blanks, (unless you happen to be using a field separator such as comma or tab). The immediate thought might be to just remove space from the 'whitespace', but when that parameter does not include space, then leading spaces are an error for numeric fields! I showed how to get around that in https://www.mathworks.com/matlabcentral/answers/361377-textscan-failing-to-read-data-in-text-file#answer_286302
See Also
Categories
Find more on Data Import and Export in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!