Reading in ascii files with white space as delimiter.

121 views (last 30 days)
I am trying to read in a very simple ascii file that looks like the following:
PRES HGHT TEMP DWPT RELH MIXR DRCT SKNT THTA THTE THTV
hPa m C C % g/kg deg knot K K K
-----------------------------------------------------------------------------
994.0 270 7.0 6.0 93 5.93 40 10 280.6 297.1 281.6
989.0 312 6.2 5.2 93 5.64 42 12 280.2 295.9 281.2
972.0 455 4.8 4.0 95 5.27 48 18 280.2 294.9 281.1
...
There seem to be a dozen functions that I can read this in with but I'm struggling with all of them.
The simplest seems to be dlmread. I'm currently using the command:
M = dlmread('radiosonde.ascii',' ',3,1)
However this seems to register a single space as the delimiter instead of all the white space. If I use:
M = dlmread('radiosonde.ascii')
It registers the white space as the delimiter but I cannot specify to ignore the headers. Is there some way to specify white space as the delimitter while also ignore the headers?
Is there a better way to do this? Why hasn't Mathworks streamlined reading text files to be one universal function?

Answers (2)

Kevin Claytor
Kevin Claytor on 9 Nov 2015
Import data seems to work pretty well (but doesn't directly get you the headers):
importdata('radiosonde.ascii', ' ', 3)
If you know the exact format, textscan is used by the auto-generated code by: right click > import data:
startRow = 4;
formatSpec = '%7s%7s%7s%7s%7s%7s%7s%7s%7s%7s%s%[^\n\r]';
dataArray = textscan(fileID, formatSpec, 'Delimiter', '', 'WhiteSpace', '', 'HeaderLines' ,startRow-1, 'ReturnOnError', false);

dpb
dpb on 9 Nov 2015
Edited: dpb on 13 Nov 2015
A"The better way..."
x=textread('radiosonde.ascii','','headerlines',3);
I hadn't noted before the symptom of repeated delimiters with dlmread; agreed that's a pit[proverbial]a[ppendage].
IMO, it's unfortunate TMW has chosen to deprecate the use of textread in favor of textscan; it has the advantage of
  1. returning a "regular" double array instead of only a cell array,
  2. doesn't need the extra fopen/fclose step again where a single file read suffices and,
  3. as shown below, it "counts" the record length and returns correct shape automagically whereas textscan has to be told or one has to reshape the returned array.
The above equivalent in textscan would be
x=cell2mat(textscan(fid,repmat('%f',1,11), ...
'delimiter',' ', ...
'headerlines',3, ...
'multipledelimsasone',1));
textscan is the one, general function, but there are so many possibilities (as in infinite) to cover that making something that is general but also flexible is difficult; hence the specialized functions for specific cases. It does seem as though the multiple delimiters option would be a worthwhile enhancement for them; as noted, I hadn't actually noted that behavior previously as I tend to use the textread route for the above reasons. There are things it can't do that textscan can (being able to be called on the same file multiple times being a major one) but instead of deprecating it, it should be brought up to the level of textscan instead imo (or, alternatively, the option I've asked for since it was introduced, have an optional ability in textscan to return the double array directly and understand a file name as well as file handle).
(+) ADDENDUM/ERRATUM
Actually, on reading the source for dlmread I observed something hadn't noticed before (and I don't think it's documented; at least not well) -- if one submits an empty string for the formatting string, then textscan will do something else internally and in a regular numeric array come up with the number of fields per input record and reflect that. That is a super result that should be shouted from the rooftops by TMW but seems to be a closely held secret--
>> cell2mat(textscan(fid,'','collectoutput',1,'headerlines',3))
ans =
Columns 1 through 8
994.0000 270.0000 7.0000 6.0000 93.0000 5.9300 40.0000 10.0000
989.0000 312.0000 6.2000 5.2000 93.0000 5.6400 42.0000 12.0000
972.0000 455.0000 4.8000 4.0000 95.0000 5.2700 48.0000 18.0000
Columns 9 through 11
280.6000 297.1000 281.6000
280.2000 295.9000 281.2000
280.2000 294.9000 281.1000
>>
  2 Comments
dpb
dpb on 9 Nov 2015
Edited: dpb on 10 Nov 2015
>> help dlmread
dlmread Read ASCII delimited file.
RESULT = dlmread(FILENAME) reads numeric data from the ASCII
delimited file FILENAME. The delimiter is inferred from the formatting
of the file.
RESULT = dlmread(FILENAME,DELIMITER) reads numeric data from the ASCII
delimited file FILENAME using the delimiter DELIMITER. The result is
returned in RESULT. Use '\t' to specify a tab.
When a delimiter is inferred from the formatting of the file,
consecutive whitespaces are treated as a single delimiter. By
contrast, if a delimiter is specified by the DELIMITER input, any
repeated delimiter character is treated as a separate delimiter.
...
I'd forgotten this detail; the behavior is documented. The problem is, there's no way with the interface as designed to specify the header rows and not the delimiter...it's a remnant of the original procedural interface design of the functions; quite often they weren't written to be as general as could/should have been.
dlmread is an m-file; it wouldn't be too hard to extend it to handle the case--
The preprocessing section looks like the following:
...
% Get Delimiter
if nargin==1 % Guess default delimiter
[fid, theMessage] = fopen(filename);
if fid < 0
error(message('MATLAB:dlmread:FileNotOpened', filename, theMessage));
end
str = fread(fid, 4096,'*char')';
frewind(fid);
delimiter = guessdelim(str);
if isspace(delimiter); delimiter = ''; end
else
delimiter = sprintf(delimiter); % Interpret \t (if necessary)
end
...
If one were to use [] placeholder for the delimiter but also provided the R,C offsets, nargin still returns the place counter in the list so it would be pretty easy to also test for the second argument being empty as well as the first case of only one argument and have it do the search. Then only if the delimiter were explicitly specified would the multiple vs single come into play.
Would take a little more effort to handle that case as well, but certainly doable (and probably should have been).
ADDENDUM
Modified the above if to
if nargin==1 | (nargin>1 & isempty(delimiter)) % Guess default delimiter
and voila! using [] as a placeholder for the DELIMITER argument lets one specify the offset row,column arguments and still get the behavior of the multiple delimiters as one and automagic determination of same.
dpb
dpb on 12 Nov 2015
BTW, the above working for the example file is sorta happenstance; the documentation also includes the caveat
All data in the input file must be numeric. dlmread does not operate
on files containing nonnumeric data, even if the specified rows and
columns for the read contain numeric data only.
The example file is an anomaly that does, in fact, work correctly when skip the headers; not all will. I've not pursued this part in depth, undoubtedly it has to do with the fact the delimiter search reads an arbitrary 4096 characters and searches within it to determine the delimiter if requested and makes assumptions based on that which may turn out to be incorrect for a general line.

Sign in to comment.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!