read ASCII data with varying lengths of space as delimiters

Question

Pq Yang on 3 Apr 2020

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/515190-read-ascii-data-with-varying-lengths-of-space-as-delimiters

Edited: dpb on 4 Apr 2020

subset.txt

I am having trouble in reading a structured dataset. The columns are separeted by spaces which have different lengths. Besides the varying lengths of space for delimiters, the column widths are varying as well.

I have tried all kinds of appraoches and functions, and searached online, but couldnot find a solution.

Can someone help me? Please.

A subset from a huge dataset is attached.

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 3 Apr 2020

Need clear definition of the format -- does look like fixed-width but not absolutely positive about whether the apparent missing fields are defined or skipped.

One presumes somewhere whoever created these files also built a toolset to read them and published a definitive description.

I'd look for the first first rather than reinvent the wheel and lacking that w/o the second it's going to be a lot of detective work.

Sign in to comment.

Sign in to answer this question.

Answer 1

per isakson on 3 Apr 2020

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/515190-read-ascii-data-with-varying-lengths-of-space-as-delimiters#answer_423872

Edited: per isakson on 4 Apr 2020

Open in MATLAB Online

subset.txt is a fixed-width text file. No doubt about that.

Since R2018b (R2017a?) it is straight forward to read fixed-width text file with readtable(). Before that it was a bit tricky to read fixed-width text files. There are a number of threads here in Answers on the topic.

The documentation includes a good description on how to read fixed-width text files. See fixedWidthImportOptions, Import options object for fixed-width text files.

IMO: The error prone part is to determine the widths of the columns. Here is a start.

copy&paste the column header line to the command window
use regexp() to determine the starting position of each column. The column headers start with a letter (the first one is special).

>> chr = '# DTG                LOCATION            NAME  ...'
>> cac = regexp( chr, '\<\w', 'start' );
>> cac
cac =
     3    22    42    90   110   130   150   170   190   210   230   250   270   290   310   330   350

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 4 Apr 2020

Edited: dpb on 4 Apr 2020

Open in MATLAB Online

Is there only one header line? Maybe the browser is folding the long line that confused the issue...ah! Indeed it is. That makes the file look much more regular than with folded lines. Indeed with the comment character to skip header lines presuming don't need them (altho then the column header record will be skipped too, probably; I've never tested that).

Well, now have downloaded, let's see just how smart TMW was... :)

>> opt=detectImportOptions('subset.txt','FileType','fixedwidth','CommentStyle','#',"NumHeaderLines",20,'ReadVariableNames',1)
opt = 
  FixedWidthImportOptions with properties:
   Format Properties:
                  Whitespace: '\b\t '
                  LineEnding: {'\n'  '\r'  '\r\n'}
                CommentStyle: {'#'}
               EmptyLineRule: 'skip'
                    Encoding: 'windows-1252'
   Replacement Properties:
                 MissingRule: 'fill'
             ImportErrorRule: 'fill'
            ExtraColumnsRule: 'addvars'
            PartialFieldRule: 'keep'
   Variable Import Properties: Set types by name using setvartype
               VariableNames: {'x_DTG', 'Var2', 'LOCATIO' ... and 12 more}
               VariableTypes: {'datetime', 'duration', 'char' ... and 12 more}
              VariableWidths: [10.00 9.00 9.00 21.00 4.00 8.00 2.00 35.00 19.00 16.00 99.00 20.00 20.00 40.00 55.00]
       SelectedVariableNames: {'x_DTG', 'Var2', 'LOCATIO' ... and 12 more}
             VariableOptions: Show all 15 VariableOptions 
	Access VariableOptions sub-properties using setvaropts/getvaropts
       PreserveVariableNames: false
   Location Properties:
                   DataLines: [22.00 Inf]
           VariableNamesLine: 21.00
              RowNamesColumn: 0
           VariableUnitsLine: 0
    VariableDescriptionsLine: 0 
	To display a preview of the table, use preview
>> opt.VariableNames
ans =
  1×15 cell array
  Columns 1 through 10
    {'x_DTG'}    {'Var2'}    {'LOCATIO'}    {'NNAME'}    {'Var5'}    {'Var6'}    {'Var7'}    {'LATITUDE'}    {'LONGITUD'}    {'EALTI'}
  Columns 11 through 15
    {'TUDEU_BOOL_10T_…'}    {'EWP_SEA_10T_D'}    {'RYB_SEA_10TN_'}    {'DRYB_10T_WETB_1…'}    {'DRYB_10U_10U_SE…'}
>> 

Well, kinda' sorta' made a stab at it...will have to physically set the column widths and number variables -- it finds the time string separate from the date so inserts Var2 for it, then thinks the LOCATION column ends at the end of the shorter string data instead of the longer so it truncates the name as LOCATIO and includes the last N in the next column. But one gets an object to play with for starters.

Good trick for automagically guessing column widths, Per, will have to remember that going forward.

OP will still need to either account for the time column independently or to set the width to encompass both and read as string 'cuz can't handle the blank field on input parsing in readtable to convert to datetime.

Sign in to comment.

read ASCII data with varying lengths of space as delimiters

1 Comment
Show -1 older commentsHide -1 older comments

Answers (1)

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

read ASCII data with varying lengths of space as delimiters

1 Comment Show -1 older commentsHide -1 older comments

Answers (1)

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

1 Comment
Show -1 older commentsHide -1 older comments