Parsing a text file in matlab and accessing contents of each sections

7 views (last 30 days)
Hi I want to separate a text file into different sections in MATLAB which is quite big.
- Ignore first set of lines
- Then the data set is repeated
- Access its content for a particular set of condition
For example, for a drag factor of 1.0 and fuel factor of 1.2, I want to find the corresponding alt for a particular weight.
Find attached the text file.
Thanks Yashvin
  2 Comments
per isakson
per isakson on 10 Jun 2015
Edited: per isakson on 10 Jun 2015
  • "quite big" &nbsp how big compared to available memory?
  • "different sections" &nbsp what defines the beginning of a section? "V2500_A5"_ is that a fixed string, which defines the beginning of a new a section?
yashvin
yashvin on 10 Jun 2015
It is 60mb of txt file. As an example, I am attaching a full section of a part of the txt file. The initial section until "Cruise at a given cost index" is unimportant.
Each section begins with "CLEAN CONFIGURATION" followed by a table.
For example, for drag factor=1,fuel factor=1,2 and ISA= =13,I want to access the table and get the corresponding weight.
All the parameters in the 'CLEAN CONFIGURATION', i want to treat them as field so that I can select for different conditions

Sign in to comment.

Accepted Answer

per isakson
per isakson on 10 Jun 2015
Edited: per isakson on 11 Jun 2015
Here is a function, which reads question2.txt and returns a struct vector. It might serve as a starting point.
>> out = cssm()
out =
1x2 struct array with fields:
DRAG_FACTOR
FUEL_FACTOR
Table
>> out(abs([out.DRAG_FACTOR]-1)<1e-6 & abs([out.FUEL_FACTOR]-1)<1e-6).Table(1:5,1:3)
ans =
1.0e+04 *
4.0000 0.0000 0.0211
4.0500 0.0000 0.0212
4.1000 0.0000 0.0213
4.1500 0.0000 0.0214
4.2000 0.0000 0.0215
where
function out = cssm()
str = fileread( 'question2.txt' );
section_separator = 'CLEAN CONFIGURATION';
cac = strsplit( str, section_separator );
len = length( cac );
out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[] );
for jj = 2 : len
out(jj-1) = handle_one_section_( cac{jj} );
end
end
function sas = handle_one_section_( str )
sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[] );
sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
sas.Table = excerpt_table_( str );
end
function val = excerpt_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
val = str2double( buf );
end
function val = excerpt_table_( str )
% Q&D, quick and dirty, search a numerical sequence, which is at least 100 character
% long. PROBLEM: requires that the preceding line ends with a "non-numerical"
% character and that the following line begins with a "non-numerical" character.
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
val = str2num( buf );
end
&nbsp
Modified function based on comment
>> cssm
ans =
1x2 struct array with fields:
DRAG_FACTOR
FUEL_FACTOR
Table
COST_INDEX
ALTITUDE
ISA
where
function out = cssm()
str = fileread( 'question2.txt' );
section_separator = 'CLEAN CONFIGURATION';
cac = strsplit( str, section_separator );
len = length( cac );
out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[] ...
, 'COST_INDEX' ,[] , 'ALTITUDE' ,[], 'ISA' ,[] );
for jj = 2 : len
out(jj-1) = handle_one_section_( cac{jj} );
end
end
function sas = handle_one_section_( str )
sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[] ...
, 'COST_INDEX' ,[], 'ALTITUDE' ,[], 'ISA' ,[] );
sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
sas.COST_INDEX = excerpt_colon_separated_num_( str, 'COST INDEX' );
sas.ALTITUDE = excerpt_colon_separated_num_( str, 'ALTITUDE' );
sas.ISA = excerpt_colon_separated_num_( str, 'ISA' );
sas.Table = excerpt_table_( str );
end
function val = excerpt_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
val = str2double( buf );
end
function val = excerpt_table_( str )
% Q&D, quick and dirty, search a numeric sequecne, which is at least 100 character
% long. PROBLEM: requires that the preceeding line ends with a "non-numeric"
% character and that the following line begins with a "non-numeric" character.
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
val = str2num( buf );
end
function val = excerpt_colon_separated_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '(?:[ \:\-]+)([\d\.])+' ], 'tokens', 'once' );
val = str2double( buf{:} );
end
  9 Comments
per isakson
per isakson on 11 Jun 2015
@Guillaume, yes the two text files differed. The first is a stripped down version of the second. I attach the copies I used.
yashvin
yashvin on 12 Jun 2015
HI! Do you still have the file? Yes! Now its clearer to me! Thanks so much! Yes both your answer were very helpful! I am getting used to it now. The first answer was of higher level! Thank you both for your contribution!

Sign in to comment.

More Answers (1)

Guillaume
Guillaume on 10 Jun 2015
Your text file is not really designed to be read by a computer. It's not very consistent (variable number of blank lines, variable number of spaces, inconsistent number format, etc.) which makes it difficult to parse efficiently.
So the first thing to look at is if you can get the same data in a format designed to be parsed by a computer: binary, json, xml, etc.
Failing that, the following works on the attached file, but because of the inconsistencies may not work on a larger file:
dragwanted = 1.0;
fuelwanted = 1.2;
content = fileread('question.txt'); %get whole content of file
sections = regexp(content, 'DRAG FACTOR\s+([0-9.]+)\s+FUEL FACTOR\s+([0-9.]+)\s+([A-Z .]+\r\n[A-Z() ]+\r\n\s*\r\n([0-9. ]+\r\n)+)', 'tokens');
%sections is a cell array of 1x3 cell arrays of {drag factor, fuel factor, table}
dragfactors = cellfun(@(s) str2double(s{1}), sections);
fuelfactors = cellfun(@(s) str2double(s{2}), sections);
wanted = dragfactors == dragwanted & fuelfactors == fuelwanted;
assert(sum(wanted) > 0, 'No section match criteria');
assert(sum(wanted) == 1, 'More than one section match criteria');
section = sections{wanted}{3};
%parse the section:
sectionlines = strsplit(section, {'\n', '\r'});
sectionheader = strsplit(strtrim(sectionlines{1}))
sectionunits = strtrim(regexp(sectionlines{2}, '(?<=\().*?(?=\))', 'match'))
sectiontable = str2num(strjoin(sectionlines(4:end-1), '\n'))
  6 Comments
yashvin
yashvin on 10 Jun 2015
Now I am understanding it better thanks to you! So, in fact, the list of condition before the table can be any one of them. Infact, it can also be CG location percentage, altitude value, ISA number(positive or negative),cost index value or % of MCR thrust.
In the file, in each sections, we care only from the CLEAN CONFIGURATION to the last value of the table. The remaining can be discarded.
The table always start by WGHT and the header stays same. Yes, the unit should be kept.
Thanks Yashvin
Guillaume
Guillaume on 10 Jun 2015
Your file is a real mess, sometimes you have empty lines with just one space, sometimes with no spaces, the header line starts with 3 spaces, the unit line only two, the parameter section sometimes has one parameter on a line, sometimes two. You may be better off parsing the file line by line.
Otherwise, the following will get you the table and the criteria section, but will not parse the criteria:
sections = regexp(content, ...
'CLEAN CONFIGURATION\r\n((.*\r\n)+?)(\s+WGHT.*\r\n.*\r\n.*\r\n([0-9. ]+\r\n)+)', ...
'tokens', 'dotexceptnewline);
sections is a 1 x n (n = number of section) cell array of cell arrays whose first elements are the criteria part and seconds elements the table part. You can parse the table with the same code as before. For reference, the above regular expression can be decoded as:
  1. match 'CLEAN CONFIGURATION' followed by '\r' (newline)
  2. starts the first token (at |(|)
  3. match any character but a newline followed by '\r' (the |(.*\r

Sign in to comment.

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!