- "quite big"   how big compared to available memory?
- "different sections"   what defines the beginning of a section? "V2500_A5"_ is that a fixed string, which defines the beginning of a new a section?
Parsing a text file in matlab and accessing contents of each sections
7 views (last 30 days)
Show older comments
Hi I want to separate a text file into different sections in MATLAB which is quite big.
- Ignore first set of lines
- Then the data set is repeated
- Access its content for a particular set of condition
For example, for a drag factor of 1.0 and fuel factor of 1.2, I want to find the corresponding alt for a particular weight.
Find attached the text file.
Thanks Yashvin
2 Comments
per isakson
on 10 Jun 2015
Edited: per isakson
on 10 Jun 2015
Accepted Answer
per isakson
on 10 Jun 2015
Edited: per isakson
on 11 Jun 2015
Here is a function, which reads question2.txt and returns a struct vector. It might serve as a starting point.
>> out = cssm()
out =
1x2 struct array with fields:
DRAG_FACTOR
FUEL_FACTOR
Table
>> out(abs([out.DRAG_FACTOR]-1)<1e-6 & abs([out.FUEL_FACTOR]-1)<1e-6).Table(1:5,1:3)
ans =
1.0e+04 *
4.0000 0.0000 0.0211
4.0500 0.0000 0.0212
4.1000 0.0000 0.0213
4.1500 0.0000 0.0214
4.2000 0.0000 0.0215
where
function out = cssm()
str = fileread( 'question2.txt' );
section_separator = 'CLEAN CONFIGURATION';
cac = strsplit( str, section_separator );
len = length( cac );
out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[] );
for jj = 2 : len
out(jj-1) = handle_one_section_( cac{jj} );
end
end
function sas = handle_one_section_( str )
sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[] );
sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
sas.Table = excerpt_table_( str );
end
function val = excerpt_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
val = str2double( buf );
end
function val = excerpt_table_( str )
% Q&D, quick and dirty, search a numerical sequence, which is at least 100 character
% long. PROBLEM: requires that the preceding line ends with a "non-numerical"
% character and that the following line begins with a "non-numerical" character.
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
val = str2num( buf );
end
 
Modified function based on comment
>> cssm
ans =
1x2 struct array with fields:
DRAG_FACTOR
FUEL_FACTOR
Table
COST_INDEX
ALTITUDE
ISA
where
function out = cssm()
str = fileread( 'question2.txt' );
section_separator = 'CLEAN CONFIGURATION';
cac = strsplit( str, section_separator );
len = length( cac );
out = struct( 'DRAG_FACTOR',nan(1,len-1), 'FUEL_FACTOR',[], 'Table',[] ...
, 'COST_INDEX' ,[] , 'ALTITUDE' ,[], 'ISA' ,[] );
for jj = 2 : len
out(jj-1) = handle_one_section_( cac{jj} );
end
end
function sas = handle_one_section_( str )
sas = struct( 'DRAG_FACTOR',[], 'FUEL_FACTOR',[], 'Table',[] ...
, 'COST_INDEX' ,[], 'ALTITUDE' ,[], 'ISA' ,[] );
sas.DRAG_FACTOR = excerpt_num_( str, 'DRAG FACTOR' );
sas.FUEL_FACTOR = excerpt_num_( str, 'FUEL FACTOR' );
sas.COST_INDEX = excerpt_colon_separated_num_( str, 'COST INDEX' );
sas.ALTITUDE = excerpt_colon_separated_num_( str, 'ALTITUDE' );
sas.ISA = excerpt_colon_separated_num_( str, 'ISA' );
sas.Table = excerpt_table_( str );
end
function val = excerpt_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '[ ]+[\d\.]+' ], 'match', 'once' );
val = str2double( buf );
end
function val = excerpt_table_( str )
% Q&D, quick and dirty, search a numeric sequecne, which is at least 100 character
% long. PROBLEM: requires that the preceeding line ends with a "non-numeric"
% character and that the following line begins with a "non-numeric" character.
buf = regexp( str, '[\d\.\s]{100,}', 'match', 'once' );
val = str2num( buf );
end
function val = excerpt_colon_separated_num_( str, name )
buf = regexp( str, [ '(?<=', name, ')', '(?:[ \:\-]+)([\d\.])+' ], 'tokens', 'once' );
val = str2double( buf{:} );
end
9 Comments
per isakson
on 11 Jun 2015
@Guillaume, yes the two text files differed. The first is a stripped down version of the second. I attach the copies I used.
More Answers (1)
Guillaume
on 10 Jun 2015
Your text file is not really designed to be read by a computer. It's not very consistent (variable number of blank lines, variable number of spaces, inconsistent number format, etc.) which makes it difficult to parse efficiently.
So the first thing to look at is if you can get the same data in a format designed to be parsed by a computer: binary, json, xml, etc.
Failing that, the following works on the attached file, but because of the inconsistencies may not work on a larger file:
dragwanted = 1.0;
fuelwanted = 1.2;
content = fileread('question.txt'); %get whole content of file
sections = regexp(content, 'DRAG FACTOR\s+([0-9.]+)\s+FUEL FACTOR\s+([0-9.]+)\s+([A-Z .]+\r\n[A-Z() ]+\r\n\s*\r\n([0-9. ]+\r\n)+)', 'tokens');
%sections is a cell array of 1x3 cell arrays of {drag factor, fuel factor, table}
dragfactors = cellfun(@(s) str2double(s{1}), sections);
fuelfactors = cellfun(@(s) str2double(s{2}), sections);
wanted = dragfactors == dragwanted & fuelfactors == fuelwanted;
assert(sum(wanted) > 0, 'No section match criteria');
assert(sum(wanted) == 1, 'More than one section match criteria');
section = sections{wanted}{3};
%parse the section:
sectionlines = strsplit(section, {'\n', '\r'});
sectionheader = strsplit(strtrim(sectionlines{1}))
sectionunits = strtrim(regexp(sectionlines{2}, '(?<=\().*?(?=\))', 'match'))
sectiontable = str2num(strjoin(sectionlines(4:end-1), '\n'))
6 Comments
Guillaume
on 10 Jun 2015
Your file is a real mess, sometimes you have empty lines with just one space, sometimes with no spaces, the header line starts with 3 spaces, the unit line only two, the parameter section sometimes has one parameter on a line, sometimes two. You may be better off parsing the file line by line.
Otherwise, the following will get you the table and the criteria section, but will not parse the criteria:
sections = regexp(content, ...
'CLEAN CONFIGURATION\r\n((.*\r\n)+?)(\s+WGHT.*\r\n.*\r\n.*\r\n([0-9. ]+\r\n)+)', ...
'tokens', 'dotexceptnewline);
sections is a 1 x n (n = number of section) cell array of cell arrays whose first elements are the criteria part and seconds elements the table part. You can parse the table with the same code as before. For reference, the above regular expression can be decoded as:
- match 'CLEAN CONFIGURATION' followed by '\r' (newline)
- starts the first token (at |(|)
- match any character but a newline followed by '\r' (the |(.*\r
See Also
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!