Extracting data using regular expression
4 views (last 30 days)
Show older comments
Shuvashish Roy
on 20 May 2021
Commented: Shuvashish Roy
on 21 May 2021
Hi,
I have the attached text file. I want to extract all the columns starting from line 1472(if used notepad) named "Physics", "Time", "dt", "Progress", "Nonlinear Iteration" "Linear Iterations"...."Nodes After Adaption". I don't know how to specify the header names so that only the numeric values after that headers are extracted in a dataframe or matrix format. Thanks a lot for your help.
Input file format:
Unnecessary lines with text
Unnevessary lines with text
................................
many unnecessay lines............
adh_run_func :: tfinal = 12513600.000000
Physics Time dt Progress Nonlinear Iteration Linear Iteration Max Resid Norm ... Nodes After Adaption
HYD_1 11908800 5 0 1 ........ ...65926
HYD_1 11908800 5 0 2 ...... ...65926
............................................................................................. ................................
............................................................................................. ................................
100% COMPLETE
output file format:
Physics Time dt Progress Nonlinear Iteration Linear Iteration Max Resid Norm ... Nodes After Adaption
HYD_1 11908800 5 0 1 ........ ...65926
HYD_1 11908800 5 0 2 ...... ...65926
............................................................................................. ................................
0 Comments
Accepted Answer
per isakson
on 21 May 2021
Edited: per isakson
on 21 May 2021
"all the columns [...] named "Physics", "Time", "dt", "Progress", "Nonlinear Iteration" "Linear Iterations"...."Nodes After Adaption" " I understand that as all the columns, none excluded.
There is a choice. Shall we use readtable() or textscan()? I don't think readtable() can handle this file without relying on the critical line numbers, which I hessitate to do. It is however possible to determine the line numbers needed in a separate step and then use readtable(). textscan() is able to parse a 1D character array, which readtabe() is not. Only TMW knows why.
I choose textscan().
%% Read file
chr = fileread('AR_20base_201214_adh.txt');
%% Remove meta data
% Using 'adh_run_func :: tfinal' feels more robust than using the line number
pos = regexp( chr, '^adh_run_func :: tfinal', 'once', 'lineanchors' );
chr(1:pos-1) = []; % remove until the first line that begins with 'adh_run_func :: tfinal'
%% Remove the summary lines at the end
pos = regexp( chr, '^\d+[\% ]+COMPLETE', 'once', 'lineanchors' );
chr(pos:end) = [];
%% Get the column headers
txt = regexp( chr, '^Physics.+?$', 'match', 'once', 'lineanchors' );
column_headers = strsplit( txt, '\t' );
%%
cac = textscan( chr, ['%s',repmat('%f',1,numel(column_headers)-1)] ...
, 'Headerlines' , 2 ... two remains after meta-data is removed
, 'Delimiter' , '\t' ...
, 'Whitespace' , ' %' ... ignore the %-sign in Progress
, 'CollectOutput' , true );
Physics = cac{1};
matrix = cac{2};
whos Physics matrix column_headers
More Answers (0)
See Also
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!