using regexp for space delimited strings in text file.

5 views (last 30 days)
I need to extract repeated strings' lines from the attached text file. For example there are 2 lines which start with "P 1" (two spaces after P) string in the data file. I need to extract 2nd and 4th column of these lines as follows;
array_P1=[ 6444.951599 -24080.372159 -8934.980576; 6645.371003 -22892.293251 -11497.619680];
I use following codes (from Stephen Cobeldick) if there are no space in repeated strings (for example P1);
fid = fopen('data_file.txt','rt');
str = fscanf(fid,'%c',Inf);
fclose(fid);
C = regexp(str,'^P1( +\S+)+\s+$','lineanchors','tokens');
C = regexp(vertcat(C{:}),'\S+','match');
N = str2double(vertcat(C{:}));
But this doesn't work if there are spaces in the repeated strings as in my example (P 1)

Accepted Answer

Stephen23
Stephen23 on 26 Jan 2016
Edited: Stephen23 on 26 Jan 2016
Try this:
% textscan options:
opt = {'MultipleDelimsAsOne',true,'CollectOutput',true};
% required arrays:
str = 'X';
dtv = [];
dat = {};
% open textfile:
fid = fopen('data.txt','rt');
while ischar(str)
% skip lines until first char is '*' (date vector):
while ~strcmp(str(1),'*')
str = fgetl(fid);
end
% convert date vector to numeric:
dtv(end+1,:) = str2double(regexp(str(2:end),'\S+','match')); %#ok<SAGROW>
% get file position:
pos = ftell(fid);
% read first line of matrix:
str = fgetl(fid);
if ischar(str)
% calculate how many columns in the matrix:
N = numel(regexp(str(5:end),'\S+','match'));
fmt = repmat('%f',1,N);
% rewind one line:
fseek(fid,pos,'bof');
% read entire matrix:
dat{end+1} = textscan(fid,['%4[^*]',fmt],opt{:}); %#ok<SAGROW>
end
end
% concatenate data in cell arrays:
dat = vertcat(dat{:});
mat = vertcat(dat{:,2});
This reads the entire data matrix (between the date vectors) into a numeric matrix inside the cell array dat, and the date vectors in dtv. It automatically adjusts for the different numbers of columns in your matrices. Some important assumptions:
  • the first columns comprise of exactly four characters (which may be spaces).
  • the date vectors always start with asterisks, but no other lines do.
  • no empty lines between the date vectors and the data matrices.
  • the matrices contain numeric data only.
Have a look inside dat, and pick the data that you need:
>> cell2mat(cellfun(@(m)m(1,[1,2,3]),dat(:,2),'UniformOutput',false))
ans =
1.0e+04 *
0.6445 -2.4080 -0.8935
0.6645 -2.2892 -1.1498
I also concatenated the matrices into mat, which lets gives you all of the matrices in one. This might be easier to access:
>> mat([1,10],[1,2,3])
ans =
1.0e+04 *
0.6445 -2.4080 -0.8935
0.6645 -2.2892 -1.1498
I tested this code on both of the files that you have provided (this question, and your last question), which are also available here:

More Answers (1)

Guillaume
Guillaume on 26 Jan 2016
Well, I guess it's time for you to learn the regular expression language.
This regex should work for you:
'^P\s*1( +\S+)+\s+$'
It simply adds 0 or more (the *) whitespace characters (the \s) between P and 1.
  1 Comment
sermet
sermet on 26 Jan 2016
I modified the code as you explained;
C = regexp(str,'^P\s*1( +\S+)+\s+$','lineanchors','tokens')
But it produced empty C
{}

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!