MATLAB Answers

0

regexp: Extract optional named tokens

Asked by Hau Kit Yong on 2 Jul 2019
Latest activity Edited by Akira Agata
on 3 Jul 2019
I would like to extract some information from the following text:
text.PNG
There are 3 groups in the text. I want to extract the genders (enclosed in brackets), the group names (the text following 'Name:') and the student IDs for each group (the numbers following 'ID XX =').
My desired output is as follows:
struct.PNG
The issue is that not all groups have a header line (the lines starting with '#'), e.g. for group 3.
My code is as follows
str = fileread('trip-data.txt');
expr = 'Student group.+?\((?<Gender>\w+?)\).*?Name:(?<Name>.+?)\nGROUP.+?=(?<IDs>.+?(,\s*\n.+?)*)(?=(\n|$))';
groups = regexp(str, expr, 'names');
The returned struct array ignores group 3:
Capture.PNG
I have also tried enclosing the header line in an optional bracket, e.g. '()?', like so
expr = '(Student group.+?\((?<Gender>\w+?)\).*?Name:(?<Name>.+?))?\nGROUP.+?=(?<IDs>.+?(,\s*\n.+?)*)(?=(\n|$))';
The returned struct captures the 'ID' fields but not the 'Gender' and 'Name' fields for all 3 groups:
Capture1.PNG

  2 Comments

Rik
on 2 Jul 2019
Do you absolutely need to use a regexp? Because it might be easier with other tools (if maybe slightly less efficient).
I would like to, yes, because the text is a small snippet of a much larger file with varying formats that I am already parsing with other expressions.

Sign in to comment.

Tags

1 Answer

Answer by Akira Agata
on 3 Jul 2019
Edited by Akira Agata
on 3 Jul 2019

How about extracting 'Name', 'Gender' and 'ID' one-by-one?
The following is an example.
% Read the file
str = fileread('trip-data.txt');
% Remove newline in ID
str = regexprep(str,'\r\n\s+','');
% Remove newline after 'Name: XX'
str = regexprep(str,'(Name:\s+\w+)\r\n','$1, ');
% Store each line as a cell array
c = strsplit(str,'\r\n')';
% Extract one-by-one
Name = erase(regexp(c,'Name:\s(\w+)','match','once'),'Name: ');
Gender = regexp(c,'(male|female)','match','once');
ID = strtrim(extractAfter(c,'='));
% Summarize as a table
tbl = table(Name,Gender,ID);

  0 Comments

Sign in to comment.