MATLAB Answers

Regexp text-block matching challenges

10 views (last 30 days)
Nate
Nate on 22 Feb 2015
Commented: Nate on 24 Feb 2015
Hi all,
Looking for some assistance in the following parsing task:
I am reading a report (.txt) file in as a string, via 'filread.' From here, I am, without success, attempting to find all instances of a certain, aperiodic RAM dump:
----
RAM_DUMP:
RAM_ADDRESS 0 2 4 6 8 A C E
0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42 **TestCompleted
----
I have a few questions that have arisen from these attempt:
  1. Can regexp take more than 1 'option' argument? Does not seem to be the case, via the regexp call itself, but would two flags work in the pattern itself? e.g. (?m-s)
  2. If I have the 'lineanchors'/(?m) flag enabled, am I able to use the anchor symbols more than once within the pattern to signify expected text on multiple lines?
  3. What is the order of operations for a combined 'lookahead'-'lookbehind' call? I am envisioning the 'lookbehind' would truncate the string in one direction, feed that remainder into the 'lookahead' which would truncate the other side?
Anyway, these questions culminate in the following failed attempt at a successful regexp pattern. It is worth noting that my approach has been to target the 'empty' line that follows the RAM dump, however targeting the non-space character as the first line character might also be valid (in the posted sample, the 'F' in 'Feb'):
repatt = '(?m)(?<=^RAM_DUMP:).+(?<!^\s?$)' ;
If you know how to accomplish this in another language, please post as well.

  3 Comments

Stephen Cobeldick
Stephen Cobeldick on 23 Feb 2015
This doesn't solve the problem directly, but you could try my interactive Regexp Helper to work on this:
It lets you try different combinations and see what effects they have on the outputs. Unlike other similar Regexp Helpers on FEX it can be treated as a simple regexp call, as all inputs are passed through internally. Although other Regexp Helpers offer support for files, fancy colors and other such luxuries: you can search for these if you need to parse large pieces of text.
Nate
Nate on 23 Feb 2015
@Stephen_Cobeldick thanks for the code! I was actually using SlickEdit's built-in regex helper and trying to translate from Perl to Matlab, which is less than ideal as I am not too familiar with Perl. Will try out your GUI.
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
per isakson
per isakson on 24 Feb 2015
Replacing the look-ahead expression
'(?<!^\s?$)' // this is not according to the Matlab dialect
by
'(?=^\s*$)' // allow more than one space in empty line
and
'.+' // will match to the last empty line
by
'.+?' // lazy
will make the expression work

Sign in to comment.

Accepted Answer

per isakson
per isakson on 23 Feb 2015
Edited: per isakson on 24 Feb 2015
With R2013b, this script returns the three blocks between RAM_DUMP: and the closest empty line
str = fileread( 'cssm.txt' );
xpr = '(?m)(?<=^RAM_DUMP:).+?(?=(^\s*$))';
cac = regexp( str, xpr, 'match' );
>> whos cac
Name Size Bytes Class Attributes
cac 1x3 2112 cell
where cssm.txt contains
RAM_DUMP:
RAM_ADDRESS 0 2 4 6 8 A C E
0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42 **TestCompleted
RAM_DUMP:
RAM_ADDRESS 0 2 4 6 8 A C E
0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42 **TestCompleted
RAM_DUMP:
RAM_ADDRESS 0 2 4 6 8 A C E
0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42 **TestCompleted
&nbsp
Did you try to replace '(?m-s)expression' by '(?m)(?-s)expression' ?

  1 Comment

Nate
Nate on 24 Feb 2015
@per_isakson, thank you sir. The sheer anger this seemingly remedial task was causing me on Saturday morning nearly ruined my coffee and doughnut.
I have not tried your flag suggestion, but agree with @Guillaume that the function arguments are clearer. I did not open the mex to find that the 'options' argument was essentially 'varargin'.

Sign in to comment.

More Answers (2)

Guillaume
Guillaume on 23 Feb 2015
Wouldn't this work:
matches = regexp(s, '(?<=^RAM_DUMP:\n).*?(?=^$)', 'match', 'dotall', 'lineanchors')
Note that your expression contains two look-behind expressions instead of a look-behind and a look-ahead, so is never going to work.
Rather than using options in the regex which I find makes it even harder to parse, I prefer to pass the options explicitly to regexp as above.

  2 Comments

Guillaume
Guillaume on 23 Feb 2015
Oh, I forgot that matlab is probably the only regular expression engine where . matches newline by default. So you don't need the 'dotall' (although it does not hurt).
Nate
Nate on 24 Feb 2015
Thanks! This answer works if I change the pattern to:
'(?<=^RAM_DUMP:).*?(?=^$)'
Thanks, @Guillaume. Gave the answer to @per_isakson because of order.

Sign in to comment.


Image Analyst
Image Analyst on 22 Feb 2015
Why not use strfind()?

  3 Comments

Nate
Nate on 23 Feb 2015
I essentially currently use this approach, where-in I find the title ('RAM_DUMP') location then while-loop until I have read all the RAM contents.
I was more just curious, as I spent a decent amount of time trying to figure out how to use regex to do this and failed. Also, I am of the opinion that well-commented regex is more maintainable than custom parsing logic.
Image Analyst
Image Analyst on 24 Feb 2015
I'm not sure I buy that last sentence. When reading code it's always easy to understand what a strfind() is looking for but looking at things like '(?<=^RAM_DUMP:\n).*?(?=^$)' , it's hard to tell exactly what it will find. If your junior colleague is to take over your code, and he is a beginner MATLAB programmer, which do you think will be easier for him to understand and maintain: strfind with two plain text strings, or that cryptic mess of special characters? I mean, you said yourself "I spent a decent amount of time trying to figure out how to use regex to do this and failed. " so my point is proven.
Nate
Nate on 24 Feb 2015
I'm not sure I buy that last sentence.
% (?<=^RAM_DUMP:).+?(?=(^\s*$))
REpatt = [...
'(?<='... LOOKBEHIND function: match must be preceeded by
'^RAM_DUMP:'...title as first entry of line
')',... END LOOKBEHIND
'.+',...matched pattern can be any character, one or more
'?',...non-greedy, stop after first matched pattern
'(?='... LOOKAHEAD function: match must be followed by
'(^\s*$)'...empty line
')',...END LOOKAHEAD
] ;
matches = regexp( fstr, REpatt, 'match',...
'lineanchors'... [^$] re-usable each line-break
) ;
for match = matches
process(match)
end
I am totally biased, however, as I have already implemented this functionality with logic and have more insight into the larger problem. I don't feel my existing code has any shorter a learning-curve than the commented regex expression above. Another reason for this path (that I did not mention) is the report format is produced differently depending on the utility used to generate it, meaning this regex pattern will be an element in a set, all targeting the same RAM_DUMP. I feel the debate between if you are better doing this with a bunch of functions vs. regex expressions is moot. The regex paradigm might be easier to port to another language.
Anyway, part of posting this was for mine own learning and fun. I think I will be using this regex structure again.

Sign in to comment.

Sign in to answer this question.