Regexp text-block matching challenges
3 views (last 30 days)
Show older comments
Hi all,
Looking for some assistance in the following parsing task:
I am reading a report (.txt) file in as a string, via 'filread.' From here, I am, without success, attempting to find all instances of a certain, aperiodic RAM dump:
----
RAM_DUMP:
RAM_ADDRESS 0 2 4 6 8 A C E
0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42 **TestCompleted
----
I have a few questions that have arisen from these attempt:
- Can regexp take more than 1 'option' argument? Does not seem to be the case, via the regexp call itself, but would two flags work in the pattern itself? e.g. (?m-s)
- If I have the 'lineanchors'/(?m) flag enabled, am I able to use the anchor symbols more than once within the pattern to signify expected text on multiple lines?
- What is the order of operations for a combined 'lookahead'-'lookbehind' call? I am envisioning the 'lookbehind' would truncate the string in one direction, feed that remainder into the 'lookahead' which would truncate the other side?
Anyway, these questions culminate in the following failed attempt at a successful regexp pattern. It is worth noting that my approach has been to target the 'empty' line that follows the RAM dump, however targeting the non-space character as the first line character might also be valid (in the posted sample, the 'F' in 'Feb'):
repatt = '(?m)(?<=^RAM_DUMP:).+(?<!^\s?$)' ;
If you know how to accomplish this in another language, please post as well.
3 Comments
per isakson
on 24 Feb 2015
Edited: per isakson
on 24 Feb 2015
Replacing the look-ahead expression
'(?<!^\s?$)' // this is not according to the Matlab dialect
by
'(?=^\s*$)' // allow more than one space in empty line
and
'.+' // will match to the last empty line
by
'.+?' // lazy
will make the expression work
Accepted Answer
per isakson
on 23 Feb 2015
Edited: per isakson
on 24 Feb 2015
With R2013b, this script returns the three blocks between RAM_DUMP: and the closest empty line
str = fileread( 'cssm.txt' );
xpr = '(?m)(?<=^RAM_DUMP:).+?(?=(^\s*$))';
cac = regexp( str, xpr, 'match' );
>> whos cac
Name Size Bytes Class Attributes
cac 1x3 2112 cell
where cssm.txt contains
RAM_DUMP:
RAM_ADDRESS 0 2 4 6 8 A C E
0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42 **TestCompleted
RAM_DUMP:
RAM_ADDRESS 0 2 4 6 8 A C E
0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42 **TestCompleted
RAM_DUMP:
RAM_ADDRESS 0 2 4 6 8 A C E
0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42 **TestCompleted
 
Did you try to replace '(?m-s)expression' by '(?m)(?-s)expression' ?
More Answers (2)
Guillaume
on 23 Feb 2015
Wouldn't this work:
matches = regexp(s, '(?<=^RAM_DUMP:\n).*?(?=^$)', 'match', 'dotall', 'lineanchors')
Note that your expression contains two look-behind expressions instead of a look-behind and a look-ahead, so is never going to work.
Rather than using options in the regex which I find makes it even harder to parse, I prefer to pass the options explicitly to regexp as above.
2 Comments
Guillaume
on 23 Feb 2015
Oh, I forgot that matlab is probably the only regular expression engine where . matches newline by default. So you don't need the 'dotall' (although it does not hurt).
Image Analyst
on 22 Feb 2015
Why not use strfind()?
3 Comments
Image Analyst
on 24 Feb 2015
I'm not sure I buy that last sentence. When reading code it's always easy to understand what a strfind() is looking for but looking at things like '(?<=^RAM_DUMP:\n).*?(?=^$)' , it's hard to tell exactly what it will find. If your junior colleague is to take over your code, and he is a beginner MATLAB programmer, which do you think will be easier for him to understand and maintain: strfind with two plain text strings, or that cryptic mess of special characters? I mean, you said yourself "I spent a decent amount of time trying to figure out how to use regex to do this and failed. " so my point is proven.
See Also
Categories
Find more on String Parsing in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!