Regexp text-block matching challenges
    5 views (last 30 days)
  
       Show older comments
    
Hi all,
Looking for some assistance in the following parsing task:
I am reading a report (.txt) file in as a string, via 'filread.' From here, I am, without success, attempting to find all instances of a certain, aperiodic RAM dump:
----
RAM_DUMP:
 RAM_ADDRESS    0    2    4    6    8    A    C    E
  0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
  0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
  0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
  0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
  0X0001CCA0 F780 FFE1
Feb 20 2015-08:11:42  **TestCompleted
----
I have a few questions that have arisen from these attempt:
- Can regexp take more than 1 'option' argument? Does not seem to be the case, via the regexp call itself, but would two flags work in the pattern itself? e.g. (?m-s)
- If I have the 'lineanchors'/(?m) flag enabled, am I able to use the anchor symbols more than once within the pattern to signify expected text on multiple lines?
- What is the order of operations for a combined 'lookahead'-'lookbehind' call? I am envisioning the 'lookbehind' would truncate the string in one direction, feed that remainder into the 'lookahead' which would truncate the other side?
Anyway, these questions culminate in the following failed attempt at a successful regexp pattern. It is worth noting that my approach has been to target the 'empty' line that follows the RAM dump, however targeting the non-space character as the first line character might also be valid (in the posted sample, the 'F' in 'Feb'):
repatt = '(?m)(?<=^RAM_DUMP:).+(?<!^\s?$)' ;
If you know how to accomplish this in another language, please post as well.
3 Comments
  per isakson
      
      
 on 24 Feb 2015
				
      Edited: per isakson
      
      
 on 24 Feb 2015
  
			Replacing the look-ahead expression
    '(?<!^\s?$)'  // this is not according to the Matlab dialect
by
    '(?=^\s*$)'  // allow more than one space in empty line
and
    '.+'         // will match to the last empty line
by
    '.+?'        // lazy
will make the expression work
Accepted Answer
  per isakson
      
      
 on 23 Feb 2015
        
      Edited: per isakson
      
      
 on 24 Feb 2015
  
      With R2013b, this script returns the three blocks between RAM_DUMP: and the closest empty line
    str = fileread( 'cssm.txt' );
    xpr = '(?m)(?<=^RAM_DUMP:).+?(?=(^\s*$))';
    cac = regexp( str, xpr, 'match' ); 
    >> whos cac
      Name      Size            Bytes  Class    Attributes
      cac       1x3              2112  cell
where cssm.txt contains
    RAM_DUMP:
     RAM_ADDRESS    0    2    4    6    8    A    C    E
      0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
      0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
      0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
      0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
      0X0001CCA0 F780 FFE1
    Feb 20 2015-08:11:42  **TestCompleted
    RAM_DUMP:
     RAM_ADDRESS    0    2    4    6    8    A    C    E
      0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
      0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
      0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
      0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
      0X0001CCA0 F780 FFE1
    Feb 20 2015-08:11:42  **TestCompleted
    RAM_DUMP:
     RAM_ADDRESS    0    2    4    6    8    A    C    E
      0X0001B650 0018 0005 001A 0042 001C 0000 0022 0A0C
      0X0001B660 0024 0030 0026 1B7F 0028 0A01 002A 0000
      0X0001CC80 FFDB FD00 F700 FFE3 4282 0400 0000 FFD5
      0X0001CC90 FBFE F720 FFE1 42C2 0400 0000 FFF8 00FA
      0X0001CCA0 F780 FFE1
    Feb 20 2015-08:11:42  **TestCompleted
 
Did you try to replace '(?m-s)expression' by '(?m)(?-s)expression' ?
More Answers (2)
  Guillaume
      
      
 on 23 Feb 2015
        Wouldn't this work:
matches = regexp(s, '(?<=^RAM_DUMP:\n).*?(?=^$)', 'match', 'dotall', 'lineanchors')
Note that your expression contains two look-behind expressions instead of a look-behind and a look-ahead, so is never going to work.
Rather than using options in the regex which I find makes it even harder to parse, I prefer to pass the options explicitly to regexp as above.
2 Comments
  Guillaume
      
      
 on 23 Feb 2015
				Oh, I forgot that matlab is probably the only regular expression engine where . matches newline by default. So you don't need the 'dotall' (although it does not hurt).
  Image Analyst
      
      
 on 22 Feb 2015
        Why not use strfind()?
3 Comments
  Image Analyst
      
      
 on 24 Feb 2015
				I'm not sure I buy that last sentence. When reading code it's always easy to understand what a strfind() is looking for but looking at things like '(?<=^RAM_DUMP:\n).*?(?=^$)' , it's hard to tell exactly what it will find. If your junior colleague is to take over your code, and he is a beginner MATLAB programmer, which do you think will be easier for him to understand and maintain: strfind with two plain text strings, or that cryptic mess of special characters? I mean, you said yourself "I spent a decent amount of time trying to figure out how to use regex to do this and failed. " so my point is proven.
See Also
Categories
				Find more on Variables in Help Center and File Exchange
			
	Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!


