How to parse text data
Show older comments
Hi
I have data in the below format. I need the mechanism to parse the data from below format with expected output.
Input data format:
07/16 12:55:22.012 INFO | test_runner_utils:0812| Began logging to /tmp/test_that_results_hatch_deL3lZ
07/16 12:55:27.477 INFO | test_runner_utils:0259| autoserv| Processing control file
Expected Output format:
Define level of message extraction based on the marker sign ==> |
-Step 1: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|
-Step 2: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>| extract full text in a variable, option to grab variable if associated with value
-Step 3: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|<string>:%1.3f|
-Step 4: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|<string>:%1.3f|extract full text in a variable, option to grab variable if associated with value
-Step 5: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|<string>:%1.3f|<string>|
-Step 6: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|<string>:%1.3f|<string>|extract full text in a variable, option to grab variable if associated with value
Input data format:
07/16 12:55:27.620 DEBUG| utils:0287| [stdout] CHROMEOS_RELEASE_BOARD=hatch
07/16 13:28:58.330 INFO | mode_switcher:0673| -[FAFT]-[ start wait_for_client ]---
Expected Output format:
-Step 1: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|<string>:%1.3f|<[string]>
-Step 2: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|<string>:%1.3f|<[string]> extract full text in a variable, option to grab variable if associated with value
-Step 3: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|<string>:%1.3f|<[string]>
-Step 4: Extract Timestamp in mm/dd HH:MM:sec.millisec <string>|<string>:%1.3f|<[string]> [string] extract full text in a variable, option to grab variable if associated with value
Input data format:
2019-07-16 12:55:30 > string
2019-07-16 12:55:30 powerbtn: released
Expected Output format:
Note the marker >
-Step 1: Extract Timestamp in YYYY:MM:DD HH:mm:sec > < string>
-Step 2: Extract Timestamp in YYYY:MM:DD HH:mm:sec < full string>
Input data format
2019-07-16 12:55:31 > [12074.734997 HC 0x121 err 1]
Expected Output format
-Step 1: Extract Timestamp in YYYY:MM:DD HH:mm:sec > [< %1.3f string extract full text in a variable, option to grab variable if associated with value>]
Thanks a lot
5 Comments
Bob Thompson
on 17 Jul 2019
Life is Wonderful
on 18 Jul 2019
Edited: Life is Wonderful
on 18 Jul 2019
Guillaume
on 19 Jul 2019
I've not looked at this question in details. Does the file format differ much from the one in your previous question?
If not, it should be fairly trivial to adapt the parser I wrote, which would be a lot less effort than starting again from scratch.
Life is Wonderful
on 19 Jul 2019
Edited: Life is Wonderful
on 19 Jul 2019
Life is Wonderful
on 23 Jul 2019
Accepted Answer
More Answers (2)
Bob Thompson
on 18 Jul 2019
0 votes
I need next steps
◾Convert Datacontent into cell's - like timestamp , message data-1,message data-2
◾Put cell in proper format
◾Create Matlab variables
◾Display Matlab variable for good analysis
1) regexp automatically outputs all results in a cell, each containing a string.
2) You can convert strings to date time formats using datetime. To do this 'quickly' I suggest using a loop through your regexp results, or by using cellfun (which is really still a loop).
3) What exactly do you mean by this? I personally do not know of a way to dynamically create variables within Matlab, and I think you would be better served to keep the information in a cell array, or to make a table out of it. It is certainly possible to create new variables in a table from a captured string from regexp.
4) Displaying Matlab variables is simply a matter of not suppressing them, or if specifically wanting to display them then you can use fprintf with no target so it defaults to the command window.
5 Comments
Life is Wonderful
on 18 Jul 2019
Edited: Life is Wonderful
on 18 Jul 2019
Life is Wonderful
on 18 Jul 2019
Bob Thompson
on 18 Jul 2019
You're getting the structure class error because 'names' outputs the results as a structure, rather than a cell, as I was expecting. Personally, I prefer 'tokens' or 'match' as my output flag for regexp.
Cellfun will not work with any input that is not a cell, hence the error.
I would suggest something like the following:
fileData = regexp(filecontent, '^(?<Time_MDY>[^ ]+) (?<Time_HMSsss>[^ ]+) (?<first>[^|]\w+)|\s+(?<last>[^|\r\n]+)|(?<last>[^|\r\n]+),\s+(?<first>[^|]|\w+)', 'tokens', 'lineanchors');
dates = datetime([fileData{1}{1},' ',fileData{1}{2}], 'InputFormat', 'MM/dd HH:mm:ss.SSS');
I did use a singular line to test this, so if you have multiple rows of inputs and outputs from regexp then you may need to investigate using a loop.
Bob Thompson
on 19 Jul 2019
Are you only looking to capture the timestamp? It seems like the issue is more in the initial regexp processing than in the date time conversion.
If you are only looking to capture the timestamp I would suggest doing a regexp call like this:
filedata = regexp(filecontent'(\d\d.\d\d\s\d\d.\d\d.\d\d.\d\d\d)\D+\d\d\d\d\D+\n','tokens');
dates = datetime([filedata{:}], 'InputFormat', 'MM/dd HH:mm:ss.SSS');
If you are looking to capture more than the timestamps then please explain more. I know you outline some more in your OP, but I'm not entirely sure what you're referring to.
Life is Wonderful
on 19 Jul 2019
Edited: Life is Wonderful
on 19 Jul 2019
Life is Wonderful
on 18 Jul 2019
0 votes
Categories
Find more on Cell Arrays in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!