Pull out strings and its values from a text file.

 Accepted Answer

HI Sriram, sorry I was away last week.
Parsing the the first part of each message (date, level, source) is trivial. It's the part after that that is difficult due to the variations of format. I don't fully understand the algorithm you've written and I don't think you can use : indiscriminately as a delimiter. For example on line 2, it's part of https://www....
Here is how I would start the parsing:
filecontent = string(fileread('File.txt')); %read whole file as STRING (for easier text comparison later)
messages = regexp(filecontent, '^(?<date>[^ ]+) (?<level>[^ ]+) (?<source>[^:]+):\s+(?<content>[^\r\n]+)', 'names', 'lineanchors'); %parse all lines according to common format
dates = num2cell(datetime([messages.date], 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ss.SSSSSSZZZZZ', 'TimeZone', 'UTC')); %decode date
[messages.date] = dates{:}; %and put back into structure
%parsing of kernel messages
iskernel = [messages.source] == "kernel";
parsedkernel = regexp([messages(iskernel).content], '\[\s*(?<cputime>[^\]]+)]\s+(?<message>.*)', 'names'); %parse kernel messages. Not sure of the rule
parsedkernel = [parsedkernel{:}]; %convert into structure array
cputime = num2cell(str2double([parsedkernel.cputime])); %convert cputime to numeric
[parsedkernel.cputime] = cputime{:}; %and put back into structure
parsedkernel = num2cell(parsedkernel); %convert to cell array to put back into messages structure
[messages(iskernel).content] = parsedkernel{:};

6 Comments

Hi
Thanks a lot. I made the change instead of string - I used char and executed the script.
I see error ,
"
Error using datetime (line 598)
Unable to parse date/time string
'2019-05-10T21:41:40.053993+00:002019-05-10T21:41:40.054122+00:002019-05-10T21:41:40.054614+00:002019-05-10T21:41:40.054618+00:002019-05-10T21:41:40.054622+00:002019-05-10T21:41:40.054623+00:002019-05-10T21:41:40.054196+00:002019-05-10T21:41:40.054625+00:002019-05-10T21:41:40.054626+00:002019-05-10T21:41:40.054627+00:002019-05-10T21:41:40.054627+00:002019-05-10T21:41:40.054230+00:002019-05-10T21:41:40.054628+00:002019-05-10T21:41:40.054629+00:002019-05-10T21:41:40.054629+00:002019-05-10T21:41:40.054255+00:002019-05-10T21:41:40.054631+00:002019-05-10T21:41:40.054632+00:002019-05-10T21:41:40.054632+00:00....'
using the format 'yyyy-MM-dd'T'HH:mm:ss.SSSSSSZZZZZ'.
Please help me.
edited by Guillaume to shorten wall of text
"I made the change instead of string - I used char"
Are you using R2016a or before? If so that is important to know.
sriram shastry's "Answer" moved here:
I am using before R2016a
So which version?
Same code to work with char arrays instead of strings:
filecontent = fileread('File.txt'); %read whole file as STRING (for easier text comparison later)
messages = regexp(filecontent, '^(?<date>[^ ]+) (?<level>[^ ]+) (?<source>[^:]+):\s+(?<content>[^\r\n]+)', 'names', 'lineanchors'); %parse all lines according to common format
dates = num2cell(datetime({messages.date}, 'InputFormat', 'yyyy-MM-dd''T''HH:mm:ss.SSSSSSZZZZZ', 'TimeZone', 'UTC')); %decode date
[messages.date] = dates{:}; %and put back into structure
%parsing of kernel messages
iskernel = strcmp({messages.source}, 'kernel');
parsedkernel = regexp({messages(iskernel).content}, '\[\s*(?<cputime>[^\]]+)]\s+(?<message>.*)', 'names'); %parse kernel messages. Not sure of the rule
parsedkernel = [parsedkernel{:}]; %convert into structure array
cputime = num2cell(str2double({parsedkernel.cputime})); %convert cputime to numeric
[parsedkernel.cputime] = cputime{:};
parsedkernel = num2cell(parsedkernel); %convert to cell array to put back into messages structure
[messages(iskernel).content] = parsedkernel{:};
Sriram's comment mistakenly posted as an answer (please use comments!):
Thanks a lot. I works.
Then consider changing your accepted answer, particularly after all the hard work that has gone in getting you there.

Sign in to comment.

More Answers (1)

cell = readcell('filename.xlsx','Range','......');
stringname = '......';
variable = strcmp(stringname,cell);

12 Comments

I have accepted the answer as it's a pointer on how to use the function,but it is not solving the issue.
It is NOT helping me .
I have a text file with lots of text and number associated with it.
Example :
2019-05-10T21:41:40.054631+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] type 16
2019-05-10T21:41:40.054632+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000009ffff] usable
2019-05-10T21:41:40.054632+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000a0000-0x00000000000fffff] reserved
2019-05-10T21:41:40.054633+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000089afdfff] usable
I need string AND associated number. Next I want to convert the cells into meaning full data for plotting as well.
I want to parse the full file using textscan function.
  • Identify the matchig pattern with string and values
cac = textscan(fid,'%s%s%s%*s[^\r\n]','Delimiter','');
[~] = fclose( fid );
n1 = cac{1}; % new
n2 = cac{2}; % new
n3 = cac{3}; % new
n4 = cac{4}; % new
  • Convert the cell using cellfunc
I need help
  • Plot the data
I need help
I want to parse the full file using textscan
Why the insistence on using textscan? The modern readtable, readcell, etc. can usually figure out the format for you and if not usually do it after being given a few hints. And they output something more useful than textscan.
One thing that you should never do is create numbered variables. Instead of embedding an index in the variable name, use proper indexing.
Looking at the sample file, I would say it is too irregular for textscan or readtable to be useful. I would instead fileread() and use regexp() to pull it apart.
Oh yes, I didn't look at the file, since the accepted answer use readcell to read an excel file. Any file that is similar to an excel file (i.e tabulated) can easily be read without using textscan.
Having now looked at the file, I would agree that readxxx would be completely unsuitable and textscan would struggle. Indeed regexp or a dedicated parser would be the way to go.
Can you please suggest a sample example to work on.
My requirement is
  • Get the Pattern string and values in a structure variable .
  • Convert the cell array into matlab variable and value
  • Use cellfun
  • Plot the data
Most of your requirements are requirements on how the code should be implemented instead of on what it should do. This is not how you design code. You first specify what result you want, then you use whichever implementation gives you these results efficiently.
Whether structures, cell arrays, cellfun, etc. are useful is unknown because you haven't specified what you want other than some sort name/value pair (what does pattern string and value refer to?) and a plot of something (what is the data?)
Understood. My requirements are example snippet from the attached file is
2019-05-10T21:41:40.054631+00:00 INFO kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] type 16
2019-05-10T21:41:40.054649+00:00 DEBUG kernel: [ 0.000009] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
2019-05-10T21:41:40.054785+00:00 NOTICE kernel: [ 0.101170] random: get_random_bytes called from start_kernel+0x8d/0x429 with crng_init=0
s_TimeStamp(idx,1) = 21:41:40.054631+00:00
MsgLib.Kernel.String(idx,1) = INFO kernel
MsgLib.Kernel.String(idx,1) = DEBUG kernel
MsgLib.Kernel.String(idx,1) = NOTICE kernel
MsgLib.Kernel.val(idx,1) = 0.000000
Msg.Kernel.BIOS.SubStr(idx,1) = BIOS-e820
Msg.Kernel.BIOS.SubVal(idx,1) = [mem 0x0000000000000000-0x0000000000000fff]
Msg.Kernel.BIOS.Substr.type(idx,1) = type
Msg.Kernel.BIOS.SubVal.type(idx,1) = 16
figure;
subplot(3,1,1);plot(MsgLib.Kernel.String,MsgLib.Kernel.val );legend(sprintf('%s,%d','MsgLib.Kernel.String',MsgLib.Kernel.val);
subplot(3,1,2);plot(Msg.Kernel.BIOS.SubStr,Msg.Kernel.BIOS.SubVal)
subplot(3,1,3);plot(Msg.Kernel.BIOS.Substr.type,Msg.Kernel.BIOS.SubVal.type );
Like this I want to the full file analysis. My request is if you help me with a sample code - I will generate rest of coding.
Thanks
Ok,so you want to parse each line of the file and split the lines into various components.
Once again, you're also giving an implementation. I'm not convinced that the structure you outline is a good idea, but that's not important right now.
The first thing you have to do, before we can even think how to implement it, is define exactly the parsing rules for the lines of the file. The start of the rule is going to be:
  • extract all the characters up to the first space and decode that as time
  • then, extract the characters up to the colon. That's the log source (I assume)
After that I'm not sure. It looks like the rule may vary according to the log source. If the log source is ANYTHING kernel, then the next step is
  • Extrace the number between [] as the log value
Then it gets very murky, you get different types of messages after the [xxx] with different formatings. You will have to establish the rules for how these should be decoded.
If the log source is not the kernel, you get a completely different format of message. Again, you need to specify the rules for decoding these.
So, I'm afraid, the task is back onto you. You first need to define rules (there's going to be several due to the complex formatting of the lines) on how to split a line into various components. Only once you've done that can we think about writing the code to do it.
I suggest you continue this bullet point list:
For each line:
  • extract the text up to the first space as the logtime
  • then extract the text up to the colon as the logsource
  • if logsource ends with kernel
  • extract the number between the [] as logvalue
  • ????
  • if logsource is ????
  • ????
  • ????
  • if ???
  • ????
  • ????
Thanks! Yes,I agree with your algorithm style ,can you please give me a sample code write up?
As I wrote:
So, I'm afraid, the task is back onto you You first need to define rules (there's going to be several due to the complex formatting of the lines) on how to split a line into various components. Only once you've done that can we think about writing the code to do it

Sign in to comment.

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!