Deleting Characters from String and Repeating

8 views (last 30 days)
I have a string of text (hexadecimal) collected from hardware, and within the string are multiple instances of noise. The beginning of each data set begins with a header, but I am having trouble extracting the data sets from the noise. I need to find the 4 character headers in the string, then extract the next 96 characters. Then I need to repeat this process until I find the next header, but once I find the header then I need to delete/throw out the characters in-between the last character of the data and the first character of the next header.
I have used the strsplit() function on the string attached in the .txt file and with the header characters of 'AA02', but once I do this it places each instance into a cell, and I can't figure out how to delete any characters of the cells that come after the data. Ideally, I could have all of the data in one long string after the noise is deleted.
Any suggestions on this problem would be greatly appreciated.
Thanks
  2 Comments
dpb
dpb on 10 Feb 2020
Edited: dpb on 10 Feb 2020
Is the 96-character data section of the message after the header noise-free and then there's "noise" until the next header record? Or is there more-or-less random noise embedded throughout?
How can you determine what is/is not noise vis a vis valid data other than by counting characters? Or is it even possible to tell?
It would be simple enough to simply truncate the strings to a given length, but will you wind up with anything at all useful if you do so?
fid=fopen('1066desk.txt','r');
msg=fread(fid,'*char').';
fid=fclose(fid);
txt=strsplit(msg,'AA02');
l=cellfun(@length,txt);
txt=txt(l>0);
l=cellfun(@length,txt);
returns the "messages" identified by strings after each header value. Unfortunately,
>> [min(l) max(l)]
ans =
112 661290
>>
doesn't indicate there is a single instance of only 96 characters.
u=unique(l); % how many different lengths are there?
n=histc(l,u); % count how many of each
>> [u n] % and look at the statistics
ans =
112 201
268 1
324 13
304118 1
661290 1
>>
From the above a sizable majority are the 112, a few of moderate difference and then a couple instances where it looks like something really interfered for quite a while.
>> l(1:10)
ans =
268
112
112
112
112
112
112
112
112
112
>> txt(1:10)
ans =
10×1 cell array
{'287E13673A34004B7958234DFEB8FFAE080800090011FFFF00030107F7990006FFFAFFFFFEA9FFBC07D20005FFFF00060000011AF78AFFE2FFFF00073F603BD03A40393043C18000C0D0000042E80000C371800043814000C440200043008000C184000043160000C0D0000043308000C24800000000000000000000000000002B44A02100C1'}
{'2934137EAA310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0200000431200000DD4' }
{'2A34138292330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0200000431200000DC3' }
{'2B3413867A350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C18C0000430B00000DA1' }
{'2C34138A62350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C18C0000430B00000D8E' }
{'2D34138E4A310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003FC00000430F00000DA6' }
{'2E34139232320006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003FC00000430F00000D94' }
{'2F3413961A330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040F0000042E600000E87' }
{'3034139A02350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040F0000042E600000E76' }
{'3134139DEA350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F10000C1080000430700000E15' }
>>
There appears to be a repetitive pattern in those of same length; possibly some pattern recognition type logic could be written to find and resynch the data stream from those?
I had a similar problem almost 50 years ago now when were returning data from plant computer via punched paper tape...a mispunch or torn tape caused similar-looking issues in data streams. Had to go through and find recognizable and convertible floating point values and use those to skip the bum points until the next case.
I've not taken the time to try to match the above data stream having no further knowledge of what is/isn't expected.
ADDENDUM:
>> cumsum(u.*n)
ans =
22512
22780
26992
331110
992400
>> cumsum(u.*n)/sum(l)
ans =
0.0227
0.0230
0.0272
0.3336
1.0000
>>
NB: If you were to just truncate after 96 or 112 characters until the next header you would throw away 97% of the data in the file.
>> numel(msg)/112
ans =
8.8685e+03
>>
which indicates there are enough characters in the received string for almost 9,000 datasets but there are only 217 instances of the header being intact.
>> numel(strfind(msg,'AA02'))
ans =
217
>>
which matches up with the result of strsplit as it should.
You'll need more sophisticated parsing or a way to clean up the transmission channel to not lose almost all the data.
Taylor Knuth
Taylor Knuth on 11 Feb 2020
Thanks for your answer, it helps give me a broader perspective of the issue. The hardware gives data packets in 50 bytes and the first 2 bytes are a consistent header. The header indicates noise free data transmission for the 48 bytes after the header. Once the data packet closes then there is noise until the next header. So I need to delete any characters that come after the 48 bytes of data. It is not surprising that the data is repetitive in this data set, because the hardware for this set is just sitting still on a desk. I also looked at the lenghts of the strings with cellfun('length',str) and saw the large inconsistencies in size, so I looking at the hardware itself currently to see if there is a problem in the data transmission. However this didn't solve my problem of deleting the data after the 96 characters for each substring.

Sign in to comment.

Accepted Answer

dpb
dpb on 11 Feb 2020
Edited: dpb on 11 Feb 2020
Well, I don't know how you get L=96 from 50 bytes and I don't think it's likely what you're really going to need, but..
Given the last comments above, I split() on the actual header of 'AA' instead of 'AA02' and there are then some 8K messages as one might reasonably expect from the size.
txt=strsplit(msg,'AA').'; % split the messages
%whos txt % see what we gots...seems more reasonable
% Name Size Bytes Class Attributes
% txt 8098x1 2861004 cell
L=96; % set the desired length parameter
l=cellfun(@length,txt); % length of each txt string
txt=txt(l>=L); % keep only that long or longer
strLeft=@(c) c(1:L); % define a helper function
txttrunc=cellfun(strLeft,txt,'UniformOutput',0); % trim the cellstr array to L characters
Gives
>> txttrunc(1:10) % see what we now have left...
ans =
10×1 cell array
{'02287E13673A34004B7958234DFEB8FFAE080800090011FFFF00030107F7990006FFFAFFFFFEA9FFBC07D20005FFFF00'}
{'310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0200000431200'}
{'022A34138292330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0'}
{'022B3413867A350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C1'}
{'022C34138A62350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C1'}
{'022D34138E4A310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003F'}
{'022E34139232320006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003F'}
{'022F3413961A330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040'}
{'023034139A02350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040'}
{'023134139DEA350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F10000C1'}
>>
Alternatively, having eliminated the short partial sections earlier, you could then convert the cellstr aray to char() w/o having embedded trailing blanks afterwards and just use colon:
chartrunc=char(txt); % convert to char array NB all are now padded to length of longest
%>> whos chartrunc
% Name Size Bytes Class Attributes
% chartrunc 7691x327 5029914 char
%>>
chartrunc=chartrunc(:,1:L); % truncate to L length
%>> whos chartrunc
% Name Size Bytes Class Attributes
% chartrunc 7691x96 1476672 char
%>>
Now, however, you have a 2D char() array so to get the full character string of a given message remember will have to use 2D indexing expression of chartrunc(i,:) for example. Or, of course, you could cast back to cellstr() or convert the whole thing to the new strings class. All depends on how/what is to be done with the data going forward.
Does the requested but methinks you've got a lot more work ahead to make some sense of what's actually in the array.
  3 Comments
Taylor Knuth
Taylor Knuth on 13 Feb 2020
There is not a message trailer unfortunately, but the data is ultimately divided further for individual data packets and converted from hex to decimal. Thanks for your help on this question.
dpb
dpb on 13 Feb 2020
Glad to try to help...as said, reminds me (very painfully :) ) of the paper tape problem from years gone by...

Sign in to comment.

More Answers (0)

Categories

Find more on Data Type Identification in Help Center and File Exchange

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!