Deleting Characters from String and Repeating

Question

0 votes

1066desk.txt

I have a string of text (hexadecimal) collected from hardware, and within the string are multiple instances of noise. The beginning of each data set begins with a header, but I am having trouble extracting the data sets from the noise. I need to find the 4 character headers in the string, then extract the next 96 characters. Then I need to repeat this process until I find the next header, but once I find the header then I need to delete/throw out the characters in-between the last character of the data and the first character of the next header.

I have used the strsplit() function on the string attached in the .txt file and with the header characters of 'AA02', but once I do this it places each instance into a cell, and I can't figure out how to delete any characters of the cells that come after the data. Ideally, I could have all of the data in one long string after the noise is deleted.

Any suggestions on this problem would be greatly appreciated.

Thanks

2 Comments
Show None Hide None

dpb on 10 Feb 2020

Edited: dpb on 10 Feb 2020

Open in MATLAB Online

Is the 96-character data section of the message after the header noise-free and then there's "noise" until the next header record? Or is there more-or-less random noise embedded throughout?

How can you determine what is/is not noise vis a vis valid data other than by counting characters? Or is it even possible to tell?

It would be simple enough to simply truncate the strings to a given length, but will you wind up with anything at all useful if you do so?

fid=fopen('1066desk.txt','r');
msg=fread(fid,'*char').';
fid=fclose(fid);
txt=strsplit(msg,'AA02');
l=cellfun(@length,txt);
txt=txt(l>0);
l=cellfun(@length,txt);

returns the "messages" identified by strings after each header value. Unfortunately,

>> [min(l) max(l)]
ans =
         112      661290
>> 

doesn't indicate there is a single instance of only 96 characters.

u=unique(l);      % how many different lengths are there?
n=histc(l,u);     % count how many of each
>> [u n]          % and look at the statistics
ans =
         112         201
         268           1
         324          13
      304118           1
      661290           1
>> 

From the above a sizable majority are the 112, a few of moderate difference and then a couple instances where it looks like something really interfered for quite a while.

>> l(1:10)
ans =
   268
   112
   112
   112
   112
   112
   112
   112
   112
   112
>> txt(1:10)
ans =
  10×1 cell array
    {'287E13673A34004B7958234DFEB8FFAE080800090011FFFF00030107F7990006FFFAFFFFFEA9FFBC07D20005FFFF00060000011AF78AFFE2FFFF00073F603BD03A40393043C18000C0D0000042E80000C371800043814000C440200043008000C184000043160000C0D0000043308000C24800000000000000000000000000002B44A02100C1'}
    {'2934137EAA310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0200000431200000DD4'                                                                                                                                                            }
    {'2A34138292330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0200000431200000DC3'                                                                                                                                                            }
    {'2B3413867A350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C18C0000430B00000DA1'                                                                                                                                                            }
    {'2C34138A62350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C18C0000430B00000D8E'                                                                                                                                                            }
    {'2D34138E4A310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003FC00000430F00000DA6'                                                                                                                                                            }
    {'2E34139232320006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003FC00000430F00000D94'                                                                                                                                                            }
    {'2F3413961A330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040F0000042E600000E87'                                                                                                                                                            }
    {'3034139A02350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040F0000042E600000E76'                                                                                                                                                            }
    {'3134139DEA350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F10000C1080000430700000E15'                                                                                                                                                            }
>> 

There appears to be a repetitive pattern in those of same length; possibly some pattern recognition type logic could be written to find and resynch the data stream from those?

I had a similar problem almost 50 years ago now when were returning data from plant computer via punched paper tape...a mispunch or torn tape caused similar-looking issues in data streams. Had to go through and find recognizable and convertible floating point values and use those to skip the bum points until the next case.

I've not taken the time to try to match the above data stream having no further knowledge of what is/isn't expected.

ADDENDUM:

>> cumsum(u.*n)
ans =
       22512
       22780
       26992
      331110
      992400
>> cumsum(u.*n)/sum(l)
ans =
    0.0227
    0.0230
    0.0272
    0.3336
    1.0000
>> 

NB: If you were to just truncate after 96 or 112 characters until the next header you would throw away 97% of the data in the file.

>> numel(msg)/112
ans =
   8.8685e+03
>> 

which indicates there are enough characters in the received string for almost 9,000 datasets but there are only 217 instances of the header being intact.

>> numel(strfind(msg,'AA02'))
ans =
   217
>> 

which matches up with the result of strsplit as it should.

You'll need more sophisticated parsing or a way to clean up the transmission channel to not lose almost all the data.

Taylor Knuth on 11 Feb 2020

Thanks for your answer, it helps give me a broader perspective of the issue. The hardware gives data packets in 50 bytes and the first 2 bytes are a consistent header. The header indicates noise free data transmission for the 48 bytes after the header. Once the data packet closes then there is noise until the next header. So I need to delete any characters that come after the 48 bytes of data. It is not surprising that the data is repetitive in this data set, because the hardware for this set is just sitting still on a desk. I also looked at the lenghts of the strings with cellfun('length',str) and saw the large inconsistencies in size, so I looking at the hardware itself currently to see if there is a problem in the data transmission. However this didn't solve my problem of deleting the data after the 96 characters for each substring.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

dpb on 11 Feb 2020

Edited: dpb on 11 Feb 2020

Open in MATLAB Online

0 votes

Well, I don't know how you get L=96 from 50 bytes and I don't think it's likely what you're really going to need, but..

Given the last comments above, I split() on the actual header of 'AA' instead of 'AA02' and there are then some 8K messages as one might reasonably expect from the size.

txt=strsplit(msg,'AA').';              % split the messages
%whos txt                              % see what we gots...seems more reasonable
%  Name         Size              Bytes  Class    Attributes
%  txt       8098x1             2861004  cell               
L=96;                                             % set the desired length parameter
l=cellfun(@length,txt);                           % length of each txt string
txt=txt(l>=L);                                    % keep only that long or longer
strLeft=@(c) c(1:L);                              % define a helper function
txttrunc=cellfun(strLeft,txt,'UniformOutput',0);  % trim the cellstr array to L characters

Gives

>> txttrunc(1:10)                                    % see what we now have left...
ans =
  10×1 cell array
    {'02287E13673A34004B7958234DFEB8FFAE080800090011FFFF00030107F7990006FFFAFFFFFEA9FFBC07D20005FFFF00'}
    {'310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0200000431200'}
    {'022A34138292330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F90000C0'}
    {'022B3413867A350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C1'}
    {'022C34138A62350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795843018000C1'}
    {'022D34138E4A310006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003F'}
    {'022E34139232320006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842FB00003F'}
    {'022F3413961A330006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040'}
    {'023034139A02350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F9000040'}
    {'023134139DEA350006E7FD0004EFF20005030DFFFEBB990000001E0000003F0000002000000021004B795842F10000C1'}
>> 

Alternatively, having eliminated the short partial sections earlier, you could then convert the cellstr aray to char() w/o having embedded trailing blanks afterwards and just use colon:

chartrunc=char(txt);        % convert to char array NB all are now padded to length of longest
%>> whos chartrunc
%  Name             Size               Bytes  Class    Attributes
%  chartrunc      7691x327            5029914  char               
%>>
chartrunc=chartrunc(:,1:L);   % truncate to L length
%>> whos chartrunc
%  Name             Size              Bytes  Class    Attributes
%  chartrunc      7691x96            1476672  char               
%>> 

Now, however, you have a 2D char() array so to get the full character string of a given message remember will have to use 2D indexing expression of chartrunc(i,:) for example. Or, of course, you could cast back to cellstr() or convert the whole thing to the new strings class. All depends on how/what is to be done with the data going forward.

Does the requested but methinks you've got a lot more work ahead to make some sense of what's actually in the array.

3 Comments
Show 1 older comment Hide 1 older comment

dpb on 11 Feb 2020

Edited: dpb on 12 Feb 2020

Open in MATLAB Online

Let's see what happens if we take the statement " header indicates noise free data transmission for the 48 bytes after the header." literally--

>> chrtrunc=chrtrunc(:,1:L/2);     % keep only first 48
>> chrtrunc(1:200:end,:)           % see what a sample set then looks like
ans =
  39×48 char array
    '02287E13673A34004B7958234DFEB8FFAE080800090011FF'
    '02ED34167C4A320006E7FD0004EFF20005030DFFFEBB9900'
    '03B53419898A310006E7FD0004EFF20005030DFFFEBB9900'
    '047C341C92E2330006E7FD0004EFF20005030DFFFEBB9900'
    '0544341FA022D60006E7FD0004EFF20005030DFFFEBB9900'
    '060A3422A592340006E7FD0004EFF20005030DFFFEBB9900'
    '06D23425B2D2360006E7FD0004EFF20005030DFFFEBB9900'
    '079A3428C012360006E7FD0004EFF20005030DFFFEBB9900'
    '0862342BCD52360006E7FD0004EFF20005030DFFFEBB9900'
    '092A342EDA92360006E7FD0004EFF20005030DFFFEBB9900'
    '09F23431E7D2380006E7FD0004EFF20005030DFFFEBB9900'
    '0ABA3434F512380006E7FD0004EFF20005030DFFFEBB9900'
    '0B8234380252340006E7FD0004EFF20005030DFFFEBB9900'
    '0C4A343B0F92340006E7FD0004EFF20005030DFFFEBB9900'
    '0D12343E1CD2380006E7FD0004EFF20005030DFFFEBB9900'
    '0DDA34412A12340006E7FD0004EFF20005030DFFFEBB9900'
    '0EA234443752380006E7FD0004EFF20005030DFFFEBB9900'
    '0F6A34474492360006E7FD0004EFF20005030DFFFEBB9900'
    '1032344A51D2360006E7FD0004EFF20005030DFFFEBB9900'
    'A10FA344D5F12340006E7FD0004EFF20005030DFFFEBB990'
    '11C234506C52350006E7FD0004EFF20005030DFFFEBB9900'
    '128A34537992360006E7FD0004EFF20005030DFFFEBB9900'
    '1352345686D2340006E7FD0004EFF20005030DFFFEBB9900'
    '141A34599412360006E7FD0004EFF20005030DFFFEBB9900'
    '14E2345CA152360006E7FD0004EFF20005030DFFFEBB9900'
    '345FAE92380006E7FD0004EFF20005030DFFFEBB99000000'
    'A16723462BBD2380006E7FD0004EFF20005030DFFFEBB990'
    '173A3465C912340006E7FD0004EFF20005030DFFFEBB9900'
    '18023468D652380006E7FD0004EFF20005030DFFFEBB9900'
    '18CA346BE392350006E7FD0004EFF20005030DFFFEBB9900'
    '1992346EF0D2380006E7FD0004EFF20005030DFFFEBB9900'
    '1A5A3471FE12380006E7FD0004EFF20005030DFFFEBB9900'
    '1B2234750B52340006E7FD0004EFF20005030DFFFEBB9900'
    '1BEA34781892340006E7FD0004EFF20005030DFFFEBB9900'
    '1CB2347B25D2340006E7FD0004EFF20005030DFFFEBB9900'
    '1D7A347E3312360006E7FD0004EFF20005030DFFFEBB9900'
    '1E4234814052380006E7FD0004EFF20005030DFFFEBB9900'
    '1F0A34844D92380006E7FD0004EFF20005030DFFFEBB9900'
    '1FD234875AD2380006E7FD0004EFF20005030DFFFEBB9900'
>> 

Looks like is probably reasonably clean altho there are a couple cases (first and about 2/3rds way through that appear to have dropped or extra characters that have gotten message offset.

I gather there is not a message trailer or checksum or any such niceties... :(

Taylor Knuth on 13 Feb 2020

There is not a message trailer unfortunately, but the data is ultimately divided further for individual data packets and converted from hex to decimal. Thanks for your help on this question.

dpb on 13 Feb 2020

Glad to try to help...as said, reminds me (very painfully :) ) of the paper tape problem from years gone by...

Sign in to comment.

Deleting Characters from String and Repeating

2 Comments
Show None Hide None

Accepted Answer

3 Comments
Show 1 older comment Hide 1 older comment

More Answers (0)

Categories

Products

Release

Tags

Community Treasure Hunt

Deleting Characters from String and Repeating

2 Comments Show None Hide None

Accepted Answer

3 Comments Show 1 older comment Hide 1 older comment

More Answers (0)

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

3 Comments
Show 1 older comment Hide 1 older comment