counting pattern frequency in a string

Question

Mark on 22 Dec 2017

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/374238-counting-pattern-frequency-in-a-string

Commented: Mark on 22 Dec 2017

Looking for ideas on the fastest way to count the number of occurrences of letter patterns in a string. For example, for the test string 'ZXCVBNMZXCVBAS' with a pattern length of 4, it would generate the following table: ZXCV=2, XCVB=2, CVBN=1, etc... The brute force way to do it is to start with an empty list of 4-letter strings, and pull 4-letters blocks from the test string, moving over one letter at a time. For each block, check if already on the list, if so increment the count; if not, add to the end of the list.

The problem with this is that if you have very long test strings (say 10^8 and higher) and move to higher pattern 'word' lengths, say 8 or 10 letters, the number of unique patterns is huge and so the thing slows down as each 8 letter block is compared against a huge list.

Another idea would be to assign each possible pattern with an index: AAAA=1, AAAB=2, so that you can just calculate the index using base 26 conversion for any given string, and increment the value at that index of a master vector. (eg AABA=27, etc., so increment vector(27) by one). But again, with longer pattern lengths, there is not enough memory to store a vector with that many indices, covering all the words (26^8 or 26^10, etc.).

So, is there some way to do this efficiently when the strings and pattern lengths get big? Something with sparsity, or smart sorting, or indexing?

Thank you

2 Comments
Show NoneHide None

Jos (10584) on 22 Dec 2017

Basic questions first: do you really need all patterns? If so, why?

Mark on 22 Dec 2017

Edited: Mark on 22 Dec 2017

I don't need all patterns, mostly because the number of possible 10-letter sequences in a test string of 10^6 letters, is just under 10^6, while the number of 10-letter patterns is 26^10. But there is no a priori list of possible patterns for a given test string, though one could quickly be generated from the string of course. I'm not sure if that answers your question? To be clear, I'm only interested in the patterns that appear in the test string, not the ones that don't. So in that sense, I don't need all patterns.

Sign in to comment.

Sign in to answer this question.

Answer 1

Stephen23 on 22 Dec 2017

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/374238-counting-pattern-frequency-in-a-string#answer_297455

Edited: Stephen23 on 22 Dec 2017

Open in MATLAB Online

Hummm, interesting problem. I would approach this slightly differently, taking advantage of two features to try and achieve some reasonably efficient solution:

The requested substring length num is small and fixed.
MATLAB'S inbuilt functions are highly optimized and quite efficient.

Given those points, I would simply loop over the string num times: 1st iteration start from position 1, 2nd iteration starting from position 2, etc. Within each iteration you can reshape the largest possible subset of the main string into a matrix and easily find the unique rows. This intermediate step slightly reduces memory requirements and builds up a "semi-unique" set of all substrings of the requested length. Also collect their indices and use histc to count the substring occurrences for that iteration. Then after the loop use efficient unique again to get the final list of unique substrings, together with accumarray to add the histc output counts together.

% Fake Data:
%str = char(randi([65,90],1,1e7));
str = 'ZXCVBNMZXCVBAS';
num = 4;
tot = numel(str);
% Loop over <num> starting indices:
tmp = cell(1,num);
idh = cell(1,num);
for k = 1:num
  vec = k-1:num:tot;
  [tmp{k},~,idt] = unique(reshape(str(k:vec(end)),num,[]).','rows');
  idh{k} = histc(idt,unique(idt)); % save the count
end
% Count unique rows:
[out,~,ido] = unique(vertcat(tmp{:}),'rows');
cnt = accumarray(ido,vertcat(idh{:}));

For that small sample string it returns:

>> out
out =
BNMZ
CVBA
CVBN
MZXC
NMZX
VBAS
VBNM
XCVB
ZXCV
>> cnt
cnt =
   1
   1
   1
   1
   1
   1
   1
   2
   2

which seems to be correct. For the longer string (1e7 elements) it took my laptop

Elapsed time is 9.031 seconds.

to run, and I would imagine it scales roughly linearly with num. How does that compare to the methods you are using now?

1 Comment
Show -1 older commentsHide -1 older comments

Mark on 22 Dec 2017

This is excellent, supremely fast. Running brute force for num=3 on something around a 10^7 length test string was taking upwards of 10 mins when I stopped the script, whereas num=5 or more takes 12 seconds on your version. Perfect! Thank you

Sign in to comment.

counting pattern frequency in a string

2 Comments
Show NoneHide None

Accepted Answer

1 Comment
Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

counting pattern frequency in a string

2 Comments Show NoneHide None

Accepted Answer

1 Comment Show -1 older commentsHide -1 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments