counting pattern frequency in a string

8 views (last 30 days)
Mark
Mark on 22 Dec 2017
Commented: Mark on 22 Dec 2017
Looking for ideas on the fastest way to count the number of occurrences of letter patterns in a string. For example, for the test string 'ZXCVBNMZXCVBAS' with a pattern length of 4, it would generate the following table: ZXCV=2, XCVB=2, CVBN=1, etc... The brute force way to do it is to start with an empty list of 4-letter strings, and pull 4-letters blocks from the test string, moving over one letter at a time. For each block, check if already on the list, if so increment the count; if not, add to the end of the list.
The problem with this is that if you have very long test strings (say 10^8 and higher) and move to higher pattern 'word' lengths, say 8 or 10 letters, the number of unique patterns is huge and so the thing slows down as each 8 letter block is compared against a huge list.
Another idea would be to assign each possible pattern with an index: AAAA=1, AAAB=2, so that you can just calculate the index using base 26 conversion for any given string, and increment the value at that index of a master vector. (eg AABA=27, etc., so increment vector(27) by one). But again, with longer pattern lengths, there is not enough memory to store a vector with that many indices, covering all the words (26^8 or 26^10, etc.).
So, is there some way to do this efficiently when the strings and pattern lengths get big? Something with sparsity, or smart sorting, or indexing?
Thank you
  2 Comments
Jos (10584)
Jos (10584) on 22 Dec 2017
Basic questions first: do you really need all patterns? If so, why?
Mark
Mark on 22 Dec 2017
Edited: Mark on 22 Dec 2017
I don't need all patterns, mostly because the number of possible 10-letter sequences in a test string of 10^6 letters, is just under 10^6, while the number of 10-letter patterns is 26^10. But there is no a priori list of possible patterns for a given test string, though one could quickly be generated from the string of course. I'm not sure if that answers your question? To be clear, I'm only interested in the patterns that appear in the test string, not the ones that don't. So in that sense, I don't need all patterns.

Sign in to comment.

Accepted Answer

Stephen23
Stephen23 on 22 Dec 2017
Edited: Stephen23 on 22 Dec 2017
Hummm, interesting problem. I would approach this slightly differently, taking advantage of two features to try and achieve some reasonably efficient solution:
  1. The requested substring length num is small and fixed.
  2. MATLAB'S inbuilt functions are highly optimized and quite efficient.
Given those points, I would simply loop over the string num times: 1st iteration start from position 1, 2nd iteration starting from position 2, etc. Within each iteration you can reshape the largest possible subset of the main string into a matrix and easily find the unique rows. This intermediate step slightly reduces memory requirements and builds up a "semi-unique" set of all substrings of the requested length. Also collect their indices and use histc to count the substring occurrences for that iteration. Then after the loop use efficient unique again to get the final list of unique substrings, together with accumarray to add the histc output counts together.
% Fake Data:
%str = char(randi([65,90],1,1e7));
str = 'ZXCVBNMZXCVBAS';
num = 4;
tot = numel(str);
% Loop over <num> starting indices:
tmp = cell(1,num);
idh = cell(1,num);
for k = 1:num
vec = k-1:num:tot;
[tmp{k},~,idt] = unique(reshape(str(k:vec(end)),num,[]).','rows');
idh{k} = histc(idt,unique(idt)); % save the count
end
% Count unique rows:
[out,~,ido] = unique(vertcat(tmp{:}),'rows');
cnt = accumarray(ido,vertcat(idh{:}));
For that small sample string it returns:
>> out
out =
BNMZ
CVBA
CVBN
MZXC
NMZX
VBAS
VBNM
XCVB
ZXCV
>> cnt
cnt =
1
1
1
1
1
1
1
2
2
which seems to be correct. For the longer string (1e7 elements) it took my laptop
Elapsed time is 9.031 seconds.
to run, and I would imagine it scales roughly linearly with num. How does that compare to the methods you are using now?
  1 Comment
Mark
Mark on 22 Dec 2017
This is excellent, supremely fast. Running brute force for num=3 on something around a 10^7 length test string was taking upwards of 10 mins when I stopped the script, whereas num=5 or more takes 12 seconds on your version. Perfect! Thank you

Sign in to comment.

More Answers (0)

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!