Finding the repeated substrings

I have a DNA sequence that is AAGTCAAGTCAATCG and I split into substrings such as AAGT,AGTC,GTCA,TCAA,CAAG,AAGT and so on. Then I have to find the repeated substirngs and their frequency counts ,that is here AAGT is repeated twice so I want to get AAGT - 2.How is this possible .

2 Comments

See Andrei Bobrov's answer for an efficient solution.
Thank you Stephen!

Sign in to comment.

 Accepted Answer

str = {'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'} ;
idx = cellfun(@(x) find(strcmp(str, x)==1), unique(str), 'UniformOutput', false) ;
L = cellfun(@length,idx) ;
Ridx = find(L>1) ;
for i = 1:length(Ridx)
st = str(idx{Ridx}) ;
fprintf('%s string repeated %d times\n',st{1},length(idx{Ridx}))
end

More Answers (2)

A = 'AAGTCAAGTCAATCG';
B = hankel(A(1:end-3),A(end-3:end));
[a,~,c] = unique(B,'rows','stable');
out = table(a,accumarray(c,1),'VariableNames',{'DNA','counts'});

5 Comments

If it's alright, I had a question about the use of unique. Why not use tabulate? Just curious.
Thanks!
Maybe he didn't know about it - I didn't.
outT = tabulate(B)
out =
8×2 table
DNA counts
____ ______
AAGT 2
AGTC 2
GTCA 2
TCAA 2
CAAG 1
CAAT 1
AATC 1
ATCG 1
outT =
8×3 cell array
{'AAGT'} {[2]} {[16.6666666666667]}
{'AGTC'} {[2]} {[16.6666666666667]}
{'GTCA'} {[2]} {[16.6666666666667]}
{'TCAA'} {[2]} {[16.6666666666667]}
{'CAAG'} {[1]} {[8.33333333333333]}
{'CAAT'} {[1]} {[8.33333333333333]}
{'AATC'} {[1]} {[8.33333333333333]}
{'ATCG'} {[1]} {[8.33333333333333]}
yeah that's fair. I was just curious since I was just looking at both and wondering why I may want to use one over the other. Seems mainly like if I want a table or cell.
Thanks!
tabulate requires the Statistics and Machine Learning Toolbox, which not everyone has.
Hi.
I have a question. Some time i have a ladder-like results (nested sequences) like this :
AAAAAAAAA which will be calculated (with frame size 3 as) as 6 AAAA sequences, wich is not correct in some cases ( it is also about ATATATA type of sequences). Is there a solution or algorithms to filter nested repeats ?
Thanx a lot.

Sign in to comment.

For the original question you could convert the char data into a categorical array and call histcounts.
>> C = categorical({'AAGT','AGTC','GTCA','TCAA','CAAG','AAGT'})
C =
1×6 categorical array
AAGT AGTC GTCA TCAA CAAG AAGT
>> [counts, uniquevalues] = histcounts(C)
counts =
2 1 1 1 1
uniquevalues =
1×5 cell array
{'AAGT'} {'AGTC'} {'CAAG'} {'GTCA'} {'TCAA'}

Categories

Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange

Asked:

on 1 Jun 2017

Answered:

on 14 Aug 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!