# how to replace characters into digits

Sharen H on 27 Feb 2013
i have to replace the each characters using the following digits s=ACGT
i have to replace as
'A' then 11
'C' then 00
'G' then 01
'T' then 10
Jan on 27 Feb 2013
You write 11 without surrounding quotes. This isn't an accident, correctly? What do you want as output? How long is the input?

Azzi Abdelmalek on 27 Feb 2013
Edited: Azzi Abdelmalek on 27 Feb 2013
clear
s='ACGT'
e=['11';'00';'01';'10']
in='AGCTAG' % Your initial data
out=in
for k=1:numel(s)
out=regexprep(out,s(k),e(k,:))
end
Cedric Wannaz on 28 Feb 2013
STRREP is probably the best solution. It is a little slower than my solution, but more memory friendly.

Jan on 28 Feb 2013
And a lookup table:
seqIn = 'ACGTTGCA'
table = repmat('0', 2, 255);
table(1, 'AT') = '1';
table(2, 'AG') = '1';
result = reshape(table(:, seqIn), 1, []);
Does this work? I do not have access to Matlab currently.
Jan on 28 Feb 2013
@Azzi: Is this a typo?! Your function needs 14 secs with REGEXPREP and 0.007 secs with STRREP? Then my minor suggestion caused a speedup of a factor 1900? Wow, this would be the most efficient suggestion I ever gave. And it would be a strong hint to warn for the low efficiency of regexprep in this forum.

Cedric Wannaz on 27 Feb 2013
Edited: Cedric Wannaz on 27 Feb 2013
If you need to process long sequences, you might want to optimize a little the efficiency.. a MEX-based solution would be most efficient I guess, but here is one way you could go using basic MATLAB only..
If you want to replace character 'A' with a numeric array (1,1) and so on, you can do:
aa = 'ACGT' ;
seq = 'AAGCTCAGGTTCA' ;
rep = zeros(2, max(aa), 'uint8') ;
rep(:,aa) = [1 0 0 1; 1 0 1 0] ;
result = reshape(rep(:,seq), 1, []) ;
This outputs the numeric array:
result =
1 1 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 1 1 0 1 0 0 0 1 1
If you want to replace character 'A' with characters '11' and so on, you can do:
aa = 'ACGT' ;
seq = 'AAGCTCAGGTTCA' ;
rep = zeros(2, max(aa), 'uint8') ;
rep(:,aa) = ['11'; '00'; '01'; '10'].' ;
result = char(reshape(rep(:,seq), 1, [])) ;
This outputs the string '11110100100011010110100011'.
EDIT: note that there are slightly different ways of doing this a little slower but with a more memory-friendly approach.
Cheers,
Cedric

Jos (10584) on 28 Feb 2013
Here is a simpler approach than looping over REGEXPREP or STRREP:
seqIn = 'ACGTTGCA' % input sequence
letters = 'ACGT' ;
symb = {'11','00','10','01'} ; % stored as a cell array of strings!
[tf,idx] = ismember(seqIn,letters) ;
seqOut = [symb{idx}]