problem calculating nucleotide percentages

Question

John Jamison on 22 Apr 2017

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/336721-problem-calculating-nucleotide-percentages

Edited: dpb on 24 Apr 2017

Hi all,

I am writing a program to do several things, first off to calculate the percentage of 4 nucleotides of RNA (A, U, C, G), given a DNA input(A, C, G, T), and output it to a new file, showing the RNA sequence and nucleotide percentages (ie. 50% U). I have the following code, but I cannot get it to correctly calculate the frequency—I don't know how to get the total number of nucleotides, since I can't get it work with length(filename)....how Can I fix this? See code below.

Thanks

%DNA Transcription
%Goal: read in a file to convert from DNA to RNA: 
% C -> G
% G -> C
% T -> A
% A -> U
dnaFile = 0;
while dnaFile < 1
   filename = input('Open file: ', 's');
   [dnaFile, message] = fopen(filename, 'r');
   if dnaFile == -1
     disp(message)
   end
end
%if input file has extension .dna, switch to .rna
if length(filename) > 4 && strcmp(filename(end-3 : end), '.dna')
    outputFile = [filename(1:end-4) '.rna']
else 
    outputFile = [filename '.rna']
end
[rnaFile, message] = fopen(outputFile, 'w');
[base, num] = fread(dnaFile, 1, 'char');
countG = 0;
countC = 0;
countA = 0;
countU = 0;
while num > 0 
  if base == 'C'
    out = 'G';
    countG = countG + 1;
  elseif base == 'G'
    out = 'C';
    countC = countC + 1;
  elseif base == 'T'
    out = 'A';
    countA = countA + 1;
  elseif base == 'A'
    out = 'U';
    countU = countU + 1;
  else
    out = '_';
  end
  freqG = countG ./ length(dnaFile);
  freqC = countC ./ length(dnaFile);
  freqA = countA ./ length(dnaFile);
  freqU = countU ./ length(dnaFile);
  %fprintf('percentG: %.5f \n percentC: %.5f \n percentA: %.5f \n percentU: %.5f \n', freqG, freqC, freqA, freqU);   
  fwrite(rnaFile, out);
  [base, num] = fread(dnaFile, 1, 'char');
end
fclose(dnaFile);
fclose(rnaFile);

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 23 Apr 2017

Open in MATLAB Online

"...I can't get it work with length(filename)..."

...
freqG = countG ./ length(dnaFile);
...

dnaFile is a file handle which is an integer stored in default datatype of double (see doc for fopen). Hence, length()--> 1

The length is either sum of all the characters you found if the percentage is to be based on the total string length including the extraneous or the sum of the four sequence counts if not. As shown, however, you can get these more efficiently than by explicit looping.

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 22 Apr 2017

1
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/336721-problem-calculating-nucleotide-percentages#answer_264058

Edited: dpb on 23 Apr 2017

Open in MATLAB Online

[rnaFile, message] = fopen(outputFile, 'w');

While not your problem, should check for the output file opening successfully as well as input...

[base, num] = fread(dnaFile, 1, 'char');

The above reads one character, but returns it as a double, not a character.

Use

[base, num]=fread(dnaFile,'*char');

to read whole file into a character array.

while num > 0 
  if base == 'C'
    ...

The above starts an infinite loop as num=1 and there's nothing inside the loop that ever changes num so it stays there forever...oh, although it will eventually error out on EOF on the read...scanned too quickly first.

You can loop, but Matlab is vectorized; may as well make use of it.

% rewrite the rules for convenience C -> G; G -> C; T -> A; A -> U
out=repmat('_',size(base));   % this is else case...we'll overwrite everything besides
out(base=='C')='G';           % use logical addressing to locate each and write output
out(base=='G')='C';
out(base=='T')='A';
out(base=='A')='U';
freqG=sum(out=='G');
freqC=sum(out=='C');
freqA=sum(out=='A');
freqU=sum(out=='U');
totalNum=sum([freqG freqC freqA freqU]);
freqG=sum(out=='G')/totalNum;
freqC=sum(out=='C')/totalNum;
freqA=sum(out=='A')/totalNum;
freqU=sum(out=='U')/totalNum;

ADDENDUM:

Couldn't see an issue otomh so did a trial with just made up sequence...

>> dna=['A', 'C', 'G', 'T'];           % the four letters start with
>> dna=repmat(dna,1,10);               % make longer sequence from them
>> dna=dna(randperm(length(dna)));     % and then scramble 'em up
>> dna(randperm(length(dna),3))='_';   % a few other characters for spice
>> dna                                 % what we start with is then ...
dna =
CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC

The algorithm above starts here:

>> rna=repmat('_',size(base));        % replace the non-wanted letters
>> rna(dna=='C')='G';
>> rna(dna=='G')='C';
>> rna(dna=='T')='A';
>> rna(dna=='A')='U';
>> [dna;rna]            % see what we got...
ans =
CTTGCCTCC_GGA_ATGAATGCAACACAGTT_GGTGCATC
GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG

Looks like what problem statement asked for...compute frequency

Did this in "more Matlab-y" way; order is alphabetic to satisfy histc

>> RNA='ACGU';   % the letters to use as bin centers must be increasing order
>> freq=histc(rna,RNA)
freq =
   9     9    10     9
>> freq=freq/sum(freq)
freq =
  0.2432    0.2432    0.2703    0.2432
>> 
>> [RNA.' num2str(freq.','%.4f')]  % display results tabulated 
ans =
A  0.2432
C  0.2432
G  0.2703
U  0.2432
>>

If order is important revert to previous or a "cute" way would be to use the categorical datatype--

>> rnac=categorical(cellstr(rna),{'G';'C';'A';'U';'_'});
>> summary(rnac)
   G      10 
   C       9 
   A       9 
   U       9 
   _       3 
>>

ADDENDUM 2:

And, the yet more Matlab-y way vectorizes the translation via lookup...

>> DNA=['C','G','T','A','_'];  % base characters in sequence 1
>> RNA=['G','C','A','U','_'];  % corresponding characters in sequence 2
>> RNA(arrayfun(@(c) find(c==DNA),dna))  % translate one to other...
ans =
GAACGGAGG_CCU_UACUUACGUUGUGUCAA_CCACGUAG

19 Comments
Show 17 older commentsHide 17 older comments

dpb on 23 Apr 2017

Edited: dpb on 23 Apr 2017

Open in MATLAB Online

OK, although that's the wrong sequence letter set, isn't it?

I am/was totally aware of what your perceived problem is; you just wouldn't answer the question required to know what the proper answer really is.

If you're adamant in using your code in the loop, then the divisor would be have to be computed as incrementing a counter in each of the subsections that finds a litter match. That sum in the end will be the total.

Or, before computing the frequency, the total is simply the sum of the four counts you've collected--

nTotal=sum([countG,countC,countA,countU]);
freqX=countX/nTotal;

I'd still strongly urge you to read the full file at once and then use one of the ways outlined above to process it..."it's the Matlab way!" and would be a useful tutorial in your Matlab education once you grasp the techniques.

Creating multiple variables with sequential or similar naming schemes in Matlab is a sure sign the data structure isn't being done correctly.

Anyway, the rule appears to be "it is a global percentage using the four characters and ignoring any others in the total count". I wasn't positive that the nonmember characters didn't have some role to play in dividing substrings over which total were needed to be accumulated--if that's not the case then it's pretty simple; you just have to count only the ones that are sieved out; not the entire length of the possible input string which was the first assumption.

Sorry if seemed to harp on you, but we only can know what you tell us unless happen to work in the field and so precision in the problem description is vital to get a correct answer and didn't seem to make sense to give the right answer to the wrong question.

dpb on 23 Apr 2017

Edited: dpb on 24 Apr 2017

Open in MATLAB Online

OK, I guess I'm to blame for not pointing out the "rest of the story" -- you've got to tell it the data type to write.

Oh, and did you fix the FOPEN problem I pointed out initially in that while you read a character, it is stored internally on reading in Matlab as a double as you wrote the code?

ERRATUM Oops, typo... "did you fix the FOPEN problem". It's in FREAD where the problem actually is... 'char' will read a character/byte but Matlab will store that in a double. I'm guessing that's the root cause of your problem below in getting extra bytes; you're writing more than one byte each FWRITE, just writing a double as 'char'. As noted then, the '*char' will also then store that character as type CHAR in memory when read which should fix the problem unless there are already bum characters in the input file form such a case before.

Anyways, try

fwrite(fid,[freqG,freqC,freqA,freqU],'char');

and see if joy ensues. Of course, this is still an unformatted file so if you're expecting to use it in text editor, etc., etc., you'll be better off using formatted i/o -- that's fprintf and friends, not fread/fwrite

John Jamison on 24 Apr 2017

I figured it out.

Thanks for all your help. My next question is regarding searching through the file for certain sequences..see my new post for that.

Thanks

dpb on 24 Apr 2017

Edited: dpb on 24 Apr 2017

So, what was the problem?

"Enquiring minds" and all that... :)

Sign in to comment.

problem calculating nucleotide percentages

1 Comment
Show -1 older commentsHide -1 older comments

Accepted Answer

19 Comments
Show 17 older commentsHide 17 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

problem calculating nucleotide percentages

1 Comment Show -1 older commentsHide -1 older comments

Accepted Answer

19 Comments Show 17 older commentsHide 17 older comments

More Answers (0)

See Also

Categories

Tags

Products

Community Treasure Hunt

1 Comment
Show -1 older commentsHide -1 older comments

19 Comments
Show 17 older commentsHide 17 older comments