Parsing a file multiple entries consisting of strings. Each entry contains a header followed by a descriptor. The objective is to use headers to obtain a subset of entries.
1 view (last 30 days)
Show older comments
The following is an example of a file to be parsed.
Each entry contains a header indicated by ">" followed by sequences of letters (amino acid descriptors). Please note that the sequences shown in the example are truncated. The objective is to use list of headers such as "FBtr0077276", ">FBtr0080587" and fish out both the header and the corresponding amino acid sequences in the same format as that of the submitted file. The format is widely used in bioinformatics and is known as fasta.
Thank you for your comments/help
Example of input file. The headers are highlighted in bold
>FBtr0077276 | Cyp6v1 Cytochrome P450 6v1
MVYSTNILLAIVTILTGVFIWSRRTYVYWQRRRVKFVQPTHLLGNLSRVLRLEESFALQL
RRFYFDERFRNEPVVGIYLFHQPALLIRDLQLVRTVLVEDFVSFSNRFAKCDGRSDKMGA
>FBtr0079061 | Cyp28d2 Cytochrome P450 28d2
MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGSFPSIFTRKRNIAYDI
>FBtr0079925 | Cyp4e3 Cytochrome P450 4e3
MWLAVLALLVLPLITLVYFERKASQRRQLLKEFNGPTPVPILGNANRIGKNPAEILSTFF
>FBtr0080587 | Cyp28a5 Cytochrome P450 28a5
MVLITLTLVSLVVGLLYAVLVWNYDYWRKRGVPGPKPKLLCGNYPNMFTMKRHAIYDLDD
>FBtr0081077 | Cyp310a1 Cytochrome P450 310a1
MWLLLPILLYSAVFLSVRHIYSHWRRRGFPSEKAGITWSFLQKAYRREFRHVEAICEAYQ
SGKDRLLGIYCFFRPVLLVRNVELAQTILQQSNGHFSELKWDYISGYRRFNLLEKLAPMF
>FBtr0077276 | Cyp6v1 Cytochrome P450 6v1
MVYSTNILLAIVTILTGVFIWSRRTYVYWQRRRVKFVQPTHLLGNLSRVLRLEESFALQL
RRFYFDERFRNEPVVGIYLFHQPALLIRDLQLVRTVLVEDFVSFSNRFAKCDGRSDKMGA
>FBtr0079061 | Cyp28d2 Cytochrome P450 28d2
MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGSFPSIFTRKRNIAYDI
0 Comments
Answers (1)
Star Strider
on 24 Sep 2024
The Bioinformatics Toolbox has a number of functions for these files. The fastaread function appears to be appropriate. (I don’t have that Toolbox, I’m simply aware of some of its functions.)
2 Comments
Star Strider
on 24 Sep 2024
I am not certain what you intend by ‘the sequence format is lost’. I also don’t have your file, so I can’t run fastaread with it to test this.
Perhaps you could create a table with:
CYP = cell2table(sequence, 'RowNames',header)
That might work.
Experimenting with something like that —
header = {'FBtr0077276 | Cyp6v1 Cytochrome P450 6v1',
'FBtr0079061 | Cyp28d2 Cytochrome P450 28d2'};
sequence = {{'MVYSTNILLAIVTILTGVFIWSR.....................'}
{'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}};
CYP_Table = cell2table(sequence, 'RowNames',header)
This required a bit of manual editing because I don’t have the actual function outputs (and I don’t have actual experience wit the function).. It might be possible to avoid the manual edits, perhaps using cellfun, and maybe compose as well. (I can’t tell from here.)
It should be relatively straightforward to get the information from ‘CYP_Table’ after that, although I don’t know what you want to do with the data after reading it and creating the table (if that’s what you actually want to do).
.
See Also
Categories
Find more on Bioinformatics Toolbox in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!