Parsing a file multiple entries consisting of strings. Each entry contains a header followed by a descriptor. The objective is to use headers to obtain a subset of entries.

Question

George on 24 Sep 2024

0
Link

Direct link to this question

https://se.mathworks.com/matlabcentral/answers/2155215-parsing-a-file-multiple-entries-consisting-of-strings-each-entry-contains-a-header-followed-by-a-de

Commented: Star Strider on 24 Sep 2024

The following is an example of a file to be parsed.

Each entry contains a header indicated by ">" followed by sequences of letters (amino acid descriptors). Please note that the sequences shown in the example are truncated. The objective is to use list of headers such as "FBtr0077276", ">FBtr0080587" and fish out both the header and the corresponding amino acid sequences in the same format as that of the submitted file. The format is widely used in bioinformatics and is known as fasta.

Thank you for your comments/help

Example of input file. The headers are highlighted in bold

>FBtr0077276 | Cyp6v1 Cytochrome P450 6v1

MVYSTNILLAIVTILTGVFIWSRRTYVYWQRRRVKFVQPTHLLGNLSRVLRLEESFALQL

RRFYFDERFRNEPVVGIYLFHQPALLIRDLQLVRTVLVEDFVSFSNRFAKCDGRSDKMGA

>FBtr0079061 | Cyp28d2 Cytochrome P450 28d2

MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGSFPSIFTRKRNIAYDI

>FBtr0079925 | Cyp4e3 Cytochrome P450 4e3

MWLAVLALLVLPLITLVYFERKASQRRQLLKEFNGPTPVPILGNANRIGKNPAEILSTFF

>FBtr0080587 | Cyp28a5 Cytochrome P450 28a5

MVLITLTLVSLVVGLLYAVLVWNYDYWRKRGVPGPKPKLLCGNYPNMFTMKRHAIYDLDD

>FBtr0081077 | Cyp310a1 Cytochrome P450 310a1

MWLLLPILLYSAVFLSVRHIYSHWRRRGFPSEKAGITWSFLQKAYRREFRHVEAICEAYQ

SGKDRLLGIYCFFRPVLLVRNVELAQTILQQSNGHFSELKWDYISGYRRFNLLEKLAPMF

>FBtr0077276 | Cyp6v1 Cytochrome P450 6v1

MVYSTNILLAIVTILTGVFIWSRRTYVYWQRRRVKFVQPTHLLGNLSRVLRLEESFALQL

RRFYFDERFRNEPVVGIYLFHQPALLIRDLQLVRTVLVEDFVSFSNRFAKCDGRSDKMGA

>FBtr0079061 | Cyp28d2 Cytochrome P450 28d2

MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGSFPSIFTRKRNIAYDI

0 Comments
Show -2 older commentsHide -2 older comments

Sign in to comment.

Sign in to answer this question.

Answer 1

Star Strider on 24 Sep 2024

0
Link

Direct link to this answer

https://se.mathworks.com/matlabcentral/answers/2155215-parsing-a-file-multiple-entries-consisting-of-strings-each-entry-contains-a-header-followed-by-a-de#answer_1522005

The Bioinformatics Toolbox has a number of functions for these files. The fastaread function appears to be appropriate. (I don’t have that Toolbox, I’m simply aware of some of its functions.)

2 Comments
Show NoneHide None

George on 24 Sep 2024

Moved: Star Strider on 24 Sep 2024

Thank you for the prompt reply.

I've used the fastaread function.

[header,sequence] = fastaread(___)

A truncated output is shown below.

head(header)

{'FBtr0077276 | Cyp6v1 Cytochrome P450 6v1' }

{'FBtr0079061 | Cyp28d2 Cytochrome P450 28d2' }

{'FBtr0079925 | Cyp4e3 Cytochrome P450 4e3' }

{'FBtr0080587 | Cyp28a5 Cytochrome P450 28a5' }

head(squence)

head(a)

{'MVYSTNILLAIVTILTGVFIWSR.....................'}

{'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}

I could generate a cell array from this and parse it but unfortunatelly the sequence format is lost.

Star Strider on 24 Sep 2024

Open in MATLAB Online

I am not certain what you intend by ‘the sequence format is lost’. I also don’t have your file, so I can’t run fastaread with it to test this.

Perhaps you could create a table with:

CYP = cell2table(sequence, 'RowNames',header)

That might work.

Experimenting with something like that —

header = {'FBtr0077276  | Cyp6v1 Cytochrome P450 6v1',
    'FBtr0079061  | Cyp28d2 Cytochrome P450 28d2'};
sequence = {{'MVYSTNILLAIVTILTGVFIWSR.....................'}
{'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}};
CYP_Table = cell2table(sequence, 'RowNames',header)
CYP_Table = 2x1 table
                                                                            sequence                         
                                                   __________________________________________________________

    FBtr0077276  | Cyp6v1 Cytochrome P450 6v1      {'MVYSTNILLAIVTILTGVFIWSR.....................'          }
    FBtr0079061  | Cyp28d2 Cytochrome P450 28d2    {'MCPVTTFLVLVLTLLVLVYVFLTWNFNYWRKRGIKTAPTWPFVGS........ '}

This required a bit of manual editing because I don’t have the actual function outputs (and I don’t have actual experience wit the function).. It might be possible to avoid the manual edits, perhaps using cellfun, and maybe compose as well. (I can’t tell from here.)

It should be relatively straightforward to get the information from ‘CYP_Table’ after that, although I don’t know what you want to do with the data after reading it and creating the table (if that’s what you actually want to do).

.

Sign in to comment.

Parsing a file multiple entries consisting of strings. Each entry contains a header followed by a descriptor. The objective is to use headers to obtain a subset of entries.

0 Comments
Show -2 older commentsHide -2 older comments

Answers (1)

2 Comments
Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

Parsing a file multiple entries consisting of strings. Each entry contains a header followed by a descriptor. The objective is to use headers to obtain a subset of entries.

0 Comments Show -2 older commentsHide -2 older comments

Answers (1)

2 Comments Show NoneHide None

See Also

Categories

Tags

Community Treasure Hunt

0 Comments
Show -2 older commentsHide -2 older comments

2 Comments
Show NoneHide None