Parsing or regexp HTML output from urlread

Question

0 votes

I need to extract the PubMed IDs from the below HTML, but I am not too fluent in the use of regexp.

Can anyone help with how I would extract the IDs from the below HTML, and store them in a vector?

I'm guessing there is some way to say: what is between '<Id>' and '</Id>' store in...

version="1.0" ? eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd" eSearchResult<Count>8</Count><RetMax>8</RetMax><RetStart>0</RetStart><IdList> href = "Id>16123227</Id">Id>16123227</Id</a href = "Id>9561342</Id">Id>9561342</Id</a href = "Id>8429296</Id">Id>8429296</Id</a href = "Id>1408722</Id">Id>1408722</Id</a href = "Id>2152845</Id">Id>2152845</Id</a href = "Id>2894889</Id">Id>2894889</Id</a href = "Id>2860133</Id">Id>2860133</Id</a href = "Id>6145799</Id">Id>6145799</Id</a /IdList<TranslationSet/><TranslationStack> TermSet Term"ulcerative colitis"[All Fields]</Term> href = "Field>All">Fields</Field</a href = "Count>33249</Count">Count>33249</Count</a href = "Explode>N</Explode">Explode>N</Explode</a /TermSet TermSet Term"Clonidine"[All Fields]</Term> href = "Field>All">Fields</Field</a href = "Count>16458</Count">Count>16458</Count</a href = "Explode>N</Explode">Explode>N</Explode</a /TermSet href = "OP>AND</OP">OP>AND</OP</a /TranslationStack<QueryTranslation>"ulcerative colitis"[All Fields] AND "Clonidine"[All Fields]</QueryTranslation></eSearchResult>

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Tom on 24 Jun 2013

Open in MATLAB Online

1 vote

str = 'version="1.0" ? eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd" eSearchResult880<IdList> Id>16123227</Id Id>9561342</Id Id>8429296</Id Id>1408722</Id Id>2152845</Id Id>2894889</Id Id>2860133</Id Id>6145799</Id /IdList<TranslationSet/><TranslationStack> TermSet Term"ulcerative colitis"[All Fields]</Term> Fields</Field Count>33249</Count Explode>N</Explode /TermSet TermSet Term"Clonidine"[All Fields]</Term> Fields</Field Count>16458</Count Explode>N</Explode /TermSet OP>AND</OP /TranslationStack"ulcerative colitis"[All Fields] AND "Clonidine"[All Fields]</eSearchResult>';
%isolate the ID list string
IDList = regexp(str,'(?<=IdList>).*(?=/IdList)','match');
disp(IDList{1})
%get the ID numbers from the string
IDno = textscan(IDList{1},'Id>%d</Id');
disp(IDno{1})

1 Comment
Show -1 older comments Hide -1 older comments

Philip Spratt on 25 Jun 2013

Very much appreciated Tom

Thanks!

Sign in to comment.

Answer 2

Sean de Wolski on 24 Jun 2013

0 votes

FEX:xml2struct

1 Comment
Show -1 older comments Hide -1 older comments

Philip Spratt on 24 Jun 2013

Must apologise, the output was HTML, hence the xml2struct didn't work.

Sign in to comment.

Parsing or regexp HTML output from urlread

0 Comments
Show -2 older comments Hide -2 older comments

Accepted Answer

1 Comment
Show -1 older comments Hide -1 older comments

More Answers (1)

1 Comment
Show -1 older comments Hide -1 older comments

Categories

Tags

Community Treasure Hunt

Parsing or regexp HTML output from urlread

0 Comments Show -2 older comments Hide -2 older comments

Accepted Answer

1 Comment Show -1 older comments Hide -1 older comments

More Answers (1)

1 Comment Show -1 older comments Hide -1 older comments

Categories

Tags

See Also

Community Treasure Hunt

0 Comments
Show -2 older comments Hide -2 older comments

1 Comment
Show -1 older comments Hide -1 older comments

1 Comment
Show -1 older comments Hide -1 older comments