Accessing NCBI Entrez Databases with E-Utilities
This example shows how to programmatically search and retrieve data from NCBI's Entrez databases using NCBI's Entrez Utilities (E-Utilities).
Using NCBI E-Utilities to Retrieve Biological Data
E-Utilities (eUtils) are server-side programs (e.g. ESearch, ESummary, EFetch, etc.,) developed and maintained by NCBI for searching and retrieving data from most Entpwdrez Databases. You access tools via URLs with a strict syntax of a specific base URL, a call to the eUtil's script and its associated parameters. For more details on eUtils, see E-Utilities Help.
Searching Nucleotide Database with ESearch
In this example, we consider the genes sequenced from the H5N1 virus, isolated in 1997 from a chicken in Hong Kong as a starting point for our analysis. This particular virus jumped from chickens to humans, killing six people before the spread of the disease was brought under control by destroying all poultry in Hong Kong [1]. You can use ESearch to find the sequence data needed for the analysis. ESearch requires input of a database (db
) and search term (term
). Optionally, you can request for ESearch to store your search results on the NCBI history server through the usehistory
parameter.
baseURL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; eutil = 'esearch.fcgi?'; dbParam = 'db=nuccore'; termParam = '&term=A/chicken/Hong+Kong/915/97+OR+A/chicken/Hong+Kong/915/1997'; usehistoryParam = '&usehistory=y'; esearchURL = [baseURL, eutil, dbParam, termParam, usehistoryParam]
esearchURL = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccore&term=A/chicken/Hong+Kong/915/97+OR+A/chicken/Hong+Kong/915/1997&usehistory=y'
The term
parameter can be any valid Entrez query. Note that there cannot be any spaces in the URL, so parameters are separated by '&' and any spaces in a query term need to be replaced with '+' (e.g. 'Hong+Kong').
You can use webread
to send the URL and return the results from ESearch as a character array.
searchReport = webread(esearchURL)
searchReport = '<?xml version="1.0" encoding="UTF-8" ?> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "https://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd"> <eSearchResult><Count>8</Count><RetMax>8</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>MCID_63cc443030474b6cd8245856</WebEnv><IdList> <Id>6048875</Id> <Id>6048849</Id> <Id>6048770</Id> <Id>6048802</Id> <Id>6048927</Id> <Id>6048903</Id> <Id>6048829</Id> <Id>3421265</Id> </IdList><TranslationSet/><TranslationStack> <TermSet> <Term>A/chicken/Hong[All Fields]</Term> <Field>All Fields</Field> <Count>1096</Count> <Explode>N</Explode> </TermSet> <TermSet> <Term>Kong/915/97[All Fields]</Term> <Field>All Fields</Field> <Count>7</Count> <Explode>N</Explode> </TermSet> <OP>AND</OP> <OP>GROUP</OP> <TermSet> <Term>A/chicken/Hong[All Fields]</Term> <Field>All Fields</Field> <Count>1096</Count> <Explode>N</Explode> </TermSet> <TermSet> <Term>Kong[All Fields]</Term> <Field>All Fields</Field> <Count>6707134</Count> <Explode>N</Explode> </TermSet> <OP>AND</OP> <TermSet> <Term>915[All Fields]</Term> <Field>All Fields</Field> <Count>473497</Count> <Explode>N</Explode> </TermSet> <OP>AND</OP> <TermSet> <Term>1997[All Fields]</Term> <Field>All Fields</Field> <Count>1656226</Count> <Explode>N</Explode> </TermSet> <OP>AND</OP> <OP>GROUP</OP> <OP>OR</OP> </TranslationStack><QueryTranslation>(A/chicken/Hong[All Fields] AND Kong/915/97[All Fields]) OR (A/chicken/Hong[All Fields] AND Kong[All Fields] AND 915[All Fields] AND 1997[All Fields])</QueryTranslation></eSearchResult> '
ESearch returns the search results in XML. The report contains information about the query performed, which database was searched and UIDs (unique IDs) to the records that match the query. If you use the history server, the report contains two additional IDs, WebEnv
and query_key
, for accessing the results. WebEnv
is the location of the results on the server, and query_key
is a number indexing the queries performed. Since WebEnv and query_key
are query dependent they will change every time the search is executed. Either the UIDs or WebEnv
and query_key
can be parsed out of the XML report then passed to other eUtils. You can use regexp
to do the parsing and store the tokens in the structure with fieldnames WebEnv
and QueryKey
.
ncbi = regexp(searchReport,... '<QueryKey>(?<QueryKey>\w+)</QueryKey>\s*<WebEnv>(?<WebEnv>\S+)</WebEnv>',... 'names')
ncbi = struct with fields: QueryKey: '1' WebEnv: 'MCID_63cc443030474b6cd8245856'
Getting GenBank® File Summaries with ESummary
To get a quick overview of sequences that matched the query you can use ESummary. ESummary retrieves a brief summary, or Document Summary (DocSum), for each record. ESummary requires an input of which database to access and which records to retrieve, identified either by a list of UIDs passed through id
parameter or by the WebEnv
and query_key
parameters. ESummary returns a report in XML that contains the summary information for each record. Use websave
with ESummary to perform the record summary retrieval and write out the XML report to a file.
tmpDirectory = tempdir; summaryFname = fullfile(tmpDirectory,'summaryReport.xml'); websave(summaryFname, [baseURL... 'esummary.fcgi?db=nuccore&WebEnv=',ncbi.WebEnv,... '&query_key=',ncbi.QueryKey]);
You can create an XSL stylesheet to view information from the ESummary XML report in a web browser. For more information on writing XSL stylesheets, see W3C® XSL. An XSL stylesheet was created for this example to view the sequence summary information and provide links to their full GenBank® files. Xslt
can be used to view the XML report in a Web browser from MATLAB®.
xslt(summaryFname,'genbankSummary.xsl','-web');
Retrieving Full GenBank Files with EFetch
To perform the sequence analysis, you need to get the full GenBank record for each sequence. EFetch retrieves full records from Entrez databases. EFetch requires an input of a database and a list of UIDs or WebEnv
and query_key
. Additionally, EFetch can return the output in different formats. You can specify which output format (i.e. GenBank (gb
), FASTA) and file format (i.e. text, ASN.1, XML) you want through the rettype
and retmode
parameters, respectively. Rettype
equals gb
for GenBank file format and retmode
equals text
for this query. Genbankread
can be used directly with the EFetch URL to retrieve all the GenBank records and read them into a structure array. This structure can then be used as input to seqviewer
to visualize the sequences.
ch97struct = genbankread([baseURL... 'efetch.fcgi?db=nuccore&rettype=gb&retmode=text&WebEnv=',ncbi.WebEnv,... '&query_key=',ncbi.QueryKey]); seqviewer(ch97struct)
Finding Links Between Databases with ELink
It might be useful to have PubMed articles related to these genes records. ELink provides this functionality. It finds associations between records within or between databases. You can give ELink the query_key and WebEnv IDs from above and tell it to find records in the PubMed Database (db
parameter) associated with your records from the Nucleotide (nuccore) Database (dbfrom
parameter). ELink returns an XML report with the UIDs for the records in PubMed. These UIDs can be parsed out of the report and passed to other eUtils (e.g. ESummary). Use the stylesheet created for viewing ESummary reports to view the results of ELink.
elinkReport = webread([baseURL... 'elink.fcgi?dbfrom=nuccore&db=pubmed&WebEnv=', ncbi.WebEnv,... '&query_key=',ncbi.QueryKey]);
Extract the PubMed UIDs from the ELink report.
pubmedIDs = regexp(elinkReport,'<Link>\s+<Id>(\w*)</Id>\s+</Link>','tokens'); NumberOfArticles = numel(pubmedIDs) % Put PubMed UIDs into a string that can be read by EPost URL. pubmed_str = []; for ii = 1:NumberOfArticles pubmed_str = sprintf([pubmed_str '%s,'],char(pubmedIDs{ii})); end
NumberOfArticles = 2
Posting UIDs to NCBI History Server with EPost
You can use EPost to posts UIDs to the history server. It returns an XML report with a query_key
and WebEnv
IDs pointing to the location of the history server. Again, these can be parsed out of the report and used with other eUtils calls.
epostReport = webread([baseURL 'epost.fcgi?db=pubmed&id=',pubmed_str(1:end-1)]); epostKeys = regexp(epostReport,... '<QueryKey>(?<QueryKey>\w+)</QueryKey>\s*<WebEnv>(?<WebEnv>\S+)</WebEnv>','names')
epostKeys = struct with fields: QueryKey: '1' WebEnv: 'MCID_63cc44394715f94fc51dbb19'
Using ELink to Find Associated Files Within the Same Database
ELink can do "within" database searches. For example, you can query for a nucleotide sequence within Nucleotide (nuccore) database to find similar sequences, essentially performing a BLAST search. For "within" database searches, ELink returns an XML report containing the related records, along with a score ranking its relationship to the query record. From the above PubMed search, you might be interested in finding all articles related to those articles in PubMed. This is easy to do with ELink. To do a "within" database search, set db
and dbfrom
to PubMed. You can use the query_key
and WebEnv
from the EPost call.
pm2pmReport = webread([baseURL... 'elink.fcgi?dbfrom=pubmed&db=pubmed&query_key=',epostKeys.QueryKey,... '&WebEnv=',epostKeys.WebEnv]); pubmedIDs = regexp(pm2pmReport,'(?<=<Id>)\w*(?=</Id>)','match'); NumberOfArticles = numel(unique(pubmedIDs)) pubmed_str = []; for ii = 1:NumberOfArticles pubmed_str = sprintf([pubmed_str '%s,'],char(pubmedIDs{ii})); end
NumberOfArticles = 388
Use websave
with EFetch to retrieve full abstracts for the articles and write out the returned XML report to a file. An XSL stylesheet is provided with this example for viewing the results of the EFetch query. The XML report can be transformed using the stylesheet and opened in a Web browser from MATLAB using xslt
.
fullFname = fullfile(tmpDirectory,'H5N1_relatedArticles.xml'); websave(fullFname, [baseURL 'efetch.fcgi?db=pubmed&retmode=xml&id=',... pubmed_str(1:end-1)]); xslt(fullFname,'pubmedFullReport.xsl','-web');
Using EGQuery to get a Global View of H5N1 Related-Records in Entrez
To see what other Entrez databases contain information about the H5N1 virus, use EGQuery. EGQuery performs a text search across all available Entrez databases and returns the number of hits in each. EGQuery accepts any valid Entrez text query as input through the term
parameter.
wbo = weboptions('Timeout', 15); % allow 15 seconds before timeout entrezSearch = webread([baseURL,'egquery.fcgi?term=H5N1+AND+virus'], wbo); entrezResults = regexp(entrezSearch,... '<DbName>(?<DB>\w+\s*\w*)</DbName>.*?(<Count>)(?<Count>\d+)</Count>',... 'names'); entrezDBs = {entrezResults(:).DB}; dbCounts = str2double({entrezResults(:).Count}); entrezDBs = entrezDBs(logical(dbCounts)); % remove databases with no records [dbCounts,sortInd] = sort(dbCounts(logical(dbCounts))); entrezDBs = entrezDBs(sortInd); numDBs = numel(entrezDBs); barh(log10(dbCounts)); ylim([.5 numDBs+.5]) ax = gca; ax.YTick = 1:numDBs; ax.YTickLabel = entrezDBs; xlabel('Log(Number of Records)'); title('Number of H5N1 Related-Records Per Entrez Database');
References
[1] Cristianini, N. and Hahn, M.W. "Introduction to Computational Genomics: A Case Studies Approach", Cambridge University Press, 2007.