NCBI API – Integration into analysis code
QBRC Tech Talk
Jiwoong Kim
Outlines
• Introduction
• Usage Guidelines of the E-utilities
• Sample Applications of the E-utilities
NCBI & Entrez• The National Center for
Biotechnology Information advances science and health by providing access to biomedical and genomic information.
• Entrez is NCBI’s primary text search and retrieval system that integrates the PubMeddatabase of biomedical literature with 39 other literature and molecular databases including DNA and protein sequence, structure, gene, genome, genetic variation and gene expression.
E-utilities
• Entrez Programming Utilities– The Entrez Programming Utilities (E-utilities) are a set of
eight server-side programs that provide a stable interface into the Entrez query and database system at the NCBI.
– The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data.
E-utilitiesURL XML, FASTA, Text …Input Output
Usage Guidelines and Requirements
• Use the E-utility URL– baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ …
– Python urllib/urlopen, Perl LWP::Simple, Linux wget, …
• Frequency, Timing and Registration of E-utility URL Requests– Make no more than 3 requests per second → sleep(0.5)
– Run large jobs on weekends or between 5 PM and 9 AM EST
– Include &tool and &email in all requests
• Minimizing the Number of Requests– &retmax=500
• Handling Special Characters Within URLs– Space → +, " → %22, # → %23
ESearch
ESearch (text searches)
• Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query.
• Syntax: esearch.fcgi?db=<database>&term=<query>– Input: Entrez database (&db); Any Entrez text query (&term)
– Output: List of UIDs matching the Entrez query
• Example: Get the PubMed IDs (PMIDs) for articles about osteosarcoma – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&
term=%22osteosarcoma%22[majr:noexp]
ESummary
ESearch
UIDs
EFetch
UID
ESummary(document summary downloads)
• Responds to a list of UIDs from a given database with the corresponding document summaries.
• Syntax: esummary.fcgi?db=<database>&id=<uid_list>– Input: List of UIDs (&id); Entrez database (&db)
– Output: XML DocSums
• Example: Download DocSums for these PubMed IDs: 24450072, 24333720, 24333432– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubme
d&id=24450072,24333720,24333432
EFetch
ELink
EFetch (data record downloads)
• Responds to a list of UIDs in a given database with the corresponding data records in a specified format.
• Syntax: efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval_type>&retmode=<retrieval_mode>– Input: List of UIDs (&id); Entrez database (&db); Retrieval type
(&rettype); Retrieval mode (&retmode)
– Output: Formatted data records as specified
• Example: Download the abstract of PubMed ID 24333432– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&i
d=24333432&rettype=abstract&retmode=text
ELink (Entrez links)
• Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database
• Checks for the existence of a specified link from a list of one or more UIDs
• Creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.
ELink (Entrez links)
• Syntax: elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<uid_list>– Input: List of UIDs (&id); Source Entrez database (&dbfrom);
Destination Entrez database (&db)
– Output: XML containing linked UIDs from source and destination databases
• Example: Find one set/separate sets of Gene IDs linked to PubMed IDs 24333432 and 24314238– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme
d&db=gene&id=24333432,24314238
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&db=gene&id=24333432&id=24314238
EGQuery
EGQuery (global query)
• Responds to a text query with the number of records matching the query in each Entrez database.
• Syntax: egquery.fcgi?term=<query>– Input: Entrez text query (&term)
– Output: XML containing the number of hits in each database.
• Example: Determine the number of records for mouse in Entrez.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=mouse[
orgn]&retmode=xml
ESpell
ESpell (spelling suggestions)
• Retrieves spelling suggestions for a text query in a given database.
• Syntax: espell.fcgi?term=<query>&db=<database>– Input: Entrez text query (&term); Entrez database (&db)
– Output: XML containing the original query and spelling suggestions.
• Example: Find spelling suggestions for the PubMed query "osteosacoma".– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=osteosac
oma&db=pmc
EInfo (database statistics)
• Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrezdatabases.
• Syntax: einfo.fcgi?db=<database>– Input: Entrez database (&db)
– Output: XML containing database statistics
• Example: Find database statistics for Entrez Protein.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein
EPost (UID uploads)
• Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset.
• Syntax: epost.fcgi?db=<database>&id=<uid_list>– Input: List of UIDs (&id); Entrez database (&db)
– Output: Web environment (&WebEnv) and query key (&query_key) parameters specifying the location on the Entrez history server of the list of uploaded UIDs
• Example: Upload five Gene IDs (7173, 22018, 54314, 403521, 525013) for later processing.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=71
73,22018,54314,403521,525013
Application 1
• Find related human genes to articles searched for non-
extended MeSH term "Osteosarcoma" (PubMed → Gene)
1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme
d&term=%22osteosarcoma%22[majr:noexp]&usehistory=y
2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubm
ed&db=gene&query_key=1&WebEnv=NCID_1_220057266_130.14.
18.34_9001_1396281951_1196950266&term=%22homo+sapiens%
22[organism]&cmd=neighbor_history
3. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene
&query_key=3&WebEnv=NCID_1_220057266_130.14.18.34_9001_
1396281951_1196950266
Application 1
• Find related human genes to articles searched for non-
extended MeSH term "Osteosarcoma" (PubMed → Gene)
– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz• It can be used instead of "ELink".
– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz• It can be used instead of "ESummary".
Application 2
• Find nucleotide sequences of "Burkholderia cepacia complex"
and download in GenBank format
1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccor
e&term=%22burkholderia+cepacia+complex%22[organism]&usehist
ory=y
2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore
&query_key=1&WebEnv=NCID_1_264773253_130.14.22.215_9001
_1396244608_457974498&rettype=gb&retmode=text
Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704GPL8226GPL6804GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
"cancer copy number" articles"Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704GPL8226GPL6804GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704GPL8226GPL6804GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Make custom scripts with XML-parser
EBot
• EBot is an interactive web tool that first allows users to construct an arbitrary E-utility analysis pipeline and then generates a Perl script to execute the pipeline. The Perl script can be downloaded and executed on any computer with a Perl installation. For more details, see the EBot page linked above.– http://www.ncbi.nlm.nih.gov/Class/PowerTools/e
utils/ebot/ebot.cgi
Entrez Direct
• E-utilities on the UNIX Command Line
• Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/
• Entrez Direct Functions– esearch performs a new Entrez search using terms in indexed fields.
– elink looks up neighbors (within a database) or links (between databases).
– efilter filters or restricts the results of a previous query.
– efetch downloads records or reports in a designated format.
– xtract converts XML into a table of data values.
– einfo obtains information on indexed fields in an Entrez database.
– epost uploads unique identifiers (UIDs) or sequence accession numbers.
– nquire sends a URL request to a web page or CGI service.
• Entering Query Commands– esearch -db pubmed -query "opsin gene conversion" | elink -related
Links• References
– Entrez Programming Utilities Help• http://www.ncbi.nlm.nih.gov/books/NBK25501/
– Entrez Help• http://www.ncbi.nlm.nih.gov/books/NBK3836/
• Useful Links– Entrez Unique Identifiers (UIDs) for selected databases
• http://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.chapter2_table1/?report=objectonly
– Valid values of &retmode and &rettype for EFetch (null = empty string)• http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?r
eport=objectonly
– The full list of Entrez links• http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
NCBI databases
• Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site Search
• Health: PubMed Health, MedGen, GTR, dbGaP, ClinVar, OMIM, OMIA
• Organisms: Taxonomy
• Nucleotide Sequences: Nucleotide, GSS, EST, SRA, PopSet, Probe
• Genomes: Genome, Assembly, Epigenomics, UniSTS, SNP, dbVar, BioProject, BioSample, Clone
• Genes: Gene, HomoloGene, UniGene, GEO Profiles, GEO DataSets
• Proteins: Protein, Conserved Domains, Protein Clusters, Structure
• Chemicals: PubChem Compound, PubChem Substance, PubChem BioAssay
• Pathways: BioSystems
E-utilities
• Eight server-side programs– ESearch : Searching a Database
– EPost : Uploading UIDs to Entrez
– ESummary : Downloading Document Summaries
– EFetch : Downloading Full Records
– ELink : Finding Related Data Through Entrez Links
– EInfo : Getting Database Statistics and Search Fields
– EGQuery : Performing a Global Entrez Search
– ESpell : Retrieving Spelling Suggestions
Sample Applications of the E-utilities
• Basic pipelines– ESearch - ESummary/EFetch
– EPost - ESummary/EFetch
– ELink - ESummary/Efetch
– ESearch - ELink - ESummary/EFetch
– EPost - ELink - ESummary/EFetch
– EPost - ESearch
– ELink - ESearch
Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets1. tr '\n' '\t' < cancer_copy_number.pubmed_result.txt | sed 's/\t\t/\n/g' | sed 's/^\t[0-9]*: //' | sed 's/\t/ /g' >
cancer_copy_number.pubmed_result.oneLine.txt
2. sed 's/^.* PubMed *PMID: *//' cancer_copy_number.pubmed_result.oneLine.txt | sed 's/; .*//' | sed 's/\.$//' >
cancer_copy_number.pubmed_ids.txt
3. for id in $(cat cancer_copy_number.pubmed_ids.txt); do perl ~/scripts/elink.pl pubmed gds $id pubmed_gds | sed
"s/^/$id\t/"; done > cancer_copy_number.pubmed_gds_ids.txt
4. awk -F'\t' '($1 == "Platform")' Affymetrix_Genome-Wide_Human_SNP_Array.gds_result.txt | cut -f2 | sed
's/^Accession: //' > Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt
5. for platform in $(cat Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt); do perl
~/scripts/esearch.pl gds $platform; done | sort -nu > Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt
6. paste cancer_copy_number.pubmed_ids.txt cancer_copy_number.pubmed_result.oneLine.txt | perl
~/scripts/table.addColumns.pl cancer_copy_number.pubmed_gds_ids.txt 0 - 0 1 | perl ~/scripts/table.search.pl
Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 0 - 1 | perl ~/scripts/table.mergeLines.pl -d ', ' - 0,2 >
cancer_copy_number.Affymetrix_Genome-Wide_Human_SNP_Array.pubmed_gds.txt
Top Related