NCBI API - Integration into analysis code

Post on 23-Aug-2014

196 views 8 download

Tags:

description

QBRC Tech Talk on April 1st, 2014

Transcript of NCBI API - Integration into analysis code

NCBI API – Integration into analysis code

QBRC Tech Talk

Jiwoong Kim

Outlines

• Introduction

• Usage Guidelines of the E-utilities

• Sample Applications of the E-utilities

NCBI & Entrez• The National Center for

Biotechnology Information advances science and health by providing access to biomedical and genomic information.

• Entrez is NCBI’s primary text search and retrieval system that integrates the PubMeddatabase of biomedical literature with 39 other literature and molecular databases including DNA and protein sequence, structure, gene, genome, genetic variation and gene expression.

E-utilities

• Entrez Programming Utilities– The Entrez Programming Utilities (E-utilities) are a set of

eight server-side programs that provide a stable interface into the Entrez query and database system at the NCBI.

– The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data.

E-utilitiesURL XML, FASTA, Text …Input Output

Usage Guidelines and Requirements

• Use the E-utility URL– baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ …

– Python urllib/urlopen, Perl LWP::Simple, Linux wget, …

• Frequency, Timing and Registration of E-utility URL Requests– Make no more than 3 requests per second → sleep(0.5)

– Run large jobs on weekends or between 5 PM and 9 AM EST

– Include &tool and &email in all requests

• Minimizing the Number of Requests– &retmax=500

• Handling Special Characters Within URLs– Space → +, " → %22, # → %23

ESearch

ESearch (text searches)

• Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query.

• Syntax: esearch.fcgi?db=<database>&term=<query>– Input: Entrez database (&db); Any Entrez text query (&term)

– Output: List of UIDs matching the Entrez query

• Example: Get the PubMed IDs (PMIDs) for articles about osteosarcoma – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&

term=%22osteosarcoma%22[majr:noexp]

ESummary

ESearch

UIDs

EFetch

UID

ESummary(document summary downloads)

• Responds to a list of UIDs from a given database with the corresponding document summaries.

• Syntax: esummary.fcgi?db=<database>&id=<uid_list>– Input: List of UIDs (&id); Entrez database (&db)

– Output: XML DocSums

• Example: Download DocSums for these PubMed IDs: 24450072, 24333720, 24333432– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubme

d&id=24450072,24333720,24333432

EFetch

ELink

EFetch (data record downloads)

• Responds to a list of UIDs in a given database with the corresponding data records in a specified format.

• Syntax: efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval_type>&retmode=<retrieval_mode>– Input: List of UIDs (&id); Entrez database (&db); Retrieval type

(&rettype); Retrieval mode (&retmode)

– Output: Formatted data records as specified

• Example: Download the abstract of PubMed ID 24333432– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&i

d=24333432&rettype=abstract&retmode=text

ELink (Entrez links)

• Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database

• Checks for the existence of a specified link from a list of one or more UIDs

• Creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.

ELink (Entrez links)

• Syntax: elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<uid_list>– Input: List of UIDs (&id); Source Entrez database (&dbfrom);

Destination Entrez database (&db)

– Output: XML containing linked UIDs from source and destination databases

• Example: Find one set/separate sets of Gene IDs linked to PubMed IDs 24333432 and 24314238– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme

d&db=gene&id=24333432,24314238

– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&db=gene&id=24333432&id=24314238

EGQuery

EGQuery (global query)

• Responds to a text query with the number of records matching the query in each Entrez database.

• Syntax: egquery.fcgi?term=<query>– Input: Entrez text query (&term)

– Output: XML containing the number of hits in each database.

• Example: Determine the number of records for mouse in Entrez.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=mouse[

orgn]&retmode=xml

ESpell

ESpell (spelling suggestions)

• Retrieves spelling suggestions for a text query in a given database.

• Syntax: espell.fcgi?term=<query>&db=<database>– Input: Entrez text query (&term); Entrez database (&db)

– Output: XML containing the original query and spelling suggestions.

• Example: Find spelling suggestions for the PubMed query "osteosacoma".– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=osteosac

oma&db=pmc

EInfo (database statistics)

• Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrezdatabases.

• Syntax: einfo.fcgi?db=<database>– Input: Entrez database (&db)

– Output: XML containing database statistics

• Example: Find database statistics for Entrez Protein.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein

EPost (UID uploads)

• Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset.

• Syntax: epost.fcgi?db=<database>&id=<uid_list>– Input: List of UIDs (&id); Entrez database (&db)

– Output: Web environment (&WebEnv) and query key (&query_key) parameters specifying the location on the Entrez history server of the list of uploaded UIDs

• Example: Upload five Gene IDs (7173, 22018, 54314, 403521, 525013) for later processing.– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=71

73,22018,54314,403521,525013

Application 1

• Find related human genes to articles searched for non-

extended MeSH term "Osteosarcoma" (PubMed → Gene)

1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme

d&term=%22osteosarcoma%22[majr:noexp]&usehistory=y

2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubm

ed&db=gene&query_key=1&WebEnv=NCID_1_220057266_130.14.

18.34_9001_1396281951_1196950266&term=%22homo+sapiens%

22[organism]&cmd=neighbor_history

3. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene

&query_key=3&WebEnv=NCID_1_220057266_130.14.18.34_9001_

1396281951_1196950266

Application 1

• Find related human genes to articles searched for non-

extended MeSH term "Osteosarcoma" (PubMed → Gene)

– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz• It can be used instead of "ELink".

– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz• It can be used instead of "ESummary".

Application 2

• Find nucleotide sequences of "Burkholderia cepacia complex"

and download in GenBank format

1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccor

e&term=%22burkholderia+cepacia+complex%22[organism]&usehist

ory=y

2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore

&query_key=1&WebEnv=NCID_1_264773253_130.14.22.215_9001

_1396244608_457974498&rettype=gb&retmode=text

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

cancer "copy number"

esearch.fcgi?db=pubmed

Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]

esearch.fcgi?db=gds

esummary.fcgi?db=pubmed

WebEnv, query_key

esummary.fcgi?db=gds

WebEnv, query_key

GPL9704GPL8226GPL6804GPL6801

elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds

Parsing

Result table

Common

PubMed title

"cancer copy number" articles"Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

cancer "copy number"

esearch.fcgi?db=pubmed

Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]

esearch.fcgi?db=gds

esummary.fcgi?db=pubmed

WebEnv, query_key

esummary.fcgi?db=gds

WebEnv, query_key

GPL9704GPL8226GPL6804GPL6801

elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds

Parsing

Result table

Common

PubMed title

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

cancer "copy number"

esearch.fcgi?db=pubmed

Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]

esearch.fcgi?db=gds

esummary.fcgi?db=pubmed

WebEnv, query_key

esummary.fcgi?db=gds

WebEnv, query_key

GPL9704GPL8226GPL6804GPL6801

elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds

Parsing

Result table

Common

PubMed title

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets

Make custom scripts with XML-parser

EBot

• EBot is an interactive web tool that first allows users to construct an arbitrary E-utility analysis pipeline and then generates a Perl script to execute the pipeline. The Perl script can be downloaded and executed on any computer with a Perl installation. For more details, see the EBot page linked above.– http://www.ncbi.nlm.nih.gov/Class/PowerTools/e

utils/ebot/ebot.cgi

Entrez Direct

• E-utilities on the UNIX Command Line

• Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/

• Entrez Direct Functions– esearch performs a new Entrez search using terms in indexed fields.

– elink looks up neighbors (within a database) or links (between databases).

– efilter filters or restricts the results of a previous query.

– efetch downloads records or reports in a designated format.

– xtract converts XML into a table of data values.

– einfo obtains information on indexed fields in an Entrez database.

– epost uploads unique identifiers (UIDs) or sequence accession numbers.

– nquire sends a URL request to a web page or CGI service.

• Entering Query Commands– esearch -db pubmed -query "opsin gene conversion" | elink -related

Links• References

– Entrez Programming Utilities Help• http://www.ncbi.nlm.nih.gov/books/NBK25501/

– Entrez Help• http://www.ncbi.nlm.nih.gov/books/NBK3836/

• Useful Links– Entrez Unique Identifiers (UIDs) for selected databases

• http://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.chapter2_table1/?report=objectonly

– Valid values of &retmode and &rettype for EFetch (null = empty string)• http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?r

eport=objectonly

– The full list of Entrez links• http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

NCBI databases

• Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site Search

• Health: PubMed Health, MedGen, GTR, dbGaP, ClinVar, OMIM, OMIA

• Organisms: Taxonomy

• Nucleotide Sequences: Nucleotide, GSS, EST, SRA, PopSet, Probe

• Genomes: Genome, Assembly, Epigenomics, UniSTS, SNP, dbVar, BioProject, BioSample, Clone

• Genes: Gene, HomoloGene, UniGene, GEO Profiles, GEO DataSets

• Proteins: Protein, Conserved Domains, Protein Clusters, Structure

• Chemicals: PubChem Compound, PubChem Substance, PubChem BioAssay

• Pathways: BioSystems

E-utilities

• Eight server-side programs– ESearch : Searching a Database

– EPost : Uploading UIDs to Entrez

– ESummary : Downloading Document Summaries

– EFetch : Downloading Full Records

– ELink : Finding Related Data Through Entrez Links

– EInfo : Getting Database Statistics and Search Fields

– EGQuery : Performing a Global Entrez Search

– ESpell : Retrieving Spelling Suggestions

Sample Applications of the E-utilities

• Basic pipelines– ESearch - ESummary/EFetch

– EPost - ESummary/EFetch

– ELink - ESummary/Efetch

– ESearch - ELink - ESummary/EFetch

– EPost - ELink - ESummary/EFetch

– EPost - ESearch

– ELink - ESearch

Application 3• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"

platform GEO Datasets1. tr '\n' '\t' < cancer_copy_number.pubmed_result.txt | sed 's/\t\t/\n/g' | sed 's/^\t[0-9]*: //' | sed 's/\t/ /g' >

cancer_copy_number.pubmed_result.oneLine.txt

2. sed 's/^.* PubMed *PMID: *//' cancer_copy_number.pubmed_result.oneLine.txt | sed 's/; .*//' | sed 's/\.$//' >

cancer_copy_number.pubmed_ids.txt

3. for id in $(cat cancer_copy_number.pubmed_ids.txt); do perl ~/scripts/elink.pl pubmed gds $id pubmed_gds | sed

"s/^/$id\t/"; done > cancer_copy_number.pubmed_gds_ids.txt

4. awk -F'\t' '($1 == "Platform")' Affymetrix_Genome-Wide_Human_SNP_Array.gds_result.txt | cut -f2 | sed

's/^Accession: //' > Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt

5. for platform in $(cat Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt); do perl

~/scripts/esearch.pl gds $platform; done | sort -nu > Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt

6. paste cancer_copy_number.pubmed_ids.txt cancer_copy_number.pubmed_result.oneLine.txt | perl

~/scripts/table.addColumns.pl cancer_copy_number.pubmed_gds_ids.txt 0 - 0 1 | perl ~/scripts/table.search.pl

Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 0 - 1 | perl ~/scripts/table.mergeLines.pl -d ', ' - 0,2 >

cancer_copy_number.Affymetrix_Genome-Wide_Human_SNP_Array.pubmed_gds.txt