BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data-...

25
BioThings APIs: Linked High-performance APIs for Biological Entities Chunlei Wu, Ph.D. [email protected] @chunleiwu Associate Professor of Molecular Medicine Dept. of Molecular Experimental Medicine The Scripps Research Institute La Jolla, CA, USA BD2K-AHM 11/2016

Transcript of BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data-...

Page 1: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

BioThings APIs: Linked High-performance APIs for Biological Ent it ies

Chunlei Wu, [email protected]

@chunleiwu

Associate Professor of Molecular MedicineDept. of Molecular Experimental Medicine

The Scripps Research InstituteLa Jolla, CA, USA

BD2K-AHM 11/2016

Page 2: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

BioThings APIs

Object ive:

Building unif ied APIs for “Bio-Things” (biological ent it ies)

Page 3: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Biological knowledge is a complex network

No one-f it-all database can capturethe entire knowledge space

Page 4: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Simplify the knowledge network as ent it ies

Extracting those central hub nodes as f lat lists:

Gene

Variant

Pathway

Metabolite

Disease

∙ ∙ ∙ ∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

∙ ∙ ∙

Page 5: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Gene and Variant annotat ions represented in JSON documents

{"_id": "chr1:g.196659237C>T","cosmic": {

"chrom": "1","hg19": {

"start": 196659237,"end": 196659237

},"ref": "C","alt": "T","tumor_site": "breast","mut_freq": 0.49,"mut_nt": "C>T","cosmic_id": "COSM424915"

}

{“_id”: “1017”,“Symbol”: “CDK2”,“Ensembl”: “ENSG00000123374”,“RefSeq”: [

“NM_001798”,“NM_052827”

],“Reporter”: {

“U95A”: [“1792_g_at”,“1833_at”

],“U133A”:[

“211804_s_at”,“2045252_at”,“211803_at”

]}

}

Page 6: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Keep data always up-to-date

Schematic view of MyVariant.info architecture

Each data source is updated individually. Colors indicate their dif ferent updating schedules.

Page 7: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

High-performance web service APIs

Schematic view of MyVariant.info architecture

Page 8: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

MyGene.info + MyVariant .info

Gene

G

Variant

V

MyVariant .inf oMyGene.inf o

/v3/gene/<geneid>/v3/query?q=<query>

/v1/variant/<hgvsid>/v1/query?q=<query>

single query on GET, batch query on POST

Page 9: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

We focus on building APIs. Try to …

Page 10: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Make it really easy to use

J ust two endpoints

No registration/sign-in

No API key

Page 11: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Developer-f riendly

J SONPCORShttps

msgpackhttp compression

http cachingJ SON-LD

Python/R clients(also js client for myvariant)

search “mygene” and “myvariant”in PyPI and B ioconductor

Supported!

Page 12: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Aggregate everything about genes and variants

MyVariant .inf oMyGene.inf o

Support >17M genesfor ~18K species

~ 200 annotation f ields

Support > 340 M variants

~ 500 annotation f ields

from 14 sources:ClinVardbNSFPdbSNP

Page 13: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Keep up-to-date

MyVariant .inf oMyGene.inf o

Support >17M genesfor ~18K species

~ 200 annotation f ields

Support > 340 M variants

~ 500 annotation f ields

from 14 sources:ClinVardbNSFPdbSNP

Weekly Monthly

Page 14: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

High-performance and scalable

>95% queries response < 30ms

Page 15: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

High-performance and scalable

Page 16: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

High-performance and scalable

Over 100M request s in Nov 2016

Page 17: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

High availabilit y

MyVariant .inf oMyGene.inf o

99.999%over last year

99.935% over last year

Availability tracked by

Page 18: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Who is usingLive applications:

MinePath.org

Gene Wiki

J Browse

Page 19: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Who is using

Many users use them in their

daily analysis pipelines

or

simply caching annotations locally

Page 20: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Generalized BioThings SDK

BioThings SDK

MyVariant .inf o

MyGene.inf oJ SON data aggregation mechanism

High-performance query engine

Well-designed REST API pattern

J SON-LD enabled Linked Data

Data-updating schedulerPython/R clients…

Page 21: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

BioThings SDK

A tutorial here (more docs are coming):http://biothingsapi.readthedocs.io/en/latest/

Page 22: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

BioThings SDK

v.biot hings.io

g.biot hings.io

BioThings SDK

s.biot hings.io

c.biot hings.io

gene

variant

species/taxonomy

drugs/ compounds

∙ ∙ ∙ ∙ ∙ ∙

alias to MyGene.info

alias to MyVariant.info

diseased.biot hings.io

Page 23: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

JSON-LD brings the linkage between BioThings APIs

Page 24: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

Apply JSON-LD contextJSON document

{"_id" : "chr6:g.26093141G>A","cl invar" : {

"gene" : {" id" : "3077" , "symbol" : "HFE“

}},"dbsnp" : {

" rsid" : " rs1800562“},"cadd" : {

" genename " : " HFE “}}

N-Quads f ormat out put

_:b0 <ht t p:/ / schema.myvariant .inf o/ dat asource/ cl invar> _:b1 ._:b0 <ht t p:/ / schema.myvariant .inf o/ dat asource/ dbsnp> _:b3 ._:b0 <ht t p:/ / schema.myvariant .inf o/ dat asource/ cadd> _:b4 ._:b1 <ht t p:/ / schema.myvariant .inf o/ dat anode/ gene> _:b2 ._:b2 <ht t p:/ / ident if iers.org/ hgnc.symbol> "HFE" ._:b3 <ht t p:/ / ident if iers.org/ dbsnp/ > " rs1800562" ._:b4 <ht t p:/ / ident if iers.org/ hgnc.symbol> "HFE" .

JSON-LD Cont ext

{" root " : {"@cont ext " : {

"cl invar" : "ht t p:/ / schema.myvariant .inf o/ dat asource/ cl invar" , "dbsnp" : "ht t p:/ / schema.myvariant .inf o/ dat asource/ dbsnp" , "genename": "ht t p:/ / ident if iers.org/ hgnc.symbol" , " cadd" : "ht t p:/ / schema.myvariant .inf o/ dat asource/ cadd" , " rsid" : "ht t p:/ / ident if iers.org/ dbsnp/ " , "gene" : "ht t p:/ / schema.myvariant .inf o/ dat anode/ gene"}},

"cl invar/ gene" : {"@cont ext " : {

"symbol " : " ht t p:/ / ident if iers.org/ hgnc.symbol " }}}

N-Quads Transf ormat ion

Page 25: BioThingsAPIs: Linked High-performance APIs for Biological ......JSON -LD enabled Linked Data. Data- updating scheduler Python/R clients ...

BioThings TEAM

TSRI:

Chunlei WuAndrew SuJ iwen XinCyrus AfrasiabiSebastien LelongGinger TsuengJ ulee AdesaraMike Mayers

U. Washingt on:

Sean MooneyMoritz J uchlerNikhil GopalSicheng Song

Funding and SupportU01HG008473U54GM114833