Gene Wiki and Wikimedia Foundation SPARQL workshop

Post on 15-Jan-2017

211 views 5 download

Transcript of Gene Wiki and Wikimedia Foundation SPARQL workshop

CURATING BIOMEDICAL KNOWLEDGE ON WIKIDATA AND WIKIPEDIA

GENE WIKI

Benjamin GoodThe Scripps Research Institute, La Jolla, California

bgood@scripps.eduTwitter: @bgood

Gene Wikidata TeamAndrew Su (Scripps)

Andra Waagmeester (Micelio)Sebastian Burgstaller (Scripps)Tim Putman (Scripps) – speaking next Julia Turner (Scripps)

Elvira Mitraka (U Maryland)Justin Leong (UBC)Lynn Schriml (U Maryland)Paul Pavlidis (UBC)Ginger Tsueng (Scripps)

ACKNOWLEDGEMENTS

“knowledge”

• A lot

• Important

• Text

More than 2 articles published/minute

Documents

Concepts

Gene Wiki: Filtering and summarizing PubMed

GENE WIKI

6

Protein structure

Symbols and identifiers

Tissue expression pattern

Gene Ontology annotations

Links to structured databases

Gene summary

Protein interactions

Linked references

Huss, PLoS Biol, 2008

Bot!

GENE WIKI TIMELINE

2007

Project Starts

2008

ProteinBoxBot populates infoboxes

for 9,000 human genes

Now at 10,369 genes, analyses

show article growth and high quality

20112009

Updated Bot maintaining

9,678 human genes

Start importing gene data into wikidata

20142016a

Convert more than 11,000+ gene infoboxes on

Wikipedia to draw all content from Wikidata

2016b

Launch first biomedically focused

Web App driven by Wikidata content…

https://en.wikipedia.org/wiki/Portal:Gene_Wiki

Gene Wiki Version 1.

{{GNF_Protein_box | Name = Reelin| image = | image_source = | PDB = {{PDB2|4AD9}} | HGNCid = 18512 | MGIid = | Symbol = LACTB2 | AltSymbols =; CGI-83 | IUPHAR = | ChEMBL = | OMIM = None | ECnumber = | Homologene = 9349 | GeneAtlas_image1 = | GeneAtlas_image2 = | GeneAtlas_image3 = | Protein_domain_image = | Function = {{GNF_GO|id=GO:0005515 |text = protein binding}} {{GNF_GO|id=GO:0016787 |text = hydrolase activity}} {{GNF_GO|id=GO:0046872 |text = metal ion binding}} | Component = {{GNF_GO|id=GO:0005739 |text = mitochondrion}} | Process = {{GNF_GO|id=GO:0008152 |text = metabolic process}} | Hs_EntrezGene = 51110 | Hs_Ensembl = ENSG00000147592 | Hs_RefseqmRNA = NM_016027 | Hs_RefseqProtein = NP_057111 | Hs_GenLoc_db = hg38 | Hs_GenLoc_chr = 8 | Hs_GenLoc_start = 70635318 | Hs_GenLoc_end = 70669174 | Hs_Uniprot = Q53H82 | Mm_EntrezGene = 212442 | Mm_Ensembl = ENSMUSG00000025937 | Mm_RefseqmRNA = NM_145381 | Mm_RefseqProtein = NP_663356 | Mm_GenLoc_db = mm10 | Mm_GenLoc_chr = 1 | Mm_GenLoc_start = 13623330 | Mm_GenLoc_end = 13660546 | Mm_Uniprot = Q99KR3 | path = PBB/51110}}

=

Gene Wiki Version 2.

{{Infobox gene}}

• All data in Wikidata• 1 Lua script works for

all 11,000+ genes

=

(1 of these for every gene)

IMPACT OF WIKIDATA ON WIKIPEDIA

IMPACT BEYOND WIKIPEDIA= SPARQL

Sample of current biomedical content

• All human, mouse genes and proteins• All Gene Ontology terms (describe function)• All Human Disease Ontology terms• All FDA approved drugs • 109+ reference microbial genomes

Burgstaller-Muelbacher et al (2016) DatabaseMitraka et al (2015) Semantic Web Applications for the Life Sciences

Putman et al (2016) Database

http://tinyurl.com/biowiki-sparql

Sample queries that are currently possible:• “where in the cell is the Reelin protein expressed?”• “What diseases are treated by Metformin”• “What diseases might be treated by Metformin”

http://query.wikidata.org

Example question: repurposing Metformin

http://tinyurl.com/zem3oxz

Metformin

?disease

interacts with

protein

geneencoded by genetic association

Mighttreat ?

Solute carrier family 22

member 3

SLC22A3

prostate cancer

A SPARQL powered user interface for consuming and editing organism data in WikidataTimothy E. Putman Ph.D. The Scripps Research Institute, La Jolla, California

tputman@scripps.eduTwitter: @putmantime

Gene Wikidata TeamAndrew Su (Scripps)Benjamin Good – just spokeAndra Waagmeester (Micelio)Sebastian Burgstaller (Scripps)Elvira Mitraka (U Maryland)Julia Turner (Scripps)Justin Leong (UBC)Lynn Schriml (U Maryland)Paul Pavlidis (UBC)Ginger Tsueng (Scripps)

ACKNOWLEDGEMENTS

Centralizing and Linking the Data

BacteriaQ10876domain

TRPAQ21153984protein

C.trachomatisQ131065species

trpAQ21153861gene

C. trachomatis434/BUQ20800254strain

C. trachomatisQ131065species

trpAQ21153861gene

TRPAQ21153984protein

C. trachomatis434/BUQ20800254strain

trpAQ21153861gene

TRPAQ21153984protein

C. trachomatis434/BUQ20800254strain

C. trachomatisQ131065species

C. trachomatisQ131065species

TRPAQ21153984protein

C. trachomatis434/BUQ20800254strain

trpAQ21153861gene

C. trachomatisQ131065species

trpAQ21153861gene

C. trachomatis434/BUQ20800254strain

TRPAQ21153984protein

SPARQL Query• On page load

• JQuery execution of SPARQL query as AJAX GET Request

• On organism select• Get all gene and protein data for organism by

taxid

QUESTIONS?