Building a Biomedical Knowledge Garden

42
Building a Biomedical Knowledge Garden Benjamin Good Su Laboratory, Group Meeting Dec. 2, 2016

Transcript of Building a Biomedical Knowledge Garden

Page 1: Building a Biomedical Knowledge Garden

Building a Biomedical Knowledge Garden

Benjamin GoodSu Laboratory, Group Meeting Dec. 2, 2016

Page 2: Building a Biomedical Knowledge Garden

Unstructured dataPubMedClinical TrialsEtc.

NLP toolsSemRepDeepDiveImplicitomeetc.

Knowledge GraphSemmedDBLiteromeetc.

Applications Semantic MEDLINEBioGraphetc.

MicrotasksMark2CureAMT

Structured dataGene Ontology etc.

http://tinyurl.com/jbmn8mz

The Knowledge Garden Idea.Circa Jan. 2015.

Page 3: Building a Biomedical Knowledge Garden

The devil is in the details…

Unstructured dataPubMedClinical TrialsEtc.

NLP toolsSemRepDeepDiveImplicitomeetc.

Knowledge GraphSemmedDBLiteromeetc.

Application Semantic MEDLINEBioGraphetc.

MicrotasksMark2CureAMT

Structured dataGene Ontology etc.

Page 4: Building a Biomedical Knowledge Garden

Reality November 2016

Knowledge GraphSemmedDB

Application knowledge.bio

MicrotasksMark2CureAMT

Page 5: Building a Biomedical Knowledge Garden

knowledge.bioExplore all biomedical knowledge as a graph with edges connected back to supporting references

v2.5 demo

Page 6: Building a Biomedical Knowledge Garden

knowledge.bio – Data challenges• V1 – V2.5 • All content from SemmedDB or Implicitome• custom schema to support these.

• V3 key requirement: ?

allow import of content from many other sources, Gene Ontology, DeepDive output, User-generated…

Page 7: Building a Biomedical Knowledge Garden

This part is important…Not nailing it down makes everything else harder

Knowledge Garden content managed as:csv filesjson documentsmysql databasesPostgress databasesneo4j databases

None of which had any coherent plan or structure

Page 8: Building a Biomedical Knowledge Garden

Requirements for a knowledge graph

• Syntax: • How to refer to nodes and edges• identifiers• schema (structure of graph)

• Semantics: • What things mean• How you decide on the ‘?’: • node1 ‘?’ node2• are they the same (to you?)• if not, what is the edge? Mind the Gap…

(one node in “Amino Acid” namespaceother in (“Biologically Active Substance” namespace)

Page 9: Building a Biomedical Knowledge Garden

Options at kb3 scale (millions of concepts and relations)

• The Unified Medical Language System (UMLS)• The Semantic Web• Wikidata ?

Page 10: Building a Biomedical Knowledge Garden

The UMLS (CUIs, Atoms, Types)

C0026106HP:0001256Mild mental retardation,Mild and nonprogressive mental retardation

SNOMEDCT_US:86765009Moron (mental age 8-12 years)

MEDCIN:35101Mild intellectual disabilities

OMIM:MTHU035844Intellectual disability, mild

Atoms

CUI

equivalent to

https://uts.nlm.nih.gov

C0233630

SNOMEDCT_US:32386009Logical Thinking

Mental or Behavioral Dysfunction

Disease or Syndrome

isa

isa

Types

Behavior

Activity

affects

isa

Event

isa

isa

affects ?

Types organized into a “Semantic Network”~ 133 types, 54 predicates13 high level ‘groups’

CUI

Page 11: Building a Biomedical Knowledge Garden

The UMLS in 2016• 3,200,922 CUIs• 211 source vocabularies (e.g. MeSH, SNOMED, RxNORM, etc.)• 12,287,973 total terms (”ATOMS”)

• Every edge in the system is a manual product of NLM• every Atom->CUI• every CUI->Type• every Type->Type

Page 12: Building a Biomedical Knowledge Garden

The Semantic Web• Concepts uniquely identified by

resolvable URIs• Meaning (e.g. equivalency)

encoded in OWL axioms• Concepts and mappings

created and maintained by anyone who can host them • No other structure• No governance

Page 13: Building a Biomedical Knowledge Garden

UMLS versus Semantic Web• UMLS• PROs: covers large portion of biomedical concept space, manually curated,

we are already using it by default, the semantic types are handy• CONs: does not exist on the semantic web - no stable URI to associate with a

CUI, license is obscure and apparently limiting, weak representation of molecular biology domain, no control over its extension (e.g. no Human Disease Ontology)

• Semantic Web• PROs: universal, open, infrastructure is the Web itself• CONs: need for organization, curation, mapping

Page 14: Building a Biomedical Knowledge Garden

Not thrilled with my options

https://commons.wikimedia.org/wiki/File:A_frustrated_and_depressed_man_holds_his_head_in_his_hand.jpg

Page 15: Building a Biomedical Knowledge Garden

Meanwhile...• human, mouse, rat, yeast,

macaque, 120+ microbes genes and proteins• Gene Ontology terms• Human Disease Ontology terms• 120,000+ chemicals• Cancer genome variants• Other people adding and using

data!!!

Page 16: Building a Biomedical Knowledge Garden

Maybe ?

Page 17: Building a Biomedical Knowledge Garden

Wikidata(QIDs, ids, Types)

Q183560HP:0001256Mild mental retardation,Mild and nonprogressive mental retardation

SNOMEDCT_US:86765009Moron (mental age 8-12 years)

MEDCIN:35101Mild intellectual disabilities

OMIM:MTHU035844Intellectual disability, mild

QID

external id

https://www.wikidata.org/wiki/Q412194

Q412194

PubChem: 2477buspirone

Specific Developmental Disorder

developmental disorder of mental health

subclass of

subclass of

treated by

Poly-Ontology

Drug

QID

Chemical

isa

mental disorder

disorder

subclass of

subclass of (DO)

ids

Page 18: Building a Biomedical Knowledge Garden

ACTIVE! Knowledge Flow for Wikidata

Unstructured dataThe Internet

NLP toolsStrepHit

Knowledge Graph Applications WikipediaWikigenomesWikidata.org

MicrotasksWikidata gameMixnMatch

Structured dataGene Ontology etc.

Page 19: Building a Biomedical Knowledge Garden

Wikidata is a Functioning and Flourishing Knowledge Garden

Page 20: Building a Biomedical Knowledge Garden

Wikidata• ~27,000,000 concepts identified by Qids like ‘Q183560’• ~1350 source vocabularies (e.g. MeSH, RxNORM, IMDB, ETC.)• (Based on properties tagged with type ‘ExternalId’)

• ? total terms integrated = labels + aliases (a lot)• Mappings to Qids product of the unwashed masses• Constantly updated

Page 21: Building a Biomedical Knowledge Garden

What concept scheme do we use ? •Wikidata• PROs: universal, open, infrastructure,

active community, largely curated content• CONs: limited biomedical content so far

?

Page 22: Building a Biomedical Knowledge Garden

Challenge: Relevant Scientific Applications

NLP toolsSemRepLiteromeImplicitomePubTatorDeepDiveSnorkelContentMineTEES….

Knowledge GraphApplications WikigenomesHetioNet

Knowledge.Bio…

Structured dataGene Expression etc,…

A. Advancing science is the goal and this is how we can help

B. We need experts to help refine and build the knowledge graph and apps are the bait

Page 23: Building a Biomedical Knowledge Garden

On the plane Oct. 11,2016…

“Screw it, lets go all in”

I got really excited..

https://www.flickr.com/photos/alexnormand/5992512756 https://www.flickr.com/photos/k6lcs/15374887957

Page 24: Building a Biomedical Knowledge Garden

knowledge.bio 3.0• All nodes to be concepts from wikidata• All predicates to be properties from wikidata• All edges to be linked to references that could be ‘stated in’ Wikidata• Edges (‘claims’) can come from any source• Now

• We have one consistent format for data import• We have a consistent pattern for gathering more data about a concept• We have access to 27 million concepts and growing (and we can add more)• We have the beginnings of new tool for expert-sourcing curation of Wikidata content• Our code is getting simpler and cleaner

Page 25: Building a Biomedical Knowledge Garden

KB3.0 – next step seeding content

• You are now basically up to date…• Rest of talk is about mapping content from SemmedDB to the new

structure • 3.0 release will allow users to add new nodes and edges• If you want data in there:

1. map it to Wikidata items and properties 2. make a tab-delimited file (Qid Pid Qid referenceUrl sentence)3. load it (or ask me to)

• Users needed!

Page 26: Building a Biomedical Knowledge Garden

How many concepts in the UMLS are now items in Wikidata?

?

27,000,000

3,000,000

Page 27: Building a Biomedical Knowledge Garden

Direct identifier mapping

Page 28: Building a Biomedical Knowledge Garden

Direct identifier mapping (15 shared ontologies)

CUI Qid

UMLS_vocab Concepts Wikidata_property Prop id UsageNCBI 1014837 NCBI Taxonomy ID P685 379589

MSH 359116 MeSH ID P486 5979

ICD10PCS 178278 ICD-10-PCS P1690 5

NCI 119620 NCI Thesaurus ID P1748 5562

ICD10CM 98899 ICD-10 P494 8826

OMIM 86181 OMIM ID P492 5835

FMA 82042 Foundational Model of Anatomy ID P1402 3378

GO 60412 Gene Ontology ID P686 43693

MDR 51961 Medical Dictionary for Regulatory Activities ID P3201 1

HGNC 39261 HGNC gene symbol P353 63691

HGNC Sometimes... HGNC-ID P354 39758

NDFRT 38206 NDF-RT ID P2115 1509

ICD9CM 20993 ICD-9-CM P1692 88

ICD10 11552 ICD-10 P494 8826

RXNORM 205998 RxNorm CUI P3345 5671

C0001629Adrenal Medulla

FMA: 15633 ?qid wdt:P1402 “15633” Q934888 Local MySQL query Build sparql query.wikidata.org

Page 29: Building a Biomedical Knowledge Garden

Strict identifier mapping

CUI Qid

UMLS_vocab Concepts Wikidata_property Prop id UsageNCBI 1014837 NCBI Taxonomy ID P685 379589MSH 359116 MeSH ID P486 5979ICD10PCS 178278 ICD-10-PCS P1690 5NCI 119620 NCI Thesaurus ID P1748 5562ICD10CM 98899 ICD-10 P494 8826OMIM 86181 OMIM ID P492 5835FMA 82042 Foundational Model of Anatomy ID P1402 3378GO 60412 Gene Ontology ID P686 43693MDR 51961 Medical Dictionary for Regulatory Activities ID P3201 1HGNC 39261 HGNC gene symbol P353 63691HGNC Sometimes... HGNC-ID P354 39758NDFRT 38206 NDF-RT ID P2115 1509ICD9CM 20993 ICD-9-CM P1692 88ICD10 11552 ICD-10 P494 8826->8292RXNORM 205998 RxNorm CUI P3345 0->5671

-> Thanks to Sebastian’s recent work..

Page 30: Building a Biomedical Knowledge Garden

How many concepts in the UMLS are now items in Wikidata? (according to identifiers)

463,059

27,000,000

3,000,000

15%

Page 31: Building a Biomedical Knowledge Garden

463,059

Wikidata items by UMLS source id

Page 32: Building a Biomedical Knowledge Garden

Coverage of shared identifiers by item

(cut off, NCBI taxonomy has > 1million)

UMLS cuis

Wikidata items

Good targets for wikidata bots

Page 33: Building a Biomedical Knowledge Garden

463,059 mapped concepts, by semantic group

Occupations

Genes & M

olecular S

equences

Disorders

Procedures

Activiti

es & Behavio

rs

Anatomy

Devices

Phenomena

Chemicals &

Drugs

Organizations

Objects

Physiology

Concepts

& Ideas

Living Beings

Geographic Areas

1

10

100

1000

10000

100000

1000000

N 1 to 1

NCBI Taxons

Gene Ontology

Genes

Diseases

Drugs

Page 34: Building a Biomedical Knowledge Garden

Where are the Gaps?

Occupations

Genes & M

olecular S

equences

Disord

ers

Proce

dures

Activiti

es & Beh

aviors

Anatomy

Devices

Phenomena

Chemica

ls & Dru

gs

Organiza

tions

Objects

Physiology

Concepts

& Idea

s

Living Bein

gs0

100000

200000

300000

400000

500000

600000

700000

800000

N no Map

600,000 missing drugs550,000 missing disorders

Page 35: Building a Biomedical Knowledge Garden

Where are(n’t) the Gaps?

0

0.1

0.2

0.3

0.4

0.5

0.6

percent_mapped

Page 36: Building a Biomedical Knowledge Garden

Label matching…

Page 37: Building a Biomedical Knowledge Garden

Adding label matching actually doesn’t help that much…• Checked only 460,080 (including all 288,552 from SemmedDB)• 21% (96,843) had an identifier match• 6.9% (31,645) had a match on the UMLS Prefered Label• 3.1% (14,319) matched one of the UMLS synonyms

• Removing anything that matched more than 1 Wikidata item we get 129,726 concepts. • Limiting to concepts used in SemmedDB we get 113,623 • (43% coverage with most matches coming from identifiers)

Page 38: Building a Biomedical Knowledge Garden

SemmedDB as Wikidata, version 1• 15,957,582 predications with 13 relation types• All Concepts Wikidata items • All relation types Wikidata properties• (Data available at http://tinyurl.com/cui2qid-1 )• Will be accessible in kb3.0 next week or the following

Page 39: Building a Biomedical Knowledge Garden

Next steps / project opportunities• More Wikidata bots!• Establish a more consistent typing strategy in Wikidata (e.g. make each

item an instance of some semantic group)• Finish the mapping of the UMLS predicates to Wikidata Properties

• Add missing properties (e.g. ‘Activates’, ‘Inhibits’) • Use existing subproperty prop. to build a prop. ontology inside wikidata

• Populate kb3.0 with knowledge pertinent to your disease area• Extend the user interface• Use the underlying neo4j database to extend HetioNet and related (or

add HetioNet to it.

Page 40: Building a Biomedical Knowledge Garden

Pick an edge or node and create or improve it

Unstructured dataPubMedClinical TrialsEtc.

NLP toolsSemRepDeepDiveImplicitomeetc.

Knowledge GraphSemmedDBLiteromeetc.

Applications Semantic MEDLINEBioGraphetc.

MicrotasksMark2CureAMT

Structured dataGene Ontology etc.

Page 41: Building a Biomedical Knowledge Garden

Thanks!• Richard Bruskiewich! and Star Informatics team for persevering…

(v1,v2.1...5, v3.0)• Gene Wiki team! Especially bot developers: Sebastian B, Andra W,

Tim P., Greg S. who planted the seeds that are making this possible.• Su laboratory!• I hope you can find something useful here and help grow the garden…• Especially you HetNetters!

https://www.flickr.com/photos/alexnormand/5992512756

Page 42: Building a Biomedical Knowledge Garden