Post on 18-Aug-2015
Gene Wiki and Mark2Cureupdate for BD2K
Benjamin Good, Ph.D.@bgood
bgood@scripps.edu
April 17, 2015
The challenge: make biomedical knowledge organized, accessible, and computable
2
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Number of new PubMed-indexed articles
Our strategy taps into the Long Tail3
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
4
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
From crowdsourcing to structured data5
The Gene Wiki
Mark2Cure
Filtering, extracting, and summarizing PubMed
Documents
Concepts Review article
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Wikis depend on a positive feedback loop8
Gene wiki page utility
Number ofusers
Number ofcontributors
1001
2002
10,000 gene “stubs” within Wikipedia9
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers10
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
Gene Wiki has a critical mass of editors11
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Edi
tor
coun
t Editors
Edits Edi
t co
unt
A review article for every gene is powerful12
References to the literature
Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
Collaborating with the journal Gene for recruiting
• Authors write standard review article for Gene
• Also required to create or update Gene Wiki article
• 1o complete, 20 more in process
13
Su, Good and van Wijnen (2013)
Gene Wiki as a tool
• Mechanism for collaboration amongst teams working on gene annotations
• Don’t roll your own wiki if you can do the same job on Wikipedia!
14
Making the Gene Wiki more computable15
Structured annotationsFree text
Analyses
Text-miningGood, BMC Genomics, 2011
Making the Gene Wiki more computable16
Structured annotationsFree text
Analyses
Text-mininghttp://fiehnlab.ucdavis.edu/projects/rice_metabolome/
Making the Gene Wiki more computable17
Structured annotationsFree text
Analyses
Text-mining
Making the Gene Wiki more computable18
Structured annotationsFree text
Databases
Making the Gene Wiki more computable19
Structured annotationsFree text
Making the Gene Wiki more computable20
Structured annotationsFree text
Wikidata21
Provide a database of the world’s knowledge that
anyone can edit
- Denny Vrandečić
Centralizing key data storage22
Source: http://commons.wikimedia.org/wiki/File:Wikidata_slides_Magnus_Manske,_Cambridge,_2014-02-27.pdf
Centralizing key data storage23
Centralizing key data storage24
Centralizing key data storage25
287 language editions of Wikipedia
Bioinformatics community
Loading biological data into Wikidata26
Entrez Gene
Ensembl
UniProt
UCSC
PDB
RefSeq
Wikidata for biology27
is a
regulates
Interacts with
Protein
Glycoprotein
Neural development
VLDL receptor
Amyloid precursor protein
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
Reelin
http://www.wikidata.org/wiki/Q414043
Wikidata for biology28
Property:P31
Property:P128
Property:P129
Q8054
Q187126
Q1345738
Q1979313
Q423510
Q414043
http://wikidata.org/w/api.php?action=wbgetentities&ids=Q414043&languages=en
Current progress
• All human and mouse genes and proteins loaded
• All diseases (Human Disease Ontology) loaded
• Dataset of FDA-approved drugs in preparation
• Datasets for gene-disease, drug-disease, and drug-protein relationships in preparation
29
Gene Wiki(Data) as a tool
• Mechanism for collaboration amongst teams working on biomedical data
• Don’t roll your own open public database if you can do the same job on WikiData!
30
The Long Tail of scientists is a valuable source of
information on gene function
31
From crowdsourcing to structured data32
The Gene Wiki
Mark2Cure
The biomedical literature is growing fast…33
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
Number of new PubMed-indexed articles
… but it is very hard to query and compute34
… but it is very hard to query and compute35
Imatinib
Crizotinib
Erlotinib
Gefitinib
Sorafenib
Lapatinib
Dasatinib
…
Acute myeloid leukemia
Acute lymphoblastic leukemia
Chronic myelogenous leukemia
Chronic lymphocytic leukemia
Hodgkin lymphoma
Non-Hodgkin lymphoma
Myeloma
…
AND
Extracting semantic networks from PubMed with the crowd’s help
Documents
Network of linked concepts
Information Extraction37
1. Find mentions of high level concepts in text
2. Map mentions to specific terms in ontologies
3. Identify relationships between concepts
Disease mentions in PubMed abstracts38
NCBI Disease corpus• 793 PubMed abstracts
• (100 development, 593 training, 100 test)
• 12 expert annotators (2 annotate each abstract)
6,900 “disease” mentions
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics.
Question: Can a group of non-scientists collectively perform concept recognition in biomedical texts?
39
The Mechanical Turk40
http://en.wikipedia.org/wiki/The_Turk
The Mechanical Turk41
http://en.wikipedia.org/wiki/The_Turk
Amazon Mechanical Turk (AMT)42
Requester
Amazon
For each task, specify:
• a qualification test
• how many workers per task
• how much we will pay per task
Manages:
• parallel execution of jobs
• worker access to tasks via qualification tests
• payments
• task advertising
Workers
1. Create tasks
2. Execute
3. Aggregate
Instructions to workers43
• Highlight all diseases and disease abbreviations • “...are associated with Huntington disease ( HD )... HD patients
received...”• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease • “... contains the insulin-dependent diabetes mellitus locus …”
• Highlight disease conjunctions as single, long spans. • “... a significant fraction of familial breast and ovarian cancer , but
undergoes…”
• Highlight symptoms - physical results of having a disease– “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss, and visual impairment.
Qualification test44
Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in trinucleotide repeat expansion in the 3-untranslated region of a protein kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”
Test #2: “Germline mutations in BRCA1 are responsible for most cases of inherited breast and ovarian cancer . However , the function of the BRCA1 protein has remained elusive . As a regulated secretory protein , BRCA1 appears to function by a mechanism not previously described for tumour suppressor gene products.”
Test #3: “We report about Dr . Kniest , who first described the condition in 1952 , and his patient , who , at the age of 50 years is severely handicapped with short stature , restricted joint mobility , and blindness but is mentally alert and leads an active life . This is in accordance with molecular findings in other patients with Kniest dysplasia and…”
26 yes / no questions
Simple annotation interface45
Click to see instructions
Highlight disease mentions
Experimental design
• Task: Identify the disease mentions in the 593 abstracts from the NCBI disease corpus– $0.06 per Human Intelligence Task (HIT)– HIT = annotate one abstract from PubMed– multiple workers annotate each abstract
46
This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
Aggregation function based on simple voting47
47
1 or more votes (K=1)This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
K=2
K=3 K=4
This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
Comparison to gold standard48
F = 0.87, k = 6
• 593 documents• 15 users / doc• 9 days• $630.96
Precision
Recall
Good, PSB, 2015
49
In aggregate, our worker ensemble is faster, cheaper and as accurate as a single expert annotator for disease
concept recognition.
Information Extraction50
1. Find mentions of high level concepts in text
2. Map mentions to specific terms in ontologies
3. Identify relationships between concepts
Annotating the relationships51
This molecule inhibits the growth of a broad panel of cancer cell lines, and is particularly efficacious in leukemia cells, including orthotopic leukemia preclinical models as well as in ex vivo acute myeloid leukemia (AML) and chronic lymphocytic leukemia (CLL) patient tumor samples. Thus, inhibition of CDK9 may represent an interesting approach as a cancer therapeutic target especially in hematologic malignancies.
subject
predicate
object
GENE
DISEASE
Does Mechanical Turk scale?52
1,000,000 articles per year
10 annotators / article
4 tasks / doc
$0.06 / task
$ 2,400,000 / year
53
http://mark2cure.org
Key stats
• Launched Jan 19, 2015• Stopped Feb 16, 2015• In 4 weeks
– 10,275 documents annotated– (589 docs, 15+ annotators per doc)– 212 unique users– Reproduced AMT results– Paid zero dollars
54
Current work in progress
• Expanding to identify genes, drugs, and diseases.
• Targeting a new volunteer campaign about May 1.
• Ongoing experiments with relationship identification/verification.
55
Mark2Cure as a tool?
• Seeking specific use cases for information extraction and collaborators in text-mining interested in exploring interplay with the crowd..
56
57
Funding and Support
BioGPS: GM83924Gene Wiki: GM089820BD2K COE: GM114833
Max Nanis
Ginger Tsueng
Chunlei Wu
Andrew Su
Andra WaagmeesterElvira MitrakaLynn SchrimlSebastian BurgstallerGang FuEvan BoltonPaul PavlidisPeter RobinsonMany WikiDatans
John HussErik ClarkeMany Wikipedians
The Prince of Crowdsourcing