Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
-
Upload
andrew-su -
Category
Technology
-
view
269 -
download
1
Transcript of Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Crowdsourcing Biology: The Gene Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.@andrewsu
[email protected]://sulab.org
October 30, 2012
Few genes are well annotated…2
38%
59%
TP53TNFAPOEMTHFRIL6HLA-DRB1VEGFAEGFRTGFB1ACE
Data: NCBI gene2pubmed, August 2010
23,278 protein-coding genes
Genes, sorted by decreasing counts
Co
un
ts
Gene ontology
PubMed
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
0
200,000
400,000
600,000
800,000
1,000,000
Number of PubMed-indexed articles
… because the literature is sparsely curated?3
… because the literature is sparsely curated?4
0
1 0
2 0
Average capacity of human scientistNumber of articles read by typical scientist
5
311,696 articles (1.5% of PubMed)have been cited by GO annotations
6
0
Sooner or later, the research community will
need to be involved in the annotation effort to scale
up to the rate of data generation.
The Long Tail is a prolific source of content7
ShortHead
Long Tail
Content produced
Contributors (sorted)
News :Video:
Product reviews:Food reviews:Talent judging:
NewspapersTV/Hollywood
Consumer reportsFood criticsOlympics
BlogsYouTube
Amazon reviewsYelp
American Idol
Wikipedia is reasonably accurate8
Wikipedia has breadth and depth9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words(millions)
Words/ article
Wikipedia Britannica Online
10
We can harness the Long Tail of scientists to directly participate in
the gene annotation process.
From crowdsourcing to structured data11
The Gene Wiki
Biological Games
Filtering, extracting, and summarizing PubMed
Documents
Concepts Review article
Filtering, extracting, and summarizing PubMed
Documents
Concepts
Wiki success depends on a positive feedback14
Gene wiki page utility
Number ofusers
Number ofcontributors
1001
2002
10,000 gene “stubs” within Wikipedia15
Protein structure
Symbols and identifiers
Tissue expression pattern
Gene Ontology annotations
Links to structured databases
Gene summary
Protein interactions
Linked references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
Gene Wiki has a critical mass of readers16
Total: 5.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
Gene Wiki has a critical mass of editors17
Increase of ~10,000 words / month from >1,000 editsCurrently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Edi
tor
coun
t Editors
Edits Edi
t co
unt
A review article for every gene is powerful18
References to the literature
Hyperlinks to related conceptsReelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
The Gene Wiki is (reasonably) reliable19
Good edits
VandalismCum
ulat
ive
edits
Date
Per edit probability
98.9%
1.1%
Average lifetime
115.4 d
3.4 d
Probability by time
99.968%
0.032%
(0.63% for WP overall)
Good, NAR, 2011
Making the Gene Wiki more reliable20
The company name is derived from old Greek, and means
"destroyer of birds".
Novartis is a multinational pharmaceutical company
based in Basel, Switzerland that manufactures drugs such
as clozapine (Clozaril), diclofenac (Voltaren), …
2
2
Making the Gene Wiki more reliable21
http://www.wikitrust.net/
The company name is derived from old Greek, and means
"destroyer of birds".
Novartis is a multinational pharmaceutical company
based in Basel, Switzerland that manufactures drugs such
as clozapine (Clozaril), diclofenac (Voltaren), …
*
36211 total edits 36 total edits
High-trust author Low-trust author
******
** *
*
*
**
2
Making the Gene Wiki more computable22
Structured annotationsFree text
Filling the gaps in gene annotation23
Wikilink
GO exact match
Gene Wiki mapping
NCBI Entrez Gene: 334
Candidate assertion
GO:0006897
6319 novel GO annotations2147 novel DO annotations
24
TOP 100 GENES
Gene Wiki content improves enrichment analysis25
GO term
Gene listConcept
recognitionPubMed abstracts
Enrichment analysis
GO:0007411
axon guidance
(GO:0007411)
264 genes
Linked genes through PubMed
P = 1.55 E-20
811 articles
Yes No
Yes 13 2
No 251 12033
Gene Wiki content improves enrichment analysis26
GO term
Gene listConcept
recognitionPubMed abstracts
Gene Wiki
+
Enrichment analysis
GO:0006936 GO:0006936
muscle contraction
(GO:0006936)
87 genes
Linked genes through PubMed
Linked genes through
PubMed + Gene Wiki
P = 1.0 P = 1.22 E-09
251 articles
87 articles
Gene Wiki content improves enrichment analysis27
p-value (PubMed only)
p-value (PubMed + GW)
Muscle contraction
More significant
PubMed + GW
More significant
PubMed only
Gene Wiki+: Crowdsourced semantic database28
Q: What genes are related to hemolytic anemia?
The Long Tail of scientists is a valuable source of
information on gene function
29
From crowdsourcing to structured data30
The Gene Wiki
Biological Games
Gene databases are numerous and overlapping31
… and hundreds more …
http://biogps.org
Community extensibility and user customizability32
Utility
UsersContributors
Utility: A simple and universal plugin interface33
Utility
UsersContributors
Utility: A simple and universal plugin interface34
Utility
UsersContributors
Utility: A simple and universal plugin interface35
Utility
UsersContributors
Utility: A simple and universal plugin interface36
Utility
UsersContributors
Utility: A simple and universal plugin interface37
Utility: A simple and universal plugin interface38
Utility
UsersContributors
Total of 389 gene-centric online databases registered as BioGPS plugins
Users: BioGPS has critical mass39
• > 5000 registered users• 13,500 unique visitors per month• 155,000 page views per week
1. Harvard2. NIH3. UCSD4. Scripps5. MIT6. Cambridge
7. U Penn8. Stanford9. Wash U10. UNC
Top 10 organizations
Daily pageviewsUtility
UsersContributors
Contributors: Explicit and implicit knowledge40
389 plugins registered (65% publicly shared)
by over 75 users
spanning 150+ domains
Utility
UsersContributors
Mining structured content from HTML41
Defining a data extraction template42
…
TP53 TNF APOE IL6 VEGF …EGFR TGFB1
The Long Tail of
bioinformaticianscan collaboratively build a gene portal.
44
From crowdsourcing to structured data45
The Gene Wiki
Biological Games
46
http://www.flickr.com/photos/archana3k1/4124330493/
Seven million human hours
47
Twenty million human hours
http://www.flickr.com/photos/ableman/2171326385/
-48
150 billion human hours
http://www.flickr.com/photos/rvp-cw/6243289302/
per year
Using games to fold proteins49
Fold.it players have successfully:• Outperformed state of the art protein
folding algorithms (Cooper, Nature, 2010)
• Solved a previously-intractable crystal structure (Khatib, Nat Struct Mol Biol, 2011)
• Designed an improved protein folding algorithm (Khatib, PNAS, 2011)
• Improved enzyme activity of de novo designed enzyme (Eiben, Nat Biotechnol, 2011)
No good gene-disease annotation database53
Alzheimer's disease (AD)Lipoprotein glomerulopathySea-blue histiocyte disease
Query: Apolipoprotein E
No good gene-disease annotation database54
Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibility
Query: Apolipoprotein E
No good gene-disease annotation database55
Alzheimer's disease (AD)Lipoprotein glomerulopathy Sea-blue histiocyte diseaseHyperlipoproteinemia, type IIIMacular degeneration, age-relatedMyocardial infarction susceptibilityHIVPsoriasisVascular Diseases
Query: Apolipoprotein E
?
?
?
?
?
No good gene-disease annotation database56
Alzheimer's disease (AD)Neuropsychological Tests Cognition Disorders Dementia Cognition Disease Progression Cardiovascular Diseases Coronary Disease Diabetes Mellitus, Type 2 Memory Disorders
Query: Apolipoprotein E
Memory Coronary Artery Disease Hypertension Mental Status Schedule Psychiatric Status Rating
Scales Hyperlipidemias Atrophy Dementia, Vascular Parkinson Disease Brain Injuries Myocardial Infarction …
477 diseases!
Play Dizeez to annotate gene-disease links57
3. If it’s ‘right’, you get points
4. Then on to the next question…
2. Click the related disease (only one is “right”)
5. Hurry!
1. Read the clue (gene)
6. Play to win!
Dizeez players seem pretty smart…58
In total (since Dec 2011):• 230 unique gamers• 1045 games played• 8525 guesses
# Occurrences Gene Disease
11 NBPF3 neuroblastoma
11 SOX8 mental retardation
9 ABL1 leukemia
9 SSX1 synovial sarcoma
8 APC colorectal cancer
8 FES sarcoma
8 RBP3 retinoblastoma
8 GAST gastrinoma
8 DCC colorectal cancer
8 MAP3K5 cancer
Gene Wiki OMIM PharmGKB PubMed
Using games to predict phenotype from genotype?59
http://genegames.org
Classification problems in genome biology60
cancer normal
find patterns
Classify new samples
cancer
normalSVM
Neural networks
Naïve Bayes
KNN
…100s samples
100,
000s
fea
ture
s
Random forests61
Sample subset of cases and
featuresTrain decision
treecancer normal
100s samples
100,
000s
fea
ture
s
Random forests62
cancer normal
100s samples
100,
000s
fea
ture
s
Random forests63
Classify new samples
cancer
normal
cancer normal
100s samples
100,
000s
fea
ture
s
How to interject biological
knowledge?
Network-guided forests64
Dutkowski & Ideker (2011). PLoS Computational Biology
Network-guided forests65
Sample features by PPI
networkTrain decision
treecancer normal
100s samples
100,
000s
fea
ture
s
Human-guided forests66
Sample features by
human intelligence
Train decision treecancer normal
100s samples
100,
000s
fea
ture
s
67
The Cure: Genomic predictors for disease68
The Cure: Genomic predictors for disease69
The Cure: Genomic predictors for disease70
The Cure: Genomic predictors for disease71
The Cure: Genomic predictors for disease72
The Cure: Genomic predictors for disease73
Human-guided forests74
Classify new samples
cancer
normal
“Critical Assessment”-style challenge75
Preliminary results
• 214 registered players– 50% declared knowledge of cancer
biology– 40% self-identified as having Ph.D.
• Prediction results– 69% correct on survival concordance
index– Best scoring model was 72%
76
The Long Tail of gamerscan collaboratively build an accurate disease classifier.
77
78
Doug Howe, ZFINJohn Hogenesch, U PennJon Huss, GNFLuca de Alfaro, UCSCAngel Pizzaro, U PennFaramarz Valafar, SDSUPierre Lindenbaum,
Fondation Jean DaussetMichael Martone, RushKonrad Koehler, Karo BioWarren Kibbe, Simon Lim, NorthwesternMany Wikipedia editors
WP:MCB Project
Collaborators
Ben GoodSalvatore LoguercioIan Macleod
Max NanisChunlei Wu
Group members
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820)
Contacthttp://sulab.org
[email protected]@andrewsu+Andrew Su
Recruiting graduate students in quantitative biology! See http://education.scripps.edu/