Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

27
Microtask crowdsourcing for annotating diseases in PubMed abstracts Andrew Su, Ph.D. @andrewsu [email protected] http://sulab.org October 20, 2014 ASHG Slides: slideshare.net/andrewsu OK OK OK

description

Presentation on "Microtask crowdsourcing for annotating diseases in PubMed abstracts" at ASHG14 session on "Cloudy with a chance of big data".

Transcript of Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Page 1: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Microtask crowdsourcing for

annotating diseases in

PubMed abstracts

Andrew Su, Ph.D.@andrewsu

[email protected]

http://sulab.org

October 20, 2014

ASHG

Slides: slideshare.net/andrewsu

OK

OK

OK

Page 2: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Potential conflicts of interest

• Novartis

• Assay Depot

• Avera Health

2

Page 3: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

3

Condition A Condition B

Candidate

genes/

proteins

RNA-seqExome seq

Whole

genome seq

ProteomicsGenotyping

Copy-number

analysis

Genome-scale profiling

ChIP-seqMethylation

Functional

genomics

Page 4: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

4

Candidate

genes/

proteins

Related

diseases

Related

drugs

Related

pathways

Page 5: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Databases are fragmented and incomplete5

KEGG

(4)

OMIM

(6)

PharmGKB

(10)

HuGE

Navigator

(517)

0

2

0

20

0

0

0

0

0

x

2

507

1

6

Disease links for Apolipoprotein E

Page 6: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

6

Page 7: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

7

0

200,000

400,000

600,000

800,000

1,000,000

1,200,000

1983 1988 1993 1998 2003 2008 2013

Number of new PubMed-indexed articles

Page 8: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

8

Page 9: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

9

http://www.flickr.com/photos/portland_mike/6140660504/

Harnessing

the crowd…

Page 10: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

10

… to organize

information

http://www.flickr.com/photos/45697441@N00/6629580443

Page 11: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Information extraction for a Network of BioThings11

1. Find mentions of high level concepts in

text

2. Map mentions to specific terms in

ontologies

3. Identify relationships between concepts

Genes/

proteins

Diseases

DrugsPathways

Page 12: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

The NCBI Disease corpus12

• 793 PubMed abstracts

• 12 expert annotators (2 annotate each

abstract)

6,900 “disease” mentions

Doğan, Rezarta, and Zhiyong Lu. Proceedings of the 2012 Workshop on Biomedical

Natural Language Processing. Association for Computational Linguistics.

Page 13: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Question: Can a group of non-scientists

collectively perform concept

recognition in biomedical texts?

13

Page 14: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Experimental design

Task: Identify the disease mentions in the

PubMed abstracts from the NCBI disease

corpus

– 5 non-scientists annotate each abstract

– The details:

• Recruit workers using Amazon Mechanical Turk

• Pay $0.066 per Human Intelligence Task (HIT)

• HIT = annotate one abstract from PubMed

14

Page 15: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Instructions to workers15

• Highlight all diseases and disease abbreviations

• “...are associated with Huntington disease ( HD )... HD patients

received...”

• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked

immunodeficiency…”

• Highlight the longest span of text specific to a disease

• “... contains the insulin-dependent diabetes mellitus locus …”

• Highlight disease conjunctions as single, long spans.

• “... a significant fraction of familial breast and ovarian cancer , but

undergoes…”

• Highlight symptoms - physical results of having a

disease

– “XFE progeroid syndrome can cause dwarfism, cachexia, and

microcephaly. Patients often display learning disabilities, hearing loss,

and visual impairment.

Page 16: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

Aggregation function based on simple voting16

1 or more votes (K=1)This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

K=2

K=3 K=4

This molecule inhibits the growth of a broad

panel of cancer cell lines, and is particularly

efficacious in leukemia cells, including

orthotopic leukemia preclinical models as

well as in ex vivo acute myeloid leukemia

(AML) and chronic lymphocytic leukemia

(CLL) patient tumor samples. Thus, inhibition

of CDK9 may represent an interesting

approach as a cancer therapeutic target

especially in hematologic malignancies.

Page 17: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Comparison to gold standard17

F score = 0.81Precision

Recall

Page 18: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard18

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

Page 19: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard19

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

Page 20: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard20

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

Page 21: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard21

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

Page 22: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 3 6 9 12 15 18

Comparison to gold standard22

Max F = 0.69 0.79 0.82

k=1

2

3

2

3 4 5

0.85

k=1

N = 3 6 9 12 15 18

7 8

0.85 0.85

F = 0.76 – score of single Ph.D. annotator

F = 0.87 – agreement between multiple Ph.D. annotators

Page 23: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

23

Crowd-based biocuration

• 7 days

• 17 workers

• $192.90

Professional biocuration

• Many months

• 12 experts

• $150,000+

In aggregate, our worker

ensemble is faster, cheaper

and as accurate as a single

expert annotator for disease

concept recognition.

Page 24: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Information extraction for a Network of BioThings24

1. Find mentions of high level concepts in

text

2. Map mentions to specific terms in

ontologies

3. Identify relationships between concepts

Genes/

proteins

Diseases

DrugsPathways

Page 25: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Vision-based Citizen Science

• Galaxy Zoo (galaxy classification; 110M+

classifications, 300k+ volunteers)

• Foldit (protein folding; 350k+ players)

• Eterna (RNA folding; 80k players)

• Eyewire (3D neuron structure determination;

130k volunteers)

• Phylo (multiple sequence alignment; 30k+

players, 285k alignments)

• …

25

Page 26: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

Language-based Citizen Science26

http://mark2cure.org

Page 27: Microtask crowdsourcing for annotating diseases in PubMed abstracts (ASHG 2014)

`

27

Funding and Support

(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)

The Su Lab

Chunlei Wu

Ben Good

Salvatore Loguercio

Max Nanis

Louis Gioia

Ramya Gamini

Greg Stupp

Ginger Tsueng

Erick Scott

Vyshakh Babji

Karthik Gangavarapu

Adam Mark

Key Alumni

Katie Fisch

Tobias Meissner

Key Collaborators

Andra Waagmeester

Lynn Schriml

Peter Robinson

Contact

http://sulab.org

[email protected]

@andrewsu

+Andrew Su

We are recruiting

programmers,

postdocs, and

awesome people of

all kinds!

bit.ly/SuLabJobs

We are hosting a hackathon

Nov 7-9 for the Network of

BioThingsbit.ly/hackNoB