The Gene Wiki, from a BioRDF-naïve perspective

20
The Gene Wiki, from a BioRDF-naïve perspective W3C / HCLSIG BioRDF Subgroup November 17, 2008

description

The Gene Wiki, from a BioRDF-naïve perspective. W3C / HCLSIG BioRDF Subgroup November 17, 2008. Entrez Gene. Patterns of gene annotation. How do we efficiently annotate the function of the ~25,000 genes in the mammalian genome? Goal: “Genome-wide functional genomics”. P( k ) ~ k - a. - PowerPoint PPT Presentation

Transcript of The Gene Wiki, from a BioRDF-naïve perspective

Page 1: The Gene Wiki, from a BioRDF-naïve perspective

The Gene Wiki, from a BioRDF-naïve perspective

W3C / HCLSIGBioRDF Subgroup

November 17, 2008

Page 2: The Gene Wiki, from a BioRDF-naïve perspective

2

How do we efficiently annotate the function of the ~25,000 genes in the mammalian genome?

Goal: “Genome-wide functional genomics”

Patterns of gene annotation

P(k) ~ k -a

Entrez Gene

0.0 1.0 2.0 3.0

01

23

4

log(# references)

log(

# ge

nes)

a = -1.32R squared = 0.963

0 1 2 3 4 5

01

23

4

log(# references)

log(

# ge

nes)

a = -0.6R squared = 0.894

0.0 1.0 2.0 3.0

0.0

0.5

1.0

1.5

log(# references)

log(

# ge

nes)

a = -0.4R squared = 0.562

0.0 1.0 2.0

1.0

1.5

2.0

2.5

3.0

log(# references)

log(

# ge

nes)

44% of genes in Entrez Gene have zero linked references. Over 75% have five or fewer linked references.

Page 3: The Gene Wiki, from a BioRDF-naïve perspective

3

The Long Tail of Knowledge

• Traditional media revolves around the Short Head – a few number of publishers putting out lots of content

• “Web 2.0” media revolves around community generated content – a huge population of individuals each generating a (relatively) small amount of content

Users

Co

nte

nt

The Short Head

NewspapersTV/Hollywood

Consumer ReportsOlympics

Encyclopedia Britannica

The Long Tail

BlogsYouTube

Amazon reviewsAmerican Idol

Wikipedia

“Community intelligence”

Page 4: The Gene Wiki, from a BioRDF-naïve perspective

The Long Tail of encyclopedias4

“http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008

Articles Words (millions) Average words / article

Wikipedia >2,000,000 >1,000 435

Britannica Online 120,000 55 370

An expert-led investigation carried out by Nature … revealed numerous errors in both encyclopaedias, but among 42 entries tested, the difference in accuracy was not particularly great: the average science entry in Wikipedia contained around four inaccuracies; Britannica, about three.

• Wiki: “… a website that allows the visitors themselves to easily add, remove, and otherwise edit and change available content, typically without the need for registration.”

• Wikipedia: “the free encyclopedia that anyone can edit.”

Page 5: The Gene Wiki, from a BioRDF-naïve perspective

5

Advantages of a Gene Wiki1) Existing gene portals are great for structured content, but a

wiki is suited for summarizing unstructured content

Entrez Gene Wikipedia

Unstructured content allows for free-text, images, diagrams, photos, etc.

Page 6: The Gene Wiki, from a BioRDF-naïve perspective

6

Advantages of a Gene Wiki2) Wiki articles enable two-way communication of information,

encouraging contributions and edits from the community.

Dec 18, 2002Jan 3, 2004Dec 11, 2004May 6, 2006

Wikipedia is rarely the last place you look, but is often a good first place for an overview.

Page 7: The Gene Wiki, from a BioRDF-naïve perspective

7

Gene “stubs”

• Active MCB community at WP had already developed ~650 gene articles

• Can we accelerate this process through stub creation?

• In total, created 7500 new articles and edited 650 previously existing articles.

Page 8: The Gene Wiki, from a BioRDF-naïve perspective

8

Why Wikipedia?

• Critical mass of articles to which and from which we could link gene pages

• Critical mass of editors who were experienced in wiki-related issues (fighting vandalism, copyediting, governance)

• Active group of molecular biologists at the MCB “WikiProject” (http://en.wikipedia.org/wiki/WP:MCB)

• Alternatives considered– Home-built wiki– Citizendium (citizendium.org)

Page 9: The Gene Wiki, from a BioRDF-naïve perspective

9

Gene wiki usage

(650)

(7500)50% of all edits to gene pages are to newly-created pages…

Gene Wiki pages are highly ranked at Google, ensuring critical mass of users and editors…

Current have ~9000 gene pages or stubs at Wikipedia

Page 10: The Gene Wiki, from a BioRDF-naïve perspective

10

Positive feedback loopGene wiki page utility

Number ofreaders

Number ofeditors

1001

2002

Page 11: The Gene Wiki, from a BioRDF-naïve perspective

11

25k gene-specific review articles?

Hyperlinks to related concepts

Reelin: 33 editors, 221 edits since July 2002

Heparin: 175 editors, 320 edits since June 2003

AMPK: 44 editors, 84 edits since March 2004

RNAi: 232 editors, 708 edits since October 2002

Page 12: The Gene Wiki, from a BioRDF-naïve perspective

12

Gene Wiki activity

Steady (and growing?) edit rate over time

Gene Wiki Daily Activity(Oct 17 - Nov 14)

0

20

40

60

80

100

120

140

160

10

/17

/08

10

/19

/08

10

/21

/08

10

/23

/08

10

/25

/08

10

/27

/08

10

/29

/08

10

/31

/08

11

/2/0

8

11

/4/0

8

11

/6/0

8

11

/8/0

8

11

/10

/08

11

/12

/08

11

/14

/08

# ed

its

Gene Wiki Monthly Activity(May 07 - Nov 08)

0

2000

4000

6000

8000

10000

12000

May

-07

Jun-

07

Jul-0

7

Aug

-07

Sep

-07

Oct

-07

Nov

-07

Dec

-07

Jan-

08

Feb

-08

Mar

-08

Apr

-08

May

-08

Jun-

08

Jul-0

8

Aug

-08

Sep

-08

Oct

-08

Nov

-08

# e

dit

s

Page 13: The Gene Wiki, from a BioRDF-naïve perspective

13

Gene Wiki article growth

http://manyeyes.alphaworks.ibm.com/manyeyes/visualizations/gene-wiki-top-2500-20081114

Page 14: The Gene Wiki, from a BioRDF-naïve perspective

14

“Welcome to the semantic web…

The main concern with plaintext-on-Wikipedia is that it's not an effective way to truly exploit the long tail, since you're going to end up with this massive plaintext disaster that will require human collating (redundant work- just get it right the first time).”

- public-semweb-lifesci mailing list

Page 15: The Gene Wiki, from a BioRDF-naïve perspective

15

Primary emphases

• Providing useful content – scientists will not find or contribute to a wiki unless it is already useful

• Instant feedback – wikis allow changes to be effective immediately, without approval or intermediary (e.g., corrections/additions to NCBI/Ensembl?)

• Emphasis on contributors, not data miners – emphasize getting data in, not on getting it out, since complex protocols encourage nonparticipation (e.g., MIAME)

• Critical mass – What will differentiate the Gene Wiki from the many other wiki efforts that are stagnant?

Page 16: The Gene Wiki, from a BioRDF-naïve perspective

16

Secondary emphases

• Reliability and accuracy – do open and uncurated data models produce trustworthy content?

• Synergy with existing resource – how can the Gene Wiki make the growth of traditional annotation more efficient?

• Enabling semantic queries/structure – how can we structure unstructured content for data mining? (Semantic Mediawiki? NLP?)

Page 17: The Gene Wiki, from a BioRDF-naïve perspective

17

Idealized information flow

Semantic structureNCBI Ensembl …

1 Create Gene Wiki stubs

2 Unstructured content from the community

Wikipedia

3 Semantic encoding of free text (how?)

Direct semantic

annotation by scientists

“Long tail” scientific contributions

Authoritative annotation databases

Page 18: The Gene Wiki, from a BioRDF-naïve perspective

18

Figure to scale?

Semantic structureNCBI Ensembl …

Wikipedia

“Long tail” scientific contributions

Page 19: The Gene Wiki, from a BioRDF-naïve perspective

19

Summary

• Goal: create a complementary resource to existing tools, not competitive.

• Primary emphasis will always be on maximizing community participation.

• How do we structure the unstructured contributions?

Page 20: The Gene Wiki, from a BioRDF-naïve perspective

20

AcknowledgementsSerge Batalov

Jason BoyerJennifer Floyd

Yue HuJon Huss

Jeff JanesCamilo Orozco

Steve SuJulia TurnerChunlei Wu

David DelanoJames Goodale

Phil McClurgRichard Trager

Faramarz Valafar, SDSUTim Vickers, Washington Univ

Michael CookePete Schultz

Funding: NIGMS, NIH; Novartis Research Foundation