2014 10-15-Nextbug edinburgh

103
@yannick__ http://yannick.poulet.org Social insect evolution: genomics opportunities & approaches 2014-10-15-NextBUG

Transcript of 2014 10-15-Nextbug edinburgh

Page 1: 2014 10-15-Nextbug edinburgh

@yannick__ http://yannick.poulet.org

Social insect evolution: genomics opportunities

& approaches

2014-10-15-NextBUG

Page 2: 2014 10-15-Nextbug edinburgh
Page 3: 2014 10-15-Nextbug edinburgh

© Alex Wild & others

Page 4: 2014 10-15-Nextbug edinburgh
Page 5: 2014 10-15-Nextbug edinburgh

© National Geographic

Atta leaf-cutter ants

Page 6: 2014 10-15-Nextbug edinburgh

© National Geographic

Atta leaf-cutter ants

Page 7: 2014 10-15-Nextbug edinburgh

© National Geographic

Atta leaf-cutter ants

Page 8: 2014 10-15-Nextbug edinburgh
Page 9: 2014 10-15-Nextbug edinburgh

Oecophylla Weaver ants

© ameisenforum.de

Page 10: 2014 10-15-Nextbug edinburgh

© ameisenforum.de

Fourmis tisserandes

Page 11: 2014 10-15-Nextbug edinburgh

© ameisenforum.de

Oecophylla Weaver ants

Page 12: 2014 10-15-Nextbug edinburgh

© forestryimages.org© wynnie@flickr

Page 13: 2014 10-15-Nextbug edinburgh

Tofilski et al 2008

Forelius pusillus

Page 14: 2014 10-15-Nextbug edinburgh

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Page 15: 2014 10-15-Nextbug edinburgh

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Page 16: 2014 10-15-Nextbug edinburgh

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Page 17: 2014 10-15-Nextbug edinburgh

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Page 18: 2014 10-15-Nextbug edinburgh

Avant

Workers staying outside die« preventive self-sacrifice »

Tofilski et al 2008

Forelius pusillus hides the nest entrance at night

Page 19: 2014 10-15-Nextbug edinburgh

Dorylus driver ants: ants with no home

© BBC

Page 20: 2014 10-15-Nextbug edinburgh

© Dirk Mezger

Ritualized fighting

© Carsten BrühlCamponotus gigas Pfeiffer & Linsenmair 2001

Page 21: 2014 10-15-Nextbug edinburgh

Army ant milling - “spiral of death”

Page 22: 2014 10-15-Nextbug edinburgh

Animal biomass (Brazilian rainforest)

from Fittkau & Klinge 1973

Other insects 49.6

Amphibians 2.8

Reptiles 3.7

Birds 5.3

Mammals 14.5

!Earthworms

17.3

!!

Spiders 4.7

Soil fauna excluding earthworms,

ants & termites 148

Ants & termites 114

Page 23: 2014 10-15-Nextbug edinburgh
Page 24: 2014 10-15-Nextbug edinburgh

Well-studied:

• behavior

• morphology

• evolutionary context

• ecology

Page 25: 2014 10-15-Nextbug edinburgh

This changes everything.454

Illumina Solid...

Any lab can sequence anything!

Page 26: 2014 10-15-Nextbug edinburgh

Major research areasGenes/mechanisms for evolution of

social behavior?

Page 27: 2014 10-15-Nextbug edinburgh

www.sciencemag.org SCIENCE VOL 331 25 FEBRUARY 2011 1067

REPORTS

on

Mar

ch 1

2, 2

013

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

Solenopsis invicta fire ants are a big problem!very well studied!

Ascunce et al 2011

Page 28: 2014 10-15-Nextbug edinburgh

Solenopsis invicta fire ant: two social forms

!

•1 large queen •Independent founding •Highly territorial •Many sizes of workers

!

•2-100 smaller queens •Dependent founding •No inter-colony aggression •All workers similar size

Single-queen form: Multiple-queen form:

Page 29: 2014 10-15-Nextbug edinburgh

Fire ants+

Population genetics: Allozyme screen

Ken Ross L. Keller

“starch gel”+

1 2 3=> “Gp-9” locus associated to social form

Page 30: 2014 10-15-Nextbug edinburgh
Page 31: 2014 10-15-Nextbug edinburgh

Single queen form Multiple queen form

Ken Ross and colleagues Laurent Keller and colleagues

Social form completely associated to Gp-9 locus

Page 32: 2014 10-15-Nextbug edinburgh

bbbbBB BB Bb bb

Ken Ross and colleagues Laurent Keller and colleagues

Single queen form Multiple queen form

Social form completely associated to Gp-9 locus

(>15% ) (< 5% )

Page 33: 2014 10-15-Nextbug edinburgh

bbBB BB Bb

x

Gp-9 bb females rareKen Ross and colleagues

Laurent Keller and colleagues

Single queen form Multiple queen form

Social form completely associated to Gp-9 locus

(>15% ) (< 5% )

Page 34: 2014 10-15-Nextbug edinburgh

BB BB Bb

Ken Ross and colleagues Laurent Keller and colleagues

Single queen form Multiple queen form

Social form completely associated to Gp-9 locus

(>15% ) (< 5% )

Page 35: 2014 10-15-Nextbug edinburgh

BB BB Bb

xKen Ross and colleagues

Laurent Keller and colleagues

Single queen form Multiple queen form

Social form completely associated to Gp-9 locus

(>15% ) (< 5% )

Page 36: 2014 10-15-Nextbug edinburgh

BB BB Bb

x xKen Ross and colleagues

Laurent Keller and colleagues

Social form completely associated to Gp-9 locus

Single queen form Multiple queen form(>15% ) (< 5% )

Page 37: 2014 10-15-Nextbug edinburgh

BB BB Bb

x x xKen Ross and colleagues

Laurent Keller and colleagues

Single queen form Multiple queen form(>15% ) (< 5% )

Social form completely associated to Gp-9 locus

Page 38: 2014 10-15-Nextbug edinburgh

Sex chromosomes

X Y

Gp-9 B

Gp-9 b

SB Sb

“Social chromosomes”

?

Wang et al Nature 2013

Page 39: 2014 10-15-Nextbug edinburgh

Major research areas

Genes/mechanisms for differences (e.g., lifespan?)?

Genes/mechanisms for evolution of social behavior?

genome evolution social evolution

Page 40: 2014 10-15-Nextbug edinburgh
Page 41: 2014 10-15-Nextbug edinburgh
Page 42: 2014 10-15-Nextbug edinburgh

This changes everything.454

Illumina Solid...

Any lab can sequence anything!

Page 43: 2014 10-15-Nextbug edinburgh

Genomics is hard.

Page 44: 2014 10-15-Nextbug edinburgh

• Biology/life is complex • Field is young. • Biologists lack computational training. • Generally, analysis tools suck.

• badly written • badly tested • hard to install • output quality… often questionable.

• Understanding/visualizing/massaging data is hard. • Datasets continue to grow!

Genomics is hard.

Page 45: 2014 10-15-Nextbug edinburgh

Inspiration?

Page 46: 2014 10-15-Nextbug edinburgh
Page 47: 2014 10-15-Nextbug edinburgh

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012

Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††

∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])

Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.

Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.

Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.

We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].

In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.

This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial

and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.

1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].

First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):

...calculation...

or to take two points:def rect_area(point1, point2):

...calculation...

The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not

Reserved for Publication Footnotes

1–7

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012

Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††

∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])

Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.

Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.

Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.

We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].

In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.

This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial

and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.

1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].

First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):

...calculation...

or to take two points:def rect_area(point1, point2):

...calculation...

The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not

Reserved for Publication Footnotes

1–7

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012

Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††

∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])

Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.

Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.

Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.

We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].

In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.

This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial

and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.

1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].

First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):

...calculation...

or to take two points:def rect_area(point1, point2):

...calculation...

The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not

Reserved for Publication Footnotes

1–7

1. Write programs for people, not computers. 2. Automate repetitive tasks. 3. Use the computer to record history. 4. Make incremental changes. 5. Use version control. 6. Don’t repeat yourself (or others). 7. Plan for mistakes. 8. Optimize software only after it works correctly. 9. Document the design and purpose of code rather than its mechanics.!10. Conduct code reviews.

Page 48: 2014 10-15-Nextbug edinburgh
Page 49: 2014 10-15-Nextbug edinburgh
Page 50: 2014 10-15-Nextbug edinburgh

Inspiration?

• Technologies

• Planning for mistakes

• Automated testing

• Continuous

• Writing for people: use style guide

Page 51: 2014 10-15-Nextbug edinburgh

Code for people: Use a style guide• For R: http://r-pkgs.had.co.nz/style.html

Page 52: 2014 10-15-Nextbug edinburgh

R style guide extract

Page 53: 2014 10-15-Nextbug edinburgh

Coding for people: Indent your code!

Programming better

• variable naming

• coding width: 100 characters

• indenting

• Follow conventions -eg “Google R Style”

• Versioning: DropBox & http://github.com/

• Automated testing

• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway

preprocess_snps <- function(snp_table, testing=FALSE) { if (testing) { # run a bunch of tests of extreme situations. # quit if a test gives a weird result. } # real part of function. }

Friday, 22 June 12

Page 54: 2014 10-15-Nextbug edinburgh

Line length Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.

R style guide extract

!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))

!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header = TRUE, sep = '\t', col.names = c('colony', 'individual', 'headwidth', 'mass') )

!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))

Page 55: 2014 10-15-Nextbug edinburgh

Code for people: Use a style guide• For R: http://r-pkgs.had.co.nz/style.html • For Ruby: https://github.com/bbatsov/ruby-style-guide

Automatically check your code:install.packages(“lint”) # once

library(lint) # everytime lint(“file_to_check.R”)

Page 56: 2014 10-15-Nextbug edinburgh
Page 57: 2014 10-15-Nextbug edinburgh

Four tools

Page 58: 2014 10-15-Nextbug edinburgh

suck less. Four tools that

Page 59: 2014 10-15-Nextbug edinburgh

Four tools

suck less. (hopefully)

Four tools that

Page 60: 2014 10-15-Nextbug edinburgh

1. SequenceServer

Page 61: 2014 10-15-Nextbug edinburgh

“Can you BLAST this for me?”

Page 62: 2014 10-15-Nextbug edinburgh

• Once I wanted to set up a BLAST server.

Anurag Priyam, Mechanical engineering student, Kharagpur

Aim: An open source idiot-proof web-interface

for custom BLASTFriday, 22 June 12

Anurag Priyam, Mechanical engineering student, IIT Kharagpur

Sure, I can help you…

Page 63: 2014 10-15-Nextbug edinburgh

“Can you BLAST this for me?”

Antgenomes.org SequenceServer BLAST made easy

(well, we’re trying...)

Page 64: 2014 10-15-Nextbug edinburgh

http://www.sequenceserver.com/

(requires a BLAST+ install)

Do you have BLAST-formatted databases? If not: sequenceserver format-databases /path/to/fastas

1. Installinggem install sequenceserver

# ~/.sequenceserver.conf bin: ~/ncbi-blast-2.2.25+/bin/ database: /Users/me/blast_databases/

2. Configure.

sequenceserver ### Launched SequenceServer at: http://0.0.0.0:4567

3. Launch.

Page 66: 2014 10-15-Nextbug edinburgh

“Can you BLAST this for me?”

Antgenomes.org SequenceServer BLAST made easy

(well, we’re trying...)

Web server :Anurag Priyam & Git community - http://sequenceserver.com

blast on 48-core 512gig fat machine

via ssh

Page 67: 2014 10-15-Nextbug edinburgh

2. Bionode

Page 68: 2014 10-15-Nextbug edinburgh

Module countsNode = “NPM”

Page 69: 2014 10-15-Nextbug edinburgh
Page 70: 2014 10-15-Nextbug edinburgh

Reusable, small and testedmodules

Page 71: 2014 10-15-Nextbug edinburgh

ExamplesBASH

JavaScript

bionode.io (online shell)

bionode-ncbi urls assembly Solenopsis invicta | grep genomic.fna

http://ftp.ncbi.nlm.nih.gov/genomes/all/GCA_000188075.1_Si_gnG/ GCA_000188075.1_Si_gnG_genomic.fna.gz

bionode-ncbi download sra arthropoda | bionode-sra

bionode-ncbi download gff bacteria

var ncbi = require('bionode-ncbi') ncbi.urls('assembly', 'Solenopsis invicta'), gotData) function gotData(urls) { var genome = urls[0].genomic.fna download(genome) })

#  Get  descriptions  for  papers  related  to  SRA  search  !bionode  ncbi  search  sra  Solenopsis  invicta  |                    tool-­‐stream  extractProperty  uid  |                    bionode  ncbi  link  sra  pubmed  |                    tool-­‐stream  extractProperty  destUID  |                  bionode  ncbi  search  pubmed  !

Page 72: 2014 10-15-Nextbug edinburgh

Difficulty writing scalable, reproducible andcomplex bioinformatic pipelines.Solution: Node.js everywhereStreams var ncbi = require('bionode-ncbi') var tool = require('tool-stream') var through = require('through2') var fork1 = through.obj() var fork2 = through.obj()

ncbi .search('sra', 'Solenopsis invicta') .pipe(fork1) .pipe(dat.reads)

fork1 .pipe(tool.extractProperty('expxml.Biosample.id')) .pipe(ncbi.search('biosample')) .pipe(dat.samples)

fork1 .pipe(tool.extractProperty('uid')) .pipe(ncbi.link('sra', 'pubmed'))

Page 73: 2014 10-15-Nextbug edinburgh
Page 74: 2014 10-15-Nextbug edinburgh
Page 75: 2014 10-15-Nextbug edinburgh

Working with Gene predictions

Page 76: 2014 10-15-Nextbug edinburgh

Gene predictionDozens of software algorithms: dozens of predictions

20% failure rate: •missing pieces •extra pieces •incorrect merging •incorrect splitting

Visual inspection... and manual fixing required.

1 gene = 5 minutes to 3 days

Yand

ell &

Enc

e 20

13 N

RG

GTCTACAATGCGATTGTAAAATAGCACGAgAGGTGCATATGATGAACGACTATGTTCCACAACCACAGCTCATATATAACATGATTTtGTTTGCCGAATTCATACACGCATTACAACACACATTGAATTCAATAATAATATCAAATTCACATTCAAAGCTTTCAAGTTAGACAAAAGTTTTAATGCCGTTTTtACCTGTTTTtGAAAAGGTAATTTTCTTTAGATATATTATGTTGAATaTTAGGGTTTTTATAAAGAATGTGTATATTGUTTACAATATAAAAGACACAATTGCAAACTAGCATGATTGTAAACAATTGCTAAACGGATCAATATAAATTAAAATTGTAATATTAAGTATCAAACCGATAATTTTTATTTATTGTTCATTGTTTGTTCTTTATTTTGTTATTTGTAAATAATGAAA

Evidence

Evidence

Consensus:

Page 77: 2014 10-15-Nextbug edinburgh
Page 78: 2014 10-15-Nextbug edinburgh

3. GeneValidator

Page 79: 2014 10-15-Nextbug edinburgh

Monica Dragan

Ismail Moghul

https://github.com/monicadragan/GeneValidatorhttps://github.com/IsmailM/GeneValidatorApp

Page 80: 2014 10-15-Nextbug edinburgh

Monica Draganhttps://github.com/monicadragan/GeneValidatorhttps://github.com/IsmailM/GeneValidatorApp

Ismail Moghul

Page 81: 2014 10-15-Nextbug edinburgh

GeneValidator

Run on:

★whole geneset: identify most problematic predictions

★alternative models for a gene (choose best)

★individual genes (while manually curating)

Page 82: 2014 10-15-Nextbug edinburgh

Warning: Work in Progress

gem install GeneValidator gem install GeneValidatorApp

http://afra.sbcs.qmul.ac.uk/genevalidator

Page 83: 2014 10-15-Nextbug edinburgh

3. Afra: Crowdsourcing gene model curation

Page 84: 2014 10-15-Nextbug edinburgh

Gene predictionDozens of software algorithms: dozens of predictions

20% failure rate: •missing pieces •extra pieces •incorrect merging •incorrect splitting

Visual inspection... and manual fixing required. 1 gene = 20 minutes to 3 days 15,000 genes * 20 species = impossible. Ya

ndell

& E

nce

2013

NRG

GTCTACAATGCGATTGTAAAATAGCACGAgAGGTGCATATGATGAACGACTATGTTCCACAACCACAGCTCATATATAACATGATTTtGTTTGCCGAATTCATACACGCATTACAACACACATTGAATTCAATAATAATATCAAATTCACATTCAAAGCTTTCAAGTTAGACAAAAGTTTTAATGCCGTTTTtACCTGTTTTtGAAAAGGTAATTTTCTTTAGATATATACAGTTTGTAATaTTAGGTATTTTATAAACAGTGTGTATATTTCTTACAATATAAAAGACACAATTGCAAACTAGCATGATTGTAAACAATTGCTAAACGGATCAATATAAATTAAAATTGTAATATTAAGTATCAAACCGATAATTTTTATTTATTGTTCATTGTTTGTTCTTTATTTTGTTATTTGTAAATAATGAAA

Evidence

Evidence

Consensus:

Page 85: 2014 10-15-Nextbug edinburgh
Page 86: 2014 10-15-Nextbug edinburgh

Algorithm discovery by protein folding game playersFiras Khatiba, Seth Cooperb, Michael D. Tykaa, Kefan Xub, Ilya Makedonb, Zoran Popovićb,David Bakera,c,1, and Foldit PlayersaDepartment of Biochemistry; bDepartment of Computer Science and Engineering; and cHoward Hughes Medical Institute, University of Washington,Box 357370, Seattle, WA 98195

Contributed by David Baker, October 5, 2011 (sent for review June 29, 2011)

Foldit is a multiplayer online game in which players collaborateand compete to create accurate protein structure models. For spe-cific hard problems, Foldit player solutions can in some cases out-perform state-of-the-art computational methods. However, verylittle is known about how collaborative gameplay produces theseresults and whether Foldit player strategies can be formalized andstructured so that they can be used by computers. To determinewhether high performing player strategies could be collectivelycodified, we augmented the Foldit gameplay mechanics with toolsfor players to encode their folding strategies as “recipes” and toshare their recipes with other players, who are able to further mod-ify and redistribute them. Here we describe the rapid social evolu-tion of player-developed folding algorithms that took place in theyear following the introduction of these tools. Players developedover 5,400 different recipes, both by creating new algorithms andby modifying and recombining successful recipes developed byother players. The most successful recipes rapidly spread throughthe Foldit player population, and two of the recipes became parti-cularly dominant. Examination of the algorithms encoded in thesetwo recipes revealed a striking similarity to an unpublished algo-rithm developed by scientists over the same period. Benchmarkcalculations show that the new algorithm independently discov-ered by scientists and by Foldit players outperforms previouslypublished methods. Thus, online scientific game frameworks havethe potential not only to solve hard scientific problems, but also todiscover and formalize effective new strategies and algorithms.

citizen science ∣ crowd-sourcing ∣ optimization ∣ structure prediction ∣strategy

Citizen science is an approach to leveraging natural humanabilities for scientific purposes. Most such efforts involve

visual tasks such as tagging images or locating image features(1–3). In contrast, Foldit is a multiplayer online scientific discoverygame, in which players become highly skilled at creating accurateprotein structure models through extended game play (4, 5). Folditrecruits online gamers to optimize the computed Rosetta energyusing human spatial problem-solving skills. Players manipulateprotein structures with a palette of interactive tools and manipula-tions. Through their interactive exploration Foldit players also uti-lize user-friendly versions of algorithms from the Rosetta structureprediction methodology (6) such as wiggle (gradient-based energyminimization) and shake (combinatorial side chain rotamer pack-ing). The potential of gamers to solve more complex scientific pro-blems was recently highlighted by the solution of a long-standingprotein structure determination problem by Foldit players (7).

One of the key strengths of game-based human problem ex-ploration is the human ability to search over the space of possiblestrategies and adapt those strategies to the type of problem andstage of problem solving (5). The variability of tactics andstrategies stems from the individuality of each player as well asmultiple methods of sharing and evolution within the game(group play, game chat), and outside of the game [wiki pages (8)].One way to arrive at algorithmic methods underlying successfulhuman Foldit play would be to apply machine learning techniquesto the detailed logs of expert Foldit players (9). We chose insteadto rely on a superior learning machine: Foldit players themselves.

As the players themselves understand their strategies better thananyone, we decided to allow them to codify their algorithmsdirectly, rather than attempting to automatically learn approxi-mations. We augmented standard Foldit play with the ability tocreate, edit, share, and rate gameplay macros, referred to as“recipes” within the Foldit game (10). In the game each playerhas their own “cookbook” of such recipes, from which they caninvoke a variety of interactive automated strategies. Players canshare recipes they write with the rest of the Foldit community orthey can choose to keep their creations to themselves.

In this paper we describe the quite unexpected evolution ofrecipes in the year after they were released, and the striking con-vergence of this very short evolution on an algorithm very similarto an unpublished algorithm recently developed independentlyby scientific experts that improves over previous methods.

ResultsIn the social development environment provided by Foldit,players evolved a wide variety of recipes to codify their diversestrategies to problem solving. During the three and a half monthstudy period (see Materials and Methods), 721 Foldit players ran5,488 unique recipes 158,682 times and 568 players wrote 5,202recipes. We studied these algorithms and found that they fellinto four main categories: (i) perturb and minimize, (ii) aggressiverebuilding, (iii) local optimize, and (iv) set constraints. The firstcategory goes beyond the deterministic minimize functionprovided to Foldit players, which has the disadvantage of readilybeing trapped in local minima, by adding in perturbations to leadthe minimizer in different directions (11). The second categoryuses the rebuild tool, which performs fragment insertion withloop closure, to search different areas of conformation space;these recipes are often run for long periods of time as they aredesigned to rebuild entire regions of a protein rather than justrefining them (Fig. S1). The third category of recipes performslocal minimizations along the protein backbone in order to im-prove the Rosetta energy for every segment of a protein. The finalcategory of recipes assigns constraints between beta strands orpairs of residues (rubber bands), or changes the secondary struc-ture assignment to guide subsequent optimization.

Different algorithms were used with very different frequenciesduring the experiment. Some are designated by the authors aspublic and are available for use by all Foldit players, whereasothers are private and available only to their creator or theirFoldit team. The distribution of recipe usage among differentplayers is shown in Fig. 1 for the 26 recipes that were run over1,000 times. Some recipes, such as the one represented by theleftmost bar, were used many times by many different players,while others, such as the one represented by the pink bar in the

Author contributions: F.K., S.C., Z.P., and D.B. designed research; F.K., S.C., M.D.T., andF.P. performed research; F.K., S.C., M.D.T., K.X., and I.M. analyzed data; and F.K., S.C., Z.P.,and D.B. wrote the paper.

The authors declare no conflict of interest.

Freely available online through the PNAS open access option.1To whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1115898108/-/DCSupplemental.

www.pnas.org/cgi/doi/10.1073/pnas.1115898108 PNAS ∣ November 22, 2011 ∣ vol. 108 ∣ no. 47 ∣ 18949–18953

BIOPH

YSICSAND

COMPU

TATIONALBIOLO

GY

PSYC

HOLO

GICALAND

COGNITIVESC

IENCE

S

http://Fold.it

Page 87: 2014 10-15-Nextbug edinburgh

• Recruiting & retaining contributors

Crowd-sourcing the visual inspection + correction of gene models.

Challenges

Page 88: 2014 10-15-Nextbug edinburgh

Recruiting & retaining contributorsPlan A: get students. • Increase accessibility:

• Make tasks small & simple • Need excellent tutorials & training • Need an intelligent “mothering” user interface.

• Provide rewards: • Better grades • Learning experience • Good karma (helping science) • Prestige & pride (on facebook; points & badges “leaderboard”, with

certificates, in publications) • Opportunities to develop expertise & responsibilities

Page 89: 2014 10-15-Nextbug edinburgh

Crowd-sourcing the visual inspection + correction of gene models.

Challenges

• Recruiting & retaining contributors

• Ensuring quality

Page 90: 2014 10-15-Nextbug edinburgh

Ensuring quality

• Excellent tutorials/training

• Make tasks small & simple

• Redundancy

• Review of conflicts by senior users.

Begin

EĞĞĚƐ�ĐƵƌĂƟŽŶ

�ƌĞĂƚĞ�ŝŶŝƟĂů�ƚĂƐŬƐ

Being curated

Curate

Being curated

Curate

Being curated

Curate

Submit Submit Submit

�ƵƚŽͲĐŚĞĐŬ

�ŽŶĞ

/ŶĐŽŶƐŝƐƚ

ĞŶƚ͗�ĐƌĞĂ

ƚĞ�

“ƌĞǀŝĞǁ͟

�ƚĂƐŬ�

�ŽŶƐŝƐƚĞŶƚ͗�create nexƚ�ƌĞƋƵŝƌĞĚ�ƚĂƐŬ

Page 91: 2014 10-15-Nextbug edinburgh

Crowd-sourcing the visual inspection + correction.

Challenges

http://afra.sbcs.qmul.ac.ukAnurag Priyam http://github.com/yeban/afra

• Recruiting & retaining contributors

• Ensuring quality

Page 92: 2014 10-15-Nextbug edinburgh

Warning: Work in Progress

Page 93: 2014 10-15-Nextbug edinburgh
Page 94: 2014 10-15-Nextbug edinburgh
Page 95: 2014 10-15-Nextbug edinburgh
Page 96: 2014 10-15-Nextbug edinburgh

Timelines• Rolled out to:

• 8 MSc students

• 20 3rd year students

• Need to improve tutorials/guidance/documentation

• Roll out to 200 first years (few months)

• Expand

Page 97: 2014 10-15-Nextbug edinburgh

Summary• Ants are cool

• Exciting times & big challenges

• Inspiration from people working with computers more/longer

• SequenceServer - set up custom BLAST servers

• Bionode -modular streams for bioinformatics

• GeneValidator - identifying problems with gene predictions

• Afra - infrastructure to crowdsource gene curation to the masses

Page 98: 2014 10-15-Nextbug edinburgh

Recruiting Genomehacker/Bioinformatics support

Page 99: 2014 10-15-Nextbug edinburgh

GitHub

Page 100: 2014 10-15-Nextbug edinburgh

Thanks!

[email protected]@yannick__

http://yannick.poulet.org

Colleagues & Collaborators @ QMUL & UNIL Anurag Priyam @yeban Monica Dragan Ismail Moghul Vivek Rai Bruno Vieira @bmpvieira

Page 101: 2014 10-15-Nextbug edinburgh
Page 102: 2014 10-15-Nextbug edinburgh

Maybe

Page 103: 2014 10-15-Nextbug edinburgh

genome evolution social evolutionGenerally

Single- vs. Multiple queennessin fire antsin similar independent species

•one or many loci? •one or many genes? •convergence?

Social parasitism

Strengths of selection in social evolution

concepts & mechanisms

Medically relevant questionsCandidate gene studies

VitellogeninSex determination genes

functional testing....

Tools for genomics work on emerging model organisms

Molecular response to social upheaval