2014 abic-talk
-
Upload
ctitusbrown -
Category
Science
-
view
606 -
download
1
description
Transcript of 2014 abic-talk
![Page 1: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/1.jpg)
BUILDING BETTER BIOINFORMATICS SOFTWARE(WHY THE HECK NOT?)
C. Titus Brown
Assistant Professor, MMG / CSE
Michigan State University
![Page 2: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/2.jpg)
BUILDING BETTER BIOINFORMATICS SOFTWARE(WHY THE HECK NOT?)
C. Titus Brown
A???????? Professor, VetMed, UC Davis
![Page 3: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/3.jpg)
Lansing, Michigan -> Davis, California
![Page 4: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/4.jpg)
Dot plots FTW!
Brown et al., 2005.
![Page 5: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/5.jpg)
So I said these things…
“this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption.”
“[ Benchmarks ] have proven to be stifling of innovation, because of the tendency to do incremental improvement.”
ivory.idyll.org/blog/2014-bosc-keynote.html
![Page 6: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/6.jpg)
So I said these things…
“this tipping point was exacerbated by the loss of about 80% of the worlds data scientists in the 2021 Great California Disruption.”
“[ Benchmarks ] have proven to be stifling of innovation, because of the tendency to do incremental improvement.”
ivory.idyll.org/blog/2014-bosc-keynote.html
![Page 7: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/7.jpg)
![Page 8: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/8.jpg)
There is a real problem.
![Page 9: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/9.jpg)
There is a massive profusion of software!
jeffvictor.deviantart.com
Mick Watson, @BioMickWatson:
biomickwatson.wordpress.com/2012/12/28/an-embargo-on-short-read-alignment-software/
![Page 10: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/10.jpg)
The players, in caricature:
1. Computer scientists
2. Software engineers
3. Data scientists
4. Statisticians
5. Biologists
![Page 11: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/11.jpg)
The Computer Scientist
Fast, sensitive, specific – pick one.
![Page 12: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/12.jpg)
The (Good) Software Engineer
Does it have any unit tests?
![Page 13: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/13.jpg)
The Data Scientist
How quickly can I run it, starting from scratch?
![Page 14: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/14.jpg)
The Statistician
What gives me the best p-value?
![Page 15: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/15.jpg)
The Biologist
What gives me the most publishable result?
![Page 16: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/16.jpg)
Problems all along the way…1. Computer scientists: build delicate, hard to use, very high
performance software that solves the wrong problem.
2. Software engineers: all work for Google.
3. Data scientists: uses the wrong programs -- because they’re actually usable.
4. Statisticians: only get invited into the project six months after all the data is generated.
5. Biologists: are desperate to find any one of the above that know any biology at all.
![Page 17: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/17.jpg)
Example: de novo mRNAseq
Every one of these steps is still an open research problem, with computational
challenges and direct biological implications!
![Page 18: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/18.jpg)
So:
1. This is all still research.
2. We’re unlikely to ever find out the right answer, but willmerely settle for one that’s not obviously terrible.
3. Everything is changing all the time: the data generation tech, the hardware, the software, the theory...
4. Who are any of us to judge the value of any particular approach?
![Page 19: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/19.jpg)
So:
1. This is all still research
2. We’re unlikely to ever find out the right answer, but willmerely settle for one that’s not obviously terrible.
3. Everything is changing all the time: the data generation tech, the hardware, the software, the theory...
4. Who are any of us to judge the value of any particular approach?
(Well, sometimes me, when I’m peer reviewer #2.)
![Page 20: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/20.jpg)
All hands on deck!
We need it all!
• Fast/sensitive/specific algorithms;
• Solid software;• Statistical robustness;• Biological insight;• Well-trained data
scientists.
(The best bioinformaticians have multiple personality disorder, or so I tell myself.)
![Page 21: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/21.jpg)
That sort of explains why.
But this still leaves us with too many choices.
![Page 22: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/22.jpg)
Example: de novo mRNAseq
10-20 packages
2-5 packages
5-10 packages
20-40 packages
x
x
x
= 2000-40,000 combinations
![Page 23: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/23.jpg)
What’s the solution!?
Ultimately? All of…
Whole-workflow evaluations of tools.Small tools (see “small tools manifesto”).
Automation!Simulations, synthetic data, mock data, real data.
Antagonistic data set development (**). Tool development driven with use cases.
Build based on solid command-line workflows.Those things called “controls”.
…and more
![Page 24: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/24.jpg)
Trying out a few approaches…
![Page 25: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/25.jpg)
1. Automate the hell out of everything (Ubuntu 14.04, git, make, IPython Notebook, latex)
![Page 26: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/26.jpg)
Time from publication of KAnalyze to our 100% reproducible re-evaluation? ~8 hours.
![Page 27: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/27.jpg)
2. Protocols, not pipelines.
STOP HIDING THE ANALYSIS STEPS.
BIG BLACK BOXES ARE NOT SMALL TOOLS!
![Page 28: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/28.jpg)
Write down what you’re doing…
https://khmer-protocols.readthedocs.org/
![Page 29: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/29.jpg)
…and add automated end-to-end tests.
c.f. “literate ReSTing”
![Page 30: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/30.jpg)
![Page 31: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/31.jpg)
3. Drive sustainable software development with use cases.
![Page 32: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/32.jpg)
…that are explicit…
![Page 33: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/33.jpg)
…versioned…
![Page 34: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/34.jpg)
…and automated.
![Page 35: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/35.jpg)
4. Put everything in the cloud and measure it.
~40 hours;m1.xlarge
Eel Pond mRNAseq protocol.
![Page 36: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/36.jpg)
5. Compare programs and workflows fairly.
Genome Reference
Quality Filtered Diginorm Partition Reinflation
Velvet - 80.90 83.64 84.57
IDBA 90.96 91.38 90.52 88.80
SPAdes
90.42 90.35 89.57 90.02
Mis-assembled Contig Length
Velvet - 52071358 44730449 45381867
IDBA 21777032 20807513 17159671 18684159
SPAdes
28238787 21506019 14247392 18851571
Kalamazoo metagenome protocol run on mock data from Shakya et al., 2013
Also! Tip o’ the hat to Michael Barton, nucleotid.es
![Page 37: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/37.jpg)
A super fun way to do reviews!• “What a nice new transcriptome assembler! Interesting
how it doesn’t perform that well on my 10 test data sets.”
• “Hey, so you make these claims, but I ran your code, and…”
• “Fun fact! Your source code has a syntax error in it – even Perl has standards! You’re still sure that’s the script you used?”
• “Here – use our evaluation pipeline, since you clearly need something better.”
The Brown Lab: taking passive aggression to a whole new level!
![Page 38: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/38.jpg)
We breed our own problems.
Let’s level up the field, already.
Reward the behavior you want to see.
![Page 39: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/39.jpg)
![Page 40: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/40.jpg)
What are we working on, scientifically speaking?
![Page 41: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/41.jpg)
Streaming error correction of genomic, transcriptomic, metagenomic data via graph alignment
Jason Pell, Jordan Fish, Michael Crusoe
![Page 42: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/42.jpg)
Error correction on simulated E. coli data
1% error rate, 100x coverage.
Michael Crusoe, Jordan Fish, Jason Pell
TP FP TN FN
1.2-pass 3,494,631 99.8% 3,865 460,601,171 5,533 2.8%
(corrected) (mistakes) (OK) (missed)
![Page 43: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/43.jpg)
Single pass, reference free, tunable, streaming online variant calling.
(Hey, look, ma – a new mapper!)
Error correction variant calling
![Page 44: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/44.jpg)
Infrastructure: distributed graph database server
ivory.idyll.org/blog/2014-moore-ddd-talk.html
![Page 45: 2014 abic-talk](https://reader033.fdocuments.us/reader033/viewer/2022051513/547e7520b47959c0508b4b67/html5/thumbnails/45.jpg)
AGTA talk on Monday• 3:15-4pm – come see me try to convince biomedical
researchers to share their data!
• 4-4:30pm – come listen to Ana Conesa talk about multi-omics data integration!
Thanks!