2014 bangkok-talk
-
Upload
ctitusbrown -
Category
Science
-
view
139 -
download
1
description
Transcript of 2014 bangkok-talk
![Page 1: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/1.jpg)
NON-INTENSIVE BIOLOGY: OPPORTUNITIES AND CHALLENGES OF NEXT-GEN SEQUENCING
C. Titus Brown
Assistant Professor
MMG / CSE
![Page 2: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/2.jpg)
Lansing, Michigan -> Davis, California
![Page 3: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/3.jpg)
We practice open science!
Everything discussed here:• Code: github.com/ged-lab/ ; BSD license• Blog: http://ivory.idyll.org/blog (‘titus brown blog’)• Twitter: @ctitusbrown• Grants on Lab Web site: http://ged.msu.edu/research.html• Papers available as preprints.• All my talks are available at slideshare.net/c.titus.brown/
![Page 4: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/4.jpg)
Sequencing!
• Sequencing of DNA and RNA.• Single genomes• Transcriptomes• Natural populations (tags)• Environmental samples/microbial populations (metagenomics)
• Cheap and massively scalable sequencing of DNA and RNA.
![Page 5: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/5.jpg)
Sequencing technology• Major, dramatic changes in our ability to sequence DNA
and RNA quickly and cheaply.• Majority of deployed techniques depend on (variations of)
a single trick: “polony” sequencing. No cloning.• Single-molecule sequencing coming along fast, but not
yet ready for prime time.
![Page 6: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/6.jpg)
![Page 7: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/7.jpg)
![Page 8: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/8.jpg)
Two specific concepts:• First, sequencing everything at random is very much
easier than sequencing a specific gene region. (For example, it will soon be easier and cheaper to shotgun-sequence all of E. coli then it is to get a single good plasmid sequence.)
• Second, if you are sequencing on a 2-D substrate (wells, or surfaces, or whatnot) then any increase in density (smaller wells, or better imaging) leads to a squared increase in the number of sequences.
![Page 9: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/9.jpg)
Novel genome sequencing
![Page 10: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/10.jpg)
![Page 11: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/11.jpg)
![Page 12: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/12.jpg)
![Page 13: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/13.jpg)
![Page 14: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/14.jpg)
Some numbers• For under $1,000 per sample, the Illumina HiSeq machine
will generate:
• 200,000,000 reads• Each of length ~150• In under a week.• x 16 samples/run.
• That’s almost 500 Gbp of sequence, or just over 160x human genome…
![Page 15: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/15.jpg)
Shotgun sequencing• Collect samples;
• Extract DNA or RNA;
• Feed into sequencer;
• Computationally analyze.
Wikipedia: Environmental shotgun sequencing.png
“Sequence it all and let the bioinformaticians sort it
out”
![Page 16: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/16.jpg)
The challenges of non-model sequencing
• Missing or low quality genome reference.
• Evolutionarily distant.
• Most extant computational tools focus on model organisms –• Assume low polymorphism (internal variation)• Assume reference genome• Assume somewhat reliable functional annotation• More significant compute infrastructure
…and cannot easily or directly be used on critters of interest.
![Page 17: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/17.jpg)
Shotgun sequencing analysis goals:
• Assembly (what is the text?)• Produces new genomes & transcriptomes.• Gene discovery for enzymes, drug targets, etc.
• Counting (how many copies of each book?)• Measure gene expression levels, protein-DNA
interactions• Variant calling (how does each edition vary?)
• Discover genetic variation: genotyping, linkage studies…
• Allele-specific expression analysis.
![Page 18: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/18.jpg)
Shotgun sequencing & assembly
http://eofdreams.com/library.html;http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
![Page 19: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/19.jpg)
Shotgun sequencing analysis goals:
• Assembly (what is the text?)• Produces new genomes & transcriptomes.• Gene discovery for enzymes, drug targets, etc.
• Counting (how many copies of each book?)• Measure gene expression levels, protein-DNA
interactions• Variant calling (how does each edition vary?)
• Discover genetic variation: genotyping, linkage studies…
• Allele-specific expression analysis.
![Page 20: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/20.jpg)
Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!
![Page 21: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/21.jpg)
Mapping: locate reads in reference
http://en.wikipedia.org/wiki/File:Mapping_Reads.png
![Page 22: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/22.jpg)
Variant detection after mapping
http://www.kenkraaijeveld.nl/genomics/bioinformatics/
![Page 23: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/23.jpg)
Looking forward 5 years…
Navin et al., 2011
![Page 24: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/24.jpg)
Some basic math:• 1000 single cells from a tumor…• …sequenced to 40x haploid coverage with Illumina…• …yields 120 Gbp each cell…• …or 120 Tbp of data.
• HiSeq X10 can do the sequencing in ~3 weeks.
• The variant calling will require 2,000 CPU weeks…
• …so, given ~2,000 computers, can do this all in one month.
![Page 25: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/25.jpg)
Similar math applies:• Pathogen detection in blood;• Environmental sequencing;• Sequencing rare DNA from circulating blood.
• Two issues:
• Volume of data & compute infrastructure;
•Latency for clinical applications.
![Page 26: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/26.jpg)
The Data Deluge(a traditional requirement for these talks)
![Page 27: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/27.jpg)
Lab approach: Lossy compression
Lossy compression can substantially reduce data size while retaining
information needed for later (re)analysis.
(Reduce volume of data & compute infrastructure requirements)
![Page 28: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/28.jpg)
http://en.wikipedia.org/wiki/JPEG
Lossy compression
![Page 29: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/29.jpg)
http://en.wikipedia.org/wiki/JPEG
Lossy compression
![Page 30: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/30.jpg)
http://en.wikipedia.org/wiki/JPEG
Lossy compression
![Page 31: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/31.jpg)
http://en.wikipedia.org/wiki/JPEG
Lossy compression
![Page 32: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/32.jpg)
http://en.wikipedia.org/wiki/JPEG
Lossy compression
![Page 33: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/33.jpg)
Outline
• The Molgulid story: investigating non-model ascidians ( this is the biology)
• Meditations on data analysis.• Methods, methods, methods.• Training, training, training.• Concluding thoughts
![Page 34: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/34.jpg)
The Molgula Story – an int’l collaboration
Elijah Lowe(MSU; Naples?)
Billie Swalla (UW, BEACON)
Lionel Christiaen (NYU);Claudia Racioppi (Naples; NYU)
![Page 35: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/35.jpg)
…to the urochordates we go!
Putnam et al., 2008, Nature.Modified from Swalla 2001
![Page 36: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/36.jpg)
Filter feeding adults
Molgula oculata
Molgula occulta
Molgula oculata Ciona intestinalis
Elijah Lowe; collaboration w/Billie Swalla
![Page 37: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/37.jpg)
Challenging organisms to work on!
Molgula occulta & M. oculata:• Only spawn ~1 month out of the year• Located off the northern coast of France• Hybrids not found outside of lab conditions• Species cannot be cultured• Wet lab techniques are not fully developed for species
• No genomic resources (as of 2008).
![Page 38: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/38.jpg)
Billie Swalla, Nadine Peyrieras, Alberto Stolfi
![Page 39: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/39.jpg)
Tail loss and notochord
a) M. oculata b) hybrid (occulta egg x oculata sperm) c) M. occultaNotochord cells in orange Swalla, B. et al. Science, Vol 274, Issue 5290, 1205-1208 , 15 November 1996
![Page 40: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/40.jpg)
Molgula clades – tail loss is derived
![Page 41: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/41.jpg)
Solitary ascidians have determinant
and invariant cleavage.
Some species have colored cytoplasms.
(Boltenia villosa)
The cell lineage is very similar in Ciona, Phallusia,
Halocynthia roretzi &Molgula oculata.
![Page 42: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/42.jpg)
Molgula occidentalis
Ciona intestinalis
![Page 43: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/43.jpg)
![Page 44: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/44.jpg)
Notochord formation (convergence & extension) in ascidians is highly conserved.
Jiang and Smith, 2007Ciona savignyi
![Page 45: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/45.jpg)
Molgula oculata notochord(40 cells, converged & extended)
Molgula occulta no notochord(20 cells, not converged & extended)
Hybrid notochord(20 cells, converged & extended)
Notochord Formation in Molgulids
Swalla and Jeffery, 1996
![Page 46: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/46.jpg)
First we applied mRNAseq…
Lowe et al., in review (PeerJ). https://peerj.com/preprints/505/
![Page 47: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/47.jpg)
…which gave us entire transcriptomes…
Lowe et al., in review (PeerJ). https://peerj.com/preprints/505/
![Page 48: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/48.jpg)
…then we sequenced their genomes...
• 3 species:Molgula occidentalis (tailed) – “MOXI”Molgula oculata (tailed) – “MOCU”Molgula occulta (tail-less) – “MOCC”
• 3 lanes: 300-400 bp; 650-750 bp; 900-1000 bp
• ≥ 200X coverage each genome
De novo assembly by Elijah Lowe (MSU)
Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
![Page 49: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/49.jpg)
…which gave us most of their genes (and regulatory elements?)
Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
Genome assembly statistics:
![Page 50: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/50.jpg)
Shift in differentially expressed genes from gastrulation to neurulation
M. ocu vs. M. occ gastrula M. ocu vs. M. occ neurula
Differentially expressed during neurulation in M. ocu vs M. occ
Elijah Lowe
![Page 51: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/51.jpg)
Notochord gene expression similar to tailed speciesElijah Lowe
![Page 52: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/52.jpg)
Heterochronic Shift in Molgulidae Development*79 genes examined across six species
![Page 53: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/53.jpg)
Transgenics of reporter constructs(“Mutual intelligibility” across ~350 my)
Stolfi et al., eLife, 2014; http://dx.doi.org/10.7554/eLife.03728
![Page 54: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/54.jpg)
Prickle is a key part of the notochord program.
Veeman, M., et al., 2007
•Planar cell polarity (PCP) pathway
•Involved in convergence and extension
![Page 55: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/55.jpg)
Prickle expressed in notochord cells of tailless ascidians.
Mita et al Zool. Sci., 2010
M. occulta gastrulationCiona intestinalis
Satoh Nature Reviews Genetics 4, 2003FGF
Bra Pk
Elijah Lowe
![Page 56: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/56.jpg)
(Re)booting the Molgula --• Determined conservation of cardiopharyngeal
developmental program, despite shifts in cis-regulatory sequences (Stolfi et al, eLife, 2014).
• Examining heterochronic shifts in developmental timing (tail loss) (Maliska et al., in preparation).
• Connecting evolutionary shifts in developmental gene regulatory networks with conserved molecular profiles (Lowe et al, submitted; Lowe et al., in preparation).
![Page 57: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/57.jpg)
More thoughts on Molgula• One grad student, two transcriptomes, three genomes,
four years…
• Genomic resources are enabling a sprawling international collaboration (UW/BEACON, MSU/BEACON, NYU, Naples, Paris)
• ! Methods development key!
![Page 58: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/58.jpg)
How Science Works
![Page 59: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/59.jpg)
Luckily, data analysis is cheap and easy!
![Page 60: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/60.jpg)
Err, well, actually…
http://www.pixelpog.com/ftpimages/GnomesAttack.jpg
![Page 61: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/61.jpg)
It is now easy to generate sequencing data sets of such a size and scale that the first round analysis cannot even be
completed.
![Page 62: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/62.jpg)
My research:theoretical => applied solutions to scale.
![Page 63: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/63.jpg)
My research: three methods.
1. Adaptation of a suite of probabilistic data structures for representing set membership and counting (Bloom filters and CountMin Sketch). (Zhang et al., PLoS One, 2014.)
2. An online streaming approach to lossy compression of sequencing data. (Brown et al., arXiv, 2012; Howe et al., PNAS, 2014.)
3. Compressible de Bruijn graph representation for assembly. (Pell et al., PNAS, 2012.)
![Page 64: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/64.jpg)
Method #2 - Digital normalization(a computational version of library normalization)
Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you
need to get 100x of A! Overkill!!
This 100x will consume disk space and, because
of errors, memory.
We can discard it for you…
![Page 65: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/65.jpg)
Digital normalization
![Page 66: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/66.jpg)
Digital normalization
![Page 67: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/67.jpg)
Digital normalization
![Page 68: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/68.jpg)
Digital normalization
![Page 69: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/69.jpg)
Digital normalization
![Page 70: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/70.jpg)
Digital normalization
![Page 71: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/71.jpg)
Digital normalization retains information, while discarding data and errors
![Page 72: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/72.jpg)
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
• Streaming & single pass: looks at each read at most once;• Does not “collect” the majority of errors;• Keeps all low-coverage reads;• Smooths out coverage of sequencing.
=>
Enables analyses that are otherwise completely impossible.
![Page 73: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/73.jpg)
Witness the power of this fully operational set of sequence analysis methods:
1. Assembling soil metagenomes.
Howe et al., PNAS, 2014 (w/Tiedje)
2. Understanding bone-eating worm symbionts.Goffredi et al., ISME, 2014.
3. An ultra-deep look at the lamprey transcriptome.
Scott et al., in preparation (w/Li)
4. Understanding development in Molgulid ascidians. Stolfi et al, eLife 2014; etc.
![Page 74: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/74.jpg)
Open scienceGuiding principle: methods that aren’t broadly available
aren’t very useful.
(=> Preprints, open source code, blog posts, Twitter, training, etc.)
Estimated ~1000 users of our software.
Diginorm now included in Trinity software from Broad Institute (~10,000 users)
Illumina TruSeq long-read technology now incorporates our approach (~100,000 users)
![Page 75: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/75.jpg)
Current research:
Compressive algorithms for sequence analysis
Can we enable and accelerate sequence-based inquiry by making all basic analysis
easier and some analyses possible?
![Page 76: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/76.jpg)
The data challenge in biology
In 5-10 years, we will have nigh-infinite data. (Genomic, transcriptomic, proteomic,
metabolomic, …?)
We currently have no good way of querying, exploring, investigating, or mining these data sets,
especially across multiple locations..
Moreover, most data is unavailable until after publication…
…which, in practice, means it will be lost.
![Page 77: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/77.jpg)
Infrastructure: distributed graph database server
![Page 78: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/78.jpg)
“Data Intensive Biology”• Increasingly, relevant data is out there or can be
generated fairly inexpensively.
• But what does the data mean? How can we get it to yield putative answers? How can we integrate it with other people’s data?
• Virtually nobody in biology is trained to do this.
• Virtually nobody in biology is being trained in how to do this.
![Page 79: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/79.jpg)
Summer NGS workshop (2010-2017)
![Page 80: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/80.jpg)
Perspectives on training• Prediction: The single biggest
challenge facing biology over the next 20 years is the lack of data analysis training (see: NIH DIWG report)
• Data analysis is not turning the crank; it is an intellectual exercise on par with experimental design or paper writing.
• Training is systematically undervalued in academia (!?)
![Page 81: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/81.jpg)
Training - looking forward• NIH “Big Data 2 Knowledge” (BD2K) will be investing
~$20-40m in training each year (my estimate). Biomedical science increasingly depends on data analysis.
• Moore, Sloan Foundations are investing heavily in training (see: Software Carpentry)
• NSF BIO Centers have stated that “training is the second most important problem that all of us have”.
We need to figure out solutions…
![Page 82: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/82.jpg)
Funding
![Page 83: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/83.jpg)
Students and postdocsFormer:• Dr. Jason Pell (Google NYC)• Asst Professor Adina Howe (Iowa State)
• Current:• Dr. Likit Preeyanon (MMG)• Elijah Lowe (CSE)• Qingpeng Zhang (CSE)• Jaron Guo (MMG)• Camille Scott (CSE)• Michael Crusoe• Luiz Irber (CSE)• Dr. Sherine Awad (MMG)
![Page 84: 2014 bangkok-talk](https://reader033.fdocuments.us/reader033/viewer/2022051609/5476c3a8b4af9fed5f8b4686/html5/thumbnails/84.jpg)
Students and postdocsFormer:• Dr. Jason Pell (Google NYC)• Asst Professor Adina Howe (Iowa State)
• Current:• Dr. Likit Preeyanon (MMG)• Elijah Lowe (CSE)• Qingpeng Zhang (CSE)• Jaron Guo (MMG)• Camille Scott (CSE)• Michael Crusoe• Luiz Irber (CSE)• Dr. Sherine Awad (MMG)