Creative Confidence in Social Innovation @ First Tuesday Bergen 2. feb 2016
2016 bergen-sars
-
Upload
ctitusbrown -
Category
Science
-
view
492 -
download
0
Transcript of 2016 bergen-sars
![Page 1: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/1.jpg)
A 12-step program for biology to survive and thrive in the era of data-
intensive science
C. Titus Brown
Genome Center & Data Science InitiativeMar 18, 2016
Slides are on slideshare.net/c.titus.brown/
![Page 2: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/2.jpg)
Marek’s diseaseSoil metagenomicsAscidian GRNsLamprey mRNAseq
My path:
![Page 3: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/3.jpg)
My guiding questionWhat is going to be happening in the
next 5 years with biological data generation?
(And can I make progress on some of the coming problems?)
![Page 4: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/4.jpg)
DNA sequencing rates continues to grow.
Stephens et al., 2015 - 10.1371/journal.pbio.1002195
![Page 5: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/5.jpg)
(2015 was a good year)
![Page 6: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/6.jpg)
Oxford Nanopore sequencing
Slide via Torsten Seeman
![Page 7: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/7.jpg)
Nanopore technology
Slide via Torsten Seeman
![Page 8: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/8.jpg)
Scaling up --
![Page 9: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/9.jpg)
Scaling up --
![Page 10: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/10.jpg)
Slide via Torsten Seeman
![Page 11: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/11.jpg)
http://ebola.nextflu.org/
![Page 12: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/12.jpg)
“Fighting Ebola With a Palm-Sized DNA Sequencer”
See: http://www.theatlantic.com/science/archive/2015/09/
ebola-sequencer-dna-minion/405466/
![Page 13: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/13.jpg)
“DeepDOM” cruise: examination of dissolved organic matter & microbial metabolism vs physical parameters – potential collab.
Via Elizabeth Kujawinski
Another challenge beyond volume and velocity – variety.
![Page 14: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/14.jpg)
CRISPRThe challenge with genome editing is
fast becoming what to edit rather than how to do.
![Page 15: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/15.jpg)
A point for reflection…
Increasingly, the best guide to the next 10 years of biology is science fiction ...
![Page 16: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/16.jpg)
Digital normalization
Statement of problem:We can’t run de novo assembly on the
transcriptome data sets we have!
![Page 17: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/17.jpg)
Shotgun sequencing and coverage
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top through all of the reads.
![Page 18: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/18.jpg)
Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (30-300 Gbp for human)
![Page 19: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/19.jpg)
Digital normalization
![Page 20: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/20.jpg)
Digital normalization
![Page 21: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/21.jpg)
Digital normalization
![Page 22: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/22.jpg)
Digital normalization
![Page 23: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/23.jpg)
Digital normalization
![Page 24: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/24.jpg)
Digital normalization
![Page 25: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/25.jpg)
(Digital normalization is a computational version of library
normalization)
Suppose you have a dilution factor of A (10) to B(1). To get
10x of B you need to get 100x of A!
Overkill!!
This 100x will consume disk space
and, because of errors, memory.
We can discard it for you…
![Page 26: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/26.jpg)
Some key points --• Digital normalization is streaming.
• Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass)
• Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
![Page 27: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/27.jpg)
Assembly now scales with information content, not data size.
• 10-100 fold decrease in memory requirements
• 10-100 fold speed up in analysis
![Page 28: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/28.jpg)
Diginorm is widely useful:
1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem.(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep)
3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
![Page 29: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/29.jpg)
Anecdata: diginorm is used in Illumina long-read sequencing (?)
![Page 30: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/30.jpg)
Computational problems now scale with information content rather than data set size.
Most samples can be reconstructed via de novo assembly on commodity computers.
![Page 31: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/31.jpg)
Applying digital normalization in a new project – the horse transcriptome
Tamer Mansour w/Bellone, Finno, Penedo, & Murray labs.
![Page 32: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/32.jpg)
Input data Tissue Library length #samples #frag(M) #bp(Gb)BrainStem PE fr.firststrand 101 8 166.73 33.68Cerebellum PE fr.firststrand 100 24 411.48 82.3Muscle PE fr.firststrand 126 12 301.94 76.08Retina PE fr.unstranded 81 2 20.3 3.28SpinalCord PE fr.firststrand 101 16 403 81.4Skin PE fr.unstranded 81 2 18.54 3
SE fr.unstranded 81 2 16.57 1.34SE fr.unstranded 95 3 105.51 10.02
Embryo ICM PE fr.unstranded 100 3 126.32 25.26SE fr.unstranded 100 3 115.21 11.52
Embryo TE PE fr.unstranded 100 3 129.84 25.96SE fr.unstranded 100 3 102.26 10.23
Total 81 1917.7 364.07
![Page 33: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/33.jpg)
equCabs current status - NCBI Annotation
Tamer Mansour
![Page 34: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/34.jpg)
Library prepRead
trimmingMapping
to refMerge rep.
Trans Ass.
Merge by Tiss.
Predict ORF
Variant Ana
Update dbvar
Haplotype ass
Pool/diginorm
Predict ncRNA
Filter & Compare Ass.
filter knowns
Compare to public ann.
Merge All Ass.
Mapping to ref
Trans Ass.
Tamer Mansour
![Page 35: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/35.jpg)
Digital normalization & (e.g.) horse transcriptome
The computational demands for cufflinks- Read binning (processing time)- Construction of gene models (no of genes, no of splicing junctions, no of reads per locus, sequencing errors, complexity of the locus like gene overlap and multiple isoforms (processing time & Memory utilization)
Diginorm- Significant reduction of binning time
- Relative increase of the resources required for gene model construction with merging more samples and tissues- ? false recombinant isoformsTamer Mansour
![Page 36: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/36.jpg)
Effect of digital normalization
** Should be very valuable for detection of ncRNA
Tamer Mansour
![Page 37: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/37.jpg)
The ORF problem
Hestand et al 2014: “we identified 301,829 positions with SNPs or small indels within these transcripts relative to EquCab2. Interestingly, 780 variants extend the open reading frame of the transcript and appear to be small errors in the equine reference genome”
Tamer Mansour
![Page 38: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/38.jpg)
We merged the assemblies into six tissue-specific transcription profiles for cerebellum, brainstem, spinal cord, retina, muscle and skin. The final merger of all
assemblies overlaps with 63% and 73% of NCBI and Ensembl loci, respectively, capturing about 72% and 81% of their coding bases. Comparing our assembly to the most recent transcriptome annotation shows ~85% overlapping
loci. In addition, at least 40% of our annotated loci represent novel transcripts.
Tamer Mansour
![Page 39: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/39.jpg)
Diginorm can also process data as it comes in – streaming
decision making.
![Page 40: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/40.jpg)
What do we do when we get new data??
• How do we efficiently process, update our existing resources?
• How do we evaluate whether or not our prior conclusions need to change or be updated?– # of genes, & their annotations;– Differential expression based on new isoforms;
• This is a problem everyone has…and it’s not going away…
![Page 41: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/41.jpg)
The data challenge in biology
So we can sequence everything – so what?
What does it mean?How can we do better biology with the data?
How can we understand?
![Page 42: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/42.jpg)
A 12-step program for biology (??)
(This was a not terribly successfulattempt to be entertaining.)
![Page 43: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/43.jpg)
1. Think repeatability and scaling
What works for one data set,
Doesn’t work as well for three,
And doesn’t work at all for 100.
![Page 44: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/44.jpg)
2. Think streaming / few-pass analysis
versus
![Page 45: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/45.jpg)
3. Invest in computational trainingSummer NGS workshop (2010-2017)
![Page 46: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/46.jpg)
4. Move beyond PDFs
This is only part of the story!
Subramanian et al., doi: 10.1128/JVI.01163-13
![Page 47: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/47.jpg)
5. Focus on a biological questionGenerating data for the sake of having data
leads you into a data analysis maze – “I’m sure there’s something interesting in there…
somewhere.”
![Page 48: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/48.jpg)
"...ignorome genes do not differ from well-studied genes in terms of connectivity in coexpression networks. Nor do they differ with respect to numbers of orthologs, paralogs, or protein domains. The major distinguishing characteristic between these sets of genes is date of discovery, early discovery
being associated with greater research momentum—a genomic bandwagon effect."
Ref.: Pandey et al. (2014), PLoS One 11, e88889. Via Erich Schwarz
The problem of lopsided gene characterization is pervasive:
e.g., the brain "ignorome"
6. Spend more effort on the unknowns!
![Page 49: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/49.jpg)
7. Invest in data integration.
Figure 2. Summary of challenges associated with the data integration in the proposed project.
Figure via E. Kujawinski
![Page 50: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/50.jpg)
8. Split your information into layersProtein coding >> ncRNA >> ???
** Should be very valuable for detection of ncRNA*** But what the heck do we do with ncRNA information?
Tamer Mansour
![Page 51: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/51.jpg)
9. Move to an update model.
![Page 52: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/52.jpg)
Candidates for additional steps…
• Invest in data sharing and better “reference” infrastructure.
• Build better tools for computationally exploring hypotheses.
• Invest in “unsupervised” analysis of data (machine learning)
• Learn/apply multivariate stats. • Invest in social media & preprints & “open”
![Page 53: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/53.jpg)
My future plans?
• Protocols and (distributed) platform for data discovery & sharing.
• Data analysis and integration in marine biogeochemistry & microbial physiology
![Page 54: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/54.jpg)
Fig. 1: The cycle from data to discovery, over models back to experiment, that generates knowledge as the cycle is repeated. Parts of that cycle are standard in particular disciplines, but putting together the full cycle requires transdisciplinary expertise.
![Page 55: 2016 bergen-sars](https://reader036.fdocuments.us/reader036/viewer/2022062823/5879a6db1a28ab082c8b7177/html5/thumbnails/55.jpg)
Training program at UC Davis:
• Regular intensive workshops, half-day or longer.• Aimed at research practitioners (grad students &
more senior); open to all (including outside community).
• Novice (“zero entry”) on up.• Low cost for students.• Leverage global training initiatives.
(Google “dib training” for details; join the announce list!)