In the field : the bulletin of the Field Museum of Natural History
Big Data Field Museum
-
Upload
adina-chuang-howe -
Category
Science
-
view
91 -
download
0
Transcript of Big Data Field Museum
RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
Understanding community
dynamics
Who is there?
What are they doing?
How are they doing it?
Kim Lewis, 2010
Gene / Genome Sequencing
Collect samples
Extract DNA
Sequence DNA
“Analyze” DNA to identify its content and origin
Taxonomy
(e.g., pathogenic E. Coli)
Function
(e.g., degrades cellulose)
Effects of low cost
sequencing…
First free-living bacterium sequenced
for billions of dollars and years of
analysis
Personal genome can be
mapped in a few days and
hundreds to few thousand
dollars
The experimental continuum
Single Isolate
Pure Culture
Enrichment
Mixed CulturesNatural systems
The era of big data in biology
Stein, Genome Biology, 2010
Computational Hardware
(doubling time 14 months)
Sanger Sequencing
(doubling time 19 months)
NGS (Shotgun) Sequencing
(doubling time 5 months)
1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
Year
0
1
10
100
1,000
10,000
100,000
1,000,000
Dis
k S
tora
ge,
Mb/$
0.1
1
10
100
1,000
10,000
100,000
1,000,000
DN
A S
equencin
g, M
bp
per $
10,000,000
100,000,000
0.1
1
10
100
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
Postdoc experience with data
2003-2008 Cumulative sequencing in PhD = 2000 bp
2008-2009 Postdoc Year 1 = 50 Gbp
2009-2010 Postdoc Year 2 = 450 Gbp
2014 = 50 Tbp
2015 = 500 Tbp budgeted
“Soil Census” to “Soil Catalogs”: Who is there?
Targetting conserved regions
of known genes
Most popular:
16S ribosomal RNA gene –
conserved in bacteria and
archaea
Who is there - community
profiling based on sequence
similarity
Must have previous
knowledge of genes
Must infer function based on
phylogeny – not advised
TARGETTED SEQUENCING
STRATEGY
“Soil Census” to “Soil Catalogs”: Who is there?
Targetting conserved regions
of known genes
Most popular:
16S ribosomal RNA gene –
conserved in bacteria and
archaea
Who is there - community
profiling based on sequence
similarity
Must have previous
knowledge of genes
Must infer function based on
phylogeny – not advised
$15 / sample
TARGETTED SEQUENCING
STRATEGY
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
THE DIRT ON SOIL
Biodiversity in the dark, Wall et al., Nature Geoscience, 2010 Jeremy Burgress
MAGNIFICENT BIODIVERSITY
THE DIRT ON SOIL
SPATIAL HETEROGENEITY
http://www.fao.org/ www.cnr.uidaho.edu
THE DIRT ON SOIL
DYNAMIC
THE DIRT ON SOIL
INTERACTIONS: BIOTIC, ABIOTIC, ABOVE, BELOW, SCALES
Philippot, 2013, Nature Reviews Microbiology
Our shared challenges
Climate Change
Energy Supply
USGCRP 2009
www.alutiiq.com
http://guardianlv.com/
Human Health
An understanding
of microbial ecology
SOIL MICROBIOLOGY: CARBON
REGULATION
The anthropogenic CO2 production is only 10% of that of the soil
Sustainable agriculture permits carbon
sequestration in the range of 0.3 – 1 ton of
C/ha.yr ~ 10% of all carbon emitted by cars
(Denman et al., 2007; Climate Change 2007: The Physical Science Basis. Contribution of Working Group I to the Fourth Assessment Report of the Intergovernmental Panel on Climate Change)
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
Lesson #1: Accessing information in
data
http://siliconangle.com/files/2010/09/image_thumb69.png
de novo assembly
Compresses dataset size significantly
Improved data quality (longer sequences, gene order)
Reference not necessary (novelty)
Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes
Metagenome assembly…a scaling
problem.
Shotgun sequencing and de novo
assembly
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computer
Practical Challenges – Intensive
computing
Howe et al, 2014, PNAS
Months of
“computer
crunching” on a
super computerAssembly of 300 Gbp can be
done with any assembly program
in less than 14 GB RAM and less
than 24 hours.
Natural community characteristics
Diverse
Many organisms
(genomes)
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Natural community characteristics
Diverse
Many organisms
(genomes)
Variable abundance
Most abundant organisms, sampled
more often
Assembly requires a minimum amount
of sampling
More sequencing, more errors
Sample 1x Sample 10x
Overkill
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Digital normalization
Brown et al., 2012, arXiv
Howe et al., 2014, PNAS
Zhang et al., 2014, PLOS One
Scales datasets for assembly up to 95% - same assembly
outputs.
Genomes, mRNA-seq, metagenomes (soils, gut, water)
Tackling Soil Biodiversity
Source: Chuck Haney
C. Titus Brown, James Tiedje, Qingpeng Zhang, Jason Pell (MSU)
Janet Jansson, Susannah Tringe (JGI)
The reality?
More like…
Source: Chuck HaneyHowe et. al, 2014, PNAS
SOIL METAGENOME REALITY CHECK
Grand Challenge effort –
10% of soil biodiversity
sampled
Incredible soil biodiversity
(estimate required 10
Tbp/sample)
“To boldly go where no man
has gone before”: >60%
Unknown
0
100
200
300
400
am
ino a
cid
meta
bolis
m
carb
ohydra
te m
eta
bo
lism
mem
bra
ne tra
nspo
rt
sig
nal tr
ansdu
ction
transla
tion
fold
ing
, sort
ing a
nd d
egra
da
tion
meta
bolis
m o
f co
facto
rs a
nd v
itam
ins
energ
y m
eta
bolis
m
transp
ort
and
cata
bolis
m
lipid
meta
bolis
m
tra
nscri
ption
ce
ll g
row
th a
nd
dea
th
replic
ation
and
rep
air
xen
obio
tics b
iod
egra
datio
n a
nd m
eta
bo
lism
nucle
otide m
eta
bolis
m
gly
can b
iosynth
esis
and m
eta
bolis
m
meta
bolis
m o
f te
rpenoid
s a
nd
poly
ke
tides
cell
motilit
y
Tota
l C
ount
KO
corn and prairie
corn only
prairie only
Howe et al, 2014, PNAS
Managed agriculture soils exhibit less
diversity, likely from its history of
cultivation.
Frustrating, but helpful
“Low input, high throughput, no output?” (Sean Eddy / Sydney Brenner)
Evaluation of sequencing as a tool
Broad characterization
“Right” kind of data
How much should I sequence?
Data characteristics
Breadth vs. depth of sampling
Computational tool development
Tr
Lesson #2: Connecting the dots
from data to information
If 80% is
unknown…what
can one do?
The idea of co-occurrence
Co-occurrence to detect novelty
Williams et al., 2014, Frontiers of Micro
Causation vs correlation
Wikipedia.com
Success story in Human
Microbiome
#3 Is more data better?
Bottlenecks for the emerging
microbiologists
Technical challenges – many
solutions
Access to data and its value
Access to resources
Data volume and velocity “clog”
Data is very heterogeneous
Software Developers
Computer Scientists
Clinicians
PIs
Data generators
Microbiologists
Data Analyzers
Statisticians
Bioinformaticians
http://ivory.idyll.org/blog/2014-the-emerging-field-of-data-intensive-biology.html
Data intensive microbiology
Social obstacles – the main
challenge
Shift of costs do not mean shift of
expectations
http://www.deluxebattery.com/25-hilarious-expectation-vs-reality-photos/
Dear PI,
It will take longer than
the time it took you to do
your experiment to
analyze the data. Please
do not write me for
results within 24 hours of
your sequences
becoming available.
- Adina
Culture of sharing
http://www.heathershumaker.com/
Metagenomic Datasets
Training / Incentives
Emails between collaborators don’t contain as
much “science” as I’d like:
All analysis: accessible,
reproducible, and automated
RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
“”
RIDING THE BIG DATA
TIDAL WAVE OF MODERN
MICROBIOLOGY
Adina Howe
Argonne National Laboratory / Michigan State University
Iowa State University, Ag & Biosystems Engr (January)
Acknowledgements
C. Titus Brown (MSU)
James Tiedje (MSU)
Daina Ringus (UC)
Folker Meyer (ANL)
Eugene Chang (UC)
NSF Biology Postdoc Fellowship
DOE Great Lakes Bioenergy Research Center