Job Talk Iowa State University Ag Bio Engineering

Post on 16-Jul-2015

273 views 2 download

Tags:

Transcript of Job Talk Iowa State University Ag Bio Engineering

RIDING THE BIG DATA

TIDAL WAVE IN

MODERN

MICROBIOLOGY

IOWA STATE UNIVERSITY

MARCH 12, 2014

Adina Howe, PhD

Outline of talk

My multi-discipline career

Biological sequencing: a game changer

Research – computational focus:

How to handle “big data” in biology

Research – biological focus:

The gut microbiome’s role in obesity?

Future research:

A flexible toolbox in a big playground

Background

Purdue University, BSME,

Mechanical Engineering

Purdue University, MS,

Environmental Engineering

(Sustainability)

Background

Purdue University, BSME,

Mechanical Engineering

Purdue University, MS,

Environmental Engineering

(Sustainability)

University of Iowa, PhD,

Environmental Engineering

(Microbiology/Bioremediatio

n)

Background

Purdue University, BSME,

Mechanical Engineering

Purdue University, MS,

Environmental Engineering

(Sustainability)

University of Iowa, PhD,

Environmental Engineering

(Microbiology/Bioremediatio

n)

Michigan State University

NSF Postdoc Math and Biology Fellow (cross-

training)

Microbial Ecology (Jim Tiedje)

Bioinformatics (Titus Brown)

Background

Purdue University, BSME,

Mechanical Engineering

Purdue University, MS,

Environmental Engineering

(Sustainability)

University of Iowa, PhD,

Environmental Engineering

(Microbiology/Bioremediatio

n)

Michigan State University

NSF Postdoc Math and Biology Fellow (cross-

training)

Microbial Ecology (Jim Tiedje)

Bioinformatics (Titus Brown)

Computational Biologist

Microbiology / Microbial Ecology

Our shared challenges

Climate Change

Energy Supply

USGCRP 2009

www.alutiiq.com

http://guardianlv.com/

Human Health

An understanding

of microbial ecology

Environmental continuum

MICROBES

IN

ECOSYSTEMS

NATURE

AIR

WATER

SOIL

MICROBIOMES

HUMANS/ANIMAL

ENGINEERED

BIOREACTORS

WASTEWATER

Understanding community

dynamics

Who is there?

What are they doing?

How are they doing it?

Kim Lewis, 2010

Gene / Genome Sequencing

Collect samples

Extract DNA

Sequence DNA

“Analyze” DNA to identify its content and origin

Taxonomy

(e.g., pathogenic E. Coli)

Function

(e.g., degrades cellulose)

Cost of Sequencing

Stein, Genome Biology, 2010

E. Coli genome 4,500,000 bp ($4.5M, 1992)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

Rapidly decreasing costs with

NGS Sequencing

Stein, Genome Biology, 2010

Next Generation Sequencing

4,500,000 bp (E. Coli, $200, presently)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

Effects of low cost

sequencing…

First free-living bacterium sequenced

for billions of dollars and years of

analysis

Personal genome can be

mapped in a few days and

hundreds to few thousand

dollars

The experimental continuum

Single Isolate

Pure Culture

Enrichment

Mixed CulturesNatural systems

The era of big data in biology

Stein, Genome Biology, 2010

Computational Hardware

(doubling time 14 months)

Sanger Sequencing

(doubling time 19 months)

NGS (Shotgun) Sequencing

(doubling time 5 months)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0

1

10

100

1,000

10,000

100,000

1,000,000

Dis

k S

tora

ge,

Mb/$

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

0.1

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Postdoc experience with data

2003-2008 Cumulative sequencing in PhD = 2000 bp

2008-2009 Postdoc Year 1 = 50 Gbp

2009-2010 Postdoc Year 2 = 450 Gbp

Flexibility towards embracing change.

How to survive a data

deluge?

Experiment

Design

Data Generatio

n

Workflow / Tools

Data analysis

Applied Solutions

Reducing data volume:

Assembly of Metagenomic

Sequences

MSU: C. Titus Brown and James Tiedje

de novo assembly

Compresses dataset size significantly

Improved data quality (longer sequences, gene order)

Reference not necessary (novelty)

Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes

Metagenome assembly…a scaling

problem.

Shotgun sequencing and de novo

assembly

It was the Gest of times, it was the wor

, it was the worst of timZs, it was the

isdom, it was the age of foolisXness

, it was the worVt of times, it was the

mes, it was Ahe age of wisdom, it was th

It was the best of times, it Gas the wor

mes, it was the age of witdom, it was th

isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computer

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computerAssembly of 300 Gbp can be

done with any assembly program

in less than 14 GB RAM and less

than 24 hours.

Natural community characteristics

Diverse

Many organisms

(genomes)

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x Sample 10x

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x Sample 10x

Overkill

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Scales datasets for assembly up to 95% - same assembly

outputs.

Genomes, mRNA-seq, metagenomes (soils, gut, water)

Partitioning (khmer software)

Pell et al, 2012, PNAS

Howe et al., 2014, PNAS

Separates metagenomes by species

Parallel computing possible

Largest known published soil metagenome and assembly

Tackling Soil Biodiversity

Source: Chuck Haney

Tackling Soil Biodiversity

Grand Challenge effort –

10% of soil biodiversity

sampled

Incredible soil biodiversity

(estimate required 10

Tbp/sample)

“To boldly go where no man

has gone before”: >60%

Unknown

0

100

200

300

400

am

ino a

cid

meta

bolis

m

carb

ohydra

te m

eta

bo

lism

mem

bra

ne tra

nspo

rt

sig

nal tr

ansdu

ction

transla

tion

fold

ing

, sort

ing a

nd d

egra

da

tion

meta

bolis

m o

f co

facto

rs a

nd v

itam

ins

energ

y m

eta

bolis

m

transp

ort

and

cata

bolis

m

lipid

meta

bolis

m

tra

nscri

ption

ce

ll g

row

th a

nd

dea

th

replic

ation

and

rep

air

xen

obio

tics b

iod

egra

datio

n a

nd m

eta

bo

lism

nucle

otide m

eta

bolis

m

gly

can b

iosynth

esis

and m

eta

bolis

m

meta

bolis

m o

f te

rpenoid

s a

nd

poly

ke

tides

cell

motilit

y

Tota

l C

ount

KO

corn and prairie

corn only

prairie only

Howe et al, 2014, PNAS

Big data combined with microbiology will

changes lives.37

The health and stability of the gut

microbiome (in response to diet change)

University of Chicago: Daina Ringus, PhD & Eugene Chang, MD38

Experiment

Design

Data Generatio

n

Workflow / Tools

Data analysis

Applied Solutions

We are supraorganisms39

Interactions between the

microbiome and the environment40

Source: Zhao, 2013

Obesity

Intestinal inflammation

IBD diseases

Diet has a greater

potential to shape the

structure and function of

gut than host genetics.Direct influence on health

state

How resilient is the microbiome?41

In mice, recovery from long term shift to obesity-inducing diet

In humans, microbiome rapidly and reproducibly recovers within 2 days (2013)

In mice, rapid recovery from long term shift to obesity-inducing diet (2012)

Is the gut community going viral?

Reyes et al, Nature Review Microbiology, 2012

42

Bacterial cells Bacterial cells infected

with bacteriophageViruses (Bacteriophage)

Vary by individual (Minot et al., 2011)

Altered by diet and co-vary with bacteria (Minot et al., 2011)

Long term stable (Minot et al., 2013)

Largely temperate (Reyes et al., 2013)

Prophage

Who is in the gut microbiome?

Is the gut community going viral?

Reyes et al, Nature Review Microbiology, 2012

43

Is the gut community going viral?

Reyes et al, Nature Review Microbiology, 2012

44

Is the gut community going viral?

Reyes et al, Nature Review Microbiology, 2012

45

Research Questions46

What are the impacts of different diets on gut

microbiome response?

What are the impacts of viruses in the gut

microbiome (rapid alteration and resilient

response?)

Multidisciplinary approach combining

novel experimental targeting of both bacterial and viral

communities

metagenomic-based sequencing to characterize

community

Novel experimental design – targeted

sampling of community fractions

I. Total DNA (bacteria + prophage + viruses) TOT

II. Virus-like particles

(free-living viruses)

VLP

III. Induced prophage

IND

47

Separation

by density

Chemically

separate

Separation

by size

Microbiome through

faecal matter (non

destructive sampling)

Two baseline diets (with a

perturbation)

Low-fat (LF) baseline diet

Milk-fat (MF) baseline diet

Age (wk)

4 5 6 7 8 9 10 11 12 13 14

Diet Switch Washout (Return to Baseline)Baseline

Total community function: TOT metagenomic sequencing at weeks 8, 11, 14

Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14

Weight of mice and count of VLPS with microscopy

Taxonomy analysis (only 16S rRNA gene) every week from week 8 – 14.

48

LF / 10% Fat / Complex Carbs

MF / 37% Fat / Simple Sugars

MF

LF MF

LF

Fecal Samples

Outcomes?49

Low-fat (LF) baseline diet

Milk-fat (MF) baseline diet

Age (wk)

4 5 6 7 8 9 10 11 12 13 14

Diet Switch Washout (Return to Baseline)Baseline

LF / 10% Fat / Complex Carbs

MF / 37% Fat / Simple Sugars

MF

LF MF

LF

Qualitative and Quantitative Measurements:

Who is there? What are they doing?

How much?

How does the community change

over time?

Dis

tance f

rom

Baselin

e

Baseline Intervention Washout

Dis

tance f

rom

Ba

selin

e

Baseline Intervention Washout

Altered-Recovery Altered-Altered

Measurements of gene abundance profile

(200,000+ genes) reduced to a single

distance measurement from the original

community (ordination)

Baseline Intervention Washout

No Change

Dis

tance f

rom

Baselin

e

Rapid and resilient bacterial gut

response after diet alteration

Dis

tance f

rom

Baselin

e

***

Baseline Intervention Washout

Diet-specific functional total

community recovery (mostly

bacterial)52

0.0

00

.05

0.1

0D

ista

nce

fro

m B

ase

line

Baseline Diet Perturbed Washout

***

53

0.0

0.1

0.2

0.3

Dis

tan

ce

fro

m B

ase

line

Free living viruses in MF baseline

are significantly altered without

recovery.

Baseline Diet Perturbed Washout

***

Prophages in MF baseline are

significantly altered without

recovery. 54

0.0

0.1

0.2

0.3

Dis

tan

ce

fro

m B

ase

line

Baseline Diet Perturbed Washout

“Combat Zone” as diets change

Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in

which there is not a rapid recovery of viral communities

Viral functions significantly

changed during the milk fat

baseline diet 56

Decreases in

Phage-related (p=0.01)

Iron acquisition (p<0.01)

Nucleotide metabolism (p=0.02)

Carbohydrate metabolism (p=0.01)

Motility and chemotaxis (p=0.03)

Virulence and defense (p=0.03)

Phage Iron

Nucleotide Carbs

Baseline - Change -- Washout

Flagella

57

Bacteroides (Bacterioidetes)

Clostridium (Firmucutes)

Eubacterium (Firmucutes)

Significant decrease in genes associated with MF baseline viruses

Ratio of Firmucutes and

Bacterioidetes associated with

obesity

Turnbaugh, 2008

Bacteriodes fragilis, Nutridesk.com C. difficile, Bioquell.ie National Geographic

Turnbaugh, 2009

Viromes potentially critical in gut

microbiome response.

Members of gut microbiome community do not

have co-occuring responses.

Loss of viral population and diversity is diet

specific (related to a milkfat to lowfat diet

transition)

Ability to redirect structure and function of

microbiome makes them pivotal drivers of health and

disease

Reyes et al, Nature Review Microbiology, 2012

59

Virome directly causes host response

Germ Free 11 week old mice (n = 3)

Diet: Standard chow

3 week conventionalization

60

A “standard control”

Microbiome:

Uniform cecal content

of standard chow

mice

Experimentally

introduced viruses

Mouse Treatment I:

Lowfat baseline

VLP

Mouse Treatment

2: Milkfat baseline

VLP

Control: Buffer

Significant decrease of intestinal

inflammation in LF VLP treatments61

Pro-inflammatory cytokines in mucosal scrapings

TNF-α INF-γ

Proximal colon

TN

F-a

lph

a (n

g/g

l)

Con

trol

LF V

LPs

MF V

LPs

0

5

10

15

Proximal colon

INF

-ga

mm

a (n

g/g

)

Con

trol

LF V

LPs

MF V

LPs

0

10

20

30*

Conclusions

Gut microbiome has reproducible and distinct responses to diet.

Viruses have a unique response to diet perturbations and do not co-occur with bacteria.

Viruses observed to cause inflammation in infected germ free mice.

Big data workflow enabled strategic sampling design providing unparalleled access to viruses of gut microbiome

62

Future work

Data-discovery is a national

investment.

Data-driven biological

investigations

MICROBES

IN

ECOSYSTEMS

NATURE

WATER

SOIL

MICROBIOMES

HUMANS/ANIMAL

ENGINEERED

WASTEWATER

High Throughput Frameworks:

Metagenomic

Metatranscriptomic

Metaproteomic

More relevant model

systems

Improved biomarkers

Scaling approaches

Big data computation

Data driven discovery

Core research values

Research that matters

Developing scientific frameworks that enable

open-science initiatives (reproducible science)

Computational and experimental integration

Scale and power to multi-disciplinary

approaches

Team value

Flexibility

Going viral: The role of the human gut

phageome in inflammatory bowel disease

Objectives:

Define and compare core phageomesassociated with healthy and diseased gut microbiomes

Determine impact of disease-associated gut phageomes on development of disease in knockout mouse models (predisposed to disease)

NIH, National Institute of Diabetes and Digestive and

Kidney Diseases; National Institute of Allergy and Infectious

Diseases ($3-5M)

Source: Nature.com

What is the role of host-phage

dynamics in the development of

intestinal diseases?

Integration of multiple datasets

Improved model systems and

biomarkers

Microbial drivers of carbon metabolism and

warming

DOE Biological and Environmental Research ($3M/3 years, 40% PI with ISU Kirsten Hofmockel, 2013-2016)

Source: Oakridge National LaboratoryContributions:

• Omic-based characterization of carbon cycling microorganisms

in the soil

• Novel approaches to target carbon cycling subsets of

community

• Improved soil genomic databases to enable future carbon

studies

Source: Oakridge National LaboratoryHow do microbes contribute to

carbon cycling models?

Big data scaling

Integration of multiple

datasets

Improved model systems

Large-scale characterization of global dark

matter proteins in complex biological

environments

NIH – Development of Software and Analysis Methods for Biomedical

Big Data in Targeted Areas of High Need

(~$1M/3 years)

Gordon and Betty Moore – Data Driven Discovery Investigator Awards

($1.5M / 5 years)

Novel extension of current software tools:

• Integration of growing volumes of global public datasets with scalable

data-mining analysis

• Lightweight data architecture to compare abundance and co-

occurrence of sequencing patterns across multiple samples and

associated metadata to elucidate information

How do we access the novelty observed in metagenomic datasets?

Big data scaling

Integration of datasets

From field to food: The origin and

fate of our microbiomes

USDA Agriculture and Food Research Initiative ($1-2.5M)

• Identify and characterize under-

researched foodborne microbial hazards

and effective control strategies

• Elucidate fate and dissemination of

foodborne microbial hazards associated

with produce production and processing Source: aboretum.umn.edu

Where do harmful microbes in our food come

from and how do we protect ourselves from

them?

Integration of multiple datasets

Improved model systems and

biomarkers

Acknowledgements

Funding DOE Microbial Carbon Cycling Grant

NSF Postdoc Fellowship, Great Lakes Bioenergy Research Center

Microbiome: University of Chicago Digestive Diseases Research Core Pilot and Feasibility Grant

My Awesome INTER-DISCIPLINARY Team C. Titus Brown (MSU) + lab (Bioinformatics)

James Tiedje (MSU) + lab (Microbial Ecology)

Daina Ringus (UC) (Microbiology / Mice)

Kirsten Hofmockel, Ryan Williams, Fan Yang (ISU)

Eugene Chang (UC)

Folker Meyer (ANL)

71

Questions?

Reducing data, not information.

More efficient data storage and mining.

Big data scaling approaches

Storage of biological big data

What other sequences are connected to

Sequence X?

Data broken into words of length “k” (k-mers)

Overlap (for assembly) = shared “word”

Pell, PNAS, 2014

Howe, PNAS,

2014

AGTCAGTT

Into its 4-mers:

AGTC

GTCA

TCAG

CAGT

AGTT

AGAAAGTC

Into its 4-mers:

AGAA

GAAA

AAAG

CAGT

AGTC

Storage of biological big data

What other sequences are connected to Sequence X?

Data broken into words of length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Efficient storage

Do I have mail?

What other sequences are connected to Sequence X?

Data broken into bins of word length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Mailbox analogy

A-G H-R S-Z

Pell, PNAS, 2014

Howe, PNAS,

2014

Is Sequencing A connected to Sequence B?

Data broken into bins of word length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Mailbox analogy – Efficient storage of information

A-G H-R S-Z

A-G* H-R S-Z

No mail for Howe, 100% sure.

A-G H-R* S-Z

Possibly mail for Howe.

Pell, PNAS, 2014

Howe, PNAS,

2014

Do I have mail?

Is Sequencing A connected to Sequence B?

Data broken into bins of word length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Mailbox analogy – Efficient storage of information

A-G H-R S-Z

A-G H-R* S-Z

G-N* A-F; O-T U-Z

D-H* A-C; I-O P-Z

Howe mail status:

Mail possibility higher.

Do I have mail?

Is Sequencing A connected to Sequence B?

Data broken into bins of word length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Mailbox analogy – Efficient storage of information

A-G H-R S-Z

A-G H-R* S-Z

G-N* A-F; O-T U-Z

D-H A-C; I-O P-Z

Howe mail status:

No mail, 100% sure.

Do I have mail?

Bloom filter data structure

“Probablistic” data structure

Decrease of false positive rate with multiple

bloom filters – “More likely I have mail”

No false negatives – “No mail. 100% sure”

For the win: both detects and counts presence

of sequences (k-mers) and their connectivity

efficiently

Is sequence A connected to sequence B?

Pell, PNAS, 2014

Howe, PNAS,

2014