Job Talk Iowa State University Ag Bio Engineering

80
RIDING THE BIG DATA TIDAL WAVE IN MODERN MICROBIOLOGY IOWA STATE UNIVERSITY MARCH 12, 2014 Adina Howe, PhD

Transcript of Job Talk Iowa State University Ag Bio Engineering

Page 1: Job Talk Iowa State University Ag Bio Engineering

RIDING THE BIG DATA

TIDAL WAVE IN

MODERN

MICROBIOLOGY

IOWA STATE UNIVERSITY

MARCH 12, 2014

Adina Howe, PhD

Page 2: Job Talk Iowa State University Ag Bio Engineering

Outline of talk

My multi-discipline career

Biological sequencing: a game changer

Research – computational focus:

How to handle “big data” in biology

Research – biological focus:

The gut microbiome’s role in obesity?

Future research:

A flexible toolbox in a big playground

Page 3: Job Talk Iowa State University Ag Bio Engineering

Background

Purdue University, BSME,

Mechanical Engineering

Purdue University, MS,

Environmental Engineering

(Sustainability)

Page 4: Job Talk Iowa State University Ag Bio Engineering

Background

Purdue University, BSME,

Mechanical Engineering

Purdue University, MS,

Environmental Engineering

(Sustainability)

University of Iowa, PhD,

Environmental Engineering

(Microbiology/Bioremediatio

n)

Page 5: Job Talk Iowa State University Ag Bio Engineering

Background

Purdue University, BSME,

Mechanical Engineering

Purdue University, MS,

Environmental Engineering

(Sustainability)

University of Iowa, PhD,

Environmental Engineering

(Microbiology/Bioremediatio

n)

Michigan State University

NSF Postdoc Math and Biology Fellow (cross-

training)

Microbial Ecology (Jim Tiedje)

Bioinformatics (Titus Brown)

Page 6: Job Talk Iowa State University Ag Bio Engineering

Background

Purdue University, BSME,

Mechanical Engineering

Purdue University, MS,

Environmental Engineering

(Sustainability)

University of Iowa, PhD,

Environmental Engineering

(Microbiology/Bioremediatio

n)

Michigan State University

NSF Postdoc Math and Biology Fellow (cross-

training)

Microbial Ecology (Jim Tiedje)

Bioinformatics (Titus Brown)

Computational Biologist

Microbiology / Microbial Ecology

Page 7: Job Talk Iowa State University Ag Bio Engineering

Our shared challenges

Climate Change

Energy Supply

USGCRP 2009

www.alutiiq.com

http://guardianlv.com/

Human Health

An understanding

of microbial ecology

Page 8: Job Talk Iowa State University Ag Bio Engineering

Environmental continuum

MICROBES

IN

ECOSYSTEMS

NATURE

AIR

WATER

SOIL

MICROBIOMES

HUMANS/ANIMAL

ENGINEERED

BIOREACTORS

WASTEWATER

Page 9: Job Talk Iowa State University Ag Bio Engineering

Understanding community

dynamics

Who is there?

What are they doing?

How are they doing it?

Kim Lewis, 2010

Page 10: Job Talk Iowa State University Ag Bio Engineering

Gene / Genome Sequencing

Collect samples

Extract DNA

Sequence DNA

“Analyze” DNA to identify its content and origin

Taxonomy

(e.g., pathogenic E. Coli)

Function

(e.g., degrades cellulose)

Page 11: Job Talk Iowa State University Ag Bio Engineering

Cost of Sequencing

Stein, Genome Biology, 2010

E. Coli genome 4,500,000 bp ($4.5M, 1992)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

Page 12: Job Talk Iowa State University Ag Bio Engineering

Rapidly decreasing costs with

NGS Sequencing

Stein, Genome Biology, 2010

Next Generation Sequencing

4,500,000 bp (E. Coli, $200, presently)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

Page 13: Job Talk Iowa State University Ag Bio Engineering

Effects of low cost

sequencing…

First free-living bacterium sequenced

for billions of dollars and years of

analysis

Personal genome can be

mapped in a few days and

hundreds to few thousand

dollars

Page 14: Job Talk Iowa State University Ag Bio Engineering

The experimental continuum

Single Isolate

Pure Culture

Enrichment

Mixed CulturesNatural systems

Page 15: Job Talk Iowa State University Ag Bio Engineering

The era of big data in biology

Stein, Genome Biology, 2010

Computational Hardware

(doubling time 14 months)

Sanger Sequencing

(doubling time 19 months)

NGS (Shotgun) Sequencing

(doubling time 5 months)

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012

Year

0

1

10

100

1,000

10,000

100,000

1,000,000

Dis

k S

tora

ge,

Mb/$

0.1

1

10

100

1,000

10,000

100,000

1,000,000

DN

A S

equencin

g, M

bp

per $

10,000,000

100,000,000

0.1

1

10

100

1,000

10,000

100,000

1,000,000

10,000,000

100,000,000

Page 16: Job Talk Iowa State University Ag Bio Engineering

Postdoc experience with data

2003-2008 Cumulative sequencing in PhD = 2000 bp

2008-2009 Postdoc Year 1 = 50 Gbp

2009-2010 Postdoc Year 2 = 450 Gbp

Page 17: Job Talk Iowa State University Ag Bio Engineering

Flexibility towards embracing change.

How to survive a data

deluge?

Experiment

Design

Data Generatio

n

Workflow / Tools

Data analysis

Applied Solutions

Page 18: Job Talk Iowa State University Ag Bio Engineering

Reducing data volume:

Assembly of Metagenomic

Sequences

MSU: C. Titus Brown and James Tiedje

Page 19: Job Talk Iowa State University Ag Bio Engineering

de novo assembly

Compresses dataset size significantly

Improved data quality (longer sequences, gene order)

Reference not necessary (novelty)

Raw sequencing data (“reads”) Computational algorithms Informative genes / genomes

Page 20: Job Talk Iowa State University Ag Bio Engineering

Metagenome assembly…a scaling

problem.

Page 21: Job Talk Iowa State University Ag Bio Engineering

Shotgun sequencing and de novo

assembly

It was the Gest of times, it was the wor

, it was the worst of timZs, it was the

isdom, it was the age of foolisXness

, it was the worVt of times, it was the

mes, it was Ahe age of wisdom, it was th

It was the best of times, it Gas the wor

mes, it was the age of witdom, it was th

isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness

Page 22: Job Talk Iowa State University Ag Bio Engineering

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computer

Page 23: Job Talk Iowa State University Ag Bio Engineering

Practical Challenges – Intensive

computing

Howe et al, 2014, PNAS

Months of

“computer

crunching” on a

super computerAssembly of 300 Gbp can be

done with any assembly program

in less than 14 GB RAM and less

than 24 hours.

Page 24: Job Talk Iowa State University Ag Bio Engineering

Natural community characteristics

Diverse

Many organisms

(genomes)

Page 25: Job Talk Iowa State University Ag Bio Engineering

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x

Page 26: Job Talk Iowa State University Ag Bio Engineering

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x Sample 10x

Page 27: Job Talk Iowa State University Ag Bio Engineering

Natural community characteristics

Diverse

Many organisms

(genomes)

Variable abundance

Most abundant organisms, sampled

more often

Assembly requires a minimum amount

of sampling

More sequencing, more errors

Sample 1x Sample 10x

Overkill

Page 28: Job Talk Iowa State University Ag Bio Engineering

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Page 29: Job Talk Iowa State University Ag Bio Engineering

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Page 30: Job Talk Iowa State University Ag Bio Engineering

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Page 31: Job Talk Iowa State University Ag Bio Engineering

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Page 32: Job Talk Iowa State University Ag Bio Engineering

Digital normalization

Brown et al., 2012, arXiv

Howe et al., PNAS, 2014

Page 33: Job Talk Iowa State University Ag Bio Engineering

Digital normalization

Brown et al., 2012, arXiv

Howe et al., 2014, PNAS

Scales datasets for assembly up to 95% - same assembly

outputs.

Genomes, mRNA-seq, metagenomes (soils, gut, water)

Page 34: Job Talk Iowa State University Ag Bio Engineering

Partitioning (khmer software)

Pell et al, 2012, PNAS

Howe et al., 2014, PNAS

Separates metagenomes by species

Parallel computing possible

Largest known published soil metagenome and assembly

Page 35: Job Talk Iowa State University Ag Bio Engineering

Tackling Soil Biodiversity

Source: Chuck Haney

Page 36: Job Talk Iowa State University Ag Bio Engineering

Tackling Soil Biodiversity

Grand Challenge effort –

10% of soil biodiversity

sampled

Incredible soil biodiversity

(estimate required 10

Tbp/sample)

“To boldly go where no man

has gone before”: >60%

Unknown

0

100

200

300

400

am

ino a

cid

meta

bolis

m

carb

ohydra

te m

eta

bo

lism

mem

bra

ne tra

nspo

rt

sig

nal tr

ansdu

ction

transla

tion

fold

ing

, sort

ing a

nd d

egra

da

tion

meta

bolis

m o

f co

facto

rs a

nd v

itam

ins

energ

y m

eta

bolis

m

transp

ort

and

cata

bolis

m

lipid

meta

bolis

m

tra

nscri

ption

ce

ll g

row

th a

nd

dea

th

replic

ation

and

rep

air

xen

obio

tics b

iod

egra

datio

n a

nd m

eta

bo

lism

nucle

otide m

eta

bolis

m

gly

can b

iosynth

esis

and m

eta

bolis

m

meta

bolis

m o

f te

rpenoid

s a

nd

poly

ke

tides

cell

motilit

y

Tota

l C

ount

KO

corn and prairie

corn only

prairie only

Howe et al, 2014, PNAS

Page 37: Job Talk Iowa State University Ag Bio Engineering

Big data combined with microbiology will

changes lives.37

Page 38: Job Talk Iowa State University Ag Bio Engineering

The health and stability of the gut

microbiome (in response to diet change)

University of Chicago: Daina Ringus, PhD & Eugene Chang, MD38

Experiment

Design

Data Generatio

n

Workflow / Tools

Data analysis

Applied Solutions

Page 39: Job Talk Iowa State University Ag Bio Engineering

We are supraorganisms39

Page 40: Job Talk Iowa State University Ag Bio Engineering

Interactions between the

microbiome and the environment40

Source: Zhao, 2013

Obesity

Intestinal inflammation

IBD diseases

Diet has a greater

potential to shape the

structure and function of

gut than host genetics.Direct influence on health

state

Page 41: Job Talk Iowa State University Ag Bio Engineering

How resilient is the microbiome?41

In mice, recovery from long term shift to obesity-inducing diet

In humans, microbiome rapidly and reproducibly recovers within 2 days (2013)

In mice, rapid recovery from long term shift to obesity-inducing diet (2012)

Page 42: Job Talk Iowa State University Ag Bio Engineering

Is the gut community going viral?

Reyes et al, Nature Review Microbiology, 2012

42

Bacterial cells Bacterial cells infected

with bacteriophageViruses (Bacteriophage)

Vary by individual (Minot et al., 2011)

Altered by diet and co-vary with bacteria (Minot et al., 2011)

Long term stable (Minot et al., 2013)

Largely temperate (Reyes et al., 2013)

Prophage

Who is in the gut microbiome?

Page 43: Job Talk Iowa State University Ag Bio Engineering

Is the gut community going viral?

Reyes et al, Nature Review Microbiology, 2012

43

Page 44: Job Talk Iowa State University Ag Bio Engineering

Is the gut community going viral?

Reyes et al, Nature Review Microbiology, 2012

44

Page 45: Job Talk Iowa State University Ag Bio Engineering

Is the gut community going viral?

Reyes et al, Nature Review Microbiology, 2012

45

Page 46: Job Talk Iowa State University Ag Bio Engineering

Research Questions46

What are the impacts of different diets on gut

microbiome response?

What are the impacts of viruses in the gut

microbiome (rapid alteration and resilient

response?)

Multidisciplinary approach combining

novel experimental targeting of both bacterial and viral

communities

metagenomic-based sequencing to characterize

community

Page 47: Job Talk Iowa State University Ag Bio Engineering

Novel experimental design – targeted

sampling of community fractions

I. Total DNA (bacteria + prophage + viruses) TOT

II. Virus-like particles

(free-living viruses)

VLP

III. Induced prophage

IND

47

Separation

by density

Chemically

separate

Separation

by size

Microbiome through

faecal matter (non

destructive sampling)

Page 48: Job Talk Iowa State University Ag Bio Engineering

Two baseline diets (with a

perturbation)

Low-fat (LF) baseline diet

Milk-fat (MF) baseline diet

Age (wk)

4 5 6 7 8 9 10 11 12 13 14

Diet Switch Washout (Return to Baseline)Baseline

Total community function: TOT metagenomic sequencing at weeks 8, 11, 14

Virome community function: VLP, IND metagenomic sequencing at weeks 8, 11, 14

Weight of mice and count of VLPS with microscopy

Taxonomy analysis (only 16S rRNA gene) every week from week 8 – 14.

48

LF / 10% Fat / Complex Carbs

MF / 37% Fat / Simple Sugars

MF

LF MF

LF

Fecal Samples

Page 49: Job Talk Iowa State University Ag Bio Engineering

Outcomes?49

Low-fat (LF) baseline diet

Milk-fat (MF) baseline diet

Age (wk)

4 5 6 7 8 9 10 11 12 13 14

Diet Switch Washout (Return to Baseline)Baseline

LF / 10% Fat / Complex Carbs

MF / 37% Fat / Simple Sugars

MF

LF MF

LF

Qualitative and Quantitative Measurements:

Who is there? What are they doing?

How much?

Page 50: Job Talk Iowa State University Ag Bio Engineering

How does the community change

over time?

Dis

tance f

rom

Baselin

e

Baseline Intervention Washout

Dis

tance f

rom

Ba

selin

e

Baseline Intervention Washout

Altered-Recovery Altered-Altered

Measurements of gene abundance profile

(200,000+ genes) reduced to a single

distance measurement from the original

community (ordination)

Baseline Intervention Washout

No Change

Dis

tance f

rom

Baselin

e

Page 51: Job Talk Iowa State University Ag Bio Engineering

Rapid and resilient bacterial gut

response after diet alteration

Dis

tance f

rom

Baselin

e

***

Baseline Intervention Washout

Page 52: Job Talk Iowa State University Ag Bio Engineering

Diet-specific functional total

community recovery (mostly

bacterial)52

0.0

00

.05

0.1

0D

ista

nce

fro

m B

ase

line

Baseline Diet Perturbed Washout

***

Page 53: Job Talk Iowa State University Ag Bio Engineering

53

0.0

0.1

0.2

0.3

Dis

tan

ce

fro

m B

ase

line

Free living viruses in MF baseline

are significantly altered without

recovery.

Baseline Diet Perturbed Washout

***

Page 54: Job Talk Iowa State University Ag Bio Engineering

Prophages in MF baseline are

significantly altered without

recovery. 54

0.0

0.1

0.2

0.3

Dis

tan

ce

fro

m B

ase

line

Baseline Diet Perturbed Washout

Page 55: Job Talk Iowa State University Ag Bio Engineering

“Combat Zone” as diets change

Milk-fat baseline (MF) mice have contrasting bacterial and viral responses, in

which there is not a rapid recovery of viral communities

Page 56: Job Talk Iowa State University Ag Bio Engineering

Viral functions significantly

changed during the milk fat

baseline diet 56

Decreases in

Phage-related (p=0.01)

Iron acquisition (p<0.01)

Nucleotide metabolism (p=0.02)

Carbohydrate metabolism (p=0.01)

Motility and chemotaxis (p=0.03)

Virulence and defense (p=0.03)

Phage Iron

Nucleotide Carbs

Baseline - Change -- Washout

Flagella

Page 57: Job Talk Iowa State University Ag Bio Engineering

57

Bacteroides (Bacterioidetes)

Clostridium (Firmucutes)

Eubacterium (Firmucutes)

Significant decrease in genes associated with MF baseline viruses

Ratio of Firmucutes and

Bacterioidetes associated with

obesity

Turnbaugh, 2008

Bacteriodes fragilis, Nutridesk.com C. difficile, Bioquell.ie National Geographic

Turnbaugh, 2009

Page 58: Job Talk Iowa State University Ag Bio Engineering

Viromes potentially critical in gut

microbiome response.

Members of gut microbiome community do not

have co-occuring responses.

Loss of viral population and diversity is diet

specific (related to a milkfat to lowfat diet

transition)

Page 59: Job Talk Iowa State University Ag Bio Engineering

Ability to redirect structure and function of

microbiome makes them pivotal drivers of health and

disease

Reyes et al, Nature Review Microbiology, 2012

59

Page 60: Job Talk Iowa State University Ag Bio Engineering

Virome directly causes host response

Germ Free 11 week old mice (n = 3)

Diet: Standard chow

3 week conventionalization

60

A “standard control”

Microbiome:

Uniform cecal content

of standard chow

mice

Experimentally

introduced viruses

Mouse Treatment I:

Lowfat baseline

VLP

Mouse Treatment

2: Milkfat baseline

VLP

Control: Buffer

Page 61: Job Talk Iowa State University Ag Bio Engineering

Significant decrease of intestinal

inflammation in LF VLP treatments61

Pro-inflammatory cytokines in mucosal scrapings

TNF-α INF-γ

Proximal colon

TN

F-a

lph

a (n

g/g

l)

Con

trol

LF V

LPs

MF V

LPs

0

5

10

15

Proximal colon

INF

-ga

mm

a (n

g/g

)

Con

trol

LF V

LPs

MF V

LPs

0

10

20

30*

Page 62: Job Talk Iowa State University Ag Bio Engineering

Conclusions

Gut microbiome has reproducible and distinct responses to diet.

Viruses have a unique response to diet perturbations and do not co-occur with bacteria.

Viruses observed to cause inflammation in infected germ free mice.

Big data workflow enabled strategic sampling design providing unparalleled access to viruses of gut microbiome

62

Page 63: Job Talk Iowa State University Ag Bio Engineering

Future work

Page 64: Job Talk Iowa State University Ag Bio Engineering

Data-discovery is a national

investment.

Page 65: Job Talk Iowa State University Ag Bio Engineering

Data-driven biological

investigations

MICROBES

IN

ECOSYSTEMS

NATURE

WATER

SOIL

MICROBIOMES

HUMANS/ANIMAL

ENGINEERED

WASTEWATER

High Throughput Frameworks:

Metagenomic

Metatranscriptomic

Metaproteomic

More relevant model

systems

Improved biomarkers

Scaling approaches

Big data computation

Data driven discovery

Page 66: Job Talk Iowa State University Ag Bio Engineering

Core research values

Research that matters

Developing scientific frameworks that enable

open-science initiatives (reproducible science)

Computational and experimental integration

Scale and power to multi-disciplinary

approaches

Team value

Flexibility

Page 67: Job Talk Iowa State University Ag Bio Engineering

Going viral: The role of the human gut

phageome in inflammatory bowel disease

Objectives:

Define and compare core phageomesassociated with healthy and diseased gut microbiomes

Determine impact of disease-associated gut phageomes on development of disease in knockout mouse models (predisposed to disease)

NIH, National Institute of Diabetes and Digestive and

Kidney Diseases; National Institute of Allergy and Infectious

Diseases ($3-5M)

Source: Nature.com

What is the role of host-phage

dynamics in the development of

intestinal diseases?

Integration of multiple datasets

Improved model systems and

biomarkers

Page 68: Job Talk Iowa State University Ag Bio Engineering

Microbial drivers of carbon metabolism and

warming

DOE Biological and Environmental Research ($3M/3 years, 40% PI with ISU Kirsten Hofmockel, 2013-2016)

Source: Oakridge National LaboratoryContributions:

• Omic-based characterization of carbon cycling microorganisms

in the soil

• Novel approaches to target carbon cycling subsets of

community

• Improved soil genomic databases to enable future carbon

studies

Source: Oakridge National LaboratoryHow do microbes contribute to

carbon cycling models?

Big data scaling

Integration of multiple

datasets

Improved model systems

Page 69: Job Talk Iowa State University Ag Bio Engineering

Large-scale characterization of global dark

matter proteins in complex biological

environments

NIH – Development of Software and Analysis Methods for Biomedical

Big Data in Targeted Areas of High Need

(~$1M/3 years)

Gordon and Betty Moore – Data Driven Discovery Investigator Awards

($1.5M / 5 years)

Novel extension of current software tools:

• Integration of growing volumes of global public datasets with scalable

data-mining analysis

• Lightweight data architecture to compare abundance and co-

occurrence of sequencing patterns across multiple samples and

associated metadata to elucidate information

How do we access the novelty observed in metagenomic datasets?

Big data scaling

Integration of datasets

Page 70: Job Talk Iowa State University Ag Bio Engineering

From field to food: The origin and

fate of our microbiomes

USDA Agriculture and Food Research Initiative ($1-2.5M)

• Identify and characterize under-

researched foodborne microbial hazards

and effective control strategies

• Elucidate fate and dissemination of

foodborne microbial hazards associated

with produce production and processing Source: aboretum.umn.edu

Where do harmful microbes in our food come

from and how do we protect ourselves from

them?

Integration of multiple datasets

Improved model systems and

biomarkers

Page 71: Job Talk Iowa State University Ag Bio Engineering

Acknowledgements

Funding DOE Microbial Carbon Cycling Grant

NSF Postdoc Fellowship, Great Lakes Bioenergy Research Center

Microbiome: University of Chicago Digestive Diseases Research Core Pilot and Feasibility Grant

My Awesome INTER-DISCIPLINARY Team C. Titus Brown (MSU) + lab (Bioinformatics)

James Tiedje (MSU) + lab (Microbial Ecology)

Daina Ringus (UC) (Microbiology / Mice)

Kirsten Hofmockel, Ryan Williams, Fan Yang (ISU)

Eugene Chang (UC)

Folker Meyer (ANL)

71

Page 72: Job Talk Iowa State University Ag Bio Engineering

Questions?

Page 73: Job Talk Iowa State University Ag Bio Engineering

Reducing data, not information.

More efficient data storage and mining.

Big data scaling approaches

Page 74: Job Talk Iowa State University Ag Bio Engineering

Storage of biological big data

What other sequences are connected to

Sequence X?

Data broken into words of length “k” (k-mers)

Overlap (for assembly) = shared “word”

Pell, PNAS, 2014

Howe, PNAS,

2014

AGTCAGTT

Into its 4-mers:

AGTC

GTCA

TCAG

CAGT

AGTT

AGAAAGTC

Into its 4-mers:

AGAA

GAAA

AAAG

CAGT

AGTC

Page 75: Job Talk Iowa State University Ag Bio Engineering

Storage of biological big data

What other sequences are connected to Sequence X?

Data broken into words of length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Efficient storage

Page 76: Job Talk Iowa State University Ag Bio Engineering

Do I have mail?

What other sequences are connected to Sequence X?

Data broken into bins of word length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Mailbox analogy

A-G H-R S-Z

Pell, PNAS, 2014

Howe, PNAS,

2014

Page 77: Job Talk Iowa State University Ag Bio Engineering

Is Sequencing A connected to Sequence B?

Data broken into bins of word length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Mailbox analogy – Efficient storage of information

A-G H-R S-Z

A-G* H-R S-Z

No mail for Howe, 100% sure.

A-G H-R* S-Z

Possibly mail for Howe.

Pell, PNAS, 2014

Howe, PNAS,

2014

Do I have mail?

Page 78: Job Talk Iowa State University Ag Bio Engineering

Is Sequencing A connected to Sequence B?

Data broken into bins of word length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Mailbox analogy – Efficient storage of information

A-G H-R S-Z

A-G H-R* S-Z

G-N* A-F; O-T U-Z

D-H* A-C; I-O P-Z

Howe mail status:

Mail possibility higher.

Do I have mail?

Page 79: Job Talk Iowa State University Ag Bio Engineering

Is Sequencing A connected to Sequence B?

Data broken into bins of word length “k” (k-mers)

Overlap (for assembly) = shared “word”

How do we store “big data” words?

Bloom filter data structure

Mailbox analogy – Efficient storage of information

A-G H-R S-Z

A-G H-R* S-Z

G-N* A-F; O-T U-Z

D-H A-C; I-O P-Z

Howe mail status:

No mail, 100% sure.

Do I have mail?

Page 80: Job Talk Iowa State University Ag Bio Engineering

Bloom filter data structure

“Probablistic” data structure

Decrease of false positive rate with multiple

bloom filters – “More likely I have mail”

No false negatives – “No mail. 100% sure”

For the win: both detects and counts presence

of sequences (k-mers) and their connectivity

efficiently

Is sequence A connected to sequence B?

Pell, PNAS, 2014

Howe, PNAS,

2014