Introduction to 16S rRNA gene multivariate analysis

83
Multivariate exploration of microbial communities Josh D. Neufeld Braunschweig, Germany December, 2013 Michael Lynch (PhD): Taxonomy, phylogenetics, ecology Michael Hall (co-op): mathematics, programming, user friendly! Andre Masella (MSc): Computer science Posted on Slideshare without images and unpublished data

description

Short introductory talk on multivariate statistics for 16S rRNA gene analysis given at the 2nd Soil Metagenomics conference in Braunschweig Germany, December 2013. A previous talk had discussed quality filtering, chimera detection, and clustering algorithms.

Transcript of Introduction to 16S rRNA gene multivariate analysis

Page 1: Introduction to 16S rRNA gene multivariate analysis

Multivariate exploration of microbial communities Josh D. Neufeld Braunschweig, Germany December, 2013

Michael Lynch (PhD): Taxonomy, phylogenetics, ecology

Michael Hall (co-op): mathematics, programming, user friendly!

Andre Masella (MSc): Computer science

Posted on Slideshare without images and unpublished data

Page 2: Introduction to 16S rRNA gene multivariate analysis

Alpha and Beta diversity

Pipelines

Quick history

Future prospects and problems

Species that matter

Page 3: Introduction to 16S rRNA gene multivariate analysis

Who lives with whom, and why, and where?

Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner

Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient

Page 4: Introduction to 16S rRNA gene multivariate analysis

An (abbreviated) history

Numerical ecology phenetics and statistical analysis of organismal counts

macroecology

16S rRNA gene era sequence analysis as a surrogate for counting

mapping of marker to taxonomy

NGS enabled synthesis of phenetics, phylogenetics, and numerical ecology

Page 5: Introduction to 16S rRNA gene multivariate analysis

Now generate V3-V4 bacterial amplicons (~450 bases) Usually PE 300

Page 6: Introduction to 16S rRNA gene multivariate analysis

Assembling paired-end reads dramatically reduces error Corrects mismatches in region of overlap (quality threshold >0.9), set a minimum overlap. Can compare to perfect overlap assembly: “completelymissesthepoint” (name changing soon)

Page 7: Introduction to 16S rRNA gene multivariate analysis

PANDAseq >30x faster than next fastest alternative assembler

Page 8: Introduction to 16S rRNA gene multivariate analysis

1. p-value threshold 2. parallelizes correctly

(both are now added or fixed in PANDAseq)

Page 9: Introduction to 16S rRNA gene multivariate analysis
Page 10: Introduction to 16S rRNA gene multivariate analysis

Biological Observation Matrix

BIOM file format (MacDonald et al. 2012) Standard recognized by EMP, MG-RAST, VAMPS Based on JSON data interchange format

Computational structure in multiple languages

“facilitates the efficient handling and storage of large, sparse biological contingency tables” Encapsulates metadata and contingency table (e.g., OTU table) in one file

Page 11: Introduction to 16S rRNA gene multivariate analysis

Alpha and Beta diversity

Pipelines

Quick history

Future prospects and problems

Species that matter

Page 12: Introduction to 16S rRNA gene multivariate analysis

Who lives with whom, and why, and where?

Data reduction is essential for: a) summarizing large numbers of observations into manageable numbers b) visualizing many interconnected variables in a compact manner

Alpha diversity: species richness (and evenness) within a single sample Beta diversity: change in species composition across a collection of samples Gamma diversity: total species richness across an environmental gradient

Page 13: Introduction to 16S rRNA gene multivariate analysis

Diversity (richness and evenness)

Page 14: Introduction to 16S rRNA gene multivariate analysis

α-diversity: Richness and Evenness

Shannon index (H’), Estimators (Chao1, ACE), Phylogenetic Diversity

Stearns et al., 2011 Hughes et al., 2001

Shannon index (H’): richness and evenness Estimators: richness Faith’s PD: phylogenetic richness

Page 15: Introduction to 16S rRNA gene multivariate analysis

“All biologists who sample natural communities are plagued with the

problem of how well a sample reflects a community’s ‘true’ diversity.”

Page 16: Introduction to 16S rRNA gene multivariate analysis

“Nonparametric estimators show particular promise for microbial data and in some habitats may require sample sizes of only 200 to 1,000 clones to detect richness differences of only tens of species.”

Hug

hes

et a

l. 20

01

Page 17: Introduction to 16S rRNA gene multivariate analysis

0

1

2000 2004 2008 20122002 2004 20100

100

200

300

400

500

Illumina

454

Sanger

Goo

gle

Sch

olar

pro

por

tion

[Seq

eunc

ing

tec

h] A

ND

16S

“Rare b

iosphere” citations

Rare b

iospher

e

Time (year)Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.

Page 18: Introduction to 16S rRNA gene multivariate analysis

GOALS Understanding of community structure

Better alpha-diversity measures Robust beta-diversity measures

Lynch and Neufeld. 2013. Nat. Rev. Microbiol. In preparation.

Page 19: Introduction to 16S rRNA gene multivariate analysis

Stearns et al. 2011

Page 20: Introduction to 16S rRNA gene multivariate analysis

Bar

tram

et

al. 2

011

Page 21: Introduction to 16S rRNA gene multivariate analysis

Clustering algorithms (influence alpha diversity primarily)

CD-HIT (Li and Godzik, Sanford-Burnham Medical Research Institute)

‘longest-sequence-first’ removal algorithm Fast, many implementations (nucleotide, protein, OTU-specific) Tends to be more stringent than UCLUST

UCLUST (R. Edgar, drive5.com) Faster than CD-HIT Tends to generate larger number of low-abundance OTUs Broader range of clustering thresholds "I do not recommend using the UCLUST algorithm or

CD-HIT for generating OTUs” – Robert Edgar

Page 22: Introduction to 16S rRNA gene multivariate analysis
Page 23: Introduction to 16S rRNA gene multivariate analysis

CROP: Clustering 16S rRNA for OTU Prediction (CROP) “CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods.”

Page 24: Introduction to 16S rRNA gene multivariate analysis

Chimeras DNA from two or more parent molecules

PCR artifact Can easily be classified as a “novel” sequence

Increases α-diversity

Software ChimeraSlayer, Bellerophon, UCHIME, Pintail

Reference database or de novo

Page 25: Introduction to 16S rRNA gene multivariate analysis

Classification and taxonomy

Ribosomal Database Project (RDP) classifier Naïve Bayesian classifier (James Cole and Tiedje) http://rdp.cme.msu.edu/

pplacer Phylogenetic placement and visualization

BLAST The tool we know and love

RTAX (UC Berkely, Rob Knight involved) http://dev.davidsoergel.com/trac/rtax/

mothur (Patrick Schloss) http://www.mothur.org/

SINA (SILVA)

Page 26: Introduction to 16S rRNA gene multivariate analysis

RDP classifier Large training sets require active memory management

Can be easily run in parallel by breaking up very large data sets

Can classify Bacteria/Archaea SSU and fungal LSU (can be re-trained)

Algorithm:

determine the probability that an unknown query sequence is a member of a known genus (training set), based on the profile of word subsets of known genera.

Confidence estimation:

the number of times in 100 trials that a genus was selected based on a random subset of words in the query

Take home:

The higher the diversity (bigger sequence space) of the training set, the better the assignment

Longer query = better and more reliable assignment

Short reads (i.e., <250 base) will have lower confidence estimates (cutoff of 0.5 suggested)

Page 27: Introduction to 16S rRNA gene multivariate analysis

Database sources GreenGenes

Latest May 2013

SILVA Latest 115 (August 2013) Includes 18S, 23S, 28S, LSU

RDP Database Latest 11 (October 2013)

GenBank Research-specific

e.g., CORE Oral

Page 28: Introduction to 16S rRNA gene multivariate analysis

Multivariate data reduction

Page 29: Introduction to 16S rRNA gene multivariate analysis

β-diversity

Visualization (ordination) versus hypothesis testing (MRPP, indicator species analysis) Many more algorithms out there for exploration and statistical testing

mostly through widely used R packages vegan (Community Ecology Package) labdsv (Ordination and Multivariate Analysis for Ecology) ape (Analyses of Phylogenetics and Evolution) picante (community analyses etc.)

Page 30: Introduction to 16S rRNA gene multivariate analysis

Visualization (ordination) Complementary to data clustering

looks for discontinuities Ordination extracts main trends as continuous axes

analysis of the square matrix derived from the OTU table

Non-parametric, unconstrained ordination methods most widely used (and best suited)

methods that can work directly on a square matrix An appropriate metric is required to derive this square matrix

many options...

Page 31: Introduction to 16S rRNA gene multivariate analysis

Metrics

Ordination is essentially reducing dimensionality first requirement: accurately model differences among samples

Models are *really* important. Examples include: OTU presence/absence

Dice, Jaccard OTU abundance

Bray-Curtis Phylogenetic

UniFrac

“You can't publish anything without a PCoA plot anymore, but METRICS

used to draw plot important.” - Susan Huse

“all models are wrong, some are useful”

- G.E. Box

Page 32: Introduction to 16S rRNA gene multivariate analysis

Metrics: UniFrac A distance measure comparing multiple communities using phylogenetic information Requires sequence alignment and tree-building

PyNAST, MUSCLE, Infernal Time-consuming and susceptible to poor phylogenetic inference (does it matter?)

Weighted (abundance) ecological features related to abundance

Unweighted ecological features related to taxonomic presence/absence

Page 33: Introduction to 16S rRNA gene multivariate analysis

Ordination example 1 (of many): Principal Coordinates Analysis

Classical Multidimensional Scaling (MDS; Gower 1966) Procedure:

based on eigenvectors position objects in low-dimensional space while preserving distance relationships as well as possible

highly flexible can choose among many association measures

In microbial ecology, used for visualizing phylogenetic or count-based distances Consistent visual output for given distance matrix

Include variance explained (%) on Axis 1 and 2

Page 34: Introduction to 16S rRNA gene multivariate analysis

Ordination example 2 (of many): Non-metric Multidimensional Scaling Ordination not based on eigenvectors

Does not preserve exact distances among objects

attempts to preserve ordering of samples (“ranks”)

Procedure:

iterative, tries to position the objects in a few (2-3) dimensions in such a way that minimizes the “stress”

how well does the new ranked distribution of points represent the original distances in the association matrix? Can express as R2 on axes 1 and 2.

the adjustment goes on until the stress value reaches a local minimum (heuristic solution)

NMDS often represents distance relationships better than PCoA in the same number of dimensions

Susceptible to the “local minimum issue”, and therefore should have strong starting point (e.g., PCoA) or many permutations

You won't get the same result each time you run the analysis. Try several runs until you are comfortable with the result.

Page 35: Introduction to 16S rRNA gene multivariate analysis

Do my treatments separate?

Page 36: Introduction to 16S rRNA gene multivariate analysis

Beta-diversity: Hypothesis testing

Multiple methods, implemented in QIIME, mothur, AXIOME

e.g., MRPP, adonis, NP-MANOVA (perMANOVA), ANOSIM Are treatment effects significant?

Because these are predominantly nonparametric methods, tests for significance rely on testing by permutation Let's focus on MRPP

Page 37: Introduction to 16S rRNA gene multivariate analysis

Multiresponse Permutation Procedures

Compare intragroup average distances with the average distances that would have resulted from all the other possible combinations

T statistic: more negative with increasing group separation (T>-10 common for ecology) A statistic: Degree of scatter within groups (A=1 when all points fall on top of one another) p value: likelihood of similar separation with randomized data.

Page 38: Introduction to 16S rRNA gene multivariate analysis

Alpha and Beta diversity

Pipelines

Quick history

Future prospects and problems

Species that matter

Page 39: Introduction to 16S rRNA gene multivariate analysis

“PCoA plots are the first step of a community analysis, not the last.”

Josh Neufeld

Page 40: Introduction to 16S rRNA gene multivariate analysis

Searching for species that matter

High dimensional data often have too many features to investigate

solution: identify and study species significantly associated with categorical metadata

Indicator species (Dufrene-Legendre) calculates indicator value (fidelity and relative abundance) of species Permutation test for significance Need solution for sparse data - be wary

of groups with small numbers of sites (influence on permutation tests) low abundance can artificially inflate indicator values

Page 41: Introduction to 16S rRNA gene multivariate analysis

Specificity

Fidelity

Page 42: Introduction to 16S rRNA gene multivariate analysis

IndVal (Dufrene & Legendre, 1997)

Specificity Large mean abundance within group relative to summed mean abundances of other groups

Fidelity Presence in most or all sites of that group

Groups defined by a priori by metadata or statistical clustering

Page 43: Introduction to 16S rRNA gene multivariate analysis

Metadata Taxon R^2 value mbc

k__Bacteria;p__Planctomycetes;c__Planctomycetia;o__Gemmatales;f__Isosphaeraceae;g__ 0.611368489781491 mbc

k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Methylocystaceae;g__ 0.677209935419981 mbn

k__Bacteria;p__Proteobacteria;c__Alphaproteobacteria;o__Rhizobiales;f__Methylocystaceae;g__ 0.64092523702996 soil_depth k__Bacteria;p__Actinobacteria;c__Actinobacteria;o__Actinomycetales;f__Intrasporangiaceae;g__ 0.669761188668774

Simple linear correlations

Page 44: Introduction to 16S rRNA gene multivariate analysis

mothur: cooccurrence function, measuring whether populations are co-occurring more frequently than you would expect by chance.

Page 45: Introduction to 16S rRNA gene multivariate analysis

Non-negative Matrix Factorization

NMF as a representation method for portraying high-dimensional data as a small number of taxonomic components. Patterns of co-occurring OTUs can be described by a smaller number of taxonomic components. Each sample represented by the collection of component taxa, helping identify relationships between taxa and the environment.

Jonathan Dushoff, McMaster University, Ontario, Canada

Page 46: Introduction to 16S rRNA gene multivariate analysis
Page 47: Introduction to 16S rRNA gene multivariate analysis

SSUnique

Page 48: Introduction to 16S rRNA gene multivariate analysis
Page 49: Introduction to 16S rRNA gene multivariate analysis
Page 50: Introduction to 16S rRNA gene multivariate analysis
Page 51: Introduction to 16S rRNA gene multivariate analysis
Page 52: Introduction to 16S rRNA gene multivariate analysis
Page 53: Introduction to 16S rRNA gene multivariate analysis
Page 54: Introduction to 16S rRNA gene multivariate analysis

SILV

A

Page 55: Introduction to 16S rRNA gene multivariate analysis

SILVA

Page 56: Introduction to 16S rRNA gene multivariate analysis

SILVA

Page 57: Introduction to 16S rRNA gene multivariate analysis

SILV

A

Page 58: Introduction to 16S rRNA gene multivariate analysis

���

;������(KUOLFKLD�UXPLQDQWLXP

8/������������XU�

8/������������XU�

8/������������XU�

8/������������

8/������������

8/������������

8/������������

0������5LFNHWWVLD�SURZD]HNLL

���

��

��

��

��

���

8/������

8/�����

$-�������)LEUREDFWHU�LQWHVWLQDOLV

8/������

8/�����

8/��������

8/�����

8/�����

8/������

8/�����

8/�����

8/�����

8/��������

8/�����

8/��������

$%�������*HPPDWLPRQDV�DXUDQWLDFD

8/�����

*8�������)LEUREDFWHU�VXFFLQRJHQHV

���

���

��

���

���

���

%5&�

*HPPDWLPRQDGHWHV

$OSKDSURWHREDFWHULD

�H�

�I�

8/�����

$-�������,VRVSKDHUD�SDOOLGD

;������*HPPDWD�REVFXULJOREXV

$-�������3ODQFWRP\FHV�EUDVLOLHQVLV

$0�������6FKOHVQHULD�SDOXGLFROD

$0�������6LQJXOLVSKDHUD�DFLGLSKLOD

8/������

$0�������=DYDU]LQHOOD�IRUPRVD

;������3ODQFWRP\FHV�OLPQRSKLOXV

;������%ODVWRSLUHOOXOD�PDULQD

8/�����

$%�������3K\FLVSKDHUD�PLNXUHQVLV

%;�������5KRGRSLUHOOXOD�EDOWLFD

8/�����

$-�������3ODQFWRP\FHV�PDULV

$-�������3LUHOOXOD�VWDOH\L8/�����

8/�����

��

��

����

���

��

��

���

��

���

���

���

�����

���

3ODQFWRP\FHWHV

�F�

8/������

8/�����

8/������

8/�����

*8�������2KWDHNZDQJLD�NRUHHQVLV

$%�������)OH[LEDFWHU�UXEHU$%�������)OH[LEDFWHU�HOHJDQV

$%�������0LFURVFLOOD�PDULQD

$-�������6SRURF\WRSKDJD�P\[RFRFFRLGHV

8/�����

0������)OH[LEDFWHU�IOH[LOLV

&3�������&\WRSKDJD�KXWFKLQVRQLL

(8�������5KRGRF\WRSKDJD�DHURODWD

���

��

��

�����

%DFWHURLGHWHV

�E�

$%�������3ODQNWRWKULFRLGHV�UDFLERUVNLL

8/������

&3�������&KORURIO��DJJUHJDQV

<������+DORVSLUXOLQD�WDSHWLFROD

8/��������

$%�������&DOGLOLQHD�DHURSKLOD

&3�������+HUSHWRVLSKRQ�DXUDQWLDFXV

8/�����

$%�������7KHUPRJHPPDWLVSRUD�IROLRUXP

8/�������

8/������

$%�������$QDHUROLQHD�WKHUPRSKLOD

$-�������6SKDHUREDFWHU�WKHUPRSKLOXV

8/������

8/��������

8/������

(8�������8QFXOWXUHG�EDFWHULXP

8/������

$%�������7KHUPRVSRURWKUL[�KD]DNHQVLV

$%�������&ULQDOLXP�HSLSVDPPXP

0������7KHUPRPLFURELXP�URVHXP

8/�����

8/�����

8/������

$(�������3URFKORURFRFFXV�PDULQXV�VXEVS��PDULQXV

+0�������9DPSLURYLEULR�FKORUHOODYRUXV

$0�������.WHGRQREDFWHU�UDFHPLIHU

$0�������3URFKORURWKUL[�KROODQGLFD

8/������

8/�����

(8�������'HKDORJHQLPRQDV�O\NDQWKURSRUHSHOOHQV

8/�������

8/������

8/�����

8/�����

()�������5XELGLEDFWHU�ODFXQDH

(8�������8QFXOWXUHG�EDFWHULXP

&3�������5RVHLIOH[XV�FDVWHQKRO]LL

8/�������

8/������

8/��������

8/�����

8/������

$%�������/HSWROLQHD�WDUGLYLWDOLV

8/��������

8/������

8/�����

8/������

(8�������8QFXOWXUHG�EDFWHULXP

8/�����

8/�����

��

���

��

��

��

���

��

���

������

��

��

��

��

���

���

���

��

���

���

���

��

���

��

��

��

���

���

���

��

��

&KORURIOH[L

&\DQREDFWHULD

�G�

8/�����

)-�������6SKLQJREDFWHULXP�VKD\HQVH

8/�����

(8�������6SKLQJREDFWHULXP�VL\DQJHQVH

8/�����

$%�������1XEVHOOD�]HD[DQWKLQLIDFLHQV

8/�����

$0�������3VHXGRVSKLQJREDFWHULXP�GRPHVWLFXP

'4�������3DUDSHGREDFWHU�NRUHHQVLV

8/�����

'4�������2OLYLEDFWHU�VLWLHQVLV

*4�������3HGREDFWHU�EDX]DQHQVLV

8/������

$%�������6ROLWDOHD�FDQDGHQVLV

��

���

��

��

��

��

��

��

���

����

���

%DFWHURLGHWHV

�D�

0������1HRULFNHWWVLD�ULVWLFLL

���

;������(KUOLFKLD�UXPLQDQWLXP

8/������������XU�

8/������������XU�

8/������������XU�

8/������������

8/������������

8/������������

8/������������

0������5LFNHWWVLD�SURZD]HNLL

���

��

��

��

��

���

8/������

8/�����

$-�������)LEUREDFWHU�LQWHVWLQDOLV

8/������

8/�����

8/��������

8/�����

8/�����

8/������

8/�����

8/�����

8/�����

8/��������

8/�����

8/��������

$%�������*HPPDWLPRQDV�DXUDQWLDFD

8/�����

*8�������)LEUREDFWHU�VXFFLQRJHQHV

���

���

��

���

���

���

%5&�

*HPPDWLPRQDGHWHV

$OSKDSURWHREDFWHULD

�H�

�I�

8/�����

$-�������,VRVSKDHUD�SDOOLGD

;������*HPPDWD�REVFXULJOREXV

$-�������3ODQFWRP\FHV�EUDVLOLHQVLV

$0�������6FKOHVQHULD�SDOXGLFROD

$0�������6LQJXOLVSKDHUD�DFLGLSKLOD

8/������

$0�������=DYDU]LQHOOD�IRUPRVD

;������3ODQFWRP\FHV�OLPQRSKLOXV

;������%ODVWRSLUHOOXOD�PDULQD

8/�����

$%�������3K\FLVSKDHUD�PLNXUHQVLV

%;�������5KRGRSLUHOOXOD�EDOWLFD

8/�����

$-�������3ODQFWRP\FHV�PDULV

$-�������3LUHOOXOD�VWDOH\L8/�����

8/�����

��

��

����

���

��

��

���

��

���

���

���

�����

���

3ODQFWRP\FHWHV

�F�

8/������

8/�����

8/������

8/�����

*8�������2KWDHNZDQJLD�NRUHHQVLV

$%�������)OH[LEDFWHU�UXEHU$%�������)OH[LEDFWHU�HOHJDQV

$%�������0LFURVFLOOD�PDULQD

$-�������6SRURF\WRSKDJD�P\[RFRFFRLGHV

8/�����

0������)OH[LEDFWHU�IOH[LOLV

&3�������&\WRSKDJD�KXWFKLQVRQLL

(8�������5KRGRF\WRSKDJD�DHURODWD

���

��

��

�����

%DFWHURLGHWHV

�E�

$%�������3ODQNWRWKULFRLGHV�UDFLERUVNLL

8/������

&3�������&KORURIO��DJJUHJDQV

<������+DORVSLUXOLQD�WDSHWLFROD

8/��������

$%�������&DOGLOLQHD�DHURSKLOD

&3�������+HUSHWRVLSKRQ�DXUDQWLDFXV

8/�����

$%�������7KHUPRJHPPDWLVSRUD�IROLRUXP

8/�������

8/������

$%�������$QDHUROLQHD�WKHUPRSKLOD

$-�������6SKDHUREDFWHU�WKHUPRSKLOXV

8/������

8/��������

8/������

(8�������8QFXOWXUHG�EDFWHULXP

8/������

$%�������7KHUPRVSRURWKUL[�KD]DNHQVLV

$%�������&ULQDOLXP�HSLSVDPPXP

0������7KHUPRPLFURELXP�URVHXP

8/�����

8/�����

8/������

$(�������3URFKORURFRFFXV�PDULQXV�VXEVS��PDULQXV

+0�������9DPSLURYLEULR�FKORUHOODYRUXV

$0�������.WHGRQREDFWHU�UDFHPLIHU

$0�������3URFKORURWKUL[�KROODQGLFD

8/������

8/�����

(8�������'HKDORJHQLPRQDV�O\NDQWKURSRUHSHOOHQV

8/�������

8/������

8/�����

8/�����

()�������5XELGLEDFWHU�ODFXQDH

(8�������8QFXOWXUHG�EDFWHULXP

&3�������5RVHLIOH[XV�FDVWHQKRO]LL

8/�������

8/������

8/��������

8/�����

8/������

$%�������/HSWROLQHD�WDUGLYLWDOLV

8/��������

8/������

8/�����

8/������

(8�������8QFXOWXUHG�EDFWHULXP

8/�����

8/�����

��

���

��

��

��

���

��

���

������

��

��

��

��

���

���

���

��

���

���

���

��

���

��

��

��

���

���

���

��

��

&KORURIOH[L

&\DQREDFWHULD

�G�

8/�����

)-�������6SKLQJREDFWHULXP�VKD\HQVH

8/�����

(8�������6SKLQJREDFWHULXP�VL\DQJHQVH

8/�����

$%�������1XEVHOOD�]HD[DQWKLQLIDFLHQV

8/�����

$0�������3VHXGRVSKLQJREDFWHULXP�GRPHVWLFXP

'4�������3DUDSHGREDFWHU�NRUHHQVLV

8/�����

'4�������2OLYLEDFWHU�VLWLHQVLV

*4�������3HGREDFWHU�EDX]DQHQVLV

8/������

$%�������6ROLWDOHD�FDQDGHQVLV

��

���

��

��

��

��

��

��

���

����

���

%DFWHURLGHWHV

�D�

0������1HRULFNHWWVLD�ULVWLFLL

���

;������(KUOLFKLD�UXPLQDQWLXP

8/������������XU�

8/������������XU�

8/������������XU�

8/������������

8/������������

8/������������

8/������������

0������5LFNHWWVLD�SURZD]HNLL

���

��

��

��

��

���

8/������

8/�����

$-�������)LEUREDFWHU�LQWHVWLQDOLV

8/������

8/�����

8/��������

8/�����

8/�����

8/������

8/�����

8/�����

8/�����

8/��������

8/�����

8/��������

$%�������*HPPDWLPRQDV�DXUDQWLDFD

8/�����

*8�������)LEUREDFWHU�VXFFLQRJHQHV

���

���

��

���

���

���

%5&�

*HPPDWLPRQDGHWHV

$OSKDSURWHREDFWHULD

�H�

�I�

8/�����

$-�������,VRVSKDHUD�SDOOLGD

;������*HPPDWD�REVFXULJOREXV

$-�������3ODQFWRP\FHV�EUDVLOLHQVLV

$0�������6FKOHVQHULD�SDOXGLFROD

$0�������6LQJXOLVSKDHUD�DFLGLSKLOD

8/������

$0�������=DYDU]LQHOOD�IRUPRVD

;������3ODQFWRP\FHV�OLPQRSKLOXV

;������%ODVWRSLUHOOXOD�PDULQD

8/�����

$%�������3K\FLVSKDHUD�PLNXUHQVLV

%;�������5KRGRSLUHOOXOD�EDOWLFD

8/�����

$-�������3ODQFWRP\FHV�PDULV

$-�������3LUHOOXOD�VWDOH\L8/�����

8/�����

��

��

����

���

��

��

���

��

���

���

���

�����

���

3ODQFWRP\FHWHV

�F�

8/������

8/�����

8/������

8/�����

*8�������2KWDHNZDQJLD�NRUHHQVLV

$%�������)OH[LEDFWHU�UXEHU$%�������)OH[LEDFWHU�HOHJDQV

$%�������0LFURVFLOOD�PDULQD

$-�������6SRURF\WRSKDJD�P\[RFRFFRLGHV

8/�����

0������)OH[LEDFWHU�IOH[LOLV

&3�������&\WRSKDJD�KXWFKLQVRQLL

(8�������5KRGRF\WRSKDJD�DHURODWD

���

��

��

�����

%DFWHURLGHWHV

�E�

$%�������3ODQNWRWKULFRLGHV�UDFLERUVNLL

8/������

&3�������&KORURIO��DJJUHJDQV

<������+DORVSLUXOLQD�WDSHWLFROD

8/��������

$%�������&DOGLOLQHD�DHURSKLOD

&3�������+HUSHWRVLSKRQ�DXUDQWLDFXV

8/�����

$%�������7KHUPRJHPPDWLVSRUD�IROLRUXP

8/�������

8/������

$%�������$QDHUROLQHD�WKHUPRSKLOD

$-�������6SKDHUREDFWHU�WKHUPRSKLOXV

8/������

8/��������

8/������

(8�������8QFXOWXUHG�EDFWHULXP

8/������

$%�������7KHUPRVSRURWKUL[�KD]DNHQVLV

$%�������&ULQDOLXP�HSLSVDPPXP

0������7KHUPRPLFURELXP�URVHXP

8/�����

8/�����

8/������

$(�������3URFKORURFRFFXV�PDULQXV�VXEVS��PDULQXV

+0�������9DPSLURYLEULR�FKORUHOODYRUXV

$0�������.WHGRQREDFWHU�UDFHPLIHU

$0�������3URFKORURWKUL[�KROODQGLFD

8/������

8/�����

(8�������'HKDORJHQLPRQDV�O\NDQWKURSRUHSHOOHQV

8/�������

8/������

8/�����

8/�����

()�������5XELGLEDFWHU�ODFXQDH

(8�������8QFXOWXUHG�EDFWHULXP

&3�������5RVHLIOH[XV�FDVWHQKRO]LL

8/�������

8/������

8/��������

8/�����

8/������

$%�������/HSWROLQHD�WDUGLYLWDOLV

8/��������

8/������

8/�����

8/������

(8�������8QFXOWXUHG�EDFWHULXP

8/�����

8/�����

��

���

��

��

��

���

��

���

������

��

��

��

��

���

���

���

��

���

���

���

��

���

��

��

��

���

���

���

��

��

&KORURIOH[L

&\DQREDFWHULD

�G�

8/�����

)-�������6SKLQJREDFWHULXP�VKD\HQVH

8/�����

(8�������6SKLQJREDFWHULXP�VL\DQJHQVH

8/�����

$%�������1XEVHOOD�]HD[DQWKLQLIDFLHQV

8/�����

$0�������3VHXGRVSKLQJREDFWHULXP�GRPHVWLFXP

'4�������3DUDSHGREDFWHU�NRUHHQVLV

8/�����

'4�������2OLYLEDFWHU�VLWLHQVLV

*4�������3HGREDFWHU�EDX]DQHQVLV

8/������

$%�������6ROLWDOHD�FDQDGHQVLV

��

���

��

��

��

��

��

��

���

����

���

%DFWHURLGHWHV

�D�

0������1HRULFNHWWVLD�ULVWLFLL

SILVA

Page 59: Introduction to 16S rRNA gene multivariate analysis

���

8/������

8/�����

8/�����8/�����

8/�����

8QFXOWXUHG��03%����>$%������@

8/�����

8/�����

8/������

8/�����

8/������

8/�����

*ORHREDFWHU�YLRODFHXV�3&&������>%$������@

�����������

������������

����������� 5HPDLQLQJ�&\DQREDFWHULD�&KORURSODVWV

�D�

�E�

���

>+4������@

8/������8/������

3\WKLXP�XOWLPXP�>$'26��������@

8/������

>-)������@

8QFXOWXUHG�>$$&<���������@

8/������

8QFXOWXUHG�>+4������@

>*8������@

8QFXOWXUHG�>'4������@

>-)������@

(VFKHULFKLD�FROL�>$%������@

8QFXOWXUHG�>)-������@

8/������

8QFXOWXUHG�>)-������@

8/������

>)-������@

8QFXOWXUHG�>*8������@

'LFW\RVWHOLXP�>'4������@

8QFXOWXUHG�>*8������@

8/������

5LFNHWWVLD�SURZD]HNLL�>$-������@

���������

�����������

�����������

������

����������

������

����������

����������������������

����XU

&KORURSK\WD�(PEU\RSK\WD

���

8QFXOWXUHG

8QFXOWXUHG

)XQJL

5KRGRSK\WD

&KURPDOYHRODWD

$FDQWKDPRHED�SRO\SKDJD�>$)������@

7KUDXVWRFK\WULXP�>$)������@

8QFXOWXUHG

��������

8QFXOWXUHG�>$%������@

������

������

������

������

������

������������

������

������

��������

9LEULR�YXOQLILFXV�>%$������@�2XWJURXSV

Lynch et al. 2012

Nakai et al. 2012

Page 60: Introduction to 16S rRNA gene multivariate analysis

Alpha and Beta diversity

Pipelines

Quick history

Future prospects and problems

Species that matter

Page 61: Introduction to 16S rRNA gene multivariate analysis

Why pipelines? Merge and manage (many) disparate techniques Democratize analysis

improve accessibility

Accelerate pace of innovation, collaboration, and research

Page 62: Introduction to 16S rRNA gene multivariate analysis

Early synthesis

Early synthesis for numerical microbial ecology Synthesis of 16S phylogenetics (Woese et al.) and Hughes (Counting the uncountable)

Numerical ecology for microorganisms

Algorithm development libshuff, dotur (mothur)

Analysis pipelines QIIME, mothur

Page 63: Introduction to 16S rRNA gene multivariate analysis

Knight Lab, U. Colorado at Boulder Predominantly a collection of integrated Python/R scripts Many dependencies

easy managed installation: qiime-deploy

MacQIIME virtual box and Ubuntu fork

avoid for anything but small runs

Becoming the standard for marker gene studies integrated analysis and visualization easy access to broad computational biology toolbox (Python/R)

Page 64: Introduction to 16S rRNA gene multivariate analysis

Automation and extension

AXIOME and phyloseq Extend existing technologies (QIIME, mothur, R, custom)

Layers of abstraction Automation and rapid re-analysis Promote reproducible research (iPython, XML, make)

Implement existing techniques (e.g., MRPP, Dufrene-Legendre IndVal)

numerical microbial ecology needs to better incorporate modern statistical theory

Develop and test new techniques

Page 65: Introduction to 16S rRNA gene multivariate analysis
Page 66: Introduction to 16S rRNA gene multivariate analysis
Page 67: Introduction to 16S rRNA gene multivariate analysis

Axiometic GUI companion for AXIOME

Cross-platform New implementation in development

Generates AXIOME file (XML)

xls template coming soon for

all commands, sample metadata,

and extra info… much easier for

everyone.

Page 68: Introduction to 16S rRNA gene multivariate analysis

“QIIME wraps many other software packages, and these should be cited if they are used. Any time you're using tools that QIIME wraps, it is essential to cite those tools.” http://qiime.org/index.html

Page 69: Introduction to 16S rRNA gene multivariate analysis

Alpha and Beta diversity

Pipelines

Quick history

Future prospects and problems

Species that matter

Page 70: Introduction to 16S rRNA gene multivariate analysis

The future

As data get bigger, interpretation should be “hands off”

Move towards hypothesis testing of high-dimension taxonomic data

Convergence on Galaxy e.g., QIIME in Galaxy is developing

Further extension to cloud services e.g., Amazon EC2

Machine learning and data mining applications

Page 71: Introduction to 16S rRNA gene multivariate analysis

Open-source, web-based platform Deployed locally or in the cloud Ongoing development of 16S rRNA gene analysis

Page 72: Introduction to 16S rRNA gene multivariate analysis

Galaxy Workshed (available tools)

Page 73: Introduction to 16S rRNA gene multivariate analysis

“The advantages of having large numbers of samples at shallow coverage (~1,000 sequences per sample) clearly outweigh having a small number of samples at greater coverage for many datasets, suggesting that the focus for future studies should be on broader sampling that can reveal association with key biological parameters rather than on deeper sequencing.”

Page 74: Introduction to 16S rRNA gene multivariate analysis

“….even [phylogenetic beta-diversity] measures suited to the underlying mechanism of differentiation may require deep sequencing to reveal subtle patterns”

Dr. Donovan Parks

Page 75: Introduction to 16S rRNA gene multivariate analysis

Method standardization Impossible.

Data storage

Sequence reads outpacing data storage costs Federated data?

File formats

e.g., FASTA (difficult to search, difficult to retrieve sequences, not space efficient, do not ensure data is in correct format, no space for metadata, no absolute

standard)… relational databases?

Software Free and Open Source enables an experiment to be faithfully replicated

Algorithms

Memory! Many clustering and phylogenetic inference algorithms vary n2

Distributed, parallel, or cloud computing may not be helpful

Metadata What to do with it? How to marry sequence and metadata sets?

We need better metadata integration, not necessarily more/better metadata

Page 76: Introduction to 16S rRNA gene multivariate analysis

What should we be doing? (take-home messages)

*Surveys are really important for spatial and temporal mapping

*Hypothesis testing follows (or implicit) *What species account for treatment effects?

*Who tracks with who? (why=function) *Who avoids who?

*Are all microorganisms accounted for? (no) *How can we use this information to

manipulate, manage and predict ecosystems?

Page 77: Introduction to 16S rRNA gene multivariate analysis

What should we be doing? (take-home messages)

There is no “one way” to analyze 16S rRNA

You need to build a pipeline for you.

If this seems daunting, it is.

If this is not daunting, your hands are dirty.

It’s getting better all the tii-ime.

Page 78: Introduction to 16S rRNA gene multivariate analysis

Helpful resources

Page 79: Introduction to 16S rRNA gene multivariate analysis
Page 80: Introduction to 16S rRNA gene multivariate analysis
Page 81: Introduction to 16S rRNA gene multivariate analysis
Page 82: Introduction to 16S rRNA gene multivariate analysis
Page 83: Introduction to 16S rRNA gene multivariate analysis

Thank you [email protected]