CCBC tutorial beiko

63
Microbiome Analysis 16S AND METAGENOMICS

Transcript of CCBC tutorial beiko

Page 1: CCBC tutorial beiko

MicrobiomeAnalysis16S AND METAGENOMICS

Page 2: CCBC tutorial beiko

Welcome!

Your Tutorial Team:

Me (16S theory)

Mike Hall (16S practical)

Morgan Langille (metagenomics theory and practical)

Special thanks to:

Will Hsiao (CBW presentation)

2

Page 3: CCBC tutorial beiko

Today’s presentation

CBW “Analysis of metagenomic data”

3

http://bioinformatics.ca/workshops/2015/analysis-metagenomic-data-2015

Page 4: CCBC tutorial beiko

OverviewMorning session

1. A brief history of molecules and microbes2. Why 16S?3. How 16S analysis is usually done4. Assumptions5. Hands-on practical

Afternoon session1. 16S vs Metagenomics2. Metagenome Taxonomic Composition3. Metagenome Functional Composition4. PICRUSt: Functional Inference5. Hands-on practical

4

Page 5: CCBC tutorial beiko

Learning objectives

At the end of the 16S tutorial, you should be able to do the following:

1. Run a simple QIIME analysis of a data set (https://www.dropbox.com/s/kpte51nm17wav9o/stool_data.zip)

2. Interpret analysis results

3. Understand the limitations of the standard 16S analysis pipeline

5

Page 6: CCBC tutorial beiko

Defining metagenomicsMicrobiome: Attributed to Joshua Lederberg by Hooper and Gordon (2001): “the collective genome of our indigenous microbes (microflora), the idea being that a comprehensive genetic view of Homo sapiens as a life-form should include the genes in our microbiome”

Is also used to mean microbiota, the group of microorganisms found in a particular setting

(usage varies: be careful and precise!)

Metagenome: Handelsman et al. (1998) “…advances in molecular biology and eukaryotic genomics, which have laid the groundwork for cloning and functional analysis of the collective genomes of soil microflora, which we term the metagenome of the soil.”

Does not encompass marker-gene surveys (e.g., 16S)This report says it does.

6

Page 7: CCBC tutorial beiko

Micro-what?Metagenomics is often defined to encompass only Bacteria and Archaea (and often Archaea are excluded too!)

Other small things to consider:◦ Viruses / phages

◦ Microbial eukaryotes

◦ Worms (helminths, nematodes, …)

7

Lukeš et al. (2015) PLoS Pathogens

Page 8: CCBC tutorial beiko

The dawn of metagenomics3.5 BYA – the Archaean Eon

16S position 349 (-ish)

?

G A

Archaea Bacteria

8

Page 9: CCBC tutorial beiko

Aaaaand more recently

t

9

Page 10: CCBC tutorial beiko

The 16S ribosomal RNA geneTHE FIRST WORD IN MICROBIAL BIODIVERSITY

10

Page 11: CCBC tutorial beiko

11

Yarza et al. (2014)

Escherichia coliribosome (PDB 4YBB)

So much RNA!

Page 12: CCBC tutorial beiko

Why 16S?The “universal phylogenetic marker”

(1) Present in all living organisms

(2) Single copy* (no recombination)

(3) Highly conserved + highly variable regions

(4) Huge reference databases

12

Page 13: CCBC tutorial beiko

Milestones

13

1990: “proposal for the domains Archaea, Bacteria, and Eucarya”

Page 14: CCBC tutorial beiko

Milestones

14

Nature (1990)

2002: “…as much as 50% of the total surface microbial community…”

Page 15: CCBC tutorial beiko

Milestones

15

PNAS (2006)

Many critical papers followed (error filtering, clustering approaches, …)

Page 16: CCBC tutorial beiko

Milestones

16

Huttenhower, Gevers et al. (2012)

+ 681 metagenomic samples

Page 17: CCBC tutorial beiko

16S analysisHOW IT ’S DONE

17

Page 18: CCBC tutorial beiko

Your basic workflow

Sample collection

DNA extraction

Amplification Analysis

18

Page 19: CCBC tutorial beiko

Sample collection and DNA extractionDefined protocols exist, many kits (e.g. PowerSoil®)

Need to consider barriers to DNA recovery and PCR (e.g. humic acids from soil, bile salts from feces)

Additional mechanical approaches (e.g., mechanical lysis of tissues with bead beating)

Kits and rogue lab DNA can end up in your sample – need to run negative controls!!

◦ Example from [year redacted]: shocking finding of bacterial DNA in the [location redacted]! However, [taxonomic group redacted] was a known frequent contaminant of DNA extraction kits.

19

Page 20: CCBC tutorial beiko

20

Size fractionation

http://www.jove.com/video/52685/automated-gel-size-selection-to-improve-quality-next-generation

Page 21: CCBC tutorial beiko

Choosing a PCR strategyNeed to consider:◦ Correct melting temperature (60-65 degrees C for Illumina

protocol)

◦ DNA sequencing read length (influences choice of primers)

◦ Primer specificity!

◦ Comparability with previous studies?[Good luck with that] [but that’s what the Earth Microbiome Project protocol http://www.earthmicrobiome.org/emp-standard-protocols/16s/is meant to achieve]

21

Page 22: CCBC tutorial beiko

Which variable regions to target?

V1-V3 favours Prevotella, Fusobacterium, Streptococcus, Granulicatella, Bacteroides, Porphyromonas and Treponema

V4-V6 favours Streptococcus, Treponema, Prevotella, Eubacterium, Porphyromonas, Campylobacter and Enterococcus.

◦ failed to detect Fusobacterium

V7-V9 favours Veillonella, Streptococcus, Eubacterium, Enterococcus, Treponema, Catonella and Selenomonas.

◦ failed to detect Selenomonas, TM7 and Mycoplasma

22

Page 23: CCBC tutorial beiko

At least there’s no shortage of options…

23

Detailed in silico evaluation of primers, experimental evaluation of two sets

Heavily biased recovery of Bacteria, Archaea, and missing groups depending on primer choice.

“Out of the 175 primers and 512 primer pairs checked, only 10 can be recommended as broad-range primers.”

Page 24: CCBC tutorial beiko

Amplification

Example: Illumina protocol

24

Page 25: CCBC tutorial beiko

Analysis(examples mostly from QIIME)

1. Quality Control ◦ Error checking

2. Sample diversity◦ Taxonomy agnostic

◦ Taxonomy aware

3. Similarity among samples

4. Associations with metadata/groups (ANOSIM, MRPP)

5. Machine-learning classification

6. Functional prediction

25

Page 26: CCBC tutorial beiko

26

QIIME Mothur

A python interface to glue together many programs

Single program with minimal external dependency

Wrappers for existing programs Reimplementation of popular algorithms

Large number of dependencies / VM available

Easy to install and setup; work best on single multi-core server with lots of memory

More scalable Less scalable

Steeper learning curve but more flexible workflow if you can write your own scripts

Easy to learn but workflow works the best with built-in tools

http://www.ncbi.nlm.nih.gov/pubmed/24060131

http://www.mothur.org/wiki/MiSeq_SOP

Will Hsiao

Page 27: CCBC tutorial beiko

“Analysis” #1Quality Control

27

Quality score filtering:◦ Minimal length of consecutive high-quality bases (as % of total read length)

◦ Maximal number of consecutive low-quality bases

◦ Maximal number of ambiguous bases (N’s)

◦ Minimum Phred quality score

Other quality filtering tools available◦ Cutadapt (https://github.com/marcelm/cutadapt)

◦ Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)

◦ Sickle (https://github.com/najoshi/sickle)

Chimera checking:◦ UCHIME

Page 28: CCBC tutorial beiko

28

Sequence quality summary using FASTQChttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Page 29: CCBC tutorial beiko

Analysis #2Within-sample (“alpha”) diversity

To describe the diversity of a sample, you need to know what you are counting!

Individual sequences?

◦ Most precise, but vulnerable to sequencing error effects – inflation of diversity

Clusters of sequences?

◦ Operational taxonomic units (OTUs) – 97% sequence identity as the “species” level of similarity

Taxonomic groups?

◦ It’s always reassuring to put names on things, but taxonomic labels can be extremely misleading

29

Page 30: CCBC tutorial beiko

OTU clustering

30

Choose a % identity threshold

97%

Cluster centroids in some order(e.g., length, abundance) – these are reference sequences

Continue procedure until all sequences are clustered OTU(singletons may be excluded)

Calculate distances between sequences

6%

Page 31: CCBC tutorial beiko

What’s in a name?

31

Bacteroides

Parabacteroides

Ruminococcus

???

???

???

???

Akkermansia

Page 32: CCBC tutorial beiko

Taxonomic assignmentMany choices:

BLAST – assign taxonomic label of closest match (simple, possibly too simple)

Phylogenetic placement – e.g. Pplacer (Matsen et al., BMC Bioinformatics2010)

Machine-learning classification, in particular Naïve Bayes e.g. RDP Classifier, Wang et al. (2007) BMC Bioinformatics

32

Page 33: CCBC tutorial beiko

Example RDP Classifier output

33

GD6JEAT01AYGPE Root rootrank 1.0 Bacteria domain 1.0"Planctomycetes" phylum 1.0 "Planctomycetacia" class 1.0Planctomycetales order 1.0 Planctomycetaceae family 1.0Schlesneria genus 0.96

GD6JEAT01BEUG6 Root rootrank 1.0 Bacteria domain 1.0Firmicutes phylum 0.32 Clostridia class 0.26Clostridiales order 0.23 Ruminococcaceae family 0.22Anaerotruncus genus 0.19

Includes bootstrap support

Page 34: CCBC tutorial beiko

Calculating alpha diversity

OTU counts – richness only

Simpson index – probability of sampling two individuals of the same type

Phylogenetic diversity – sum of branch lengths

34

Page 35: CCBC tutorial beiko

Example: human body-site diversity

35

Huttenhower, Gevers et al. (2012)

Page 36: CCBC tutorial beiko

Analysis #3Among-sample (“beta”) diversity

1. Perform pairwise comparisons between all samples to build a dissimilarity matrix

2. Summarize the matrix using based on major patterns of covariance or hierarchical similarity

36

Page 37: CCBC tutorial beiko

Analysis #3Among-sample (“beta”) diversity

Given a pair of samples (described as e.g. OTU abundance), calculate their dissimilarity

Beta-diversity measures can be:◦ non-phylogenetic or phylogenetic

◦ weighted or unweighted

There are a lot of measures!

-Bray-Curtis (weighted, non-phylogenetic)

-Jaccard (unweighted, non-phylogenetic)

-Weighted UniFrac (weighted, phylogenetic)

-…

37

Page 38: CCBC tutorial beiko

Analysis #3Among-sample (“beta”) diversity

How similar are the results of different measures?

CORRELATIONS between calculated values

38

Parks and Beiko (2013): ISME J

Page 39: CCBC tutorial beiko

Analysis #3Among-sample (“beta”) diversity

What to do with a dissimilarity matrix?

39

Yatsunenko et al. (2012) Nature Parks and Beiko (2012) Mol Biol Evol

OrdinationClustering

Page 40: CCBC tutorial beiko

Analysis #3Among-sample (“beta”) diversity

Different beta-diversity measures can yield dramatically different clusters!

40

Parks and Beiko (2013): ISME J

Page 41: CCBC tutorial beiko

Analysis #4Associations with metadata

PERMANOVA: Permutational multivariate analysis of variance

ANOSIM: Rank-based analysis of similarity

Mantel test: Comparison of between-group vs within-group distances

41

Good review: Anderson and Walsh (2013) Ecological Monographs

Example: Weighted UniFrac distance: root compartment explains 46.62% of variance (PERMANOVA p<0.001)

Unweighted UniFrac: root compartment explains only 18.07% of variance (PERMANOVA p<0.001); soil type is more important

Page 42: CCBC tutorial beiko

Analysis #5Machine-learning classification

Identify aspects of community structure that are predictive of sample attributes

Advantages of machine-learning approaches:◦ Non-linear combinations of variables

◦ Data transformations

◦ Can accommodate many different representations of the data

Disadvantages:◦ Complex, may “overfit”

◦ Can be time consuming

◦ Obfuscation of predictive rules

42

Page 43: CCBC tutorial beiko

Random forests(supervised_learning.py)

43

“…there are only weak and, for the most part, non-significant associations of particular taxa or overall diversity with the obese human gut that hold true across different studies. However, using supervised learning with receiver operator curves to maximize sensitivity and specificity, one can categorize subjects according to lean and obese states with in some cases considerable accuracy…”

Page 44: CCBC tutorial beiko

Tree-based classifications

Nested clade analysis

and feature selection

Classification of plaque samples using support vector machines

44

Ning and Beiko (2015): Microbiome

Page 45: CCBC tutorial beiko

Analysis #6Functional prediction

PICRUSt: Langille et al (2013) Nat Biotechnol

45

Morgan can tell you about this…

Page 46: CCBC tutorial beiko

AssumptionsTHAT ARE OFTEN FALSE

46

Page 47: CCBC tutorial beiko

Do not assume that#1: 16S is an effective proxy for microbial diversity.

#2: All 16S studies are created equal, with results that are comparable.

#3: Rarefaction is a good idea.

#4: 16S OTUs describe ecologically cohesive units (“species”?).

#5: The 16S tree is the “Tree of Life”.

47

Page 48: CCBC tutorial beiko

Assumption #116S is an effective proxy for microbial diversity.

48

rrnDB: Stoddard et al. NAR (2014)

Estimating copy number: Kembel et al. (2012) and PICRUSt (coming up later)

Variation: Coenye and Vandamme (2003)

Page 49: CCBC tutorial beiko

Assumption #116S is an effective proxy for microbial

diversity.Alternative marker genes: cpn60, rpoB, …

Smaller reference databases!

Protein-coding genes!

49

Page 50: CCBC tutorial beiko

Assumption #2All 16S studies are created equal.

Effects of sequencing platform, V region, amplicon vs metagenomics

50

Tremblay et al. (2015) Front Microbiol

Page 51: CCBC tutorial beiko

Assumption #3Rarefaction is a good idea.

Example of statistics before and after rarefaction:

Loss of statistical power

Random subsampling can increase false-positive differences

Arbitrary minimum library size chosen for downsampling

Alternatives e.g. Negative Binomial fitting (e.g., DeSeq2)

51

McMurdie and Holmes (2014) PLoS Comp Biol

Page 52: CCBC tutorial beiko

Assumption #416S OTUs describe ecologically cohesive units.

52

Distribution of

sequence similarity (dashed line = OTU threshold)

branch lengths

Nguyen et al. (2016) npj Biofilms and Microbiomes

Page 53: CCBC tutorial beiko

Assumption #416S OTUs describe ecologically cohesive units.

53

Hall et al., in preparation

Same OTU, different temporal patterns

Page 54: CCBC tutorial beiko

Assumption #416S OTUs describe ecologically cohesive units.

54

Many alternatives exist, including Swarm: Mahé et al. (2015) PeerJ

Page 55: CCBC tutorial beiko

Assumption #5The 16S tree is the “Tree of Life”.

16S is limited for several reasons:

Limited resolving power

Subject to compositional bias

Subject to recombination and lateral transfer

Models typically applied to protein-coding genes do not make sense for noncoding RNA

55

Page 56: CCBC tutorial beiko

Moving OnADVENTURES IN “MULTI-OMICS”

56

Page 57: CCBC tutorial beiko

Multi-omics??16S can profile the biodiversity of a microbial sample…

But we need the metagenome to shine a light on function…

The metatranscriptome tells us what is expressed under specific conditions…

And the metaproteome can quantify the relative abundance of different enzymes…

While the metametabolome focuses on the products of metabolism.

What do we really need?

57

Page 58: CCBC tutorial beiko

Metagenomic / metatranscriptomic AMD analysis - Hua et al., ISME J (2015)Draft genomes at MG-RAST

Page 59: CCBC tutorial beiko

59

Differences in the microbiome between arsenic-exposed and control mice

16S taxonomic analysis + metametabolomics

Taxonomy

Metabolic function

Page 60: CCBC tutorial beiko

Hands on!LET ’S MAKE SCIENCE HAPPEN

60

Page 61: CCBC tutorial beiko

The Dataset

61

Page 62: CCBC tutorial beiko

Workflow1. Retrieve data

2. Cluster sequences

3. Taxonomic classification

4. Phylogenetic tree construction

5. OTU table creation

6. Downstream visualization / analysis

62

Page 63: CCBC tutorial beiko

FIN

63

Presentationshttp://www.slideshare.net/MickWatson/studying-the-microbiome

http://bioinformatics.ca/metagenomics2015module2pptx