CCBC tutorial beiko

MicrobiomeAnalysis16S AND METAGENOMICS

‘

Welcome!

Your Tutorial Team:

Me (16S theory)

Mike Hall (16S practical)

Morgan Langille (metagenomics theory and practical)

Special thanks to:

Will Hsiao (CBW presentation)

2

Today’s presentation

CBW “Analysis of metagenomic data”

3

http://bioinformatics.ca/workshops/2015/analysis-metagenomic-data-2015

OverviewMorning session

1. A brief history of molecules and microbes2. Why 16S?3. How 16S analysis is usually done4. Assumptions5. Hands-on practical

Afternoon session1. 16S vs Metagenomics2. Metagenome Taxonomic Composition3. Metagenome Functional Composition4. PICRUSt: Functional Inference5. Hands-on practical

4

Learning objectives

At the end of the 16S tutorial, you should be able to do the following:

1. Run a simple QIIME analysis of a data set (https://www.dropbox.com/s/kpte51nm17wav9o/stool_data.zip)

2. Interpret analysis results

3. Understand the limitations of the standard 16S analysis pipeline

5

Defining metagenomicsMicrobiome: Attributed to Joshua Lederberg by Hooper and Gordon (2001): “the collective genome of our indigenous microbes (microflora), the idea being that a comprehensive genetic view of Homo sapiens as a life-form should include the genes in our microbiome”

Is also used to mean microbiota, the group of microorganisms found in a particular setting

(usage varies: be careful and precise!)

Metagenome: Handelsman et al. (1998) “…advances in molecular biology and eukaryotic genomics, which have laid the groundwork for cloning and functional analysis of the collective genomes of soil microflora, which we term the metagenome of the soil.”

Does not encompass marker-gene surveys (e.g., 16S)This report says it does.

6

Micro-what?Metagenomics is often defined to encompass only Bacteria and Archaea (and often Archaea are excluded too!)

Other small things to consider:◦ Viruses / phages

◦ Microbial eukaryotes

◦ Worms (helminths, nematodes, …)

7

Lukeš et al. (2015) PLoS Pathogens

The dawn of metagenomics3.5 BYA – the Archaean Eon

16S position 349 (-ish)

?

G A

Archaea Bacteria

8

Aaaaand more recently

t

9

The 16S ribosomal RNA geneTHE FIRST WORD IN MICROBIAL BIODIVERSITY

10

11

Yarza et al. (2014)

Escherichia coliribosome (PDB 4YBB)

So much RNA!

Why 16S?The “universal phylogenetic marker”

(1) Present in all living organisms

(2) Single copy* (no recombination)

(3) Highly conserved + highly variable regions

(4) Huge reference databases

12

Milestones

13

1990: “proposal for the domains Archaea, Bacteria, and Eucarya”

Milestones

14

Nature (1990)

2002: “…as much as 50% of the total surface microbial community…”

Milestones

15

PNAS (2006)

Many critical papers followed (error filtering, clustering approaches, …)

Milestones

16

Huttenhower, Gevers et al. (2012)

+ 681 metagenomic samples

16S analysisHOW IT ’S DONE

17

Your basic workflow

Sample collection

DNA extraction

Amplification Analysis

18

Sample collection and DNA extractionDefined protocols exist, many kits (e.g. PowerSoil®)

Need to consider barriers to DNA recovery and PCR (e.g. humic acids from soil, bile salts from feces)

Additional mechanical approaches (e.g., mechanical lysis of tissues with bead beating)

Kits and rogue lab DNA can end up in your sample – need to run negative controls!!

◦ Example from [year redacted]: shocking finding of bacterial DNA in the [location redacted]! However, [taxonomic group redacted] was a known frequent contaminant of DNA extraction kits.

19

20

Size fractionation

http://www.jove.com/video/52685/automated-gel-size-selection-to-improve-quality-next-generation

Choosing a PCR strategyNeed to consider:◦ Correct melting temperature (60-65 degrees C for Illumina

protocol)

◦ DNA sequencing read length (influences choice of primers)

◦ Primer specificity!

◦ Comparability with previous studies?[Good luck with that] [but that’s what the Earth Microbiome Project protocol http://www.earthmicrobiome.org/emp-standard-protocols/16s/is meant to achieve]

21

http://www.earthmicrobiome.org/emp-standard-protocols/16s/

Which variable regions to target?

V1-V3 favours Prevotella, Fusobacterium, Streptococcus, Granulicatella, Bacteroides, Porphyromonas and Treponema

V4-V6 favours Streptococcus, Treponema, Prevotella, Eubacterium, Porphyromonas, Campylobacter and Enterococcus.

◦ failed to detect Fusobacterium

V7-V9 favours Veillonella, Streptococcus, Eubacterium, Enterococcus, Treponema, Catonella and Selenomonas.

◦ failed to detect Selenomonas, TM7 and Mycoplasma

22

At least there’s no shortage of options…

23

Detailed in silico evaluation of primers, experimental evaluation of two sets

Heavily biased recovery of Bacteria, Archaea, and missing groups depending on primer choice.

“Out of the 175 primers and 512 primer pairs checked, only 10 can be recommended as broad-range primers.”

Amplification

Example: Illumina protocol

24

Analysis(examples mostly from QIIME)

1. Quality Control ◦ Error checking

2. Sample diversity◦ Taxonomy agnostic

◦ Taxonomy aware

3. Similarity among samples

4. Associations with metadata/groups (ANOSIM, MRPP)

5. Machine-learning classification

6. Functional prediction

25

26

QIIME Mothur

A python interface to glue together many programs

Single program with minimal external dependency

Wrappers for existing programs Reimplementation of popular algorithms

Large number of dependencies / VM available

Easy to install and setup; work best on single multi-core server with lots of memory

More scalable Less scalable

Steeper learning curve but more flexible workflow if you can write your own scripts

Easy to learn but workflow works the best with built-in tools

http://www.ncbi.nlm.nih.gov/pubmed/24060131

http://www.mothur.org/wiki/MiSeq_SOP

Will Hsiao

“Analysis” #1Quality Control

27

Quality score filtering:◦ Minimal length of consecutive high-quality bases (as % of total read length)

◦ Maximal number of consecutive low-quality bases

◦ Maximal number of ambiguous bases (N’s)

◦ Minimum Phred quality score

Other quality filtering tools available◦ Cutadapt (https://github.com/marcelm/cutadapt)

◦ Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)

◦ Sickle (https://github.com/najoshi/sickle)

Chimera checking:◦ UCHIME

https://github.com/marcelm/cutadapt

http://www.usadellab.org/cms/?page=trimmomatic

https://github.com/najoshi/sickle

28

Sequence quality summary using FASTQChttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Analysis #2Within-sample (“alpha”) diversity

To describe the diversity of a sample, you need to know what you are counting!

Individual sequences?

◦ Most precise, but vulnerable to sequencing error effects – inflation of diversity

Clusters of sequences?

◦ Operational taxonomic units (OTUs) – 97% sequence identity as the “species” level of similarity

Taxonomic groups?

◦ It’s always reassuring to put names on things, but taxonomic labels can be extremely misleading

29

OTU clustering

30

Choose a % identity threshold

97%

Cluster centroids in some order(e.g., length, abundance) – these are reference sequences

Continue procedure until all sequences are clustered OTU(singletons may be excluded)

Calculate distances between sequences

6%

What’s in a name?

31

Bacteroides

Parabacteroides

Ruminococcus

???

???

???

???

Akkermansia

Taxonomic assignmentMany choices:

BLAST – assign taxonomic label of closest match (simple, possibly too simple)

Phylogenetic placement – e.g. Pplacer (Matsen et al., BMC Bioinformatics2010)

Machine-learning classification, in particular Naïve Bayes e.g. RDP Classifier, Wang et al. (2007) BMC Bioinformatics

32

Example RDP Classifier output

33

GD6JEAT01AYGPE Root rootrank 1.0 Bacteria domain 1.0"Planctomycetes" phylum 1.0 "Planctomycetacia" class 1.0Planctomycetales order 1.0 Planctomycetaceae family 1.0Schlesneria genus 0.96

GD6JEAT01BEUG6 Root rootrank 1.0 Bacteria domain 1.0Firmicutes phylum 0.32 Clostridia class 0.26Clostridiales order 0.23 Ruminococcaceae family 0.22Anaerotruncus genus 0.19

Includes bootstrap support

Calculating alpha diversity

OTU counts – richness only

Simpson index – probability of sampling two individuals of the same type

Phylogenetic diversity – sum of branch lengths

34

Example: human body-site diversity

35

Huttenhower, Gevers et al. (2012)

Analysis #3Among-sample (“beta”) diversity

1. Perform pairwise comparisons between all samples to build a dissimilarity matrix

2. Summarize the matrix using based on major patterns of covariance or hierarchical similarity

36


Given a pair of samples (described as e.g. OTU abundance), calculate their dissimilarity

Beta-diversity measures can be:◦ non-phylogenetic or phylogenetic

◦ weighted or unweighted

There are a lot of measures!

-Bray-Curtis (weighted, non-phylogenetic)

-Jaccard (unweighted, non-phylogenetic)

-Weighted UniFrac (weighted, phylogenetic)

-…

37


How similar are the results of different measures?

CORRELATIONS between calculated values

38

Parks and Beiko (2013): ISME J


What to do with a dissimilarity matrix?

39

Yatsunenko et al. (2012) Nature Parks and Beiko (2012) Mol Biol Evol

OrdinationClustering


Different beta-diversity measures can yield dramatically different clusters!

40

Parks and Beiko (2013): ISME J

Analysis #4Associations with metadata

PERMANOVA: Permutational multivariate analysis of variance

ANOSIM: Rank-based analysis of similarity

Mantel test: Comparison of between-group vs within-group distances

41

Good review: Anderson and Walsh (2013) Ecological Monographs

Example: Weighted UniFrac distance: root compartment explains 46.62% of variance (PERMANOVA p<0.001)

Unweighted UniFrac: root compartment explains only 18.07% of variance (PERMANOVA p<0.001); soil type is more important

Analysis #5Machine-learning classification

Identify aspects of community structure that are predictive of sample attributes

Advantages of machine-learning approaches:◦ Non-linear combinations of variables

◦ Data transformations

◦ Can accommodate many different representations of the data

Disadvantages:◦ Complex, may “overfit”

◦ Can be time consuming

◦ Obfuscation of predictive rules

42

Random forests(supervised_learning.py)

43

“…there are only weak and, for the most part, non-significant associations of particular taxa or overall diversity with the obese human gut that hold true across different studies. However, using supervised learning with receiver operator curves to maximize sensitivity and specificity, one can categorize subjects according to lean and obese states with in some cases considerable accuracy…”

Tree-based classifications

Nested clade analysis

and feature selection

Classification of plaque samples using support vector machines

44

Ning and Beiko (2015): Microbiome

Analysis #6Functional prediction

PICRUSt: Langille et al (2013) Nat Biotechnol

45

Morgan can tell you about this…

AssumptionsTHAT ARE OFTEN FALSE

46

Do not assume that#1: 16S is an effective proxy for microbial diversity.

#2: All 16S studies are created equal, with results that are comparable.

#3: Rarefaction is a good idea.

#4: 16S OTUs describe ecologically cohesive units (“species”?).

#5: The 16S tree is the “Tree of Life”.

47

Assumption #116S is an effective proxy for microbial diversity.

48

rrnDB: Stoddard et al. NAR (2014)

Estimating copy number: Kembel et al. (2012) and PICRUSt (coming up later)

Variation: Coenye and Vandamme (2003)

Assumption #116S is an effective proxy for microbial

diversity.Alternative marker genes: cpn60, rpoB, …

Smaller reference databases!

Protein-coding genes!

49

Assumption #2All 16S studies are created equal.

Effects of sequencing platform, V region, amplicon vs metagenomics

50

Tremblay et al. (2015) Front Microbiol

Assumption #3Rarefaction is a good idea.

Example of statistics before and after rarefaction:

Loss of statistical power

Random subsampling can increase false-positive differences

Arbitrary minimum library size chosen for downsampling

Alternatives e.g. Negative Binomial fitting (e.g., DeSeq2)

51

McMurdie and Holmes (2014) PLoS Comp Biol

Assumption #416S OTUs describe ecologically cohesive units.

52

Distribution of

sequence similarity (dashed line = OTU threshold)

branch lengths

Nguyen et al. (2016) npj Biofilms and Microbiomes


53

Hall et al., in preparation

Same OTU, different temporal patterns


54

Many alternatives exist, including Swarm: Mahé et al. (2015) PeerJ

Assumption #5The 16S tree is the “Tree of Life”.

16S is limited for several reasons:

Limited resolving power

Subject to compositional bias

Subject to recombination and lateral transfer

Models typically applied to protein-coding genes do not make sense for noncoding RNA

55

Moving OnADVENTURES IN “MULTI-OMICS”

56

Multi-omics??16S can profile the biodiversity of a microbial sample…

But we need the metagenome to shine a light on function…

The metatranscriptome tells us what is expressed under specific conditions…

And the metaproteome can quantify the relative abundance of different enzymes…

While the metametabolome focuses on the products of metabolism.

What do we really need?

57

Metagenomic / metatranscriptomic AMD analysis - Hua et al., ISME J (2015)Draft genomes at MG-RAST

59

Differences in the microbiome between arsenic-exposed and control mice

16S taxonomic analysis + metametabolomics

Taxonomy

Metabolic function

Hands on!LET ’S MAKE SCIENCE HAPPEN

60

The Dataset

61

Workflow1. Retrieve data

2. Cluster sequences

3. Taxonomic classification

4. Phylogenetic tree construction

5. OTU table creation

6. Downstream visualization / analysis

62

FIN

63

Presentationshttp://www.slideshare.net/MickWatson/studying-the-microbiome

http://bioinformatics.ca/metagenomics2015module2pptx

http://www.slideshare.net/MickWatson/studying-the-microbiome

http://bioinformatics.ca/metagenomics2015module2pptx

CCBC tutorial beiko

Science

Transcript of CCBC tutorial beiko