CCBC tutorial beiko
Transcript of CCBC tutorial beiko
MicrobiomeAnalysis16S AND METAGENOMICS
‘
Welcome!
Your Tutorial Team:
Me (16S theory)
Mike Hall (16S practical)
Morgan Langille (metagenomics theory and practical)
Special thanks to:
Will Hsiao (CBW presentation)
2
Today’s presentation
CBW “Analysis of metagenomic data”
3
http://bioinformatics.ca/workshops/2015/analysis-metagenomic-data-2015
OverviewMorning session
1. A brief history of molecules and microbes2. Why 16S?3. How 16S analysis is usually done4. Assumptions5. Hands-on practical
Afternoon session1. 16S vs Metagenomics2. Metagenome Taxonomic Composition3. Metagenome Functional Composition4. PICRUSt: Functional Inference5. Hands-on practical
4
Learning objectives
At the end of the 16S tutorial, you should be able to do the following:
1. Run a simple QIIME analysis of a data set (https://www.dropbox.com/s/kpte51nm17wav9o/stool_data.zip)
2. Interpret analysis results
3. Understand the limitations of the standard 16S analysis pipeline
5
Defining metagenomicsMicrobiome: Attributed to Joshua Lederberg by Hooper and Gordon (2001): “the collective genome of our indigenous microbes (microflora), the idea being that a comprehensive genetic view of Homo sapiens as a life-form should include the genes in our microbiome”
Is also used to mean microbiota, the group of microorganisms found in a particular setting
(usage varies: be careful and precise!)
Metagenome: Handelsman et al. (1998) “…advances in molecular biology and eukaryotic genomics, which have laid the groundwork for cloning and functional analysis of the collective genomes of soil microflora, which we term the metagenome of the soil.”
Does not encompass marker-gene surveys (e.g., 16S)This report says it does.
6
Micro-what?Metagenomics is often defined to encompass only Bacteria and Archaea (and often Archaea are excluded too!)
Other small things to consider:◦ Viruses / phages
◦ Microbial eukaryotes
◦ Worms (helminths, nematodes, …)
7
Lukeš et al. (2015) PLoS Pathogens
The dawn of metagenomics3.5 BYA – the Archaean Eon
16S position 349 (-ish)
?
G A
Archaea Bacteria
8
Aaaaand more recently
t
9
The 16S ribosomal RNA geneTHE FIRST WORD IN MICROBIAL BIODIVERSITY
10
11
Yarza et al. (2014)
Escherichia coliribosome (PDB 4YBB)
So much RNA!
Why 16S?The “universal phylogenetic marker”
(1) Present in all living organisms
(2) Single copy* (no recombination)
(3) Highly conserved + highly variable regions
(4) Huge reference databases
12
Milestones
13
1990: “proposal for the domains Archaea, Bacteria, and Eucarya”
Milestones
14
Nature (1990)
2002: “…as much as 50% of the total surface microbial community…”
Milestones
15
PNAS (2006)
Many critical papers followed (error filtering, clustering approaches, …)
Milestones
16
Huttenhower, Gevers et al. (2012)
+ 681 metagenomic samples
16S analysisHOW IT ’S DONE
17
Your basic workflow
Sample collection
DNA extraction
Amplification Analysis
18
Sample collection and DNA extractionDefined protocols exist, many kits (e.g. PowerSoil®)
Need to consider barriers to DNA recovery and PCR (e.g. humic acids from soil, bile salts from feces)
Additional mechanical approaches (e.g., mechanical lysis of tissues with bead beating)
Kits and rogue lab DNA can end up in your sample – need to run negative controls!!
◦ Example from [year redacted]: shocking finding of bacterial DNA in the [location redacted]! However, [taxonomic group redacted] was a known frequent contaminant of DNA extraction kits.
19
20
Size fractionation
http://www.jove.com/video/52685/automated-gel-size-selection-to-improve-quality-next-generation
Choosing a PCR strategyNeed to consider:◦ Correct melting temperature (60-65 degrees C for Illumina
protocol)
◦ DNA sequencing read length (influences choice of primers)
◦ Primer specificity!
◦ Comparability with previous studies?[Good luck with that] [but that’s what the Earth Microbiome Project protocol http://www.earthmicrobiome.org/emp-standard-protocols/16s/is meant to achieve]
21
Which variable regions to target?
V1-V3 favours Prevotella, Fusobacterium, Streptococcus, Granulicatella, Bacteroides, Porphyromonas and Treponema
V4-V6 favours Streptococcus, Treponema, Prevotella, Eubacterium, Porphyromonas, Campylobacter and Enterococcus.
◦ failed to detect Fusobacterium
V7-V9 favours Veillonella, Streptococcus, Eubacterium, Enterococcus, Treponema, Catonella and Selenomonas.
◦ failed to detect Selenomonas, TM7 and Mycoplasma
22
At least there’s no shortage of options…
23
Detailed in silico evaluation of primers, experimental evaluation of two sets
Heavily biased recovery of Bacteria, Archaea, and missing groups depending on primer choice.
“Out of the 175 primers and 512 primer pairs checked, only 10 can be recommended as broad-range primers.”
Amplification
Example: Illumina protocol
24
Analysis(examples mostly from QIIME)
1. Quality Control ◦ Error checking
2. Sample diversity◦ Taxonomy agnostic
◦ Taxonomy aware
3. Similarity among samples
4. Associations with metadata/groups (ANOSIM, MRPP)
5. Machine-learning classification
6. Functional prediction
25
26
QIIME Mothur
A python interface to glue together many programs
Single program with minimal external dependency
Wrappers for existing programs Reimplementation of popular algorithms
Large number of dependencies / VM available
Easy to install and setup; work best on single multi-core server with lots of memory
More scalable Less scalable
Steeper learning curve but more flexible workflow if you can write your own scripts
Easy to learn but workflow works the best with built-in tools
http://www.ncbi.nlm.nih.gov/pubmed/24060131
http://www.mothur.org/wiki/MiSeq_SOP
Will Hsiao
“Analysis” #1Quality Control
27
Quality score filtering:◦ Minimal length of consecutive high-quality bases (as % of total read length)
◦ Maximal number of consecutive low-quality bases
◦ Maximal number of ambiguous bases (N’s)
◦ Minimum Phred quality score
Other quality filtering tools available◦ Cutadapt (https://github.com/marcelm/cutadapt)
◦ Trimmomatic (http://www.usadellab.org/cms/?page=trimmomatic)
◦ Sickle (https://github.com/najoshi/sickle)
Chimera checking:◦ UCHIME
28
Sequence quality summary using FASTQChttp://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Analysis #2Within-sample (“alpha”) diversity
To describe the diversity of a sample, you need to know what you are counting!
Individual sequences?
◦ Most precise, but vulnerable to sequencing error effects – inflation of diversity
Clusters of sequences?
◦ Operational taxonomic units (OTUs) – 97% sequence identity as the “species” level of similarity
Taxonomic groups?
◦ It’s always reassuring to put names on things, but taxonomic labels can be extremely misleading
29
OTU clustering
30
Choose a % identity threshold
97%
Cluster centroids in some order(e.g., length, abundance) – these are reference sequences
Continue procedure until all sequences are clustered OTU(singletons may be excluded)
Calculate distances between sequences
6%
What’s in a name?
31
Bacteroides
Parabacteroides
Ruminococcus
???
???
???
???
Akkermansia
Taxonomic assignmentMany choices:
BLAST – assign taxonomic label of closest match (simple, possibly too simple)
Phylogenetic placement – e.g. Pplacer (Matsen et al., BMC Bioinformatics2010)
Machine-learning classification, in particular Naïve Bayes e.g. RDP Classifier, Wang et al. (2007) BMC Bioinformatics
32
Example RDP Classifier output
33
GD6JEAT01AYGPE Root rootrank 1.0 Bacteria domain 1.0"Planctomycetes" phylum 1.0 "Planctomycetacia" class 1.0Planctomycetales order 1.0 Planctomycetaceae family 1.0Schlesneria genus 0.96
GD6JEAT01BEUG6 Root rootrank 1.0 Bacteria domain 1.0Firmicutes phylum 0.32 Clostridia class 0.26Clostridiales order 0.23 Ruminococcaceae family 0.22Anaerotruncus genus 0.19
Includes bootstrap support
Calculating alpha diversity
OTU counts – richness only
Simpson index – probability of sampling two individuals of the same type
Phylogenetic diversity – sum of branch lengths
34
Example: human body-site diversity
35
Huttenhower, Gevers et al. (2012)
Analysis #3Among-sample (“beta”) diversity
1. Perform pairwise comparisons between all samples to build a dissimilarity matrix
2. Summarize the matrix using based on major patterns of covariance or hierarchical similarity
36
Analysis #3Among-sample (“beta”) diversity
Given a pair of samples (described as e.g. OTU abundance), calculate their dissimilarity
Beta-diversity measures can be:◦ non-phylogenetic or phylogenetic
◦ weighted or unweighted
There are a lot of measures!
-Bray-Curtis (weighted, non-phylogenetic)
-Jaccard (unweighted, non-phylogenetic)
-Weighted UniFrac (weighted, phylogenetic)
-…
37
Analysis #3Among-sample (“beta”) diversity
How similar are the results of different measures?
CORRELATIONS between calculated values
38
Parks and Beiko (2013): ISME J
Analysis #3Among-sample (“beta”) diversity
What to do with a dissimilarity matrix?
39
Yatsunenko et al. (2012) Nature Parks and Beiko (2012) Mol Biol Evol
OrdinationClustering
Analysis #3Among-sample (“beta”) diversity
Different beta-diversity measures can yield dramatically different clusters!
40
Parks and Beiko (2013): ISME J
Analysis #4Associations with metadata
PERMANOVA: Permutational multivariate analysis of variance
ANOSIM: Rank-based analysis of similarity
Mantel test: Comparison of between-group vs within-group distances
41
Good review: Anderson and Walsh (2013) Ecological Monographs
Example: Weighted UniFrac distance: root compartment explains 46.62% of variance (PERMANOVA p<0.001)
Unweighted UniFrac: root compartment explains only 18.07% of variance (PERMANOVA p<0.001); soil type is more important
Analysis #5Machine-learning classification
Identify aspects of community structure that are predictive of sample attributes
Advantages of machine-learning approaches:◦ Non-linear combinations of variables
◦ Data transformations
◦ Can accommodate many different representations of the data
Disadvantages:◦ Complex, may “overfit”
◦ Can be time consuming
◦ Obfuscation of predictive rules
42
Random forests(supervised_learning.py)
43
“…there are only weak and, for the most part, non-significant associations of particular taxa or overall diversity with the obese human gut that hold true across different studies. However, using supervised learning with receiver operator curves to maximize sensitivity and specificity, one can categorize subjects according to lean and obese states with in some cases considerable accuracy…”
Tree-based classifications
Nested clade analysis
and feature selection
Classification of plaque samples using support vector machines
44
Ning and Beiko (2015): Microbiome
Analysis #6Functional prediction
PICRUSt: Langille et al (2013) Nat Biotechnol
45
Morgan can tell you about this…
AssumptionsTHAT ARE OFTEN FALSE
46
Do not assume that#1: 16S is an effective proxy for microbial diversity.
#2: All 16S studies are created equal, with results that are comparable.
#3: Rarefaction is a good idea.
#4: 16S OTUs describe ecologically cohesive units (“species”?).
#5: The 16S tree is the “Tree of Life”.
47
Assumption #116S is an effective proxy for microbial diversity.
48
rrnDB: Stoddard et al. NAR (2014)
Estimating copy number: Kembel et al. (2012) and PICRUSt (coming up later)
Variation: Coenye and Vandamme (2003)
Assumption #116S is an effective proxy for microbial
diversity.Alternative marker genes: cpn60, rpoB, …
Smaller reference databases!
Protein-coding genes!
49
Assumption #2All 16S studies are created equal.
Effects of sequencing platform, V region, amplicon vs metagenomics
50
Tremblay et al. (2015) Front Microbiol
Assumption #3Rarefaction is a good idea.
Example of statistics before and after rarefaction:
Loss of statistical power
Random subsampling can increase false-positive differences
Arbitrary minimum library size chosen for downsampling
Alternatives e.g. Negative Binomial fitting (e.g., DeSeq2)
51
McMurdie and Holmes (2014) PLoS Comp Biol
Assumption #416S OTUs describe ecologically cohesive units.
52
Distribution of
sequence similarity (dashed line = OTU threshold)
branch lengths
Nguyen et al. (2016) npj Biofilms and Microbiomes
Assumption #416S OTUs describe ecologically cohesive units.
53
Hall et al., in preparation
Same OTU, different temporal patterns
Assumption #416S OTUs describe ecologically cohesive units.
54
Many alternatives exist, including Swarm: Mahé et al. (2015) PeerJ
Assumption #5The 16S tree is the “Tree of Life”.
16S is limited for several reasons:
Limited resolving power
Subject to compositional bias
Subject to recombination and lateral transfer
Models typically applied to protein-coding genes do not make sense for noncoding RNA
55
Moving OnADVENTURES IN “MULTI-OMICS”
56
Multi-omics??16S can profile the biodiversity of a microbial sample…
But we need the metagenome to shine a light on function…
The metatranscriptome tells us what is expressed under specific conditions…
And the metaproteome can quantify the relative abundance of different enzymes…
While the metametabolome focuses on the products of metabolism.
What do we really need?
57
Metagenomic / metatranscriptomic AMD analysis - Hua et al., ISME J (2015)Draft genomes at MG-RAST
59
Differences in the microbiome between arsenic-exposed and control mice
16S taxonomic analysis + metametabolomics
Taxonomy
Metabolic function
Hands on!LET ’S MAKE SCIENCE HAPPEN
60
The Dataset
61
Workflow1. Retrieve data
2. Cluster sequences
3. Taxonomic classification
4. Phylogenetic tree construction
5. OTU table creation
6. Downstream visualization / analysis
62
FIN
63
Presentationshttp://www.slideshare.net/MickWatson/studying-the-microbiome
http://bioinformatics.ca/metagenomics2015module2pptx