Evaluation of the Escherichia coli threonine deaminase gene as a selectable marker for plant
Marker Gene Analysis: Best Practices
description
Transcript of Marker Gene Analysis: Best Practices
Marker Gene Analysis Best Practices
Susan Huse Marine Biological Laboratory /
Brown University October 17, 2012
Cleaning Data Filtering:
Remove reads that are likely to be overall low-quality and have errors throughout the read.
Quality Trimming: trim off nucleotides from the end(s) of the read based on local quality values.
Denoising: Adjust nucleotides that are more likely to be an error in base-calling (noise) than a true low-frequency variation (signal)
Anchor Trimming: trim the end of long amplicons to a conserved location in the SSU alignment
Chimera Removal: remove hybrid sequences created during amplification
Recommended 454 Filtering
• Exact match to barcode and proximal primer
• Optional denoising (currently only 454)
• Remove sequences
– with Ns
– that are too short
– Below average or window quality threshold
• Trim to distal primer or anchor
– Remove sequences without anchor / primer
SSU rRNA Anchor Trimming
Next-gen sequences often do not reach to the distal primer, and reads may have a range of lengths.
De novo OTU clustering and other sequence comparisons
are more consistent if all tags are trimmed to the same start and stop positions in the rRNA alignment.
Anchor trimming uses a highly conserved location situated
within the read length and truncates all reads to that position. Be careful that the anchor is the unique and present across all taxa.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 5 10 15 20 25 30 35 40
Cum
ulat
ive
Per
cent
of E
rror
s
Quality Score
Quality Scores for Error Positions
Untrimmed Data
80% of error bases have a quality score <=16
Before trimming, most errors have low Q scores
An Illumina HiSeq Error Distribution
HiSeq Reads with Ns NTAGCACCAAACATAAATCACCTCACTTAAGTGGCTGGAGACAAATAATCTCTTTAATAACCTGATTCAGCGAAACCAATCCGCGGCATTTAGTAGCGGTA!NTAATTACCCCAAAAAGAAAGGTATTAAGGATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATGAAATCGCGTAGAGGCTTTGCTATTCAGCGTTTGATG!NGCGCCAATATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTGATGAACTAAGTCAACCTCAGCACTAACCTTGCGAGTCATTTCTTTGATTTGGTCAT!NGTAAAAATGTCTACAGTAGAGTCAATAGCAAGGCCACGACGCAATGGAGAAAGACGGAGAGCGCCAACGGCGTCCATCTCGAAGGAGTCGCCAGCGATAA!NTCTATGTGGCTAAATACGTTAACAAAAAGTCAGATATGGACCTTGCTGCTAAAGGTCTAGGAGCTAAAGAATGGAACAACTCACTAAAAACCAAGCTGTC!CAGTGGAATAGTCAGGTTAAATTTAATGTGACCGTNTNNNNNAATNNNNNNNNNNNNNNNNNNNNNNNCANNNNNTNGNNNNANNNNNTTGAGTGTGAGGT!CGGATTGTTCAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTNAACANNNNNNNNNNNNNATAGTAATCCACGCTCTTNTAANATGTCAACAAGAG!TATGCGCCAAATGCTTACTCAAGCTCAAACGGCTGGTCAGAATTTTACCAATGACCANNNCAAAGAAATGACTCGCAAGGTTAGTGCTGAGGTTGACTTAG!TAGAAGTCGTCATTTGGCGAGAAAGCTCAGTCTCAGGAGGAAGCGGAGCAGTCCAAANNNTTTTGAGATGGCAGCAACGGAAACCATAACGAGCATCATCT!TGCTGTTGAGTGGTCTCATGACAATAAAGTATGTCNCTGNNTTGAAGNNTNNNNNNNNNNNNNNNCTNATACAATCACGCNCANNNNNAAAAGTGTCGTGT!CTACTGCGACTAAAGAGATTCAGTACCTTAACGCTAAAGGTGCTTTGNCTTANNNNNNNNNNNNTGGCGACCCTGTTTTGTATGGCANCTTGCCGCCGCGT!CGGCAGAAGCCTGAATGAGCTTAATAGAGGCCAAAGCGGTCTGGAAACGTACGGATTNNNNAGTAACTTGACTCATGATTTCTTACCTATTAGTGGTTGAA!GTGATTTATGTTTGGTGCTATTGCTGGCGGTATTGCTTCTGCTCTTGNTGGTNNCNNNNNNNNNAAATTGTTTGGAGGCGGTCAAAANGCCGCCTCCGGTG!ATATCAACCACACCAGAAGCAGCATCAGTGACGACATTAGAAATATCCTTTGNAGTNNNNNNNNTATGAGAAGAGCCATACCGCTGATTCTGCGTTTGCTG!!
Illumina
In this dataset: • 68 reads contained at least 1 N, of these: • 14 (21%) could not be mapped to PhiX, • 7 of those 14 (50%) had only 1 N • 24 (35%) contain more than 1 N
Minoche Filtering for Illumina
Minoche A, et al. 2011. Genome Biology 12: R112 using Bambus vulgaris, Arabidopsis thaliana, and PhiX
Illumina Chastity (ChF)
Low-Quality (B) tails
Ns
<1/3 of nt Q<30 in 1st half
avgQ < 30 1st 30% of nt
Table 2: Expected error rates based on Q-scores (% of bases lost)
No filter
All filters
Remaining Errors
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0 5 10 15 20 25 30 35 40
Pct
of E
rror
s
Quality Score
Quality Scores for Error Positions
Trimmed Data Untrimmed Data
Illumina
PCR errors?
QIIME Illumina Pipeline
• Single mismatch to barcode
• Trim read to last position above quality threshold q
• Remove sequences less than length threshold p
• Remove sequences with more than n Ns
Paired-End Filtering
Read 1 (forward)
Read 2 (reverse)
A small insert size allows for sequence overlap
Area of sequence overlap
Keep only reads that match exactly throughout the region of overlap. Amplicons designed to completely overlap (e.g., V6) ensure the highest quality sequences.
But Variation Still Exists
Is this: 1. systematic bidirectional sequencing error (unlikely) 2. PCR error, or 3. natural variation?
weblogo.berkeley.edu
5!
1
A
CT
2
A
G
3
T
A
G
4
CT
5
T
C6
C
T7
A
G
C
T8
A
G9
C
G
A10
T
A
C11
T
G
A12
CT
13
T
C14
T
A
C15
T
G
C
A16
T
A
C17
C
T
GA
18
A
G19
G
A20
G
CA
21
TG
C22
A
CT
23
CT
24
CT
25
T
C26
T
A
C27
T
GA
28
A
G29
GA
30
A
G31
G
TA
32
A
C
GT
33
A
G34
AG
35
TGA
36
G
C
AT
37
C
GAT
38
A
T
G
39
A
T
G
40
C
GT
41
A
C
G
42
G
T
C
43
T
C
44
A
CT
45
A
CT
46
T
G
C
47
C
A
G
48
A
G
49
A
G
50
GA
51
G
CA
52
A
T
C
53
A
GCT
54
C
A
T
G
55
A
C
GT
56
A
G
57
T
GA
58
A
G
59
T
GCA3!
E. coli K-12 V6 paired end with complete perfect overlap
What are Chimeras and
How do we find them?
5’ 3’
PCR primer primer anneals to complementary target
5’ 3’
Extension creates double-stranded amplicon
5’ 3’
Premature dissociation terminates elongation
3’
But…
The chimera can act as a template during the next PCR round.
5’ 3’
5’ 3’
The incomplete strand binds to a different template at a conserved region…
3’
conserved region
5’ 3’ …then extends to create a chimera
Chimera Detection 1. Look for the best match to the left (left parent)
2. Look for the best match to the right (right parent) Chimeric Read
Parent B
Parent A
Chimeric Read
3. Compare the distance between the two parents – are they really different or multiple entries for the same organism
Parent A
Parent B
Detection methods differ by source of parents
1. Reference Comparison: check against known reference sequences
2. De novo detection: check all triplets in your amplification
Reference Comparison only as good as the Ref Set
• Can only find parents if they are in the RefSet
• Any chimeras in the Ref Set are deleterious!
• Sparse RefSet may not detect chimeras from closely related organisms (intra-genera, intra-species)
• Differential density of the Ref Set can create biases
• Poor matches to the Ref Set can be mistaken for chimeras
• Hard to detect if parents are similar, but may not matter
De Novo Pros and Cons
• Can detect parents not in the RefSet: novel, close neighbors, PCR errors, unexpected amplifications
• Must be run by amplification , ie. by tube All your parents but only your parents
• Abundance profile can be tricky with long tail
• Early False Positives (parent is lost to RefSet) and False Negatives (chimera add to RefSet) will affect downstream calls
We use both de novo and ref
0%
10%
20%
30%
40%
50%
60%
70%
0% 10% 20% 30% 40% 50%
Per
cenc
t of D
atas
ets
Percent of Reads that are Chimeric
Percent Chimeric for Various Datasets
V6V4 V3V5
Rates of Chimera Formation in BPC Datasets As a function of total reads, not unique sequences
Chimera detection programs optimized for short reads
• UChime (in USearch, QIIME and VAMPS)
• Perseus (in AmpliconNoise and mothur)
Aggregating
Taxonomic assignments will generally remain the same despite a few mismatches. More so at coarser taxonomic levels (class vs. genus) OTU Clustering can round out small percentages of errors depending on the algorithm used. Clustering at 3% can (but does not always!) aggregate sequences with 1 – 2% errors.
“Aggregating” is not accepted terminology in the field
Downstream analytical techniques that compensate for inaccuracies in the remaining sequence data.
Taxonomic Filtering
In addition to knowledge base associated with taxnomic names:
• Can filter many unintended PCR amplification products.
• Reads too far from the tree can be classified as “Unknown” and examined further.
• Important to map reads to all domains, not just Bacteria, primers can amplify across domains and organelles
Amplification of other Domains
SSU region
Total Reads Archaea Bacteria Organelle Unknown
V6 529,359 0.02% 96% 4% 0.1%
V6-V4 3,437,855 0.3% 87% 8% 4%
Samples from Little Sippewissett Marsh. Organelles include mitochondria and chloroplasts
Non SSU rRNA Amplification
Thank you, Hilary
DNA binding transcriptional dual regulator, tyrosine-binding
Predicted antibiotic transporter
Putative transport system permease protein
Predicted major pilin subunit
16S rRNA
16S rRNA
Conserved inner membrane protein cardiolipin synthase
Taxonomy
GAST: Global Alignment of Sequence Taxonomy Use sequence alignment to compare against a RefSet Distance = alignment distance to nearest RefSet sequence (SILVA, Greengenes, Stajich Refs, UNITE, HOMD, etc) (VAMPS)
RDP:
Ribosomal Database Project Uses k-mer matching to find nearest genus Boot strap values reflect confidence in the assignment (RDP Training set, Greengenes, etc.) (QIIME, VAMPS)
• Primer bias • Chimeras • Discovery of novel 16S • Unrepresented in reference database • Low-quality references • Taxonomy not available • Incorrect taxonomy in RefSet • Ambiguous hypervariable sequence (>1 hit) • RefSets often biased toward most studied
Sources of Error in Taxonomic Analyses
Creating OTUs:
Operational Taxonomic Units for taxonomy independent analyses
OTUs vs Taxonomy
• Novel organisms
• Many unnamed organisms
• Some clades only defined to phyla or class
• Many species names based on phenotype rather than genotype
• Do not lump together all 16S “unknowns” or diverse partially classified.
Clustering Algorithms
Different clustering algorithms can have very different effects on the size and number of OTUs created…
Clustering Methods
De novo (open)
• greedy clusters - test sequentially and incorporate sequence into first qualifying OTU. Dependent on input order.
• average linkage - the average distance from a sequence to every other sequence in the OTU is less than the width. Dependent on input order. [complete and single linkage are other methods]
Reference (closed)
• greedy - map each sequence to representative sequences defining prebuilt clusters
The Problem of OTU Inflation
De novo clustering algorithms return more OTUs than predicted for mock communities.
OTU inflation leads to:
• alpha diversity inflation • beta diversity inflation
Where does this inflation come from?
• residual sequencing errors, • chimeras, • multiple sequence alignments, • clustering algorithms
Rarefaction, Sample Size under OTU Inflation
0
1000
2000
3000
4000
5000
6000
7000
- 20,000 40,000 60,000 80,000 100,000 120,000
OT
Us
Number of Sequences Sampled
M2FN Rarefaction - PML
5K
10K
15K
20K
50K
100K
PML MS-CL
Rarefaction, Sample Size with minimal OTU Inflation
PML SLP-PW-AL
Cluster to Reference
1. Create a comprehensive set of Cluster Representatives (e.g., new Greengenes) representing the breadth of Bacteria
2. Assign each sequence to ClusterRep <= W
3. If Seq is not a member of any cluster, set aside
4. Cluster denovo the set of extra-cluster sequences
Advantages of clustering to full-length reference
• Not as prone to OTU inflation • Can add new data as available • Provides static Cluster IDs
– Can be used to compare short reads from different regions (v3-v5 and v6)
– Can compare with other projects using same Ref Set
Oligotyping • Further differentiation within closely related organisms
(e.g., genus)
• Rather than blanket 3% clustering, select sequence positions with the most information (Shannon Entropy)
Fusobacterium oligtypes across oral sites
tons
ils
subg
ingi
val
plaq
ue
supr
agin
giva
l pl
aque
saliv
a
tong
ue
dors
um
thro
at
bucc
al
muc
osa
hard
pal
ate
kera
tiniz
ed
ging
iva
“But I’m not interested in the rare biosphere,
only the major players.
Can’t I just remove the low abundance OTUs?”
Consistent community profile across samples and environments
0
50
100
150
200
250
300
350
0 20 40 60 80 100
Cou
nt in
OTU
OTU Rank
700106784
0 100 200 300 400 500 600 700 800 900
0 50 100 150 200 250 300 350
Cou
nt in
OTU
OTU Rank
700038978
0
1000
2000
3000
4000
5000
6000
7000
0 50 100 150 200 250 300 350
Cou
nt in
OTU
OTU Rank
700023096
A small number of highly abundant organisms
A large number of low abundance organisms Rare Biosphere
Sogin et al, 2006. Microbial diversity in the deep sea and the underexplored “rare biosphere” PNAS 103: 12115-12120
Distribution of OTU relative abundances across 210 HMP stool samples
Huse et al. (2012) PLoS ONE
Distribution of OTU Absolute Abundances in English Channel Water Samples
Frequency in PML Samples
OTU
s
Distribution of OTU Absolute Abundances in English Channel Water Samples
Absent Singleton Doubleton 3-5 6-10 11-50 51-500 >500
Everything may not be everywhere,
but everything is rare somewhere!
If you feel you must remove low abundance OTUs, don’t do it until you have clustered
ALL of your samples
Alpha and Beta Diversity:
Impacts of Sampling Depth and Diversity Algorithm
-
200
400
600
800
1,000
1,200
1,400
1,600
1,800
- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000
Alpha Diversity - Richness
CL - ACE SLP - ACE CL - Chao SLP - Chao 1 in 5000 1 in 2500 1 in 1000 1 in 500
Alpha diversity metrics are sensitive to cluster method, sequencing depth and rare OTUs
0
1
1
2
2
3
3
4
4
5
- 5,000 10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000
Div
ersi
ty
Sampling Depth
Sampling Depth and Alpha Diversity
SLP - NPShannon SLP - Simpson CL - NPShannon Simpson
Robust to both singletons and depth
Comparing Different Sampling Depths
The “population” is a set of 50,000 reads from one sample The “samples” are randomly-selected subsets of sizes:
1,000 15,000 5,000 20,000 7,500 25,000 10,000
Calculate subsample diversity estimates across subsample
depths which are representing the same population.
Subsample 1,000 and 5,000 reads from sample of 50,000 reads, Pairwise distances for replicates at single depth
0
0.02
0.04
0.06
0.08
0.1
0.12 C
omm
unity
Dis
tanc
e
Replicates
Community Distance of Subsamples
Bray Curtis (1K) Bray-Curtis (5K) Morisita Horn (1K) Morisita Horn (5K)
1000 5000
7500 10000
15000 20000 25000
0.000 0.100 0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900
1.000
1000 5000 7500 10000 15000 20000 25000
Effect of Sample Depth - Bray Curtis
Bray Curtis uses absolute counts, intra-community distances are high as depths diverge
Nearly 100% Different
1,000 5,000
7,500 10,000
15,000 20,000 25,000
0.000 0.001 0.002 0.003 0.004 0.005 0.006 0.007 0.008 0.009
1,000 5,000 7,500 10,000 15,000 20,000
Effect of Sample Depth - Morisita Horn
Beta diversity metric that uses relative abundances and compensates for different sample sizes.
Distances are low across depths above min.sampling depth.
Nearly 0.5% Different
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
-0.4 -0.2 0 0.2 0.4 0.6 0.8
PC 2
PC 1
SLP Clustering and Bray-Curtis
1,000 2,000 5,000 7,500 10,000 15,000 20,000 25,000 30,000 40,000
Bray-Curtis PCoA clusters entirely on depth (each point represents 10 atop one another)
!"#"$%&
!"#"$&
!"#""'&
!"#""(&
!"#"")&
!"#""%&
"&
"#""%&
"#"")&
"#""(&
"#""'&
"#"$&
!"#"$*& !"#"$& !"#""*& "& "#""*& "#"$&
!"#$#
!"#%#
&'!#"()*+,-./0#1.+2#34-.*.+5#64-/#
&$+"""&&
&*+"""&&
&,+*""&&
&$"+"""&&
&$*+"""&&
&%"+"""&&
&%*+"""&&
Minimum sample depth here of 10,000, but will be a function of the diversity of the sample
Acknowledgements
The Josephine Bay Paul Center for Comparative Molecular Biology and Evolution
Mitch Sogin
David Mark Welch
Hilary Morrison Joe Vineis
A. Murat Eren
Anna Shipunova
Andy Voorhis
Sharon Grim
Why filter infrequent errors?
Ns Average 454 Error Rate
Errors / 400nt
Percent of Reads
0 or more 0.40% 1.6 100%
0 0.40% 1.6 99.3%
If we include all reads with or without Ns, we have an overall error rate of 0.4%.
If, however we remove all <1% of sequences with Ns, we have an overall error rate of 0.4%.
Why bother?? 454
Why filter infrequent errors?
Ns Average Error
Rate Errors / 400nt
Percent of Reads
0 0.40% 1.6 99.3% 1 1.11% 3.1 0.57% 2 3.81% 8.7 0.1% 3 7.26% 16.5 0.0% 4 8.40% 19.2 0.0% 5 10.46% 25.1 0.0%
It’s not just improving the overall error rate, but removing spurious data
Low-quality reads can be interpreted as unique organisms: 0.7% of 500,000 reads = 3,500 “unique organisms”
454 Error Distribution
454 Errors are not evenly distributed among reads: Many reads have only a small number of errors, and a small number of reads have many errors
Distribution of errors in short reads (<100nt)
454
Most reads contain no errors at all
A good beginning can mask a bad end
if last 100 have an average of 25 avg qual = ((350*35) + (100*25)) / 500 = 30
If 450 nt read and first 400nt average 35:
if last 50 have an average of 0 avg qual = ((400*35) + (50*0)) / 450 = 31
Longer reads, pushing the limits
454 Filter Summary Percent of Reads
Average Error Rate
Average Errors / 400 nt
N=0 99% 0.40% 1.6 N>=1 1% 0.91% 3.6
Exact Primer 95% 0.38% 1.5 Not Exact Primer 5% 0.84% 3.4
Average Qual >=30 98% 0.90% 3.6 Average Qual <30 2% 1.3% 5.2
454
454 Filter Summary (cont) Percent of Reads
Average Error Rate
Average Errors / 400 nt
Read Length (500 - 600nt)
99+% 0.39% 1.6
Read Length (<500, >600 nt)
0.1% 1.8% 7.2
Filtered 93% 0.36% 1.4 Unfiltered 7% 0.64% 2.6
454
Evaluating Chimeras (USearch)
Query Parent A
Parent B
Diffs: A,B: Q matches expected P a,b: Q matches other P p: A=B!=Q Votes: + for Model, 0 neutral, ! against Model Model: shows extent of Parent A and Parent B, xxxx is overlap matching A&B
Click on the bar to see the alignment
Initial Length: 277
Extent of your sequence
Extent of your match
Check for left and right parents: BLAST the left (1-175) BLAST the right (175 - 277)
277
175
100% Match to Fusobacterium
100% Match to Pseudomonas
1
175
Taxonomic Names
• Bergey’s Taxonomic Outline – manual of taxonomic names for bacteria
• List of Prokaryotic names with Standing in the Nomenclature (vetting process)
• NCBI – similar taxonomy, but multiple “subs” (subclass, suborder, subfamily, tribe)
• Archaea – a work in progress…
• Fungi – another work in progress…
Cluster “Width” Diameter
Sequences are never more than D apart. (CL)
Radius Sequences are never more than R from seed. (SL, AL, Gr)
Average Linkage collapses errors
Cluster Count: 1
Clusters tend to be heavily dominated by their most abundant sequence, which strongly weights the average and smoothes the noise.
#1
Still lose outlier sequencing errors
Multiple sequencing errors still not clustered
Inflation in Action: Multiple Sequence Alignment
and Complete Linkage clustering
1,042 is a few more than the expected 2
Example MSA
18,156 sequences and 392 positions
Regardless of clustering algorithm, an MSA cannot fully align tags whose
sequences are too divergent
Relative Inflation
Absolute number of errant OTUs will increase with sample size. Relative number of errant OTUs will descrease with sample complexity
The Magical 3%
NOT!
3% SSU OTUs = Species and
6% SSU OTUs = Genera
Clustering Questions
• How meaningful are clusters functionally?
• When is an errare rare and when is it an error?
• Should it be included in an existing cluster or start its own?
• How to place sequences if OTUs overlap?
• What is the effect of residual low quality data or chimeras?
• How sensitive are alpha and beta diversity estimates to clustering results?