18-21 August 2009 The Biosphere. 18-21 August 2009 Secondary structure of small subunit ribosomal...
-
Upload
alberta-charles -
Category
Documents
-
view
217 -
download
0
Transcript of 18-21 August 2009 The Biosphere. 18-21 August 2009 Secondary structure of small subunit ribosomal...
18-21 August 2009
Secondary structureof small subunit
ribosomal RNA 5' end 3' end
Image adapted from R. Gutellhttp://www.rna.ccbb.utexas.edu/
18-21 August 2009
Secondary structureof small subunit
ribosomal RNA5' end 3' end
Image adapted from R. Gutellhttp://www.rna.ccbb.utexas.edu/
18-21 August 2009
The 530 Loop of E. coli
Stem with canonical Watson-Crick base
pairingBulge
Non-canonical G-U basepair
Loop
18-21 August 2009
Simlarity matrices comparing the 530 loop sequences and the full rRNA sequences of
the six listed taxaA. Similarity matrix for 530 loop
B. Similarity matrix for complete 16S rRNA
18-21 August 2009
Acknowledgement of rRNA secondary structure image:
•Cannone J.J., Subramanian S., Schnare M.N., Collett J.R., D'Souza L.M., Du Y., Feng B., Lin N., Madabusi L.V., Müller K.M., Pande N., Shang Z., Yu N., and Gutell R.R. (2002). The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and Other RNAs. BioMed Central Bioinformatics, 3:2. [Correction: BioMed Central Bioinformatics. 3:15.]
• Smith T.F., Gutell R., Lee J., and Hartman H. 2008. The origin and evolution of the ribosome. Biology Direct, 3:16.
• Woese CR. 1987. Bacterial evolution. Microbiol Rev. 1987 51(2):221-71.
• Zuckerkandl E, Pauling L. 1965. Molecules as documents of evolutionary history. J Theor Biol. 8(2):357-66.
•Cole, J., Wang, Q., Cardenas, E., Fish, J., Chai, B ., Farris, R., Kulam-Syed-Mohideen, A., McGarrell, D., Marsh, T., Garrity, G. and Tiedje, J. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acid Research. 2009. In press.
References
18-21 August 2009
Multiple Sequence Alignment
• Pairwise dynamic programming– Smith-Waserman, Needleman Wunsch– Can be transformed into probabilistic framework
• Multidimensional dynamic programming– Not practical
• Progressive alignment– Muscle, ClustalW– Both are progressive iterative
18-21 August 2009
BLAST
• Heuristic search strategy
• Locate high-scoring short matches– 3aa or 5 to 11 bases
• Extend short matches
• Determine significance using extreme value distribution statistics
18-21 August 2009
BLAST (cont.)
• E value– Database dependent
• Bits– Database independent
• % Similarity (identity)– For aligned segment s– NOT overall % identity
18-21 August 2009
Model Based Alignment
• Profile Hidden Markov Models– Protein and nucleic acid– Models primary sequence
• Stochastic Context-Free Grammars– Incorporates RNA secondary structure
18-21 August 2009
2D Structure Conservedfrom Domain to Family
5'
3'
I
II
III
[0-5]
[0-1][0-6]
[0-1]
[0-1]
[0-1]
[0-2] [1-2]
[0-1]
AAAUUGAAG A G U U
U GAUC
UGGCUCAGA
UUGA
AC
GC
U
GGC
GG
CA
GG
CC
UAAC
AC AUGC
A
A
GU CG a
C G GA
C a GUU
CG
GCGAGGGC
GG
ACGGG
UG
AGUA
AUG
UCUGGG
A
CUg
CC
gA
Ga G G GG G
A U AA C ACUG G
AA
ACGGUGCU
AAUACCGC
AU
A
UCG
A
cA
AAGgGGG
GAccu
U
g gg CCU c GC
aUc
GAUG
CCCAGAUG
gGA
UU
AG
CU
GU
GG
Ug
GG
UAA
GG C
ucA
CC
AG
GC
GAC G AU
C
CU
AGCUG
GUCUGAG A
G G A UGA
cC A G C C
AC
ACUGGaACUG
AGACA C GG
C C A GAC
UC
CUA
C GG
GA
GG C AG
CAGUG
G
GGAAUAU
UGCA
CAAUGGGcG
cA
A g CCUG A UG CA G CcAU
GC
CG
CGUGUUG
AAGAA
GGCUUc
G GGU UG
U A AAG AC
UU
UCA
GC
G
GGAGGAA
GG
uuAA U A
a uUGAC G U
UAC
cG
C AG
AA
GA
AGCACCGGC
UA ACUCCG
UGCC
AGC
A
GC C
GC G
GUAA
UAC
GGAG
GGUGCaA
GC
GU
UA
AU
CG
GAAUuA
CU
G GGCGU
AA
AG
CG
cACGCA
GG
CGGUU
AAGUGAUGUG
AAA
UCCCCGGCU
A A C Ug GG AA C
u G CAU AACU Gg g C
U
GAGUcU
GU
AGA G
GGgGGU
AGAAUUCCAgGUGUA
GCGGUGAA A UG C
GU
AGAgAU c U GGA G GA A U
AC C
gGUG
GC GAA
GGCGgCCcCCUGG
AC
AA
gACUG
ACG
CU
C AGG
Ug
CGAA
A GCGUGGG
GA G
CAAA
CAGG
AUU
A G AUAC
CCUGGUA
GU
CCACGC G U
AAAC
GAU
GU Cg A U U GgA
GG
UU
GU
c U Ug
A
GU
GG
cU
Uc
CGgA
UA
ACG
CGUUA
AUcGAC
CGCCU
GGGG
AGU ACGGC C G
CA
AGGUUAAAA
CUCA
AAU G A A U UG ACG
GGGGC CG
CAC A AGC GG
U
GGAGCAUGUGGUU
UAAUU
CGA
UGCAAC
G CG
AAGAA
C C U UA
CCU
UCU
UGA
C
AU
CC
A
GAAAGAG
A U GU
G
CU
UCGGGA
a
C
UgA
GAC A
GG
UGCUGC
A UG
GCUG
UCG
UCA
GCUCGUGUUG
UGAAAU
G
U
UGGG
UU
AAGU
CCCG C
AACG AG C
GC AA
CCCUUA UCCUUU g U U G CC
AGC u
GGGGAACU
CAAGGA
G
A
CUG
CC
GUG
AAAAC
GG
AGG
AAGGUGGGGA
uGACGU
CA
AGU C
AUC
A
UGGCCC
UUA
CGA
AGGGCU
AC
ACACGUGCUAC A A
U GG
C
AUAC
A A A GGa
a GCgA C
Uc G C
GA
GaGc
aAG
CG
GA
CCAu
AAAGU
GU
CGUA
GU
CCGGAUUGGAGUC
UGC
AACUCGACUCCA
UGAAGU
CG
G
AAUCGC
UAGUAAUCGu
GA
UCA
GAAU
GC
aC
GG
UGA
AU
AC
GUU
CCCGGGCCUUGU
ACA
CA
C
CGCCCG
UC
ACACC
AUGG
GAGUGGGUUGCAAA
AGAAGUaGGU
AGCUU
AAC
uu C
G
GA
gGGC
GCUuACCAC
UUUGUGAUUCAUGA
CUGG
GGUGA
AGU
CGU
AAC
AAG
G
UAACCG UAG GGGA
ACUGCGGUUGgaucaCcuCcUUA
10
50
150
250
300
350
400
450
500550
600
650
700
750
800
850
900
950
1000
1050
1100
1150
1200
1250
1300
13501400
1500
5'
3'
I
II
III
[0-157 ]
[0-51]
[0-2 ]
[0-32]
[0-1]
[0-37 ]
[0-2]
[0-100]
[3-35]
[0-2 ]
[0-1 ]
[4-84]
[0-1]
AG A G U U
U GA
UCUGGCUCAG
gAA
CG
CU
gGC
GG
G
c
U
AaC AUGC
A
A
GU CG a
CG
AGGGC
ACGGG
UG
aGUA
A
c
U
A
U
cC
GGA A
GAA
AU
AAUaCc
AU
AAAg
GA
g
AU
AG
U
GUU
GG
u
GG
UaA
gG C
AC
CA
A
G
C
G A
UA
gc
G
cUG
AG A
G G G
Cg c C
AC
A
UGGACUG
AGA
A C GG
C C AAC
UC
CUA
C GG
GA
GG C AG
CAGU
gGAAU
UUCAA
UGGG
AA CUG
AA G C
AC
CG
CGUGG
AGA
Ggu
G uG
U A AA
CU
U
gA
gA
uGAc
UA
A
AAG
CGGC
A A cU
cG
UGCC
AGC
AG C C
GC G
GUAA
uAC
g
Ag
G
gC
AG
CG
UU
CG
GA
U
A
UG G
GcgU
AA
AG
G
GAG
GGGGuGU
AAA
GCU
A A C CAC C
U
GAg
AG
GG
GAAuUGU
GUAGgGU
g
A A U Cg
AGA
AU g AGA A A
C Cu
GC GAA
gGC
CU
gG
acUG
AC
CU A
CGAA
AGc
uGGG
A GC
AA
CaGG
AUU
A G AUAC
CCuG
GUA
GU
CCa
GC U
AAAC
GU
G U
g
g
A
AAc
g
uA
AC
CgCCU
Gg Gg
AGU
AcG C G
CA
AGUAAA
CUCA
A AG A A U UG ACG
GGG CC G
CAC A A G
CG
GGAg
AUGuGG
UUAAU
UC
GA
G
AC
G CG
a
AA
C C U UA
CC
UU
GAc
aU
gA
a
ac A
GG
UG
UGC
A UG
G
UG
UCGUCA
GCUCGUG
GUG
A
U
G
U
UgGG
UU
AAGU
CCcg
AACG AG C
GC A A
CCC g U U
CA
C
GGACU
c
A
CuG
CC
GA
A
GG
AGG
AAGG
GgGGA
GA
CG
UC
AA
U C
uCA
UG
CCc
UUA
g
GG
GCu
CA
CaC
U
cUA
C A AU G
G
AC
A gG
GC
A
G
A
aG
C
aA
C
AAA
C
AG
UC
GGAU
G
CUGc
AACUCG
C
UG
AAG
GG
AUc
GC
UAGUAA
UC
G
aUCA
GA
g
CG
GUG
AA
UaC
GUU
C
CGGGCUUGUA
CACA
C
C
GCCCG
UC
A
CA
g
AG
cc
AA
g c
A
gg
gA
ugG
G
AA
GU
CGU
AAC
AAG
G
UA CC UAGA
AUGGGUGGAUCACCUCCUU
(eu)Bacteria l consensus Family Enterobacteriaceae
Diagrams from the Gutell Lab Comparative RNA Web Site (http://www.rna.icmb.utexas.edu)
18-21 August 2009
SCFG Limitations
• Model primary and secondary structure– Can’t model pseudoknots or higher-order interactions
• Time complexity O(ML3)– Solved by Nawroki et al.
• Space complexity O(ML2)– Est 16 GB memory for rRNA– Solved by Eddy
• Partial sequences– Disrupt internal alignment– Solved by Nawrorki et al.
18-21 August 2009
Aligner References
• MUSCLEhttp://www.drive5.com/muscle/
• BLASThttp://blast.ncbi.nlm.nih.gov/
• HMMERhttp://hmmer.janelia.org/
• INFERNALhttp://infernal.janelia.org/
18-21 August 2009
Distance Calculation
• Phylogenetic methods only score base substitution, not insertion or deletion.
• Score comparable positions– Mask out unaligned regions, insertions– Ignore positions with deletion
18-21 August 2009
Other Common Distances
• Hamming distance– No gap - insert– Original Blast
• Edit distance– Penalize for gaps– RDP Probe Match
• Matching word percentage (q-gram)– Does not require alignment– RDP Sequence Match
18-21 August 2009
Unsupervised Classification (Clustering)
• Hierarchical Agglomerative– Single Linkage (Nearest neighbor)– Average Linkage (UPGMA)– Compete Linkage (Furthest Neighbor)
• Partitional Clustering– K-Means– Not often used in this field
• Self Organizing Maps– Using word frequency
18-21 August 2009
Supervised Classification
• K-Nearest Neighbors– SeqMatch, Megan, easyTaxon– Last Common Ancestor
• Bayesian– RDP Classifier
• Kernel methods– Support Vector Machines
18-21 August 2009
RDP-II Screenshotsfast search algorithm,limit searches to sequences spanning specific regions,
change depth and edit distance
fast search algorithm,limit searches to sequences spanning specific regions,
change depth and edit distance
place sequences into bacterial taxonomy,works well with partial or full-length sequences,
bootstrap confidence estimate,prior alignment not required
place sequences into bacterial taxonomy,works well with partial or full-length sequences,
bootstrap confidence estimate,prior alignment not required
finds nearest neighbor,more accurate than BLAST,
uses “q-gram” matching method
finds nearest neighbor,more accurate than BLAST,
uses “q-gram” matching method
Twenty-Eight Years LaterProc. Natl. Acad. Sci., USAVol. 103, No. 32, pp 12115-12120, August 2006
www.pnas.org/cgi/doi/10.1073/pnas.0605127103
18-21 August 2009
Initial Processing Steps
• Sort by barcode (key)
• Quality filter
– Forward & (optional) reverse primers
– Ambiguities– Length
• Trim key & primer sequences
18-21 August 2009
Taxonomy Independent• Global Alignment
• Cluster Based OTU Assignment
• Standard Ecological Metrics
• Many 3rd Party Data Formats
Taxonomy Dependent• RDP Classifier
• Sequence Match
• Many 3rd Party Data Formats
Two Analysis Tracks
18-21 August 2009
• Infernal Aligner
– (Nawrocki and Eddy. 2007, PLoS Comput Biol)
• Fast - 500/min
• Probabilistic Model
– Model describes shared features
• Incorporates 2d Structure
– Cannone et al. 2002, BioMed Central Bioinformatics
Model Based Alignment
http://www.rna.icmb.utexas.edu
18-21 August 2009
Complete Linkage Clustering(Operational Taxonomic Units)
• Distance based method
• Guaranteed intra-cluster distance
• N2 algorithm
• Current online limit 150,000 unique reads
• Memory-efficient versionin testing
≤0.03
18-21 August 2009
RDP Naive Bayesian Classifier
• Fast - 3000/min
• Places sequences into bacterial taxonomy
• Works well on partial or full-length sequences
• Does not require alignment
• Easily re-trained to match new taxonomies
• Bootstrap confidence estimates
• Online GUI - Soap service - Open source
18-21 August 2009
RDP Classifier Bootstrap Performance
(Genus Level - Short Reads)
V3 V6 V4
Bootstrap cutoff
0% 50% 80% 0% 50% 80% 0% 50% 80%
Human Gut
% classified 100 92.4 82.3 100 73.5 40.4 100 97.0 87.9
% matching 92.0 95.0 98.1 79.0 96.5 98.7 92.8 94.5 95.7
Soil
% classified 100 71.3 48.3 100 32.7 16.7 100 74.4 56.3
% matching 70.0 85.5 94.6 48.0 80.0 84.3 84.1 93.3 96.8