18-21 August 2009 The Biosphere. 18-21 August 2009 Secondary structure of small subunit ribosomal...

54
18-21 August 2009 The Biosphere

Transcript of 18-21 August 2009 The Biosphere. 18-21 August 2009 Secondary structure of small subunit ribosomal...

18-21 August 2009

The Biosphere

18-21 August 2009

Secondary structureof small subunit

ribosomal RNA 5' end 3' end

Image adapted from R. Gutellhttp://www.rna.ccbb.utexas.edu/

18-21 August 2009

Unaligned rRNA sequences in a multiple alignment editor

18-21 August 2009

Aligned rRNA sequences in editor

18-21 August 2009

Secondary structureof small subunit

ribosomal RNA5' end 3' end

Image adapted from R. Gutellhttp://www.rna.ccbb.utexas.edu/

18-21 August 2009

The 530 Loop of E. coli

Stem with canonical Watson-Crick base

pairingBulge

Non-canonical G-U basepair

Loop

18-21 August 2009

530 loop of E.coli & T.jannaschii

18-21 August 2009

The 530 loop structure of six species

1

18-21 August 2009

Six taxa showing aligned 530 loopregion of the 16S rRNA

18-21 August 2009

Simlarity matrices comparing the 530 loop sequences and the full rRNA sequences of

the six listed taxaA. Similarity matrix for 530 loop

B. Similarity matrix for complete 16S rRNA

18-21 August 2009

The Biosphere

E.coli

AqxPyropT.jannaschii

P.freundenreichii

M.vannieliiS.solfa

18-21 August 2009

Acknowledgement of rRNA secondary structure image:

•Cannone J.J., Subramanian S., Schnare M.N., Collett J.R., D'Souza L.M., Du Y., Feng B., Lin N., Madabusi L.V., Müller K.M., Pande N., Shang Z., Yu N., and Gutell R.R. (2002). The Comparative RNA Web (CRW) Site: An Online Database of Comparative Sequence and Structure Information for Ribosomal, Intron, and Other RNAs. BioMed Central Bioinformatics, 3:2. [Correction: BioMed Central Bioinformatics. 3:15.]

• Smith T.F., Gutell R., Lee J., and Hartman H. 2008. The origin and evolution of the ribosome. Biology Direct, 3:16.

• Woese CR. 1987. Bacterial evolution. Microbiol Rev. 1987 51(2):221-71.

• Zuckerkandl E, Pauling L. 1965. Molecules as documents of evolutionary history. J Theor Biol. 8(2):357-66.

•Cole, J., Wang, Q., Cardenas, E., Fish, J., Chai, B ., Farris, R., Kulam-Syed-Mohideen, A., McGarrell, D., Marsh, T., Garrity, G. and Tiedje, J. The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acid Research. 2009. In press.

References

18-21 August 2009

Sequence Alignment

Accuracy, Time, Memory

18-21 August 2009

Multiple Sequence Alignment

• Pairwise dynamic programming– Smith-Waserman, Needleman Wunsch– Can be transformed into probabilistic framework

• Multidimensional dynamic programming– Not practical

• Progressive alignment– Muscle, ClustalW– Both are progressive iterative

18-21 August 2009

BLAST

• Heuristic search strategy

• Locate high-scoring short matches– 3aa or 5 to 11 bases

• Extend short matches

• Determine significance using extreme value distribution statistics

18-21 August 2009

BLAST (cont.)

• E value– Database dependent

• Bits– Database independent

• % Similarity (identity)– For aligned segment s– NOT overall % identity

18-21 August 2009

Model Based Alignment

• Profile Hidden Markov Models– Protein and nucleic acid– Models primary sequence

• Stochastic Context-Free Grammars– Incorporates RNA secondary structure

18-21 August 2009

Profile HMM

QuickTime™ and a decompressor

are needed to see this picture.

18-21 August 2009

Hidden Markov Model

18-21 August 2009

Hidden Markov Model

18-21 August 2009

Hidden Markov Model

18-21 August 2009

18-21 August 2009

2D Structure Conservedfrom Domain to Family

5'

3'

I

II

III

[0-5]

[0-1][0-6]

[0-1]

[0-1]

[0-1]

[0-2] [1-2]

[0-1]

AAAUUGAAG A G U U

U GAUC

UGGCUCAGA

UUGA

AC

GC

U

GGC

GG

CA

GG

CC

UAAC

AC AUGC

A

A

GU CG a

C G GA

C a GUU

CG

GCGAGGGC

GG

ACGGG

UG

AGUA

AUG

UCUGGG

A

CUg

CC

gA

Ga G G GG G

A U AA C ACUG G

AA

ACGGUGCU

AAUACCGC

AU

A

UCG

A

cA

AAGgGGG

GAccu

U

g gg CCU c GC

aUc

GAUG

CCCAGAUG

gGA

UU

AG

CU

GU

GG

Ug

GG

UAA

GG C

ucA

CC

AG

GC

GAC G AU

C

CU

AGCUG

GUCUGAG A

G G A UGA

cC A G C C

AC

ACUGGaACUG

AGACA C GG

C C A GAC

UC

CUA

C GG

GA

GG C AG

CAGUG

G

GGAAUAU

UGCA

CAAUGGGcG

cA

A g CCUG A UG CA G CcAU

GC

CG

CGUGUUG

AAGAA

GGCUUc

G GGU UG

U A AAG AC

UU

UCA

GC

G

GGAGGAA

GG

uuAA U A

a uUGAC G U

UAC

cG

C AG

AA

GA

AGCACCGGC

UA ACUCCG

UGCC

AGC

A

GC C

GC G

GUAA

UAC

GGAG

GGUGCaA

GC

GU

UA

AU

CG

GAAUuA

CU

G GGCGU

AA

AG

CG

cACGCA

GG

CGGUU

AAGUGAUGUG

AAA

UCCCCGGCU

A A C Ug GG AA C

u G CAU AACU Gg g C

U

GAGUcU

GU

AGA G

GGgGGU

AGAAUUCCAgGUGUA

GCGGUGAA A UG C

GU

AGAgAU c U GGA G GA A U

AC C

gGUG

GC GAA

GGCGgCCcCCUGG

AC

AA

gACUG

ACG

CU

C AGG

Ug

CGAA

A GCGUGGG

GA G

CAAA

CAGG

AUU

A G AUAC

CCUGGUA

GU

CCACGC G U

AAAC

GAU

GU Cg A U U GgA

GG

UU

GU

c U Ug

A

GU

GG

cU

Uc

CGgA

UA

ACG

CGUUA

AUcGAC

CGCCU

GGGG

AGU ACGGC C G

CA

AGGUUAAAA

CUCA

AAU G A A U UG ACG

GGGGC CG

CAC A AGC GG

U

GGAGCAUGUGGUU

UAAUU

CGA

UGCAAC

G CG

AAGAA

C C U UA

CCU

UCU

UGA

C

AU

CC

A

GAAAGAG

A U GU

G

CU

UCGGGA

a

C

UgA

GAC A

GG

UGCUGC

A UG

GCUG

UCG

UCA

GCUCGUGUUG

UGAAAU

G

U

UGGG

UU

AAGU

CCCG C

AACG AG C

GC AA

CCCUUA UCCUUU g U U G CC

AGC u

GGGGAACU

CAAGGA

G

A

CUG

CC

GUG

AAAAC

GG

AGG

AAGGUGGGGA

uGACGU

CA

AGU C

AUC

A

UGGCCC

UUA

CGA

AGGGCU

AC

ACACGUGCUAC A A

U GG

C

AUAC

A A A GGa

a GCgA C

Uc G C

GA

GaGc

aAG

CG

GA

CCAu

AAAGU

GU

CGUA

GU

CCGGAUUGGAGUC

UGC

AACUCGACUCCA

UGAAGU

CG

G

AAUCGC

UAGUAAUCGu

GA

UCA

GAAU

GC

aC

GG

UGA

AU

AC

GUU

CCCGGGCCUUGU

ACA

CA

C

CGCCCG

UC

ACACC

AUGG

GAGUGGGUUGCAAA

AGAAGUaGGU

AGCUU

AAC

uu C

G

GA

gGGC

GCUuACCAC

UUUGUGAUUCAUGA

CUGG

GGUGA

AGU

CGU

AAC

AAG

G

UAACCG UAG GGGA

ACUGCGGUUGgaucaCcuCcUUA

10

50

150

250

300

350

400

450

500550

600

650

700

750

800

850

900

950

1000

1050

1100

1150

1200

1250

1300

13501400

1500

5'

3'

I

II

III

[0-157 ]

[0-51]

[0-2 ]

[0-32]

[0-1]

[0-37 ]

[0-2]

[0-100]

[3-35]

[0-2 ]

[0-1 ]

[4-84]

[0-1]

AG A G U U

U GA

UCUGGCUCAG

gAA

CG

CU

gGC

GG

G

c

U

AaC AUGC

A

A

GU CG a

CG

AGGGC

ACGGG

UG

aGUA

A

c

U

A

U

cC

GGA A

GAA

AU

AAUaCc

AU

AAAg

GA

g

AU

AG

U

GUU

GG

u

GG

UaA

gG C

AC

CA

A

G

C

G A

UA

gc

G

cUG

AG A

G G G

Cg c C

AC

A

UGGACUG

AGA

A C GG

C C AAC

UC

CUA

C GG

GA

GG C AG

CAGU

gGAAU

UUCAA

UGGG

AA CUG

AA G C

AC

CG

CGUGG

AGA

Ggu

G uG

U A AA

CU

U

gA

gA

uGAc

UA

A

AAG

CGGC

A A cU

cG

UGCC

AGC

AG C C

GC G

GUAA

uAC

g

Ag

G

gC

AG

CG

UU

CG

GA

U

A

UG G

GcgU

AA

AG

G

GAG

GGGGuGU

AAA

GCU

A A C CAC C

U

GAg

AG

GG

GAAuUGU

GUAGgGU

g

A A U Cg

AGA

AU g AGA A A

C Cu

GC GAA

gGC

CU

gG

acUG

AC

CU A

CGAA

AGc

uGGG

A GC

AA

CaGG

AUU

A G AUAC

CCuG

GUA

GU

CCa

GC U

AAAC

GU

G U

g

g

A

AAc

g

uA

AC

CgCCU

Gg Gg

AGU

AcG C G

CA

AGUAAA

CUCA

A AG A A U UG ACG

GGG CC G

CAC A A G

CG

GGAg

AUGuGG

UUAAU

UC

GA

G

AC

G CG

a

AA

C C U UA

CC

UU

GAc

aU

gA

a

ac A

GG

UG

UGC

A UG

G

UG

UCGUCA

GCUCGUG

GUG

A

U

G

U

UgGG

UU

AAGU

CCcg

AACG AG C

GC A A

CCC g U U

CA

C

GGACU

c

A

CuG

CC

GA

A

GG

AGG

AAGG

GgGGA

GA

CG

UC

AA

U C

uCA

UG

CCc

UUA

g

GG

GCu

CA

CaC

U

cUA

C A AU G

G

AC

A gG

GC

A

G

A

aG

C

aA

C

AAA

C

AG

UC

GGAU

G

CUGc

AACUCG

C

UG

AAG

GG

AUc

GC

UAGUAA

UC

G

aUCA

GA

g

CG

GUG

AA

UaC

GUU

C

CGGGCUUGUA

CACA

C

C

GCCCG

UC

A

CA

g

AG

cc

AA

g c

A

gg

gA

ugG

G

AA

GU

CGU

AAC

AAG

G

UA CC UAGA

AUGGGUGGAUCACCUCCUU

(eu)Bacteria l consensus Family Enterobacteriaceae

Diagrams from the Gutell Lab Comparative RNA Web Site (http://www.rna.icmb.utexas.edu)

18-21 August 2009

SCFG rRNA Model

18-21 August 2009

SCFG Limitations

• Model primary and secondary structure– Can’t model pseudoknots or higher-order interactions

• Time complexity O(ML3)– Solved by Nawroki et al.

• Space complexity O(ML2)– Est 16 GB memory for rRNA– Solved by Eddy

• Partial sequences– Disrupt internal alignment– Solved by Nawrorki et al.

18-21 August 2009

QuickTime™ and a decompressor

are needed to see this picture.

18-21 August 2009

QuickTime™ and a decompressor

are needed to see this picture.

18-21 August 2009

QuickTime™ and a decompressor

are needed to see this picture.

18-21 August 2009

Aligner References

• MUSCLEhttp://www.drive5.com/muscle/

• BLASThttp://blast.ncbi.nlm.nih.gov/

• HMMERhttp://hmmer.janelia.org/

• INFERNALhttp://infernal.janelia.org/

18-21 August 2009

Distance Calculation

• Phylogenetic methods only score base substitution, not insertion or deletion.

• Score comparable positions– Mask out unaligned regions, insertions– Ignore positions with deletion

18-21 August 2009

Other Common Distances

• Hamming distance– No gap - insert– Original Blast

• Edit distance– Penalize for gaps– RDP Probe Match

• Matching word percentage (q-gram)– Does not require alignment– RDP Sequence Match

18-21 August 2009

Clustering

Accuracy, Time, Memory

18-21 August 2009

Unsupervised Classification (Clustering)

• Hierarchical Agglomerative– Single Linkage (Nearest neighbor)– Average Linkage (UPGMA)– Compete Linkage (Furthest Neighbor)

• Partitional Clustering– K-Means– Not often used in this field

• Self Organizing Maps– Using word frequency

18-21 August 2009

Hierarchical Clustering

≤0.03

≤0.03

Complete Linkage Single Linkage

18-21 August 2009

18-21 August 2009

FastGroupII

18-21 August 2009

Supervised Classification

• K-Nearest Neighbors– SeqMatch, Megan, easyTaxon– Last Common Ancestor

• Bayesian– RDP Classifier

• Kernel methods– Support Vector Machines

18-21 August 2009

18-21 August 2009

QuickTime™ and a decompressor

are needed to see this picture.

18-21 August 2009

RDP-II Screenshotsfast search algorithm,limit searches to sequences spanning specific regions,

change depth and edit distance

fast search algorithm,limit searches to sequences spanning specific regions,

change depth and edit distance

place sequences into bacterial taxonomy,works well with partial or full-length sequences,

bootstrap confidence estimate,prior alignment not required

place sequences into bacterial taxonomy,works well with partial or full-length sequences,

bootstrap confidence estimate,prior alignment not required

finds nearest neighbor,more accurate than BLAST,

uses “q-gram” matching method

finds nearest neighbor,more accurate than BLAST,

uses “q-gram” matching method

18-21 August 2009

RDP Pyrosequencing Pipeline

Tools for high-throughput analysis

18-21 August 2009

Thirty-One Years of rRNA Sequencing

Twenty-Eight Years LaterProc. Natl. Acad. Sci., USAVol. 103, No. 32, pp 12115-12120, August 2006

www.pnas.org/cgi/doi/10.1073/pnas.0605127103

18-21 August 2009

Multiplexed Amplicon Pyrosequencing

18-21 August 2009

RDP Pyrosequencing Pipeline

18-21 August 2009

Initial Processing Steps

• Sort by barcode (key)

• Quality filter

– Forward & (optional) reverse primers

– Ambiguities– Length

• Trim key & primer sequences

18-21 August 2009

Taxonomy Independent• Global Alignment

• Cluster Based OTU Assignment

• Standard Ecological Metrics

• Many 3rd Party Data Formats

Taxonomy Dependent• RDP Classifier

• Sequence Match

• Many 3rd Party Data Formats

Two Analysis Tracks

18-21 August 2009

• Infernal Aligner

– (Nawrocki and Eddy. 2007, PLoS Comput Biol)

• Fast - 500/min

• Probabilistic Model

– Model describes shared features

• Incorporates 2d Structure

– Cannone et al. 2002, BioMed Central Bioinformatics

Model Based Alignment

http://www.rna.icmb.utexas.edu

18-21 August 2009

Complete Linkage Clustering(Operational Taxonomic Units)

• Distance based method

• Guaranteed intra-cluster distance

• N2 algorithm

• Current online limit 150,000 unique reads

• Memory-efficient versionin testing

≤0.03

18-21 August 2009

RDP Naive Bayesian Classifier

• Fast - 3000/min

• Places sequences into bacterial taxonomy

• Works well on partial or full-length sequences

• Does not require alignment

• Easily re-trained to match new taxonomies

• Bootstrap confidence estimates

• Online GUI - Soap service - Open source

18-21 August 2009

From Wang et. al., AEM, 2007

Classifier Accuracy on 200 bp Regions

18-21 August 2009

RDP Classifier Bootstrap Performance

(Genus Level - Short Reads)

V3 V6 V4

Bootstrap cutoff

0% 50% 80% 0% 50% 80% 0% 50% 80%

Human Gut

% classified 100 92.4 82.3 100 73.5 40.4 100 97.0 87.9

% matching 92.0 95.0 98.1 79.0 96.5 98.7 92.8 94.5 95.7

Soil

% classified 100 71.3 48.3 100 32.7 16.7 100 74.4 56.3

% matching 70.0 85.5 94.6 48.0 80.0 84.3 84.1 93.3 96.8