RNA StrAT: RNA Secondary Structure Analysis Toolkitcchauve/Publications/ISMB08-D31-poster.pdf · A...

1
A STEM/STEM-LOOP DECOMPOSITION BASED HEURISTIC RNA secondary structures are decomposed into stems and stem-loops, that are then compared using an exact algorithm for the conservative edit distance [Guignon et al., 2005], a variant of the general distance of [Jiang et al., 2002]. Then, these hairpins pairwise comparison are used in a Smith-Waterman based heuristic to produce a distance and alignment between the two complete structures. SEARCH FOR STRUCTURAL HOMOLOGS For example, structured RNA gene detected by a high- throughput computational analysis. Structures database ALGORITHMIC MODEL EDIT DISTANCE The edit distance algorithm is the following problem: - given two RNA secondary structures, a set of allowed edit operations and a cost for each possible operation, - compute an alignment of minimum cost between the two structures. RNAStrAT uses the edit distance model defined in [Jiang et al. 2002] that comports: single base edit operations (substitution/insertion/deletion), base pairs operations such as insertion/deletion, creation/opening or alteration of a hydrogene bond. This model was introduced by [Jiang et al., 2002]. Computing the distance is an NP-hard problem, but several less general version of this problem can be solved exactly and are used in widely in RNA secondary strcutures comparison tools such as RNAForrester [Höchsmann et al., 2004]. ALIGNMENT OF SECONDARY STRUCTURES SUMMARY RNAStrAT is a web server dedicated to the comparison of sets of RNA secondary structures. This server offers tools to align pairs of RNA secondary structures and to search for structural homologs in a database of RNA secondary structures (based on the RFAM). The alignment and search are based on an edit distance algorithm that considers a wide range of edit operations defined in [Jiang et al., 2002]. Tools for the vizualisation of secondary structures and structures alignments are also available. Up to date RNA StrAT is the only server offering all these features (general RNA edit model, RFAM database search, rendering) together. Availability: http://www-lbit.iro.umontreal.ca/rnastrat/ Contact: [email protected] ADDITIONAL FEATURES AND DEVELOPMENT The database of RNA secondary structures Users can access to structure information including links to its RNA family (in the Rfam classification), its organism taxonomy (EMBL), its sequence (EMBL). Structures stored in our database are extracted from the Rfam seed alignments and for each RNA gene, its specific secondary structure is obtained from both its sequence and the family consensus structure. Database search improvements In order to speed-up the database search the search engines first analyze the structural characteristics of the query structure to select a group of candidates in the database that share similar characteristics close to the query ones, eliminating at the same time irrelevant structures. Then, the query structure is compared to these candidates to find which ones have the best similarity scores. The user can modify the parameters that define the candidates. Structures and alignment rendering High structural similarity score Query structure Valentin Guignon 1 , Cedric Chauve 2 and Sylvie Hamel 1 1. DIRO, Université de Montréal, C.P. 6128, Succ. Centre-Ville, Montréal, QC, H3C 3J7, Canada. 2. Department of Mathematics, Simon Fraser University, 8888 University Drive, Burnaby, BC, V5A 1S6, Canada. CGL and LaCIM, Université du Québec à Montréal, Montréal, QC, Canada. RNA StrAT: RNA Secondary Structure Analysis Toolkit Figure 1. Structural alignment between two 5S rRNA. Some edit operations are displayed on the structures using arrows. Delftia acidovorans Escherichia coli 5’ 3' 50 100 10 120 U G C C U G G C G G C C G U A G C G C G G U G G U C C C A C C U G A C C C C A U G C C G A A C U C A G A A G U G A A A C G C C G U A G C G C C G A U G G U A G U G U G G G G U C U C C C C A U G C G A G A G U A G G G A A C U G C C A G G C A U 5' 3' U G C C U G A U G A C C A U A G C A A G U U G G U A C C A C U C C U U C C C A U C C C G A A C A G G A C A G U G A A A C G A C U U U G C G C C G A U G A U A G U G C G G G U U C C C G U G U G A A A G U A G G U C A U C G U C A G G C N N 10 50 100 110 >Escherichia coli, V00336 UGCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGCAU ((((((((((.....((((((((....(((((((.............))))..)))...)))))).)).(((((((..((((((((...))))))))..)))))))...)))))))))). >Delftia acidovorans UGCCUGAUGACCAUAGCAAGUUGGUACCACUCCUUCCCAUCCCGAACAGGACAGUGAAACGACUUUGCGCCGAUGAUAGUGCGGG-U-U-CCCGUGUGAAAGUAGGUCAUCGUCAGGCNN .(((((((((.....((((((((....(((((((.............))))..)))...)))))).)).((.((....((((((.-.-.-.))))))....)).))...))))))))).. pair match base substitution base match completion/altering pair substitution pair half-match pair deletion/insertion pair opening/creation base deletion/insertion Figure 2. Given a query structure and a database of structures, structural homologs can be found using structural similarity scores. C A G G G C C G G G G C C G G G C C A G G G C G base substitution base deletion base insertion C G A G G U G C C G A G G C G C C A A G G U G C C A G G G C C U G G A G C G C U paired base substitution pair substitution pair deletion pair insertion C G U G G C G C G C G U G G C C A G G U G C G C C G C U G G C G pair creation pair opening pair completion pair altering Figure 3. Edit operations. A U A G G G C G G A G G G AAG C U C A U C A G U G G G G C C A C G A G C U G A G U G C G U C C U GU C ACUC C A C U C CC A U G U CCCU U G G G A A G G U C U G A G A C U A G G G C C A G A G G C G G C C C U A A C A G G G C U C U C C C U G A G C UU C G G G G A G G U G A G U U C C C A G A G A ACG G G G C U C C G C G C G A G G U C A G AC U G G G C A G G A G A U G C C G U G G A C C C C GC C C UU C G G G G A G G G G C C C G G C G G A U G C C U C C U U U G C C G G A G C U U G G A A C A G A C U C A C GGCC A G CG A A G U G A G U U C A A U G G C U G A GG U G A G G U AC C C C GC A G G G G A C C U C A U AA C C C A A U U C A G A C C A C U C U C C U C C G C C C A U U U P1 P2 P3a P3b P4 P7 P8 P9 P10/11 P12 P19 G G G G C C A C G A G C U G A G U G C G U C C U G U C A C U C C A C U C C C A U G U C C C U G G C C C U A A C A G G G C U C U C C C U G A G C U U C G G G G A G G U C A G AC U G G G C G G A G A U G C C G U G G A C C C C G C C C U U C G G G G A G G G G C C C G G C G G A U G C C U C C G U G A G U U C C C A G A G A A CG G G G C U C C G C G C G A U U G C C G G A G C U U G G A A C A G A C U C A C G G C C G G C C C U C A U G A G A U A G G G C G G A G G G U C C U C C G C C C A U G U G A G G U A C C C C GC A G G G G A C C U C A U decomposition Figure 4. Stem/stem-loop decomposition of an Rnase P structure. Figure 5. Edit distance computation between 2 Rnase P structures. Structure rendering (see figure 7) Figure 6. Database browsing features. Figure 7. Structure online rendering (base on Vienna RNA Package): the rendering form (on the left) enables the user to display structures in various ways (on the right) and browse the structure (middle). Figure 8. The rendering engine enables to compare two aligned structures (on the left) and see which bases changed from a structure to the other. The characteristics of a structure compared to a set of structures can also be rendered (on the right). Each base of the query structure is displayed with a pie chart that shows how often the base has been kept, replaced by an other one or deleted. REFERENCES Griffiths-Jones S., Bateman A., Marshall M., Khanna A., Eddy S.R. RFam: an RNA family database, Nucleic Acids Research, 2003, 31, 1, 439-441. Jiang T., Lin G., Ma B., Zhang K. A general edit distance between RNA structures. J. Comput. Biol., 9(2):371–388. 2002. Guignon V., Chauve C., Hamel S. Distance d'édition entre tige-boucles, JOBIM 2005, 2005, poster 82. Smith T.F., Waterman M.S. Identification of Common Molecular Subsequences, Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5, 1981. Höchsmann M., Voss B., Giegerich R. Pure Multiple RNA Secondary Structure Alignments: A Progressive Profile Approach in IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 2004, pp53-62. Vienna RNA Package, http://www.tbi.univie.ac.at/~ivo/RNA RNase P SM-A12(14) RNase P SM-A18(31) Stem/stem-loop edit distance Global edit distance RNase P SM-A12(14) RNase P SM-A18(31) RNase P SM-A18(31) RNase P SM-A12(14) g a g g a a a g u c c g g g c U C C U U C G G A C A G G G C G C C A G G U A A C G C C U G G G G G G C G U G A G C C C A C G G A A A G U G C C A C A G A A A A U A U A C C G C C A G C U UC G G C U G G UA A G G G U G A A A U G G U G C G GU A A G A G C G C A C C G C G C G A C U G G C A A C G G C U U G C G G C A C G G U A A A C C C C G C C C GGAGCAA G A C C A A AU A G G G G A G C A UG U C C G U C GU G U C C G A A C G G G C U C C C G G G U A G G U U G C U U G A G G U G G C C G G U G A C G G C U A U C C C A G A U G A A U G G U U G U CG A UG a c a g a a c c c g g c u u a c 1 20 40 60 80 100 120 14 0 160 180 200 220 240 260 28 0 g a g g a a a g u c c g g g c U C C A U G G A A G C G C G G U G C C G G A U A A C G U C C G G C G G G G G C GA C C U C A G G G A A A G U G C C A C A G A A A G C A A A C C G C C C U C G A G G C C G AA A G G C U U C G C G G A G G G UA A G GG U G A A A G G G U G C G GU A A G A G C G C A C C G C G U C U U U G G C A A C A A A G G C G G C A A G G C A A A C C C C A C C G G G A C C A A AU A G G G G C U G C A CG G A C G A G AG A U C G U C C A G G U C U G U U U C C A G A C C C G C G G C C C G G G U U G G U U G C A A G A G G C G U C U C G C A A G A G G C G U C C C A G A U G A A U G G C C A U CAC C U C G C A G C A A U G C G A G GA a c a g a a c c c g g c u u a 1 20 40 60 80 10 0 120 140 160 180 200 22 0 240 260 28 0 30 0 320 340 AA C GAG 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9

Transcript of RNA StrAT: RNA Secondary Structure Analysis Toolkitcchauve/Publications/ISMB08-D31-poster.pdf · A...

Page 1: RNA StrAT: RNA Secondary Structure Analysis Toolkitcchauve/Publications/ISMB08-D31-poster.pdf · A STEM/STEM-LOOP DECOMPOSITION BASED HEURISTIC RNA secondary structures are decomposed

A STEM/STEM-LOOP DECOMPOSITION BASED HEURISTIC

RNA secondary structures are decomposed into stems and stem-loops, that are then compared using an exact algorithm for the conservative edit distance [Guignon et al., 2005], a variant of the general distance of [Jiang et al., 2002].

Then, these hairpins pairwise comparison are used in a Smith-Waterman based heuristic to produce a distance and alignment between the two complete structures.

SEARCH FOR STRUCTURAL HOMOLOGS

For example, structured RNA gene detected by a high-throughput computational analysis.

Structures database

ALGORITHMIC MODELEDIT DISTANCE

The edit distance algorithm is the following problem: - given

● two RNA secondary structures,● a set of allowed edit operations and● a cost for each possible operation,

- compute● an alignment of minimum cost between the two

structures.RNAStrAT uses the edit distance model defined in [Jiang et al. 2002] that comports:● single base edit operations (substitution/insertion/deletion),● base pairs operations such as insertion/deletion, creation/opening or alteration of a hydrogene bond.

This model was introduced by [Jiang et al., 2002]. Computing the distance is an NP-hard problem, but several less general version of this problem can be solved exactly and are used in widely in RNA secondary strcutures comparison tools such as RNAForrester [Höchsmann et al., 2004].

ALIGNMENT OFSECONDARY STRUCTURES

SUMMARYRNAStrAT is a web server dedicated to the comparison of sets of RNA secondary structures. This server offers tools to align pairs of RNA secondary structures and to search for structural homologs in a database of RNA secondary structures (based on the RFAM). The alignment and search are based on an edit distance algorithm that considers a wide range of edit operations defined in [Jiang et al., 2002]. Tools for the vizualisation of secondary structures and structures alignments are also available. Up to date RNA StrAT is the only server offering all these features (general RNA edit model, RFAM database search, rendering) together.

Availability: http://www-lbit.iro.umontreal.ca/rnastrat/

Contact: [email protected]

ADDITIONAL FEATURES AND DEVELOPMENT

The database of RNA secondary structures

Users can access to structure information including links to its RNA family (in the Rfam classification), its organism taxonomy (EMBL), its sequence (EMBL). Structures stored in our database are extracted from the Rfam seed alignments and for each RNA gene, its specific secondary structure is obtained from both its sequence and the family consensus structure.

Database search improvementsIn order to speed-up the database search the search engines first analyze the structural characteristics of the query structure to select a group of candidates in the database that share similar characteristics close to the query ones, eliminating at the same time irrelevant structures. Then, the query structure is compared to these candidates to find which ones have the best similarity scores. The user can modify the parameters that define the candidates.

Structures and alignment rendering

High structuralsimilarity score

Query structure

Valentin Guignon1, Cedric Chauve2 and Sylvie Hamel11. DIRO, Université de Montréal, C.P. 6128, Succ. Centre-Ville, Montréal, QC, H3C 3J7, Canada.2. Department of Mathematics, Simon Fraser University, 8888 University Drive, Burnaby, BC, V5A 1S6, Canada. CGL and LaCIM, Université du Québec à Montréal, Montréal, QC, Canada.

RNA StrAT: RNA Secondary Structure Analysis Toolkit

Figure 1. Structural alignment between two 5S rRNA. Some edit operations are displayed on the structures using arrows.

Delftia acidovoransEscherichia coli

5 ’3 '

5 0

1 0 0

1 0

1 2 0UGCCUGGCGG C CG

UAG C G C G G U G

G U C CC A C C U G A

C C C C AUGC

CGAACUCAG

AAGUG

AAACGCCGU

AGCGC

CGAUGGUA

GUGUGGGGUCU

CCCCAUGC

GAGAGUAGG

G

A

ACUGCCAGGCAU

5 '3 '

UGCCUGAUGA C CA

UAG C A A G U U G

G U A CC A C U C C U

U C C C AUCC

CGAACAGGA

CAGUG

AAACGACUU

UGCGC

CGAUGAUA

GUGCGGG

UUCCCGUGU

GAAAGUAGG

U

C

AUCGUCAGGCNN

1 0

5 01 0 0

1 10

>Escherichia coli, V00336UGCCUGGCGGCCGUAGCGCGGUGGUCCCACCUGACCCCAUGCCGAACUCAGAAGUGAAACGCCGUAGCGCCGAUGGUAGUGUGGGGUCUCCCCAUGCGAGAGUAGGGAACUGCCAGGCAU((((((((((.....((((((((....(((((((.............))))..)))...)))))).)).(((((((..((((((((...))))))))..)))))))...)))))))))).>Delftia acidovoransUGCCUGAUGACCAUAGCAAGUUGGUACCACUCCUUCCCAUCCCGAACAGGACAGUGAAACGACUUUGCGCCGAUGAUAGUGCGGG-U-U-CCCGUGUGAAAGUAGGUCAUCGUCAGGCNN.(((((((((.....((((((((....(((((((.............))))..)))...)))))).)).((.((....((((((.-.-.-.))))))....)).))...)))))))))..

pair match

base substitution base match

completion/altering

pair substitutionpair half-match

pair deletion/insertion

pair opening/creation

base deletion/insertion

Figure 2. Given a query structure and a database of structures, structural homologs can be found using structural similarity scores.

C AG GG C

C GG GG C

CG GG C

C AGG

G CG

base substitution

base

deletion

base insertion

CG AG

G UG C

CG AG

G CG C

CA AG

G UG C

C AGGG CC

U

GG AG C

G CU

paired basesubstitution

pair substitution

pair deletion

pair insertion

CG UG

G CG C

G

CG UG

G C

CA

G

G UG CG C

CGCUG

G CG

pair creation

pair opening

pair completion

pair altering

Figure 3. Edit operations.

A UAG G G CG G AG G G A A G CUCAUC

AGUGGGGCCACGAGC

UGAGUGCGUCCUG U C A CU C

CAC U

C C C A UGUC C C UUGGGAAGGUCUGAGACUA

GGGCCA

GAGG

CGGCCC

UAACAGGGCUCUC

CCUGA

GC U UCG GGG AG GU

GAGUUC

CCAGAG A

A C G GGG CUCCG

CG CGAGGUCAG A CU

GGGCAGG AG A UG C

CGUGG

A C CCCG CCC

U UCGGGGAGGGGCCCGGCGGAUGC

CUCCUUUGCCGGAGC

UUGGAACAGACUCACG G C C A G C G A A

GUGAGUUCAAUGGCUGAG GU

GAGGUACCCCG CAGGGG

ACCUCAU A A CCCAAUUCAGACCAC

UCUCCUCCGCCCAUU

5´3´

UP1

P2

P3aP3b P4

P7

P8

P9P10/11

P12

P19

GGG G

CCAC

G AG C

U GA

GUGC

G UC C U G

UC A C U C C A C U CCCAUGUCCCU GG

CCCUA A CAG

GGCU CU

CCCU

GAG C UUCGGGG

AG GUCAG A CU

GGGC

G GA G

A UGCC

GUGG

A CCC

CGCC

CU U C G

GGG AGGGG

CCC

GGCG GA

UGCCUCCGUGAGUUCC

CAGAGAA

C G G G GCUCCGC GC

GA UUGCCGGAGCUU

GGAACAGACUC

ACGGCC

GGCCCU

CAUGAG

AU AGG GCGG

AGGGUCCU

CCGC

CCAU

GUGAGGU

A CCCCG C A

GGGGACCUCAU

decomposition

Figure 4. Stem/stem-loop decomposition of an Rnase P structure.

Figure 5. Edit distance computation between 2 Rnase P structures.

Structure rendering(see figure 7)

Figure 6. Database browsing features.

Figure 7. Structure online rendering (base on Vienna RNA Package): the rendering form (on the left) enables the user to display structures in various ways (on the right) and browse the structure (middle).

Figure 8. The rendering engine enables to compare two aligned structures (on the left) and see which bases changed from a structure to the other. The characteristics of a structure compared to a set of structures can also be rendered (on the right). Each base of the query structure is displayed with a pie chart that shows how often the base has been kept, replaced by an other one or deleted.

REFERENCESGriffiths-Jones S., Bateman A., Marshall M., Khanna A., Eddy S.R. RFam: an RNA family database, Nucleic Acids Research, 2003, 31, 1, 439-441.

Jiang T., Lin G., Ma B., Zhang K. A general edit distance between RNA structures. J. Comput. Biol., 9(2):371–388. 2002.

Guignon V., Chauve C., Hamel S. Distance d'édition entre tige-boucles, JOBIM 2005, 2005, poster 82.

Smith T.F., Waterman M.S. Identification of Common Molecular Subsequences, Journal of Molecular Biology 147: 195–197. doi:10.1016/0022-2836(81)90087-5, 1981.

Höchsmann M., Voss B., Giegerich R. Pure Multiple RNA Secondary Structure Alignments: A Progressive Profile Approach in IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 2004, pp53-62.

Vienna RNA Package, http://www.tbi.univie.ac.at/~ivo/RNA

RNase P SM-A12(14)

RN

ase

P S

M-A

18(3

1)

Stem/stem-loop edit distance Global edit distance

RNase P SM-A12(14)

RN

ase

P S

M-A

18(3

1)

RNase P SM-A18(31) RNase P SM-A12(14)

gaggaaaguccgggcUCC U

UCGG

ACAGGG

CGCCAGGUAAC

G CCUGGGGG

GCGUG

A G C C C AC GGAA A G U

GCCACAGAAAA

U AU A C

CGCCAGCUU CG

GCUGGU AAG

G G U G A AAUG G U G C GG

U AAGAGCGCACC

GC G C G A C U G G

CAAC

GGCUUGC

GGCACGGUAAACCCC G C C C G G A G C A A G A C C

AAA U AG

G G G A G CAUG U

C C G UCG U G

UCCGA

ACGGGCUCCCGGGUA

GGUUGCUUGAG G U G G C C G G U

GACGGCUAUCCCAGAUGAAUGGUUGUCG A UG ac

agaacccggcuuac

1

2040

60

80

100

120

140

160

180 200

220

240

260

280

gaggaaaguccgggcUCCA

UGGA

AGCGCGGU

GCCGGAUAAC

G UCCGGCGG

GGGCG

A C C U C AG GGA

AA G UGC

CACAGAAAG

C A A A CCGCCCUCGAGGCCGA AA

GGCUUC GCGGAGGGU AAGG G U G A AA

GG G U G C GG

U AAGAGCGCACC

GC G U C U U U G G

CAACAAA

GGCGGCAAGGCAAAC

CCC A C C G G G A C C

AAAU AG

G G G C U GCACG

GACGAGAG A

UCGUCCAG G U C U G

UUUCCAGACCCGCGGCCC

GGGUUGGUU

GCAAGAG G C G U C U C G C

AAGAGGCGUCCCAGAUGAAUGGCCAUC A C

CUCGCAG C AA

UGCGAGG A a c

agaacccggcuua

1

20

40

60

80

100

120

140

160180

200

220

240260

280

300

320

340

A ACG A G1

2

3 4

5

6

7

8

9

10

11

12

1

2

3 4

5

6

7

8

9