Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne,...
-
Upload
maximillian-perkins -
Category
Documents
-
view
218 -
download
0
description
Transcript of Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne,...
Local Exact Pattern Matching for Non-fixed
RNA Structures
Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl,
Christina Schmiedl, Sebastian Will
RNA
CPM 2012, Helsinki
RNA R is an ordered pair (S,B) where:
C UCG U CAG UA CG AC UU
UC
UCG
UCA
G
UA C
GAC
U
B is a set of base pairs C-G, G-C, A-U, or U-Abase pairsingle base
S is a sequence defined over = {A,C,G,U} 𝚺backbone connection
RNA
CPM 2012, Helsinki
RNA R is an ordered pair (S,B) where:
C UCG U CAG UA CG AC UU
UC
UCG
UCA
G
UA C
GAC
U
B presents the secondary structure of RS presents the primary structure of R
RNA Representations
CPM 2012, Helsinki
C UCG U CAG UA CG AC UU
U GC
GC
UA
UA CG AC
C U U
UC
UCG
UCA
G
UA C
GAC
U
Arc annotated string
Tree
RNA Secondary Structure
CPM 2012, Helsinki
A GUCA
U C
GC
G
U A
U
CCG
AGC
GC
AC
G ACG
UCA
G
UA CG AC
GC
AU
UAC
GA
•Determines the activity and functionality of the RNA
The secondary structures of RNA is highly researched
•Usually more preserved during evolution
RNA Structure
CPM 2012, Helsinki
A GUCA
U C
GC
G
U A
U
CCG
AGC
GC
AC
G ACG
UCA
G
UA CG AC
GC
AU
UAC
GA
•Predicting the secondary structure of RNA molecule is a difficult task
•The structure is sometimes given in a non-fixed form, where each base pair has a probability ≤ 1 to exist in the RNA
Nested Structure
CPM 2012, Helsinki
C UCG U CAG UA CG AC UU
U GC
GC
UA
UA CG AC
C U U
UC
UCG
UCA
G
UA CG
AC
U
In all of these examples,the structure of R is Nested:
Each base can be connected by a bond connection to at most one other base, and there are no crossing arcs
Unlimited Structure
CPM 2012, Helsinki
Arc annotated substrings can represent Unlimited structures, as well
C U ACCG AGU CAG UA CG AC GC AU UA C
Bounded-Unlimited Structure
CPM 2012, Helsinki
Arc annotated substrings can represent Bounded-Unlimited structures:
Each base can be connected to a constant number of other bases,
C U ACCG AGU CAG UA CG AC GC AU UA C
and crossing arcs are allowed
RNA Similarity AlgorithmsMany algorithms for finding similarity between RNA molecules use tree similarity algorithms
GC
UA
AU
GC
GC
GC
U AC G AC
GC
GC
UA
CG
UA
UA
UA CG AC CG
C GA A UC
Tree Edit Distance:
•Tai (’79) O(n6)
•Zhang & Shasha (‘89) O(n4)
•Klein (‘98) O(n3logn)
•Ma et al. (‘99) O(n3logn)
•Demaine et al. (‘07) O(n3)
CPM 2012, Helsinki
RNA Similarity AlgorithmsMany algorithms for finding similarity between RNA molecules use tree similarity algorithms
GC
UA
AU
GC
GC
GC
U AC G AC
GC
GC
UA
CG
UA
UA
UA CG AC CG
C GA A UC
Tree Alignment:
•Jiang et al. (’95)
•Schirmer & Giegerich (‘11)
•Backofen et al. (‘07)
•Mohl et al. (’09)
CPM 2012, Helsinki
RNA Similarity AlgorithmsMany algorithms for finding similarity between RNA molecules use tree similarity algorithms
GC
UA
AU
GC
GC
GC
U AC G AC
GC
GC
UA
CG
UA
UA
UA CG AC CG
C GA A UC
Longest Arc Preserving Common Subsequence:
•Evans (’99)
•Lin et al. (’02)
•Alber et al. (’04)
•Jiang et al. (’04)
CPM 2012, Helsinki
RNA Similarity AlgorithmsMany algorithms for finding similarity between RNA molecules use tree similarity algorithms
GC
UA
AU
GC
GC
GC
U AC G AC
GC
GC
UA
CG
UA
UA
UA CG AC CG
C GA A UC
Similar Subforests
•Jansson & Peng (’11)
CPM 2012, Helsinki
Exact Pattern Matching ProblemIn this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules
Pattern
CPM 2012, Helsinki
Patterns in RNAsIn this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules
CPM 2012, Helsinki
Exact Pattern Matching Problem
U CUA C UCA GC GU ACG
Finding all maximal common structure-sequence regions between two RNAs
U CAA G UCA GA GA AC CCG
Solved by Backofen & Siebert in O(n2) for fixed Nested x Nested Structures
C GU U
A AC U
CPM 2012, Helsinki
single base match left endpoint matchtype mismatch
Exact Pattern Matching ProblemIn this work, we solve the problem for non-fixed Nested x Nested Structures
U CUA C UCA GC GU ACG
U CAA G UCA GA GA AC CCG C GU U
A AC U
arc breaking
CPM 2012, Helsinki
Arc Breaking Operation•We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty
G C CC G CUA A G AG GU U G A C
single bases
base pair
CPM 2012, Helsinki
Arc Breaking Operation•We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty
G
C CC
G
C
U
AA
GA
GG
U
U
G AC
single bases
base pair
CPM 2012, Helsinki
Arc Breaking•We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty
GC
UA
AU
GC
GC
GC
U AC G AC
GC
GC
UA
CG
UA
UA
UA CG AC CG
C GA A UC
U A
CPM 2012, Helsinki
Arc BreakingPatterns are now less restricting:
CPM 2012, Helsinki
Exact Pattern Matching AlgorithmsWe describe three algorithms for finding the local exact pattern matching between two RNAs:•A simple O(n4) algorithm
(using ideas from Zhang & Shasha (‘89) )
•An improved O(n3logn) algorithm(using ideas from Klein (‘98) )
•An O(n3) algorithm(using ideas from Demaine, Weimann et al. (‘07) )
CPM 2012, Helsinki
Exact Pattern Matching AlgorithmInput: R1=(S1,B1) and R2=(S2,B2), |R1|=n, |R2|=m, n>m
Output: Local exact pattern matching between R1 and R2
CPM 2012, Helsinki
R1:
R2:
Exact Pattern Matching AlgorithmWe compare each base pair from R1 with each base pair from R2, in increasing order of their sizes
CPM 2012, Helsinki
R1:
R2:
Exact Pattern Matching AlgorithmFor each two base pairs we compute the matching inside the base pairs, and the extensions to their outsides
CPM 2012, Helsinki
……
……
Matching Inside the Base Pairs•Dynamic programming algorithm•Similar to the LCS\Edit distance algorithms of strings
CPM 2012, Helsinki
Matching Inside the Base PairsOn each comparison we compute only prefixes of the substrings and select the maximal score over 4 expressions :Match base pairs
S1(i)==S2(j) ?
CPM 2012, Helsinki
i
j
bp1
bp2
1
1
++
Matching Inside the Base PairsMatch single bases
CPM 2012, Helsinki
S1(i)==S2(j) ?i
j
bp1
bp2
1
1
Matching Inside the Base PairsDelete from R1
CPM 2012, Helsinki
i
j
bp1
bp2
1
1
i-1
Delete from R2
Matching Inside the Base PairsOn each comparison we compute the maximal match from left-to-right
U CGA G AUA UU AA CGC C
U U CGAA CA AUC UA AG UC UAG
AG
CPM 2012, Helsinki
…
… … C
… C
i
j
1
1
Matching Inside the Base Pairs
U CGA G AUA UU AA CGC C
U U CGAA CA AUC UA AG UC UAG
AG
CPM 2012, Helsinki
On each comparison we compute the maximal match from right-to-left
…
… … C
… C
i
j
1
1
Matching Inside the Base Pairs
U CGA G AUA UU AA CGC C
U U CGAA CA AUC UA AG UC UAG
AG
CPM 2012, Helsinki
There are two tricky parts here:• What happens when a mismatch occurs?
C
G …
…… C
… C
i
j
1
1
Matching Inside the Base Pairs
CPM 2012, Helsinki
U CGA G AUA UU AA CGC C
U U CGAA CA AUC UA AG UC UAG
AG
There are two tricky parts here:• What happens when the matchings overlap?
…
… … C
… C
i
j
1
1
Matching Inside the Base Pairs
U CGA G AUA UU AA CGC C
U U CGAA CA AUC UA AG UC UAG
AG
CPM 2012, Helsinki
The solution: on each comparison we compute the best score going from both right-to-left and left-to-right
…
… … C
… C
i
j
1
1
• We only compare prefixes of the base pairs
•There are O(n2) prefixes for each RNA
• Each comparison is computed in O(1) time
•The total time is O(n4)
Time Complexity
CPM 2012, Helsinki
Extending the Match
CPM 2012, Helsinki
We compute the maximal pattern extension for all bases in R1 and all bases in R2 in one run.
The time complexity: O(n2)
…
…
i
j
n
m
R1:
R2:
Total Time Complexity
CPM 2012, Helsinki
Computing the pattern match inside all base pairs is done in O(n4)
Computing the pattern match extensions to the right and to the left is done in O(n2)
The total time complexity is O(n4)
+
=
An O(n3logn) Algorithm
CPM 2012, Helsinki
The root base pair is marked light, and continue recursively:Select the maximal child base pair and mark it as heavy,mark the rest of the children as light
C GA GCC CG G G UUC UA GGC CG A A UC
We use Klein’s Tree Edit Distance (‘98) ideas:we decompose the largest RNA into heavy paths:
For each base pair we define its special substringsSpecial Substrings
C GA GCC CG G G UUUCAC
C ACC CG G G UU
bphp
a x y b
CPM 2012, Helsinki
C ACC CG G G UUUCAC ACC CG G G UUUCC ACC CG G G UUU
C ACC CG G G UUUCAC
C ACC CG G G UUUCAC GGGC ACC CG G G UUUCAC
The no. of special substrings of a base pair is:|bp| - |hp| + 1
Lemma (Sleator & Tarjan ‘83):There are O(nlog n) special substring in R of size n
We compare all O(n2) substrings of R2 with O(nlogn) special substrings of R1
An O(n3logn) Algorithm
C GA GCC CG G G UUUCAC
C ACC CG G G UUUCAC ACC CG G G UUUCC ACC CG G G UUUC ACC CG G G UU
bphp
a x y b
C ACC CG G G UUUCAC GG
C ACC CG G G UUUCACGC ACC CG G G UUUCAC
CPM 2012, Helsinki
The comparisons are made between the rightmost or leftmost bases, according to the special substring
An O(n3logn) Algorithm
CPM 2012, Helsinki
C GA GCC CG G G UUUCAC
C ACC CG G G UUUCAC ACC CG G G UUUCC ACC CG G G UUUC ACC CG G G UU
bphp
a x y b
C ACC CG G G UUUCAC GG
C ACC CG G G UUUCACGC ACC CG G G UUUCAC
The total number of compared substrings is O(n3logn), each one computed in O(1) time, which gives a total of O(n3logn) running time.
An O(n3logn) Algorithm
C GA GCC CG G G UUUCAC
C AG
CC CG G G UUUCACC ACC CG G G UUUCAC ACC CG G G UUUCAC ACC CG G G UUUCAC ACC CG G G UUUCC ACC CG G G UUUC ACC CG G G UU
bphp
a x y b
GG
CC
CPM 2012, Helsinki
This algorithm works for Nested x Bounded-Unlimited structures also.
Based on Demaine et al. (‘07) algorithm we decompose both RNAs to heavy paths, the special substrings are decided on each base pairs comparison: the base pair that has the largest root light base pair, is the dominant one
An O(n3) Algorithm
C GA GCC CG G G UUC UA GGC CG A A UC
C A GCU GU G C UUC UCA C UC G U
1
2 3
R1:
R2:
5
C
46 789
A
B C DE F
CPM 2012, Helsinki
A
The number of compared substrings is O(n3)
An O(n3) Algorithm
C GA GCC CG G G UUC UA GGC CG A A UC
C A GCU GU G C UUC UCA C UC GG U
1
23
R1:
R2:
5
C
46
789
A
B C DE F
CPM 2012, Helsinki
This algorithm can work with Nested X Nested structures only
• Find the local approximate pattern matching between Nested x Nested structures in O(n3k2) for k allowed mismatches
•Find the local approximate pattern matching between Nested x Bounded-Unlimited structures in O(n3k2logn) for k allowed mismatches
•Find the most similar sibling substructures between Nested x Nested structures in O(n3)
More Algorithms
CPM 2012, Helsinki
K O ! H Y U A T N