Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne,...

46
Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian Will

description

RNA CPM 2012, Helsinki RNA R is an ordered pair (S,B) where: CUCGUCAGUACGACU U U C U C G U C A G U A C G AC U B presents the secondary structure of R S presents the primary structure of R

Transcript of Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne,...

Page 1: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Local Exact Pattern Matching for Non-fixed

RNA Structures

Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl,

Christina Schmiedl, Sebastian Will

Page 2: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA

CPM 2012, Helsinki

RNA R is an ordered pair (S,B) where:

C UCG U CAG UA CG AC UU

UC

UCG

UCA

G

UA C

GAC

U

B is a set of base pairs C-G, G-C, A-U, or U-Abase pairsingle base

S is a sequence defined over = {A,C,G,U} 𝚺backbone connection

Page 3: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA

CPM 2012, Helsinki

RNA R is an ordered pair (S,B) where:

C UCG U CAG UA CG AC UU

UC

UCG

UCA

G

UA C

GAC

U

B presents the secondary structure of RS presents the primary structure of R

Page 4: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA Representations

CPM 2012, Helsinki

C UCG U CAG UA CG AC UU

U GC

GC

UA

UA CG AC

C U U

UC

UCG

UCA

G

UA C

GAC

U

Arc annotated string

Tree

Page 5: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA Secondary Structure

CPM 2012, Helsinki

A GUCA

U C

GC

G

U A

U

CCG

AGC

GC

AC

G ACG

UCA

G

UA CG AC

GC

AU

UAC

GA

•Determines the activity and functionality of the RNA

The secondary structures of RNA is highly researched

•Usually more preserved during evolution

Page 6: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA Structure

CPM 2012, Helsinki

A GUCA

U C

GC

G

U A

U

CCG

AGC

GC

AC

G ACG

UCA

G

UA CG AC

GC

AU

UAC

GA

•Predicting the secondary structure of RNA molecule is a difficult task

•The structure is sometimes given in a non-fixed form, where each base pair has a probability ≤ 1 to exist in the RNA

Page 7: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Nested Structure

CPM 2012, Helsinki

C UCG U CAG UA CG AC UU

U GC

GC

UA

UA CG AC

C U U

UC

UCG

UCA

G

UA CG

AC

U

In all of these examples,the structure of R is Nested:

Each base can be connected by a bond connection to at most one other base, and there are no crossing arcs

Page 8: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Unlimited Structure

CPM 2012, Helsinki

Arc annotated substrings can represent Unlimited structures, as well

C U ACCG AGU CAG UA CG AC GC AU UA C

Page 9: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Bounded-Unlimited Structure

CPM 2012, Helsinki

Arc annotated substrings can represent Bounded-Unlimited structures:

Each base can be connected to a constant number of other bases,

C U ACCG AGU CAG UA CG AC GC AU UA C

and crossing arcs are allowed

Page 10: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA Similarity AlgorithmsMany algorithms for finding similarity between RNA molecules use tree similarity algorithms

GC

UA

AU

GC

GC

GC

U AC G AC

GC

GC

UA

CG

UA

UA

UA CG AC CG

C GA A UC

Tree Edit Distance:

•Tai (’79) O(n6)

•Zhang & Shasha (‘89) O(n4)

•Klein (‘98) O(n3logn)

•Ma et al. (‘99) O(n3logn)

•Demaine et al. (‘07) O(n3)

CPM 2012, Helsinki

Page 11: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA Similarity AlgorithmsMany algorithms for finding similarity between RNA molecules use tree similarity algorithms

GC

UA

AU

GC

GC

GC

U AC G AC

GC

GC

UA

CG

UA

UA

UA CG AC CG

C GA A UC

Tree Alignment:

•Jiang et al. (’95)

•Schirmer & Giegerich (‘11)

•Backofen et al. (‘07)

•Mohl et al. (’09)

CPM 2012, Helsinki

Page 12: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA Similarity AlgorithmsMany algorithms for finding similarity between RNA molecules use tree similarity algorithms

GC

UA

AU

GC

GC

GC

U AC G AC

GC

GC

UA

CG

UA

UA

UA CG AC CG

C GA A UC

Longest Arc Preserving Common Subsequence:

•Evans (’99)

•Lin et al. (’02)

•Alber et al. (’04)

•Jiang et al. (’04)

CPM 2012, Helsinki

Page 13: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

RNA Similarity AlgorithmsMany algorithms for finding similarity between RNA molecules use tree similarity algorithms

GC

UA

AU

GC

GC

GC

U AC G AC

GC

GC

UA

CG

UA

UA

UA CG AC CG

C GA A UC

Similar Subforests

•Jansson & Peng (’11)

CPM 2012, Helsinki

Page 14: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Exact Pattern Matching ProblemIn this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules

Pattern

CPM 2012, Helsinki

Page 15: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Patterns in RNAsIn this work, we search for local common sequence-structure regions (patterns) between two given RNA molecules

CPM 2012, Helsinki

Page 16: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Exact Pattern Matching Problem

U CUA C UCA GC GU ACG

Finding all maximal common structure-sequence regions between two RNAs

U CAA G UCA GA GA AC CCG

Solved by Backofen & Siebert in O(n2) for fixed Nested x Nested Structures

C GU U

A AC U

CPM 2012, Helsinki

single base match left endpoint matchtype mismatch

Page 17: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Exact Pattern Matching ProblemIn this work, we solve the problem for non-fixed Nested x Nested Structures

U CUA C UCA GC GU ACG

U CAA G UCA GA GA AC CCG C GU U

A AC U

arc breaking

CPM 2012, Helsinki

Page 18: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Arc Breaking Operation•We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty

G C CC G CUA A G AG GU U G A C

single bases

base pair

CPM 2012, Helsinki

Page 19: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Arc Breaking Operation•We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty

G

C CC

G

C

U

AA

GA

GG

U

U

G AC

single bases

base pair

CPM 2012, Helsinki

Page 20: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Arc Breaking•We support the operation of arc-breaking, in which a base pair can be deleted, with no penalty

GC

UA

AU

GC

GC

GC

U AC G AC

GC

GC

UA

CG

UA

UA

UA CG AC CG

C GA A UC

U A

CPM 2012, Helsinki

Page 21: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Arc BreakingPatterns are now less restricting:

CPM 2012, Helsinki

Page 22: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Exact Pattern Matching AlgorithmsWe describe three algorithms for finding the local exact pattern matching between two RNAs:•A simple O(n4) algorithm

(using ideas from Zhang & Shasha (‘89) )

•An improved O(n3logn) algorithm(using ideas from Klein (‘98) )

•An O(n3) algorithm(using ideas from Demaine, Weimann et al. (‘07) )

CPM 2012, Helsinki

Page 23: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Exact Pattern Matching AlgorithmInput: R1=(S1,B1) and R2=(S2,B2), |R1|=n, |R2|=m, n>m

Output: Local exact pattern matching between R1 and R2

CPM 2012, Helsinki

R1:

R2:

Page 24: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Exact Pattern Matching AlgorithmWe compare each base pair from R1 with each base pair from R2, in increasing order of their sizes

CPM 2012, Helsinki

R1:

R2:

Page 25: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Exact Pattern Matching AlgorithmFor each two base pairs we compute the matching inside the base pairs, and the extensions to their outsides

CPM 2012, Helsinki

……

……

Page 26: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base Pairs•Dynamic programming algorithm•Similar to the LCS\Edit distance algorithms of strings

CPM 2012, Helsinki

Page 27: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base PairsOn each comparison we compute only prefixes of the substrings and select the maximal score over 4 expressions :Match base pairs

S1(i)==S2(j) ?

CPM 2012, Helsinki

i

j

bp1

bp2

1

1

++

Page 28: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base PairsMatch single bases

CPM 2012, Helsinki

S1(i)==S2(j) ?i

j

bp1

bp2

1

1

Page 29: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base PairsDelete from R1

CPM 2012, Helsinki

i

j

bp1

bp2

1

1

i-1

Delete from R2

Page 30: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base PairsOn each comparison we compute the maximal match from left-to-right

U CGA G AUA UU AA CGC C

U U CGAA CA AUC UA AG UC UAG

AG

CPM 2012, Helsinki

… … C

… C

i

j

1

1

Page 31: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base Pairs

U CGA G AUA UU AA CGC C

U U CGAA CA AUC UA AG UC UAG

AG

CPM 2012, Helsinki

On each comparison we compute the maximal match from right-to-left

… … C

… C

i

j

1

1

Page 32: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base Pairs

U CGA G AUA UU AA CGC C

U U CGAA CA AUC UA AG UC UAG

AG

CPM 2012, Helsinki

There are two tricky parts here:• What happens when a mismatch occurs?

C

G …

…… C

… C

i

j

1

1

Page 33: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base Pairs

CPM 2012, Helsinki

U CGA G AUA UU AA CGC C

U U CGAA CA AUC UA AG UC UAG

AG

There are two tricky parts here:• What happens when the matchings overlap?

… … C

… C

i

j

1

1

Page 34: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Matching Inside the Base Pairs

U CGA G AUA UU AA CGC C

U U CGAA CA AUC UA AG UC UAG

AG

CPM 2012, Helsinki

The solution: on each comparison we compute the best score going from both right-to-left and left-to-right

… … C

… C

i

j

1

1

Page 35: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

• We only compare prefixes of the base pairs

•There are O(n2) prefixes for each RNA

• Each comparison is computed in O(1) time

•The total time is O(n4)

Time Complexity

CPM 2012, Helsinki

Page 36: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Extending the Match

CPM 2012, Helsinki

We compute the maximal pattern extension for all bases in R1 and all bases in R2 in one run.

The time complexity: O(n2)

i

j

n

m

R1:

R2:

Page 37: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Total Time Complexity

CPM 2012, Helsinki

Computing the pattern match inside all base pairs is done in O(n4)

Computing the pattern match extensions to the right and to the left is done in O(n2)

The total time complexity is O(n4)

+

=

Page 38: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

An O(n3logn) Algorithm

CPM 2012, Helsinki

The root base pair is marked light, and continue recursively:Select the maximal child base pair and mark it as heavy,mark the rest of the children as light

C GA GCC CG G G UUC UA GGC CG A A UC

We use Klein’s Tree Edit Distance (‘98) ideas:we decompose the largest RNA into heavy paths:

Page 39: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

For each base pair we define its special substringsSpecial Substrings

C GA GCC CG G G UUUCAC

C ACC CG G G UU

bphp

a x y b

CPM 2012, Helsinki

C ACC CG G G UUUCAC ACC CG G G UUUCC ACC CG G G UUU

C ACC CG G G UUUCAC

C ACC CG G G UUUCAC GGGC ACC CG G G UUUCAC

The no. of special substrings of a base pair is:|bp| - |hp| + 1

Lemma (Sleator & Tarjan ‘83):There are O(nlog n) special substring in R of size n

Page 40: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

We compare all O(n2) substrings of R2 with O(nlogn) special substrings of R1

An O(n3logn) Algorithm

C GA GCC CG G G UUUCAC

C ACC CG G G UUUCAC ACC CG G G UUUCC ACC CG G G UUUC ACC CG G G UU

bphp

a x y b

C ACC CG G G UUUCAC GG

C ACC CG G G UUUCACGC ACC CG G G UUUCAC

CPM 2012, Helsinki

Page 41: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

The comparisons are made between the rightmost or leftmost bases, according to the special substring

An O(n3logn) Algorithm

CPM 2012, Helsinki

C GA GCC CG G G UUUCAC

C ACC CG G G UUUCAC ACC CG G G UUUCC ACC CG G G UUUC ACC CG G G UU

bphp

a x y b

C ACC CG G G UUUCAC GG

C ACC CG G G UUUCACGC ACC CG G G UUUCAC

Page 42: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

The total number of compared substrings is O(n3logn), each one computed in O(1) time, which gives a total of O(n3logn) running time.

An O(n3logn) Algorithm

C GA GCC CG G G UUUCAC

C AG

CC CG G G UUUCACC ACC CG G G UUUCAC ACC CG G G UUUCAC ACC CG G G UUUCAC ACC CG G G UUUCC ACC CG G G UUUC ACC CG G G UU

bphp

a x y b

GG

CC

CPM 2012, Helsinki

This algorithm works for Nested x Bounded-Unlimited structures also.

Page 43: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

Based on Demaine et al. (‘07) algorithm we decompose both RNAs to heavy paths, the special substrings are decided on each base pairs comparison: the base pair that has the largest root light base pair, is the dominant one

An O(n3) Algorithm

C GA GCC CG G G UUC UA GGC CG A A UC

C A GCU GU G C UUC UCA C UC G U

1

2 3

R1:

R2:

5

C

46 789

A

B C DE F

CPM 2012, Helsinki

A

Page 44: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

The number of compared substrings is O(n3)

An O(n3) Algorithm

C GA GCC CG G G UUC UA GGC CG A A UC

C A GCU GU G C UUC UCA C UC GG U

1

23

R1:

R2:

5

C

46

789

A

B C DE F

CPM 2012, Helsinki

This algorithm can work with Nested X Nested structures only

Page 45: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

• Find the local approximate pattern matching between Nested x Nested structures in O(n3k2) for k allowed mismatches

•Find the local approximate pattern matching between Nested x Bounded-Unlimited structures in O(n3k2logn) for k allowed mismatches

•Find the most similar sibling substructures between Nested x Nested structures in O(n3)

More Algorithms

CPM 2012, Helsinki

Page 46: Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.

K O ! H Y U A T N