Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer...

17
Implementation of Implementation of Planted Motif Search Planted Motif Search Algorithms PMS1 and Algorithms PMS1 and PMS2 PMS2 Clifford Locke Clifford Locke BioGrid REU, Summer 2008 BioGrid REU, Summer 2008 Department of Computer Science and Department of Computer Science and Engineering Engineering University of Connecticut, Storrs, CT University of Connecticut, Storrs, CT
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    212
  • download

    0

Transcript of Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer...

Page 1: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Implementation of Planted Implementation of Planted Motif Search Algorithms Motif Search Algorithms

PMS1 and PMS2PMS1 and PMS2Clifford LockeClifford Locke

BioGrid REU, Summer 2008BioGrid REU, Summer 2008Department of Computer Science and Department of Computer Science and

EngineeringEngineeringUniversity of Connecticut, Storrs, CTUniversity of Connecticut, Storrs, CT

Page 2: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

IntroductionIntroduction General Problem: Multiple Sequence General Problem: Multiple Sequence

ComparisonComparison Biological BasisBiological Basis

DNA structure/functionDNA structure/function Sequence of nucleotides Sequence of nucleotides

Modeled as stringsModeled as strings Genes code for proteinsGenes code for proteins

Structure Structure Function Function

Evolution – Result of DNA mutations and Evolution – Result of DNA mutations and selective pressuresselective pressures

Image credit: www.britannica.com

Page 3: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

IntroductionIntroduction Goals of Multiple Sequence ComparisonGoals of Multiple Sequence Comparison

Deduce evolutionary relationships.Deduce evolutionary relationships. Protein and gene function studies. Protein and gene function studies. Find transcription factor/ regulatory protein Find transcription factor/ regulatory protein

binding sites.binding sites. Approaches:Approaches:

Find common subpatterns and deduce a Find common subpatterns and deduce a biological relationship. biological relationship.

Find common subpatterns between DNA Find common subpatterns between DNA sequences with a known biological sequences with a known biological relationship.relationship.

Page 4: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Planted Motif SearchPlanted Motif Search Motifs- Common functional subsequences Motifs- Common functional subsequences

in a set of biological sequencesin a set of biological sequences Planted (Planted (l,dl,d) motif search problem:) motif search problem:

Input are Input are nn strings ( strings (SS11, , SS22, … , , … , SSnn) of length ) of length mm and two integers and two integers ll and and dd. Find all strings . Find all strings xx such that |such that |xx| = | = ll and every input string and every input string contains at least one variant of contains at least one variant of x x at a at a Hamming distance of at most Hamming distance of at most dd..

Primary applications: Finding Primary applications: Finding transcription factor binding sites; drug transcription factor binding sites; drug target identificationtarget identification

Page 5: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Algorithm PMS1 Generate the set of all l-mers in each input sequence.

Let Ci correspond to the l-mers of Si. For each l-mer u in Ci (1 < i < n), generate all l-mers v

such that v is at a Hamming distance of at most d from u (v is a “neighbor” of u). Let Li correspond to all l-mers u and v from input sequence Si.

Alphabetically sort each set of neighbors Li and eliminate any duplicates.

Merge and intersect all sets Li to find the l-mer that appears in each neighborhood. Such l-mers constitute the motifs in the input sequences.

Page 6: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Algorithm PMS2 Algorithm PMS2 exploits these

observations If M occurs in each input sequence, then at

least l-k+1 length-k substrings of M occur in each input sequence.

In each input sequence there must be at least one position ij such that a k-mer of M occurs at each position ij – ij+l – k .

Page 7: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Algorithm PMS2 Use a modified PMS1 to solve the planted (d+c, d)-

motif problem. Let R contain the (d+c)-motifs. Find all of the occurrences of R in an arbitrary input

sequence Sj,. Let Li contain the (d+c)-motifs of R with variants starting at position i of Sj.

For each position i in Sj A is the l-mer of Sj starting at position i M1 and M2 are members of Li and Li+l – (d+c). If the last 2(d+c) – l characters of M1 are equal to the

first 2(d+c) – l characters of M2, form an l-mer B by appending the last l – (d+c) characters of M2 to M1.

If dH(A,B) < d, add B to a list of candidates C. Once the list of candidates is complete, check if each candidate is a motif.

Page 8: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Results n = 20 and m = 600; arbitrary motif

inserted in each input sequence Each implementation gave the

correct planted motif for each (l,d) case

PMS1 was faster than PMS2 for the challenging instances (9,2) and (11,3)

Otherwise, PMS2 could be faster, depending on the value of c

Low values of c lead to a high number of (d+c,d)-motifs, which leads to a high number of candidate strings

Conslusions PMS1 better-suited for challenge

problems PMS2 better suited for larger l

(l,d) PMS1 PMS2

(9,2) 53.343 d+c = 5: 305.672d+c = 6: 342.657d+c = 7: 394.234d+c = 8: 72.609

(10,2) 73.203 d+c = 7: 547.25d+c = 8: 72.344d+c = 9: 53.828

(11,2) 89.704 d+c = 7: 705.75d+c = 8: 70.25d+c = 9: 54.046

(12,2) 118.266

d+c = 8: 76.468d+c = 9: 54.312d+c = 10: 71.531

(11,3) 1076.23

d+c = 10: 1105.03

(12,3) 1552.47

d+c = 10: 1059.83

Runtimes, in seconds, of algorithms PMS1 and PMS2

Page 9: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Future Work

Page 10: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Minimization of Consensus Sequences

Consensus Sequence An expression that can be used to describe two or more

sequences Two forms:

{c1, c2, … ,cn} – Presence of one of the given characters c in the list

{i1, i2, … in}c – Character c may occur any number of times ik Examples:

Merging abcde, abccde, abcdee, and abccdee gives ab{1,2}cd{1,2}e

Merging agtgc and actgc gives a{c,g}tgc Problem Statement: Output a minimum number of

consensus sequences for a given set of input

sequences.

Page 11: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Minimization of Consensus Sequences

Algorithm To start, all input sequences are “alive” An arbitrary alive sequence S is chosen and compared with

every other alive sequence T to check if they can be merged. Dynamic programming is used to optimally align S and T The optimal alignments of S and T will have loops

corresponding to insertions, deletions, and replacements. Merging may occur only if all loops can be resolved

All mismatches can be resolved Insertions and deletions can be resolved only if there is a

match of the inserted/deleted character to the left or right of the loop.

If S and T can be merged, a consensus sequence is generated and added to the list of “alive” sequences. S and T are killed.

This process continues until no two alive sequences can be merged. At that point, all remaining alive input and consensus sequences are output.

Page 12: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Summary Planted Motif Search Problem: Find an l-

mer that differs in d or less places from at least one l-mer in each input sequence

Algorithms PMS1 and PMS2 are based on a model that generates the neighborhood of every input sequence and intersects them to find the motifs

PMS1 is best suited for challenge problems; use PMS2 for larger l

Future work will include the minimization of consensus sequences (regular expressions)

Page 13: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Acknowledgements

Special thanks to: Sanguthevar Rajasekaran National Science Foundation University of Connecticut Department of

Computer Science and Engineering

Page 14: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Levenshtein DistanceLevenshtein Distance Formal definition: The lowest number of edit operations, Formal definition: The lowest number of edit operations,

consisting of insertion (I), deletion (D), and replacement consisting of insertion (I), deletion (D), and replacement (R), necessary to convert one string to another.(R), necessary to convert one string to another.

AlgorithmAlgorithm Let DLet Di,ji,j be the edit distance of S be the edit distance of S11(1…i) and S(1…i) and S22(1…j).(1…j). Add a blank space to the beginning of each string Add a blank space to the beginning of each string

and align the strings along the edges of a matrix.and align the strings along the edges of a matrix. By definition, DBy definition, Di,0i,0= i and D= i and D0,j0,j= j. = j. Recurrence relation: DRecurrence relation: Di,ji,j= min(D= min(Di-1,ji-1,j+ 1, D+ 1, Di,j-1i,j-1+ 1, D+ 1, Di-1,j-1i-1,j-1

+ t+ ti,ji,j ) ) tti,j = i,j = 0 if S0 if S11[i] = S[i] = S22[j] , 1 otherwise (substitution)[j] , 1 otherwise (substitution)

By definition of DBy definition of Di,ji,j, D, Dn,mn,m, where n = |S, where n = |S11| and m = |S| and m = |S22|, |, is the edit distance of Sis the edit distance of S11 and S and S22

Page 15: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Example S1 = vintner, S2 = writers

Adapted from Algorithms on Strings, Trees, and Sequences by Dan Gusfield, 1999.

Value in bottom-right cell gives Levenshtein distance (5)

w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1

i 2 2

n 3 3

t 4 4

n 5 5

e 6 6

r 7 7

w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

Page 16: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Optimal Alignment from Levenshtein Distance

Working from the bottom right of the matrix, insert pointers Set a pointer from cell (i,j) to

Cell (i-1, j) if Di,j = Di-1,j + 1 Corresponds to a deletion of S1(i) from S1

Cell (i, j-1) if Di,j = Di,j-1 + 1 Corresponds to an insertion of S2(j) into S1

Cell (i-1, j-1) if Di,j = Di-1,j-1 + ti,j

Corresponds to match (t=0) or replacement (t=1) Follow the pointers from Dn,m to D(0,0) to get optimal alignment Some cells may have two pointers, in which case more than

one optimal alignment exists 3 optimal alignments in the example:w r i t _ e r s

v i n t n e r _

w r i _ t _ e r s

v _ i n t n e r _

w r i _ t _ e r s

_ v i n t n e r _

Page 17: Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.

Example

w r i t e r s

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6 7

v 1 1 1 2 3 4 5 6 7

i 2 2 2 2 2 3 4 5 6

n 3 3 3 3 3 3 4 5 6

t 4 4 4 4 4 3 4 5 6

n 5 5 5 5 5 4 4 5 6

e 6 6 6 6 6 5 4 5 6

r 7 7 7 6 7 6 5 4 5

w r i _ t _ e r s

v _ i n t n e r _