Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 1

EE3J2 Data Mining

Lecture 12: Sequence Analysis

Martin Russell


Objectives

To motivate the need for sequence analysis To introduce the notions of:

– Alignment path

– Accumulated distance along a path

To introduce Dynamic Programming and the principle of optimality

To explain the use of dynamic programming to find the distance between two sequences


Sequences

Sequential data common in real applications:– DNA analysis in bioinformatics and forensic science

– Sequences of the letters A, G, C and T– Signature recognition biometrics– Words and text

– Spelling and grammar checkers, author verification,..– Speech, music and audio

– Speech/speaker recognition, speech coding and synthesis– Electronic music

– Radar signature recognition…


Retrieving sequential data

Large corpora of sequential data– E.g: DNA databases

Individual sequences may not be amenable to human interpretation

Need for automated sequential data retrieval Need to ‘mine’ sequential data Fundamental requirement is a measure of distance

between two sequences


Basic definitions

In a typical sequence analysis application we have a basic alphabet consisting of N symbols

Examples:– In conventional text A is the set of letters plus

punctuation plus ‘white space’– In bioinformatics, scientists us {A, B, D, F} to describe

DNA sequences

Nn ,...,,...,1


Distance between sequences (1)

Consider two sequences of symbols from the alphabet A = {A, B, C, D}

How similar are the sequences:– S1 = ABCD

– S2 = ABD

Intuitively S2 is obtained from S1 by deleting C

An alternative explanation is that S2 is obtained from S1 by substituting D for C and then deleting D


Distance between sequences (2)

Alternatively S2 was obtained from S1 be deleting ABCD and inserting ABC

… Why is the first explanation intuitively ‘correct’ and

the last one intuitively ‘wrong’? Because we favour the simplest explanation which

involves the minimum number of insertions, deletions and substitutions

Or do we?


Distance between sequences (3) Consider:

– S1 = AABC– S2 = SABC– S3 = PABC– S4 = ASCB

If we know that these sequences were typed then S2 is closer to S1 than S3 is, because A and S are adjacent on a keyboard

Similarly S4 is close to S2 because letter-swapping (SA AS etc) is a common typographical error


Alignments

The relationship between two sequences can be expressed as an alignment between their elements

E.G: Insertion

A B B C

A

B

C


Alignment: deletion and substitution

A X C

A

B

C

substitution

A C

A

B

C

deletion

N.B: All edits described relative to vertical string


General alignment pathA C X C C D

A

B

C

D

Which alignment is best?


The Distance Matrix

Suppose that d(A,B) denotes the distance between the alphabet symbols A and B

Examples:– d(A,B) = 0 if A = B, otherwise d(A,B) = 1

– In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B

– In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B


Notation

Suppose we have an alphabet:

A distance matrix for A is an N N matrix

where is the distance between the mth and nth alphabet symbols

Nn ,...,,...,1

NnmDD nm ,1 ,,

nmnm dD ,,


The Accumulated Distance

Consider two sequences:– S1 = ABCD

– S2 = ACXCCD

For any alignment path p between S1 and S2 we define the accumulated distance between S1 and S2, denoted by ADp(S1,S2), to be the sum over all the nodes of p of the corresponding distances between elements of S1 and S2.


Accumulated distance along pA C X C C D

A

B

C

D DDdCCdCCdXCdCCdBAdAAdSSADp ,,,,,,,),( 21

Path p


Accumulated distance (continued)A C X C C D

A

B

C

D XCdBAdSSADp ,,),( 21

XCdKKKBAdKSSAD INSINSINSDELp ,,),( 21

KDEL

KINS KINS KINS


Optimal path

The optimal path is the path with minimum accumulated distance

Formally the optimal path is

where:

The accumulated distance AD(S1,S2) between S1 and S2 is given by:

p̂

2121ˆ21 ,max,or ,,maxargˆ SSADSSADSSADp pp

ppp

21ˆ21 ,, SSADSSAD p


Calculating the optimal path

Given – the distance matrix D, – the insertion penalty KINS , and – the deletion penalty KDEL

*

How can we compute the optimal path between two (potentially long) sequences S1 and S2?

In a real application the lengths of S1 and S2 might be large

*If KDEL and KINS are not defined you should assume that they are zero


Dynamic Programming (DP)

The optimal path is calculated using Dynamic Programming (DP) DP is based on the principle of optimality

• Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of:

• (i) the best path from X to A plus the cost of going from A to Y

• (ii) the best path from X to A plus the cost of going from B to Y

X Y

A

B


DP – step 1

Step 1: draw the trellis of all possible paths

A C X C C D

A

B

C

D


DP – forward pass – initialisation

A C X C C D

A

B

C

D

(Forward) path matrix

d(A,A)

(Forward) accumulated distance matrix

ad(2,1)=ad(1,1)+d(2,1)


ad(i,j)

ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j)

It is calculated using the principle of optimality

In addition, the forward path matrix records the local optimal path at each point

jid

Kjiad

jiad

Kjiad

jiad

INS

DEL

,

1,

1,1

,1

min,

(i-1,j)

(i,j-1)

(i-1,j-1)


DP – forward pass – continued

A C X C C D

A

B

C

D



ad(1,2) = ad(1,1)+d(A,C)



A C X C C D

A

B

C

D



CCd

Kad

ad

Kad

ad

INS

DEL

,

2,3

2,2

3,2

min3,3



A C X C C D

A

B

C

D



CCd

Kad

ad

Kad

adSSAD

INS

DEL

,

5,4

5,3

6,3

min6,4, 21



A C X C C D

A

B

C

D



Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand

corner


Summary

Introduction to sequence analysis Definitions of:

– distance,

– forward accumulated distance,

– forward path matrix,

– optimal path

Dynamic programming How to computed the accumulated distance using DP How to recover the optimal path

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

Documents

Transcript of Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.