Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

27
EE3J2 Data Mining Slide 1 EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

Page 1: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 1

EE3J2 Data Mining

Lecture 12: Sequence Analysis

Martin Russell

Page 2: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 2

Objectives

To motivate the need for sequence analysis To introduce the notions of:

– Alignment path

– Accumulated distance along a path

To introduce Dynamic Programming and the principle of optimality

To explain the use of dynamic programming to find the distance between two sequences

Page 3: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 3

Sequences

Sequential data common in real applications:– DNA analysis in bioinformatics and forensic science

– Sequences of the letters A, G, C and T– Signature recognition biometrics– Words and text

– Spelling and grammar checkers, author verification,..– Speech, music and audio

– Speech/speaker recognition, speech coding and synthesis– Electronic music

– Radar signature recognition…

Page 4: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 4

Retrieving sequential data

Large corpora of sequential data– E.g: DNA databases

Individual sequences may not be amenable to human interpretation

Need for automated sequential data retrieval Need to ‘mine’ sequential data Fundamental requirement is a measure of distance

between two sequences

Page 5: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 5

Basic definitions

In a typical sequence analysis application we have a basic alphabet consisting of N symbols

Examples:– In conventional text A is the set of letters plus

punctuation plus ‘white space’– In bioinformatics, scientists us {A, B, D, F} to describe

DNA sequences

Nn ,...,,...,1

Page 6: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 6

Distance between sequences (1)

Consider two sequences of symbols from the alphabet A = {A, B, C, D}

How similar are the sequences:– S1 = ABCD

– S2 = ABD

Intuitively S2 is obtained from S1 by deleting C

An alternative explanation is that S2 is obtained from S1 by substituting D for C and then deleting D

Page 7: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 7

Distance between sequences (2)

Alternatively S2 was obtained from S1 be deleting ABCD and inserting ABC

… Why is the first explanation intuitively ‘correct’ and

the last one intuitively ‘wrong’? Because we favour the simplest explanation which

involves the minimum number of insertions, deletions and substitutions

Or do we?

Page 8: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 8

Distance between sequences (3) Consider:

– S1 = AABC– S2 = SABC– S3 = PABC– S4 = ASCB

If we know that these sequences were typed then S2 is closer to S1 than S3 is, because A and S are adjacent on a keyboard

Similarly S4 is close to S2 because letter-swapping (SA AS etc) is a common typographical error

Page 9: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 9

Alignments

The relationship between two sequences can be expressed as an alignment between their elements

E.G: Insertion

A B B C

A

B

C

Page 10: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 10

Alignment: deletion and substitution

A X C

A

B

C

substitution

A C

A

B

C

deletion

N.B: All edits described relative to vertical string

Page 11: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 11

General alignment pathA C X C C D

A

B

C

D

Which alignment is best?

Page 12: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 12

The Distance Matrix

Suppose that d(A,B) denotes the distance between the alphabet symbols A and B

Examples:– d(A,B) = 0 if A = B, otherwise d(A,B) = 1

– In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B

– In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B

Page 13: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 13

Notation

Suppose we have an alphabet:

A distance matrix for A is an N N matrix

where is the distance between the mth and nth alphabet symbols

Nn ,...,,...,1

NnmDD nm ,1 ,,

nmnm dD ,,

Page 14: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 14

The Accumulated Distance

Consider two sequences:– S1 = ABCD

– S2 = ACXCCD

For any alignment path p between S1 and S2 we define the accumulated distance between S1 and S2, denoted by ADp(S1,S2), to be the sum over all the nodes of p of the corresponding distances between elements of S1 and S2.

Page 15: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 15

Accumulated distance along pA C X C C D

A

B

C

D DDdCCdCCdXCdCCdBAdAAdSSADp ,,,,,,,),( 21

Path p

Page 16: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 16

Accumulated distance (continued)A C X C C D

A

B

C

D XCdBAdSSADp ,,),( 21

XCdKKKBAdKSSAD INSINSINSDELp ,,),( 21

KDEL

KINS KINS KINS

Page 17: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 17

Optimal path

The optimal path is the path with minimum accumulated distance

Formally the optimal path is

where:

The accumulated distance AD(S1,S2) between S1 and S2 is given by:

2121ˆ21 ,max,or ,,maxargˆ SSADSSADSSADp pp

ppp

21ˆ21 ,, SSADSSAD p

Page 18: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 18

Calculating the optimal path

Given – the distance matrix D, – the insertion penalty KINS , and – the deletion penalty KDEL

*

How can we compute the optimal path between two (potentially long) sequences S1 and S2?

In a real application the lengths of S1 and S2 might be large

*If KDEL and KINS are not defined you should assume that they are zero

Page 19: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 19

Dynamic Programming (DP)

The optimal path is calculated using Dynamic Programming (DP) DP is based on the principle of optimality

• Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of:

• (i) the best path from X to A plus the cost of going from A to Y

• (ii) the best path from X to A plus the cost of going from B to Y

X Y

A

B

Page 20: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 20

DP – step 1

Step 1: draw the trellis of all possible paths

A C X C C D

A

B

C

D

Page 21: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 21

DP – forward pass – initialisation

A C X C C D

A

B

C

D

(Forward) path matrix

d(A,A)

(Forward) accumulated distance matrix

ad(2,1)=ad(1,1)+d(2,1)

Page 22: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 22

ad(i,j)

ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j)

It is calculated using the principle of optimality

In addition, the forward path matrix records the local optimal path at each point

jid

Kjiad

jiad

Kjiad

jiad

INS

DEL

,

1,

1,1

,1

min,

(i-1,j)

(i,j-1)

(i-1,j-1)

Page 23: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 23

DP – forward pass – continued

A C X C C D

A

B

C

D

(Forward) path matrix

(Forward) accumulated distance matrix

ad(1,2) = ad(1,1)+d(A,C)

Page 24: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 24

DP – forward pass – continued

A C X C C D

A

B

C

D

(Forward) path matrix

(Forward) accumulated distance matrix

CCd

Kad

ad

Kad

ad

INS

DEL

,

2,3

2,2

3,2

min3,3

Page 25: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 25

DP – forward pass – continued

A C X C C D

A

B

C

D

(Forward) path matrix

(Forward) accumulated distance matrix

CCd

Kad

ad

Kad

adSSAD

INS

DEL

,

5,4

5,3

6,3

min6,4, 21

Page 26: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 26

DP – forward pass – continued

A C X C C D

A

B

C

D

(Forward) path matrix

(Forward) accumulated distance matrix

Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand

corner

Page 27: Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.

EE3J2 Data MiningSlide 27

Summary

Introduction to sequence analysis Definitions of:

– distance,

– forward accumulated distance,

– forward path matrix,

– optimal path

Dynamic programming How to computed the accumulated distance using DP How to recover the optimal path