Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
-
date post
19-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
EE3J2 Data MiningSlide 1
EE3J2 Data Mining
Lecture 12: Sequence Analysis
Martin Russell
EE3J2 Data MiningSlide 2
Objectives
To motivate the need for sequence analysis To introduce the notions of:
– Alignment path
– Accumulated distance along a path
To introduce Dynamic Programming and the principle of optimality
To explain the use of dynamic programming to find the distance between two sequences
EE3J2 Data MiningSlide 3
Sequences
Sequential data common in real applications:– DNA analysis in bioinformatics and forensic science
– Sequences of the letters A, G, C and T– Signature recognition biometrics– Words and text
– Spelling and grammar checkers, author verification,..– Speech, music and audio
– Speech/speaker recognition, speech coding and synthesis– Electronic music
– Radar signature recognition…
EE3J2 Data MiningSlide 4
Retrieving sequential data
Large corpora of sequential data– E.g: DNA databases
Individual sequences may not be amenable to human interpretation
Need for automated sequential data retrieval Need to ‘mine’ sequential data Fundamental requirement is a measure of distance
between two sequences
EE3J2 Data MiningSlide 5
Basic definitions
In a typical sequence analysis application we have a basic alphabet consisting of N symbols
Examples:– In conventional text A is the set of letters plus
punctuation plus ‘white space’– In bioinformatics, scientists us {A, B, D, F} to describe
DNA sequences
Nn ,...,,...,1
EE3J2 Data MiningSlide 6
Distance between sequences (1)
Consider two sequences of symbols from the alphabet A = {A, B, C, D}
How similar are the sequences:– S1 = ABCD
– S2 = ABD
Intuitively S2 is obtained from S1 by deleting C
An alternative explanation is that S2 is obtained from S1 by substituting D for C and then deleting D
EE3J2 Data MiningSlide 7
Distance between sequences (2)
Alternatively S2 was obtained from S1 be deleting ABCD and inserting ABC
… Why is the first explanation intuitively ‘correct’ and
the last one intuitively ‘wrong’? Because we favour the simplest explanation which
involves the minimum number of insertions, deletions and substitutions
Or do we?
EE3J2 Data MiningSlide 8
Distance between sequences (3) Consider:
– S1 = AABC– S2 = SABC– S3 = PABC– S4 = ASCB
If we know that these sequences were typed then S2 is closer to S1 than S3 is, because A and S are adjacent on a keyboard
Similarly S4 is close to S2 because letter-swapping (SA AS etc) is a common typographical error
EE3J2 Data MiningSlide 9
Alignments
The relationship between two sequences can be expressed as an alignment between their elements
E.G: Insertion
A B B C
A
B
C
EE3J2 Data MiningSlide 10
Alignment: deletion and substitution
A X C
A
B
C
substitution
A C
A
B
C
deletion
N.B: All edits described relative to vertical string
EE3J2 Data MiningSlide 11
General alignment pathA C X C C D
A
B
C
D
Which alignment is best?
EE3J2 Data MiningSlide 12
The Distance Matrix
Suppose that d(A,B) denotes the distance between the alphabet symbols A and B
Examples:– d(A,B) = 0 if A = B, otherwise d(A,B) = 1
– In typing, d(A,B) might indicate how unlikely it is that A would be mistyped as B
– In bioinformatics d(A,B) might indicate how unlikely it is that symbol A in a DNA string is replaced by B
EE3J2 Data MiningSlide 13
Notation
Suppose we have an alphabet:
A distance matrix for A is an N N matrix
where is the distance between the mth and nth alphabet symbols
Nn ,...,,...,1
NnmDD nm ,1 ,,
nmnm dD ,,
EE3J2 Data MiningSlide 14
The Accumulated Distance
Consider two sequences:– S1 = ABCD
– S2 = ACXCCD
For any alignment path p between S1 and S2 we define the accumulated distance between S1 and S2, denoted by ADp(S1,S2), to be the sum over all the nodes of p of the corresponding distances between elements of S1 and S2.
EE3J2 Data MiningSlide 15
Accumulated distance along pA C X C C D
A
B
C
D DDdCCdCCdXCdCCdBAdAAdSSADp ,,,,,,,),( 21
Path p
EE3J2 Data MiningSlide 16
Accumulated distance (continued)A C X C C D
A
B
C
D XCdBAdSSADp ,,),( 21
XCdKKKBAdKSSAD INSINSINSDELp ,,),( 21
KDEL
KINS KINS KINS
EE3J2 Data MiningSlide 17
Optimal path
The optimal path is the path with minimum accumulated distance
Formally the optimal path is
where:
The accumulated distance AD(S1,S2) between S1 and S2 is given by:
p̂
2121ˆ21 ,max,or ,,maxargˆ SSADSSADSSADp pp
ppp
21ˆ21 ,, SSADSSAD p
EE3J2 Data MiningSlide 18
Calculating the optimal path
Given – the distance matrix D, – the insertion penalty KINS , and – the deletion penalty KDEL
*
How can we compute the optimal path between two (potentially long) sequences S1 and S2?
In a real application the lengths of S1 and S2 might be large
*If KDEL and KINS are not defined you should assume that they are zero
EE3J2 Data MiningSlide 19
Dynamic Programming (DP)
The optimal path is calculated using Dynamic Programming (DP) DP is based on the principle of optimality
• Suppose all paths from X to Y must pass through A or B immediately before reaching Y. The optimal path from X to Y is the best of:
• (i) the best path from X to A plus the cost of going from A to Y
• (ii) the best path from X to A plus the cost of going from B to Y
X Y
A
B
EE3J2 Data MiningSlide 20
DP – step 1
Step 1: draw the trellis of all possible paths
A C X C C D
A
B
C
D
EE3J2 Data MiningSlide 21
DP – forward pass – initialisation
A C X C C D
A
B
C
D
(Forward) path matrix
d(A,A)
(Forward) accumulated distance matrix
ad(2,1)=ad(1,1)+d(2,1)
EE3J2 Data MiningSlide 22
ad(i,j)
ad(i,j) is the sum of distances along the best (partial) path from (1,1) to (i,j)
It is calculated using the principle of optimality
In addition, the forward path matrix records the local optimal path at each point
jid
Kjiad
jiad
Kjiad
jiad
INS
DEL
,
1,
1,1
,1
min,
(i-1,j)
(i,j-1)
(i-1,j-1)
EE3J2 Data MiningSlide 23
DP – forward pass – continued
A C X C C D
A
B
C
D
(Forward) path matrix
(Forward) accumulated distance matrix
ad(1,2) = ad(1,1)+d(A,C)
EE3J2 Data MiningSlide 24
DP – forward pass – continued
A C X C C D
A
B
C
D
(Forward) path matrix
(Forward) accumulated distance matrix
CCd
Kad
ad
Kad
ad
INS
DEL
,
2,3
2,2
3,2
min3,3
EE3J2 Data MiningSlide 25
DP – forward pass – continued
A C X C C D
A
B
C
D
(Forward) path matrix
(Forward) accumulated distance matrix
CCd
Kad
ad
Kad
adSSAD
INS
DEL
,
5,4
5,3
6,3
min6,4, 21
EE3J2 Data MiningSlide 26
DP – forward pass – continued
A C X C C D
A
B
C
D
(Forward) path matrix
(Forward) accumulated distance matrix
Optimal path obtained by back-tracking through the forward path matrix, starting at the bottom right-hand
corner
EE3J2 Data MiningSlide 27
Summary
Introduction to sequence analysis Definitions of:
– distance,
– forward accumulated distance,
– forward path matrix,
– optimal path
Dynamic programming How to computed the accumulated distance using DP How to recover the optimal path