Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

16
Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05

Transcript of Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Page 1: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Position Weight Matrices for Representing Signals in

Sequences

Triinu Tasa, Koke 04.02.05

Page 2: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Definitions

• Sequence, string – ordered arrangement of letters {'A', 'C', 'G', 'T'}

• Pattern – simplified regular expression, alphabet {'A', 'C', 'G', 'T', '.'}, where '.' - wild-card of length 1 ('A', 'C', 'G' or 'T')

Triinu Tasa, Koke 04.02.05

Page 3: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

What is a weight matrix?

GATGAG

GATGAT

TGATAT

GATGATor

[GT][AG][TA][GT]A[GT]

What is a weight matrix?

Triinu Tasa, Koke 04.02.05

Page 4: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Alignment matrix C:A 0 2 1 0 3 0C 0 0 0 0 0 0G 2 1 0 2 0 1T 1 0 2 1 0 2

Frequency matrix F:A 0 0.7 0.3 0 1 0 C 0 0 0 0 0 0 G 0.7 0.3 0 0.7 0 0.3T 0.3 0 0.7 0.3 0 0.7

Better: GATGAG

GATGAT

TGATAT

Triinu Tasa, Koke 04.02.05

What is a weight matrix?

Page 5: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Or weight matrix W:

where

N – number of sequences used

- a priori probability of letter iip

1.09 1.10 0.51 1.09 1.47 1.09

1.09 1.09 1.09 1.09 1.09 1.09

1.10 0.51 1.09 1.10 1.09 0.51

0.51 1.09 1.10 0.51 1.09 1.10

A

C

G

T

, ,, ln ~ ln

( 1)i j i i j

i ji i

c p fw

N p p

What is a weight matrix?

Triinu Tasa, Koke 04.02.05

Page 6: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Importance matrix I:

I(i, j) = *,i jc ,i jf

A 0 1.4 0.3 03 0

C 0 0 0 00 0

G 1.4 0.3 0 1.40 0.3

T 0.3 0 1.4 0.30 1.4

What is a weight matrix?

Triinu Tasa, Koke 04.02.05

Page 7: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Applications

• Pattern clustering1. G.GATGAG.T 62/75 1:39/49 2:23/26 R:17.3026 BP:1.12008e-37

2. G.GATGAG 89/110 1:45/60 2:44/50 R:10.436 BP:1.61764e-34

3. GATGAG.T 124/148 1:52/70 2:72/78 R:7.36961 BP:2.79148e-33

4. TG.AAA.TTT 132/145 1:53/61 2:79/84 R:6.84578 BP:1.83509e-32

5. AAAATTTT 200/231 1:63/77 2:137/154 R:4.69239 BP:1.19109e-30

6. TGAAAA.TTT 104/114 1:45/53 2:59/61 R:7.78277 BP:3.86086e-29

7. AAA.TTTT 343/537 1:79/145 2:264/392 R:3.05349 BP:5.66833e-29

8. G.AAA.TTTT 135/156 1:51/62 2:84/94 R:6.19534 BP:5.69933e-29

9. TG.GATGAG 49/57 1:30/35 2:19/22 R:16.1117 BP:9.35765e-28

10. TG.AAA.TTTT 86/91 1:40/43 2:46/48 R:8.87311 BP:1.1124e-27

...

Triinu Tasa, Koke 04.02.05

Applications - Clustering

Page 8: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

G.GATGAG.T:

GAGATGAGAT

GTGATGAGAT

GAGATGAGGT

...

A -6.9 0.98 -6.9 1.38 -6.9 -6.9 1.38 -6.9 0.98 -6.9

C -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 -6.9 -6.9

G 1.38 -6.9 1.38 -6.9 -6.9 1.38 -6.9 1.38 0.29 -6.9

T -6.9 0.29 -6.9 -6.9 1.38 -6.9 -6.9 -6.9 -6.91.38

Triinu Tasa, Koke 04.02.05

Applications - Clustering

Page 9: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Compare matrices with each other using the dynamic programming approach

:

where

A, B – matrices

i, j - columns

If D(m,n) > threshold => matrices are different

( 1, ) _ cos ,

( , ) min ( , 1) _ cos ,

( 1, 1) ( , )

D i j deletion t

D i j D i j insertion t

D i j d i j

2, ,

1

( , ) ( )m

i j k i k jk

d A B A B

Triinu Tasa, Koke 04.02.05

Applications - Clustering

Page 10: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

G.GATGAG.T TG.AAA.TTT AAAATTTT

G.GATGAG TGAAAA.TTT AAA.TTTT

GATGAG.T TG.AAA.TTTT

We want to represent the clusters by

logos:

We need to align the patterns first – position the similar parts of the patterns above each other:

G.GATGAG.T

G.GATGAG--

--GATGAG.T

or the logo will look like this:

Triinu Tasa, Koke 04.02.05

Applications - Clustering

Page 11: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

•Multiple Alignment

Importance matrix I – represents the aligned patterns.

Example:

G.GATGAG.T

GATGAG.T

G.GATGAG

1. Insert the first pattern into I: ('.' gives 0.25 to each)

A 0 0.25 0 1 0 0 1 0 0.25 0

C 0 0.25 0 0 0 0 0 0 0.25 0

G 1 0.25 1 0 0 1 0 1 0.25 0

T 0 0.25 0 0 1 0 0 0 0.25 1

2. Align the second pattern with I using a dynamic programming approach:

, : ( ,0) 0, (0, ) 0i j v i v j

,

, ,

0, 0 :

0, 0,( , ) max

( 1, 1) , 0i

i i

S j

S j S j

i j

Iv i j

v i j I I

Triinu Tasa, Koke 04.02.05

Applications – Multiple alignment

Page 12: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Dynamic programming matrix:

G . G A T G A G . T

G 0.00 0.10 0.01 0.10 0.00 0.00 0.10 0.00 0.10 0.01 0.00

A 0.00 0.00 0.11 0.00 0.20 0.00 0.00 0.20 0.00 0.11 0.00

T 0.00 0.00 0.01 0.00 0.00 0.30 0.00 0.00 0.00 0.01 0.21

G 0.00 0.10 0.01 0.11 0.00 0.00 0.40 0.00 0.10 0.01 0.00

A 0.00 0.00 0.11 0.00 0.21 0.00 0.00 0.50 0.00 0.11 0.00

G 0.00 0.10 0.01 0.21 0.00 0.00 0.10 0.00 0.60 0.01 0.00

. 0.00 0.00 0.10 0.01 0.21 0.00 0.00 0.10 0.00 0.60 0.01

T 0.00 0.00 0.01 0.00 0.00 0.31 0.00 0.00 0.00 0.01 0.70

G.GATGAG.T

--GATGAG.T

Triinu Tasa, Koke 04.02.05

Applications – Multiple alignment

Page 13: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

3. Add the pattern '--GATGAG.T' to I, if necessary add columns to the matrix.

4. Repeat the procedure for every pattern.

Output:

G.GATGAG.TG.GATGAG----GATGAG.T

Why importance matrix?

Triinu Tasa, Koke 04.02.05

Applications – Multiple alignment

Page 14: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Example:

Pattern: GATG

So far aligned:

GATGATGTA-- - - GATGTGG

We want: w(G, 4) > w(G, 1) > w(G, 9)

Solution – importance matrix

Triinu Tasa, Koke 04.02.05

Applications – Multiple alignment

Page 15: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

● Weight Matrix Matching

Purpose: find the sequences that the weight matrix describes best in a given text file

...CATAGGAAATTCCACCTCTTTGGCTTTGCCCAGTCTTCCCTTGAGGATGCCTACGTTC...

1. Calculate the score for each position

2. if score > threshold => signal

Problem: finding a good threshold

● Threshold – 99.5% quantile

1.09 1.10 0.51 1.09 1.47 1.09

1.09 1.09 1.09 1.09 1.09 1.09

1.10 0.51 1.09 1.10 1.09 0.51

0.51 1.09 1.10 0.51 1.09 1.10

A

C

G

T

Triinu Tasa, Koke 04.02.05

Applications – Weight matrix matching

Page 16: Position Weight Matrices for Representing Signals in Sequences Triinu Tasa, Koke 04.02.05.

Questions?

Triinu Tasa, Koke 04.02.05