©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened...

31
©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information from a well studied to a newly determined sequence, we need an alignment that represents the protein structures of today.
  • date post

    24-Jan-2016
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened...

Page 1: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Alignment

Most alignment programs create an alignment that represents what happened during evolution at the DNA level.

To carry over information from a well studied to a newly determined sequence, we need an alignment that represents the protein structures of today.

Page 2: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

The amino acids

Most information that enters the alignment procedure comes from the physicochemical properties of the amino acids. Example: which is the better alignment (left or right)?

CPISRTWASIFRCW CPISRTWASIFRCWCPISRT---LFRCW CPISRTL---FRCW

Page 3: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

A difficult alignment problem

AYAYAYAYSY

LGLPLPLPLP

So, in an alignment of more than 2 sequences you can find more information than from just the 2 sequences you are interested in. How do we make these multi-sequence alignmnets?

AGAPAPAPSP

Page 4: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

A difficult alignment problem solved

AYAYAYAYSYAGAPAPAPSPLGLPLPLPLP

Page 5: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Alignment order

MIESAYTDSW QFEKSYVTDY

-MIESAYTDSW QFEKSYVTDY-

Page 6: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Alignment order

MIESAYTDSW QFEKSYVTDYQWERTYASNF

-MIESAYTDSW QFEKSYVTDY-QWERTYASNF-

Page 7: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Conclusion

Align first the sequences that look very much like each other.

So you ‘build up information’ while generating those alignments that most likely are correct.

Page 8: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Alignment order

In order to know which sequences look most like each other, you need to do all pairwise alignments first.

This is exactly what CLUSTAL does.

CLUSTAL builds a tree while doing the build-up of the multiple sequence alignment.

Page 9: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

MSA and treesTake, for example, the three sequences:1 ASWTFGHK2 GTWSFANR3 ATWAFADRand you see immediately that 2 and 3 are close, while 1 is further away. So the tree will look roughly like:

321

Page 10: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

A B C D E

A 0 6 9 11 9

B 6 0 7 9 7

C 9 7 0 8 6

D 11 9 8 0 4

E 9 7 6 4 0

Aligning sequences; start with distances

D

E

Matrix of pair-wise distances between five sequences.

10 8 7

D and E are the closest pair. Take them, and collapse the matrix by one row/column.

Page 11: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Aligning sequences

A B C DE

A 0 6 9 10

B 6 0 7 8

C 9 7 0 7

DE 10 8 7 0

D

E

A

B

Page 12: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Aligning sequences

AB C DE

AB 0 8 9

C 8 0 7

DE 9 7 0

D

E

C

A

B

Page 13: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Aligning sequences

AB CDE

AB 0 8.5

CDE 8.5 0

D

E

C

A

B

Page 14: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Back to the alignment

1 ASWTFGHK2 GTWSFANR3 ATWAFADRActually I cheated. 1 is closer to 3 than to 2 because of the A at position 1. How can we express this in the tree? For example:

3 2

1

3

21

I will call thistree-flipping

Page 15: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Can we generalize tree-flipping?

To generalize tree flipping, sequences must be placed ‘distance-correct’ in 1 dimension:

2 3 1

And then connect them, as we did before: So, now most info

sits in the horizontaldimension. Can we use the verticaldimension usefully?

Page 16: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

The problem is actually bigger

1 ASWTFGHK2 GTWSFANR3 ATWAFADR

d(i,j) is the distance between sequences i and j.

d(1,2)=6; d(1,3)=5; d(2,3)=3.

1

3

2

So a perfect representation would be:

But what if a 4th sequence is added with d(1,4)=4, d(2,4)=5, d(3,4)=4? Where would that sequence sit?

Page 17: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

So, nice tree, but what did we actually do?

1)We determined a distance measure2)We measured all pair-wise distances3)We reduced the dimensionality of the space of the problem4)We used an algorithm to visualize

In a way, we projected the hyperspace in which we can perfectly describe all pair-wise distances onto a 1-dimensional line.

What does this sentence mean?

Page 18: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Projection

Fuller projection; Unfolded Dymaxion mapGnomonic projection: Correct distances

Political projection

Source: Wikepedia Mercator projection

Page 19: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Back to sequences:

ASASDFDFGHKMGHS 1ASASDFDFRRRLRHS 2ASASDFDFRRRLRIT 5ASLPDFLPGHSIGHS 3ASLPDFLPGHSIGIT 6ASLPDFLPRRRVRIT 3

The more dimensions we retain, the less information we loose. The three is now in 3D…

Page 20: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Projection to visualize clusters

We want to reduce the dimensionality with minimal distortion of the pair-wise distances. One way is Eigenvector determination, or PCA.

Page 21: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

PCA to the rescue

Now we have made the data one-dimensional, while the second, vertical, dimension is noise. If we did this correctly, we kept as much data as possible.

Page 22: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Back to sequences:

In we have N sequences, we can only draw their distance matrix in an N-1 dimensional space. By the time it is a tree, how many dimensions, and how much information have we lost?

Perhaps we should cluster in a different way?

Page 23: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Cluster on critical residues?

QWERTYAKDFGRGHAWTRTYAKDFGRPMSWTRTNMKDTHRKCQWGRTNMKDTHRVWGray = conservedRed = variableGreen = correlated

Page 24: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Conclusions from correlated residues

Page 25: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Other algorithms

Multi-sequence alignment can also be done with an iterative ‘profile’ alignment.

A) Make an alignment of few, well-aligned sequences

B) Align all sequences using this profile

Page 26: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

1. What is a profile?

Normally, we use a PAM-like matrix to determine the score for each possible match in an alignment.

This assumes that all matches between I <-> E are the same. But the aren’t.

Page 27: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

2. What is a profile?

QWERTYIPASEF At 1, E and I are QWEKSFIPGSEY both OK.NWERTMVPVSEMQFEKTYLPSSEY At 2, I is OK, NFIKTLMPATEF but E surely not.QYIRSLIPAGEMNYIQSLIPSTEL At 3, E is OK,QFIRSLFPSSEI but I surely not. 1 2 3

Page 28: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

3. What is a profile?

The knowledge about which residue types are good at a certain position in the multiple sequence alignment can be expressed in a profile.

A profile holds for each position 20 scores for the 20 residue types, and sometimes also two values for position specific gap open and gap elongation penalties.

Page 29: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Conserved, variable, or in-between

QWERTYASDFGRGHQWERTYASDTHRPMQWERTNMKDFGRKCQWERTNMKDTHRVWGray = conservedBlack = variableGreen = correlated mutations

Page 30: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Correlated mutations determine the tree shape

1 AGASDFDFGHKM2 AGASDFDFRRRL3 AGLPDFMNGHSI4 AGLPDFMNRRRV

Page 31: ©CMBI 2001 Alignment Most alignment programs create an alignment that represents what happened during evolution at the DNA level. To carry over information.

©CMBI 2001

Correlation = Information

1, 2 and 5 bind calcium; 3 and 4 don’t. Which residues bind calcium?

1234567890123451 ASDFNTDEKLRTTFI2 ASDFSTDEKLKTTFI3 LSFFTTDTRLATIYI4 LSHFLTNLRLATIYI5 ASDFTTDEKLALTFI

Red has correct correlation, but wrong residue type.Brown has correct type, but wrong correlation.Green can be calcium-binders.