1 Combinatorial optimisation in protein structure prediction and recognition: Background, review,...

1

Combinatorial optimisation in protein structure prediction and recognition: Background, review, and research direction

Speaker: Vicky Mak

2

What’s in this talk?

What is protein structure prediction and recognition?

Who has done what before?What’s interesting and hasn’t been done?

Being critical about others’ work is easy.Doing something brilliant is difficult.This talk addresses the easy problem.

3

Combining two Amino acids

Amino group

Residue

-Carbon

N-terminal C-terminal

Before

After

Carboxy group

4

Protein: polypeptide chain

A polypeptide chain: chain of amino acids linked together by peptide bonds.

Each amino acid is the same except for the residues. There are 20 such amino acids. Different combinations of these 20 amino acids make different proteins.

A protein sequence can contain from tens to thousands of amino acids.

N-terminal C-terminal

5

An example-helix

-sheet

Primarystructure:individualamino acids.

Secondarystructure:-helix and-sheet.

The green chain defines a tertiary structure. So is the blue chain.

Quaternarystructure:green+blue chains.

6

Motivation

Notice: It is the 3-D structures of the proteins that are important (2 different sequences can have exactly the same structure!)

Need to know the “shape” of a protein, so as to develop antibodies that “bind” that shape - Fold prediction.

Antibodies produced against one protein may also work for another protein that “looks similar” - Structure recognition.

7

Structure prediction

8

HP models (Ab initio prediction )

Given a sequence of amino acids, determine the structure from scratch.

Hydrophobic-hydrophilic (HP) model proposed by Dill (1985)

Two groups of amino acids: Hydrophobic acids (H) Hydrophilic acids (P)

Self avoiding walks on latticesObjective: minimise global free energy

Meaning, it’s good to put as many Hydrophobic acids as close together as possible.

9

HP model on lattices:a 2-dimensional example

Hydrophobic acids

Hydrophilic acids

10

HP model on lattices:a 2-dimensional example

Hydrophobic acids

Hydrophilic acids

Fold with 5 hydrophobic contacts

11

Previous work on HP models Most previous work involves complete enumeration of self-

avoiding random walks on various lattices (e.g. Lau and Dill (1989), Irback and Troein (2002)) Irback and Troein (2002) managed sequences with up to 25 amino

acids

Unger and Moult (1993) - hybrid Genetic Algorithm and Simulated Annealing (2-D) size 20-64. Opt for size 36,48,60 (Opt ?! How do they know?) Shakhovich et al. (1991) tried SA on 30 27-acid problems. (Only 1

found global minimum. Inappropriate local search is to blame.)

Backofen (2001) constraint programming approach tested problems of size 27-36, time: 20min - 1hr38min (opt)

IP models proposed recently in Greenberg, Hart and Lancia (2002). No numerical results reported as yet. (See pages 1-4 of pdf file)

12

Problems with IP modelsDealing with symmetry

Methods are suggested in Greenberg, Hart and Lancia (2002) and in Beckofen’s PhD thesis.

What about other lattices?

Number of lattice points unnecessarily large. Lau and Dill (1989) proposed maximal compact chain

conformations: Lattice walks in which every point is occupied by exactly one amino acid.

E.g. 3x3x3 cubic lattice for a 27-amino acid-chain

May be not that tight, but definitely not n2. May be a union of some of those maximal compact

chain conformations.

13

Let’s be criticalCubic lattices probably not good enough. But it’s a

good start anyway. Faulon, Rintoul and Young (2002) tried 2-D honeycomb, 2-D square,

3-D diamond and 3-D cubic lattices. Agarwala et al. triangular lattice (Constrained SAW, no optimisation involved).

Use energy matrix rather than simple unit credit for each HH interaction? (Different hydrophobicity) Energy released by putting different pairs of H-acids together are

different, and are depending on how far they are apart in sequence! Dill’s HP model is too simplified. Besides, interactions between H-acids should be defined differently to

the Domain and Neighbourhood.

14

Under old definitions, suppose are hydrophobic acids,

are all the same.

15

Butsurely

lookbetterthan

16

Research opportunitiesExact algorithms

Alternative ILP formulations (with tight LP relaxation bounds)

Difference in lattice neighbourhood and hydrophobic interaction neighbourhood (use Euclidean distance for the latter).

Development of solution methodologies

Modify Dill’s model to deal with reality Alternative lattices (apply optimisation techniques as supposed to

complete or simple constrained numeration).More complicated hydrophobicity (Atkins and Hart (1999)

discussed fixed energy matrix and proved NP-hardness).Previous methods either constraints programming or integer linear

programming. Why not a hybrid CP and ILP approach?

17

Research opportunities

No methods so far can manage a sequence with >100 amino acids Heuristics:

Meta-heuristics: still room for research, try different neighbourhood scheme

• Tailor-made search techniques that considers folding patterns

Development of problem-specific heuristic or greedy heuristic

• At least that will provide quick initial bounds for exact methods.

18

Structure recognition

19

Sequence alignment Comparing a sequence of amino acids with known

sequences in Protein Data Bank on the primary structure level.

Does this sequence look alike that sequence? Methods well developed: e.g. BLAST.

Fold recognition Comparing the structure of an unknown protein with

known protein structures in PDB.Contact Map Optimisation (primary-structure comparisons)Arthur Lesk’s model (secondary-structure comparisons)Ip et al.’s model (secondary-structure comparisons)

20

Comparing 3-D structures of two sequences of amino acids, e.g. s=(s1..sm) and t=(t1..tn). (Assuming you already know how each of them look like, and you now want to know how much they look alike each other.)

Construct an undirected graph for each of s and t, amino-acids as vertices.

For each sequence, two amino acids that are within a certain Euclidean distance from each other are connected by an edge.

Contact Map Optimisation

21

Contact Map Optimisation

s

t

s1 s2 sm

t1 t2 tn

22

Contact Map OptimisationOne way of mapping.4 pairs of edges mapped.

23

Contact Map OptimisationAnother way of mapping.5 edges mapped.

24

Wait a minute...Remember from the HP models, amino

acids are divided into two groups. What is the point of mapping a hydrophobic amino acid in one graph to a hydrophilic amino acid in another or vice versa???

Adding constraints that only amino acids of the same group are supposed to be matched might be helpful!!!

25

Who has done what?No one noticed the HP issue so models aren’t 100%

cool.Lancia et al. (2001) ILP model (see pages 5-6 of pdf file)

LP-relaxation of no-crossing constraints typically weak, hence clique constraints (exponentially many) are introduced.

Problem can be converted to a max independent problem, for which cliques inequalities are facet-defining.

O(n2) time separation for cliques. Root-node LP relaxation (from 1min to 2hours for 62-74

acids and 80-140 contacts. The more alike of the two proteins the faster LP relaxation can be solved!)

26

Who has done what?Heuristic approaches:

Lancia et al. (2001)Genetic algorithm (GA)Steepest ascent local search

Results of Lancia et al. Exact algorithm

Gaps: 0->5% (Mostly >5% exactly how much??)

HeuristicsSame story as above. GA much better than LS.

Work on similar topics can also be found in Havel et al. (1979), Martin et al. (1992) and so on.

27

Let’s be critical...Even just the LP relaxation of the IP

formulation without no-crossing constraints takes a long time to solve for comparing pairs of real protein sequences with 100-200 amino acids. Tried comparing two sequences with 120+

amino acids, took more than 10 hours!!!

Really should consider the HP issue, and may be even aggregating certain amino acids!

28

Let’s be critical...A big problem with model - a 3-D example

1 2 3 4 5 6 7

12

3

6

47

5

1

3

6

7

5

2

4

Consider the following sequence

Two different structures giving the same objective valueby the ILP formulation of Lancia et al. assuming acids withine-distance of 31/3 are connected by an edge.

29

Research opportunities

Exact methods New ILP formulation. Alternative solution methodologies for solving

the ILPs - now that we know the ILP models are huge and solving them is hard.

Heuristics Problem specific heuristic. Different neighbourhood search for meta-

heuristics.

30

Arthur Lesk’s modelCompare structures of two protein

sequences by inspecting relations between secondary structures

Does the blue protein look like the green protein?

31

Angle btw pairs Symbol0-45 A45-90 B90-135 C135-180 D180-225 E225-270 F270-315 G315-360 H

32

1 β1 β2 α2 β3 β4

α1 - B C Dβ1 - A Fβ2 - E F Gα2 - Aβ3 -β4 -

33

1 β1 β2 α2 β3 β4

α1 - B C Dβ1 - A Fβ2 - E F Gα2 - Aβ3 -β4 -

'1 β'1 β'2 α'2 β'3 β'4α'1 - D C Bβ'1 - A Fβ'2 - E F Hα'2 - Aβ'3 -β'4 -

Protein sequence 1

Protein sequence 2

34

Similar to CMO...

1 1 2 2 3 4

C

D

B

’1 ’1

35

Useful papers and websites

Greenberg, H.J., Hart, W.E., Lancia, G. “Opportunities for Combinatorial Optimization in Computational Biology”

http://www.dkfz-heidelberg.de/tbi/bioinfo/ProteinStructure/

Christian Lemmen and Thomas Lengauer. “Computational methods for the structural alignment of molecules”, Journal of Computer-Aided Molecular Design, 14 215- 232, 2000.

1 Combinatorial optimisation in protein structure prediction and recognition: Background, review,...

Documents

Transcript of 1 Combinatorial optimisation in protein structure prediction and recognition: Background, review,...