Robotics Algorithms for the Study of Protein Structure and Motion Jean-Claude Latombe Computer...

Post on 20-Dec-2015

216 views 0 download

Tags:

Transcript of Robotics Algorithms for the Study of Protein Structure and Motion Jean-Claude Latombe Computer...

Robotics Algorithms for the Study of

Protein Structure and Motion

Jean-Claude LatombeComputer Science Department

Stanford University

ProteinLong sequence of amino-acids (dozens to thousands), from a dictionary of 20 distinct amino-acids

Central Dogma of Molecular Biology

Physiological conditions: aqueous solution, 37°C, pH 7,atmospheric pressure

Why Proteins? They are the workhorses of living organisms

• They perform many vital functions, e.g.:- catalysis of reactions - storage of energy- transmission of signals - building blocks of muscles

They raise challenging computational issues• Large molecules (100s to several 1000s of atoms)• Made of building blocks drawn from a small “dictionary”• Unusual kinematic structure

They are associated with many critical problems• Folded structure determination• Global and local structural similarities • Prediction of folding and binding motions

Kinematic Linkage Model

peptide group

side-chain group

Molecule and Robot

Two problems Structure determination from

electron density maps• Inverse kinematics techniques

[Itay Lotan, Henry van den Bedem, Ashley Deacon (Joint Center for Structural Genomics)]

Energy maintenance during Monte Carlo simulation• Collision detection techniques

[Itay Lotan, Fabian Schwarzer, and Danny Halperin (Tel Aviv University)]

Structure Determination/Prediction

Experimental tools

Computational tools• Homology, threading• Molecular dynamics

NMR spectrometryX-ray crystallography

Protein Data Bank

1990 250 new structures1999 2500 new structures2000 >20,000 structures total2004 ~30,000 structures total

Only about 10% of structures have been determined for known protein sequences

Protein Structure Initiative (PSI)

X-Ray Crystallography

Automated Model Building Software systems: RESOLVE, TEXTAL, ARP/wARP, MAID

• 1.0Å < d < 2.3Å ~ 90% completeness• 2.3Å ≤ d < 3.0Å ~ 67% completeness (varies widely)1

Manually completing a model:

• Labor intensive, time consuming• Existing tools are highly

interactive

JCSG: 43% of data sets 2.3Å

1Badger (2003) Acta Cryst. D59

Model completion is high-throughput bottleneck

1.0Å 3.0Å

The Completion Problem

Input:• Electron-density map• Partial structure•Two anchor residues•Amino-acid sequence of missing fragment (typically 4 – 15 residues long)

Output: • Few candidate conformation(s) of fragment that

- Respect the closure constraint (IK)- Maximize match with electron-density map

Main part of protein (f olded)

Protein f ragment (f uzzy map)

Anchor 1(3 atoms)

Anchor 2(3 atoms)

Main part of protein (f olded)

Protein f ragment (f uzzy map)

Anchor 1(3 atoms)

Anchor 2(3 atoms)

Input:• Closed kinematic chain with n > 6 degrees of freedom

• Relative positions/orientations X of end frames• Target function T(Q) →

Output:• Joint angles Q that

- Achieve closure- Optimize T

IK Problem

T

Related Work

Robotics/Computer Science

• Exact IK solvers– Manocha & Canny ’94– Manocha et al. ’95

• Optimization IK solvers– Wang & Chen ’91

• Redundant manipulators– Khatib ’87 – Burdick ’89

• Motion planning for closed loops– Han & Amato ’00 – Yakey et al. ’01– Cortes et al. ’02, ’04

Biology/Crystallography• Exact IK solvers

– Wedemeyer & Scheraga ’99– Coutsias et al. ’04

• Optimization IK solvers– Fine et al. ’86– Canutescu & Dunbrack Jr. ’03

• Ab-initio loop closure– Fiser et al. ’00 – Kolodny et al. ’03

• Database search loop closure– Jones & Thirup ’86– Van Vlijman & Karplus ’97

• Semi-automatic tools– Jones & Kjeldgaard ’97– Oldfield ’01

Two-Stage IK Method

1. Candidate generations Closed fragments

2. Candidate refinement Optimize fit with EDM

Stage 1: Candidate Generation

1. Generate random conformation of fragment (only one end attached to anchor)

2. Close fragment (i.e., bring other end to second anchor) using Cyclic Coordinate Descent (CCD) (Wang & Chen ’91, Canutescu & Dunbrack ’03)

fixed end

moving end

Closure Distance

Closure Distance: 2 22

S N N C C C C

Compute

+ bias toward EDM+ avoid steric clashes

s.t. 0ii

Sq

q

A.A. Canutescu and R.L. Dunbrack Jr.Cyclic coordinate descent: A robotics algorithm for protein loop closure. Prot. Sci. 12:963–972, 2003.

Stage 2: Candidate Refinement

1-D manifold

Target function T (Q) measuring quality of the fit with the EDM

Minimize T while retaining closure Closed conformations lie on a self-motion

manifold of lower dimension

d3d2

d1(1,2,3)

Null space

Closure and Null Space

dX = J dQ, where J is the 6n Jacobian matrix (n > 6)

Null space {dQ | J dQ = 0} has dim = n – 6 N: orthonormal basis of null space Pseudo-inverse J+ such that JJ+ = I dQ = J+dX + NNTy

y = T(Q)

dX U66 VT6n dQ66

=

Computation of J+ and NSVD of J

12

6

J+ = V + UT where +=diag[1/i]

Gram-Schmidt orthogonalization

0

(n-6) basis N of null space

NT

Refinement Procedure

Repeat until minimum is reached: Compute J, J+ and N at current Q• Compute T at current Q

(analytical expression of T + linear-time recursive computation [Abe et al., Comput. Chem., 1984])

• Move along dQ = J+dX + NNT T until minimum is reached or closure is broken

+Monte Carlo + simulated annealing protocol to deal with local minima

Monte Carlo OptimizationRepeat:1. Perform a random move of the

fragment:– either by picking a random direction in

null space– or by using an exact IK solver over 6

dofs [Coutsias et al, 2004] ( big jumps)

2. Minimize T(Q)3. Accept move with Metropolis-

criterion probability ~exp(-T/Temp)

Tests #1: Artificial Gaps

TM1621 (234 residues) and TM0423 (376 residues), SCOP classification a/b

Complete structures (gold standard) resolved with EDM at 1.6Å resolution

Compute EDM at 2, 2.5, and 2.8Å resolution

Remove fragments and rebuild

TM1621 103 Fragments from TM1621 at 2.5Å

Produced by H. van den Bedem

Long Fragments:

12: 96% < 1.0Å aaRMSD15: 88% < 1.0Å aaRMSD

Short Fragments:

100% < 1.0Å aaRMSD

Comparison Across Resolutions

Resolution = 2.0Å Resolution = 2.8ÅResolution = 2.5Å

Example: TM0423PDB: 1KQ3, 376 res.2.0Å resolution12 residue gapBest: 0.3Å aaRMSD

Tests #2: True Gaps Structure computed by RESOLVE Gaps completed independently (gold

standard) Example: TM1742 (271 residues) 2.4Å resolution; 5 gaps left by RESOLVE

Length Top scorer Lowest error

4 0.22Å 0.22Å

5 0.78Å 0.78Å

5 0.36Å 0.36Å

7 0.72Å 0.66Å

10 0.43Å 0.43Å

Produced by H. van den Bedem

TM0813

GLU-83

GLY-96

PDB: 1J5X, 342 res.2.8Å resolution12 residue gap

TM0813

GLU-83

GLY-96

PDB: 1J5X, 342 res.2.8Å resolution12 residue gapBest 0.6Å aaRMSD

TM1621

Green: manually completed conformation

Cyan: conformation computed by stage 1

Magenta: conformation computed by stage 2

The aaRMSD improved by 2.4Å to 0.31Å

resolution: 2.0Åinitial model: ARP/wARPcontour: 1.0sPDB: 1VJGaaRMSD: 0.33Å

Alr1529D72-D78

TM0542

• Top-scoring fragment in cyan• Manually completed fragment in green• Residues A259 and A260 are flipped

Current/Future Work

A

B

Software actively being used at the JCSG

What about multi-modal loops?

TM0755: data at 1.8Å 8-residue fragment crystallized in 2 conformations Overlapping density: Difficult to interpret

manually

Algorithm successfully identified and built both conformations

A323Hist

A316Ser

Current/Future Work

A

B

Software actively being used at the JCSG

What about multi-modal loops?

Fuzziness in EDM can then be exploited

Use EDM to infer probability measure over the conformation space of the loop

Amylosucrase

J. Cortés, T. Siméon, M. Renaud-Siméon, and V. Tran. J. Comp. Chemistry, 25:956-967, 2004

Energy maintenance during Monte Carlo simulation

joint work with Itay Lotan, Fabian Schwarzer, and Dan Halperin1

1 Computer Science Department, Tel Aviv University

Random walk through conformation space At each attempted step:

• Perturb current conformation at random• Accept step with probability:

The conformations generated by an arbitrarily long MCS are Boltzman distributed, i.e.,

#conformations in V ~

/( ) min 1, bE k TP accept e

Monte Carlo Simulation (MCS)

E

-kT

Ve dV

Used to:• sample meaningful distributions of conformations • generate energetically plausible motion pathways

A simulation run may consist of millions of steps

energy must be evaluated frequently

Problem: How to maintain energy efficiently?

Monte Carlo Simulation (MCS)

Energy Function E = bonded terms

+ non-bonded terms + solvation terms

Bonded terms - O(n)

Non-bonded terms - E.g., e.g. Van der Waals and electrostatic- Depend on distances between pairs of atoms - O(n2) Expensive to compute

Solvation terms- May require computing molecular surface

Non-Bonded Terms Energy terms go to 0 when distance

increases Cutoff distance (6 - 12Å)

vdW forces prevent atoms from bunching up Only O(n) interacting pairs [Halperin&Overmars 98]

Problem: How to find interacting pairswithout enumerating all atom pairs?

Grid Method

dcutoff

Subdivide 3-space into cubic cells

Compute cell that contains each atom center

Represent grid as hashtable

Grid Method

dcutoff Θ(n) time to build grid O(1) time to find

interactive pairs for each atom

Θ(n) to find all interactive pairs of atoms [Halperin&Overmars, 98]

Asymptotically optimal in worst-case

Can we do better on average?

Few DOFs are changed at each MC step

Number kof DOF changes

0 10 20 305

simulationof 100,000attempted steps

Can we do better on average?

Few DOFs are changed at each MC step Proteins are long chain kinematics

Long sub-chains stay rigid at each step Many partial energy sums remain constant

Problem: How to retrieve the unchanged partial sums?

Hierarchical Collision Checking

Widely used technique in robotics/graphics to approximate distances between objects

Pre-computation of bounding-volume hierarchy

How to update this hierarchy if the objects deform

Two New Data Structures

1. ChainTree Fast detection of interacting atom pairs

2. EnergyTree Retrieval of unchanged partial energy sums

ChainTree(Twofold Hierarchy: BVs +

Transforms)

links

TNO

TJK

TAB

joints

ChainTree(Twofold Hierarchy: BVs +

Transforms)

Updating the ChainTree

Update path to root:– Recompute transforms that “shortcut” the DOF change– Recompute BVs that contain the DOF change– O(k log(n/k)) work for k changes

Finding Interacting Pairs

Finding Interacting Pairs

Finding Interacting Pairs

Do not search inside rigid sub-chains (unmarked nodes)

Finding Interacting Pairs

Do not search inside rigid sub-chains (unmarked nodes)

Do not test two nodes with no marked node between them

New interacting pairs

EnergyTree

E(N,N)

E(J,L)

E(K.L)

E(L,L)

E(M,M)

EnergyTree

E(N,N)

E(J,L)

E(K.L)

E(L,L)

E(M,M)

Complexity

n : total number of DOFs k : number of DOF changes at each MCS step k << n

Complexity of: updating ChainTree: O(k log(n/k)) finding interacting pairs: O(n4/3)

but performs much better in practice!!!

Experimental Setup

Energy function: Van der Waals Electrostatic Attraction between native contacts Cutoff at 12Å

300,000 steps MCS with Grid and ChainTree

Steps are the same with both methods Early rejection for large vdW terms

Results: 1-DOF change

(68) (144) (374) (755)# amino acids

3.5

12.5

5.8

7.8

speedup

Results: 5-DOF change

(68) (144) (374) (755)

2.2

3.4

4.5

5.9

speedup

Two-Pass ChainTree (ChainTree+)

1st pass: small cutoff distance to detect steric clashes2nd pass: normal cutoff distance

>5Tests around native state

Interaction with Solvent

Explicit solvent models: 100s or 1000s of discrete solvent molecules

Implicit solvent models: solvent as continuous medium, interface is solvent-accessible surface

E. Eyal, D. Halperin. Dynamic Maintenance of Molecular Surfaces underConformational Changes. http://www.give.nl/movie/publications/telaviv/EH04.pdf

Summary

Inverse kinematics techniques Improve structure determination from fuzzy electron density maps

Collision detection techniques Speedup energy maintenance during Monte Carlo simulation

About Computational Biology

Computational Biology is more than using computers to biological problems or mimicking nature (e.g., performing MD simulation)

One of its goals is to achieve algorithmic efficiency by exploiting properties of molecules, e.g.: • Proteins are long kinematic chains• Atoms cannot bunch up together• Forces have relatively short ranges