Genetic Algorithms and Protein Folding

Genetic Algorithms and

Protein Folding

Based on lecture by Dr. Steffen Schulze-Kremerhttp://www.techfak.uni-bielefeld.de/bcd/Curric/ProtEn/proten.html

http://www.techfak.uni-bielefeld.de/bcd/Curric/ProtEn/proten.html

Genetic Algorithm:

is a heuristic method that operates on pieces of information like nature does on genes in the course of evolution.

•Individuals are represented by a linear string of letters of an alphabet (in nature nucleotides, in genetic algorithms bits)

•Individuals are allowed to mutate, crossover and reproduce.

•Fitness function evaluates individuals.

•Depending on the generation replacement mode a subset of parents and offspring enters the next reproduction cycle.

•After a number of iterations the population consists of individuals that are well adapted in terms of the fitness function.

•It cannot be proven that the individuals of a final generation contain an optimal solution for the objective encoded in the fitness function.

I. Initialise a population of individuals. This can be done either randomly or with domain specific background knowledge to start the search with promising seed individuals. (Where available the latter is always recommended. )

• Individuals are represented as a string of bits. • A fitness function must be defined that takes as input an

individual and returns a number (or a vector) that can be used as a measure for the quality (fitness) of that individual.

The application should be formulated in a way that the desired solution to the problem coincides with the most successful individual according to the fitness function.

II. Evaluate all individuals of the initial population.

III. Generate new individuals. The reproduction probability for an individual is proportional to its relative fitness within the current generation.

Crossover

two point crossover

0101001111000011010101011110111

1010101101011100101110001010101

uniform crossover

0101001111000011010101011110111

1010101101011100101110001010101

Genetic Operators:

Mutation. Substitute one or more bits of an individual randomly by a new value (0 or 1).

Variation. Change the bits in a way that the number encoded by them is slightly incremented or decremented.

Crossover. Exchange parts (single bits or strings of bits) of one individual with the corresponding parts of another individual. Originally, only one-point crossover was performed but theoretically one can process up to L - 1 different crossover sites (with L as the length of the individual).

IV. Select individuals for the new parent generation. Schemes:

1) Complete offspring is selected while all parents are discarded (original genetic algorithm). This is motivated by the biological model and is called total generation replacement.

2) The n best individuals (from old and new generation) This method is called elitist generation replacement.

V. Go back to step 2 until either a desired fitness value was reached or until a predefined number of iterations was performed

Init the first generation

Evaluate

Apply Genetic Operations

Select the next generation

Representation Formalism

• hybrid approach - genetic algorithm is configured to operate on numbers, not bit strings as in the original genetic algorithm.

Disadvantages:

– the mathematical foundation of genetic algorithms holds only for binary representations, although some of the mathematical properties are also valid for a floating point representation.

– Binary representations run faster in many applications.

– An additional encoding/decoding process may be required to map numbers onto bit strings.

Protein Structure Prediction

Individuals - Protein Conformations

Fitness Function – Force Field

Representation

Cartesian 3D coordinates is not a good choice

-> representation by torsion angles

•The frequency of each torsion angle in intervals of 10° was determined and the ten most frequently occurring intervals are made available for substitution of individual torsion angles by the MUTATE operator.

•At the beginning of the run, individuals were initialized with either a completely extended conformation where all torsion angles are 180° or by a random selection from the ten most frequently occurring intervals of each torsion angle.

•For the w torsion angle the constant value of 180° was used because of the rigidity of the peptide bond between the atoms Ci and Ni+1.

Search Space

Generally molecules with n atoms have 3n - 6 degrees of freedom ->

100 residues * approximately 20 atoms per residue = 5994 degrees of freedom

Systems of equations with this number of variables are analytically intractable today.

Discrete approximation:

(5 torsion angles per residue * 5 likely values per torsion angle) = 25100

Fitness Function - Potential Energy

= + + + + + + + + .

Charmm energy func:

= + +

Simplified to:

bond length potential (set to const)bond angle potential (set to const)torsion angle potentialimproper torsion angle potential (set to const)van der Waals pair interactionselectrostatic potentialhydrogen bonds (set to const)interaction with the solvent (set to const -> in vacum)

(since there are no interactions with the solvent, there is not enough force to drive the protein to a compact folded state)

Simplified Energy Function

= + + + .

pseudo entropic term

Empirical relation between the number of residues and the diameter:

First Testprotein Crambin, 46 a.a.

Table 3. Steric Energies in the Last Generation

Table 2. R.m.s. Deviations to Native Crambin

Simple summation of different components has the disadvantage that components with larger numbers would dominate the fitness function whether or not they are important or of any significance at all for a particular conformation.

In other words -> bad fitness function

The genetic algorithm favoured individuals with lowest total energy which in this case was most easily achieved by optimising electrostatic contributions.

Improvements•Instead of using separate phi psi value distributions, apply phi-psi (2D) clustering procedure.

•Use secondary structure prediction algorithm (70% accuracy).

•Specialised Genetic Operators

LOCAL TWIST (local conformation changes by performing the ring closure algorithm for polymers)

The LOCAL TWIST operator led to significant improvements in prediction accuracy and also to a substantial decrease in overall computation time.

Improvements(2)Fitness Function -> vector

r.m.s. only for verification

Vector Fitness Function

Candidate selection for the next generation:

•If there is an individual that has better (i.e. lower) values in each fitness component, then we take it. Continue until no unambiguously better individuals are found.

•Then remove the worst individuals, i.e. those with higher values in each fitness component than any other individual.

•The remaining set of individuals is heuristically reduced until the exact number of individuals for the next generation is reached. This is done by iteratively removing an individual with the worst fitness value in a randomly selected fitness component.

Tests on other proteins (Local Twist and rms fitness) gave also close to native conformations (less than 3.0 A)

Capability of Genetic Algorithm in General?

Conclusion: applying an appropriate fitness function genetic algorithm achieves the desired results.

polar, , , , hydro, Crippen and solvent

-> Rms 6.27

, hydro, Crippen, solvent decreased with rms

polar, , mislead the algorithm to non-native conformation

I. Fitness vector

II. Fitness vector Crippen, clash, hydro and scatter

+ constraints on the secondary structures

-> Rms 4.36

Test case – Crambin 46 a.a.

trypsin inhibitor -> 6.65

Conclusions• Genetic algorithms proved to be an efficient search tool for 3-D representations

of proteins. For a 3-D protein model with a simple, additive force field as fitness function and using a rather small population the genetic algorithm produced several individuals (i.e. protein conformations) of dissimilar topology but each with highly optimized fitness values.

• Given an appropriate fitness function the genetic algorithm application described here finds the desired solution within only small deviations.

• The major problem lies in the fitness function. If there were one or a set of indicators that return 1for the object is native protein conformation and 0 for the object is not a native protein conformation one could expect the genetic algorithm approach to deliver reasonably accurate ab initio predictions. However, neither mathematical models, empirical, semi-empirical or statistical force fields are yet accurate enough to reliably discriminate native from non-native conformations without additional constraints. Thus, the genetic algorithm produces (sub-)optimal conformations in a different sense than that of nativeness.

Notice: the same problem (fitness-scoring function) exists in the Protein Docking problem. The correct transformation (within 3-5A) is found in realistic time (almost in all cases). However, to assign a high score to the native complex is a problematic task. We don’t know yet a proper scoring function.

Side Chain Placement

rms 1.86

Genetic Algorithms and Protein Folding

Documents

Transcript of Genetic Algorithms and Protein Folding