Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics Cottbus, 8...
-
Upload
kendal-bold -
Category
Documents
-
view
219 -
download
2
Transcript of Bioinformatics and Computer Science Ina Koch TFH Berlin, Masters course Bioinformatics Cottbus, 8...
Bioinformatics and Computer Science
Ina KochTFH Berlin, Master‘s course Bioinformaticshttp://www.tfh-berlin.de/bi/
Cottbus, 8th of October 2004
Outline
Introduction
SNP analysis in the human genome
Dynamic programming as basis for
sequence comparison
Summary and outlook
Bioinformatics-Computational Biology
Data collection and storage - data base techniques,
integrative, data bases
Data visualisation - computer graphics, molecule graphics
MicroArray analysis – pattern recognition, statistics
Data analysis
sequence - string algorithms, dynamic programming
structure - graph theory, AI, knowledge acquisition
networks - graph theory, Petri nets, computer algebra Drug Design, Molecular Modelling - parallel algorithms
Outline
Introduction
SNP analysis in the human genome
Dynamic programming as basis for sequence comparison
Summary and outlook
SNP analysis in the human genomeThe average human being exhibits ~100 new mutations.
The mutation of one nucleotide (point mutation) in the genome:
Single Nucleotide Polymorphism - SNP,
if it occurs with more than 1% in a population.
non-synonymous: causes a mutation of the amino acid TTT - Phe TTA - Leusynonymous: codes the same amino acid TTT - Phe TTC - Phe
SNPs – some numbersTwo individuals: some millions nucleotide differences ~ 100,000 amino acid differences
Within a population: 1/300 bp differences
~ half of the SNPs in coding regions are none-synonymous.
In two equal chromosomes: 1/1000 bp differences (nucleotide-variety)
Most frequent type: transition C T (G A) 2/3 of all SNPs
Why SNPs are interesting?
Medical questions
CD - CV hypothesis (Common Disease - Common Variant)
Example: ApoE*E4 allele of Alzheimer’s disease
• How many SNPs are associated with diseases? • How can we identify these SNPs?
• Ho many none-synonymous SNPs are damaging the structure or function of the protein? • How can we identify these SNPs?
A disease causing SNP The human Hemochromatosis protein (1A6Z)
Frequency: ~ 6% in north Europe ~ 14% in Irland
Search for SNPs in data bases
SWISS-PROT
Data bases
OMIM
HGBASE
dbSNP
HSSP
FilterKeywords: ‘3D STRUCTURE‘and ‘DISEASE MUTATION’
Results
Keywords: ‘3D STRUCTURE‘and ‘POLYMORPHISM’, but not ‘DISEASE MUTATION’Allelic variants with >1% frequency in ‘normal’ humans
BLASTX search against HSSP
Search for close homologues(>95% similarity) in other species for all until now selec-ted proteins and mutations
1 551 diseasecausing mutations
459 allelic variants
440 neutral mutations between species
Prediction of function-damaging effect
Active sites, binding sites
Analysis of the multiple alignment
Disulfide bridges
Hydrophobicity in the protein core
Solvent accessibility
Interactions with hetero atoms
The amino acid variant is function-damaging, if 1. it is located in a region annotated in SWISS-PROT as ACTIVE_SITE, BINDING_SITE, SITE, MOD_RES, DISULFID, METAL or
2. it is not compatible with the amino acid substitutions at the same position of homologous proteins,
or
3. it is located inside of the protein core and causes a change in the electrostatic potential, or
4. it is located at the protein surface and changes the surface accessibility of the protein, or
5. it concerns a proline residue in a helix, or
6. its minimal distance to hetero atoms (except water) < 6 Å.
Prediction rulesPrediction rules
Results
total predicted absolute percent
Disease causing 60 54 90Mutations
Function-damaging 54 43 80mutations artificially generated
Control predictions on proteins with knownfunction-damaging mutations
Results
total as function-damaging predicted absolute percent
All Polymorphisms 459 156 34
Experimentally proved 245 79 32Polymorphisms
False-negative predictions
isoleucine
False-negative predictions
serine
False-negative predictions
K A L G I S P F H E Homo sapiens K S L G I S P F H E Ovis aries K G L G L S P F H E Gallus gallus K T F G I S P F H E Sminthopsis macroura K A L G V S P F H E Petaurus breviceps K K L G L T P F H E Rana catesbiana T N Q G S T P F H E Sparus aurata K K Q N L E S F F P Escherichia coli E S K J L D T F F P Salmonella dublin K A K N V E S F Y P Caenorhabelzis elegans
Part of the multiple alignment of the human transthyretin
False-negative predictions
False-negative predictions
False-negative predictions
Outline
Introduction
SNP analysis in the human genome
Dynamic programming as basis for
sequence comparison
Summary and outlook
Sequence AlignmentSearch for evolutionary or functional similarity
Input: two nucleotide or amino acid sequencesDesired output: biologically meaningful similarity
Scoring of an alignment: Sum over all scores for each aligned pair and the gap
penalties
Score for amino acid pairs: substitution matrices (PAM, BLOSUM)
Difficulty to set gap penalties
Search for the optimal global alignment
Sequence AlignmentHuman alpha globin and human beta globin: trueHBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++LS+LH KLHBB_HUMAN GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL
Human alpha globin and leghaemoglobin from yellow lupin: Human alpha globin and leghaemoglobin from yellow lupin: true true HBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKLHBA_HUMAN GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL ++ ++++H+ KV + +A ++ +L+ L+++H+ K++ ++++H+ KV + +A ++ +L+ L+++H+ K LGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGLGB2_LUPLU NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
Human alpha globin and glutathione S-transferase: Human alpha globin and glutathione S-transferase: falsefalseHBA_HUMANHBA_HUMAN GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKLGSAQVKGHGKKVADALTNAVAHVDDMPNALSALSD----LHAHKL GS+ + G + +D L ++ H+ D+ A +AL D ++AH+GS+ + G + +D L ++ H+ D+ A +AL D ++AH+F11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQEF11G11.2 GSGYLVGDSLTFVDLL--VAQHTADLLAANAALLDEFPQFKAHQE
Dynamic Programming Application to optimisation problems
Development of an dynamic programming algorithm
(1) characterise the structure of an optimal solution(2) recursively define the value of an optimal
solution(3) compute the value of an optimal solution in a bottom-up fashion(4) construct an optimal solution from computed
information
Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -1 -1 -2 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6
Initialisation with BLOSUM50
Dynamic Programming - Example (1) Optimal solution: the alignment with the highest score
(2) Recursive solution:Three ways an alignment can be extended up to ( i, j )
(a) xi aligned to yj
(b) xi aligned to a gap
(c) yj aligned to a gap F ( I - 1, j - 1 ) + s ( xi, yj )
MM ( i, j ) = max F ( i-1, j ) – d F (I, j-1 ) – dd: gap penalty,
s (xi, yj ): score of the pair (xi, yj)
{
Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -1 -1 -2 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6
Computation of M ( 1, 1 )
Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9 -1 -2 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6
Computation of M ( 1, 2 )
Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9-17 -2 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6
Computation of M ( 1, 3 )
Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9-17-25 -1 -4 -2 -2 -1 -1A -16 -2 -1 5 0 5 -3 0 -2 -1 -1 W -24 -3 -3 -3 -3 -3 15 -3 -3 -3 -3H -32 10 0 -2 -2 -2 -3 -2 10 0 0E -40 0 6 -1 -3 -3 -1 -3 0 6 6 A -48 -2 -1 5 0 5 -3 0 -2 -1 -1E -56 0 6 -1 -3 -3 -1 -3 0 6 6
Computation of M ( 1, 4 )
Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9 -17 -25 -33 -42 -49 -57 -65 -73A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
The completely calculated matrix
Dynamic Programming - Example M H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80P -8 –2 -9 -17 -25 -33 -42 -49 -57 -65 -73A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
Computation of the optimal alignment path
HEAGAWGHE-E--P-AW-HEAE
The optimal alignment
Dynamic Programming - Example
Needleman/Wunsch – algorithm for global alignment
Needleman & Wunsch (1970) J. Mol. Biol. 48:443-453. O(n3)
Gotoh – algorithm for global alignment O(n2) Gotoh (1982) J. Mol. Biol. 162:705-708.
Smith/Watermann – algorithm for local alignment Smith & Waterman (1981) J. Mol. Biol. 147:195-197. O(n2)
Summary
SNP analysis as typical example for bioinformatics Sunyaev, Ramensky, Lathe III., Kondrashov, Bork, Human Molecular Genetics (2001) 10:591-
597.
data base parsing multiple sequence alignment rule based system molecular modelling
Application of dynamic programming to sequence alignment
Gotoh algorithm for pair-wise global sequence alignment
Outlook Application of graph theory to protein structure analysis
PTGL Protein Topology Graph Library http://sanaga.tfh-berlin.de/~ptgl/ptgl.html
May, Barthel, Koch (2004) Bioinformatics, in press. Koch (2001) Theoretical Computer Science 250:1-30.
Investigations of Alternative Splicing
Boué, Vingron, Koch (2002) Bioinformatics, suppl.2, 18:S65-S75. Kriventseva, Koch, Apweiler, Vingron, Bork, Gelfand, Sunyaev
(2003)Trends in Genetics 19:124-128.
Outlook Modelling, analysis, and simulation of biological
molecular networks using Petri net theory in co-operation with BTU Cottbus (Prof. M. Heiner) Voss, Heiner, Koch (2003) In Silico Biology 3:0031. Heiner, Koch, Will (2004) BioSystems, Special Issue 75(1-3):15-28. Heiner & Koch (2004) Proc. 25th ICAPTN, LNCS 3099:216-237. Koch, Junker, Heiner (2004) Bioinformatics, in press.
Ongoing projects: 1. Human glycolysis with coloured Petri nets Thomas Runge 2. Metabolism in the human liver cell Daniel Schrödter 3. G1/S phase in the mammalian cell cycle Dr. Thomas Kaunath 4. Duchenne muscle dystrophy Stepfanie Grunwald
Thank you!