High-throughput Biological Data and evolution€¦ · 2-step process The 1-step process is based on...
Transcript of High-throughput Biological Data and evolution€¦ · 2-step process The 1-step process is based on...
1
Introduction to bioinformaticsLecture 3
High-throughput Biological Data
-data deluge, bioinformatics algorithms-
and evolution
CENTR
FORINTEGRATIVE
BIOINFORMATICSVU
E
Last lecture:• Many different genomics datasets:
– Genome sequencing: more than 300 species completely sequenced and data in public domain (i.e. information is freely available), virus genome can be sequenced in a day
– Gene expression (microarray) data: many microarraysmeasured per day
– Proteomics: Protein Data Bank (PDB) - as of Tuesday February 07, 2006 there are 35026 Structures.http://www.rcsb.org/pdb/
– Protein-protein interaction data: many databases worldwide
– Metabolic pathway, regulation and signaling data, many databases worldwide
Growth in number of protein tertiary structures The data deluge
Although a lot of tertiary structural data is being produced (preceding slide), there is the
SEQUENCE-STRUCTURE-FUNCTION GAP
The gap between sequence data on the one hand, and structure or function data on the other, is widening rapidly: Sequence data grows much faster
High-throughput Biological DataThe data deluge
• Hidden in all these data classes is information that reflects– existence, organization, activity,
functionality …… of biological machineries at different levels in living organisms
Most effectively utilising and analysing this information computationally is essential for
Bioinformatics
Data issues: from data to distributed knowledge
• Data collection: getting the data
• Data representation: data standards, data normalisation …..
• Data organisation and storage: database issues …..
• Data analysis and data mining: discovering “knowledge”, patterns/signals, from data, establishing associationsamong data patterns
• Data utilisation and application: from data patterns/signals to models for bio-machineries
• Data visualization: viewing complex data ……
• Data transmission: data collection, retrieval, …..
• ……
2
Bio-Data Analysis and Data Mining• Analysis and mining tools exist and are developed for:
– DNA sequence assembly
– Genetic map construction
– Sequence comparison and database searching
– Gene finding
– Gene expression data analysis
– Phylogenetic tree analysis, e.g. to infer horizontally-transferred genes
– Mass spectrometry data analysis for protein complex characterization
– ……
Bio-Data Analysis and Data Mining• As the amount and types of data and their
cross connections increase rapidly
• the number of analysis tools needed will go up “exponentially” if we do not reuse techniques– blast, blastp, blastx, blastn, … from BLAST family
of tools (we will cover BLAST later)
– gene finding tools for human, mouse, fly, rice, cyanobacteria, …..
– tools for finding various signals in genomic sequences, protein-binding sites, splice junction sites, translation start sites, …..
Bio-Data Analysis and Data Mining
Many of these data analysis problems are fundamentally the same problem(s) and can be solved using the same set of tools
e.g.
•clustering or
•optimal segmentation by Dynamic Programming
We will cover both of these techniques in later lec tures
Bio-data Analysis, Data Mining and Integrative
BioinformaticsTo have analysis capabilities covering a wide
range of problems, we need to discover the common fundamental structures of these
problems;
HOWEVER in biology one size does NOT fit all…
An important goal of bioinformatics is development of a data analysis
infrastructure in support of Genomics and beyond
Protein structure hierarchical levels
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
QUATERNARY STRUCTURE (oligomers)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Protein complexes for photosynthesis in plants
3
Protein folding problem
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
PRIMARY STRUCTURE (amino acid sequence)
SECONDARY STRUCTURE (helices, strands)
TERTIARY STRUCTURE (fold)
Each protein sequence “knows” how to fold into its tertiary structure. We still do not understand exactly how and why
1-step process
2-step process
The 1-step process is based on a hydrophobic collapse; the 2-step process, more common in forming larger proteins, is called the framework model of folding
Protein folding: step on the way is secondary structure prediction
• Long history -- first widely used algorithm was by Chou and Fasman (1974)
• Different algorithms have been developed over the years to crack the problem:– Statistical approaches – Neural networks (first from speech recognition)– K-nearest neighbour algorithms– Support Vector machines
Algorithms in bioinformatics (recap)
• Sometimes the same basic algorithm can be re-used for different problems (1-method-multiple-problem)
• Normally, biological problems are approached by different researchers using a variety of methods (1-problem-multiple-method)
Algorithms in bioinformatics• string algorithms• dynamic programming• machine learning (Neural Netsworks, k-Nearest Neighbour,
Support Vector Machines, Genetic Algorithm, ..)• Markov chain models, hidden Markov models, Markov
Chain Monte Carlo (MCMC) algorithms• molecular mechanics, e.g. molecular dynamics, Monte
Carlo, simplified force fields• stochastic context free grammars• EM algorithms• Gibbs sampling• clustering• tree algorithms• text analysis• hybrid/combinatorial techniques and more…
Sequence analysis and homology searching Finding genes and regulatory elements
There are many different regulation signals such as start, stop and skip messages hidden in the genome for each gene, but what and where are they?
4
Expression data Functional genomics
• Monte Carlo
Protein translation
What is life?
• NASA astrobiology program:
“Life is a self-sustained chemical system capable of undergoing Darwinian evolution”
EvolutionFour requirements:
• Template structure providing stability (DNA)
• Copying mechanism (meiosis)
• Mechanism providing variation (mutations; insertions and deletions; crossing-over; etc.)
• Selection: some traits lead to greater fitness of one individual relative to another. Darwin wrote “survival of the fittest”
Evolution is a conservative process: the vast majority of mutations will not be selected (i.e. will not make it as they lead to worse performance or are even lethal) – this is called negative(or purifying) selection
Orthology/paralogy
Orthologous genes are homologous (corresponding) genes in different species
Paralogous genes are homologous genes within the same species (genome)
5
Changing molecular sequences
• Mutations: changing nucleotides (‘letters’) within DNA, also called ‘point mutations’
• A & G: purines, C & T/U: pyrimidines:– Transition: purine -> purine or pyrimidine ->
pyrimidine
– Transversion: purine -> pyrimidine or pyrimidine -> purine
Types of point mutation
• Synonymous mutation: mutation that does not lead to an amino acid change (where in the codon are these expected?)
• Non-synonymous mutation: does lead to an amino acid change – Missense mutation: one a.a replaced by other
a.a– Nonsense mutation: a.a. replaced by stop
codon (what happens with protein?)
Ka/Ks Ratios• Ks is defined as the number of synonymous
nucleotide substitutions per synonymous site• Ka is defined as the number of nonsynonymous
nucleotide substitutions per nonsynonymous site• The Ka/Ks ratio is used to estimate the type of
selection exerted on a given gene or DNA fragment
• Need aligned orthologous sequences to do calculate Ka/Ks ratios (we will talk about alignment later).
Ka/Ks ratios
The frequency of different values of Ka/Ks for 835 mouse–rat orthologous genes. Figures on the x axis represent the middle figure of each bin; that is, the 0.05 bin collects data from 0 to 0.1
Ka/Ks ratios
Three types of selection:
1. Negative (purifying) selection -> Ka/Ks < 1
2. Neutral selection (Kimura) -> Ka/Ks ~= 1
3. Positive selection -> Ka/Ks > 1
Human Evolution
6
Divergent Evolution
Ancestral sequence:ABCD
ACCD (B C) ABD (C ø)
ACCD or ACCD Pairwise AlignmentAB─D A─BD
mutation deletion
Evolution
Ancestral sequence:ABCD
ACCD (B C) ABD (C ø)
ACCD or ACCD Pairwise AlignmentAB─D A─BD
true alignment
mutation deletion
Consequence of evolution
• Notion of comparative analysis (Darwin)
• What you know about one species might be transferable to another, for example from mouse to human
• Provides a framework to do the multi-level large-scale analysis of the genomics data plethora
Flavodoxin-cheY Multiple Sequence Alignment
This pathway diagram shows a comparison of pathways in (left) Homo sapiens (human) and (right) Saccharomyces cerevisiae (baker’s yeast). Changes in controlling enzymes (square boxes in red) and the pathway itself have occurred (yeast has one altered (‘overtaking’) path in the graph)
We need to be able to do automatic pathway comparison (pathway alignment)
Human Yeast The citric-acid cycle
http://en.wikipedia.org/wiki/Krebs_cycle
7
The citric-acid cycleFig. 1. (a) A graphical representation of the reactions of the citric-acid cycle (CAC), including the connections with pyruvate and phosphoenolpyruvate, and the glyoxylateshunt.When there are two enzymes that are not homologous to each other but that catalyse the same reaction (non-homologous gene displacement), one is marked with a solid line and the other with a dashed line. The oxidative direction is clockwise. The enzymes with their EC numbers are as follows: 1, citrate synthase (4.1.3.7); 2, aconitase (4.2.1.3); 3, isocitrate dehydrogenase (1.1.1.42); 4, 2-ketoglutarate dehydrogenase (solid line; 1.2.4.2 and 2.3.1.61) and 2-ketoglutarate ferredoxin oxidoreductase (dashed line; 1.2.7.3); 5, succinyl- CoA synthetase (solid line; 6.2.1.5) or succinyl-CoA–acetoacetate-CoA transferase (dashed line; 2.8.3.5); 6, succinate dehydrogenase or fumarate reductase(1.3.99.1); 7, fumarase (4.2.1.2) class I (dashed line) and class II (solid line); 8, bacterial-type malate dehydrogenase(solid line) or archaeal-type malate dehydrogenase (dashed line) (1.1.1.37); 9, isocitrate lyase (4.1.3.1); 10, malatesynthase (4.1.3.2); 11, phosphoenolpyruvate carboxykinase(4.1.1.49) or phosphoenolpyruvate carboxylase (4.1.1.32);12, malic enzyme (1.1.1.40 or 1.1.1.38); 13, pyruvatecarboxylase or oxaloacetate decarboxylase (6.4.1.1); 14, pyruvate dehydrogenase (solid line; 1.2.4.1 and 2.3.1.12) andpyruvate ferredoxin oxidoreductase (dashed line; 1.2.7.1).
M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, 281-29 (1999)
The citric-acid cycle
M. A. Huynen, T. Dandekar and P. Bork ``Variation and evolution of the citric acid cycle: a genomic approach'' Trends Microbiol, 7, 281-29 (1999)
b) Individual species might not have a complete CAC. This diagram shows the genes for the CAC for each unicellular species for which a genome sequence has been published, together with the phylogeny of the species. The distance-based phylogeny was constructed using the fraction of genes shared between genomes as a similarity criterion29. The major kingdoms of life are indicated in red (Archaea), blue (Bacteria) and yellow (Eukarya). Question marks representreactions for which there is biochemical evidence in the species itself or in a relatedspecies but for which no genes could be found. Genes that lie in a single operon are shown in the same color. Genes were assumed to be located in a single operonwhen they were transcribed in the same direction and the stretches of non-coding DNA separatingthem were less than 50 nucleotides in length.
Thinking about evolution
• Is the evolutionary model applicable to other systems?– Story telling in old cultures
– Richard Dawkins’ book entitled A Selfish Genetalks about Memes
• The Genetic Algorithm (GA) is arguably the best computational optimisation strategy around, and is based entirely on Darwinian evolution