Post on 03-Jan-2016
Computational Molecular Biology
Introduction and Preliminaries
My T. Thaimythai@cise.ufl.edu
2
Preliminaries in Computer Science
Strings and alphabet Basic notations in graph theory Algorithms and Complexity
My T. Thaimythai@cise.ufl.edu
3
Strings
Consist of a sequence of letters: DNA: four nucleotides A, C, G, T Proteins: 20 symbol alphabet of animo acids
Given a string s, we have the following notations: Length: |s| Substring: ACT is a substring of ATGACTG Superstring: ATGACTG is a superstring of ACT Index and interval: s[i] and s[i..j] Prefix and suffix: s[1..j] and s[i..|s|]
My T. Thaimythai@cise.ufl.edu
4
Graphs
G = (V, E) where V is a set of vertices and E is a set of edges
Undirected graph: edges are undirected Directed graph: edges are directed Weighted graph G = (V, E, w) where each edge
has some weight Some special graphs: complete graph, bipartite
graph, tree, and interval graph Subgraph, spanning tree, steiner tree
My T. Thaimythai@cise.ufl.edu
5
Interval Graphs
Intersection graph of a set of intervals on the real line
A vertex represents an interval and an edge (u, v) exists if intervals u and v intersect
My T. Thaimythai@cise.ufl.edu
6
Some Problems in Graphs
Euler circuit: Given a graph, find a cycle that passes through each edge exactly once
Hamiltonian circuit: Given a graph, find a cycle that passes through each vertex exactly once
Minimum Spanning Tree: Given a weighted undirected graph, find a spanning tree with minimum total weight
Maximum Matching: Given an undirected graph, find a maximum cardinality matching, which is a subset of edges such that no two edges in the subset share an endpoint
My T. Thaimythai@cise.ufl.edu
7
P vs. NP
Class of P: Set of problems solvable by polynomial-time algoirthms
Class of NP: Set of problems whose solutions, once found, can be verified in polynomial time
NP-complete (NP-hard) problems: cannot obtain an optimal solutions in polynomial time
My T. Thaimythai@cise.ufl.edu
8
Some approaches for NP-complete Problems
Special-case method: Work on the problem with a restricted class of inputs
Exhaustive search: Design an exponential-time algorithms that may perform well in practice
Approximation algorithms: Design a polynomial-time algorithm that is guaranteed to find near-optimal solutions (with a good approximation ratio)
Heuristics: Fast algorithms that produce satisfactory solutions most of the time but without guarantee
My T. Thaimythai@cise.ufl.edu
9
Preliminaries in Molecular Biology
My T. Thaimythai@cise.ufl.edu
10
DNA and Base Pairs
Double helix consisting of two dual strands
Has four types of nucleotides: Adenine, Thymine, Guanine, Cytosine
Base Pairs: A↔T, C↔G Two ends of a strand are
marked with 3’ and 5’ The entire DNA of a living
organism is called its genome
My T. Thaimythai@cise.ufl.edu
11
DNA Sequences
My T. Thaimythai@cise.ufl.edu
12
DNA Replication
Strands are separatedEach strand is replicated
using one of the parental strands as a template
My T. Thaimythai@cise.ufl.edu
13
Cell, Chromosome, and DNA
My T. Thaimythai@cise.ufl.edu
14
Cell Classification
My T. Thaimythai@cise.ufl.edu
15
Chromosomes
Consists of a DNA molecule associated with proteins that fold and pack the DNA thread into a more compact structure and proteins required for the process of gene expression, DNA replication and DNA repair.
Human genome is distributed over 24 chromosomes Each cell contains 46 chromosomes
22 pairs common to both males and females2 sex chromosome X and Y in males and two Xs in
female
My T. Thaimythai@cise.ufl.edu
16
Genes Segments of DNA
Functional and physical unit of heredity passed from parent to offspring
Contain the information for making a specific protein
My T. Thaimythai@cise.ufl.edu
17
Proteins
Shorts strings in the amino acid 20-letter alphabet
Human genome: about 100,000 proteins, with each protein a few hundred amino acids long
Bacteria make 500-1500 proteins Made by genes (fragments of DNA) that are
roughly three times longer than the corresponding proteins.
Why? Every 3 nucleotides in the DNA alphabet code one letter in the protein alphabet of amino acids
My T. Thaimythai@cise.ufl.edu
18
Central Dogma of Molecular Biology
My T. Thaimythai@cise.ufl.edu
19
Transcription
My T. Thaimythai@cise.ufl.edu
20
Translation Translation
mRNA (after exported out of the nucleus and reaching the cytosol) directs the synthesis of the protein by joining together amino acids in the order encoded by the mRNA
Genetic codeDefines a mapping between codons and amino acid.Codon
Triplet of nucleotides specifies a single amino acid in a corresponding protein
64 codons and 20 amino acids
Translation is carried out by ribosomes
My T. Thaimythai@cise.ufl.edu
21
Polymerase Chain Reaction (PCR)
PrimerNucleic acid
strandServes as a
starting point of DNA replication
My T. Thaimythai@cise.ufl.edu
22
Plasmid Vector
Plasmid Circular and double-stranded DNA
Antibiotic resistance
Automatic replication
Exists in bacteria
Vector an agent that can carry a DNA fragment into a host cell
My T. Thaimythai@cise.ufl.edu
23
DNA Cloning Using Plasmids as Vectors
(a) DNA recombination
(b) Transformation
My T. Thaimythai@cise.ufl.edu
24
DNA Cloning Using Plasmids as Vectors (Cont) (c) Selective amplification
(d) Isolation of desired DNA clones
My T. Thaimythai@cise.ufl.edu
25
DNA Library Screening Probe:
Labeled with radioisotope or fluorescence
Used to detect specific DNA sequences by hybridization
Hybridization: Binding of two nucleic acid
chains by base paring
DNA Library Screening To identify each clone whether it
contains a probe from a given set of probes
Positive clone: contains a probe
My T. Thaimythai@cise.ufl.edu
26
Some Computational Problems
Pooling Design Non-unique probe selection Sequence Alignment, Multi Sequence
Alignment DNA sequencing Genome Rearrangement Protein Structure Prediction and Recognition Protein-Protein Interactions Functional Groups, Modules
My T. Thaimythai@cise.ufl.edu
27
Pooling Designs
Problem Definition Given a set of n clones with at most d positive
clones Identify all positive clones with the minimum
number of tests
Pool: a subset of clones Positive pool: a pool contains at least one positive
clone
My T. Thaimythai@cise.ufl.edu
28
Pooling Designsclones
c1 c2 cj cn
p1 0 0 … 0 … 0 … 0 … 0 0 p2 0 1 … 0 … 0 … 0 … 0 1
pools . .. .
pi 0 0 … 0 … 1 … 0 … 0 1. .. .
pt 0 0 … 0 … 0 … 0 … 0 0 txn tx1
M[i, j] = 1 iff the ith pool contains the jth clone
Decoding Algorithm: Given M and V(D), identify all positive clones
Testing
V(D)
Mtxn =
My T. Thaimythai@cise.ufl.edu
29
Challenges
Challenge 1: How to construct the binary matrix M such that: Outputs of any union of d columns are distinct
Challenge 2: How to design a decoding algorithm with efficient time complexity [O(tn)]
My T. Thaimythai@cise.ufl.edu
30
Probe Selection
Problem Definition: Given a biological sample (e.g., blood) and a set of
probes Identify the presence (or absence) of some
biological objects (e.g., viruses or bacteria) with the minimum number of probes
My T. Thaimythai@cise.ufl.edu
31
Unique Probes VS. Non-unique Probes
Unique probes Gene-specific probes or signature probes. Difficult to find
Non-unique probes Hybridize to more than one target. Difficult to decode the results
My T. Thaimythai@cise.ufl.edu
32
Probe-Target Matrix 12 probe candidates. 4 targets (genes). For target set S, define P(S) as set of
probes reacting to any gene in S. P({1, 2}) = {1, 2, 3, 4, 7, 8, 9, 10, 12}. P({2, 3}) = {1, 3, 4, 5, 6, 7, 8, 9, 12}. Symmetric set difference: P({1, 2})∆P({2, 3}) = {2, 5, 6, 10}. Probes
that separate two sets.
My T. Thaimythai@cise.ufl.edu
33
Sequence Alignment
Problem Definition: Given: 2 DNA or protein sequences Find: Best match between them
What is an Alignment: Given: 2 Strings S and S’ Goal: The lengths of S and S’ are the same by inserting
spaces into these strings
A -- T C -- A
-- C T C A A
My T. Thaimythai@cise.ufl.edu
34
Matches, Mismatches and Indels
Match: two aligned, identical characters in an alignment
Mismatch: two aligned, unequal characters Indel: A character aligned with a space
A A C T A C T -- C C T A A C A C T -- ---- -- C T C C T A C C T -- -- T A C T T T
10 matches, 2 mismatches, 7 indels
My T. Thaimythai@cise.ufl.edu
35
Basic Algorithmic Problem
Find the alignment of the two strings that: Max m where m = (# matches – mismatches –
indels)
m defines the similarity of the two strings, also called Optimal Global Alignment
Biologically: a mismatch represents a mutation, whereas an indel represents a historical insertion or deletion of a single character
My T. Thaimythai@cise.ufl.edu
36
Multiple Sequence Alignment
Problem Definition: Similar to the sequence alignment problem but the
input has more than 2 strings
Challenges: NP-hard Guarantee factor: 2 – 2/k where k is the number of
the input sequences. More work to reduce the time and space complexity
My T. Thaimythai@cise.ufl.edu
37
DNA Sequencing Problem Definition:
Given a set of fragments that are contained in a DNA string S
Goal: Determine the string S NP-complete Further complicated due to the existence of
repetitive sequences in the genome Can cast this as a Hamiltonian path or Euler path
problem (was introduced by Pavel Pavzner)
My T. Thaimythai@cise.ufl.edu
38
Genome Rearrangement
Problem Definition: Given genomes of 2 different species Goal: Find a sequence of evolutionary events that
turn the first genome to the second one.
Biological reasons: How close between these species, how much evolution separate these species.
E.g.: We usually test new drugs on mice before humans. However, how close is a mouse to a human?
My T. Thaimythai@cise.ufl.edu
39
Genome Rearrangement
Can we use the solutions of sequence alignment to solve this problem?
Answer: NO, because: Genome is a very long strings (3 million letters for
a human genome Model of sequence alignment is not appropriate for
human genome comparison since the differences are not in terms of insertions/deletion/mutations of a nucleotide, but a rearrangement of a long DNA regions
The basic comparison is gene
My T. Thaimythai@cise.ufl.edu
40
An Example
• If we compare these two strings by sequence alignment, it’s impossible
• However, the second string is the first string after reverse the fragment AATGGT…CCC.
My T. Thaimythai@cise.ufl.edu
41
Main Evolutionary Events
Deletions: A fragment is removed Duplications: create many copies of a fragment
and insert into different positions Transpositions: A fragment is removed and re-
inserted into a different position Inversions: A fragment is removed, reversed,
and then reinserted into the same position Translocations: A pair of fragments are
exchanged between the ends of two chromosomes
My T. Thaimythai@cise.ufl.edu
42
My T. Thaimythai@cise.ufl.edu
43
Protein Structure Prediction
Problem Definition: Given: A sequence of amino acids Goal: Predict the 3D structure of the protein
Some approaches: Determine the position of a protein’s atoms so as to
minimize the total free energy Find the similarities to some known proteins
My T. Thaimythai@cise.ufl.edu
44
Community Structure
Problem Definition: Given a graph G = (V, E) representing a network Partition G into a set of subgraph (community structure) so
that nodes in each subgraph are highly connected
Biological reason: Genes with similar expression data may have similar functions. Identify the community structure can help us to reduce the number of tests
Others: Community structure is also studied in different fields