Sequence Alignment - Baylor ECSweb.ecs.baylor.edu/faculty/cho/3350/2_SequenceAlignment.pdf · PAM...
-
Upload
phamnguyet -
Category
Documents
-
view
225 -
download
0
Transcript of Sequence Alignment - Baylor ECSweb.ecs.baylor.edu/faculty/cho/3350/2_SequenceAlignment.pdf · PAM...
9/1/2015
1
Young-Rae Cho
Associate Professor
Department of Computer Science
Baylor University
BINF 3350, Genomics and Bioinformatics
Sequence Alignment
BINF 3350, Chapter 4, Sequence Alignment
1. Sequence Alignment
2. Dynamic Programming
3. Scoring Alignments
4. Gap Penalty
5. Global vs. Local Alignment
6. Pairwise vs. Multiple Sequence Alignment
7. Sequence Homolog Search
8. Motif Search
9/1/2015
2
Sequence Homology
Homologs
Similar sequence and Common ancestor
Similar sequence and Same function (in divergent evolution)
Orthologs
Homologous sequences in different species by species divergence
Paralogs
Homologous sequences in the same species by gene duplication
Analogs
Similar sequence and No common ancestor (in convergent evolution)
Sequence Similarity
Importance of finding similar (DNA or protein) sequences
Evolutionary closeness
Relationship between sequences and evolution
Functional similarity
Relationship between sequences and functions
How to measure sequence similarity
(Method 1) Counting identical letters on each position
(Method 2) Inserting gaps to maximize the number of identical letters
Sequence alignment
A T G T T A T
T C G T A C T| | |
A T ‐ G T T A ‐ T
‐ T C G T ‐ A C T| | | | |
9/1/2015
3
Sequence Alignment
Sequence Alignment
Aligning two or more sequences to maximize their similarity including gaps
How to find sequence alignment?
(1) Measuring edit distance
Edit Distance (1)
Definition
Edit distance between two sequences x and y : the minimum number of
editing operations (insertion, deletion, substitution) to transform x into y
Example
x=“TGCATAT” (m=7), y=“ATCCGAT” (n=7)
TGCATAT
ATGCATATinsertion of “A”
ATCCATATsubstitution of “G” with “C”
ATCCGATATinsertion of “G”
ATCCGATTdeletion of “A”
ATCCGATdeletion of “T”
edit distance = 5 ?
9/1/2015
4
Edit Distance (2)
Example
x=“TGCATAT” (m=7), y=“ATCCGAT” (n=7)
Can it be done in 3 steps?
How to measure edit distance efficiently?
TGCATAT
ATGCATATinsertion of “A”
ATGCAATdeletion of “T”
ATCCAATsubstitute of “G” with “C”
ATCCGATsubstitute of “A” with “G”
edit distance = 4 ?
A T G T T A T G C A A T G T A C T T A
T C G T A C T C A G T T C A A G T C A
Edit Distance (3)
Example in 2-Row Representation
x=“ATCTGATG” (m=8), y=“TGCATAC” (n=7)
A T C T G A T G
T G C A T A C
x
y
4 matches
1 substitutions
3 insertions2 deletions
A T C T G A T G
T G C A T A C
x
y
4 matches4 insertions3 deletions
Edit distance = #insertions + #deletions + #substitutions
9/1/2015
5
Hamming Distance vs. Edit Distance
Hamming Distance
Compares the letters on the same position between two sequences
Not good to measure evolutionary distance between DNA sequences
Edit Distance
Compares the letters between two sequences after inserting gaps
Allows comparison of two sequences of different lengths
Good to measure evolutionary distance between DNA sequences
Example
x=“ATATATAT” , y=“TATATATA”
Hamming distance between x and y ?
Edit distance between x and y ?
Sequence Alignment
Sequence Alignment
Aligning two or more sequences to maximize their similarity including gaps
How to find sequence alignment?
(1) Measuring edit distance
(2) Finding longest common subsequence
9/1/2015
6
Longest Common Subsequence (1)
Subsequence of x
An ordered sequence of letters from x
Not necessarily consecutive
e.g., x=“ATTGCTA”, “AGCA” ?, “TCG” ?, “ATCT” ?, “TGAT” ?
Common Subsequence of x and y
e.g., x=“ATCTGAT” and y=“TGCATA”, “TCTA” ?, “TGAT” ?, “TATA” ?
Longest Common Subsequence (LCS) of x and y ?
Longest Common Subsequence (2)
Example
x=“ATCTGATG” (m=8), y=“TGCATAC” (n=7)
LCS of X and Y ?
2-row representation
How to find LCS efficiently?
A T G T T A T G C A A T G T A C T T A G A C T C A A G T G C C A T T T G A C
T C G T A C T C A G T T C A A G T C A G T T A C G A G T A C A T G C A A A C
A T G T T A T G C A A T G T A C T T A
T C G T A C T C A G T T C A A G T C A
A T G T T A T G C A
T C G T A C T C A G
A T G T T
T C G T A
9/1/2015
7
Sequence Alignment
Sequence Alignment
Aligning two or more sequences to maximize their similarity including gaps
How to find sequence alignment?
(1) Measuring edit distance
(2) Finding longest common subsequenceDynamic programming
BINF 3350, Chapter 4, Sequence Alignment
1. Sequence Alignment
2. Dynamic Programming
3. Scoring Alignments
4. Gap Penalty
5. Global vs. Local Alignment
6. Pairwise vs. Multiple Sequence Alignment
7. Sequence Homolog Search
8. Motif Search
9/1/2015
8
Definition
An algorithm to solve complex problems by breaking them down into simpler sub‐problems
The result of a sub‐problem is used to solve the next sub‐problem
Features
Optimization
• Finding an optimal solution
• Saving memory space
Examples
Binary search tree
Sequence alignment
Dynamic Programming
Edit Graph
2‐D grid structure having a diagonal
on the position of the same letter
Weight the diagonal lines as 1
Weight the other lines as 0
Goal
Finding the strongest path from
source to sink
Algorithm
(1) Compute the max score for each node
(The max score means the max counts of identical letters from source to each node)
(2) When reaching the sink, trace backward to find LCS
Dynamic Programming for Sequence Alignment
sink
source A T C G T A C
A
T
G
T
T
T
A
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
9/1/2015
9
Example
Sequence Alignment
Sequence Alignment Example
source
sink
A T C G T A C
A
T
G
T
T
T
A
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
+1
0 0 0 0 0 0 0 0
0
0
0
0
0
0
0
1 1 1 1 1 1 1
1 2 2 2 2 2 2
1 2 2
4
3 3 3
1 2 3
3
4 4
1 2 2 3 4 4 4
1 2 2 3 4 5 5
1 2 2 3 4 5 5
2
A T C G T – A C –
A T – G T T A – T| | | | |
Example
X = “ATGCGT”, Y = “AGACAT”
Sequence Alignment
Quiz
sink
source A T G C G T
A
G
A
C
A
T
9/1/2015
10
BINF 3350, Chapter 4, Sequence Alignment
1. Sequence Alignment
2. Dynamic Programming
3. Scoring Alignments
4. Gap Penalty
5. Global vs. Local Alignment
6. Pairwise vs. Multiple Sequence Alignment
7. Sequence Homolog Search
8. Motif Search
Scoring Alignments: Percent Identity (1)
Identity
Degree of identical matches between sequences
Percent Identity
Percentage of identical matches
Dot-plot representations
Visualization method of identity
9/1/2015
11
Scoring Alignments: Percent Identity (2)
Dot-plot representations of self alignment
The background noise can be removed by setting a threshold of the min
identity score in a fixed window
Scoring Alignments: Percent Similarity
Percent Similarity
Percentage of similar amino acid pairs in biochemical structure (Protein)
Percentage of similar nucleotide pairs in biochemical structure (DNA)
Advanced Scoring Schemes
Varying scores in similarity of biochemical structures
Penalties (negative scores) for strong mismatches
Relative likelihood of evolutionary relationship
Probability of mutations
Minimum Acceptance Score
90% of sequence pairs with more than 30% sequence identity: homolog
20~30% sequence identity: twilight zone
9/1/2015
12
Substitution Matrices (1)
Substitution Matrix
Score matrix among nucleotides or amino acids
4 × 4 array representation for DNA sequences
or (4+1) × (4+1) array
20 × 20 array representation for protein sequences
or (20+1) × (20+1) array
Entry of δ(i,j) has the score between i and j,
i.e., the rate at which i is substituted with j over time
Substitution Matrices (2)
PAM (Point Accepted Mutations)
For protein sequence alignment
Amino acid substitution frequency in mutations
Logarithmic matrix of mutation probabilities
PAM120: Results from 120 mutations per 100 residues
PAM120 vs. PAM240
BLOSUM (Block Substitution Matrix)
For protein sequence alignment
Applied for local sequence alignments
Substitution frequencies between clustered groups
BLOSUM-62: Results with a threshold (cut-off) of 62% identity
BLOSUM-62 vs. BLOSUM-50
9/1/2015
13
Substitution Matrices (3)
Substitution Matrix Examples
BLOSUM-62 PAM120
Theory of Scoring Alignments
Random model
Non-random model
Odds ratio
Odds ratio for each position
Odds ratio for entire alignment
log-odds ratio (a score in a substitution matrix)
Expected score
9/1/2015
14
BINF 3350, Chapter 4, Sequence Alignment
1. Sequence Alignment
2. Dynamic Programming
3. Scoring Alignments
4. Gap Penalty
5. Global vs. Local Alignment
6. Pairwise vs. Multiple Sequence Alignment
7. Sequence Homolog Search
8. Motif Search
Gap Penalty (1)
Gaps
Contiguous sequence of spaces in one of the aligned sequences
Gaps inserted as the results of insertions and deletions (indels)
Gap Penalties
High penalties vs. Low penalties
Fixed penalties vs. Flexible penalties depending on residues
No penalty on start gaps and end gaps
Finding optimal number of gaps for the best score in sequence alignment
Dynamic Programming
9/1/2015
15
Gap Penalty (2)
Examples of high penalties and low penalties
Affine Gap Penalty (1)
Motivation
-σ for 1 gap (insertion or deletion)
-2σ for 2 consecutive gaps (insertions or deletions)
-3σ for 3 consecutive gaps (insertions or deletions), etc.
→ too severe penalty for a series of 100 consecutive gaps
Example
x=“ATAGC”, y=“ATATTGC”
x=“ATAGGC”, y=“ATGTGC”
single event
9/1/2015
16
Affine Gap Penalty (2)
Linear Gap Penalty
Score for a gap of length x : -σ x
Constant Gap Penalty
Score for a gap of length x : -ρ
Affine Gap Penalty
Score for a gap of length x : - (ρ + σ x)
ρ : gap opening penalty / σ : gap extension penalty ( ρ σ )
BINF 3350, Chapter 4, Sequence Alignment
1. Sequence Alignment
2. Dynamic Programming
3. Scoring Alignments
4. Gap Penalty
5. Global vs. Local Alignment
6. Pairwise vs. Multiple Sequence Alignment
7. Sequence Homolog Search
8. Motif Search
9/1/2015
17
Global vs. Local Alignment
Global Alignment
Finding sequence alignment across the whole length of sequences
Local Alignment
Finding significant similarity in a part of sequences
Example
x = “TCAGTGTCGAAGTTA”
y = “TAGGCTAGCAGTGTA”
T C A G – – T – G T C G A A G T – T A
T – A G G C T A G – C – A – G T G T A| | | | | | | | | | |
T C A G T G T C G A A G T T A
T A G G C T A G C A G T G T A| | | | | |
Dynamic Programming (Needleman-Wunch algorithm)
Dynamic Programming (Smith-Waterman algorithm)
Local Alignment Example
Local Alignment
Applied for multi-domain
protein sequences
Protein domain
• Basic functional block
• Evolutionary conserved
9/1/2015
18
BINF 3350, Chapter 4, Sequence Alignment
1. Sequence Alignment
2. Dynamic Programming
3. Scoring Alignments
4. Gap Penalty
5. Global vs. Local Alignment
6. Pairwise vs. Multiple Sequence Alignment
7. Sequence Homolog Search
8. Motif Search
Multiple Alignment (1)
Pairwise Alignment
Alignment of two sequences
Sometimes two sequences are functionally
similar or have common ancestor although
they have weak sequence similarity
Multiple Alignment
Alignment of three or more sequences simultaneously
Finds similarity which is invisible in pairwise alignment
9/1/2015
19
Multiple Alignment (2)
Example
Dynamic Programming ?
Computationally not acceptable
Need heuristic methods
Hierarchical Method (1)
Hierarchical Method
(1) Compares all sequences in pairwise alignments
(2) Creates a guide tree (hierarchy)
(3) Follows the guide tree for a series of
pairwise alignments
-
.17 –
.87 .28 –
.59 .33 .62 –
v1 v2 v3 v4 …
v1
v2
v3
v4
9/1/2015
20
Hierarchical Method (2)
Features
Also called progressive alignment
More intelligent strategy on each step
Use of consensus sequence to compare groups of sequences
Gaps are permanent (“once a gap, always a gap”)
Works well for close sequences
Application Tools
ClustalW
• Comparing residues one pair at a time and imposing gap penalties
DIALIGN
• Finding pairs of equal-length gap-free segments
Divide-and-Conquer Method
Process
Features
Fast aligning of long sequences
9/1/2015
21
Multiple Alignment Results
Examples
Summary of PSA & MSA Algorithms
Rigorous Algorithms
Heuristic Algorithms
9/1/2015
22
BINF 3350, Chapter 4, Sequence Alignment
1. Sequence Alignment
2. Dynamic Programming
3. Scoring Alignments
4. Gap Penalty
5. Global vs. Local Alignment
6. Pairwise vs. Multiple Sequence Alignment
7. Sequence Homolog Search
8. Motif Search
Searching Databases
Sequence Homolog Search
Search similar sequences to a query sequence in a database
Computational issues
• Dynamic programming (N-W / S-W algorithms) are rigorous
• But inefficient in searching a huge database
• Need heuristic approaches
Sequence Homolog Searching Tools
FASTA
BLAST
9/1/2015
23
FASTA (1)
FASTA
DNA / protein sequence alignment tool (local alignment)
Applies dynamic programming in scoring selected sequences
Heuristic method in candidate sequence search
Algorithm
(1) Finding all pairwise k-tuples (at least k contiguous matching residues)
(2) Scoring the k-tuples by a substitution matrix
(3) Selecting sequences with high scores for alignment
FASTA (2)
Indexing (or Hashing)
Indexing Process in FASTA
(1) Find all k-tuples from a query sequence and calculate ci
(2) Build an index table
9/1/2015
24
FASTA Package
FASTA package
ssearch : applies dynamic programming (S-W algorithm)
query sequence database
fasta protein protein
fasta DNA DNA
fastx / fasty DNA (all reading frames) protein
tfastx / tfasty protein DNA (all reading frames)
BLAST (1)
BLAST (Basic Local Alignment Search Tool)
DNA / protein sequence alignment tool
Finds local alignments
Heuristic method in sequence search
Runs faster than FASTA
Algorithm
(1) Makes a list of words (word pairs) from the query sequence
(2) Chooses high-scoring words
(3) Searches database for matches (hits) with the high-scoring words
(4) Extends the matches in both directions to find high-scoring segment pair
(HSP)
(5) Selects the sequence which has two or more HSPs for S-W alignment
9/1/2015
25
BLAST (2)
Deterministic Finite Automata (DFA)
DFA Analysis Process in BLAST
(1) Build DFA using high-scoring words
(2) Read sequences in database and trace DFA
(3) Output the positions for hits
BLAST Package
BLAST programs
query sequence database
blastp protein protein
blastn DNA DNA
blastx DNA (all reading frames) protein
tblastn protein DNA (all reading frames)
tblastx DNA (all reading frames) DNA (all reading frames)
9/1/2015
26
Search Results
BLAST Search Results
FASTA Search Results
E-value
E-value
Average number of alignments with a score of at least S that would be
expected by chance alone in searching a database of n sequences
Ranges of E-value:
High alignment score S
Low alignment score S
Factors
• Alignment score
• The number of sequences in the database
• Sequence length
Default E-value threshold: 0.01 ~ 0.001
Low E-value
High E-value
0 ~ n
9/1/2015
27
Filtering
Low-Complexity Region
Highly biased amino acid composition
Lowers significant hits in sequence alignment
BLAST filters the query sequence for low-complexity regions and mark “X”
Summary of Homolog Search Algorithms
Rigorous Algorithms
Heuristic Algorithms
9/1/2015
28
BINF 3350, Chapter 4, Sequence Alignment
1. Sequence Alignment
2. Dynamic Programming
3. Scoring Alignments
4. Gap Penalty
5. Global vs. Local Alignment
6. Pairwise vs. Multiple Sequence Alignment
7. Sequence Homolog Search
8. Motif Search
Motifs
Motifs
Short sequence patterns
Functionally related sequences share similarly distributed patterns (motifs)
of critical functional residues
Types of Motif Search
Search a query sequence in a motif database
Search a pattern in a sequence database
Find a pattern from a set of sequences
Motif Finding
Consensus method by global multiple alignment
9/1/2015
29
Motif Search Tools (1)
BLOCKS
Logos
• Size of letters: conservation levels
• Color of letters: biochemical properties
Motif Search Tools (2)
MEME
Summary motif information
• Location of motifs in sequences
9/1/2015
30
Motif Databases
PROSITE
Code for patterns
• Each letter represents an amino acid residue
• All positions are separated by “-”
Code description Example
X any amino acid G-X-L-M-S-A-D-F-F-F
[] two or more possible amino acid G-[LI]-L-M-S-A-D-F-F-F
{} disallowed amino acid G-[LI]-L-M-S-A-{RK}-F-F-F
(n) repetition by n of the amino acid G-[LI]-L-M-S-A-{RK}-F(3)
(n,m) a range: only allowed with X G-[LI]-L-M-S-A-{RK}-X(1,3)