Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July...

Some Independent Study on Some Independent Study on Sequence AlignmentSequence Alignment

— Lan Lin

prepared for theory group meeting on July 16, 2003

Biological Background (1)Biological Background (1)

Genetic information is stored in DNA– used to make identical copies – transferred from DNA to RNA to protein

DNA is a linear polymer of 4 nucleotides– (AA, TT, GG, CC)

RNA is a similar polymer– (AA, UU, GG, CC)

Both can pair one with another — “double helixdouble helix”– pairing being sequence specific (G-CG-C, A-T/UA-T/U)– templating resulting in DNA replication and RNA copy of a DNA

sequence

Biological Background (2)Biological Background (2)

Proteins are variable length linear, mixed polymers of 20 different amino acids– peptidespeptides and polypeptidespolypeptides for amino acid polymers– functional property largely determined by the amino acid sequence

RNA protein by translation of a code consisting of 3 nucleotides into 1 amino acid– one amino acid encoded by 1 ~ 6 different triplet codes– 3 stop codonsstop codons specifying “end of peptide sequence”– 3 reading frames for a DNA sequence, 6 for one with its (inferred)

complementary strand

Sequence Analysis (1)Sequence Analysis (1)

Some difficulties– Where the code for a protein starts and stops?– DNA frequently scattered in separate “exonsexons”, not continuous– RNAs up- and down-stream of the coding region, non-coding regions can

be quite large; not all RNAs encode proteins Inferring structure and function from a protein sequence is even

harder!– 3 levels of protein structure

primary structureprimary structure — sequence of amino acids in the protein secondary structuresecondary structure — polypeptide chains folding into regular structures (i.e.,

alpha helixalpha helix or beta sheetbeta sheet) tertiary structuretertiary structure — 3D structure of protein determining biological function

– homology-based approach used to determine the tertiary structure by primary sequence analysis of related proteins

Sequence Analysis (2)Sequence Analysis (2)

What can be done?– Identification of protein primary sequence from DNA

sequence– searching for DBs for similar sequences

DNA sequences: DDBJDDBJ, EMBLEMBL, GenBankGenBank– for rapid search for a query sequence: BLASTBLAST and FASTAFASTA

protein sequences: SwissProtSwissProt, PIRPIR

– calculation of sequence alignments for evolutionary inferences and to aid in structural and functional analysis

Pairwise Sequence Alignment Pairwise Sequence Alignment

Two quantitative measures– similaritysimilarity (the larger the better)– distancedistance (the smaller the better)

Edit operations by introducing a gap character “-”– match, replacement, insertion/deletion (“indelindel”)

The unit cost model– The cost of an alignmentcost of an alignment of two sequences ss and tt is the sum of the

cost of all the edit operations that lead from ss to tt.– An optimal alignmentoptimal alignment is one with the minimum cost.– The edit distanceedit distance of ss and tt is the cost of an optimal alignment of ss

and tt under a cost function ww denoted by ddww(s, t)(s, t).

Pairwise Alignment via Pairwise Alignment via Dynamic Programming (1)Dynamic Programming (1)

Recursion step

dw(0:s:i, 0:t:j ) = min {dw(0:s:(i-1), 0:t:(j-1) ) + w(si, tj),

dw(0:s:(i-1), 0:t:j ) + w(si,- ),

dw(0:s:i, 0:t:(j-1) ) + w(-, tj)}

for i, j 1 Base

dw(0:s:0, 0:t:0 ) = 0

dw(0:s:i, 0:t:0 ) = dw(0:s:(i-1), 0:t:0 ) + w(si,- ) for i = 1, …, m

dw(0:s:0, 0:t:j ) = dw(0:s:0, 0:t:(j-1) ) + w(-, tj) for j = 1, …, n

Pairwise Alignment via Pairwise Alignment via Dynamic Programming (2)Dynamic Programming (2)

The edit distances of all prefixes define an (m+1)(m+1) (n+1) (n+1) distance marix D D = (d= (di, ji, j) ) with ddi, ji, j = d = dww((00:s::s:i, 0i, 0:t::t:jj)).

Pattern of dependencies between matrix elements

ddi-1, j-1 i-1, j-1 ddi-1, ji-1, j

ddi, j-1 i, j-1 ddi, ji, j The bottom right corner contains the desired result:

ddmn mn = d= dww((00:s::s:m, 0m, 0:t::t:nn) = d) = dww(s, t)(s, t). A path through the distance matrix indicating how to align

– A diagonal line means replacement/match– A vertical line means deletion– A horizontal line means insertion

The most common order of calculation is line by line (each line from left to right), or column by column (each column from top to bottom).

On Scoring FunctionsOn Scoring Functions

Different words all attributing a numeric value to a pair of sequences– “distancedistance” values are never negative; should be minimized– “costcost” implies positive values, with the overall cost to be

minimized– “weightsweights” and “scoresscores” can be positive or negative; the optimal

alignments maximize scores– “similaritysimilarity” implies large values are good; should be maximized

If relating sequences of different length, length-relative scores make sense.

Realistic Gap ModelsRealistic Gap Models

No-gap alignmentNo-gap alignment– using matches/replacements only in some regions (i.e.,

sites of protein-protein interaction)– DP algorithm geared to do this by setting costs for indel

to infinity (or something close to it) Block-indelBlock-indel

– charging a certain set-up cost for introducing the gap, whereas extending the gap is less expensive

– DP algorithm adapted without much effect on its efficiency

Variations of Pairwise Alignment (1)Variations of Pairwise Alignment (1)

Local alignment (approximate pattern matching)– where s s is relatively short with respect to tt and we seek that

subunit of t t which s s aligns best with:

Given 00:s::s:m m and 0 0:t::t:nn, find ii:t::t:j j such that ddww(s, (s, ii:t::t:j j ) ) is minimal among all choices of 0 0 i i j j n n.

Local alignment recursion– no cost for deletion of a prefix 00:t::t:i,i,

– no cost for deletion of a suffix jj:t::t:n,n,

– ddmn mn gives the cost of the optimal local alignment, ii:t::t:j j is found by:

j = min {k|dj = min {k|dm,km,k = d = dm, nm, n}}

i i is the point where the optimal path leading to ddm, jm, j starts from the 1st row

Variations of Pairwise Alignment (2)Variations of Pairwise Alignment (2)

Local similarity– asking for those subunits of s s and t t that exhibit most similarity– using a similarity rather than a distance measure

w(a, b) > 0w(a, b) > 0, if a, ba, b are similar,

w(a, b) < 0w(a, b) < 0, if a, ba, b are not similar

w(a, -) < 0, w(a, -) < 0, and w(-, b) < 0 w(-, b) < 0, in particular

– score 0 as a cut-off value between subsequences with/without similarity

long stretches of dissimilarity shown as regions of zeroes in the matrix

stretches of local similarity rising as islands of positive values

Heuristic MethodsHeuristic Methods

Edit distance calculation complexity– for input sequences 00:s::s:m,m,and 0 0:t::t:nn, DP calculates mm n n matrix

entries; time complexity is O(mO(m n) n)– to only get the edit distance, only one column (or one row) of the

matrix needs to be stored; space complexity is O(m) O(m) or O(n) O(n)– to retrace optimal path, the whole matrix needs to be stored; space

complexity is also O(mO(m n) n)

Heuristic methods approximate optimal alignment in a time complexity close to O(m+n)O(m+n)

– trading speed for precision

Multiple AlignmentMultiple Alignment

Helpful for protein structure prediction and evolutionary history inference

A multiple alignment of k k sequences is a rectangular array of kk rows which resemble the corresponding sequences when ignoring the gap character, with each column containing at lease one character different from “-”.

Two ways to formulate a cost/weight function– colomuns-first– pairs-first

An optimal multiple alignment optimal multiple alignment is one with minimum overall cost, or maximal overall similarity.

based on SP-cost SP-cost (“sum-of-pairs”“sum-of-pairs”)

MSA by Standard DP and HeuristicsMSA by Standard DP and Heuristics

DP matrix DP hyperlattice– taking time in O(2O(2kk |s |sii|)|) and space in O(O( |s |sii|)|)

– NP-hard with regard to the number of sequences with the SP measure

Alignment along a phylogenetic tree– tree generation through all optimal pairwise alignments– most similar pairs aligned first before aligning alignments– not necessarily optimal due to error accumulation– “sequencessequences” “profilesprofiles”

“sum-of-pairssum-of-pairs” “scoring along a treescoring along a tree”

i=1,…,ki=1,…,k i=1,…,ki=1,…,k

More Interesting TopicsMore Interesting Topics

Phylogenetic treeGenetic algorithms and protein foldingRNA secondary structure predictionProtein structuresFinding instances of known/unknown sitesetc, …

ReferencesReferences

Online Lectures on Bioinformaticshttp://lectures.molgen.mpg.de/online_lectures.html

Biocomputing Hypertext Coursebookhttp://www.techfak.uni-bielefeld.de/bcd/Curric/welcome.html

Lecture Notes on Biological Sequence Analysis

http://www.cs.uml.edu/bioinformatics/resources/Lectures/tompa00lecture.pdf

Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July...

Documents

Transcript of Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July...