Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon...

115
Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005

Transcript of Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon...

Page 1: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Techniques & Applications of

Sequence Comparison

Limsoon Wong31 August 2005

Page 2: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Lecture Plan• Basic sequence comparison methods

– Pairwise alignment, Multiple alignment

• Popular tools– FASTA, BLAST, Pattern Hunter

• Applications– Homologs, Active sites, Key mutation sites,

Looking for SNPs, Determining origin of species, …

• More advanced tools– PHI BLAST, ISS, PSI-BLAST, SAM, ...

Page 3: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Basic Sequence Comparison

Methods

A brief refresher

Page 4: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Sequence Comparison:

Motivations• DNA is blue print for living organisms Evolution is related to changes in DNA By comparing DNA sequences we can

infer evolutionary relationships between the sequences w/o knowledge of the evolutionary events themselves

• Foundation for inferring function, active site, and key mutations

Page 5: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Alignment

Sequence U

Sequence V

mismatch

match

indel• Key aspect of sequence

comparison is sequence alignment

• A sequence alignment maximizes the number of positions that are in agreement in two sequences

Copyright © 2004 by Limsoon Wong

Page 6: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Alignment:

Poor Example

No obvious match between Amicyanin and Ascorbate Oxidase

• Poor seq alignment shows few matched positions The two proteins are not likely to be homologous

Copyright © 2004 by Limsoon Wong

Page 7: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Alignment:

Good Example

• Good alignment usually has clusters of extensive matched positions

The two proteins are likely to be homologous

good match between Amicyanin and unknown M. loti protein

Copyright © 2004 by Limsoon Wong

Page 8: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Alignment:

A Formalization

Page 9: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Alignment:

Simple-minded Probability & Score

• Define score S(A) by simple log likelihood asS(A) = log(prob(A)) - [m log(s) + h log(s)], with log(p/s) = 1

• Then S(A) = #matches - #mismatches - #indels

h

Page 10: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Global Pairwise Alignment:

Problem Definition• Given sequences U and V of lengths n

and m, then number of possible alignments is given by– f(n, m) = f(n-1,m) + f(n-1,m-1) + f(n,m-1)– f(n,n) ~ (1 + 2)2n+1 n-1/2

• The problem of finding a global pairwise alignment is to find an alignment A so that S(A) is max among exponential number of possible alternatives

Page 11: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Global Pairwise Alignment:

Dynamic Programming Solution

• Define an indel-similarity matrix s(.,.); e.g., – s(x,x) = 1– s(x,y) = -, if x y

• Then

Page 12: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Global Pairwise Alignment:

More Realistic Handling of Indels

• In Nature, indels of several adjacent letters are not the sum of single indels, but the result of one event

• So reformulate as follows:

Page 13: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Variations of Pairwise Alignment

• Fitting a “short’’ seq to a “long’’ seq.

• Indels at beginning and end are not penalized

• Find “local” alignment

• find i, j, k, l, so that– S(A) is maximized,

– A is alignment of ui…uj

and vk…vl

U

V

U

V

Copyright © 2004 by Limsoon Wong

Page 14: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Multiple Alignment

Conserved sites

• Multiple seq alignment maximizes number of positions in agreement across several seqs

• seqs belonging to same “family” usually have more conserved positions in a multiple seq alignment

Copyright © 2004 by Limsoon Wong

Page 15: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Multiple Alignment:

A Formalization

Page 16: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Multiple Alignment:

Naïve Approach• Let S(A) be the score of a multiple alignment A. The optimal multiple

alignment A of sequences U1, …, Ur can be extracted from the following dynamic programming computation of Sm1,…,mr:

• This requires O(2r) steps• Exercise: Propose a practical approximation

Page 17: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Phylogeny• By looking at extent of conserved positions in the

multiple seq alignment of different groups of seqs, can infer when they last shared an ancestor

Construct “family tree” or phylogeny

Copyright © 2004 by Limsoon Wong

Page 18: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Popular Tools for Sequence

Comparison: FASTA, BLAST, Pattern Hunter

Acknowledgements: Some slides here are “borrowed” from Bin Ma & Dong Xu

Page 19: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Scalability of Software30 billion ntin year 2005

012345678

1980 1985 1990 1995 2000

Years

Nu

cleo

tid

es(b

illio

n)

• Increasing number of sequenced genomes: yeast, human, rice, mouse, fly, …

S/w must be “linearly” scalable to large datasets

Copyright © 2004 by Limsoon Wong

Page 20: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Need Heuristics for Sequence Comparison

• Time complexity for optimal alignment is O(n2) , where n is sequence length

Given current size of sequence databases, use of optimal algorithms is not practical for database search

• Heuristic techniques: – BLAST– FASTA– Pattern Hunter– MUMmer, ...

• Speed up:– 20 min (optimal alignment) – 2 min (FASTA) – 20 sec (BLAST)

Page 21: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Basic Idea: Indexing & Filtering

• Good alignment includes short identical, or similar fragments

Break entire string into substrings, index the substrings

Search for matching short substrings and use as seed for further analysis

Extend to entire string find the most significant local alignment segment

Page 22: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

FASTA in 4 StepsLipman & Pearson, Science 227:1435-1441, 1985

Step 1• Find regions of seqs w/ highest density of matches

– Exact matches of a given length (by default 2 for proteins, 6 for nucleic acids) are determined

– Regions w/ a high number of matches are selected

Copyright © 2004 by Limsoon Wong

Page 23: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Step 2• Rescan 10 regions w/ highest density of

identities using BLOSUM50 matrix• Trim ends of region to include only those

residues contributing to the highest score• Each region is a partial alignment w/o gaps

FASTA in 4 StepsLipman & Pearson, Science 227:1435-1441, 1985

Page 24: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Step 3• If there are several regions w/ scores greater

than a cut off value, try to join these regions. • A score for the joined initial regions is calculated

given a penalty for each gap

FASTA in 4 StepsLipman & Pearson, Science 227:1435-1441, 1985

Copyright © 2004 by Limsoon Wong

Page 25: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Step 4• Select seqs in db w/ highest score• For each seq construct a Smith-Waterman

optimal alignment considering only positions that lie in a band centred on the best initial region found

FASTA in 4 StepsLipman & Pearson, Science 227:1435-1441, 1985

Copyright © 2004 by Limsoon Wong

Page 26: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

BLAST in 3 StepsAltschul et al, JMB 215:403-410, 1990

• Word matching as FASTA

• Similarity matching of words (3 aa’s, 11 bases) – no need identical

words

• If no words are similar, then no alignment– won’t find matches for

very short sequences

• MSP: Highest scoring pair of segments of identical length. A segment pair is locally maximal if it cannot be improved by extending or shortening the segments

• Find alignments w/ optimal max segment pair (MSP) score

• Gaps not allowed• Homologous seqs will

contain a MSP w/ a high score; others will be filtered out

Page 27: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

BLAST in 3 StepsAltschul et al, JMB 215:403-410, 1990

Step 1• For the query, find the list of high scoring words

of length w

Copyright © 2004 by Limsoon Wong

Image credit: Barton

Page 28: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

BLAST in 3 StepsAltschul et al, JMB 215:403-410, 1990

Step 2• Compare word list to db & find exact matches

Copyright © 2004 by Limsoon Wong

Image credit: Barton

Page 29: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

BLAST in 3 StepsAltschul et al, JMB 215:403-410, 1990

Step 3• For each word match, extend alignment in both

directions to find alignment that score greater than a threshold s

Image credit: Barton

Copyright © 2004 by Limsoon Wong

Page 30: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Spaced Seeds

• 111010010100110111 is an example of a spaced seed model with– 11 required matches (weight=11)– 7 “don’t care” positions

GAGTACTCAACACCAACATTAGTGGCAATGGAAAAT…|| ||||||||| ||||| || ||||| ||||||GAATACTCAACAGCAACACTAATGGCAGCAGAAAAT… 111010010100110111

• 11111111111 is the BLAST seed model for comparing DNA seqs

Copyright © 2004 by Limsoon Wong

Page 31: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Observations on Spaced Seeds

• Seed models w/ different shapes can detect different homologies– the 3rd base in a codon “wobbles” so a seed like

110110110… should be more sensitive when matching coding regions

Some models detect more homologies More sensitive homology search– PatternHunter I

Use >1 seed models to hit more homologies– Approaching 100% sensitive homology search– PatternHunter II

Page 32: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

CAA?A??A?C??TA?TGG?|||?|??|?|??||?|||?CAA?A??A?C??TA?TGG?111010010100110111 111010010100110111

PatternHunter IMa et al., Bioinformatics 18:440-445, 2002

• Spaced seeds uses fewer hits to detect one homology

Efficient

• BLAST’s seed usually uses more than one hits to detect one homology

WastefulTTGACCTCACC?|||||||||||?TTGACCTCACC?11111111111 11111111111

Copyright © 2004 by Limsoon Wong

1/4 chances to have 2nd hit next to the 1st hit 1/46 chances to have 2nd hit

next to the 1st hit

Page 33: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

PatternHunter IMa et al., Bioinformatics 18:440-445, 2002

Proposition. The expected number of hits of a weight-W length-M model within a length-L region of similarity p is (L – M + 1) * pW

Proof.

For any fixed position, the prob of a hit is pW. There are L – M + 1 candidate positions. The proposition follows.

Page 34: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Implication

• For L = 1017– BLAST seed expects

(1017 – 11 + 1) * p11 = 1007 * p11 hits

– But ~1/4 of these overlap each other. So likely to have only ~750 * p11 distinct hits

– Our example spaced seed expects (1017 – 18 + 1) * p11 = 1000 * p11 hits

– But only 1/46 of these overlap each other. So likely to have ~1000 * p11 distinct hits

Copyright © 2004 by Limsoon Wong

Spaced seeds

likely to be more sensitive& more efficient

Page 35: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Sensitivity of PatternHunter I

Copyright © 2004 by Limsoon Wong

Image credit: Li

Page 36: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Speed of PatternHunter I

• Mouse Genome Consortium used PatternHunter to compare mouse genome & human genome

• PatternHunter did the job in a 20 CPU-days ---it would have taken BLAST 20 CPU-years! Nature, 420:520-522, 2002

Copyright © 2004 by Limsoon Wong

Page 37: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

How to Increase Sensitivity?

• Ways to increase sensitivity:– “Optimal” seed– Reduce weight by 1– Increase number of

spaced seeds by 1

• For L = 1017 & p = 50%– 1 weight-11 length-18

model expects 1000/211 hits

– 2 weight-12 length-18 models expect 2 * 1000/212 = 1000/211 hits

When comparing regions w/ >50% similarity, using 2 weight-12 spaced seeds together is more sensitive than using 1 weight-11 spaced seed!

Copyright © 2004 by Limsoon Wong

Page 38: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

PatternHunter IILi et al, GIW, 164-175, 2003

• Idea– Select a group of

spaced seed models– For each hit of each

model, conduct extension to find a homology

• Selecting optimal multiple seeds is NP-hard

• Algorithm to select multiple spaced seeds– Let A be an empty set– Let s be the seed such

that A ⋃ {s} has the highest hit probability

– A = A ⋃ {s}– Repeat until |A| = K

• Computing hit probability of multiple seeds is NP-hard

Page 39: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

One weight-12

Two weight-12

One weight-11

Sensitivity of PatternHunter II

• Solid curves: Multiple (1, 2, 4, 8,16) weight-12 spaced seeds

• Dashed curves: Optimal spaced seeds with weight = 11,10, 9, 8

“Doubling the seed number” gains better sensitivity than “decreasing the weight by 1”

sen

siti

vity

Copyright © 2004 by Limsoon Wong

Image credit: Ma

Page 40: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Expts on Real Data

• 30k mouse ESTs (25Mb) vs4k human ESTs (3Mb) – downloaded from NCBI genbank– “low complexity” regions filtered out

• SSearch (Smith-Waterman method) finds “all” pairs of ESTs with significant local alignments

• Check how many percent of these pairs can be “found” by BLAST and different configurations of PatternHunter II

Page 41: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Results

Copyright © 2004 by Limsoon Wong

In fact, at 80% similarity, 100% sensitivity can

be achieved using 40

weight-9 seeds

Image credit: Ma

Page 42: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Farewell to the Supercomputer Age of Sequence Comparison!

Copyright © 2004 by Limsoon Wong

Image credit: Bioinformatics Solutions Inc

Page 43: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Application of Sequence

Comparison:Guilt-by-

Association

Page 44: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Emerging Patterns

• An emerging pattern is a pattern that occurs significantly more frequently in one class of data compared to other classes of data

• A lot of biological sequence analysis problems can be thought of as extracting emerging patterns from sequence comparison results

Page 45: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Dnet.ram

A protein is a ...

• A protein is a large complex molecule made up of one or more chains of amino acids

• Protein performs a wide variety of activities in the cell

Copyright © 2004 by Limsoon Wong

Page 46: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

SPSTNRKYPPLPVDKLEEEINRRMADDNKLFREEFNALPACPIQATCEAASKEENKEKNRYVNILPYDHSRVHLTPVEGVPDSDYINASFINGYQEKNKFIAAQGPKEETVNDFWRMIWEQNTATIVMVTNLKERKECKCAQYWPDQGCWTYGNVRVSVEDVTVLVDYTVRKFCIQQVGDVTNRKPQRLITQFHFTSWPDFGVPFTPIGMLKFLKKVKACNPQYAGAIVVHCSAGVGRTGTFVVIDAMLDMMHSERKVDVYGFVSRIRAQRCQMVQTDMQYVFIYQALLEHYLYGDTELEVT

Function Assignment to Protein Sequence

• How do we attempt to assign a function to a new protein sequence?

Page 47: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Guilt-by-Association

• Compare the target sequence T with sequences S1, …, Sn of known function in a database

• Determine which ones amongst S1, …, Sn are the mostly likely homologs of T

• Then assign to T the same function as these homologs

• Finally, confirm with suitable wet experiments

Page 48: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Compare T with seqs of known function in a db

Assign to T same function as homologs

Confirm with suitable wet experiments

Discard this functionas a candidate

Guilt-by-Association

Copyright © 2004 by Limsoon Wong

Page 49: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

BLAST: How it worksAltschul et al., JMB, 215:403--410, 1990

find from db seqs with short perfectmatches to queryseq(Exercise: Why do we need this step?)

find seqs withgood flanking alignment

• BLAST is one of the most popular tool for doing “guilt-by-association” sequence homology search

Copyright © 2004 by Limsoon Wong

Page 50: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Homologs obtained by BLAST

• Thus our example sequence could be a protein tyrosine phosphatase (PTP)

Copyright © 2004 by Limsoon Wong

Page 51: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Example Alignment with PTP

Copyright © 2004 by Limsoon Wong

Page 52: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Guilt-by-Association: Caveats

• Ensure that the effect of database size has been accounted for

• Ensure that the function of the homology is not derived via invalid “transitive assignment’’

• Ensure that the target sequence has all the key features associated with the function, e.g., active site and/or domain

Page 53: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Interpretation of P-value

• Seq. comparison progs, e.g. BLAST, often associate a P-value to each hit

• P-value is interpreted as prob. that a random seq. has an equally good alignment

• Suppose the P-value of an alignment is 10-6

• If database has 107 seqs, then you expect 107 * 10-6 = 10 seqs in it that give an equally good alignment

Need to correct for database size if your seq. comparison prog does not do that!

Page 54: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Examples of Invalid Function Assignment:

The IMP dehydrogenases (IMPDH)

A partial list of IMPdehydrogenase misnomers in complete genomes remaining in some

public databases

Copyright © 2004 by Limsoon Wong

Page 55: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

IMPDH Misnomer in Methanococcus jannaschii

IMPDH Misnomers in Archaeoglobus fulgidus

IMPDH Misnomer in Methanococcus jannaschii

IMPDH Misnomers in Archaeoglobus fulgidus

IMPDH Domain Structure

• Typical IMPDHs have 2 IMPDH domains that form the catalytic core and 2 CBS domains.

• A less common but functional IMPDH (E70218) lacks the CBS domains.

• Misnomers show similarity to the CBS domains

Copyright © 2004 by Limsoon Wong

Page 56: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Invalid Transitive Assignment

Mis-assignment of function

A

B

C

Root of invalid transitive assignment

No IMPDH domain

Copyright © 2004 by Limsoon Wong

Page 57: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Emerging Pattern

• Most IMPDHs have 2 IMPDH and 2 CBS domains. • Some IMPDH (E70218) lacks CBS domains. IMPDH domain is the emerging pattern

IMPDH Misnomer in Methanococcus jannaschii

IMPDH Misnomers in Archaeoglobus fulgidus

IMPDH Misnomer in Methanococcus jannaschii

IMPDH Misnomers in Archaeoglobus fulgidus

Typical IMPDHFunctional IMPDH w/o CBS

Copyright © 2004 by Limsoon Wong

Page 58: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Application of Sequence

Comparison:Active Site/Domain

Discovery

Page 59: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Discover Active Site and/or Domain

• How to discover the active site and/or domain of a function in the first place?– Multiple alignment of homologous seqs – Determine conserved positions– Emerging patterns relative to background Candidate active sites and/or domains

• Easier if sequences of distance homologs are used

Page 60: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Multiple Alignment of PTPs

• Notice the PTPs agree with each other on some positions more than other positions

• These positions are more impt wrt PTPs• Else they wouldn’t be conserved by evolution They are candidate active sites

Copyright © 2004 by Limsoon Wong

Page 61: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Guilt-by-Association:

What if no homolog of known function

is found?

genome phylogenetic profiles

protfun’s feature profiles

Page 62: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Phylogenetic ProfilingPellegrini et al., PNAS, 96:4285--4288, 1999

• Gene (and hence proteins) with identical patterns of occurrence across phyla tend to function together

Even if no homolog with known function is available, it is still possible to infer function of a protein

Page 63: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Phylogenetic

Profiling:How it Works

Copyright © 2004 by Limsoon Wong

Image credit: Pellegrini

Page 64: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Phylogenetic Profiling: P-value

No. of ways to distribute z co-occurrences over N lineage's

No. of ways to distributethe remaining x – z and y – zoccurrences over the remainingN – z lineage's

No. of ways of distributing X and Yover N lineage's without restriction

z

Copyright © 2004 by Limsoon Wong

Page 65: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Phylogenetic Profiles: Evidence

Pellegrini et al., PNAS, 96:4285--4288, 1999

• Proteins grouped based on similar keywords in SWISS-PROT have more similar phylogenetic profiles

Copyright © 2004 by Limsoon Wong

No. ofprotein pairsin this groupthat differ by < 3 “bit”

No. of protein pairsin random group that differ by < 3 “bit”

No. ofproteinsin this group

Page 66: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Phylogenetic Profiling: Evidence

Wu et al., Bioinformatics, 19:1524--1530, 2003

• Proteins having low hamming distance (thus highly similar phylogenetic profiles) tend to share common pathways

• Exercise: Why do proteins having high hamming distance also have this behaviour?

KEGG COG

frac

tion

of

gen

e p

airs

hav

ing

ham

min

g d

ista

nce

Dan

d s

har

e a

com

mon

pat

hw

ayin

KE

GG

/CO

G

hamming distance (D)

hamming distance X,Y = #lineages X occurs + #lineages Y occurs – 2 * #lineages X, Y occur

Copyright © 2004 by Limsoon Wong

Page 67: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

The ProtFun ApproachJensen, JMB, 319:1257--1265, 2002

• A protein is not alone when performing its biological function

• It operates using the same cellular machinery for modification and sorting as all other proteins do, such as glycosylation, phospharylation, signal peptide cleavage, …

• These have associated consensus motifs, patterns, etc.

• Proteins performing similar functions should share some such “features”

Perhaps we can predict protein function by comparing its “feature” profile with other proteins?

seq1

Copyright © 2004 by Limsoon Wong

Page 68: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

ProtFun: How it Works

Average the output ofthe 5 component ANNs

Extract featureprofile of proteinusing various prediction methods

Copyright © 2004 by Limsoon Wong

Page 69: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

ProtFun: Evidence

• Some combinations of “features” seem to characterize some functional categories

Copyright © 2004 by Limsoon Wong

Page 70: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

ProtFun: Example Output• At the seq level,

Prion, A4, & TTHY are dissimilar

• ProtFun predicts them to be cell envelope-related, tranport & binding

• This is in agreement with known functionality of these proteins

Copyright © 2004 by Limsoon Wong

Page 71: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

ProtFun: Performance

Copyright © 2004 by Limsoon Wong

Page 72: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

SVM-Pairwise Framework

Training Data

S1

S2

S3

Testing Data

T1

T2

T3

Training Features

S1 S2 S3 …

S1 f11 f12 f13 …

S2 f21 f22 f23 …

S3 f31 f32 f33 …

… … … … …

Feature Generation

Trained SVM Model (Feature Weights)

Training

Testing Features

S1 S2 S3 …

T1 f11 f12 f13 …

T2 f21 f22 f23 …

T3 f31 f32 f33 …

… … … … …

Feature Generation

Support Vectors Machine

(Radial Basis Function Kernel)

Classification

Discriminant Scores

RBF Kernel

f31 is the local alignment score between S3 and S1

f31 is the local alignment score between T3 and S1

Image credit: Kenny Chua

Page 73: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Performance of SVM-Pairwise

• Receiver Operating Characteristic (ROC)– The area under the

curve derived from plotting true positives as a function of false positives for various thresholds.

• Rate of median False Positives (RFP)– The fraction of

negative test examples with a score better or equals to the median of the scores of positive test examples.

Page 74: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Application of Sequence

Comparison:Key Mutation Site

Discovery

Page 75: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Identifying Key Mutation Sites

K.L. Lim et al., JBC, 273:28986--28993, 1998

• Some PTPs have 2 PTP domains• PTP domain D1 is has much more activity

than PTP domain D2• Why? And how do you figure that out?

Sequence from a typical PTP domain D2

Copyright © 2004 by Limsoon Wong

Page 76: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Emerging Patterns of PTP D1 vs D2

• Collect example PTP D1 sequences• Collect example PTP D2 sequences• Make multiple alignment A1 of PTP D1• Make multiple alignment A2 of PTP D2• Are there positions conserved in A1 that

are violated in A2?• These are candidate mutations that

cause PTP activity to weaken• Confirm by wet experiments

Page 77: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

present

absent

D1

D2

This site is consistently conserved in D1, but is consistently missing in D2 it is an EP possible cause of D2’s loss of function

This site is consistently conserved in D1, but is not consistently missing in D2 it is not an EP not a likely cause of D2’s loss of function

Copyright © 2004 by Limsoon Wong

Emerging Patterns of PTP D1 vs D2

Page 78: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Key Mutation Site: PTP D1 vs D2

• Positions marked by “!” and “?” are likely places responsible for reduced PTP activity– All PTP D1 agree on them– All PTP D2 disagree on them

D1

D2

Copyright © 2004 by Limsoon Wong

Page 79: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Key Mutation Site: PTP D1 vs D2

• Positions marked by “!” are even more likely as 3D modeling predicts they induce large distortion to structure

D1

D2

Copyright © 2004 by Limsoon Wong

Image credit: Kolatkar

Page 80: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Confirmation by Mutagenesis Expt

• What wet experiments are needed to confirm the prediction?– Mutate E D in D2 and see if there is gain in

PTP activity– Mutate D E in D1 and see if there is loss in

PTP activity

• Exercise: Why do you need this 2-way expt?

Page 81: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Application of Sequence

Comparison:From Looking for

SimilaritiesTo Looking for

Differences

Page 82: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Single Nucleotide Polymorphism

• SNP occurs when a single nucleotide replaces one of the other three nucleotide letters

• E.g., the alteration of the DNA segment AAGGTTA to ATGGTTA

• SNPs occur in human population > 1% of the time

• Most SNPs are found outside of "coding seqs” (Exercise: Why?)

SNPs found in a coding seq are of great interest as they are more likely to alter function of a protein

Page 83: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Example SNP Report

Copyright © 2004 by Limsoon Wong

Page 84: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

SNP Uses

• Association studies– Analyze DNA of group

affected by disease for their SNP patterns

– Compare to patterns obtained from group unaffected by disease

– Detect diff betw SNP patterns of the two

– Find pattern most likely associated with disease-causing gene normal

disease

strong assoc weak assoc½

Copyright © 2004 by Limsoon Wong

Page 85: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

SNP Uses

• better evaluate role non-genetic factors (e.g., behavior, diet, lifestyle)

• determine why people differ in abilities to absorb or clear a drug

• determine why an individual experiences side effect of a drug

• Exercise: What is the general procedure for using SNPs in these 3 types of analysis?

Page 86: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Application of Sequence

Comparison:The 7 Daughters of

Eve

Page 87: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Time since split

Australian

Papuan

Polynesian

Indonesian

Cherokee

Navajo

Japanese

Tibetan

English

Italian

Ethiopian

Mbuti PygmyAfrica

Europe

Asia

America

Oceania

Austalasia

Root

Population Tree• Estimate order in which

“populations” evolved• Based on assimilated

freq of many different genes

• But …– is human evolution a

succession of population fissions?

– Is there such thing as a proto-Anglo-Italian population which split, never to meet again, and became inhabitants of England and Italy?

Copyright © 2004 by Limsoon Wong

Page 88: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

150000years ago

100000years ago

50000years ago

present

African Asian Papuan European

Root

Evolution Tree

• Leaves and nodes are individual persons---real people, not hypothetical concept like “proto-population”

• Lines drawn to reflect genetic differences between them in one special gene called mitochondrial DNA

Copyright © 2004 by Limsoon Wong

Page 89: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Why Mitochondrial DNA

• Present in abundance in bone fossils • Inherited only from mother• Sufficient to look at the 500bp control

region• Accumulate more neutral mutations than

nuclear DNA• Accumulate mutations at the “right” rate,

about 1 every 10,000 years• No recombination, not shuffled at each

generation

Page 90: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Mutation Rates• All pet golden hamsters in

the world descend from a single female caught in 1930 in Syria

• Golden hamsters “manage” ~4 generations a year :-)

• So >250 hamster generations since 1930

• Mitochondrial control regions of 35 (independent) golden hamsters were sequenced and compared

• No mutation was found

Mitochondrial control region mutates at the “right” rate

Copyright © 2004 by Limsoon Wong

Page 91: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Contamination

• Need to know if DNA extracted from old bones really from those bones, and not contaminated with modern human DNA

• Apply same procedure to old bones from animals, check if you see modern human DNA.

If none, then procedure is OK

Page 92: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

189, 217, 247, 261

189, 217

189, 217, 261

Origin of Polynesians

• Do they come from Asia or America?

Copyright © 2004 by Limsoon WongImage credit: Sykes

Page 93: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Origin of Polynesians

• Common mitochondrial control seq from Rarotonga have variants at positions 189, 217, 247, 261. Less common ones have 189, 217, 261

• Seq from Taiwan natives have variants 189, 217

• Seq from regions in betw have variants 189, 217, 261.

• More 189, 217 closer to Taiwan. More 189, 217, 261 closer to Rarotonga

• 247 not found in America Polynesians came from

Taiwan!• Taiwan seq sometimes

have extra mutations not found in other parts

These are mutations that happened since Polynesians left Taiwan!

Page 94: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Neanderthal vs Cro Magnon

• Are Europeans descended purely from Cro Magnons? Pure Neanderthals? Or mixed?

NeanderthalCro Magnon

Copyright © 2004 by Limsoon Wong

Page 95: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

Neanderthal vs Cro Magnon• Based on palaeontology,

Neanderthal & Cro Magnon last shared an ancestor 250k years ago

• Mitochondrial control regions accumulate 1 mutation per 10k years

If Europeans have mixed ancestry, mito-chondrial control regions betw 2 Europeans should have ~25 diff w/ high probability

• The number of diff betw Welsh is ~3, & at most 8.

• When compared w/ other Europeans, 14 diff at most

Ancestor either 100% Neanderthal or 100% Cro Magnon

• Mitochondrial control seq from Neanderthal have 26 diff from Europeans

Ancestor must be 100% Cro Magnon

Page 96: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Clan Mother

A woman w/ only sons cant be clan mother---her mitochondrial DNA cant be passed on

A woman cant be clan mother if she has only 1 daughter---she is not most recent maternal ancestor

• Exercise: Which of , , , is the clan mother?

Copyright © 2004 by Limsoon Wong

• Clan mother is the most recent maternal ancestor common to all members

Page 97: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

How many clans in Europe?

• Cluster seq according to mutations

• Each cluster thus represents a major clan

• European seq cluster into 7 major clans

• The 7 clusters age betw 45000 and 10000 years (length of time taken for all mutations in a cluster to arise from a single founder seq)

• The founder seq carried by just 1 woman in each case---the clan mother

• Note that the clan mother did not need to be alone. There could be other women, it was just that their descendants eventually died out

• Exercise: How about clan father?

Page 98: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

World Clans

Copyright © 2004 by Limsoon Wong

Page 99: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

More AdvancedSequence

Comparison Methods

PHI-BLAST Iterated BLAST

PSI-BLASTSAM

Page 100: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

PHI-BLAST:

Pattern-Hit Initiated BLAST• Input

– protein sequence and– pattern of interest that

it contains

• Output– protein sequences

containing the pattern and have good alignment surrounding the pattern

• Impact– able to detect

statistically significant similarity between homologous proteins that are not recognizably related using traditional one-pass methods

Page 101: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

PHI-BLAST:

How it works

find from databaseall seq containinggiven pattern

find sequences withgood flanking alignment

Copyright © 2004 by Limsoon Wong

Page 102: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

PHI-BLAST:

IMPACT

Copyright © 2004 by Limsoon Wong

Page 103: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

ISS:

Intermediate Sequence Search

• Two homologous seqs, which have diverged beyond the point where their homology can be recognized by a simple direct comparison, can be related through a third sequence that is suitably intermediate between the two

• High score betw A & C, and betw B & C, imply A & B are related even though their own match score is low

Page 104: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

ISS:

Search ProcedureInputseq A

Matched seqs M1, M2, ...

Matched regionsR1, R2, ...

ResultsH1, H2, ...

BLAST against db (p-value @ 0.081)

Keep regions in M1, M2, … that A. Discard rest of M1, M2, ...

BLAST against db (p-value @ 0.0006)

Copyright © 2004 by Limsoon Wong

Page 105: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

ISS:

IMPACT

No obvious match between Amicyanin and Ascorbate Oxidase

Copyright © 2004 by Limsoon Wong

Page 106: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Convincing homology via Plastocyanin

Previously onlythis part was matched

ISS:

IMPACT

Copyright © 2004 by Limsoon Wong

Page 107: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

PSI-BLAST:

Position-Specific Iterated BLAST

• given a query seq, initial set of homologs is collected from db using GAP-BLAST

• weighted multiple alignment is made from query seq and homologs scoring better than threshold

• position-specific score matrix is constructed from this alignment

• matrix is used to search db for new homologs

• new homologs with good score are used to construct new position-specific score matrix

• iterate the search until no new homologs found, or until specified limit is reached

Page 108: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

SAM-T98 HMM Method

• similar to PSI-BLAST• but use HMM instead of position-specific

score matrix

Page 109: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Comparisons

Iterated seq.comparisons vspairwise seq.comparison

Copyright © 2004 by Limsoon Wong

Page 110: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Cop

yrig

ht ©

200

4, 2

005

by L

imso

on W

ong

Suggested Readings

Page 111: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

References

• S.E.Brenner. “Errors in genome annotation”, TIG, 15:132--133, 1999

• T.F.Smith & X.Zhang. “The challenges of genome sequence annotation or `The devil is in the details’”, Nature Biotech, 15:1222--1223, 1997

• D. Devos & A.Valencia. “Intrinsic errors in genome annotation”, TIG, 17:429--431, 2001.

• K.L.Lim et al. “Interconversion of kinetic identities of the tandem catalytic domains of receptor-like protein tyrosine phosphatase PTP-alpha by two point mutations is synergist and substrate dependent”, JBC, 273:28986--28993, 1998.

Page 112: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

References• J. Park et al. “Sequence comparisons using multiple sequences

detect three times as many remote homologs as pairwise methods”, JMB, 284(4):1201-1210, 1998

• J. Park et al. “Intermediate sequences increase the detection of homology between sequences”, JMB, 273:349--354, 1997

• Z. Zhang et al. “Protein sequence similarity searches using patterns as seeds”, NAR, 26(17):3986--3990, 1996

• M.S.Gelfand et al. “Gene recognition via spliced sequence alignment”, PNAS, 93:9061--9066, 1996

• B. Ma et al. “PatternHunter: Faster and more sensitive homology search”, Bioinformatics, 18:440--445, 2002

• M. Li et al. “PatternHunter II: Highly sensitive and fast homology search”, GIW, 164--175, 2003

Page 113: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

References • S.F.Altshcul et al. “Basic local alignment search tool”, JMB,

215:403--410, 1990• S.F.Altschul et al. “Gapped BLAST and PSI-BLAST: A new

generation of protein database search programs”, NAR, 25(17):3389--3402, 1997.

• M. Pellegrini et al. “Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles”, PNAS, 96:4285--4288, 1999

• J. Wu et al. “Identification of functional links between genes using phylogenetic profiles”, Bioinformatics, 19:1524--1530, 2003

• L.J.Jensen et al. “Prediction of human protein function from post-translational modifications and localization features”, JMB, 319:1257--1265, 2002

• B. Sykes. “The seven daughters of Eve”, Gorgi Books, 2002

Page 114: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

References

• P.Erdos & A. Renyi. “On a new law of large numbers”, J. Anal. Math., 22:103--111, 1970

• R. Arratia & M. S. Waterman. “Critical phenomena in sequence matching”, Ann. Prob., 13:1236--1249, 1985

• R. Arratia, P. Morris, & M. S. Waterman. “Stochastic scrabble: Large deviations for sequences with scores”, J. Appl. Prob., 25:106--119, 1988

• R. Arratia, L. Gordon. “Tutorial on large deviations for the binomial distribution”, Bull. Math. Biol., 51:125--131, 1989

Page 115: Copyright © 2004, 2005 by Limsoon Wong Techniques & Applications of Sequence Comparison Limsoon Wong 31 August 2005.

Copyright © 2004, 2005 by Limsoon Wong

• 6.30pm @ LT28, South Spine• Bertil 96560566