Comparative Sequence Analysis in Molecular Biology

66
Comparative Sequence Analysis in Molecular Biology Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A.

description

Comparative Sequence Analysis in Molecular Biology. Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A. Outline. What genome data is available? What is phylogenetic footprinting? - PowerPoint PPT Presentation

Transcript of Comparative Sequence Analysis in Molecular Biology

Page 1: Comparative Sequence Analysis in Molecular Biology

Comparative Sequence Analysisin Molecular Biology

Martin TompaComputer Science & Engineering

Genome SciencesUniversity of Washington

Seattle, Washington, U.S.A.

Page 2: Comparative Sequence Analysis in Molecular Biology

2

Page 3: Comparative Sequence Analysis in Molecular Biology

3

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

Page 4: Comparative Sequence Analysis in Molecular Biology

4

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

Page 5: Comparative Sequence Analysis in Molecular Biology

5

How Many Genomes Are Available?

• 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes)

• 1766 bacterial genomes sequenced (as of 2/12/2012)

• Insects, fungi, worms, plants, …

• Many more will be finished very soon

• Fertile ground for comparative genomics

Page 6: Comparative Sequence Analysis in Molecular Biology

61982-2003: number of nucleotides in GenBank doubled every 18 months

Since 2003: doubled every 3 years

Page 7: Comparative Sequence Analysis in Molecular Biology

7

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

Page 8: Comparative Sequence Analysis in Molecular Biology

8

Phylogenetic Footprinting(Tagle et al. 1988)

Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones.

1. Consider a set of corresponding DNA sequences from related species.

2. Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”

Page 9: Comparative Sequence Analysis in Molecular Biology

9

Page 10: Comparative Sequence Analysis in Molecular Biology

10

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

Page 11: Comparative Sequence Analysis in Molecular Biology

11

How to Find Conserved Motifs

ACTAACCGGGAGATTTCAGA human

AAGTTCCGGGAGATTTCCA chimp

TAGTTATCCGGGAGATTAGA mouse

AAAACCGGTAGATTTCAGG rat

Page 12: Comparative Sequence Analysis in Molecular Biology

12

Multiple Sequence Alignment

AC--TAACCGGGAGATTTCAGA human

AAGTT--CCGGGAGATTTCC-A chimp

TAGTTATCCGGGAGATT--AGA mouse

AA---AACCGGTAGATTTCAGG rat

(Finding the optimal alignment is NP-complete.)

Page 13: Comparative Sequence Analysis in Molecular Biology

13

Phylogenetic Footprinting

1. Use whole-genome multiple alignment such as provided

by UCSC Genome Browser.

2. Search for regions of well conserved alignment.

– Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe;

Xie (2)]

– RNA elements [Pedersen; Washietl]

– General conservation & constraint [Bejerano; Boffelli;

Cooper; Margulies (4); Pollard; Prabhakar; Siepel]

Page 14: Comparative Sequence Analysis in Molecular Biology

14

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

Page 15: Comparative Sequence Analysis in Molecular Biology

15

Why Doubt Alignments?

• Multiple sequence alignment of short sequences (proteins, promoters) is difficult (NP-complete)

• Aligning whole genomes adds the complications of huge sequences and genomic rearrangements

• Vertebrate alignment has 3.8 billion columns

• Automatically generated

Page 16: Comparative Sequence Analysis in Molecular Biology

16

Assessing 4 Genome-Size Alignments (with Xiaoyu Chen)

• Alignments: MLAGAN [Brudno 2003], MAVID [Bray 2003],

TBA [Blanchette 2003], Pecan [Paten 2008]

• Target ENCODE regions: 30 Mbp covering 1% of the human

genome (ENCODE targets)

• Total input: 554 Mbp over 28 vertebrates

• Rich resource for comparing and assessing genome-size

alignments

Margulies et al. 2007, Genome Research

Page 17: Comparative Sequence Analysis in Molecular Biology

17

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

Alignment coverage: number of human bases aligned to a given species

Page 18: Comparative Sequence Analysis in Molecular Biology

18

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

In noncoding regions, as species distance from human↑, coverage↓

Page 19: Comparative Sequence Analysis in Molecular Biology

19

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

MAVID has lowest coverage

Page 20: Comparative Sequence Analysis in Molecular Biology

20

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

Other 3 have comparable coverage in placental mammals

Page 21: Comparative Sequence Analysis in Molecular Biology

21

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

MLAGAN has highest coverage in distant species, intronic and intergenic

Page 22: Comparative Sequence Analysis in Molecular Biology

22

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Page 23: Comparative Sequence Analysis in Molecular Biology

23

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Agree%: Coding > UTR > Int.

Page 24: Comparative Sequence Analysis in Molecular Biology

24

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Unique%: Coding < UTR < Int.

Page 25: Comparative Sequence Analysis in Molecular Biology

25

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

As species distance from human↑, Agree%↓Unique%↑

Page 26: Comparative Sequence Analysis in Molecular Biology

26

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Primates: high Agree%

Page 27: Comparative Sequence Analysis in Molecular Biology

27

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Placental nonprimates: Agree% > 0.5

Page 28: Comparative Sequence Analysis in Molecular Biology

28

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Distant species, Int: low Agree%, high Unique%

Page 29: Comparative Sequence Analysis in Molecular Biology

29

Alignment agreement for mouse

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

Intronic bases Intergenic bases

• Intronic & intergenic account for 95% of mouse bases aligned to human

• Agree% in those categories: 44% to 62% • Much worse for more distant species• Building reliable MSA remains challenging

Page 30: Comparative Sequence Analysis in Molecular Biology

30

Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990)

Goal: label each alignment column with confidence measure of alignment correctness

– Identify sequences that do not belong

• Users forewarned about regions of interest

• Genome browser designers consider realigning

• Alignment tool designers get feedback for possible improvements

Page 31: Comparative Sequence Analysis in Molecular Biology

31

Sample Suspicious AlignmentHuman -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Chimp -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Rhesus -----------GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Mouse -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC

Rat -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC

Dog -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Cow -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Elephant -----------GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Tenrec -----------GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Opossum -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA----------TTAACAC

Chicken -----------GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA----------TTAACAC

Zebrafish GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC

Page 32: Comparative Sequence Analysis in Molecular Biology

32

Scoring Function

Pairwise: score(1, 2) = log ( )

Multiple:

HumanChimpMouseRatChicken

1

2

3

4

5

Pr(1, 2)

Pr(1) Pr(2)

sc(12345 | ) = log( )Pr(12345 | )

Pr(125 | ) Pr(34 | )

Page 33: Comparative Sequence Analysis in Molecular Biology

33

Outline of Computation

Input Multiple sequence alignment A

Output Discordance : maxk pk

For each branch k of the tree {

Compute scoring function sck (Felsenstein)

Find all maximally scoring segments of A using sck (Ruzzo & Tompa)

Compute K, using sck (Karlin & Altschul)

Compute p-value pk of each segment score using K, (Karlin & Altschul)

}

Page 34: Comparative Sequence Analysis in Molecular Biology

34

Suspicious Alignment Regions

Back to four ENCODE alignments spanning 30 Mbp of human aligned to 27 other vertebrates (with Xiaoyu Chen)• Identify suspicious alignment regions:

– Length 50 bp– Discordance 0.1 at each position, all with respect to the

same worst species– Fewer than 50% gapped sites

• Suspicious%

– Percentage of aligned bases in suspicious regions

Page 35: Comparative Sequence Analysis in Molecular Biology

35

Alignment accuracy

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

TBA MAVID MLAGAN Pecan

Coding bases

UTR bases

Intronic bases

Intergenic bases

Page 36: Comparative Sequence Analysis in Molecular Biology

36

Alignment accuracy

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

TBA MAVID MLAGAN Pecan

Coding bases

UTR bases

Intronic bases

Intergenic bases

Page 37: Comparative Sequence Analysis in Molecular Biology

37

Alignment accuracy

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

TBA MAVID MLAGAN Pecan

Coding bases

UTR bases

Intronic bases

Intergenic bases

Page 38: Comparative Sequence Analysis in Molecular Biology

38

Alignment accuracy

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

TBA MAVID MLAGAN Pecan

Coding bases

UTR bases

Intronic bases

Intergenic bases

Page 39: Comparative Sequence Analysis in Molecular Biology

39

Can suspicious alignments be improved?

Baboon and MLAGAN (for example): all points (x,y), where•x = human-baboon alignment score of MLAGAN region suspicious for baboon

•y = human-baboon alignment score of alternative alignment for same human region but not suspicious for baboon

y = x

y - x = μ, where μ = average y-x over all points

y - x = μ ± σ, where σ = standard deviation of y-x over all points

Page 40: Comparative Sequence Analysis in Molecular Biology

40

Can suspicious alignments be improved?

Page 41: Comparative Sequence Analysis in Molecular Biology

41

Summary of comparisons (all categories)

0%

5%

10%

15%

20%

25%

30%

12 13 14 15 16 17 18

log(coverage)

susp

icio

us%

primates other placental mammals distant species

TBA MAVID MLAGAN Pecan

High is better

Low

is b

ette

r

Page 42: Comparative Sequence Analysis in Molecular Biology

42

Conclusions1. Disturbing lack of agreement among alignments:

alignment still a hard problem

2. Performance of the aligners varies significantly by species group and region type, particularly distant species and noncoding regions

Page 43: Comparative Sequence Analysis in Molecular Biology

43

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

Page 44: Comparative Sequence Analysis in Molecular Biology

44

DNA, Genes, and Proteins

DNA: program for cell processes

Proteins: execute cell processes

TCCAA

CGGTGC

TGAGGT

GCAC

GeneProtein

DNA

Page 45: Comparative Sequence Analysis in Molecular Biology

45

Regulation of Genes

• What turns genes on and off?

• When is a gene turned on or off?

• Where (in which cells) is a gene turned on?

• How many copies of the gene product are produced?

Page 46: Comparative Sequence Analysis in Molecular Biology

46

Regulation of Genes

GeneRegulatory Element

RNA polymerase

Transcription Factor

DNA

Page 47: Comparative Sequence Analysis in Molecular Biology

47

RNA polymerase

Transcription Factor

DNA

Regulatory Element Gene

Regulation of Genes

Page 48: Comparative Sequence Analysis in Molecular Biology

48

GoalIdentify regulatory elements in DNA sequences. These are:

• Binding sites for proteins

• Short subsequences (5-25 nucleotides)

• Up to 1000 nucleotides (or farther) from gene

• Inexactly repeating patterns (“motifs”)

Page 49: Comparative Sequence Analysis in Molecular Biology

49

CLUSTALW multiple sequence alignment (rbcS gene)

Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC

Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA

Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA

Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

Page 50: Comparative Sequence Analysis in Molecular Biology

50

Finding Short Motifs

AGTCGTACGTGAC... (Human)

AGTAGACGTGCCG... (Chimp)

ACGTGAGATACGT... (Rabbit)

GAACGGAGTACGT... (Mouse)

TCGTGACGGTGAT... (Rat)

Size of motif sought: k = 4

Page 51: Comparative Sequence Analysis in Molecular Biology

51

Most Parsimonious Solution

“Parsimony score”: 1 mutation

AGTCGTACGTGAC...

AGTAGACGTGCCG...

ACGTGAGATACGT...

GAACGGAGTACGT...

TCGTGACGGTGAT...ACGGACGT

ACGT

ACGT

Page 52: Comparative Sequence Analysis in Molecular Biology

52

Substring Parsimony ProblemGiven:

• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d

Problem:

• Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d.

This problem is NP-complete.

Page 53: Comparative Sequence Analysis in Molecular Biology

53

FootPrinter’s Exact Algorithm(with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975)

Wu [s] = best parsimony score for subtree rooted at node u,

if u is labeled with string s.

AGTCGTACGTG

ACGGGACGTGC

ACGTGAGATAC

GAACGGAGTAC

TCGTGACGGTG

… ACGG: 2 ACGT: 1 ...

… ACGG: 0 ACGT: 2...

… ACGG: 1 ACGT: 1 ...

ACGG: + ACGT: 0

...

… ACGG: 1 ACGT: 0 ...

4k entries

… ACGG: 0 ACGT: + ...

… ACGG: ACGT :0 ...

… ACGG: ACGT :0 ...

… ACGG: ACGT :0 ...

Page 54: Comparative Sequence Analysis in Molecular Biology

54

Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u

Running Time

Number of species

Average sequence

length

Motif length

Total time O(n k (42k + l ))

Page 55: Comparative Sequence Analysis in Molecular Biology

55

Improvements• Better algorithm reduces time from

O(n k (42k + l )) to O(n k (4k + l ))

• By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k)

• Amenable to many useful extensions (e.g., allow insertions and deletions)

Page 56: Comparative Sequence Analysis in Molecular Biology

56

Application to -actin Gene

Gilthead sea bream (678 bp)

Medaka fish (1016 bp)

Common carp (696 bp)

Grass carp (917 bp)

Chicken (871 bp)

Human (646 bp)

Rabbit (636 bp)

Rat (966 bp)

Mouse (684 bp)

Hamster (1107 bp)

Page 57: Comparative Sequence Analysis in Molecular Biology

57

Common carpACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTG

AGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCA

GACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC

ChickenACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTG

TTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGA

GCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGA

TAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT

HumanGCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTT

TTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTT

GGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTC

CCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG

Parsimony score over 10 vertebrates: 0 1 2

Page 58: Comparative Sequence Analysis in Molecular Biology

58

Motifs Absent from Some Species

• Find motifs – with small parsimony score

– that span a large part of the tree

• Example: in tree of 10 species spanning 760 Myrs, find all motifs with– score 0 spanning at least 250 Myrs– score 1 spanning at least 350 Myrs– score 2 spanning at least 450 Myrs– score 3 spanning at least 550 Myrs

Page 59: Comparative Sequence Analysis in Molecular Biology

59

Application to c-fos Gene

Asked for motifs of length 10, with 0 mutations over tree of

size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26

Puffer fish

Chicken

Pig

Mouse

Hamster

Human

10

2

7

2

2

21

0

1

1

Found: 0 mutations over tree of size 81 mutation over tree of size 163 mutations over tree of size 214 mutations over tree of size 28

Page 60: Comparative Sequence Analysis in Molecular Biology

60

Application to c-fos GeneMotif Score Conserved in Known?

CAGGTGCGAATGTTC 0 4 mammals

TTCCCGCCTCCCCTCCCC 0 4 mammals yes

GAGTTGGCTGcagcc 3 puffer + 4 mammals

GTTCCCGTCAATCcct 1 chicken + 4 mammals yes

CACAGGATGTcc 4 all 6 yes

AGGACATCTG 1 chicken + 4 mammals yes

GTCAGCAGGTTTCCACG 0 4 mammals yes

TACTCCAACCGC 0 4 mammals

metK in B. subtilis

Page 61: Comparative Sequence Analysis in Molecular Biology

61

Microbial Footprinting• 1889 prokaryotes with genomes completely

sequenced (as of 2/12/2012)– For any prokaryotic gene of interest, plenty of close genes

in other species available– Relatively simple genomes

• MicroFootPrinter (with Shane Neph)– Designed specifically for phylogenetic footprinting in

microbial genomes– undergraduate Computational Biology Capstone project– User specifies species and gene of interest– Automates collection of orthologous genes, cis-regulatory

sequences, gene tree, parameters

Page 62: Comparative Sequence Analysis in Molecular Biology

62

Demo

• MicroFootPrinter home• Examples: Agrobacterium tumefaciens genes

regulated by ChvI (with Eugene Nester)

– chvI (two component response regulator)– ropB (outer membrane protein )

Page 63: Comparative Sequence Analysis in Molecular Biology

63

Sample chvI motifParsimony score: 2Span: 41.10Significance score: 4.22

B. henselae -151 GCTACAATTTR. etli -90 GCCACAATTTR. leguminosarum -106 GCCACAATTTS. meliloti -119 GCCACAATTTS. medicae -118 GCCACAATTTA. tumefaciens -105 GCCACAATTTM. loti -80 GCCACATTTTM. sp. -87 GCCACATTTTO. anthropi -158 GCCACATTTTB. suis -38 GCCACATTTTB. melitensis -156 GCCACATTTTB. abortus -156 GCCACATTTTB. ovis -156 GCCACATTTTB. canis -38 GCCACATTTT

Page 64: Comparative Sequence Analysis in Molecular Biology

64

Sample ropB motifParsimony score: 1Span: 20.70Significance score: 1.34

Jannaschia sp. -151 CACATTTTGGR. etli -134 CACAATTTGGR. leguminosarum -135 CACAATTTGGA. tumefaciens -131 CACATTTTGGS. meliloti -128 CACATTTTGGS. medicae -128 CACATTTTGG

Page 65: Comparative Sequence Analysis in Molecular Biology

65

Combined ChvI Motif

ropB: CACATTTTGGchvI: GCCACAATTTAtu1221: TTGTCACAAT

ultimate: GYCACAWTTTGGY={C,T}

W={A,T}

Page 66: Comparative Sequence Analysis in Molecular Biology

66

References and Acknowledgments• Amol Prakash & Martin Tompa, Measuring the Accuracy of

Genome-Size Multiple Alignments. Genome Biology, June 2007, R124.

• Xiaoyu Chen & Martin Tompa, Comparative Assessment of Methods for Aligning Multiple Genome Sequences. Nature Biotechnology, June 2010, 567-572.

• Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, 739-748.

• Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368.

• All software available at bio.cs.washington.edu/software