Comparative Sequence Analysis in Molecular Biology

Post on 16-Jan-2016

37 views 0 download

Tags:

description

Comparative Sequence Analysis in Molecular Biology. Martin Tompa Computer Science & Engineering Genome Sciences University of Washington Seattle, Washington, U.S.A. Outline. What genome data is available? What is phylogenetic footprinting? - PowerPoint PPT Presentation

Transcript of Comparative Sequence Analysis in Molecular Biology

Comparative Sequence Analysisin Molecular Biology

Martin TompaComputer Science & Engineering

Genome SciencesUniversity of Washington

Seattle, Washington, U.S.A.

2

3

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

4

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

5

How Many Genomes Are Available?

• 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes)

• 1766 bacterial genomes sequenced (as of 2/12/2012)

• Insects, fungi, worms, plants, …

• Many more will be finished very soon

• Fertile ground for comparative genomics

61982-2003: number of nucleotides in GenBank doubled every 18 months

Since 2003: doubled every 3 years

7

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

8

Phylogenetic Footprinting(Tagle et al. 1988)

Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones.

1. Consider a set of corresponding DNA sequences from related species.

2. Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”

9

10

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

11

How to Find Conserved Motifs

ACTAACCGGGAGATTTCAGA human

AAGTTCCGGGAGATTTCCA chimp

TAGTTATCCGGGAGATTAGA mouse

AAAACCGGTAGATTTCAGG rat

12

Multiple Sequence Alignment

AC--TAACCGGGAGATTTCAGA human

AAGTT--CCGGGAGATTTCC-A chimp

TAGTTATCCGGGAGATT--AGA mouse

AA---AACCGGTAGATTTCAGG rat

(Finding the optimal alignment is NP-complete.)

13

Phylogenetic Footprinting

1. Use whole-genome multiple alignment such as provided

by UCSC Genome Browser.

2. Search for regions of well conserved alignment.

– Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe;

Xie (2)]

– RNA elements [Pedersen; Washietl]

– General conservation & constraint [Bejerano; Boffelli;

Cooper; Margulies (4); Pollard; Prabhakar; Siepel]

14

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

15

Why Doubt Alignments?

• Multiple sequence alignment of short sequences (proteins, promoters) is difficult (NP-complete)

• Aligning whole genomes adds the complications of huge sequences and genomic rearrangements

• Vertebrate alignment has 3.8 billion columns

• Automatically generated

16

Assessing 4 Genome-Size Alignments (with Xiaoyu Chen)

• Alignments: MLAGAN [Brudno 2003], MAVID [Bray 2003],

TBA [Blanchette 2003], Pecan [Paten 2008]

• Target ENCODE regions: 30 Mbp covering 1% of the human

genome (ENCODE targets)

• Total input: 554 Mbp over 28 vertebrates

• Rich resource for comparing and assessing genome-size

alignments

Margulies et al. 2007, Genome Research

17

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

Alignment coverage: number of human bases aligned to a given species

18

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

In noncoding regions, as species distance from human↑, coverage↓

19

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

MAVID has lowest coverage

20

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

Other 3 have comparable coverage in placental mammals

21

Coverage of each alignmentCoding Kbp

0

200

400

600UTR Kbp

0

200

400

600

Intronic Kbp

0

2000

4000

6000

8000

10000

12000Intergenic Kbp

02000400060008000

1000012000

TBA

MAVID

MLAGAN

Pecan

MLAGAN has highest coverage in distant species, intronic and intergenic

22

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

23

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Agree%: Coding > UTR > Int.

24

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Unique%: Coding < UTR < Int.

25

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

As species distance from human↑, Agree%↓Unique%↑

26

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Primates: high Agree%

27

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Placental nonprimates: Agree% > 0.5

28

Level of agreement among alignments

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

TBA (T)

MLAGAN (L)

MAVID (V)

TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases

Intronic bases Intergenic bases

Distant species, Int: low Agree%, high Unique%

29

Alignment agreement for mouse

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Agree%

Disagree%

Unique%

Intronic bases Intergenic bases

• Intronic & intergenic account for 95% of mouse bases aligned to human

• Agree% in those categories: 44% to 62% • Much worse for more distant species• Building reliable MSA remains challenging

30

Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990)

Goal: label each alignment column with confidence measure of alignment correctness

– Identify sequences that do not belong

• Users forewarned about regions of interest

• Genome browser designers consider realigning

• Alignment tool designers get feedback for possible improvements

31

Sample Suspicious AlignmentHuman -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Chimp -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Rhesus -----------GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Mouse -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC

Rat -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC

Dog -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Cow -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Elephant -----------GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Tenrec -----------GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC

Opossum -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA----------TTAACAC

Chicken -----------GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA----------TTAACAC

Zebrafish GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC

32

Scoring Function

Pairwise: score(1, 2) = log ( )

Multiple:

HumanChimpMouseRatChicken

1

2

3

4

5

Pr(1, 2)

Pr(1) Pr(2)

sc(12345 | ) = log( )Pr(12345 | )

Pr(125 | ) Pr(34 | )

33

Outline of Computation

Input Multiple sequence alignment A

Output Discordance : maxk pk

For each branch k of the tree {

Compute scoring function sck (Felsenstein)

Find all maximally scoring segments of A using sck (Ruzzo & Tompa)

Compute K, using sck (Karlin & Altschul)

Compute p-value pk of each segment score using K, (Karlin & Altschul)

}

34

Suspicious Alignment Regions

Back to four ENCODE alignments spanning 30 Mbp of human aligned to 27 other vertebrates (with Xiaoyu Chen)• Identify suspicious alignment regions:

– Length 50 bp– Discordance 0.1 at each position, all with respect to the

same worst species– Fewer than 50% gapped sites

• Suspicious%

– Percentage of aligned bases in suspicious regions

35

Alignment accuracy

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

TBA MAVID MLAGAN Pecan

Coding bases

UTR bases

Intronic bases

Intergenic bases

36

Alignment accuracy

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

TBA MAVID MLAGAN Pecan

Coding bases

UTR bases

Intronic bases

Intergenic bases

37

Alignment accuracy

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

TBA MAVID MLAGAN Pecan

Coding bases

UTR bases

Intronic bases

Intergenic bases

38

Alignment accuracy

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

0%

5%

10%

15%

20%

25%

30%

TBA MAVID MLAGAN Pecan

Coding bases

UTR bases

Intronic bases

Intergenic bases

39

Can suspicious alignments be improved?

Baboon and MLAGAN (for example): all points (x,y), where•x = human-baboon alignment score of MLAGAN region suspicious for baboon

•y = human-baboon alignment score of alternative alignment for same human region but not suspicious for baboon

y = x

y - x = μ, where μ = average y-x over all points

y - x = μ ± σ, where σ = standard deviation of y-x over all points

40

Can suspicious alignments be improved?

41

Summary of comparisons (all categories)

0%

5%

10%

15%

20%

25%

30%

12 13 14 15 16 17 18

log(coverage)

susp

icio

us%

primates other placental mammals distant species

TBA MAVID MLAGAN Pecan

High is better

Low

is b

ette

r

42

Conclusions1. Disturbing lack of agreement among alignments:

alignment still a hard problem

2. Performance of the aligners varies significantly by species group and region type, particularly distant species and noncoding regions

43

Outline

• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple

sequence alignment• Which parts of multiple sequence alignments

are trustworthy?• FootPrinter: phylogenetic footprinting without

alignment

44

DNA, Genes, and Proteins

DNA: program for cell processes

Proteins: execute cell processes

TCCAA

CGGTGC

TGAGGT

GCAC

GeneProtein

DNA

45

Regulation of Genes

• What turns genes on and off?

• When is a gene turned on or off?

• Where (in which cells) is a gene turned on?

• How many copies of the gene product are produced?

46

Regulation of Genes

GeneRegulatory Element

RNA polymerase

Transcription Factor

DNA

47

RNA polymerase

Transcription Factor

DNA

Regulatory Element Gene

Regulation of Genes

48

GoalIdentify regulatory elements in DNA sequences. These are:

• Binding sites for proteins

• Short subsequences (5-25 nucleotides)

• Up to 1000 nucleotides (or farther) from gene

• Inexactly repeating patterns (“motifs”)

49

CLUSTALW multiple sequence alignment (rbcS gene)

Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC

Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA

Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA

Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG

50

Finding Short Motifs

AGTCGTACGTGAC... (Human)

AGTAGACGTGCCG... (Chimp)

ACGTGAGATACGT... (Rabbit)

GAACGGAGTACGT... (Mouse)

TCGTGACGGTGAT... (Rat)

Size of motif sought: k = 4

51

Most Parsimonious Solution

“Parsimony score”: 1 mutation

AGTCGTACGTGAC...

AGTAGACGTGCCG...

ACGTGAGATACGT...

GAACGGAGTACGT...

TCGTGACGGTGAT...ACGGACGT

ACGT

ACGT

52

Substring Parsimony ProblemGiven:

• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d

Problem:

• Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d.

This problem is NP-complete.

53

FootPrinter’s Exact Algorithm(with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975)

Wu [s] = best parsimony score for subtree rooted at node u,

if u is labeled with string s.

AGTCGTACGTG

ACGGGACGTGC

ACGTGAGATAC

GAACGGAGTAC

TCGTGACGGTG

… ACGG: 2 ACGT: 1 ...

… ACGG: 0 ACGT: 2...

… ACGG: 1 ACGT: 1 ...

ACGG: + ACGT: 0

...

… ACGG: 1 ACGT: 0 ...

4k entries

… ACGG: 0 ACGT: + ...

… ACGG: ACGT :0 ...

… ACGG: ACGT :0 ...

… ACGG: ACGT :0 ...

54

Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u

Running Time

Number of species

Average sequence

length

Motif length

Total time O(n k (42k + l ))

55

Improvements• Better algorithm reduces time from

O(n k (42k + l )) to O(n k (4k + l ))

• By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k)

• Amenable to many useful extensions (e.g., allow insertions and deletions)

56

Application to -actin Gene

Gilthead sea bream (678 bp)

Medaka fish (1016 bp)

Common carp (696 bp)

Grass carp (917 bp)

Chicken (871 bp)

Human (646 bp)

Rabbit (636 bp)

Rat (966 bp)

Mouse (684 bp)

Hamster (1107 bp)

57

Common carpACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTG

AGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCA

GACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC

ChickenACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTG

TTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGA

GCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGA

TAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT

HumanGCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTT

TTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTT

GGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTC

CCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG

Parsimony score over 10 vertebrates: 0 1 2

58

Motifs Absent from Some Species

• Find motifs – with small parsimony score

– that span a large part of the tree

• Example: in tree of 10 species spanning 760 Myrs, find all motifs with– score 0 spanning at least 250 Myrs– score 1 spanning at least 350 Myrs– score 2 spanning at least 450 Myrs– score 3 spanning at least 550 Myrs

59

Application to c-fos Gene

Asked for motifs of length 10, with 0 mutations over tree of

size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26

Puffer fish

Chicken

Pig

Mouse

Hamster

Human

10

2

7

2

2

21

0

1

1

Found: 0 mutations over tree of size 81 mutation over tree of size 163 mutations over tree of size 214 mutations over tree of size 28

60

Application to c-fos GeneMotif Score Conserved in Known?

CAGGTGCGAATGTTC 0 4 mammals

TTCCCGCCTCCCCTCCCC 0 4 mammals yes

GAGTTGGCTGcagcc 3 puffer + 4 mammals

GTTCCCGTCAATCcct 1 chicken + 4 mammals yes

CACAGGATGTcc 4 all 6 yes

AGGACATCTG 1 chicken + 4 mammals yes

GTCAGCAGGTTTCCACG 0 4 mammals yes

TACTCCAACCGC 0 4 mammals

metK in B. subtilis

61

Microbial Footprinting• 1889 prokaryotes with genomes completely

sequenced (as of 2/12/2012)– For any prokaryotic gene of interest, plenty of close genes

in other species available– Relatively simple genomes

• MicroFootPrinter (with Shane Neph)– Designed specifically for phylogenetic footprinting in

microbial genomes– undergraduate Computational Biology Capstone project– User specifies species and gene of interest– Automates collection of orthologous genes, cis-regulatory

sequences, gene tree, parameters

62

Demo

• MicroFootPrinter home• Examples: Agrobacterium tumefaciens genes

regulated by ChvI (with Eugene Nester)

– chvI (two component response regulator)– ropB (outer membrane protein )

63

Sample chvI motifParsimony score: 2Span: 41.10Significance score: 4.22

B. henselae -151 GCTACAATTTR. etli -90 GCCACAATTTR. leguminosarum -106 GCCACAATTTS. meliloti -119 GCCACAATTTS. medicae -118 GCCACAATTTA. tumefaciens -105 GCCACAATTTM. loti -80 GCCACATTTTM. sp. -87 GCCACATTTTO. anthropi -158 GCCACATTTTB. suis -38 GCCACATTTTB. melitensis -156 GCCACATTTTB. abortus -156 GCCACATTTTB. ovis -156 GCCACATTTTB. canis -38 GCCACATTTT

64

Sample ropB motifParsimony score: 1Span: 20.70Significance score: 1.34

Jannaschia sp. -151 CACATTTTGGR. etli -134 CACAATTTGGR. leguminosarum -135 CACAATTTGGA. tumefaciens -131 CACATTTTGGS. meliloti -128 CACATTTTGGS. medicae -128 CACATTTTGG

65

Combined ChvI Motif

ropB: CACATTTTGGchvI: GCCACAATTTAtu1221: TTGTCACAAT

ultimate: GYCACAWTTTGGY={C,T}

W={A,T}

66

References and Acknowledgments• Amol Prakash & Martin Tompa, Measuring the Accuracy of

Genome-Size Multiple Alignments. Genome Biology, June 2007, R124.

• Xiaoyu Chen & Martin Tompa, Comparative Assessment of Methods for Aligning Multiple Genome Sequences. Nature Biotechnology, June 2010, 567-572.

• Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, 739-748.

• Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368.

• All software available at bio.cs.washington.edu/software