Comparative Sequence Analysis in Molecular Biology
description
Transcript of Comparative Sequence Analysis in Molecular Biology
Comparative Sequence Analysisin Molecular Biology
Martin TompaComputer Science & Engineering
Genome SciencesUniversity of Washington
Seattle, Washington, U.S.A.
2
3
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
4
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
5
How Many Genomes Are Available?
• 46 vertebrate genomes sequenced (primates to rodents to marsupials to birds to fishes)
• 1766 bacterial genomes sequenced (as of 2/12/2012)
• Insects, fungi, worms, plants, …
• Many more will be finished very soon
• Fertile ground for comparative genomics
61982-2003: number of nucleotides in GenBank doubled every 18 months
Since 2003: doubled every 3 years
7
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
8
Phylogenetic Footprinting(Tagle et al. 1988)
Functional regions of DNA (regions under “purifying constraint”) evolve slower than nonfunctional ones.
1. Consider a set of corresponding DNA sequences from related species.
2. Identify unusually well conserved subsequences (i.e., ones that have not mutated much over the course of evolution): “motifs”
9
10
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
11
How to Find Conserved Motifs
ACTAACCGGGAGATTTCAGA human
AAGTTCCGGGAGATTTCCA chimp
TAGTTATCCGGGAGATTAGA mouse
AAAACCGGTAGATTTCAGG rat
12
Multiple Sequence Alignment
AC--TAACCGGGAGATTTCAGA human
AAGTT--CCGGGAGATTTCC-A chimp
TAGTTATCCGGGAGATT--AGA mouse
AA---AACCGGTAGATTTCAGG rat
(Finding the optimal alignment is NP-complete.)
13
Phylogenetic Footprinting
1. Use whole-genome multiple alignment such as provided
by UCSC Genome Browser.
2. Search for regions of well conserved alignment.
– Regulatory elements [Cliften; Kellis; Kolbe; Prakash; Woolfe;
Xie (2)]
– RNA elements [Pedersen; Washietl]
– General conservation & constraint [Bejerano; Boffelli;
Cooper; Margulies (4); Pollard; Prabhakar; Siepel]
14
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
15
Why Doubt Alignments?
• Multiple sequence alignment of short sequences (proteins, promoters) is difficult (NP-complete)
• Aligning whole genomes adds the complications of huge sequences and genomic rearrangements
• Vertebrate alignment has 3.8 billion columns
• Automatically generated
16
Assessing 4 Genome-Size Alignments (with Xiaoyu Chen)
• Alignments: MLAGAN [Brudno 2003], MAVID [Bray 2003],
TBA [Blanchette 2003], Pecan [Paten 2008]
• Target ENCODE regions: 30 Mbp covering 1% of the human
genome (ENCODE targets)
• Total input: 554 Mbp over 28 vertebrates
• Rich resource for comparing and assessing genome-size
alignments
Margulies et al. 2007, Genome Research
17
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
Alignment coverage: number of human bases aligned to a given species
18
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
In noncoding regions, as species distance from human↑, coverage↓
19
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
MAVID has lowest coverage
20
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
Other 3 have comparable coverage in placental mammals
21
Coverage of each alignmentCoding Kbp
0
200
400
600UTR Kbp
0
200
400
600
Intronic Kbp
0
2000
4000
6000
8000
10000
12000Intergenic Kbp
02000400060008000
1000012000
TBA
MAVID
MLAGAN
Pecan
MLAGAN has highest coverage in distant species, intronic and intergenic
22
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
23
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Agree%: Coding > UTR > Int.
24
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Unique%: Coding < UTR < Int.
25
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
As species distance from human↑, Agree%↓Unique%↑
26
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Primates: high Agree%
27
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Placental nonprimates: Agree% > 0.5
28
Level of agreement among alignments
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
TBA (T)
MLAGAN (L)
MAVID (V)
TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVL TVLCoding bases UTR bases
Intronic bases Intergenic bases
Distant species, Int: low Agree%, high Unique%
29
Alignment agreement for mouse
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
Agree%
Disagree%
Unique%
Intronic bases Intergenic bases
• Intronic & intergenic account for 95% of mouse bases aligned to human
• Agree% in those categories: 44% to 62% • Much worse for more distant species• Building reliable MSA remains challenging
30
Which Alignment Columns to Trust? (with Amol Prakash, generalizing Karlin and Altschul 1990)
Goal: label each alignment column with confidence measure of alignment correctness
– Identify sequences that do not belong
• Users forewarned about regions of interest
• Genome browser designers consider realigning
• Alignment tool designers get feedback for possible improvements
31
Sample Suspicious AlignmentHuman -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Chimp -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Rhesus -----------GTTGCCATGC-AAAAATATTATGTCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Mouse -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC
Rat -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CGTGTCAA----------TTAACAC
Dog -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Cow -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Elephant -----------GTTGCTATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Tenrec -----------GTTGCCATAC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATGTCAA----------TTAACAC
Opossum -----------GTTGCCATGC-AAAAATATTATGGCTTTACTAAAATTTATACAAG---CATATCAA----------TTAACAC
Chicken -----------GTTGCCATGCAAAAAATAATATGGCTTTACTAAAATTTACACAAC---CCTGACAA----------TTAACAC
Zebrafish GAACATATCCGAGTGCTGTAA-AATACTACTGGGA----ACCAGAAATG—-ACAAGTTCCATGACAGCTTTGCCTTTTTGGCTC
32
Scoring Function
Pairwise: score(1, 2) = log ( )
Multiple:
HumanChimpMouseRatChicken
1
2
3
4
5
Pr(1, 2)
Pr(1) Pr(2)
sc(12345 | ) = log( )Pr(12345 | )
Pr(125 | ) Pr(34 | )
33
Outline of Computation
Input Multiple sequence alignment A
Output Discordance : maxk pk
For each branch k of the tree {
Compute scoring function sck (Felsenstein)
Find all maximally scoring segments of A using sck (Ruzzo & Tompa)
Compute K, using sck (Karlin & Altschul)
Compute p-value pk of each segment score using K, (Karlin & Altschul)
}
34
Suspicious Alignment Regions
Back to four ENCODE alignments spanning 30 Mbp of human aligned to 27 other vertebrates (with Xiaoyu Chen)• Identify suspicious alignment regions:
– Length 50 bp– Discordance 0.1 at each position, all with respect to the
same worst species– Fewer than 50% gapped sites
• Suspicious%
– Percentage of aligned bases in suspicious regions
35
Alignment accuracy
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
TBA MAVID MLAGAN Pecan
Coding bases
UTR bases
Intronic bases
Intergenic bases
36
Alignment accuracy
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
TBA MAVID MLAGAN Pecan
Coding bases
UTR bases
Intronic bases
Intergenic bases
37
Alignment accuracy
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
TBA MAVID MLAGAN Pecan
Coding bases
UTR bases
Intronic bases
Intergenic bases
38
Alignment accuracy
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
0%
5%
10%
15%
20%
25%
30%
TBA MAVID MLAGAN Pecan
Coding bases
UTR bases
Intronic bases
Intergenic bases
39
Can suspicious alignments be improved?
Baboon and MLAGAN (for example): all points (x,y), where•x = human-baboon alignment score of MLAGAN region suspicious for baboon
•y = human-baboon alignment score of alternative alignment for same human region but not suspicious for baboon
y = x
y - x = μ, where μ = average y-x over all points
y - x = μ ± σ, where σ = standard deviation of y-x over all points
40
Can suspicious alignments be improved?
41
Summary of comparisons (all categories)
0%
5%
10%
15%
20%
25%
30%
12 13 14 15 16 17 18
log(coverage)
susp
icio
us%
primates other placental mammals distant species
TBA MAVID MLAGAN Pecan
High is better
Low
is b
ette
r
42
Conclusions1. Disturbing lack of agreement among alignments:
alignment still a hard problem
2. Performance of the aligners varies significantly by species group and region type, particularly distant species and noncoding regions
43
Outline
• What genome data is available?• What is phylogenetic footprinting?• Phylogenetic footprinting by multiple
sequence alignment• Which parts of multiple sequence alignments
are trustworthy?• FootPrinter: phylogenetic footprinting without
alignment
44
DNA, Genes, and Proteins
DNA: program for cell processes
Proteins: execute cell processes
TCCAA
CGGTGC
TGAGGT
GCAC
GeneProtein
DNA
45
Regulation of Genes
• What turns genes on and off?
• When is a gene turned on or off?
• Where (in which cells) is a gene turned on?
• How many copies of the gene product are produced?
46
Regulation of Genes
GeneRegulatory Element
RNA polymerase
Transcription Factor
DNA
47
RNA polymerase
Transcription Factor
DNA
Regulatory Element Gene
Regulation of Genes
48
GoalIdentify regulatory elements in DNA sequences. These are:
• Binding sites for proteins
• Short subsequences (5-25 nucleotides)
• Up to 1000 nucleotides (or farther) from gene
• Inexactly repeating patterns (“motifs”)
49
CLUSTALW multiple sequence alignment (rbcS gene)
Cotton ACGGTT-TCCATTGGATGA---AATGAGATAAGAT---CACTGTGC---TTCTTCCACGTG--GCAGGTTGCCAAAGATA-------AGGCTTTACCATTPea GTTTTT-TCAGTTAGCTTA---GTGGGCATCTTA----CACGTGGC---ATTATTATCCTA--TT-GGTGGCTAATGATA-------AGG--TTAGCACATobacco TAGGAT-GAGATAAGATTA---CTGAGGTGCTTTA---CACGTGGC---ACCTCCATTGTG--GT-GACTTAAATGAAGA-------ATGGCTTAGCACCIce-plant TCCCAT-ACATTGACATAT---ATGGCCCGCCTGCGGCAACAAAAA---AACTAAAGGATA--GCTAGTTGCTACTACAATTC--CCATAACTCACCACCTurnip ATTCAT-ATAAATAGAAGG---TCCGCGAACATTG--AAATGTAGATCATGCGTCAGAATT--GTCCTCTCTTAATAGGA-------A-------GGAGCWheat TATGAT-AAAATGAAATAT---TTTGCCCAGCCA-----ACTCAGTCGCATCCTCGGACAA--TTTGTTATCAAGGAACTCAC--CCAAAAACAAGCAAADuckweed TCGGAT-GGGGGGGCATGAACACTTGCAATCATT-----TCATGACTCATTTCTGAACATGT-GCCCTTGGCAACGTGTAGACTGCCAACATTAATTAAALarch TAACAT-ATGATATAACAC---CGGGCACACATTCCTAAACAAAGAGTGATTTCAAATATATCGTTAATTACGACTAACAAAA--TGAAAGTACAAGACC
Cotton CAAGAAAAGTTTCCACCCTC------TTTGTGGTCATAATG-GTT-GTAATGTC-ATCTGATTT----AGGATCCAACGTCACCCTTTCTCCCA-----APea C---AAAACTTTTCAATCT-------TGTGTGGTTAATATG-ACT-GCAAAGTTTATCATTTTC----ACAATCCAACAA-ACTGGTTCT---------ATobacco AAAAATAATTTTCCAACCTTT---CATGTGTGGATATTAAG-ATTTGTATAATGTATCAAGAACC-ACATAATCCAATGGTTAGCTTTATTCCAAGATGAIce-plant ATCACACATTCTTCCATTTCATCCCCTTTTTCTTGGATGAG-ATAAGATATGGGTTCCTGCCAC----GTGGCACCATACCATGGTTTGTTA-ACGATAATurnip CAAAAGCATTGGCTCAAGTTG-----AGACGAGTAACCATACACATTCATACGTTTTCTTACAAG-ATAAGATAAGATAATGTTATTTCT---------AWheat GCTAGAAAAAGGTTGTGTGGCAGCCACCTAATGACATGAAGGACT-GAAATTTCCAGCACACACA-A-TGTATCCGACGGCAATGCTTCTTC--------Duckweed ATATAATATTAGAAAAAAATC-----TCCCATAGTATTTAGTATTTACCAAAAGTCACACGACCA-CTAGACTCCAATTTACCCAAATCACTAACCAATTLarch TTCTCGTATAAGGCCACCA-------TTGGTAGACACGTAGTATGCTAAATATGCACCACACACA-CTATCAGATATGGTAGTGGGATCTG--ACGGTCA
Cotton ACCAATCTCT---AAATGTT----GTGAGCT---TAG-GCCAAATTT-TATGACTATA--TAT----AGGGGATTGCACC----AAGGCAGTG-ACACTAPea GGCAGTGGCC---AACTAC--------------------CACAATTT-TAAGACCATAA-TAT----TGGAAATAGAA------AAATCAAT--ACATTATobacco GGGGGTTGTT---GATTTTT----GTCCGTTAGATAT-GCGAAATATGTAAAACCTTAT-CAT----TATATATAGAG------TGGTGGGCA-ACGATGIce-plant GGCTCTTAATCAAAAGTTTTAGGTGTGAATTTAGTTT-GATGAGTTTTAAGGTCCTTAT-TATA---TATAGGAAGGGGG----TGCTATGGA-GCAAGGTurnip CACCTTTCTTTAATCCTGTGGCAGTTAACGACGATATCATGAAATCTTGATCCTTCGAT-CATTAGGGCTTCATACCTCT----TGCGCTTCTCACTATAWheat CACTGATCCGGAGAAGATAAGGAAACGAGGCAACCAGCGAACGTGAGCCATCCCAACCA-CATCTGTACCAAAGAAACGG----GGCTATATATACCGTGDuckweed TTAGGTTGAATGGAAAATAG---AACGCAATAATGTCCGACATATTTCCTATATTTCCG-TTTTTCGAGAGAAGGCCTGTGTACCGATAAGGATGTAATCLarch CGCTTCTCCTCTGGAGTTATCCGATTGTAATCCTTGCAGTCCAATTTCTCTGGTCTGGC-CCA----ACCTTAGAGATTG----GGGCTTATA-TCTATA
Cotton T-TAAGGGATCAGTGAGAC-TCTTTTGTATAACTGTAGCAT--ATAGTACPea TATAAAGCAAGTTTTAGTA-CAAGCTTTGCAATTCAACCAC--A-AGAACTobacco CATAGACCATCTTGGAAGT-TTAAAGGGAAAAAAGGAAAAG--GGAGAAAIce-plant TCCTCATCAAAAGGGAAGTGTTTTTTCTCTAACTATATTACTAAGAGTACLarch TCTTCTTCACAC---AATCCATTTGTGTAGAGCCGCTGGAAGGTAAATCATurnip TATAGATAACCA---AAGCAATAGACAGACAAGTAAGTTAAG-AGAAAAGWheat GTGACCCGGCAATGGGGTCCTCAACTGTAGCCGGCATCCTCCTCTCCTCCDuckweed CATGGGGCGACG---CAGTGTGTGGAGGAGCAGGCTCAGTCTCCTTCTCG
50
Finding Short Motifs
AGTCGTACGTGAC... (Human)
AGTAGACGTGCCG... (Chimp)
ACGTGAGATACGT... (Rabbit)
GAACGGAGTACGT... (Mouse)
TCGTGACGGTGAT... (Rat)
Size of motif sought: k = 4
51
Most Parsimonious Solution
“Parsimony score”: 1 mutation
AGTCGTACGTGAC...
AGTAGACGTGCCG...
ACGTGAGATACGT...
GAACGGAGTACGT...
TCGTGACGGTGAT...ACGGACGT
ACGT
ACGT
52
Substring Parsimony ProblemGiven:
• phylogenetic tree T,• set of orthologous sequences at leaves of T,• length k of motif• threshold d
Problem:
• Find each set S of k-mers, one k-mer from each leaf, such that the parsimony score of S in T is at most d.
This problem is NP-complete.
53
FootPrinter’s Exact Algorithm(with Mathieu Blanchette, generalizing Sankoff and Rousseau 1975)
Wu [s] = best parsimony score for subtree rooted at node u,
if u is labeled with string s.
AGTCGTACGTG
ACGGGACGTGC
ACGTGAGATAC
GAACGGAGTAC
TCGTGACGGTG
… ACGG: 2 ACGT: 1 ...
… ACGG: 0 ACGT: 2...
… ACGG: 1 ACGT: 1 ...
…
ACGG: + ACGT: 0
...
… ACGG: 1 ACGT: 0 ...
4k entries
… ACGG: 0 ACGT: + ...
… ACGG: ACGT :0 ...
… ACGG: ACGT :0 ...
… ACGG: ACGT :0 ...
54
Wu [s] = min ( Wv [t] + d(s, t) ) v: child t of u
Running Time
Number of species
Average sequence
length
Motif length
Total time O(n k (42k + l ))
55
Improvements• Better algorithm reduces time from
O(n k (42k + l )) to O(n k (4k + l ))
• By restricting to motifs with parsimony score at most d, greatly reduce the number of table entries computed (exponential in d, polynomial in k)
• Amenable to many useful extensions (e.g., allow insertions and deletions)
56
Application to -actin Gene
Gilthead sea bream (678 bp)
Medaka fish (1016 bp)
Common carp (696 bp)
Grass carp (917 bp)
Chicken (871 bp)
Human (646 bp)
Rabbit (636 bp)
Rat (966 bp)
Mouse (684 bp)
Hamster (1107 bp)
57
Common carpACGGACTGTTACCACTTCACGCCGACTCAACTGCGCAGAGAAAAACTTCAAACGACAACATTGGCATGGCTTTTGTTATTTTTGGCGCTTGACTCAGGATCTAAAAACTGGAACGGCGAAGGTGACGGCAATGTTTTGGCAAATAAGCATCCCCGAAGTTCTACAATGCATCTG
AGGACTCAATGTTTTTTTTTTTTTTTTTTCTTTAGTCATTCCAAATGTTTGTTAAATGCATTGTTCCGAAACTTATTTGCCTCTATGAAGGCTGCCCAGTAATTGGGAGCATACTTAACATTGTAGTATTGTATGTAAATTATGTAACAAAACAATGACTGGGTTTTTGTACTTTCAGCCTTAATCTTGGGTTTTTTTTTTTTTTTGGTTCCAAAAAACTAAGCTTTACCATTCAAGATGTAAAGGTTTCATTCCCCCTGGCATATTGAAAAAGCTGTGTGGAACGTGGCGGTGCA
GACATTTGGTGGGGCCAACCTGTACACTGACTAATTCAAATAAAAGTGCACATGTAAGACATCCTACTCTGTGTGATTTTTCTGTTTGTGCTGAGTGAACTTGCTATGAAGTCTTTTAGTGCACTCTTTAATAAAAGTAGTCTTCCCTTAAAGTGTCCCTTCCCTTATGGCCTTCACATTTCTCAACTAGCGCTTCAACTAGAAAGCACTTTAGGGACTGGGATGC
ChickenACCGGACTGTTACCAACACCCACACCCCTGTGATGAAACAAAACCCATAAATGCGCATAAAACAAGACGAGATTGGCATGGCTTTATTTG
TTTTTTCTTTTGGCGCTTGACTCAGGATTAAAAAACTGGAATGGTGAAGGTGTCAGCAGCAGTCTTAAAATGAAACATGTTGGA
GCGAACGCCCCCAAAGTTCTACAATGCATCTGAGGACTTTGATTGTACATTTGTTTCTTTTTTAATAGTCATTCCAAATATTGTTATAATGCATTGTTACAGGAAGTTACTCGCCTCTGTGAAGGCAACAGCCCAGCTGGGAGGAGCCGGTACCAATTACTGGTGTTAGATGATAATTGCTTGTCTGTAAATTATGTAACCCAACAAGTGTCTTTTTGTATCTTCCGCCTTAAAAACAAAACACACTTGATCCTTTTTGGTTTGTCAAGCAAGCGGGCTGTGTTCCCCAGTGA
TAGATGTGAATGAAGGCTTTACAGTCCCCCACAGTCTAGGAGTAAAGTGCCAGTATGTGGGGGAGGGAGGGGCTACCTGTACACTGACTTAAGACCAGTTCAAATAAAAGTGCACACAATAGAGGCTTGACTGGTGTTGGTTTTTATTTCTGTGCTGCGCTGCTTGGCCGTTGGTAGCTGTTCTCATCTAGCCTTGCCAGCCTGTGTGGGTCAGCTATCTGCATGGGCTGCGTGCTGGTGCTGTCTGGTGCAGAGGTTGGATAAACCGTGATGATATTTCAGCAAGTGGGAGTTGGCTCTGATTCCATCCTGAGCTGCCATCAGTGTGTTCTGAAGGAAGCTGTTGGATGAGGGTGGGCTGAGTGCTGGGGGACAGCTGGGCTCAGTGGGACTGCAGCTGTGCT
HumanGCGGACTATGACTTAGTTGCGTTACACCCTTTCTTGACAAAACCTAACTTGCGCAGAAAACAAGATGAGATTGGCATGGCTTTATTTGTTT
TTTTTGTTTTGTTTTGGTTTTTTTTTTTTTTTTGGCTTGACTCAGGATTTAAAAACTGGAACGGTGAAGGTGACAGCAGTCGGTT
GGAGCGAGCATCCCCCAAAGTTCACAATGTGGCCGAGGACTTTGATTGCATTGTTGTTTTTTTAATAGTCATTCCAAATATGAGATGCATTGTTACAGGAAGTCCCTTGCCATCCTAAAAGCCACCCCACTTCTCTCTAAGGAGAATGGCCCAGTCCTCTCCCAAGTCCACACAGGGGAGGTGATAGCATTGCTTTCGTGTAAATTATGTAATGCAAAATTTTTTTAATCTTCGCCTTAATACTTTTTTATTTTGTTTTATTTTGAATGATGAGCCTTCGTGCCCCCCCTTC
CCCCTTTTTGTCCCCCAACTTGAGATGTATGAAGGCTTTTGGTCTCCCTGGGAGTGGGTGGAGGCAGCCAGGGCTTACCTGTACACTGACTTGAGACCAGTTGAATAAAAGTGCACACCTTAAAAATGAGGCCAAGTGTGACTTTGTGGTGTGGCTGGGTTGGGGGCAGCAGAGGGTG
Parsimony score over 10 vertebrates: 0 1 2
58
Motifs Absent from Some Species
• Find motifs – with small parsimony score
– that span a large part of the tree
• Example: in tree of 10 species spanning 760 Myrs, find all motifs with– score 0 spanning at least 250 Myrs– score 1 spanning at least 350 Myrs– score 2 spanning at least 450 Myrs– score 3 spanning at least 550 Myrs
59
Application to c-fos Gene
Asked for motifs of length 10, with 0 mutations over tree of
size 6 1 mutation over tree of size 11 2 mutations over tree of size 16 3 mutations over tree of size 21 4 mutations over tree of size 26
Puffer fish
Chicken
Pig
Mouse
Hamster
Human
10
2
7
2
2
21
0
1
1
Found: 0 mutations over tree of size 81 mutation over tree of size 163 mutations over tree of size 214 mutations over tree of size 28
60
Application to c-fos GeneMotif Score Conserved in Known?
CAGGTGCGAATGTTC 0 4 mammals
TTCCCGCCTCCCCTCCCC 0 4 mammals yes
GAGTTGGCTGcagcc 3 puffer + 4 mammals
GTTCCCGTCAATCcct 1 chicken + 4 mammals yes
CACAGGATGTcc 4 all 6 yes
AGGACATCTG 1 chicken + 4 mammals yes
GTCAGCAGGTTTCCACG 0 4 mammals yes
TACTCCAACCGC 0 4 mammals
metK in B. subtilis
61
Microbial Footprinting• 1889 prokaryotes with genomes completely
sequenced (as of 2/12/2012)– For any prokaryotic gene of interest, plenty of close genes
in other species available– Relatively simple genomes
• MicroFootPrinter (with Shane Neph)– Designed specifically for phylogenetic footprinting in
microbial genomes– undergraduate Computational Biology Capstone project– User specifies species and gene of interest– Automates collection of orthologous genes, cis-regulatory
sequences, gene tree, parameters
62
Demo
• MicroFootPrinter home• Examples: Agrobacterium tumefaciens genes
regulated by ChvI (with Eugene Nester)
– chvI (two component response regulator)– ropB (outer membrane protein )
63
Sample chvI motifParsimony score: 2Span: 41.10Significance score: 4.22
B. henselae -151 GCTACAATTTR. etli -90 GCCACAATTTR. leguminosarum -106 GCCACAATTTS. meliloti -119 GCCACAATTTS. medicae -118 GCCACAATTTA. tumefaciens -105 GCCACAATTTM. loti -80 GCCACATTTTM. sp. -87 GCCACATTTTO. anthropi -158 GCCACATTTTB. suis -38 GCCACATTTTB. melitensis -156 GCCACATTTTB. abortus -156 GCCACATTTTB. ovis -156 GCCACATTTTB. canis -38 GCCACATTTT
64
Sample ropB motifParsimony score: 1Span: 20.70Significance score: 1.34
Jannaschia sp. -151 CACATTTTGGR. etli -134 CACAATTTGGR. leguminosarum -135 CACAATTTGGA. tumefaciens -131 CACATTTTGGS. meliloti -128 CACATTTTGGS. medicae -128 CACATTTTGG
65
Combined ChvI Motif
ropB: CACATTTTGGchvI: GCCACAATTTAtu1221: TTGTCACAAT
ultimate: GYCACAWTTTGGY={C,T}
W={A,T}
66
References and Acknowledgments• Amol Prakash & Martin Tompa, Measuring the Accuracy of
Genome-Size Multiple Alignments. Genome Biology, June 2007, R124.
• Xiaoyu Chen & Martin Tompa, Comparative Assessment of Methods for Aligning Multiple Genome Sequences. Nature Biotechnology, June 2010, 567-572.
• Mathieu Blanchette & Martin Tompa, Discovery of Regulatory Elements by a Computational Method for Phylogenetic Footprinting. Genome Research, May 2002, 739-748.
• Shane Neph & Martin Tompa, MicroFootPrinter: a Tool for Phylogenetic Footprinting in Prokaryotic Genomes. Nucleic Acids Research, July 2006, W366-W368.
• All software available at bio.cs.washington.edu/software