Word JumbleWord Jumble
The Search for Protein-DNA Recognition Sites in the Encode Regions of
the Human Genome
BME230 Project - Tim Dreszer
Searching for Tiny “Words”Searching for Tiny “Words”
Ultra-conserved 200 Base Pair Regions
Phylo-HMMs can recognize as few as 6 bp
Degenerate Words and Chance Occurrences
High Frequency Words
Low Frequency Words
Word Pairs
Scoring WordsScoring Words
N = 30,000,000 (masked 16,431,479)
46 = 4096 Words
Word Freq. = Word Count/Total Bases
BG Probability = 1st Order HMM or Base Freq. X 6 Bases
E Score = log(Word Freq.) – log(BG Prob.)
High and Low Frequency Words High and Low Frequency Words (No HMM Model)(No HMM Model)
Word Count Frequency BG Probability E score CGCGTA 5 -16.924 -11.909 -5.015 CGTACG 5 -16.924 -11.909 -5.015ATATCG 6 -16.660 -12.101 -4.559CGAACG 7 -16.438 -11.891 -4.547TACGCG 7 -16.438 -11.909 -4.529CTGGGG 650 -9.901 -11.803 1.901CCCAGG 676 -9.845 -11.810 1.965CCTGGG 711 -9.772 -11.815 2.043AAAAAA 637 -9.930 -12.240 2.310TTTTTT 621 -9.967 -12.347 2.380
High and Low Frequency Words High and Low Frequency Words (HMM Model)(HMM Model)
Word Count Frequency BG Probability E score CGATTG 8 -16.245 -19.969 3.723CGCAAT 14 -15.438 -19.951 4.513CGATTA 7 -16.438 -21.004 4.566CGGCTA 19 -14.998 -19.920 4.923CGAACG 7 -16.438 -21.580 5.142TTATTT 318 -10.933 -21.038 10.106AAAATA 360 -10.754 -20.917 10.164TATATA 154 -11.979 -22.151 10.172AATATA 158 -11.942 -22.133 10.191ATATAT 194 -11.646 -22.151 10.505
Scoring Pairs of WordsScoring Pairs of Words
Actual Hit = Location of Word 2
within 200 bases of Word 1
Prob of Hit = Sum of Word 2 *
(Sum of Word 1 * 400/Total Bases)
E score = log2
( Frequency of Actual Hits )
Background Probability of Hits
Pairs of InterestPairs of Interest (No HMM Model)(No HMM Model)
Word 1 Word 2 Word 1 Count
Word 2Count
Hit Count
Hit Freq
BG Prob
E score
GCTACG TACGCG 21 7 66 6.044 0.984 5.060
ACGCGT TACGCG 10 7 13 3.700 -0.111 3.811
CGCTAC TACGCG 19 7 19 4.248 0.490 3.758
ATCGCG CGTACG 17 5 5 2.322 -1.265 3.587
TACGCG TATCGG 7 11 9 3.170 -0.323 3.493
ACGACG CGCGTA 20 5 6 2.585 -0.807 3.392
ATCGCG TACGCG 17 7 6 2.585 -0.780 3.364
CGCGAA CGTTCG 13 9 9 3.170 -0.134 3.304
CGAACG CGTTCG 7 9 8 3.000 -0.235 3.235
Pairs of InterestPairs of Interest (HMM Model)(HMM Model)
Word 1 Word 2 Word 1 Count
Word 2Count
Hit Count
Hit Freq
BG Prob
E score
CGGTAC TAGCGG 20 30 97 6.600 -2.185 8.785
CGGTCA CGTCTA 54 21 116 6.858 -1.470 8.328
CGCGAA CGTTCG 13 9 9 3.170 -4.859 8.029
CGAACG CGTTCG 7 9 8 3.000 -4.960 7.960
CCGCCG CGCGTA 106 5 28 4.807 -3.137 7.944
CGCGAA CGCGTT 13 11 10 3.322 4.569 7.891
CCGCGC CGCGTT 82 11 48 5.585 2.273 7.858
CCGCGC CGGCGT 82 18 78 6.285 1.562 7.848
CCGCGG CGGCGT 101 18 90 6.492 1.345 7.837
Location Pairs of InterestLocation Pairs of InterestGenerated using No HMM background
Generated using HMM background
Further AnalysisFurther Analysis
Verify Results
Different Sized Windows (N = 50, 100)
Independent Elements
Degenerate Words – Single Substitutions
Careful analysis of the words produced
Top Related