Download - Word Jumble The Search for Protein-DNA Recognition Sites in the Encode Regions of the Human Genome BME230 Project - Tim Dreszer.

Word JumbleWord Jumble

The Search for Protein-DNA Recognition Sites in the Encode Regions of

the Human Genome

BME230 Project - Tim Dreszer

Searching for Tiny “Words”Searching for Tiny “Words”

Ultra-conserved 200 Base Pair Regions

Phylo-HMMs can recognize as few as 6 bp

Degenerate Words and Chance Occurrences

High Frequency Words

Low Frequency Words

Word Pairs

Scoring WordsScoring Words

N = 30,000,000 (masked 16,431,479)

46 = 4096 Words

Word Freq. = Word Count/Total Bases

BG Probability = 1st Order HMM or Base Freq. X 6 Bases

E Score = log(Word Freq.) – log(BG Prob.)

High and Low Frequency Words High and Low Frequency Words (No HMM Model)(No HMM Model)

Word Count Frequency BG Probability E score CGCGTA 5 -16.924 -11.909 -5.015 CGTACG 5 -16.924 -11.909 -5.015ATATCG 6 -16.660 -12.101 -4.559CGAACG 7 -16.438 -11.891 -4.547TACGCG 7 -16.438 -11.909 -4.529CTGGGG 650 -9.901 -11.803 1.901CCCAGG 676 -9.845 -11.810 1.965CCTGGG 711 -9.772 -11.815 2.043AAAAAA 637 -9.930 -12.240 2.310TTTTTT 621 -9.967 -12.347 2.380

High and Low Frequency Words High and Low Frequency Words (HMM Model)(HMM Model)

Word Count Frequency BG Probability E score CGATTG 8 -16.245 -19.969 3.723CGCAAT 14 -15.438 -19.951 4.513CGATTA 7 -16.438 -21.004 4.566CGGCTA 19 -14.998 -19.920 4.923CGAACG 7 -16.438 -21.580 5.142TTATTT 318 -10.933 -21.038 10.106AAAATA 360 -10.754 -20.917 10.164TATATA 154 -11.979 -22.151 10.172AATATA 158 -11.942 -22.133 10.191ATATAT 194 -11.646 -22.151 10.505

Scoring Pairs of WordsScoring Pairs of Words

Actual Hit = Location of Word 2

within 200 bases of Word 1

Prob of Hit = Sum of Word 2 *

(Sum of Word 1 * 400/Total Bases)

E score = log2

( Frequency of Actual Hits )

Background Probability of Hits

Pairs of InterestPairs of Interest (No HMM Model)(No HMM Model)

Word 1 Word 2 Word 1 Count

Word 2Count

Hit Count

Hit Freq

BG Prob

E score

GCTACG TACGCG 21 7 66 6.044 0.984 5.060

ACGCGT TACGCG 10 7 13 3.700 -0.111 3.811

CGCTAC TACGCG 19 7 19 4.248 0.490 3.758

ATCGCG CGTACG 17 5 5 2.322 -1.265 3.587

TACGCG TATCGG 7 11 9 3.170 -0.323 3.493

ACGACG CGCGTA 20 5 6 2.585 -0.807 3.392

ATCGCG TACGCG 17 7 6 2.585 -0.780 3.364

CGCGAA CGTTCG 13 9 9 3.170 -0.134 3.304

CGAACG CGTTCG 7 9 8 3.000 -0.235 3.235

Pairs of InterestPairs of Interest (HMM Model)(HMM Model)

Word 1 Word 2 Word 1 Count

Word 2Count

Hit Count

Hit Freq

BG Prob

E score

CGGTAC TAGCGG 20 30 97 6.600 -2.185 8.785

CGGTCA CGTCTA 54 21 116 6.858 -1.470 8.328

CGCGAA CGTTCG 13 9 9 3.170 -4.859 8.029

CGAACG CGTTCG 7 9 8 3.000 -4.960 7.960

CCGCCG CGCGTA 106 5 28 4.807 -3.137 7.944

CGCGAA CGCGTT 13 11 10 3.322 4.569 7.891

CCGCGC CGCGTT 82 11 48 5.585 2.273 7.858

CCGCGC CGGCGT 82 18 78 6.285 1.562 7.848

CCGCGG CGGCGT 101 18 90 6.492 1.345 7.837

Location Pairs of InterestLocation Pairs of InterestGenerated using No HMM background

Generated using HMM background

Further AnalysisFurther Analysis

Verify Results

Different Sized Windows (N = 50, 100)

Independent Elements

Degenerate Words – Single Substitutions

Careful analysis of the words produced