Position-dependent motif characterization using Non-negative
matrix Factorization (NMF)Joel H Graber
Lucie N. Hutchins, Erik McCarthy, Sean Murphy, Priyam Singh
The Jackson Laboratory
In collaboration with: Thomas Blumenthal, University of Colorado David Kulp, University of Massachusetts
Funding Sources Current: NIH GM 072706, NIH HD037102
Previous: NIH RR 16463 (INBRE-Maine) NSF 2010 Project DBI 0331497
Motifs are often constrained in positioningAUGCACAUAGAGGCAAUUGUGUAUCAAUAUUAAAAAUAAAGUAAAACUUA AAGCAUGUGUAGACCGUGUGAUGAAUCCUUGUAUAAGCAACUGCCAAUGAAAUCGGGCUCGCUGUGGUCA UCCGUGAGUGCUUAUCAUUCUGGUAAUACCGUGGUCUAUUUAUACAAAUAUUAAAAGUGCUGUUUAUAGA GCCUGUGUCAUGUGGCAACUUCCUGUGUCAUGACCUCAGGAAAUAAAUUUCCUUGACUUUAUAAAAGCCA AAACGUUUGCCCUCUUCCUUGGAAUUUGAAAUUACUCCAAUUUAAAAUAAAUUACUGGACUGUGGAAAUA ACAUGUAGAAUUGCAGUUUUACACUGUAACAGUUGCUUCUGCCUACCUUAUAAAUAAAGAAUCACUAAGA AAAAGAGUUCUCAGGUCUCCCUGAGCUCAGACUGAGGGGAAACGGAGGCAAAUAAAGCUGAGUUUUGAGA ACUCGGUGGCCUGUGUUCCUAGCCUGUACUCACCCCUUCCCUUAAUAAUAAUAAAACAACAACUUUGUGA AUUUGAGUUUUCCUUAGAGCUCAACAGAUCAUAUUCAGUGUCUUGAAUAAAUUGCUCUAUUUUGAUAUUA GAGAACAUAGUGACUGUGUUUGGUACGAUUAUUUUUUUUAACUAAAAUGAGAUAAAAUUCUAUAUUCUUA UGUGUGUGUGGUUUUUGAUGGGUGAAACUGUCUCAAUUUGAAUAAAUAUUUUUAUUGCAAUUCUGAACCA AUUUUAAAAGAAAAGAUACAAAUGUCCUUCCAAAUAGAGCCUUUUUAUUAAUAAAGGGCCUUGUACUUCA CUUGGAACAAAGGACGUUUCAUUUCAUUGUGUUAAAUGUAUACUUGUAAAUAAAAUAGCUGCAAACCUUA AGCCUUUGAGCUACUUGGUGUAUCUCACUCGGUAUUACGUGCUCUGCAAUAGAAGUUGGUGUGAACAUUC CCAGGUGACAUGCAGUGUUACCACCACCCCUCCAUCAGUAAGCCACUAAUAAAGUGCAUCUAUGCAGCCA CAGGUCUGUCUGCCUCUUUUGGCUGGGCACCUUAAAAGAGAAGUCAAUAAACUGGGCUACACAGUACUUA AAACGCUGAACUGGCUAAGAUGUGUAUUUAUGAAUAUUAAUGAAUAAAAACUGCUUGGAUGGUUUACCUA ACUACUGCAUGAGGUUUUUUUCCUUUCUUUUCUCUCCACUCAAUAAAUACUUUAAAGCACAUUUGGAAUA AAGGAAGAGACUUUUAAGUGGUGCUUAAUGAUAAGGUUUUGACUUGUUAAAUUAAACCAUUUGGAAUAUA UUGUGUGUUUGUAGUAGUCAGUGCCUUUGUUUGUAAACCAAAAAGUAAUAAAUGAAUCCCUAUAUUUCUA UUAUAGCAUCUAUUGUAUUUAAUAUAGUAUUUUAUUUAAGAAAAUAAACUUUGCAGUUUUUGCAUUGUGA AUUCUCUCUCUUCCCGCCCACUGCCAUGAAAAAUGUUGUUUAUGGAAUAAAAAAAAUGUAACUGCCUUUA AAUUUCCUGGUGGCUGUGUU
Functional site
N position counts
Msequence
words
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
PWCMatrix
NMF decomposes the PWC matrix into characteristic patterns (motifs)
€
V =W ⋅HCounts (M x N) Bases (M x r) Weights (r x N)
Wik = weight of ith word in the kth motif
Hkj = abundance of kth motif at the jth position
(content)
(positioning)
r = number of basis functions (patterns)
Synthetic data verifies NMF performance
RSS provides a robust estimate for the optimal number of vectors (r)
0
10
20
30
40
50
60
3 4 5 6 7 8 9 10 11 12 13
Basis Vector Count (r)
Residue (Test Matrixes)
Test matrix 1
Test matrix 2
120
140
160
180
200
220
240
3 4 5 6 7 8 9 10 11 12 13
Basis Vector Count (r)
Residue (Test Sequences)
0
500
1000
1500
2000
2500
3000
3500
Residue (Human PolyA Sites)Artificial sequences
Human polyA sequences
€
RSS =
Vij − WH( )ij( )2
∑
NMF can characterize complex control sequences
QuickTime™ and aTIFF (LZW) decompressor
are needed to see this picture.
Mouse 3’-processing sequencesHuman transcription start sites
Top Related