Using CTW as a language modeler in Dasher
description
Transcript of Using CTW as a language modeler in Dasher
![Page 1: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/1.jpg)
Using CTW as a language modeler in Dasher
Phil Cowans, Martijn van Veen
25-04-2007
Inference GroupDepartment of Physics
University of Cambridge
![Page 2: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/2.jpg)
2/24
Language ModellingLanguage Modelling
•Goal is to produce a generative model over strings
•Typically sequential predictions:
•Finite context models:
![Page 3: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/3.jpg)
3/24
Dasher: Language ModelDasher: Language Model
• Conditional probability for each alphabet symbol, given the previous symbols
• Similar to compression methods
• Requirements: – Sequential– Fast– Adaptive
• Model is trained
• Better compression -> faster text input
![Page 4: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/4.jpg)
4/24
Basic Language ModelBasic Language Model
• Independent distributions for each context
•Use Dirichlet prior
•Makes poor use of data– intuitively expect similarities between
similar contexts
![Page 5: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/5.jpg)
5/24
Basic Language ModelBasic Language Model
![Page 6: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/6.jpg)
6/24
Prediction By Partial MatchPrediction By Partial Match
•Associate a generative distribution with each leaf in the context tree
•Share information between nodes using a hierarchical Dirichlet (or Pitman-Yor) prior
• In practice use a fast, but generally good, approximation
![Page 7: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/7.jpg)
7/24
Hierarchical Dirichlet ModelHierarchical Dirichlet Model
![Page 8: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/8.jpg)
8/24
Context Tree WeightingContext Tree Weighting
•Combine nodes in the context tree
•Tree structure treated as a random variable
•Contexts associated with each leaf have the same generative distribution
•Contexts associated with different leaves are independent
•Dirichlet prior on generative distributions
![Page 9: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/9.jpg)
9/24
CTW: Tree modelCTW: Tree model• Source structure in the model, memoryless
parameters
![Page 10: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/10.jpg)
10/24
Tree PartitionsTree Partitions
![Page 11: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/11.jpg)
11/24
Recursive DefinitionRecursive Definition
Children share one distribution
Children distributed independently
![Page 12: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/12.jpg)
12/24
Experimental Results [256]Experimental Results [256]
![Page 13: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/13.jpg)
13/24
Experimental Results [128]Experimental Results [128]
![Page 14: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/14.jpg)
14/24
Experimental Results [27]Experimental Results [27]
![Page 15: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/15.jpg)
15/24
Observations So FarObservations So Far
•No clear overall winner without modification.
•PPM Does better with small alphabets?
•PPM Initially learns faster?
•CTW is more forgiving with redundant symbols?
![Page 16: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/16.jpg)
16/24
CTW for textCTW for text Properties of text generating sources:
• Large alphabet, but in any given context only a small subset is used– Waste of code space, many probabilities that should be
zero– Solution:
•Adjust zero-order estimator to decrease probability of unlikely events
•Binary decomposition
• Only locally stationary – Limit the counts to increase adaptivity
• Bell, Cleary, Witten 1989
![Page 17: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/17.jpg)
17/24
Binary DecompositionBinary Decomposition• Decomposition tree
![Page 18: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/18.jpg)
18/24
Binary DecompositionBinary Decomposition
• Results found by Aberg and Shtarkov:– All tests with full ASCII alphabet
Input file Paper 1 Paper 2
Book 1 Book 2 News
PPM-D(byte
predictions)2.351 2.322 2.291 1.969 2.379
CTW-D(byte
predictions)2.904 2.719 2.490 2.265 2.877
CTW-KT(bit predictions)
2.322 2.249 2.184 1.910 2.379
CTW/PPM-D(byte
predictions)2.287 2.235 2.192 1.896 2.322
![Page 19: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/19.jpg)
19/24
Count halvingCount halving
• If one count reaches a maximum, divide both counts by 2– Forget older input data, increase adaptivity
• In Dasher: Predict user input with a model based on training text– Adaptivity even more important
![Page 20: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/20.jpg)
20/24
Counthalving: ResultsCounthalving: Results
![Page 21: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/21.jpg)
21/24
Counthalving: ResultsCounthalving: Results
![Page 22: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/22.jpg)
22/24
Results: Enron Results: Enron
![Page 23: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/23.jpg)
23/24
Combining PPM and CTWCombining PPM and CTW
•Select locally best model, or weight models together
•More alpha parameters for PPM, learned from data
•PPM like sharing, with prior over context trees, as with CTW
![Page 24: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/24.jpg)
24/24
ConclusionsConclusions
•PPM and CTW have different strengths, makes sense to try combining them
•Decomposition and count scaling may give clues for improving PPM
•Look at performance on out of domain text in more detail
![Page 25: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/25.jpg)
25/24
Experimental ParametersExperimental Parameters
•Context depth: 5
•Smoothing: 5%
•PPM – alpha: 0.49, beta: 0.77
•CTW – w: 0.05, alpha: 1/128
![Page 26: Using CTW as a language modeler in Dasher](https://reader036.fdocuments.us/reader036/viewer/2022062410/568159db550346895dc72863/html5/thumbnails/26.jpg)
26/24
Comparing language modelsComparing language models
• PPM– Quickly learns repeating strings
• CTW– Works on a set of all possible tree models– Not sensitive to parameter D, max. model
depth– Easy to increase adaptivity– The weight factor (escape probability) is
strictly defined