stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is...
Transcript of stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is...
![Page 1: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/1.jpg)
10-405
1
![Page 2: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/2.jpg)
Some avuncular advice• Pro tip: You should think of HW1a/HW1b as one
assignment– HW1a is a “checkpoint”
• 10% of the points, but mandatory• about 50% of the work
– HW1b is the “real assignment”• 90% of the points
– Don’t wait till the last minute and don’t think of HW1b’s due date as the last minute• We’re totally ok giving you experiments that will take
many hours to run• Another pro tip: read the lecture notes!– If you find a typo or error and send it to me that
counts as “class participation”
2
![Page 3: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/3.jpg)
Rocchio’s Algorithm
As seen in Monday’s quiz!
3
![Page 4: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/4.jpg)
Motivation
• Naïve Bayes is unusual as a learner:–Only one pass through data–Order doesn’t matter
4
• Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc.
![Page 5: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/5.jpg)
Rocchio’s algorithm
DF(w) = # different docs w occurs in TF(w,d) = # different times w occurs in doc d
IDF(w) = |D |DF(w)
u(w,d) = log(TF(w,d)+1) ⋅ log(IDF(w))u(d) = u(w1,d),....,u(w|V |,d)
u(y) =α 1|Cy |
u(d)||u(d) ||2d∈Cy
∑ −β1
|D−Cy |u(d ')
||u(d ') ||2d '∈D−Cy
∑
f (d) = argmaxyu(d)
||u(d) ||2
⋅u(y)
||u(y) ||2
Many variants of these formulae
…as long as u(w,d)=0 for words not in d!
Store only non-zeros in u(d), so size is O(|d| )
But size of u(y) is O(|V| )
u2= ui
2
i∑
5
to store <0,….,0,ui1,0,….,0,ui2,0,….> just store the non-zero indices i1, i2, …. and the associated values: [(i1,ui1),(i2,ui2),….]
![Page 6: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/6.jpg)
6
How do you determine what are the “important” words in a document?
![Page 7: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/7.jpg)
7
Most frequent words in the document are expanded – is this
weighting useful?
![Page 8: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/8.jpg)
8
High-IDF (rare) words in the document are expanded
![Page 9: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/9.jpg)
9
TF-IDF weighting: frequently-appearing rare terms, which are
often names and places and other things important to the document
![Page 10: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/10.jpg)
Rocchio results…Joacchim ’98, “A Probabilistic Analysis of the Rocchio Algorithm…”
Rocchio’s method (w/ two variants of TFIDF)
10
![Page 11: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/11.jpg)
11
![Page 12: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/12.jpg)
Rocchio results…Schapire, Singer, Singhal, “Boosting and Rocchio Applied to Text Filtering”, SIGIR 98
Reuters 21578 – all classes (not just the frequent ones) 12
![Page 13: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/13.jpg)
A hidden agenda• Part of machine learning is good grasp of theory• Part of ML is a good grasp of what hacks tend to work• These are not always the same– Especially in big-data situations
• Catalog of useful tricks so far– Brute-force estimation of a joint distribution– Naive Bayes– TF-IDF weighting – especially IDF• it’s often useful even when we don’t understand why
13
![Page 14: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/14.jpg)
One more Rocchio observation
Rennie et al, ICML 2003, “Tackling the Poor Assumptions of Naïve Bayes Text Classifiers”
NB + using TFIDF representation for documents
14
![Page 15: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/15.jpg)
SCALING TO LARGE VOCABULARIES: WHY?
15
![Page 16: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/16.jpg)
The Naïve Bayes classifier – v1
• Dataset: each example has– A unique id id• Why? For debugging the feature extractor
– d attributes X1,…,Xd• Each Xi takes a discrete value in dom(Xi)
– One class label Y in dom(Y)• You have a train dataset and a test dataset• Assume: – the dataset doesn’t fit in memory– the model doesn’t either
![Page 17: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/17.jpg)
What’s next
• How to implement Naïve Bayes– Assuming the
event counters do not fit in memory
• Why?
Micro:0.5G memoryStandard:S: 2GbXL: 8Gb10xlarge: 160Gbx1.32xlarge:2Tb, 128 cores 17
$0.03/hr$0.104/hr
$2.34/hr
$0.00652/hr
$13.33/hr
![Page 18: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/18.jpg)
What’s next• How to implement Naïve Bayes– Assuming the event counters do not fit in memory
• Possible approaches:– Use a database? (or at least a key-value store)
18
![Page 19: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/19.jpg)
19
![Page 20: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/20.jpg)
Numbers (Jeff Dean says) Everyone Should Know
~= 10x
~= 15x
~= 100,000x
40x
20
![Page 21: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/21.jpg)
What’s next• How to implement Naïve Bayes– Assuming the event counters do not fit in memory
• Possible approaches:– Use a database?
• Counts are stored on disk, not in memory• …So, accessing a count might involve some seeks
– Caveat: many DBs are good at caching frequently-used values, so seeks might be infrequent …..
O(n*scan) è O(n*scan*seek)
21
![Page 22: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/22.jpg)
What’s next• How to implement Naïve Bayes– Assuming the event counters do not fit in memory
• Possible approaches:– Use a memory-based distributed database?
• Counts are stored on disk, not in memory• …So, accessing a count might involve some seeks
– Caveat: many DBs are good at caching frequently-used values, so seeks might be infrequent …..
O(n*scan) è O(n*scan*???)
22
![Page 23: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/23.jpg)
Counting
• example 1• example 2• example 3•….
Counting logic Hash table, database, etc
“increment C[x] by D”
23
![Page 24: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/24.jpg)
Counting
• example 1• example 2• example 3•….
Counting logic Hash table, database, etc
“increment C[x] by D”
Hashtable issue: memory is too smallDatabase issue: seeks are slow
24
![Page 25: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/25.jpg)
Distributed Counting
• example 1• example 2• example 3•….
Counting logic
Hash table1
“increment C[x] by D”
Hash table2
Hash table2
Machine 1
Machine 2
Machine K
. . .
Machine 0
Now we have enough memory….25
![Page 26: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/26.jpg)
Distributed Counting
• example 1• example 2• example 3•….
Counting logic
Hash table1
“increment C[x] by D”
Hash table2
Hash table2
Machine 1
Machine 2
Machine K
. . .
Machine 0
New issues:•Machines and memory cost $$!• Routing increment requests to right machine• Sending increment requests across the network• Communication complexity 26
![Page 27: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/27.jpg)
Numbers (Jeff Dean says) Everyone Should Know
~= 10x
~= 15x
~= 100,000x
40x
27
![Page 28: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/28.jpg)
What’s next• How to implement Naïve Bayes– Assuming the event counters do not fit in memory
• Possible approaches:– Use a memory-based distributed database?
• Extra cost: Communication costs: O(n) … but that’s “ok”• Extra complexity: routing requests correctly
– Note: If the increment requests were ordered seeks would not be needed!
O(n*scan) è O(n*scan+n*send)
1) Distributing data in memory across machines is not as cheap as accessing memory locally because of communication costs.2) The problem we’re dealing with is not size. It’s the interaction between size and locality: we have a large structure that’s being accessed in a non-local way.
28
![Page 29: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/29.jpg)
What’s next• How to implement Naïve Bayes– Assuming the event counters do not fit in memory
• Possible approaches:– Use a memory-based distributed database?
• Extra cost: Communication costs: O(n) … but that’s “ok”• Extra complexity: routing requests correctly
– Compress the counter hash table?• Use integers as keys instead of strings?• Use approximate counts?• Discard infrequent/unhelpful words?
– Trade off time for space somehow?• Observation: if the counter updates were better-ordered we
could avoid using disk
Great ideas which we’ll discuss more later
O(n*scan) è O(n*scan+n*send)
29
![Page 30: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/30.jpg)
Large-vocabulary Naïve Bayes• One way trade off time for space:– Assume you need K times as much memory as you
actually have– Method:
• Construct a hash function h(event)• For i=0,…,K-1:
– Scan thru the train dataset– Increment counters for event only if h(event) mod K == i– Save this counter set to disk at the end of the scan
• After K scans you have a complete counter set• Comment:
– this works for any counting task, not just naïve Bayes– What we’re really doing here is organizing our “messages” to
get more locality….
Counting
30
![Page 31: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/31.jpg)
HOW TO ORGANIZE DATA TO ENABLE LARGE-SCALE COUNTING
31
![Page 32: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/32.jpg)
Large vocabulary counting
• Another approach:–Start with• Q: “what can we do for large sets quickly”?• A: sorting– It’s O(n log n), not much worse than linear–You can do it for very large datasets using a merge
sort» sort k subsets that fit in memory, »merge results, which can be done in linear time
32
![Page 33: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/33.jpg)
33
![Page 34: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/34.jpg)
Alternative visualization
34
![Page 35: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/35.jpg)
Wikipedia on Old-School Merge Sort
Use four tape drives A,B,C,D
1. merge runs from A,B and write them alternately into C,D
2. merge runs from C,D and write them alternately into A,B
3. And so on….
Requires only constant memory.
35
![Page 36: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/36.jpg)
ASIDE: MORE ON SORTING
36
![Page 37: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/37.jpg)
Unix Sort• Load as much as you can
[actually --buffer-size=SIZE] into memory and do an in-memory sort [usually quicksort].
• If you have more to do, then spill this sorted buffer out on to disk, and get a another buffer’s worth of data.
• Finally, merge your spill buffers.
37
![Page 38: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/38.jpg)
Unix Sort
38
Pro tip!
![Page 39: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/39.jpg)
SORTING OUT OF MEMORY WITH PIPES
39
generate lines | sort | process lines
![Page 40: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/40.jpg)
How Unix Pipes Work
• Processes are all started at the same time• Data streaming thru the pipeline is held in a
queue: writer à […queue…] à reader• If the queue is full:– the writing process is blocked
• If the queue is empty:– the reading process is blocked
• (I think) queues are usually smallish: 64k
40
![Page 41: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/41.jpg)
How stream-and-sort works
• Pipeline is stream à […queue…] à sort• Algorithm you get:– sort reads --buffer-size lines in, sorts them,
spills them to disk– sort merges spill files after stream closes
– stream is blocked when sort falls behind–and sort is blocked if it gets ahead
41
generate lines | sort | process lines
![Page 42: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/42.jpg)
THE STREAM-AND-SORT DESIGN PATTERN FOR NAIVE BAYES
42
![Page 43: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/43.jpg)
Large-vocabulary Naïve Bayes• Create a hashtable C• For each example id, y, x1,….,xd in train:– C(“Y=ANY”) ++; C(“Y=y”) ++– For j in 1..d:
• C(“Y=y ^ X=xj”) ++
43
![Page 44: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/44.jpg)
Large-vocabulary Naïve Bayes• Create a hashtable C• For each example id, y, x1,….,xd in train:– C(“Y=ANY”) ++; C(“Y=y”) ++– Print “Y=ANY += 1”– Print “Y=y += 1”– For j in 1..d:• C(“Y=y ^ X=xj”) ++• Print “Y=y ^ X=xj += 1”
• Sort the event-counter update “messages”• Scan the sorted messages and compute and output the final
counter values
Think of these as “messages” to another component to increment the counters
python MyTrainer.py train | sort | python MyCountAdder.py > model44
![Page 45: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/45.jpg)
Large-vocabulary Naïve Bayes• Create a hashtable C• For each example id, y, x1,….,xd in train:– C(“Y=ANY”) ++; C(“Y=y”) ++– Print “Y=ANY += 1”– Print “Y=y += 1”– For j in 1..d:• C(“Y=y ^ X=xj”) ++• Print “Y=y ^ X=xj += 1”
• Sort the event-counter update “messages”– We’re collecting together messages about the same counter
• Scan and add the sorted messages and output the final counter values
Y=business += 1Y=business += 1…Y=business ^ X =aaa += 1…Y=business ^ X=zynga += 1Y=sports ^ X=hat += 1Y=sports ^ X=hockey += 1Y=sports ^ X=hockey += 1Y=sports ^ X=hockey += 1…Y=sports ^ X=hoe += 1…Y=sports += 1…
45
![Page 46: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/46.jpg)
Large-vocabulary Naïve Bayes
Y=business += 1Y=business += 1…Y=business ^ X =aaa += 1…Y=business ^ X=zynga += 1Y=sports ^ X=hat += 1Y=sports ^ X=hockey += 1Y=sports ^ X=hockey += 1Y=sports ^ X=hockey += 1…Y=sports ^ X=hoe += 1…Y=sports += 1…
•previousKey = Null• sumForPreviousKey = 0• For each (event,delta) in input:
• If event==previousKey• sumForPreviousKey += delta
• Else• OutputPreviousKey()• previousKey = event• sumForPreviousKey = delta
• OutputPreviousKey()
define OutputPreviousKey():• If PreviousKey!=Null
• print PreviousKey,sumForPreviousKey
Accumulating the event counts requires constant storage … as long as the input is sorted.
streamingScan-and-add:
46
![Page 47: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/47.jpg)
Distributed Counting à Stream and Sort Counting
• example 1• example 2• example 3•….
Counting logic
Hash table1
“C[x] +=D”Hash table2
Hash table2
Machine 1
Machine 2
Machine K
. . .
Machine 0
Mes
sage
-rou
ting
logi
c
47
![Page 48: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/48.jpg)
Distributed Counting à Stream and Sort Counting
• example 1• example 2• example 3•….
Counting logic
“C[x] +=D”
Machine A
Sort
• C[x1] += D1• C[x1] += D2•….
Logic to combine counter updates
Machine C
Machine B48
![Page 49: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/49.jpg)
Stream and Sort Counting à Distributed Counting
• example 1• example 2• example 3•….
Counting logic
“C[x] +=D”
Machines A1,…
Sort
• C[x1] += D1• C[x1] += D2•….
Logic to combine counter updates
Machines C1,..,Machines B1,…,
Trivial to parallelize! Easy to parallelize!
Standardized message routing logic
49
![Page 50: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/50.jpg)
Locality is good
Micro:0.6G memoryStandard:S: 1.7GbL: 7.5GbXL: 15MbHi Memory:XXL: 34.2XXXXL: 68.4
50
![Page 51: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/51.jpg)
Large-vocabulary Naïve Bayes• For each example id, y, x1,….,xd in
train:– Print Y=ANY += 1– Print Y=y += 1– For j in 1..d:• Print Y=y ^ X=xj += 1
• Sort the event-counter update “messages”
• Scan and add the sorted messages and output the final counter values
Complexity: O(n), n=size of train
Complexity: O(nlogn)
Complexity: O(n)
(Assuming a constant number of labels apply to each document)
Model size: min( O(n), O(|V||dom(Y)|))51
python MyTrainer.py train | sort | python MyCountAdder.py > model
![Page 52: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/52.jpg)
STREAM-AND-SORT +LOCAL PARTIAL COUNTING
REFINEMENT 1/2
52
![Page 53: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/53.jpg)
Today
• Naïve Bayes with huge feature sets– i.e. ones that don’t fit in memory
• Pros and cons of possible approaches– Traditional “DB” (actually, key-value store)–Memory-based distributed DB– Stream-and-sort counting
• Optimizations• Other tasks for stream-and-sort
53
![Page 54: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/54.jpg)
Optimizations
cat train | MyTrainer.py | sort | MyCountAdder.py > model
O(n)Input size=nOutput size=n
O(nlogn)Input size=nOutput size=n
O(n)Input size=nOutput size=m
m<<n … say O(sqrt(n))
A useful optimization:
decrease the size of the input to the sort
Reduces the size from O(n) to O(m)
1. Compress the output by using simpler messages (“C[event] += 1”) è “event 1”
2. Compress the output more – e.g. stringàinteger codeTradeoff – ease of debugging vs efficiency – are messages
meaningful or meaningful in context?54
![Page 55: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/55.jpg)
Optimization: partial local counting• For each example id, y, x1,….,xd in
train:– Print “Y=y += 1”– For j in 1..d:
• Print “Y=y ^ X=xj += 1”• Sort the event-counter update
“messages”• Scan and add the sorted messages
and output the final counter values
• Initialize hashtable C• For each example id, y, x1,….,xd in
train:– C[Y=y] += 1– For j in 1..d:
• C[Y=y ^ X=xj] += 1• If memory is getting full: output
all values from C as messages and re-initialize C
• Sort the event-counter update “messages”
• Scan and add the sorted messages
55
python MyTrainer.py train | sort | python MyCountAdder.py > model
![Page 56: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/56.jpg)
Review: Large-vocab Naïve Bayes• Create a hashtable C• For each example id, y, x1,….,xd in train:
– C.inc(“Y=y”)– For j in 1..d:
• C.inc(“Y=y ^ X=xj”)
class EventCounter(object):def __init__(self):
self._ctr = {}def inc(self, event):
// increment the counter for ‘event’if (len(self._ctr) > BUFFER_SIZE):
for (e,n) in self._ctr.items() : print ’%s\t%d’ % (e,n)// clear self._ctrself._ctr = {}
56
![Page 57: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/57.jpg)
Distributed Counting à Stream and Sort Counting
• example 1• example 2• example 3•….
Counting logic
“C[x] +=D”
Machine A
Sort
• C[x1] += D1• C[x1] += D2•….
Logic to combine counter updates
Machine C
Machine B
BUFFER
57
![Page 58: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/58.jpg)
How much does buffering help?
BUFFER_SIZE Time Message Sizenone 1.7M words100 47s 1.2M1,000 42s 1.0M10,000 30s 0.7M100,000 16s 0.24M1,000,000 13s 0.16Mlimit 0.05M
58
![Page 59: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/59.jpg)
CONFESSION: THIS NAÏVE BAYES HAS A PROBLEM…
REFINEMENT 2/2
59
![Page 60: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/60.jpg)
Today• Naïve Bayes with huge feature sets– i.e. ones that don’t fit in memory
• Pros and cons of possible approaches– Traditional “DB” (actually, key-value store)–Memory-based distributed DB– Stream-and-sort counting
• Optimizations• Other tasks for stream-and-sort• Finally: A “detail” about large-vocabulary
Naïve Bayes…..
60
![Page 61: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/61.jpg)
Complexity of Naïve Bayes• You have a train dataset and a test dataset• Initialize an “event counter” (hashtable) C• For each example id, y, x1,….,xd in train:
– C(“Y=y”) ++– For j in 1..d:
• C(“Y=y ^ X=xj”) ++• ….
• For each example id, y, x1,….,xd in test:– For each y’ in dom(Y):
• Compute log Pr(y’,x1,….,xd) =
– Return the best y’
= logC(X = x j ∧Y = y ')+mqxC(X = ANY ∧Y = y ')+mj
∑#
$%%
&
'((+ log
C(Y = y ')+mqyC(Y = ANY )+m
where:qj = 1/|V|qy = 1/|dom(Y)|mqx=1
Complexity: O(n), n=size of train
Complexity: O(|dom(Y)|*n’), n’=size of test
Assume hashtable holding all counts fits in memory
Sequential reads
Sequential reads
61
![Page 62: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/62.jpg)
Using Large-vocabulary Naïve Bayes
• For each example id, y, x1,….,xd in train:• Sort the event-counter update “messages”• Scan and add the sorted messages and output the final
counter values• For each example id, y, x1,….,xd in test:– For each y’ in dom(Y):
• Compute log Pr(y’,x1,….,xd) =
Model size: max O(n), O(|V||dom(Y)|)
= logC(X = x j ∧Y = y ')+mqx
C(Y = y ')+mj∑#
$%%
&
'((+ log
C(Y = y ')+mqyC(Y = ANY )+m
62
![Page 63: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/63.jpg)
Using Large-vocabulary Naïve Bayes
• For each example id, y, x1,….,xd in train:• Sort the event-counter update “messages”• Scan and add the sorted messages and output the final
counter values• Initialize a HashSet NEEDED and a hashtable C• For each example id, y, x1,….,xd in test:– Add x1,….,xd to NEEDED
• For each event, C(event) in the summed counters– If event involves a NEEDED term x read it into C
• For each example id, y, x1,….,xd in test:– For each y’ in dom(Y):
• Compute log Pr(y’,x1,….,xd) = ….
[For assignment]
Model size: O(|V|)
Time: O(n2), size of testMemory: same
Time: O(n2)Memory: same
Time: O(n2)Memory: same
63
![Page 64: stream-and-sortwcohen/10-405/stream-and-sort.pdf · 2018-01-24 · Motivation •Naïve Bayes is unusual as a learner: –Only one pass through data –Order doesn’t matter 4 •](https://reader033.fdocuments.us/reader033/viewer/2022041505/5e246b613e6fed71e41718e3/html5/thumbnails/64.jpg)
Large-Vocabulary Naïve Bayes
Learning/Counting Using Counts• Assignment:
– Scan through counts to find those needed for test set
– Classify with counts in memory
• Put counts in a database• Use partial counts and
repeated scans of the test data?
• Re-organize the counts and test set so that you can classify in a stream
• Counts on disk with a key-value store
• Counts as messages to a set of distributed processes
• Repeated scans to build up partial counts
• Counts as messages in a stream-and-sort system
• Assignment: Counts as messages but buffered in memory
64