LDA FOR BIG DATA - cs.cmu.edu
Transcript of LDA FOR BIG DATA - cs.cmu.edu
LDAforBigData- Outline• QuickreviewofLDAmodel
– clusteringwords-in-context• ParallelLDA~=IPM• FastsamplingtricksforLDA
–Sparsified sampler–Aliastable–Fenwicktrees
• LDAfortextà LDA-likemodelsforgraphs
2
Unsupervised NB vs LDA
Y
NdD
π
W
γK
α
β
Y Y
NdD
θd
γk K
α
β
Zdi
one Y per doc
one Z per word
one class prior
different class distribθ for each doc
Wdi
4
LDAand(Collapsed)GibbsSampling• Gibbssampling– worksforanydirectedmodel!- Applicablewhenjointdistributionishardtoevaluatebutconditionaldistributionisknown
- SequenceofsamplescomprisesaMarkovChain- Stationarydistributionofthechainisthejointdistribution
Key capability: estimate distribution of one latent variables given the other latent variables and observed variables.
7
Recap: Collapsed Sampling for LDA
Y
NdD
θd
γk K
α
β
Zdi
Wdi
Pr(Z|E+) Pr(E-|Z)
“fraction” of timeZ=t in doc d
fraction of timeW=w in topic t
8
Only sample the Z’s
ignores a detail – counts should not include the Zdi being sampled
Observation
• Howmuchdoesthechoiceofzdependontheotherz’sinthesamedocument?–quitealot
• Howmuchdoesthechoiceofzdependontheotherz’sinelsewhereinthecorpus?–maybenotsomuch–dependsonPr(w|t)butthatchangesslowly
• CanweparallelizeGibbsandstillgetgoodresults?
11
Question
• CanweparallelizeGibbssampling?– formally,no:everychoiceofzdependsonalltheotherz’s
–Gibbsneedstobesequential• justlikeSGD
12
What if you try and parallelize?Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”
This is iterative parameter mixing
13
What if you try and parallelize?
D=#docs W=#word(types) K=#topics N=words in corpus
14
All-Reduce cost
Later work….• Algorithms:
– Distributedvariational EM– AsynchronousLDA(AS-LDA)– ApproximateDistributedLDA(AD-LDA)– EnsembleversionsofLDA:HLDA,DCM-LDA
• Implementations:– GitHub Yahoo_LDA
• notHadoop,special-purposecommunicationcodeforsynchronizingtheglobalcounts
• AlexSmola,YahooàCMU– MahoutLDA
• AndySchlaikjer,CMUàTwitter
19
RECAP
each iteration: linear in corpus size
most of the time is resampling
resample: linear in #topics
22
z=1
z=2
z=3
…
…
unit heightrandom
1. You spend a lot of time sampling2. There’s a loop over all topics here
in the sampler
RECAP
23
normalizer = s+r+q
• Draw random U from uniform[0,1]• If U<s:
• lookup U on line segment with tic-marks at α1β/(βV+n.|1),α2β/(βV+n.|2),…
27
…
…
height srandom U
z=s+r+q
• If U<s:• lookup U on line segment with tic-marks
at α1β/(βV+n.|1),α2β/(βV+n.|2),…• Ifs<U<r:
• lookupU onlinesegmentforr Only need to check t such that nt|d>0
28
z=s+r+q
• If U<s:• lookup U on line segment with tic-marks
at α1β/(βV+n.|1),α2β/(βV+n.|2),…• Ifs<U<s+r:
• lookupU onlinesegmentforr• Ifs+r<U:
• lookupUonlinesegmentforqOnly need to check t such that nw|t>0
29
z=s+r+q
Only need to check t such that nw|t>0
Only need to check t such that nt|d>0
Only need to check occasionally (< 10% of the time)
30
z=s+r+q
Need to store nw|tfor each word, topic pair …???
Only need to store nt|dfor current d
Only need to store(and maintain) total words per topic and α’s,β,V
Trick; count up nt|d for d when you start working on d and update incrementally
31
z=s+r+qNeed to store nw|tfor each word, topic pair …???
1.Precompute,foreacht,
Most (>90%) of the time and space is here…
2.Quicklyfindt’ssuchthatnw|t islargeforw
32
z=1
z=2
z=3……
z=1
z=2
z=3
……
Need to storenw|t for each word, topic pair …???
1.Precompute,foreacht,
Most (>90%) of the time and space is here…
2.Quicklyfindt’ssuchthatnw|t islargeforw
• associateeachwwithanint array• nolargerthanfrequencyofw• nolargerthan#topics
• encode (t,n) asabitvector• ninthehigh-orderbits• tinthelow-orderbits
• keepints sortedindescendingorder
33
Alias tables
http://www.keithschwarz.com/darts-dice-coins/
Basic problem: how can we sample from a biased coin quickly?
If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree
rin(23/40,7/10]
O(K)
O(log2K)
36
Alias tables
http://www.keithschwarz.com/darts-dice-coins/
Another idea…
Simulatethedartwithtwodrawnvalues:
rxè int(u1*K)ryè u1*pmax
keepthrowingtillyouhitastripe
38
Alias tables
http://www.keithschwarz.com/darts-dice-coins/
An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit.
You can always do this using only twocolors in each column of the final alias table and the dart never misses!
mathematically speaking…39
LDAwithAliasSampling
• SampleZ’swithaliassampler• Don’tupdatethesamplerwitheachflip:
– Correctfor“staleness”withMetropolis-Hastingsalgorithm
[KDD 2014]
40
Fenwick Tree (1994)
http://www.keithschwarz.com/darts-dice-coins/
Basic problem: how can we sample from a biased die quickly….
…and update quickly? maybe we can use a binary tree….
rin(23/40,7/10]
O(K)
O(log2K)
44
Datastructuresandalgorithms
βq: dense, changes slowly, re-used for each word in a document r: sparse, a different one
is needed for each uniqterm in doc
Sampler is:
Fenwick tree
Binary search
49
Second idea: you can sample document-by-document or word-by-word …. or….
use a MF-like approach to distributing the data. 53
Motivation• Socialgraphsseemtohave
– someaspectsofrandomness• smalldiameter,giantconnectedcomponents,..
– somestructure• homophily,scale-freedegreedist?
• Howdoyoumodelit?
58
Moreterms• “Stochasticblockmodel”,aka“Block-stochasticmatrix”:–Drawni nodesinblocki–Withprobabilitypij,connectpairs(u,v)whereuisinblocki,visinblockj
– Special,simplecase:pii=qi,andpij=sforalli≠j
• Question:canyoufitthismodeltoagraph?– findeachpij andlatentnodeàblockmapping
59
StochasticBlockmodels:assume1)nodesw/inablockz and2)edgesbetweenblockszp,zqareexchangeable
zp zq
apq
N2
zp
N
a
p
b
62
Anothermixedmembershipblockmodel
z=(zi,zj) is a pair of block ids
nz = #pairs z
qz1,i = #links to i from block z1
qz1,. = #outlinks in block z1
δ =indicatorfordiagonal
M=#nodes
64