Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

49
Fast sampling for LDA William Cohen

Transcript of Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Page 1: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Fast sampling for LDA

William Cohen

Page 2: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

MORE LDA SPEEDUPSFIRST - RECAP LDA DETAILS

Page 3: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 4: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables

Fr: Parameter estimation for text analysis - Gregor Heinrich

prob this word/term assigned to topic k

prob this doc contains topic k

Page 5: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

More detail

Page 6: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 7: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

z=1

z=2

z=3

unit heightrandom

Page 8: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

SPEEDUP 1 - SPARSITY

Page 9: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

KDD 2008

Page 10: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

z=1

z=2

z=3

unit heightrandom

Page 11: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 12: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Running total of P(z=k|…) or P(z<=k)

Page 13: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Discussion….

• Where do you spend your time?–sampling the z’s–each sampling step involves a loop over all topics– this seems wasteful

• even with many topics, words are often only assigned to a few different topics– low frequency words appear < K times … and there are lots and lots of them!– even frequent words are not in every topic

Page 14: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Discussion….

• What’s the solution?Idea: come up with

approximations to Z at each stage - then you might be able to stop

early…..computationall

y like a sparser vectorWant Zi>=Z

Page 15: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Tricks• How do you compute and maintain the bound?

– see the paper• What order do you go in?

– want to pick large P(k)’s first– … so we want large P(k|d) and P(k|w)– … so we maintain k’s in sorted order

• which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array

Page 16: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Results

Page 17: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

SPEEDUP 2 - ANOTHER APPROACH FOR USING SPARSITY

Page 18: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

KDD 09

Page 19: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

z=s+r+q

t = topic (k)w = wordd = doc

Page 20: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

z=s+r+q

• If U<s:• lookup U on line segment with tic-

marks at α1β/(βV + n.|1), α2β/(βV + n.|2), …• If s<U<r:

• lookup U on line segment for rOnly need to check t such that nt|

d>0

t = topic (k)w = wordd = doc

Page 21: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

z=s+r+q

• If U<s:• lookup U on line segment with tic-

marks at α1β/(βV + n.|1), α2β/(βV + n.|2), …• If s<U<s+r:

• lookup U on line segment for r• If s+r<U:

• lookup U on line segment for qOnly need to check t such that nw|

t>0

Page 22: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

z=s+r+q

Only need to check t such that nw|

t>0

Only need to check t such that nt|

d>0

Only need to check occasionally (< 10% of the time)

Page 23: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

z=s+r+q

Need to store nw|t for each word, topic pair …???

Only need to store nt|d for current d

Only need to store (and maintain) total words per topic and α’s,β,V

Trick; count up nt|

d for d when you start working on d and update incrementally

Page 24: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

z=s+r+q

Need to store nw|t for each word, topic pair …???

1. Precompute, for each t,

Most (>90%) of the time and space is here…

2. Quickly find t’s such that nw|t is large for w

Page 25: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Need to store nw|t for each word, topic pair …???

1. Precompute, for each t,

Most (>90%) of the time and space is here…

2. Quickly find t’s such that nw|t is large for w• map w to an int array

• no larger than frequency w• no larger than #topics

• encode (t,n) as a bit vector• n in the high-order bits• t in the low-order bits

• keep ints sorted in descending order

Page 26: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 27: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Outline

• LDA/Gibbs algorithm details• How to speed it up by parallelizing• How to speed it up by faster sampling

–Why sampling is key–Some sampling ideas for LDA

• The Mimno/McCallum decomposition (SparseLDA)• Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)

Page 28: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Alias tables

http://www.keithschwarz.com/darts-dice-coins/

Basic problem: how can we sample from a biased coin quickly?

If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree

r in (23/40,7/10]

O(K)

O(log2K)

Page 29: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Alias tables

http://www.keithschwarz.com/darts-dice-coins/

Another idea…

Simulate the dart with two drawn values:rx int(u1*K)ry u1*pmaxkeep throwing till you hit a stripe

Page 30: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Alias tables

http://www.keithschwarz.com/darts-dice-coins/

An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit.

You can always do this using only two colors in each column of the final alias table and the dart never misses!

mathematically speaking…

Page 31: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

KDD 2014

Key ideas• use variant of Mimno/McCallum decomposition

• Use alias tables to sample from the dense parts

• Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs

Page 32: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

KDD 2014

• q is stale, easy-to-draw from distribution• p is updated distribution• computing ratios p(i)/q(i) is cheap• usually the ratio is close to one

else the dart missed

Page 33: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

KDD 2014

Page 34: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

SPEEDUP FOR PARALLEL LDA - USING ALLREDUCE FOR

SYNCHRONIZATION

Page 35: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 36: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

What if you try and parallelize?

Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”

Common subtask in parallel versions of: LDA, SGD, ….

Page 37: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Introduction• Common pattern:

– do some learning in parallel – aggregate local changes from each processor

• to shared parameters– distribute the new shared parameters

• back to each processor– and repeat….

• AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme

MAP

ALLREDUCE

Page 38: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 39: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 40: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 41: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 42: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 43: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.
Page 44: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Gory details of VW Hadoop-AllReduce

• Spanning-tree server:– Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server

• Worker nodes (“fake” mappers):– Input for worker is locally cached– Workers all connect to spanning-tree server– Workers all execute the same code, which might contain AllReduce calls:

• Workers synchronize whenever they reach an all-reduce

Page 45: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Hadoop AllReduce

don’t wait for duplicate jobs

Page 46: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Second-order method - like Newton’s method

Page 47: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

2 24 features

~=100 non-zeros/example

2.3B examples

example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

Page 48: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

50M examples

explicitly constructed kernel 11.7M features

3,300 nonzeros/example

old method: SVM, 3 days: reporting time to get to fixed test error

Page 49: Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.