Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Fast sampling for LDA

William Cohen

MORE LDA SPEEDUPSFIRST - RECAP LDA DETAILS

Called “collapsed Gibbs sampling” since you’ve marginalized away some variables

Fr: Parameter estimation for text analysis - Gregor Heinrich

prob this word/term assigned to topic k

prob this doc contains topic k

More detail

unit heightrandom

SPEEDUP 1 - SPARSITY

KDD 2008

unit heightrandom

Running total of P(z=k|…) or P(z<=k)

Discussion….

• Where do you spend your time?–sampling the z’s–each sampling step involves a loop over all topics– this seems wasteful

• even with many topics, words are often only assigned to a few different topics– low frequency words appear < K times … and there are lots and lots of them!– even frequent words are not in every topic

Discussion….

• What’s the solution?Idea: come up with

approximations to Z at each stage - then you might be able to stop

early…..computationall

y like a sparser vectorWant Zi>=Z

Tricks• How do you compute and maintain the bound?

– see the paper• What order do you go in?

– want to pick large P(k)’s first– … so we want large P(k|d) and P(k|w)– … so we maintain k’s in sorted order

• which only change a little bit after each flip, so a bubble-sort will fix up the almost-sorted array

Results

SPEEDUP 2 - ANOTHER APPROACH FOR USING SPARSITY

KDD 09

z=s+r+q

t = topic (k)w = wordd = doc

z=s+r+q

• If U<s:• lookup U on line segment with tic-

marks at α1β/(βV + n.|1), α2β/(βV + n.|2), …• If s<U<r:

• lookup U on line segment for rOnly need to check t such that nt|

t = topic (k)w = wordd = doc

z=s+r+q

• If U<s:• lookup U on line segment with tic-

marks at α1β/(βV + n.|1), α2β/(βV + n.|2), …• If s<U<s+r:

• lookup U on line segment for r• If s+r<U:

• lookup U on line segment for qOnly need to check t such that nw|

z=s+r+q

Only need to check t such that nw|

Only need to check t such that nt|

Only need to check occasionally (< 10% of the time)

z=s+r+q

Need to store nw|t for each word, topic pair …???

Only need to store nt|d for current d

Only need to store (and maintain) total words per topic and α’s,β,V

Trick; count up nt|

d for d when you start working on d and update incrementally

z=s+r+q

1. Precompute, for each t,

Most (>90%) of the time and space is here…

2. Quickly find t’s such that nw|t is large for w

1. Precompute, for each t,

Most (>90%) of the time and space is here…

2. Quickly find t’s such that nw|t is large for w• map w to an int array

• no larger than frequency w• no larger than #topics

• encode (t,n) as a bit vector• n in the high-order bits• t in the low-order bits

• keep ints sorted in descending order

Outline

• LDA/Gibbs algorithm details• How to speed it up by parallelizing• How to speed it up by faster sampling

–Why sampling is key–Some sampling ideas for LDA

• The Mimno/McCallum decomposition (SparseLDA)• Alias tables (Walker 1977; Li, Ahmed, Ravi, Smola KDD 2014)

Alias tables

http://www.keithschwarz.com/darts-dice-coins/

Basic problem: how can we sample from a biased coin quickly?

If the distribution changes slowly maybe we can do some preprocessing and then sample multiple times. Proof of concept: generate r~uniform and use a binary tree

r in (23/40,7/10]

O(log2K)

Alias tables

Another idea…

Simulate the dart with two drawn values:rx int(u1*K)ry u1*pmaxkeep throwing till you hit a stripe

Alias tables

An even more clever idea: minimize the brown space (where the dart “misses”) by sizing the rectangle’s height to the average probability, not the maximum probability, and cutting and pasting a bit.

You can always do this using only two colors in each column of the final alias table and the dart never misses!

mathematically speaking…

KDD 2014

Key ideas• use variant of Mimno/McCallum decomposition

• Use alias tables to sample from the dense parts

• Since the alias table gradually goes stale, use Metropolis-Hastings sampling instead of Gibbs

KDD 2014

• q is stale, easy-to-draw from distribution• p is updated distribution• computing ratios p(i)/q(i) is cheap• usually the ratio is close to one

else the dart missed

KDD 2014

SPEEDUP FOR PARALLEL LDA - USING ALLREDUCE FOR

SYNCHRONIZATION

What if you try and parallelize?

Split document/term matrix randomly and distribute to p processors .. then run “Approximate Distributed LDA”

Common subtask in parallel versions of: LDA, SGD, ….

Introduction• Common pattern:

– do some learning in parallel – aggregate local changes from each processor

• to shared parameters– distribute the new shared parameters

• back to each processor– and repeat….

• AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme

ALLREDUCE

Gory details of VW Hadoop-AllReduce

• Spanning-tree server:– Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server

• Worker nodes (“fake” mappers):– Input for worker is locally cached– Workers all connect to spanning-tree server– Workers all execute the same code, which might contain AllReduce calls:

• Workers synchronize whenever they reach an all-reduce

Hadoop AllReduce

don’t wait for duplicate jobs

Second-order method - like Newton’s method

2 24 features

~=100 non-zeros/example

2.3B examples

example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

50M examples

explicitly constructed kernel 11.7M features

3,300 nonzeros/example

old method: SVM, 3 days: reporting time to get to fixed test error

Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Documents

Transcript of Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Parallel Programming: Speedups and Amdahl’s lawweb.engr.oregonstate.edu/~mjb/cs575/Handouts/speedups.and... · 2020. 3. 23. · Parallel Programming: Speedups and Amdahl’s law

Configure Lda Prep

Quantum Speedups

Twitter LDA

LDA Portfolio

Speedups of Deterministic Machines by Synchronous Parallel ...

Generalized Correspondence-LDA Models (GC-LDA) for ... · The GC-LDA and Correspondence-LDA models are extensions of Latent Dirichlet Allocation (LDA) [3]. Several Bayesian methods

1:21CR76JJM-LDA Case 1:21-cr-00076-JJM-LDA Document 3 ...

Spatial LDA

Lda Tutorial

Maximal Adaptive-Decision Speedups in Quantum-State Readout

LDA 141 FINAL Book - WordPress.com · LDA 141 FINAL Book - WordPress.com ... *#(1"+.,-’+.("

The World of ALDA - European Committee of the Regions · lda prijedor (bih) lda zavidoviĆi (bih) lda montenegro (mne) lda central and southern serbia (rs) lda subotica (rs) lda kosovo

Productivity Slowdowns and Inequality Speedups: What is the Role of Intangibles?

Learning human behaviors and lifestyle by capturing ...let allocation (LDA) model [3]. The LDA can infer the function (LDA \topic") of a region (LDA \document") in a city, e.g., educational

goalparadistjudiciary.gov.ingoalparadistjudiciary.gov.in/stat/sscjm.pdfUDA UDA UDA UDA LDA LDA LDA LDA LDA LDA LDA 21-02-1986 29-05-2015 29-05-2015 19-05-1998 21-02-1986 29-05-2015

Linear Discriminant Analysis (LDA)shireen/pdfs/tutorials/Elhabian_LDA09.pdf · LDA Objective •The objective of LDA is to perform dimensionalityreduction… –Sowhat,PCAdoesthisL…

Edit lda rslideshow

Tried and Tested Speedups - Amazon S3 · Tried and Tested Speedups for SW-driven SoC Simulation Gordon Allan Senior Verification Technologist Mentor Graphics Corp, Fremont CA March

LDA Brochure