Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Large Scale Parallel Supervised Topic-Modeling-implementation plan-

Keisuke KamatakiJun ZhuEric Xing

Sep 27, 2010

Implementation plan

Big picture: Separate implementation for 3 steps (we can still run the distributed MedLDA from the step1)

• Plan1 (E-step and M-step are separated programs. SVM in M-step is not parallelized)

• Plan2 (E-step and M-step are separated programs. SVM in M-step is parallelized)

• Plan3 (Everything is integrated and parallelized within a single program)

*Start from plan1, then extend it to plan2, and try plan3 last.

Plan 1 (E-step and M-step are separated. SVM in M-step is not parallelized)

Given –many documents

Perform E-step(Gibbs sampling)in parallel way. Get Sufficient Stats

Perform M-stepon a single computer

Repeat until convergence

Single Program

Single Program

α, β, η, μ

z z z z

Plan 1 in detail• Prepare E-step code and M-step code separately

(probably in C++). Merge the codes using Shell/Perl/Ruby script

• Easy to quickly implement and debug• Would be extendible to plan 2 and plan 3• May not scale only in the situation when we

have a large # of K(n-topics) and L(n-labels) …..should be solved in plan 2

To be a good start !!

Plan 2 (E-step and M-step are separated. SVM in M-step is parallelized)



Perform M-stepIn parallel way (only parallelize SVM to Estimate η and μ)


Single Program

Single Program

α, β η, μη, μ

z z z z

Plan 2 in detail• Prepare E-step code and M-step code

separately. Merge the codes using Shell/Perl/Ruby script (same with plan 1)

• Almost a copy from plan 1 except for the SVM part in M-step

• SVM is parallelized. So, the estimation of η, μ would be fast and scalable (of course, we need to figure out how to parallelize SVM in the FB’s computing environment)

Practical extension (only need to figure out how to parallelize SVM)

Plan 3 (Everything is integrated and parallelized within a single program)



Perform M-stepIn parallel way


Single Program

α, β, η, μα, β, η, μ

α, β, η, μ

z z z z

Plan 3 in detail• E-step and M-step (including SVM) is

integrated within a single code • In practice, the computational efficiency and

the algorithmic behavior would be almost same with the plan 2 (but the software will be more complicated and the implementation would take a lot of time)

Could be beautiful in research aspect (but should be built as an extension of plan2 since the software will be complex)

ToDo for a while

• KeisukePrepare the core-part codes of Gibbs-sampling based LDA and the merging script

• JunDerive Gibbs-sample based equation in MedLDA

Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Documents

Transcript of Large Scale Parallel Supervised Topic-Modeling -implementation plan-