Large Scale Parallel Supervised Topic-Modeling -implementation plan-

9
Large Scale Parallel Supervised Topic-Modeling -implementation plan- Keisuke Kamataki Jun Zhu Eric Xing Sep 27, 2010

description

Large Scale Parallel Supervised Topic-Modeling -implementation plan-. Keisuke Kamataki Jun Zhu Eric Xing. Sep 27, 2010. Implementation plan. Big picture: Separate implementation for 3 steps (we can still run the distributed MedLDA from the step1) - PowerPoint PPT Presentation

Transcript of Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Page 1: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Large Scale Parallel Supervised Topic-Modeling-implementation plan-

Keisuke KamatakiJun ZhuEric Xing

Sep 27, 2010

Page 2: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Implementation plan

Big picture: Separate implementation for 3 steps (we can still run the distributed MedLDA from the step1)

• Plan1 (E-step and M-step are separated programs. SVM in M-step is not parallelized)

• Plan2 (E-step and M-step are separated programs. SVM in M-step is parallelized)

• Plan3 (Everything is integrated and parallelized within a single program)

*Start from plan1, then extend it to plan2, and try plan3 last.

Page 3: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Plan 1 (E-step and M-step are separated. SVM in M-step is not parallelized)

Given –many documents

Perform E-step(Gibbs sampling)in parallel way. Get Sufficient Stats

Perform M-stepon a single computer

Repeat until convergence

Single Program

Single Program

α, β, η, μ

z z z z

Page 4: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Plan 1 in detail• Prepare E-step code and M-step code separately

(probably in C++). Merge the codes using Shell/Perl/Ruby script

• Easy to quickly implement and debug• Would be extendible to plan 2 and plan 3• May not scale only in the situation when we

have a large # of K(n-topics) and L(n-labels) …..should be solved in plan 2

To be a good start !!

Page 5: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Plan 2 (E-step and M-step are separated. SVM in M-step is parallelized)

Given –many documents

Perform E-step(Gibbs sampling)in parallel way. Get Sufficient Stats

Perform M-stepIn parallel way (only parallelize SVM to Estimate η and μ)

Repeat until convergence

Single Program

Single Program

α, β η, μη, μ

z z z z

Page 6: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Plan 2 in detail• Prepare E-step code and M-step code

separately. Merge the codes using Shell/Perl/Ruby script (same with plan 1)

• Almost a copy from plan 1 except for the SVM part in M-step

• SVM is parallelized. So, the estimation of η, μ would be fast and scalable (of course, we need to figure out how to parallelize SVM in the FB’s computing environment)

Practical extension (only need to figure out how to parallelize SVM)

Page 7: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Plan 3 (Everything is integrated and parallelized within a single program)

Given –many documents

Perform E-step(Gibbs sampling)in parallel way. Get Sufficient Stats

Perform M-stepIn parallel way

Repeat until convergence

Single Program

α, β, η, μα, β, η, μ

α, β, η, μ

z z z z

Page 8: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

Plan 3 in detail• E-step and M-step (including SVM) is

integrated within a single code • In practice, the computational efficiency and

the algorithmic behavior would be almost same with the plan 2 (but the software will be more complicated and the implementation would take a lot of time)

Could be beautiful in research aspect (but should be built as an extension of plan2 since the software will be complex)

Page 9: Large Scale Parallel Supervised Topic-Modeling -implementation plan-

ToDo for a while

• KeisukePrepare the core-part codes of Gibbs-sampling based LDA and the merging script

• JunDerive Gibbs-sample based equation in MedLDA