Large Scale Parallel Supervised Topic-Modeling -implementation plan-
-
Upload
mari-mccall -
Category
Documents
-
view
23 -
download
0
description
Transcript of Large Scale Parallel Supervised Topic-Modeling -implementation plan-
Large Scale Parallel Supervised Topic-Modeling-implementation plan-
Keisuke KamatakiJun ZhuEric Xing
Sep 27, 2010
Implementation plan
Big picture: Separate implementation for 3 steps (we can still run the distributed MedLDA from the step1)
• Plan1 (E-step and M-step are separated programs. SVM in M-step is not parallelized)
• Plan2 (E-step and M-step are separated programs. SVM in M-step is parallelized)
• Plan3 (Everything is integrated and parallelized within a single program)
*Start from plan1, then extend it to plan2, and try plan3 last.
Plan 1 (E-step and M-step are separated. SVM in M-step is not parallelized)
Given –many documents
Perform E-step(Gibbs sampling)in parallel way. Get Sufficient Stats
Perform M-stepon a single computer
Repeat until convergence
Single Program
Single Program
α, β, η, μ
z z z z
Plan 1 in detail• Prepare E-step code and M-step code separately
(probably in C++). Merge the codes using Shell/Perl/Ruby script
• Easy to quickly implement and debug• Would be extendible to plan 2 and plan 3• May not scale only in the situation when we
have a large # of K(n-topics) and L(n-labels) …..should be solved in plan 2
To be a good start !!
Plan 2 (E-step and M-step are separated. SVM in M-step is parallelized)
Given –many documents
Perform E-step(Gibbs sampling)in parallel way. Get Sufficient Stats
Perform M-stepIn parallel way (only parallelize SVM to Estimate η and μ)
Repeat until convergence
Single Program
Single Program
α, β η, μη, μ
z z z z
Plan 2 in detail• Prepare E-step code and M-step code
separately. Merge the codes using Shell/Perl/Ruby script (same with plan 1)
• Almost a copy from plan 1 except for the SVM part in M-step
• SVM is parallelized. So, the estimation of η, μ would be fast and scalable (of course, we need to figure out how to parallelize SVM in the FB’s computing environment)
Practical extension (only need to figure out how to parallelize SVM)
Plan 3 (Everything is integrated and parallelized within a single program)
Given –many documents
Perform E-step(Gibbs sampling)in parallel way. Get Sufficient Stats
Perform M-stepIn parallel way
Repeat until convergence
Single Program
α, β, η, μα, β, η, μ
α, β, η, μ
z z z z
Plan 3 in detail• E-step and M-step (including SVM) is
integrated within a single code • In practice, the computational efficiency and
the algorithmic behavior would be almost same with the plan 2 (but the software will be more complicated and the implementation would take a lot of time)
Could be beautiful in research aspect (but should be built as an extension of plan2 since the software will be complex)
ToDo for a while
• KeisukePrepare the core-part codes of Gibbs-sampling based LDA and the merging script
• JunDerive Gibbs-sample based equation in MedLDA