“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
-
Upload
laureen-edwards -
Category
Documents
-
view
219 -
download
2
Transcript of “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.
![Page 1: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/1.jpg)
“Study on Parallel SVM Based on MapReduce”
Kuei-Ti Lu03/12/2015
![Page 2: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/2.jpg)
Support Vector Machine (SVM)
• Used for – Classification– Regression
• Applied in – Network intrusion detection– Image processing– Text classification– …
![Page 3: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/3.jpg)
libSVM
• Library for support vector machines• Integrate different types of SVMs
![Page 4: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/4.jpg)
Types of SVMs Supported by libSVM
• For support vector classification– C-SVC– Nu-SVC
• For support vector regression– Epsilon-SVR– Nu-SVR
• For distribution estimation– One-class SVM
![Page 5: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/5.jpg)
C-SVC
• Goal: Find the separating hyperplane that maximizes the margin
• Support vectors: data points closest to the separating hyperplane
![Page 6: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/6.jpg)
C-SVC
• Primal form
• Dual form (derived using Lagrange multipliers)
ni
nibwxyts
Cw
i
iiTi
ii
bw i
,...,10
,...,11))((..
}||2
1{min 2
,,
liCa
ayts
axxkyyaa
i
T
ii
jijijiji
a
,...,1,0
0..
),(min,
![Page 7: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/7.jpg)
Speedup
• Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases
• Need efficient algorithms and implementation to apply to large scale data mining
• => Parallel SVM
![Page 8: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/8.jpg)
Parallel SVM Methods
• Message Passing Interface (MPI) – Efficient for computation-intensive problems• Ex. Simulation
• MapReduce– Can be used for data-intensive problems
• …
![Page 9: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/9.jpg)
Other Speedup Techniques
• Chunking: optimize subsets of training data iteratively until the global optimum is reached– Ex. Sequential Minimal Optimization (SMO) • Use a chunk size of 2 vectors
• Eliminate non-support vectors early
![Page 10: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/10.jpg)
This Paper’s Approach
1. Partition & distribute data to nodes2. Map class: Train each subSVM to find support
vectors for subset of data3. Reduce class: Combine support vectors of
each 2 subSVMs4. If more than 1 SVM
Go to 2.
![Page 11: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/11.jpg)
Twister
• Support iterative MapReduce
• More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce
![Page 12: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/12.jpg)
Computation Complexity
mN
tnOnOm
nO trans
N
i
iNiNN
i
iN
2
1
22
22
log
)2)2(()))
2((())((
![Page 13: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/13.jpg)
Evaluations
• Number of nodes• Training time• Accuracy = # correctly predicted data / # total
testing data * 100 %
![Page 14: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/14.jpg)
Adult Data Analysis
• Binary classification• Correlation between attribute variable X and
class variable Y used to select attributes
YX
YX
YXYX
YXEYX
)])([(),cov(
,
![Page 15: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/15.jpg)
Adult Data Analysis
• Computation cost concentrates on training
• Data transfer time cost minor• Last layer computation time
depends on α and β instead of number of nodes (1 node only)
• Feature selection reduces computation greatly but does not reduce accuracy very much
![Page 16: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/16.jpg)
Forest Cover Type Classification
• Multiclass classification– Use k(k - 1)/2 binary SVMs as k-class SVM– 1 binary SVM for each pair of classes– Use maximum voting to determine the class
![Page 17: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/17.jpg)
Forest Cover Type Classification
• Correlation between attribute variable X and class variable Y used to select attributes
• Attribute variables are normalized to [0, 1]
minmax
min
xx
xxxnorm
![Page 18: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/18.jpg)
Forest Cover Type Classification
• Last layer computation time depends on α and β instead of number of nodes (1 node only)
• Feature selection reduces computation greatly but does not reduce accuracy very much
![Page 19: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/19.jpg)
Heart Disease Classification
• Binary classification• Data replicated different times to compare
results for different sample sizes
![Page 20: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/20.jpg)
Heart Disease Classification
• When sample size too big, can’t be processed with 1 node because of memory constraint
• Training time decreases little when number of nodes > 8
![Page 21: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/21.jpg)
Conclusion
• Classical SVM impractical for large scale data• Need parallel SVM• This paper proposes a model based on
iterative MapReduce• Show the model efficient for data-intensive
problems
![Page 22: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/22.jpg)
References
[1] Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las
Vegas, NV, 2012, pp. [2] C. Lin et al., “Anomaly Detection Using
LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp. 166-171.
![Page 23: “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.](https://reader035.fdocuments.us/reader035/viewer/2022062407/56649e7a5503460f94b7ab3d/html5/thumbnails/23.jpg)
Q & A