Spark algorithms
-
Upload
ashutosh-trivedi -
Category
Data & Analytics
-
view
886 -
download
1
Transcript of Spark algorithms
Spark Algorithms
Ashutosh Trivedi
Kaushik Ranjan
IIIT Bangalore
Spark-Meetup Bangalore
Outlier Detection and KNN Join
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
2
Agenda
• Introduction to two core algorithms– Outlier Detection on Categorical Data– KNN-Join
• Application in graph algorithms– Feedback Vertex Set of a Graph– Geographical Information Systems
• Challenges we faced • Best practices
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
3
• Its good to be different but not in data !!
• Something is wrong, generated by a different mechanism.
• How will my model generalize ?
• Image ref : http://outskirtsbattledomewiki.com/index.php/13-general-obd-terms/96-outlier
Outliers
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
4
Solutions
• Distance based solutions.– Mahalanobis Distance
• Covariance matrix solution
• Single class SVM.
• Density based solutions– Counting frequency
Categorical Data ?
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
• Attribute Value Frequency(AVF) is based on assigning a score to each point in the dataset using the frequency of each unique attribute value.
• Easily parallelizable.
• Shown to perform favourably compared to other competitive but more complex outlier detection strategies.
• Usages– Anomaly Detection– Security
MR-AVF
5
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
7
Outliers on Categorical
• Attribute Value Frequency
Col 1 Col 2
A BA CC BD E Outlier
Col 1 Col 2 Score
A B 4A C 3C B 3D E 2 Low
Score
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
8
AVF -Mapping
(1,A) 1 , (2,B) 1,
(1,A) 1, (2,C) 1,
(1,C) 1, (2,B) 1,
(1,D) 1, (2,E) 1,
Key<Column No,
Attribute>
(1,A) 2, (2,B) 2,
(1,C) 1, (2,C) 1,
(1,D) 1, (2,E) 1,
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
9
AVF – frequency Calculations
Col 1 Col 2
A BA CC BD E
Input RDDfreq RDD
Information of line numbers A unique Identifier
(1,A) 2, (2,B) 2,
(1,C) 1, (2,C) 1,
(1,D) 1, (2,E) 1,
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
12
Centralized to Distributed
Imperative to functional
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
13
Centralized to DistributedImperative to functional
14
AVF – Line Calculations
(1,A) X1, (2,B) X1,
(1,A) X2, (2,C) X2,
(1,C) X3, (2,B) X3,
(1,D) X4, (2,E) X4,
Col 1 Col 2
A BA CC BD E
Column Index as well as row index ZipWithIndex
data RDD
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
15
(1,A) X1, (2,B) X1,
(1,A) X2, (2,C) X2,
(1,C) X3, (2,B) X3,
(1,D) X4, (2,E) X4,
data RDDfreq RDD
(1,A) 2, (2,B) 2,
(1,C) 1, (2,C) 1,
(1,D) 1, (2,E) 1,
Col 1 Col 2
A B
A C
C B
D E
Input RDD
AVF – Join
16
AVF - Join
(1,A) ( 2, X1 ) , ( 2, X2 ) , (2,B) ( 2, X1 ) ,(2, X3 ),
(1,C) ( 1, X3 ) , (2,C) ( 1, X2 ) ,
(1,D) ( 1, X4) , (2,E) (1, X4 ) ,
( X1, 2 ) , ( X2, 2 ) , ( X1, 2 ) ,( X3, 2 ), ( X3, 1 ) ,( X4, 1 ) , ( X4, 1 ) , ( X2, 1 ) ,
Col 1 Col 2
X1 4X2 3X3 3X4 2
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
18
Performance on Spark
Performance on different data-points
438MB Memory, Intel core i3 machine
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
19
Performance on Spark
Performance on 43358 data-points with different partition of file
438MB Memory, Intel core i3 machine
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
20
Best Practices
• Minimal use of variable, Everything should be immutable.
• More transformations less actions.
• Minimize broadcast.
• No updating variable in filter.
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
21
KNN-Join
• Finds the K nearest neighbors from a data set for a given data point.
• Approximate KNN-Join helps generate results with order of log(n) page access.
• This idea uses Z- Values to map points in a multi dimensional space to a single dimension.
• It translate KNN search for the query point on the single dimensional space.
• Usages• Similarity Search in huge Datasets• Smoothening of images
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
22
Z-Order a.k.a. Morton Code a.k.a. Space-Filling Curve
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
23
Z-Order
Z-order curve iterations extended to three dimensions
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
24
KNN-Join
3 14 6
2 13 7
4 12 7
4 14 6Data Set
Data Point
3 12 7
Iteration : 2 Neighbors : 1
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
25
KNN-Join Calculations
3 , 14 , 6 0
2, 13 , 7 1
4 , 12 , 7 2
4 , 14 , 6 3
1 1259
0 1276
2 1481
3 1496
1261
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
26
3 , 14 , 6 0
2, 13 , 7 1
First Iteration Result
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
27
Second Iteration
3 , 14 , 6 0
2, 13 , 7 1
4 , 12 , 7 2
4 , 14 , 6 3
Data Point
3 12 7
Random Vector
17 22 34
20 , 36 , 40 0
19, 35 , 41 1
21, 34 , 41 2
21 , 36 , 40 3
new data Point20 34 41
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
28
1 115255
2 115477
0 115584
3 115588
115473
2, 13 , 7 1
4, 12 , 7 2
3 , 14 , 6 0
2, 13 , 7 1
Union only append, Does not remove duplicates
Second Iteration -Result
Ashutosh & Kaushik, Spark-Meetup Bangalore Dec-2014
29
Data Point
3 12 7
Z-KNN
1 [ (2, 13 , 7) , (2, 13 , 7) ]
2 [ (4, 12 , 7) ]
0 [ (3 , 14 , 6) ]
2, 13 , 7
4, 12 , 7
3 , 14 , 6Data Set
1 2, 13 , 7
2 4, 12 , 7
0 3 , 14 , 6
1 2, 13 , 7
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
30
2, 13 , 7
4, 12 , 7
3 , 14 , 6
Data Point
3 12 7Data Set
4 12 7
Z-KNN Results
1 Nearest Neighbor
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
31
Performance on Spark
Performance on different Ks
438MB Memory, Intel core i3 machine
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
32
Performance on Spark
Performance on different data-points with k = 30 and 30 iterations
438MB Memory, Intel core i3 machine
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
33
Best Practices
• More code review at codacy (www.codacy.com)• Integrated with Github
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
35
Application on GraphX
• Feedback Vertex Set of a Graph
• Geographical Information Systems
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
36
Future Works
• Social Content Matching (max flow Algorithm) (alpha)
• KNN for float types (requires calculation of Morton order for floats)
• Matrix multiplication by the Strassen algorithm, using Morton order as locality search.
• Similarity between two documents, implementation of all sequence kernels.
• More outlier detection algorithm
Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015
37
Connect with us
LinkedIn Ashutoshhttps://www.linkedin.com/in/ashutoshtrivedi
Kaushikhttps://www.linkedin.com/in/ranjankaushik
Fork our repository at https://github.com/anantasty/SparkAlgorithms
References
• Follow us at• https://github.com/codeAshu• https://github.com/kaushikranjan
• A. Koufakou, J. Secretan, J. Reeder, K. Cardona, and M. Georgiopoulos. “Fast parallel outlier detection for categorical datasets using MapReduce." IEEE World Congress on computational Intelligence International Joint Conference on Neural Networks IJCNN, pp. 3298-3304, 2008.
• DOI> 10.1109/IJCNN.2008.4634266
• Zhang, Chi, Feifei Li, and Jeffrey Jestes. "Efficient parallel kNN joins for large data in MapReduce." Proceedings of the 15th International Conference on Extending Database Technology. ACM, 2012.
• DOI>10.1145/2247596.2247602