Spark algorithms

Spark Algorithms

Ashutosh Trivedi

Kaushik Ranjan

IIIT Bangalore

Spark-Meetup Bangalore

Outlier Detection and KNN Join

http://in.linkedin.com/in/ashutoshtrivedi

http://in.linkedin.com/pub/kaushik-ranjan/19/286/920




Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

2

Agenda

• Introduction to two core algorithms– Outlier Detection on Categorical Data– KNN-Join

• Application in graph algorithms– Feedback Vertex Set of a Graph– Geographical Information Systems

• Challenges we faced • Best practices


3

• Its good to be different but not in data !!

• Something is wrong, generated by a different mechanism.

• How will my model generalize ?

• Image ref : http://outskirtsbattledomewiki.com/index.php/13-general-obd-terms/96-outlier

Outliers


4

Solutions

• Distance based solutions.– Mahalanobis Distance

• Covariance matrix solution

• Single class SVM.

• Density based solutions– Counting frequency

Categorical Data ?


• Attribute Value Frequency(AVF) is based on assigning a score to each point in the dataset using the frequency of each unique attribute value.

• Easily parallelizable.

• Shown to perform favourably compared to other competitive but more complex outlier detection strategies.

• Usages– Anomaly Detection– Security

MR-AVF

5


6

Algorithm


7

Outliers on Categorical

• Attribute Value Frequency

Col 1 Col 2

A BA CC BD E Outlier

Col 1 Col 2 Score

A B 4A C 3C B 3D E 2 Low

Score


8

AVF -Mapping

(1,A) 1 , (2,B) 1,

(1,A) 1, (2,C) 1,

(1,C) 1, (2,B) 1,

(1,D) 1, (2,E) 1,

Key<Column No,

Attribute>

(1,A) 2, (2,B) 2,

(1,C) 1, (2,C) 1,

(1,D) 1, (2,E) 1,


9

AVF – frequency Calculations

Col 1 Col 2

A BA CC BD E

Input RDDfreq RDD

Information of line numbers A unique Identifier

(1,A) 2, (2,B) 2,

(1,C) 1, (2,C) 1,

(1,D) 1, (2,E) 1,


10

Centralized to Distributed


11

Imperative to functional


12

Centralized to Distributed

Imperative to functional


13

Centralized to DistributedImperative to functional

14

AVF – Line Calculations

(1,A) X1, (2,B) X1,

(1,A) X2, (2,C) X2,

(1,C) X3, (2,B) X3,

(1,D) X4, (2,E) X4,

Col 1 Col 2

A BA CC BD E

Column Index as well as row index ZipWithIndex

data RDD



15

(1,A) X1, (2,B) X1,

(1,A) X2, (2,C) X2,

(1,C) X3, (2,B) X3,

(1,D) X4, (2,E) X4,

data RDDfreq RDD

(1,A) 2, (2,B) 2,

(1,C) 1, (2,C) 1,

(1,D) 1, (2,E) 1,

Col 1 Col 2

A B

A C

C B

D E

Input RDD

AVF – Join

16

AVF - Join

(1,A) ( 2, X1 ) , ( 2, X2 ) , (2,B) ( 2, X1 ) ,(2, X3 ),

(1,C) ( 1, X3 ) , (2,C) ( 1, X2 ) ,

(1,D) ( 1, X4) , (2,E) (1, X4 ) ,

( X1, 2 ) , ( X2, 2 ) , ( X1, 2 ) ,( X3, 2 ), ( X3, 1 ) ,( X4, 1 ) , ( X4, 1 ) , ( X2, 1 ) ,

Col 1 Col 2

X1 4X2 3X3 3X4 2



17

Performance


18

Performance on Spark

Performance on different data-points

438MB Memory, Intel core i3 machine


19


Performance on 43358 data-points with different partition of file



20

Best Practices

• Minimal use of variable, Everything should be immutable.

• More transformations less actions.

• Minimize broadcast.

• No updating variable in filter.


21

KNN-Join

• Finds the K nearest neighbors from a data set for a given data point.

• Approximate KNN-Join helps generate results with order of log(n) page access.

• This idea uses Z- Values to map points in a multi dimensional space to a single dimension.

• It translate KNN search for the query point on the single dimensional space.

• Usages• Similarity Search in huge Datasets• Smoothening of images


22

Z-Order a.k.a. Morton Code a.k.a. Space-Filling Curve


23

Z-Order

Z-order curve iterations extended to three dimensions


24

KNN-Join

3 14 6

2 13 7

4 12 7

4 14 6Data Set

Data Point

3 12 7

Iteration : 2 Neighbors : 1


25

KNN-Join Calculations

3 , 14 , 6 0

2, 13 , 7 1

4 , 12 , 7 2

4 , 14 , 6 3

1 1259

0 1276

2 1481

3 1496

1261


26

3 , 14 , 6 0

2, 13 , 7 1

First Iteration Result


27

Second Iteration

3 , 14 , 6 0

2, 13 , 7 1

4 , 12 , 7 2

4 , 14 , 6 3

Data Point

3 12 7

Random Vector

17 22 34

20 , 36 , 40 0

19, 35 , 41 1

21, 34 , 41 2

21 , 36 , 40 3

new data Point20 34 41


28

1 115255

2 115477

0 115584

3 115588

115473

2, 13 , 7 1

4, 12 , 7 2

3 , 14 , 6 0

2, 13 , 7 1

Union only append, Does not remove duplicates

Second Iteration -Result

Ashutosh & Kaushik, Spark-Meetup Bangalore Dec-2014

29

Data Point

3 12 7

Z-KNN

1 [ (2, 13 , 7) , (2, 13 , 7) ]

2 [ (4, 12 , 7) ]

0 [ (3 , 14 , 6) ]

2, 13 , 7

4, 12 , 7

3 , 14 , 6Data Set

1 2, 13 , 7

2 4, 12 , 7

0 3 , 14 , 6

1 2, 13 , 7


30

2, 13 , 7

4, 12 , 7

3 , 14 , 6

Data Point

3 12 7Data Set

4 12 7

Z-KNN Results

1 Nearest Neighbor


31


Performance on different Ks



32


Performance on different data-points with k = 30 and 30 iterations



33

Best Practices

• More code review at codacy (www.codacy.com)• Integrated with Github

http://www.codacy.com/


34


35

Application on GraphX

• Feedback Vertex Set of a Graph

• Geographical Information Systems


36

Future Works

• Social Content Matching (max flow Algorithm) (alpha)

• KNN for float types (requires calculation of Morton order for floats)

• Matrix multiplication by the Strassen algorithm, using Morton order as locality search.

• Similarity between two documents, implementation of all sequence kernels.

• More outlier detection algorithm


37

Connect with us

@[email protected]

[email protected]

LinkedIn Ashutoshhttps://www.linkedin.com/in/ashutoshtrivedi

Kaushikhttps://www.linkedin.com/in/ranjankaushik

Fork our repository at https://github.com/anantasty/SparkAlgorithms

mailto:[email protected]

mailto:[email protected]

https://www.linkedin.com/in/ashutoshtrivedi



https://www.linkedin.com/in/ranjankaushik

https://www.linkedin.com/in/ranjankaushik

https://github.com/anantasty/SparkAlgorithms

References

• Follow us at• https://github.com/codeAshu• https://github.com/kaushikranjan

• A. Koufakou, J. Secretan, J. Reeder, K. Cardona, and M. Georgiopoulos. “Fast parallel outlier detection for categorical datasets using MapReduce." IEEE World Congress on computational Intelligence International Joint Conference on Neural Networks IJCNN, pp. 3298-3304, 2008.

• DOI> 10.1109/IJCNN.2008.4634266

• Zhang, Chi, Feifei Li, and Jeffrey Jestes. "Efficient parallel kNN joins for large data in MapReduce." Proceedings of the 15th International Conference on Extending Database Technology. ACM, 2012.

• DOI>10.1145/2247596.2247602

https://github.com/codeAshu

https://github.com/kaushikranjan

http://dx.doi.org/10.1109/IJCNN.2008.4634266

http://dx.doi.org/10.1145/2247596.2247602

Spark algorithms

Data & Analytics

Transcript of Spark algorithms