Spark algorithms

38
Spark Algorithms Ashutosh Trivedi Kaushik Ranjan IIIT Bangalore Spark-Meetup Bangalore Outlier Detection and KNN Join

Transcript of Spark algorithms

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

2

Agenda

• Introduction to two core algorithms– Outlier Detection on Categorical Data– KNN-Join

• Application in graph algorithms– Feedback Vertex Set of a Graph– Geographical Information Systems

• Challenges we faced • Best practices

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

3

• Its good to be different but not in data !!

• Something is wrong, generated by a different mechanism.

• How will my model generalize ?

• Image ref : http://outskirtsbattledomewiki.com/index.php/13-general-obd-terms/96-outlier

Outliers

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

4

Solutions

• Distance based solutions.– Mahalanobis Distance

• Covariance matrix solution

• Single class SVM.

• Density based solutions– Counting frequency

Categorical Data ?

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

• Attribute Value Frequency(AVF) is based on assigning a score to each point in the dataset using the frequency of each unique attribute value.

• Easily parallelizable.

• Shown to perform favourably compared to other competitive but more complex outlier detection strategies.

• Usages– Anomaly Detection– Security

MR-AVF

5

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

6

Algorithm

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

7

Outliers on Categorical

• Attribute Value Frequency

Col 1 Col 2

A BA CC BD E Outlier

Col 1 Col 2 Score

A B 4A C 3C B 3D E 2 Low

Score

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

8

AVF -Mapping

(1,A) 1 , (2,B) 1,

(1,A) 1, (2,C) 1,

(1,C) 1, (2,B) 1,

(1,D) 1, (2,E) 1,

Key<Column No,

Attribute>

(1,A) 2, (2,B) 2,

(1,C) 1, (2,C) 1,

(1,D) 1, (2,E) 1,

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

9

AVF – frequency Calculations

Col 1 Col 2

A BA CC BD E

Input RDDfreq RDD

Information of line numbers A unique Identifier

(1,A) 2, (2,B) 2,

(1,C) 1, (2,C) 1,

(1,D) 1, (2,E) 1,

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

10

Centralized to Distributed

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

11

Imperative to functional

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

12

Centralized to Distributed

Imperative to functional

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

13

Centralized to DistributedImperative to functional

14

AVF – Line Calculations

(1,A) X1, (2,B) X1,

(1,A) X2, (2,C) X2,

(1,C) X3, (2,B) X3,

(1,D) X4, (2,E) X4,

Col 1 Col 2

A BA CC BD E

Column Index as well as row index ZipWithIndex

data RDD

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

15

(1,A) X1, (2,B) X1,

(1,A) X2, (2,C) X2,

(1,C) X3, (2,B) X3,

(1,D) X4, (2,E) X4,

data RDDfreq RDD

(1,A) 2, (2,B) 2,

(1,C) 1, (2,C) 1,

(1,D) 1, (2,E) 1,

Col 1 Col 2

A B

A C

C B

D E

Input RDD

AVF – Join

16

AVF - Join

(1,A) ( 2, X1 ) , ( 2, X2 ) , (2,B) ( 2, X1 ) ,(2, X3 ),

(1,C) ( 1, X3 ) , (2,C) ( 1, X2 ) ,

(1,D) ( 1, X4) , (2,E) (1, X4 ) ,

( X1, 2 ) , ( X2, 2 ) , ( X1, 2 ) ,( X3, 2 ), ( X3, 1 ) ,( X4, 1 ) , ( X4, 1 ) , ( X2, 1 ) ,

Col 1 Col 2

X1 4X2 3X3 3X4 2

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

17

Performance

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

18

Performance on Spark

Performance on different data-points

438MB Memory, Intel core i3 machine

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

19

Performance on Spark

Performance on 43358 data-points with different partition of file

438MB Memory, Intel core i3 machine

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

20

Best Practices

• Minimal use of variable, Everything should be immutable.

• More transformations less actions.

• Minimize broadcast.

• No updating variable in filter.

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

21

KNN-Join

• Finds the K nearest neighbors from a data set for a given data point.

• Approximate KNN-Join helps generate results with order of log(n) page access.

• This idea uses Z- Values to map points in a multi dimensional space to a single dimension.

• It translate KNN search for the query point on the single dimensional space.

• Usages• Similarity Search in huge Datasets• Smoothening of images

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

22

Z-Order a.k.a. Morton Code a.k.a. Space-Filling Curve

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

23

Z-Order

Z-order curve iterations extended to three dimensions

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

24

KNN-Join

3 14 6

2 13 7

4 12 7

4 14 6Data Set

Data Point

3 12 7

Iteration : 2 Neighbors : 1

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

25

KNN-Join Calculations

3 , 14 , 6 0

2, 13 , 7 1

4 , 12 , 7 2

4 , 14 , 6 3

1 1259

0 1276

2 1481

3 1496

1261

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

26

3 , 14 , 6 0

2, 13 , 7 1

First Iteration Result

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

27

Second Iteration

3 , 14 , 6 0

2, 13 , 7 1

4 , 12 , 7 2

4 , 14 , 6 3

Data Point

3 12 7

Random Vector

17 22 34

20 , 36 , 40 0

19, 35 , 41 1

21, 34 , 41 2

21 , 36 , 40 3

new data Point20 34 41

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

28

1 115255

2 115477

0 115584

3 115588

115473

2, 13 , 7 1

4, 12 , 7 2

3 , 14 , 6 0

2, 13 , 7 1

Union only append, Does not remove duplicates

Second Iteration -Result

Ashutosh & Kaushik, Spark-Meetup Bangalore Dec-2014

29

Data Point

3 12 7

Z-KNN

1 [ (2, 13 , 7) , (2, 13 , 7) ]

2 [ (4, 12 , 7) ]

0 [ (3 , 14 , 6) ]

2, 13 , 7

4, 12 , 7

3 , 14 , 6Data Set

1 2, 13 , 7

2 4, 12 , 7

0 3 , 14 , 6

1 2, 13 , 7

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

30

2, 13 , 7

4, 12 , 7

3 , 14 , 6

Data Point

3 12 7Data Set

4 12 7

Z-KNN Results

1 Nearest Neighbor

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

31

Performance on Spark

Performance on different Ks

438MB Memory, Intel core i3 machine

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

32

Performance on Spark

Performance on different data-points with k = 30 and 30 iterations

438MB Memory, Intel core i3 machine

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

33

Best Practices

• More code review at codacy (www.codacy.com)• Integrated with Github

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

34

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

35

Application on GraphX

• Feedback Vertex Set of a Graph

• Geographical Information Systems

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

36

Future Works

• Social Content Matching (max flow Algorithm) (alpha)

• KNN for float types (requires calculation of Morton order for floats)

• Matrix multiplication by the Strassen algorithm, using Morton order as locality search.

• Similarity between two documents, implementation of all sequence kernels.

• More outlier detection algorithm

Ashutosh & Kaushik, Spark-Meetup Bangalore Jan-2015

37

Connect with us

@[email protected]

[email protected]

LinkedIn Ashutoshhttps://www.linkedin.com/in/ashutoshtrivedi

Kaushikhttps://www.linkedin.com/in/ranjankaushik

Fork our repository at https://github.com/anantasty/SparkAlgorithms

References

• Follow us at• https://github.com/codeAshu• https://github.com/kaushikranjan

• A. Koufakou, J. Secretan, J. Reeder, K. Cardona, and M. Georgiopoulos. “Fast parallel outlier detection for categorical datasets using MapReduce." IEEE World Congress on computational Intelligence International Joint Conference on Neural Networks IJCNN, pp. 3298-3304, 2008.

• DOI> 10.1109/IJCNN.2008.4634266

• Zhang, Chi, Feifei Li, and Jeffrey Jestes. "Efficient parallel kNN joins for large data in MapReduce." Proceedings of the 15th International Conference on Extending Database Technology. ACM, 2012.

• DOI>10.1145/2247596.2247602