Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Machine Learning on Spark

Shivaram VenkataramanUC Berkeley

Computer Science

Machine learning

Statistics

Machine learning

Spam filters

Recommendations

Click prediction

Search ranking

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

Implementing Machine Learning

Machine learning algorithms are

- Complex, multi-stage

- Iterative

MapReduce/Hadoop unsuitable Need efficient primitives for data sharing

Spark RDDs efficient data sharing

In-memory caching accelerates performance

- Up to 20x faster than Hadoop

Easy to use high-level programming interface

- Express complex algorithms ~100 lines.

Machine Learning using Spark

Machine learningtechniques

Classification

Regression

Clustering

Active learning

Collaborative filtering

K-Means Clustering using Spark

Focus: Implementation and Performance

Clustering

Grouping data according to similarity

Distance EastD

h E.g. archaeological dig

Clustering

Grouping data according to similarity

Distance EastD

K-Means Algorithm

Benefits

• Popular• Fast• Conceptually

straightforward

Distance EastD

K-Means: preliminaries

Feature 1Fe

Data: Collection of values

data = lines.map(line=> parseVector(line))

Feature 1Fe

Dissimilarity: Squared Euclidean distance

dist = p.squaredDist(q)

Feature 1Fe

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

Feature 1Fe

K = Number of clusters

Data assignments to clustersS1, S2,. . ., SK

K-Means Algorithm

Feature 1

• Initialize K cluster centers• Repeat until convergence:

Assign each data point to the cluster with the closest center.Assign each cluster center to be the mean of its cluster’s data points.

K-Means Algorithm

Feature 1

• Initialize K cluster centers• Repeat until convergence:

K-Means Algorithm

Feature 1

• Initialize K cluster centers

• Repeat until convergence:

centers = data.takeSample( false, K, seed)

K-Means Algorithm

Feature 1

K-Means Algorithm

Feature 1

K-Means Algorithm

Feature 1

Assign each cluster center to be the mean of its cluster’s data points.

closest = data.map(p => (closestPoint(p,centers),p))

K-Means Algorithm

Feature 1

K-Means Algorithm

Feature 1

K-Means Algorithm

Feature 1

pointsGroup = closest.groupByKey()

K-Means Algorithm

Feature 1

newCenters = pointsGroup.mapValues( ps => average(ps))

K-Means Algorithm

Feature 1

K-Means Algorithm

Feature 1

closest = data.map(p => (closestPoint(p,centers),p))pointsGroup = closest.groupByKey()

K-Means Algorithm

Feature 1

newCenters =pointsGroup.mapValues( ps => average(ps))

while (dist(centers, newCenters) > ɛ)

K-Means Algorithm

Feature 1

while (dist(centers, newCenters) > ɛ)

K-Means Source

Feature 1

while (d > ɛ){

d = distance(centers, newCenters)

centers = newCenters.map(_)

Ease of use

Interactive shell:

Useful for featurization, pre-processing data

Lines of code for K-Means

- Spark ~ 90 lines – (Part of hands-on tutorial !)

- Hadoop/Mahout ~ 4 files, > 300 lines

25 50 1000

300274

Hadoop HadoopBinMemSpark

Number of machines

K-Means

25 50 1000

HadoopHadoopBinMemSpark

Number of machines

Logistic Regression

Performance

[Zaharia et. al, NSDI’12]

K means clustering using Spark Hands-on exercise this afternoon !

Examples and more: www.spark-project.org

Spark: Framework for cluster computing Fast and easy machine learning programs

Conclusion

Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

Documents

Transcript of Machine Learning on Spark Shivaram Venkataraman UC Berkeley.

CS 744: Big Data Systemspages.cs.wisc.edu/~shivaram/cs744-slides/cs744-spark-streaming-v2.pdfCONTINUOUS OPERATOR MODEL Long-lived operators Distributed Checkpoints High overhead for

Machine Learning on Spark - UC Berkeley AMP Campampcamp.berkeley.edu/.../Machine-Learning-on-Spark... · Machine Learning on Spark Shivaram Venkataraman ... Machine learning algorithms

Venkataraman (Venkat) Lakshmi

Presto: Distributed Machine Learning and Graph …pages.cs.wisc.edu/~shivaram/publications/presto-eurosys...shivaram@cs.berkeley.edu, ferik.bodzsar, indrajitr, alvina, rob.schreiberg@hp.com

Drizzle: Fast and Adaptable Stream Processing ... - Shivaram

SparkR: Interactive Data Science at Scalefiles.meetup.com/1225993/SparkR-BARUG.pdfSparkR: Interactive Data Science at Scale Shivaram Venkataraman . Fast Scalable Flexible . DataFrames

Venkataraman Biomass

Cake: Enabling High-level SLOs on Shared Storage Systemsacs.ict.ac.cn/storage/slides/Cake.pdfCake: Enabling High-level SLOs on Shared Storage Systems Andrew Wang, Shivaram Venkataraman,

KeystoneML: Optimizing Pipelines for Large-Scale …...KeystoneML: Optimizing Pipelines for Large-Scale Advanced Analytics Evan R. Sparks , Shivaram Venkataraman , Tomer Kaftanz, Michael

CS 537: INTRO TO OPERATING SYSTEMSpages.cs.wisc.edu/~shivaram/cs537-sp19-notes/intro/cs537-intro.pdf · CS 537: INTRO TO OPERATING SYSTEMS Shivaram Venkataraman Spring 2019 . WHO

Distributed Machine Learning and Graph Processing with ...Distributed Machine Learning and Graph Processing with Sparse Matrices Shivaram Venkataraman*, Erik Bodzsar# ... Graph-centric

Drizzle: Fast and Adaptable Stream Processing at Scaleshivaram.org/publications/drizzle-sosp17.pdf · Drizzle: Fast and Adaptable Stream Processing at Scale Shivaram Venkataraman*

Shivaram Kalyanakrishnan shivaram@csa.iisc.ernet

[Venkataraman] at the Speed of Light

Building Large Scale Machine Learning Applications with Pipelines-(Evan Sparks and Shivaram Venkataraman, UC Berkeley AMPLAB)

Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spark: Spark Summit East talk by Shivaram Venkataraman

Sweet Storage SLOs with Frosting Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Ion Stoica, Randy Katz.

The Jungle Look ~ Sudhir Shivaram' s Wildlife Photography

Cake: Enabling High-level SLOs on Shared Storage Systems Enabling High-level SLOs on Shared Storage Systems Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy Katz, Ion Stoica

Bharat Venkataraman