Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Clustering with SparkSandy Ryza / Data Science / Cloudera

● Data scientist at Cloudera● Recently lead Apache Spark development at

Cloudera● Before that, committing on Apache Hadoop● Before that, studying combinatorial

optimization and distributed systems at Brown

Sometimes you find yourself with lots of stuff

Large Scale Learning

Network Packets

Detect Network Intrusions

Credit Card Transactions

Detect Fraud

Movie Viewings

Recommend Movies

Unsupervised Learning

● Learn hidden structure of your data● Interpret new data as it relates to this

structure

Two Main Problems

● Designing a system for processing huge data in parallel

● Taking advantage of it with algorithms that work well in parallel

CONFIDENTIAL - RESTRICTED*

MapReduce

Map Map Map Map Map Map Map Map Map Map Map Map

Reduce Reduce Reduce Reduce

Key advances by MapReduce:

•Data Locality: Automatic split computation and launch of mappers appropriately

•Fault tolerance: Write out of intermediate results and restartable mappers meant ability to run on commodity hardware

•Linear scalability: Combination of locality + programming model that forces developers to write generally scalable solutions to problems

MapReduce

Map Map Map Map Map Map Map Map Map Map Map Map

Reduce Reduce Reduce Reduce

Limitations of MapReduce

•Each job reads data from HDFS

•No concept of a session

•Jobs are rigin map-then-reduce

Spark is a general purpose computation framework geared towards massive data - more flexible than MapReduce

Extra properties:•Leverages distributed memory•Full Directed Graph expressions for data parallel computations•Improved developer experience

Yet retains:Linear scalability, Fault-tolerance and Data-Locality

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Driver

val numbers = lines.map ((x) => x.toDouble) numbers.sum()

bigfile.txt lines

val lines = sc.textFile (“bigfile.txt”)

numbers

Partition

Driver

val numbers = lines.map ((x) => x.toInt) numbers.cache()

.sum()

numbers.sum()

bigfile.txt lines numbers

Partition

Driver

Spark MLlib

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering● K-means

Dimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Spark MLlib

Discrete Continuous

Supervised Classification● Logistic regression (and

regularized variants)● Linear SVM● Naive Bayes● Random Decision Forests

(soon)

Regression● Linear regression (and

regularized variants)

Unsupervised Clustering

● K-meansDimensionality reduction, matrix factorization

● Principal component analysis / singular value decomposition

● Alternating least squares

Using it

val data = sc.textFile("kmeans_data.txt")

val parsedData = data.map( _.split(' ').map(_.toDouble))

// Cluster the data into two classes using KMeans

val numIterations = 20

val numClusters = 2

val clusters = KMeans.train(parsedData, numClusters,

numIterations)

K-Means

● Choose some initial centers● Then alternate between two steps:

○ Assign each point to a cluster based on existing centers

○ Recompute cluster centers from the points in each cluster

K-Means - very parallelizable

● Alternate between two steps:○ Assign each point to a cluster based on

existing centers■ Process each data point independently

○ Recompute cluster centers from the points in each cluster■ Average across partitions

// Find the sum and count of points mapping to each center

val totalContribs = data.mapPartitions { points =>

val k = centers.length

val dims = centers(0).vector.length

val sums = Array.fill(k)(BDV.zeros[Double](dims).asInstanceOf[BV[Double]])

val counts = Array.fill(k)(0L)

points.foreach { point =>

val (bestCenter, cost) = KMeans.findClosest(centers, point)

costAccum += cost

sums(bestCenter) += point.vector

counts(bestCenter) += 1

val contribs = for (j <- 0 until k) yield {

(j, (sums(j), counts(j)))

contribs.iterator

}.reduceByKey(mergeContribs).collectAsMap()

// Update the cluster centers and costs

var changed = false

var j = 0

while (j < k) {

val (sum, count) = totalContribs(j)

if (count != 0) {

sum /= count.toDouble

val newCenter = new BreezeVectorWithNorm(sum)

if (KMeans.fastSquaredDistance(newCenter, centers(j)) > epsilon * epsilon) {

changed = true

centers(j) = newCenter

j += 1

if (!changed) {

logInfo("Run " + run + " finished in " + (iteration + 1) + " iterations")

cost = costAccum.value

The Problem

● K-Means is very sensitive to initial set of center points chosen.

● Best existing algorithm for choosing centers is highly sequential.

K-Means++

● Start with random point from dataset● Pick another one randomly, with probability

proportional to distance from the closest already chosen

● Repeat until initial centers chosen

K-Means++

● Initial cluster has expected bound of O(log k) of optimum cost

K-Means++

● Requires k passes over the data

K-Means||

● Do only a few (~5) passes● Sample m points on each pass● Oversample● Run K-Means++ on sampled points to find

initial centers

Then on the full data...

Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Technology

Transcript of Sandy Ryza – Software Engineer, Cloudera at MLconf ATL

Bryan T. Quesada Ma. Kathrina S. Lorenos Mai Ryza Amante- · PDF fileBryan T. Quesada Ma. Kathrina S. Lorenos Mai Ryza Amante-Sison, M.A. Jaime D.L. Caro, Ph.D

MLconf NYC Samantha Kleinberg

Josh Wills, Director of Data Science, Cloudera at MLconf SEA - 5/01/15

MLconf NYC Pek Lum

MLconf Yael Elmatad

Deploying Cloudera CDH (Cloudera Distribution Including ... · 5 LAB GUIDE | Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000

MLconf NYC Josh Wills

H2O 0xdata MLconf

ReviewAnalysis MLconf 2016 JPrendki

MLconf NYC Corinna Cortes

Building the Enterprise Data Lake with Cloudera & Cisco · cloudera introduces the enterprise data hub and cloudera enterprise 5 2015 cloudera includes kafka, kudu and record service

MLconf NYC Xiangrui Meng

Scott Triglia, MLconf 2013

Cloudera training: secure your Cloudera cluster

Cloudera Tutorial

Music recommendations @ MLConf 2014

Jake Mannix, MLconf 2013

Juliet Hougland, Data Scientist, Cloudera at MLconf NYC

MLconf NYC Edo Liberty

MLconf seattle 2015 presentation