Data Science at Scale with Spark (pdf)

95
dean.wampler@lightbend.com polyglotprogramming.com/talks 1 Data Science at Scale with Spark

Transcript of Data Science at Scale with Spark (pdf)

Page 3: Data Science at Scale with Spark (pdf)

“TrollingtheHadoopcommunitysince2012...”

Page 4: Data Science at Scale with Spark (pdf)
Page 5: Data Science at Scale with Spark (pdf)

Why the JVM?

Page 6: Data Science at Scale with Spark (pdf)

The JVM

AkkaBreezeAlgebirdSpire & CatsAxle...

Page 7: Data Science at Scale with Spark (pdf)

Big Data Ecosystem

Page 8: Data Science at Scale with Spark (pdf)

Hadoop

Hadoop

Page 9: Data Science at Scale with Spark (pdf)

9

Hadoopmaster

Resource Mgr

Name Node

slave

DiskDiskDiskDiskDisk

Data Node

Node Mgrslave

DiskDiskDiskDiskDisk

Data Node

Node Mgr

Page 10: Data Science at Scale with Spark (pdf)

10

Hadoopmaster

Resource Mgr

Name Node

slave

DiskDiskDiskDiskDisk

Data Node

Node Mgrslave

DiskDiskDiskDiskDisk

Data Node

Node Mgr

HDFS

HDFS

Page 11: Data Science at Scale with Spark (pdf)

11

masterResource Mgr

Name Node

slave

DiskDiskDiskDiskDisk

Data Node

Node Mgrslave

DiskDiskDiskDiskDisk

Data Node

Node Mgr

HDFS

HDFS

MapReduceJob

MapReduceJob

MapReduceJob

Page 12: Data Science at Scale with Spark (pdf)

Hadoop

MapReduce

Page 13: Data Science at Scale with Spark (pdf)

13

Example: Inverted Indexinverse indexblock

hadoop (.../hadoop,1)

(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs

(.../hive,1)hive

(.../hbase,1),(.../hive,1)hbase

......

......

block......

block......

block......

(.../hadoop,1),(.../hive,1)and

......

wikipedia.org/hadoopHadoop providesMapReduce and HDFS

wikipedia.org/hbase

HBase stores data in HDFS

wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL

...

...

Page 14: Data Science at Scale with Spark (pdf)

14

Example: Inverted Index

wikipedia.org/hadoopHadoop providesMapReduce and HDFS

wikipedia.org/hbase

HBase stores data in HDFS

wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL

...

...

Web Crawl

indexblock

......

Hadoop provides...wikipedia.org/hadoop

......

block......

HBase stores...wikipedia.org/hbase

......

block......

Hive queries...wikipedia.org/hive

......

inverse indexblock

hadoop (.../hadoop,1)

(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs

(.../hive,1)hive

(.../hbase,1),(.../hive,1)hbase

......

......

block......

block......

block......

(.../hadoop,1),(.../hive,1)and

......

Miracle!!

Compute Inverted Index

Page 15: Data Science at Scale with Spark (pdf)

15

wikipedia.org/hadoopHadoop providesMapReduce and HDFS

wikipedia.org/hbase

HBase stores data in HDFS

wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL

...

...

Web Crawl

indexblock

......

Hadoop provides...wikipedia.org/hadoop

......

block......

HBase stores...wikipedia.org/hbase

......

block......

Hive queries...wikipedia.org/hive

......

inverse indexblock

hadoop (.../hadoop,1)

(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs

(.../hive,1)hive

(.../hbase,1),(.../hive,1)hbase

......

......

block......

block......

block......

(.../hadoop,1),(.../hive,1)and

......

Miracle!!

Compute Inverted Index

Page 16: Data Science at Scale with Spark (pdf)

16

wikipedia.org/hadoopHadoop providesMapReduce and HDFS

wikipedia.org/hbase

HBase stores data in HDFS

wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL

...

...

Web Crawl

indexblock

......

Hadoop provides...wikipedia.org/hadoop

......

block......

HBase stores...wikipedia.org/hbase

......

block......

Hive queries...wikipedia.org/hive

......

inverse indexblock

hadoop (.../hadoop,1)

(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs

(.../hive,1)hive

(.../hbase,1),(.../hive,1)hbase

......

......

block......

block......

block......

(.../hadoop,1),(.../hive,1)and

......

Miracle!!

Compute Inverted Index

Page 17: Data Science at Scale with Spark (pdf)

17

wikipedia.org/hadoopHadoop providesMapReduce and HDFS

wikipedia.org/hbase

HBase stores data in HDFS

wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL

...

...

Web Crawl

indexblock

......

Hadoop provides...wikipedia.org/hadoop

......

block......

HBase stores...wikipedia.org/hbase

......

block......

Hive queries...wikipedia.org/hive

......

inverse indexblock

hadoop (.../hadoop,1)

(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs

(.../hive,1)hive

(.../hbase,1),(.../hive,1)hbase

......

......

block......

block......

block......

(.../hadoop,1),(.../hive,1)and

......

Miracle!!

Compute Inverted Index

Page 18: Data Science at Scale with Spark (pdf)

18

wikipedia.org/hadoopHadoop providesMapReduce and HDFS

wikipedia.org/hbase

HBase stores data in HDFS

wikipedia.org/hiveHive queries HDFS files and HBase tables with SQL

...

...

Web Crawl

indexblock

......

Hadoop provides...wikipedia.org/hadoop

......

block......

HBase stores...wikipedia.org/hbase

......

block......

Hive queries...wikipedia.org/hive

......

inverse indexblock

hadoop (.../hadoop,1)

(.../hadoop,1),(.../hbase,1),(.../hive,1)hdfs

(.../hive,1)hive

(.../hbase,1),(.../hive,1)hbase

......

......

block......

block......

block......

(.../hadoop,1),(.../hive,1)and

......

Miracle!!

Compute Inverted Index

Page 19: Data Science at Scale with Spark (pdf)

import java.io.IOException; import java.util.*;

import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*;

public class LineIndexer {

public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class);

conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass(LineIndexMapper.class);

19

Java MapReduce Inverted Index

Page 20: Data Science at Scale with Spark (pdf)

20

public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(LineIndexer.class);

conf.setJobName("LineIndexer"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); FileInputFormat.addInputPath(conf, new Path("input")); FileOutputFormat.setOutputPath(conf, new Path("output")); conf.setMapperClass(LineIndexMapper.class); conf.setReducerClass(LineIndexMapper.class); client.setConf(conf);

try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } }

Page 21: Data Science at Scale with Spark (pdf)

21

public static class LineIndexMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, Text> { private final static Text word = new Text(); private final static Text location = new Text();

public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName);

String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) {

Page 22: Data Science at Scale with Spark (pdf)

private final static Text location = new Text();

public void map( LongWritable key, Text val, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

FileSplit fileSplit = (FileSplit)reporter.getInputSplit(); String fileName = fileSplit.getPath().getName(); location.set(fileName);

String line = val.toString(); StringTokenizer itr = new StringTokenizer(line.toLowerCase()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, location); } } }

22

Actualbusinesslogic.

Page 23: Data Science at Scale with Spark (pdf)

23

}

public static class LineIndexReducer extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> values, OutputCollector<Text, Text> output, Reporter reporter) throws IOException { boolean first = true; StringBuilder toReturn = new StringBuilder(); while (values.hasNext()) { if (!first) toReturn.append(", "); first = false; toReturn.append(values.next().toString()); } output.collect(key, new Text(toReturn.toString())); } } }

Actualbusinesslogic.

Page 24: Data Science at Scale with Spark (pdf)

Problems

Hard to implement algorithms...

24

Page 25: Data Science at Scale with Spark (pdf)

25

HigherLevel Tools?

Page 26: Data Science at Scale with Spark (pdf)

26

CREATE TABLE students (name STRING, age INT, gpa FLOAT); LOAD DATA ...; ... SELECT age, AVG(gpa) FROM students GROUP BY age;

Page 27: Data Science at Scale with Spark (pdf)

27

A = LOAD 'students' USING PigStorage() AS (name:chararray, age:int, gpa:float); B = GROUP A BY age; C = FOREACH B GENERATE group AS age, AVG(gpa); DUMP c;

Page 28: Data Science at Scale with Spark (pdf)

28

Cascading (Java)

MapReduce

Page 29: Data Science at Scale with Spark (pdf)

29

import com.twitter.scalding._

class InvertedIndex(args: Args) extends Job(args) {

val texts = Tsv("texts.tsv", ('id, 'text)) val wordToIds = texts .flatMap(('id, 'text) -> ('word, 'id2)) { fields: (String, String) => val (id2, text) = text.split("\\s+").map { word => (word, id2) } }

val invertedIndex = wordToTweets .groupBy('word)(_.toList[String]('id2 -> 'ids)) invertedIndex.write(Tsv("output.tsv")) }

Page 30: Data Science at Scale with Spark (pdf)

Problems

Only “Batch mode”;What about streaming?

30

Page 31: Data Science at Scale with Spark (pdf)

Problems

Performance needsto be better

31

Page 32: Data Science at Scale with Spark (pdf)

32

Spark

Page 33: Data Science at Scale with Spark (pdf)

33

Productivity?

Very concise, elegant, functional APIs.•Python, R•Scala, Java

•... and SQL!

Page 34: Data Science at Scale with Spark (pdf)

34

Productivity?

Interactive shell (REPL)•Scala, Python, R, and SQLNotebooks (iPython/Jupyter-like)

Page 35: Data Science at Scale with Spark (pdf)

35

Composable primitives support wide class of algos:Iterative Machine Learning & Graph processing/traversal

Flexible for Algorithms?

Page 36: Data Science at Scale with Spark (pdf)

36

Efficient?

Builds a dataflow DAG:•Combines steps into “stages” •Can cache intermediate data

Page 37: Data Science at Scale with Spark (pdf)

37

Efficient?

The New DataFrame API has the same performance for all languages.

Page 38: Data Science at Scale with Spark (pdf)

38

Batch + Streaming?

Streams - “mini batch” processing:•Reuses “batch” code•Adds “window” functions

Page 39: Data Science at Scale with Spark (pdf)

39

Resilient Distributed Datasets (RDDs)

Cluster

Node NodeNode

RDD

Partition 1 Partition 1 Partition 1

Page 40: Data Science at Scale with Spark (pdf)

40

DStreamsDStream (discretized stream)

… …

Window of 3 RDD Batches #1

Time 2 RDD

Even

t

Even

t

Even

t

Even

t

Even

t

Time 1 RDD

Even

t

Even

t

Even

t

Time 3 RDD

Even

t

Even

t

Even

t

Even

t

Time 4 RDD

Even

t

Even

t

Even

t

Window of 3 RDD Batches #2

Page 41: Data Science at Scale with Spark (pdf)

41

Scala?

I’ll use Scala for the examples, but Data Scientists can use Python or R, if they prefer.

See Vitaly Gordon's opinion on Scala for Data Science

Page 42: Data Science at Scale with Spark (pdf)

42

Inverted Index

Page 43: Data Science at Scale with Spark (pdf)

import org.apache.spark.SparkContext import org.apache.spark.SparkContext._

object InvertedIndex { def main(args: Array[String]) = {

val sc = new SparkContext("local", "Inverted Index")

sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map {

43

Page 44: Data Science at Scale with Spark (pdf)

import org.apache.spark.SparkContext import org.apache.spark.SparkContext._

object InvertedIndex { def main(args: Array[String]) = {

val sc = new SparkContext("local", "Inverted Index")

sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map {

44

Page 45: Data Science at Scale with Spark (pdf)

import org.apache.spark.SparkContext import org.apache.spark.SparkContext._

object InvertedIndex { def main(args: Array[String]) = {

val sc = new SparkContext("local", "Inverted Index")

sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map {

45

Page 46: Data Science at Scale with Spark (pdf)

val sc = new SparkContext("local", "Inverted Index")

sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2

46

Page 47: Data Science at Scale with Spark (pdf)

val sc = new SparkContext("local", "Inverted Index")

sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2

47

(word1,path1)(word2,path2)...

Page 48: Data Science at Scale with Spark (pdf)

word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") }

48

((word1,path1),N1)((word2,path2),N2)...

Page 49: Data Science at Scale with Spark (pdf)

word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") }

49

(word1,(path1,N1))(word2,(path2,N2))...

((word1,path1),N1)((word2,path2),N2)...

Page 50: Data Science at Scale with Spark (pdf)

50

(word,Seq((path1,n1),(path2,n2),(path3,n3),...))...

} .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("output/inverted-index")

sc.stop() } }

Page 51: Data Science at Scale with Spark (pdf)

} .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("output/inverted-index")

sc.stop() } }

51

(word,“(path4,80),(path19,51),(path8,12),...”)...

Page 52: Data Science at Scale with Spark (pdf)

} .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile("output/inverted-index")

sc.stop() } }

52

Page 53: Data Science at Scale with Spark (pdf)

53

Altogether

import org.apache.spark.SparkContext import org.apache.spark.SparkContext._

object InvertedIndex { def main(args: Array[String]) = {

val sc = new SparkContext( "local", "Inverted Index")

sc.textFile("data/crawl") .map { line => val array = line.split("\t", 2) (array(0), array(1)) } .flatMap { case (path, text) => text.split("""\W+""") map { word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ") } .saveAsTextFile(argz.outpath)

sc.stop() } }

Page 54: Data Science at Scale with Spark (pdf)

word => (word, path) } } .map { case (w, p) => ((w, p), 1) } .reduceByKey { (n1, n2) => n1 + n2 } .map { case ((word,path),n) => (word,(path,n)) } .groupByKey .mapValues { iter => iter.toSeq.sortBy { case (path, n) => (-n, path) }.mkString(", ")

54

Powerful,composable“operators”

Page 55: Data Science at Scale with Spark (pdf)

The larger ecosystem

55

Page 56: Data Science at Scale with Spark (pdf)

SQL Revisited

56

Page 57: Data Science at Scale with Spark (pdf)

Spark SQL

57

•Use HiveQL (Hive’s SQL dialect)•Use Spark’s own SQL dialect•Use the new DataFrame API

Page 58: Data Science at Scale with Spark (pdf)

Spark SQL

58

•Use HiveQL (Hive’s SQL dialect)•Use Spark’s own SQL dialect•Use the new DataFrame API

Page 59: Data Science at Scale with Spark (pdf)

59

import org.apache.spark.sql.hive._

val sc = new SparkContext(...) val sqlc = new HiveContext(sc)

sqlc.sql( "CREATE TABLE wc (word STRING, count INT)").show()

sqlc.sql(""" LOAD DATA LOCAL INPATH '/path/to/wc.txt' INTO TABLE wc""").show()

sqlc.sql(""" SELECT * FROM wc ORDER BY count DESC""").show()

Page 60: Data Science at Scale with Spark (pdf)

import org.apache.spark.sql.hive._

val sc = new SparkContext(...) val sqlc = new HiveContext(sc)

sqlc.sql( "CREATE TABLE wc (word STRING, count INT)").show()

sqlc.sql(""" LOAD DATA LOCAL INPATH '/path/to/wc.txt' INTO TABLE wc""").show()

sqlc.sql(""" SELECT * FROM wc ORDER BY count DESC""").show() 60

Page 61: Data Science at Scale with Spark (pdf)

import org.apache.spark.sql.hive._

val sc = new SparkContext(...) val sqlc = new HiveContext(sc)

sqlc.sql( "CREATE TABLE wc (word STRING, count INT)").show()

sqlc.sql(""" LOAD DATA LOCAL INPATH '/path/to/wc.txt' INTO TABLE wc""").show()

sqlc.sql(""" SELECT * FROM wc ORDER BY count DESC""").show() 61

Page 62: Data Science at Scale with Spark (pdf)

62

•Prefer Python??•Just replace:

•With this:

•and delete the vals.

import org.apache.spark.sql.hive._

from pyspark.sql import HiveContext

Page 63: Data Science at Scale with Spark (pdf)

63

•Prefer Just HiveQL??•Use spark-sql shell script:

CREATE TABLE wc (word STRING, count INT);

LOAD DATA LOCAL INPATH '/path/to/wc.txt' INTO TABLE wc;

SELECT * FROM wc ORDER BY count DESC;

Page 64: Data Science at Scale with Spark (pdf)

Spark SQL

64

•Use HiveQL (Hive’s SQL dialect)•Use Spark’s own SQL dialect•Use the new DataFrame API

Page 65: Data Science at Scale with Spark (pdf)

65

import org.apache.spark.sql._

val sc = new SparkContext(...) val sqlc = new HiveContext(sc)

val df = sqlc.read.parquet("/path/to/wc.parquet")

val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()

val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)

Page 66: Data Science at Scale with Spark (pdf)

import org.apache.spark.sql._

val sc = new SparkContext(...) val sqlc = new HiveContext(sc)

val df = sqlc.read.parquet("/path/to/wc.parquet")

val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()

val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)

66

Page 67: Data Science at Scale with Spark (pdf)

67

import org.apache.spark.sql._

val sc = new SparkContext(...) val sqlc = new HiveContext(sc)

val df = sqlc.read.parquet("/path/to/wc.parquet")

val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()

val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)

Page 68: Data Science at Scale with Spark (pdf)

import org.apache.spark.sql._

val sc = new SparkContext(...) val sqlc = new HiveContext(sc)

val df = sqlc.read.parquet("/path/to/wc.parquet")

val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()

val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)

68

Page 69: Data Science at Scale with Spark (pdf)

import org.apache.spark.sql._

val sc = new SparkContext(...) val sqlc = new HiveContext(sc)

val df = sqlc.read.parquet("/path/to/wc.parquet")

val ordered_df = df.orderBy($"count".desc) ordered_df.show() ordered_df.cache()

val long_words = ordered_df.filter($"word".length > 20) long_words.write.parquet(“/…/long_words.parquet”)

69

Page 70: Data Science at Scale with Spark (pdf)

Machine Learning

MLlib

Page 71: Data Science at Scale with Spark (pdf)

Streaming KMeansExample

Page 72: Data Science at Scale with Spark (pdf)

72

import ...spark.mllib.clustering.StreamingKMeans import ...spark.mllib.linalg.Vectors import ...spark.mllib.regression.LabeledPoint import ...spark.streaming.{ Seconds, StreamingContext}

val sc = new SparkContext(...) val ssc = new StreamingContext(sc, Seconds(10))

val trainingData = ssc.textFileStream(...) .map(Vectors.parse) val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)

val model = new StreamingKMeans() .setK(K_CLUSTERS)

Page 73: Data Science at Scale with Spark (pdf)

73

import ...spark.mllib.clustering.StreamingKMeans import ...spark.mllib.linalg.Vectors import ...spark.mllib.regression.LabeledPoint import ...spark.streaming.{ Seconds, StreamingContext}

val sc = new SparkContext(...) val ssc = new StreamingContext(sc, Seconds(10))

val trainingData = ssc.textFileStream(...) .map(Vectors.parse) val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)

val model = new StreamingKMeans() .setK(K_CLUSTERS)

Page 74: Data Science at Scale with Spark (pdf)

import ...spark.mllib.clustering.StreamingKMeans import ...spark.mllib.linalg.Vectors import ...spark.mllib.regression.LabeledPoint import ...spark.streaming.{ Seconds, StreamingContext}

val sc = new SparkContext(...) val ssc = new StreamingContext(sc, Seconds(10))

val trainingData = ssc.textFileStream(...) .map(Vectors.parse) val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)

val model = new StreamingKMeans() .setK(K_CLUSTERS)

74

Page 75: Data Science at Scale with Spark (pdf)

import ...spark.mllib.clustering.StreamingKMeans import ...spark.mllib.linalg.Vectors import ...spark.mllib.regression.LabeledPoint import ...spark.streaming.{ Seconds, StreamingContext}

val sc = new SparkContext(...) val ssc = new StreamingContext(sc, Seconds(10))

val trainingData = ssc.textFileStream(...) .map(Vectors.parse) val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)

val model = new StreamingKMeans() .setK(K_CLUSTERS)

75

Page 76: Data Science at Scale with Spark (pdf)

val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)

val model = new StreamingKMeans() .setK(K_CLUSTERS) .setDecayFactor(1.0) .setRandomCenters(N_FEATURES, 0.0)

val f: LabeledPoint => (Double, Vector) = lp => (lp.label, lp.features)

model.trainOn(trainingData) model.predictOnValues(testData.map(f)).print()

ssc.start() ssc.awaitTermination()

76

Page 77: Data Science at Scale with Spark (pdf)

val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)

val model = new StreamingKMeans() .setK(K_CLUSTERS) .setDecayFactor(1.0) .setRandomCenters(N_FEATURES, 0.0)

val f: LabeledPoint => (Double, Vector) = lp => (lp.label, lp.features)

model.trainOn(trainingData) model.predictOnValues(testData.map(f)).print()

ssc.start() ssc.awaitTermination()

77

Page 78: Data Science at Scale with Spark (pdf)

val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)

val model = new StreamingKMeans() .setK(K_CLUSTERS) .setDecayFactor(1.0) .setRandomCenters(N_FEATURES, 0.0)

val f: LabeledPoint => (Double, Vector) = lp => (lp.label, lp.features)

model.trainOn(trainingData) model.predictOnValues(testData.map(f)).print()

ssc.start() ssc.awaitTermination()

78

Page 79: Data Science at Scale with Spark (pdf)

val testData = ssc.textFileStream(...) .map(LabeledPoint.parse)

val model = new StreamingKMeans() .setK(K_CLUSTERS) .setDecayFactor(1.0) .setRandomCenters(N_FEATURES, 0.0)

val f: LabeledPoint => (Double, Vector) = lp => (lp.label, lp.features)

model.trainOn(trainingData) model.predictOnValues(testData.map(f)).print()

ssc.start() ssc.awaitTermination()

79

Page 80: Data Science at Scale with Spark (pdf)

80

Graph Processing

GraphX

Page 81: Data Science at Scale with Spark (pdf)

81

•Social networks•Epidemics•Teh Interwebs

•“Page Rank”•...

GraphX

Page 82: Data Science at Scale with Spark (pdf)

82

import scala.collection.mutable import org.apache.spark._ import ...spark.storage.StorageLevel import ...spark.graphx._ import ...spark.graphx.lib._ import ...spark.graphx.PartitionStrategy._

val nEdgePartitions = 20 val partitionStrategy = PartitionStrategy.CanonicalRandomVertexCut val edgeStorageLevel = StorageLevel.MEMORY_ONLY val vertexStorageLevel = StorageLevel.MEMORY_ONLY val tolerance = 0.001F val input = "..."

val sc = new SparkContext(...)

Page 83: Data Science at Scale with Spark (pdf)

import scala.collection.mutable import org.apache.spark._ import ...spark.storage.StorageLevel import ...spark.graphx._ import ...spark.graphx.lib._ import ...spark.graphx.PartitionStrategy._

val nEdgePartitions = 20 val partitionStrategy = PartitionStrategy.CanonicalRandomVertexCut val edgeStorageLevel = StorageLevel.MEMORY_ONLY val vertexStorageLevel = StorageLevel.MEMORY_ONLY val tolerance = 0.001F val input = "..."

val sc = new SparkContext(...) 83

Page 84: Data Science at Scale with Spark (pdf)

import scala.collection.mutable import org.apache.spark._ import ...spark.storage.StorageLevel import ...spark.graphx._ import ...spark.graphx.lib._ import ...spark.graphx.PartitionStrategy._

val nEdgePartitions = 20 val partitionStrategy = PartitionStrategy.CanonicalRandomVertexCut val edgeStorageLevel = StorageLevel.MEMORY_ONLY val vertexStorageLevel = StorageLevel.MEMORY_ONLY val tolerance = 0.001F val input = "..."

val sc = new SparkContext(...) 84

Page 85: Data Science at Scale with Spark (pdf)

val tolerance = 0.001F val input = "..."

val sc = new SparkContext(...)

val unpartitionedGraph = GraphLoader.edgeListFile( sc, input, numEdgePartitions, edgeStorageLevel, vertexStorageLevel).cache

val graph = partitionStrategy.foldLeft( unpartitionedGraph)(_.partitionBy(_))

println("# vertices = " + graph.vertices.count) println("# edges = " + graph.edges.count)

val pr = PageRank.runUntilConvergence( graph, tolerance).vertices.cache()

85

Page 86: Data Science at Scale with Spark (pdf)

val tolerance = 0.001F val input = "..."

val sc = new SparkContext(...)

val unpartitionedGraph = GraphLoader.edgeListFile( sc, input, numEdgePartitions, edgeStorageLevel, vertexStorageLevel).cache

val graph = partitionStrategy.foldLeft( unpartitionedGraph)(_.partitionBy(_))

println("# vertices = " + graph.vertices.count) println("# edges = " + graph.edges.count)

val pr = PageRank.runUntilConvergence( graph, tolerance).vertices.cache()

86

Page 87: Data Science at Scale with Spark (pdf)

val graph = partitionStrategy.foldLeft( unpartitionedGraph)(_.partitionBy(_))

println("# vertices = " + graph.vertices.count) println("# edges = " + graph.edges.count)

val pr = PageRank.runUntilConvergence( graph, tolerance).vertices.cache()

println("Top ranks: “) pr.sortBy(tuple => -tuple._2, ascending=false). foreach(println)

pr.map { case (id, r) => id + "\t" + r }.saveAsTextFile(...) sc.stop()

87

Page 88: Data Science at Scale with Spark (pdf)

val graph = partitionStrategy.foldLeft( unpartitionedGraph)(_.partitionBy(_))

println("# vertices = " + graph.vertices.count) println("# edges = " + graph.edges.count)

val pr = PageRank.runUntilConvergence( graph, tolerance).vertices.cache()

println("Top ranks: “) pr.sortBy(tuple => -tuple._2, ascending=false). foreach(println)

pr.map { case (id, r) => id + "\t" + r }.saveAsTextFile(...) sc.stop()

88

Page 89: Data Science at Scale with Spark (pdf)

89

Thank You!

[email protected]@deanwampler

polyglotprogramming.com/talks

Page 90: Data Science at Scale with Spark (pdf)

90

Bonus Slides:Details of MapReduce implementation for the Inverted Index

Page 91: Data Science at Scale with Spark (pdf)

91

1 Map step + 1 Reduce step

Page 92: Data Science at Scale with Spark (pdf)

92

1 Map step + 1 Reduce step

Map Task

(hadoop,(wikipedia.org/hadoop,1))

(mapreduce,(wikipedia.org/hadoop, 1))

(hdfs,(wikipedia.org/hadoop, 1))

(provides,(wikipedia.org/hadoop,1))

(and,(wikipedia.org/hadoop,1))

Page 93: Data Science at Scale with Spark (pdf)

93

1 Map step + 1 Reduce step

Page 94: Data Science at Scale with Spark (pdf)

94

1 Map step + 1 Reduce step

Reduce Task

(hadoop,Iterator((…/hadoop,1)))

(hdfs,Iterator((…/hadoop, 1), (…/hbase,1),(…/hive,1)))

(hbase, Iterator( (…/hbase,1),(…/hive,1)))

Page 95: Data Science at Scale with Spark (pdf)

95

1 Map step + 1 Reduce step