Apache spark when things go wrong

223
Apache Spark when things go wrong @rabbitonweb

Transcript of Apache spark when things go wrong

Page 1: Apache spark   when things go wrong

Apache Sparkwhen things go wrong

@rabbitonweb

Page 2: Apache spark   when things go wrong

Apache Spark - when things go wrongval sc = new SparkContext(conf)

sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

.take(300)

.foreach(println)

If above seems odd, this talk is rather not for you.

@rabbitonweb tweet compliments and complaints

Page 3: Apache spark   when things go wrong

Target for this talk

Page 4: Apache spark   when things go wrong

Target for this talk

Page 5: Apache spark   when things go wrong

Target for this talk

Page 6: Apache spark   when things go wrong

Target for this talk

Page 7: Apache spark   when things go wrong

Target for this talk

Page 8: Apache spark   when things go wrong

Target for this talk

Page 9: Apache spark   when things go wrong

Target for this talk

Page 10: Apache spark   when things go wrong

Target for this talk

Page 11: Apache spark   when things go wrong

Target for this talk

Page 12: Apache spark   when things go wrong

4 things for take-away

Page 13: Apache spark   when things go wrong

4 things for take-away

1. Knowledge of how Apache Spark works internally

Page 14: Apache spark   when things go wrong

4 things for take-away

1. Knowledge of how Apache Spark works internally

2. Courage to look at Spark’s implementation (code)

Page 15: Apache spark   when things go wrong

4 things for take-away

1. Knowledge of how Apache Spark works internally

2. Courage to look at Spark’s implementation (code)

3. Notion of how to write efficient Spark programs

Page 16: Apache spark   when things go wrong

4 things for take-away

1. Knowledge of how Apache Spark works internally

2. Courage to look at Spark’s implementation (code)

3. Notion of how to write efficient Spark programs

4. Basic ability to monitor Spark when things are starting to act weird

Page 17: Apache spark   when things go wrong

Two words about examples used

Page 18: Apache spark   when things go wrong

Two words about examples used

Super Cool App

Page 19: Apache spark   when things go wrong

Two words about examples used

JournalSuper Cool App

events

Page 20: Apache spark   when things go wrong

spark cluster

Two words about examples used

JournalSuper Cool App

events

node 1

node 2

node 3

master A

master B

Page 21: Apache spark   when things go wrong

spark cluster

Two words about examples used

JournalSuper Cool App

events

node 1

node 2

node 3

master A

master B

Page 22: Apache spark   when things go wrong

spark cluster

Two words about examples used

JournalSuper Cool App

events

node 1

node 2

node 3

master A

master B

Page 23: Apache spark   when things go wrong

Two words about examples used

Page 24: Apache spark   when things go wrong

Two words about examples used

sc.textFile(“hdfs://journal/*”)

Page 25: Apache spark   when things go wrong

Two words about examples used

sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

Page 26: Apache spark   when things go wrong

Two words about examples used

sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

Page 27: Apache spark   when things go wrong

Two words about examples used

sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

Page 28: Apache spark   when things go wrong

Two words about examples used

sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

.take(300)

Page 29: Apache spark   when things go wrong

Two words about examples used

sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

.take(300)

.foreach(println)

Page 30: Apache spark   when things go wrong

What is a RDD?

Page 31: Apache spark   when things go wrong

What is a RDD?

Resilient Distributed Dataset

Page 32: Apache spark   when things go wrong

What is a RDD?

Resilient Distributed Dataset

Page 33: Apache spark   when things go wrong

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

What is a RDD?

Page 34: Apache spark   when things go wrong

node 1

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

node 2 node 3

What is a RDD?

Page 35: Apache spark   when things go wrong

node 1

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

node 2 node 3

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

What is a RDD?

Page 36: Apache spark   when things go wrong

What is a RDD?

Page 37: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

Page 38: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent

Page 39: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned

Page 40: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data

Page 41: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data

Page 42: Apache spark   when things go wrong

What is a partition?

Page 43: Apache spark   when things go wrong

What is a partition?

A partition represents subset of data within your distributed collection.

Page 44: Apache spark   when things go wrong

What is a partition?

A partition represents subset of data within your distributed collection.

override def getPartitions: Array[Partition] = ???

Page 45: Apache spark   when things go wrong

What is a partition?

A partition represents subset of data within your distributed collection.

override def getPartitions: Array[Partition] = ???

How this subset is defined depends on type of the RDD

Page 46: Apache spark   when things go wrong

example: HadoopRDDval journal = sc.textFile(“hdfs://journal/*”)

Page 47: Apache spark   when things go wrong

example: HadoopRDDval journal = sc.textFile(“hdfs://journal/*”)

How HadoopRDD is partitioned?

Page 48: Apache spark   when things go wrong

example: HadoopRDDval journal = sc.textFile(“hdfs://journal/*”)

How HadoopRDD is partitioned?

In HadoopRDD partition is exactly the same as file chunks in HDFS

Page 49: Apache spark   when things go wrong

example: HadoopRDD

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

Page 50: Apache spark   when things go wrong

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

Page 51: Apache spark   when things go wrong

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

Page 52: Apache spark   when things go wrong

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

Page 53: Apache spark   when things go wrong

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

Page 54: Apache spark   when things go wrong

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

Page 55: Apache spark   when things go wrong

example: HadoopRDDclass HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {

...

override def getPartitions: Array[Partition] = {

val jobConf = getJobConf()

SparkHadoopUtil.get.addCredentials(jobConf)

val inputFormat = getInputFormat(jobConf)

if (inputFormat.isInstanceOf[Configurable]) {

inputFormat.asInstanceOf[Configurable].setConf(jobConf)

}

val inputSplits = inputFormat.getSplits(jobConf, minPartitions)

val array = new Array[Partition](inputSplits.size)

for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }

array

}

Page 56: Apache spark   when things go wrong

example: HadoopRDDclass HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {

...

override def getPartitions: Array[Partition] = {

val jobConf = getJobConf()

SparkHadoopUtil.get.addCredentials(jobConf)

val inputFormat = getInputFormat(jobConf)

if (inputFormat.isInstanceOf[Configurable]) {

inputFormat.asInstanceOf[Configurable].setConf(jobConf)

}

val inputSplits = inputFormat.getSplits(jobConf, minPartitions)

val array = new Array[Partition](inputSplits.size)

for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }

array

}

Page 57: Apache spark   when things go wrong

example: HadoopRDDclass HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {

...

override def getPartitions: Array[Partition] = {

val jobConf = getJobConf()

SparkHadoopUtil.get.addCredentials(jobConf)

val inputFormat = getInputFormat(jobConf)

if (inputFormat.isInstanceOf[Configurable]) {

inputFormat.asInstanceOf[Configurable].setConf(jobConf)

}

val inputSplits = inputFormat.getSplits(jobConf, minPartitions)

val array = new Array[Partition](inputSplits.size)

for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) }

array

}

Page 58: Apache spark   when things go wrong

example: MapPartitionsRDDval journal = sc.textFile(“hdfs://journal/*”)

val fromMarch = journal.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

Page 59: Apache spark   when things go wrong

example: MapPartitionsRDDval journal = sc.textFile(“hdfs://journal/*”)

val fromMarch = journal.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

How MapPartitionsRDD is partitioned?

Page 60: Apache spark   when things go wrong

example: MapPartitionsRDDval journal = sc.textFile(“hdfs://journal/*”)

val fromMarch = journal.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

How MapPartitionsRDD is partitioned?

MapPartitionsRDD inherits partition information from its parent RDD

Page 61: Apache spark   when things go wrong

example: MapPartitionsRDDclass MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) {

...

override def getPartitions: Array[Partition] = firstParent[T].partitions

Page 62: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data

Page 63: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data

Page 64: Apache spark   when things go wrong

RDD parentsc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

.take(300)

.foreach(println)

Page 65: Apache spark   when things go wrong

RDD parentsc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

.take(300)

.foreach(println)

Page 66: Apache spark   when things go wrong

RDD parentsc.textFile()

.groupBy()

.map { }

.filter {

}

.take()

.foreach()

Page 67: Apache spark   when things go wrong

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 68: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 69: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 70: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 71: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 72: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 73: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Two types of parent dependencies:

1. narrow dependency2. wider dependency

Page 74: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Two types of parent dependencies:

1. narrow dependency2. wider dependency

Page 75: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Two types of parent dependencies:

1. narrow dependency2. wider dependency

Page 76: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 77: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Tasks

Page 78: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Tasks

Page 79: Apache spark   when things go wrong

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 80: Apache spark   when things go wrong

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 81: Apache spark   when things go wrong

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 82: Apache spark   when things go wrong

Stage 1Stage 2

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Page 83: Apache spark   when things go wrong

Stage 1Stage 2

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Two important concepts:1. shuffle write2. shuffle read

Page 84: Apache spark   when things go wrong

toDebugStringval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

Page 85: Apache spark   when things go wrong

toDebugStringval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

scala> events.toDebugString

Page 86: Apache spark   when things go wrong

toDebugStringval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

scala> events.toDebugString

res5: String =

(4) MapPartitionsRDD[22] at filter at <console>:50 []

| MapPartitionsRDD[21] at map at <console>:49 []

| ShuffledRDD[20] at groupBy at <console>:48 []

+-(6) HadoopRDD[17] at textFile at <console>:47 []

Page 87: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data

Page 88: Apache spark   when things go wrong

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data

Page 89: Apache spark   when things go wrong

Stage 1Stage 2

Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { }

Page 90: Apache spark   when things go wrong

Stage 1Stage 2

Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()

Page 91: Apache spark   when things go wrong

Stage 1Stage 2

Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()

action

Page 92: Apache spark   when things go wrong

Stage 1Stage 2

Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()

action

Actions are implemented using sc.runJob method

Page 93: Apache spark   when things go wrong

Running Job aka materializing DAG/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

): Array[U]

Page 94: Apache spark   when things go wrong

Running Job aka materializing DAG/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

rdd: RDD[T],

): Array[U]

Page 95: Apache spark   when things go wrong

Running Job aka materializing DAG/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

rdd: RDD[T],

partitions: Seq[Int],

): Array[U]

Page 96: Apache spark   when things go wrong

Running Job aka materializing DAG/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

rdd: RDD[T],

partitions: Seq[Int],

func: Iterator[T] => U,

): Array[U]

Page 97: Apache spark   when things go wrong

Running Job aka materializing DAG/**

* Return an array that contains all of the elements in this RDD.

*/

def collect(): Array[T] = {

val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

Array.concat(results: _*)

}

Page 98: Apache spark   when things go wrong

Running Job aka materializing DAG/**

* Return an array that contains all of the elements in this RDD.

*/

def collect(): Array[T] = {

val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

Array.concat(results: _*)

}

/**

* Return the number of elements in the RDD.

*/

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

Page 99: Apache spark   when things go wrong

Multiple jobs for single action/**

* Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.

*/

def take(num: Int): Array[T] = {

(….)

val left = num - buf.size

val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true)

(….)

res.foreach(buf ++= _.take(num - buf.size))

partsScanned += numPartsToTry

(….)

buf.toArray

}

Page 100: Apache spark   when things go wrong

Lets test what we’ve learned

Page 101: Apache spark   when things go wrong

Towards efficiencyval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

Page 102: Apache spark   when things go wrong

Towards efficiencyval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

scala> events.toDebugString

(4) MapPartitionsRDD[22] at filter at <console>:50 []

| MapPartitionsRDD[21] at map at <console>:49 []

| ShuffledRDD[20] at groupBy at <console>:48 []

+-(6) HadoopRDD[17] at textFile at <console>:47 []

Page 103: Apache spark   when things go wrong

Towards efficiencyval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

scala> events.toDebugString

(4) MapPartitionsRDD[22] at filter at <console>:50 []

| MapPartitionsRDD[21] at map at <console>:49 []

| ShuffledRDD[20] at groupBy at <console>:48 []

+-(6) HadoopRDD[17] at textFile at <console>:47 []

events.count

Page 104: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 105: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 106: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 107: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 108: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 109: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 110: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 111: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 112: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 113: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 114: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 115: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 116: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 117: Apache spark   when things go wrong

Stage 1val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 118: Apache spark   when things go wrong

Everyday I’m Shuffling

Page 119: Apache spark   when things go wrong

Everyday I’m Shuffling val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 120: Apache spark   when things go wrong

Everyday I’m Shufflingval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 121: Apache spark   when things go wrong

Everyday I’m Shufflingval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 122: Apache spark   when things go wrong

Everyday I’m Shufflingval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 123: Apache spark   when things go wrong

Everyday I’m Shufflingval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 124: Apache spark   when things go wrong

Stage 2val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 125: Apache spark   when things go wrong

Stage 2val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 126: Apache spark   when things go wrong

Stage 2val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 127: Apache spark   when things go wrong

Stage 2val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 128: Apache spark   when things go wrong

Stage 2val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 129: Apache spark   when things go wrong

Stage 2val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

node 1

node 2

node 3

Page 130: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 131: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 132: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 133: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 134: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 135: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 136: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 137: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 138: Apache spark   when things go wrong

Stage 2

node 1

node 2

node 3

val events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 139: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 140: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 141: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter { case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1) }

Page 142: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

Page 143: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

Page 144: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

node 1

node 2

node 3

Page 145: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

node 1

node 2

node 3

Page 146: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.groupBy(extractDate _, 6)

.map { case (date, events) => (date, events.size) }

node 1

node 2

node 3

Page 147: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.groupBy(extractDate _, 6)

.map { case (date, events) => (date, events.size) }

node 1

node 2

node 3

Page 148: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.groupBy(extractDate _, 6)

.map { case (date, events) => (date, events.size) }

node 1

node 2

node 3

Page 149: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.groupBy(extractDate _, 6)

.map { case (date, events) => (date, events.size) }

Page 150: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.map( e => (extractDate(e), e))

.combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2)

.groupBy(extractDate _, 6)

.map { case (date, events) => (date, events.size) }

Page 151: Apache spark   when things go wrong

Let's refactorval events = sc.textFile(“hdfs://journal/*”)

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.map( e => (extractDate(e), e))

.combineByKey(e => 1, (counter: Int,e: String) => counter + 1,(c1: Int, c2: Int) => c1 + c2)

Page 152: Apache spark   when things go wrong

A bit more about partitionsval events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4

.map( e => (extractDate(e), e))

Page 153: Apache spark   when things go wrong

A bit more about partitionsval events = sc.textFile(“hdfs://journal/*”) // here small number of partitions, let’s say 4

.repartition(256) // note, this will cause a shuffle

.map( e => (extractDate(e), e))

Page 154: Apache spark   when things go wrong

A bit more about partitionsval events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.map( e => (extractDate(e), e))

Page 155: Apache spark   when things go wrong

A bit more about partitionsval events = sc.textFile(“hdfs://journal/*”) // here a lot of partitions, let’s say 1024

.filter { LocalDate.parse(extractDate _) isAfter LocalDate.of(2015,3,1) }

.coalesce(64) // this will NOT shuffle

.map( e => (extractDate(e), e))

Page 156: Apache spark   when things go wrong

Enough theory

Page 157: Apache spark   when things go wrong
Page 158: Apache spark   when things go wrong
Page 159: Apache spark   when things go wrong
Page 160: Apache spark   when things go wrong
Page 161: Apache spark   when things go wrong
Page 162: Apache spark   when things go wrong
Page 163: Apache spark   when things go wrong
Page 164: Apache spark   when things go wrong
Page 165: Apache spark   when things go wrong
Page 166: Apache spark   when things go wrong
Page 167: Apache spark   when things go wrong
Page 168: Apache spark   when things go wrong
Page 169: Apache spark   when things go wrong
Page 170: Apache spark   when things go wrong
Page 171: Apache spark   when things go wrong
Page 172: Apache spark   when things go wrong
Page 173: Apache spark   when things go wrong
Page 174: Apache spark   when things go wrong
Page 175: Apache spark   when things go wrong
Page 176: Apache spark   when things go wrong
Page 177: Apache spark   when things go wrong
Page 178: Apache spark   when things go wrong
Page 179: Apache spark   when things go wrong
Page 180: Apache spark   when things go wrong
Page 181: Apache spark   when things go wrong
Page 182: Apache spark   when things go wrong
Page 183: Apache spark   when things go wrong
Page 184: Apache spark   when things go wrong
Page 185: Apache spark   when things go wrong
Page 186: Apache spark   when things go wrong
Page 187: Apache spark   when things go wrong

Few optimization tricks

Page 188: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)

Page 189: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster

Page 190: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster3. Experiment with compression codecs

Page 191: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster3. Experiment with compression codecs4. Learn the API

a. groupBy->map vs combineByKey

Page 192: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster3. Experiment with compression codecs4. Learn the API

a. groupBy->map vs combineByKeyb. map vs mapValues (partitioner)

Page 193: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy

Page 194: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy

Page 195: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy join

Page 196: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy join

Page 197: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy join

Optimization: shuffle avoided if data is already partitioned

Page 198: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy map

Page 199: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy map

Page 200: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy map join

Page 201: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy map join

Page 202: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy mapValues

Page 203: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy mapValues

Page 204: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy mapValues join

Page 205: Apache spark   when things go wrong

Spark performance - shuffle optimization

map groupBy mapValues join

Page 206: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster3. Experiment with compression codecs4. Learn the API

a. groupBy->map vs combineByKeyb. map vs mapValues (partitioner)

Page 207: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster3. Experiment with compression codecs4. Learn the API

a. groupBy->map vs combineByKeyb. map vs mapValues (partitioner)

5. Understand basics & internals to omit common mistakes

Page 208: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster3. Experiment with compression codecs4. Learn the API

a. groupBy->map vs combineByKeyb. map vs mapValues (partitioner)

5. Understand basics & internals to omit common mistakesExample: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers)

Page 209: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster3. Experiment with compression codecs4. Learn the API

a. groupBy->map vs combineByKeyb. map vs mapValues (partitioner)

5. Understand basics & internals to omit common mistakesExample: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers)

rdd.map((k,v) => otherRDD.get(k)) rdd.map(e => otherRdd.map {})

Page 210: Apache spark   when things go wrong

Few optimization tricks1. Serialization issues (e.g KryoSerializer)2. Turn on speculations if on shared cluster3. Experiment with compression codecs4. Learn the API

a. groupBy->map vs combineByKeyb. map vs mapValues (partitioner)

5. Understand basics & internals to omit common mistakesExample: Usage of RDD within other RDD. Invalid since only the driver can call operations on RDDS (not other workers)

rdd.map((k,v) => otherRDD.get(k)) -> rdd.join(otherRdd)rdd.map(e => otherRdd.map {}) -> rdd.cartesian(otherRdd)

Page 211: Apache spark   when things go wrong

Resources & further read● Spark Summit Talks @ Youtube

Page 212: Apache spark   when things go wrong

Resources & further read● Spark Summit Talks @ Youtube

SPARK SUMMIT EUROPE!!!October 27th to 29th

Page 213: Apache spark   when things go wrong

Resources & further read● Spark Summit Talks @ Youtube

Page 214: Apache spark   when things go wrong

Resources & further read● Spark Summit Talks @ Youtube● “Learn Spark” book, published by O’Reilly

Page 215: Apache spark   when things go wrong

Resources & further read● Spark Summit Talks @ Youtube● “Learn Spark” book, published by O’Reilly● Apache Spark Documentation

○ https://spark.apache.org/docs/latest/monitoring.html

○ https://spark.apache.org/docs/latest/tuning.html

Page 216: Apache spark   when things go wrong

Resources & further read● Spark Summit Talks @ Youtube● “Learn Spark” book, published by O’Reilly● Apache Spark Documentation

○ https://spark.apache.org/docs/latest/monitoring.html

○ https://spark.apache.org/docs/latest/tuning.html

● Mailing List aka solution-to-your-problem-is-probably-already-there

Page 217: Apache spark   when things go wrong

Resources & further read● Spark Summit Talks @ Youtube● “Learn Spark” book, published by O’Reilly● Apache Spark Documentation

○ https://spark.apache.org/docs/latest/monitoring.html

○ https://spark.apache.org/docs/latest/tuning.html

● Mailing List aka solution-to-your-problem-is-probably-already-there● Pretty soon my blog & github :)

Page 218: Apache spark   when things go wrong
Page 219: Apache spark   when things go wrong
Page 220: Apache spark   when things go wrong

Paweł Szulc

Page 221: Apache spark   when things go wrong

Paweł Szulcblog: http://rabbitonweb.com

Page 222: Apache spark   when things go wrong

Paweł Szulcblog: http://rabbitonweb.comtwitter: @rabbitonweb

Page 223: Apache spark   when things go wrong

Paweł Szulcblog: http://rabbitonweb.comtwitter: @rabbitonwebgithub: https://github.com/rabbitonweb