Why Your Apache Spark Job is Failing

Why your Spark Job is FailingKostas Sakellis

• Software Engineering at Cloudera•Contributor to Apache Spark•Before that, worked on Cloudera Manager

com.esotericsoftware.kryo.KryoException: Unable to find class: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$4$$anonfun$apply$3

We go about our day ignoring manholes until…

Courtesy of: http://www.independent.co.uk/incoming/article9127706.ece/binary/original/maholev23.jpg

… something goes wrong.

Courtesy of: http://greenpointers.com/wp-content/uploads/2015/03/Manhole-Explosion1.jpg

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0"

at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1250)

at java.lang.Double.parseDouble(Double.java:540)at scala.collection.immutable.StringLike[...]

Driver stacktrace:at

org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)

at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)

Job? What now?

Courtesy of:http://calvert.lib.md.us/jobs_pic.jpg

Examplesc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Then what the heck is a stage?

Courtesy of: https://writinginadeadworld.files.wordpress.com/2014/03/rock1.jpeg

Partitionssc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Partition 1

Partition 2

Partition 3

Partition 4

RDDssc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

…RDD1

Partition 1

Partition 2

Partition 3

Partition 4

…RDD1 …RDD2

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD1 …RDD2

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

…RDD1 …RDD2

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

…RDD1 …RDD2

RDD Lineage

Partition 1

Partition 2

Partition 3

Partition 4

sc.textFile(“hdfs://…”, 4) .map((x) => x.toInt) .filter(_ > 10) .sum()

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Lineage

RDD Dependencies

…RDD1 …RDD2

Partition 1

Partition 2

Partition 3

Partition 4

Partition 1

Partition 2

Partition 3

Partition 4

…RDD3

Partition 1

Partition 2

Partition 3

Partition 4

Narrow Dependencies

•Narrow and Wide Dependencies

Wide Dependencies

• Sometimes records need to be grouped together• Examples• join•groupByKey

• Stages created at wide dependency boundaries

A more Interesting Spark Job

val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc)

val rdd2 = sc.hadoopFile(“hdfs://...”) .groupByKey() .map(someOtherFunc)

val rdd3 = rdd1.join(rdd2) .map(someFunc)

rdd3.collect()

val rdd1 = sc.textFile(“hdfs://...”) .map(someFunc) .filter(filterFunc)

maptextFile filter

val rdd2 = sc.hadoopFile(“hdfs://...”) .groupByKey() .map(someOtherFunc)

groupByKeyhadoopFile map

val rdd3 = rdd1.join(rdd2) .map(someFunc)

join map

rdd3.collect()

maptextFile filter

groupByKey

hadoopFile map

join map

Wide Dependencies

Get to the point before I stop caring!

What was the failure?

org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, kostas-4.vpc.cloudera.com): java.lang.NumberFormatException: For input string: "3.9166,10.2491,-4.0926,-4.4659,0” [...]

StageTask Task

Task Task

StageTask Task

Task Task

StageTask Task

Task Task

spark.task.maxFailures=4

ERROR executor.Executor: Exception in task ID 2866 java.io.IOException: Filesystem closed at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:565) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:648)

at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:706) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:209) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:173) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:206) at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:45) at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:164) [...]

Spark Architecture

YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container Container

Application Master

Client

Process Process

Spark on YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container ContainerClient

Process Process

Spark on YARN Architecture

Resource Manager

Node Manager

Container Container

Node Manager

Container Container

Application Master

Client

Process Process

spark-submit --executor-memory 2g

--master yarn-client

--num-executors 2

--num-cores 2

Container [pid=63375,containerID=container_1388158490598_0001_01_000003] is running beyond physical memory limits. Current usage: 2.2 GB of 2.1 GB physical memory used; 2.8 GB of 4.2 GB virtual memory used. Killing container. [...]

spark-submit --executor-memory 2g

--master yarn-client

--num-executors 2

--num-cores 2

yarn.nodemanager.resource.memory-mb

Executor Container

spark.yarn.executor.memoryOverhead (7%) (10% in 1.4)

spark.executor.memory

spark.shuffle.memoryFraction (0.4) spark.storage.memoryFraction (0.6)

Memory allocation

Sometimes jobs run slow or even…

Courtesy of: http://blog.sdrock.com/pastors/files/2013/06/time-clock.jpg

java.lang.OutOfMemoryError: GC overhead limit exceeded at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1986) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) [...]

GC Stalls

Too much spilling!

Courtesy of: http://tgnp.me/wp-content/uploads/2014/05/spilled-starbucks.jpg

Shuffle Boundaries

maptextFile filter

groupByKey

hadoopFile map

join map

Shuffle

Most performance issues are in shuffles!

Inside a Task: Fetch & Aggregate

ExternalAppendOnlyMapBlock

deserialize

key1 -> valueskey2 -> valueskey3 -> valueskey4 -> values

Sort & Spill

key1 -> valueskey2 -> valueskey3 -> values

rdd.reduceByKey(reduceFunc, numPartitions=1000)

Inside a Task: Specify partitions

Why not set partitions to ∞ ?

Excessive parallelism

•Overwhelming scheduler overhead•More fetches -> more disk seeks•Driver needs to track state per-task

So how to choose?

• Easy answer:•Keep multiplying by 1.5 and see what works

Is Spark bad?

Courtesy of: https://theferkel.files.wordpress.com/2015/04/250474-breaking-bad-quotes.jpg

Thank you

Why Your Apache Spark Job is Failing

Software

Transcript of Why Your Apache Spark Job is Failing

Apache Spark & Hadoop

Spark SQL | Apache Spark

Apache spark

Budapest Spark Meetup - Apache Spark @enbrite.ly

Apache Spark - LMU

R + Apache Spark

Apache Spark - Courses€¦ · Apache Spark Introduction to Data Science DATA11001 Nitinder Mohan CollaborativeNetworking (CoNe) nitinder.mohan@helsinki.fi. What is Apache Spark?

TeachYourself Apache Spark...HOUR 1 Introducing Apache Spark..... 1 2 Understanding Hadoop ... Part II: Programming with Apache Spark HOUR 6: Learning the Basics of Spark Programming

Managed Solutions Apache Spark® · Apache Spark® Apache Spark™ is a high performing engine for large-scale analytics and data processing, While Apache Spark™ provides advanced

Apache Spark Operations

Apache spark meetup

Why your Spark job is failing

Developing Apache Spark Applications - Cloudera · Apache Spark Quick Start Apache Spark Overview Apache Spark Programming Guide Using the Spark DataFrame API A DataFrame is a distributed

Accelerator for Apache Spark Functional Specification · Accelerator for Apache Spark – Functional Specification 12 Table 1: Accelerator for Apache Spark Components Component Software

Apache Spark 2.0

[@NaukriEngineering] Apache Spark

Apache Spark Streaming

Using Apache Spark, Apache Kafka and Apache Cassandra...USING APACHE SPARK, APACHE KAFKA AND APACHE CASSANDRA TO POWER INTELLIGENT APPLICATIONS | 02 Apache Cassandra is well known

Performance-Analyse von Apache Spark und Apache Hadoop€¦ · Apache Spark, Apache Hadoop, Big Data, Benchmarking, Performance-Analyse Kurzzusammenfassung Diese Bachelorarbeit beschäftigt

Apache Ignite and Apache Spark - GridGain Systems · Ignite and Spark Integration Spark Application Spark Worker Spark Job Spark Job Yarn Mesos Docker HDFS Spark Worker Spark Job