Top 5 mistakes when writing Spark applications
-
Upload
hadooparchbook -
Category
Engineering
-
view
2.140 -
download
3
Transcript of Top 5 mistakes when writing Spark applications
![Page 1: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/1.jpg)
Top 5 mistakes when writing Spark applicationstiny.cloudera.com/spark-mistakes
Mark Grover | Software Engineer, Cloudera | @mark_grover
Ted Malaska | Technical Group Architect, Blizzard| @TedMalaska
![Page 2: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/2.jpg)
2
About the book
• @hadooparchbook• hadooparchitecturebook.com• github.com/hadooparchitecturebook• slideshare.com/hadooparchbook
![Page 3: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/3.jpg)
3
Mistakes people makewhen using Spark
![Page 4: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/4.jpg)
4
Mistakes people we’ve madewhen using Spark
![Page 5: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/5.jpg)
5
Mistakes people makewhen using Spark
![Page 6: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/6.jpg)
6
Mistake # 1
![Page 7: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/7.jpg)
7
# Executors, cores, memory !?!
• 6 Nodes• 16 cores each• 64 GB of RAM each
![Page 8: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/8.jpg)
8
Decisions, decisions, decisions
• Number of executors (--num-executors)• Cores for each executor (--executor-cores)• Memory for each executor (--executor-memory)
• 6 nodes• 16 cores each• 64 GB of RAM
![Page 9: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/9.jpg)
9
Spark Architecture recap
![Page 10: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/10.jpg)
10
Answer #1 – Most granular
• Have smallest sized executorspossible• 1 core each• 64GB/node / 16 executors/node= 4 GB/executor• Total of 16 cores x 6 nodes = 96 cores => 96 executors
Worker node
Executor 16
Executor 4
Executor 3
Executor 2
Executor 1
![Page 11: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/11.jpg)
11
Answer #1 – Most granular
• Have smallest sized executorspossible• 1 core each• 64GB/node / 16 executors/node= 4 GB/executor• Total of 16 cores x 6 nodes = 96 cores => 96 executors
Worker node
Executor 16
Executor 4
Executor 3
Executor 2
Executor 1
![Page 12: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/12.jpg)
12
Why?
• Not using benefits of running multiple tasks in same executor
![Page 13: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/13.jpg)
13
Answer #2 – Least granular
• 6 executors in total=>1 executor per node• 64 GB memory each• 16 cores each
Worker node
Executor 1
![Page 14: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/14.jpg)
14
Answer #2 – Least granular
• 6 executors in total=>1 executor per node• 64 GB memory each• 16 cores each
Worker node
Executor 1
![Page 15: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/15.jpg)
15
Why?
• Need to leave some memory overhead for OS/Hadoop daemons
![Page 16: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/16.jpg)
16
Answer #3 – with overhead
• 6 executors – 1 executor/node• 63 GB memory each• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
![Page 17: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/17.jpg)
17
Answer #3 – with overhead
• 6 executors – 1 executor/node• 63 GB memory each• 15 cores each
Worker node
Executor 1
Overhead(1G,1 core)
![Page 18: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/18.jpg)
18
Let’s assume…
• You are running Spark on YARN, from here on…
![Page 19: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/19.jpg)
19
3 things
• 3 other things to keep in mind
![Page 20: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/20.jpg)
20
#1 – Memory overhead
• --executor-memory controls the heap size• Need some overhead (controlled by
spark.yarn.executor.memory.overhead) for off heap memory• Default is max(384MB, .07 * spark.executor.memory)
![Page 21: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/21.jpg)
21
#2 - YARN AM needs a core: Client mode
![Page 22: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/22.jpg)
22
#2 YARN AM needs a core: Cluster mode
![Page 23: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/23.jpg)
23
#3 HDFS Throughput
• 15 cores per executor can lead to bad HDFS I/O throughput.• Best is to keep under 5 cores per executor
![Page 24: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/24.jpg)
24
Calculations
• 5 cores per executor– For max HDFS throughput
• Cluster has 6 * 15 = 90 cores in totalafter taking out Hadoop/Yarn daemon cores)• 90 cores / 5 cores/executor= 18 executors• Each node has 3 executors• 63 GB/3 = 21 GB, 21 x (1-0.07) ~ 19 GB• 1 executor for AM => 17 executors
Overhead
Worker node
Executor 3
Executor 2
Executor 1
![Page 25: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/25.jpg)
25
Correct answer
• 17 executors in total• 19 GB memory/executor• 5 cores/executor
* Not etched in stone
Overhead
Worker node
Executor 3
Executor 2
Executor 1
![Page 26: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/26.jpg)
26
Dynamic allocation helps with though, right?
• Dynamic allocation allows Spark to dynamically scale the cluster resources allocated to your application based on the workload.
• Works with Spark-On-Yarn
![Page 27: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/27.jpg)
27
Decisions with Dynamic Allocation
• Number of executors (--num-executors)• Cores for each executor (--executor-cores)• Memory for each executor (--executor-memory)
• 6 nodes• 16 cores each• 64 GB of RAM
![Page 28: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/28.jpg)
28
Read more
• From a great blog post on this topic by Sandy Ryza:http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
![Page 29: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/29.jpg)
29
Mistake # 2
![Page 30: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/30.jpg)
30
Application failure15/04/16 14:13:03 WARN scheduler.TaskSetManager: Lost task 19.0 in stage 6.0 (TID 120, 10.215.149.47): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUEat sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123) at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132) at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517) at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432) at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)
![Page 31: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/31.jpg)
31
Why?
• No Spark shuffle block can be greater than 2 GB
![Page 32: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/32.jpg)
32
Ok, what’s a shuffle block again?
• In MapReduce terminology, a file written from one Mapper for a Reducer• The Reducer makes a local copy of this file (reducer local copy) and then
‘reduces’ it
![Page 33: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/33.jpg)
33
Defining shuffle and partition
Each yellow arrow in this diagram represents a shuffle block.
Each blue block is a partition.
![Page 34: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/34.jpg)
34
Once again
• Overflow exception if shuffle block size > 2 GB
![Page 35: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/35.jpg)
35
What’s going on here?
• Spark uses ByteBuffer as abstraction for blocksval buf = ByteBuffer.allocate(length.toInt)
• ByteBuffer is limited by Integer.MAX_SIZE (2 GB)!
![Page 36: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/36.jpg)
36
Spark SQL
• Especially problematic for Spark SQL• Default number of partitions to use when doing shuffles is 200
– This low number of partitions leads to high shuffle block size
![Page 37: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/37.jpg)
37
Umm, ok, so what can I do?
1. Increase the number of partitions– Thereby, reducing the average partition size
2. Get rid of skew in your data– More on that later
![Page 38: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/38.jpg)
38
Umm, how exactly?
• In Spark SQL, increase the value of spark.sql.shuffle.partitions
• In regular Spark applications, use rdd.repartition() or rdd.coalesce()(latter to reduce #partitions, if needed)
![Page 39: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/39.jpg)
39
But, how many partitions should I have?
• Rule of thumb is around 128 MB per partition
![Page 40: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/40.jpg)
40
But! There’s more!
• Spark uses a different data structure for bookkeeping during shuffles, when the number of partitions is less than 2000, vs. more than 2000.
![Page 41: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/41.jpg)
41
Don’t believe me?
• In MapStatus.scaladef apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
if (uncompressedSizes.length > 2000) {HighlyCompressedMapStatus(loc, uncompressedSizes)
} else {new CompressedMapStatus(loc, uncompressedSizes)
}}
![Page 42: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/42.jpg)
42
Ok, so what are you saying?
If number of partitions < 2000, but not by much, bump it to be slightly higher than 2000.
![Page 43: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/43.jpg)
43
Can you summarize, please?
• Don’t have too big partitions– Your job will fail due to 2 GB limit
• Don’t have too few partitions– Your job will be slow, not making using of parallelism
• Rule of thumb: ~128 MB per partition• If #partitions < 2000, but close, bump to just > 2000• Track SPARK-6235 for removing various 2 GB limits
![Page 44: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/44.jpg)
44
Mistake # 3
![Page 45: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/45.jpg)
45
Slow jobs on Join/Shuffle
• Your dataset takes 20 seconds to run over with a map job, but take 4 hours when joined or shuffled. What wrong?
![Page 46: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/46.jpg)
46
Mistake - Skew
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Single Thread
Normal
Distributed
The Holy Grail of Distributed Systems
![Page 47: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/47.jpg)
47
Mistake - Skew
Single ThreadNormal
Distributed
What about Skew, because that is a thing
![Page 48: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/48.jpg)
48
• Salting• Isolated Salting• Isolated Map Joins
Mistake – Skew : Answers
![Page 49: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/49.jpg)
49
• Normal Key: “Foo”• Salted Key: “Foo” + random.nextInt(saltFactor)
Mistake – Skew : Salting
![Page 50: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/50.jpg)
50
Managing Parallelism
![Page 51: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/51.jpg)
51
Mistake – Skew: Salting
![Page 52: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/52.jpg)
52©2014 Cloudera, Inc. All rights reserved.
Add Example Slide
![Page 53: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/53.jpg)
53
• Two Stage Aggregation– Stage one to do operations on the salted keys– Stage two to do operation access unsalted key results
Mistake – Skew : Salting
Data Source MapConvert to
Salted Key & ValueTuple
ReduceBy Salted Key
Map Convert results to
Key & ValueTuple
ReduceBy Key
Results
![Page 54: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/54.jpg)
54
• Second Stage only required for Isolated Keys
Mistake – Skew : Isolated Salting
Data Source MapConvert to
Key & ValueIsolate Key and
convert toSalted Key & Value
Tuple
ReduceBy Key & Salted
Key
Filter Isolated Keys
From Salted Keys
Map Convert results to
Key & ValueTuple
ReduceBy Key
Union to Results
![Page 55: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/55.jpg)
55
• Filter Out Isolated Keys and use Map Join/Aggregate on those
• And normal reduce on the rest of the data• This can remove a large amount of data being shuffled
Mistake – Skew : Isolated Map Join
Data Source Filter Normal Keys
From Isolated Keys
ReduceBy Normal Key
Union to Results
Map Join For Isolated
Keys
![Page 56: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/56.jpg)
56
Managing ParallelismCartesian Join
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
Map Task
Shuffle Tmp 1
Shuffle Tmp 2
Shuffle Tmp 3
Shuffle Tmp 4
ReduceTask
ReduceTask
ReduceTask
ReduceTask
Amount of Data
Amount of Data
10x100x1000x10000x100000x1000000xOr more
![Page 57: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/57.jpg)
57
Table YTable X
• How To fight Cartesian Join– Nested Structures
Managing Parallelism
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
Table XA, 1, 4
A, 2, 4
A, 3, 4
A, 1, 5
A, 2, 5
A, 3, 5
A, 1, 6
A, 2, 6
A, 3, 6
JOIN OR
Table X
A
A, 1
A, 2
A, 3
A, 4
A, 5
A, 6
![Page 58: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/58.jpg)
58
• How To fight Cartesian Join– Nested Structures
Managing Parallelism
create table nestedTable (col1 string,col2 string,col3 array< struct< col3_1: string, col3_2: string>>
val rddNested = sc.parallelize(Array(Row("a1", "b1", Seq(Row("c1_1", "c2_1"),
Row("c1_2", "c2_2"),Row("c1_3", "c2_3"))),
Row("a2", "b2", Seq(Row("c1_2", "c2_2"),Row("c1_3", "c2_3"),Row("c1_4", "c2_4")))), 2)
=
![Page 59: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/59.jpg)
59
Mistake # 4
![Page 60: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/60.jpg)
60
Out of luck?
• Do you every run out of memory?• Do you every have more then 20 stages?• Is your driver doing a lot of work?
![Page 61: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/61.jpg)
61
Mistake – DAG Management
• Shuffles are to be avoided• ReduceByKey over GroupByKey• TreeReduce over Reduce• Use Complex/Nested Types
![Page 62: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/62.jpg)
62
Mistake – DAG Management: Shuffles
• Map Side reduction, where possible• Think about partitioning/bucketing ahead of time• Do as much as possible with a single shuffle• Only send what you have to send• Avoid Skew and Cartesians
![Page 63: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/63.jpg)
63
ReduceByKey over GroupByKey• ReduceByKey can do almost anything that GroupByKeycan do• Aggregations• Windowing• Use memory• But you have more control
• ReduceByKey has a fixed limit of Memory requirements• GroupByKey is unbound and dependent on data
![Page 64: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/64.jpg)
64
TreeReduce over Reduce• TreeReduce & Reduce return some result to driver• TreeReduce does more work on the executors • While Reduce bring everything back to the driver
Partition
Partition
Partition
Partition
Driver
100%
Partition
Partition
Partition
Partition
Driver
4
25%
25%
25%
25%
![Page 65: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/65.jpg)
65
Complex Types
• Top N List• Multiple types of Aggregations• Windowing operations
• All in one pass
![Page 66: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/66.jpg)
66
Complex Types• Think outside of the box use objects to reduce by • (Make something simple)
![Page 67: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/67.jpg)
67
Mistake # 5
![Page 68: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/68.jpg)
68
Ever seen this?Exception in thread "main" java.lang.NoSuchMethodError:com.google.common.hash.HashFunction.hashInt(I)Lcom/google/common/hash/HashCode;
at org.apache.spark.util.collection.OpenHashSet.org$apache$spark$util$collection$OpenHashSet$$hashcode(OpenHashSet.scala:261)
atorg.apache.spark.util.collection.OpenHashSet$mcI$sp.getPos$mcI$sp(OpenHashSet.scala:165)
atorg.apache.spark.util.collection.OpenHashSet$mcI$sp.contains$mcI$sp(OpenHashSet.scala:102)
atorg.apache.spark.util.SizeEstimator$$anonfun$visitArray$2.apply$mcVI$sp(SizeEstimator.scala:214)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)at
org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:210)at…....
![Page 69: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/69.jpg)
69
But!
• I already included protobuf in my app’s maven dependencies?
![Page 70: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/70.jpg)
70
Ah!
• My protobuf version doesn’t match with Spark’s protobuf version!
![Page 71: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/71.jpg)
71
Shading<plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-shade-plugin</artifactId><version>2.2</version>
...<relocations><relocation><pattern>com.google.protobuf</pattern><shadedPattern>com.company.my.protobuf</shadedPattern>
</relocation></relocations>
![Page 72: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/72.jpg)
72
Future of shading
• Spark 2.0 has some libraries shaded• Gauva is fully shaded
![Page 73: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/73.jpg)
73
Summary
![Page 74: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/74.jpg)
74
5 Mistakes
• Size up your executors right• 2 GB limit on Spark shuffle blocks• Evil thing about skew and cartesians• Learn to manage your DAG, yo!• Do shady stuff, don’t let classpath leaks mess you up
![Page 75: Top 5 mistakes when writing Spark applications](https://reader034.fdocuments.us/reader034/viewer/2022042707/58f9ad75760da3da068b9876/html5/thumbnails/75.jpg)
75
THANK YOU.tiny.cloudera.com/spark-mistakes
Mark Grover | @mark_groverTed Malaska | @TedMalaska