Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

60
Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Project Tungsten Advanced Apache Spark Meetup Chris Fregly Principal Data Solutions Engineer We’re Hiring - Only Nice People! Nov 12, 2015

Transcript of Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Page 1: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Project Tungsten Advanced Apache Spark Meetup

Chris Fregly Principal Data Solutions Engineer

We’re Hiring - Only Nice People!

Nov 12, 2015

Page 2: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

2

Streaming Data Engineer Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Founder Advanced Apache Meetup

Author Advanced .

Due 2016

My Ma’s First Time in California

Page 3: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Random Slide: More Ma “First Time” Pics

3

In California Using Chopsticks Using “New” iPhone

Page 4: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Upcoming Meetups and Conferences London Spark Meetup (Oct 12th) Scotland Data Science Meetup (Oct 13th) Dublin Spark Meetup (Oct 15th)

Barcelona Spark Meetup (Oct 20th) Madrid Big Data Meetup (Oct 22nd)

Paris Spark Meetup (Oct 26th) Amsterdam Spark Summit (Oct 27th) Brussels Spark Meetup (Oct 30th)

Zurich Big Data Meetup (Nov 2nd) Geneva Spark Meetup (Nov 5th) San Francisco Datapalooza.io (Nov 10th)

4

San Francisco Advanced Spark (Nov 12th) Oslo Big Data Hadoop Meetup (Nov 18th) Helsinki Spark Meetup (Nov 20th)

Stockholm Spark Meetup (Nov 23rd) Copenhagen Spark Meetup (Nov 26th)

Budapest Spark Meetup (Nov 27th) Singapore Strata Conference (Dec 1st) San Francisco Advanced Spark (Dec 8th)

Mountain View Advanced Spark (Dec 10th) Toronto Spark Meetup (Dec 14th) Washington DC Spark Meetup (Jan 2016)

Page 5: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Advanced Apache Spark Meetup Meetup Metrics ~1600 Members in just 4 mos! 4th Most Active Spark Meetup!! Meetup Goals   Dig deep into codebase of Spark and related projects   Study integrations of Cassandra, ElasticSearch,

Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R   Surface and share patterns and idioms of these

well-designed, distributed, big data components

THANKS TO ALL OF YOU!!

Page 6: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

All Slides and Code Are Available!

slideshare.net/cfregly github.com/fluxcapacitor

hub.docker.com/r/fluxcapacitor

6

Page 7: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Themes of this Talk  Filter  Off-Heap  Parallelize  Approximate  Find Similarity  Minimize Seeks  Maximize Scans  Customize for Workload  Tune Performance At Every Layer

7

  Be Nice, Collaborate! Like a Mom!!

Page 8: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Outline

①  Mechanical Sympathy ②  Recap of 100TB GraySort Challenge ③  Project Tungsten Deep Dive

8

Page 9: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Mechanical Sympathy Hardware and software working together in harmony. - Martin Thompson http://mechanical-sympathy.blogspot.com

Whatever your data structure, my array will beat it. - Scott Meyers Every C++ Book, basically

9

Hair Sympathy

- Bruce Jenner

Page 10: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark and Mechanical Sympathy

10

Project Tungsten (Spark 1.4-1.6+)

GraySort Challenge (Spark 1.1-1.2)

Minimize Memory and GC Maximize CPU Cache Locality

Saturate Network I/O Saturate Disk I/O

Page 11: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

AlphaSort Technique: Sort 100 Bytes Recs

11

Value

Ptr Key Dereference Not Required! AlphaSort

List [(Key, Pointer)] Key is directly available for comparison

Naïve List [Pointer] Must dereference key for comparison

Ptr Dereference for Key Comparison

Key

Page 12: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Line and Memory Sympathy Key (10 bytes)+Pointer (*4 bytes)*Compressed OOPs = 14 bytes

12

Key Ptr

Not CPU Cache-line Friendly!

Ptr Key-Prefix

2x CPU Cache-line Friendly! Key-Prefix (4 bytes) + Pointer (4 bytes) = 8 bytes

Key (10 bytes)+Pad (2 bytes)+Pointer (4 bytes) = 16 bytes Key Ptr

Pad

/Pad CPU Cache-line Friendly!

Page 13: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Performance Comparison

13

Page 14: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Similar Trick: Direct Cache Access (DCA) Pull out packet header along side pointer to payload

14

Page 15: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Lines: Sequential vs. Random

15

Page 16: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Naïve Matrix Multiplication

// Dot product of each row & column vector for (i <- 0 until numRowA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matB[ k ][ j ];

16

Bad: Row-wise traversal, not using CPU cache line,

ineffective pre-fetching

Page 17: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Friendly Matrix Multiplication // Transpose B for (i <- 0 until numRowsB) for (j <- 0 until numColsB) matBT[ i ][ j ] = matB[ j ][ i ];

// Modify dot product calculation for B Transpose for (i <- 0 until numRowsA) for (j <- 0 until numColsB) for (k <- 0 until numColsA) res[ i ][ j ] += matA[ i ][ k ] * matBT[ j ][ k ];

17

Good: Full CPU cache line, effective prefetching

OLD: res[ i ][ j ] += matA[ i ][ k ] * matB [ k ] [ j ];

Reference jbefore k

Page 18: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Instrumenting and Monitoring CPU Use Linux perf command!

18

http://www.brendangregg.com/blog/2015-11-06/java-mixed-mode-flame-graphs.html

Page 19: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Results of Matrix Multiply Comparison

Naïve Matrix Multiply

19

Cache-Friendly Matrix Multiply ~72x ~8x

~3x ~3x ~2x ~7x ~10x

perf stat -XX:+PreserveFramePointer -XX:-Inline \ –event L1-dcache-load-misses,L1-dcache-prefetch-misses,LLC-load-misses, \

LLC-prefetch-misses,cache-misses,stalled-cycles-frontend

~10x 55 hp 550 hp

Page 20: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Demo! Compare CPU Naïve & Cache-Friendly Matrix Multiplication

20

Page 21: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Naïve Tuple Counters object CacheNaiveTupleIncrement { var tuple = (0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : (Int, Int) = { this.synchronized { tuple = (tuple._1 + leftIncrement, tuple._2 + rightIncrement) tuple } } }

21

Page 22: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Naïve Case Class Counters case class MyTuple(left: Int, right: Int) object CacheNaiveCaseClassCounters { var tuple = new MyTuple(0,0) … def increment(leftIncrement: Int, rightIncrement: Int) : MyTuple = { this.synchronized { tuple = new MyTuple(tuple.left + leftIncrement, tuple.right + rightIncrement)

tuple } } }

22

Page 23: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU Cache Friendly Lock-Free Counters object CacheFriendlyLockFreeCounters { // a single Long (8-bytes) will maintain 2 separate Ints (4-bytes each) val tuple = new AtomicLong() … def increment(leftIncrement: Int, rightIncrement: Int) : Long = { var originalLong = 0L var updatedLong = 0L do {

originalLong = tuple.get() val originalRightInt = originalLong.toInt // cast originalLong to Int to get right counter val originalLeftInt = (originalLong >>> 32).toInt // shift right to get left counter val updatedRightInt = originalRightInt + rightIncrement // increment right counter val updatedLeftInt = originalLeftInt + leftIncrement // increment left counter

updatedLong = updatedLeftInt // update the new long with the left counter updatedLong = updatedLong << 32 // shift the new long left updatedLong += updatedRightInt // update the new long with the right counter

} while (tuple.compareAndSet(originalLong, updatedLong) == false) updatedLong } }

23

Quiz: Why not @volatile?

Page 24: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Demo! Compare CPU Naïve & Cache-Friendly Tuple Counter Sync

24

Page 25: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Results of Counters Comparison Naïve Tuple Counters

Naïve Case Class Counters

25

Cache Friendly Lock-Free Counters

~2x ~1.5x

~3.5x ~2x ~2x

~1.5x

~1.5x

~1.5x

Page 26: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Profiling Visualizations: Flame Graphs With Java Stack Traces!!

26 Example: Spark Word Count

Java Stack Traces are Good!

Plateausare Bad!!

Page 27: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Outline

①  Mechanical Sympathy ②  Recap of 100TB GraySort Challenge ③  Project Tungsten Deep Dive

27

Page 28: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

100TB GraySort Challenge Sort 100TB of 100-Byte Records with 10-byte Keys

Custom Data Structs & Algos for Sort & Shuffle

Saturate Network and Disk I/O Controllers 28

Page 29: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

100TB GraySort Challenge Results

29

Performance Goals   Saturate Network I/O   Saturate Disk I/O

(2013) (2014)

Page 30: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Winning Hardware Configuration Compute 206 Workers, 1 Master (AWS EC2 i2.8xlarge) 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz 244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4 3 GBps mixed read/write disk I/O per node

Network AWS Placement Groups, VPC, Enhanced Networking Single Root I/O Virtualization (SR-IOV) 10 Gbps, low latency, low jitter (iperf: ~9.5 Gbps)

30

Page 31: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Winning Software Configuration Spark 1.2, OpenJDK 1.7 Disable caching, compression, spec execution, shuffle spill Force NODE_LOCAL task scheduling for optimal data locality HDFS 2.4.1 short-circuit local reads, 2x replication Empirically chose between 4-6 partitions per cpu 206 nodes * 32 cores = 6592 cores 6592 cores * 4 = 26,368 partitions 6592 cores * 6 = 39,552 partitions 6592 cores * 4.25 = 28,000 partitions (empirical best)

Range partitioning takes advantage of sequential keyspace Required ~10s of sampling 79 keys from in each partition

31

Page 32: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

New Sort Shuffle Manager for Spark 1.2 Original “hash-based” New “sort-based” ①  Use less OS resources (socket buffers, file descriptors) ②  TimSort partitions in-memory ③  MergeSort partitions on-disk into a single master file ④  Serve partitions from master file: seek once, sequential scan

32

Page 33: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Asynchronous Network Module Switch to asyncronous Netty vs. synchronous java.nio Switch to zero-copy epoll Use only kernel-space between disk and network controllers

Custom memory management spark.shuffle.blockTransferService=netty

Spark-Netty Performance Tuning spark.shuffle.io.preferDirectBuffers=true Reuse off-heap buffers spark.shuffle.io.numConnectionsPerPeer=8 (for example) Increase to saturate hosts with multiple disks (8x800 SSD)

33

Details in SPARK-2468

Page 34: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Custom Algorithms and Data Structures Optimized for sort & shuffle workloads o.a.s.util.collection.TimSort[K,V] Based on JDK 1.7 TimSort Performs best with partially-sorted runs Optimized for elements of (K,V) pairs Sorts impl of SortDataFormat (ie. KVArraySortDataFormat)

o.a.s.util.collection.AppendOnlyMap Open addressing hash, quadratic probing Array of [(K, V), (K, V)] Good memory locality Keys never removed, values only append

34

Page 35: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Daytona GraySort Challenge Goal Success

1.1 Gbps/node network I/O (Reducers) Theoretical max = 1.25 Gbps for 10 GB ethernet

3 GBps/node disk I/O (Mappers)

35

Aggregate Cluster

Network I/O!

220 Gbps / 206 nodes ~= 1.1 Gbps per node

Page 36: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Shuffle Performance Tuning Tips Hash Shuffle Manager (Deprecated) spark.shuffle.consolidateFiles (Mapper) o.a.s.shuffle.FileShuffleBlockResolver

Intermediate Files Increase spark.shuffle.file.buffer (Reducer) Increase spark.reducer.maxSizeInFlight if memory allows

Use Smaller Number of Larger Executors Minimizes intermediate files and overall shuffle More opportunity for PROCESS_LOCAL

SQL: BroadcastHashJoin vs. ShuffledHashJoin spark.sql.autoBroadcastJoinThreshold Use DataFrame.explain(true) or EXPLAIN to verify

36

Many Threads (1 per CPU)

Page 37: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Outline

①  Mechanical Sympathy ②  Recap of 100TB GraySort Challenge ③  Project Tungsten Deep Dive

37

Page 38: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Project Tungsten Data Struts & Algos Operate Directly on Byte Arrays

Maximize CPU Cache Locality, Minimize GC

Utilize Dynamic Code Generation

38

SPARK-7076 (Spark 1.4)

Page 39: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Quick Review of Project Tungsten Jiras

39

SPARK-7076 (Spark 1.4)

Page 40: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Why is CPU the Bottleneck? CPU is used for serialization, hashing, compression! Network and Disk I/O bandwidth are relatively high GraySort optimizations improved network & shuffle Partitioning, pruning, and predicate pushdowns Binary, compressed, columnar file formats (Parquet)

40

Page 41: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Yet Another Spark Shuffle Manager! spark.shuffle.manager = hash (Deprecated) < 10,000 reducers Output partition file hashes the key of (K,V) pair Mapper creates an output file per partition Leads to M*P output files for all partitions sort (GraySort Challenge) > 10,000 reducers Default from Spark 1.2-1.5 Mapper creates single output file for all partitions Minimizes OS resources, netty + epoll optimizes network I/O, disk I/O, and memory Uses custom data structures and algorithms for sort-shuffle workload Wins Daytona GraySort Challenge tungsten-sort (Project Tungsten) Default since 1.5 Modification of existing sort-based shuffle Uses com.misc.Unsafe for self-managed memory and garbage collection Maximize CPU utilization and cache locality with AlphaSort-inspired binary data structures/algorithms Perform joins, sorts, and other operators on both serialized and compressed byte buffers

41

Page 42: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

CPU & Memory Optimizations Custom Managed Memory Reduces GC overhead Both on and off heap Exact size calculations

Direct Binary Processing Operate on serialized/compressed arrays Kryo can reorder/sort serialized records LZF can reorder/sort compressed records

More CPU Cache-aware Data Structs & Algorithms o.a.s.sql.catalyst.expression.UnsafeRow o.a.s.unsafe.map.BytesToBytesMap

Code Generation (default in 1.5) Generate source code from overall query plan 100+ UDFs converted to use code generation

42

UnsafeFixedWithAggregationMap TungstenAggregationIterator

CodeGenerator GeneratorUnsafeRowJoiner

UnsafeSortDataFormat UnsafeShuffleSortDataFormat

PackedRecordPointer UnsafeRow

UnsafeInMemorySorter UnsafeExternalSorter UnsafeShuffleWriter

Mostly Same Join Code, UnsafeProjection

UnsafeShuffleManager UnsafeShuffleInMemorySorter UnsafeShuffleExternalSorter

Details in SPARK-7075

Page 43: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

sun.misc.Unsafe

43

Info addressSize() pageSize()

Objects allocateInstance() objectFieldOffset()

Classes staticFieldOffset() defineClass() defineAnonymousClass() ensureClassInitialized()

Synchronization monitorEnter() tryMonitorEnter() monitorExit() compareAndSwapInt() putOrderedInt()

Arrays arrayBaseOffset() arrayIndexScale()

Memory allocateMemory() copyMemory() freeMemory() getAddress() – not guaranteed after GC getInt()/putInt() getBoolean()/putBoolean() getByte()/putByte() getShort()/putShort() getLong()/putLong() getFloat()/putFloat() getDouble()/putDouble() getObjectVolatile()/putObjectVolatile()

Used by Tungsten

Page 44: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spark + com.misc.Unsafe

44

org.apache.spark.sql.execution. aggregate.SortBasedAggregate aggregate.TungstenAggregate aggregate.AggregationIterator aggregate.udaf aggregate.utils SparkPlanner rowFormatConverters UnsafeFixedWidthAggregationMap UnsafeExternalSorter UnsafeExternalRowSorter UnsafeKeyValueSorter UnsafeKVExternalSorter local.ConvertToUnsafeNode local.ConvertToSafeNode local.HashJoinNode local.ProjectNode local.LocalNode local.BinaryHashJoinNode local.NestedLoopJoinNode joins.HashJoin joins.HashSemiJoin joins.HashedRelation joins.BroadcastHashJoin joins.ShuffledHashOuterJoin (not yet converted) joins.BroadcastHashOuterJoin joins.BroadcastLeftSemiJoinHash joins.BroadcastNestedLoopJoin joins.SortMergeJoin joins.LeftSemiJoinBNL joins.SortMergerOuterJoin Exchange SparkPlan UnsafeRowSerializer SortPrefixUtils sort basicOperators aggregate.SortBasedAggregationIterator aggregate.TungstenAggregationIterator datasources.WriterContainer datasources.json.JacksonParser datasources.jdbc.JDBCRDD Window

org.apache.spark. unsafe.Platform unsafe.KVIterator unsafe.array.LongArray unsafe.array.ByteArrayMethods unsafe.array.BitSet unsafe.bitset.BitSetMethods unsafe.hash.Murmur3_x86_32 unsafe.map.BytesToBytesMap unsafe.map.HashMapGrowthStrategy unsafe.memory.TaskMemoryManager unsafe.memory.ExecutorMemoryManager unsafe.memory.MemoryLocation unsafe.memory.UnsafeMemoryAllocator unsafe.memory.MemoryAllocator (trait/interface) unsafe.memory.MemoryBlock unsafe.memory.HeapMemoryAllocator unsafe.memory.ExecutorMemoryManager unsafe.sort.RecordComparator unsafe.sort.PrefixComparator unsafe.sort.PrefixComparators unsafe.sort.UnsafeSorterSpillWriter serializer.DummySerializationInstance shuffle.unsafe.UnsafeShuffleManager shuffle.unsafe.UnsafeShuffleSortDataFormat shuffle.unsafe.SpillInfo shuffle.unsafe.UnsafeShuffleWriter shuffle.unsafe.UnsafeShuffleExternalSorter shuffle.unsafe.PackedRecordPointer shuffle.ShuffleMemoryManager util.collection.unsafe.sort.UnsafeSorterSpillMerger util.collection.unsafe.sort.UnsafeSorterSpillReader util.collection.unsafe.sort.UnsafeSorterSpillWriter util.collection.unsafe.sort.UnsafeShuffleInMemorySorter util.collection.unsafe.sort.UnsafeInMemorySorter util.collection.unsafe.sort.RecordPointerAndKeyPrefix util.collection.unsafe.sort.UnsafeSorterIterator network.shuffle.ExternalShuffleBlockResolver scheduler.Task rdd.SqlNewHadoopRDD executor.Executor

org.apache.spark.sql.catalyst.expressions. regexpExpressions BoundAttribute SortOrder SpecializedGetters ExpressionEvalHelper UnsafeArrayData UnsafeReaders UnsafeMapData Projection LiteralGeneartor UnsafeRow JoinedRow SpecializedGetters InputFileName SpecificMutableRow codegen.CodeGenerator codegen.GenerateProjection codegen.GenerateUnsafeRowJoiner codegen.GenerateSafeProjection codegen.GenerateUnsafeProjection codegen.BufferHolder codegen.UnsafeRowWriter codegen.UnsafeArrayWriter complexTypeCreator rows literals misc stringExpressions

Over 200 source files affected!!

Page 45: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Traditional Java Object Row Layout 4-byte String

Multi-field Object

45

Page 46: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Custom Data Structures for Workload UnsafeRow

(Dense Binary Row)

TaskMemoryManager (Virtual Memory Address)

BytesToBytesMap (Dense Binary HashMap)

46

Dense, 8-bytes per field (word-aligned)

Key Ptr

AlphaSort-Style (Key + Pointer)

OS-Style Memory Paging

Page 47: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

UnsafeRow Layout Example

47

Pre-Tungsten

Tungsten

Page 48: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Custom Memory Management o.a.s.memory. TaskMemoryManager & MemoryConsumer Memory management: virtual memory allocation, pageing Off-heap: direct 64-bit address On-heap: 13-bit page num + 27-bit page offset

o.a.s.shuffle.sort. PackedRecordPointer 64-bit word (24-bit partition key, (13-bit page num, 27-bit page offset))

o.a.s.unsafe.types. UTF8String Primitive Array[Byte]

48

2^13 pages * 2^27 page size = 1 TB RAM per Task

Page 49: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

UnsafeFixedWidthAggregationMap

Aggregations o.a.s.sql.execution. UnsafeFixedWidthAggregationMap Uses BytesToBytesMap In-place updates of serialized data No object creation on hot-path Improved external agg support No OOM’s for large, single key aggs

o.a.s.sql.catalyst.expression.codegen. GenerateUnsafeRowJoiner Combine 2 UnsafeRows into 1

o.a.s.sql.execution.aggregate. TungstenAggregate & TungstenAggregationIterator Operates directly on serialized, binary UnsafeRow 2 Steps: hash-based agg (grouping), then sort-based agg Supports spilling and external merge sorting

49

Page 50: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Equality Bitwise comparison on UnsafeRow No need to calculate equals(), hashCode() Row 1

Equals! Row 2

50

Page 51: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Joins Surprisingly, not many code changes o.a.s.sql.catalyst.expressions. UnsafeProjection Converts InternalRow to UnsafeRow

51

Page 52: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Sorting o.a.s.util.collection.unsafe.sort. UnsafeSortDataFormat UnsafeInMemorySorter UnsafeExternalSorter RecordPointerAndKeyPrefix UnsafeShuffleWriter

AlphaSort-Style Cache Friendly

52

Ptr Key-Prefix

2x CPU Cache-line Friendly!

Using multiple subclasses of SortDataFormat simultaneously will prevent JIT inlining. This affects sort & shuffle performance.

Supports merging compressed records if compression CODEC supports it (LZF)

Page 53: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Spilling Efficient Spilling Exact data size is known No need to maintain heuristics & approximations Controls amount of spilling

Spill merge on compressed, binary records! If compression CODEC supports it

53

UnsafeFixedWidthAggregationMap.getPeakMemoryUsedBytes()

Exact Peak Memory for Spark Jobs

Page 54: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Code Generation Problem Boxing causes excessive object creation Expensive expression tree evals per row JVM can’t inline polymorphic impls

Solution Codegen by-passes virtual function calls Defer source code generation to each operator, UDF, UDAF Use Scala quasiquote macros for Scala AST source code gen Rewrite and optimize code for overall plan, 8-byte align, etc Use Janino to compile generated source code into bytecode

54

Page 55: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc IBM | spark.tc

Spark SQL UDF Code Generation 100+ UDFs now generating code

More to come in Spark 1.6+

Details in SPARK-8159, SPARK-9571

Each Implements Expression.genCode() !

Page 56: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Creating a Custom UDF with Codegen Study existing implementations https://github.com/apache/spark/pull/7214/files

Extend base trait o.a.s.sql.catalyst.expressions.Expression.genCode()

Register the function o.a.s.sql.catalyst.analysis.FunctionRegistry.registerFunction()

Augment DataFrame with new UDF (Scala implicits) o.a.s.sql.functions.scala

Don’t forget about Python! python.pyspark.sql.functions.py 56

Page 57: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Benefits from Project Tungsten? Users of DataFrames All Spark SQL Queries Catalyst

All RDDs Serialization, Compression, and Aggregations

57

Page 58: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Performance Results Query Time

Garbage Collection

58

OOM’d on Large Dataset!

Page 59: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Thank You!!! Chris Fregly @cfregly IBM Spark Technology Center San Francisco, California Relevant Links advancedspark.com Signup for the book & global meetup! github.com/fluxcapacitor/pipeline Clone, contribute, and commit code! hub.docker.com/r/fluxcapacitor/pipeline/wiki Run all demos in your own environment with Docker!

59

Page 60: Advanced Apache Spark Meetup Project Tungsten Nov 12 2015

Power of data. Simplicity of design. Speed of innovation. IBM Spark spark.tc

Power of data. Simplicity of design. Speed of innovation.

IBM Spark