Post on 11-Apr-2017
How Spark Beat Hadoop @ 100 TB SortAdvanced Apache Spark Meetup
Chris Fregly, Principal Data Solutions EngineerIBM Spark Technology Center
Power of data. Simplicity of design. Speed of innovation.IBM | spark.tc
Meetup Housekeeping
IBM | spark.tc
Announcements
Deepak SrinivasanBig Commerce
Steve BeierIBM Spark Tech Center
IBM | spark.tc
Who am I?Streaming Platform Engineer
Streaming Data EngineerNetflix Open Source Committer
Data Solutions EngineerApache Contributor
Principle Data Solutions Engineer
IBM Technology Center
IBM | spark.tc
Last Meetup (End-to-End Data Pipeline)Presented `Flux Capacitor`
End-to-End Data Pipeline in a Box!Real-time, Advanced AnalyticsMachine LearningRecommendations
Githubgithub.com/fluxcapacitor
Dockerhub.docker.com/r/fluxcapacitor
IBM | spark.tc
Since Last Meetup (End-to-End Data Pipeline)Meetup Statistics
Total Spark Experts: ~850 (+100%)Mean RSVPs per Meetup: 268Mean Attendance Percentage: ~60% of RSVPsDonations: $15 (Thank you so much, but please keep
your $!)
Github Statistics (github.com/fluxcapacitor)18 forks, 13 clones, ~1300 views
Docker Statistics (hub.docker.com/r/fluxcapacitor)~1600 download
IBM | spark.tc
Recent EventsReplay of Last SF Meetup in Mtn View@BaseCRM
Presented Flux Capacitor End-to-End Data Pipe-line
(Scala + Big Data) By The Bay ConferenceWorkshop and 2 TalksTrained ~100 on End-to-End Data Pipeline
Galvanize WorkshopTrained ~30 on End-to-End Data Pipeline
IBM | spark.tc
Upcoming USA EventsIBM Hackathon @ Galvanize (Sept 18th – Sept 21st)
Advanced Apache Spark Meetup@DataStax (Sept 21st)
Spark-Cassandra Spark SQL+DataFrame Connector
Cassandra Summit Talk (Sept 22nd – Sept 24th)Real-time End-to-End Data Pipeline w/ Cassandra
Strata New York (Sept 29th - Oct 1st)
IBM | spark.tc
Upcoming European Events
Dublin Spark Meetup Talk (Oct 15th)Barcelona Spark Meetup Talk (Oct ?)Madrid Spark Meetup Talk (Oct ?)Amsterdam Spark Meetup (Oct 27th)Spark Summit Amsterdam (Oct 27th – Oct 29th)Brussels Spark Meetup Talk (Oct 30th)
Spark and the Daytona GraySort tChallengesortbenchmark.org
sortbenchmark.org/ApacheSpark2014.pdf
IBM | spark.tc
Themes of this Talk: Mechanical SympathySeek Once, Scan Sequentially
CPU Cache Locality, Memory Hierarchy are Key
Go Off-Heap Whenever Possible
Customize Data Structures for your Work-load
IBM | spark.tc
What is the Daytona GraySort Challenge?Key MetricThroughput of sorting 100TB of 100 byte data, 10
byte keyTotal time includes launching app and writing out-
put file
DaytonaApp must be general purpose
GrayNamed after Jim Gray
IBM | spark.tc
Daytona GraySort Challenge: Input and ResourcesInput
Records are 100 bytes in lengthFirst 10 bytes are random keyInput generator: `ordinal.com/gensort.html`28,000 fixed-size partitions for 100 TB sort250,000 fixed-size partitions for 1 PB sort1 partition = 1 HDFS block = 1 node = no partial
read I/OHardware and Runtime Resources
Commercially available and off-the-shelfUnmodified, no over/under-clockingGenerates 500TB of disk I/O, 200TB network I/O
IBM | spark.tc
Daytona GraySort Challenge: RulesMust sort to/from OS files in secondary stor-age
No raw disk since I/O subsystem is being tested
File and device striping (RAID 0) are encour-aged
Output file(s) must have correct key order
IBM | spark.tc
Daytona GraySort Challenge: Task SchedulingTypes of Data Locality
PROCESS_LOCALNODE_LOCALRACK_LOCALANY
Delay Scheduling`spark.locality.wait.node`: time to wait for next
shitty levelSet to infinite to reduce shittiness, force
NODE_LOCALStraggling Executor JVMs naturally fade away on
each run
IncreasingLevel ofShittiness
IBM | spark.tc
Daytona GraySort Challenge: Winning Results On-disk only, in-memory caching
disabled!
EC2 (i2.8xlarge)
EC2 (i2.8xlarge)
28,000partition
s
250,000 partitions
(!!)
IBM | spark.tc
Daytona GraySort Challenge: EC2 Configuration206 EC2 Worker nodes, 1 Master node
i2.8xlarge 32 Intel Xeon CPU E5-2670 @ 2.5 Ghz244 GB RAM, 8 x 800GB SSD, RAID 0 striping, ext4NOOP I/O scheduler: FIFO, request merging, no re-
ordering3 Gbps mixed read/write disk I/O
Deployed within Placement Group/VPCEnhanced NetworkingSingle Root I/O Virtualization (SR-IOV): extension of
PCIe10 Gbps, low latency, low jitter (iperf showed ~9.5
Gbps)
IBM | spark.tc
Daytona GraySort Challenge: Winning Configuration
Spark 1.2, OpenJDK 1.7_<amazon-something>_u65-b17Disabled in-memory caching -- all on-disk!HDFS 2.4.1 short-circuit local reads, 2x replicationWrites flushed after every run (5 runs for 28,000 parti-tions)Netty 4.0.23.Final with native epollSpeculative Execution disabled: `spark.speculation`=falseForce NODE_LOCAL: `spark.locality.wait.node`=Infinite Force Netty Off-Heap: `spark.shuffle.io.preferDirect-Buffers`Spilling disabled: `spark.shuffle.spill`=trueAll compression disabled
IBM | spark.tc
Daytona GraySort Challenge: Partitioning
Range Partitioning ( vs. Hash Partitioning)Take advantage of sequential key spaceSimilar keys grouped together within a partitionRanges defined by sampling 79 values per partitionDriver sorts samples and defines range boundariesSampling took ~10 seconds for 28,000 partitions
IBM | spark.tc
Daytona GraySort Challenge: Why Bother?Sorting relies heavily on shuffle, I/O subsys-tem
Shuffle is major bottleneck in big data pro-cessing
Large number of partitions can exhaust OS re-sources
Shuffle optimization benefits all high-level libraries
Goal is to saturate network controller on all nodes
~125 MB/s (1 GB ethernet), 1.25 GB/s (10 GB eth-ernet)
IBM | spark.tc
Daytona GraySort Challenge: Per Node Results
Mappers: 3 Gbps/node disk I/O (8x800 SSD)Reducers: 1.1 Gbps/node network I/O (10Gbps)
Quick Shuffle Refresher
IBM | spark.tc
Shuffle Overview
All to All, Cartesian Product Operation
Least ->UsefulExampleI CouldFind ->
IBM | spark.tc
Spark Shuffle Overview
Most ->ConfusingExampleI CouldFind ->
Stages are Defined by Shuffle Boundaries
IBM | spark.tc
Shuffle Intermediate Data: Spill to DiskIntermediate shuffle data stored in memorySpill to Disk
`spark.shuffle.spill`=true`spark.shuffle.memoryFraction`=% of all shuffle
buffersCompetes with `spark.storage.memoryFraction`Bump this up from default!! Will help Spark SQL,
too.
Skipped StagesReuse intermediate shuffle data found on reducerDAG for that partition can be truncated
IBM | spark.tc
Shuffle Intermediate Data: Compression`spark.shuffle.compress`
Compress outputs (mapper)
`spark.shuffle.spill.compress`Compress spills (reducer)
`spark.io.compression.codec`LZF: Most workloads (new default for Spark)Snappy: LARGE workloads (less memory required to
compress)
IBM | spark.tc
Spark Shuffle Operationsjoin
distinctcogroupcoalesce
repartitionsortByKey
groupByKeyreduceByKey
aggregateByKey
IBM | spark.tc
Spark Shuffle Managers`spark.shuffle.manager` = {`hash` <10,000 Reducers
Output file determined by hashing the key of (K,V) pair
Each mapper creates an output buffer/file per re-ducer
Leads to M*R number of output buffers/files per shuffle
`sort` >= 10,000 ReducersDefault since Spark 1.2Wins Daytona GraySort Challenge w/ 250,000 re-
ducers!!`tungsten-sort` (Future Meetup!)
}
IBM | spark.tc
Shuffle Managers
IBM | spark.tc
Hash Shuffle Manager
M*R num open files per shuffle; M=num mappers R=num reducers
Mapper Opens 1 File per Partition/ReducerHDFS
(2x repl)
HDFS(2x repl)
IBM | spark.tc
Sort Shuffle Manager
Hold Tight!
IBM | spark.tc
Tungsten-Sort Shuffle Manager
Future Meetup!!
IBM | spark.tc
Shuffle Performance Tuning
Hash Shuffle Manager (no longer default)`spark.shuffle.consolidateFiles`: mapper output
files`o.a.s.shuffle.FileShuffleBlockResolver`
Intermediate FilesIncrease `spark.shuffle.file.buffer`: reduce seeks &
sys calls
Increase `spark.reducer.maxSizeInFlight` if memory allows
Use smaller number of larger workers to reduce to-tal filesSQL: BroadcastHashJoin vs. Shuffled-HashJoin
`spark.sql.autoBroadcastJoinThreshold`Use `DataFrame.explain(true)` or `EXPLAIN` to ver-
ify
IBM | spark.tc
Shuffle Configuration
Documentationspark.apache.org/docs/latest/configuration.html#shuffle-
behavior
Prefixspark.shuffle
Winning Optimizations Deployed across Spark 1.1 and 1.2
IBM | spark.tc
Daytona GraySort Challenge: Winning OptimizationsCPU-Cache Locality: (Key, Pointer-to-Record)
& Cache Alignment
Optimized Sort Algorithm: Elements of (K, V) Pairs
Reduce Network Overhead: Async Netty, epoll
Reduce OS Resource Utilization: Sort Shuffle
IBM | spark.tc
CPU-Cache Locality: (Key, Pointer-to-Record)AlphaSort paper ~1995
Chris Nyberg and Jim Gray
NaïveList (Pointer-to-Record)Requires Key to be dereferenced for comparison
AlphaSortList (Key, Pointer-to-Record)Key is directly available for comparison
IBM | spark.tc
CPU-Cache Locality: Cache AlignmentKey(10 bytes) + Pointer(4 bytes*) = 14 bytes
*4 bytes when using compressed OOPS (<32 GB heap)
Not binary in size, not CPU-cache friendly
Cache Alignment Options① Add Padding (2 bytes)
Key(10 bytes) + Pad(2 bytes) + Pointer(4 bytes)=16 bytes
② (Key-Prefix, Pointer-to-Record)Perf affected by key distroKey-Prefix (4 bytes) + Pointer (4 bytes)=8 bytes
IBM | spark.tc
CPU-Cache Locality: Performance Compari-son
IBM | spark.tc
Optimized Sort Algorithm: Elements of (K, V) Pairs`o.a.s.util.collection.TimSort`
Based on JDK 1.7 TimSortPerforms best on partially-sorted datasets Optimized for elements of (K,V) pairsSorts impl of SortDataFormat (ie. KVArraySort-
DataFormat)
`o.a.s.util.collection.AppendOnlyMap`Open addressing hash, quadratic probingArray of [(key0, value0), (key1, value1)] Good memory localityKeys never removed, values only append
(^2 Probing)
IBM | spark.tc
Reduce Network Overhead: Async Netty, epollNew Network Module based on Async Netty
Replaces old java.nio, low-level, socket-based codeZero-copy epoll uses kernel-space between disk &
networkCustom memory management reduces GC pauses`spark.shuffle.blockTransferService`=netty
Spark-Netty Performance Tuning`spark.shuffle.io.numConnectionsPerPeer`
Increase to saturate hosts with multiple disks`spark.shuffle.io.preferDirectBuffers`
On or Off-heap (Off-heap is default)
Apache Spark JiraSPARK-2468
S
IBM | spark.tc
Reduce OS Resource Utilization: Sort Shuffle
M open files per shuffle; M = num of mappers`spark.shuffle.sort.bypassMergeThreshold`
Merge Sort
(Disk)
Reducers seek and scan from range
offsetof Master File on
Mapper
TimSort
(RAM)
HDFS(2x
repl)
HDFS(2x repl)
SPARK-2926: Replace TimSort w/Merge
Sort(Memory)
Mapper Merge Sorts Partitions into 1 Master File Indexed by Partition Range Offsets
<- Master->
File
Bonus!
IBM | spark.tc
External Shuffle Service: Separate JVM ProcessTakes over when Spark Executor is in GC or dies
Use new Netty-based Network Module
Required for YARN dynamic allocationNode Manager serves files
Apache Spark Jira: SPARK-3796
Next StepsProject Tungsten
IBM | spark.tc
Project Tungsten: CPU and Memory Opti-mizationsDisk
Net-workCPU
Mem-ory
Daytona GraySort Optimizations
Tungsten OptimizationsCustom Memory Management
Eliminates JVM object and GC overheadMore Cache-aware Data Structs and Algos
`o.a.s.unsafe.map.BytesToBytesMap` vs. j.u.HashMapCode Generation (default in 1.5)
Generate bytecode from overall query plan
Thank you!Special thanks to Big Commerce!!
IBM Spark Tech Center is Hiring! Nice people only, please!!
IBM | spark.tc
Sign up for our newsletter at
To Be Continued…
IBM | spark.tc
Relevant Links
http://sortbenchmark.org/ApacheSpark2014.pdfhttps://databricks.com/blog/2014/10/10/spark-petabyte-sort.htmlhttps://databricks.com/blog/2014/11/05/spark-offi-cially-sets-a-new-record-in-large-scale-sorting.htmlhttp://0x0fff.com/spark-architecture-shuffle/http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
Power of data. Simplicity of design. Speed of innovation.IBM Spark