Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

37
Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark Gavin Li, Jaebong Kim, Andy Feng Yahoo

description

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark . Gavin Li, Jaebong Kim, Andy Feng Yahoo. Agenda. Audience Expansion Spark Application Spark scalability: problems and our solutions Performance tuning. How we built audience expansion on Spark . audience expansion. - PowerPoint PPT Presentation

Transcript of Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

Page 1: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Yahoo Audience Expansion: Migration from Hadoop Streaming to Spark

Gavin Li, Jaebong Kim, Andy FengYahoo

Page 2: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Agenda

• Audience Expansion Spark Application• Spark scalability: problems and our solutions• Performance tuning

Page 3: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

AUDIENCE EXPANSIONHow we built audience expansion on Spark

Page 4: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Audience Expansion

• Train a model to find users perform similar as sample users

• Find more potential “converters”

Page 5: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

System

• Large scale machine learning system• Logistic Regression• TBs input data, up to TBs intermediate data• Hadoop pipeline is using 30000+ mappers,

2000 reducers, 16 hrs run time• All hadoop streaming, ~20 jobs

• Use Spark to reduce latency and cost

Page 6: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Pipeline

Labeling

• Label positive/negative samples• 6-7 hrs, IO intensive, 17 TB intermediate IO in hadoop

Feature Extraction

• Extract Features from raw events

Model Training

• Logistic regression phase, CPU bound

Score/Analyze models

• Validate trained models, parameters combinations, select new model

Validation/Metrics

• Validate and publish new model

Page 7: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

How to adopt to Spark efficiently?

• Very complicated system• 20+ hadoop streaming map reduce jobs• 20k+ lines of code• Tbs data, person.months to do data validation• 6+ person, 3 quarters to rewrite the system

based on Scala from scratch

Page 8: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Our migrate solution

• Build transition layer automatically convert hadoop streaming jobs to Spark job

• Don’t need to change any Hadoop streaming code

• 2 person*quarter

• Private Spark

Page 9: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Spark

ZIPPO:Hadoop Streaming Over Spark

Hadoop Streaming

ZIPPO

HDFS

Audience Expansion Pipeline20+ Hadoop Streaming jobs

Page 10: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

ZIPPO

• A layer (zippo) between Spark and application

• Implemented all Hadoop Streaming interfaces

• Migrate pipeline without code rewriting

• Can focus on rewriting perf bottleneck

• Plan to open source HDFS

Audience Expansion Pipeline

Hadoop Streaming

Spark

ZIPPO:Hadoop

Streaming Over Spark

Page 11: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

ZIPPO - Supported Features

• Partition related– Hadoop Partitioner class (-partitioner)– Num.map.key.fields, num.map.parition.fields

• Distributed cache– -cacheArchive, -file, -cacheFile

• Independent working directory for each task instead of each executor

• Hadoop Streaming Aggregation• Input Data Combination (to mitigate many small files)• Customized OutputFormat, InputFormat

Page 12: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Performance Comparison 1Tb data

• Zippo Hadoop streaming

• Spark cluster– 1 hard drive– 40 hosts

• Perf data:– 1hr 25 min

• Original Hadoop streaming

• Hadoop cluster– 1 hard drives– 40 Hosts

• Perf data– 3hrs 5 min

Page 13: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

SPARK SCALABILITY

Page 14: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Spark Shuffle

• Mapper side of shuffle write all the output to disk(shuffle files)

• Data can be large scale, so not able to all hold in memory

• Reducers transfer all the shuffle files for each partition, then process

Page 15: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Spark Shuttle

Mapper 1

Mapper m-2

Reducer Partition 1

Reducer Partition 2

Reducer Partition n

Reducer Partition 3

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle n

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle n

Page 16: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

On each Reducer

• Every partition needs to hold all the data from all the mappers

• In hash map• In memory

• Uncompressed

Reducer i of 4 cores

Partition 1

Shuffle

mapper 1

Shuffle

mapper 2

Shuffle

mapper 3

Shuffle

mapper n

Partition 2

Shuffle

mapper 1

Shuffle

mapper 2

Shuffle

mapper 3

Shuffle

mapper n

Partition 4

Shuffle

mapper 1

Shuffle

mapper 2

Shuffle

mapper 3

Shuffle

mapper n

Partition 3Shuffl

e m

apper 1

Shuffle

mapper 2

Shuffle

mapper 3

Shuffle

mapper n

Page 17: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Host 2 (4 cores)

How many partitions?

• Need to have small enough partitions to put all in memory

Host 1 (4 cores)

Partition 1

Partition 2

Partition 3

Partition 4

Partition 5

Partition 6

Partition 7

Partition n

Partition 8

Partition 9……

Partition 10

Partition 11

Partition 12

Partition 13

Partition 14

……

Page 18: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Spark needs many Partitions

• So a common pattern of using Spark is to have big number of partitions

Page 19: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

On each Reducer

• For 64 Gb memory host• 16 cores CPU• For compression ratio 30:1, 2 times overhead• To process 3Tb data, Needs 46080 partitions• To process 3Pb data, Need 46 million

partitions

Page 20: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Non Scalable

• Not linear scalable.• No matter how many hosts in total do we

have, we always need 46k partitions

Page 21: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Issues of huge number of partitions

• Issue 1: OOM in mapper side– Each Mapper core needs to write to 46k shuffle files

simultaneously– 1 shuffle file = OutputStream + FastBufferStream +

CompressionStream – Memory overhead:

• FD and related kernel overhead• FastBufferStream (for making ramdom IO to sequential IO), default

100k buffer each stream• CompressionStream, default 64k buffer each stream

– So by default total buffer size:• 164k * 46k * 16 = 100+ Gb

Page 22: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Issues of huge number of paritions

• Our solution to Mapper OOM– Set spark.shuffle.file.buffer.kb to 4k for FastBufferStream

(kernel block size) – Based on our Contributed patch

https://github.com/mesos/spark/pull/685• Set spark.storage.compression.codec to

spark.storage.SnappyCompressionCodec to enable snappy to reduce footprint

• Set spark.snappy.block.size to 8192 to reduce buffer size (while snappy can still have good compression ratio)

– Total buffer size after this:• 12k * 46k * 16 = 10Gb

Page 23: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Issues of huge number of partitions

• Issue 2: large number of small files– Each Input split in Mapper is broken down into at least 46K partitions– Large number of small files makes lots of random R/W IO– When each shuffle file is less then 4k (kernel block size), overhead becomes

significant– Significant meta data overhead in FS layer– Example: only manually deleting the whole tmp directory can take 2 hour as we

have too many small files– Especially bad when splits are not balanced.– 5x slower than Hadoop

Input Split 1

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle

46080

Input Split 2

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle

46080

Input Split n

Shuffle 1

Shuffle 2

Shuffle 3

Shuffle

46080

Page 24: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Reduce side compression

• Current shuffle in reducer side data in memory is not compressed

• Can take 10-100 times more memory• With our patch

https://github.com/mesos/spark/pull/686, we reduced memory consumption by 30x, while compression overhead is only less than 3%

• Without this patch it doesn’t work for our case • 5x-10x performance improvement

Page 25: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Reduce side compression

• Reducer side– compression – 1.6k files– Noncompression – 46k shuffle files

Page 26: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Reducer Side Spilling

Reduce

Compressio

n Bucket 1

Compressio

n Bucket 2

Compressio

n Bucket 3

Compressio

n Bucket n

Spill 1

Spill 2 Spill n

Page 27: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Reducer Side Spilling

• Spills the over-size data to Disk in the aggregation hash table

• Spilling - More IO, more sequential IO, less seeks• All in mem – less IO, more random IO, more

seeks

• Fundamentally resolved Spark’s scalability issue

Page 28: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Align with previous Partition function

• Our input data are from another map reduce job

• We use exactly the same hash function to reduce number of shuffle files

Page 29: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Previous Job GeneratingInput data

Spark Job

Align with previous Partition function

• New hash function, More even distribution

InputData

0Mod 4

Key 0, 4, 8…

Key 2, 6, 10…

Key 1,5,9…

Key 3, 7, 11…

shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4

Mod 5

shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4

shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4

shuffule file 0shuffule file 1shuffule file 2shuffule file 3shuffule file 4

Page 30: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Spark JobPrevious Job GeneratingInput data

Align with previous Partition function

• Use the same hash function

InputData

0Mod 4

Key 0, 4, 8…

Key 2, 6, 10…

Key 1,5,9…

Key 3, 7, 11…

Mod 4

1 shuffle file

1 shuffle file

1 shuffle file

1 shuffle file

Page 31: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Align with previous Hash function

• Our Case:– 16m shuffle files, 62kb on average (5-10x slower)– 8k shuffle files, 125mb on average

• Several different input data sources• Partition function from the major one

Page 32: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

PERFORMANCE TUNNING

Page 33: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

All About Resource Utilization

• Maximize the resource utilization• Use as much CPU,Mem,Disk,Net as possbile• Monitor vmstat, iostat, sar

Page 34: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Resource Utilization

• (This is old diagram, to update)

Page 35: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Resource Utilization

• Ideally CPU/IO should be fully utilized

• Mapper phase – IO bound• Final reducer phase – CPU bound

Page 36: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Shuffle file transfer

• Spark transfers all shuffle files to reducer memory before start processing.

• Non-streaming(very hard to change to streaming).

• For poor resource utilization– So need to make sure maxBytesInFlight is set big

enough– Consider allocating 2x more threads than physical

core number

Page 37: Yahoo Audience Expansion: Migration from  Hadoop  Streaming to Spark

Thanks.

Gavin Li [email protected] Kim [email protected] Feng [email protected]