Post on 20-May-2020
A Year With SparkMartin Goodson, VP Data Science
Skimlinks
Phase I of my big data experience
R files, python files, Awk, Sed, job scheduler (sun grid engine), Make/bash scripts
Phase II of my big data experience
Pig + python user defined functions
Phase III of my big data experience
PySpark?
Skimlinks data
Automated Affiliatization Tech140,000 publisher sitesCollect 30TB month of user behaviour (clicks, impressions, purchases)
Data science team
5 Data scientistsMachine learning or statistical computing Varying programming experienceNot engineersNo devops
Reality
Spark Can Be Unpredictable
Reality
Learning in depth how spark worksTry to divide and conquerLearning how to configure spark properly
Learning in depth how spark works
Read all this:https://spark.apache.org/docs/1.2.1/programming-guide.htmlhttps://spark.apache.org/docs/1.2.1/configuration.htmlhttps://spark.apache.org/docs/1.2.1/cluster-overview.html
And then:https://www.youtube.com/watch?v=49Hr5xZyTEA (spark internals)https://github.com/apache/spark/blob/master/python/pyspark/rdd.py
Try to divide and conquer
Don't throw 30Tb of data at a spark script and expect it to just work.
Divide the work into bite sized chunks - aggregating and projecting as you go.
Try to divide and conquer
Use reduceByKey() not groupByKey()
Use max() and add()(cf. http://www.slideshare.net/samthemonad/spark-meetup-talk-final)
Start with this(k1, 1)(k1, 1)(k1, 2)(k1, 1)(k1, 5)(k2, 1)(k2, 2)
Use RDD.reduceByKey(add) to get this:(k1, 10)(k2, 3)
Key concept: reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,
5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …}
(k1, 1)(k1, 1)
(k1, 2)(k1, 1)
(k1, 5)
Key concept: reduceByKey(combineByKey)
{k1: 2, …} (k1, 2)
(k1, 3)
(k1,
5)
{k1: 10, …}
{…}
combineLocally _mergeCombiners
{k1: 3, …}
{k1: 5, …} reduceByKey(numPartitions)
(k1, 1)(k1, 1)
(k1, 2)(k1, 1)
(k1, 5)
PySpark Memory: worked example
PySpark Memory: worked example
10 x r3.4xlarge (122G, 16 cores)Use half for each executor: 60GBNumber of cores = 120Cache = 60% x 60GB x 10 = 360GBEach java thread: 40% x 60GB / 12 = ~2GBEach python process: ~4GBOS: ~12GB
PySpark Memory: worked example
spark.executor.memory=60gspark.cores.max=120gspark.driver.memory=60g
PySpark Memory: worked example
spark.executor.memory=60gspark.cores.max=120gspark.driver.memory=60g
~/spark/bin/pyspark --driver-memory 60g
PySpark: other memory configuration
spark.akka.frameSize=1000spark.kryoserializer.buffer.max.mb=10(spark.python.worker.memory)
PySpark: other configuration
spark.shuffle.consolidateFiles=Truespark.rdd.compress=Truespark.speculation=true
Errors
java.net.SocketException: Connection reset java.net.SocketTimeoutException: Read timed outLost executor, cancelled key exceptions, etc
Errors
java.net.SocketException: Connection reset java.net.SocketTimeoutException: Read timed outLost executor, cancelled key exceptions, etc
All of the above are caused by memory errors!
Errors
‘ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue’: filter() little data from many partitions - use coalesce()
Collect() fails - increase driver memory + akka framesize
Were our assumptions correct?
We have a very fast development process.Use spark for development and for scale-up.Scale-able data science development.
Large-scale machine learning with Spark and Python
Empowering the data scientistby Maria Mestre
ML @ Skimlinks
● Mostly for research and prototyping● No developer background● Familiar with scikit-learn and Spark● Building a data scientist toolbox
➢ Scraping pages ➢ Training a classifier
Every ML system….
➢ Filtering
➢ Segmenting urls
➢ Sample training instances
➢ Applying a classifier
Data collection: scraping lots of pages
This is how I would do it in my local machine…
● use of Scrapy package● write a function scrape() that creates a Scrapy object
urls = open(‘list_urls.txt’, ‘r’).readlines()
output = s3_bucket + ‘results.json’
scrape(urls, output)
Distributing over the cluster
def distributed_scrape(urls, index, s3_bucket):
output = s3_bucket + ‘part’ + str(index) + ‘.json’
scrape(urls, output)
urls = open(‘list_urls.txt’, ‘r’).readlines()
urls = sc.parallelize(urls, 100)
urls.mapPartitionsWithIndex(lambda index, urls: distributed_scrape(urls, index, s3_bucket))
Installing scrapy over the cluster
1/ need to use Python 2.7echo 'export PYSPARK_PYTHON=python2.7' >> ~/spark/conf/spark-env.sh
2/ use pssh to install packages in the slavespssh -h /root/spark-ec2/slaves ‘easy_install-2.7 Scrapy’
➢ Scraping pages ➢ Training a classifier
Every ML system….
➢ Filtering
➢ Segmenting urls
➢ Sample training instances
➢ Applying a classifier
Example: filtering
● we want to find activity of 30M users in 2 months of activity: 2 Gb vs 6 Tb
○ map-side join using broadcast() ⇒ does not work with large objects!■ e.g. input.filter(lambda x: x[‘user’] in user_list_b)
○ use of mapPartitions() ■ e.g. input.mapPartitions(lambda x: read_file_and_filter(x))
6 TB,~11B input
35 mins 113 Gb, 529M matches
60 Gb, 515M matches
9 mins
bloom filter join
Example: segmenting urls
● we want to convert an url ‘www.iloveshoes.com’ to [‘i’, ‘love’, ‘shoes’]
● Segmentation○ wordsegment package in python ⇒ very slow!○ 300M urls take 10 hours with 120 cores!
Example: getting a representative sample
Our solution in Spark!
sample = sc.parallelize([],1)
sample_size = 1000
input.cache()
for category, proportion in stats.items():
category_pages = input.filter(lambda x: x[‘category’] == category)
category_sample = category_pages.takeSample(False, sample_size * proportion)
sample = sample.union(category_sample)
MLLib offers a probabilistic solution (not exact sample size):
sample = sampleByKey(input, stats)
➢ Scraping pages ➢ Training a classifier
Every ML system….
➢ Filtering
➢ Segmenting urls
➢ Sample training instances
➢ Applying a classifier
Grid search for hyperparametersProblem: we have some candidate [ 1, 2, ..., 10000] values for a hyperparameter
, which one should we choose?
If the data is small enough that processing time is fine
➢ Do it in a single machine
If the data is too large to process on a single machine
➢ Use MLlib
If the data can be processed on a single machine but takes too long to train
➢ The next slide!
number of combinations = {parameters} = 2
number of combinations = {parameters} = 2
Using cross-validation to optimise a hyperparameter
1. separate the data into k equally-sized chunks2. for each candidate value i
a. use (k-1) chunks to fit the classifier parametersb. use the remaining chunk to get a classification scorec. report average score
3. At the end, select the that achieves the best average score
number of combinations = {parameters} x {folds} = 4
➢ Scraping pages ➢ Training a classifier
Every ML system….
➢ Filtering
➢ Segmenting urls
➢ Sample training instances
➢ Applying a classifier
Apply the classifier over the new_data: easy!
With scikit-learn:classifier_b = sc.broadcast(classifier)new_labels = new_data.map(lambda x: classifier_b.value.predict(x))
With scikit-learn but cannot broadcast:save classifier models to files, ship to s3use mapPartitions to read model parameters and classify
With MLlib:(model._threshold = None) new_labels = new_data.map(lambda x: model.predict(x))
Thanks!
Apache Spark for Big Data
Spark at Scale & Performance Tuning
Sahan Bulathwela | Data Science Engineer @ Skimlinks |
Outline
● Spark at scale: Big Data Example
● Tuning and Performance
Spark at Scale: Big Data Example● Yes, we use Spark !!
● Not just to prototype or one-time analyses
● Run automated analyses at a large scale on
daily basis
● Use-case: Generating audience statistics for
our customers
Before…
● We provide data products based on audience statistics to customers
● Extract event data from Datastore
● Generate Audience statistics and reports
Data● Skimlinks records web data in terms of user
events such as clicks, impressions and etc…● Our Data!!
○ Records 18M clicks (11 GB)○ Records 203M impressions (950 GB)○ These numbers are on daily basis (Oct 01, 2014)
● About 1TB of relevant events
A few days and data scientists later...
Statistics
Major pain points● Most of the data is not relevant
○ Only 3-4 out of 30ish fields are useful for each report
● Many duplicate steps ○ Reading the data○ Extracting relevant fields ○ Transformations such as classifying
events
Solution
Solution
Solution
Aggregation doing its magic● Mostly grouping events and summarizing
● Distribute the workload in time
● “Reduce by” instead of “Group by”
● BOTS
Deep Dive
DatastoreBuild Daily profiles
Intermediate Data Structure
(Compressed in GZIP)
Events (1 TB)
Daily Profiles1.8 GB
Build Monthly profiles
Monthly Aggregate40 GB
Generate Audience StatisticsCustomers
Statistics7 GB
● Takes 4 hours● 150 Statistics● Delivered daily to
clients
Deep Dive
DatastoreBuild Daily profiles
Intermediate Data Structure
(Compressed in GZIP)
Events (1 TB)
Daily Profiles1.8 GB
Build Monthly profiles
Generate Audience StatisticsCustomers
Statistics7 GB
● Takes 4 hours● 150 Statistics● Delivered daily to
clients
Monthly Aggregate40 GB
Deep Dive
DatastoreBuild Daily profiles
Intermediate Data Structure
(Compressed in GZIP)
Events (1 TB)
Daily Profiles1.8 GB
Build Monthly profiles
Generate Audience StatisticsCustomers
Statistics7 GB
● Takes 4 hours● 150 Statistics● Delivered daily to
clients
Monthly Aggregate40 GB
SO WHAT???Before After
Computing Daily event summary 1+ DAYS !!! 20 Mins
Computing monthly aggregate 40 Mins
Storing Daily event summary 100’s of GBs 1.8 GB
Storing monthly aggregate 40 GB
Total time taken for generating Stats 1+ DAYS !!! 3 hrs 30 mins
time taken per Report 1+ DAYS !!! 1.4 mins
Parquet enabled us to reduce our storage costs by 86% and
increase data loading speed by 5x
Storage
Performance when parsing 31 daily profiles
Thank You !!