Hadoop & Spark Performance tuning using Dr. Elephant
-
Upload
akshay-rai -
Category
Data & Analytics
-
view
674 -
download
7
Transcript of Hadoop & Spark Performance tuning using Dr. Elephant
Dr. Elephantgithub.com/linkedin/dr-elephant
Akshay RaiHadoop Dev Team
Introduction
Scaling Hadoop Infrastructure
Scale and Optimize Hardware● More users, more jobs, more resources
● Large investment in hardware
● Can’t keep upgrading and adding machines to solve problem forever
● Some tuning is needed to get things running
Users are more valuable than machines
What do we do?
Improve User Productivity
User Productivity● Freedom to experiment and run jobs on the cluster
● Build tools to help developers. (Hadoop DSL, Resolvers for Pig/Hive)
○ Improve developer lifecycle
○ Also reduce unnecessary resource wastage
The Tuning Problem
How easy is it to tune a job?● Problems are not obvious
● Critical information is scattered
● Inter-related settings
● Large parameter space
Here’s what we learned!
Expert Intervention● Not enough support resources available
● Poor coverage
● Difficult to prioritize efforts
● Delays user development
Random
Suggestions
Training is not at all easy● Too many users
● Diverse backgrounds
● Scope is large and evolving
● Other responsibilities are more important
Scaling Productivity is Hard!
Dr. Elephant to the Rescue
What does Dr. Elephant do?● Automated performance monitoring and tuning tool
● Help every user get the best performance from their jobs
● Highlights common mistakes
● Indicates best practices and tuning tips
● Provides a platform for other performance related tools
● Analyzes hundred thousand jobs every day
Architecture
Dashboard
Search
Job Page
MapReduce Report
Failed Job
Help Page
Tuning Tips
Awesome Features
Simplified analysis of a flow’s historical executions● Monitoring performance, resource usage and many others
● Comparing flows against previous executions
● Impact of tuning a specific parameter or a changing a line of code
Flow History
Job History
Heuristics
How does a Heuristic work?● Fetch Counters and Task Data
● Some logic to compute a value
● Compare value against threshold levels
Heuristic Severity
Severity Color Description
CRITICAL The job is in critical state and must be tuned
SEVERE There is scope for improvement
MODERATE There is scope for further improvement
LOW There is scope for few minor improvements
NONE The job is safe. No tuning necessary
Example | Mapper Data Skew
Mapper Skew Problem● Number of Mappers depend on the number of splits
● Varying size of splits can cause skewness in the Mapper Input
Solution to Mapper Skewness● Each Mapper should process the same amount of data
● Combine the small chunks and feed it to a single Mapper
Example | Spark Executor Load Balance
Spark Driver
Executor 1
Executor 2
Executor 3
RDD
Partition 1
Partition 2
Partition 3
Custom Heuristics
Adding a New Heuristic1. Create a new heuristic and test it.
2. Create a new view for the heuristic. For example, helpMapperSpill.scala.html
3. Add the details of the heuristic in the HeuristicConf.xml file.
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper GC</heuristicname>
<classname>com.linkedin.dre.mapreduce.heuristics.MapperGC</classname>
<viewname>views.html.help.mapreduce.helpGC</viewname>
</heuristic>
4. Run Dr. Elephant. It should now include the new heuristics.
Configuring Heuristics/Threshold levels<heuristics>
<heuristic>
<applicationtype>mapreduce</applicationtype>
<heuristicname>Mapper Data Skew</heuristicname>
<classname>com.linkedin.dre.mapreduce.heuristics.MapperDataSkew</classname>
<viewname>views.html.help.mapreduce.helpMapperDataSkew</viewname>
<params>
<num_tasks_severity>10, 50, 100, 200</num_tasks_severity>
<deviation_severity>2, 4, 8, 16</deviation_severity>
<files_severity>1/8, 1/4, 1/2, 1</files_severity>
</params>
</heuristic>
</heuristics>
Elephagent
Workflow monitoring and reports● Performance characteristics change
○ Data Growth
○ Data distribution change
○ Hardware change
○ Incremental software change
● Monitor performance on each execution
● Compare behaviour across revisions
● Cost to Serve analysis
Production Reviews | JIRA Bot● Separate cluster for critical workloads
● Audit before deployment
● Improved accuracy
● Faster turnaround
● Higher throughput
Future Plans
Upcoming● Job Resource Usage and Wastage
● Job Wait time
● Real time analysis of a job
● Workflow DAG visualization
● Improved Spark heuristics
ReferencesEngineering Blog: engineering.linkedin.com/blog/2016/04/dr-elephant-open-source-self-serve-performance-tuning-hadoop-spark
Open Source Github Link:github.com/linkedin/dr-elephant
Mailing List:Dr-elephant-users
Hadoop Summit 2015:https://www.youtube.com/watch?v=aL3OJ4YoxPA
Thank You
©2014 LinkedIn Corporation. All Rights Reserved.
©2014 LinkedIn Corporation. All Rights Reserved.
© 2016