Hado"ops" or Had"oops"
-
Upload
kishore-yellamraju -
Category
Engineering
-
view
78 -
download
0
Transcript of Hado"ops" or Had"oops"
Proprietary amp Confidential Copyright copy 2014
HadorsquoopsrsquoorHadrsquooopsrsquo 1
Wersquore Hiringrocketfuelcomcareers
Kishore Kumar YellamrajuAbhijit Pol
Proprietary amp Confidential Copyright copy 2014
The Web Is Monetized By Advertising
Proprietary amp Confidential Copyright copy 2014
Delivery Methods
raquoDisplayraquoVideoraquoMobileraquoSocial
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
$238965$06782$17234
$009$178964$16782$17234$0809$242125
$211$126
$2178$2056$0809$242125
$211$126$278$156
$1809$242125
$211$126$278$056$242125
$211$126$278
$0756$0809$242125
$211$126$278
$1256$1809$242125
$211$126$278
$0586$2009
125$211$126$278$156
$000
[ + ][ + ]
SitePageGeoWeatherTime of DayBrand AffinityUser
Always buying the best impressions amp serving the best ad
Real Time Bidding and Serving
Proprietary amp Confidential Copyright copy 2014
GoalLeadsamp sales
GoalCoupondownloads
GoalBrandawareness
SitePageGeoWeatherTime of DayBrand AffinityDemo
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-marketBehaviorResponse
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse X
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse
+100+40-20+20+15+10+40+35
+97
+40-70-20+10+15-25-40-18
+07
+10-10-20+20+10-35-25+10
+14
Real Time Bidding and Serving
Xuuml
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
Facebook likes
Searches on Google
Bid Requests Considered by Rocketfuel
5 B
6 B
45 B
Requests per day
Throughput
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
The Web Is Monetized By Advertising
Proprietary amp Confidential Copyright copy 2014
Delivery Methods
raquoDisplayraquoVideoraquoMobileraquoSocial
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
$238965$06782$17234
$009$178964$16782$17234$0809$242125
$211$126
$2178$2056$0809$242125
$211$126$278$156
$1809$242125
$211$126$278$056$242125
$211$126$278
$0756$0809$242125
$211$126$278
$1256$1809$242125
$211$126$278
$0586$2009
125$211$126$278$156
$000
[ + ][ + ]
SitePageGeoWeatherTime of DayBrand AffinityUser
Always buying the best impressions amp serving the best ad
Real Time Bidding and Serving
Proprietary amp Confidential Copyright copy 2014
GoalLeadsamp sales
GoalCoupondownloads
GoalBrandawareness
SitePageGeoWeatherTime of DayBrand AffinityDemo
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-marketBehaviorResponse
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse X
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse
+100+40-20+20+15+10+40+35
+97
+40-70-20+10+15-25-40-18
+07
+10-10-20+20+10-35-25+10
+14
Real Time Bidding and Serving
Xuuml
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
Facebook likes
Searches on Google
Bid Requests Considered by Rocketfuel
5 B
6 B
45 B
Requests per day
Throughput
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Delivery Methods
raquoDisplayraquoVideoraquoMobileraquoSocial
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
$238965$06782$17234
$009$178964$16782$17234$0809$242125
$211$126
$2178$2056$0809$242125
$211$126$278$156
$1809$242125
$211$126$278$056$242125
$211$126$278
$0756$0809$242125
$211$126$278
$1256$1809$242125
$211$126$278
$0586$2009
125$211$126$278$156
$000
[ + ][ + ]
SitePageGeoWeatherTime of DayBrand AffinityUser
Always buying the best impressions amp serving the best ad
Real Time Bidding and Serving
Proprietary amp Confidential Copyright copy 2014
GoalLeadsamp sales
GoalCoupondownloads
GoalBrandawareness
SitePageGeoWeatherTime of DayBrand AffinityDemo
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-marketBehaviorResponse
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse X
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse
+100+40-20+20+15+10+40+35
+97
+40-70-20+10+15-25-40-18
+07
+10-10-20+20+10-35-25+10
+14
Real Time Bidding and Serving
Xuuml
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
Facebook likes
Searches on Google
Bid Requests Considered by Rocketfuel
5 B
6 B
45 B
Requests per day
Throughput
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
$238965$06782$17234
$009$178964$16782$17234$0809$242125
$211$126
$2178$2056$0809$242125
$211$126$278$156
$1809$242125
$211$126$278$056$242125
$211$126$278
$0756$0809$242125
$211$126$278
$1256$1809$242125
$211$126$278
$0586$2009
125$211$126$278$156
$000
[ + ][ + ]
SitePageGeoWeatherTime of DayBrand AffinityUser
Always buying the best impressions amp serving the best ad
Real Time Bidding and Serving
Proprietary amp Confidential Copyright copy 2014
GoalLeadsamp sales
GoalCoupondownloads
GoalBrandawareness
SitePageGeoWeatherTime of DayBrand AffinityDemo
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-marketBehaviorResponse
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse X
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse
+100+40-20+20+15+10+40+35
+97
+40-70-20+10+15-25-40-18
+07
+10-10-20+20+10-35-25+10
+14
Real Time Bidding and Serving
Xuuml
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
Facebook likes
Searches on Google
Bid Requests Considered by Rocketfuel
5 B
6 B
45 B
Requests per day
Throughput
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
$238965$06782$17234
$009$178964$16782$17234$0809$242125
$211$126
$2178$2056$0809$242125
$211$126$278$156
$1809$242125
$211$126$278$056$242125
$211$126$278
$0756$0809$242125
$211$126$278
$1256$1809$242125
$211$126$278
$0586$2009
125$211$126$278$156
$000
[ + ][ + ]
SitePageGeoWeatherTime of DayBrand AffinityUser
Always buying the best impressions amp serving the best ad
Real Time Bidding and Serving
Proprietary amp Confidential Copyright copy 2014
GoalLeadsamp sales
GoalCoupondownloads
GoalBrandawareness
SitePageGeoWeatherTime of DayBrand AffinityDemo
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-marketBehaviorResponse
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse X
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse
+100+40-20+20+15+10+40+35
+97
+40-70-20+10+15-25-40-18
+07
+10-10-20+20+10-35-25+10
+14
Real Time Bidding and Serving
Xuuml
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
Facebook likes
Searches on Google
Bid Requests Considered by Rocketfuel
5 B
6 B
45 B
Requests per day
Throughput
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
GoalLeadsamp sales
GoalCoupondownloads
GoalBrandawareness
SitePageGeoWeatherTime of DayBrand AffinityDemo
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-marketBehaviorResponse
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse X
Impression ScorecardDemoBrand AffinityTime of DayGeoWeatherSitePageAd PositionIn-MarketBehaviorResponse
+100+40-20+20+15+10+40+35
+97
+40-70-20+10+15-25-40-18
+07
+10-10-20+20+10-35-25+10
+14
Real Time Bidding and Serving
Xuuml
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
Facebook likes
Searches on Google
Bid Requests Considered by Rocketfuel
5 B
6 B
45 B
Requests per day
Throughput
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
6 Ad Served
User Segments
3 Bid Reques
t
Overview
Publishers
2 Ad Request
1 Page Request
4 Bid amp Ad
User Engagemen
ts
Data Partners
Advertisers
Browser
Some Exchange Partners
Ad Exchange
Optimize
Rocket Fuel Platform
Real-time BidderAutomated Decisions
Models
Refresh learning
Data Store
Ads ampBudget
ModelScores
Events
5 RocketfuelWinning Ad
Proprietary amp Confidential Copyright copy 2014
Facebook likes
Searches on Google
Bid Requests Considered by Rocketfuel
5 B
6 B
45 B
Requests per day
Throughput
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Facebook likes
Searches on Google
Bid Requests Considered by Rocketfuel
5 B
6 B
45 B
Requests per day
Throughput
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Blink of an eye
SF to Tokyo network round trip
One beat of a hummindbirds wing
Look up in Blackbird
400
100
20
2
Time (ms)
Latency
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Architecture and Scale
raquoDatacentersraquoScaleraquoGrowthraquoArchitecture
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Data Center Expansion
raquoabc
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Data Center Design
bull Racks custom built at Rocket Fuelbull Leased spacebandwidth in colocation facilities
Hadoop Server20 2U servers (85kW)
Bidders40 2-U Twin 2 servers (17kW)
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Rocket Fuel Scale
raquo34474 CPU processor coresndash2655 serversndash1874 Teraflops of computing
raquo188 Terabytes of memoryndash13X the memory of IBM computer Watson that
played Jeopardy
raquo42PB Petabytes of storagendash106X the data volume of the entire Library of
Congress
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Hadoop at Rocket Fuel
raquo 1400 servers
raquo 15K Disks
raquo 15K Cores
raquo 90 TB
raquo 30K MR slots
raquo 12K daily MR jobs
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
200 Servers 1400 Servers
1 Year
5 PB
41 PB8x
Growth
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Data Architecture 30
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Hadoop Setup
QJM ZK Quorum
raquo 6x2TB Disksraquo 2x6 coreraquo 196 GB RAMraquo 2x1G NIC
raquo 12x3TB Disksraquo 2x6 coreraquo 64 GB RAMraquo 10G NIC
raquo same as DNrsquosraquo Dedicated disk
to ZK or JN
JT
Standby NN
ZKFCZKFC
Active NN
DNTT
DNTT
DNTT
DNTT
DNTT
DNTT
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Puppet+
Infradb
Automation is key
Maintenance is Not Easy
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Puppet and Infradb
raquo Automate as much as you canraquo Adding a slave node to Hadoop cluster lt 120 secondsraquo Bringing up a new Hadoop cluster lt 500 secondsraquo MR slots are automatically determined based on hardware config
Isnrsquot it cool
Just define once
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
No issues when cluster is small Problems starts when it grows
Performance Tuning
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
dfsdatanodehandlercount dfsnamenodehandlercount
dfsdatanodemaxtransferthreads dfsimagetransfertimeout
mapredreduceparallelcopies
mapredjobtrackerhandlercount
iosortmbiosortfactor
maxClientCnxns ZK
HDFS
MR
IMP MAPREDUCE-2026
-XX+UseConcMarkSweepGC
-XXCMSFullGCsBeforeCompaction=1
-XXCMSInitiatingOccupancyFraction=60
ha-timeoutms
JVM
Performance Tuning
mapreducereduceshuffleparallelcopies
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
MAPREDUCE-5351
MAPREDUCE-5508
keepfailedtaskfiles=true
We Have an Issue
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
instances of JobInProgressrdquo class = no of users submitted jobs X mapredjobtrackercompleteuserjobsmaximum
mapredjobtrackercompleteuserjobsmaximum mapredjobtrackerretirejobinterval
mapredjobtrackerretiredjobscachesize
JT OOM
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Monitoring
Wall of Ops
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Monitoring
hadoopnamenodeCallQueueLength hadoopjobtrackerjvmmemheapusedm
Donrsquot fly blind you will crash
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
MR Workload Monitoring
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Network Monitoring
Donrsquot blame network instead monitor it Network Mesh can be mess
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Alerting
Monitoring is not enough need better Alerting
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Alerts
httphostnameportjmx
qry=Hadoopservice=NameNodename=NameNodeInfo
gtgt Checking whether NN and JT are up is a no brainer gtgt Reduce alert noise by having summaryaggregate alerts gtgt We heavily rely on custom scripts that query jmx for NN and JT
qry=hadoopservice=JobTrackername=JobTrackerInfo
NameDirStatuses DeadNodes NumberOfMissingBlocks
qry=Hadoopservice=NameNodename=FSNamesystemState
FSState CapacityRemaining NumDeadDataNodes UnderReplicatedBlocks
Blacklisted TTrsquos jobs slots_used ThreadCount
qry=javalangtype=Memory
Used jvm free jvm etc
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
MR Workload Alerting
raquo Monitoring MR workload and alertndash In-house tool that use ldquohoudahrdquo ruby gem monitorsndash Long running jobs jobs with more map tasks blacklisted TTrsquos
with more failure counts etchellip
raquo Collect details and auto-restart blacklisted TTrsquosraquo Parse the JT logfile for rouge jobsraquo Parse the JT log and collects all Job related inforaquo White-elephant or hraven could helpraquo Parse the scheduler html page or use metrics page httpltJT-hostnamegt50030scheduleradvanced httpltJT-hostnamegt50030metrics
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Modeling
OPS
ETL
Ad-hoc
Multi Tenancy
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
No Scheduler is perfect unless you understand and tune it properly
Scheduling
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Operations
raquo Maintenanceraquo Performance Tuningraquo Monitoringraquo BCPraquo YARN
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
BCP
raquo BCP Business Continuity Planraquo Near real time reporting over 15+ TB of daily dataraquo Freshness of models trained over petabytes of data
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Data BCP Cluster
INW Data
Cluster
US Serving Clusters
EU Serving Clusters
HK Serving Clusters
Modeling
Reporting
User Queries
Amazon BackupLSV Data
Cluster
USEUHK Serving Clusters
Research
Ad-hoc Queries
Processed Data
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
YARN
JobTracker
raquo Resource Manager - Global resource scheduler - Hierarchical queues - Application management
raquo Node Manager - Per-machine agent - Manages life cycle of container - Container resource monitoring
raquo Application Master - Per-application - Manages application scheduling and task execution
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
YARN at Rocket FueI
raquo Yarn is in production raquo 700+ nodesraquo 31TB RAM 8500 disks 8500 cores raquo Primary use case Map-Reduceraquo No more static slotsraquo Tez Spark Storm are in race
YAY
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
Obligatory ldquowe are hiringrdquo slide
httprocketfuelcomcareers
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-
Proprietary amp Confidential Copyright copy 2014
THANKS
kishorerocketfuelcomapolrocketfuelcom
- Hadorsquoopsrsquo or Hadrsquooopsrsquo 1
- The Web Is Monetized By Advertising
- Delivery Methods
- Overview
- Always buying the best impressions amp serving the best ad
- Real Time Bidding and Serving
- Overview (2)
- Throughput
- Latency
- Architecture and Scale
- Data Center Expansion
- Data Center Design
- Rocket Fuel Scale
- Hadoop at Rocket Fuel
- Growth
- Data Architecture 30
- Hadoop Setup
- Operations
- Maintenance is Not Easy
- Puppet and Infradb
- Performance Tuning
- Performance Tuning (2)
- We Have an Issue
- JT OOM
- Operations (2)
- Monitoring
- Monitoring (2)
- MR Workload Monitoring
- Network Monitoring
- Alerting
- Alerts
- MR Workload Alerting
- Multi Tenancy
- Scheduling
- Operations (3)
- BCP
- Data BCP Cluster
- YARN
- YARN at Rocket FueI
- Obligatory ldquowe are hiringrdquo slide
- Slide 41
-