Hadoop Monitoring best Practices
-
Upload
edward-capriolo -
Category
Technology
-
view
14.018 -
download
2
description
Transcript of Hadoop Monitoring best Practices
- 1. How to monitor the$H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters
2. Relevant Hadoop Information
- From 3 3000 Nodes
- Hardware/Software failures common
- Redundant Components DataNode, TaskTracker
- Non-redundant Components NameNode, JobTracker, SecondaryNameNode
- Fast Evolving Technology (Best Practices?)
3. Monitoring Software
- Nagios
-
- Red Yellow Green Alerts, Escalations
-
- Defacto Standard Widely deployed
-
- Text base configuration
-
- Web Interface
-
- Pluggable with shell scripts/external apps
-
-
- Return 0 - OK
-
4. Cacti
- Performance Graphing System
- RRD/RRA Front End
- Slick Web Interface
- Template System for Graph Types
- Pluggable
-
- SNMP input
-
- Shell script /external program
5. 6. hadoop-cacti-jtg
- JMX Fetching Code w/ (kick off) scripts
- Cacti templates For Hadoop
- Premade Nagios Check Scripts
- Helper/Batch/automation scripts
- Apache License
7. Hadoop JMX 8. Sample Cluster P1
- NameNode & SecNameNode
-
- Hardware RAID
-
- 8 GB RAM
-
- 1x QUAD CORE
-
- DerbyDB (hive) on SecNameNode
- JobTracker
-
- 8GB RAM
-
- 1x QUAD CORE
9. A Sample Cluster p2
- Slave (hadoopdata1-XXXX)
-
- JBOD 8x 1TB SATA Disk
-
- RAM 16GB
-
- 2x Quad Core
10. Prerequisites
- Nagios (install) DAG RPMs
- Cacti (install) Several RPMS
- Liberal network access to the cluster
11. Alerts & Escalations
- X nodes * Y Services = < Sleep
- Define a policy
-
- Wake Me Ups (SMS)
-
- Dont Wake Me Ups (EMAIL)
-
- Review (Daily, Weekly, Monthly)
12. Wake Me Ups
- NameNode
-
- Disk Full (Big Big Headache)
-
- RAID Array Issues (failed disk)
- JobTracker
- SecNameNode
-
- Do not realize it is not working too late
13. Dont Wake Me Ups
- Or Wake someone else up
- DataNode
-
- Warning Currently Failed Disk will down the Data Node (see Jira)
- TaskTracker
- Hardware
-
- Bad Disk (Start RMA)
- Slaves are expendable (up to a point)
14. Monitoring Battle Plan
- Start With the Basics
-
- Ping, Disk
- Add Hadoop Specific Alarms
-
- check_data_node
- Add JMX Graphing
-
- NameNodeOperations
- Add JMX Based alarms
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
15. The Basics Nagios
- Nagios (All Nodes)
-
- Host up (Ping check)
-
- Disk % Full
-
- SWAP > 85 %
- * Load based alarms are somewhat useless389% CPU load is not necessarily a bad thing in Hadoopville
16. The Basics Cacti
- Cacti (All Nodes)
-
- CPU (full CPU)
-
- RAM/SWAP
-
- Network
-
- Disk Usage
17. Disk Utilization 18. RAID Tools
- Hpacucli not a Street Fighter move
-
- Alerts on RAID events (NameNode)
-
-
- Disk failed
-
-
-
- Rebuilding
-
-
- JBOD (DataNode)
-
-
- Failed Drive
-
-
-
- Drive Errors
-
- Dell, SUN, Vendor Specific Tools
19. Before you jump in
- X Nodes * Y Checks * = Lots of work
- About 3 Nodes into the process
-
- Wait!!! I need some interns!!!
- Solution S.I.C.C.T.Semi-Intelligent-Configuration-cloning-tools
-
- (I made that up)
-
- (for this presentation)
20. Nagios
- Answers IS IT RUNNING?
- Text based Configuration
21. Cacti
- Answers HOW WELL IS IT RUNNING?
- Web Based configuration
-
- php-cli tools
22. Monitoring Battle Plan Thus Far
- Start With the Basics
-
- Ping, Disk !!!!!!Done!!!!!!
- Add Hadoop Specific Alarms
-
- check_data_node
- Add JMX Graphing
-
- NameNodeOperations
- Add JMX Based alarms
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
23. Add Hadoop Specific Alarms
- Hadoop Components with a Web Interface
-
- NameNode 50070
-
- JobTracker 50030
-
- TaskTracker 50060
-
- DataNode 50075
- check_http + regex = simple + effective
24. nagios_check_commands.cfg
- Component Failure
- (Future) Newer Hadoop will have XML status
define command { command_namecheck_remote_namenode command_line$USER1$/check_http -H$HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode } define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070 } 25. Monitoring Battle Plan
- Start With the Basics
-
- Ping, Disk (Done)
- Add Hadoop Specific Alarms
-
- check_data_node (Done)
- Add JMX Graphing
-
- NameNodeOperations
- Add JMX Based alarms
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
26. JMX Graphing
- Enable JMX
- Import Templates
27. JMX Graphing 28. JMX Graphing 29. JMX Graphing 30. 31. Standard Java JMX 32. Monitoring Battle Plan Thus Far
- Start With the Basics !!!!!!Done!!!!!
-
- Ping, Disk
- Add Hadoop Specific Alarms !Done!
-
- check_data_node
- Add JMX Graphing !Done!
-
- NameNodeOperations
- Add JMX Based alarms
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
33. Add JMX based Alarms
- hadoop-cacti-jtg is flexible
-
- extend fetch classes
-
- Dont call output()
-
- Write your own check logic
34. Quick JMX Base Walkthrough
- url, user, pass, object specified from CLI
- wantedVariables, wantedOperations by inheritance
- fetch() output() provided
35. Extend for NameNode 36. Extend for Nagios 37. Monitoring Battle Plan
- Start With the Basics !DONE!
-
- Ping, Disk
- Add Hadoop Specific Alarms !DONE!
-
- check_data_node
- Add JMX Graphing !DONE!
-
- NameNodeOperations
- Add JMX Based alarms !DONE!
-
- FilesTotal > 1,000,000 or LiveNodes < 50%
38. Review
- File System Growth
-
- Size
-
- Number of Files
-
- Number of Blocks
-
- Ratios
- Utilization
-
- CPU/Memory
-
- Disk
- Email (nightly)
-
- FSCK
-
- DSFADMIN
39. The Future
- JMX Coming to JobTracker and TaskTracker (0.21)
-
- Collect and Graph Jobs Running
-
- Collect and Graph Map / Reduce per node
-
- Profile Specific Jobs in Cacti?