Monitoring Systems and POWER5/6 LPARs with Ganglia Michael Perzl –...
-
Upload
junior-york -
Category
Documents
-
view
220 -
download
6
Transcript of Monitoring Systems and POWER5/6 LPARs with Ganglia Michael Perzl –...
2 Monitoring Systems and POWER5/6 LPARs with Ganglia
Agenda
Ganglia – what is it ? Ganglia components and data flow An introduction to RRDTool Ganglia metrics – what can be measured ? New POWER5/6 metrics (AIX & Linux) Extending Ganglia with gmetric Add device specific information to Ganglia Ganglia network communication Installation issues Where to get Ganglia for AIX and Linux on POWER ? Best practices Future additions / plans Discussion Links
4 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ? (1/3)
Ganglia is an Open Source cluster performance monitoring tool and has been extended to include POWER5/6 features like shared processor LPARs, entitlement, physical CPU usage etc.
This session covers:– the technical details of Ganglia and the POWER5/6 extensions– how to set it up and use it to monitor all LPARs in a single machine and lots of
machines
5 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ? (2/3)
Ganglia properties: scalable distributed monitoring system for high-performance computing
systems such as clusters and grids based on a hierarchical design targeted at federations of clusters relies on a multicast-based listen/announce protocol to monitor state within
clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state
leverages widely used technologies such as– XML for data representation– XDR (eXternal Data Representation) for compact, portable data transport– RRDtool for data storage and visualization
uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency
robust implementation Open Source, written in C
– Downloaded 110,000+ times, 145+ countries, 500+ clusters, 2000+ nodes
6 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – what is it ? (3/3)
Ganglia properties (cont.): has been ported to an extensive set of operating systems and processor
architectures:– AIX– Darwin– FreeBSD– HP-UX– IRIX– Linux– OSF– NetBSD– Solaris– Windows (via Cygwin)
is currently in use on over 500+ clusters around the world has been used to link clusters across university campuses and around the
world and can scale to handle clusters with 2000+ nodes– check http://ganglia.info/ for more details
8 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia components
The ganglia system consists of: two unique daemons:
– Ganglia Monitoring Daemon (gmond)• monitoring daemon, collects the metrics• runs on each node
– Ganglia Meta Daemon (gmetad)• polls all gmond clients and stores the collected metrics in Round-Robin
Databases (RRDs)
a PHP-based web frontend a few other small utility programs
– gmetric• can be used to easily extend Ganglia with additional user-defined metrics
– gstat– gexec
9 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia – Schematic View
From: “Ganglia: Past, Present and Future” by Matt Massie:URL: http://ganglia.info/talks/lug_lbl_talk/
11 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Monitoring Daemon (gmond)
Ganglia Monitoring Daemon (gmond) is a multi-threaded daemon which runs on each cluster node you want to monitor.
Installation is easy:– just the daemon and a configuration file (/etc/gmond.conf)
gmond has four main responsibilities:1. monitor changes in host state2. announce relevant changes3. listen to the state of all other ganglia nodes via a unicast or multicast channel4. answer requests for an XML description of the cluster state
Each gmond transmits information in two different ways:– unicasting or multicasting host state in external data representation (XDR) format
using UDP messages– sending XML over a TCP connection
12 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Meta Daemon (gmetad) (1/2)
Ganglia Meta Daemon (gmetad) is a daemon which typically only runs on one specific cluster node – or on more when using a staged setup.
Installation is easy:– just the daemon and a configuration file (/etc/gmetad.conf)
Federation in Ganglia is achieved using a tree of point-to-point connections amongst representative cluster nodes to aggregate the state of multiple clusters.
At each node in the tree a gmetad– periodically polls a collection of child data sources– parses the collected XML– saves all numeric volatile metrics to round-robin databases– exports the aggregated XML over a TCP socket to clients
13 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia Meta Daemon (gmetad) (2/2)
Data sources may be either– gmond daemons, representing specific clusters
or– other gmetad daemons, representing sets of clusters
Data sources use source IP addresses for access control– Multiple IP addresses can be specified for failover– The capability is natural for aggregating data from clusters since each gmond
daemon contains the entire state of its cluster
14 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia PHP web frontend (1/2)
Web frontend properties: provides a view of the gathered information via real-time dynamic web pages displays Ganglia data in a meaningful way for system administrators and users
– For example, one can view the CPU utilization over the past hour, day, week, month, or year
– The web frontend shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics
15 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia PHP web frontend (2/2)
Web frontend properties (cont.): depends on the existence of the gmetad which provides it with data from
several Ganglia sources opens the local port 8651 (by default) and expects to receive a Ganglia XML
tree the web pages themselves are highly dynamic; any change to the Ganglia data
appears immediately on the site– This behavior leads to a very responsive site, but requires that the full XML tree be
parsed on every page access– Therefore, the Ganglia web frontend should run on a fairly powerful, dedicated
machine if it presents a large amount of data
is written in the PHP scripting language and uses graphs generated by gmetad to display history information
has been tested on many flavors of Unix (primarily Linux) with the Apache web server and the PHP 4.1 module
16 Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond
One daemon per node/LPAR
Operating Systemperformance stats
API
Ganglia - data flow (1/4)
File accessNetworkWeb
17 Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond gmetad
rrdtooldatabaseof statistics
One daemon per node/LPAR Runs on web server
Browser
/etc/gmetad.conf
Operating Systemperformance stats
API
Ganglia - data flow (2/4)
File accessNetworkWeb
18 Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond gmetad
rrdtooldatabaseof statistics
Apache2+ PHP5
One daemon per node/LPAR Runs on web server
Browser
Ganglia FE scripts
/etc/gmetad.conf
Operating Systemperformance stats
API
Ganglia - data flow (3/4)
File accessNetworkWeb
19 Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf
gmond gmetad
rrdtooldatabaseof statistics
Apache2+ PHP5
gmetric
One daemon per node/LPAR Runs on web serverUser command
Browser
Ganglia FE scripts
/etc/gmetad.conf
Operating Systemperformance stats
API
Ganglia - data flow (4/4)
File accessNetworkWeb
20 Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf gmond gmetad
rrdtooldatabaseof statistics
Apache2+ PHP5
One daemon per node/LPAR Only one instance with the Web Server
Browser
PHP scripts
/etc/gmetad.conf
File accessNetwork
/etc/gmond.conf gmond
/etc/gmond.conf gmond
Web
Ganglia - data flow again
22 Monitoring Systems and POWER5/6 LPARs with Ganglia
RRDTool
Homepage: http://oss.oetiker.ch/rrdtool/ RRD is the Acronym for Round-Robin Database. RRD is a system to store and display time-series data (i.e., network bandwidth,
machine-room temperature, server load average). It stores the data in a very compact way that will not expand over time (fixed
size of DB), and it presents useful graphs by processing the data to enforce a certain data density.
It can be used either via simple wrapper scripts (from shell or Perl) or via frontends that poll network devices and put a friendly user interface on it.
RRDTool is the industry standard tool to store and display time-series data!
23 Monitoring Systems and POWER5/6 LPARs with Ganglia
RRDTool example graph
Graph taken from http://oss.oetiker.ch/rrdtool/gallery/index.en.html
Graph shows inbound and outbound call traffic going in and out of the switch via the 6 trunks connected to the Diamond exchange. Inbound traffic shown as positive and uses a lowest-free fill method. Outbound traffic shown as negative uses a distributed fill method. Tech details on RRDtrac.
24 Monitoring Systems and POWER5/6 LPARs with Ganglia
RRDTool example
# rrdtool create test.rrd \ --start 920804400 \ --step 300 \ DS:km:COUNTER:600:U:U \ RRA:AVERAGE:0.5:1:24
# rrdtool update test.rrd 920804700:12345 920805000:12357 920805300:12363# rrdtool update test.rrd 920805600:12363 920805900:12363 920806200:12373# rrdtool update test.rrd 920806500:12383 920806800:12393 920807100:12399# rrdtool update test.rrd 920807400:12405 920807700:12411 920808000:12415# rrdtool update test.rrd 920808300:12420 920808600:12422 920808900:12423
# rrdtool graph kilometer.png \ --start 920804400 \ --end 920808000 \ DEF:mykm=test.rrd:km:AVERAGE \ LINE2:mykm#FF0000
26 Monitoring Systems and POWER5/6 LPARs with Ganglia
Metrics
Definition of a metric: A metric is a certain observed property of the system.
Number of metrics: 34 standard metrics, i.e., available (i.e., defined) on all platforms Additional platform dependent metrics available
– Solaris• 8 additional metrics available
– HP-UX• 4 additional metrics available
– AIX• 18 additional new metrics available for POWER5/6 !!!• details later….
Remarks: One RRD database per Ganglia metric is used Database size is fixed (~ 12 kB per RRD database with default settings) Some standard metrics do not exist on all platforms, e.g., some metrics (coming from
Linux) don’t exist or don’t make sense on AIX
27 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia standard metrics (1/2)
boottime–system boot timestamp
bytes_in–number of network bytes received per second
bytes_out–number of network bytes sent out per second
cpu_aidle–percent of time since boot idle CPU–not defined on AIX, Linux yes
cpu_idle–percent CPU idle time
cpu_nice–percent CPU nice–not defined on AIX, Linux yes
cpu_num–number of CPUs
cpu_intr–number of interrupts (??)–not defined on AIX, Linux yes
cpu_sintr–number of system interrupts (??)–not defined on AIX, Linux yes
cpu_speed–speed of CPUs in MHz
cpu_system–percent CPU system
cpu_user–percent CPU user
cpu_wio–CPU time spent waiting for I/O
disk_free–total free disk space in GB
disk_total–total available disk space in GB
load_one–load average over 1 minute
load_five–load average over 5 minutes
load_fifteen–load average over 15 minutes
28 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia standard metrics (2/2)
machine_type–type of machine (e.g., POWER5)
mem_total–total available memory in kB
mem_free–amount of free memory in kB
mem_shared–amount of shared memory–not defined on AIX, Linux yes
mem_buffers–amount of memory used for buffers–not defined on AIX, Linux yes
mem_cached–amount of memory used for cache–AIX: numperm memory pages
mtu–MTU size reported in bytes
os_name–name of OS
os_release–OS release version (on AIX: level of fileset bos.mp)
part_max_used–most filled disk partition–not defined on AIX, Linux yes
pkts_in–number of network packets received
pkts_out–number of network packets sent out
proc_run–total number of running processes
proc_total–total number of processes
swap_free–free swap space in kB–AIX: paging space free
swap_total–total available swap space in kB–AIX: paging space
30 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia and POWER5/6
Current deficiences of Ganglia on POWER5/6: Ganglia does not understand Shared Processor LPAR statistics
– things like capped, weight, CPU entitlement etc.– these metrics can be added (see below)
How to fix these deficiences, i.e., add these new metrics ? Easy solution:
– Extend Ganglia with the utility program gmetric– Details in section “Extending Ganglia with gmetric” (later)
Preferred solution:– Add these new metrics to the gmond implementation on AIX and Linux on POWER– Requires significant patching of Ganglia source code– This has been completed and tested, i.e., ready to go !– Where to get it ?
• My personal web site: http://www.perzl.org/ganglia/
31 Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5/6 metrics (1/5)
Question: How are those additional metrics programmed and where do they get the
information from?
Answer: AIX:
– Only the APIs provided by libperfstat are used– As a consequence the fileset bos.perf.libperfstat must be installed
Linux:– Only entries in the /proc file system are used, e.g., /proc/cpuinfo, /proc/meminfo,
/proc/ppc64/lparcfg etc.– No additional fileset must be installed
32 Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5/6 metrics (2/5)
1) capped
2) cpu_entitlement
3) cpu_in_lpar
4) cpu_in_machine
5) cpu_in_pool
6) cpu_pool_idle
7) cpu_used
8) disk_read
9) disk_write
10) kernel64bit
11) lpar
12) lpar_name
13) lpar_num
14) oslevel
15) serial_num
16) smt
17) splpar
18) weight
List of 18 additional new metrics for POWER5 (AIX & Linux):
33 Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5 metrics (3/5)
1) capped– Type: String value– returns "yes" if the system is a POWER5 Shared Processor LPAR which is running in capped mode or "no"
otherwise
2) cpu_entitlement– Type: Float value– returns the Capacity Entitlement of the system in units of physical CPUs
3) cpu_in_lpar– Type: Integer value– returns the number of CPUs the OS sees in the system. In a POWER5 Shared Processor LPAR this returns the
number of virtual CPUs. When SMT is enabled this number is doubled.
4) cpu_in_machine– Type: Integer value– returns the number of physical CPUs in the whole system
5) cpu_in_pool– Type: Integer value– returns the number of physical CPUs in the Shared Processor Pool
6) cpu_pool_idle– Type: Float value– returns in fractional numbers of physical CPUs how much the Shared Processor Pool is idle
34 Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5 metrics (4/5)
7) cpu_used– Type: Float value– returns in fractional numbers of physical CPUs how much compute resources this shared processor has used
since the last time this metric was measured
8) disk_read– Type: Float value– returns in units of kB the total read I/O of the system
9) disk_write– Type: Float value– returns in units of kB the total write I/O of the system
10) kernel64bit– Type: String value– returns "yes" if the running kernel is a 64-bit kernel or "no" otherwise
11) lpar– Type: String value– returns "yes" if the system is a LPAR or "no" otherwise
12) lpar_name– Type: String value– returns the name of the LPAR as defined on the Hardware Management Console (HMC) or some reasonable
message otherwise
35 Monitoring Systems and POWER5/6 LPARs with Ganglia
Additional Ganglia POWER5 metrics (5/5)
13) lpar_num– Type: Integer value– returns the partition ID of the LPAR as defined on the Hardware Management Console (HMC) or some
reasonable message otherwise
14) oslevel– Type: String value– returns the version string as provided by the AIX command 'oslevel‘
15) serial_num– Type: String value– returns the serial number of the system as provided by the AIX command 'uname‘
16) smt– Type: String value– returns "yes" if SMT is enabled or "no" otherwise
17) splpar– Type: String value– returns "yes" if the system is running in a shared processor LPAR or "no" otherwise
18) weight– Type: Integer value– returns the weight of the LPAR running in uncapped mode
37 Monitoring Systems and POWER5/6 LPARs with Ganglia
Extending Ganglia
How can I easily add metrics to Ganglia ? Ganglia has a simple way to add metrics that should be monitored The utility program gmetric is used for that purpose These new metrics are then automatically added to the database and web
server data and graphs
New metric(periodically call
gmetric to providenew data)
38 Monitoring Systems and POWER5/6 LPARs with Ganglia
gmetric example – Machine firmware level
Example of a static metric: Machine firmware level
gmetric --name firmware \ --value `lsattr -El sys0 -a modelname -F value` \ --type "string"
Remarks: The above will only save the statistics once. The firmware level is unlikely to change without reboot, therefore it is sufficient
to run this command once.
39 Monitoring Systems and POWER5/6 LPARs with Ganglia
gmetric example – database transactions
Example of a variable metric: Transaction rate of your database
To add the number of transaction and assuming you have a script that will work this out called "transactions" that returns a number with a decimal point – you will have to write this script yourself !
gmetric --name tpm \ --value `/usr/local/bin/transactions` \ --type double
Remarks: This command will only save the statistics once. As the number of transactions per minute will definitely change, to get these
always up to date, it is recommended to run the command regularly, e.g., run once every 60 seconds via cron.
41 Monitoring Systems and POWER5/6 LPARs with Ganglia
General Remarks
Add device specific information toGanglia via gmetric for Network adapters Disks Disk adapters (SCSI + Fibre Channel)
Available in two variants: as a daemon (implemented in C) as a shell script
Network information: Daemons:
–g_aix_netif for AIX–g_linux_netif for Linux on POWER
Shell script:–ent_adapter.sh for AIX
Disk information: Daemons:
–g_aix_disk for AIX–g_linux_disk for Linux on POWER
Disk adapter information (AIX only) Daemons
–g_aix_adapter for AIX Shell script:
–fcs_adapter.sh for AIX
42 Monitoring Systems and POWER5/6 LPARs with Ganglia
Add device specific information on AIX and Linux (1/4)
Network adapter information: g_aix_netif or g_linux_netif
Utility daemon program periodically calls gmetric (interval configurable)
Monitored parameters per network interface:– Bytes received / second– Bytes transmitted / second– Packets received / second– Packets transmitted / second– MTU size (AIX only)
Example: g_aix_netif -s5 -b3 -p3 -m en1 Every 5 seconds get the number of bytes/sec and packets/sec transferred in
and out as well as the current MTU size for network interface en1.
43 Monitoring Systems and POWER5/6 LPARs with Ganglia
Add device specific information on AIX and Linux (2/4)
Disk information: g_aix_disk or g_linux_disk
Utility daemon program periodically calls gmetric (interval configurable)
Monitored parameters per disk:– Bytes read / second– Bytes written / second
Example: g_aix_disk -s10 -i -o hdisk3 Every 10 seconds get the number of bytes/sec transferred in and out for disk
hdisk3.
44 Monitoring Systems and POWER5/6 LPARs with Ganglia
Add device specific information on AIX and Linux (3/4)
Disk adapter information: g_aix_adapter (AIX only)
Utility daemon program periodically calls gmetric (interval configurable)
Monitored parameters per disk adapter:– Bytes read / second– Bytes written / second
Example: g_aix_adapter –s5 –i –o scsi0 Every 5 seconds get the number of bytes/sec transferred in and out for SCSI
adapter scsi0.
45 Monitoring Systems and POWER5/6 LPARs with Ganglia
Add device specific information on AIX and Linux (4/4)
./g_aix_netif: Version 1.0
g_aix_netif [OPTIONS] <network-interface>
[-?] or [-h] This help information
[-s seconds] The time between output (default is 60 seconds), seconds must be in the range [1..3600].
[-c loop_count] The number of loops (default = 20 million), loop_count must be in the range [10..20000000].
[-b 0|1|2|3] Show bytes received/sent (default = off) 0 = don't show, 1 = show incoming 2 = show outgoing, 3 = show both
[-p 0|1|2|3] Show packets received/sent (default = off) 0 = don't show, 1 = show incoming 2 = show outgoing, 3 = show both
[-m] Show MTU size for <network-interface> (default = off)
[-d] Debug mode, remain in foreground, output only to screen
Please note:If not in debug mode (-d) g_aix_netif runs the same command via a shelland assumes gmetric will be found in the PATH and gmond is already running.g_aix_netif will disconnect from your terminal to become a daemon.Use "ps -ef | grep g_aix_netif" to confirm it is running.
47 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia network communication
Ganglia by default uses Multicast Some network administrators might not like Multicast This can be changed to Unicast - requires changes to default config files
– Very simple changes to the gmond.conf files
Ganglia gmond network “chatter” The processes talk to each other quite a lot Not large in 100Mb or 1Gb or virtual networks terms Recommendation:
– Ganglia over admin network rather than user network if possible
48 Monitoring Systems and POWER5/6 LPARs with Ganglia
gmond: Multicast configuration example
gmond.conf:
/* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */globals { daemonize = yes setuid = yes on some Linux distros: no user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 3600 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will NOT be wrapped inside of a <CLUSTER> tag. */cluster { name = "System p5 Model 550" owner = "unspecified" latlong = "unspecified" url = "unspecified"}
gmond.conf continued:
.../* The host section describes attributes of the host, like the location */host { location = "System p5 Model 550"}
/* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */udp_send_channel { mcast_join = 239.2.11.71 port = 8649}
/* You can specify as many udp_recv_channels as you like as well. */udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71}
/* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */tcp_accept_channel { port = 8649}
49 Monitoring Systems and POWER5/6 LPARs with Ganglia
gmond: Unicast configuration example
gmond.conf:
/* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */globals { daemonize = yes setuid = yes on some Linux distros: no user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 3600 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will NOT be wrapped inside of a <CLUSTER> tag. */cluster { name = "System p5 Model 550" owner = "unspecified" latlong = "unspecified" url = "unspecified"}...
gmond.conf continued:
.../* The host section describes attributes of the host, like the location */host { location = "System p5 Model 550"}
/* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */udp_send_channel { host = p550-aix port = 8649}
/* You can specify as many udp_recv_channels as you like as well. */udp_recv_channel { port = 8649}
/* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */tcp_accept_channel { port = 8649}
50 Monitoring Systems and POWER5/6 LPARs with Ganglia
gmetad: configuration example
# Format:# data_source "my cluster" [polling interval] address1:port addreses2:port ...## The keyword 'data_source' must immediately be followed by a unique string which identifies the source,# then an optional polling interval in seconds. The source will be polled at this interval on average.# If the polling interval is omitted, 15sec is asssumed.## A list of machines which service the data source follows, in the format ip:port, or name:port.# If a port is not specified then 8649 (the default gmond port) is assumed.# default: There is no default value## data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651# data_source "another source" 1.3.4.7:8655 1.3.4.8data_source "Systen p5 Model 550" 15 p550-aix:8649 p550-nim:8649
# Round-Robin Archives# You can specify custom Round-Robin archives here (defaults are listed below)## RRAs "RRA:AVERAGE:0.5:1:240" "RRA:AVERAGE:0.5:24:240" "RRA:AVERAGE:0.5:168:240" \# "RRA:AVERAGE:0.5:672:240" "RRA:AVERAGE:0.5:5760:370"
# The name of this Grid. All the data sources above will be wrapped in a GRID tag with this name.# default: Unspecifiedgridname "Mycluster"
51 Monitoring Systems and POWER5/6 LPARs with Ganglia
Security issues
Ganglia setup can be tunneled through SSH if– certain security guidelines must be adhered to– security guidelines allow only SSH connections, nothing else
Detailed description of such a setup can be found at:– http://www.ibm.com/collaboration/wiki/display/LinuxP/Ganglia/
53 Monitoring Systems and POWER5/6 LPARs with Ganglia
Build requirements for Ganglia from scratch
RRDTool package dependencies–freetype2
–libart_lpgl
–libpng
–Perl
–zlib
Apache 2 package dependencies–httpd
• expat
• Perl
• zlib
–mod_ssl
• httpd
• openssl
PHP package dependencies–none
gmond package dependencies–none
gmetad packages dependencies–Apache 2
–PHP
–libxml2
–rrdtool
54 Monitoring Systems and POWER5/6 LPARs with Ganglia
Installation issues on POWER5/6
Linux: Installation on any recent Linux distribution is very easy!
– All Linux distributions contain the necessary RPM packages.
AIX: For AIX some (though not all) of the prerequisites can be fulfilled with RPM
packages from the AIX Toolbox for Linux Applications:– http://www.ibm.com/servers/aix/products/aixos/linux/
There was also a problem for people getting hold of Apache 2 on AIX with the latest PHP version, so Nigel Griffiths wrote a How-To to build these popular Open Source tools for AIX with GCC:
– http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen
Compilation of Ganglia with the IBM C/C++ compilers is also easy and is done for the Ganglia binaries provided on my personal website:
– http://www.perzl.org/ganglia/
56 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia binaries and source code for POWER5/6 (1/2)
Where to get it ? My personal web site: http://www.perzl.org/ganglia/
What is available ? Binary and source RPMs available for:
– AIX v4.3.3 (gmond only)– AIX 5L v5.1 and v5.2– AIX 5L v5.3– Red Hat Enterprise Linux 4 and 5– SLES 9 and SLES 10– Fedora Core 4, 5, 6 and 7– openSUSE 10.0, 10.1, 10.2 and 10.3
Precompiled Apache 2 + PHP for Ganglia gmetad on AIX5L v5.1 and higher My enhanced Ganglia web interface Device specific information (network, disk, adapter) added to Ganglia via
gmetric
57 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia binaries and source code for POWER5/6 (2/2)
Required source code changes against version 3.0.5 of Ganglia to incorporate the new POWER5/6 metrics:
– configure – configure.in– gmetad/Makefile.in– gmond/gmond.c– lib/apr_net.c– lib/libgmond.c– lib/protocol.x – libmetrics/libmetrics.h – libmetrics/aix/metrics.c – libmetrics/linux/metrics.c – libmetrics/tests/test-metrics.c
58 Monitoring Systems and POWER5/6 LPARs with Ganglia
Download statistics of http://perzl.org/ganglia/ (1/2)
Data as of November 1st, 2007
Monthly access of http://www.perzl.org/ganglia/
5 525
92130
250 246
343
273
326
282 266 280256
457
0
50
100
150
200
250
300
350
400
450
500
08/20
06
09/20
06
10/20
06
11/20
06
12/20
06
01/20
07
02/20
07
03/20
07
04/20
07
05/20
07
06/20
07
07/20
07
08/20
07
09/20
07
10/20
07
Number of accesses per month
59 Monitoring Systems and POWER5/6 LPARs with Ganglia
Download statistics of http://perzl.org/ganglia/ (2/2)
Data as of November 1st, 2007
Monthly download numbers of binary packages
0
20
40
60
80
100
120
140
160
11/2006 12/2006 01/2007 02/2007 03/2007 04/2007 05/2007 06/2007 07/2007 08/2007 09/2007 10/2007
gmond RPM (all versions) gmetad RPM (all versions) Apache .tar.bz2 (AIX only)
61 Monitoring Systems and POWER5/6 LPARs with Ganglia
Some things to consider before you start…
A. Hostnames– To Ganglia a new hostname is a new machine– Has to resolve IP address so use DNS
B. IP addresses stable– Make sure you are not going to change IP addresses
C.Time and date– Make sure the timezone, time and date is consistent on all machines in a cluster– Use of NTP is recommended
So– These are normal on production machines– For prototype and test systems – get this right before starting Ganglia
Simple Ganglia How-To available for people setting up their first Ganglia system available at:– http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia
On all nodes
62 Monitoring Systems and POWER5/6 LPARs with Ganglia
/etc/gmond.conf gmond gmetad
rrdtool- database of stats
Apache2+ PHP5
Daemon one per node/LPAR Only one copy with the Web Server
Browser
PHP scripts
/etc/gmetad.conf
File accessNetwork
/etc/gmond.conf gmond
/etc/gmond.conf gmond
Web
1
2
3
1
1
Ganglia data flow and what goes where…
63 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices
Preferred setup Ganglia sampling intervals Ganglia default ports Shared Ethernet statistics Fibre Channel statistics Enhanced web interface
64 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Preferred Setup
Define each System p machine with all its LPARs as a separate cluster Use Unicast for network communication Define at least two LPARs per System p machine as gmond hosts for gmetad
– One would be sufficient, however, two is better for high availibility reasons
Define those two LPARs in /etc/gmetad.conf as the information brokers for that machine
From gmetad: Don’t poll the gmond hosts more frequently than every 15 secs Know upfront what time intervals to use for sampling (RRAs stanza in
/etc/gmetad.conf)– See next slides
Use my extensions for– Ethernet adapters– Fibre Channel adapters– Web interface
65 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (1/6)
Important to know: The sampling interval is defined in /etc/gmetad.conf.
The "RRAs" stanza is used to defined individual settings. The sampling settings are global.
If no "RRAs" stanza is defined a default configuration is used. For historic reasons all values are specified in intervals of 15 seconds.
66 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (2/6)
Example: Default settings in Ganglia
RRAs "RRA:AVERAGE:0.5:1:240" \ "RRA:AVERAGE:0.5:24:240" \ "RRA:AVERAGE:0.5:168:240" \ "RRA:AVERAGE:0.5:672:240" \ "RRA:AVERAGE:0.5:5760:370"
used for
Translation: display of
Take 240 samples at 1 × 15 seconds intervals hour Take 240 samples at 24 × 15 seconds (= 6 minutes) intervals day Take 240 samples at 168 × 15 seconds (= 42 minutes) intervals week Take 240 samples at 672 × 15 seconds (= 168 minutes) intervals month Take 370 samples at 5760 × 15 seconds (= 24 hours) intervals year
67 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (3/6)
Example: 1-minute sampling for one year
RRAs "RRA:AVERAGE:0.5:4:525600"
Translation: Take 525600 samples at 4 × 15 seconds (= 1 minute) intervals 525600 = 60 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)
68 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (4/6)
Example: 1-minute sampling for 6 months, 5-minute sampling for 2 years
RRAs "RRA:AVERAGE:0.5:4:259200" \ "RRA:AVERAGE:0.5:20:210240"
Translation: Take 259200 samples at every 4 × 15 seconds (= 1 minute) intervals
– 259200 = 60 (samples/hour) × 24 (hours) × 30 (days) × 6 (months)
Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals– 210240 = 12 (samples/hour) × 24 (hours) × 365 (days) × 2 (years)
69 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (5/6)
Example: 15-second sampling for 1 day, 1-minute sampling for 2 months, 10-minute sampling for 1 year
RRAs "RRA:AVERAGE:0.5:1:5760" \ "RRA:AVERAGE:0.5:4:86400" \ "RRA:AVERAGE:0.5:40:52560"
Translation: Take 5760 samples at every 1 × 15 seconds intervals
– 5760 = 4 (samples/minute) 60 (samples/hour) × 24 (hours)
Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals– 86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months)
Take 52560 samples at every 40 × 15 seconds (= 10 minutes) intervals– 52560 = 6 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)
70 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Ganglia Sampling Intervals (6/6)
Example: 1-minute sampling for 2 months, 5-minute sampling for 6 months, 15-minute sampling for 3 years
RRAs "RRA:AVERAGE:0.5:4:86400" \ "RRA:AVERAGE:0.5:20:51840" \ "RRA:AVERAGE:0.5:60:105120"
Translation: Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals
– 86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months)
Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals– 51840 = 12 (samples/hour) × 24 (hours) × 30 (days) × 6 (month)
Take 105120 samples at every 60 × 15 seconds (= 15 minutes) intervals– 105120 = 4 (samples/hour) × 24 (hours) × 365 (days) × 3 (years)
71 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Default Ports
Ganglia by default uses the following ports:
8649The port gmond uses for
– Sending to other gmonds via UDP (udp_send_channel in /etc/gmond.conf)– Receiving from other gmonds via UDP (udp_receive_channel in /etc/gmond.conf)– Sending an XML description of the state of the cluster (tcp_accept_channel in
/etc/gmond.conf)
8651The port gmetad will answer requests for XML.
8652The port gmetad will answer queries for XML.This facility allows simple subtree and summation views of the XML tree.
72 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Shared Ethernet Statistics
Question: How to monitor SEA statistics on the VIO server ? The AIX libperfstat library seems not to report any statistics about Ethernet
adapters if there are no interfaces defined on that adapter. Only seldom interfaces are defined on SEAs. The AIX command ‘entstat’ however provides these statistics.
Solution: Extension through gmetric via a shell script Korn shell script ‘ent_adapter.sh’ Get it from http://www.perzl.org/ganglia/devicespecific.html Graphs will appear immediately for that specific host
73 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Fibre Channel Statistics
Question: How to monitor Fibre Channel statistics on the VIO server ? The AIX libperfstat library seems not to report any statistics about Fibre
Channel adapters if there are no disks attached to the adapter. Tapes, for instance, would be left out. The AIX command ‘fcstat’ however provides these statistics.
Solution: Extension through gmetric via a shell script Korn shell script ‘fcs_adapter.sh’ Get it from http://www.perzl.org/ganglia/devicespecific.html Graphs will appear immediately for that specific host
74 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Enhanced Web Interface (1/2)
Get it from:http://www.perzl.org/ganglia/webinterface.html
Includes the following extensions:
Jscalendar patch of Timothy D. Witham for selecting custom intervals for graph display.
Custom Graph Patch of Alex Balk.
Zoomable graphs–Adapted from the UC Berkeley Grid live Demo website.
–They implemented some nice graph zooming, i.e., when you click onto a single node metric graph or on one of the Grid overview graphs you get a nicely scaled bigger version of that graph.
75 Monitoring Systems and POWER5/6 LPARs with Ganglia
Best Practices – Enhanced Web Interface (2/2)
Includes the following extensions (cont.)
cpu_used statistics for IBM POWER5/6 systems–A cpu_used overview graph for the cluster_view template which - when clicked onto - produces a detailed cpu_used statistics graph for the whole cluster
77 Monitoring Systems and POWER5/6 LPARs with Ganglia
Future additions / plans
Try to incorporate the POWER5/6 additions into the Ganglia mainstream source code
Adapt the AIX and Linux on POWER metrics to POWER6 Provide a custom web interface tailored specifically to POWER5/6 Future updates on my personal web site:
– Update RPMs to newest versions– Provide RPMs for Apache + PHP on my AIX Open Source web site (very soon)
79 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia advantages
Overview and major statistics available at one look with nice graphs Available on many platforms Widely used, many users and many different setups Global view (cluster/grid) and local view (single node) available Easily remote accessible through web interface Highly customizable through config files Shared Processor LPAR statistics available Monitored data is stored in Round-Robin Databases (RRDs), i.e., this information could
be easily passed on accounting too Fine granular statistics possible Different time interval views: hour, day, week, month, year or customizable Open Source (source code and binary RPMs for AIX and Linux on POWER) available Low risk and low cost Easily extendible
– through utility program gmetric
– through adapting the Ganglia source code (as shown in this presentation)
80 Monitoring Systems and POWER5/6 LPARs with Ganglia
Ganglia disadvantages
Not an official IBM tool No official support available Primarily a monitoring tool, not an accounting tool ‘‘Only‘‘ a monitoring (visualization) tool, no actions can be triggered AIX setup from scratch requires some work (building all prerequisite software),
although it is well documented– Normally not necessary by using my binary RPMs
82 Monitoring Systems and POWER5/6 LPARs with Ganglia
Links (1/2)
Main Ganglia website– http://ganglia.info/
Ganglia Documentation– http://ganglia.info/docs/
Ganglia Source Code Download– http://ganglia.sourceforge.net/downloads.php
Ganglia POWER5/6 extensions and ready-to-run binaries (RPM files) as well as source code
– http://www.perzl.org/ganglia/
My personal AIX Open Source repository– http://www.perzl.org/aix/
83 Monitoring Systems and POWER5/6 LPARs with Ganglia
Links (2/2)
Ganglia Usage at Wikipedia– http://ganglia.wikimedia.org/
RRDTool homepage– http://oss.oetiker.ch/rrdtool/
Ganglia How-To on IBM AIX wiki site (written by Nigel Griffiths)– http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia
Open Source with AIX on IBM AIX wiki site (written by Nigel Griffiths)– http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen
IBM AIX wiki site:– http://www.ibm.com/collaboration/wiki/display/WikiPtype/Home
IBM Linux on POWER wiki site:– http://www.ibm.com/collaboration/wiki/display/LinuxP/Home
84 Monitoring Systems and POWER5/6 LPARs with Ganglia
Thank you for your attention !Thank you for your attention !
Questions ?
87 Monitoring Systems and POWER5/6 LPARs with Ganglia
1 Simple gmond install (1/2)
On all data nodes:
Install the gmond RPM file on each data node– rpm –Uvh ganglia-gmond-VVV.PPP.rpm– VVV.PPP is the version number and platform like:
• 3.0.5• aix5.1.ppc, aix5.3.ppc, suse.ppc64, redhat.ppc64
Edit the configuration file– /etc/gmond.conf
On Linux on POWER– need access to /proc/ppc64/lparcfg = root user only– So also set “setuid = no” depending on Linux distribution
On all data nodes
cluster { name = "unspecified" owner = "unspecified" latlong = "unspecified" url = "unspecified" }
cluster { name = "mycluster" owner = "unspecified" latlong = "unspecified" url = "unspecified" }
Cluster name
88 Monitoring Systems and POWER5/6 LPARs with Ganglia
1 Simple gmond install (2/2)
Use the gmond control script located in /etc to start gmond:– /etc/init.d/gmond start
• Linux (SUSE and Red Hat)– /etc/rc.d/init.d/gmond start
• AIX
These scripts also automatically start gmond when booting the system
All options are: – start– stop– restart– status
Easy to automate the install:Just a couple of files +the gmond.conf is the same on all nodes/LPARs
On all data nodes
89 Monitoring Systems and POWER5/6 LPARs with Ganglia
2 Simple gmetad - prerequisites (1/3)
Gmetad needs rrdtool to store the data and Apache2 + PHP to serve the web pages
For Linux rrdtool, Apache2 and PHP5 is part of every Linux distribution
For AIX you can resolve the prerequisites as follows: – rrdtool provided at http://www.perzl.org/ganglia/ – libart_lgpl AIX Toolbox for Linux Applications– libpng AIX Toolbox for Linux Applications– freetype2 AIX Toolbox for Linux Applications– zlib AIX Toolbox for Linux Applications– Perl AIX Toolbox for Linux Applications
AIX Toolbox for Linux Applications– http://www.ibm.com/servers/aix/products/aixos/linux
90 Monitoring Systems and POWER5/6 LPARs with Ganglia
2 Simple gmetad install (2/3)
Install the gmetad RPM file on each node– rpm –Uvh ganglia-gmetad-VVV.PPP.rpm– VVV.PPP is the version number and platform like:
• 3.0.5 and• aix5.1.ppc, aix5.3.ppc, suse.ppc64, redhat.ppc64
Edit the configuration file /etc/gmetad.conf– data_source “mycluster" localhost
Cluster nameLocal gmond supplies
Ganglia data
On the central node
91 Monitoring Systems and POWER5/6 LPARs with Ganglia
2 Start gmetad (3 of 3)
Use the gmetad control script located in /etc to start gmetad:– /etc/init.d/gmetad start
• Linux (SUSE and Red Hat)– /etc/rc.d/init.d/gmetad start
• AIX
These scripts also automatically start gmetad when booting the system
All options are: – start– stop– restart– status
On the central node
92 Monitoring Systems and POWER5/6 LPARs with Ganglia
3 Ganglia Web Server front end setup (1/2)
Could use existing web server but we need PHP support– On AIX – we have only used Apache2 and PHP5 (see next foil)– On Linux use version included with the distribution
(Red Hat EL 4 & 5, SUSE SLES 9 & SLES 10)
Simple test of PHP if it works:– Create <web-server-directory>/phptest.php– Use browser to access this file– Should print out lots of interesting data
rpm –Uvh ganglia-web-3.0.5-1.noarch.rpm– “noarch” as this is PHP scripts only– may have to use the “--ignoreos” flag for installation– may have to move these files to your web server directory tree
• depends on if you are running AIX or Linux
<h1>PHP Test</h1> <?PHP phpinfo() ?>
On the central node
93 Monitoring Systems and POWER5/6 LPARs with Ganglia
For AIX you will not be able to find Apache2 and PHP5 with the required features:
– AIX CD-ROM or update download site – nope– AIX repositories (Bull or UCLA) – nope – AIX Toolbox for Linux Applications – nope– WAS – nope– IBM HTTP Server (Apache with add-ons) – nope
Get Apache + PHP from my personal website instead!– http://www.perzl.org/ganglia/– http://www.perzl.org/aix/
If you have to recompile your own Apache + PHP– Using latest GNU GCC compiler and latest libraries - it is actually easy– Nigel Griffiths wrote the details of how to do this on the AIX 5L Wiki at
• http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen
On the central node
3 Ganglia Web Server front end setup (AIX) (2/2)