Update on Farm Monitor and Control Domenico Galli, Bologna RTTC meeting Genève, 14 april 2004.
-
Upload
jacob-goodman -
Category
Documents
-
view
216 -
download
0
description
Transcript of Update on Farm Monitor and Control Domenico Galli, Bologna RTTC meeting Genève, 14 april 2004.
Update on Farm Monitor and Control
Domenico Galli, Bologna
RTTC meetingGenève, 14 april 2004
Update on Farm Control. 2Domenico Galli
Outline Test of Sub-Farm Monitor & Control
software on Linux SLC3. PVSS Boot Manager. Changes in monitor PVSS Panels. IPMI-DIM power manager.
Update on Farm Control. 3Domenico Galli
Test of Sub-Farm Monitor & Control Software on Linux SLC3 Sub-farm monitor & control software (SFM
0.2) has been tested on Linux SLC3. lm_sensors package had to be
recompiled and istalled in order to monitor temperatures and fans.
SFM 0.2 package works without recompiling.
No imcompatibilities detected.
Update on Farm Control. 4Domenico Galli
Boot Manager A PVSS panel has been developed to
configure the boot of the subfarms nodes controlled by a control PC.
The panel allows to add/remove/configure the nodes of a sub-farm, by specifying hostname, MAC address and IP address.
At present the panel write a text file containing the configuration of the nodes.
The target is to write directly the DHCP table for the control PC.
Update on Farm Control. 5Domenico Galli
PVSS Monitor Panels A button has been added to all the monitor
panels to configure the thresholds for warning & error state of the statemachine.
A PVSS scriptcompare themonitoredvalue with thethreshold, andif it isexceeded, astate machinetransition istriggered.
Update on Farm Control. 6Domenico Galli
PVSS Monitor Panels (II) If the button is pressed, a new panel is
open, in which an“expert user”can set thealarmthresholds.
Update on Farm Control. 7Domenico Galli
IPMI (Intelligent Platform Management Interface) What IPMI can be useful for?
Switching on/off the power supply of the farm nodes without using expensive network-controlled power distributors.
Monitoring the power status of the farm nodes (on/off).
Monitoring temperatures, fan speeds, power supply voltages, etc. in a OS-independent way.
Accessing on-board event-log.
Update on Farm Control. 8Domenico Galli
IPMI Interfaces IPMI
KCS (Keyboard Controller Style) interface (AKA open interface)
Local interface (interface to the host OS), unauthenticated. Can be accessed through the openIPMI linux software. Can’t be used to swich on a PC or to power cycle a hung-up
PC. LAN interface
Network interface, session-based, authenticated. Designed to be always available (even when the system is
powered down or when the OS is hung or inactive). Hardware implementation. OS independent.
Update on Farm Control. 9Domenico Galli
IPMI LAN Interface Server side (farm node):
Harware implementation. NIC hardware redirects to BMC the Ethernet frames
containing datagrams destined to UDP port 623. Configured by means of PC startup configuration utility. May use DHCP to set up network
parameters. No need of additional
software. Client side (control PC).
Client software, e.g.: IPMItool,freeIPMI, IPMIsh linux software.
ManagementNetwork
Controller
(BMC)Baseboard
ManagementController
Control PC(IPMI client)
UDP port 623
LANFarm node
otherEthernetframes
Update on Farm Control. 10Domenico Galli
IPMI Power Commands on: power-up the chassis. off: power-down the chassis (without a clean shut-
down of the OS). cycle: power-down, wait 1 second, and power-up
again. soft_off: initiate a soft-shutdown of OS via ACPI by
emulating a fatal over-temperature condition. hard_reset: pulse the system reset signal. pulse_diag: pulse a version of a diagnostic
interrupt that goes directly to the processor(s). This is typically used to cause the operating system to do a diagnostic dump (OS dependent).
Update on Farm Control. 11Domenico Galli
DIM-IPMI Power Manager A Power Manager (based on IPMI and DIM) to
switch on/off the power to the Farm Nodes is under development.
Each Control PC runs aDIM server interfaced toIPMI and publishes, for eachnode, acommandand a service.
Control PCIPMI-DIMserver
SFN-001-01BMC
SFN-001-02BMC
SFN-001-03BMC
SFN-001-04BMC
SFN-001-05BMC
IPMI
DIM Services:/SFN-001-01/power_status/SFN-001-02/power_status/SFN-001-03/power_status
DIM Commands:/SFN-001-01/power_switch on|off|soft_off|cycle/SFN-001-02/power_switch on|off|soft_off|cycle/SFN-001-03/power_switch on|off|soft_off|cycle
PVSS-DIMclient
PVSSGUI
Farm Nodes
DIM
CMD-lineclient
Update on Farm Control. 12Domenico Galli
Status of DIM-IPMI Power Manager We started using with IPMItool’s libintf_lan.so
library. Problems:
IPMI response takes at least 0.7 s. In case of a disconnected node, timeout takes about 16 s. A complete cycle over 200 nodes, to update the farm
power status, takes therefore 140-3200 s. Solution:
Use one thread for each node to be contacted, in order to parallelize IPMI connections.
But: libintf_lan.so library is not thread-safe (global variables,
timeouts using signals+longjmp, etc.)
Update on Farm Control. 13Domenico Galli
Status of DIM-IPMI Power Manager (II) DONE
IPMItool’s libintf_lan.so deeply hacked, in order to make it “more” thread-safe (no more global variables, no more signals & longjmps to time-out).
A power manager DIM server and a command-line DIM client are ready and working (tested on a Dell PowerEdge SC 1425 without OS).
TODO: Conflicts between commands and status monitor on the
same node must be arbitrated by the DIM-IPMI server (if the NIC BMC is processing a command, it is not able to receive other commands).
Add mutex to the library to protect non-thread-safe system/library calls (e.g. malloc, free, etc.).
Update on Farm Control. 14Domenico Galli
Power Manager Command-Line ClientpwSwitch [-m hostname] on|off|(cycle|soft_off)
N.B.: nodelhcbcn2 isdisconnected!
command time out
service time out service time out
command time out
Update on Farm Control. 15Domenico Galli
Power Manager PVSS Client Work in progress. Basically one PVSS panel showing:
A list of the controlled nodes with their power status (on, off).
Buttons for power on / off / soft_off / cycle / power_reset / pulse_diag.