Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.
-
Upload
candice-crawford -
Category
Documents
-
view
213 -
download
0
Transcript of Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.
Status of Farm Monitor and Control
CERN, February 24, 2005
Gianluca Peco, INFN Bologna
Status of Farm Monitor and Control, 2Gianluca Peco
Summary Already done: SubFarm Monitor and Control
architecture
New features: Process Controller, PVSS SFM improvement
For the Real Time Trigger Challenge Higher Priority - Archiving, Boot Manager, IPMI Lower Priority - fwTrending, FSM integration
To be done later Ms Windows SFM software porting, Oracle & PVSS
Status of Farm Monitor and Control, 3Gianluca Peco
SubFarm
Node
Task
Manager
Logger
Monitor
SubFarm
Node
Task
Manager
Logger
Monitor
SubFarm
Node
Task
Manager
Logger
Monitor
Control PC
TTY Client
PVSS Integration
SubFarmMonitor Architecture
Logger
Monitor
Process Control
Task Manager
Logger Task Manager
SubFarm
Node
Task
Manager
Logger
Monitor
DIM Communication Layer
Services
Services
Command and Services
Status of Farm Monitor and Control, 4Gianluca Peco
New Item - Process Controller It is a program executed on the Control PC that controls the
processes in execution in all farm nodes and restart them (immediately) in case of death.
It reads from an XML file (in future from the Configuration DB) the list of processes to start on each node and their execution mode (arguments, environment, user, scheduler, priority, re-spawn parameter, etc)
It works by contacting the Light ServerTask Managers running on every node through the DIM Cmd and Service.
Process restart is triggered by process death: The Task Manager handles the SIGCHLD signals from the children
processes
In case of SIGCHLD signal the Task Manager updates the “DIM list service” (within about 0.1 s);
The update of the “DIM list service” triggers the reaction of the Process Controller, which then schedules the process restart (within about 0.1 s).
Status of Farm Monitor and Control, 5Gianluca Peco
Control PC
PVSS Integration
SubFarm Monitor – Process Controller
SFController
SFNode_01
SFNode_02
SFNode_03
SFNode_n
Logger
Task Manager
Logger
SFarm_n
Monitor
Process Control
Task Manager
Monitor
DIM Communication Layer
C light Publisher
XML File
Process List DP
DIM
Status of Farm Monitor and Control, 6Gianluca Peco
New Item - Process Controller (II)
A re-spawn control is also implemented: If a process is re-started more than maxStartN
times in checkPeriod seconds the process restart is disabled for disPeriod seconds.
If maxStartN=-1 the re-spawn control is excluded, i.e. process can be restarted indefinitely.
If disPeriod=-1 the process-restart, once disabled, is never re-enabled. (one time re-spawn)
Status of Farm Monitor and Control, 7Gianluca Peco
New Items - PVSS Improvements
HELPOnline help is available
The meaning of every item can be established by right clicking the PVSS Panel objects
Alarm DP A PVSS script determine the status of each monitor sensor (DU) used by the FSM.
Parametrization is hard coded inside CTRL script.
This will be updated soon using configuration DP’s
Status of Farm Monitor and Control, 8Gianluca Peco
To be done for the Real Time Trigger Challenge HIGH PRIORITY
PVSS Archiving ( under development ) Implement PVSS archive using native RAIMA DB Static selection of the data to be archived. It’s not allowed to
select data to archive from UI. Boot Manager ( under investigation )
Implement a system inside PVSS UInterface with a mechanism to set boot node configuration ( DHCP,HOST,Static ARP,Route,etc.) using Configuration DB and/or graphical UI
Plug & Play system for node adding\removing IPMI
Under investigation to allow necessary method to change electrical power state w/o OS interaction. Over LAN messages directly to HW.
Power down , soft reboot, power up, etc
Status of Farm Monitor and Control, 9Gianluca Peco
IPMI v2.0 Architecture
Baseboard System Bus
BridgeBridgeControllerController
ICMB
Aux. IPMB
RemoteMgmt. Card
SMBus/PCI Mgmt. BusBaseboardBaseboard
Mgmt.Mgmt.ControllerController
(BMC)(BMC)I2C/SMBus
SDR, SDR, SEL, SEL, FRUFRU
NV StoreMgmtMgmtNetwkNetwkCtrlrCtrlr
LAN
PCI
RS-232
MODEM/ Serial
IPMB (I2C)
ChassisChassis
sensors& controlcircuitry
FRU SEEPROMFRU SEEPROM
SatelliteSatelliteMgmt.Mgmt.
ControllerController“side-band”
System InterfaceSystem Interface
SENSORs& controlcircuitry
I2 C /
SM
Bu
s
IPMI Architecture and Initiative Update
IPMI Messages
Status of Farm Monitor and Control, 10Gianluca Peco
LANLAN
chassischassis
IPMI in modular architecture
Typical Modular Application
computenode A
computenode A
BMCBMC
computenode B
computenode B i/o nodei/o node
SatelliteSatelliteControllerController
mgmtmodule
mgmtmodule
SatelliteSatelliteControllerController PS PSFAN
temp
FAN
Sys I/F Sys I/F
BP I/FBP I/F
Mgmt.Mgmt.ModuleModule
ProcessorProcessor
Backplane Mgmt Interconnect
BMCBMC
IPMI Messages
Remote MgmtConsoleSystem
BP I/FBP I/FBP I/FBP I/F
CIMto
IPMI
Status of Farm Monitor and Control, 11Gianluca Peco
To be done for the Real Time Trigger Challenge
LOW PRIORITY fwTrending integration in SFM (under
development)
Probably easy to implement using framework feature and power graphical trending tool on archived DPelement (historical data ). Excel export for further analisys.
FSM integration in SFM (just started)
Using dp alarm structure already implemented we can trigger FSM alarms and relative command for the node (start,stop,reboot,etc)
Status of Farm Monitor and Control, 12Gianluca Peco
To be done later Ms Windows SFM porting
One possibility under investigation is to use Windows Management Interface and relative API ( .NET platform )
The idea is to recompile Linux Monitor Sensor code using a low layer to take information from the WMI structure.
More difficult should be theTask Manager and Process Controller ! Totally new code with different signal handling.
We are interested to work on the Oracle DB for the LHCb needs
In particular we are taking care the Oracle & PVSS integration
Status of Farm Monitor and Control, 13Gianluca Peco
END
END
Status of Farm Monitor and Control, 14Gianluca Peco
Process Controller ( Backup ) If more than one process dies in a short time interval,
more than one process list updates is scheduled.
If the process controller takes more time than the update time difference to start a new process, it receives more than one updates with the missing process and therefore restart the process more than once.
The problem can be solved by implementing a coalescence mechanism and disabling list updating during process restart (this is achieved by means of a mutex which arbitrate between update thread and start process thread).
Is it possible to implement these mechanisms in PVSS?