Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

14
Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna

Transcript of Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Page 1: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control

CERN, February 24, 2005

Gianluca Peco, INFN Bologna

Page 2: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 2Gianluca Peco

Summary Already done: SubFarm Monitor and Control

architecture

New features: Process Controller, PVSS SFM improvement

For the Real Time Trigger Challenge Higher Priority - Archiving, Boot Manager, IPMI Lower Priority - fwTrending, FSM integration

To be done later Ms Windows SFM software porting, Oracle & PVSS

Page 3: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 3Gianluca Peco

SubFarm

Node

Task

Manager

Logger

Monitor

SubFarm

Node

Task

Manager

Logger

Monitor

SubFarm

Node

Task

Manager

Logger

Monitor

Control PC

TTY Client

PVSS Integration

SubFarmMonitor Architecture

Logger

Monitor

Process Control

Task Manager

Logger Task Manager

SubFarm

Node

Task

Manager

Logger

Monitor

DIM Communication Layer

Services

Services

Command and Services

Page 4: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 4Gianluca Peco

New Item - Process Controller It is a program executed on the Control PC that controls the

processes in execution in all farm nodes and restart them (immediately) in case of death.

It reads from an XML file (in future from the Configuration DB) the list of processes to start on each node and their execution mode (arguments, environment, user, scheduler, priority, re-spawn parameter, etc)

It works by contacting the Light ServerTask Managers running on every node through the DIM Cmd and Service.

Process restart is triggered by process death: The Task Manager handles the SIGCHLD signals from the children

processes

In case of SIGCHLD signal the Task Manager updates the “DIM list service” (within about 0.1 s);

The update of the “DIM list service” triggers the reaction of the Process Controller, which then schedules the process restart (within about 0.1 s).

Page 5: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 5Gianluca Peco

Control PC

PVSS Integration

SubFarm Monitor – Process Controller

SFController

SFNode_01

SFNode_02

SFNode_03

SFNode_n

Logger

Task Manager

Logger

SFarm_n

Monitor

Process Control

Task Manager

Monitor

DIM Communication Layer

C light Publisher

XML File

Process List DP

DIM

Page 6: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 6Gianluca Peco

New Item - Process Controller (II)

A re-spawn control is also implemented: If a process is re-started more than maxStartN

times in checkPeriod seconds the process restart is disabled for disPeriod seconds.

If maxStartN=-1 the re-spawn control is excluded, i.e. process can be restarted indefinitely.

If disPeriod=-1 the process-restart, once disabled, is never re-enabled. (one time re-spawn)

Page 7: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 7Gianluca Peco

New Items - PVSS Improvements

HELPOnline help is available

The meaning of every item can be established by right clicking the PVSS Panel objects

Alarm DP A PVSS script determine the status of each monitor sensor (DU) used by the FSM.

Parametrization is hard coded inside CTRL script.

This will be updated soon using configuration DP’s

Page 8: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 8Gianluca Peco

To be done for the Real Time Trigger Challenge HIGH PRIORITY

PVSS Archiving ( under development ) Implement PVSS archive using native RAIMA DB Static selection of the data to be archived. It’s not allowed to

select data to archive from UI. Boot Manager ( under investigation )

Implement a system inside PVSS UInterface with a mechanism to set boot node configuration ( DHCP,HOST,Static ARP,Route,etc.) using Configuration DB and/or graphical UI

Plug & Play system for node adding\removing IPMI

Under investigation to allow necessary method to change electrical power state w/o OS interaction. Over LAN messages directly to HW.

Power down , soft reboot, power up, etc

Page 9: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 9Gianluca Peco

IPMI v2.0 Architecture

Baseboard System Bus

BridgeBridgeControllerController

ICMB

Aux. IPMB

RemoteMgmt. Card

SMBus/PCI Mgmt. BusBaseboardBaseboard

Mgmt.Mgmt.ControllerController

(BMC)(BMC)I2C/SMBus

SDR, SDR, SEL, SEL, FRUFRU

NV StoreMgmtMgmtNetwkNetwkCtrlrCtrlr

LAN

PCI

RS-232

MODEM/ Serial

IPMB (I2C)

ChassisChassis

sensors& controlcircuitry

FRU SEEPROMFRU SEEPROM

SatelliteSatelliteMgmt.Mgmt.

ControllerController“side-band”

System InterfaceSystem Interface

SENSORs& controlcircuitry

I2 C /

SM

Bu

s

IPMI Architecture and Initiative Update

IPMI Messages

Page 10: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 10Gianluca Peco

LANLAN

chassischassis

IPMI in modular architecture

Typical Modular Application

computenode A

computenode A

BMCBMC

computenode B

computenode B i/o nodei/o node

SatelliteSatelliteControllerController

mgmtmodule

mgmtmodule

SatelliteSatelliteControllerController PS PSFAN

temp

FAN

Sys I/F Sys I/F

BP I/FBP I/F

Mgmt.Mgmt.ModuleModule

ProcessorProcessor

Backplane Mgmt Interconnect

BMCBMC

IPMI Messages

Remote MgmtConsoleSystem

BP I/FBP I/FBP I/FBP I/F

CIMto

IPMI

Page 11: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 11Gianluca Peco

To be done for the Real Time Trigger Challenge

LOW PRIORITY fwTrending integration in SFM (under

development)

Probably easy to implement using framework feature and power graphical trending tool on archived DPelement (historical data ). Excel export for further analisys.

FSM integration in SFM (just started)

Using dp alarm structure already implemented we can trigger FSM alarms and relative command for the node (start,stop,reboot,etc)

Page 12: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 12Gianluca Peco

To be done later Ms Windows SFM porting

One possibility under investigation is to use Windows Management Interface and relative API ( .NET platform )

The idea is to recompile Linux Monitor Sensor code using a low layer to take information from the WMI structure.

More difficult should be theTask Manager and Process Controller ! Totally new code with different signal handling.

We are interested to work on the Oracle DB for the LHCb needs

In particular we are taking care the Oracle & PVSS integration

Page 13: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 13Gianluca Peco

END

END

Page 14: Status of Farm Monitor and Control CERN, February 24, 2005 Gianluca Peco, INFN Bologna.

Status of Farm Monitor and Control, 14Gianluca Peco

Process Controller ( Backup ) If more than one process dies in a short time interval,

more than one process list updates is scheduled.

If the process controller takes more time than the update time difference to start a new process, it receives more than one updates with the missing process and therefore restart the process more than once.

The problem can be solved by implementing a coalescence mechanism and disabling list updating during process restart (this is achieved by means of a mutex which arbitrate between update thread and start process thread).

Is it possible to implement these mechanisms in PVSS?