ATLAS DAQ/HLT Infrastructure H.P. Beck, M. Dobson, Y. Ermoline, D. Francis, M. Joos, B. Martin, G....

ATLAS DAQ/HLT Infrastructure

H.P. Beck, M. Dobson, Y. Ermoline, D. Francis, M. Joos, B. Martin, G. Unel, F. Wickens.

Acknowledgements:

H. Burckhart, W. Iwanski, K. Korcyl, M. Leahu

11th Workshop on Electronics for LHC and future Experiments, 12-16 September 2005, Heidelberg, Germany

2/20

ATLAS HLT and DAQ physical layout

SDX1

USA15

Rack-mounted PCs and network switches ~2500 components in ~120 racks

Read-Out Subsystems underground 152 ROSes (max 12 ROLs per ROS)

HLT and EB processor farms on the surface

ROS – Read Out SystemSV – SupervisorDFM – Data Flow Manager

3/20

HLT and DAQ operational infrastructure components

Housing of HLT/DAQ system - 2 floor counting room in SDX1 building Metallic structure (electronics is heavy nowadays) Lighting, ventilation, mixed water piping and cable trays Power supply (from network and UPS)

Housing and operation of rack-mounted PCs and network switches Rack selection and component mounting Power distribution and heat removal Power and network cabling

Operational infrastructure monitoring Standard ATLAS DCS tools for rack parameter monitoring Linux and IPMI tools for individual PC internal monitoring

Computing farm installation, operation and maintenance Rack configuration and cabling databases System administration

4/20

Counting room in SDX1 for DAQ/HLT system

Design

Construction

Size of barrack constrained by crane, shaft to USA15 and existing walls of SDX

Housing 100 racks 600 mm x 1000 mm, up to 52U (height floor to ceiling ~2600 mm)

Metallic structure initially designed for 500 kg/rack, re-designed for 800 kg/rack

Air-conditioning removes ~10% of the heat dissipated, other 90% are removed via water-cooling

Lighting and ventilation

Water piping

Cable trays

Temporarypowersupply

5/20

Housing of equipment

Investigated 2 options Modified “standard” 52U ATLAS rack

for ROSes in USA15 positions already defined weight not an issue in USA15 uniform to other racks in USA15

Industry standard Server Rack for SDX1 (e.g. RITTAL TS8)

bayed racks with partition panes for fire protection

height and weight limits lighter racks, more flexible mounting

for PCs

Mounting of equipment Front mounted PC’s on supplied

telescopic rails Rear/front mounted switches (on

support angles if heavy) All cabling at rear inside the rack

RITTAL TS847/52U rack

“Standard”ATLAS52U rack

6/20

Cooling of equipment

Common horizontal air-flow cooling solution for ATLAS and RITTAL racks Outcome of joint “LHC PC Rack

Cooling Project” Requirement: ~10 kW per rack

A water-cooled heat exchanger fixed to the rear door of the rack

CIAT cooler mounted on the rear door (+150 mm) 1800x300x150 mm3

Cooling capacity – 9.5 kW Water: in 15°C, out 19.1°C Air: in 33°C, out 21,5 °C 3 fan rotation sensors

“Standard”ATLAS52U rack

RITTAL47U rack

Water in/out

7/20

Powering of equipment in SDX1 (prototype rack study)

Powering problems and (as simple as possible) solutions High inrush current – D-curve breakers and sequential powering Harmonic content – reinforced neutral conductor with breaker

Ventilation

Auxiliary

Power to equipment

Power Control

11 kVA of apparent power delivered per rack on 3 phases, 16A each ~10 kW of real power available (and dissipated)

with typical PFC of 0.9 3 individually controllable breakers with D-curve

for high inrush current First level of sequencing

3 sequential power distribution units inside the rack (e.g. SEDAMU from PGEP)

Second level of sequencing 4 groups of 4(5) outlets, max 5A per group,

power-on is separated by 200 ms

Investigated possibilities for individual power control via IPMI or Ethernet controlled power units

8/20

Powering of equipment – proposed final implementation (1)

Transformer (in front of SDX1) to power the switchboard Power from the switchboard delivered to 6 distributing cupboards on 2 levels Each cupboard controls and powers 1 row of racks (up to 18 racks)

Two 3-phase cables under the false floor from the cupboard to each rack

Sw

itch

bo

ard

Distributing cupboards

Distribution cupboard 400 A breaker on the input 1 Twido PLC

Read of ON/OFF and Electrical Fault status of all breakers in the row

16A breakers on the output One breaker per power cable Each breaker is individually

turned ON by DCS via PLC

9/20

Powering of equipment – proposed final implementation (2)

3U distribution box on the bottom of each rack (front side) to provide distribution inside a rack 2 manual 3 phase switches on

the front panel to cut power on 2 power lines

Input and output cables on the rear panel

Distribution from two 3 phase power lines to 6 single phase lines

Flat cable distribution on the back side of the rack 6 cables from the distribution

box to 6 flat cables 6-7 connectors on each flat

cable

PLC2 x 16A D type3 phase breakers

2 x 3 phase manual switches

RackSide

Installation is designed to sustain 35 A of inrush current (peak) from 1 PC for ~35-40 PCs per rack

10/20

SDX1 TDAQ computing room environment monitored by: CERN infrastructure services (electricity, C&V, safety) and ATLAS central DCS (room temperature, etc)

Two complementary paths to monitor TDAQ rack parameters by available standard ATLAS DCS tools (sensors, ELMB, etc.) by PC itself (e.g. – lm_sensors) or farm management tools (e.g. – IPMI)

What is monitored by DCS inside the racks: Air temperature – 3 sensors Inlet water temperature – 1 sensor Relative humidity – 1 sensor Cooler’s fan operation – 3 sensors

What is NOT monitored by DCS inside the racks: Status of the rear door (open/closed) Water leak/condensation inside the cooler Smoke detection inside the rack

Monitoring of operational infrastructure

QuartzTF 25NTC 10 kOhm

PreconHS-2000VRH

11/20

Standard ATLAS DCS tools for sensor readout

The SCADA system (PVSS II) with OPC client / server The PC (Local Control Station) with Kvaser PCIcan-Q (4 ports) CAN power crate (16 branches of 32 ELMBs) and CANbus cable The ELMB, motherboards and sensor adapters

ELMB

Kvaser

Motherboard

CAN power crate

12/20

Sensors location and connection to ELMB

Sensors location on the rear door of TDAQ rack All sensor signals (Temperature, Rotation,

Humidity) and power lines are routed to a connector on the rear door to simplify assembly

Flat cables connect these signals to 1 of 4 ELMB motherboard connectors 3 connectors receive signals from 3 racks, 1 spare

connector for upgrades

1 ELMB may be used for 3 racks

CANbus cables

To next rack To PSU

13/20

Use of internal PC’s monitoring

Most PC's now come with a hardware monitoring chips (e.g. LM78) Onboard voltages, fan status, CPU/chassis temperature, etc. a program, running on every TDAQ PC, may use lm_sensors package

to access parameters and send this information using DIM to DCS PC

IPMI (Intelligent Platform Management Interface) specification platform management standard by Intel, Dell, HP, and NEC a standard methodology for accessing and controlling bare-metal

hardware, even without software installed or running based on a specialized micro-controller,

or Baseboard Management Controller (BMC) - it is available even if system is powered down and no OS loaded

Supermicro IPMI Card

http://it-dep-fio-ds.web.cern.ch/it-dep-fio-ds/presentations.asp

14/20

Preparation for equipment installation

Design drawing

Rack numbering (Y.03-04.D1)

Rack content

6 Pre-Seriesracks

15/20

Computing farm – Pre-Series installation

Pre-Series components in CERN (few % of the final size) Fully functional, small scale, version of the complete HLT/DAQ Installed in SDX1 lower level and USA15

Effort for physical installation Highlight and solve procedural problems before we get involved

in the much larger scale installation of the full implementation

Will grow in time – 2006: + 14 racks, 2007: + 36 racks…

16/20

Cabling of equipment

All cables are defined and labeled: 608 individual optical fibers from

ROS to Patch Panels 12 bundles of 60 fibers from

USA15 patch panels to SDX1 patch panels

Individual cables from Patch Panels to the central switches and then to PCs

Cables labeling is updated after installation

Cable installation: tries to minimize cabling between

racks, to keep cabling tidy and to conform to minimum bend radii

not using cable arms but uncable a unit before removal

17/20

Computing farm – system management

System Management System Management of HLT/DAQ has been considered by SysAdmin

Task Force, topics addressed: Users / Authentication Networking in general Booting / OS / Images Software / File Systems Farm Monitoring How to Switch on/off nodes

Remote Access & Reset with IPMI IPMI daughter card for PC motherboard - experience with v1.5. which

allows access via LAN; to reset, to turn off & on; to login as from the console; to monitor all sensors

(fan, temperature, voltages etc.) Cold start procedure tested recently for Pre-Series:

IPMI used to powerdown/boot/monitor machines scripts for IPMI operations (booting all EF nodes etc...) are being written

18/20

Computing farm – Point 1 system architecture

Central Server 1

Central Server 2

Local Server 1

Local Server 2

Local Server n

Client 1

Client n

Client 1

Client n

Client 1

Client n

Ser

vice

pat

h

Sync path

Sync pathsAlternative sync paths

ATCN

CPN

Gateway(login necessary)

CERN IT

Bypass

Users access

CTN

Clients are netbooted.

CERN Public Network

ATLAS Technical and Control Network

CE

RN

Tec

hn

ical

Net

wo

rk

19/20

Computing farm – file servers/clients infrastructure

Gateway & Firewall services to CERN Public Network ssh access

Tree servers/clients structure Central File Server - 6 Local File Servers - ~70 clients all files come from a single (later a mirrored pair) Central File Server

All clients are net-booted and configured from Local File Servers (PC’s & SBC’s) maximum of ~30 clients per boot-server to allow scaling up to 2500

nodes a top-down configuration (from machine specific / detector specific /

function specific / node specific) provided when modified, the atlas software is (push) synced from top down to

each node with disk software installation mechanism and responsibilities being discussed

with Offline and TDAQ Node logs are collected on the local servers for post-mortem

analysis if needed

20/20

Computing farm – Nagios farm management

All farm management related functions are unified under a generic management tool Nagios (LFS and Clients): a unique tool to view the

overall status, issue commands, etc…

using IPMI where available otherwise ssh and lm_sensors mail notification for alarms (e,g,

- temperature) DCS tools will be integrated

21/20

Conclusions:

Good progress made in developing final infrastructure for the DAQ/HLT system power and cooling - has become a major challenge in computer centers installation monitoring and farm management

Pre-series installation has provided invaluable experience to tune/correct the infrastructure, handling and operation

Making good progress towards ATLAS operation in 2007 Looking forward to the start of physics running

ATLAS DAQ/HLT Infrastructure H.P. Beck, M. Dobson, Y. Ermoline, D. Francis, M. Joos, B. Martin, G....

Documents

Transcript of ATLAS DAQ/HLT Infrastructure H.P. Beck, M. Dobson, Y. Ermoline, D. Francis, M. Joos, B. Martin, G....