Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power...

36
at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009

Transcript of Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power...

Page 1: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Grid Computingat the Large Hadron

Collider:Massive Computing at the

Limit of Scale, Space, Power and Budget

Dr Helge MeinhardCERN, IT Department

02-Jul-2009

Page 2: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

CERN (1)

Conseil européen pour la recherche nucléaire – aka European Laboratory for Particle Physics Facilities for

fundamental research

Between Geneva and the Jura mountains, straddling the Swiss-French border

Founded in 1954

Page 3: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

CERN (2)

20 member states ~3400 staff members,

fellows, students, apprentices

9000 users registered (~6500 on site) from more than 550

institutes in more than 80 countries

~ 910 MCHF (~550 MEUR) annual budget

http://cern.ch/

Page 4: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Physics at the LHC (1)

Matter particles: fundamental building

blocks

Force particles:bind matter particles

Page 5: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Physics at the LHC (2)

Four known forces: strong force,weak force, electromagnetism, gravitation

Standard model unifies three of them Verified to

0.1 percent level Too many free

parameters E.g. particle masses

Higgs particle Higgs condensate

fills vacuum Acts like ‘molasse’,

slows other particles down, gives them mass

Page 6: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Physics at the LHC (3)

Open questions in particle physics: Why are the parameters of the size as we

observe them? What gives the particles their masses? How can gravity be integrated into a unified

theory? Why is there only matter and no anti-matter in

the universe? Are there more space-time dimensions than

the 4 we know of? What is dark energy and dark matter which

makes up 98% of the universe? Finding the Higgs and possible new

physics with LHC will give the answers!

Page 7: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

The Large Hadron Collider (1)

Accelerator for protons against protons – 14 TeV collision energy By far the world’s

most powerful accelerator

Tunnel of 27 km circumference, 4 m diameter, 50…150 m below ground

Detectors at four collision points

Page 8: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

The Large Hadron Collider (2)

Approved 1994, first circulating beams on10 September 2008

Protons are bent by superconducting magnets (8 Tesla, operating at 2K = –271°C) all around the tunnel

Each beam: 3000 bunches of 100 billion protons each

Up to 40 million bunch collisions per second at the centre of each of the four detectors

Page 9: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

The Large Hadron Collider (3) Incident on 19 September 2008

During attempt to ramp beam energy up to 7 TeV, a leak occurred in the cold mass causing significant loss of helium

Repair work is ongoing Instrumentation for detecting this kind of

problem being added Schedule: beam around mid October

2009, collisions around mid November 2009, running until autumn 2010 Collision energy: 5 + 5 TeV Short technical stop at Christmas 2009

Page 10: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

LHC Detectors (1)ATLAS

CMS

LHCb

Page 11: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

LHC Detectors (2)

2’200 physicists (including 450 students) from 167 institutes of 37

countries

Page 12: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

LHC Data(1)

The accelerator generates 40 million bunch collisions (“events”) every second at the centre of each of the four experiments’ detectors

Page 13: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

LHC Data (2)

Reduced by online computers that filter out a few hundred “good” events per second …

… which are recorded on disk and magnetic tape at 100…1’000 Megabytes/sec

15 Petabytes per year for four experiments

15’000 Terabytes = 3 million DVDs

1 event = few Megabytes

Page 14: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

LHC Data (3)

Page 15: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

CERN18%

All Tier-1s39%

All Tier-2s43%

CERN12%

All Tier-1s55%

All Tier-2s33%

CERN34%

All Tier-1s66%

CPU Disk Tape

30’000 CPU servers, 110’000 disks: Far too

much for CERN!

Summary of Computing Resource RequirementsAll experiments – 2008

From LCG TDR – June 2005

Total

CPU (MSPECint2000s) 142

Disk (Petabytes) 57

Tape (Petabytes) 53

CERN All Tier-1s All Tier-2s

25 56 61

7 31 19

18 35

Page 16: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Worldwide LHC Computing Grid (1) Tier 0: CERN

Data acquisition and initial processing

Data distribution Long-term curation

Tier 1: 11 major centres Managed mass

storage Data-heavy analysis Dedicated 10 Gbps

lines to CERN Tier 2: More than

200 centres in more than 30 countries Simulation End-user analysis

Tier 3: physicists’ desktops

Tier3physics

department

Desktop

Germany

USAUK

France

Italy

Taiwan

NordicCountries

Nether-lands

CERN Tier 0

Tier2

Lab a

Uni a

Lab c

Uni n

Lab m

Lab b

Uni bUni y

Uni x

grid for a physicsstudy group

SpainTier 1

grid for a regional group

Page 17: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Worldwide LHC Computing Grid (2) Grid middleware for “seemless”

integration of services Aim: Looks like single huge compute

facility Projects: EDG/EGEE, OSG Big step from proof of concept to stable,

large-scale production Centres are autonomous, but lots of

commonalities Commodity hardware (e.g. x86

processors) Linux (RedHat Enterprise Linux variant)

Page 18: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

CERNComputer CentreFunctions:

WLCG: Tier 0, some T1/T2

Support for smaller experiments at CERN

Infrastructure for the laboratory

Page 19: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Requirements and Boundaries (1) High Energy Physics applications require

mostly integer processor performance Large amount of processing power and

storage needed for aggregate performance No need for parallelism / low-latency high-

speed interconnects Can use large numbers of components with

performance below optimum level (“coarse-grain parallelism”)

Infrastructure (building, electricity,cooling) is a concern Refurbished two machine rooms

(1500 + 1200 m2) for total air cooledpower consumption of 2.5 MWatts

Will run out of power in about 2011…

Page 20: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Requirements and Boundaries (2)

Major boundary condition: cost Getting maximum

resources with fixed budget…

… then dealing with cuts to “fixed” budget

Only choice: commodity equipment as far as possible, minimisingTCO / performance This is not always the

solution with the cheapest investment cost!

Purchased in 2004, now retired

Page 21: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

The Bulk Resources – Event Data

Tapes/ server

s

Disk servers

CPU servers

Router

R

R

R

REthernetbackbone

(multiple 10GigE)

10GigE

Permanent storage on tape

Disk as temporary buffer

Data paths:tape diskdisk cpu

(simplified network topology)

Page 22: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

CERN CC currently (July 2009)

5’700 systems, 34’600 processing cores CPU servers, disk servers, infrastructure

servers 13’900 TB usable on 41’500 disk

drives 34’000 TB on 45’000 tape cartridges

(56’000 slots), 160 tape drives Tenders in progress or planned

(estimates) 3’000 systems, 20’000 processing cores 19’000 TB usable on 21’000 disk drives

Page 23: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

CPU Servers (1)

Simple, stripped down, “HPC like” boxes No fast low-latency interconnects

EM64T or AMD64 processors (usually 2),2 GB/core, 1 disk/processor

Open to multiple systems per enclosure Adjudication based on total performance

(SPECint2000, moving to SPECcpu2006 – all_cpp subset)

Power consumption taken into account

Page 24: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

CPU Servers (2)

Page 25: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

The Power Challenge (1)

Infrastructure limitations E.g. CERN: 2.5 MW for IT equipment

Clearly insufficient – need to fit maximum capacity into given power envelope

Additional creative measures required (water-cooled racks in air-cooled room)

Electricity costs money Electricity costs likely to raise (steeply)

over the next few years Saving 10 W is saving 88 kWh per year

Page 26: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

The Power Challenge (2)

IT responsible of significant fraction of world energy consumption Server farms: 180…280 billion kWh per

year (20…31 million kW) CERN’s data centre is 0.1 per mille of this…

1…2% of the world’s energy consumption, annual growth rate: 16…23%

Responsibility towards mankind demands using the energy as efficiently as possible

Saving a few percent of energy consumption makes a big difference

Page 27: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Server Energy Consumption

Power supply Fans Processors Chipset Memory modules Disk drives VRMs (Voltage Regulator Modules) RAID controllers … What should we start looking at?

Page 28: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

CERN’s Approach

Measure apparent (VA) power consumption in primary AC circuit CPU servers: 80% full load, 20% idle Infrastructure servers: 50% full load, 50%

idle Add element reflecting power

consumption to purchase price Currently about 6.50 EUR per VA

Adjudicate on the sum of purchase price and power adjudication element

Page 29: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Power Efficiency: Lessons Learned (1) Power efficiency increased by factor 12

in just a little over 4 years Power efficiency =

performance / power consumption Quantum steps:

Microarchitecture: Netburst to Core to Nehalem Multi-core: 1 to 2 to 4 per CPU For Core: 5000P (Blackford) to 5100 (San

Clemente) chipset Much improved PSU efficiencies

CERN retires servers aggressively after end of warranty period

Page 30: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Power Efficiency: Lessons Learned (2)

Need to benchmark concrete servers; generic statements on platform are void Beginning 2007: different servers with same

CPU, same chipset, same memory config resulted in proposals with 50% spread of power efficiency

Fostering energy-efficient solutions makes a difference Summer 2008: different techniques in

response to same call for tender differed by 60% in power efficiency

Page 31: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Power Efficiency: Lessons Learned (3)

Solutions with power supplies feeding more than one system usually more power-efficient There are more options than just

blades… Redundant power supplies are

inefficient Summer 2008: 1U server running on one

PSU module used 8.5% less power than running on two modules

Difference even larger in idle mode

Page 32: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Future (1)

Is IT growth sustainable? Demands continue to rise exponentially Even if Moore’s law continues to apply,

data centres will need to grow in number and size

IT already consuming 2% of world’s energy – where do we go?

How to handle growing demands within a given data centre? Demands evolve very rapidly, technologies

less so, infrastructure even at a slower pace – how to best match these three?

Page 33: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Future (2)

IT: Ecosystem of Hardware OS software and tools Applications

Evolving at different paces: hardware fastest, applications slowest How to make sure at any given time that

they match reasonably well?

Page 34: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Future (3)

Example: single-core to multi-core to many-core Most HEP applications currently single-

threaded Consider server with two quad-core

CPUs as eight independent execution units Model does not scale much further

Need to adapt applications to many-core machines Large, long effort

Page 35: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Summary The Large Hadron Collider (LHC) and its experiments

is a very data (and compute) intensive project LHC has triggered or pushed new technologies

E.g. Grid middleware, WANs High-end or bleeding edge technology not necessary

everywhere That’s why we can benefit from the cost advantages of

commodity hardware Scaling computing to the requirements of LHC is hard

work IT power consumption/efficiency is a primordial

concern We had the first circulating beams on 10-Sep-2008,

and have the capacity in place for the initial needs We are on track for further ramp-ups of the computing

capacity for future requirements

Page 36: Grid Computing at the Large Hadron Collider: Massive Computing at the Limit of Scale, Space, Power and Budget Dr Helge Meinhard CERN, IT Department 02-Jul-2009.

Summary of Computing Resource RequirementsAll experiments - 2008From LCG TDR - June 2005

CERN All Tier-1s All Tier-2s TotalCPU (MSPECint2000s) 25 56 61 142Disk (PetaBytes) 7 31 19 57Tape (PetaBytes) 18 35 53

Thank you