Computer Hardware and Procurement at CERN Helge Meinhard (at) cern ch HEPiX fall 2005 @ SLAC.

31
Computer Hardware and Procurement at CERN Helge Meinhard (at) cern ch HEPiX fall 2005 @ SLAC
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    223
  • download

    1

Transcript of Computer Hardware and Procurement at CERN Helge Meinhard (at) cern ch HEPiX fall 2005 @ SLAC.

Computer Hardware and Procurement at CERN

Helge Meinhard (at) cern ch

HEPiX fall 2005 @ SLAC

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 2

Outline

Procedures Hardware (being) procured Power measurements Observations

Procedures

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 4

Constraints (1)

CERN is an international organisation with strict administrative rules Competitive tendering required covering (at

least) member states No way to avoid for commodity equipment

Lowest compliant bid wins No negotiations about added value of higher offers

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 5

Constraints (2)

Different procedures depending on expected volume < 10’000 CHF: IT seeks 3 offers < 200’000 CHF: Formal price enquiry by purchasing

service. Four weeks response time < 750’000 CHF: Formal call for tender preceded by

market survey. Six weeks response time > 750’000 CHF: As < 750’000 CHF, plus approval by

CERN’s Finance Committee (5 sessions/year, papers ready two months in advance)

(1 CHF = 0.78 USD = 0.65 EUR)

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 6

Our problems

Procedures badly adapted to quickly evolving computing market

Difficult to give preference to “good”, reliable equipment

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 7

Our choices (1)

For significant purchases (> 100 kCHF) we require (a) sample system(s) with the tender for big tenders on CERN’s request for small tenders

Tenders include 3 years on-site warranty for hardware Typical requirements:

4 working hours response / 12 working hours repair for critical machines

3 working days response / 5 working days repair for farm nodes

Supplier can subcontract on-site warranty

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 8

Our choices (2)

Payment within 30 days after provisional acceptance on receipt of bank guarantee of 5% of purchase sum valid until end of warranty period

Delivery within 6 weeks, penalty for late delivery: 2% of purchase sum per complete week, max. 10%

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 9

Our choices (3)

If more than 10% systems fail during acceptance or during first month after: right to return the whole batch

If a system fails 3 or more times during any 6 months’ period, right to request complete replacement of system

If more than 20% of any component fail during any 6 months’ period, right to request complete replacement of this component across batch

If CERN adds third-party devices, no impact on warranty obligations for system as delivered

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 10

Our choices (4)

If justified by volume, procure from two suppliers (lowest and second-lowest compliant) Better protection if one delivers crap or

nothing at all Better chance for companies to win an order Increased workload on our part

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 11

Example of a procurement

Procurement of equipment worth < 750 kCHF Approval by Finance Committee not needed

Market survey already done Market survey can cover different types of

equipment Valid for 1 year If not done yet, add ~ 16 weeks

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 12

Steps (1)

Fix scope 2 w Write technical, commercial docs 3 w IT-internal review Revise technical, commercial docs 2 w Specification meeting Revise technical, commercial docs 1 w Tender out Deadline for replies 6 w Opening of replies 1 w(Total so far: 15 weeks, at best compressible to 12 weeks)

Typi

cal c

ase

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 13

Steps (2)

(Total from previous slide: 15 w, min. 12 w) Technical analysis of replies 1 w Visual inspection, mounting 1 w Benchmarks, reports 3 w Technical clarifications 1 w Purchase request, order 2 w Delivery 7 w Preliminary acceptance 6 wTotal: 36 weeks, compressible to 30 weeks

Typi

cal c

ase

Hardware (being) procured

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 15

Objectives

Cover existing needs with as few different models and as few procurement procedures as possible

Closely follow technology and market evolution and satisfy requirements with modern hardware at low cost

contradiction

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 16

Fabric Infrastructure and Operations (1)

RedHat 7.3 phased out on public services Campaign on storage nodes far advanced

New in machine room since Karlsruhe: 200 farm PCs (dual Nocona): in production 116 disk servers (> 5 TB usable each, total of 900 TB

gross capacity): part in production, part under acceptance test

112 “midrange servers”: under acceptance test 32-node Infiniband-based cluster for Theory

Refurbishment of machine room proceeding LHS being populated, but power remains limited

Talk

From CERN site report 2005/10/11

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 17

Hardware being procured (1)

Large volumes – several times < 750 kCHF per year “Farm PCs” – non-redundant, cheap dual-

processor work horses “Disk servers” – storage-in-a-box systems

with many SATA disks for streaming applications

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 18

Hardware being procured (2)

Medium-size volumes – once < 750 kCHF per year or once or several times < 200 kCHF per year “Midrange servers” – redundant building blocks for specific

applications “Tape servers” – midrange servers with an FC interface “Disk arrays” – autonomous RAID units with FC uplinks SAN infrastructure (most notably FC switches) Head nodes for serial console infrastructure “Small disk servers”, somewhere between disk servers and

midrange servers Miscellaneous

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 19

Specifications: Farm PCs (1)

2 boxed Intel Noconas of 2.8 GHz Mainboard:

BMC (IPMI 1.5 or higher) PXE, USB boot BBS menu Console redirection Configurable to stay off on AC power loss

2 GB ECC memory From mainboard manuf. approved list Upgradable to 4 GB without removing modules

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 20

Specifications: Farm PCs (2)

1 disk > 140 GB, IDE not permitted Certified for 24/7, 3 y warranty by disk manuf.

1 GigE providing PXE and IPMI access 19” chassis max. 4 U, with rails

Power, reset button Power, disk activity LED

Power supply supporting machine + 50 W Active PFC C13 to C14 LSZH power cord

Guaranteed to run under RHEL 3 (i386 and x86_64) Delivery within 6 weeks from dispatch of order

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 21

Specifications: Disk server (1)

1 or 2 boxed Intel Xeon with EM64T Mainboard as for Farm PCs

Now adding support for memory mirroring Memory as for Farm PCs General requirements for disks etc.

≥ 7200 rpm, no EIDE, 3 y warranty, certified for 24/7 by manufacturer

Metallic hot-swap trays certified by chassis manuf. Indicators for power and activity for each tray PCB backplanes for disks, multilane cabling “Intelligent” RAID controllers

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 22

Specifications: Disk server (2)

System disks: 2 x ≥ 140 GB mirrored Data disks: all identical

Redundant RAIDs with hot spares (min. 1/15) Total usable capacity per system above 5 TB Battery buffer if controller with active cache

1 GigE providing required performance, PXE, IPMI access

19” chassis rack-mountable with rails Min. 40 TB usable in 42 U high rack

Power supply: N+1 redundant, active PFC Guaranteed to run under RHEL 3 (i386 and x86_64) Delivery within 6 weeks from dispatch of order

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 23

Specifications: Disk server (3)

Performance: memory to disk: iozone with 16 GB files and 256 kb record size Single stream: 40 MB/s write, 40 MB/s read Multi-stream (at least 10): 115 MB/s write, 170 MB/s

read (*) Memory to network: iperf

Single stream: 100 MB/s write, 100 MB/s read Two streams: 110 MB/s write, 110 MB/s read Two streams in, two streams out: 145 MB/s

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 24

Specifications: Disk server (4)

Global (disk to network) performance: At least 10 clients transferring 2 GB files via rfio Reading from system: 95 MB/s (*) Writing to system: 90 MB/s (*)

(*): Requirements scale linearly with usable capacity, numbers for 5000 GB usable

Power measurements

Done by

Andras Horvath, CERN

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 26

Power measurements

Switched

off Booting (peak)

Idle Full load

SpecInt2k SI2k/watt (full load)

W VA W VA W VA W VA

Xeon 2x 2.8Ghz 9 18 150 150 110 112 225 230 1950 8.67

Nocona 2x 3.2Ghz (IBM Eserver xSeries 336)

47 70 250 283 173 199 266 292 3030 11.39

Opteron 2x 246 (IBM Eserver 326)

10 33 164 180 154 170 184 201 2316 12.59

Opteron dual-core 2x2x 265 9 26 175 190 163 180 208 228 5454 26.22

Estimated Pentium D system with 1 CPUs 2 cores

154 244 2800 11.48

http://ahorvath.home.cern.ch/ahorvath/power

Observations

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 28

Observations (1)

Profile of winning companies Tier-1 suppliers competing with large

integrators Small ‘round the corner companies eliminated

at Market Survey stage Almost always the integrators win

Specially tailored solutions responding to our specifications

Prices of Tier-1s rather high in Europe

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 29

Observations (2)

Stress test as (important) part of the acceptance test Introduced ~ 2 years ago (triggered by presentations

from SLAC and FNAL at HEPiX) Very useful Based on va-ctcs

No longer sufficiently actively maintained Large number of false positives Looking for a replacement

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 30

Observations (3)

Pushing these procedures through requires dedicated (and knowledgeable) person power

Not obvious to run multiple procedures in parallel In particular, if things go wrong, e.g. stress

test fails

Helge Meinhard (at) cern.chHEPiX@SLAC: Hardware procurement at CERN 31

Summary

Computer hardware procurement is an excellent experimental confirmation of two fundamental laws of human nature Murphy: “Everything that can go wrong will go

wrong.” Hoffstaedter: “Things always take longer than

you think, even if you take into account Hoffstaedter’s law.”