Benigno Gobbo Benigno Gobbo 11Incontro con NandoIncontro con Nando8 February 20058 February 2005
More than Four Years of More than Four Years of Compute FarmCompute Farm
Benigno GobboBenigno [email protected]@cern.ch
Info:Info:
http://www.ts.infn.it/acidhttp://www.ts.infn.it/acid
[email protected]@ts.infn.it
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 22
RequirementsRequirements
COMPASS: High statistics - Medium event complexityCOMPASS: High statistics - Medium event complexity~ 10~ 101010 events/year events/year~ 10 “good” tracks/event~ 10 “good” tracks/event
More than 200 tracking planes in non uniform magnetic fieldMore than 200 tracking planes in non uniform magnetic fieldParticle Identification: RICH, calorimeters, …Particle Identification: RICH, calorimeters, …
Non trivial event reconstructionNon trivial event reconstructionProduction time: ~ 300 sProduction time: ~ 300 s..SpecCINT2000 (Si2k)SpecCINT2000 (Si2k)
DATA STORAGE, PRODUCTION and ANALYSIS modelDATA STORAGE, PRODUCTION and ANALYSIS modelRaw data stored at CERN (~300 TB/year)Raw data stored at CERN (~300 TB/year)Production at CERN (E.g. for 2005 we have 200000 Si2k/QuarterProduction at CERN (E.g. for 2005 we have 200000 Si2k/QuarterMonte Carlo Production and Data Analysis at Home-LabsMonte Carlo Production and Data Analysis at Home-Labs
►►Need of Compute Farms at Home LaboratoriesNeed of Compute Farms at Home LaboratoriesAlso due to usual CERN request of computing redistribution:Also due to usual CERN request of computing redistribution:
33% at CERN, 67% outside33% at CERN, 67% outside
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 33
A different Computing ModelA different Computing Model
1998. Definition of a Computing Model for the post–LEP era1998. Definition of a Computing Model for the post–LEP eraJanuary 1998. A Task Force was established at CERN January 1998. A Task Force was established at CERN (1)
To achieve: agreement with time scale and requirements of To achieve: agreement with time scale and requirements of experiments, flexibility of environment, constraints from used experiments, flexibility of environment, constraints from used commercial software, realistic assessment of costs, …commercial software, realistic assessment of costs, …
April 1998. Conclusions (Recommendations): Hybrid ArchitectureApril 1998. Conclusions (Recommendations): Hybrid Architectureusing PCs for computation (preferred: Windows NT, “tolerated”: using PCs for computation (preferred: Windows NT, “tolerated”: Linux)Linux)using at present RISC systems for I/O (legacy Unix) using at present RISC systems for I/O (legacy Unix)
1999. Evolution of the model1999. Evolution of the modelSensitive Linux improvements: now stable and better performing Sensitive Linux improvements: now stable and better performing
than Win NTthan Win NTDevelopment of “low price + good enough quality” IDE disk based Development of “low price + good enough quality” IDE disk based
PC serversPC servers
COMPASS Definitive choice:COMPASS Definitive choice:PCs for both server and computation machinesPCs for both server and computation machines(RedHat) Linux OS(RedHat) Linux OS
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 44
The HistoryThe History
Sep. 2000. Approved (and above all “sponsored”!) by CSN Sep. 2000. Approved (and above all “sponsored”!) by CSN II
Financed in two yearsFinanced in two years200M ITL in 2000200M ITL in 2000124k € in 2001 124k € in 2001
Oct. 2000. Definition of a schema for the farm “initial Oct. 2000. Definition of a schema for the farm “initial setup”setup”
The farm has to be as much as possible compatible with the The farm has to be as much as possible compatible with the CERN oneCERN one
But not CERN-dependentBut not CERN-dependent
The “initial setup” must guarantee a “production environment”The “initial setup” must guarantee a “production environment”Enough disk space (for data storage and MC production)Enough disk space (for data storage and MC production)Enough CPU power (i.e. PC clients)Enough CPU power (i.e. PC clients)
It must be scalable to the final configuration without (major) It must be scalable to the final configuration without (major) modificationsmodifications
It must fit with approved financingIt must fit with approved financing
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 55
History: first stepsHistory: first stepsNov. 2000. “Initial setup” decided, orders submittedNov. 2000. “Initial setup” decided, orders submitted1 PC Server with large EIDE disk space (with 14 x 75 GB EIDE 1 PC Server with large EIDE disk space (with 14 x 75 GB EIDE
disks)disks)RAID1 (mirroring) configured, it allowed RAID1 (mirroring) configured, it allowed 0.5 TB0.5 TB of (cheap) disk of (cheap) disk storagestorageThe machine was assembled by ELONEX following a CERN R&DThe machine was assembled by ELONEX following a CERN R&D
1 Sun Server with external SCSI disks ( 8 x 73 GB)1 Sun Server with external SCSI disks ( 8 x 73 GB)Configured RAID5, gave a 0.47 TB of more reliable disk storageConfigured RAID5, gave a 0.47 TB of more reliable disk storageDifferent OS (Solaris) and architecture (SPARC): allows better test Different OS (Solaris) and architecture (SPARC): allows better test and debugging of softwareand debugging of software
1 PC Supervision Server1 PC Supervision ServerNothing special: just a white-box PC with better components. Used Nothing special: just a white-box PC with better components. Used as a supervisor or master in monitoring or client-server softwareas a supervisor or master in monitoring or client-server software
12 PC Clients12 PC ClientsValue white-box PC, to stay into available budgetValue white-box PC, to stay into available budget
All machines are dual processor to improve performances/costsAll machines are dual processor to improve performances/costsWell… Sun was bought as single processor (it was so expansive…) Well… Sun was bought as single processor (it was so expansive…) and upgraded subsequentlyand upgraded subsequently
Network switch (36 100BaseT + 3 1000BaseSX ports)Network switch (36 100BaseT + 3 1000BaseSX ports)KVM switches, rack, shelves, monitor, keyboard, etc.KVM switches, rack, shelves, monitor, keyboard, etc.UPS and cooling system provided by “Sezione di Trieste” UPS and cooling system provided by “Sezione di Trieste” (thanks (thanks
to A. Mansutti & S. Rizzarelli)to A. Mansutti & S. Rizzarelli)
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 66
History. Feb. 2001: “First setup” in History. Feb. 2001: “First setup” in productionproduction
First First LinuxLinux Compute Farm locally installed and Compute Farm locally installed and completely managed by INFN personnelcompletely managed by INFN personnel
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 77
History: the final setupHistory: the final setup
Sep. 2001. Start Farm upgrade to Final Setup Sep. 2001. Start Farm upgrade to Final Setup 1 more EIDE PC Server (with 20 x 80 GB EIDE disks)1 more EIDE PC Server (with 20 x 80 GB EIDE disks)
Configured RAID1: Configured RAID1: 0.75 GB0.75 GB
Upgrade of previous EIDE Server with 6 additional 80 GB disksUpgrade of previous EIDE Server with 6 additional 80 GB disksNow it provides Now it provides 0.72 TB0.72 TB (RAID1) (RAID1)
Upgrade of the Sun to dual processorUpgrade of the Sun to dual processorSTK Tape Library: 20 slots (can be upgraded to 40) , 2 IBM STK Tape Library: 20 slots (can be upgraded to 40) , 2 IBM
Ultrium drives (can have 4 drives)Ultrium drives (can have 4 drives)It can store up to 4 TB of data. Drives transfer rate up to 30 MB/sIt can store up to 4 TB of data. Drives transfer rate up to 30 MB/s
1 Dell PC Tape Server, with 6 x 73 GB SCSI disks configured RAID 1 Dell PC Tape Server, with 6 x 73 GB SCSI disks configured RAID 0 (striping)0 (striping)
To be used with Tape Lib forming HSM systemTo be used with Tape Lib forming HSM system
19 PC clients19 PC clientswhite-box machines, dual 1 GHz P III white-box machines, dual 1 GHz P III
12 ports 1000BaseSX switch12 ports 1000BaseSX switchKVM switches, etc.KVM switches, etc.
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 88
History: the 2002 “Final Setup”History: the 2002 “Final Setup”
11 Old clients:11 Old clients:MSI 694D ProMSI 694D Pro
Dual PIII 800 MHzDual PIII 800 MHz2 x 20 GB ATA Disk2 x 20 GB ATA Disk
512 MB RAM512 MB RAM
11 Old clients:11 Old clients:MSI 694D ProMSI 694D Pro
Dual PIII 800 MHzDual PIII 800 MHz2 x 20 GB ATA Disk2 x 20 GB ATA Disk
512 MB RAM512 MB RAM
19 New clients:19 New clients:Abit VP6Abit VP6
Dual PIII 1000 MHzDual PIII 1000 MHz2 x 40 GB ATA Disk2 x 40 GB ATA Disk
512 MB RAM512 MB RAM
19 New clients:19 New clients:Abit VP6Abit VP6
Dual PIII 1000 MHzDual PIII 1000 MHz2 x 40 GB ATA Disk2 x 40 GB ATA Disk
512 MB RAM512 MB RAM
3com 49003com 49003com 39003com 3900
Kvm switchKvm switch
Server SGE, DHCP, BServer SGE, DHCP, BB, …B, …Asus CUR-DLSAsus CUR-DLSDual PIII 800 MHzDual PIII 800 MHz2 x 36 GB SCSI Disk2 x 36 GB SCSI Disk512 MB RAM512 MB RAMGA620 G gigabitGA620 G gigabit
EIDE disk serverEIDE disk serverIntel L440 GX+Intel L440 GX+Dual PIII 700 MHzDual PIII 700 MHz2 x 15 GB ATA disk2 x 15 GB ATA disk14 x 75 GB ATA disk14 x 75 GB ATA disk6 x 80 GB ATA disk6 x 80 GB ATA diskGA620 G gigabit GA620 G gigabit
EIDE disk serverEIDE disk serverIntel STL2Intel STL2Dual PIII 866 MHzDual PIII 866 MHz2 x 20 GB ATA disk2 x 20 GB ATA disk20 x 80 GB ATA disk20 x 80 GB ATA diskGA620 G gigabit GA620 G gigabit
Tape LibraryTape LibrarySTK L40 20 slotSTK L40 20 slot2 x IBM Ultrium2 x IBM Ultrium
Tape/disk serverTape/disk serverDell PowerEdge 4400Dell PowerEdge 4400Dual Xeon 1 GHzDual Xeon 1 GHz2 x 36 GB SCSI RAID12 x 36 GB SCSI RAID16 x 73 GB SCSI RAID06 x 73 GB SCSI RAID0
SCSI disk serverSCSI disk serverSun Blade 1000Sun Blade 1000Dual SparcIII 750 MHzDual SparcIII 750 MHz18 GB SCSI FC disk18 GB SCSI FC disk8 x 73 GB SCSI RAID58 x 73 GB SCSI RAID5
CRD-5440CRD-5440
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 99
History: up to now and in the near futureHistory: up to now and in the near future
2002 - 2004. Upgrades2002 - 2004. UpgradesAdditional EIDE PC Server with 20 x 200 Additional EIDE PC Server with 20 x 200
GB disksGB disksPowerful machine (Dual Xeon). 4 RAID5 Powerful machine (Dual Xeon). 4 RAID5 partitions allowing 3 TB of disk spacepartitions allowing 3 TB of disk space
PC server for Oracle/DB with 12 x 200 GB PC server for Oracle/DB with 12 x 200 GB disksdisks
To contain event databaseTo contain event database
HP PC Server with 6 x 142 GB SCSI disksHP PC Server with 6 x 142 GB SCSI disksSTK Tape Library upgrade from 20 to 40 STK Tape Library upgrade from 20 to 40
slotsslotsNow allows to store up to 8 TB of dataNow allows to store up to 8 TB of data
Ultrium2 Tape Drive for STK Tape LibraryUltrium2 Tape Drive for STK Tape LibraryUp to 400 GB/cartridge, up to 70 MB/s Up to 400 GB/cartridge, up to 70 MB/s transfer ratetransfer rate
8 PC Clients8 PC Clients Rack mount Dual Rack mount Dual Opteron processor Opteron processor machinesmachines
2005. Financed2005. FinancedUpgrade of disk spaceUpgrade of disk space(SATA disks rack)(SATA disks rack)
EIDE Disk ServerEIDE Disk ServerIntel SE7500CW2Intel SE7500CW2Dual Xeon 2 GHzDual Xeon 2 GHz1 GB RAM1 GB RAM2 x 40 GB + 20 x 200 GB 2 x 40 GB + 20 x 200 GB
ATAATANetgear GA 621Netgear GA 621
Oracle ServerOracle ServerSuperMicro X5DP8-G2SuperMicro X5DP8-G2Dual Xeon 2.4 GHzDual Xeon 2.4 GHz2 GB RAM2 GB RAM2 x 20 GB + 12 x 200 GB ATA2 x 20 GB + 12 x 200 GB ATA3com 3C996-SX3com 3C996-SX
HP Proliant ML530G2HP Proliant ML530G2Dual Xeon 2.8 GHzDual Xeon 2.8 GHz2 GB RAM2 GB RAM2 x 36 + 6 x 146.8 SCSI2 x 36 + 6 x 146.8 SCSIGigabitGigabit
Newisys 2100Newisys 2100Dual Opteron 250Dual Opteron 2502 GB RAM2 GB RAM2x36 GB SCSI2x36 GB SCSI
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1010
ACID Farm w.r.t. CERN farm: ACID Farm w.r.t. CERN farm: HardwareHardware
The choicesThe choicesClients. No alternatives due to cost difference: use PCs. Clients. No alternatives due to cost difference: use PCs.
But…But…At CERN there are short hardware upgrade periods At CERN there are short hardware upgrade periods use use “old”, good quality (e.g. Intel chipsets), well Linux tested “old”, good quality (e.g. Intel chipsets), well Linux tested (certified) hardware(certified) hardware
Here hardware lifetime is longer Here hardware lifetime is longer use “recent” hardware use “recent” hardware (as it becomes “dated” really fastly), middle quality (e.g. VIA (as it becomes “dated” really fastly), middle quality (e.g. VIA chipset, for cost reasons), may be not yet completely Linux chipset, for cost reasons), may be not yet completely Linux certifiedcertifiedWhat we learnedWhat we learned. “Whitebox” PC are quite fragile. In . “Whitebox” PC are quite fragile. In particular EIDE disks are particular EIDE disks are veryvery fragile, and the worst piece to fragile, and the worst piece to be replace due to need of data recovery. High quality disks be replace due to need of data recovery. High quality disks are preferable (if possible).are preferable (if possible).
EIDE disk server shows a great performance/cost ratioEIDE disk server shows a great performance/cost ratioNot completely tested at beginning, but looked nice and the Not completely tested at beginning, but looked nice and the difference in cost with SCSI based servers (a factor three) difference in cost with SCSI based servers (a factor three) looked too attractivelooked too attractiveWhat we learnedWhat we learned. See above comment on disks.. See above comment on disks.
The SunThe SunAlso at CERN the is a SUNDEV cluster made available for Also at CERN the is a SUNDEV cluster made available for code quality checking. In addition, there are some services code quality checking. In addition, there are some services still run on Suns for stability or commercial software still run on Suns for stability or commercial software requirement reasons requirement reasons
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1111
ACID Farm w.r.t. CERN Farm: ACID Farm w.r.t. CERN Farm: SoftwareSoftware
Requirements and solutionsRequirements and solutionsCompatible as much as possibleCompatible as much as possible
Programs should run without recompilation Programs should run without recompilation Use same kernel, C- Use same kernel, C-library, and compilerslibrary, and compilersUsers should find similar environment Users should find similar environment Use same Linux distributionUse same Linux distributionUse CERN patches if they helpUse CERN patches if they help
Independent as much as possibleIndependent as much as possibleDo not use too-CERN-specific tools like SUE (hard to port, not so Do not use too-CERN-specific tools like SUE (hard to port, not so useful)useful)Use official distributions (RedHat) and not CERN “adapted” onesUse official distributions (RedHat) and not CERN “adapted” onesDo not use CERN patches if they do not helpDo not use CERN patches if they do not help
Chose something else if nothing available or simply if there is Chose something else if nothing available or simply if there is something better around:something better around:
CERN batch solution too expensive (LSF), nothing interesting at INFN CERN batch solution too expensive (LSF), nothing interesting at INFN level level use use SGESGE: free, good, supported: free, good, supportedMonitoring: Monitoring: BigBrotherBigBrother is free and well done (just little complicated is free and well done (just little complicated to install)to install)Software documenting too: found Software documenting too: found doxygendoxygen, it is so good that it was , it is so good that it was subsequently adopted by CERN and now available in many Linux subsequently adopted by CERN and now available in many Linux distributionsdistributions
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1212
ACID w.r.t. CERN Farm: Commercial ACID w.r.t. CERN Farm: Commercial SoftwareSoftware
We try to avoid it, if possible (it costs and it is source of We try to avoid it, if possible (it costs and it is source of troubles)troubles)
CERN attempt to go for “commercial-only software” dramatically CERN attempt to go for “commercial-only software” dramatically failed!failed!
In general: too difficult to interface to HEP environmentIn general: too difficult to interface to HEP environmentIn general: it never completely fits with HEP requirementsIn general: it never completely fits with HEP requirementsIn general: not able to follow the fast Linux and GNU software In general: not able to follow the fast Linux and GNU software evolution (e.g. compiler: we are forced to use quite outdated and evolution (e.g. compiler: we are forced to use quite outdated and now unsupported gcc compilers. Objectivity/DB needed gcc 2.95.2, now unsupported gcc compilers. Objectivity/DB needed gcc 2.95.2, ORACLE needs gcc 2.95.3 or 2.96 and only recently gcc 3.2; current ORACLE needs gcc 2.95.3 or 2.96 and only recently gcc 3.2; current gcc version is 3.4)gcc version is 3.4)Expansive or whit unsatisfactory support (and, in any case, no source Expansive or whit unsatisfactory support (and, in any case, no source code available: so no way to fix problems by ourselves)code available: so no way to fix problems by ourselves)
So, the current idea is to use commercial software only where So, the current idea is to use commercial software only where there are not alternativesthere are not alternatives
Basically only DBMS (Basically only DBMS (Objectivity/DB 6 Objectivity/DB 6 before,before, ORACLE 9i ORACLE 9i after): too after): too difficult to develop an HEP specific DBMS. Well, free DBMS are difficult to develop an HEP specific DBMS. Well, free DBMS are available too (e.g. MySQL), but it is too dangerous to follow a available too (e.g. MySQL), but it is too dangerous to follow a solution different with the CERN one on this subject… solution different with the CERN one on this subject…
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1313
ACID w.r.t. CERN Farm: HEP LinuxACID w.r.t. CERN Farm: HEP Linux
Due to RedHat change of philosophy during 2003Due to RedHat change of philosophy during 2003Free distribution Free distribution “Fedora Project”“Fedora Project”
Free distribution with a release period of 4-6 month (too fast for HEP Free distribution with a release period of 4-6 month (too fast for HEP needs) and just 3 months support/patching of previous release (too needs) and just 3 months support/patching of previous release (too short for HEP needs)short for HEP needs)
Commercial distribution Commercial distribution “Enterprise”“Enterprise”Commercial distribution with 5 years support of previous release but too Commercial distribution with 5 years support of previous release but too expensive!expensive!
HEP ReactionsHEP ReactionsMandate to the 3 HEP big labs to negotiate with RedHat, but at the Mandate to the 3 HEP big labs to negotiate with RedHat, but at the
end…end…FNALFNAL
Rebuild RHEL from source (legal if done without violating RedHat Rebuild RHEL from source (legal if done without violating RedHat copyrights!) Scientific Linux 3.x. Other HEP labs joined FNAL in copyrights!) Scientific Linux 3.x. Other HEP labs joined FNAL in developing and supporting SLdeveloping and supporting SL
CERNCERNSLC3 (a local flavour of FNAL’s SL). We certified it on Nov. 1SLC3 (a local flavour of FNAL’s SL). We certified it on Nov. 1stst, 2004; and , 2004; and now it is the official Linux distributed at CERN. But some (~200) RHEL3-now it is the official Linux distributed at CERN. But some (~200) RHEL3-WS were bought too. WS were bought too.
SLACSLACRHEL (got “via” DOE) is the main distribution. BRHEL (got “via” DOE) is the main distribution. BAABBARAR certified SL too certified SL too (SLC was certified to run binaries but not (yet?) to build the codes). (SLC was certified to run binaries but not (yet?) to build the codes).
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1414
ACID w.r.t. CERN Farm: software, what is ACID w.r.t. CERN Farm: software, what is now changednow changed
Keep CERN compatibility, as it’s easier and less expensive…Keep CERN compatibility, as it’s easier and less expensive…GoodGood
CERN port will be less specific (no more SUE, etc.)CERN port will be less specific (no more SUE, etc.)No more “alternative gcc” compilersNo more “alternative gcc” compilersBut with additional “wanted” packages (CASTOR, patched kernels, CERN But with additional “wanted” packages (CASTOR, patched kernels, CERN TeX styles, PINE, …) not available from RedHat distributions. TeX styles, PINE, …) not available from RedHat distributions. ACID now uses CERN SLC distribution with an adapted installation setup ACID now uses CERN SLC distribution with an adapted installation setup instead of use RedHat or SL distribution plus add-ons. instead of use RedHat or SL distribution plus add-ons.
Something Bad?Something Bad?The port will be supported for 1-2 years. And after?The port will be supported for 1-2 years. And after?The RHEL option still present. That could mean extra costs for software The RHEL option still present. That could mean extra costs for software (now we use RHEL (AS2.1) just on the ORACLE server machine). In that (now we use RHEL (AS2.1) just on the ORACLE server machine). In that case an I.N.F.N. wide license solution would be a better solution. We case an I.N.F.N. wide license solution would be a better solution. We have just to wait and see… have just to wait and see…
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1515
Farm management: man power costs Farm management: man power costs (SW)(SW)
Distribution UpgradeDistribution UpgradeIt is a major task as a local certification is needed tooIt is a major task as a local certification is needed too
All applications need to be testedAll applications need to be testedAll nodes need to be re-installed from scratchAll nodes need to be re-installed from scratchIn general it requires more than a month preparation timeIn general it requires more than a month preparation timeNot too frequent: one every few years (~2)Not too frequent: one every few years (~2)
Software InstallationSoftware InstallationComplexity and test-debug period depend on packageComplexity and test-debug period depend on package
Could be a strong work (e.g. CASTOR/HSM porting: many months of Could be a strong work (e.g. CASTOR/HSM porting: many months of work)work)Time-to-time, upgrades/updates are neededTime-to-time, upgrades/updates are needed
PatchingPatchingIn general simple but quite frequent (security patches)In general simple but quite frequent (security patches)
In the past it needed a lot of time (e.g. as we used a locally patched In the past it needed a lot of time (e.g. as we used a locally patched kernel, we need a complete kernel recompilation after every official kernel, we need a complete kernel recompilation after every official patch). Now thing are easier thanks to tools like APT or YUM. patch). Now thing are easier thanks to tools like APT or YUM. Risk of troubles after a patch is not frequent but not negligible: in Risk of troubles after a patch is not frequent but not negligible: in particular after Kernel updatesparticular after Kernel updates
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1616
Farm management: man power costs Farm management: man power costs (HW)(HW)
New hardwareNew hardwarePurchasePurchase
Product choice, offers requests, “CONSIP”, …Very time consuming and Product choice, offers requests, “CONSIP”, …Very time consuming and generally boringgenerally boring
Installation and/or integrationInstallation and/or integrationIn general non complex, but in some cases needs timeIn general non complex, but in some cases needs time
Maintenance Maintenance (1)Many parts of the farm are no more covered by warranty nor Many parts of the farm are no more covered by warranty nor
under outsourced maintenanceunder outsourced maintenanceBroken parts (disks, boards, …) need to be replaced by hand. That Broken parts (disks, boards, …) need to be replaced by hand. That takes a lot of timetakes a lot of timeAn Example:An Example:MicroStar 694D ProMicroStar 694D Pro mainboards mount bad quality electrolytic capacitors (from mainboards mount bad quality electrolytic capacitors (from TAYEHTAYEH). Over 11 boards, on 7 there were failures due to that capacitors ). Over 11 boards, on 7 there were failures due to that capacitors leakage. Intervention requires a complete PC dismount, board removal, leakage. Intervention requires a complete PC dismount, board removal, capacitor replacement and re-mount. On two boards capacitor failure damaged capacitor replacement and re-mount. On two boards capacitor failure damaged following electronics: in those cases mainboard replacement where necessary.following electronics: in those cases mainboard replacement where necessary.
Power loss (HW failures were many times due to overheating).Power loss (HW failures were many times due to overheating).Quite (better: too) frequent in AREA. No cooling for long periods with Quite (better: too) frequent in AREA. No cooling for long periods with consequent machines overheating (In addition T02 room was consequent machines overheating (In addition T02 room was definitively too small compared to the hardware installed inside, now definitively too small compared to the hardware installed inside, now size doubled, but machinery too). size doubled, but machinery too).
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1717
The good and the badThe good and the badAs said: the first Linux Compute Farm installed and managed in an As said: the first Linux Compute Farm installed and managed in an INFN LabINFN LabFirst COMPASS home-lab farm in productionFirst COMPASS home-lab farm in productionOne of the first CASTOR/HSM installation outside CERNOne of the first CASTOR/HSM installation outside CERN
and probably the first one (outside CERN) in production and probably the first one (outside CERN) in production
First “in production” ORACLE database replica of part of First “in production” ORACLE database replica of part of (COMPASS) events outside CERN (COMPASS) events outside CERN Heavily used by COMPASS-Trieste groupHeavily used by COMPASS-Trieste group
Data analysis, Monte Carlo production, RICH software development and Data analysis, Monte Carlo production, RICH software development and analysis, …analysis, …
““Borrowed” for other Trieste groups works (LEP, …) Borrowed” for other Trieste groups works (LEP, …)
It is an “in production” apparatusIt is an “in production” apparatusInterventions have to be immediate, quick (& Interventions have to be immediate, quick (& NOTNOT dirt) dirt)It requires a continuous monitoring: i.e. someone always has to be present It requires a continuous monitoring: i.e. someone always has to be present “nearby T02”“nearby T02”It always “evolve” (software updates, hardware upgrades) and that requires It always “evolve” (software updates, hardware upgrades) and that requires manpowermanpowerIt is fragile: the probability of failures is highIt is fragile: the probability of failures is highParts of software need to be updated and checked very frequently (even Parts of software need to be updated and checked very frequently (even every day or so)every day or so)It is difficult to have a day without need of interventions somewhere inside It is difficult to have a day without need of interventions somewhere inside the farmthe farm
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1818
What nextWhat next
A new project: the A new project: the “Farm di Sezione”“Farm di Sezione”To (try to) merge all local farms in a kind of To (try to) merge all local farms in a kind of unique entityunique entity. .
It is again something relatively new inside INFN sitesIt is again something relatively new inside INFN sites
It involves It involves Gruppo CalcoloGruppo Calcolo and several experiments people and several experiments people from existing farms (ALICE and COMPASS) and new from existing farms (ALICE and COMPASS) and new onesones
Room expanded: T02 Room expanded: T02 T01+T02T01+T02Cooling was poweredCooling was poweredDiscussion started: to find common requirements and Discussion started: to find common requirements and
evaluate incompatibilitiesevaluate incompatibilitiesHardware is available and is being installedHardware is available and is being installedR&D will start soon, hopefully during this month R&D will start soon, hopefully during this month
(compatibility tests between current farms (compatibility tests between current farms environments, etc.)environments, etc.)
Consequences on the ACIDs: too early to say anything, Consequences on the ACIDs: too early to say anything, anyhow the ACID team is collaborating in the anyhow the ACID team is collaborating in the “Farm di “Farm di Sezione”Sezione” setup… setup…
Incontro con NandoIncontro con Nando8 February 20058 February 2005 Benigno Gobbo Benigno Gobbo 1919
Acknowledges and ConclusionsAcknowledges and ConclusionsThanks toThanks toR. BirsaR. Birsa
Sun ManagementSun ManagementHelp in software installation and Help in software installation and debuggingdebugging (e.g. CASTOR would (e.g. CASTOR would never be installed without his accurate work on it) never be installed without his accurate work on it)
V. DuicV. DuicData (DB) import, job parallelization tools Data (DB) import, job parallelization tools
All people fromAll people from Gruppo Calcolo Gruppo CalcoloOffer requestsOffer requestsConsultancyConsultancy
To concludeTo concludeThis farm shows that at INFN-Trieste there is a not negligible IT This farm shows that at INFN-Trieste there is a not negligible IT knowledge (compared to other INFN sites)knowledge (compared to other INFN sites)Computing is becoming more and more relevant in HEP Computing is becoming more and more relevant in HEP experiments. It will probably be dominant (in good and bad) at experiments. It will probably be dominant (in good and bad) at LHCLHCUnfortunately (Unfortunately (this is an opinion of minethis is an opinion of mine) INFN looks ) INFN looks NOTNOT so so pioneering on that field as in all others aspects HEPpioneering on that field as in all others aspects HEP
Top Related