Florida Tech Grid Cluster

26
Florida Tech Grid Florida Tech Grid Cluster Cluster P. Ford P. Ford 2 * X. Fave * X. Fave 1 1 * M. Hohlmann * M. Hohlmann 1 High Energy Physics Group High Energy Physics Group 1 1 Department of Physics and Space Sciences Department of Physics and Space Sciences 2 2 Department of Electrical & Computer Department of Electrical & Computer Engineering Engineering

description

Florida Tech Grid Cluster. P. Ford 2 * X. Fave 1 * M. Hohlmann 1 High Energy Physics Group 1 Department of Physics and Space Sciences 2 Department of Electrical & Computer Engineering. History. Original conception in 2004 with FIT ACITC grant. - PowerPoint PPT Presentation

Transcript of Florida Tech Grid Cluster

Florida Tech Grid Florida Tech Grid ClusterClusterP. FordP. Ford22 * X. Fave * X. Fave11 * M. Hohlmann * M. Hohlmann11

High Energy Physics GroupHigh Energy Physics Group11Department of Physics and Space SciencesDepartment of Physics and Space Sciences22Department of Electrical & Computer EngineeringDepartment of Electrical & Computer Engineering

HistoryHistory

Original conception in 2004 with FIT ACITC Original conception in 2004 with FIT ACITC grant.grant.

2007 - Received over 30 more low-end systems 2007 - Received over 30 more low-end systems from UF. Basic cluster software operational.from UF. Basic cluster software operational.

2008 - Purchased high-end servers and 2008 - Purchased high-end servers and designed new cluster. Established Cluster on designed new cluster. Established Cluster on Open Science Grid.Open Science Grid.

2009 - Upgraded and added systems. 2009 - Upgraded and added systems. Registered as CMS Tier 3 site.Registered as CMS Tier 3 site.

Current StatusCurrent Status

OS: Rocks V (CentOS 5.0)OS: Rocks V (CentOS 5.0)

Job Manager: Condor 7.2.0Job Manager: Condor 7.2.0

Grid Middleware: OSG 1.2, Berkeley Storage Manager Grid Middleware: OSG 1.2, Berkeley Storage Manager (BeStMan) 2.2.1.2.i7.p3, Physics Experiment Data (BeStMan) 2.2.1.2.i7.p3, Physics Experiment Data Exports (PhEDEx) 3.2.0Exports (PhEDEx) 3.2.0

Contributed over 400,000 wall hours to CMS experiment. Contributed over 400,000 wall hours to CMS experiment. Over 1.3M wall hours total.Over 1.3M wall hours total.

Fully Compliant on OSG Resource Service Validation Fully Compliant on OSG Resource Service Validation (RSV), and CMS Site Availability Monitoring (SAM) tests. (RSV), and CMS Site Availability Monitoring (SAM) tests.

System ArchitectureSystem Architecture

nas-0-0nas-0-0

Compute Element (CE)Compute Element (CE)

Storage Element (SE)Storage Element (SE)

compute-2-Xcompute-2-Xcompute-1-Xcompute-1-X

HardwareHardware

CE/Frontend: 8 Intel Xeon E5410, 16GB RAM, RAID5CE/Frontend: 8 Intel Xeon E5410, 16GB RAM, RAID5

NAS0: 4 CPUs, 8GB RAM, 9.6TB RAID6 ArrayNAS0: 4 CPUs, 8GB RAM, 9.6TB RAID6 Array

SE: 8 CPUs, 64GB RAM, 1TB RAID5SE: 8 CPUs, 64GB RAM, 1TB RAID5

20 Compute Nodes: 8 CPUs & 16GB RAM each. 20 Compute Nodes: 8 CPUs & 16GB RAM each. 160 total batch slots.160 total batch slots.

Gigabit networking, Cisco Express at core.Gigabit networking, Cisco Express at core.

2x 208V 5kVA UPS for nodes, 1x 120V 3kVA 2x 208V 5kVA UPS for nodes, 1x 120V 3kVA UPS for critical systems.UPS for critical systems.

HardwarHardwareeOlin Physical Olin Physical

Science High Science High BayBay

Rocks OSRocks OS

Huge software package for clusters (e.g. 411, Huge software package for clusters (e.g. 411, dev tools, apache, autofs, ganglia)dev tools, apache, autofs, ganglia)

Allows customization through “Rolls” and Allows customization through “Rolls” and appliances. Config stored in MySQL.appliances. Config stored in MySQL.

Customizable appliances auto-install nodes Customizable appliances auto-install nodes and and post-install scripts.post-install scripts.

StorageStorage

Set up XFS on NAS partition - mounted on all Set up XFS on NAS partition - mounted on all machines.machines.

NAS stores all user and grid data, streams over NAS stores all user and grid data, streams over NFS.NFS.

Storage Element gateway for Grid storage on Storage Element gateway for Grid storage on NAS array.NAS array.

Condor Batch Job Condor Batch Job ManagerManager

Batch job system that enables distribution of Batch job system that enables distribution of workflow jobs to compute nodes.workflow jobs to compute nodes.

Distributed computing, NOT parallel.Distributed computing, NOT parallel.

Users submit jobs to a queue and system finds Users submit jobs to a queue and system finds places to process them.places to process them.

Great for Grid Computing, most-used in Great for Grid Computing, most-used in OSG/CMS.OSG/CMS.

Supports “Universes” - Vanilla, Standard, Grid...Supports “Universes” - Vanilla, Standard, Grid...

Personal Condor / Central Manager

Master

collector

negotiator

startd

schedd

MasterMaster: Manages all daemons: Manages all daemons

NegotiatorNegotiator: “Matchmaker” between idle jobs and : “Matchmaker” between idle jobs and pool nodes.pool nodes.

CollectorCollector: Directory service for all daemons. : Directory service for all daemons. Daemons send ClassAd updates periodically.Daemons send ClassAd updates periodically.

StartdStartd: Runs on each “execute” node.: Runs on each “execute” node.

ScheddSchedd: Runs on a “submit” host, creates a : Runs on a “submit” host, creates a “shadow” process on the host. Allows manipulation “shadow” process on the host. Allows manipulation of job queue.of job queue.

Master

Collectorschedd

negotiator

Workstation

Master

startdschedd

Workstation

Master

startdschedd

Central Manager

Cluster Node

Master

startd

Cluster Node

Master

startd

Typical Condor setupTypical Condor setup

Condor PriorityCondor Priority

User priority managed by complex algorithm User priority managed by complex algorithm (half-life) with configurable parameters.(half-life) with configurable parameters.

System does not kick off running jobs.System does not kick off running jobs.

Resource claim is freed as soon as job is Resource claim is freed as soon as job is finished.finished.

Enforces fair use AND allows vanilla jobs to Enforces fair use AND allows vanilla jobs to finish. Optimized for Grid Computing.finish. Optimized for Grid Computing.

Grid MiddlewareGrid Middleware

Source: OSG Twiki Source: OSG Twiki documentationdocumentation

OSG MiddlewareOSG Middleware

OSG middleware installed/updated by Virtual Data OSG middleware installed/updated by Virtual Data Toolkit (VDT).Toolkit (VDT).

Site configuration was complex before 1.0 release. Site configuration was complex before 1.0 release. Simpler now.Simpler now.

Provides Globus framework & security via Certificate Provides Globus framework & security via Certificate Authority.Authority.

Low maintenance: Resource Service Validation (RSV) Low maintenance: Resource Service Validation (RSV) provides snapshot of site.provides snapshot of site.

Grid User Management System (GUMS) handles Grid User Management System (GUMS) handles mapping of grid certs to local users.mapping of grid certs to local users.

BeStMan StorageBeStMan Storage

Berkeley Storage Manager: SE runs basic Berkeley Storage Manager: SE runs basic gateway configuration - short config but hard gateway configuration - short config but hard to get working.to get working.

Not nearly as difficult as dCache - BeStMan is a Not nearly as difficult as dCache - BeStMan is a good replacement for small to medium sites.good replacement for small to medium sites.

Allows grid users to transfer data to-and-from Allows grid users to transfer data to-and-from designated storage via LFN e.g.designated storage via LFN e.g.srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN=/bestman/srm://uscms1-se.fltech-grid3.fit.edu:8443/srm/v2/server?SFN=/bestman/BeStMan/cms...BeStMan/cms...

WLCGWLCG

Large Hadron Collider - expected 15PB/year. Large Hadron Collider - expected 15PB/year. Compact Muon Solenoid detector will be a large part Compact Muon Solenoid detector will be a large part of this.of this.

World LHC Computing Grid (WLCG) handles the data, World LHC Computing Grid (WLCG) handles the data, interfaces with sites in OSG, EGEE (european), etc.interfaces with sites in OSG, EGEE (european), etc.

Tier 0 - CERN, Tier 1 - Fermilab, Closest Tier 2 - Tier 0 - CERN, Tier 1 - Fermilab, Closest Tier 2 - UFlorida.UFlorida.

Tier 3 - US! Not officially part of CMS computing Tier 3 - US! Not officially part of CMS computing group (i.e. no funding), but very important for group (i.e. no funding), but very important for dataset storage and analysis.dataset storage and analysis.

T2/T3 sites in the UST2/T3 sites in the US

T3T3

T3T3T3T3

T3T3

T3T3

https://cmsweb.cern.ch/sitedb/https://cmsweb.cern.ch/sitedb/sitelist/sitelist/

T3T3

T3T3

T3T3

T3T3

T3T3

T2T2

T2T2

T2T2 T2T2

T2T2

T2T2

T2T2

T3T3

T3T3

Cumulative Hours for FIT on Cumulative Hours for FIT on OSGOSG

Local Usage TrendsLocal Usage Trends

TrendsTrends

Over 400,000 cumulative hours for CMSOver 400,000 cumulative hours for CMS

Over 900,000 cumulative hours by local Over 900,000 cumulative hours by local usersusers

Total of 1.3 million CPU hours utilizedTotal of 1.3 million CPU hours utilized

Tier-3 SitesTier-3 Sites

Not yet completely defined. Consensus: T3 Not yet completely defined. Consensus: T3 sites give scientists a framework for sites give scientists a framework for collaboration (via transfer of datasets), also collaboration (via transfer of datasets), also provide compute resources.provide compute resources.

Regular testing by RSV and Site Availability Regular testing by RSV and Site Availability Monitoring (SAM) tests, and OSG site info Monitoring (SAM) tests, and OSG site info publishing to CMS.publishing to CMS.

FIT is one of the largest Tier 3 sites.FIT is one of the largest Tier 3 sites.

RSV & SAM ResultsRSV & SAM Results

PhEDExPhEDEx

Physics Experiment Data Exports: Final milestone for our Physics Experiment Data Exports: Final milestone for our site.site.

Physics datasets can be downloaded from other sites or Physics datasets can be downloaded from other sites or exported to other sites.exported to other sites.

All relevant datasets catalogued on CMS Data All relevant datasets catalogued on CMS Data Bookkeeping System (DBS) - keeps track of locations of Bookkeeping System (DBS) - keeps track of locations of datasets on the grid.datasets on the grid.

Central web interface allows dataset copy/deletion Central web interface allows dataset copy/deletion requests.requests.

DemoDemo

http://myosg.grid.iu.eduhttp://myosg.grid.iu.edu

http://uscms1.fltech-grid3.fit.eduhttp://uscms1.fltech-grid3.fit.edu

https://cmsweb.cern.ch/dbs_discovery/https://cmsweb.cern.ch/dbs_discovery/aSearch?aSearch?caseSensitive=on&userMode=user&sortOrdercaseSensitive=on&userMode=user&sortOrder=desc&sortName=&grid=0&method=dbsapi&=desc&sortName=&grid=0&method=dbsapi&dbsInst=cms_dbs_ph_analysis_02&userInput=fdbsInst=cms_dbs_ph_analysis_02&userInput=find+dataset+where+site+like+*FLTECH*+andind+dataset+where+site+like+*FLTECH*+and+dataset.status+like+VALID*+dataset.status+like+VALID*

CMS Remote Analysis CMS Remote Analysis Builder (CRAB)Builder (CRAB)

Universal method for experimental data Universal method for experimental data processingprocessing

Automates analysis workflow, i.e. status Automates analysis workflow, i.e. status tracking, resubmissionstracking, resubmissions

Datasets can be exported to Data Discovery Datasets can be exported to Data Discovery PagePage

Locally used extensively in our muon Locally used extensively in our muon tomography simulations.tomography simulations.

Network PerformanceNetwork Performance

Changed to a default 64kB blocksize across NFSChanged to a default 64kB blocksize across NFS

RAID Array change to fix write-cachingRAID Array change to fix write-caching

Increased kernel memory allocation for TCPIncreased kernel memory allocation for TCP

Improvements in both network and grid transfer Improvements in both network and grid transfer ratesrates

DD copy tests across networkDD copy tests across network

Changes from 2.24 to2.26 GB/s in readingChanges from 2.24 to2.26 GB/s in reading

Changes from 7.56 to 81.78 MB/s in WritingChanges from 7.56 to 81.78 MB/s in Writing

Block size

WRITE MB/s READ GB/s

64k 12.7 84.4 0.42 2.513.1 81.7 0.45 2.313.1 81.5 0.48 2.214 76.5 0.53 2.3

12.6 84.8 0.46 2.313.1 81.78 0.468 2.26

Block size WRITE MB/s READ GB/s64k 102.9 10.4 0.45 2.4

94.5 11.4 0.49 2.2288.5 3.7 0.49 2.2244.89 4.4 0.49 2.2

135 7.9 0.49 2.2173.15

8 7.56 0.482 2.24

TCP:S TCP:CUDP: jitter lost

(Mbits/sec)

753 754 0.11 0 1.05912 913 0.022 0 1.05896 897 0.034 0 1.05891 892 0.393 0 1.05888 889 1.751 0 1.05868 869 0.462 0 1.05

Iperf on the Iperf on the Frontend BeforeFrontend Before

TCP: S TCP: CUDP: jitter lost

(Mbits/sec)

941 942 0.048 0 1.05939 940 0.025 0 1.05935 937 0.022 0 1.05930 931 0.023 0 1.05941 942 0.025 0 1.05

937.2 938.4 0.0286 0 1.05

Iperf on the Iperf on the Frontend AfterFrontend After

DD on the DD on the Frontend AfterFrontend After

DD on the DD on the Frontend BeforeFrontend Before