SAMGrid – A fully functional computing grid based on standard technologies
Igor Terekhov for the JIM team FNAL/CD/CCF
Igor Terekhov, FNAL
Outlook
Brief History, D0 and CDF computing, Grid Jobs and Information Management
ArchitectureJob managementInformation management
Summary
Igor Terekhov, FNAL
HistoryRun II CDF and D0, the two largest, currently running collider experimentsEach experiment to accumulate ~1PB/yr raw, reconstructed, analyzed data by 2007. Get the Higgs jointly.Real data acquisition – 5 /wk, 25MB/s, 1TB/day, plus MC
1pb
Igor Terekhov, FNAL
Globally Distributed Computing and GridD0 – 80 institutions, 18 countries. CDF – 60 institutions, 12 countries.Many institutions have computing (including storage) resources, dozens for each of D0, CDFSome of these are actually shared, regionally or experiment-wide
Sharing is goodA possible contribution by the institution into the collaboration while keeping it localRecent Grid trend (and its funding) encourages it
Igor Terekhov, FNAL
Goals of Globally Distributed Computing in Run II
To distribute data to processing centers – SAM is a wayTo benefit from the pool of distributed resources – maximize job turnaround, yet keep single interfaceTo facilitate and automate job placementTo reliably execute jobs spread across multiple resourcesTo provide an aggregate view of the system and its activities and keep track of what’s happeningTo maintain security Finally, to learn and prepare for the LHC computing
Igor Terekhov, FNAL
Data Distribution - SAMSAM is Sequential data Access via Meta-data.
http://{d0,cdf}db.fnal.gov/samPresented numerous times, CHEPSCore features: meta-data cataloguing, global data replication and routing, co-allocation of compute and data resourcesGlobal data distribution:
MC import from remote sitesOff-site analysis centersOff-site reconstruction (D0)
See Lee Lueking’s talk for more details
Igor Terekhov, FNAL
Now that the Data’s Distributed: JIMGrid Jobs and Information ManagementOwes to the D0 Grid funding – PPDG (the FNAL team), UK GridPP (Rod Walker, ICL)Very young – started 2001Actively explore, adopt, enhance, develop new Grid technologiesCollaborate with the Condor team from The University of Wisconsin on Job managementJIM with SAM is also called The SAMGridT<10min?
Igor Terekhov, FNAL
Igor Terekhov, FNAL
Job Management StrategiesWe distinguish grid-level (global) job scheduling (selection of a cluster to run) from local scheduling (distribution of the job within the cluster)We distinguish structured jobs from unstructured.
Structured jobs have their details known to Grid middleware. Unstructured jobs are mapped as a whole onto a cluster
In the first phase, we want reasonably intelligent scheduling and reliable execution of unstructured data-intensive jobs.
Igor Terekhov, FNAL
Job Management HighlightsWe seek to provide automated resource selection (brokering) at the global level with final scheduling done locally Focus on data-intensive jobs:
Execution time is composed of:• Time to retrieve any missing input data• Time to process the data• Time to store output data
In the Leading Order, we rank sites by the amount of data cached at the site (minimize missing input data)Scheduler is interfaced with the data handling system
Igor Terekhov, FNAL
Job Management – Distinct JIM Features
Decision making is based on both:Information existing irrespective of jobs (resource description)Functions of (jobs,resource)
Decision making is interfaced with data handling middleware rather than individual SE’s or RC alone: this allows incorporation of DH considerationsDecision making is entirely in the Condor framework (no own RB) – strong promotion of standards, interoperability
Igor Terekhov, FNALJO
B
Computing Element
Submission Client
User Interface
QueuingSystem
Job ManagementUser Interface
User Interface
BrokerMatch
Making Service
Information Collector
Execution Site #1
Submission Client
Submission Client
Match Making Service
Match Making Service
Computing Element
Grid Sensors
Execution Site #n
Queuing System
Queuing System
Grid Sensors
Storage Element
Storage Element
Computing Element
Storage Element
Data Handling System
Data Handling System
Storage Element
Storage Element
Storage Element
Storage Element
Information Collector
Information Collector
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Grid Sensor
s
Computing Element
Computing Element
Data Handling System
Data Handling System
Data Handling System
Data Handling System
Igor Terekhov, FNAL
Monitoring Highlights
Sites (resources) and jobsDistributed knowledge about jobs etcIncremental knowledge buildingGMA for current state inquiries, Logging for recent history studiesAll Web based
Igor Terekhov, FNAL
Web Browser
Web Server
Site 1 Information System
IPIPIP
Web Browser
Web Server 1
Site 2 Information System
IPIP
IPIP
Web Server N
Site N Information System
JIM Monitoring
Igor Terekhov, FNAL
Information Management – Implementation and Technology Choices
XML for representation of site configuration and (almost) all other informationXquery and XSLT for information processingXindice and other native XML databases for database semantics
Igor Terekhov, FNAL
Main Site/cluster Config
…
Schema
Resource Advertisement
Monitoring Schema
DataHandling
HostingEnvironment
Meta-Schema
Igor Terekhov, FNAL
SAMGrid Project StatusCore SAM in maintenance stageJIM -- Delivered prototype for D0, Oct 10, 2002. Now deploying V1.
Remote job submissionBrokering based on data cachedWeb-based monitoring
SC-2002 demo – 11 sites (D0, CDF), big successPost V1 – OGSA, Web services, NG logging service
Igor Terekhov, FNAL
Igor Terekhov, FNAL
SummaryRun II experiments’ computing is highly distributed, Grid trend is very relevantThe JIM (Jobs and Information Management) part of the SAMGrid addresses the needs for global and grid computing at Run IIWe use Condor and Globus middleware to schedule jobs globally (based on data), and provide Web-based monitoring
Igor Terekhov, FNAL
AcksV. White, who created SAM and (co-)led it to successPPDG project management, for making d0grid possibleGridPP project in the UK, for its fundingMembers of the Condor team for fruitful discussions
Top Related