Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The...
-
Upload
madlyn-cunningham -
Category
Documents
-
view
216 -
download
0
Transcript of Status of DØ Computing at UTA Introduction The UTA – DØ Grid team DØ Monte Carlo Production The...
Status of DØ Computing at UTAStatus of DØ Computing at UTA
•Introduction •The UTA – DØ Grid team•DØ Monte Carlo Production•The DØ Grid Computing
–DØRAC–DØSAR–DØGrid Software Development Effort
•Impact on Outreach and Education•Conclusions
DoE Site VisitDoE Site VisitNov. 13, 2003Nov. 13, 2003
Jae YuJae YuUniversity of Texas at ArlingtonUniversity of Texas at Arlington
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
2
• UTA has been producing DØ MC events as the US leader• UTA led the effort to
• Start remote computing at DØ• Define remote computing architecture at DØ• Implement the remote computing design at DØ in the US
• Leverage on experience as the ONLY active US DØ MC farm This became no longer true
• UTA is the leader in US DØ Grid effort• The UTA DØ Grid team has been playing a leadership role
in monitoring software development
Introduction
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
3
The UTA-DØGrid Team• Faculty: Jae Yu, David Levine (CSE)• Research Associate: HyunWoo Kim
– SAM/Grid expert– Development of McFarm SAM/Grid job manager
• Software Program Consultant: Drew Meyer– Development, improvement, and maintenance of McFarm
• CSE Master’s Degree Students:– Nirmal Ranganathan: Investigation of Resource needs in Grid execution
• EE M.S. Student: Prashant Bhamidipati– MC Farm operation and McPerM development
• PHY Undergraduate Student: David Jenkins – Take over MC Farm Operation and Development of Monitoring database
• Graduated:– Three CSE MS students All are at industry– One CSE Undergraduate student on MS program at U. of Washington
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
4
UTA DØ MC Production• Have two independent farms
– Swift farm (HEP)• 36 P3 866MHz cpu’s• 250Mbyte/cpu• A total of .6TB disk space
– CSE Farm• 12 P3 866MHz cpu’s
• McFarm as our production control software
• Statistics (11/1/2002 – 11/12/2003):– Produced: ~10M– Delivered: ~ 8M
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
5
What do we want to do with the data?
Want to analyze data no matter where we are!!!
Location and time independent analysis
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
6
DØ Data Taking Summary
30~40M events/mo
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
7
• Total expected data size is ~4PB (4 million GB=100km of 100GB Hard drives)!!!
• Detectors are complicated Need many people to construct and make them work
• Collaboration is large and scattered all over the world• Allow software development at remote institutions• Optimized resource management, job scheduling, and
monitoring tools• Efficient and transparent data delivery and sharing
What do we need for efficient data analyses in a HEP experiment?
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
8
650 Collaborators78 Institutions18 Countries
DØ Collaboration
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
9
Old Deployment ModelsStarted with Fermilab-centric SAM infrastructure in place, …
…transition to hierarchically distributed Model
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
10
Central Analysis Center (CAC)
DesktopAnalysis Stations
DAS DAS…. DAS DAS….
InstitutionalAnalysis Centers
IAC ... IAC IAC…IAC
Normal InteractionCommunication PathOccasional Interaction Communication Path
RegionalAnalysis Centers
RAC …. RAC
DØ Remote Analysis Model (DØRAM)
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
11
What is a DØRAC?• A large concentrated computing resource hub• An institute willing to provide storage and
computing services to a few small institutes in the region
• An institute capable of providing increased infrastructure as the data from the experiment grows
• An institute willing to provide support personnel• Complementary to the central facility
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
12
DØ Southern Analysis Region (DØSAR)
The first US Region centered around the UTA – RAC
Mexico/Brazil
OU/LU
UAZ
RiceLTU
UTA
KUKSU
Ole Miss
It is a regional virtual organization (RVO) within the greater DØ VO!!
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
13
SAR Institutions
• First Generation IAC’s– Langston University– Louisiana Tech University– University of Oklahoma– UTA
• Second Generation IAC’s– Cinvestav, Mexico– Universidade Estadual Paulista, Brazil – University of Kansas– Kansas State University
• Third Generation IAC’s– Ole Miss, MS– Rice University, TX– University of Arizona, Tucson, AZ
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
14
Goals of DØ Southern Analysis Region• Prepare institutions within the region for grid enabled
analyses using RAC at UTA• Enable IAC’s to contribute to the experiment as much
as they can, including MC production and data re-processing
• Provide GRID enabled software and computing resources to DØ collaboration
• Provide regional technical support and help new IAC’s• Perform physics data analyses within the region• Discover and draw in more computing and human
resources from external sources
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
15
SAR Workshops• Biennial Workshops to promote healthy regional
collaboration and to share expertise• Had two workshops
– April 18 – 19, 2003 at UTA: ~40 participants– Sept. 25 – 26, 2003 at OU: 32 participants
• Each workshop had different goals and outcomes– Established SAR, RAC & IAC web pages and e-mail– Identified Institutional representatives– Enabled three additional IAC’s with MC production– Paired new institutions with existing ones
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
16
SAR Strategy• Setup all IAC’s with full DØ Software setup
(DØRACE Phase 0 – IV)• Install Condor (or PBS) batch control system on
desktop farms or clusters• Install McFarm MC Production control• Produce MC events on IAC machines• Install globus for monitoring information transfer• Install SAM-Grid and interface McFarm to it• Submit jobs through SAM/Grid and monitor them• Perform analysis at individual’s desk
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
17
SAR Software Status
• Up-to-date with DØ Releases• McFarm MC Production control • Condor or PBS as batch control• Globus v2.xx for grid enabled communication
– Globus & DOE SG Certificates obtained and installed
• SAM/Grid on two of the farms (UTA IAC farms)
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
18
UTA Software for SAR• McFarm Job control
– All DØSAR institutions use this product for automated MC Production
• Ganglia resource monitoring– Contains 7 clusters (332 CPU’s), including Tata institute,
India• McFarmGraph: MC Job status Monitoring system
using gridftp – Provides detailed information for a MC request
• McPerM: MC Farm Performance Monitoring
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
19
Ganglia Grid Resource Monitoring
1st SAR wrkshp
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
20
Job Status Monitoring: McFarmGraph
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
21
Farm Performance Monitor: McPerM
Increased Productivity
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
22
UTA RAC and Its Status• NSF MRI funded facility
– Joint proposal of UTA HEP and CSE + UTSW Med.– 2 HEP, 10 CSE and 2 UTSW Medical
• Core System (high throughput Research system)– CPU: 64 P4 Xeon 2.4GHz (total ~154 GHz)– Memory & NIC: 1 GB/CPU & 1 Gbit/sec port each (total of 64 Gbytes)– Storage: 5TB Fiber Channel supported by 3 GFS servers (3Gbit/sec throughput)– Network: Faundary switch w/ 52 Gbit/sec + 24 100Mbit/sec ports
• Expansion system (high CPU cycle, large storage Grid system)– CPU: 100 P4 Xeon 2.6GHz (total ~260 GHz)– Memory & NIC: 1 GB/CPU & 1 Gbit/sec port each (total of 100 Gbytes)– Storage: 60TB IDE RAID supported by 10 NFS servers– Network: 52 Gbit/sec
• The full facility went online on Oct. 31, 2003• Software installation in progress• Plan to participate in SC2003 demo next week
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
23
Just to Recall Two Years Ago….
Disk Server
.
.
.
IDE-RAID
IDE-RAID
IDE-RAID
IDE-RAID
Gbit Switch•IDE Hard drives are ~$2.5/GByte•Each set of IDE RAID array gives ~1.6TByte – hot swappable•Can be configured to have up to 10-16TB in a rack•Modest server can manage the entire system•Gbit network switch provide high throughput transfer to outside world•Flexible and scalable system•Need an efficient monitoring and error recovery system•Communication to resource management
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
24
UTA DØRAC•100 P4 Xeon 2.6GHz CPU = 260 GHz•64TB of Disk space
•84 P4 Xeon 2.4GHz CPU = 202 GHz•7.5TB of Disk space
•Total CPU: 462 GHz•Total disk: 73TB•Total Memory: 168Gbyte•Network bandwidth: 54Gb/sec
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
25
SAR Accomplishments• Held two workshops and the third is planned• All first generation institutions produce MC events using
McFarm on desktop PC farms – Generated MC events: OU: 300k, LU: 250k, LTU: 150k, UTA: ~1.3M– Discovered additional resources
• Significant local expertise have been accumulated in running farms and producing MC events
• Produced several documents, including two DØ notes• Hold regular bi-weekly meetings (VRVS) to keep up progress• Working toward data re-processing
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
26
SAR Computing ResourcesInstitutions CPU (GHz) Storage (TB) People
Cinvestav 13 1.1 1F+?Langston 13 1 1F+1GA
LTU 25+12 0.5+0.5 1F+1PD+2GAKU 12 ?? 1F+1PD(?)
KSU 40 1.2 1F+2GAOU 36+27(OSCER) 1.8 + 120(tape) 4F+3PD+2GA
Sao Paulo 60+144(future) 3 1F+ManyUTA 192 31 2F+1.4PD+0.5C+
3GATotal 430 40+120(tape) 12F+6PD+10GA
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
27
SAR Plans• Four second generation IAC’s have been paired with four
first generation institutions– Success is defined as:
• Regular production and delivery of MC events to SAM using McFarm• Install SAM/’Grid and perform a simple SAM job
– Add all these new IAC’s to ganglia, McFarmGraph and McPerM• Discover and integrate more resources for DØ
– Integrate OU’s OSCER cluster– Integrate other institution’s large, university-wide resources
• Move toward grid enabled regional physics analyses– Collaborators need to be educated to use the system
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
28
Future Software Projects• Preparation of UTA DØRAC equipment
– MC Production (DØ is suffering from shortage of resources.) – Re-reconstruction– SAM/Grid
• McFarm – Integration of re-processing– Enhanced monitoring– Better error handling
• McFarm Interface to SAM/Grid (job_manager)– Initial script successfully tested for SC2003 demo
• Work with SAM-Grid team for monitoring database and integration of McFarm technology
• Improvement and maintenance of McFramGraph and McPerM• Universal Graphical User Interface to Grid ( PHY PhD Student)
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
29
SAR Physics Interests• OU/LU:
– EWSB/Higgs searches– Single top search– CPV / Rare decays in heavy flavors– SUSY
• LTU:– Higgs search– B-tagging
• UTA:– SUSY– Higgs searches– Diffractive physics
• Diverse topics but can define common samples
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
30
Funding at SAR• Hardware Support
– UTA – RAC : NSF MRI– UTA – IAC : DoE + Local
• Totally independent of RAC resources• Need to more hardware to adequately support desktop analyses
utilizing RAC resources
• Software Support– Mostly UTA Local funding Will run out this year!!!– Many tries for different sources but none worked
• We seriously need help to – Maintain the leadership in DØ Remote Computing– Maintain the leadership in grid computing – Realize the DØRAM and expeditious physics analyses
Tevatron Grid Framework: SAM-Grid• DØ already has data delivery part of the Grid system (SAM)• Project started in 2001 as part of the PPDG collaboration to handle
DØ’s expanded needs.• Current SAM-Grid team includes:
– Andrew Baranovski, Gabriele Garzoglio, Lee Lueking, Dane Skow, Igor Terekhov, Rod Walker (Imperial College), Jae Yu (UTA), Drew Meyer (UTA), HyunWoo Kim (UTA) in Collaboration with U. Wisconsin Condor team.
http://www-d0.fnal.gov/computing/grid• UTA is working on developing an interface for McFarm to SAM-Grid
• This brings the entire SAR institutions + any institutions with McFarm into the DØGrid
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
32
Fermilab Grid Framework (SAM-Grid)
UTA
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
33
UTA-FNAL CSE Master’s Student Exchange Program• In order to establish usable Grid software in the DØ time scale,
the project needs highly skilled software developers– FNAL cannot afford computer professionals– UTA - CSE department has 450 MS students Many are highly trained
but back at school due to economy– Students can participate in cutting-edge Grid computing topics in real-life
situation– Students’ Master’s thesis become a well documented record of the work
which lacks in many HEP computing projects • The third generation students are at FNAL working on
improvement of SAM – Grid and its implementation two semester circulation period
• Previous two generations have made a significant impact to SAM – Grid – One of the four previous generation students is in PhD program at CSE– One at Wisconsin Condor team Possibility to get into PhD– Two are at industry
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
34
Impact to Education and Outreach• UTA DØ Grid program graduated
– Trained: 12 (10 MS + 1 Undergraduate) students– Graduated: 5 CSE Masters + 1 Under grad– CSE Grid Course: Many class projects on DØ
• Quarknet– UTA is one of the founding institutions of QuarkNet programs– Initiated TECOS project– Other School-top cosmic projects across the nation need storage and
computing resources QuarkNet Grid– Will be working with QuarkNet for data storage & eventual use of
computing resources by teachers and students• UTA Recently became a member of Texas grid (HiPCAT)
– HEP is leading this effort– Strongly supported by the university – Expect significant increase in infrastructure, such as bandwidth
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
35
Conclusions• UTA DØ – Grid team has accomplished tremendously• UTA played a leading role in DØ Remote Computing
– MC production– Design of DØ Grid architecture– Implementation of the DØRAM
• DØ Southern Analysis Region is a great success– Four new institutions (3 US) are now MC production sites– Enabled exploitation of available intelligence and resources in
an extremely distributed environments– Remote expertise being accumulated
Nov. 13, 2003 Status DØ Computing EffortDoE Site Visit, Jae Yu
36
• UTA – DØRAC is up and running Software installation in progress– Soon to add significant resources to SAR and to DØ
• Sam-Grid interface to McFarm working One step closer to establish a globalized grid
• UTA – FNAL MS student exchange program is very successful
• UTA DØ Grid computing program has significant impact to outreach and education
• UTA is the ONLY DØ US institution who’s been playing a leading role in DØ grid Makes UTA unique
• The local support runs out this year!! UTA needs support to maintain leadership in and support for DØ Remote Computing