Post on 12-Jan-2016
Grid Computing IntroductionTuesday Morning Session
Brian Bockelman, bbockelm@cse.unl.edu
OSG Staff
University of Nebraska-Lincoln
Part I: Introduction to Grids
OSG Summer School 2010
Outline
• Grid Computing, definitions and implementation.
• How security and handling works on the OSG.
• The OSG approach to grid computing.
3
OSG Summer School 2010
Grid Computing
• Per usual, wikipedia offers a decent starting point: Grid computing is the combination of
computer resources from multiple administrative domains for a common goal.
• Grid computing is used to perform computations which may not be feasible otherwise. Reasons may be: Practical (one site can’t hold all the computers) Opportunistic (an organization wants to take
advantage of more computing resources) Political (multiple big sites working together).
4
OSG Summer School 2010
Grid Computing
• Important aspects of the definition: “Combination of computing resources”:
Implies each resource can function separately. Overall, extra layer of difficulty to handle
compared to using a single resource (even if this is hidden from the end user!).
“Multiple administrative domains”: There must be some level of trust between the
user and sites. These trust relationships can be very complex!
5
OSG Summer School 2010
Original Idea
• The original idea behind grid computing was to make computing power as easy to access as the electrical grid. You could take your job and plug it in to
“the grid”. Everyone can use the same interface. This metaphor also implied grids would be
as easy to use as the power system… (… which might have been a pipe dream)
6
OSG Summer School 2010
What makes it unique?
• Food for thought: What’s the difference between grid
computing and cloud computing? What’s the difference between grid
computing and the capabilities Condor provides?
7
OSG Summer School 2010
Grids in the US
• The two largest grids in the US are the Teragrid and the OSG. Both are formed by taking traditional
computing sites and allowing users to access resources in a somewhat uniform manner.
Resources include: Unique supercomputers: IBM Blue Genes Linux clusters: Loosely-coupled Intel/Linux Data archives: Long-term tape storage Large-scale data systems: Distributed or clustered
file systems, providing hundreds of TB to multiple petabytes.
8
OSG Summer School 2010
Teragrid, in a nutshell
• The Teragrid is formed by a small number (less than 10) of computing centers. These are some of the largest computing
centers in the world. Often multiple, unique resources per center.
Clearly favors a few incredibly powerful resources. By invite only.
Compute time is allocated by committee. Access via both grid protocols and ssh logins.
9
OSG Summer School 2010
The Open Science Grid
• The OSG is a grid formed by 80 sites across the US and the world. Most sites are small-to-medium Linux clusters,
with a few large clusters. Primary stakeholders are the LHC and LIGO.
Focus is on data-intensive, high-throughput processing (not “traditional supercomputing”).
Compute time is allocated by individual site policy. Strong emphasis on decentralization.
No SSH logins allowed to remote sites.
10
https://twiki.grid.iu.edu/bin/view/ReleaseDocumentation/WhatIsOSG
OSG Summer School 2010
Map of OSG sites
11
To give you a feel for the distribution of the OSG sites…
OSG Summer School 2010
Today’s Sites
• We’ll be using 3 different sites today. OSG-EDU: A small, educational resource here
at Wisconsin. <100 cores managed by Condor, only a few GB of NFS-based storage.
Nebraska: A medium-sized production site owned by CMS in Lincoln, NE. 1500 cores managed by Condor, about 350TB of Hadoop storage.
Firefly: A large production site owned by HCC in Omaha, NE. ~5000 cores managed by PBS, 150TB of Panasas storage.
12
OSG Summer School 2010
OSG Compute Resources
• The OSG CE is layered on top of a traditional batch system.
13
Site Cluster
OSG CE OSG CE
Batch System services
WN
WN
WN
WN
WN
WN
User
OSG Summer School 2010
Inside the OSG CE
14
• The current core of the OSG CE is the Globus Toolkit. Important parts are: Globus GRAM: Translates jobs in RSL format to batch system jobs; allows
generic batch system commands to be translated to the site. GASS server: Used to stage small files in and out of the GridFTP server: Used to move large files in and out of the host.
OSG Summer School 2010
Anatomy of a Grid Job
• All job creations will have the following steps: User creates job and determines which OSG CE it
will be sent to. User submits job description and files to Globus
on the OSG CE. Globus converts the job to a batch system job,
and submits that. Job starts running on the worker node.
• Job finishing goes up in the reverse direction.• For the user to know the job status, there are 3
different systems (user, Globus, batch system) that must be in sync.
15
OSG Summer School 2010
OSG CE, In Summary
• The core of the CE is the Globus middleware (using the GT2 protocol). Globus allows an abstract interface to your system
to be exported to the world. Provides the means to allow users to submit to
multiple administrative domains.
• The CE interacts with your cluster’s batch system; grid jobs are converted to and run as batch system jobs. The component doing the conversion is called the
“Job Manager”. Condor sites use “jobmanager-condor”, PBS sites use “jobmanager-pbs”, etc.
16
OSG Summer School 2010
And everything else
• There are quite a few more OSG components: Monitoring Accounting Information services Storage and transfer
• Which I will not be covering in this talk.• You’ll learn them throughout the week
(assuming you don’t skip class).
17
OSG Summer School 2010
Review
• Grid computing allows one to utilize multiple computing resources from multiple administrative domains. This is more complex than traditional batch
systems.
• OSG is one implementation of a grid; its technology is based upon the Globus Toolkit and Condor. It has almost 100 computing resources
(clusters) and 50 storage resources.
18
Part II: Trust Relationships
OSG Summer School 2010
Trust Relationships
• What kind of trust relationships do we encounter in the airport? Passports Tickets “Secured area” inside terminal
20
OSG Summer School 2010
Trust Relationships in the Grid
• In the grid, we usually think of the users/organizations as the consumers and the sites as the producers. How does the site know you are who you
say you are? How does the site know you are allowed to
submit jobs?
21
OSG Summer School 2010
Compare to the Cloud
• What kind of trust relationship is there in the Amazon EC2 cloud? You trust a SSL connection with the Amazon
SSL certificate has Amazon on the other side. Amazon allows you to use the compute
resources if your credit card number is valid. You trust Amazon gives you a certain amount
of computing for your money. What else?
• Note the trust is 2 way!
22
OSG Summer School 2010
The one you forgot about!
• How do you trust the site? After all, this is your data! How do you know they
aren’t going to steal your Ph.D. thesis? Your new novel protein? (How is this different from any case of using computing
you don’t own?) The answer is that you trust the organization running the
resource. Often, you trust the OSG to only allow reputable
organizations to join. In this case, the trust relationship is based on
society, not technology. Keep this in mind – the societal aspects are equally as
important as the technology sometimes. Read “Reflections on Trusting Trust”!
23
OSG Summer School 2010
Authn and Authz
• In order to establish trust relationships on the grid, two things need to happen: Authentication (authn): The process of
establishing an identity for your job. Authorization (authz): Determining that
your job is allowed to run at the site.
• Think: What authentication and authorization need to happen at an airport?
24
OSG Summer School 2010
X509 and GSI Security
• In the OSG, authentication happens using a grid certificate. This is simply a personal SSL certificate. The grid certificate is signed by a trusted
authority – the Certificate Authority – and vouches for your identity.
When you need to temporary delegate your rights to elsewhere – like to a remote job – you can use your certificate to form a proxy certificate. This authenticates a grid job as belonging to you.
Any grid job with your proxy you get the blame for.
25Your grid certificate identifies who you are!
OSG Summer School 2010
Authorization on the OSG
• It would be very hard for sites to authorize each user independently (think: CMS has around 2000).
• Instead, each site authorizes the organizations they want to partner with to run at their site. And the organization securely informs the site
who is in their organization. Because these organizations don’t always
deal with a physical entity (like a single campus or lab), they’re referred to as a “virtual organization” or a VO.
26
OSG Summer School 2010
Take Home Message
• You are identified by your certificate. This is your authentication.
• You must join a VO to use the OSG.• A site makes authorization decisions
based upon a VO.• With this model, we tend to minimize the
number of communications between the site, user, and organization.
• The OSG implements authorization and authentication on top of x509 certificates and PKI.
27
Segway: Hands-on with security
OSG Summer School 2010
Hands-On with Security
• Bounce on over to the following URL: https://twiki.grid.iu.edu/bin/view/Education/
GlobusToolsOss2010
29
OSG Summer School 2010
Hands-On with Security
• You should have learned… That you have a personal certificate.
You know your DN and CN.
That you are a member with of an OSG VO (osgedu, at least).
How to create a certificate. The difference between a plain “grid” certificate
and a “voms” certificate.
How to test Globus authentication. And a few common error messages!
30
Segway: Globus Tools
OSG Summer School 2010
Hands-On with Globus Tools
• You should have… Run a single executable and a script. Explored the difference between the
different jobmanagers. Thought about what it would take to
manage HTC-type grid jobs.
32
Part III: Condor-G
OSG Summer School 2010
Condor-G
• Ok, we know what the grid is… … and how to authenticate … … how the heck do we use it effectively?
• There is an added complexity in using multiple sites, but many things remain the same: Job submission Checking job status Cancelling jobs Managing input/output Managing job dependencies
34
OSG Summer School 2010
Observation
• The requirements for the grid and for the batch system look about the same!
• Key insight: We can reuse a large portion of our batch system to command and control grid jobs. In fact, this means we can present familiar
interfaces to the user!
• Hence, Condor-G was born.
35
OSG Summer School 2010
Normal Condor
36
OSG Summer School 2010
Condor-G
37
OSG Summer School 2010
Condor vanilla vs Condor-G
• In Condor, the current state of the batch slot is represented by the shadow. Negotiation provides matchmaking down to
the best slot. Schedd provides queue management and
presents a “batch system” interface to the user. Works with shadow and negotiator.
• In the grid universe, a “gridmanager” process is spawned which performs the different queue actions.
38
OSG Summer School 2010
Gridmanagers
• There is one gridmanager type per grid flavor: Globus Nordugrid EC2 (Clouds) PBS.
• PBS isn’t normally thought of as a grid, but Condor-G / the “grid universe” is just a way to interface the Condor with external batch-like systems.
39
OSG Summer School 2010
Gridmanagers
• For example, the GT2 grid manager will take a job from the schedd and use the job’s ClassAd to submit the job via a Globus C library. If the job is successfully submitted, the
gridmanager process will update the Schedd accordingly.
• Periodically, the gridmanager will poll all the jobs at a site, get their latest status, and update the schedd queue.
40
OSG Summer School 2010
Condor vs Condor-G vs Globus
Action Condor Condor-G Globus
Submit job condor_submit condor_submit globus-job-submit
Query Status condor_status condor_status globus-job-status
Cancel job condor_rm condor_rm globus-job-cancel
Job Description ClassAd ClassAd RSL
41
• All your queue and job management can be done by condor.• Familiar interface and description language.
• Tools which know how to interact with Condor can interact cleanly with grid jobs.• There may be no “standard grid protocol”, but Condor-G is
almost the defacto standard grid client.• Allows effective use even while the server technology is
evolving (Keep up with the hype cycle: Utility computing -> grid computing -> cloud computing; all can use Condor)
OSG Summer School 2010
Workflow on the Grid
• See: Workflows on Condor, i.e. DAGMan. Because DAGMan layers on top of the
Condor schedd… And we put our grid jobs into Condor-G… All of our workflow methodology converts
to the grid with no changes.
42
OSG Summer School 2010
Condor-G Details
• Always use Condor-G on the OSG; Globus tools are extremely unscalable and will crash the remote site at about 100 jobs.
• Condor-G is activated when you set the Condor job’s universe to “grid”
• You also need to specify what endpoint and grid flavor using “grid_resource” in your submit file.
43
OSG Summer School 2010
Condor-G Details
• To use Condor-G to submit to a Globus resource, add the following 2 lines to your submit file: universe=grid grid_resource = gt2
osg-edu.cs.wisc.edu:/jobmanager-condor
• This submits to the OSG-EDU resource. A few other sample endpoints: ff-grid.unl.edu:/jobmanager-pbs (Firefly) red.unl.edu:/jobmanager-condor (Nebraska)
44
OSG Summer School 2010
Condor-G “Gotchas”
• Assume no shared file system. Know your input and output files and let Condor handle
the file movement.• Never run compute jobs on “jobmanager” or
“jobmanager-fork” (anyone recall why?)• Grid is NOT uniform.
Different architectures, OS, execution environment. Different policies, restrictions, site preferences. Condor-G does not make any of these magically
heterogeneous. On the OSG, for Condor-G, you are expected to
provide a wrapper to make the execution environment what you need.
45
OSG Summer School 2010
Review
• You should have learned: Grids have needs similar to batch systems. Condor schedd provides all the job/queue
management that one needs for the grid. Condor-G is activated when you use the grid
universe in Condor. The normal Condor components are replaced by
the “gridmanager” process, which translates the Condor commands to grid commands.
By using Condor-G, you can reuse the (large) Condor infrastructure on many grids, including the OSG.
46
OSG Summer School 2010
Questions?
• Questions? Comments?• Feel free to ask me questions later:
Brian Bockelman, bbockelm@cse.unl.edu
• Upcoming sessions: Afternoon, after lunch:
Handling large-scale data. Room 2310
47