WestGrid Overview Dr. Rob Simmonds Distributed Systems Architect.

51
WestGrid Overview Dr. Rob Simmonds Distributed Systems Architect
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    2

Transcript of WestGrid Overview Dr. Rob Simmonds Distributed Systems Architect.

WestGrid Overview

Dr. Rob SimmondsDistributed Systems Architect

Talk Overview

• The WestGrid project

• The WestGrid HPTC resources

• Grid services for HPTC and how they will be used in WestGrid

WestGrid Project

• 8 institutions• More than 250 researchers• Technical and operational officers

• HPTC: compute resources and storage

• Visualization and collaboration

WestGrid People

• PIs– Jonathan Borwein (SFU), Gren Patey (UBC), Jonathan

Schaeffer (UofA), Brian Unger (UofC), Mike Vetterli (SFU/TRIUMF)

• HPC planning committee– Rob Balantyne, Matthew Choptuik, Corrie Kost, Harold

Esche, Paul Lu, Richard Marchand, Seamus O'Shea, Mark Thachuk, Ron Senda, Martin Siegert, Rob Simmonds, Mike Vetterli

• Visualization planning committee– Lyn Bartram, Kelly Booth, Pierre Boulanger, Brian Corrie,

Sara Diamond, Larry Katz, John MacDonald, Trever Woods

• CAO– Ken Hewitt

WestGrid HPTC Resources

• 140TB IBM storage server (Power4/AIX)

• 1008 processor IBM cluster (IA-32/Linux)

• 256 processor SGI Origin (MIPS/Irix)

• 144 processor HP SC45 (Alpha/Tru64)

All connected by Canada’s world class networks

Grid Computing

• “Grid” is a set software services– Combines meta-computing, resource

discovery and security– Designed to enable access to resources

in different management domains– Grid services will enable WestGrid

resources to be integrated into individual researcher’s computing environments

Grid Standardization

• Global Grid Form (GGF) is working to provide standards

• Open Grid Services Architecture (OGSA) defines low level Grid services

Grid toolkits

• Globus (Public domain – ANL/ISI)– Currently version 2.x used for

production– Version 3 provides a reference

implementation for OGSA

• Legion (Commercial – Avaki)– Provides more support for data handing– Will support OGSA

Grid Security Infrastructure

• Ability for trusted users to access remote resources without re-authentication

• Ability for trusted jobs to access remote resources without re-authentication

• Protection against stolen credentials• Avoid requirement for dedicated,

highly available security server(s)

Certificate Authority Model

• CA issues certificates to trusted users and services

• Certificates used to authenticate with remote resources that trust issuing CA

• Grid Canada CA will be trusted by WestGrid resources

GSI Proxy Certificates

• User credentials delegated from user certificate to proxy certificate– Proxy certificate used for authentication

• Proxy certificates have limited lifetime – can also be limited to only authenticate

with certain services

• Proxy certificate copied to remote resource when job is started

Globus Security Commands

• Users can request a certificate using ‘grid-cert-request’– This creates userkey.pem and

usercert_request.pem in ~/.globus/• Certificate request file sent to CA

– usercert.pem is returned and placed in ~/.globus/

Aim to automate this process forWestGrid users

Globus Security – Cont.

• Proxy certificate created using ‘grid-proxy-init’

• Proxy certificate examined using ‘grid-proxy-info’

• Proxy certificate destroyed using ‘grid-proxy-destroy’

Proxy certificates could be createdduring login process

GSI initialization demo …

Enabling Access to Resources• Holding certificate from trusted CA does

not guarantee access to resources• Users given access to resource by being

included in recource’s grid-mapfile– This allows owner of resource to choose

which users are allowed to use the resource

• The grid-mapfile maps Grid user to a local account

Globus Job Starting

• Run job on remote resource using ‘globus-job-run <host> <program>’– <host> must trust the CA that signed the

users certificate and user must be mentioned in grid-mapfile

– Proxy certificate is copied to GASS cache on <host> to enable program to authenticate with other remote resources

Batch Job Starting

• ‘globus-job-submit <host> <program>’– This returns a url used to query job

• ‘globus-job-status <url>’– Find out if the job is waiting, running or finished

• ‘globus-job-get-output <url>’– Get output produced by job. This is stored in the

GASS cache on the host where the job is running

• ‘globus-job-clean <url>’– Remove the GASS cache entry for the job in

question

GridFTP

• ‘globus-url-copy <original> <copy>’– Copies file from one location to another

• file:/<filename> - a file on a local file-system

• gsiftp://<host>/<filename> - a file on GridFTP server <host>

• Extensions to standard FTP include– Third party transfers– Parallel transfers

Credential Repository

• NCSA’s MyProxy server provides an on-line credential repository

• User stores proxy certificate in repository– This certificate can be long lived

• User can later recover a short lived certificate from the repository

Credential Repository Uses

• Used to authenticate with environment when user does not have access to their certificate– e.g., in a Web portal

• Could be used to authenticate and get proxy certificate during login process eliminating need for Unix passwords

MyProxy Commands

• myproxy-init –s <host>– Put a proxy certificate into the MyPoxy server

on <host>– Can specify host using environment variable

• myproxy-info –s <host>– View information about user’s proxy certificate

• myproxy-get-credential– Get a proxy certificate

• myproxy-destroy– Remove proxy certificate from the

MyProxy server

Inserting Credential

Recovering Credential

MyProxy Certificate Renewal

• Allows automated proxy certificate renewal

• Special proxy certificate enables trusted service to renew standard proxy certificate– e.g., trust a local scheduler to renew the

certificate before starting a job

• Should help to prevent users resorting to insecure means for automating proxy renewal

GSI Enabled SSH Tools

• GSI enabled versions of OpenSSH tools will be used in WestGrid– gsi-ssh Authenticates through GSI and copies

proxy certificates to remote host– gsi-scp Authenticates through GSI

GSI Enabled SSH

Resource Discovery

• Globus uses MDS for resource discovery– GRIS – provides information about individual

hosts– GIIS – provides information about groups of

hosts

• In WestGrid each of the 4 major resources will run a GRIS

• At least one GIIS will be provided to hold aggregate information– Probably use one per site

MDS

• Publish information to LDAP servers– Information used by Grid services to

locate needed resources• Publish information such as

– Type(s) of job scheduler available– Parameters accepted by job scheduler– Number of processors– Amount of RAM, disk or tape– Software and license availability

MDS Example

Meta-scheduling

• A meta-scheduler is used to submit jobs to other job schedulers

• WestGrid will employ meta-scheduling– Condor-G, Silver and Trellis are under

consideration– Multiple meta-schedulers could be used

• Hierarchical meta-scheduling can be employed

Condor-G

• Can be used to submit jobs to specific machines

• Can use ‘glideins’ to add resources to local condor pool

• New version will include support for batch scheduler advertisements

Condor-G : Glidein Example

Movie at http://www.cpsc.ucalgary.ca/~simmonds/EdmontonTalk1/condor_demo1.avi

Result: Solar System Viz

Movie at http://www.cpsc.ucalgary.ca/~simmonds/EdmontonTalk1/solarsystem.avi

WestGrid Accounting

• Use MDS to publish accounting information from each site to LDAP

• WestGrid wide accounting calculated and also published in secure LDAP

• Users will be able to gain access to information, filtered by a policy manager

Scheduling Priorities

• Plan to use accounting information to provide fairness in scheduling priorities across WestGrid

• Feed values calculated using global accounting information back into local batch schedulers

Data Storage

• Grid enabled access to storage– Accessible from researcher’s desktop

• Distributed file systems currently limited– Security and caching issues

• Data repository systems provide much of the functionality required– SRB from SDSC– Giggle from ISI/ANL

Repository management

• Large network available file stores• Annotation – meta-data tagging• Data representation optimization

– Files, collections and containers

• User level replication aided by catalogs

Look at SRB

SRB – “S commands”

Wide Area Message Passing

• MPI-G2 enables running of message passing jobs in Grid environment

• Attempts to use best MPI implementation at each site

• Provides process mapping configuration to group tightly coupled processes

Web Portals

• Enable access to Grid services via web browser

• Start a secure session then authenticate this session with GSI using credential server

• Web session now acts as you in Grid environment

WestGrid mock up

WestGrid mock up

WestGrid mock up

WestGrid mock up

WestGrid mock up

Getting a WestGrid Account

• Centralized Web based account requests

• We get certificate or you use exiting certificate

• We setup accounts, install certificates and email you

WestGrid Grid Environment

• Initial Grid services use– Globus, MyProxy, OpenSSH, SRB

• Services include– Job starting, resource discover,

credential management and repository management

• Working on having meta-scheduler(s)– Condor-G, …

Lots of work to do …

• Distributed file systems• Improved replica management• Fine-grain security• Performance measurement and

analysis• Credential based information discovery• Enhanced meta-scheduling• Workflow

Credits – TeleSim helpers

• Mark Fox [email protected]

(TeleSim programmer)– Web portals, demo

• Andrey Mirchovski [email protected]

(TeleSim research student)– Security and chief Globus critic

• Phil Rizk [email protected] (Hons project student/TeleSim programmer)

– MDS, accounting and Web services

Questions and Comments …