Post on 16-Apr-2017
1
About the Presenter
Jan Bigalke
SAS Architect at Allianz Managed Operations & Services SE
Greg Nelson
CEO and Founder of ThotWave Technologies
3
Five Reasons you should stay…1. 90% of the time, an out of the box grid install wont handle your use
cases
2. You will gain a better appreciation of what’s happening with using applications such as EG with SAS Grid Manager
3. You’ll better understand how options in queue will affect your users based on differing workloads
4. You will learn how to estimate how many jobs your grid environment can theoretically run
5. You will get a tutorial on how to go beyond the installation and configure your system for high availability
6. Plus more….
4
Agenda§ Introduction Modern SAS Architectures and SAS Grid
§ Understanding how SAS Workloads can impact different resources
§ How SAS Grid Processes Work
§ Best practices for Post-Installation Configuration
§ Calculating Capacity
§ Implementing High Availability
§ Maintaining your SAS Software in a Grid Environment
5
Trends / Motivation§ Workload management
§ Commodity Hardware x86
§ Scalability
§ Start small grow with the customer demand
§ Efficiency (reduce cost)
7
GridSAS WebServer
SAS WebApp
SAS Meta
SAS Compute SAS Compute SAS Compute…..
Grid architecture
SAS WebServer
SAS WebApp
SAS Meta
SAS Compute
Changes with GRID:
Shared Filesystem for the compute servers necessaryDistribution of Workload between ServersScalability
Default
Shared FS
8
A SAS Grid computing environment is one in which SAS computing tasks are distributed among multiple computers on a network, all under the control of SAS Grid Manager.
ArchitectureSAS Grid @work
9
Design considerations§ Memory, CPU and I/O
§ Utilization, latency, throughput
§ Type of Workload , different needs
10
SAS Versions and grid capabilitiesSAS Version new capabilities (key points)9.4 M2 Grid Manager plug-in for the Environment
Manager
9.4 M1 stored process servers ,pooled workspace servers grid-launched
9.4 M0 grid options, grid-launched workspace servers
9.3 load balancing for stored process servers, OLAP servers and pooled workspace servers
9.2 SAS code analyzer grid-launched batch SAS jobs load balancing for SAS Workspace Servers
11
SAS Grid request flow
4
3
SAS Metadata Server
SAS Object Spawner
LSFgridrun WorkspaceServer
SAS Grid Node
LSF
SAS® Enterprise Guide®
1
2
Request flow:
1. Metadata Server2. Object Spawner3. Get grid options4. Spawn task on Grid Node
12
Metadata and GRIDObject Spawner Console Log:
2015-10-07T21:09:23,801 INFO (gridrun.c:590) - commHandler: command received is >[INIT] [PROVNAME]:"Platform" [MODNAME]:"" [SRVHOST]:"sas94-app1-syst.testdomain" [SRVPORT]:"0" [USERNAME]:"" [PASSWORD]:"" [TIMEOUT]:"0" [OPTIONS]:<project=SASApp><.
2015-10-07T21:09:23,836 INFO (gridrun.c:609) - commHandler: command response is >[DONE]<.
2015-10-07T21:09:23,837 INFO (gridrun.c:590) - commHandler: command received is >[STARTJOB] [JOBNAME]:"SAS Enterprise Guide_SASApp - Workspace Server_4101A009-6340-2B46-8E08-8DB2933E8182" [RESOURCES]:"" [COMMAND]:</var/opt/data/sas/sas94/configAPP/Lev1/SASApp/WorkspaceServer/WorkspaceServer.sh> [ARGUMENTS]:<-noterminal -noxcmd -netencryptalgorithm AES -metaserver sas94-meta-syst.testdomain -metaport 8561 -metarepository Foundation -locale en_US -objectserver -objectserverparms "delayconn sph=hosta.testdomain protocol=bridge spawned spp=42449 cid=0 pb classfactory=440196D4-90F0-11D0-9F41-00A024BB830C server=OMSOBJ:SERVERCOMPONENT/A5ZI7 NU4.AY0000WN cel=everything lb recon grid "keepalive=30"" -METAUSER '"testuser@!*(generatedpassworddomain)*!"' -METAPASS 49944139d506b727d1555D7b1d8E6162 > [OPTIONS]:<> [ARMCORR]:"" [FLAGS]:"0" [INFILES]:"" [OUTFILES]:"" [HOSTS]:"sas94-app1-syst.testdomain,sas94-app2-syst.testdomain"[MPIPROCS]:"0" [PROCSHOST]:"0"<.
Job <33707> is submitted to queue <qiSASApp>.
13
Job Flow Processing Inside LSF
SEQUENCE:1. Submit the job2. Schedule the job3. Dispatch for job4. Run the job5. Return output
Master HostSubmit job
Job PEND
1 2
Compute Host3
Dispatch job
Job RUN
4Queue
5 Job output
mbatchd:JOB_SCHEDULING_INTERVAL
gridrun
sbatchd:JOB_ACCEPT_INTERVALSBD_SLEEP_TIME
Resource Info
Job DONE
14
Best PracticesConfiguration in Metadata
Beyond the basics with QueuesMulti-tenant considerationsSAS Grid and Hadoop
Theoretical number of jobsHigh availabilitySoftware Updates
15
SAS configuration§ STP , Workspace, batch
§ STP only balancing (needs to check if this helps)
§ Workspace Grid launched
§ Batch grid integration into enterprise scheduler
19
GRID and workload considerations§ GRID load balanced (load balanced based on utilization)
§ Web Requests need short latency
§ Grid for longer running request (Batch, Workspace Server)
§ Online distribute via ObjectSpawner (shorter latency)
20
Mange Workloads with queuesQueues can manage Workload on different requirements
SAS
clients protect access with groups
Parameter per queue:• Priority • Limits • Stop und start conditions
Interactive
Batch
default
Queues
Scheduler
SAS Progs
21
Why Queues MatterParameter Example
(Interactive Queue)
Example (Batch Queue)
Definition
PRIORITY PRIORITY=50 PRIORITY=20 The relative priorities as compared to other queues
NICE NICE=20 NICE=10 Specifies the execution priority change, based on Linux “nice” values.
CPULIMIT CPULIMIT=5 CPULIMIT=15 a time limit applied to jobs
UJOB_LIMIT UJOB_LIMIT=5 UJOB_LIMIT=2 the maximum job slots per user in a queue
PJOB_LIMIT PJOB_LIMIT=10 PJOB_LIMIT=5 the maximum job slots per processor in a queue.
QJOB_LIMIT QJOB_LIMIT = 120 QJOB_LIMIT = 60 the maximum jobs in a queue.
22
Why Queues MatterParameter Example
(Interactive Queue)
Example (Batch Queue)
Definition
HJOB_LIMIT HJOB_LIMIT = 4 HJOB_LIMIT = 4 Maximum number of job slots that this queue can use on any host
CHUNK_JOB_SIZE
CHUNK_JOB_SIZE = 4
CHUNK_JOB_SIZE = 4
Specifies the maximum number of jobs allowed to be dispatched together in a chunk.
r1m r1m=0.3/1.5 r1m=0.3/1.5 1-minute CPU run queue length (alias:cpu)
ut ut=0.2 ut=0.2 1-minute CPU utilization (0.0 to 1.0)
r15s r15s=0.3/1.5 r15s=0.3/1.5 15 second CPU run queue length (alias:cpu)
it it=10/1 it=10/1 Idle time (minutes) (alias: idle)
23
Multi tenancy considerationsQueues per tenant
§ allow different settings per tenant
Queues and workload
§ stop_cond = select[ (cpuusg > 95.0) && (cguxx > 90.0)) ]
§ resume_cond = select[ (cguxx < 95.0) || (cpuusg < 95.0) ]
24
GRID and interaction with other systems (databases / Hadoop)
§ Database Access interfaces (same DB client config)
§ Hadoop (Java JAR, Auth Kerberos, Grid)
Node 1
RDBMS
Node 2 Node 3 Node 4 Node 5 Node n
RDBMS
25
Capacity Sample§ ! Jobs(k))
*+, = MXJ1 + MXJ2 +..+MXJn
§ JOB_ACCEPT_INTERVAL is 1
§ MBD_SLEEP_TIME is 5
§ Platform LSF dispatches one job to a particular machineand waits for 5 seconds before dispatching another jobto the same machine regardless of how long each jobtakes
26
Capacity Sample
§ Average job duration: 5 seconds
§ JOB_ACCEPT_INTERVAL: 1
§ MBD_SLEEP_TIME: 5
§ 4 cores per host
§ 2 hosts
§ 4 job slots per core
27
Capacity Sample
§ Jobs per host= 60/ (1 * 5) = 12 number of jobs per host per minute
§ 12 * 2 = 24 jobs per minute in the GRID
28
What’s Included§ Core LSF Processes
(Base and Batch)
What’s Not Included§ SAS Management Console§ SAS Mid-Tier
§ SAS Object Spawner
§ Platform Process Manager· § Platform Grid Management
Service
Failover Options
29
LSF Daemons on UNIX
Server Host
Master Server
LIM RES PIM PEM
LSF BASE
LSF BATCH SBATCHD MBATCHD MBSCHD
31
GRID failover EGO§ Where can grid help, compute cluster
SAS WebServer
SAS WebApp
SAS Meta
SAS Compute SAS Compute SAS Compute…..
SAS Meta SAS Meta
SAS WebApp SAS WebApp Midtier Cluster
Metadata Cluster
Compute services via EGOGrid
32
GRID and failover § EGO define the services and number of instances
if one Server goes downEGO restarts the service on a failover node
Node 1 Node 2 Node 3S S S
34
GRID and hot fixing process § Update considerations
§ Base and beyond.
Node 1 Node 2 Node 3
Shared Store
FIX FIX FIX
FIX Close node in LSF, Stop Services EGOWait for end SAS processesFIX binaries, open NodeSync (rsync)
35
Conclusion§ GRID allows us to scale horizontally
§ Different Workloads need different settings
§ We can optimize workloads with Queues
§ We use EGO to mange the services failover
§ GRID is not only SAS … EGO/LSF settings