Practical Guidelines for Moab Stacks
-
Upload
insidehpc -
Category
Technology
-
view
191 -
download
4
Transcript of Practical Guidelines for Moab Stacks
© 2013 ADAPTIVE COMPUTING, INC. 1
Practical Guidelines for Highly Available Moab Stacks
Daniel Hardman, Chief Solutions Architect
@dhh1128 ~ http://codecraft.co ~ http://gplus.to/danielhardman ~ http://lnkd.in/z7PTAR
April 2013
© 2013 ADAPTIVE COMPUTING, INC. 2 © 2013 ADAPTIVE COMPUTING, INC. 2
The Goal of HA
…NOT! :-)
© 2013 ADAPTIVE COMPUTING, INC. 3 © 2013 ADAPTIVE COMPUTING, INC. 3
The real goals of HA
▪ Eliminate or reduce “downtime” for running jobs
▪ Eliminate or reduce “downtime” for new submissions
▪ Make failovers visible and manageable ▪ Satisfy regulatory requirements ▪ Preserve audit trail
© 2013 ADAPTIVE COMPUTING, INC. 4 © 2013 ADAPTIVE COMPUTING, INC. 4
HA is constrained by time, money
How much are you willing to spend to tolerate: ▪ A power outage? ▪ A software crash? ▪ A hacker from unit 61398 in Shanghai? ▪ The Chelyabinsk meteor? ▪ The Chicxulub meteor that wiped out the
dinosarus?
© 2013 ADAPTIVE COMPUTING, INC. 5 © 2013 ADAPTIVE COMPUTING, INC. 5
What is “downtime”?
0 – hardware failure
+3 min – usable, but very slow
-30 min – last checkpoint
+10 min – full restore
© 2013 ADAPTIVE COMPUTING, INC. 6 © 2013 ADAPTIVE COMPUTING, INC. 6
4 Basic Recipes
▪ Simple built-in HA ▪ Standard pairwise HA ▪ Shared pairwise HA ▪ Advanced HA
© 2013 ADAPTIVE COMPUTING, INC. 7
Recipe 1: simple, built-in HA
© 2013 ADAPTIVE COMPUTING, INC. 8 © 2013 ADAPTIVE COMPUTING, INC. 8
Simple, built-in HA
▪ hot ~ warm (daemons idle on fallback svr)
▪ Moab, TORQUE ▪ shared file system, synced clocks, two daemons,
last mod date on semaphore
▪ MAM ▪ DB replication, primary and fallback server
© 2013 ADAPTIVE COMPUTING, INC. 9 © 2013 ADAPTIVE COMPUTING, INC. 9
Sample deployment (simple, built-in HA)
© 2013 ADAPTIVE COMPUTING, INC. 10 © 2013 ADAPTIVE COMPUTING, INC. 10
Pros and cons (simple, built-in HA)
▪ Pros ▪ Fast and easy to set up ▪ Minimal learning curve
▪ Cons ▪ Doesn’t protect the solution DB, MWS, Viewpoint ▪ Depends on synchronized clocks, reliable
propagation of file metadata in shared fs ▪ Risk of false triggers ▪ Shared FS may be single point of failure,
depending on how it’s implemented
© 2013 ADAPTIVE COMPUTING, INC. 11
Recipe 2: standard, pairwise HA
© 2013 ADAPTIVE COMPUTING, INC. 12 © 2013 ADAPTIVE COMPUTING, INC. 12
Standard, pairwise HA
▪ Twin headnodes (all daemons) ▪ hot ~ cold (daemons inert on fallback svr) ▪ Heartbeat, redhat clustering ▪ Replicated FS (DRBD)
© 2013 ADAPTIVE COMPUTING, INC. 13 © 2013 ADAPTIVE COMPUTING, INC. 13
Sample deployment (standard, pairwise HA)
© 2013 ADAPTIVE COMPUTING, INC. 14 © 2013 ADAPTIVE COMPUTING, INC. 14
Pros and cons (standard, pairwise HA)
▪ Pros ▪ All services fail over the same way ▪ Heartbeat is robust, well understood ▪ FS can’t be a single point of failure
▪ Cons ▪ Some vulnerability to “split brain” scenario ▪ More learning curve ▪ More complexity than simple, built-in HA
© 2013 ADAPTIVE COMPUTING, INC. 15
Recipe 3: shared, pairwise HA
© 2013 ADAPTIVE COMPUTING, INC. 16 © 2013 ADAPTIVE COMPUTING, INC. 16
Shared, pairwise HA
▪ Twin headnodes (all daemons) ▪ hot ~ warm (some daemons inert, some
idle on fallback svr) ▪ Heartbeat, redhat clustering ▪ DB failover ▪ Shared FS (e.g., GFS2)
© 2013 ADAPTIVE COMPUTING, INC. 17 © 2013 ADAPTIVE COMPUTING, INC. 17
Sample deployment (shared, pairwise HA 1)
© 2013 ADAPTIVE COMPUTING, INC. 18 © 2013 ADAPTIVE COMPUTING, INC. 18
Sample deployment (shared, pairwise HA 2)
© 2013 ADAPTIVE COMPUTING, INC. 19 © 2013 ADAPTIVE COMPUTING, INC. 19
Pros and cons (shared, pairwise HA)
▪ Pros ▪ Solves “split brain” scenario ▪ May have slightly lower latency
▪ Cons ▪ Greater learning curve ▪ More complexity
© 2013 ADAPTIVE COMPUTING, INC. 20
Recipe 4: advanced HA
© 2013 ADAPTIVE COMPUTING, INC. 21 © 2013 ADAPTIVE COMPUTING, INC. 21
Advanced HA
▪ Each service (potentially) split onto dedicated box
▪ Daemons are paired and fail over with heartbeat, redhat clustering
▪ DB failover ▪ Replicated or shared FS
© 2013 ADAPTIVE COMPUTING, INC. 22 © 2013 ADAPTIVE COMPUTING, INC. 22
Advanced HA
This is less of a recipe, and more of a general pattern. Each unique server role has to have N-way redundancy. Complexity of config is high; we recommend involvement of professional services.
© 2013 ADAPTIVE COMPUTING, INC. 23 © 2013 ADAPTIVE COMPUTING, INC. 23
Pros and cons (advanced HA)
▪ Pros ▪ Can meet very aggressive SLAs ▪ Can be tailored and fine-tuned
▪ Cons ▪ Major implementation effort ▪ Requires sophisticated learning and monitoring
© 2013 ADAPTIVE COMPUTING, INC. 24 © 2013 ADAPTIVE COMPUTING, INC. 24
General Observations
▪ Important to audit ▪ Super-fast failover not a goal in our
recipes ▪ Security implications ▪ Not perf enhancer ▪ Not scalability enhancer ▪ Not DR
© 2013 ADAPTIVE COMPUTING, INC. 25 © 2013 ADAPTIVE COMPUTING, INC. 25
More Info
Whitepaper now available. Email me ([email protected]) for a copy, or download from /documents/ha-moab-cloud-hpc.pdf. Documentation for Hopper release includes a new HA task guide for simple, built-in HA configuration.