1 Getting Ready for Mainstream Parallel Computing Burton Smith Cray Inc.
-
date post
22-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of 1 Getting Ready for Mainstream Parallel Computing Burton Smith Cray Inc.
2
Parallel computing is upon us†
Uniprocessors are nudging against performance limits More transistors/processor increases watts, not speed
Meanwhile, logic cost ($/gate-Hz) continues to fall What are we to do with all that hardware?
New “killer apps” will probably need more performance Semantic information processing and retrieval Personal robots Better human-computer interfaces
Newer microprocessors are multi-core and/or multithreaded So far, it’s just “more of the same” architecturally
The big question: what are parallel computers good for? This is a question for both hardware and software Hardware has gotten ahead of software, as usual
†And it’s about time!
3
Parallel software is almost rudderless We don’t have a good parallel language story
MPI? PGAS languages?
We don’t have a good debugging story TotalView? gdb?
We don’t have a good resource management story Some-weight-or-other kernels? Linux? Lustre?
We don’t have a good robustness story Malware resistance? User-generated checkpointints (or else)?
4
Better parallel languages
Making data races impossible Implementing higher level operations Presenting a transparent performance model Exploiting architectural support for parallelism
Global memory operations Light-weight synchronization
Enabling a more abstract specification of program locality Give the programmer the hard parts and nothing else
A better parallel programming language could makeprogrammers more productive by:
5
Are functional languages the answer? Imperative languages schedule values into variables
Parallel versions of such languages encourage data races The basic problem: stores don’t commute with loads
Functional languages avoid races by avoiding variables In a sense, they compute new constants from old Data races are impossible (because loads do commute) We can make dead constant reclamation pretty efficient
Program transformation is enabled by these semantics High-level operations become practical
Programmer productivity is much improved Caution: it is pretty easy to opacify performance
Abstracting locality is much easier in functional languages Copying is safer, for example
6
The bad news There is no notion of state in functional languages Attempts to add state while preserving commutativity:
Applicative State Transition systems (Backus) Monads (Wadler et al.) M-structures (Arvind et al.)
A related fact: functional programs are deterministic Introducing state leads to non-determinism (e.g. races)
Some kinds of nondeterminism are good Any ordering that does not affect final results is OK Only the programmer knows where the opportunities are How can we tell good non-determinism from bad?
7
A (serial) histogramming example
const double in[N]; //data to be histogrammedconst int f(double); //f(x) < M is the bin of xint hist[M]; //histogram, initially 0for(i = 0; i < N; i++){int bin = f(in[i]);hist[bin]++;
}
/*(int k)(hist[k] = {j|0j<i f(in[j])= k})*/
/*(int k)(hist[k] = {j|0j<N f(in[j])= k})*/
Don’t try this in parallel with a functional language!
8
Histogramming in parallel
const double in[N]; //data to be histogrammedconst int f(double); //f(x) < M is the bin of xint hist[M]; //histogram, initially 0forall i in 0..N-1{int bin = f(in[i]);lock hist[bin];hist[bin]++;unlock hist[bin];
}
/*(int k)(hist[k] = {j|j f(in[j])=k})*/
/*(int k)(hist[k] = {j|0j<N f(in[j])=k})*/ is the set of values i processed “so far”
•The loop instances commute with respect to the invariant•Premature reads of hist[] get non-deterministic garbage
9
What do the locks do? The locks guarantee the integrity of the invariant
They protect whatever makes the invariant temporarily false As long as invariants describe all we care about in the
computation and forward progress is made, all is well We have non-determinism “beneath the invariants” In the example, the set captures that non-determinism
Pretty clearly, the locks need to be lightweight Barriers won’t do the job
Can we automate or at least verify lock insertion? If we had a language for the invariants, maybe so
A constructive step is to let the language handle the locks Efficiency with safety is one reason
10
Atomic sections
const double in[N]; //data to be histogrammedconst int f(double); //f(x) < M is the bin of xint hist[M]; //histogram, initially 0forall i in 0..N-1 do atomic { int bin = f(in[i]); hist[bin]++; } This abstraction permits implementation mechanisms other
than locking It works better interprocedurally All of the proposed HPCS languages have something like it
11
Operations on multiple objects
node *m; //a node in an undirected graph/*(node m)(node n)(n(m->nbr)* m(n-
>nbr)*)*/atomic { //remove *m from the mesh for (n = m->nbr, n != NULL, n = n->nbr){ //remove links between *n to *m for (p = n->nbr, p != NULL, ... //etc } A naíve implementation would routinely deadlock If a sequence would deadlock or fail, preservation of the
invariant requires that it be “undone”, reversing its side effects
In other words, what we need is linguistic support for nestable multi-object atomic transactions
Operating system implementers really need this stuff
12
Better debugging Don’t worry about:
The ability to single-step any thread Instead, worry about:
Conditional breakpoints For both programs and data Ad-hoc conditional expressions
Whole-program data perusal Declaration awareness Run-time data structures, e.g. queues Ad-hoc data visualization Ad-hoc data verification
A much higher level user interface language Don’t make the GUI mandatory (what generates code?) Anyone familiar with duel?gdb> duel #/(root-->(left,right)->key)
13
Debugging shared memory parallelism Data races are a perennial problem
A more disciplined programming paradigm, e.g using transactions systematically, would help a lot
Another big issue is verifying invariants of data structures Anyone else ever read a dump with this end in view?
Invariants could be checked continually, say with daemons Denelcor debugged Unix System 3 on the HEP that way Trouble is, the daemons race with the code being checked
Of course the daemons can transact along with everyone else If transactions are protecting each invariant’s domain, a
daemon will always see a consistent state by definition As long as we are at it, why not routinely make sure the
invariant is restored whenever a transaction commits?
14
Better resource management Parallel computers have a plethora of resources to manage
And well-virtualized resources of a given type are fungible Resource allocation is all about enforcing policies while
satisfying constraints One policy, e.g. for process scheduling, never fits all users
On the other hand, policy implementations are usually so mixed up with mechanism that they are tough to tweak
It’s certainly not something we want users doing Why do we use imperative languages to implement policy?
Declarative programming, e.g. CLP, may be a better fit
15
What the heck is CLP? CLP is Constraint Logic Programming — guarded Horn
clauses, say, augmented by predicates over some domain CLP() adds < , = , etc. over the real numbers Another flavor is CLP(FD) over a finite domain
Unification is extended appropriately to the predicates The cool thing about CLP is that it directly enforces the
constraints and searches for a solution that satisfies them all There is probably a way to make CLP parallel if need be
In any event it needs to be able to transact against the current state of the resource requests and allocations
We are exploring these ideas in Cascade, our HPCS project
16
Better robustness Our computer systems are too easy to attack
Prevention is best, but detection is a decent runner-up Continuous verification of a system’s properties is a direct
way to detect intrusion Violations of resource allocation policy Loss of data structure integrity
The debugging discussion applies here Transactions are needed to make it work
Another enemy of robustness is hardware failure Hardware error checking is necessary but hardly sufficient Continually checking invariant properties of programs, with
daemons or upon transaction commits, is the rest of the story But what should we do to recover from failures?
Checkpoint/restart is the usual answer
17
How to checkpoint Don’t do it
Just re-run or re-boot as necessary Let the user do it
Tell the OS where the restart file(s) and entry point are Let the compiler do it
Ditto, hopefully at a “good time” Let the OS do it
Maybe at times specified by the user or the compiler Do it directly to disk Use 2x the memory and RAID and/or buffer to disk Use COW and transactions and do it concurrently and
incrementally Mix these above strategies
How do we decide?
18
Productivity Productivity is utility U divided by cost C
What one gets for what one spends We want to maximize it
Utility is generally a function of the time-to-solution It is positive, non-increasing, and eventually zero
Here we consider execution time, which depends on: The application The system
These two parameters also strongly influence the cost
We assume the application is fixed in what follows),(
)),((max)(
C
tU
19
The rental execution cost model Plausibly, the cost of a run is linear in the time used
The rate might be the life-cycle cost of the system times the fraction used over the system’s useful life:
and so
For a given system under the rental model, productivity strictly decreases with compute time until it becomes zero when the utility reaches zero
The implication is an injective functional relationship between time and productivity under the rental model
)()()()(euseful_lif
)(frac_used)(life_cost)(
tRtC
)()(
))((max
tR
tU
20
Merits of the productivity approach Utility functions quantify the value of timeliness
One example is a hard deadline (“box”) utility function Productivities can be computed from execution times Given a probability distribution for productivity, we can
compute other quantities readily: The mean and variance of the productivity The probability that the productivity is greater than a particular
value The impact of policies and system configuration decisions can
be assessed, e.g. Is a more expensive but more reliable system worthwhile? How frequently should we checkpoint, if at all? Should memory or disks be added to speed up checkpointing?
21
Estimating total running time The distribution of total running time t is determined by:
Failure process parameters, e.g. the failure rate The no-failure execution time Tx The execution interval between checkpoints Ti The operating system start-up time To The time to perform a checkpoint Tc The time to perform a restart Tr
Without checkpointing, job execution looks like this:
With checkpointing, job execution looks like this:
failure
Ti Tc Ti TiTrTc To Tc
TxToTo
failure
To
failurefailure
22
A Markov process model Here we have a nontrivial Markov process:
where
and the delays associated with the events a–d arero
ciTTT
TTT
pTTq
pTq
1)prob(
1)prob(
TT
:b:a
TTTT
:d:c
p/a . . .0
0
q/b
q/d
p/c
1
1
q/b
q/d
p/c
2
2
q/b
q/d
p/c
3
3
q/b
q/d
p/c
n
. . .
p/a p/a p/a
24
Mean total time Computing the mean total time, we get
Expanding,
This approaches n(Ti+Tc) when the MTBF -1 is large and
when -1 is small compared to Ti+Tc
p
nqt t
)0(
)(
)1(
)()(
)(
)(
roroci
roci
ci
TTTTTT
TTTT
TT
eene
ent
)( roci TTTTen
25
Optimizing the checkpoint interval
Setting n = Tx / Ti allows determination of the optimal checkpoint interval Ťi:
whence
This is identical to Daly’s result†
2
)()()(
)()(
)(0
i
TTTTTT
i
TTTTx
i
TTTTTTx
ii
T
ee
T
eT
T
eeT
Tt
Trorociroci
roroci
)(1 ci TT
ie
T
† Equation (9) in Daly, J.T. “A Strategy for Running Large Scale Applications Based on a Model that Optimizes the Checkpoint Interval for Restart Dumps”, Proc. Int. Conf. Supercomputing, St. Malo, France, June 2004
26
Optimizing checkpoint bandwidth
Discs can be bought to reduce t by reducing Tc (and Tr) Whether it is worth doing depends on the utility
Assuming a rental cost model, constant utility, an optimal checkpoint interval and a disk rental “rate” of D dollars/s per image/s:
Letting Tc = Tr and basing cost on the expected total time,
The solution looks something like
tTDR
U
c )/(
tTDR
T cc0
R
TfRDDDT ct
c 2
),(42
27
Conclusions It is time to take parallel computing seriously We have a lot of architectural work to do in software
Not that the hardware is all done! Software architecture is needed to define and refine it
New ways of thinking about programming may be useful New ways of thinking about performance are also needed