Concurren t Abstract In terpreter

LISP AND SYMBOLIC COMPUTATION: An International Journal, ?, ??{??, 1994c 1994 Kluwer Academic Publishers { Manufactured in The NetherlandsA Concurrent Abstract InterpreterSTEPHEN WEEKS ([email protected])Dept. of Computer Science, Carnegie-Mellon University, Pittsburgh, PASURESH JAGANNATHAN ([email protected])NEC Research Institute, 4 Independence Way, Princeton, NJJAMES PHILBIN ([email protected])NEC Research Institute, 4 Independence Way, Princeton, NJKeywords: Abstract Interpretation, Control- ow Analysis, Concurrency, Multi-threadedComputingAbstract. Abstract interpretation [6] has been long regarded as a promising optimiza-tion and analysis technique for high-level languages. In this article, we describe animplementation of a concurrent abstract interpreter. The interpreter evaluates programswritten in an expressive parallel language that supports dynamic process creation, �rst-class locations, list data structures and higher-order procedures. Synchronization in theinput language is mediated via �rst-class shared locations. The analysis computes intra-and inter-thread control and data ow information.The interpreter is implemented on top of Sting [12], a multi-threaded dialect of Schemethat serves as a high-level operating system for modern programming languages.1. IntroductionAbstract interpretation [6] has been long regarded as a promising opti-mization and analysis technique for high-level languages. However, sinceabstract interpretation typically involves manipulation of an approximateprogram state, it has been di�cult to realize linear-time implementationswhen the approximate state is complex. On the other hand, an interpreterthat manipulates a simple abstract state in which many signi�cant detailsof the exact program state are collapsed may be e�cient to implement, butmay not be very useful as an optimization tool.One way of making sophisticated abstract interpreters useful in practice isto transform their internal structure into a form suitable for parallelization.In such an implementation, many components of a program interpretationwill occur in parallel. This implementation strategy has the potential toscale well as input program size increases; since the input to an abstractinterpreter is a source program, opportunities for useful and signi�cant con-currency increase with program size. However, parallelization of programs

2 WEEKS, JAGANNATHAN, AND PHILBINsuch as abstract interpreters would be di�cult to achieve in parallel dialectsof languages such as C or Fortran. This is because abstract interpretersgenerate data dynamically, the data sets consist of objects of irregular typeand structure, and data dependencies are di�cult to detect statically.In this article, we describe an implementation of a concurrent abstractinterpreter in a parallel dialect of Scheme. The interpreter evaluates pro-grams written in an expressive parallel language that supports dynamicprocess creation, �rst-class locations, list data structures and higher-orderprocedures. Synchronization in the input language is mediated via �rst-class shared locations. The design of the language was motivated by ourdesire to have a simple framework capable of expressing many interestingparallel programming abstractions and paradigms. For example, the essen-tial characteristics of futures [9], distributed data structures [4], concurrentobjects [10], and synchronization primitives such as replace-and-op [2, 7]are all expressible using the primitives provided in this language. Thus,we consider a concurrent implementation of an abstract interpreter whoseinput language itself supports parallel constructs. In the description givenhere, the implementation of the concurrent interpreter is not expressed inthe input language it evaluates, but there is no reason a priori why thiscould not be done.The analysis computes intra- and inter-thread control and data ow infor-mation. Such information can be used to facilitate a number of importantoptimizations that would otherwise be realized via ad hoc analyses; life-time and sharing analysis, static scheduling, prefetching, test for presence,and process/data mappings, are a few important examples we have con-sidered. Our discussion does not develop the formal construction of theabstract interpretation, details of which are provided in [14]. The ability toe�ectively analyze high-level parallel languages statically has important im-plications for implementing high-performance parallel systems well-suitedfor symbolic processing.Broadly speaking, we can regard the abstract interpreter as a constraintbased evaluator. The abstract state is represented as a directed graph inwhich nodes correspond to expressions and edges represent the ow of data.Each node stores an abstract value which corresponds to the value of theexpression at that node. Each edge can be viewed as a (subset) constrainton the abstract values stored at the nodes it connects. The evaluation ofan expression can change the abstract value stored at a node, which inturn may violate a number of constraints (i.e., edges) emanating from thatnode. This may cause the evaluation of further expressions, the violationof further constraints, etc. as a further complication, because the inputlanguage contains data structures and higher-order procedures, edges maybe added dynamically. The interpreter terminates when all constraints are

A CONCURRENT ABSTRACT INTERPRETER 3satis�ed.Despite its complexity, structuring an interpreter in this way has theimportant bene�t of making the interpreter amenable to a parallel imple-mentation. Because the interpreter is organized in terms of a collection oflocal rules (constraints between pairs of nodes), many of its componentsmay be able to evaluate concurrently.One obvious concurrent implementation of the interpreter would createa new lightweight thread for each constraint that is violated; thus, theresulting runtime behavior would be very �ne-grained, especially if littlecomputation is required per node. We rejected this implementation, how-ever, because it is not amenable to runtime optimizations such as lazy taskcreation [18], or thread absorption [12] that can be e�ectively used in cer-tain instances to obviate the cost of �ne-grained parallelism on stock mul-tiprocessor platforms. These optimizations reduce the overhead of creatingmany execution contexts for �ne-grained tasks by allowing threads thatexhibit manifest data dependencies with one another to share the same ex-ecution context (i.e., stack, registers, etc.); these techniques thus e�ectivelyincrease thread granularity transparently at runtime.The threads generated by a �ne-grained concurrent implementation ofthe concurrent interpreter, however, will often not exhibit any manifestdata dependencies with any other thread. Because not all dependencies inthe graph are known statically, and because constraints generated by onethread must often be propagated to many nodes in the abstract state, a �ne-grained concurrent implementation of the interpreter will typically entailthe creation of many more execution contexts than can be e�ciently imple-mented on multiprocessor platforms that do not provide explicit hardwaresupport for �ne-grained concurrency [1, 11].We, therefore, choose a more coarse-grained implementation strategy. Inthis version, the abstract state is partitioned, and a thread is assigned tomanage each partition. The runtime organization of a program's abstractstate is given in terms of shared data structures that are concurrently up-dated by these threads as new abstract values are generated and re�ned.This implementation places signi�cant communication and memory man-agement burdens on Sting[12, 13], the runtime kernel on which the inter-preter is implemented, and serve to highlight a number of interesting issuesin programming dynamic and irregular parallel computations.The remainder of this article is organized as follows. Section 2 describesthe target language for the interpreter. Section 3 introduces the data andcontrol- ow problem in this context. Section 4 gives a brief description ofthe exact and abstract semantics for a kernel language we develop. Sec-tion 5 describes the rules used by the interpreter to implement the abstractsemantics. Sections 6 discusses a sequential implementation. Section 7

4 WEEKS, JAGANNATHAN, AND PHILBINe 2 Expc 2 Const = Int+ Boolx 2 Varf 2 Funcp 2 Prim = fcons; car; cdr;mk-loc;write; read; remove; : : :ge ::= cj xj fj pj e1 (e2; e3; : : : ; en) j (e)j if e then e1 else e2j letrec x1 = f1 : : : xn = fn in ej spawn ef ::= �x1 x2 : : : xn: eFigure 1: The kernel language.describes a parallel extension to the sequential implementation. Section 8presents some benchmark timings, and Section 9 gives conclusions.2. The Target LanguageOur abstract interpreter is concerned with computing control- ow anddata ow information for an expressive higher-order parallel language. Ourlanguage (see Fig. 1) has constants, variables, procedures, primitives, call-by-value procedure applications, conditionals, mutual recursion, and a pro-cess creation operation. The primitives include operations for creatingand accessing pairs and shared locations. Constants include integers andbooleans.Since processes can be instantiated dynamically, and since synchroniza-tion is mediated via shared locations, this language can be used to describea number of interesting synchronization and communication paradigms.The language's sequential core de�nes a higher-order functional languagewith list data structures; its parallel extensions provide functionality foundin parallel Lisps [9, 16], ML threads [19], or Linda [3]. Concurrency isintroduced through a spawn operation that creates a lightweight threadto evaluate its input argument. Synchronization and communication is

A CONCURRENT ABSTRACT INTERPRETER 5realized exclusively through shared variables. There are four operationsover shared locations:1. (mk-loc) creates a new location and marks it unbound.2. read(x) returns the value at location x, blocking if x is empty.3. write(x, v) writes v into the location denoted by x.4. remove(x) behaves like read, except that in addition to reading thevalue stored at location x, it also marks x as unbound.To illustrate the language's utility, we show the following translation ofthe MultiLisp [9] expression, (future e)let loc = (mk-loc)in beginspawnwrite(loc; e)locendThe MultiLisp synchronization operation (touch v) is equivalent to,if location? (v)then read(v)else v(Note that \let" and \begin" are syntactic sugar for simple and nestedapplication, respectively.)3. Control and DataFlow AnalysisOptimizing compilers for high-level sequential languages typically performaggressive control and data ow analysis which can be used to facilitate anumber of important optimizations; notable examples include lifetime andescape analysis, type recovery [23, 24], e�cient closure representation [17,22], constant folding, and copy propagation.In languages such as Scheme or ML, however, inter-procedural analy-sis is complicated. The di�culties arise primarily because of higher-orderprocedures and polymorphic data structures. Consider the example shownin Fig. 2. A useful control- ow analysis of this program would reveal anumber of interesting properties about this program fragment:

6 WEEKS, JAGANNATHAN, AND PHILBINlet f = � (x y)if null? xthen � () yelse let z = (x)in : : : z : : :in let g = cons (let v = exp in � () f (nil v),nil)in f (car(g), w)Figure 2: Higher-order programs have non-trivial control- ow and data owproperties.1. v escapes outside the call in which it occurs.2. the closure passed to cons is applied in f 's body indirectly via car.3. the application of x in the conditional's else branch returns a closurecorresponding to � () y where the binding for y is either the valueof exp or the value of w.4. x is bound to nil or the procedure,� () f (nil, v)To obtain this kind of information, a control- ow analysis must be capa-ble of selectively collapsing environment information associated with mul-tiple instantiations of the same procedure, and recording data ow into andout of complex data structures such as lists.In the presence of concurrency, the control- ow analysis problem be-comes exacerbated. In addition to recording useful closure information, ananalysis is also faced with the task of computing dynamic interleaving in-formation among threads. In order to be practical, however, the interpretercannot a�ord to generate all potential interleavings; certain interleaving in-formation must be collapsed. Furthermore, an analysis that operates overa language of the kind described in the previous section must also track themovement of shared locations since optimizations relating to synchroniza-tion and communication depend on knowing how and where these locationsare used.

A CONCURRENT ABSTRACT INTERPRETER 74. SemanticsA detailed description of the exact and approximate semantics for our kernellanguage is given in [14]; we provide a brief outline here.Both the exact and abstract semantics are de�ned in terms of labels.There are three kinds of labels: value, environment, and spawn, denotedby �v, �e, and �s, respectively. Every expression in the program is assigneda unique value label, which is used to store the value of the expression.Every lambda and spawn expression is assigned a unique environment label.Additionally, each spawn expression is assigned a unique spawn label. Inthe exact semantics, we say an expression is being evaluated in spawn label �sif the thread which is evaluating the expression was created by the spawnexpression labeled with �s. Note that this is a dynamic notion, not asyntactic one.In the abstract semantics, a pair h�v; �si serves as an index into theabstract state. Associated with this index is an abstract value that \com-bines" or \joins" all values to which the expression with label �v may eval-uate in spawn label �s. Similarly, a pair h�e; �si is an index to an abstractenvironment that \combines" all environments which may exist at label �ewhile evaluating in spawn label �s.In both the exact and abstract semantics, labels serve a dual role ascomponents of values. In the exact semantics, pointers and heap entriesare created for cons cells, shared locations and closures. Similar \abstractpointers" are created in the abstract semantics. For example, an \abstractpointer" to a cons cell is represented by a pair h�v; �si, where �v is the labelof the cons expression and �s is the label of the thread which created thecell. Intuitively, this abstraction joins the values of all cons cells which werecreated by the cons expression labeled with �v while being evaluated in �s.An abstract value is simply a set of abstract pointers. Abstract environ-ments are joined componentwise; in other words, an abstract environmentis a function from variables to abstract values.5. RulesAn abstract state consists of two maps, one from pairs h�v; �si to abstractvalues and one from pairs h�e; �si to abstract environments. Because anabstract environment is a function from variables to abstract values, we cangeneralize further and view an abstract state as single map from indices toabstract values, where an index is either a pair h�v; �si or a triple h�e; �s; xi;x represents a program variable in the latter case.The abstract semantics is de�ned by a transition function, T , that maps

8 WEEKS, JAGANNATHAN, AND PHILBINabstract states to abstract states. The abstract transition function for aparticular program is given by a collection of rules, one for each index,where the rule for a particular index i takes an abstract state and describeshow to compute the abstract value that arises at i after taking one \step"in the program. There are fairly complex rule schemas which describe howto de�ne the rule for each index for a particular program; [14] providesdetails.We use the metavariable i to range over indices and S to range overabstract states. We use S(i) to denote the abstract value at index i inabstract state S. Finally, viewing a rule as a function from abstract statesto abstract values, we use Ri(S) to indicate the abstract value at index iafter taking a step in S. For the purposes of this article, the details of therules are unimportant. The rules can, however, be broadly classi�ed intoone of four categories; each rule is given as union over some set of valuesin the abstract state:1. Constant. Constant rules are used for indices whose values are syn-tactically apparent. Indices which correspond to numbers, cons ex-pressions, mk-loc expressions, and lambda expressions fall in thiscategory. Note that for the latter three, a singleton set containing anabstract pointer is the value.2. Static. Static rules are used to compute values where the control owis apparent syntactically. For example, the value of an if-expressionis the join (union) of the values of the consequent and alternativebranches, whose indexes are syntactically apparent. A similar type ofrule describes the value of an expression which references a variablesince that variable is stored in the enclosing abstract environment.3. Dynamic From Pointers. These rules describe information that ows from a pointer when it arrives at a particular index. For exam-ple, the value of the index of a car expression depends on the conspointers that constitute the value of its subexpression. A similar ruledescribes cdr expressions and read expressions on locations.4. Dynamic To Pointers. These rules describe information that owsback to a pointer based on the indices whose value contain the pointer.For example, the values that ow into an abstract location dependon the set of indices corresponding to expressions that write to thatlocation.

A CONCURRENT ABSTRACT INTERPRETER 9Snew S0repeatSold SnewSnew T (Sold)until Sold = SnewFigure 3: Naive Algorithm6. ImplementationAbstract values are ordered by set inclusion, abstract states are orderedpointwise. If S0 abstracts the set of initial program states, then the least�xed point we seek is the limit of the chain:S0; T (S0); T 2(S0); : : : T n(S0); : : :The least �xed point may be computed by the algorithm in Figure 3.This algorithm has a straightforward implementation. At each iterationof the loop, this implementation (1) creates a new state, (2) computes thevalue at each index of the new state as given by the rule at that index,and (3) tests for equality between the new state and the previous state.Unfortunately, this implementation su�ers from many ine�ciencies. First,it requires the construction of a new state for each iteration of the loop. Itwould be more e�cient to keep a single state, and update it destructively.Second, there may be many indices i such that Snew(i) = Sold(i); e.g., theabstract values produced by constant and static rules are invariant acrossiterations in the �xpoint computation. We would like to eliminate the un-necessary recomputation of these unchanged values, and only compute theentries which might change. Third, the test for equality requires comparingeach entry of the new state with the corresponding entry in the old state(a set comparison). As before, we would like to remove as much redundantcomputation as possible from this test, and only compare entries that mayhave changed.These concerns can be addressed by an implementation based on a dif-ferent view of the abstract transition function. Instead of viewing thisfunction as de�ning a set of rules that describe how to compute entries ofa new abstract state, we can view the rules as constraints on the currentabstract state that must be satis�ed. Let Ri be the rule that exists atindex i in a program. By the de�nition of T , there there exists an index iin abstract state S such that S(i) 6= Ri(S), if and only if S 6= T (S). Thealgorithm in �gure 4 is based on explicitly maintaining a set of such indices.

10 WEEKS, JAGANNATHAN, AND PHILBINS S0I I0while I 6= �remove some index i from Iif S(i) 6= Ri(S)then S(i) Ri(S)I I [ DiFigure 4: Constraint AlgorithmIt maintains the invariant that at the beginning of each iteration, the indexset I contains all indices i such that S(i) 6= Ri(S).For each iteration, the algorithm chooses an index i from I, computesRi(S), and if new information is found, updates S to re ect the new infor-mation. This modi�cation of the state may in turn violate other constraints,and new indices may need to be added to I to maintain the invariant. Ane�cient implementation of this algorithm requires being able to determinewhich indices must be recomputed when S(i) changes. The simplest solu-tion is to associate this information with each index. Let Di denote the setof indices that depend on S(i). Initially, the Di's are given by the staticrules of T . As the �xed point is computed, and abstract values arrive atcertain indices, the Di's are augmented to re ect additional dependenciesinduced by the dynamic rules of T . To do so requires that a small amountof additional information be stored for certain abstract pointers.When the algorithm terminates, we have for all indices i that S(i) =Ri(S), hence S = T (S). It can also be proven by induction that at eachiteration, S is less than or equal to the least �xed point of T . The proofrelies on the monotonicity of T , which is evident from its de�nition.One ine�ciency of the constraint algorithm is that for each i0 2 Di, theentire union as given by Ri0 will be recomputed upon changing S(i). Muchof this computation is, in fact, redundant, and can also be removed. By themonotonicity of T we know that any change to an abstract state S at indexi must be an increase. Hence, we only need to compute the e�ect of thenew information on the union. To be more precise, let P be a set that is tobe added to S at index i, and let Pdi� be S(i)�P . The only information tobe propagated to indices i0 which depend on i is Pdiff ; it is not necessary torecompute Ri0 . The algorithm in Fig. 5 explicitly maintains a list of pairshi0; P i where P is a set containing new information that must be added tothe union at S(i0). L0 contains sets of pairs associated with indices that

A CONCURRENT ABSTRACT INTERPRETER 11S S0L L0while L 6= �remove some pair hi; P i from Llet Pdi� be P � S(i)in if Pdi� 6= �then S(i) S(i) [ Pdi�for each i0 2 Diadd hi0; Pdi�i to Lend Figure 5: Dependency Algorithmhave static or constant rules.Essentially, this algorithm views the abstract state and the rules as agraph, where the nodes are indices, and the adjacency list for node i isgiven by Di. An edge (i; i0) represents the fact that information owsfrom index i to i0. Thus, if there is an edge from i to i0, then in a �xedpoint of T , S(i) � S(i0). Whenever new information is added to a node,this new information is propagated along all outgoing edges of the node.The algorithm is similar in spirit to the \worklist" approach of Hall andKennedy [8].7. The Parallel InterpreterIn the following, we use thread to refer to a lightweight asynchronous processde�ned as part of the interpreter implementation.We parallelize the dependency algorithm of the previous section by par-titioning the abstract state. Thus, we de�ne a function � : Index !f1; 2; : : : ;Kg, where K is the number of partitions. For each partition, k,a worker thread Wk is created which is responsible for computing S(i) forall i such that �(i) = k. Worker Wk is also responsible for communicatingany new information it computes to other partitions, as dictated by its out-going dependency edges. The program structure resembles a master-slavecomputation. The master is described in Fig. 6.Each worker thread has a both a local queue (L) and a remote queue(R) of hi; P i pairs that contain additional information that must be joinedto nodes in its partition. The structure of a worker is very similar to the

12 WEEKS, JAGANNATHAN, AND PHILBINS S0for k = 1 to KLk Lk0Rk �for k = 1 to Kcreate worker WkFigure 6: Masterwhile Lk 6= � or Rk 6= �let hi; P i beif Lk 6= �then dequeue(L)else dequeue(R))Pdi� be P � S(i)in if Pdi� 6= �then S(i) S(i) [ Pdi�for each i0 2 Diif �(i0) = kthen enqueue(Lk, hi0; Pdi�i)else enqueue(R�(i0), hi0; Pdi�i)end Figure 7: Code for Worker kdependency algorithm of the previous section. When new information isfound, however, it must be added to the appropriate worker's queue. Thecode for a worker is shown in Fig. 7.Partitioning the input across threads reduces contention on shared datastructures. Since each thread has exclusive access to the portion of theabstract state corresponding to its partition, no synchronization operationsneed to be executed by threads when re�ning or adding a local constraint.Synchronization costs occur entirely in the management of remote queues.A thread sending work to a remote queue belonging to another thread musthave exclusive access to the tail of the queue in the interval in which themessage transmission occurs. Work sent to remote threads are implementedas thunks that are evaluated by the receiving thread upon receipt; this

A CONCURRENT ABSTRACT INTERPRETER 13communication paradigm resembles active messages [25].In our implementation, work found on local queues is done before workfound on remote ones. A thread examines its remote queue only whenall local work has been completed. Evaluating work found on a remotequeue may enable further (local) updates on a partition; these are evaluatedbefore the next item on the remote queue is examined. If there is no workon either the local or remote queues, the thread either spins or blocks; ablocked thread W is resumed by another thread W 0 when W 0 deposits anew piece of work on W 's remote queue. There is one top-level threadT that remains blocked and is resumed only when the program reachesa �xpoint. This happens when all workers become blocked because theirlocal and remote queues are empty. Prior to blocking, a worker incrementsa counter that is decremented when the thread resumes; when the counterbecomes equal to the number of threads initially created, T is resumed. Tthen terminates all workers and returns the current abstract state.The parallel program structure in this problem exhibits a great dealof locality { shared data is exclusively owned by a thread (data and pro-cess layout is thus SPMD-style), communication occurs exclusively throughmessages sent on remote queues, and it is assumed that there is a one-to-one correspondence between processors and threads. Depending on theinput program and the e�ectiveness of the partitioning strategy, however,there may be signi�cant communication via remote queues among di�er-ent tasks. The overheads incurred in message transmission, blocking, andresuming threads is determined partially by the e�ciency of the Sting im-plementation, and partially by the granularity exhibited by the interpreter'ssequential component.8. BenchmarksOur benchmarks were implemented on a 16 processor Silicon-Graphicscache-coherent shared memory multiprocessor. Each node consists of aMIPS R4400 150 Mhz processor; the total memory on the system is 512megabytes. Each processor has a 16Kbyte data and 16Kbyte instruction on-chip primary cache, and a one megabyte uni�ed o�-chip secondary cache.The operating system used is Irix V Release 5.1.1.1.The benchmarks were executed using compiled versions of both Stingand the abstract interpreter. The de�nition of the interpreter follows thedescription of the abstract semantics given in Sections 4, 5 and 7.

14 WEEKS, JAGANNATHAN, AND PHILBIN8.1. Input ProgramsWe benchmarked two di�erent applications on the concurrent abstractinterpreter. The �rst is a parallel parser implemented for a signi�cant sub-set of Scheme [5] written in a functional style, adapted from [20]. In thiscode, a parser is a function that takes a stream of tokens and a locationand writes the resulting parse tree in that location. There are parser con-structors for the grammar operations \or" (alt), \concatenation"' (seq), and\option" (optional). The only parallelization of the parser is in the functionalt, the parser constructor that implements the or grammar operation. Ittakes two parsers and an input stream and returns the resulting parse treeif either parser succeeds. In the parallel case, threads are spawned to eval-uate both parsers. When the �rst thread is complete, its value is returned,if it succeeds; if it fails, the value of the second parser is returned. Notethat even if the �rst thread completes successfully, the evaluation of thesecond thread is not aborted; the value returned by the second thread issimply ignored.The second benchmark, self-application, is shown in Fig. 8. Since a signif-icant fraction of the interpreter's complexity lies in managing higher-orderprocedures, and propagating new information derived by a function applica-tion to other call sites of the applied function, we constructed a benchmarkto exercise these facets. While many higher-order parallel programs maynot exhibit this kind of structure, we expect more complicated abstractinterpreters to exhibit many of the same characteristics shown by our in-terpreter on this benchmark on demonstrably simpler applications [23].8.2. ParametersThe benchmarks shown in the following sections measure the time takento run the abstract interpreter on the input programs described above. Weparameterized the structure of the interpreter along three dimensions:1. Partitioning Strategy. We consider two possible ways of allocatingindices in the abstract state to threads. The �rst is random allocation;in this strategy, the set of nodes managed by a thread is determinedrandomly. The second is block allocation; in this strategy, a nodeis allocated to a given partition if it contains a known dependencywith some other node in that partition, provided that its inclusionwould not cause the number of elements in the partition to exceedthe partition's maximum allowed size.The random partition strategy su�ers from poor locality e�ects {nodes adjacent in the dependency graph may be allocated to di�erentthreads running on di�erent processors. It does have the advantage,

A CONCURRENT ABSTRACT INTERPRETER 15

let thunk = � ():let p1 = � f:f( f)p2 = � f:f( f)...in let i1 = � x:xi2 = � x:x...in p1( i1)p1( i2)...p2( i1)p2( i2)...in spawn( thunk)spawn( thunk)...Figure 8: A complex higher-order program. The benchmark used evaluateda program with the above structure using 20 self-application nodes, 20identity procedures, and 20 syntatically distinct spawn expressions.

16 WEEKS, JAGANNATHAN, AND PHILBINhowever, of being less susceptible to poor load balancing; a moreeven partition of work to threads is likely to occur using randompartitioning for abstract states that have irregular or non-uniformstructure.The block partition strategy exploits locality in the dependency graph,at the expense of uneven load balancing for non-uniform, irregularlystructured graphs.2. Waiting Strategy. Recall that a thread with no work in its localqueue examines its remote one. In the case where the remote queueis also empty, a thread may either choose to busy wait or explicitlyblock. Busy waiting is obviously the preferred alternative when thenumber of evaluating threads is equivalent to the number of processorsbeing used since no useful work will be performed by a processorexecuting a thread that chooses to block. Blocking, on the otherhand, is conceptually cleaner and more general since it decouplesthe choice of how many threads (or partitions) are created from thenumber of processors actually used.3. Traversal Strategy. Nodes may be examined using either a breadth-�rst or a depth-�rst traversal strategy. In a breadth-�rst implemen-tation, a thread deposits all new work into local and remote queuesbefore examining the contents of these queues. As a result, the orderin which information is added to the graph is unpredictable. In thedepth-�rst case, a new piece of work is immediately evaluated beforeany other; thus, there is no need for local queues, and a stack is suf-�cient. In this case, information percolates along a given spine in thegraph until it crosses a partition boundary or it encounters a nodethat does not generate any new information.8.3. Execution Times8.3.1. Sequential TimesAs a base case, we measured the time taken by the abstract interpreterto evaluate the input programs in a purely sequential setting. We compiledthe interpreter using Orbit [17]; the interpreter was written in T [21] release3.1. The sequential benchmarks uses no partitioning or waiting strategy; itis a faithful implementation of the algorithm shown in Fig. 5. We measurethe times using both breadth-�rst and depth-�rst traversal. In a sequentialimplementation, the traversal strategy only dictates the order in whichnodes in the dependency graph are examined; there are no remote queuesmanipulated. Since our intent was to measure only sequential running

A CONCURRENT ABSTRACT INTERPRETER 17Parser Self-ApplyT 56.5 (BFS) 100.9 (BFS)66.4 (DFS) 74.3 (DFS)Figure 9: Sequential times (in seconds) for the abstract interpreter writtenin T.times, no part of the Sting runtime was involved in these tests. The timesare shown in Fig. 9 and do not include garbage collection costs.8.3.2. Parallel ExecutionFor the parallel implementation, we used the Sting runtime system tohandle all concerns related to thread scheduling and context-switching. TheSting virtual machine used for the benchmarks implemented a single globalFIFO thread queue. None of the benchmarks triggered garbage collection.The benchmark results are shown in Fig. 10. The number of block oper-ations executed in runs using a block waiting strategy is shown in Fig. 11.Note that the single processor times for both the parser and self-applyare slightly faster than the sequential times; memory and register opti-mizations performed by Sting that are not utilized by T are the primaryreasons for this di�erence. As a related point, lock acquire and releaseson the R4400 are implemented in terms of two simple machines instruc-tions, load-linked and store-conditional [15]; the overhead in ac-quiring locks in the single processor case is thus negligible.With 16 processors, using a breadth-�rst traversal, block partition, andspin-waiting, the evaluation of the abstract interpretation on the parsertook 7.1 seconds; compared with the single processor time of 53.8 seconds,this program executes at roughly 47% e�ciency; on 8 processors, the sameprogram executes with 58% e�ciency.1 We suspect the drop in e�ciencyis due to lack of available parallelism in the program.Note however that the block partition strategy clearly outperforms ran-dom partitioning which achieved no speedup regardless of the traversal orwaiting strategy used. This is because updating nodes across partitionboundaries entails manipulating remote queues which are serialized. Moreimportantly, locality is compromised; on tightly-coupled cache-coherentshared memory machines, failure to achieve cache locality can lead to sig-ni�cant performance degradation. With random partitioning, a thread willlikely perform many updates to shared data structures not cached on its1If m is the running time of a program P on one processor, and n is its executiontime on k processors, the e�cency of P on k processors is de�ned to be mkn � 100.

18 WEEKS, JAGANNATHAN, AND PHILBIN

0 2 4 6 8 10 12 14 16

Processors

0

10

20

30

40

50

60

70

80

Tim

e (i

n se

cond

s)

Parser -- Breadth-first Search.

Sequential time.Block partition, no spin waiting.Random partition, no spin waiting.Block partition, spin waiting.Random partiton, spin waiting.

0 2 4 6 8 10 12 14 16

Processors

0

10

20

30

40

50

60

70

80

Tim

e (i

n se

cond

s)

Parser -- Depth-first Search.

Sequential time.Block partition, no spin waiting.Random partition, no spin waiting.Block partition, spin waiting.Random partiton, spin waiting.

0 2 4 6 8 10 12 14 16

Processors

0102030405060708090

100110

Tim

e (i

n se

cond

s)

Self-apply -- Breadth-first Search.

Sequential time.Block partition, no spin waiting.Random partition, no spin waiting.Block partition, spin waiting. Random partiton, spin waiting.

0 2 4 6 8 10 12 14 16

Processors

0

10

20

30

40

50

60

70

80

90

Tim

e (i

n se

cond

s)

Self-apply -- Depth-first Search.

Sequential time.Block partition, no spin waiting.Random partition, no spin waiting.Block partition, spin waiting. Random partiton, spin waiting.

Figure 10: Wallclock times measuring the execution of the concurrent ab-stract interpreter on the input programs, Parser and Self-apply describedin Section 8.1.

A CONCURRENT ABSTRACT INTERPRETER 19

0 5 10 15

Processors

0

20000

40000

60000

80000

100000

Blo

cks

Parser Block Counts.

Breadth-first search. Block partitionBreadth-first search. Random partition.Depth-first search. Block partition.Depth-first search. Random partition.

0 5 10 15

Processors

0

20000

40000

60000

80000

100000

Blo

cks

Self-Apply Block Counts.

Breadth-first search. Block partition.Breadth-first search. Random partition.Depth-first search. Block partition.Depth-first search. Random partiton.

Figure 11: Thread block counts incurred under di�erent partitioning andtraversal strategies for the input programs described in Section 8.1 whenevaluated by the concurrent abstract interpreter.

20 WEEKS, JAGANNATHAN, AND PHILBINprocessor. Poor memory locality coupled with extra synchronization costsmake random partitioning an inferior strategy to block partitioning.With block partitioning, the choice of spin waiting or blocking appearsto have little impact on the overall execution times. Under a breadth-�rsttraversal with automatic blocking, running the concurrent interpreter onthe parser resulted in approximately 9,000 thread blocks/unblocks. Thecost of executing these extra blocks is roughly the cost of simply spinning;the overheads introduced by Sting to resume blocked threads does notappear to entail any performance penalty; in fact, on 4 and 8 processors, ablocking waiting strategy using block partitioning slightly outperforms thespin waiting version.Under a random partition strategy, the block counts increase dramati-cally. On 16 processors, using breadth-�rst traversal interpretation of theparser resulted in roughly 56,400 thread blocks. A similar number of blocksoccurs under a depth-�rst traversal.Surprisingly, there is little di�erence between a depth-�rst or a breadth-�rst traversal of the abstract state. With 16 processors, using block parti-tioning and without busy waiting, interpreting the parser using a depth-�rsttraversal took 7.06 seconds versus 7.17 seconds using a breadth-�rst strat-egy. With spin-waiting, the breadth-�rst version took 7.07 seconds versus7.9 seconds under a depth-�rst strategy. We suspect the dependency graphfor the parser has small depth but is bushy; in this case, its evaluation takeslittle advantage of locality bene�ts a�orded by a depth-�rst traversal.The performance curves for the self-application benchmark are roughlythe same as the parser. As with the parser, the dominant parameter in theexecution times is the partitioning strategy; self-apply on 16 processors witha block partition scheme takes between 13.4 seconds (with depth-�rst searchand spin-waiting) and 13.6 seconds (with breadth-�rst search and spin-waiting). The execution times with a random partition is roughly constantregardless of the number of processors used. Given a single processor timeof 74.4 seconds with depth-�rst search, block partitioning and spin-waiting,the interpreter executes self-apply with an e�ciency of approximately 34%on 16 processors; on 8 processors the e�ciency improves to a little over51%. Again, the reason for the dropo� is presumably lack of availableparallelism; a larger input program would presumably improve the overalle�ciency provided garbage collection remains relatively infrequent.As with the parser, self-apply shows the importance of spatial locality ontightly-coupled parallel systems. A random partitioning strategy exhibitspoor locality that results in a signi�cantly larger number of thread blocks;under a depth-�rst traversal, applying the interpreter to self-apply on 16processors resulted in over 109,000 thread blocks, roughly a factor of sevenmore than the number of blocks executed using block partitioning. Here

A CONCURRENT ABSTRACT INTERPRETER 21also, the traversal strategy chosen seems to be less important than thepartitioning strategy. On 16 processors, a depth-�rst traversal under ablock partition with busy waiting takes 13.41 seconds compared with 13.59seconds under a similar con�guration with a breadth-�rst traversal. Thenon-busy waiting numbers are roughly the same. The structure of thedependency graph for self-apply again presumably lacks su�cient height tohighlight di�erences between the two traversal strategies.Regardless of the underlying machine, communication costs incurred indepositing work on remote queues, lack of \parallel slackness", and un-even load balancing appear to be dominating factors as we scale the num-ber of processors for both benchmarks. Signi�cant queue contention andpoor cache locality are important factors in making remote queue accesscostly relative to the cost of executing the interpreter's serial component.Moreover, the small sequential grain size de�ned by the local unit of workperformed by each thread is small enough to exaggerate the costs of inter-thread communication. In both benchmarks, threads communicate fre-quently, and perform relatively little work compared to the cost of record-ing a piece of work on another thread's remote queue. Increasing grain sizeby computing a more sophisticated analysis, or combining other analysessimultaneously would help mask some of these overheads. Furthermore,regardless of whether a thread blocks or spins, the lack of other tasks toexecute when a thread has no work in either queue undoubtedly results insome performance degradation. Unfortunately, simply making the partitionsizes smaller, and thus the program more �ne-grained will not necessarilyimprove performance because contention for remote queues would increasecorrespondingly, thus eliminating any bene�ts derived from extra availableparallelism. Allowing nodes to dynamically migrate, and thus permittingdynamic repartitioning of the abstract state would be a more realistic solu-tion. Since not all nodes perform equal amount of work, using static (andeven random partitions) is unlikely to lead to even load-balancing amongworker threads. Allocating threads to \interesting" regions of the abstractstate may lead to a more e�cient implementation.9. ConclusionsAn e�cient concurrent abstract interpreter is a challenging and importantsymbolic application. Like many other applications in this category, ab-stract interpreters generate data dynamically, work on inputs with highlyirregular structure, manipulate objects of many di�erent types, and do notlend themselves to a trivial parallel formulation. Constructing an e�cientconcurrent implementation, however, has several important bene�ts. Mostsigni�cantly, it enables the incorporation of a number of sophisticated inter-

22 WEEKS, JAGANNATHAN, AND PHILBINprocedural compile-time optimizations as part of a general compiler toolboxfor Scheme and other higher-order symbolic programming languages.References1. Anant Agarwal, Beng-Hong Lim, David Kranz, and John Kubiatowicz.APRIL: A Processor Architecture for Multiprocessing. In Proceedingsof the 17th IEEE Conference on Computer Architecture, pages 104{114,1990.2. George Almasi and Allan Gottlieb. Highly Parallel Computing. Ben-jamin/Cummings, 1991.3. Nick Carriero and David Gelernter. Linda in Context. Communica-tions of the ACM, 32(4):444{458, April 1989.4. Nick Carriero and David Gelernter. Tuple Analysis and Partial Eval-uation Strategies in the Linda Precompiler. In Second Workshop onLanguages and Compilers for Parallelism. MIT Press, August 1989.5. William Clinger and Jonathan Rees, editors. Revised4 Report on theAlgorithmic Language Scheme. ACM Lisp Pointers, 4(3), July 1991.6. Patrick Cousot and Radhia Cousot. Abstract Interpretation: a Uni-�ed Lattice Model for Static Analysis of Programs by Construction ofApproximation of Fixpoints. In ACM 4th Symposium on Principles ofProgramming Languages, pages 238{252, January 1977.7. Allan Gottlieb, B. Lubachevsky, and Larry Rudolph. Basic Techniquesfor the E�cient Coordination of Very Large Numbers of CooperatingSequential Processors. ACM Transactions on Programming Languagesand Systems, 5(2):164{189, April 1983.8. Mary Hall and Ken Kennedy. E�cient Call Graph Analysis. ACM Let-ters on Programming Languages and Systems, 1(3):227{242, September1992.9. Robert Halstead. Multilisp: A Language for Concurrent SymbolicComputation. Transactions on Programming Languages and Systems,7(4):501{538, October 1985.10. Waldemar Horwat, Andrew Chien, andWilliam Dally. Experience withCST: Programming and Implementation. In ACM SIGPLAN '89 Con-ference on Programming Language Design and Implementation, pages101{109, June 1989.

A CONCURRENT ABSTRACT INTERPRETER 2311. William J. Dally and et. al. Architecture of a Message-Driven Proces-sor. In Proceedings of the 14th IEEE Conference on Computer Archi-tecture, pages 189{196, 1987.12. Suresh Jagannathan and James Philbin. A Customizable Substrate forConcurrent Languages. In ACM SIGPLAN '91 Conference on Pro-gramming Language Design and Implementation, pages 55{67, June1992.13. Suresh Jagannathan and James Philbin. A Foundation for an E�cientMulti-Threaded Scheme System. In Proceedings of the 1992 Conf. onLisp and Functional Programming, pages 345{357, June 1992.14. Suresh Jagannathan and Stephen Weeks. Analyzing Stores and Ref-erences in a Parallel Symbolic Language. In Proceedings of the 1994ACM Conference on Lisp and Functional Programming, June 1994.15. G. Kane. MIPS RISC Architecture. Prentice-Hall, 1989.16. David Kranz, Robert Halstead, and Eric Mohr. Mul-T: A High Perfor-mance Parallel Lisp. In Proceedings of the ACM Symposium on Pro-gramming Language Design and Implementation, pages 81{91, June1989.17. David Kranz, R. Kelsey, Jonathan Rees, Paul Hudak, J. Philbin, andN. Adams. ORBIT: An Optimizing Compiler for Scheme. ACM SIG-PLAN Notices, 21(7):219{233, July 1986.18. Rick Mohr, David Kranz, and Robert Halstead. Lazy Task Creation:A Technique for Increasing the Granularity of Parallel Programs. InProceedings of the 1990 ACM Conference on Lisp and Functional Pro-gramming, June 1990.19. J. Gregory Morrisett and Andrew Tolmach. Procs and Locks: APortable Multiprocessing Platform for Standard ML of New Jersey. InFourth ACM Symposium on Principles and Practice of Parallel Pro-gramming, pages 198{207, 1993.20. Chris Reade. Elements of Functional Programming. Addison WesleyPress, 1989.21. Jonathan A. Rees and Norman I. Adams. T: A Dialect of Lisp or,LAMBDA: The Ultimate Software Tool. In Proceedings of the ACMSymposium on Lisp and Functional Programming, pages 114{122, 1982.

24 WEEKS, JAGANNATHAN, AND PHILBIN22. Zhong Shao and Andrew Appel. Space E�cient Closure Representa-tions. In Proceedings of the 1994 ACM Conference on Lisp and Func-tional Programming, June 1994.23. Olin Shivers. Data- ow Analysis and Type Recovery in Scheme. InTopics in Advanced Language Implementation. MIT Press, 1990.24. Olin Shivers. The Semantics of Scheme Control-Flow Analysis. InProceedings of the ACM SIGPLAN Symposium on Partial Evaluationand Semantics-Based Program Manipulation, pages 190{198, 1991.25. Thorsten von Eicken, David Culler, Seth Goldstein, and Klaus EricSchauser. Active Messages: A Mechanism for Integrated Communica-tion and Computation. In Proceedings of the 19th Annual InternationalSymposium on Computer Architecture, pages 256{266, 1992.

Concurren t Abstract In terpreter

Documents

Transcript of Concurren t Abstract In terpreter