Memory Models: A Case for RethinkingParallel Languages and Hardware
COE 502/CSE 661 Fall 2011Parallel Processing Architectures
Professor: Muhamed Mudawar, KFUPM
Presented by: Ahmad Hammad
outline
• Abstract• Introduction• Sequential Consistency• Data-Race-Free• Industry Practice and Evolution• Lessons Learned• Implications for Languages and Hardware• Conclusion
Abstract
• Writing correct parallel programs remains far more difficult than writing sequential programs.
• most parallel programs are written using a shared-memory approach.
• The memory model, specifies the meaning of shared variables.– involved a tradeoff between programmability and
performance– One of the most challenging areas in both hardware
architecture and programming language specification.
Abstract
• Recent broad community-scale efforts have finally led to a convergence in this debate,– with popular languages such as Java and C++ and – most hardware vendors publishing compatible
memory model specifications.
• Although this convergence is a dramatic improvement, it has exposed fundamental shortcomings that prevent achieving the vision of structured and safe parallel programming.
Abstract• This paper discusses the path to the above
convergence, the hard lessons learned, and their implications.
• A cornerstone of this convergence has been the view that the memory model should be a contract between the programmer and the system – – if the programmer writes disciplined (data-race-free)
programs, the system will provide high programmability (sequential consistency) and performance.
Abstract• We discuss why this view is the best we can do
with current• popular languages, and why it is inadequate
moving forward.• In particular, we argue that parallel languages
should not only promote high-level disciplined models, but they should also enforce the discipline.
• for scalable and efficient performance, hardware should be co-designed to take advantage of and support such disciplined models.
Introduction
• Most parallel programs today are written using threads and shared variables
• There are a number of reasons why threads remain popular. – Even before the dominance of multicore, threads
were already widely supported by ,most OS because they are also useful for other purposes.
Introduction
• Direct HW support for shared-memory provides a performance advantage
• The ability to pass memory references among threads makes it easier to share complex data structures.
• shared-memory makes it easier to selectively parallelize application hot spots without complete redesign of data structures.
Memory Model
• At he heart of the concurrency semantics of a shared memory program.
• It defines the set of values a read in a program can return
• It answers questions such as:– Is there enough synchronization to ensure a
thread’s write will occur before another’s read? – Can two threads write to adjacent fields in a
memory location at the same time?
Memory Model
• We should have unambiguous MM• Defines an interface between a program and
any hardware or software that may transform that program – The compiler, the virtual machine, or any dynamic
optimizer• A complex memory model – parallel programs difficult to write– parallel programming difficult to teach.
Memory Model
• Since it is an interface property, – has a long-lasting impact, affecting portability and
maintainability of programs– overly constraining one may limit hardware and
compiler optimization reducing performance.
Memory Model
• The central role of the memory model has often been downplayed– Specifying a model that balances all desirable
properties has proven surprisingly complex• programmability• performance • portability.
Memory Model
• In 1980s and 1990s, the area received attention primarily in the hardware community– which explored many approaches, with little
consensus – challenge for HW architects was the lack of clear
MM at the programming language level – Unclear what programmers expected hardware to
do.
Memory Model• hardware researchers proposed approaches to
bridge this gap but there were no wide consensus from the software community.
• Before 2000– few programming environments addressed the issue
clearly– widely used environments had unclear and
questionable specifications– Even when specifications were relatively clear, they
were often violated to obtain sufficient performance ,– tended to be misunderstood even by experts, and
were difficult to teach.
Memory Model
• Since 2000, efforts to cleanly specify programming-language-level memory models,– Java and C++, with efforts now underway to
adopt similar models for C and other languages. – In the process, issues created by hardware that
had evolved and made it difficult to reconcile the need for a simple and usable programming model with that for adequate performance on existing hardware.
Memory Model
• Today, the above languages and most hardware vendors have published (or plan to publish) compatible memory model specifications.
• Although this convergence is a dramatic improvement over the past, it has exposed fundamental shortcomings in our parallel languages and their interplay with hardware.
Memory Model
• After decades of research, it is still unacceptably difficult to describe what value a load can return without compromising modern safety guarantees or implementation methodologies.
• To us, this experience has made it clear that solving the memory model problem will require a significantly new and cross-disciplinary research direction for parallel computing languages, hardware, and environments as a whole.
Sequential Consistency
• The execution of a multithreaded program with shared variables is as follows.– Each step in the execution consists of choosing
one of the threads to execute, – then performing the next step in that thread’s
execution according to program order.– This process is repeated until the program as a
whole terminates.
Sequential Consistency
• Execution viewed as taking all steps executed by each thread, and interleaving them.
• Possible results determined by trying all possible interleaving between instructions.
• Instructions in a single thread must occur in same order in interleaving as in the thread.
• According to Lamport this exc. Called sequentially consistent
Red Thread Blue Thread
X=1;r1=Y;
Y=1;R2=X;
Initially X=Y=0
Fig. 2 Some executions for Fig. 1
Exec. 1 Exec. 2 Exec. 3
X=1;r1=Y;Y=1;
R2=X;-----
r1=0, r2=1
Y=1;R2=X;X=1;r1=Y;
r1=1, r2=0
X=1;Y=1;r1=Y;R2=X;
r1=1, r2=1
Fig 1. Core of Dekker’s algorithmCan r1=r2=0?
Sequential Consistency
Fig.2 gives three possible interleaving that together illustrate all possible final values of the non-shared variables r1 and r2.
Sequential Consistency• Sequential consistency gives the simplest
possible meaning for shared variables, but suffers from several related flaws.– sequential consistency can be expensive to
implement.– Almost all h/w, compilers violate SC today
• reorder the two independent assignments, since scheduling loads early tends to hide the load latency.
• Modern processors use a store buffer to avoid waiting for stores to complete.
• Both the compiler and hardware optimization make outcome of r1 = r2 = 0 possible non-sequentially consistent execution.
Data-Race-Free• reordering accesses to unrelated variables in a thread– same meaning of single-threaded programs– affect multithreaded programs
• Two memory accesses form a race if– different threads, try to access same location, at least one
is a write– Occur one after another
• Data-race-free-program No data race in any SC execution– Requires races be explicitly identified as synchronization – locks, use volatile variables in Java, atomics in C++
data-race-free
• A program that does not allow a data race is said to be data-race-free.
• The data-race-free model guarantees sequential consistency only for data-race-free programs.
• For programs that allow data races, the model does not provide any guarantees.
Data-Race-Free Approach
• Programmability– Simply identify the shared variables as synchronization variables
• Performance– Specifies minimal constraints
• Portability– Language must provide way to identify races– Hardware must provide way to preserve ordering on races– Compiler must translate correctly– synchronization accesses are performed indivisibly– ensure sequential consistency, in spite of those simultaneous
accesses
Data-Race-Free Approach
• Most important h/w and compiler optimizations allowed for ordinary accesses
• Care must be taken at the explicitly identified synchronization accesses
• data-race-free was formally proposed in 1990,– it did not see widespread adoption as a formal
model in industry until recently.
Industry Practice and Evolution
• Hardware Memory Models• High-Level Language Memory Models– The Java Memory Model– The C++ Memory Model
Hardware Memory Models• Most hardware supports relaxed models that are
weaker than sequential consistency. – These models take an implementation- or performance-
centric view, – desirable hardware optimizations drive the model
specification • Different vendors had different models– Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray.– Various ordering guarantees + fences to impose other
orders– Many ambiguities - due to complexity by design and
unclear documentation
High-Level Language Memory Models
• Ada was the first widely used high-level programming language
• provides first class support for shared-memory parallel programming. – Advanced in treatment of memory semantics.– It used a style similar to data-race-free, requiring
legal programs to be well-synchronized;– It did not fully formalize the notion of well-
synchronized.
Pthreads and OpenMP
• Most shared-memory programming enabled through libraries and APIs such as Posix threads and OpenMP.– Incomplete, ambiguous model specs– Memory model property of language, not library– unclear what compiler transformations are legal,– what the programmer is allowed to assume.
Pthreads and OpenMP
• The Posix threads – specification indicates a model similar to data-
race-free– Several inconsistent aspects, – Varying interpretations even among experts
• The OpenMP model– Unclear and largely based on a flush instruction
that is similar to fence instructions in h/w models
Java
• Java provided first class support for threads– chapter specifically devoted to its memory model.
• Bill Pugh publicized fatal flaws in Java model – This model was hard to interpret – Badly broken • Common compiler optimizations were prohibited • Many cases the model gave ambiguous or unexpected
behavior
Java Memory Model Highlights• In 2000, Sun appointed an expert group to revise the model
through the Java community process.– coordination through an open mailing list– Attracted a variety of participants, representing
• Hardware, software, researchers and practitioners.– Took 5 years of debates– Many competing models– Final consensus model approved in 2005 for Java 5.0– Quickly decided that the JMM must provide SC for data-race-
free programs– Unfortunately, this model is very complex– Have some surprising behaviors, – Recently been shown to have a bug
Java Memory Model Highlights
• Missing piece: Semantics for programs with data races– Java is meant to be a safe and secure language– It cannot allow arbitrary behavior for data races. – Java cannot have undefined semantics for ANY
program– Must ensure safety/security guarantees to Limit
damage from data races in untrusted code.– Problem: “safety/security, limited damage” in a
multithreaded context were not clearly defined
Java Memory Model Highlights
• Final model based on consensus, but complex– Programmers (must) use SC for data-race-free– But system designers must deal with complexity – Recent discovery of bugs in 2008
The C++ Memory Model
• The situation in C++ was significantly different from Java.
• The language itself provided no support for threads.– most concurrency with Pthreads.– Corresponding Microsoft Windows facilities.
• combination of the C++ standard with the Pthread standard.– made it unclear to compiler writers precisely what
they needed to implement
The C++ Memory Model• In 2005 a concurrency model for C++ initiated• The resulting effort expanded to include the definition
of atomic operations, and the threads API itself.• Microsoft concurrently started its own internal effort• C++ easier than Java because it is unsafe– Data-race-free is possible model
• BUT multicore New h/w optimizations, more search– Mismatched h/w, programming views became painful– Debate that SC for data-race-free inefficient with new
hardware models
Lessons Learned• Data-race-free provides a simple and consistent
model for threads and shared variables. • The current hardware implies three significant
weaknesses:– Debugging Accidental introduction of a data race
results in “undefined behavior,”– Synchronization variable performance on current
hardware: ensuring SC in Java or C++ on current h/w complicate the model for experts.
– Untrusted code There is no way to ensure data-race-freedom in untrusted code (Java).
Lessons Learned
• There is a need for:– Higher-level disciplined models that enforce
discipline• Deterministic Parallel Java
– Hardware co-designed with high-level models• DeNovo hardware
Implications for Languages
• Modern software needs to be both parallel and secure
• Choice between the two should not be acceptable.
• Disciplined shared-memory models– Simple– Enforceable– Expressive– Performance
Implications for Languages• Simple enough to be easily teachable to
undergraduates.– minimally provide sequential consistency to programs that
obey the required discipline.• enable the enforcement of the discipline.– violations the discipline should not have undefined or
complex semantics.– should be caught and returned back to the programmer as
illegal;• General-purpose enough to express important parallel
algorithms and patterns• enable high and scalable performance.
Data-race-free
• A near-term discipline: continue with Data-race-free
• Focus research on its enforcement– Ideally, language prohibits and eliminate data race
by design– Else, runtime catches as exception
Deterministic-by-Default Parallel Programming
• Many algorithms are deterministic– Fixed input gives fixed output– Standard model for sequential programs– Also holds for many transformative parallel
programs• Parallelism not part of problem specification, only for
performance
• Language-based approaches with such goals Deterministic Parallel Java (DPJ) .
Implications for Hardware
• Current h/w MM are an imperfect match for even current software MM
• instruction set architecture changes – identify individual loads and stores as synchronization
can alleviate some short-term problems.
• An established ISA, is hard to change– when existing code works mostly adequately – there is no enough experience to document the
benefits of the change.
Implications for Hardware• longer term perspective, more fundamental
solution to the problem will emerge with a co-designed approach,– where future multicore hardware research evolves in
concert with the software models research.• DeNovo project at Illinois in concert with DPJ– Design hardware to exploit disciplined parallelism– Simpler hardware– Scalable performance– Power/energy efficiency– Working with DPJ as example disciplined model
Conclusion
• This paper gives a perspective based on work collectively spanning about thirty years.
• Current way to specify concurrency semantics fundamentally broken – Best we can do is SC for data-race-free– Mismatched hardware-software
* Simple optimizations give unintended consequences
• Need – High-level disciplined models that enforce discipline– Hardware co-designed with high-level model
Authors
• Sarita V. Adve ([email protected]) is a professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign.
• Hans-J. Boehm ([email protected]) is a member of Hewlett-Packard's Exascale Computing Lab, Palo Alto, CA.
Top Related