Tomorrow's Learners, Tomorrow's Technologies: Preparing for the Predictions
Implementing Tomorrow's Programming Languages
description
Transcript of Implementing Tomorrow's Programming Languages
1
Implementing Tomorrow's Programming Languages
Rudi Eigenmann
Purdue University
School of ECE
Computing Research institute
Indiana, USA
2
How to find Purdue University
3
Computing Research Institute (CRI)
CRI is the high-performance computing branch of Discovery Park’s
Other DP Centers:BioscienceNanotechnologyE-EnterpriseEntrepreneurshipLearningAdvanced ManufacturingEnvironmentOncology
4
Compilers are the Center of the UniverseThe compiler translates
the programmer’s view
into
the machine’s view
DO I=1,n a(I)=b(I)ENDDO
Do Weatherforecast
Today: Tomorrow:
Subr doitLoop:Load 1,R1. . .Move R2, x. . .BNE loop
Compute on machine x
Remote call doit
5
Why is Writing Compilers Hard?… a high-level view
• Translation passes are complex algorithms• Not enough information at compile time
– Input data not available– Insufficient knowledge of architecture– Not all source code available
• Even with sufficient information, modeling performanceis difficult
• Architectures are moving targets
6
Why is Writing Compilers Hard?… from an implementation angle
• Interprocedural analysis• Alias/dependence analysis• Pointer analysis• Information gathering and propagation• Link-time, load-time, run-time optimization
– Dynamic compilation/optimization– Just-in-time compilation– Autotuning
• Parallel/distributed code generation
7
Because we want:• All our programs to work on multicore processors
• Very High-level languages– Do weather forecast …
• Composition: Combine weather forecast with energy-reservation and cooling manager
• Reuse: warn me if I’m writing a module that exists “out there”.
It’s Even Harder Tomorrow
8
How Do We Get There?Paths towards tomorrow’s programming language
Addressing the (new) multicore challenge:• Automatic Parallelization• Speculative Parallel Architectures• SMP languages for distributed systemsAddressing the (old) general software engineering
challenge:• High-level languages• Composition• Symbolic analysis• Autotuning
9
The Multicore Challenge
• We have finally reached the long-expected “speed wall” for the processor clock.– (this should not be news to you!)
• “… one of the biggest disruptions in the evolution of information technology.”
• “Software engineers who do not know parallel programming will be obsolete in no time.”
10
Automatic ParallelizationCan we implement standard languages on multicore?
… more specifically: a source-to-source restructuring compiler
Research issues in such a compiler:– Detecting parallelism– Mapping parallelism onto the machine– Performing compiler techniques at runtime– Compiler infrastructure
Standard Fortran
Fortran+directives(OpenMP)
OpenMPbackendcompiler
Polaris
Polaris – A Parallelizing Compiler
11
00.511.522.533.544.55
Sp
eed
up
ARC2DFLO52Q
HYDRO2DMDG SWIM
TOMCATVTRFD123
45
State of the Art in Automatic parallelization• Advanced optimizing compilers perform well in
50% of all science/engineering applications.
• Caveats: this is true– in research compilers– for regular applications, written in Fortran or
C without pointers
• Wanted: heroic, black-belt programmers who know the “assembly language of HPC”
12
Can Speculative Parallel Architectures Help?Basic idea:
• Compiler splits program into sections (without considering data dependences)
• The sections are executed in parallel
• The architecture tracks data dependence violations and takes corrective action.
13
Performance of Speculative Multithreading
00.5
11.5
22.5
33.5
44.5
FPPPPAPSI
TURB3DAPPLUWAVE5
SU2CORTOMCATVHYDRO2D
SWIMFLO52ARC2DTRFD
MGRID
Implicit-Only Multiplex-Naïve Multiplex-Selective Multiplex-ProfileSPEC CPU2000 FP programs executed on a 4-core speculative architecture.
14
We may need
Explicit Parallel ProgrammingShared-memory architectures:
OpenMP: proven model for Science/Engineering programs
Suitability for non-numerical programs ?
Distributed computers:MPI: the assembly language of
parallel/distributed systems. Can we do better ?
15
Beyond Science&Engineering Applications7+ Dwarfs:
1. Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement)
2. Unstructured Grids3. Fast Fourier Transform4. Dense Linear Algebra5. Sparse Linear Algebra6. Particles7. Monte Carlo8. Search/Sort 9. Filter10. Combinational logic11. Finite State Machine
16
Shared-Memory Programming for Distributed Applications?• Idea 1:
– Use an underlying software distributed-shared-memory system (e.g., Treadmarks).
• Idea 2:– Direct translation into message-passing
code
17
• S-DSM maintains coherency at a page level
• Optimizations that reduce false sharing and increase page affinity are very important
• In S-DSMs, such as TreadMarks, the stacks are not in shared address space
• Compiler must identify shared stack variable interprocedural analysis
Shared address space
OpenMP for Software DSMChallengesChallenges
Sharedmemory Distributed memories
stack stack stack stackstackstackstackstack
A[50] =
= A[50]
barrier
Processor 1 Processor 2
P1 tells P2“I have written page x”
P2 requests page “diff” from P1
t
18
0
1
2
3
4
5
6
1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8
SPEC OMP2001M Performance
Baseline Performance Optimized Performance
wupwise swim mgrid artequakeapplu
Optimized Performance of SPEC OMPM2001 Benchmarks on a Treadmarks S-DSM System
19
Direct Translation of OpenMP into Message Passing A question often asked: How is this different from HPF?• HPF: emphasis is on data distribution OpenMP: the starting point is explicit parallel regions.• HPF: implementations apply strict data distribution and owner-
computes schemes
Our approach: partial replication of shared data. Partial replication leads to– Synchronization-free serial code– Communication-free data reads– Communication for data writes amenable to collective message
passing.– Irregular accesses (in our benchmarks) amenable to compile-time
analysisNote: partial replication is not necessarily “data scalable”
20
Performance of OpenMP-to-MPI Translation
Performance comparison of our OpenMP-to-MPI translated versions versus(hand-coded) MPI versions of the same programs.
Hand-coded MPI represents a practical “upper bound”
“speedup” is relative to the serial version
Hand-coded MPIOpenMP-to-MPI NEW EXISTS
Higher is better
21
How does the performance compare to the same programs optimized for Software DSM?
OpenMP-to-MPI NEWOpenMP for SDSM
EXISTS (Project 2)
Higher is better
22
How Do We Get There?Paths towards tomorrow’s programming language
The (new) multicore challenge:• Automatic Parallelization• Speculative Parallel Architectures• SMP languages for distributed systemsThe (old) general software engineering challenge:• High-level languages• Composition• Symbolic analysis• Autotuning
23
(Very) High-Level Languages
Observation: “The number of programming
errors is roughly proportional to the
number of programming lines”
AssemblyFortran
Object-oriented
languages
Scripting,Matlab
?
• Probably Domain-specific• How efficient?
–Very efficient because there is much flexibility in translating VHLLs–Inefficient by experience
24
CompositionCan we compose software from existing modules?
• Idea:
Add an “abstract algorithm” (AA) construct to the programming language– the programmer definines is the AA’s goal– called like a procedure
Compiler replaces each AA call with a sequence of library calls
– How does the compiler do this?It uses a domain-independent planner that accepts
procedure specifications as operators
25
Motivation: Programmers often Write Sequences of Library Calls
Example: A Common BioPerl Call Sequence “Query a remote database and save the result to local storage:”
Query q = bio_db_query_genbank_new(“nucleotide”,
“Arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”);
DB db = bio_db_genbank_new( );
Stream stream = get_stream_by_query(db, q);
SeqIO seqio = bio_seqio_new(“>sequence.fasta”, “fasta”);
Seq seq = next_seq(stream);
write_seq(seqio, seq);
Example adapted fromhttp://www.bioperl.org/wiki/HOWTO:Beginners
5 data types, 6 procedure callsType Procedure
26
Defining and Calling an AA • AA (goal) defined using the glossary...algorithmsave_query_result_locally(db_name, query_string, filename, format) => { query_result(result, db_name, query_string), contains(filename, result), in_format(filename, format) }
...and called like a procedure
Seq seq = save_query_result_locally(“nucleotide”, “Arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”, “>sequence.fasta”, “fasta”);
1 data type, 1 AA call
27
“Ontological Engineering”
• Library author provides a domain glossary– query_result(result, db, query) – result is the outcome
of sending query to the database db– contains(filename, data) – file named filename
contains data– in_format(filename, format) – file named filename is in
format format
28
Implementing the Composition Idea
Borrowing AI technology: planners
-> for details, see PLDI 2006
PlannerInitial State
Goal State
OperatorsPlan Actions
Plan User World
A Domain-independent PlannerA
(Call Context)
(AA Definition)
(Library Specs.)
(Compiler) (Executable)
29
Symbolic Program Analysis
• Today: many compiler techniques work assume numerical constants
• Needed: Techniques that canreason about the program in symbolic terms.– differentiate ax2
-> 2ax– analyze ranges y=exp; if {c} y+=5; -> y=[exp:exp+5]– c=0 DO j=1,n Recognize algorithm: if (t(j)<v) c+=1 -> c= COUNT(t[1:n]<v) ENDDO
30
Autotuning(dynamic compilation/adaptation)
• Moving compile-time decisions to runtime• A key observation:
Compiler writers “solve” difficult decisions by creating a command-line option
-> finding the best combination of options means making the difficult compiler decisions.
31
Tuning Time
PEAK is 20 times as fast as the whole-program tuning.
On average, PEAK reduces tuning time from 2.19 hours to 5.85 minutes.
62.22
50.99
105.76
69.2363.14
89.28
50.59
87.32
36.96
102.97
68.28
2.337.06
11.214.03 1.79 2.33 3.38 4.22 2.59 1.61 3.36
0.00
20.00
40.00
60.00
80.00
100.00
120.00
ammp appluapsi
art
equakemesa mgrid
sixtrackswim
wupwiseGeoMean
Normalized tuning time
Whole PEAK
32
Program Performance
The performance is the same.
0
10
20
30
40
50
60
70
ammp applu apsiart
equakemesa mgrid
sixtrackswim
wupwiseGeoMean
Relative performance improvement percentage (%)
Whole_Train PEAK_Train Whole_Ref PEAK_Ref
33
ConclusionsAdvanced compiler capabilities are crucial for
implementing tomorrow’s programming languages:• The multicore challenge -> parallel programs
– Automatic parallelization– Support for speculative multithreading– Shared-memory programming support
• High-level constructs – Composition pursues this goal
• Techniques to reason about programs in symbolic terms• Dynamic tuning