Implementing Tomorrow's Programming Languages

1

Implementing Tomorrow's Programming Languages

Rudi Eigenmann

Purdue University

School of ECE

Computing Research institute

Indiana, USA

2

How to find Purdue University

3

Computing Research Institute (CRI)

CRI is the high-performance computing branch of Discovery Park’s

Other DP Centers:BioscienceNanotechnologyE-EnterpriseEntrepreneurshipLearningAdvanced ManufacturingEnvironmentOncology

4

Compilers are the Center of the UniverseThe compiler translates

the programmer’s view

into

the machine’s view

DO I=1,n a(I)=b(I)ENDDO

Do Weatherforecast

Today: Tomorrow:

Subr doitLoop:Load 1,R1. . .Move R2, x. . .BNE loop

Compute on machine x

Remote call doit

5

Why is Writing Compilers Hard?… a high-level view

• Translation passes are complex algorithms• Not enough information at compile time

– Input data not available– Insufficient knowledge of architecture– Not all source code available

• Even with sufficient information, modeling performanceis difficult

• Architectures are moving targets

6

Why is Writing Compilers Hard?… from an implementation angle

• Interprocedural analysis• Alias/dependence analysis• Pointer analysis• Information gathering and propagation• Link-time, load-time, run-time optimization

– Dynamic compilation/optimization– Just-in-time compilation– Autotuning

• Parallel/distributed code generation

7

Because we want:• All our programs to work on multicore processors

• Very High-level languages– Do weather forecast …

• Composition: Combine weather forecast with energy-reservation and cooling manager

• Reuse: warn me if I’m writing a module that exists “out there”.

It’s Even Harder Tomorrow

8

How Do We Get There?Paths towards tomorrow’s programming language

Addressing the (new) multicore challenge:• Automatic Parallelization• Speculative Parallel Architectures• SMP languages for distributed systemsAddressing the (old) general software engineering

challenge:• High-level languages• Composition• Symbolic analysis• Autotuning

9

The Multicore Challenge

• We have finally reached the long-expected “speed wall” for the processor clock.– (this should not be news to you!)

• “… one of the biggest disruptions in the evolution of information technology.”

• “Software engineers who do not know parallel programming will be obsolete in no time.”

10

Automatic ParallelizationCan we implement standard languages on multicore?

… more specifically: a source-to-source restructuring compiler

Research issues in such a compiler:– Detecting parallelism– Mapping parallelism onto the machine– Performing compiler techniques at runtime– Compiler infrastructure

Standard Fortran

Fortran+directives(OpenMP)

OpenMPbackendcompiler

Polaris

Polaris – A Parallelizing Compiler

11

00.511.522.533.544.55

Sp

eed

up

ARC2DFLO52Q

HYDRO2DMDG SWIM

TOMCATVTRFD123

45

State of the Art in Automatic parallelization• Advanced optimizing compilers perform well in

50% of all science/engineering applications.

• Caveats: this is true– in research compilers– for regular applications, written in Fortran or

C without pointers

• Wanted: heroic, black-belt programmers who know the “assembly language of HPC”

12

Can Speculative Parallel Architectures Help?Basic idea:

• Compiler splits program into sections (without considering data dependences)

• The sections are executed in parallel

• The architecture tracks data dependence violations and takes corrective action.

13

Performance of Speculative Multithreading

00.5

11.5

22.5

33.5

44.5

FPPPPAPSI

TURB3DAPPLUWAVE5

SU2CORTOMCATVHYDRO2D

SWIMFLO52ARC2DTRFD

MGRID

Implicit-Only Multiplex-Naïve Multiplex-Selective Multiplex-ProfileSPEC CPU2000 FP programs executed on a 4-core speculative architecture.

14

We may need

Explicit Parallel ProgrammingShared-memory architectures:

OpenMP: proven model for Science/Engineering programs

Suitability for non-numerical programs ?

Distributed computers:MPI: the assembly language of

parallel/distributed systems. Can we do better ?

15

Beyond Science&Engineering Applications7+ Dwarfs:

1. Structured Grids (including locally structured grids, e.g. Adaptive Mesh Refinement)

2. Unstructured Grids3. Fast Fourier Transform4. Dense Linear Algebra5. Sparse Linear Algebra6. Particles7. Monte Carlo8. Search/Sort 9. Filter10. Combinational logic11. Finite State Machine

16

Shared-Memory Programming for Distributed Applications?• Idea 1:

– Use an underlying software distributed-shared-memory system (e.g., Treadmarks).

• Idea 2:– Direct translation into message-passing

code

17

• S-DSM maintains coherency at a page level

• Optimizations that reduce false sharing and increase page affinity are very important

• In S-DSMs, such as TreadMarks, the stacks are not in shared address space

• Compiler must identify shared stack variable interprocedural analysis

Shared address space

OpenMP for Software DSMChallengesChallenges

Sharedmemory Distributed memories

stack stack stack stackstackstackstackstack

A[50] =

= A[50]

barrier

Processor 1 Processor 2

P1 tells P2“I have written page x”

P2 requests page “diff” from P1

t

18

0

1

2

3

4

5

6

1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8 1 2 4 8

SPEC OMP2001M Performance

Baseline Performance Optimized Performance

wupwise swim mgrid artequakeapplu

Optimized Performance of SPEC OMPM2001 Benchmarks on a Treadmarks S-DSM System

19

Direct Translation of OpenMP into Message Passing A question often asked: How is this different from HPF?• HPF: emphasis is on data distribution OpenMP: the starting point is explicit parallel regions.• HPF: implementations apply strict data distribution and owner-

computes schemes

Our approach: partial replication of shared data. Partial replication leads to– Synchronization-free serial code– Communication-free data reads– Communication for data writes amenable to collective message

passing.– Irregular accesses (in our benchmarks) amenable to compile-time

analysisNote: partial replication is not necessarily “data scalable”

20

Performance of OpenMP-to-MPI Translation

Performance comparison of our OpenMP-to-MPI translated versions versus(hand-coded) MPI versions of the same programs.

Hand-coded MPI represents a practical “upper bound”

“speedup” is relative to the serial version

Hand-coded MPIOpenMP-to-MPI NEW EXISTS

Higher is better

21

How does the performance compare to the same programs optimized for Software DSM?

OpenMP-to-MPI NEWOpenMP for SDSM

EXISTS (Project 2)

Higher is better

22

How Do We Get There?Paths towards tomorrow’s programming language

The (new) multicore challenge:• Automatic Parallelization• Speculative Parallel Architectures• SMP languages for distributed systemsThe (old) general software engineering challenge:• High-level languages• Composition• Symbolic analysis• Autotuning

23

(Very) High-Level Languages

Observation: “The number of programming

errors is roughly proportional to the

number of programming lines”

AssemblyFortran

Object-oriented

languages

Scripting,Matlab

?

• Probably Domain-specific• How efficient?

–Very efficient because there is much flexibility in translating VHLLs–Inefficient by experience

24

CompositionCan we compose software from existing modules?

• Idea:

Add an “abstract algorithm” (AA) construct to the programming language– the programmer definines is the AA’s goal– called like a procedure

Compiler replaces each AA call with a sequence of library calls

– How does the compiler do this?It uses a domain-independent planner that accepts

procedure specifications as operators

25

Motivation: Programmers often Write Sequences of Library Calls

Example: A Common BioPerl Call Sequence “Query a remote database and save the result to local storage:”

Query q = bio_db_query_genbank_new(“nucleotide”,

“Arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”);

DB db = bio_db_genbank_new( );

Stream stream = get_stream_by_query(db, q);

SeqIO seqio = bio_seqio_new(“>sequence.fasta”, “fasta”);

Seq seq = next_seq(stream);

write_seq(seqio, seq);

Example adapted fromhttp://www.bioperl.org/wiki/HOWTO:Beginners

5 data types, 6 procedure callsType Procedure

26

Defining and Calling an AA • AA (goal) defined using the glossary...algorithmsave_query_result_locally(db_name, query_string, filename, format) => { query_result(result, db_name, query_string), contains(filename, result), in_format(filename, format) }

...and called like a procedure

Seq seq = save_query_result_locally(“nucleotide”, “Arabidopsis[ORGN] AND topoisomerase[TITL] AND 0:3000[SLEN]”, “>sequence.fasta”, “fasta”);

1 data type, 1 AA call

27

“Ontological Engineering”

• Library author provides a domain glossary– query_result(result, db, query) – result is the outcome

of sending query to the database db– contains(filename, data) – file named filename

contains data– in_format(filename, format) – file named filename is in

format format

28

Implementing the Composition Idea

Borrowing AI technology: planners

-> for details, see PLDI 2006

PlannerInitial State

Goal State

OperatorsPlan Actions

Plan User World

A Domain-independent PlannerA

(Call Context)

(AA Definition)

(Library Specs.)

(Compiler) (Executable)

29

Symbolic Program Analysis

• Today: many compiler techniques work assume numerical constants

• Needed: Techniques that canreason about the program in symbolic terms.– differentiate ax2

-> 2ax– analyze ranges y=exp; if {c} y+=5; -> y=[exp:exp+5]– c=0 DO j=1,n Recognize algorithm: if (t(j)<v) c+=1 -> c= COUNT(t[1:n]<v) ENDDO

30

Autotuning(dynamic compilation/adaptation)

• Moving compile-time decisions to runtime• A key observation:

Compiler writers “solve” difficult decisions by creating a command-line option

-> finding the best combination of options means making the difficult compiler decisions.

31

Tuning Time

PEAK is 20 times as fast as the whole-program tuning.

On average, PEAK reduces tuning time from 2.19 hours to 5.85 minutes.

62.22

50.99

105.76

69.2363.14

89.28

50.59

87.32

36.96

102.97

68.28

2.337.06

11.214.03 1.79 2.33 3.38 4.22 2.59 1.61 3.36

0.00

20.00

40.00

60.00

80.00

100.00

120.00

ammp appluapsi

art

equakemesa mgrid

sixtrackswim

wupwiseGeoMean

Normalized tuning time

Whole PEAK

32

Program Performance

The performance is the same.

0

10

20

30

40

50

60

70

ammp applu apsiart

equakemesa mgrid

sixtrackswim

wupwiseGeoMean

Relative performance improvement percentage (%)

Whole_Train PEAK_Train Whole_Ref PEAK_Ref

33

ConclusionsAdvanced compiler capabilities are crucial for

implementing tomorrow’s programming languages:• The multicore challenge -> parallel programs

– Automatic parallelization– Support for speculative multithreading– Shared-memory programming support

• High-level constructs – Composition pursues this goal

• Techniques to reason about programs in symbolic terms• Dynamic tuning

Implementing Tomorrow's Programming Languages

Documents

Transcript of Implementing Tomorrow's Programming Languages