From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model

37
03/30/22 \Seminar\Spain-00-01 1 From EARTH to HTMT: From EARTH to HTMT: The Evolution of a The Evolution of a Multithreaded Multithreaded Architecture Model Architecture Model Guang R. Gao Computer Architecture & Parallel Systems Laboratory (CAPSL) University of Delaware

description

From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model. Guang R. Gao C omputer A rchitecture & P arallel S ystems L aboratory (CAPSL) University of Delaware. Outline. Introduction The EARTH Execution and Architecture Model The EARTH Programming Model and Threaded-C - PowerPoint PPT Presentation

Transcript of From EARTH to HTMT: The Evolution of a Multithreaded Architecture Model

04/19/23 \Seminar\Spain-00-01 1

From EARTH to HTMT:From EARTH to HTMT:The Evolution of a Multithreaded The Evolution of a Multithreaded

Architecture ModelArchitecture Model

Guang R. GaoComputer Architecture & Parallel Systems

Laboratory (CAPSL)

University of Delaware

04/19/23 \Seminar\Spain-00-01 2

Outline• Introduction • The EARTH Execution and Architecture Model• The EARTH Programming Model and

Threaded-C• Application Studies and Performance Evaluation• Related Work and Conclusions

04/19/23 \Seminar\Spain-00-01 3

Main Challenges:

High-PerformanceParallel Systems

Scalable

forboth Class A and Class B

Applications

04/19/23 \Seminar\Spain-00-01 4

Challenges: The “Killer Latency Problem”

Network

Latency due to:- Communication- Synchronization- task spawning- load balancing

C

NI

M

P

C

NI

M

P

SP2 is hard enough, PC clusters is much worse !

04/19/23 \Seminar\Spain-00-01 5

• Observation I: Many such Applications

have “Bad Latencies” demanding good

support of adaptive fine-grain parallelism

Meeting High-End Application Meeting High-End Application Challenges:Challenges:

[Petaflop-2 Conference, 99-2]

04/19/23 \Seminar\Spain-00-01 6

Observation II: It is not necessarily too

hard to “generate” and “program” fine-

grain threads!

Here Comes the SurpriseHere Comes the Surprise!![Theobald’s Ph.D. thesis, May, 1999][Theobald’s Ph.D. thesis, May, 1999]

However, it may be hard to statically group

them into coarse-grain threads!

04/19/23 \Seminar\Spain-00-01 7

AA BaseBase AdaptiveAdaptive Fine-GrainFine-Grain Multithreaded Execution ModelMultithreaded Execution Model

C1 (Abundance) : a very large pool of threads

C2 (ultra-light weight): can be spawned as easily and as quickly as possible

C3 (Mobility): Adaptively migratable as easily and as quickly as possible

04/19/23 \Seminar\Spain-00-01 8

Motivation of The EARTH Project

How to exploit fine-grain multithreadeding on a parallel system given off-the-shelf microprocessors

04/19/23 \Seminar\Spain-00-01 9

Two Types of Fine-Grain Threads

• A parallel function invocation

• Strand/Fiber - a function body can be divided into several “strands/fibers”

04/19/23 \Seminar\Spain-00-01 10

• A fiber becomes enabled if it has received all input signals

• An enabled fiber may be selected for execution when the required hardware resource has been allocated

• After finished execution, a signal is sent to all destination fiber to update the corresponding sync slots

Fiber within a frame

Parallel function invocation

Call a procedure

SYNC ops

Note: The role of strand !

Threads and FibersThreads and Fibers

04/19/23 \Seminar\Spain-00-01 11

The Execution Model of Fibers

• Dependence-Driven firing rule for fibers

• Fiber is atomic and ultra-light weighted

• Relation with dataflow model (Dennis72)

2 21 2

0 10 2

2 4

Fibers

SignalToken

04/19/23 \Seminar\Spain-00-01 12

• Threaded C = ANSI C + extensions for multithreading

• Extensions include:– Threaded functions

– Threaded synchronization

– Support for global addresses

– Data transfer primitives

• Threaded C is:– The “instruction set” of the

EARTH processor– A target language for

high-level compilers

High-Level LanguageTranslation

Treaded C

Threaded CCompiler

EARTH Platforms

Users

C FORTRAN

The Threaded C Language

04/19/23 \Seminar\Spain-00-01 13

04/19/23 \Seminar\Spain-00-01 14

An Evolutionary Path for EARTH

CPU / SU

CPU SU

CPU SU

CPU LINKCPU

SEMi Simulation Platform (Theobald99)

MANNA-dual/spn

SU-ext

SU-int

- Parallel machines- PC-clusters - ...

<=

04/19/23 \Seminar\Spain-00-01 15

Platforms for EARTH

• MANNA:– MANNA is architecture testbed from GMD– benchmarking platform for fine-grain

multithreading

• EARTH-SP2

• EARTH-Beowulf (Linux based)

• EARTH-SUN/SMP/Cluster

04/19/23 \Seminar\Spain-00-01 16

Unique Advantages of EARTH-MANNA Platform

• We can push OS completely out of the way!

• We can design the EARTH runtime system from very low level up

• The invaluable experience/lessons learned from EARTH-MANNA are essential for the successful migration of the EARTH model to other platforms (e.g. the IBM SP-2 story, etc.)

04/19/23 \Seminar\Spain-00-01 17

04/19/23 \Seminar\Spain-00-01 18

04/19/23 \Seminar\Spain-00-01 20

Sumamry of Recent Experimental Results (Kevin99)

• Impressive speedup and scalability (scalable even with high overhead fine-grain parallel programs: e.g. fib)

• Enhanced Programmability (N-queen-p example)

• Broad applicability

04/19/23 \Seminar\Spain-00-01 21

Experiements

• Example 1 (assorted benchmarks): fib, nqueen, paraffin, tomcatv, matrix-multiply,etc.

• Example 2: Adaptive unstructured grids

• Example 3: Wavelet computation

04/19/23 \Seminar\Spain-00-01 22

04/19/23 \Seminar\Spain-00-01 24

Performance of Performance of N-Queens(12)N-Queens(12)[Theobald99][Theobald99]

• 117.8 fold speedup on a 120 node simulation!

• 1,637,099 tokens are generated ! 1,637,099 tokens are generated !

• average, 30+ tokens are maintained per average, 30+ tokens are maintained per processorsprocessors

• n-QUUEN is a useful HTMT benchmark after all ! (Phil Murkey)

04/19/23 \Seminar\Spain-00-01 25

04/19/23 \Seminar\Spain-00-01 26

04/19/23 \Seminar\Spain-00-01 27

04/19/23 \Seminar\Spain-00-01 28

Coarse-Grain Applications

• 116 fold speedup on 120-node machine is achieved for Cannon’s matrix multiply algorithm!

• Deep software systolic-style implementation to exploit paralelism

• Fine-grain mechanisms

04/19/23 \Seminar\Spain-00-01 29

Example 2 --- Adaptive Unstructured Mesh Computation

Observation

• The critical part of the framework is mesh adaptation and load balancing

• Partitioning problem in better shape, remapping problem open

Partitioning

Mapping

Initialization

Solution

Finalization

Adapt? Execution

Balanced?

Expensive?

Repartitioning

Remapping

N

N

Y

Y

Y

N

04/19/23 \Seminar\Spain-00-01 30

Node 0 Node 1

* * *

Node N

The Mapping After a Few Iterations

Node 0 Node 1 Node N

* * *

The Initial Picture

04/19/23 \Seminar\Spain-00-01 31

Initial Results

• About 3000 lines of Threaded-C code

• migration >= 70% (good)

• Unbiased variance = 3 - 5% (very good)

• A good speedup on EARTH-MANNA

has been observed

04/19/23 \Seminar\Spain-00-01 33

Example 3 --- Adaptive Wavelet Transformation

• Load evolution pattern is dynamically changing, but is statically predictable

• Need adaptive load redistribution/grouping

• Mapping onto EARTH [IPPS99]

04/19/23 \Seminar\Spain-00-01 34

HTMT Facility (Perspective)HTMT Facility (Perspective)

04/19/23 \Seminar\Spain-00-01 35

HTMT ArchitectureHTMT Architecture

SPIM

SPIM

SPIM

SPIM

SPIMSPIMSPIM

SPIM

SPIM

SPIM

SPIMSPIM

SPELLs

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

DPIM

04/19/23 \Seminar\Spain-00-01 36

Extensions to CurrentExtensions to CurrentEARTH ModelEARTH Model

• Percolation Model• Memory Model: Location

Consistency• Load balancing and percolation

04/19/23 \Seminar\Spain-00-01 37

HTMT Percolation ModelHTMT Percolation ModelHTMT Percolation ModelHTMT Percolation Model

ParcelInvocation

&Termination

I-PoolParcel

Assembly&

Disassembly

ParcelDispatcher

&Dispenser

T-Pool

A-Pool

D-Pool

SRAM-PIM

CRYOGENIC AREA

Run Time System

DM

A

DM

A

Split-PhaseSynchronization

to SRAM

donestart

CRAM

SCPExecution

Unit

04/19/23 \Seminar\Spain-00-01 38

The System Software ArchitectureThe System Software Architecture

Note:• The threaded-C compiler has

part of its functions embedded in RTS

• The RTS will work with architecture and OS layers to provide the PXM interface

• The performance models Are defined across all layers

Threaded-C Compiler - RTS interface

RTS-OS interfaceRTS-hardware architecture interface

Applications

High-level languagecompiler

Threaded-CCompiler

and Tool Set

RTS

Hardware Architectures

OS

High-levellanguagese.g. parallel Cetc.

HTMT-C/Threaded-C

PXMInterface

Per

form

ance

Mod

els

04/19/23 \Seminar\Spain-00-01 39

Evolution of Multithreaded Architecture Models

Non-dataflowbased

CDC 66001964

MASAHalstead1986

HEPB. Smith1978

Cosmic CubeSeiltz1985

J-MachineDally1988-93

M-MachineDally1994-98

Dataflowmodel inspired

StaticDataflowDennis 1972MIT

MIT TTDAArvind1980

ManchesterGurd & Watson1982

*T/Start-NGMIT/Motorola1991-

SIGMA-IShimada1988

Arg-FetchingDataflowDennisGao1987-88

MDFAGao1989-93

MTAHumTheobaldGao 94

MonsoonPapadopoulos& Culler 1988

P-RISCNikhil & Arvind1989

EM-5/4/X RWC-11992-97

EARTHPACT95’, ISCA96, Theobald99

Iannuci’s1988-92

Others: Multiscalar (1994), SMT (1995), etc.

Flynn’sProcessor1969

CHoPP’77 CHoPP’87

TAMCuller1990

TeraB. Smith1990-

AlwifeAgarwal1989-96

CilkLeiserson

XMTVishkin

04/19/23 \Seminar\Spain-00-01 40

Acknowledgement(Incomplete List)

• Erik Altman • Haiying Cai• Nasser Elmasri• Gerd Heber• Laurie J. Hendren• Herbert Hum• Alberto Jimenez• Prasad Kakulavarapu• Cheng Li• Olivier Maquelin• Andres Marquez

• Shashank Nemawarkar• Zach Ruiz• Sean Ryan• V.C. Sreedhar• Xinan Tang • Kevin Theobald• Ruppa Thulasiram • Parimala Thulasiraman• Xinmin Tian• Yingchun Zhu• J. Nelson Amaral

NSERC, FCAR,DARPA,NSA,NSF,NASA