10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status...

33
10/15/04 1 Distributed Shared- Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory ECE Department University of Florida Principal Investigator: Professor Alan D. George Sr. Research Assistant: Mr. Hung-Hsun Su

description

10/15/043 Objectives and Motivations Objectives  Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs  Design and analysis of tools to support UPC on SAN-based systems  Benchmarking and case studies with key UPC applications  Analysis of tradeoffs in application, network, service, and system design Motivations  Increasing demand in sponsor and scientific computing community for shared- memory parallel computing with UPC  New and emerging technologies in system-area networking and cluster computing Scalable Coherent Interface (SCI) Myrinet (GM) InfiniBand (VAPI) QsNet (Quadrics Elan) Gigabit Ethernet and 10 Gigabit Ethernet  Clusters offer excellent cost-performance potential

Transcript of 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status...

Page 1: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 1

Distributed Shared-MemoryParallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt.

High-Perf. Networking (HPN) GroupHCS Research Laboratory

ECE DepartmentUniversity of Florida

Principal Investigator: Professor Alan D. GeorgeSr. Research Assistant: Mr. Hung-Hsun Su

Page 2: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 2

Outline

Objectives and Motivations

Background

Related Research

Approach

Results

Conclusions and Future Plans

Page 3: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 3

Objectives and Motivations Objectives

Support advancements for HPC with Unified Parallel C (UPC) on cluster systems exploiting high-throughput, low-latency System-Area Networks (SANs) and LANs

Design and analysis of tools to support UPC on SAN-based systems Benchmarking and case studies with key UPC applications Analysis of tradeoffs in application, network, service, and system design

Motivations Increasing demand in sponsor and scientific computing community for shared-

memory parallel computing with UPC New and emerging technologies in system-area networking and cluster computing

Scalable Coherent Interface (SCI) Myrinet (GM) InfiniBand (VAPI) QsNet (Quadrics Elan) Gigabit Ethernet and 10 Gigabit Ethernet

Clusters offer excellent cost-performance potential

Page 4: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 4

Background Key sponsor applications and developments toward

shared-memory parallel computing with UPC UPC extends the C language to exploit parallelism

Currently runs best on shared-memory multiprocessors or proprietary clusters (e.g. AlphaServer SC) Notably HP/Compaq’s UPC compiler

First-generation UPC runtime systems becoming available for clusters MuPC, Berkeley UPC

Significant potential advantage in cost-performance ratio with Commercial Off-The-Shelf (COTS) cluster configurations Leverage economy of scale Clusters exhibit low cost relative to tightly coupled SMP, CC-

NUMA, and MPP systems Scalable performance with COTS technologies

Com3

Com3Com3

UPCUPC

UPCUPC

??

Page 5: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 5

Related Research University of California at Berkeley

UPC runtime system UPC to C translator Global-Address Space Networking

(GASNet) design and development Application benchmarks

George Washington University UPC specification UPC documentation UPC testing strategies, testing

suites UPC benchmarking UPC collective communications Parallel I/O

Michigan Tech University Michigan Tech UPC (MuPC) design and

development UPC collective communications Memory model research Programmability studies Test suite development

Ohio State University UPC benchmarking

HP/Compaq UPC compiler

Intrepid GCC UPC compiler

Page 6: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 6

Related Research -- MuPC & DSM MuPC (Michigan Tech UPC)

First open-source reference implementation of UPC for COTS clusters Any cluster that provides Pthreads and MPI can use Built as a reference implementation, performance is secondary

Limitations in application size, memory mode Not suitable for performance-critical applications

UPC/DSM/SCI SCI-VM (DSM system for SCI)

HAMSTER interface allows multiple modules to support MPI and shared-memory models Created using Dolphin SISCI API, ANSI C

SCI-VM not under constant development, so future upgrades sketchy Not feasible for amount of work needed versus expected performance

Better possibilities with GASNet

Page 7: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 7

Related Research -- GASNet Communication system created by U.C. Berkeley

Target for Berkeley UPC system Global-Address Space Networking (GASNet) [1]

Language-independent, low-level networking layer for high-performance communication

Segment region for communication on each node, three types Segment-fast: sacrifice size for speed Segment-large: allows large memory area for shared space, perhaps

with some loss in performance (though firehose [2] algorithm is often employed)

Segment-everything: expose the entire virtual memory space of each process for shared access Firehose algorithm allows memory to be managed into buckets for

efficient transfers Interface for high-level global address space SPMD languages

UPC [3] and Titanium [4]

Divided into two layers Core

Active Messages Extended

High-level operations which take direct advantage of network capabilities communication system from U.C. Berkeley

A reference implementation available that uses the Core layer

Page 8: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 8

Related Research -- Berkeley UPC Second open-source implementation of

UPC for COTS clusters First with a focus on performance

GASNet for all accesses to remote memory Network conduits allow for high performance

over many different interconnects

Targets a variety of architectures x86, Alpha, Itanium, PowerPC, SPARC, MIPS,

PA-RISC

Best chance as of now for high-performance

UPC applications on COTS clusters Note: Only supports strict shared-memory

access and therefore only uses the blocking

transfer functions in the GASNet spec

UPC Code Translator

Translator Generated CCode

Berkeley UPC RuntimeSystem

GASNetCommunication

System

Network Hardware

Platform-independent

Network-independent

Compiler-independent

Language-independent

Page 9: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 9

Approach Collaboration

HP/Compaq UPC Compiler V2.1 running in our lab on ES80 AlphaServer (Marvel)

Support of testing by OSU, MTU, UCB/LBNL, UF, et al. with leading UPC tools and system for function performance evaluation

Field test of newest compiler and system Exploiting SAN Strengths for UPC

Design and develop new SCI Conduit for GASNet in collaboration UCB/LBNL

Evaluate DSM for SCI as option of executing UPC Benchmarking

Use and design of applications in UPC to grasp key concepts and understand performance issues NAS benchmarks from GWU DES-cypher benchmark from UF

Performance Analysis Network communication experiments UPC computing experiments

Emphasis on SAN Options and Tradeoffs SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE,

etc.

Upp

er L

ayer

sA

pplic

atio

ns, T

rans

lato

rs,

Doc

umen

tatio

n

Mid

dle

Laye

rsR

untim

e Sy

stem

s, In

terf

aces

Low

er L

ayer

sR

untim

e Sy

stem

s, In

terf

aces

UF

HC

S L

abOhio StateBenchmarks

Michigan TechBenchmarks, modeling,

specification

UC BerkeleyBenchmarks, UPC-to-C translator, specification

GWUBenchmarks, documents,

specification

Benchmarks

Michigan TechUPC-to-MPI translation

and runtime system

UC BerkeleyC runtime system, upper

levels of GASNet

HPUPC runtime system on

AlphaServer

UC BerkeleyGASNet

GASNet collaboration,beta testing

GASNet collaboration,

network performance

analysis

Page 10: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 10

GASNet SCI Conduit Scalable Coherent Interface (SCI)

Low-latency, high-bandwidth SAN Shared-memory capabilities

Require memory exporting and importing

PIO (require importing) + DMA (need 8 bytes alignment)

Remote write ~10x faster than remote read

SCI conduit AM enabling (core API)

Dedicated AM message channels (Command) Request/Response pairs to prevent

deadlock Flags to signal arrival of new AM (Control)

Put/Get enabling (extended API) Global segment (Payload)

Control X

Command X-1

...

Command X-N

Payload X

Local (In use)

Local (free)

Control 1

...

...

Control N

Command 1-X

...

Command X-X

Control X

Payload 1

...

Payload X

...

Control 1

...

Control X

Command 1-X

...

Command X-X

Command N-X

Payload N

Command X-X

...

...

Control N

...

Command N-X

...

Control Segments(N total)

Command Segments(N*N total)

Payload Segments(N total)

SCI Space

Node X

Physical Address

Virtual Address

ExportingImporting

DMA Queues (Local)

Page 11: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 11

GASNet SCI Conduit - Core APIActive Message Transferring

1. Obtain free slot Tract locally using array of flags

2. Package AM Header3. Transfer Data

Short AM PIO write (Header)

Medium AM PIO write (Header) PIO write (Medium Payload)

Long AM PIO write (Header) PIO write (Long Payload)

Payload size 1024 Unaligned portion of payload

DMA write (multiple of 64 bytes)4. Wait for transfer completion5. Signal AM arrival

Message Ready Flag Value = type of AM

Message Exist Flag Value = TRUE

6. Wait for reply/control signal Free up remote slot for reuse

AM Header

Medium AM Payload

Long AM Payload

Flags

Control

Command Y-1

Node X

...

Command Y-X

...

Command Y-N

Payload Y

Node Y

Wait for Completion

New Messages Availiable?

Process all new messages

Yes No

Check Message Exist Flag

Polling Done

Polling End

Polling Start

Other processing

Process reply

message

Memory

AM Reply or ack

Extract Message

Information

Page 12: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 12

Experimental Testbed Elan, VAPI (Xeon), MPI, and SCI conduits

Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset

SCI: 667 MB/s (300 MB/s sustained) Dolphin SCI D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus

Quadrics: 528 MB/s (340 MB/s sustained) Elan3, using PCI-X in two nodes with QM-S16 16-port switch

InfiniBand: 4x (10Gb/s, 800 MB/s sustained) Infiniserv HCAs, using PCI-X 100, InfiniIO 2000 8-port switch from Infinicon

RedHat 9.0 with gcc compiler V 3.3.2, SCI uses MP-MPICH beta from RWTH Aachen Univ., Germany. Berkeley UPC runtime system 1.1

VAPI (Opteron) Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S

server motherboard InfiniBand: Same as in VAPI (Xeon)

GM (Myrinet) conduit (c/o access to cluster at MTU) Nodes*: Dual 2.0 GHz Intel Xeons, 2GB DDR PC2100 (DDR266) RAM Myrinet*: 250 MB/s Myrinet 2000, using PCI-X, on 8 nodes connected with 16-port M3F-

SW16 switch RedHat 7.3 with Intel C compiler V 7.1., Berkeley UPC runtime system 1.1

ES80 AlphaServer (Marvel) Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM, proprietary inter-processor

connections Tru64 5.1B Unix, HP UPC V2.1 compiler

* via testbed made available courtesy of Michigan Tech

Page 13: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 13

SCI Conduit GASNet Core Level Experiments Experimental Setup

SCI Conduit Latency/Throughput (testam 10000

iterations) SCI Raw

PIO latency (scipp) DMA latency and throughput

(dma_bench) Analysis

Latency a little high, but constant overhead (not exponential

Throughput follows RAW trend

Short/Medium AM Ping-Pong Latency

05

101520

25303540

0 1 2 4 8 16 32 64 128

256

512

1024

Payload Size (Bytes)

Late

ncy

(us)

SCI Raw SCI Conduit

Long AM Ping-Pong Latency

0

50

100

150

200

250

0 1 2 4 8 16 32 64 128

256

512

1K 2K 4K 8K 16K

Payload Size (Bytes)

Late

ncy

(us)

SCI Raw SCI Conduit

Long AM Throughput

0

50

100

150

200

250

64 128

256

512

1K 2K 4K 8K 16K

32K

64K

128K

256K

Payload Size (Bytes)

Thro

ughp

ut (M

B/s

)

SCI Raw SCI Conduit

PIO/DMA Mode Shift

PIO/DMA Mode Shift

Page 14: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 14

SCI Conduit GASNet Extended Level Experiments Experimental Setup

GASNet configured with segment Large As fast as segment-fast for inside the segment Makes use of Firehose for memory outside the segment (often more efficient than segment-fast)

GASNet Conduit experiments Berkeley GASNet test suite

Average of 1000 iterations Each uses put/get operations to take advantage of implemented extended APIs Executed with target memory falling inside and then outside the GASNet segment

Reported only inside results unless difference was significant Latency results use testsmall Throughput results use testlarge

Analysis Elan shows best performance for latency of puts and gets VAPI is by far the best bandwidth; latency very good GM latencies a little higher than all the rest HCS SCI conduit shows better put latency than MPI on SCI for sizes > 64 bytes; very close to MPI on SCI for smaller messages HCS SCI conduit has latency slightly higher than MPI on SCI GM and SCI provide about the same throughput

HCS SCI conduit slightly higher bandwidth for largest message sizes Quick look at estimated total cost to support 8 nodes of these interconnect architectures:

SCI: ~$8,700 Myrinet: ~$9,200 InfiniBand: ~$12,300 Elan3: ~$18,000 (based on Elan4 pricing structure, which is slightly higher)

* via testbed made available courtesy of Michigan Tech

Page 15: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 15

GASNet Extended Level Latency

0

5

10

15

20

25

30

35

40

1 2 4 8 16 32 64 128 256 512 1K

Message Size (bytes)

Rou

nd-tr

ip L

aten

cy (u

sec)

GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get

Page 16: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 16

GASNet Extended Level Throughput

0

100

200

300

400

500

600

700

800

128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K

Message Size (bytes)

Thro

ughp

ut (M

B/s

)

GM put GM get Elan put Elan getVAPI put VAPI get MPI SCI put MPI SCI getHCS SCI put HCS SCI get

Page 17: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 17

Matisse IP-Based Networks Switch-based GigE network with DWDM backbone between switches for high scalability Product in alpha testing stage Experimental setup

Nodes: Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100 (DDR266) RAM, Intel SE7501BR2 server motherboard with E7501 chipset

Setup: 1 switch – all nodes connected to 1 switch 2 switch – half of the nodes connected to each switch with either short (1km) or long (12.5km) fiber in between the

switches Tests:

Low Level - Pallas Benchmark: ping-pong and send-recv GASNet Level - testsmall

Pallas PingPong Test

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

40.00

45.00

50.00

8 32 128

512

2K 8K 16K

32K

64K

128K

256K

512K

1024

K

2048

K

4096

K

Message Size (bytes)

Thro

ughp

ut (M

B/s

ec)

1 switch 2 switches - short fiber 2 switches - long fiber

Pallas SendRecv Test (2 Nodes)

0.00

10.00

20.00

30.00

40.00

50.00

60.00

70.00

8 32 128

512

2K 8K 16K

32K

64K

128K

256K

512K

1024

K

2048

K

4096

K

Message Size (bytes)

Thro

ughp

ut (M

B/s

ec)

1 switch 2 switches - short fiber 2 switches - long fiber

Page 18: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 18

Matisse IP-Based Networks GASNet put/get

Latency for 2 switches with short/long fibers constant Short – 250 us Long – 374 us

Throughput is comparable with regular GigE Latency comparable with regular GigE (~255us for all sizes)

Testsmall Latency (1 switch)

0

50

100

150

200

250

300

1 2 4 8 16 32 64 128

256

512

1024

2048

Message Size (bytes)

Late

ncy

(us)

put get

Testsmall Throughput

0

2000

4000

6000

8000

1 2 4 8 16 32 64 128

256

512

1024

2048

Message Size (bytes)

Thro

ughp

ut (k

b/s)

1 switch - put 1 switch - get2 switches - short fiber - put 2 switches - short fiber - get2 switches - long fiber - put 2 switches - long fiber - getGigE - put GigE - get

Page 19: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 19

UPC function performance A look at common shared-data operations

Comparison between accesses to local data through regular and private pointers Block copies between shared and private memory

upc_memget, upc_memput Pointer conversion (shared local to private) Pointer addition (advancing pointer to next location) Loads & Stores (to a single location local and remote)

Block copies upc_memget & upc_memget translate directly into GASNet blocking put and get

(even on local shared objects); see previous graph for results Marvel with HP UPC compiler shows no appreciable difference between local and

remote puts and gets and regular C operations Steady increase from 0.27 to 1.83 µsec for sizes 2 to 8K bytes Difference of < .5 µsec for remote operations

Page 20: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 20

UPC function performance Pointer operations

Cast Local share to private All BUPC conduits ~2ns, Marvel needed ~90ns

Pointer addition below

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

MPI-SCI MPI GigE Elan GM VAPI Marvel

Exec

utio

n Ti

me

(use

c)

Private Shared

Shared-pointer manipulation about an order of magnitude greater than private.

Page 21: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 21

UPC function performance Loads and stores with pointers (not bulk)

Data local to the calling node Pvt Shared are private pointers to the local shared space

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

MPI SCI MPI GigE Elan GM VAPI Marvel

Exec

utio

n Ti

me

(use

c)

Private Store Private Load Shared StoreShared Load Pvt Shared Store Pvt Shared Load

MPI on GigE shared store takes 2 orders of magnitude longer, therefore not shown. Marvel shared loads and stores twice an order of magnitude greater than private.

Page 22: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 22

UPC function performance Loads and stores with pointers (not bulk)

Data remote to the calling node Note: MPI GigE showed a time of ~450µsec for loads and ~500µsec for stores

0

5

10

15

20

25

30

MPI-SCI Elan GM VAPI Marvel

Exec

utio

n Ti

me

(use

c)

Store Load

Marvel remote access through pointers the same as with local shared, two orders of magnitude less than Elan.

Page 23: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 23

UPC Benchmark - IS from NAS Benchmark

0

5

10

15

20

25

30

GM Elan GigE mpi VAPI (Xeon) SCI mpi SCI Marvel

Exec

utio

n Ti

me

(sec

)

1 Thread 2 Threads 4 Threads 8 Threads

IS (Integer Sort, Class A), lots of fine-grain communication, low computation Poor performance in the GASNet communication system does not necessary indicate poor

performance in UPC application

Page 24: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 24

UPC Benchmarks – FT from NAS Benchmarks* FT (3-D Fast Fourier Fransform, Class A), medium communication, high computation

Used optimized version 01 (private pointers to local shared memory) SCI Conduit unable to run due to driver limitation (size constraint)

High-bandwidth networks perform best (VAPI followed by Elan) VAPI conduit allows cluster of Xeons to keep pace with Marvel’s performance MPI on GigE not well suited for these types of problems (high-latency, low-bandwidth traits limit performance) MPI on SCI has lower bandwidth than VAPI but still maintains near-linear speedup for more than 2 nodes (skirts

TCP/IP overhead) GM performance a factor of processor speed (see 1 Thread)

0

5

10

15

20

25

30

35

40

45

50

GM Elan GigE mpi VAPI SCI mpi Marvel

Exec

utio

n Ti

me

(sec

)

1 Thread 2 Threads 4 Threads 8 Threads

* Using code developed at GWU

High-latency of MPI on GigE impedes performance.

Page 25: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 25

UPC Benchmark - DES Differential Attack Simulator

S-DES (8-bit key) cipher (integer-based) Creates basic components used in differential cryptanalysis

S-Boxes, Difference Pair Tables (DPT), and Differential Distribution Tables (DDT) Bandwidth-intensive application Designed for high cache miss rate, so very costly in terms of memory access

0

500

1000

1500

2000

2500

3000

3500

GM Elan GigE mpi VAPI(Xeon)

SCI mpi SCI Marvel

Exec

utio

n Ti

me

(mse

c.) Sequential 1 Thread 2 Threads 4 Threads

Page 26: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 26

UPC Benchmark - DES Analysis With increasing number of nodes, bandwidth and NIC

response time become more important Interconnects with high bandwidth and fast response

times perform best Marvel shows near-perfect linear speedup, but processing time of

integers an issue VAPI shows constant speedup Elan shows near-linear speedup from 1 to 2 nodes, but more

nodes needed in testbed for better analysis GM does not begin to show any speedup until 4 nodes, then

minimal SCI conduit performs well for high-bandwidth programs but with

the same speedup problem as GM MPI conduit clearly inadequate for high-bandwidth programs

Page 27: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 27

UPC Benchmark - Differential Cryptanalysis for CAMEL Cipher Uses 1024-bit S-Boxes Given a key, encrypts data, then tries to guess key solely based on encrypted data using differential attack Has three main phases:

Compute optimal difference pair based on S-Box (not very CPU-intensive) Performs main differential attack (extremely CPU-intensive)

Gets a list of candidate keys and checks all candidate keys using brute force in combination with optimal difference pair computed earlier

Analyze data from differential attack (not very CPU-intensive) Computationally (independent processes) intensive + several synchronization points

0

50

100

150

200

250

SCI (Xeon) VAPI (Opteron) Marvel

Exec

utio

n Ti

me

(s) 1 Thread 2 Threads 4 Threads 8 Threads 16 Threads

ParametersMAINKEYLOOP = 256

NUMPAIRS = 400,000

Initial Key: 12345

Page 28: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 28

UPC Benchmark - CAMEL Analysis

Marvel Attained almost perfect speedup Synchronization cost very low

Berkeley UPC Speedup decreases with increasing number of threads

Cost of synchronization increases with number of threads Run time varied greatly as number of threads increased

Hard to get consistent timing readings Still decent performance for 32 threads (76.25% efficiency, VAPI) Performance is more sensitive to data affinity

Page 29: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 29

Architectural Performance Tests Intel Pentium 4 Xeon

Features 32-bit processor Hyper-Threading technology

Increased CPU utilization Intel NetBurst microarchitecture

RISC processor core 4.3 GB/s I/O bandwidth

AMD Opteron Features

32-bit/64-bit processor Real-time support of 32-bit OS On-chip memory controllers Eliminates 4 GB memory barrier

imposed by 32-bit systems 19.2 GB/s I/O bandwidth per

processor

Intel Itanium II Features

64-bit processor Based on EPIC architecture 3-level cache design Enhanced Machine Check

Architecture (MCA) with extensive Error Correcting Code (ECC)

6.4 GB/s I/O bandwidth

Theme: Preliminary study of tradeoffs in available processor architectures, since their performance will clearly affect computation, communication, and synchronization in UPC clusters.

Page 30: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 30

0

200

400

600

800

1000

Random Reads Random Writes SequentialReads

SequentialWrites

Disk copies

Thro

ughp

ut (M

B/s

)

Itanium2 Opteron XeonCPU Performance Results

AIM 9 10 iterations using 5MB files testing sequential and random reads, writes, and copies Itanium2 slightly above Opteron in both reads and writes except for random writes

where Opteron has a slight advantage Both Itanium2 and Opteron outperform Xeon by a wide margin in all cases except

sequential reads Xeon sequential reads are comparable to Opteron, but Itanium2 is much higher than

both Major performance gain from sequential reads compared to random, but sequential

writes do not receive nearly as large of a boost

Computation benchmarks excluded due to compiler problems with Itanium2.

Page 31: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 31

10 Gigabit Ethernet – Preliminary results

Testbed Nodes: Each with dual x 2.4GHz Xeons, S2io

Xframe 10GigE card in PCI-X 100, 1GB PC2100 DDR RAM, Intel PRO/1000 1GigE, RedHat 9.0 kernel 2.4.20-8smp, LAM-MPI V7.0.3

0

20

40

60

80

100

120

140

160

180

200

0 1 2 4 8 16 32 64 128 256 512 1K 2K 4K

Message Size (bytes)

Rou

nd-tr

ip L

aten

cy (u

sec)

10GigE GigE 0

50

100

150

200

250

300

350

400

450

64 128 256 512 1K 2K 4K 8K 16K 32K 64K

Message Size (bytes)

Thro

ughp

ut (M

B/s

)

10GigE GigE

10GigE is promising due to expected economy-of-scale issues of Ethernet

S2io 10GigE shows impressive throughput, though slightly less than half of theoretical maximum; further tuning needed to go higher

Results show much-needed decrease in latency versus other Ethernet options

Page 32: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 32

Conclusions Key insights

HCS SCI conduit shows promise Performance on-par with other conduits On-going collaboration with vendor (Dolphin) to resolve the memory constraint issue

Berkeley UPC system a promising COTS cluster tool Performance on par with HP UPC (also see [6])

Performance of COTS clusters match and sometimes beat performance of high-end CC-NUMA Various conduits allow UPC to execute on many interconnects

VAPI and Elan are initially found to be strongest Some open issues with bugs and optimization

Active bug reports and development team help improvements Very good solution for clusters to execute UPC, but may not quite be ready for production use

No debugging or performance tools available Xeon cluster suitable for applications with high Read/Write ratio Opteron cluster suitable for generic application due to comparable Read/Write capability Itanium2 excellent for sequential reads, about the same as Opteron for everything else 10GigE provides high bandwidth with much lower latencies than 1GigE

Key accomplishments to date Baselining of UPC on shared-memory multiprocessors Evaluation of promising tools for UPC on clusters Leveraging and extension of communication and UPC layers Conceptual design of new tools for UPC Preliminary network and system performance analyses for UPC systems Completion of optimized GASNet SCI conduit for UPC

Page 33: 10/15/041 Distributed Shared-Memory Parallel Computing with UPC on SAN-based Clusters – Q3 Status Rpt. High-Perf. Networking (HPN) Group HCS Research Laboratory.

10/15/04 33

References1. D. Bonachea, U.C. Berkeley Tech Report (UCB/CSD-02-1207) (spec v1.1), October

2002.

2. C. Bell, D. Bonachea, “A New DMA Registration Strategy for Pinning-Based High Performance Networks,” Workshop on Communication Architecture for Clusters (CAC'03), April, 2003.

3. W. Carlson, J. Draper, D. Culler, K. Yelick, E. Brooks, K. Warren, “Introduction to UPC and Language Specification,” IDA Center for Computing Sciences, Tech. Report., CCS-TR99-157, May 1999.

4. K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken , “Titanium: A High-Performance Java Dialect,” Concurrency: Practice and Experience, Vol. 10, No. 11-13, September-November 1998.

5. B. Gordon, S. Oral, G. Li, H. Su and A. George, “Performance Analysis of HP AlphaServer ES80 vs. SAN-based Clusters,” 22nd IEEE International Performance, Computing, and Communications Conference (IPCCC), April, 2003.

6. W. Chen, D. Bonachea, J. Duell, P. Husbands, C. Iancu, K. Yelick, “A Performance Analysis of the Berkeley UPC Compiler,” 17th Annual International Conference on Supercomputing (ICS), June, 2003.