ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic,...

134
Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com 1 ENVISION. ACCELERATE. ARRIVE. Overview ClearSpeed Technical Training December 2007

Transcript of ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic,...

Page 1: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com1

ENVISION. ACCELERATE. ARRIVE.

Overview

ClearSpeed Technical Training

December 2007

Page 2: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com2

Presenters

Ronald LanghiTechnical Marketing Manager

[email protected]

Brian SumnerSenior Engineer

[email protected]

Page 3: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com3

ClearSpeed Technology: Company Background

• Founded in 2001

– Focused on alleviating the power, heat, and density challenges of HPC systems

– 103 patents granted and pending (as of September 2007)

– Offices in San Jose, California and Bristol, UK

Page 4: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com4

Agenda

AcceleratorsClearSpeed and HPCHardware overviewInstalling hardware and softwareThinking about performanceSoftware Development KitApplication examplesHelp and support

Page 5: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com5

ENVISION. ACCELERATE. ARRIVE.

What is an accelerator?

Page 6: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com6

What is an accelerator?

• A device to improve performance– Relieve main CPU of workload– Or to augment CPU’s capability

• An accelerator card can increase performance– On specific tasks– Without aggravating facility limits on clusters (power,

size, cooling)

Page 7: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com7

FPGAs•Good for integer, bit-level ops•Programming looks like circuit design•Low power per chip, but

20x more power than custom VLSI

•Not for 64-bit FLOPS

Cell and GPUs•Good for video gaming tasks•32-bit FLOPS, not IEEE•Unconventional programming model•Small local memory•High power consumption (> 200 W)

ClearSpeed•Good for HPC applications•IEEE 64-bit and 32-bit FLOPS•Custom VLSI, true coprocessor•At least 1 GB local memory•Very low power consumption (25 W)•Familiar programming model

All accelerators are good…

for their intended purpose

Page 8: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com8

The case for accelerators

• Accelerators designed for HPC applications can improve performance as well as performance per (watt, cabinet, dollar)

• Accelerators enable:– Larger problems for given compute time, or

– Higher accuracy for given compute time, or

– Same problem in shorter time

• Host to card latency and bandwidth are not major barriers to successful use of properly-

designed accelerators.

Page 9: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com9

ENVISION. ACCELERATE. ARRIVE.

What can be accelerated?

Page 10: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com10

Good application targets for acceleration

• Application needs to be both computationally intensive and contain a high degree of data parallelism.

• Computationally intensive:– Software depends on executing large numbers of arithmetic calculations– Usually 64-bit FLoating point Operations per Second (FLOPS)– Should also have a high ratio of FLOPS to data movement (bandwidth)– Computationally intensive applications may run for many hours or more

even on large clusters.• Data parallelism:

– Software performs the same sequence of operations again and again but on a different item of data each time

• Example computationally intensive, data parallel problems include:

– Large matrix arithmetic (linear algebra)– Molecular simulations– Monte Carlo options pricing in financial applications– And many, many more…

Page 11: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com11

Example data parallel problems that can be accelerated

Structural Analysis

ElectromagneticModeling

Radar Cross-Section

Ab initio Computational Chemistry

Global Illumination Graphics

Page 12: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com12

HPC Requirements

• Accelerator boards increase compute performance on highly specific tasks, without aggravating facility limits on clusters (power, size)

• Need to consider– Type of application– Software– Data type and precision– Compatibility with host (logical and physical)– Memory size (local to accelerator)– Latency and bandwidth to host

Page 13: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com13

An HPC-specific accelerator

• CSX600 coprocessor for math acceleration– Assists serial CPU running compute-intensive math libraries– Available on add-in boards, e.g. PCI-X, PCIe– Potentially integrated on the motherboard– Can also be used for embedded applications

• Significantly accelerates certain libraries and applications– Target libraries: Level 3 BLAS, LAPACK, ACML, Intel® MKL– Mathematical modeling tools: Mathematica®, MATLAB®, etc.– In-house code: Using the SDK to port compute-intensive kernels

• ClearSpeed Advance™

board– Dual CSX600 coprocessors– Sustains 67 GFLOPS for 64-bit matrix multiply (DGEMM) calls– PCI-X, PCI Express x8– Low power; typically 25-35 Watts

Page 14: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com14

Plug-and-play Acceleration

• ClearSpeed host-side library CSXL– Provides some of the most commonly used and

important Level 3 BLAS and LAPACK functions– Exploits standard shared/dynamic library mechanisms to

intercept calls to L3 BLAS and LAPACK– Executes calls heterogeneously across both the multi-

core host and the ClearSpeed accelerators simultaneously for maximum performance

– Compatible with ACML from AMD and MKL from Intel

• User & application do not need to be aware of ClearSpeed– Except that the application suddenly runs faster

Page 15: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com15

Programming considerations

• Is my main data type integer or floating-point?• Is the data parallel in nature?• What precision do I need?• How much data needs to be local to the accelerated task?• Does existing accelerator software meet my needs, or do I have

to write my own?• If I have to write my own code will the existing tools meet my

needs—for example: compiler, debugger, and simulator?

Page 16: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com16

ENVISION. ACCELERATE. ARRIVE.

Hardware Overview

Page 17: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com17

CSX600: A chip designed for HPC

• Array of 96 Processor Elements; 64-bit and 32-bit floating point

• Single-Instruction, Multiple-Data (SIMD)

• 210 MHz --

key to low power• 47% logic, 53% memory

– About 50% of the logic is FPU– Hence around one quarter of the

chip is floating point hardware• Embedded SRAM• Interface to DDR2 DRAM• Inter-processor I/O ports• ~ 1 TB/sec internal bandwidth• 128 million transistors• Approximately 10 Watts

ClearSpeed CSX600

Page 18: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com18

CSX 600

• Multi-Threaded Array Processing– Programmed in familiar languages– Hardware multi-threading– Asynchronous, overlapped I/O– Run-time extensible instruction set

• Array of 96 Processor Elements (PEs)– Each has multiple execution units– Including double precision floating

point and integer units

CSX600 processor core

Programmable I/O to DRAM

PE 0

Peripheral Network

PE 1

PE 95…

Data Cache

Mono Controller Instruc-

tion Cache

Control and

Debug

System Network

Poly Controller

System Network

Page 19: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com19

CSX600 processing element (PE)

32/64-bitIEEE 754

PE

n

Programmed I/O

Register File128 Bytes

PE SRAM6 KBytes

FP M

ul

FP A

dd

Div

, Sqr

t

MA

C

ALU

PIO Collection & Distribution

64 64 64

32

6464 PE

n+1

PE

n–1

128

32

}• Multiple execution units

• 4-stage floating point adder• 4-stage floating point multiplier• Divide/square root unit• Fixed-point MAC 16x16 → 32+64• Integer ALU with shifter• Load/store

• 5-port register file (3 reads, 2 writes)• Closely coupled 6 KB SRAM for data• High bandwidth per PE DMA (PIO)• Per PE address generators (serves as

hardware gather-scatter)• Fast inter-PE communication path

Page 20: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com20

PE 95

Bank 1CSX DRAM: 0.5 GBytes

Advance accelerator memory hierarchy

HostDRAM: 1-32 GBytes

typical

Bank 0CSX DRAM: 0.5 GBytes

PE 0Poly memory: 6 KBytes

Per PERegister memory: 128 Bytes

Per PEArithmetic: 0.42 GFLOPS

Tier 3

Tier 2

Tier 1

Tier 016 16 16

1616

32

Swazzle322 GB/s

Aggregate:

~0.03 GB/s per PE

~1GB/s

1.0 GBytes

5.4 GB/s

192 PEs * 6 KB = 1.1 MB

161 GB/s

192 PEs * 128 Byte = 24 KB

725 GB/s

Total: 80 GFLOPS, 1.1 TB/s…but only 25 Watts

Page 21: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com21

203 mm length, full-height

Advance X620 (PCI-X)

Advance e620 PCIe (x8)

Both boards can sustain over 66 GFLOPS on 64-bit HPC kernels

Acceleration by plug-in card

• Dual ClearSpeed CSX600 coprocessors• R∞

> 66 GFLOPS for 64-bit matrix multiply (DGEMM) calls– Hardware also supports 32-bit floating

point and integer calculations• 133 MHz PCI-X two-thirds length (8″) form

factor• PCIe x8 half-length form factor• 1 GB of memory on the board• Drivers today for Linux (Red Hat and SLES)

and Windows (XP, Server 2003)• Low power: 25 watts typical• Multiple boards can be used together for

greater performance

Half length, full-height

Page 22: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com22

Host to board DMA performance

• The board includes a host DMA controller which can act as a bus master.

• All DMA transfers are at least 8-byte aligned.

• The host DMA engine will attempt to use the full bandwidth of the bus.

• Note: measured bandwidth is highly system-dependent– Variations of up to 50% have been observed

– Depends on system chipset, operating system, bus contention…

Type of PCI-X slot Peak bandwidth Expected DMA speed

PCI Express x8 2,000 MB/s Up to 1,300 MB/s

PCI-X 133 MHz 1,066 MB/s Up to 750 MB/s

Page 23: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com23

ENVISION. ACCELERATE. ARRIVE.

Installing Hardware and Software

Page 24: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com24

Supported compilers• For Linux:

gcc, icc, fort, pgf• For Windows XP, 2003:

Visual C++ 2005

Configuration support

Supported host BLAS libraries• AMD ACML• Intel MKL• Goto• ATLAS

For the latest support information go to http://support.clearspeed.com

Operating System IA32 (x86)

AMD64/EM64T (x86-64)

SuSE Linux Enterprise Server 9Red Hat Enterprise Linux 4Windows XP SP2Windows Server 2003 preview

• Advance supports

the

following

host operating systems:

Page 25: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com25

Base software

• All ClearSpeed software on Linux is installed using the rpm command.

• The software consists of three parts:– Runtime and driver software– Diagnostics– ClearSpeed standard libraries, CSXL & CSFFT

• You can download the latest versions from the ClearSpeed support website:

• https://support.clearspeed.com/downloads

Page 26: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com26

Installing

base software on Linux

1.

Log in to the Linux machine as root and change to the directory containing the drivers package.

2.

Install the runtime software, using the command:

rpm –i csx600_m512_le-runtime-<version>.<arch>.rpm

3.

Install the Kernel module -

for Linux 2.6 simply install the open source CSX driver using:

/opt/clearspeed/csx600_m512_le/drivers/csx/install-csx

4.

Install the board diagnostics: rpm –i csx600_m512_le-board_diagnostics- <version>.<arch>.rpm

5.

Install the CSXL library package: rpm –i csx600_m512_le-csxl_<version>.<arch>.rpm

Note: For Windows a Jungo driver will need to be installed and configured – see installation manual for more details.

Page 27: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com27

Confirming successful installation

ClearSpeed distributes diagnostic tests to check that the board and drivers are successfully installed:

– Some tests take several minutes to complete.– Each test will write Pass or Fail to standard output.– A log file

test.log will be written in the current directory.

1. Open a shell window and go to an appropriate directory:cd /tmp

2. Set up ClearSpeed environment variables, by typing: source /opt/clearspeed/csx600_m512_le/bin/bashrc

3. Run the diagnostic program, by typing the command: /opt/clearspeed/csx600_m512_le/bin/run_tests.pl

Page 28: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com28

csreset

– The csreset command reinitializes an Advance board and its processors.

– It must be run after start-up or reboot of the system or simulator.

– It is also a good idea to run csreset at the start of a batch job that calls the Advance board.

– The csreset command can take argument flags to provide a finer level of control. These include:

-A Specifies that all boards should be reset.-v Verbose output. This shows the details about each

board.-h Help. This shows the full list of options.

Page 29: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com29

If you have problems with software installation

– Make sure you are logged in as super-user.• As root for Linux.• As administrator for Windows.

– If the configure or make install steps fail, check that you have the appropriate header files.

• Check the preconfigured header files and, if necessary, obtain the appropriate configured header file.

– If the system cannot access the board but the driver is installed, make sure the board is seated well.

• Try removing the board and reinstalling.

Page 30: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com30

ENVISION. ACCELERATE. ARRIVE.

Targeting ClearSpeed Advance: Exploiting Data Parallelism

Page 31: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com31

Alternative approaches

Three main approaches to acceleration:

1.

Use an application which is already ported2.

Plug and play

3.

Custom port using the SDK

Page 32: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com32

Using an application which is already ported

• Acceleration: simply insert ClearSpeed• Latest list of ported applications:

– http://www.clearspeed.com/products/applicationsupport/

• Includes:– Amber– Mathematica– MATLAB– Star-P

Page 33: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com33

Plug and play libraries: CSXL

• Underlying shared libraries are augmentedwith ClearSpeed CSXL accelerated functions

• Includes key functions from:– LAPACK– Level 3 BLAS

• As an example, BLAS is used by:– AMD ACML– Intel MKL– Full list on:

http://www.clearspeed.com/products/compatibility/

• Application is transparently accelerated– No modifications to application

Page 34: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com34

Acceleration using CSXL and standard libraries

Host Library LAPACK

BLAS etc.

CSXL Library LAPACK

BLAS etc.

Application

Automatically select optimum path

CSXL Intercept Layer

Page 35: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com35

Considerations for custom port of application

• Is the task large enough to consider acceleration?– Takes time to ship data to the accelerator

• Accelerator can work in parallel with host– Overlap computation

• Performance considerations– Look for areas of data parallelism– Overlap compute with data I/O– Make full use of ClearSpeed I/O paths

• Analysis starts with model based on memory tiers and can be verified using performance profiling tools

Page 36: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com36

Is this trip necessary? Considering I/O

• Time to move N data to/from another node or an accelerator is ~latency+N/B seconds.

• Because local memory bandwidth is usually higher than B, acceleration might be lost in the communication time.

• Estimate the break-even point for the task (note: offloading is different from accelerating, where host continues working).

Node

Node memory

Accelerator

Accelerator memory

Bandwidth = B

accelerator

node

break- even

speed

time(larger problem size)

Page 37: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com37

Memory bandwidth dictates performance

• Applications that can stage into local RAM can go 10x faster than current high-end Intel and AMD hosts– Applications residing in Accelerator DRAM do not make use of massive

memory bandwidth• GPUs face very similar issue

Multicore x86

Node memory

Accelerator

Accelerator DRAM

PCI-X or PCIe1 to 2 GB/s

AcceleratorLocal RAM

17 GB/s 5.4 GB/s

192 GB/s

Page 38: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com38

Latency and bandwidth: Simple offload model

• Accelerator must be quite fast for this approach to have benefit

• This “mental picture”

may stem from early days of Intel 80x87, Motorola 6888x math coprocessors

latencyHost latency

Accelerator

band

-widt

h

band-

widthHost

Page 39: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com39

Latency and bandwidth: Acceleration model

• Host continues working– Accelerator needs only be fast enough to make up for time lost to

bandwidth + latency• Easiest use model

– Host and accelerator share the same task, like DGEMM• More flexible

– Host, accelerator each specialize what they do

latencyHost latencyHost

Accelerator

band

-widt

h

band-

widthHost

Page 40: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com40

Accelerator need not wait for all data before starting

• Host can work while data is moved– PCI transfers might burden a single x86 core by up to 40%– Other cores on host continue productive work at full speed

• Accelerator can work while data is moved– Can be slower than the host, and still add performance!

• In practice, latency is microseconds; accelerator task takes seconds– Latency gaps above would be microscopic if drawn to scale

Accelerator

latencyHost latency

band

-widt

h

band-

width

Accelerator

Host Host

Page 41: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com41

Performance considerations

• Look for data parallelism– Fine-grained – vector operations– Medium-grained – unrolled independent loops – Coarse-grained – multiple simultaneous data

channels/sets

• Performance analysis for accelerator cards– Like analysis for message-passing parallelism but with

more levels of memory and communication

• Application porting success depends heavily on attention to memory bandwidths – (Surprisingly) not so much on the bandwidth between host

and accelerator card

Page 42: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com42

PCI Bus

• ClearSpeed boards utilize either PCI-X or PCIe busses– PCI-X 133 MHz: 1 GB/s Peak– PCIe x8: 1.6 GB/s Peak

• Available memory on board– 1 GB of 200 MHz DDR2 SDRAM shared by 2 CSX600 processors

• Must consider both the transfer rate AND the available memory– If application requires more memory, then more communication to the

board is necessary

• Infinitely fast board– Time = Bus Speed * Total data size transferred

Page 43: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com43

PCI Bus

• Driver performance is very machine-specific and depends on transfer size, direction, etc.– Transfer Size vs. Transfer Rate

– See Runtime User’s Guide for current driver performance

Page 44: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com44

On-board Memory

• 2 level memory hierarchy– 1 GB “mono” shared memory– 6 kB “poly” memory per processing element (PE)

• 6 kB/PE * 96 PE = 576 kB per CSX600

• Peak bandwidth between levels– 2.7 GB/s x 2 chips = 5.4 GB/s

• Must consider both the transfer rate AND the available memory– If application requires more memory, then more communication to the

board is necessary• Infinitely fast PEs

– Time = Bus Speed * Total data size transferred

• Secondary considerations– Burst size: 64 Bytes/PE (i.e., 8 doubles)– Transfers can be smaller, but at reduced efficiency

Page 45: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com45

SIMD Computing

• What is SIMD?– Single Instruction, Multiple Data

• Each PE sees the same instruction stream• Each PE issues “load”, “multiply”, etc., simultaneously• But acts on different data per PE

– PARALLEL COMPUTATION

• ClearSpeed SIMD is enhanced by:– Local memory for each PE

• data management is easier within “poly” memory• does not require adjacent access for all 96 elements involved in the

computation from shared memory pool– PEs can be enabled/disabled

• not required to use all PEs always• useful for handling “boundaries”

Page 46: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com46

SIMD Array

• 96 PEs per CSX600– 210 MHz– double precision multiply-accumulate per cycle– 4 cycle pipeline depth for multiply and accumulation

• For top performance, use operations on 4 element vectors on each PE

• Nearest neighbor communication– “swazzle” path topology is a line or ring– Bandwidth: 8 Bytes per cycle between register files

• 8*96*210 = 161 GB/s• Useful for fine grained communication

Page 47: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com47

Good Example Kernels

• Dense Linear Algebra– Matrix-Matrix products (DGEMM)

• Low memory bandwidth required = high data re-use• Inner kernel: Matrix-multivector product

– 96x96 matrix, x4 vectors» 96x96 matrix due to 96 PEs» 4 vectors due to multiply/accumulate pipeline depth

• Monte Carlo (computational finance)– “Embarrassingly parallel” task distribution– Very little data requirement

• Molecular Dynamics (Amber, BUDE)– Large numbers of identical tasks can be found– Requires small working data sets

Page 48: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com48

Possible Kernel

• Partial Differential Equations– Some are memory bandwidth limited, so not a good candidate

for ClearSpeed acceleration• small stencil implies little computation per grid point• wide, sparse stencil implies large active data set

• But, some PDE simulations are good candidates– require a small grid, so can run entirely in PE memory

(computational finance)– have large, dense stencils

• large amounts of computation per grid point• sufficiently small active data set

– implicit time stepping• large systems of equations solved via direct methods• direct solvers utilize dense linear algebra kernels

– (i.e., DGEMM)

Page 49: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com49

Keys to Success

• Parallelism is essential

• Proper management of the “poly”

memory is also critical– Application must accept memory bandwidth limits

• PCIe or PCI-X• On-board memory hierarchy

– SDK enables asynchronous data transfers• permits efficient “double buffering” to manage data streams,

accommodating the size limit– Application must employ a small working data set

• less than 576 kB, distributed across 96 PEs• also aware of 1 GB shared memory limit

• While developing ClearSpeed applications, use the ClearSpeed Visual Profiler to discover what is actually happening on the board!

Page 50: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com50

Remember the host processor

• Today’s multi-core hosts are very useful for managing “other tasks”

that are not accelerated by ClearSpeed

• Many applications can overlap these tasks with ClearSpeed accelerated tasks

• Profile the host portion of your application as well using any of a variety of tools

– Use ClearSpeed Visual Profile for CSAPI utilization

Page 51: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com51

General optimization techniques

• Latency hiding– Overlap compute with I/O

• Data reuse– On-chip swazzle path

• Maximize PE usage– Ensure all PEs are processing, not idle

Page 52: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com52

Overlap data with compute

• Double-buffer• Many levels of data I/O –

compute parallelism

– PE load/store overlaps PE compute– PE to board memory can also overlap– Board memory to host memory can also overlap

• Hence, if task is compute bound:– Data takes “no time” to transfer

• If task is I/O bound:– Compute takes “no time” to calculate

Page 53: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com53

Data reuse

• Swazzle path– Left or right 64 bit transfer (8 bytes)– 8 bytes per cycle, so ~161GB/s per CSX processor– Can be complete loop or linear chain

• Parallel with other data I/O– Register-register move– On-off chip in parallel

• Doesn’t impinge on DRAM access– PE local memory – register in parallel

• Doesn’t impinge on local memory access

Page 54: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com54

Maximize PE usage

• Aim for 100% efficiency• PEs use predicated execution

– PEs are “disabled” rather than code skipped– Minimize effects – extract common code from conditionals

• Mono processor can branch– Skip blocks of code

Page 55: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com55

PE

n+1

PE

n-1

Detail of I/O widths for performance analysis

Each accelerator board has:– 161 GB/s bandwidth PE register

to PE memory• 4 bytes per cycle

– 322 GB/s swazzle path bandwidth

• 8 bytes per cycle– 968 GB/s bandwidth PE register

to PE ALU• 24 bytes per cycle

– 5.4 GB/s DRAM bandwidth• 32 bytes per cycle

(Aggregate bandwidth for two CSX600 chips.)

PE

n

Programmed I/O

Register File128 Bytes

PE SRAM6 KBytes

FP M

ul

FP A

dd

Div

, Sqr

t

MA

C

ALU

PIO Collection & Distribution

64 64 64

32

6464

128

32

161GB/s

322GB/s 322GB/s

968GB/s

CSX DRAM1 GByte

5.4GB/s

Page 56: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com56

ENVISION. ACCELERATE. ARRIVE.

Software Development Kit

Page 57: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com57

ClearSpeed SDK overview

• Cn

compiler – C with extension for SIMD control

• Assembler• Linker• Simulator• Debugger• Graphical profiler• Libraries• Documentation• Available for Windows XP / 2003 and Linux

(Red Hat Enterprise Linux 4 and SLES 9)

Page 58: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com58

Agenda

1.

Introduction to Cn

2.

Cn

Libraries3.

Debugging Cn

4.

CSAPI: Host / Board Communication

Page 59: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com59

ENVISION. ACCELERATE. ARRIVE.

Introduction to Cn

Page 60: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com60

Software Development

The CSX architecture is simpler to program:

• Single program for serial and parallel operations• Architecture and compiler co-designed• Instruction and data caches• Simple, regular 32-bit instruction set• Large, flexible register file• Fast thread context switching• Built-in debug support• Same development process as traditional architectures:

compile, assemble, link• Cn

is a simple parallel extension of C

Page 61: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com61

Cn

— C with vector extensions for CSX

• New Keywords– mono and poly storage

qualifiers• mono is a serial (single)

variable• poly is a parallel (vector)

variable

• Mono variables in 1 GB DRAM

• Poly variables in 6 KB SRAM of each PE

DRAM 1 GB

Page 62: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com62

Cn

differences from C

• New data type multiplicity modifiers:– mono: denotes serial variable

• resident in “mono” memory• mono is the default multiplicity

– poly: denotes parallel/vector variable• resident in “poly” memory local to each PE

– applies to pointers, doubly so:• mono int * poly foo;

– foo is a pointer in poly memory to an int in mono memory• poly int * mono bar;

– bar is a pointer in mono memory to an int in poly memory• int * poly *mono good_grief;

– as you would expect….

• Pointer sizes:– mono int *

• 4 bytes (32-bit addressable space, 512 MB)– poly int *

• 2 bytes (16-bit addressable space, 6 kB)

Page 63: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com63

Cn

differences from C

• Execution context:– Alters branch/jump behavior– In mono context, jumps occur as in traditional architecture– In poly context, PEs are enabled/disabled

• if (penum>32) {…} else {…}– disables false PEs on true branch, then re-enables the false

PEs and disables the other PEs for the false branch– both branches executed

• break, continue– select PEs get disabled until the end of scope on all PEs

• return– select PEs get disabled until all PEs return, or end of scope

Page 64: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com64

Porting C to Cn

(Example 1)

poly int i, j; i = get_penum(); // i=0 on PE0, i=1 on PE1 etc. j = 2*i; // j=0 on PE0, j=2 on PE2 etc.

int i, j;

for( i=0; i<96; i++ ) { j = 2*i;

}

Similar Cn

code

C code

Page 65: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com65

Porting C to Cn

(Example 2)

poly int me, i; mono int npes; me = get_penum(); // me=0 on PE0, me=1 on PE1 etc. npes = get_num_pes(); // npes = 96

// i=0,96,192, …; 1,97,193, … etc. for( i=me; i<N; i+=npes ) { …

}

int i;

for( i=0; i<N; i++ ) { …

}

Similar Cn

code

C code

Page 66: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com66

Simple Cn

examplevoid foo

(double *A, double *B, int n) { // Assume n is divisible by 24*96.poly double mat[4]={1.,2.,3.,4.};poly double a[24];poly double b[4]={0.,0.,0.,0.}; int i;

while (n) { memcpym2p (a, A+24*get_penum(), 24*sizeof(double));A+=24*96;for (i=0; i<24; i++) {

b[0] += a[i]*mat[0] + a[i+1]*mat[1];b[1] += a[i+1]*mat[0] + a[i]*mat[1];b[2] += a[i]*mat[2] -

a[i+1]*mat[3];b[3] += a[i+1]*mat[2] -

a[i]*mat[3];}n -= 24*96;

}

memcpyp2m (B+4*get_penum(), b, 4*sizeof(double));return;

}

Page 67: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com67

ENVISION. ACCELERATE. ARRIVE.

Cn Libraries

Page 68: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com68

Runtime libraries

• Cn

supports standard C runtime, including:– malloc– printf– sqrt– memcpy

Cn extensions include:– sqrtp– memcpym2p / memcpyp2m– get_penum– swazzle– any / all

Page 69: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com69

Asynchronous I/O

• For most efficient use of limited PE memory, overlap data transfers between mono memory and poly:

– async_memcpym2p/p2m– sem_sig / sem_wait

For greatest efficiency, async_memcpy routines bypass the data cache, so coherency must be maintained:• dcache_flush / dcache_flush_address

Page 70: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com70

Asynchronous I/O examplevoid foo(double

*A, double *B,int

n) { // Assume n is divisible by 24*96poly unsigned short penum=get_penum();poly double mat[4]={1.,2.,3.,4.};poly double a_front[12], a_back[12];poly double b[4]={0.,0.,0.,0.}; int i;

async_memcpym2p(19,a_front,A+12*penum,12*sizeof(double));A+=12*96;n-=24*96;while (n) {

async_memcpym2p(17,a_back,A+12*penum,12*sizeof(double));A+=12*96;sem_wait(19);for (i=0;i<12;i++) {

b[0] += a_front[i]*mat[0] + a_front[i+1]*mat[1];b[1] += a_front[i+1]*mat[0] + a_front[i]*mat[1];b[2] += a_front[i]*mat[2] -

a_front[i+1]*mat[3];b[3] += a_front[i+1]*mat[2] -

a_front[i]*mat[3];}n-=12*96;async_memcpym2p(19,a_front,A+12*penum,12*sizeof(double));A+=12*96;sem_wait(17);for (i=0;i<12;i++) { …

// compute on a_back, then finish outside while loop

Page 71: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com71

ENVISION. ACCELERATE. ARRIVE.

Cn Pointers

Page 72: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com72

• Using mono and poly with pointersmono int * mono mPmi mono pointer to mono intpoly int * mono mPpi mono pointer to poly intmono int * poly pPmi poly pointer to mono intpoly int * poly pPpi poly pointer to poly int

• Most commonly used is mono pointer to polypoly <type> * mono <variable_name>

Cn

— mono and poly pointers

Page 73: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com73

• mono

pointer to mono

intmono int * mono mPmi

int

int *

Mono memory

Cn

— mono and poly pointers

Page 74: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com74

• mono

pointer to poly

intpoly int * mono pPmi

Note: Points to same location in each PE

int *

Mono memory

intPoly memory

intPoly memory

intPoly memory

Cn

— mono and poly pointers

Page 75: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com75

• poly pointer to poly intpoly int * poly pPpi

Note: Pointer stored in same location in each PE

int *

Poly memory

int

int *

Poly memory

Int

int *

Poly memory

int

Cn

— mono and poly pointers

Page 76: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com76

• poly pointer to mono intmono int * poly pPmi

Note: Pointer stored in same location in each PE

int

Mono memory

int *Poly memory

int *Poly memory

int *Poly memory

int

int

Cn

— mono and poly pointers

Page 77: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com77

ENVISION. ACCELERATE. ARRIVE.

Conditional Expressions

Page 78: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com78

Conditional Expressions: mono-if

• Conditions based on mono expressions– Expression has same value on all PEs– Code block selected according to expression and branch

instruction executedmono int i, j;

i = j = 1; if( i == j ) { // this block executed on all PEs

} else { // this block branched over on all PEs

}

Page 79: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com79

Conditional Expressions: poly-if

• Conditions based on poly expressions– Expression may have different values on different PEs– But SIMD model implies all PEs execute same instruction

simultaneously– All branches executed on all PEs, with PE enabled if

conditional expression is true (like predicated instructions)

poly int i; i = get_penum();

if( i < 48 ) { // PEs 0, 1, 2, … execute instructions // PEs 48, 49, … instructions issued but ignored

} else { // PEs 0, 1, 2, … instructions issued but ignored // PEs 48, 49, … execute instructions

}

Page 80: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com80

Conditional Expressions: poly-while

• While loops based on poly expressions– Loop continues execution until condition is false on all

PEs– PEs will be disabled one by one until while condition is

false on all PEs– count keeps track of total number of iterations (96 in this

case)

mono int count = 0; poly int me; me = get_penum();

while( me > 0 ) { --me; ++count;

}

Page 81: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com81

Other variations between C and Cn

• Labeled

break and continue statements• No switch statement using poly variables (use

multiple if statements)• No goto statement in poly context

Page 82: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com82

ENVISION. ACCELERATE. ARRIVE.

Moving Data

Page 83: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com83

Data flow

• Board and host communicate via Linux kernel module or Windows driver

• Create a handle and establish the connection

Page 84: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com84

Data flow

• Register intent of using the first processor on the card• Load the code onto the enabled processor

Page 85: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com85

Data flow

• Transfer data from host to board• Semaphores synchronize transfers between host and

board

Page 86: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com86

Data flow

• Run the code on the enabled processor• Host can continue with other work

Page 87: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com87

Data flow

• Send results back to host• Halt board program and clean up

Page 88: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com88

Implicit broadcast from mono and poly

• Implicit broadcast from mono to poly by assignment

• Assigning poly to mono is not permitted

mono int m = 7; poly int p;

p = m; // Implicit broadcast to all PEs

mono int m; poly int p = get_penum();

m = p; // NO! m receives different value from each PE

Page 89: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com89

Explicit data movement –

mono to poly

memcpym2p();

async_memcpym2p()• Memory copy of n bytes from mono to poly

– Source is a poly pointer to mono memory, which can have a different value for each PE

– Destination is a mono pointer to poly memory, that is destination address is the same for all PEs

Source data in mono memory

PE0 PE1 PE2

Same destination on each PE

PE95

Page 90: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com90

Explicit data movement –

poly to mono

memcpyp2m();

async_memcpyp2m()• Memory copy of n bytes from poly to mono

– Source is a mono pointer to poly memory; therefore source address is the same for every PE

– Destination is a poly pointer to mono memory, which can have a different value for each PE

Destination data in mono memory

PE0 PE1 PE2

Same source address on each PE

PE95

Page 91: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com91

Explicit data movement –

asynchronous

async_memcpym2p(); async_memcpyp2m()• Asynchronous memory copy of n bytes

from mono to poly or from poly to mono– Computation continues during data copy– Mono memory data cache NOT flushed– Restrictions on alignment of data– Use semaphores to wait for completion of copy– Much higher bandwidth than synchronous versions

dcache_flush(); async_memcpym2p( semaphore, … ); // computation continues sem_wait( semaphore ); // use data that has been transferred from mono memory

Page 92: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com92

Explicit data movement –

swazzle

• Register-to-register transfer between neighboring

PE’s

PE n

Register file

Memory

ALU

Enable stack

Status flags

To:

PE n-1

To:

PE n+1

Page 93: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com93

Swazzle operations

• Assembly language versions operate directly on register file

• Cn versions operate on data and include implicit data movement from memory to registers

• Variants– swazzle_up( poly int src ); // copy to higher numbered PE– swazzle_down( poly int src ); // copy to lower numbered

PE– swazzle_up_generic( poly void *dst, poly void *src,

unsigned int size );– swazzle_down_generic( … );– Similar swazzles operating on other data types– Functions to set data copied into ends of swazzle chain

Page 94: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com94

Data movement bandwidths per CSX600

• Mono memory to poly memory —

2.7 GB/s aggregate over 96 PEs

• Poly memory to registers —

840 MB/s per PE, 81 GB/s aggregate

• Swazzle path bandwidth —

1680 MB/s per PE, 161 GB/s aggregate

• Total bandwidth for Advance board (2 CSX600 processors) ~0.5 TB/s

Page 95: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com95

DMA performance

• Advance board has a host DMA controller which can act as a PCI bus master

• All DMA transfers are at least 8-byte aligned• Host DMA engine will attempt to use the entire bus

bandwidth

ClearSpeed Advance DMA Performance

0

200

400

600

800

1000

1200

2.0 3.0 3.9 4.9 5.9 6.8 7.8

Transfer size (MB)

MB

/ s

e620_Read_avge620_Write_avgX620_Read_avgX620_Write_avg

Page 96: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com96

ENVISION. ACCELERATE. ARRIVE.

CSAPI Host -

Board communication

Page 97: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com97

Host-Board interaction basics

• The basic model for interaction between the host and the card is very simple:

• The ClearSpeed board can signal and wait for semaphores; it cannot initiate data transactions with the host.

• The host pushes data to and pulls data from the board.

• The host can also signal and receive semaphores.

Page 98: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com98

Connecting to the board

• A host application needs to perform the following sequence to launch a process on the board:– Create a CSAPI handle

• CSAPI_new– Establish a connection with the board

• CSAPI_connect– Register the host application with the driver

• CSAPI_register_application– Load the CSX application on the desired chip

• CSAPI_load– Run the CSX application on the desired chip

• CSAPI_run

Page 99: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com99

Interacting with the board

– Get board memory address of a known symbol• CSAPI_get_symbol_value

– This must be done after the application is loaded, if the dynamic load capability is to be used.

– Write/Read data to a retrieved memory address• CSAPI_write_mono_memory• CSAPI_read_mono_memory

– Asynchronous variants of these routines also exist– A process does not need to be running for these operations to succeed,

but the process needs to be loaded.– These should not be performed DURING process termination.

– Managing semaphores• CSAPI_allocate_shared_semaphore

– Declares a semaphore for use on both host and card• CSAPI_semaphore_wait• CSAPI_semaphore_signal

Page 100: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com100

Cleaning up

– Process termination• CSAPI_wait_on_terminate• CSAPI_get_return_value

– Clean-up• CSAPI_delete

– See CSX600 Runtime Software User Guide for more details, including:

• managing multiple processes on the board/chip at once• managing board control registers• board reset• managing multi-threaded CSX applications• board memory allocation• managing multiple boards/chips• error handling

Page 101: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com101

ENVISION. ACCELERATE. ARRIVE.

Debugging Cn

Page 102: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com102

csgdb

– csgdb is a port of the open source gdb debugger– full symbolic debugging of mono/poly variables– full gdb breakpoint support– step through Cn or assembly– views mono and poly registers– views PE enabled state– also accessible via DDD

• DDD allows graphical data visualization

Page 103: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com103

Debug control

– To enable debugging:• export CS_CSAPI_DEBUGGER=1

– initializes the debug interface within the host application• export CS_CSAPI_DEBUGGER_ATTACH=1

– host application will then write a port number to stdout and wait for <Return/Enter> to be pressed so that csgdb can be manually attached to the connected board process

– Launch the host application• This can be done with or without a debugger.

– Launch csgdb in a new shell*• csgdb <csx_file_name> <port_number>

– No need to “connect” as the host application did this already• set desired breakpoints• run

– Note that the host is currently blocked waiting for <Return/Enter>, so card process may also be blocked waiting for the host.

– Press return in the host shell for the host and card applications to proceed.

Page 104: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com104

Real time plot of contents of PE

memory

Cn source-level break point, watch points,

single step, etc.

Disassembly, break point, watch points,

single step, etc.

Register contents

On-chip poly array contents displayed

csgdb Debugger (Shown with ddd

Front-end)

Page 105: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com105

csgdb Command-line example

% cscn foo.cn –g –o foo.csx% csgdb ./foo.csx

• (gdb) connect• 0x80000000 in __FRAME_BEGIN_MONO__ ()• (gdb) break 109• Breakpoint 1 at 0x800154c0: file foo.cn, line 109.• (gdb) run• Starting program: /home/kris/my_app/foo.csx• Breakpoint 1, main () at foo.cn:109• (gdb) next• 110 y = MINY + (get_penum() * STEPY);• (gdb) print y• $1 = {-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1}

Page 106: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com106

ENVISION. ACCELERATE. ARRIVE.

ClearSpeed Visual Profiler Explaining Performance

Page 107: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com107

ClearSpeed Visual Profiler (csvprof)

– Host tracing• Trace CSAPI function• User can infer overlapping host/board utilization• Locate hot-spots

– Board tracing• Trace board side functions without instrumentation• Locate hot-spots

– Board hardware utilization• Display activity of csx functional units including:

• Cycle accurate• View corresponding source

– Unified GUI

– ld/st– Pi/o– SIMD microcode

– Instruction cache– Data cache– Thread

Page 108: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com108

Advance Accelerator Board

CSX 600

Pipeline

CSX 600

Pipeline

HostCPU(s)Host

CPU(s)HostCPU(s)

Detailed profiling is essential for accelerator tuning

Advance Accelerator Board

HostCPU(s)

CSX600

Pipeline

HOST/BOARD INTERACTIONInfer cause and effect.Measure transfer bandwidth.Check overlap of host and board compute.

ACCELERATOR PIPEView instruction issue.Visualize overlap of executing instructions. Get cycle-accurate timing.Remove instruction-level performance bottlenecks.

CSX600 SYSTEMTrace at system level.Inspect overlap of compute and I/O.View cache utilization. Graph performance.

CSX600

Pipeline

HOST CODE PROFILINGVisually inspect multiple host threads.Time specific code sections.Check overlap of host threads.

Page 109: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com109

csvprof: Host tracing

• Dynamic loading of CSAPI Trace implementation

• Triggered with an environment variable:– export CS_CSAPI_TRACE=1

» Recall similar enabling of debug support:» export CS_CSAPI_DEBUGGER=1

• Specify tracing format:– export CS_CSAPI_TRACE_CSVPROF=1– currently this is the only implementation, but in the future…

• Specify output file for trace:– export CS_CSAPI_TRACE_CSVPROF_FILE=mytrace.cst– default filename = csvprof_data.cst

• Output file written during CSAPI_delete

Page 110: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com110

csvprof: Host-Board interaction

Page 111: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com111

csvprof: Host code profile –

Linpack

benchmark

Page 112: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com112

csvprof: CSX600 system profile

Page 113: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com113

csvprof: Accelerator pipeline profile

Page 114: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com114

csvprof: Instruction pipeline stalls

Page 115: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com115

csvprof: Advance board tracing

– Enabled using the debugger, csgdb• Can use interactively or through gdb script

– Can select events to profile, or all events

– Requires buffer allocation on the card• Today, this is done statically• One could use CSAPI to allocate buffer, but developer must get

location and size of the buffer to user to be entered for csgdb• Easy if running only on one chip, place buffer in the other chip’s

memory

– Explicit dump to generate trace file• Can control the type of data to be dumped

Page 116: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com116

csvprof: Sample gdb script

• % cat ./csgdb_trace.gdb•

connect

load ./foo.csx

cstrace

buffer 0x60000000 0x1000000

cstrace

event all on

tbreak

test_me

continue

cstrace

enable

continue

cstrace

dump foo.cst

cstrace

dump branch dgemm_test4_branch.cst

quit

• % csgdb –command=./csgdb_trace.gdb

Page 117: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com117

ENVISION. ACCELERATE. ARRIVE.

Tuning Tips

Page 118: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com118

Pipelined arithmetic

• Four-stage floating-point pipeline• Use vector types, vector intrinsic functions, and

vector math library for high efficiency

__DVECTOR a, b, c; poly double x[N];

a = *((__DVECTOR *)x[0]); b = *((__DVECTOR *)x[4]); c = cs_sqrt( __cs_vadd( a, b ) );

Page 119: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com119

Poly conditionals

• When possible, remove common sub- expressions from poly if-blocks to reduce

amount of replicated work.• Maybe need to compute and throw away results

if it leads to fewer poly conditional blocks.• A poly if-block uses predicated instructions,

not a branch, so it is cheap if not many additional instructions are executed.

Page 120: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com120

Poly loop counters

• Loops with poly counters are more expensive than those with mono counters

• Use mono loop counters if possible

Page 121: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com121

Arrays

• Pointer incrementing is more efficient than using array index notation

• Poly addresses require 16 bits• Use short for poly pointer increments

– This avoids conversion of int to short

Page 122: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com122

Data transfer

• Synchronous functions are completely general– flush the data cache each transfer

• memcpyp2m()• memcpym2p()

• Asynchronous functions maximize performance– do not flush cache– have data size and alignment restrictions– require use of wait semaphore

• async_memcpyp2m(); sem_wait()• async_memcpym2p(); sem_wait()

• Large data blocks are more efficient than small blocks– Host to board– Board to host– Mono to poly– Poly to mono

Page 123: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com123

ENVISION. ACCELERATE. ARRIVE.

Application Examples

Page 124: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com124

Math function speed comparison

Typical speedup of ~8X over the fastest x86 processors, because math functions stay in local memory on the card

64-bit Function Operations per Second (Billions)

0.0

0.5

1.0

1.5

2.0

2.5

Sqrt InvSqrt Exp Ln Cos Sin SinCos Inv Function name

2.6 GHz dual-core Opteron3 GHz dual-core WoodcrestClearSpeed Advance card

Page 125: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com125

Nucleic Acid Builder (NAB)

• Newton-Raphson refinement now possible; large DGEMM calls from computed second derivatives will be in AMBER 10

• 2.5x speedup obtained for this operation in three hours of programmer effort

• Enables accurate computation of entropy and Gibbs Free Energy for first time

• AMBER itself has cases that ClearSpeed accelerates by 3.2x to 9x, with 5x to 17x possible once we exploit symmetry of atom-

atom interactions

Page 126: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com126

AMBER molecular modeling

with ClearSpeed

• AMBER model Host Advance X620 Speedup• Gen Born 1 83.5 min 24.6 min 3.4ו Gen Born 2 84.6 min 23.5 min 3.6ו Gen Born 6 37.9 min 4.0 min 9.4×

AMBER Generalized Born Models 1, 2, and 6

83.5 84.6

37.9

24.6 23.5

4.00.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

90.0

100.0

Generalized Born 1 Generalized Born 2 Generalized Born 3

Run

Tim

e, in

Min

utes

Host Advance X620

Page 127: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com127

Monte Carlo methods exploit high local bandwidth

• Monte Carlo methods are ideal for ClearSpeed acceleration:– High regularity and locality of the algorithm– Very high compute to I/O ratio– Very good scalability to high degrees of parallelism– Needs 64-bit

• Excellent results for parallelization– Achieving 10x performance per Advance card vs. highly

optimized code on the fastest x86 CPUs available today– Maintains high precision required by the computations

• True 64-bit IEEE 754 floating point throughout– 25 W per card typical when card is computing

• ClearSpeed has a Monte Carlo example code, available in source form for evaluation

Page 128: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com128

Monte Carlo applications scale very well

• No acceleration:

200M samples, 79 seconds• 1 Advance board:

200M samples, 3.6 seconds• 5 Advance boards:

200M samples, 0.7 secondsEuropean Option Pricing Model

0123456789

10

0 1 2 3 4 5

Number of ClearSpeed Advance Boards

Spee

dup

Page 129: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com129

Why do Monte Carlo applications need 64-bit?

• Accuracy increases as the square root of the number of trials, so five-decimal accuracy takes 10 billion trials.

• But, when you sum many similar values, you start to lose all the significant digits.

• 64-bit summation needed to get a single-precision result!

Single precision:1.0000x108 + 1 = 1.0000x108

Double precision:1.0000x108 + 1

= 1.00000001x108

Page 130: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com130

ENVISION. ACCELERATE. ARRIVE.

Help and Support

Page 131: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com131

Installed documentation

• docs directory– CSXL user guide– runtime user guide– csvprof Visual Profiler overview and examples– SDK

• getting started• gdb manual• instruction set manual• Cn

library manual• reference manual

– release notes

• examples directory

Page 132: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com132

ClearSpeed online

• General information, news, etc.– Company website www.clearspeed.com

• Report a problem, find answers, etc.– Support website support.clearspeed.com

• Support website has:– Documentation, user guides, reference manuals– Solutions knowledge base– Software downloads– Log a case

Page 133: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com133

Join the ClearSpeed Developer Program!

• Designed to support the leading-edge community of developers using accelerators

• Membership is free and has the following benefits: – Access to the ClearSpeed Developer website– ClearSpeed Developer Community on-line forum– Invitation to participate in ClearSpeed Developer & User

Community meetings and events– Repository to share and access demonstrations and sample

codes within the ClearSpeed Developer Community – Technical updates, tips and tricks from the gurus at ClearSpeed

and the Developer Community– And more, including opportunities to preview new software

releases and developer discount programs.• Leverage the expertise of developers worldwide.• Ask a question, or share your knowledge.• Register now at developer.clearspeed.com !

Page 134: ENVISION. ACCELERATE. ARRIVE. ClearSpeed Technical …jbaker/ResearchGroup/slides...• 47% logic, 53% memory – About 50% of the logic is FPU – Hence around one quarter of the

Copyright © 2007 ClearSpeed Technology Inc. All rights reserved. www.clearspeed.com134