The Cell Processor

87
Business Unit or Product Name Confidential | Date | Other Information, if necessary © 2005 IBM Corporation Open Systems Design and Development 2007-04-12 | Heiko J Schick <[email protected]> The Cell Processor Computing of tomorrow or yesterday? © 2007 IBM Corporation

description

 

Transcript of The Cell Processor

Page 1: The Cell Processor

Business Unit or Product Name

Confidential | Date | Other Information, if necessary © 2005 IBM Corporation

Open Systems Design and Development

2007-04-12 | Heiko J Schick <[email protected]>

The Cell Processor

Computing of tomorrow or yesterday?

© 2007 IBM Corporation

Page 2: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation2

Agenda

1. Introduction

2. Limiters to Processor Performance

3. Cell Architecture

4. Cell Platform

5. Cell Applications

6. Cell Programming

7. Appendix

Page 3: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation3

1Introduction

Page 4: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation4

Cell History

IBM, SCEI / Sony and Toshiba Alliance formed in 2000

Design Center opened in March 2001 (Based in Austin, Texas)

Single Cell BE operational Spring 2004

2-way SMP operational Summer 2004

February 7, 2005: First technical disclosures

November 9, 2005: Open Source SDK Published

Page 5: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation5

The problem is…The problem is…

……the view from the computer room!the view from the computer room!

Page 6: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation6

Outlook

Source: Kurzweil

“Computer performance increases since 100 years exponential !!!”

Page 7: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation7

But what could you do if all But what could you do if all objectsobjects were were intelligent…intelligent…

……and connected?and connected?

Page 8: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation8

What could you do with

unlimited computing power… for pennies?

Could you predict the path of a storm

down to the square kilometer?Could you identify another 20% of proven oil reserves without drilling one hole?

Page 9: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation9

2Limiters to Processor Performance

Page 10: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation10

Power Wall / Voltage Wall

Power components:– Active Power

– Passive Power

– Gate leakage

– Sub-threshold leakage (source-drain leakage)

Source: Tom’s Hardware Guide

1

Page 11: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation11

Memory Wall

Main memory now nearly 1000 cycles from the processor – Situation worse with (on-chip) SMP

Memory latency penalties drive inefficiency in the design – Expensive and sophisticated hardware to try and deal with it

– Programmers that try to gain control of cache content are hindered by the hardware mechanisms

Latency induced bandwidth limitations – Much of the bandwidth to memory in systems can only be used speculatively

2

Page 12: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation12

Frequency Wall

Increasing frequencies and deeper pipelines have reached diminishing returns on performance

Returns negative if power is taken into account

Results of studies depend on issue width of processor – The wider the processor the slower it wants to be

– Simultaneous Multithreading helps to use issue slots efficiently

Results depend on number of architected registers and workload – More registers tolerate deeper pipeline

– Fewer random branches in application tolerates deeper pipelines

3

Page 13: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation13

Microprocessor Efficiency

Gelsinger’s law – 1.4x more performance for 2x

more

Hofstee’s corollary – 1/1.4x efficiency loss in every

generation

– Examples: Cache size, Out-of-Order, Super-scalar, etc.

Source: Tom’s Hardware Guide

Increasing performance requires increasing efficiency !!!

Page 14: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation14

Attacking the Performance Walls

Multi-Core Non-Homogeneous Architecture – Control Plane vs. Data Plane processors

Attacks Power Wall

3-level Model of Memory – Main Memory, Local Store, Registers

Attacks Memory Wall

Large Shared Register File & SW Controlled Branching – Allows deeper pipelines (11FO4 helps power)

Attacks Frequency Wall

Page 15: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation15

3Cell Architecture

Page 16: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation16

Cell BE Processor

~250M transistors

~235mm2

Top frequency >3GHz

9 cores, 10 threads

> 200+ GFlops (SP) @3.2 GHz

> 20+ GFlops (DP) @3.2 GHz

Up to 25.6GB/s memory B/W

Up to 76,8GB/s I/O B/W

~400M$(US) design investment

Page 17: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation17

Key Attributes of Cell

Cell is Multi-Core – Contains 64-bit Power Architecture TM – Contains 8 Synergistic Processor Elements (SPE)

Cell is a Flexible Architecture – Multi-OS support (including Linux) with Virtualization technology – Path for OS, legacy apps, and software development

Cell is a Broadband Architecture – SPE is RISC architecture with SIMD organization and Local Store – 128+ concurrent transactions to memory per processor

Cell is a Real-Time Architecture – Resource allocation (for Bandwidth Measurement) – Locking Caches (via Replacement Management Tables)

Cell is a Security Enabled Architecture – SPE dynamically reconfigurable as secure processors

Page 18: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation18

Page 19: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation19

Power Processor Element (PPE) 64-bit Power Architecture™ with VMX

In-order, 2-way hardware Multi-threading

Coherent Load/Store with 32KB I & D L1 and 512KB L2

Controls the SPEs

Page 20: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation20

Synergistic Processor Elements (SPEs)

SPE provides computational performance– Dual issue, up to 16-way 128-bit SIMD

– Dedicated resources: 128 128-bit register file, 256KB Local Store

– Each can be dynamically configured to protect resources

– Dedicated DMA engine: Up to 16 outstanding request

– Memory flow controller for DMA

– 25 GB/s DMA data transfer

– “I/O Channels” for IPC

Seperate Cores

Simple Implementation (e.g. no branch prediction)

No Caches

No protected instructions

Page 21: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation21

Permute Unit

Load-Store Unit

Floating-Point Unit

Fixed-Point Unit

Branch Unit

Channel Unit

Result Forwarding and Staging

Register File

Local Store(256kB)

Single Port SRAM

128B Read 128B Write

DMA Unit

Instruction Issue Unit / Instruction Line Buffer

8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle

On-Chip Coherent Bus

SPE BLOCK DIAGRAM

Page 22: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation22

Element Interconnect Bus

Four 16 byte data rings, supporting multiple transfers

96B/cycle peak bandwidth

Over 100 outstanding requests

300+ GByte/sec @ 3.2 GHz

Element Interconnect Bus (EIB)

Page 23: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation23

Four 16B data rings connecting 12 bus elements– Two clockwise / Two counter-clockwise

Physically overlaps all processor elements

Central arbiter supports up to three concurrent transfers per data ring– Two stage, dual round robin arbiter

Each element port simultaneously supports 16B in and 16B out data path– Ring topology is transparent to element data interface

16B 16B 16B 16B

Data Arb

16B 16B 16B 16B

16B 16B16B 16B16B 16B16B 16B

16B

16B

16B

16B

16B

16B

16B

16B

SPE0 SPE2 SPE4 SPE6

SPE7SPE5SPE3SPE1

MIC

PPE

BIF/IOIF0

IOIF1

Element Interconnect Bus (EIB)

Page 24: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation24

MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF1

Ramp

7

Controller

Ramp

8

Controller

Ramp

9

Controller

Ramp

10

Controller

Ramp

11

Controller

Controller

Ramp

0

Controller

Ramp

1

Controller

Ramp

2

Controller

Ramp

3

Controller

Ramp

4

Controller

Ramp

5

Controller

Ramp

6

Controller

Ramp

7

Controller

Ramp

8

Controller

Ramp

9

Controller

Ramp

10

Controller

Ramp

11

Data

Arbiter

Ramp

7

Controller

Ramp

8

Controller

Ramp

9

Controller

Ramp

10

Controller

Ramp

11

Controller

Controller

Ramp

5

Controller

Ramp

4

Controller

Ramp

3

Controller

Ramp

2

Controller

Ramp

1

Controller

Ramp

0

PPE SPE1 SPE3 SPE5 SPE7 IOIF1PPE SPE1 SPE3 SPE5 SPE7 IOIF1

PPE SPE1 SPE3 SPE5 SPE7 IOIF1MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF0

Ring1Ring3

Ring0Ring2

controls

Example of eight concurrent transactions

Page 25: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation25

I/O and Memory Interfaces I/O Provides wide bandwidth

– Dual XDRTM controller (25.6GB/s @ 3.2Gbps)– Two configurable interfaces (76.8GB/s @6.4Gbps)

– Configurable number of Bytes– Coherent or I/O Protection

– Allows for multiple system configurations

Page 26: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation26

Page 27: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation27

4Cell Platform

Page 28: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation28

Game console systems Blades HDTV Home media servers Supercomputers ...... ?

Cell BEProcessor

XDRtm XDRtm

IOIF0 IOIF1

Cell BEProcessor

XDRtm XDRtm

IOIF BIF

Cell BEProcessor

XDRtm XDRtm

IOIF

Cell BEProessor

XDRtm XDRtm

IOIFBIF

Cell BEProcessor

XDRtm XDRtm

IOIF

Cell BEProcessor

XDRtm XDRtm

IOIF

BIF

Cell BEProcessor

XDRtm XDRtm

IOIFSW

Cell processor can support many systems

Page 29: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation29

Chassis

Standard IBM BladeCenter with: 7 Blades (for 2 slots each) with full performance 2 switches (1Gb Ethernet) with 4 external ports each

Updated Management Module Firmware. External Infiniband Switches with optional FC ports.

Blade (400 GFLOPs)

Game Processor and Support Logic: Dual Processor Configuration Single SMP OS image 1GB XDRAM Optionally PCI-exp attached standard graphics adapter

BladeCenter Interface ( Based on IBM JS20): New Blade Power System and Sense Logic Control Firmware to connect processor & support logic to H8

service processor Signal Level Converters for processor & support logic 2 Infiniband (IB) Host Adapters with 2x IB 4x each Physical link drivers (GbE Phy etc)

Chassis

2x (+12V RS-485,USB,GbEn)

Rambus Design:DRAM 1/2GB

Cell BEProcessor

H8SP

BladeInput

Power&Sense

Level Convert

GbEPhy

BladeCenter Interface

Blade

Cell BEProcessor

South Bridge

Rambus Design:DRAM 1/2GB

South Bridge

IB4X

IB4X

Blade

QS20 Hardware Description

Page 30: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation30

QS20 Blade (w/o heatsinks)

Page 31: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation31

ATA Disk

Service Proc.

South Bridges InfiniBand Cards

Blade Bezel

QS20 Blade Assembly

Page 32: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation32

Up to 2 InfiniBand Cards can be attached.

Standard PC InfiniBand Card with special bezel– MHEA28-1TCSB Dual-Port HCA – PCI Express x8 interface– Dual 10 Gb/s InfiniBand 4X Ports– 128 MB Local Memory– IBTA v1.1 Compatible Design

Options - InfiniBand

Page 33: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation33

Cell Software Stack

Firmware

ApplicationsApplications

SLOFSLOF

powerpc architecture dependent codepowerpc architecture dependent code

Cell Broadband Engine Cell Broadband Engine

LinuxLinux memory managementmemory management

device driversdevice drivers

gccgcc

ppc64, spu backendppc64, spu backend

glibcglibc

Hardware

RTASRTAS

Secondary Boot LoaderSecondary Boot Loader

powerpc- and cell-specific Linux code

Low-level FWLow-level FW

schedulerscheduler

(pSeries)(pSeries) (PMac)(PMac) cellcell

User space

Linux common code

device driversdevice drivers

Page 34: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation34

Cell BE Development Platform

Cell BE

Firmware

Graphics Std Devices

Developer Workstation

Cell Linux kernel

Lower-level programming interface

Basic Cell runtime: lib_spe, spelibc, …

Basic Cell toolchain: gcc, binutils, gdb, oprofile, …

Cell aware tooling

Application Framework (segment specific)

Standard Linux

Development

Environment

ppc64 Cell optimized libraries

Cell specialized compilers

Higher-level programming interface

Application-level programming interface

Tooling

Libraries

Ce

ll e

na

ble

me

nt

Ce

ll e

xp

loit

ati

on

Cell is an exotic platform and hard to program Challenging to exploit SPEs: Limited local memory (256 KB) – need to DMA data and code fragments back and forth Multi-level parallelism – 8 SPEs, 128-bit wide SIMD units in each SPE

If done right, the result is impressive performance…

Make Cell easier to program Hide complexity in critical libraries Compiler support for standard tasks, e.g., overlays, global data access, SW-managed cache, auto vectorization, auto parallelization, … Smart tooling

Make Cell a standard platform Middleware and frameworks provide architecture-specific components and hide Cell –specifics from application developer

Page 35: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation35

Alpha Quality– SDK hosted on FC4 / X86

OS: Initial Linux Cell 2.6.14 patches

SPE Threads runtime XLC Cell C Compiler SPE gdb debugger Cell Coding Sample Source Documentation

– Installation Scripts– Cell Hardware Specs– Programming Docs

SDK1.0

GCC Tools from SCEA– gcc 3.0 for Cell

– Binutils for Cell

Alpha Quality– SDK hosted on FC5 / X86

Critical Linux Cell Performance Enhancements– Cell Enhanced Functions

Critical Cell RAS Functions– Machine Check, System Error

Performance Analysis Tools– Oprofile – PPU Cycle only profiling (No

SPU) GNU Toolchain updates Mambo Updates Julia Set Sample

SDK1.1

Execution platform:Cell Simulator

Hosting platform:Linux/86 (FC4)

11/2005 7/2006

SDK 2.0

12/2006

XL C/C++ – Linux/x86, LoP – Overlay prototype– Auto-SIMD enhancements

Linux Kernel updates– Performance Enhancements– RAS/ Debug support– SPE runtime extensions– Interrupt controller enhancements

GNU Toolchain updates– FSF integration– GDB multi-thread support– Newlib library optimization– Prog model support for overlay

Programming Model Preview– Overlay support– Accelerated Libraries Framework

Library enhancements– Vector Math Library – Phase 1 – MASS Library for PPU, MASSV Library for PPU/SPU

IDE– Tool integration– Remote tool support

Performance Analysis– Visualization tools– Bandwidth, Latency, Lock analyzers– Performance debug tools– Oprofile – SDK 1.1 plus PPU event based profiling

Mambo– Performance model correlation– Visualization

SDK1.0.1

Execution platform:Cell Simulator Cell Blade 1 rev 2

Hosting platform:Linux/86 (FC4)Linux/Cell (FC4)*Linux/Power (FC4)*

Execution platform:Cell Simulator Cell Blade 1 rev 3

Hosting platform:Linux/86 (FC5)Linux/Cell (FC5)*Linux/Power (FC5)*

Refresh

Execution platform:Cell Simulator Cell Blade 1 rev 3

Hosting platform:Linux/86 (FC5)Linux/Cell (FC5)*Linux/Power (FC5)*

2/2006

Refresh

9/2006

SDK1.1.1

•Documentation•Mambo updates for CB1 and 64-bit hosting•ISO image update

* Subset of tools

Page 36: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation36

Cell library content (source) ~ 156k loc Standard SPE C library subset

– optimized SPE C functions including stdlib c lib, math and etc. Audio resample - resampling audio signals FFT - 1D and 2D fft functions gmath - mathematic functions optimized for gaming environment image - convolution functions intrinsics - generic intrinsic conversion functions large-matrix - functions performing large matrix operations matrix - basic matrix operations mpm- multi-precision math functions noise - noise generation functions oscillator - basic sound generation functions sim- providing I/O channels to simulated environments surface - a set of bezier curve and surface functions sync - synchronization library vector - vector operation functions

http://www.alphaworks.ibm.com/tech/cellsw

Page 37: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation37

5Cell Applications

Page 38: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation38

Peak GFLOPs

FreeScale

DC 1.5 GHz

PPC 970

2.2 GHz

AMD DC

2.2 GHz

Intel SC

3.6 GHz

Cell

3.0 GHz

0

20

40

60

80

100

120

140

160

180

200

SinglePrecision

DoublePrecision

Page 39: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation39

Cell Processor Example Application Areas

Cell is a processor that excels at processing of rich media content in the context of broad connectivity

– Digital content creation (games and movies)

– Game playing and game serving

– Distribution of (dynamic, media rich) content

– Imaging and image processing

– Image analysis (e.g. video surveillance)

– Next-generation physics-based visualization

– Video conferencing (3D?)

– Streaming applications (codecs etc.)

– Physical simulation & science

Page 40: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation40

Opportunities for Cell BE Blade

Aerospace & DefenseAerospace & DefenseSignal & Image ProcessingSignal & Image ProcessingSecurity, SurveillanceSecurity, SurveillanceSimulation & Training, …Simulation & Training, …

Petroleum IndustryPetroleum IndustrySeismic computingSeismic computingReservoir Modeling, …Reservoir Modeling, …

Communications EquipmentCommunications EquipmentLAN/MAN RoutersLAN/MAN RoutersAccessAccessConverged NetworksConverged NetworksSecurity, …Security, …

Medical ImagingMedical ImagingCT ScanCT ScanUltrasound, …Ultrasound, …

Consumer / Digital MediaConsumer / Digital MediaDigital Content CreationDigital Content CreationMedia PlatformMedia PlatformVideo Surveillance, …Video Surveillance, …

Public Sector / Gov’t & Higher Educ.Public Sector / Gov’t & Higher Educ.Signal & Image ProcessingSignal & Image ProcessingComputational Chemistry, …Computational Chemistry, …

FinanceFinanceTrade modelingTrade modeling

IndustrialIndustrialSemiconductor / LCDSemiconductor / LCDVideo ConferenceVideo Conference

Page 41: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation41

http://folding.stanford.edu/FAQ-PS3.htmlDr. V. S. Pande, folding@home, Distributed Computing Project, Stanford University

Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation of:

– Protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer.– By joining together hundreds of thousands of PCs throughout the world, calculations which were previously

considered impossible have now become routine.

Folding@Home utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers.

–14,000 PlayStation 3’s are literally outperforming 159,000 Windows Computers by more than Double!–In fact they out perform all the other clients combined.

Page 42: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation42

Ported by

235 584 tetrahedra48 000 nodes

28 iterations in NKMG solverIn 3.8 seconds

Sustained Performance for large Objects:

52 GFLOP/s

Multigrid Finite Element Solver on Cell

using the free SDK

www.digitalmedics.dels7-www.cs.uni-dortmund.de

Page 43: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation43

Ported by

Sustained Performance for large Objects: Not yet benchmarked (3/2007)

Computational Fluid Dynamics Solver on Cell

using the free SDK

www.digitalmedics.dels7-www.cs.uni-dortmund.de

Page 44: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation44

A Lattice-Boltzmann Solver

Developed by Fraunhofer IWTM

Computational Fluid Dynamics Solver on Cell

http://www.itwm.fraunhofer.de/

Page 45: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation45

Terrain Rendering Engine (TRE) and IBM Blades

Systems and Technology Group

45

BladeCenter-1 Chassis

QS20

Commodity Cell BE Blade

Add Live Video, Aerial Information, Combat Situational

Awareness

Next-Gen GCS

Combine Data & Render

Aircraft data /Field Data

Page 46: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation46

Example: Medical Computer Tomography (CT) Scans

Image whole heart in 1 rotation

4D CT – includes time

2 slices 4 slices 8 slices 16 slices

32

slices

64

slices

128

slices

256

slices

Current CT Products Future CT Products

Page 47: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation47

The moving image is aligned to the fixed image as the registration proceeds.

Fixed Image Moving Image Registration Process

“Image Registration” Using Cell

Page 48: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation48

6Cell Programming

Page 49: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation49

Small single-SPE models – a sample

/* spe_foo.c:* A C program to be compiled into an executable called “spe_foo” */

int main( int speid, addr64 argp, addr64 envp){char i;

/* do something intelligent here */i = func_foo (argp);

/* when the syscall is supported */printf( “Hello world! my result is %d \n”, i);

return i;}

Page 50: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation50

extern spe_program_handle spe_foo; /* the spe image handle from CESOF */

int main(){int rc, status;speid_t spe_id;

/* load & start the spe_foo program on an allocated spe */spe_id = spe_create_thread (0, &spe_foo, 0, NULL, -1, 0);

/* wait for spe prog. to complete and return final status */rc = spe_wait (spe_id, &status, 0);

return status;}

Small single-SPE models – PPE controlling program

Page 51: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation51

Using SPEs

PowerPC(PPE)

SPU

Local Store

MFC

N

SPEPuts

Results

PPEPutsText

Static DataParameters

SPE executes

PowerPC(PPE)

SPU

Local Store

MFC

N

SPE Puts Results

PPEPuts

Initial TextStatic DataParameters System Memory

SPE IndependentlyStages Text & Intermediate Data

Transfers while executing

(1) Simple Function OffloadRemote Procedure Call StyleSPE working set fits in Local StorePPE initiates DMA data/code transfersCould be easily supported by a programming env, e.g.,

RPC Style IDL Compiler Compiler Directives (pragmas)LibrariesOr even automatic scheduling of code/data to SPEs

(2) Typical (Complex) Function OffloadSPE working set larger than Local StorePPE initially loads SPE LS with small startup codeSPE initiates DMAs (code/data staging)

Stream data through codeStream code through data

Latency hiding required in most casesRequires "high locality of reference" characteristicsCan be extended to a “services offload model”

Page 52: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation52

Using SPEs

(3) Pipelining for complex functionsFunctions split up in processing stagesDirect LS to LS communication possible

Including LS to LS DMAAvoid PPE / System Memory bottlenecks

(4) Parallel stages for very compute-intense functionsPPE partitions and distributes work to multiple SPEs

SPU

Local Store

MFC

N

SPU

Local Store

MFC

N

Parallel-stages

PowerPC(PPE)

System Memory

PowerPC(PPE)

System Memory

SPU

Local Store

MFC

N

SPU

Local Store

MFC

N

Multi-stage Pipeline SPU

Local Store

MFC

N

Page 53: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation53

Large single-SPE programming models

Data or code working set cannot fit completely into a local store

The PPE controlling process, kernel, and libspe runtime set up the system memory mapping as SPE’s secondary memory store

The SPE program accesses the secondary memory store via its software-controlled SPE DMA engine - Memory Flow Controller (MFC)

SPE Program

System Memory

PPE controller maps system memory for

SPE DMA trans.

Local Store

DMA transactions

Page 54: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation54

Large single-SPE programming models – I/O data

System memory for large size input / output data

– e.g. Streaming model

int g_ip[512*1024]

System memory

int g_op[512*1024]

int ip[32]

int op[32]

SPE program: op = func(ip)

DMA

DMA

Local store

Page 55: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation55

Large single-SPE programming models

System memory as secondary memory store– Manual management of data buffers

– Automatic software-managed data cache

– Software cache framework libraries

– Compiler runtime support

Global objects

System memory

SW cache entriesSPE program

Local store

Page 56: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation56

Large single-SPE programming models

System memory as secondary memory store– Manual loading of plug-in into code buffer

– Plug-in framework libraries

– Automatic software-managed code overlay

– Compiler generated overlaying code

System memory

Local store

SPE plug-in b

SPE plug-in a

SPE plug-in e

SPE plug-in a

SPE plug-in b

SPE plug-in c

SPE plug-in d

SPE plug-in e

SPE plug-in f

Page 57: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation57

Large single-SPE prog. models – Job Queue

Code and data packaged together as inputs to an SPE kernel program

A multi-tasking model – more discussion later

Job queue

System memory

Local store

code/data n

code/data n+1code/data n+2

code/data …

Code n

Data n

SPE kernel

DMA

Page 58: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation58

Large single-SPE programming models - DMA

DMA latency handling is critical to overall performance for SPE programs moving large data or code

Data pre-fetching is a key technique to hide DMA latency– e.g. double-buffering

Time

I Buf 1 (n) O Buf 1 (n)

I Buf 2 (n+1) O Buf 2 (n-1)

SPE program: Func (n)

outputn-2 inputn Outputn-1

Func (inputn)

Inputn+1

Func (inputn+1)Func (inputn-1)

outputn Inputn+2DMAs

SPE exec.

DMAs

SPE exec.

Page 59: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation59

Large single-SPE programming models - CESOF

Cell Embedded SPE Object Format (CESOF) and PPE/SPE toolchains support the resolution of SPE references to the global system memory objects in the effective-address space.

_EAR_g_foo structure

Local Store Space

Effective Address Space

Char g_foo[512]

Char local_foo[512]

DMA transactions

CESOF EARsymbol resolution

Page 60: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation60

Parallel programming models – Job Queue

Large set of jobs fed through a group of SPE programs

Streaming is a special case of job queue with regular and sequential data

Each SPE program locks on the shared job queue to obtain next job

For uneven jobs, workloads are self-balanced among available SPEs

PPE

SPE1Kernel()

SPE0Kernel()

SPE7Kernel()

System Memory

In

.

I7

I6

I5

I4

I3

I2

I1

I0

On

.

O7

O6

O5

O4

O3

O2

O1

O0

…..

Page 61: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation61

Parallel programming models – Pipeline / Streaming

Use LS to LS DMA bandwidth, not system memory bandwidth

Flexibility in connecting pipeline functions

Larger collective code size per pipeline

Load-balance is harderPPE

SPE1Kernel1()

SPE0Kernel0()

SPE7Kernel7()

System Memory

In

.

.

I6

I5

I4

I3

I2

I1

I0

On

.

.

O6

O5

O4

O3

O2

O1

O0

…..DMA DMA

Page 62: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation62

Multi-tasking SPEs – LS resident multi-tasking

Simplest multi-tasking programming model

No memory protection among tasks

Co-operative, Non-preemptive, event-driven scheduling

acadxacd

Task a

Task b

Task c

Task d

Task x

EventDispatcher

Local Store

SPE n

Event Queue

Page 63: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation63

Multi-tasking SPEs – Self-managed multi-tasking

Non-LS resident

Blocked job context is swapped out of LS and scheduled back later to the job queue once unblocked

Job queue

System memory

Local store

task n

task n+1task n+2

Task …

Code n

Data n

SPE kerneltask n’

task queue

Page 64: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation64

libspe sample code

#include <libspe.h>int main(int argc, char *argv[], char *envp[]){spe_program_handle_t *binary;speid_t spe_thread;int status;

binary = spe_open_image(argv[1]);if (!binary)

return 1;spe_thread = spe_create_thread(0, binary, argv+1,

envp, -1, 0);if (!spe_thread)

return 2;

spe_wait(spe_thread, &status, 0);spe_close_image(binary);return status;

}

Page 65: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation65

libspe sample code

#include <libspe.h>int main(int argc, char *argv[], char *envp[]){spe_program_handle_t *binary;speid_t spe_thread;int status;

binary = spe_open_image(argv[1]);if (!binary)

return 1;spe_thread = spe_create_thread(0, binary, argv+1,

envp, -1, 0);if (!spe_thread)

return 2;

spe_wait(spe_thread, &status, 0);spe_close_image(binary);return status;

}

Page 66: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation66

libspe sample code

#include <libspe.h>int main(int argc, char *argv[], char *envp[]){spe_program_handle_t *binary;speid_t spe_thread;int status;

binary = spe_open_image(argv[1]);if (!binary)

return 1;spe_thread = spe_create_thread(0, binary, argv+1,

envp, -1, 0);if (!spe_thread)

return 2;

spe_wait(spe_thread, &status, 0);spe_close_image(binary);return status;

}

Page 67: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation67

Linux on Cell/B.E. kernel components

Platform abstractionarch/powerpc/platforms/{cell,ps3,beat}

Integrated Interrupt Handling

I/O Memory Management Unit

Power Management

Hypervisor abstractions

South Bridge drivers

SPU file system

Page 68: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation68

SPU file system

Virtual File System

/spu holds SPU contexts as directories

Files are primary user interfaces

New system calls: spu create and spu run

SPU contexts abstracted from real SPU

Preemptive context switching (W.I.P)

Page 69: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation69

PPE on Cell is a 100% compliant ppc64!

A solid base…– Everything in a distribution, all middleware runs out of the box

– All tools available

– BUT: not optimized to exploit Cell

Toolchain needs to cover Cell aspects

Optimized, critical “middleware” for Cell needed– Depending on workload requirements

Page 70: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation70

Using SPEs: Task Based Abstraction

APIs provided by user space libraries

SPE programs controlled via PPE-originated thread function calls– spe_create_thread(), ...

Calls on PPE and SPE– Mailboxes

– DMA

– Events

Simple runtime support (local store heap management, etc.)

Lots of library extensions– Encryption, signal processing, math operations

Page 71: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation71

spu_create

int spu create(const char *pathname, int flags, mode t mode);

– creates a new context in pathname

– returns an open file descriptor

– context is gets destroyed when fd is closed

Page 72: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation72

spu_run

uint32 t spu run(int fd, uint32 t *npc, uint32 t *status);

– transfers flow of control to SPU context fd

– returns when the context has stopped for some reason, e.g.

– exit or forceful abort

– callback from SPU to PPU

– can be interrupted by signals

Page 73: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation73

PPE programming interfaces

Asynchronous SPE thread API (“libspe 1.x”)

spe_create_thread

spe_wait

spe_kill

. . .

Page 74: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation74

spe create thread implementation

Allocate virtual SPE (spu create)

Load SPE application code into context

Start PPE thread using pthread create

New thread calls spu run

Page 75: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation75

More libspe interfaces

Event notification– int spe get event(struct spe event *, int nevents, int timeout);

Message passing– spe read out mbox(speid t speid);

– spe write in mbox(speid t speid);

– spe write signal(speid t speid, unsigned reg, unsigned data);

Local store access– void *spe get ls(speid t speid);

Page 76: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation76

GNU tool chain

PPE support– Just another PowerPC variant. . .

SPE support– Just another embedded processor. . .

Cell/B.E. support– More than just PPE + SPE!

Page 77: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation77

Object file format

PPE: regular ppc/ppc64 ELF binaries

SPE: new ELF flavour EM SPU– 32-bit big-endian

– No shared libraries

– Manipulated via cross-binutils

– New: Code overlay support

Cell/B.E.: combined object files– embedspu: link into one binary

– .rodata.spuelf section in PPE object

– CESOF: SPE− >PPE symbol references

Page 78: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation78

gcc on the PPE

handled by “rs6000” back end

Processor-specific tuning

pipeline description

Page 79: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation79

gcc on the SPE

Merged Jan 3rd

Built as cross-compiler

Handles vector data types, intrinsics

Middle-end support: branch hints, aggressive if-conversion

GCC 4.1 port exploiting auto-vectorization

No Java

Page 80: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation80

Existing proprietary applications

Games

Volume rendering

Real-time Raytracing

Digital Video

Monte Carlo simulation

Page 81: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation81

Obviously missing

ffmpeg, mplayer, VLC

VDR, mythTV

Xorg acceleration

OpenSSL

Your project here !!!

Page 82: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation82

Questions!

Thank you very much for your attention.

Page 83: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation83

7Appendix

Page 84: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation84

Documentation (new or recently updated) Cell Broadband Engine

– Cell Broadband Engine Architecture V1.0

– Cell Broadband Engine Programming Handbook V1.0

– Cell Broadband Engine Registers V1.3

– SPU C/C++ Language Extensions V2.1

– Synergistic Processor Unit (SPU) Instruction Set Architecture V1.1

– SPU Application Binary Interface Specification V1.4

– SPU Assembly Language Specification V1.3 Cell Broadband Engine Programming using the SDK

– Cell Broadband Engine SDK Installation and User's Guide V1.1

– Cell Broadband Engine Programming Tutorial V1.1

– Cell Broadband Engine Linux Reference Implementation ABI V1.0

– SPE Runtime Management library documentation V1.1

– SDK Sample Library documentation V1.1

– IDL compiler documentation V1.1 New developerWorks Articles

– Maximizing the power of the Cell Broadband Engine processor

– Debugging Cell Broadband Engine systems

Page 85: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation85

Documentation (new or recently updated) IBM Cell Broadband Engine Full-System Simulator

– IBM Full-System Simulator Users Guide

– IBM Full-System Simulator Command Reference

– Performance Analysis with the IBM Full-System Simulator

– IBM Full-System Simulator BogusNet HowTo

PowerPC Architecture Book– Book I: PowerPC User Instruction Set Architecture Version 2.02

– Book II: PowerPC Virtual Environment Architecture Version 2.02

– Book III: PowerPC Operating Environment Architecture Version 2.02

– Vector/SIMD Multimedia Extension Technology Programming Environments Manual Version 2.06c

Page 86: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation86

Links

Cell Broadband Enginehttp://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine

IBM BladeCenter QS20http://www-03.ibm.com/technology/splash/qs20/

Cell Broadband Engine resource centerhttp://www-128.ibm.com/developerworks/power/cell/

Cell Broadband Engine resource center - Documentation archivehttp://www-128.ibm.com/developerworks/power/cell/docs_documentation.html

Cell Broadband Engine technologyhttp://www.alphaworks.ibm.com/topics/cell

Power.org's Cell Developers Cornerhttp://www.power.org/resources/devcorner/cellcorner

Barcelona Supercomputer Center - Linux on Cellhttp://www.bsc.es/projects/deepcomputing/linuxoncell/

Barcelona Supercomputer Center - Documentationhttp://www.bsc.es/plantillaH.php?cat_id=262

Heiko J Schick's Cell Bookmarkshttp://del.icio.us/schihei/Cell

Page 87: The Cell Processor

Open Systems Design and Development

The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation87