The Cell Processor: Technological Breakthrough or Yet Another Over-hyped Chip?
The Cell Processor
-
Upload
heiko-joerg-schick -
Category
Technology
-
view
918 -
download
4
description
Transcript of The Cell Processor
Business Unit or Product Name
Confidential | Date | Other Information, if necessary © 2005 IBM Corporation
Open Systems Design and Development
2007-04-12 | Heiko J Schick <[email protected]>
The Cell Processor
Computing of tomorrow or yesterday?
© 2007 IBM Corporation
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation2
Agenda
1. Introduction
2. Limiters to Processor Performance
3. Cell Architecture
4. Cell Platform
5. Cell Applications
6. Cell Programming
7. Appendix
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation3
1Introduction
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation4
Cell History
IBM, SCEI / Sony and Toshiba Alliance formed in 2000
Design Center opened in March 2001 (Based in Austin, Texas)
Single Cell BE operational Spring 2004
2-way SMP operational Summer 2004
February 7, 2005: First technical disclosures
November 9, 2005: Open Source SDK Published
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation5
The problem is…The problem is…
……the view from the computer room!the view from the computer room!
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation6
Outlook
Source: Kurzweil
“Computer performance increases since 100 years exponential !!!”
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation7
But what could you do if all But what could you do if all objectsobjects were were intelligent…intelligent…
……and connected?and connected?
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation8
What could you do with
unlimited computing power… for pennies?
Could you predict the path of a storm
down to the square kilometer?Could you identify another 20% of proven oil reserves without drilling one hole?
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation9
2Limiters to Processor Performance
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation10
Power Wall / Voltage Wall
Power components:– Active Power
– Passive Power
– Gate leakage
– Sub-threshold leakage (source-drain leakage)
Source: Tom’s Hardware Guide
1
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation11
Memory Wall
Main memory now nearly 1000 cycles from the processor – Situation worse with (on-chip) SMP
Memory latency penalties drive inefficiency in the design – Expensive and sophisticated hardware to try and deal with it
– Programmers that try to gain control of cache content are hindered by the hardware mechanisms
Latency induced bandwidth limitations – Much of the bandwidth to memory in systems can only be used speculatively
2
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation12
Frequency Wall
Increasing frequencies and deeper pipelines have reached diminishing returns on performance
Returns negative if power is taken into account
Results of studies depend on issue width of processor – The wider the processor the slower it wants to be
– Simultaneous Multithreading helps to use issue slots efficiently
Results depend on number of architected registers and workload – More registers tolerate deeper pipeline
– Fewer random branches in application tolerates deeper pipelines
3
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation13
Microprocessor Efficiency
Gelsinger’s law – 1.4x more performance for 2x
more
Hofstee’s corollary – 1/1.4x efficiency loss in every
generation
– Examples: Cache size, Out-of-Order, Super-scalar, etc.
Source: Tom’s Hardware Guide
Increasing performance requires increasing efficiency !!!
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation14
Attacking the Performance Walls
Multi-Core Non-Homogeneous Architecture – Control Plane vs. Data Plane processors
Attacks Power Wall
3-level Model of Memory – Main Memory, Local Store, Registers
Attacks Memory Wall
Large Shared Register File & SW Controlled Branching – Allows deeper pipelines (11FO4 helps power)
Attacks Frequency Wall
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation15
3Cell Architecture
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation16
Cell BE Processor
~250M transistors
~235mm2
Top frequency >3GHz
9 cores, 10 threads
> 200+ GFlops (SP) @3.2 GHz
> 20+ GFlops (DP) @3.2 GHz
Up to 25.6GB/s memory B/W
Up to 76,8GB/s I/O B/W
~400M$(US) design investment
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation17
Key Attributes of Cell
Cell is Multi-Core – Contains 64-bit Power Architecture TM – Contains 8 Synergistic Processor Elements (SPE)
Cell is a Flexible Architecture – Multi-OS support (including Linux) with Virtualization technology – Path for OS, legacy apps, and software development
Cell is a Broadband Architecture – SPE is RISC architecture with SIMD organization and Local Store – 128+ concurrent transactions to memory per processor
Cell is a Real-Time Architecture – Resource allocation (for Bandwidth Measurement) – Locking Caches (via Replacement Management Tables)
Cell is a Security Enabled Architecture – SPE dynamically reconfigurable as secure processors
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation18
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation19
Power Processor Element (PPE) 64-bit Power Architecture™ with VMX
In-order, 2-way hardware Multi-threading
Coherent Load/Store with 32KB I & D L1 and 512KB L2
Controls the SPEs
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation20
Synergistic Processor Elements (SPEs)
SPE provides computational performance– Dual issue, up to 16-way 128-bit SIMD
– Dedicated resources: 128 128-bit register file, 256KB Local Store
– Each can be dynamically configured to protect resources
– Dedicated DMA engine: Up to 16 outstanding request
– Memory flow controller for DMA
– 25 GB/s DMA data transfer
– “I/O Channels” for IPC
Seperate Cores
Simple Implementation (e.g. no branch prediction)
No Caches
No protected instructions
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation21
Permute Unit
Load-Store Unit
Floating-Point Unit
Fixed-Point Unit
Branch Unit
Channel Unit
Result Forwarding and Staging
Register File
Local Store(256kB)
Single Port SRAM
128B Read 128B Write
DMA Unit
Instruction Issue Unit / Instruction Line Buffer
8 Byte/Cycle 16 Byte/Cycle 128 Byte/Cycle64 Byte/Cycle
On-Chip Coherent Bus
SPE BLOCK DIAGRAM
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation22
Element Interconnect Bus
Four 16 byte data rings, supporting multiple transfers
96B/cycle peak bandwidth
Over 100 outstanding requests
300+ GByte/sec @ 3.2 GHz
Element Interconnect Bus (EIB)
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation23
Four 16B data rings connecting 12 bus elements– Two clockwise / Two counter-clockwise
Physically overlaps all processor elements
Central arbiter supports up to three concurrent transfers per data ring– Two stage, dual round robin arbiter
Each element port simultaneously supports 16B in and 16B out data path– Ring topology is transparent to element data interface
16B 16B 16B 16B
Data Arb
16B 16B 16B 16B
16B 16B16B 16B16B 16B16B 16B
16B
16B
16B
16B
16B
16B
16B
16B
SPE0 SPE2 SPE4 SPE6
SPE7SPE5SPE3SPE1
MIC
PPE
BIF/IOIF0
IOIF1
Element Interconnect Bus (EIB)
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation24
MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF1
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
Controller
Controller
Ramp
0
Controller
Ramp
1
Controller
Ramp
2
Controller
Ramp
3
Controller
Ramp
4
Controller
Ramp
5
Controller
Ramp
6
Controller
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
Data
Arbiter
Ramp
7
Controller
Ramp
8
Controller
Ramp
9
Controller
Ramp
10
Controller
Ramp
11
Controller
Controller
Ramp
5
Controller
Ramp
4
Controller
Ramp
3
Controller
Ramp
2
Controller
Ramp
1
Controller
Ramp
0
PPE SPE1 SPE3 SPE5 SPE7 IOIF1PPE SPE1 SPE3 SPE5 SPE7 IOIF1
PPE SPE1 SPE3 SPE5 SPE7 IOIF1MIC SPE0 SPE2 SPE4 SPE6 BIF /IOIF0
Ring1Ring3
Ring0Ring2
controls
Example of eight concurrent transactions
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation25
I/O and Memory Interfaces I/O Provides wide bandwidth
– Dual XDRTM controller (25.6GB/s @ 3.2Gbps)– Two configurable interfaces (76.8GB/s @6.4Gbps)
– Configurable number of Bytes– Coherent or I/O Protection
– Allows for multiple system configurations
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation26
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation27
4Cell Platform
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation28
Game console systems Blades HDTV Home media servers Supercomputers ...... ?
Cell BEProcessor
XDRtm XDRtm
IOIF0 IOIF1
Cell BEProcessor
XDRtm XDRtm
IOIF BIF
Cell BEProcessor
XDRtm XDRtm
IOIF
Cell BEProessor
XDRtm XDRtm
IOIFBIF
Cell BEProcessor
XDRtm XDRtm
IOIF
Cell BEProcessor
XDRtm XDRtm
IOIF
BIF
Cell BEProcessor
XDRtm XDRtm
IOIFSW
Cell processor can support many systems
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation29
Chassis
Standard IBM BladeCenter with: 7 Blades (for 2 slots each) with full performance 2 switches (1Gb Ethernet) with 4 external ports each
Updated Management Module Firmware. External Infiniband Switches with optional FC ports.
Blade (400 GFLOPs)
Game Processor and Support Logic: Dual Processor Configuration Single SMP OS image 1GB XDRAM Optionally PCI-exp attached standard graphics adapter
BladeCenter Interface ( Based on IBM JS20): New Blade Power System and Sense Logic Control Firmware to connect processor & support logic to H8
service processor Signal Level Converters for processor & support logic 2 Infiniband (IB) Host Adapters with 2x IB 4x each Physical link drivers (GbE Phy etc)
Chassis
2x (+12V RS-485,USB,GbEn)
Rambus Design:DRAM 1/2GB
Cell BEProcessor
H8SP
BladeInput
Power&Sense
Level Convert
GbEPhy
BladeCenter Interface
Blade
Cell BEProcessor
South Bridge
Rambus Design:DRAM 1/2GB
South Bridge
IB4X
IB4X
Blade
QS20 Hardware Description
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation30
QS20 Blade (w/o heatsinks)
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation31
ATA Disk
Service Proc.
South Bridges InfiniBand Cards
Blade Bezel
QS20 Blade Assembly
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation32
Up to 2 InfiniBand Cards can be attached.
Standard PC InfiniBand Card with special bezel– MHEA28-1TCSB Dual-Port HCA – PCI Express x8 interface– Dual 10 Gb/s InfiniBand 4X Ports– 128 MB Local Memory– IBTA v1.1 Compatible Design
Options - InfiniBand
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation33
Cell Software Stack
Firmware
ApplicationsApplications
SLOFSLOF
powerpc architecture dependent codepowerpc architecture dependent code
Cell Broadband Engine Cell Broadband Engine
LinuxLinux memory managementmemory management
device driversdevice drivers
gccgcc
ppc64, spu backendppc64, spu backend
glibcglibc
Hardware
RTASRTAS
Secondary Boot LoaderSecondary Boot Loader
powerpc- and cell-specific Linux code
Low-level FWLow-level FW
schedulerscheduler
(pSeries)(pSeries) (PMac)(PMac) cellcell
User space
Linux common code
device driversdevice drivers
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation34
Cell BE Development Platform
Cell BE
Firmware
Graphics Std Devices
Developer Workstation
Cell Linux kernel
Lower-level programming interface
Basic Cell runtime: lib_spe, spelibc, …
Basic Cell toolchain: gcc, binutils, gdb, oprofile, …
Cell aware tooling
Application Framework (segment specific)
Standard Linux
Development
Environment
ppc64 Cell optimized libraries
Cell specialized compilers
Higher-level programming interface
Application-level programming interface
Tooling
Libraries
Ce
ll e
na
ble
me
nt
Ce
ll e
xp
loit
ati
on
Cell is an exotic platform and hard to program Challenging to exploit SPEs: Limited local memory (256 KB) – need to DMA data and code fragments back and forth Multi-level parallelism – 8 SPEs, 128-bit wide SIMD units in each SPE
If done right, the result is impressive performance…
Make Cell easier to program Hide complexity in critical libraries Compiler support for standard tasks, e.g., overlays, global data access, SW-managed cache, auto vectorization, auto parallelization, … Smart tooling
Make Cell a standard platform Middleware and frameworks provide architecture-specific components and hide Cell –specifics from application developer
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation35
Alpha Quality– SDK hosted on FC4 / X86
OS: Initial Linux Cell 2.6.14 patches
SPE Threads runtime XLC Cell C Compiler SPE gdb debugger Cell Coding Sample Source Documentation
– Installation Scripts– Cell Hardware Specs– Programming Docs
SDK1.0
GCC Tools from SCEA– gcc 3.0 for Cell
– Binutils for Cell
Alpha Quality– SDK hosted on FC5 / X86
Critical Linux Cell Performance Enhancements– Cell Enhanced Functions
Critical Cell RAS Functions– Machine Check, System Error
Performance Analysis Tools– Oprofile – PPU Cycle only profiling (No
SPU) GNU Toolchain updates Mambo Updates Julia Set Sample
SDK1.1
Execution platform:Cell Simulator
Hosting platform:Linux/86 (FC4)
11/2005 7/2006
SDK 2.0
12/2006
XL C/C++ – Linux/x86, LoP – Overlay prototype– Auto-SIMD enhancements
Linux Kernel updates– Performance Enhancements– RAS/ Debug support– SPE runtime extensions– Interrupt controller enhancements
GNU Toolchain updates– FSF integration– GDB multi-thread support– Newlib library optimization– Prog model support for overlay
Programming Model Preview– Overlay support– Accelerated Libraries Framework
Library enhancements– Vector Math Library – Phase 1 – MASS Library for PPU, MASSV Library for PPU/SPU
IDE– Tool integration– Remote tool support
Performance Analysis– Visualization tools– Bandwidth, Latency, Lock analyzers– Performance debug tools– Oprofile – SDK 1.1 plus PPU event based profiling
Mambo– Performance model correlation– Visualization
SDK1.0.1
Execution platform:Cell Simulator Cell Blade 1 rev 2
Hosting platform:Linux/86 (FC4)Linux/Cell (FC4)*Linux/Power (FC4)*
Execution platform:Cell Simulator Cell Blade 1 rev 3
Hosting platform:Linux/86 (FC5)Linux/Cell (FC5)*Linux/Power (FC5)*
Refresh
Execution platform:Cell Simulator Cell Blade 1 rev 3
Hosting platform:Linux/86 (FC5)Linux/Cell (FC5)*Linux/Power (FC5)*
2/2006
Refresh
9/2006
SDK1.1.1
•Documentation•Mambo updates for CB1 and 64-bit hosting•ISO image update
* Subset of tools
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation36
Cell library content (source) ~ 156k loc Standard SPE C library subset
– optimized SPE C functions including stdlib c lib, math and etc. Audio resample - resampling audio signals FFT - 1D and 2D fft functions gmath - mathematic functions optimized for gaming environment image - convolution functions intrinsics - generic intrinsic conversion functions large-matrix - functions performing large matrix operations matrix - basic matrix operations mpm- multi-precision math functions noise - noise generation functions oscillator - basic sound generation functions sim- providing I/O channels to simulated environments surface - a set of bezier curve and surface functions sync - synchronization library vector - vector operation functions
http://www.alphaworks.ibm.com/tech/cellsw
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation37
5Cell Applications
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation38
Peak GFLOPs
FreeScale
DC 1.5 GHz
PPC 970
2.2 GHz
AMD DC
2.2 GHz
Intel SC
3.6 GHz
Cell
3.0 GHz
0
20
40
60
80
100
120
140
160
180
200
SinglePrecision
DoublePrecision
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation39
Cell Processor Example Application Areas
Cell is a processor that excels at processing of rich media content in the context of broad connectivity
– Digital content creation (games and movies)
– Game playing and game serving
– Distribution of (dynamic, media rich) content
– Imaging and image processing
– Image analysis (e.g. video surveillance)
– Next-generation physics-based visualization
– Video conferencing (3D?)
– Streaming applications (codecs etc.)
– Physical simulation & science
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation40
Opportunities for Cell BE Blade
Aerospace & DefenseAerospace & DefenseSignal & Image ProcessingSignal & Image ProcessingSecurity, SurveillanceSecurity, SurveillanceSimulation & Training, …Simulation & Training, …
Petroleum IndustryPetroleum IndustrySeismic computingSeismic computingReservoir Modeling, …Reservoir Modeling, …
Communications EquipmentCommunications EquipmentLAN/MAN RoutersLAN/MAN RoutersAccessAccessConverged NetworksConverged NetworksSecurity, …Security, …
Medical ImagingMedical ImagingCT ScanCT ScanUltrasound, …Ultrasound, …
Consumer / Digital MediaConsumer / Digital MediaDigital Content CreationDigital Content CreationMedia PlatformMedia PlatformVideo Surveillance, …Video Surveillance, …
Public Sector / Gov’t & Higher Educ.Public Sector / Gov’t & Higher Educ.Signal & Image ProcessingSignal & Image ProcessingComputational Chemistry, …Computational Chemistry, …
FinanceFinanceTrade modelingTrade modeling
IndustrialIndustrialSemiconductor / LCDSemiconductor / LCDVideo ConferenceVideo Conference
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation41
http://folding.stanford.edu/FAQ-PS3.htmlDr. V. S. Pande, folding@home, Distributed Computing Project, Stanford University
Since 2000, Folding@Home (FAH) has led to a major jump in the capabilities of molecular simulation of:
– Protein folding and related diseases, including Alzheimer’s Disease, Huntington's Disease, and certain forms of cancer.– By joining together hundreds of thousands of PCs throughout the world, calculations which were previously
considered impossible have now become routine.
Folding@Home utilizes the new Cell processor in Sony’s PLAYSTATION 3 (PS3) to achieve performance previously only possible on supercomputers.
–14,000 PlayStation 3’s are literally outperforming 159,000 Windows Computers by more than Double!–In fact they out perform all the other clients combined.
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation42
Ported by
235 584 tetrahedra48 000 nodes
28 iterations in NKMG solverIn 3.8 seconds
Sustained Performance for large Objects:
52 GFLOP/s
Multigrid Finite Element Solver on Cell
using the free SDK
www.digitalmedics.dels7-www.cs.uni-dortmund.de
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation43
Ported by
Sustained Performance for large Objects: Not yet benchmarked (3/2007)
Computational Fluid Dynamics Solver on Cell
using the free SDK
www.digitalmedics.dels7-www.cs.uni-dortmund.de
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation44
A Lattice-Boltzmann Solver
Developed by Fraunhofer IWTM
Computational Fluid Dynamics Solver on Cell
http://www.itwm.fraunhofer.de/
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation45
Terrain Rendering Engine (TRE) and IBM Blades
Systems and Technology Group
45
BladeCenter-1 Chassis
QS20
Commodity Cell BE Blade
Add Live Video, Aerial Information, Combat Situational
Awareness
Next-Gen GCS
Combine Data & Render
Aircraft data /Field Data
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation46
Example: Medical Computer Tomography (CT) Scans
Image whole heart in 1 rotation
4D CT – includes time
2 slices 4 slices 8 slices 16 slices
32
slices
64
slices
128
slices
256
slices
Current CT Products Future CT Products
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation47
The moving image is aligned to the fixed image as the registration proceeds.
Fixed Image Moving Image Registration Process
“Image Registration” Using Cell
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation48
6Cell Programming
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation49
Small single-SPE models – a sample
/* spe_foo.c:* A C program to be compiled into an executable called “spe_foo” */
int main( int speid, addr64 argp, addr64 envp){char i;
/* do something intelligent here */i = func_foo (argp);
/* when the syscall is supported */printf( “Hello world! my result is %d \n”, i);
return i;}
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation50
extern spe_program_handle spe_foo; /* the spe image handle from CESOF */
int main(){int rc, status;speid_t spe_id;
/* load & start the spe_foo program on an allocated spe */spe_id = spe_create_thread (0, &spe_foo, 0, NULL, -1, 0);
/* wait for spe prog. to complete and return final status */rc = spe_wait (spe_id, &status, 0);
return status;}
Small single-SPE models – PPE controlling program
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation51
Using SPEs
PowerPC(PPE)
SPU
Local Store
MFC
N
SPEPuts
Results
PPEPutsText
Static DataParameters
SPE executes
PowerPC(PPE)
SPU
Local Store
MFC
N
SPE Puts Results
PPEPuts
Initial TextStatic DataParameters System Memory
SPE IndependentlyStages Text & Intermediate Data
Transfers while executing
(1) Simple Function OffloadRemote Procedure Call StyleSPE working set fits in Local StorePPE initiates DMA data/code transfersCould be easily supported by a programming env, e.g.,
RPC Style IDL Compiler Compiler Directives (pragmas)LibrariesOr even automatic scheduling of code/data to SPEs
(2) Typical (Complex) Function OffloadSPE working set larger than Local StorePPE initially loads SPE LS with small startup codeSPE initiates DMAs (code/data staging)
Stream data through codeStream code through data
Latency hiding required in most casesRequires "high locality of reference" characteristicsCan be extended to a “services offload model”
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation52
Using SPEs
(3) Pipelining for complex functionsFunctions split up in processing stagesDirect LS to LS communication possible
Including LS to LS DMAAvoid PPE / System Memory bottlenecks
(4) Parallel stages for very compute-intense functionsPPE partitions and distributes work to multiple SPEs
SPU
Local Store
MFC
N
SPU
Local Store
MFC
N
Parallel-stages
PowerPC(PPE)
System Memory
PowerPC(PPE)
System Memory
SPU
Local Store
MFC
N
SPU
Local Store
MFC
N
Multi-stage Pipeline SPU
Local Store
MFC
N
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation53
Large single-SPE programming models
Data or code working set cannot fit completely into a local store
The PPE controlling process, kernel, and libspe runtime set up the system memory mapping as SPE’s secondary memory store
The SPE program accesses the secondary memory store via its software-controlled SPE DMA engine - Memory Flow Controller (MFC)
SPE Program
System Memory
PPE controller maps system memory for
SPE DMA trans.
Local Store
DMA transactions
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation54
Large single-SPE programming models – I/O data
System memory for large size input / output data
– e.g. Streaming model
int g_ip[512*1024]
System memory
int g_op[512*1024]
int ip[32]
int op[32]
SPE program: op = func(ip)
DMA
DMA
Local store
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation55
Large single-SPE programming models
System memory as secondary memory store– Manual management of data buffers
– Automatic software-managed data cache
– Software cache framework libraries
– Compiler runtime support
Global objects
System memory
SW cache entriesSPE program
Local store
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation56
Large single-SPE programming models
System memory as secondary memory store– Manual loading of plug-in into code buffer
– Plug-in framework libraries
– Automatic software-managed code overlay
– Compiler generated overlaying code
System memory
Local store
SPE plug-in b
SPE plug-in a
SPE plug-in e
SPE plug-in a
SPE plug-in b
SPE plug-in c
SPE plug-in d
SPE plug-in e
SPE plug-in f
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation57
Large single-SPE prog. models – Job Queue
Code and data packaged together as inputs to an SPE kernel program
A multi-tasking model – more discussion later
Job queue
System memory
Local store
code/data n
code/data n+1code/data n+2
code/data …
Code n
Data n
SPE kernel
DMA
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation58
Large single-SPE programming models - DMA
DMA latency handling is critical to overall performance for SPE programs moving large data or code
Data pre-fetching is a key technique to hide DMA latency– e.g. double-buffering
Time
I Buf 1 (n) O Buf 1 (n)
I Buf 2 (n+1) O Buf 2 (n-1)
SPE program: Func (n)
outputn-2 inputn Outputn-1
Func (inputn)
Inputn+1
Func (inputn+1)Func (inputn-1)
outputn Inputn+2DMAs
SPE exec.
DMAs
SPE exec.
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation59
Large single-SPE programming models - CESOF
Cell Embedded SPE Object Format (CESOF) and PPE/SPE toolchains support the resolution of SPE references to the global system memory objects in the effective-address space.
_EAR_g_foo structure
Local Store Space
Effective Address Space
Char g_foo[512]
Char local_foo[512]
DMA transactions
CESOF EARsymbol resolution
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation60
Parallel programming models – Job Queue
Large set of jobs fed through a group of SPE programs
Streaming is a special case of job queue with regular and sequential data
Each SPE program locks on the shared job queue to obtain next job
For uneven jobs, workloads are self-balanced among available SPEs
PPE
SPE1Kernel()
SPE0Kernel()
SPE7Kernel()
System Memory
In
.
I7
I6
I5
I4
I3
I2
I1
I0
On
.
O7
O6
O5
O4
O3
O2
O1
O0
…..
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation61
Parallel programming models – Pipeline / Streaming
Use LS to LS DMA bandwidth, not system memory bandwidth
Flexibility in connecting pipeline functions
Larger collective code size per pipeline
Load-balance is harderPPE
SPE1Kernel1()
SPE0Kernel0()
SPE7Kernel7()
System Memory
In
.
.
I6
I5
I4
I3
I2
I1
I0
On
.
.
O6
O5
O4
O3
O2
O1
O0
…..DMA DMA
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation62
Multi-tasking SPEs – LS resident multi-tasking
Simplest multi-tasking programming model
No memory protection among tasks
Co-operative, Non-preemptive, event-driven scheduling
acadxacd
Task a
Task b
Task c
Task d
Task x
EventDispatcher
Local Store
SPE n
Event Queue
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation63
Multi-tasking SPEs – Self-managed multi-tasking
Non-LS resident
Blocked job context is swapped out of LS and scheduled back later to the job queue once unblocked
Job queue
System memory
Local store
task n
task n+1task n+2
Task …
Code n
Data n
SPE kerneltask n’
task queue
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation64
libspe sample code
#include <libspe.h>int main(int argc, char *argv[], char *envp[]){spe_program_handle_t *binary;speid_t spe_thread;int status;
binary = spe_open_image(argv[1]);if (!binary)
return 1;spe_thread = spe_create_thread(0, binary, argv+1,
envp, -1, 0);if (!spe_thread)
return 2;
spe_wait(spe_thread, &status, 0);spe_close_image(binary);return status;
}
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation65
libspe sample code
#include <libspe.h>int main(int argc, char *argv[], char *envp[]){spe_program_handle_t *binary;speid_t spe_thread;int status;
binary = spe_open_image(argv[1]);if (!binary)
return 1;spe_thread = spe_create_thread(0, binary, argv+1,
envp, -1, 0);if (!spe_thread)
return 2;
spe_wait(spe_thread, &status, 0);spe_close_image(binary);return status;
}
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation66
libspe sample code
#include <libspe.h>int main(int argc, char *argv[], char *envp[]){spe_program_handle_t *binary;speid_t spe_thread;int status;
binary = spe_open_image(argv[1]);if (!binary)
return 1;spe_thread = spe_create_thread(0, binary, argv+1,
envp, -1, 0);if (!spe_thread)
return 2;
spe_wait(spe_thread, &status, 0);spe_close_image(binary);return status;
}
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation67
Linux on Cell/B.E. kernel components
Platform abstractionarch/powerpc/platforms/{cell,ps3,beat}
Integrated Interrupt Handling
I/O Memory Management Unit
Power Management
Hypervisor abstractions
South Bridge drivers
SPU file system
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation68
SPU file system
Virtual File System
/spu holds SPU contexts as directories
Files are primary user interfaces
New system calls: spu create and spu run
SPU contexts abstracted from real SPU
Preemptive context switching (W.I.P)
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation69
PPE on Cell is a 100% compliant ppc64!
A solid base…– Everything in a distribution, all middleware runs out of the box
– All tools available
– BUT: not optimized to exploit Cell
Toolchain needs to cover Cell aspects
Optimized, critical “middleware” for Cell needed– Depending on workload requirements
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation70
Using SPEs: Task Based Abstraction
APIs provided by user space libraries
SPE programs controlled via PPE-originated thread function calls– spe_create_thread(), ...
Calls on PPE and SPE– Mailboxes
– DMA
– Events
Simple runtime support (local store heap management, etc.)
Lots of library extensions– Encryption, signal processing, math operations
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation71
spu_create
int spu create(const char *pathname, int flags, mode t mode);
– creates a new context in pathname
– returns an open file descriptor
– context is gets destroyed when fd is closed
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation72
spu_run
uint32 t spu run(int fd, uint32 t *npc, uint32 t *status);
– transfers flow of control to SPU context fd
– returns when the context has stopped for some reason, e.g.
– exit or forceful abort
– callback from SPU to PPU
– can be interrupted by signals
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation73
PPE programming interfaces
Asynchronous SPE thread API (“libspe 1.x”)
spe_create_thread
spe_wait
spe_kill
. . .
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation74
spe create thread implementation
Allocate virtual SPE (spu create)
Load SPE application code into context
Start PPE thread using pthread create
New thread calls spu run
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation75
More libspe interfaces
Event notification– int spe get event(struct spe event *, int nevents, int timeout);
Message passing– spe read out mbox(speid t speid);
– spe write in mbox(speid t speid);
– spe write signal(speid t speid, unsigned reg, unsigned data);
Local store access– void *spe get ls(speid t speid);
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation76
GNU tool chain
PPE support– Just another PowerPC variant. . .
SPE support– Just another embedded processor. . .
Cell/B.E. support– More than just PPE + SPE!
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation77
Object file format
PPE: regular ppc/ppc64 ELF binaries
SPE: new ELF flavour EM SPU– 32-bit big-endian
– No shared libraries
– Manipulated via cross-binutils
– New: Code overlay support
Cell/B.E.: combined object files– embedspu: link into one binary
– .rodata.spuelf section in PPE object
– CESOF: SPE− >PPE symbol references
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation78
gcc on the PPE
handled by “rs6000” back end
Processor-specific tuning
pipeline description
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation79
gcc on the SPE
Merged Jan 3rd
Built as cross-compiler
Handles vector data types, intrinsics
Middle-end support: branch hints, aggressive if-conversion
GCC 4.1 port exploiting auto-vectorization
No Java
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation80
Existing proprietary applications
Games
Volume rendering
Real-time Raytracing
Digital Video
Monte Carlo simulation
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation81
Obviously missing
ffmpeg, mplayer, VLC
VDR, mythTV
Xorg acceleration
OpenSSL
Your project here !!!
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation82
Questions!
Thank you very much for your attention.
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation83
7Appendix
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation84
Documentation (new or recently updated) Cell Broadband Engine
– Cell Broadband Engine Architecture V1.0
– Cell Broadband Engine Programming Handbook V1.0
– Cell Broadband Engine Registers V1.3
– SPU C/C++ Language Extensions V2.1
– Synergistic Processor Unit (SPU) Instruction Set Architecture V1.1
– SPU Application Binary Interface Specification V1.4
– SPU Assembly Language Specification V1.3 Cell Broadband Engine Programming using the SDK
– Cell Broadband Engine SDK Installation and User's Guide V1.1
– Cell Broadband Engine Programming Tutorial V1.1
– Cell Broadband Engine Linux Reference Implementation ABI V1.0
– SPE Runtime Management library documentation V1.1
– SDK Sample Library documentation V1.1
– IDL compiler documentation V1.1 New developerWorks Articles
– Maximizing the power of the Cell Broadband Engine processor
– Debugging Cell Broadband Engine systems
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation85
Documentation (new or recently updated) IBM Cell Broadband Engine Full-System Simulator
– IBM Full-System Simulator Users Guide
– IBM Full-System Simulator Command Reference
– Performance Analysis with the IBM Full-System Simulator
– IBM Full-System Simulator BogusNet HowTo
PowerPC Architecture Book– Book I: PowerPC User Instruction Set Architecture Version 2.02
– Book II: PowerPC Virtual Environment Architecture Version 2.02
– Book III: PowerPC Operating Environment Architecture Version 2.02
– Vector/SIMD Multimedia Extension Technology Programming Environments Manual Version 2.06c
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation86
Links
Cell Broadband Enginehttp://www-306.ibm.com/chips/techlib/techlib.nsf/products/Cell_Broadband_Engine
IBM BladeCenter QS20http://www-03.ibm.com/technology/splash/qs20/
Cell Broadband Engine resource centerhttp://www-128.ibm.com/developerworks/power/cell/
Cell Broadband Engine resource center - Documentation archivehttp://www-128.ibm.com/developerworks/power/cell/docs_documentation.html
Cell Broadband Engine technologyhttp://www.alphaworks.ibm.com/topics/cell
Power.org's Cell Developers Cornerhttp://www.power.org/resources/devcorner/cellcorner
Barcelona Supercomputer Center - Linux on Cellhttp://www.bsc.es/projects/deepcomputing/linuxoncell/
Barcelona Supercomputer Center - Documentationhttp://www.bsc.es/plantillaH.php?cat_id=262
Heiko J Schick's Cell Bookmarkshttp://del.icio.us/schihei/Cell
Open Systems Design and Development
The Cell Processor | Computing for tomorrow or yesterday? © 2007 IBM Corporation87