POD: A Heterogeneous Many-Core
Platform Using a 3D-integrated
Acceleration Layer
Hsien-Hsin S. Lee Georgia Tech
withDong Hyuk Woo Georgia TechJosh Fryman Intel CorporationAllan Knies Intel Research BerkeleyMarsha Eng Intel Corporation
2Hsien-Hsin S. Lee: POD
‘Many-Core’? Are We There Yet?
• What is new?– On-Die
Performance Scalability
Area ScalabilityArea Scalability
- How many cores on a die?How many cores on a die?
Power ScalabilityPower Scalability
- How much power per core?How much power per core?
Performance Scalability
3Hsien-Hsin S. Lee: POD
Power Issue 1: Thermal Control
3D Cooler Pro
Pure copperCooler
jetCooligy’s channel
Cooking-Aware Computing(Source: Kevin Skadron, UVa) PS3 Grill
leakddstdddd IVIVfCVP 2
4Hsien-Hsin S. Lee: POD
Power Issue 2: Supply
Google Server Farms (Oregon)
5Hsien-Hsin S. Lee: POD
Technology (nm) 65 45 32 22 16
# of cores 4 8 16 32 64
Frequency (GHz) 3.0 3.0 3.0 3.0 3.0
Vdd (V) 1.3 1.1 0.9 0.8 0.7
Power / core (W) 37.5 26.8 18.0 14.2 10.9
Total power (W) 150 215 288 454 696
Scalable Power Consumption?
* The above pictures (>=8) do not represent any current or future Intel products.
6Hsien-Hsin S. Lee: POD
A Thousand Cores, Power Consumption?
??
??
?
• UIUC: Programming Model for one thousand cores [DAC-44]
• Mark Horowitz (Stanford): You will never see a one-thousand core processor [C2S2/GSRC review, Dec 2007]
7Hsien-Hsin S. Lee: POD
Some Known Facts on Interconnection
• One main energy hog!– Alpha 21364:
• 20% (25W / 125W)
– MIT RAW: • 40% of individual tile power
– UT TRIPS: • 25%
• Mesh-based MIMD many-core?– Excessive energy in its router logic– Unpredictable communication pattern
• Unfriendly to low-power techniques
Interconnection Network related
8Hsien-Hsin S. Lee: POD
Many Many-Cores . . . To Name a Few
P* : Homogeneous
c* : Homogeneous (Energy Efficient)
P+c*: Heterogeneous
9Hsien-Hsin S. Lee: POD
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70
Single-chip power budget
Rel
ativ
e P
erfo
rman
cePower-Equivalent Performance
P* c* P+c*Woo and Lee. To appear in IEEE Computer
10Hsien-Hsin S. Lee: POD
Power-Equivalent Performance per Joule
0
1
2
3
4
5
6
7
8
9
10
0 10 20 30 40 50 60 70
Single-chip power budget
Rel
ativ
e P
erfo
rman
ce p
er J
ou
le
P* c* P+c*Woo and Lee. To appear in IEEE Computer
PODPOD
Parallel-On-DemandParallel-On-Demand
Parallel-On-DieParallel-On-Die
Woo, Fryman, Knies, Eng, and Lee. To appear in IEEE MICRO, July 2008
12Hsien-Hsin S. Lee: POD
Modern Constraints and Issues
• Area & Power scalability (efficiency)
• On-chip wire delay
• Backward/forward binary compatibility
• Extensibility
• Low NRE
13Hsien-Hsin S. Lee: POD
A “Snap-On” Broad-Purpose Accelerator
Memory ControllerL2 Cache
x86 Processor
Processing Element
Row Response Queue (RRQ)
Memory ring
External TLB (xTLB)
Arbiter (ARB)
Memory Bus (MBUS)
Interleaved 2D Torus Links
Data Return Buffer (DRB)
Instruction Bus (IBUS)
ORTree
14Hsien-Hsin S. Lee: POD
3D Die Stacking, Extending Moore’s Law
Processor Layer
Cache Layer
DRAM Layer
Analog Layer
Face to Face
Back to Back
Face to Back
Heat Sink
15Hsien-Hsin S. Lee: POD
3D-integrated POD
Die 0
Die 1
You live and work here
You get your beer here
16Hsien-Hsin S. Lee: POD
Host Processor
Host Processor
17Hsien-Hsin S. Lee: POD
POD Array (with 8x8 PEs)
Host Processor
18Hsien-Hsin S. Lee: POD
Instruction Bus (IBus)
Host Processor
19Hsien-Hsin S. Lee: POD
ORTree: Status Monitoring
Host ProcessorFlag status
20Hsien-Hsin S. Lee: POD
Data Return Buffer (DRB)
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
21Hsien-Hsin S. Lee: POD
2D Torus P2P Links
0 7 1 6 2 5 3 4
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
22Hsien-Hsin S. Lee: POD
2D Torus P2P Links
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
23Hsien-Hsin S. Lee: POD
RRQ & MBus: DMA
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
RowResponse
Queue
MemoryBus
24Hsien-Hsin S. Lee: POD
On-chip MC / Ring
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
Sys
tem
mem
ory
xTLB
LLC
ARB
MC
MC
Arbitrator
Memory Controller
25Hsien-Hsin S. Lee: POD
Wire Delay Aware Design
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
EFLAGS! MemFence
IBUS
26Hsien-Hsin S. Lee: POD
Simplified PE Pipelining
Local
SRAM
IBUS
MBUS
DEC / RF
80
96
NSEW
G
X
M
NSEW
96-b
it V
LIW
inst
ruct
ion
27Hsien-Hsin S. Lee: POD
Inter-PE Communication
• Fully synchronized and orchestrated by the host processor – No crossbar– No contention / arbitration– No large buffer of conventional on-chip routers– Nothing unpredictable from the standpoint of
programmers!
28Hsien-Hsin S. Lee: POD
Memory Organization
RRQn-1
RRQ0
xTLB
LLC
MC0
MCj-1
PE
Arr
ay
x86 Host Core
Data Ring: ~66B
RRQ Ctrl Ring: ~2B
Address Ring: ~8B
LLC Ctrl Ring: ~2B
Worst-case:
~78B
ARB
Sys
tem
mem
ory
29Hsien-Hsin S. Lee: POD
Execution Model
• Heterogeneous ISAs– Host Processor (e.g., x86 CISC)
• For backward compatibility
– RISC-type VLIW POD• For area- and power-efficiency
• Host orchestrates the entire execution
• Instruction Encapsulation– SendBits (x86 extension)– Followed by 96-bit POD VLIW instructions
30Hsien-Hsin S. Lee: POD
Programming Model
• Example code: Reduction
1
0
N
iiaS
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
31Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Malloc memory at the host memory spacechar* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
32Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Initialize ai
base_addr: 16
16: 017: 018: 019: 0
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
33Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Load the address of my PE ID
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
34Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Load my PE ID
r15 2 r15 2
r15 2 r15 2
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
35Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Broadcast base_addr
r15 2 r15 3
r15 0 r15 1
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
36Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Calculate the pointer to my ai
r14r15
162
r14r15
163
r14r15
160
r14r15
161
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
37Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Load my ai from the host memory space
r14r15
1618
r14r15
1619
r14r15
1616
r14r15
1617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
38Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Wait until it is loaded
r14r15
1618
r14r15
1619
r14r15
1616
r14r15
1617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
39Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Update S
r1
r14r15
3
1618
r1
r14r15
4
1619
r1
r14r15
1
1616
r1
r14r15
2
1617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
40Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• 1st iteration
r1r2r14r15
331618
r1r2r14r15
441619
r1r2r14r15
111616
r1r2r14r15
221617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
41Hsien-Hsin S. Lee: POD
Programming Model (2x2 POD)
• Transfer my ai to my east-side neighbor
r1r2r14r15
331618
r1r2r14r15
441619
r1r2r14r15
111616
r1r2r14r15
221617
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
42Hsien-Hsin S. Lee: POD
r1r2r14r15
31618
r1r2r14r15
41619
r1r2r14r15
11616
r1r2r14r15
21617
Programming Model (2x2 POD)
• Update S
3
1
4
2
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
43Hsien-Hsin S. Lee: POD
r1r2r14r15
71618
r1r2r14r15
71619
r1r2r14r15
31616
r1r2r14r15
31617
Programming Model (2x2 POD)
• Copy rowsum to r1
4
2
3
1
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
44Hsien-Hsin S. Lee: POD
r1r2r14r15
71618
r1r2r14r15
71619
r1r2r14r15
31616
r1r2r14r15
31617
Programming Model (2x2 POD)
• 1st iteration
7
3
7
3
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
45Hsien-Hsin S. Lee: POD
r1r2r14r15
71618
r1r2r14r15
71619
r1r2r14r15
31616
r1r2r14r15
31617
Programming Model (2x2 POD)
• Transfer my rowsum to my north-side neighbor
7
3
7
3
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
46Hsien-Hsin S. Lee: POD
r1r2r14r15
71618
r1r2r14r15
71619
r1r2r14r15
31616
r1r2r14r15
31617
Programming Model (2x2 POD)
• Update S
7 7
3 3
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
47Hsien-Hsin S. Lee: POD
r1r2r14r15
101618
r1r2r14r15
101619
r1r2r14r15
101616
r1r2r14r15
101617
Programming Model (2x2 POD)
• Done!
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
char* base_addr = (char*) malloc(npe);for ( i=0; I < npe; i++ ) {
*(base_addr+i) = i+1;}$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]POD_movl( r14, base_addr );
$ASM add r15 = r15, r14
$ASM ld r1 = sys [ r15 + 0 ]POD_mfence();
$ASM add r2 = r2, r1for ( i=0; i< (npeX-1); i++ ) {
$ASM xfer.east r1 = r1$ASM add r2 = r2, r1
}$ASM add r1 = r2, 0for ( i=0; i< (npeY-1); i++ ) {
$ASM xfer.north r1 = r1$ASM add r2 = r2, r1
}
48Hsien-Hsin S. Lee: POD
r1r2r14r15
101618
r1r2r14r15
101619
r1r2r14r15
101616
r1r2r14r15
101617
Programming Model (2x2 POD)
• Broadcast the pointer to ‘result’
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r2$ASM popmask
POD_mfence();
128: <- addr of ‘result’
49Hsien-Hsin S. Lee: POD
r1r2r14r15
1012818
r1r2r14r15
1012819
r1r2r14r15
1012816
r1r2r14r15
1012817
Programming Model (2x2 POD)
• Only PE0 will write-back!
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r2$ASM popmask
POD_mfence();
128: <- addr of ‘result’
50Hsien-Hsin S. Lee: POD
r1r2r14r15
101282
r1r2r14r15
101282
r1r2r14r15
101282
r1r2r14r15
101282
Programming Model (2x2 POD)
• Load PE_ID
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r2$ASM popmask
POD_mfence();
128: <- addr of ‘result’
51Hsien-Hsin S. Lee: POD
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Compare PE_ID
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r2$ASM popmask
POD_mfence();
128: <- addr of ‘result’
52Hsien-Hsin S. Lee: POD
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Enable PE only when equal to zero
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r2$ASM popmask
POD_mfence();
128: <- addr of ‘result’
53Hsien-Hsin S. Lee: POD
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Write-back
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r2$ASM popmask
POD_mfence();
128: <- addr of ‘result’
54Hsien-Hsin S. Lee: POD
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Enable ALL PEs
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r2$ASM popmask
POD_mfence();
128: <- addr of ‘result’
55Hsien-Hsin S. Lee: POD
r1r2r14r15
101282
r1r2r14r15
101283
r1r2r14r15
101280
r1r2r14r15
101281
Programming Model (2x2 POD)
• Wait until the result is written-back
3 3
7 7
base_addr: 16
16: 117: 218: 319: 4
int result;
POD_movl( r14, &result );
$ASM mov r15 = MY_PE_ID_POINTER$ASM ld r15 = local [ r15 + 0 ]
$ASM sub r15 = r15, 0$ASM pushmask.e
$ASM st sys[ r14 + 0 ] = r2$ASM popmask
POD_mfence();
128: <- addr of ‘result’
56Hsien-Hsin S. Lee: POD
Dimension Virtualization for POD
• Virtualization of the number of PEs– Six runtime variables needed in HW
• Total number of PEs in a row• Total number of PEs in a column• Total number of PEs• Per-PE x-coordinate value• Per-PE y-coordinate value• PE ID of each PE
• Virtualization of the existence of a POD layer– Software/hardware binary translator– One PE on a conventional general-purpose
processor layer (area overhead ~9%)
EvaluationEvaluation
58Hsien-Hsin S. Lee: POD
Physical Design Evaluation @ 45nm
• PE– 128KB 2R/W-1R-1W SRAM, Basic GP ALU,
simplified SSE ALU– ~1.9 mm2 (~1.33W)
• RRQ– (conservative estimation)– ~1.9 mm2
• POD Total: ~137 mm2 (~100W)
• Penryn (6MB L2) : ~107 mm2
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
59Hsien-Hsin S. Lee: POD
Physical Design Evaluation
SIMD
INT
AGUDL1
Intel Penryn Core 2 Duo processor die photo
60Hsien-Hsin S. Lee: POD
Physical Design Evaluation
Intel Penryn Core 2 Duo processor die photo
PODPE
61Hsien-Hsin S. Lee: POD
Performance Result
4
16
64869.3
GFLOPS
113.3GFLOPS
134.8GFLOPS
860.2GFLOPS
91.6GFLOPS
68.2GFLOPS
504.6GFLOPS
66.8GFLOPS
95.6GFLOPS
DenseMMM
(SP FP)
FFT
(SP FP)
IDCT
(DP FP)
OptionPricing(SP FP)
DownSampling(SP FP)
Histogram
(INT)
LinearRegression
(SP FP)
K-means
(SP FP)
CollisionDetection(SP FP)
RayCasting(SP FP)
Pe
rfo
rma
nce
1x1 POD 2x2 POD 4x4 POD 8x8 POD
62Hsien-Hsin S. Lee: POD
Performance per mm2
0
2
4
6
8
10
12
DenseMMM FFT IDCT OptionPricing
DownSampling
Histogram LinearRegression
K-means CollisionDetection
RayCasting
Pe
rfo
rma
nce
pe
r m
m^2
1x1 POD 2x2 POD 4x4 POD 8x8 POD
63Hsien-Hsin S. Lee: POD
Performance per Joule
0
100
200
300
400
500
600
700
800
DenseMMM FFT IDCT OptionPricing
DownSampling
Histogram LinearRegression
K-means CollisionDetection
RayCasting
Pe
rfo
rma
nce
pe
r Jo
ule
1x1 POD 2x2 POD 4x4 POD 8x8 POD
64Hsien-Hsin S. Lee: POD
Interconnection Power
Input buffer X-bar arbiter Link
RAW 31% 30% ~0% 39%
TRIPS 35% 33% 1% 31%
Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003
65Hsien-Hsin S. Lee: POD
2D Torus P2P Link Active Time
DenseMMM FFT IDCT OptionPricing
DownSampling
Histogram LinearRegression
K-means CollisionDetection
RayCasting
0%
1%
2%
3%
4%
5%
6%
N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S N E W S
Act
ive
Tim
e
1x1 POD 2x2 POD 4x4 POD 8x8 POD
66Hsien-Hsin S. Lee: POD
Conclusions
• We propose POD – Parallel-On-Demand many-core architecture – Broad-purpose acceleration – High energy and area efficiency– Backward compatibility
• A single-chip POD can achieve up to 1.5 TFLOPS of IEEE SP FP operations.
• Interconnection architecture of POD is extremely power- and area-efficient.
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
RRQ
Host Processor
DRB
DRB
DRB
DRB
DRB
DRB
DRB
DRB
xTLB
LLC
ARB
MC
MC
Let’s Use Power to Compute, Not Commute!
Georgia Tech
ECE MARS Labs
http://arch.ece.gatech.edu
Back-up SlidesBack-up Slides
69Hsien-Hsin S. Lee: POD
x86 Modification
• 5 new instructions:– sendibits, getort, drainort, movl, getdrb,
• 3 modified instructions:– lfence, sfence, mfence
• 4 possible new co-processor registers or else mem-mapped addrs:– ibits register, ort status, drb value, xTLB support
• IBits queue
• All other modifications external to the x86 core– Arbiter for LLC priority can be external to LLC
• Deferred Design Issue: LLC-POD Coherence– v1 Study: uncachable– v2 Prototype: LLC Snoops on A/D rings
70Hsien-Hsin S. Lee: POD
POD ISA
• “Instructions” are 3-wide VLIW bundles, – One of each G,X,M (32 bits each)
• Integer Pipe (G):– and,or,not,xor,shl,shr,add,sub,mul,cmp,cmov,xfer,xferdrb : !
div
• SSE Pipe (X):– pfp{fma,add,sub,mul,div,max,min},pi{add,sub,mul,and,or,not,c
mp},shuffle,cvt,xfer : !idiv
• Memory Pipe (M):– ld,st,blkrd/wr,mask{set,pop,push},mov,xfer
• Hybrid (G+X+M):– movl
71Hsien-Hsin S. Lee: POD
PODSIM
Native
x86
machinePODSIM
PE pipelining
Memory subsystem
Accurate modeling of on-chip, off-chip bandwidth
Limitation:
Host processor overhead not modeled (I$ miss, branch misprediction)
LLC takeover of the MC ring by the host
72Hsien-Hsin S. Lee: POD
Out-of-order Host Issues
• Speculative execution– No recovery mechanism in each PE– Broadcast POD instructions in a non-speculative
manner• As long as host-side code does not depend on the results
from POD, this approach does not affect the performance.
• Instruction Reordering– POD instructions are strongly ordered.
• IBits queue – Similar to the store queue
73Hsien-Hsin S. Lee: POD
Programming Model
• Example: Reduction
1
0
N
iiaS
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
74Hsien-Hsin S. Lee: POD
Programming Model
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
0 1 2 3 4 5 6 7
75Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
76Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
4 5 6 7
0 1 2 3
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
77Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
4 5 6 7
0 1 2 3
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
9 13
1 5
78Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
9 13
1 5
9 13
1 5
79Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
9 13
1 5
9 13
1 5
80Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
13 9
5 1
22 22
6 6
81Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
13 9
5 1
22 22
6 6
22 22
6 6
82Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
13 9
5 1
22 22
6 6
22 22
6 6
83Hsien-Hsin S. Lee: POD
Programming Model
0 1 2 3 4 5 6 72 (blk_size)
for ( i = 0; i < input_size; i++ )host_input[i] = i+1;
blk_size = input_size / npememcpy( &local_input[0], &host_input[blk_size*mypeID], my_input_size )memory_fence;
local_sum = 0for ( i = 0; i < blk_size; i++ )
local_sum += local_input[i]
row_sum = local_sumfor ( i=0; i< (npeX-1); i++ ) {
xfer.east local_sum = local_sumrow_sum += local_sum
}
global_sum = row_sumfor ( i=0; i< (npeY-1); i++ ) {
xfer.north row_sum = row_sumglobal_sum += row_sum
}
13 9
5 1
6 6
22 22
28 28
28 28
84Hsien-Hsin S. Lee: POD
Design Space Exploration
• Baseline Design
85Hsien-Hsin S. Lee: POD
POD Design Space Exploration
• 3D many-POD processorDie 0
Die 1
86Hsien-Hsin S. Lee: POD
Power-Equivalent Performance per Watt
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
0 10 20 30 40 50 60 70
Single-chip power budget
Rel
ativ
e P
erfo
rman
ce p
er W
att
P* c* P+c*
87Hsien-Hsin S. Lee: POD
Microfluid Channels for 3D Integration
• Challenges– Routing– Fluid Pressure– Cost
a stack
RDL
C4 bump
ball
fluid in fluid out
package substrate
front
back
front
back
Courtesy : Young-Joon Lee, Georgia Tech
transistor Signal TSVmicrofluidic
channel
metal layers
P/G nets
Clock net
fluidic via(hot)
fluidic via(cool)
Avatrel cover
Clock TSV
I/O TSVI/O
buffer
P/G TSV
88Hsien-Hsin S. Lee: POD
Many-Core? Do You Need It?
Well, it is hard to say in Computing WorldWell, it is hard to say in Computing World
If you were plowing a field, which would you rather use:
Two strong oxen or 1024 chickens?--- Seymour Cray
89Hsien-Hsin S. Lee: POD
What Did We Learn from Deep Blue?
DEEP BLUE
480 chess chips
Can evaluate 200,000,000 moves per second!!
Feng-Hsiung Hsu
Top Related