Piranha - Barrosobarroso.org/publications/piranha_asilomar.pdf · What is Piranha?What is Piranha?...
-
Upload
phungtuong -
Category
Documents
-
view
221 -
download
1
Transcript of Piranha - Barrosobarroso.org/publications/piranha_asilomar.pdf · What is Piranha?What is Piranha?...
Piranha: Designing a Scalable CMP-based System for
Commercial Workloads
Piranha: Designing a Scalable CMP-based System for
Commercial Workloads
Luiz André BarrosoWestern Research Laboratory
Luiz André BarrosoWestern Research Laboratory
April 27, 2001 Asilomar Microcomputer Workshop
What is Piranha?What is Piranha?What is Piranha?lA scalable shared memory architecture based on chip
multiprocessing (CMP) and targeted at commercialworkloads
lA research prototype under development by CompaqResearch and Compaq NonStop Hardware DevelopmentGroup
lA departure from ever increasing processor complexityand system design/verification cycles
Importance of Commercial ApplicationsImportance of Commercial ApplicationsImportance of Commercial Applications
lTotal server market size in 1999: ~$55-60B– technical applications: less than $6B– commercial applications: ~$40B
Worldwide Server Customer Spending (IDC 1999)
Infrastructure29%
Business processing
22%
Decision support
14%
Software development
14%
Collaborative12%
Other3%
Scientific & engineering
6%
Price Structure of ServersPrice Structure of ServersPrice Structure of Serversl IBM eServer 680
(220KtpmC; $43/tpmC)§ 24 CPUs§ 96GB DRAM, 18 TB Disk§ $9M price tag
lCompaq ProLiant ML370(32KtpmC; $12/tpmC)§ 4 CPUs§ 8GB DRAM, 2TB Disk§ $240K price tag
- Software maintenance/management costs even higher (up to $100M)- Storage prices dominate (50%-70% in customer installations)
- Price of expensive CPUs/memory system amortized
Normalized breakdown of HW cost
0%10%20%30%40%50%60%70%80%90%
100%
IBM eServer 680 Compaq ProLiant ML570
I/ODRAMCPUBase
$/CPU $/MB DRAM $/GB Disk
IBM eServer 680 $65,417 $9 $359Compaq ProLiant ML570 $6,048 $4 $64
Price per componentSystem
OutlineOutlineOutline
l Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha
lDesign Methodology
lSummary
Studies of Commercial WorkloadsStudies of Commercial WorkloadsStudies of Commercial Workloadsl Collaboration with Kourosh Gharachorloo (Compaq WRL)
– ISCA’98: Memory System Characterization of Commercial Workloads (with E. Bugnion)
– ISCA’98: An Analysis of Database Workload Performance onSimultaneous Multithreaded Processors
(with J. Lo, S. Eggers, H. Levy, and S. Parekh)
– ASPLOS’98: Performance of Database Workloads on Shared-MemorySystems with Out-of-Order Processors
(with P. Ranganathan and S. Adve)
– HPCA’00: Impact of Chip-Level Integration on Performance of OLTPWorkloads
(with A. Nowatzyk and B. Verghese)
– ISCA’01: Code Layout Optimizations for Transaction ProcessingWorkloads
(with A. Ramirez, R. Cohn, J. Larriba-Pey, G. Lowney, and M. Valero)
Studies of Commercial Workloads: summaryStudies of Commercial Workloads: summaryStudies of Commercial Workloads: summarylMemory system is the main bottleneck
– astronomically high CPI– dominated by memory stall times– instruction stalls as important as data stalls– fast/large L2 caches are critical
lVery poor Instruction Level Parallelism (ILP)– frequent hard-to-predict branches– large L1 miss ratios– Ld-Ld dependencies– disappointing gains from wide-issue out-of-order techniques!
OutlineOutlineOutline
l Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha
lDesign Methodology
lSummary
Increasing Complexity of Processor DesignsIncreasing Complexity of Processor DesignsIncreasing Complexity of Processor Designs
lPushing limits of instruction-level parallelism– multiple instruction issue– speculative out-of-order (OOO) execution
lDriven by applications such as SPECl Increasing design time and team size
lYielding diminishing returns in performance
Processor(SGI MIPS)
YearShipped
TransistorCount
(millions)
DesignTeamSize
DesignTime
(months)
VerificationTeam Size
(% of total)R2000 1985 0.10 20 15 15%R4000 1991 1.40 55 24 20%R10000 1996 6.80 >100 36 >35%
courtesy: John Hennessy, IEEE Computer, 32(8)
Exploiting Higher Levels of IntegrationExploiting Higher Levels of IntegrationExploiting Higher Levels of Integration
l lower latency, higher bandwidthl reuse of existing CPU core
addresses complexity issues
1.5MBL2$
1GHz21264 CPU
64KBD$
64KBI$
I/ONet
wo
rk In
terf
ace
Co
her
ence
En
gin
e
ME
M-C
TL
0
31
ME
M-C
TL
0
31
Alpha 21364
364M
IO364
M
IO
364M
IO364
M
IO
364M
IO364
M
IO
l incrementally scalableglueless multiprocessing
Singlechip
Exploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial AppsExploiting Parallelism in Commercial Apps
L2$
CPU
D$I$
I/O
Net
wo
rkC
oh
eren
ce
ME
M-C
TL
ME
M-C
TL
CPU
D$I$
Chip Multiprocessing (CMP)
Example: IBM Power4
time
thread 1thread 2thread 3thread 4
Simultaneous Multithreading (SMT)
l SMT superior in single-thread performance
l CMP addresses complexity by using simpler cores
time
thread 1thread 2thread 3thread 4
Example: Alpha 21464
OutlineOutlineOutlinel Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha– Architecture– Performance
lDesign Methodology
lSummary
Piranha ProjectPiranha ProjectPiranha Project
lExplore chip multiprocessing for scalable serverslFocus on parallel commercial workloadslSmall team, modest investment, short design timelAddress complexity by using:
– simple processor cores– standard ASIC methodology
Give up on ILP, embrace TLP
Piranha Team MembersPiranha Team MembersPiranha Team MembersResearch
– Luiz André Barroso (WRL)– Kourosh Gharachorloo (WRL)– David Lowell (WRL)– Joel McCormack (WRL)– Mosur Ravishankar (WRL)– Rob Stets (WRL)– Yuan Yu (SRC)
NonStop Hardware DevelopmentASIC Design Center
– Tom Heynemann– Dan Joyce– Harland Maxwell– Harold Miller– Sanjay Singh– Scott Smith– Jeff Sprouse– … several contractors
Robert McNamaraBasem NayfehAndreas NowatzykJoan PendletonShaz Qadeer
Brian RobinsonBarton SanoDaniel ScalesBen Verghese
Former Contributors
Piranha Processing NodePiranha Processing NodePiranha Processing Node
CPU
Alpha core: 1-issue, in-order, 500MHzL1 caches: I&D, 64KB, 2-wayIntra-chip switch (ICS) 32GB/sec, 1-cycle delayL2 cache: shared, 1MB, 8-wayMemory Controller (MC) RDRAM, 12.8GB/secProtocol Engines (HE & RE): µprog., 1K µinstr., even/odd interleavingSystem Interconnect: 4-port Xbar router topology independent 32GB/sec total bandwidth
D$I$
L2$
ICS
CPU
D$I$
L2$
L2$
CPU
D$I$
CPU
D$I$L2$
CPU
D$I$L2$
CPU
D$I$L2$
L2$
CPU
D$I$L2$
CPU
D$I$
MEM-CTL
MEM-CTL
MEM-CTL MEM-CTL MEM-CTL
MEM-CTL MEM-CTL MEM-CTL
RE
HE
Ro
ute
r
Single Chip
Piranha I/O NodePiranha I/O NodePiranha I/O Node
Ro
ute
r
2 Links @8GB/s D$
L2$
CPU
I$
FBFB
RE
HE
ICS
D$PCI-X
MEM-CTL
l I/O node is a full-fledged member of system interconnect– CPU indistinguishable from Processing Node CPUs– participates in global coherence protocol
Example ConfigurationExample ConfigurationExample Configuration
P
P P P
P- I/O
P- I/O
P
P
l Arbitrary topologies
l Match ratio of Processing to I/O nodes to application requirements
L2 Cache and Intra-Node CoherenceL2 Cache and Intra-Node CoherenceL2 Cache and Intra-Node Coherence
lNo inclusion between L1s and L2 cache– total L1 capacity equals L2 capacity– L2 misses go directly to L1– L2 filled by L1 replacements
l L2 keeps track of all lines in the chip– sends Invalidates, Forwards– orchestrates L1-to-L2 write-backs to maximize
chip-memory utilization– cooperates with Protocol Engines to enforce
system-wide coherence
Inter-Node Coherence ProtocolInter-Node Coherence ProtocolInter-Node Coherence Protocoll ‘Stealing’ ECC bits for memory directory
lDirectory (2b state + 40b sharing info)
lDual representation: limited pointer + coarse vectorl “Cruise Missile” Invalidations (CMI)
– limit fan-out/fan-in serialization with CV
lSeveral new protocol optimizations
info on sharersstate
2b 20b
info on sharersstate
2b 20b
8x(64+8) 4X(128+9+7) 2X(256+10+22) 1X(512+11+53)Data-bitsECCDirectory-bits
0 28 44 53
010000001000CMI
Simulated ArchitecturesSimulated ArchitecturesSimulated Architectures
Single-Chip Piranha PerformanceSingle-Chip Piranha PerformanceSingle-Chip Piranha Performance
0
50
100
150
200
250
300
350
P1500 MHz1-issue
INO1GHz
1-issue
OOO1GHz
4-issue
P8500MHz1-issue
P1500 MHz1-issue
INO1GHz
1-issue
OOO1GHz
4-issue
P8500MHz1-issue
No
rmal
ized
Exe
cuti
on
Tim
e L2MissL2HitCPU
233
145
100
34
350
191
100
44
OLTP DSS
l Piranha’s performance margin 3x for OLTP and 2.2x for DSS
l Piranha has more outstanding misses è better utilizes memory system
Single-Chip Performance (Cont.)Single-Chip Performance Single-Chip Performance (Cont.)(Cont.)
lNear-linear scalability– low memory latencies– effectiveness of highly associative L2 and non-inclusive caching
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8
Number of Cores
Sp
eed
up
010
2030
4050
6070
8090
100
P1 P2 P4 P8
500 MHz, 1-issue
No
rmal
ized
Bre
akd
ow
n o
f L
1 M
isse
s (%
)
L2 MissL2 FwdL2 Hit
Potential of a Full-Custom PiranhaPotential of a Full-Custom PiranhaPotential of a Full-Custom Piranha
l 5x margin over OOO for OLTP and DSSl Full-custom design benefits substantially from boost in core speed
0
20
40
60
80
100
120
OOO1GHz
4-issue
P8500MHz1-issue
P8F1.25GHz1-issue
OOO1GHz
4-issue
P8500MHz1-issue
P8F1.25GHz1-issue
No
rmal
ized
Exe
cuti
on
Tim
e
L2 MissL2 HitCPU
OLTP DSS
100
34
20
100
43
19
OutlineOutlineOutline
l Importance of Commercial Workloads
lCommercial Workload Requirements
lTrends in Processor Design
lPiranha
lDesign Methodology
lSummary
Managing Complexity in the ArchitectureManaging Complexity in the ArchitectureManaging Complexity in the ArchitecturelUse of many simpler logic modules
– shorter design– easier verification– only short wires*– faster synthesis– simpler chip-level layout
lSimplify intra-chip communication– all traffic goes through ICS (no backdoors)
lUse of microprogrammed protocol engineslAdoption of large VM pagesl Implement sub-set of Alpha ISA
– no VAX floating point, no multimedia instructions, etc.
Methodology ChallengesMethodology ChallengesMethodology Challengesl Isolated sub-module testing
– need to create robust bus functional models (BFM)– sub-modules’ behavior highly inter-dependent– not feasible with a small team
lSystem-level (integrated) testing– much easier to create tests– only one BFM at the processor interface– simpler to assert correct operation– Verilog simulation is too slow for comprehensive testing
Our Approach:Our Approach:Our Approach:
lDesign in stylized C++ (synthesizable RTL level)– use mostly system-level, semi-random testing– simulations in C++ (faster & cheaper than Verilog)§ simulation speed ~1000 clocks/second
– employ directed tests to fill test coverage gaps
lAutomatic C++ to Verilog translation– single design database– reduce translation errors– faster turnaround of design changes– risk: untested methodology
lUsing industry-standard synthesis tools
l IBM ASIC process (Cu11)
Piranha Methodology: OverviewPiranha Methodology: OverviewPiranha Methodology: Overview
C++ RTLModels
C++ RTL Models: Cycleaccurate and “synthesizeable”
Physical Design: leveragesindustry standard Verilog-basedtools
PhysicalDesign
cxx: C++ compiler
PS1
PS1: Fast (C++) LogicSimulator
cxx
PS1V PS1V: Can “co-simulate” C++and Verilog module versionsand check correspondence
cxx VerilogModels
Verilog Models: Machinetranslated from C++ models
CLevel
CLevel: C++-to-Verilog Translator
SummarySummarySummarylCMP architectures are inevitable in the near future
lPiranha investigates an extreme point in CMP design– many simple cores
lPiranha has a large architectural advantage over complexsingle-core designs (> 3x) for database applications
lPiranha methodology enables faster design turnaround
lKey to Piranha is application focus:– One-size-fits-all solutions may soon be infeasible
ReferenceReferenceReferencelPapers on commercial workload performance & Piranha
research.compaq.com/wrl/projects/Database