CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The...
-
Upload
godwin-sanders -
Category
Documents
-
view
216 -
download
2
Transcript of CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The...
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)
Dimitris Kaseridis and Lizy K. John
The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu
10th Workshop on Computer Architecture Evaluation using Commercial Workloads
(CAECW-10)
Laboratory for Computer Architecture 2
Outline
Brief Description of UltraSPARC T1 Objectives SpecJbb2005 Benchmark Results
Laboratory for Computer Architecture 3
UltraSPARC T1
A new multi-threaded processor that combines CMP & SMT in CMT
8 cores with each one handling 4 hardware context threads 32 active hardware context threads
Simple in-order pipeline with no branch prediction unit per core
Optimized for multithreaded performance Throughput
High throughput hide the memory and pipeline stalls/latencies by scheduling other threads with Zero cycle thread switch penalty
Laboratory for Computer Architecture 5
UltraSPARC T1 Core Pipeline
Thread Group shares L1 cache, TLBs, execution units, pipeline registers and datapath Core area = 11 mm2 (90 nm technology) 4 way MT adds ~ 20% area to core
Laboratory for Computer Architecture 6
Objectives
Evaluate CMP/CMT benefits
Quantify the benefits that additional cores and/or additional hardware threads on a multithreaded environment
Show effectiveness of latency hiding
Laboratory for Computer Architecture 7
SPECjbb 2005 Benchmark Characteristics
Model a self contained 3-tier system: Server, Database and Clients
Every warehouse is a collection of Java objects with ~25MB of data
Each client is represented by an individual thread
No I/O effects
Reported score: Billion of Operations per Second (BOPS)
Targets performance of CPUs, caches, memory hierarchy and the scalability of shared memory processors
Stresses the implementations of: JVM (Java Virtual Machine), JIT (Just-In-Time) compiler, garbage collection and threads
Client 1
Business LogicEngine
Client N
Client 3
Client 2
Object Trees
Database
SPECjbb2005 3-tier architecture
Laboratory for Computer Architecture 8
Parameters Experimental parameters
Parameter Value
Operating System SunOS 5.10 Generic_118833-17
CPU frequency 1 GHz
Main Memory Size 8 Gbytes DDR2 DRAM
JVM version Java(TM) 2 build 1.5.0_06-b05
SPECjbb
Execution
Command
Java -Xmx2560m -Xms2560m -Xmn1536m - Xss128k -XX:
+UseParallelOldGC -
XX:ParallelGCThreads=15 -
XX:+AggressiveOpts -
XX:LargePageSizeInBytes=
256m
-cp jbb.jar:check.jar spec.jbb.JBBmain -propfile
SPECjbb.props
Laboratory for Computer Architecture 9
Measurements Methodology On-chip performance counters for real/accurate results Niagara:
Solaris10 tools : cpustat, cputrack 2 counters per Hardware Thread with one only for Instruction
count Event Name Description
Instr_cnt Number of completed instructions.
SB_full Number of store buffer full cycles
FP_instr_cnt Number of completed floating-point instructions
IC_miss Number of instruction cache (L1) misses
DC_miss Number of data cache (L1) misses for loads
ITLB_miss Number of instruction TLB miss trap taken.
DTLB_miss Number of data TLB miss trap taken (includes real_translation misses).
L2_imiss Number of secondary cache (L2) misses due to instruction cache requests.
L2_dmiss_ld Number of secondary cache (L2) misses due to data cache load requests.
Laboratory for Computer Architecture 10
500
1000
1500
2000
2500
3000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Results – Latency hiding pay off
Single Thread Execution on T1Single core execution using
4 threads on one core
X2 instead of 4
Number of Warehouses
Sp
ec
Jb
b S
co
re (
BO
PS
)
Number of Warehouses
Sp
ec
Jb
b S
co
re (
BO
PS
)
Laboratory for Computer Architecture 11
CMP / CMT Scaling – CMP benefits
0.00
5000.00
10000.00
15000.00
20000.00
25000.00
0 2 4 6 8 10 12 14 16 18
Region 2 Benchmark Saturation
Region 12521x per
additional core
Number of Warehouses
Sp
ec
Jb
b S
co
re (
BO
PS
)
8 core x 1 thread/cores
Laboratory for Computer Architecture 12
CMP / CMT Scaling – CMT benefits
75% of the benefit of adding a single core Significant less area and power requirements
(remember that 4 way MT adds ~ 20% area to each core)
0
5000
10000
15000
20000
25000
30000
35000
40000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
Region 21957 per thread
Region 12537 per
core
Region 3Benchmark Saturation
Number of Warehouses
Sp
ec
Jb
b S
co
re (
BO
PS
)
8 core x 2 threads/cores
Laboratory for Computer Architecture 13
Number of Warehouses
Sp
ec
Jb
b S
co
re (
BO
PS
)
8 core x 4 threads/cores
CMP / CMT Scaling – SMT benefits
Laboratory for Computer Architecture 14
Additional hardware threads > 2 give an additional benefit of 45%
Gradually diminishing returns in terms of SMT efficiency
Garbage collector significantly effects regions 4 and 5
Number of Warehouses
Sp
ec
Jb
b S
co
re (
BO
PS
)
CMP / CMT Scaling – SMT benefits
Laboratory for Computer Architecture 15
IPC of three configurations Best case SPECjbb score speedup
SPECjbb Score Scaling
0
5
10
15
20
25
0 10 20 30
Number of Virtual Processors
No
rm.
SP
EC
jbb
sc
ore
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0.55
0.6
8 Cores x 1 thread 8 Cores x 2 threads 8 Cores x 4 threads
IPC
Laboratory for Computer Architecture 16
Throughput vs. Latency in multiprocessing/multithreaded environments
Latency hiding is a good/promising technique against aggressive speculation
Adding SMT can give up to 75% the benefit of CMP with significant less cost
Moving to higher levels of SMT shows diminishing returns tradeoffs between #cores and #Hardware threads per core
Conclusions