CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The...

17
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu 10 th Workshop on Computer Architecture Evaluation using Commercial Workloads (CAECW-10)

Transcript of CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The...

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)

Dimitris Kaseridis and Lizy K. John

The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu

10th Workshop on Computer Architecture Evaluation using Commercial Workloads

(CAECW-10)

Laboratory for Computer Architecture 2

Outline

Brief Description of UltraSPARC T1 Objectives SpecJbb2005 Benchmark Results

Laboratory for Computer Architecture 3

UltraSPARC T1

A new multi-threaded processor that combines CMP & SMT in CMT

8 cores with each one handling 4 hardware context threads 32 active hardware context threads

Simple in-order pipeline with no branch prediction unit per core

Optimized for multithreaded performance Throughput

High throughput hide the memory and pipeline stalls/latencies by scheduling other threads with Zero cycle thread switch penalty

Laboratory for Computer Architecture 4

SMP vs. CMT

Laboratory for Computer Architecture 5

UltraSPARC T1 Core Pipeline

Thread Group shares L1 cache, TLBs, execution units, pipeline registers and datapath Core area = 11 mm2 (90 nm technology) 4 way MT adds ~ 20% area to core

Laboratory for Computer Architecture 6

Objectives

Evaluate CMP/CMT benefits

Quantify the benefits that additional cores and/or additional hardware threads on a multithreaded environment

Show effectiveness of latency hiding

Laboratory for Computer Architecture 7

SPECjbb 2005 Benchmark Characteristics

Model a self contained 3-tier system: Server, Database and Clients

Every warehouse is a collection of Java objects with ~25MB of data

Each client is represented by an individual thread

No I/O effects

Reported score: Billion of Operations per Second (BOPS)

Targets performance of CPUs, caches, memory hierarchy and the scalability of shared memory processors

Stresses the implementations of: JVM (Java Virtual Machine), JIT (Just-In-Time) compiler, garbage collection and threads

Client 1

Business LogicEngine

Client N

Client 3

Client 2

Object Trees

Database

SPECjbb2005 3-tier architecture

Laboratory for Computer Architecture 8

Parameters Experimental parameters

Parameter Value

Operating System SunOS 5.10 Generic_118833-17

CPU frequency 1 GHz

Main Memory Size 8 Gbytes DDR2 DRAM

JVM version Java(TM) 2 build 1.5.0_06-b05

SPECjbb

Execution

Command

Java -Xmx2560m -Xms2560m -Xmn1536m - Xss128k -XX:

+UseParallelOldGC -

XX:ParallelGCThreads=15 -

XX:+AggressiveOpts -

XX:LargePageSizeInBytes=

256m

-cp jbb.jar:check.jar spec.jbb.JBBmain -propfile

SPECjbb.props

Laboratory for Computer Architecture 9

Measurements Methodology On-chip performance counters for real/accurate results Niagara:

Solaris10 tools : cpustat, cputrack 2 counters per Hardware Thread with one only for Instruction

count Event Name Description

Instr_cnt Number of completed instructions.

SB_full Number of store buffer full cycles

FP_instr_cnt Number of completed floating-point instructions

IC_miss Number of instruction cache (L1) misses

DC_miss Number of data cache (L1) misses for loads

ITLB_miss Number of instruction TLB miss trap taken.

DTLB_miss Number of data TLB miss trap taken (includes real_translation misses).

L2_imiss Number of secondary cache (L2) misses due to instruction cache requests.

L2_dmiss_ld Number of secondary cache (L2) misses due to data cache load requests.

Laboratory for Computer Architecture 10

500

1000

1500

2000

2500

3000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results – Latency hiding pay off

Single Thread Execution on T1Single core execution using

4 threads on one core

X2 instead of 4

Number of Warehouses

Sp

ec

Jb

b S

co

re (

BO

PS

)

Number of Warehouses

Sp

ec

Jb

b S

co

re (

BO

PS

)

Laboratory for Computer Architecture 11

CMP / CMT Scaling – CMP benefits

0.00

5000.00

10000.00

15000.00

20000.00

25000.00

0 2 4 6 8 10 12 14 16 18

Region 2 Benchmark Saturation

Region 12521x per

additional core

Number of Warehouses

Sp

ec

Jb

b S

co

re (

BO

PS

)

8 core x 1 thread/cores

Laboratory for Computer Architecture 12

CMP / CMT Scaling – CMT benefits

75% of the benefit of adding a single core Significant less area and power requirements

(remember that 4 way MT adds ~ 20% area to each core)

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Region 21957 per thread

Region 12537 per

core

Region 3Benchmark Saturation

Number of Warehouses

Sp

ec

Jb

b S

co

re (

BO

PS

)

8 core x 2 threads/cores

Laboratory for Computer Architecture 13

Number of Warehouses

Sp

ec

Jb

b S

co

re (

BO

PS

)

8 core x 4 threads/cores

CMP / CMT Scaling – SMT benefits

Laboratory for Computer Architecture 14

Additional hardware threads > 2 give an additional benefit of 45%

Gradually diminishing returns in terms of SMT efficiency

Garbage collector significantly effects regions 4 and 5

Number of Warehouses

Sp

ec

Jb

b S

co

re (

BO

PS

)

CMP / CMT Scaling – SMT benefits

Laboratory for Computer Architecture 15

IPC of three configurations Best case SPECjbb score speedup

SPECjbb Score Scaling

0

5

10

15

20

25

0 10 20 30

Number of Virtual Processors

No

rm.

SP

EC

jbb

sc

ore

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

8 Cores x 1 thread 8 Cores x 2 threads 8 Cores x 4 threads

IPC

Laboratory for Computer Architecture 16

Throughput vs. Latency in multiprocessing/multithreaded environments

Latency hiding is a good/promising technique against aggressive speculation

Adding SMT can give up to 75% the benefit of CMP with significant less cost

Moving to higher levels of SMT shows diminishing returns tradeoffs between #cores and #Hardware threads per core

Conclusions

Laboratory for Computer Architecture 17

Thank you…

Questions??

The Laboratory for Computer Architecture

Web-site: http://lca.ece.utexas.edu