CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The...

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara)

Dimitris Kaseridis and Lizy K. John

The University of Texas at Austin Laboratory for Computer Architecture http://lca.ece.utexas.edu

10th Workshop on Computer Architecture Evaluation using Commercial Workloads

(CAECW-10)

Laboratory for Computer Architecture 2

Outline

Brief Description of UltraSPARC T1 Objectives SpecJbb2005 Benchmark Results


UltraSPARC T1

A new multi-threaded processor that combines CMP & SMT in CMT

8 cores with each one handling 4 hardware context threads 32 active hardware context threads

Simple in-order pipeline with no branch prediction unit per core

Optimized for multithreaded performance Throughput

High throughput hide the memory and pipeline stalls/latencies by scheduling other threads with Zero cycle thread switch penalty


SMP vs. CMT


UltraSPARC T1 Core Pipeline

Thread Group shares L1 cache, TLBs, execution units, pipeline registers and datapath Core area = 11 mm2 (90 nm technology) 4 way MT adds ~ 20% area to core


Objectives

Evaluate CMP/CMT benefits

Quantify the benefits that additional cores and/or additional hardware threads on a multithreaded environment

Show effectiveness of latency hiding


SPECjbb 2005 Benchmark Characteristics

Model a self contained 3-tier system: Server, Database and Clients

Every warehouse is a collection of Java objects with ~25MB of data

Each client is represented by an individual thread

No I/O effects

Reported score: Billion of Operations per Second (BOPS)

Targets performance of CPUs, caches, memory hierarchy and the scalability of shared memory processors

Stresses the implementations of: JVM (Java Virtual Machine), JIT (Just-In-Time) compiler, garbage collection and threads

Client 1

Business LogicEngine

Client N

Client 3

Client 2

Object Trees

Database

SPECjbb2005 3-tier architecture


Parameters Experimental parameters

Parameter Value

Operating System SunOS 5.10 Generic_118833-17

CPU frequency 1 GHz

Main Memory Size 8 Gbytes DDR2 DRAM

JVM version Java(TM) 2 build 1.5.0_06-b05

SPECjbb

Execution

Command

Java -Xmx2560m -Xms2560m -Xmn1536m - Xss128k -XX:

+UseParallelOldGC -

XX:ParallelGCThreads=15 -

XX:+AggressiveOpts -

XX:LargePageSizeInBytes=

256m

-cp jbb.jar:check.jar spec.jbb.JBBmain -propfile

SPECjbb.props


Measurements Methodology On-chip performance counters for real/accurate results Niagara:

Solaris10 tools : cpustat, cputrack 2 counters per Hardware Thread with one only for Instruction

count Event Name Description

Instr_cnt Number of completed instructions.

SB_full Number of store buffer full cycles

FP_instr_cnt Number of completed floating-point instructions

IC_miss Number of instruction cache (L1) misses

DC_miss Number of data cache (L1) misses for loads

ITLB_miss Number of instruction TLB miss trap taken.

DTLB_miss Number of data TLB miss trap taken (includes real_translation misses).

L2_imiss Number of secondary cache (L2) misses due to instruction cache requests.

L2_dmiss_ld Number of secondary cache (L2) misses due to data cache load requests.


500

1000

1500

2000

2500

3000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Results – Latency hiding pay off

Single Thread Execution on T1Single core execution using

4 threads on one core

X2 instead of 4

Number of Warehouses

Sp

ec

Jb

b S

co

re (

BO

PS

)


Sp

ec

Jb

b S

co

re (

BO

PS

)


CMP / CMT Scaling – CMP benefits

0.00

5000.00

10000.00

15000.00

20000.00

25000.00

0 2 4 6 8 10 12 14 16 18

Region 2 Benchmark Saturation

Region 12521x per

additional core


Sp

ec

Jb

b S

co

re (

BO

PS

)

8 core x 1 thread/cores


CMP / CMT Scaling – CMT benefits

75% of the benefit of adding a single core Significant less area and power requirements

(remember that 4 way MT adds ~ 20% area to each core)

0

5000

10000

15000

20000

25000

30000

35000

40000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32

Region 21957 per thread

Region 12537 per

core

Region 3Benchmark Saturation


Sp

ec

Jb

b S

co

re (

BO

PS

)

8 core x 2 threads/cores



Sp

ec

Jb

b S

co

re (

BO

PS

)

8 core x 4 threads/cores

CMP / CMT Scaling – SMT benefits


Additional hardware threads > 2 give an additional benefit of 45%

Gradually diminishing returns in terms of SMT efficiency

Garbage collector significantly effects regions 4 and 5


Sp

ec

Jb

b S

co

re (

BO

PS

)

CMP / CMT Scaling – SMT benefits


IPC of three configurations Best case SPECjbb score speedup

SPECjbb Score Scaling

0

5

10

15

20

25

0 10 20 30

Number of Virtual Processors

No

rm.

SP

EC

jbb

sc

ore

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

8 Cores x 1 thread 8 Cores x 2 threads 8 Cores x 4 threads

IPC


Throughput vs. Latency in multiprocessing/multithreaded environments

Latency hiding is a good/promising technique against aggressive speculation

Adding SMT can give up to 75% the benefit of CMP with significant less cost

Moving to higher levels of SMT shows diminishing returns tradeoffs between #cores and #Hardware threads per core

Conclusions


Thank you…

Questions??

The Laboratory for Computer Architecture

Web-site: http://lca.ece.utexas.edu

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The...

Documents

Transcript of CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The...