Leveraging Simultaneous Multithreading for Adaptive ...dbrooks/tacs05/donald_slides.pdf · Process...

Leveraging Simultaneous Multithreading for Adaptive Thermal Control

James Donald & Margaret MartonosiPrinceton University

Outline

• Background & Problem Description• Related Work• Experimental Setup & Benchmarks• Adaptive Fetch Algorithm• Adaptive Register Renaming• Future Work & Conclusions

Background

• Temperature-Aware Design– Increasingly relevant w/ greater transistor densities– Tools e.g. HotSpot

• Applied to multithreaded architectures (SMT, CMP)– Applicable to prevalent emerging architectures– Provides opportunity for smarter thermal

management

Localized Thermal Hotspots• A single hot spot

limits performance.• Global DTM (e.g.

DVFS) handles this at expense of entire core’s performance.

Hottest Spots for fma3d-twolf with Basic Fetch Throttling

84.9084.9184.9284.9384.9484.9584.9684.9784.9884.9985.0085.01

0 20 40 60 80 100 120 140 160Time (ms)

Tem

pera

ture

(°C

)

FXU registers

FPU registers

Proposed Solution

• Selectively fetch instructions to mitigate single hotspot’s severity.

• Our experiments: two hotspots monitored.

Hottest Spots for fma3d-twolf with Adaptive Fetch

84.9084.9184.9284.9384.9484.9584.9684.9784.9884.9985.0085.01

0 20 40 60 80 100 120 140Time (ms)

Tem

pera

ture

(°C

)

FXU registers

FPU registers

fetch policythermal sensors

profiling data

thread#1thread#2

IFU decision

Outline


Related Work

• M. Powell, M. Gomaa and T. N. Vijaykumar. “Heat and Run: Leveraging SMT and CMP to Manage Power Density Through the Operating System”. ASPLOS 2004.

• Y. Li, Z. Hu, D. Brooks and K. Skadron. “Performance, Energy, and Thermal Considerations for SMT and CMP Architectures”. HPCA 2005.

Outline


Simulation Setup

• Turandot PowerPC simulator w/ SMT• PowerTimer + leakage power estimation• HotSpot 2.0

Machine ParametersProcessor Configuration

Process Technology 0.18 µ

Supply Voltage 1.2 V

Clock Rate 1.4 GHz

Organization single-core, 2-context SMT

Functional Units 2 FXU, 2 FPU, 2 LSU, 1 BRU

PhysicalRegisters

120 GPR, 90 FPR

Thermal Control fetch throttling

Processor Floorplan

SPEC Benchmark Heat Behaviorsbenchmark type function FXUreg FPUregammp FP computational chemistryapplu FP CFD/physicsfma3d FP mechanical responsegalgel FP CFD

gcc int compilergzip int compressionmcf int transportation schedmesa FP 3-D graphicsparser int word processingtwolf int lithography routing

Test WorkloadsSMT workload Mix typeammp-gzip FP-integerammp-mcf FP-integerapplu-parser FP-integerapplu-twolf FP-integerfma3d-galgel FP-FPfma3d-twolf FP-integergalgel-mesa FP-integergcc-mesa integer-FPgcc-parser integer-integergzip-mcf integer-integer

• Tests a wide range of integer/floating-point and hot/hotter mixes.

Outline


Adaptive Fetch Algorithm (a)• Decide which hotspot to mitigate:

Total Int regfile accesses

Total FP regfile accesses

Total Instructions fetched

/

/

Thermal threshold °

Int regfile temp °

FP regfile temp °

-

-

/

/

comparison

Critical regfile

gapetemperaturrateaccessregfileoverheattorate =

Adaptive Fetch Algorithm (b)• Favor best thread for the identified hotspot.

– Compare which thread has lower:

Thread1 Int(or FP) regfile accesses

Thread2 Int(or FP) regfile accesses

Thread2 Instructions fetched

/

/

Thread1 Instructions fetched

comparison

Priority thread

Moderate policy: 50% of cycles default to round-robinAggressive policy: 12.5%

instraccesses

Metrics

• Performance – IPC and weighted speedup

• Power-performance – ED2 product

• Thread fairness (percentages of retired instructions)

Performance Results57

.8%

/ 42

.2%

66.7

% /

33.3

%

63.5

% /

36.5

%

60.6

% /

39.4

%

43.4

% /

56.6

%

60.4

% /

39.6

%

60.8

% /

39.2

%

53.0

% /

47.0

%

62.0

% /

38.0

%

57.1

% /

42.9

%58.9

% /

41.1

%

68.1

% /

31.9

%

61.8

% /

38.2

%

61.0

% /

39.0

%

41.4

% /

58.6

%

62.6

% /

37.4

%

60.7

% /

39.3

%

50.5

% /

49.5

%

60.4

% /

39.6

%

59.0

% /

41.0

%

70.3

% /

29.7

%

78.4

% /

21.6

%

47.5

% /

52.5

%

51.1

% /

48.9

%

40.8

% /

59.2

%

62.1

% /

37.9

%

35.6

% /

64.4

%

22.5

% /

77.5

% 32.8

% /

67.2

%

72.3

% /

27.7

%

00.10.20.30.40.50.60.70.80.9

1

ammp-gzip

ammp-mcf

applu-parser

applu-twolf

fma3d-galgel

fma3d-twolf

galgel-mesa

gcc-mesa

gcc-parser

gzip-mcf

workload mix

Wei

ghte

d sp

eedu

p

baseline (fetch throttle)moderate adaptive fetchaggressive adaptive fetch

Significant performance increases at the cost of thread fairness.

Results – ED2 Product57

.8%

/ 42

.2%

66.7

% /

33.3

%

63.5

% /

36.5

%

60.6

% /

39.4

%

43.4

% /

56.6

%

60.4

% /

39.6

%

60.8

% /

39.2

%

53.0

% /

47.0

%

62.0

% /

38.0

%

57.1

% /

42.9

%

58.9

% /

41.1

%

68.1

% /

31.9

%

61.8

% /

38.2

%

61.0

% /

39.0

%

41.4

% /

58.6

%

62.6

% /

37.4

%

60.7

% /

39.3

%

50.5

% /

49.5

%

60.4

% /

39.6

%

59.0

% /

41.0

%

70.3

% /

29.7

%

78.4

% /

21.6

%

47.5

% /

52.5

%

51.1

% /

48.9

%

40.8

% /

59.2

%

62.1

% /

37.9

%

35.6

% /

64.4

%

22.5

% /

77.5

%

32.8

% /

67.2

%

72.3

% /

27.7

%

00.10.20.30.40.50.60.70.80.9

11.11.21.31.41.5

ammp-gzip

ammp-mcf

applu-parser

applu-twolf

fma3d-galgel

fma3d-twolf

galgel-mesa

gcc-mesa

gcc-parser

gzip-mcf

workload mix

ED2 p

rodu

ct (n

orm

aliz

ed)

baseline (fetch throttle)moderate adaptive fetchaggressive adaptive fetch

Quadraticallycorrelated ED2

reduction.

Outline


Adaptive Register Renaming

• Same counter-based thread priority decision algorithm.• Difference: Decision affects thread priority for register

mapping instead of instruction fetch.• Disadvantage: Response at later pipeline stage already

allows instructions in to waste resources.• Advantage: Closer to point of interest (registers),

allows more isolated, simpler implementation.

Speedup – Adaptive Renaming57

.7%

/ 42

.3%

66.6

% /

33.4

%

63.2

% /

36.8

%

60.3

% /

39.7

%

43.1

% /

56.9

%

60.3

% /

39.7

%

60.8

% /

39.2

%

53.0

% /

47.0

%

61.8

% /

38.2

%

57.1

% /

42.9

%

58.4

% /

41.6

%

67.0

% /

33.0

%

62.7

% /

37.3

%

61.3

% /

38.7

%

40.5

% /

59.5

%

61.3

% /

38.7

%

61.8

% /

38.2

%

47.7

% /

52.3

%

56.8

% /

43.2

%

57.5

% /

42.5

%

62.5

% /

37.5

%

70.9

% /

29.1

%

46.1

% /

53.9

%

61.7

% /

38.3

%

31.9

% /

68.1

%

64.9

% /

35.1

%

45.9

% /

54.1

%

23.4

% /

76.6

%

32.7

% /

67.3

%

62.1

% /

37.9

%

00.10.20.30.40.50.60.70.80.9

1

ammp-gzip

ammp-mcf

applu-parser

applu-twolf

fma3d-galgel

fma3d-twolf

galgel-mesa

gcc-mesa

gcc-parser

gzip-mcf

workload mix

Wei

ghte

d sp

eedu

p

baseline (rename throttle)moderate adaptive renamingaggressive adaptive renaming

ED2 – Adaptive Renaming57

.8%

/ 42

.2%

66.7

% /

33.3

%

63.5

% /

36.5

%

60.6

% /

39.4

%

43.4

% /

56.6

%

60.4

% /

39.6

%

60.8

% /

39.2

%

53.0

% /

47.0

%

62.0

% /

38.0

%

57.1

% /

42.9

%

58.9

% /

41.1

%

68.1

% /

31.9

%

61.8

% /

38.2

%

61.0

% /

39.0

%

41.4

% /

58.6

%

62.6

% /

37.4

%

60.7

% /

39.3

%

50.5

% /

49.5

%

60.4

% /

39.6

%

59.0

% /

41.0

%

70.3

% /

29.7

%

78.4

% /

21.6

%

47.5

% /

52.5

%

51.1

% /

48.9

%

40.8

% /

59.2

%

62.1

% /

37.9

%

35.6

% /

64.4

%

22.5

% /

77.5

%

32.8

% /

67.2

%

72.3

% /

27.7

%

00.10.20.30.40.50.60.70.80.9

11.1

ammp-gzip

ammp-mcf

applu-parser

applu-twolf

fma3d-galgel

fma3d-twolf

galgel-mesa

gcc-mesa

gcc-parser

gzip-mcf

workload mix

ED2 p

rodu

ct (n

orm

aliz

ed)

baseline (rename throttle)moderate adaptive renamingaggressive adaptive renaming

Outline


Future Work

• Implement in context of multicore (SMT combined CMP) processors.

• Comparison to real-world fetch policies (e.g. ICOUNT).

• Tradeoffs when in conjunction with global DTM strategies.

Conclusions

• Adaptive fetch: 30% performance improvement (44% ED2 reduction)

• Adaptive rename: 23% performance improvement (35% ED2 reduction)

• Benefits even in integer-only or FP-only mixes.

• Tradeoff between power-performance benefits and thread fairness.

Acknowledgements

• Special thanks– IBM– Yingmin Li– The anonymous reviewers

Leveraging Simultaneous Multithreading for Adaptive ...dbrooks/tacs05/donald_slides.pdf · Process...

Documents

Transcript of Leveraging Simultaneous Multithreading for Adaptive ...dbrooks/tacs05/donald_slides.pdf · Process...