Pushing Performance, Efficiency and Scalability of Microprocessors

15
Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006 Gabriel Loh

description

Pushing Performance, Efficiency and Scalability of Microprocessors. CERCS IAB Meeting, Fall 2006 Gabriel Loh. Research Overview. Funding from state of GA, Intel, MARCO Currently 2 PhD students, 2 MS Active undergrad research as well Collaborations Universities: PSU, UO, Rutgers - PowerPoint PPT Presentation

Transcript of Pushing Performance, Efficiency and Scalability of Microprocessors

Page 1: Pushing Performance, Efficiency and Scalability of Microprocessors

Pushing Performance, Efficiency and Scalability of Microprocessors CERCS IAB Meeting, Fall 2006Gabriel Loh

Page 2: Pushing Performance, Efficiency and Scalability of Microprocessors

Research Overview

• Funding from state of GA, Intel, MARCO

• Currently 2 PhD students, 2 MS– Active undergrad research as well

• Collaborations– Universities: PSU, UO, Rutgers– Industry: Intel, IBM

Page 3: Pushing Performance, Efficiency and Scalability of Microprocessors

Research Focus

• “Near-term” microprocessor design issues– ~ 5-year time scale– Power/performance/complexity– Traditional uniprocessor performance– Multi-core performance

• “Longer-term”– Keeping Moore’s Law alive for the longer

term– Primarily, 3D integration for now

Page 4: Pushing Performance, Efficiency and Scalability of Microprocessors

Scaling Performance and Efficiency• Multi-cores are here, but single-

thread perf still matters– Intel Core 2 Duo is multi-core, but…– Single core is more OOO than ever

• Larger instruction window, improved branch prediction, speculative load-store ordering, wider pipe and decoders

– But power also really matters• Lower clock speeds, different channel length

transistors, more uop fusion, …

Page 5: Pushing Performance, Efficiency and Scalability of Microprocessors

Research Focus

• Maximum performance within bounds– Bounds = power, area, TDP, …

• Single-core performance helps multi-core performance, too– For future multi-core systems, need to strike a

good balance between 1T and MT

• Most of our research is at the uarch level– Caches, branch predictors, instruction

schedulers, memory queue design, memory dependence prediction, etc.

Page 6: Pushing Performance, Efficiency and Scalability of Microprocessors

Highlight: Traditional Caching [MICRO’06]

• Well known that different apps respond differently to different replacement policies

• Previous work in the OS domain has described adaptive replacement with provable bounds on performance

• Adapted techniques for on-chip caches

Page 7: Pushing Performance, Efficiency and Scalability of Microprocessors

Idea…

Page 8: Pushing Performance, Efficiency and Scalability of Microprocessors

Adaptive Cache Implementation

• Theoretical Guarantees– Miss rate provably bounded to be within

a factor of two of the better algorithm

In practice,it’s much better

Page 9: Pushing Performance, Efficiency and Scalability of Microprocessors

Current Research

• Working on multi-core generalizations of adaptive caching and other ways to manage shared resources

• Uniprocessor microarchitecture– Scalable memory scheduling [MICRO’06]– Memory dependence prediction

[HPCA’06]– Branch prediction […]– And more…

Page 10: Pushing Performance, Efficiency and Scalability of Microprocessors

Longer-Term Processor Scaling

• Limitations/Obstacles– Wire scaling

• Latency/performance• Power

– Feature size• Lithography, parametric variations

– Off-chip communication

Page 11: Pushing Performance, Efficiency and Scalability of Microprocessors

3D Integration

• Wire– Power/perf.

• Off-chip• Feature size

– Limitations, variations

ActiveLayer 1

ActiveLayer 2

MetalLayers 1

Die-to-DieVias

Die/Wafer Stacking

MetalLayers 2

Less RC faster, lower-power

Page 12: Pushing Performance, Efficiency and Scalability of Microprocessors

Example: Caches

Simplified 2D SRAM Array 3D Bitline Stacking

Wordline length halved

• in our studies, WL was critical for latency

3D Wordline Stacking

Bitline length halved

• BL reduction has greater impact on power savings• Split decoder no activity stacking

We’ve studieda wide varietyof other CPU

building blocks

Page 13: Pushing Performance, Efficiency and Scalability of Microprocessors

Uarch-level 3D design

Example: 4-die significance-partitioned datapathUse uarch prediction mechanism for early determination of width

Smaller footprint faster and lower-power

Width-based gating even lower power,

close to original power density

Overall: 47% performance gain atonly 2 degree temperature increase

Page 14: Pushing Performance, Efficiency and Scalability of Microprocessors

3D Research Summary

• Circuit-level [ICCD’05,ISVLSI’06,ISCAS’06,GLSVLSI’06]

• Uarch-level [MICRO’06 (w/ ),HPCA’07]

• Tutorial papers [JETC’06]

• Tutorial [MICRO’06]

• Tools [DATE’06,TCAD’07] w/ GTCAD &

• Parametric Variations w/ Jim Meindl

• Funding, equip from ,

Page 15: Pushing Performance, Efficiency and Scalability of Microprocessors

Summary

• loh@cc• http://www.cc.gatech.edu/~loh

• Lots of exciting work going on here