A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin,...

17
A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin, and Sule Ozev

Transcript of A Mechanism for Online Diagnosis of Hard Faults in Microprocessors Fred A. Bower, Daniel J. Sorin,...

A Mechanism for Online Diagnosis of Hard Faults in Microprocessors

Fred A. Bower, Daniel J. Sorin, and Sule Ozev

overview

Motivation Current Techniques

Proposed Mechanism for Online Fault DiagnosisResults

ChallengesConclusion

Hard Faults

Electron Migration Gate Oxide Breakdown

background

Transient Faults

Single Event Upset

motivation

Process Scaling

current fault handling techniques

DIVA

Redundancy

DIVA

UTILIZEREDUNDANCY

error detection and correction

hybrid approach

online diagnosis

Track Units

DIVA ERROR

deconfigureunit

error_count++

If(error_count > threshold)

YES

NONo Action

ALU DIVA CHECKER

Reorder Buffer

Reservation Station

Units that can be turned off in case of a fault

Field Deconfigurable Units (FDU)

Deconfigure entries in circular buffer Deconfigure entries in tabular structure

deconfiguring mechanism

Hard fault diagnosis latency Performance impact of losing component to hard fault

analysis

• DIVA: 6% of an Alpha 21264 core

• Error counters (~1227 bits total)

• Instruction resource usage (19 wires in total)

• Deconfiguration logic

• Can be reduced using coarse granularity

challenges

Error count threshold• Related to resource usage• Heavily used resources have higher

counters• Pipeline flushes before threshold is

reached

challenges

Error count threshold• Related to resource usage• Heavily used resources have higher

counters• Pipeline flushes before threshold is

reached

Transient faults

Independent resource usage

ERRORHARD FAULT

TRANSIENT FAULT

A B C

D E F

Desired

Observed

DIVA CHECKER

challenges

• Certain structures cannot be protected• Register File• Issue logic• Common Data Bus (CDB)

• Transient fault False Deconfiguration• Possibly masked by error counter

• Faults in the error counter or deconfiguration logic• Periodically test counters• Permanently configure or deconfigure FDU

upon error

• Window of vulnerability• DIVA produces errors until counter

saturates

limitations

• As transistors shrink, hard fault rate increases

• Current reliability mechanisms• Redundancy (TMR)• Thread level redundancy• Pre shipment testing and deconfiguration• Low cost solutions such as DIVA

• Online diagnosis• Low cost and hardware overhead• Use FDUs along with DIVA to diagnose faults dynamically• Increase yield Binned to a lower performance bin

conclusion

discussion

What are the advantages of this hybrid scheme over using just a DIVA checker?

As process technology gets smaller, can this mechanism help increase the lifetime of the processor a significant amount?

As transistors shrink, the number of cores will increase, can this mechanism be used still as opposed to turning off a faulty core?

How can we extend this mechanism to take care of the issue logic, singleton resources and CDB?

citations

images• Electron Migration. Digital image. Wikimedia.org. Wikimedia, 6 Mar. 2007. Web.

<http://upload.wikimedia.org/wikipedia/commons/thumb/8/8b/Leiterbahn_ausfallort_elektromigration.jpg/220px-Leiterbahn_ausfallort_elektromigration.jpg>.

• Gate Oxide Breakdown. Digital image. Attopsemi Technology. Attopsemi Technology, n.d. Web. <http://www.attopsemi.com/tec3.htm>.

• Sawant, Minal. Single Event Upset. Digital image. COTS. Microsemi, Jan. 2012. Web. <http://www.cotsjournalonline.com/articles/view/102279>.

• Sawant, Minal. Soft Error Rate. Digital image. CCCP. University of Michigan, 11 May 2012. Web. <http://cccp.eecs.umich.edu/research/reliability.php>.

• Carr, Robert. Simultaneous Multithreading. Digital image. Prezi. Prezi, 31 Oct. 2013. Web. <http://prezi.com/tegbbfk34l57/question-2/>.

• Wong, William. Out of Order Pipeline. Digital image. Electronic Design. Electronic Design, 19 Oct. 2011. Web. <http://electronicdesign.com/microcontrollers/little-core-shares-big-core-architecture>.

• Mark Brehob, EECS 470 Lecture Slides

• Fred A. Bower, Daniel J. Sorin, and Sule Ozev. A Mechanism for Online Diagnosis of Hard Faults Microprocessors. In Proc. Of the 38th Annual IEEE/ACM International Symposium on Microarchiteceture (MICRO’05), 2005

• T.M. Austin. DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design. In Proc. Of the 32nd Annual IEEE/ACM Int’l Symposium on Microarchitecture, pages 196-207, Nov. 1999.

papers