Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of...
Transcript of Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of...
![Page 1: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/1.jpg)
Large-Scale Study of In-the-Field Flash Failures
Onur Mutlu [email protected]
(joint work with Justin Meza, Qiang Wu, Sanjeev Kumar)
August 10, 2016 Flash Memory Summit 2016, Santa Clara, CA
![Page 2: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/2.jpg)
Original Paper (I)
![Page 3: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/3.jpg)
Original Paper (II) n Presented at the ACM SIGMETRICS Conference in June 2015.
n Full paper for details:
q Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the ACM International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), Portland, OR, June 2015. [Slides (pptx) (pdf)]
q [Coverage at ZDNet] [Coverage on The Register] [Coverage on TechSpot] [Coverage on The Tech Report]
q https://users.ece.cmu.edu/~omutlu/pub/flash-memory-failures-in-the-field-at-facebook_sigmetrics15.pdf
![Page 4: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/4.jpg)
A Large-Scale Study ofFlash Memory Errors in the FieldJustin Meza Qiang Wu Sanjeev Kumar Onur Mutlu
![Page 5: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/5.jpg)
Overview
First study of flash reliability:▪ at a large scale
▪ in the field
![Page 6: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/6.jpg)
Access patterndependence
SSD lifecycle
Readdisturbance
Temperature
Overview
New reliability
trends
![Page 7: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/7.jpg)
Access patterndependence
Readdisturbance
Temperature
New reliability
trends
SSD lifecycle
Overview
Early detection lifecycle period distinct from hard disk drive lifecycle.
![Page 8: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/8.jpg)
Access patterndependence
Temperature
New reliability
trends
SSD lifecycle
Readdisturbance
Overview
We do not observe the effects of read disturbance errors in the field.
![Page 9: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/9.jpg)
Access patterndependence
New reliability
trends
SSD lifecycle
Readdisturbance
Temperature
Overview
Throttling SSD usage helps mitigate temperature-induced errors.
![Page 10: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/10.jpg)
New reliability
trends
SSD lifecycle
Readdisturbance
Temperature
Access patterndependence
Overview
We quantify the effects of the page cache and write amplification in the field.
![Page 11: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/11.jpg)
▪ background and motivation ▪ server SSD architecture ▪ error collection/analysis methodology ▪ SSD reliability trends ▪ summary
Outline
![Page 12: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/12.jpg)
Background and motivation
![Page 13: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/13.jpg)
▪ persistent▪ high performance▪ hard disk alternative▪ used in solid-state drives (SSDs)
Flash memory
![Page 14: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/14.jpg)
▪ persistent▪ high performance▪ hard disk alternative▪ used in solid-state drives (SSDs)▪ prone to a variety of errors
▪ wearout, disturbance, retention
Flash memory
![Page 15: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/15.jpg)
Prior Flash Error Studies (I) 1. Overall flash error analysis
-‐ Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, Error Pa3erns in MLC NAND Flash Memory: Measurement, CharacterizaBon, and Analysis, DATE 2012.
-‐ Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai, Error Analysis and RetenBon-‐Aware Error Management for NAND Flash Memory, Intel Technology Journal 2013.
2. Program and erase cycling noise analysis
-‐ Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, Threshold Voltage DistribuBon in MLC NAND Flash Memory: CharacterizaBon, Analysis and Modeling, DATE 2013.
![Page 16: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/16.jpg)
Prior Flash Error Studies (II) 3. RetenBon noise analysis and management
-‐ Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai, Flash Correct-‐and-‐Refresh: RetenBon-‐Aware Error Management for Increased Flash Memory LifeBme, ICCD 2012.
-‐ Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, Data RetenBon in MLC NAND Flash Memory: CharacterizaBon, OpBmizaBon and Recovery, HPCA 2015.
-‐ Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu, WARM: Improving NAND Flash Memory LifeBme with Write-‐hotness Aware RetenBon Management, MSST 2015.
![Page 17: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/17.jpg)
Prior Flash Error Studies (III) 4. Cell-‐to-‐cell interference analysis and management
-‐ Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai, Program Interference in MLC NAND Flash Memory: CharacterizaBon, Modeling, and MiBgaBon, ICCD 2013.
-‐ Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai, Neighbor-‐Cell Assisted Error CorrecBon for MLC NAND Flash Memories, SIGMETRICS 2014.
5. Read disturb noise study
-‐ Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu, Read Disturb Errors in MLC NAND Flash Memory: CharacterizaBon and MiBgaBon, DSN 2015.
![Page 18: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/18.jpg)
Some Prior Talks on Flash Errors
• Saugata Ghose, Write-‐hotness Aware Reten0on Management, FMS 2016.• Onur Mutlu, Read Disturb Errors in MLC NAND Flash Memory, FMS 2015.• Yixin Luo, Data Reten0on in MLC NAND Flash Memory, FMS 2015.• Onur Mutlu,Error Analysis and Management for MLC NAND Flash Memory, FMS 2014.
• FMS 2016 posters:-‐ WARM: Improving NAND Flash Memory LifeOme with Write-‐hotness AwareRetenOon Management
-‐ Read Disturb Errors in MLC NAND Flash Memory -‐ Data RetenOon in MLC NAND Flash Memory
![Page 19: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/19.jpg)
Prior Works on Flash Error Analysis n Controlled studies that experimentally analyze many types
of error q retention, program interference, read disturb, wear
n Conducted on raw flash chips, not full SSD-based systems
n Use synthetic access patterns, not real workloads in production systems
n Do not account for the storage software stack
n Small number of chips and small amount of time
![Page 20: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/20.jpg)
Prior Lower-Level Flash Error Studies
n Provide a lot of insight
n Lead to new reliability and performance techniques q E.g., to manage errors in a controller
n But they do not provide information on q errors that appear during real-system operation q beyond the correction capability of the controller
![Page 21: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/21.jpg)
In-The-Field Operation Effects n Access patterns not controlled
n Real applications access SSDs over years
n Through the storage software stack (employs buffering)
n Through the SSD controller (employs ECC and wear leveling)
n Factors in platform design (e.g., number of SSDs) can affect access patterns
n Many SSDs and flash chips in a real data center
![Page 22: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/22.jpg)
Our goal
Understand SSD reliability:▪ at a large scale
▪ millions of device-days, across four years
▪ in the field▪ realistic workloads and systems
![Page 23: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/23.jpg)
Server SSDarchitecture
![Page 24: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/24.jpg)
![Page 25: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/25.jpg)
PCIe
![Page 26: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/26.jpg)
Flash chips
![Page 27: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/27.jpg)
SSD controller▪ translates addresses▪ schedules accesses▪ performs wear leveling
![Page 28: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/28.jpg)
10011111 11001111 11000011 00001101 10101110 11100101 11111001 01111011 00011001 11011101 11100011 11111000 11011111 01001101 11110000 10111111 00000001 11011110 00000101 01010110 00001011 10000010 11111110 00011100
...
01001100 01001101 11010010 01000000 10011100 10111111 10101111 11000101
User data
ECC metadata
![Page 29: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/29.jpg)
Types of errorsSmall errors▪ 10's of flipped bits per KB▪ silently corrected by SSD controller
Large errors▪ 100's of flipped bits per KB▪ corrected by host using driver▪ referred to as SSD failure
![Page 30: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/30.jpg)
Small errors
Large errors
Types of errors
▪ ~10's of flipped bits per KB ▪ silently corrected by SSD controller
▪ ~100's of flipped bits per KB▪ corrected by host using driver▪ refer to as SSD failure
We examine large errors (SSD failures) in this study.
![Page 31: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/31.jpg)
Error collection/analysismethodology
![Page 32: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/32.jpg)
SSD data measurement▪ metrics stored on SSDs▪ measured across SSD lifetime
![Page 33: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/33.jpg)
SSD characteristics▪ 6 different system configurations
▪ 720GB, 1.2TB, and 3.2TB SSDs▪ servers have 1 or 2 SSDs▪ this talk: representative systems
▪ 6 months to 4 years of operation▪ 15TB to 50TB read and written
![Page 34: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/34.jpg)
Platform and SSD Characteristics n Six different platforms n Spanning a majority of SSDs at Facebook’s production servers
![Page 35: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/35.jpg)
Bit error rates (BER)▪ BER = bit errors per bits transmitted▪ 1 error per 385M bits transmitted to
1 error per 19.6B bits transmitted▪ averaged across all SSDs in each system type
▪ 10x to 1000x lower than prior studies▪ large errors, SSD performs wear leveling
![Page 36: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/36.jpg)
Some Definitions
n Uncorrectable error q Cannot be corrected by the SSD q But corrected by the host CPU driver
n SSD failure rate q Fraction of SSDs in a “bucket” that have had at least one
uncorrectable error
1
![Page 37: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/37.jpg)
Different Platforms, Different Failure Rates
![Page 38: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/38.jpg)
Older Platforms à Higher SSD Error Rates
![Page 39: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/39.jpg)
Platforms with Multiple SSDs
n Failures of SSDs in the same platform are correlated q Multiple SSDs in one host
n Conclusion: Operational conditions related to platform affect SSD failure trends
![Page 40: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/40.jpg)
A few SSDs cause most errors
![Page 41: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/41.jpg)
A few SSDs cause most errors
10% of SSDshave >80%of errors
Errors followWeibulldistribution
![Page 42: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/42.jpg)
A few SSDs cause most errors
What factors contribute to SSD failures in the field?
![Page 43: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/43.jpg)
Analytical methodology▪ not feasible to log every error▪ instead, analyze lifetime counters▪ snapshot-based analysis
![Page 44: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/44.jpg)
![Page 45: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/45.jpg)
Errors 54,326 0 2 10
Data written 10TB 2TB 5TB 6TB
![Page 46: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/46.jpg)
Errors 54,326 0 2 10
Data written 10TB 2TB 5TB 6TB 2014-11-1
![Page 47: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/47.jpg)
Errors 54,326 0 2 10
Data written 10TB 2TB 5TB 6TB 2014-11-1
Data written
Errors
![Page 48: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/48.jpg)
Errors 54,326 0 2 10
Data written 10TB 2TB 5TB 6TB 2014-11-1
Data written
BucketsErrors
![Page 49: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/49.jpg)
Errors 54,326 0 2 10
Data written 10TB 2TB 5TB 6TB 2014-11-1
Data written
Errors
![Page 50: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/50.jpg)
Errors 54,326 0 2 10
Data written 10TB 2TB 5TB 6TB 2014-11-1
Data written
Errors
![Page 51: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/51.jpg)
Some Definitions
n Uncorrectable error q Cannot be corrected by the SSD q But corrected by the host CPU driver
n SSD failure rate q Fraction of SSDs in a “bucket” that have had at least one
uncorrectable error
![Page 52: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/52.jpg)
SSD reliabilitytrends
![Page 53: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/53.jpg)
Access patterndependence
SSD lifecycle
Readdisturbance
Temperature
New reliability
trends
![Page 54: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/54.jpg)
Access patterndependence
Readdisturbance
Temperature
SSD lifecycle
New reliability
trends
![Page 55: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/55.jpg)
bathtub curveStorage lifecycle background:
the
[Schroeder+,FAST'07]
for disk drives
Usage
Failure rate
![Page 56: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/56.jpg)
bathtub curveStorage lifecycle background:
the
[Schroeder+,FAST'07]
for disk drives
Failure rate
Usage
Earlyfailureperiod
Useful lifeperiod
Wearoutperiod
![Page 57: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/57.jpg)
bathtub curveStorage lifecycle background:
the
[Schroeder+,FAST'07]
for disk drives
Failure rate
Usage
Earlyfailureperiod
Useful lifeperiod
Wearoutperiod
Do SSDs display similar lifecycle periods?
![Page 58: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/58.jpg)
Use data written to flashto examine SSD lifecycle
(time-independent utilization metric)
![Page 59: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/59.jpg)
What We Find
![Page 60: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/60.jpg)
●
●
●
●
●
●●
●
●
●
●
●
●
0e+00 4e+13 8e+13
Data written (B)
0.00
0.50
1.00
SSD
failu
re ra
te
● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs
0 40 80
Data written (TB)
![Page 61: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/61.jpg)
●
●
●
●
●
●●
●
●
●
●
●
●
0e+00 4e+13 8e+13
Data written (B)
0.00
0.50
1.00
SSD
failu
re ra
te
● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs
0 40 80
Data written (TB)
Early failure period
Useful life period
Wearout period
![Page 62: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/62.jpg)
●
●
●
●
●
●●
●
●
●
●
●
●
0e+00 4e+13 8e+13
Data written (B)
0.00
0.50
1.00
SSD
failu
re ra
te
● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs
0 40 80
Data written (TB)
Early failure period
Useful life period
Wearout period
Earlydetectionperiod
![Page 63: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/63.jpg)
Access patterndependence
Readdisturbance
Temperature
New reliability
trends
SSD lifecycle
Early detection lifecycle period distinct from hard disk drive lifecycle.
![Page 64: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/64.jpg)
Why Early Detection Period?
n Two pool model of flash blocks: weak and strong
n Weak ones fail early à increasing failure rate early in lifetime q SSD takes them offline à lowers the overall failure rate
n Strong ones fail late à increasing failure rate late in lifetime
![Page 65: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/65.jpg)
Access patterndependence
Temperature
SSD lifecycle
Readdisturbance
New reliability
trends
![Page 66: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/66.jpg)
Read disturbance▪ reading data can disturb contents▪ failure mode identified in lab setting▪ under adversarial workloads
![Page 67: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/67.jpg)
Read from Flash Cell Array
3.0V 3.8V 3.9V 4.8V
3.5V 2.9V 2.4V 2.1V
2.2V 4.3V 4.6V 1.8V
3.5V 2.3V 1.9V 4.3V
Vread = 2.5 V
Vpass = 5.0 V
Vpass = 5.0 V
Vpass = 5.0 V
1 1 0 0 Correct values for page 2: 1
Page 1
Page 2
Page 3
Page 4
Pass (5V)
Read (2.5V)
Pass (5V)
Pass (5V)
![Page 68: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/68.jpg)
Read Disturb Problem: “Weak Programming” Effect
3.0V 3.8V 3.9V 4.8V
3.5V 2.9V 2.4V 2.1V
2.2V 4.3V 4.6V 1.8V
3.5V 2.3V 1.9V 4.3V
Repeatedly read page 3 (or any page other than page 2) 2
Read (2.5V)
Pass (5V)
Pass (5V)
Pass (5V)
Page 1
Page 2
Page 3
Page 4
![Page 69: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/69.jpg)
More on Flash Read Disturb Errors n Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai,
and Onur Mutlu, "Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation" Proceedings of the 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015.
![Page 70: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/70.jpg)
Read disturbance▪ reading data can disturb contents▪ failure mode identified in lab setting▪ under adversarial workloads
Does read disturbance affect SSDs in the field?
![Page 71: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/71.jpg)
Examine SSDs withflash R/W
to understand read effects
(isolate effects of read vs. write errors)
ratiosmost data readand
high
![Page 72: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/72.jpg)
●● ● ●
● ●
● ●● ●
●
●● ●
0.0e+00 1.5e+14
Data read (B)
0.00
0.50
1.00
SSD
failu
re ra
te
3.2TB, 1 SSD (average R/W = 2.14)
0 100 200
Data read (TB)
![Page 73: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/73.jpg)
●● ● ● ● ● ● ●
● ●● ●
●●
●
●●
●
●●
0.0e+00 1.0e+14 2.0e+14
Data read (B)
0.00
0.50
1.00
SSD
failu
re ra
te
1.2TB, 1 SSD (average R/W = 1.15)
0 100 200
Data read (TB)
![Page 74: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/74.jpg)
Access patterndependence
Temperature
New reliability
trends
SSD lifecycle
Readdisturbance
We do not observe the effects of read disturbance errors in the field.
![Page 75: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/75.jpg)
Access patterndependence
SSD lifecycle
Readdisturbance
Temperature
New reliability
trends
![Page 76: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/76.jpg)
![Page 77: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/77.jpg)
Temperaturesensor
![Page 78: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/78.jpg)
Three Failure Rate Trends with Temperature
n Increasing q SSD not throttled
n Decreasing after some temperature q SSD could be throttled
n Not sensitive q SSD could be throttled
![Page 79: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/79.jpg)
●● ●
●
●
●
●
30 40 50 60
Average temperature (°C)
0.00
0.50
1.00
SSD
failu
re ra
te
● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs
![Page 80: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/80.jpg)
No Throttling on A & B SSDs
![Page 81: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/81.jpg)
High temperature: may throttle or shut down
![Page 82: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/82.jpg)
●● ● ● ●
●
30 40 50 60 70
Average temperature (°C)
0.00
0.50
1.00
SSD
failu
re ra
te
● Platform C Platform E1.2TB, 1 SSD 3.2TB, 1 SSD
Throttling
![Page 83: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/83.jpg)
Heavy Throttling on C & E SSDs
![Page 84: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/84.jpg)
Access patterndependence
New reliability
trends
SSD lifecycle
Readdisturbance
Temperature
Throttling SSD usage helps mitigate temperature-induced errors.
![Page 85: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/85.jpg)
PCIe Bus Power Consumption
n Trends for Bus Power Consumption vs. Failure Rate q Similar to Temperature vs. Failure Rate
n Temperature might be correlated with Bus Power
![Page 86: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/86.jpg)
Temperature
SSD lifecycle
Readdisturbance
Access patterndependence
New reliability
trends
![Page 87: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/87.jpg)
Access pattern effectsSystem buffering▪ data served from OS caches▪ decreases SSD usage
Write amplification▪ updates to small amounts of data▪ increases erasing and copying
![Page 88: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/88.jpg)
Access pattern effects
Write amplification▪ updates to small amounts of data▪ increases erasing and copying
System buffering▪ data served from OS caches▪ decreases SSD usage
![Page 89: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/89.jpg)
![Page 90: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/90.jpg)
OS
![Page 91: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/91.jpg)
OS
Page cache
![Page 92: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/92.jpg)
OS
Page cache
![Page 93: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/93.jpg)
OS
Page cache
![Page 94: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/94.jpg)
OS
Page cache
![Page 95: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/95.jpg)
OS
Page cache
![Page 96: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/96.jpg)
OS
Page cache
![Page 97: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/97.jpg)
OS
Page cache
![Page 98: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/98.jpg)
OS
the impact of SSD writesSystem caching reduces
Page cache
![Page 99: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/99.jpg)
●
●
●
●
●
●●
● ● ●
0e+00 3e+10 6e+10
System data written (sectors)
0.00
0.50
1.00
SSD
failu
re ra
te
● Platform A Platform B720GB, 1 SSD 720GB, 2 SSDs
0 15 30
Data written to OS (TB)
![Page 100: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/100.jpg)
●
● ●
●●
●
●●
●
●
0e+00 2e+10 4e+10
2e+1
36e
+13
System data written (sectors)
Platform A
Dat
a w
ritte
n to
flas
h ce
lls (B
)
●
●
●
●● ●
●
●
●
●
●
●
0e+00 3e+10 6e+10
2e+1
36e
+13
System data written (sectors)
Platform B
●●●●●●
●
●
●●●●
●●●
●
●●
●
●
●●●●
0.0e+00 6.0e+10 1.2e+110.0e
+00
1.0e
+14
System data written (sectors)
Platform C
●
●
●
●
●
●
●
0.0e+00 1.5e+10 3.0e+10
0e+0
02e
+13
System data written (sectors)
Platform D
●●●●●
●●●●
●●
●●
●●
●
●
●●●●
●
●●●●
●●
●●
0.0e+00 1.0e+11 2.0e+110.0e
+00
1.5e
+14
3.0e
+14
System data written (sectors)
Platform E
●
●
●
● ●
●
●●
●
●
0e+00 2e+10 4e+100.0e
+00
2.0e
+13
System data written (sectors)
Platform F720GB, 2 SSDs
0 15 30
Data written to OS (TB)
Data written toflash cells (TB)
60
20
![Page 101: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/101.jpg)
System-Level Writes vs. Chip-Level Writes
n More data written at the software does not imply
n More data written into flash chips
n Due to system level buffering
n More system-level writes can enable more opportunities for coalescing in the system buffers
![Page 102: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/102.jpg)
Access pattern effectsSystem buffering▪ data served from OS caches▪ decreases SSD usage
Write amplification▪ updates to small amounts of data▪ increases erasing and copying
![Page 103: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/103.jpg)
OS
Flash devices use atranslation layer
to locate data
![Page 104: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/104.jpg)
OS
Logical address space
Translation layerPhysical address space
<offset1, size1><offset2, size2>
...
![Page 105: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/105.jpg)
Sparse data layoutmore translation metadata
potential for higher write amplification
e.g., many small file updates
![Page 106: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/106.jpg)
Dense data layoutless translation metadata
potential for lower write amplification
e.g., one huge file update
![Page 107: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/107.jpg)
Use translation data sizeto examine effects of data layout
(relates to application access patterns)
![Page 108: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/108.jpg)
●
●●
●
● ● ●
●●
●
5.0e+08 1.5e+09
DRAM buffer usage (B)
0.00
0.50
1.00
SSD
failu
re ra
te
0 1 2
Translation data (GB)
SparserDenser
720GB, 1 SSD
![Page 109: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/109.jpg)
5.0e+08 1.5e+09
DRAM buffer usage (B)
0.00
0.50
1.00
SS
D fa
ilure
rate
Platform A Platform B
5.0e+08 1.5e+09
DRAM buffer usage (B)
0.00
0.50
1.00
SS
D fa
ilure
rate
Platform C Platform D
5.0e+08 1.5e+09
DRAM buffer usage (B)
0.00
0.50
1.00
SS
D fa
ilure
rate
Platform E Platform F
Figure 8: SSD failure rate vs. DRAM bu↵er usage. Sparse data mappings (e.g., non-contiguous data, indicatedby high DRAM bu↵er usage to store flash translation layer metadata) negatively a↵ect SSD reliability themost (Platforms A, B, and D). Additionally, some dense data mappings (e.g., contiguous data in PlatformsE and F) also negatively a↵ect SSD reliability, likely due to the e↵ect of small, sparse writes.
2.5e+08 4.0e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Graph Search
SS
D fa
ilure
rate
2.5e+08 4.0e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Batch Processor
2.5e+08 3.5e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Key−Value Store
2.5e+08 4.0e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Load Balancer
2.5e+08 4.0e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Distributed Key−Value Store
4e+08 8e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Flash Cache
Figure 9: SSD failure rate vs. DRAM bu↵er usage across six applications that run on Platform B. We observesimilar DRAM bu↵er e↵ects to Figure 8, even among SSDs running the same application.
5. THE ROLE OF EXTERNAL FACTORSWe next examine how factors external to the SSD influence
the errors observed over an SSD’s lifetime. We examine the ef-fects of temperature, PCIe bus power, and system-level writesreported by the OS.
5.1 TemperatureIt is commonly assumed that higher temperature negatively
a↵ects the operation of flash-based SSDs. In flash cells, highertemperatures have been shown to cause cells to age morequickly due to the temperature-activated Arrhenius e↵ect [39].Temperature-dependent e↵ects are especially important to un-derstand for flash-based SSDs in order to make adequate datacenter provisioning and cooling decisions. To examine the ef-fects of temperature, we used temperature measurements fromtemperature sensors embedded on the SSD cards, which pro-vide a more accurate portrayal of the temperature of flash cellsthan temperature sensors at the server or rack level.Figure 10 plots the failure rate for SSDs that have various
average operating temperatures. We find that at an operatingtemperature range of 30 to 40 C, SSDs across server platformssee similar failure rates or slight increases in failure rates astemperature increases.Outside of this range (at temperatures of 40 C and higher),
we find that SSDs fall into one of three categories with respectto their reliability trends vs. temperature: (1) temperature-sensitive with increasing failure rate (Platforms A and B),(2) less temperature-sensitive (Platforms C and E), and (3)
temperature-sensitive with decreasing failure rate (PlatformsD and F). There are two factors that may a↵ect the trends weobserve with respect to SSD temperature.One potential factor when analyzing the e↵ects of temper-
ature is the operation of the SSD controller in response tochanges in temperature. The SSD controllers in some of theSSDs we examine attempt to ensure that SSDs do not exceedcertain temperature thresholds (starting around 80 C). Simi-lar to techniques employed in processors to reduce the amountof processor activity in order to keep the processor within acertain temperature range, our SSDs attempt to change theirbehavior (e.g., reduce the frequency of SSD access or, in theextreme case, shut down the SSD) in order not to exceed tem-perature thresholds.A second potential factor is the thermal characteristics of
the machines in each platform. The existence of two SSDs ina machine (in Platforms B, D, and F) compared to one SSDin a machine may (1) increase the thermal capacity of themachine (causing its SSDs to reach higher temperatures morequickly and increase the work required to cool the SSDs) and(2) potentially reduce airflow to the components, prolongingthe e↵ects of high temperatures when they occur.One hypothesis is that temperature-sensitive SSDs with in-
creasing error rates, such as Platforms A and B, may not em-ploy as aggressive temperature reduction techniques as otherplatforms. While we cannot directly measure the actions theSSD controllers take in response to temperature events, weexamined an event that can be correlated with temperature
5.0e+08 1.5e+09
DRAM buffer usage (B)
0.00
0.50
1.00
SS
D fa
ilure
rate
Platform A Platform B
5.0e+08 1.5e+09
DRAM buffer usage (B)
0.00
0.50
1.00
SS
D fa
ilure
rate
Platform C Platform D
5.0e+08 1.5e+09
DRAM buffer usage (B)
0.00
0.50
1.00
SS
D fa
ilure
rate
Platform E Platform F
Figure 8: SSD failure rate vs. DRAM bu↵er usage. Sparse data mappings (e.g., non-contiguous data, indicatedby high DRAM bu↵er usage to store flash translation layer metadata) negatively a↵ect SSD reliability themost (Platforms A, B, and D). Additionally, some dense data mappings (e.g., contiguous data in PlatformsE and F) also negatively a↵ect SSD reliability, likely due to the e↵ect of small, sparse writes.
2.5e+08 4.0e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Graph Search
SS
D fa
ilure
rate
2.5e+08 4.0e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Batch Processor
2.5e+08 3.5e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Key−Value Store
2.5e+08 4.0e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Load Balancer
2.5e+08 4.0e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Distributed Key−Value Store
4e+08 8e+08
DRAM buffer usage (B)
0.00
0.50
1.00
Flash Cache
Figure 9: SSD failure rate vs. DRAM bu↵er usage across six applications that run on Platform B. We observesimilar DRAM bu↵er e↵ects to Figure 8, even among SSDs running the same application.
5. THE ROLE OF EXTERNAL FACTORSWe next examine how factors external to the SSD influence
the errors observed over an SSD’s lifetime. We examine the ef-fects of temperature, PCIe bus power, and system-level writesreported by the OS.
5.1 TemperatureIt is commonly assumed that higher temperature negatively
a↵ects the operation of flash-based SSDs. In flash cells, highertemperatures have been shown to cause cells to age morequickly due to the temperature-activated Arrhenius e↵ect [39].Temperature-dependent e↵ects are especially important to un-derstand for flash-based SSDs in order to make adequate datacenter provisioning and cooling decisions. To examine the ef-fects of temperature, we used temperature measurements fromtemperature sensors embedded on the SSD cards, which pro-vide a more accurate portrayal of the temperature of flash cellsthan temperature sensors at the server or rack level.
Figure 10 plots the failure rate for SSDs that have variousaverage operating temperatures. We find that at an operatingtemperature range of 30 to 40 C, SSDs across server platformssee similar failure rates or slight increases in failure rates astemperature increases.
Outside of this range (at temperatures of 40 C and higher),we find that SSDs fall into one of three categories with respectto their reliability trends vs. temperature: (1) temperature-sensitive with increasing failure rate (Platforms A and B),(2) less temperature-sensitive (Platforms C and E), and (3)
temperature-sensitive with decreasing failure rate (PlatformsD and F). There are two factors that may a↵ect the trends weobserve with respect to SSD temperature.
One potential factor when analyzing the e↵ects of temper-ature is the operation of the SSD controller in response tochanges in temperature. The SSD controllers in some of theSSDs we examine attempt to ensure that SSDs do not exceedcertain temperature thresholds (starting around 80 C). Simi-lar to techniques employed in processors to reduce the amountof processor activity in order to keep the processor within acertain temperature range, our SSDs attempt to change theirbehavior (e.g., reduce the frequency of SSD access or, in theextreme case, shut down the SSD) in order not to exceed tem-perature thresholds.
A second potential factor is the thermal characteristics ofthe machines in each platform. The existence of two SSDs ina machine (in Platforms B, D, and F) compared to one SSDin a machine may (1) increase the thermal capacity of themachine (causing its SSDs to reach higher temperatures morequickly and increase the work required to cool the SSDs) and(2) potentially reduce airflow to the components, prolongingthe e↵ects of high temperatures when they occur.
One hypothesis is that temperature-sensitive SSDs with in-creasing error rates, such as Platforms A and B, may not em-ploy as aggressive temperature reduction techniques as otherplatforms. While we cannot directly measure the actions theSSD controllers take in response to temperature events, weexamined an event that can be correlated with temperature
0.25 0.45
Translation data (GB)
0.25 0.45
Translation data (GB)
Graph search Key-value store
Write amplification in the field
![Page 110: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/110.jpg)
Why Does Sparsity of Data Matter?
n More translation data correlates with higher failure rates
n Sparse data updates, i.e., updates to less contiguous data, lead to more translation data
n Higher failure rates likely due to more frequent erase and copying caused by non-contiguous updates q Write amplification
![Page 111: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/111.jpg)
New reliability
trends
SSD lifecycle
Readdisturbance
Temperature
Access patterndependence
We quantify the effects of the page cache and write amplification in the field.
![Page 112: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/112.jpg)
Access patterndependence
SSD lifecycle
Readdisturbance
Temperature
New reliability
trends
![Page 113: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/113.jpg)
▪ Block erasures and discards▪ Page copies▪ Bus power consumption
More results in paper
![Page 114: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/114.jpg)
Summary
![Page 115: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/115.jpg)
▪ Large scale▪ In the field
![Page 116: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/116.jpg)
Access patterndependence
SSD lifecycle
Readdisturbance
Temperature
Summary
New reliability
trends
![Page 117: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/117.jpg)
Access patterndependence
Readdisturbance
Temperature
New reliability
trends
SSD lifecycle
Summary
Early detection lifecycle period distinct from hard disk drive lifecycle.
![Page 118: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/118.jpg)
Access patterndependence
Temperature
New reliability
trends
SSD lifecycle
Readdisturbance
Summary
We do not observe the effects of read disturbance errors in the field.
![Page 119: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/119.jpg)
Access patterndependence
New reliability
trends
SSD lifecycle
Readdisturbance
Temperature
Summary
Throttling SSD usage helps mitigate temperature-induced errors.
![Page 120: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/120.jpg)
New reliability
trends
SSD lifecycle
Readdisturbance
Temperature
Access patterndependence
Summary
We quantify the effects of the page cache and write amplification in the field.
![Page 121: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/121.jpg)
A Large-Scale Study ofFlash Memory Errors in the FieldJustin Meza Qiang Wu Sanjeev Kumar Onur Mutlu
![Page 122: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/122.jpg)
Our Remaining FMS 2016 Talks n At 4:20pm Today n Practical Threshold Voltage Distribution Modeling
q Yixin Luo (CMU PhD Student) August 10 @ 4:20pm q Forum E-22: Controllers and Flash Technology
n At 5:45pm Today n "WARM: Improving NAND Flash Memory Lifetime
with Write-hotness Aware Retention Management” q Saugata Ghose (CMU Researcher) August 10 @ 5:45pm q Forum C-22: SSD Concepts (SSDs Track)
![Page 123: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/123.jpg)
Referenced Papers and Talks
n All are available at http://users.ece.cmu.edu/~omutlu/projects.htm http://users.ece.cmu.edu/~omutlu/talks.htm
n And, many other previous works on q NVM & Persistent Memory q DRAM q Hybrid memories q NAND flash memory
![Page 124: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/124.jpg)
Thank you.
Feel free to email me with any questions & feedback
[email protected] http://users.ece.cmu.edu/~omutlu/
![Page 125: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/125.jpg)
Large-Scale Study of In-the-Field Flash Failures
Onur Mutlu [email protected]
(joint work with Justin Meza, Qiang Wu, Sanjeev Kumar)
August 10, 2016 Flash Memory Summit 2016, Santa Clara, CA
![Page 126: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/126.jpg)
References to Papers and Talks
![Page 127: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/127.jpg)
Challenges and Opportunities in Memory
n Onur Mutlu, "Rethinking Memory System Design" Keynote talk at 2016 ACM SIGPLAN International Symposium on Memory Management (ISMM), Santa Barbara, CA, USA, June 2016. [Slides (pptx) (pdf)] [Abstract]
n Onur Mutlu and Lavanya Subramanian, "Research Problems and Opportunities in Memory Systems" Invited Article in Supercomputing Frontiers and Innovations (SUPERFRI), 2015.
![Page 128: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/128.jpg)
Our FMS Talks and Posters • Onur Mutlu, ThyNVM: So+ware-‐Transparent Crash Consistency forPersistent Memory, FMS 2016.
• Onur Mutlu, Large-‐Scale Study of In-‐the-‐Field Flash Failures, FMS 2016.• Yixin Luo, PracBcal Threshold Voltage DistribuBon Modeling, FMS 2016.• Saugata Ghose, Write-‐hotness Aware RetenBon Management, FMS 2016.• Onur Mutlu, Read Disturb Errors in MLC NAND Flash Memory, FMS 2015.• Yixin Luo, Data RetenBon in MLC NAND Flash Memory, FMS 2015.• Onur Mutlu,Error Analysis and Management for MLC NAND Flash Memory, FMS 2014.
• FMS 2016 posters:-‐ WARM: Improving NAND Flash Memory LifeOme with Write-‐hotness AwareRetenOon Management
-‐ Read Disturb Errors in MLC NAND Flash Memory -‐ Data RetenOon in MLC NAND Flash Memory
![Page 129: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/129.jpg)
Our Flash Memory Works (I) 1. Reten'on noise study and management1) Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman
Unsal, and Ken Mai, Flash Correct-‐and-‐Refresh: Reten'on-‐Aware Error Management forIncreased Flash Memory Life'me, ICCD 2012.
2) Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, Data Reten'on in MLC NAND Flash Memory: Characteriza'on, Op'miza'onand Recovery, HPCA 2015.
3) Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu, WARM: Improving NAND Flash Memory Life'me with Write-‐hotness AwareReten'on Management, MSST 2015.
2. Flash-‐based SSD prototyping and tes'ng plaLorm4) Yu Cai, Erich F. Haratsh, Mark McCartney, Ken Mai,
FPGA-‐based solid-‐state drive prototyping plaLorm, FCCM 2011.
![Page 130: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/130.jpg)
Our Flash Memory Works (II) 3. Overall flash error analysis 5) Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai,
Error PaQerns in MLC NAND Flash Memory: Measurement, Characteriza'on, and Analysis, DATE 2012.
6) Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai, Error Analysis and Reten'on-‐Aware Error Management for NAND Flash Memory, ITJ 2013.
4. Program and erase noise study 7) Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai,
Threshold Voltage Distribu'on in MLC NAND Flash Memory: Characteriza'on, Analysis and Modeling, DATE 2013.
![Page 131: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/131.jpg)
Our Flash Memory Works (III) 5. Cell-‐to-‐cell interference characteriza'on and tolerance 8) Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai,
Program Interference in MLC NAND Flash Memory: Characteriza'on, Modeling, and Mi'ga'on, ICCD 2013.
9) Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai, Neighbor-‐Cell Assisted Error Correc'on for MLC NAND Flash Memories, SIGMETRICS 2014.
6. Read disturb noise study 10) Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu,
Read Disturb Errors in MLC NAND Flash Memory: Characteriza'on and Mi'ga'on, DSN 2015.
![Page 132: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/132.jpg)
Our Flash Memory Works (IV) 7. Flash errors in the field 11) JusOn Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu,
A Large-‐Scale Study of Flash Memory Errors in the Field, SIGMETRICS 2015.
8. Persistent memory 12) Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur
Mutlu, ThyNVM: Enabling SoZware-‐Transparent Crash Consistency in Persistent Memory Systems, MICRO 2015.
![Page 133: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/133.jpg)
Phase Change Memory As DRAM Replacement
n Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative" Proceedings of the 36th International Symposium on Computer Architecture (ISCA), pages 2-13, Austin, TX, June 2009. Slides (pdf)
n Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger, "Phase Change Technology and the Future of Main Memory" IEEE Micro, Special Issue: Micro's Top Picks from 2009 Computer Architecture Conferences (MICRO TOP PICKS), Vol. 30, No. 1, pages 60-70, January/February 2010.
![Page 134: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/134.jpg)
STT-MRAM As DRAM Replacement
n Emre Kultursay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu, "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative" Proceedings of the 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Austin, TX, April 2013. Slides (pptx) (pdf)
![Page 135: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/135.jpg)
Taking Advantage of Persistence in Memory n Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and
Onur Mutlu, "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory" Proceedings of the 5th Workshop on Energy-Efficient Design (WEED), Tel-Aviv, Israel, June 2013. Slides (pptx) Slides (pdf)
n Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu, "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems" Proceedings of the 48th International Symposium on Microarchitecture (MICRO), Waikiki, Hawaii, USA, December 2015. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]
![Page 136: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/136.jpg)
Hybrid DRAM + NVM Systems (I) n HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding,
and Onur Mutlu, "Row Buffer Locality Aware Caching Policies for Hybrid Memories" Proceedings of the 30th IEEE International Conference on Computer Design (ICCD), Montreal, Quebec, Canada, September 2012. Slides (pptx) (pdf) Best paper award (in Computer Systems and Applications track).
n Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan, "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management" IEEE Computer Architecture Letters (CAL), February 2012.
![Page 137: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/137.jpg)
Hybrid DRAM + NVM Systems (II)
n Dongwoo Kang, Seungjae Baek, Jongmoo Choi, Donghee Lee, Sam H. Noh, and Onur Mutlu, "Amnesic Cache Management for Non-Volatile Memory" Proceedings of the 31st International Conference on Massive Storage Systems and Technologies (MSST), Santa Clara, CA, June 2015. [Slides (pdf)]
![Page 138: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/138.jpg)
NVM Design and Architecture
n HanBin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, and Onur Mutlu, "Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories" ACM Transactions on Architecture and Code Optimization (TACO), Vol. 11, No. 4, December 2014. [Slides (ppt) (pdf)] Presented at the 10th HiPEAC Conference, Amsterdam, Netherlands, January 2015. [Slides (ppt) (pdf)]
n Justin Meza, Jing Li, and Onur Mutlu, "Evaluating Row Buffer Locality in Future Non-Volatile Main Memories" SAFARI Technical Report, TR-SAFARI-2012-002, Carnegie Mellon University, December 2012.
![Page 139: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/139.jpg)
Referenced Papers and Talks
n All are available at http://users.ece.cmu.edu/~omutlu/projects.htm http://users.ece.cmu.edu/~omutlu/talks.htm
n And, many other previous works on NAND flash memory errors and management
![Page 140: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/140.jpg)
Related Videos and Course Materials n Undergraduate Computer Architecture Course Lecture
Videos (2013, 2014, 2015)
n Undergraduate Computer Architecture Course Materials (2013, 2014, 2015)
n Graduate Computer Architecture Lecture Videos (2013, 2015)
n Parallel Computer Architecture Course Materials (Lecture Videos)
n Memory Systems Short Course Materials (Lecture Video on Main Memory and DRAM Basics)
![Page 141: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/141.jpg)
Additional Slides
![Page 142: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/142.jpg)
Backup slides
![Page 143: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/143.jpg)
System characteristicsSSD
capacityPCIe
Average age
(years)
SSDs per server
Average written
(TB)
Average read(TB)
720GB v1, x4 2.4 1 27.2 23.82 48.5 45.1
1.2TB v2, x4 1.6 1 37.8 43.42 18.9 30.6
3.2TB v2, x4 0.5 1 23.9 51.12 14.8 18.2
![Page 144: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/144.jpg)
A B C D E F
SSD
failu
re ra
te
0.0
0.2
0.4
0.6
0.8
1.0
720GB 1.2TB 3.2TBDevices: 1 2 1 2 1 2
![Page 145: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/145.jpg)
A B C D E F
Year
ly u
ncor
rect
able
erro
rs p
er S
SD
0e+0
04e
+05
8e+0
5
720GB 1.2TB 3.2TBDevices: 1 2 1 2 1 2
![Page 146: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/146.jpg)
Channelsoperate in parallel
![Page 147: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/147.jpg)
DRAM buffer▪ stores address translations ▪ may buffer writes
![Page 148: Large-Scale Study of In-the-Field Flash Failures · Factors in platform design (e.g., number of SSDs) can affect ... 45th Annual IEEE/IFIP International Conference on Dependable Systems](https://reader033.fdocuments.us/reader033/viewer/2022052010/602055414648ee34dd71d32b/html5/thumbnails/148.jpg)
●● ● ●
●
●
●
35 45 55 65
Average temperature (°C)
0.00
0.50
1.00
SSD
failu
re ra
te
● Platform D Platform F1.2TB, 2 SSDs 3.2TB, 2 SSDs
Early detection+ throttling