

### Large-Scale Study of In-the-Field Flash Failures

Onur Mutlu omutlu@ethz.ch

(joint work with Justin Meza, Qiang Wu, Sanjeev Kumar)

August 10, 2016



Flash Memory Summit 2016, Santa Clara, CA









#### A Large-Scale Study of Flash Memory Failures in the Field

Justin Meza Carnegie Mellon University meza@cmu.edu Qiang Wu Facebook, Inc. qwu@fb.com Sanjeev Kumar Facebook, Inc. skumar@fb.com Onur Mutlu Carnegie Mellon University onur@cmu.edu

### Original Paper (II)



- Presented at the ACM SIGMETRICS Conference in June 2015.
- Full paper for details:
  - Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, "A Large-Scale Study of Flash Memory Errors in the Field" Proceedings of the <u>ACM International Conference on Measurement and Modeling of</u> <u>Computer Systems</u> (SIGMETRICS), Portland, OR, June 2015. [Slides (pptx) (pdf)]
  - [Coverage at ZDNet] [Coverage on The Register] [
     Coverage on TechSpot] [Coverage on The Tech Report]
  - https://users.ece.cmu.edu/~omutlu/pub/flash-memory-failuresin-the-field-at-facebook\_sigmetrics15.pdf

### A Large-Scale Study of Flash Memory Errors in the Field

**Justin Meza** Qiang Wu Sanjeev Kumar Onur Mutlu

> **facebook** Carnegie Mellon University

## Overview

## First study of flash reliability:

- at a large scale
- in the field





## **Overview** SSD lifecycle We **do not** observe the Read effects of *read disturbance* disturbance errors in the field.

### Temperature



## Overview

## SSD lifecycle

### Access pattern dependence

We quantify the effects of the *page cache* and *write amplification* in the field.

### Temperature

## Outline

- background and motivation
- server SSD architecture
- error collection/analysis methodology
- SSD reliability trends
- summary

# Background and motivation

# Flash memory

- persistent
- high performance
- hard disk alternative
- used in solid-state drives (SSDs)

# Flash memory

- persistent
- high performance
- hard disk alternative
- used in solid-state drives (SSDs)
- prone to a variety of errors
  - wearout, disturbance, retention



### Prior Flash Error Studies (I)

#### 1. Overall flash error analysis

- Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai,
   Error Patterns in MLC NAND Flash Memory: Measurement, Characterization, and Analysis, DATE 2012.
- Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,
   <u>Error Analysis and Retention-Aware Error Management for NAND Flash</u> <u>Memory</u>, Intel Technology Journal 2013.

#### 2. Program and erase cycling noise analysis

 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, <u>Threshold Voltage Distribution in MLC NAND Flash Memory:</u> <u>Characterization, Analysis and Modeling</u>, DATE 2013.



### Prior Flash Error Studies (II)

#### 3. <u>Retention noise analysis and management</u>

- Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,
   Flash Correct-and-Refresh: Retention-Aware Error Management for Increased Flash Memory Lifetime, ICCD 2012.
- Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu,
   <u>Data Retention in MLC NAND Flash Memory: Characterization, Optimization</u> and Recovery, HPCA 2015.
- Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu, WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management, MSST 2015.



### Prior Flash Error Studies (III)

#### 4. <u>Cell-to-cell interference analysis and management</u>

- Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai,
   Program Interference in MLC NAND Flash Memory: Characterization,
   Modeling, and Mitigation, ICCD 2013.
- Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai, <u>Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories</u>, SIGMETRICS 2014.

#### 5. Read disturb noise study

 Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu, <u>Read Disturb Errors in MLC NAND Flash Memory: Characterization and</u> <u>Mitigation</u>, DSN 2015.



### Some Prior Talks on Flash Errors

- Saugata Ghose, <u>Write-hotness Aware Retention Management</u>, FMS 2016.
- Onur Mutlu, *Read Disturb Errors in MLC NAND Flash Memory*, FMS 2015.
- Yixin Luo, *Data Retention in MLC NAND Flash Memory*, FMS 2015.
- Onur Mutlu,

Error Analysis and Management for MLC NAND Flash Memory, FMS 2014.

- FMS 2016 posters:
  - <u>WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware</u> <u>Retention Management</u>
  - Read Disturb Errors in MLC NAND Flash Memory
  - Data Retention in MLC NAND Flash Memory



### Prior Works on Flash Error Analysis



- retention, program interference, read disturb, wear
- Conducted on raw flash chips, not full SSD-based systems
- Use synthetic access patterns, not real workloads in production systems
- Do not account for the storage software stack
- Small number of chips and small amount of time

#### Prior Lower-Level Flash Error Studies

- Provide a lot of insight
- Lead to new reliability and performance techniques
   E.g., to manage errors in a controller

- But they do not provide information on
  - errors that appear during real-system operation
  - beyond the correction capability of the controller

### In-The-Field Operation Effects



- Access patterns **not** controlled
- Real applications access SSDs over years
- Through the storage software stack (employs buffering)
- Through the SSD controller (employs ECC and wear leveling)
- Factors in platform design (e.g., number of SSDs) can affect access patterns
- Many SSDs and flash chips in a real data center

# Our goal

## **Understand SSD reliability:**

- at a large scale
  - millions of device-days, across four years
- in the field
  - realistic workloads and systems

# Server SSD architecture







### SSD controller

- translates addresses
- schedules accesses
- performs wear leveling

Summer of the second se

• • •

User data

0

01001100 01001101 11010010 01000000 10011100 1011111 10101111 11000101

### ECC metadata

## **Types of errors Small errors**

- IO's of flipped bits per KB
- silently corrected by SSD controller

## Large errors

- 100's of flipped bits per KB
- corrected by host using driver
- referred to as SSD failure

## **Types of errors** Small errors

## We examine *large errors* (SSD failures) in this study.

er

~100's of flipped bits per KB
corrected by host using driver
refer to as SSD failure

# Error collection/ analysis methodology

## **SSD** data measurement

- metrics stored on SSDs measured across SSD lifetime

## **SSD characteristics**

- 6 different system configurations
  - 720GB, 1.2TB, and 3.2TB SSDs
  - servers have 1 or 2 SSDs
  - this talk: representative systems
- 6 months to 4 years of operation
  15TB to 50TB read and written

### Platform and SSD Characteristics



- Six different platforms
- Spanning a majority of SSDs at Facebook's production servers

| Platform    | SSDs | PCIe   | Per SSD          |               |                   |                   |                       |
|-------------|------|--------|------------------|---------------|-------------------|-------------------|-----------------------|
| 1 lation in | 0005 | 1 010  | Capacity         | Age $(years)$ | Data written      | Data read         | UBER                  |
| A           | 1    | v1, ×4 | $720\mathrm{GB}$ | $2.4 \pm 1.0$ | $27.2\mathrm{TB}$ | $23.8\mathrm{TB}$ | $5.2 \times 10^{-10}$ |
| В           | 2    |        |                  |               | $48.5\mathrm{TB}$ | $45.1\mathrm{TB}$ | $2.6 \times 10^{-9}$  |
| С           | 1    | v2, ×4 | $1.2\mathrm{TB}$ | $1.6 \pm 0.9$ | 37.8 TB           | $43.4\mathrm{TB}$ | $1.5 \times 10^{-10}$ |
| D           | 2    |        |                  |               | 18.9 TB           | $30.6\mathrm{TB}$ | $5.7 \times 10^{-11}$ |
| E           | 1    |        | $3.2\mathrm{TB}$ | $0.5 \pm 0.5$ | $23.9\mathrm{TB}$ | $51.1\mathrm{TB}$ | $5.1 \times 10^{-11}$ |
| F           | 2    |        |                  |               | 14.8 TB           | $18.2\mathrm{TB}$ | $1.8 \times 10^{-10}$ |

Table 1: The platforms examined in our study. PCIe technology is denoted by vX,  $\times$ Y where X = version and Y = number of lanes. Data was collected over the entire age of the SSDs. Data written and data read are to/from the physical storage over an SSD's lifetime. UBER = uncorrectable bit error rate (Section 3.2).

# Bit error rates (BER)

- BER = bit errors per bits transmitted
- 1 error per 385M bits transmitted to
   1 error per 19.6B bits transmitted
  - averaged across all SSDs in each system type
- Iox to Iooox lower than prior studies
  - large errors, SSD performs wear leveling



#### Uncorrectable error

- Cannot be corrected by the SSD
- But corrected by the host CPU driver

#### SSD failure rate

 Fraction of SSDs in a "bucket" that have had at least one uncorrectable error
### Different Platforms, Different Failure Rates



**SAFARI** 

### Older Platforms -> Higher SSD Error Rates



**SAFARI** 

### Platforms with Multiple SSDs

Failures of SSDs in the same platform are correlated
 Multiple SSDs in one host

 Conclusion: Operational conditions related to platform affect SSD failure trends

## A few SSDs cause most errors



Normalized SSD number

# A few SSDs cause most errors



Normalized SSD number

## A few SSDs cause most errors



Normalized SSD number

# Analytical methodology

not feasible to log every error
instead, analyze lifetime counters
snapshot-based analysis





| Errors          | 54,326 | 0   | 2   | 10  |
|-----------------|--------|-----|-----|-----|
| Data<br>written | 10TB   | 2TB | 5TB | 6TB |



Errors 54,326 0 2 10 Data 10TB 2TB 5TB 6TB 2014-11-1



















#### Uncorrectable error

- Cannot be corrected by the SSD
- But corrected by the host CPU driver

#### SSD failure rate

 Fraction of SSDs in a "bucket" that have had at least one uncorrectable error

# SSD reliability trends





### Storage lifecycle background: the **bathtub curve** for disk drives



### Storage lifecycle background: the **bathtub curve** for disk drives



### Storage lifecycle background: the bathtub curve for disk drives



# Use **data written to flash** *to examine SSD lifecycle*

(time-independent utilization metric)





Figure 4: SSD lifecycle failure pattern. SSDs fail at different rates during several distinct periods throughout their lifetime (measured by usage)

SAFARI









### SSD lifecycle

## Access distinct from hard disk drive lifecycle.

### Temperature



- Two pool model of flash blocks: weak and strong
- Weak ones fail early → increasing failure rate early in lifetime
   SSD takes them offline → lowers the overall failure rate
- Strong ones fail late  $\rightarrow$  increasing failure rate late in lifetime



# **Read disturbance**

- reading data can disturb contents
- failure mode identified in *lab setting*
- under adversarial workloads



#### Read Disturb Problem: "Weak Programming" Effect



### More on Flash Read Disturb Errors



 Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu,
 "Read Disturb Errors in MLC NAND Flash Memory: Characterization and Mitigation"
 Proceedings of the
 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), Rio de Janeiro, Brazil, June 2015.

#### Read Disturb Errors in MLC NAND Flash Memory: Characterization, Mitigation, and Recovery

Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch\*, Ken Mai, Onur Mutlu Carnegie Mellon University, \*Seagate Technology yucaicai@gmail.com, {yixinluo, ghose, kenmai, onur}@cmu.edu

# Read disturbance

# Does read disturbance affect SSDs in the field?

g

# Examine SSDs with high **flash R/W** ratios and **most data read** to understand read effects

(isolate effects of read vs. write errors)

### 3.2TB, 1 SSD (average R/W = 2.14)


#### 1.2TB, 1 SSD (average R/W = 1.15)



## SSD lifecycle

# We **do not** observe the effects of **read disturbance** errors in the field.



## Temperature







#### Three Failure Rate Trends with Temperature

#### Increasing

SSD not throttled

#### Decreasing after some temperature

SSD could be throttled

#### Not sensitive

SSD could be throttled

#### **SAFARI**





Average temperature (°C)





#### SAFARI

# High temperature: may throttle or shut down

0

 $\langle \mathcal{S} \rangle$ 



Average temperature (°C)





#### SAFARI



#### Temperature

#### PCIe Bus Power Consumption

- Trends for Bus Power Consumption vs. Failure Rate
   Similar to Temperature vs. Failure Rate
- Temperature might be correlated with Bus Power



# Access pattern effects

## System buffering

data served from OS caches
decreases SSD usage

## Write amplification

- updates to small amounts of data
- increases erasing and copying

## Access pattern effects

## System buffering data served from OS caches decreases SSD usage Write amplification updates to small amounts of data increases erasing and copying



















## **System caching reduces** the impact of SSD writes









#### System-Level Writes vs. Chip-Level Writes

- More data written at the software does not imply
- More data written into flash chips
- Due to system level buffering
- More system-level writes can enable more opportunities for coalescing in the system buffers

# Access pattern effects

# System buffering data served from OS caches decreases SSD usage

## Write amplification

updates to small amounts of data
increases erasing and copying

## Flash devices use a translation layer to locate data

 $\mathbf{0}\mathbf{S}$ 

## Translation layer

Logical address space

0S

<offset<sub>1</sub>, size<sub>1</sub>><offset<sub>2</sub>, size<sub>2</sub>>

Physical address space

J. S. Marine Marine S. M.

## **Sparse data layout** more translation metadata potential for higher write amplification



e.g., many small file updates

## **Dense data layout** less translation metadata potential for *lower* write amplification



e.g., one huge file update

## Use **translation data size** to examine effects of data layout

(relates to application access patterns)


### Write amplification in the field





- More translation data correlates with higher failure rates
- Sparse data updates, i.e., updates to less contiguous data, lead to more translation data
- Higher failure rates likely due to more frequent erase and copying caused by non-contiguous updates
  - Write amplification

### SSD lifecycle

### Access pattern dependence

We quantify the effects of the *page cache* and *write amplification* in the field.

### Temperature



# More results in paper

- Block erasures and discards
- Page copies
- Bus power consumption

# Summary

# Large scale In the field







### Temperature



## Summary

### SSD lifecycle

### Access pattern dependence

We quantify the effects of the *page cache* and *write amplification* in the field.

### Temperature

### A Large-Scale Study of Flash Memory Errors in the Field

**Justin Meza** Qiang Wu Sanjeev Kumar Onur Mutlu

> **facebook** Carnegie Mellon University



#### At 4:20pm Today

#### Practical Threshold Voltage Distribution Modeling

- Yixin Luo (CMU PhD Student) August 10 @ 4:20pm
- Forum E-22: Controllers and Flash Technology

#### At 5:45pm Today

- "WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management"
  - Saugata Ghose (CMU Researcher) August 10 @ 5:45pm
  - Forum C-22: SSD Concepts (SSDs Track)



All are available at

http://users.ece.cmu.edu/~omutlu/projects.htm http://users.ece.cmu.edu/~omutlu/talks.htm

- And, many other previous works on
  - NVM & Persistent Memory
  - DRAM
  - Hybrid memories
  - NAND flash memory



### Thank you.

#### Feel free to email me with any questions & feedback

omutlu@ethz.ch

http://users.ece.cmu.edu/~omutlu/



### Large-Scale Study of In-the-Field Flash Failures

Onur Mutlu omutlu@ethz.ch

(joint work with Justin Meza, Qiang Wu, Sanjeev Kumar)

August 10, 2016



Flash Memory Summit 2016, Santa Clara, CA







### References to Papers and Talks

### Challenges and Opportunities in Memory

 Onur Mutlu, "Rethinking Memory System Design" Keynote talk at 2016 ACM SIGPLAN International Symposium on Memory Management (ISMM), Santa Barbara, CA, USA, June 2016. [Slides (pptx) (pdf)] [Abstract]

 Onur Mutlu and Lavanya Subramanian,
 <u>"Research Problems and Opportunities in Memory Systems"</u> Invited Article in <u>Supercomputing Frontiers and Innovations</u> (SUPERFRI), 2015.



### **Our FMS Talks and Posters**

- Onur Mutlu, *ThyNVM: Software-Transparent Crash Consistency for Persistent Memory*, FMS 2016.
- Onur Mutlu, Large-Scale Study of In-the-Field Flash Failures, FMS 2016.
- Yixin Luo, Practical Threshold Voltage Distribution Modeling, FMS 2016.
- Saugata Ghose, Write-hotness Aware Retention Management, FMS 2016.
- Onur Mutlu, *Read Disturb Errors in MLC NAND Flash Memory*, FMS 2015.
- Yixin Luo, *Data Retention in MLC NAND Flash Memory*, FMS 2015.
- Onur Mutlu,

Error Analysis and Management for MLC NAND Flash Memory, FMS 2014.

- FMS 2016 posters:
  - WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware Retention Management
  - Read Disturb Errors in MLC NAND Flash Memory
  - Data Retention in MLC NAND Flash Memory



### Our Flash Memory Works (I)

#### 1. <u>Retention noise study and management</u>

- Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai,
   <u>Flash Correct-and-Refresh: Retention-Aware Error Management for</u> <u>Increased Flash Memory Lifetime</u>, ICCD 2012.
- Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, <u>Data Retention in MLC NAND Flash Memory: Characterization, Optimization</u> <u>and Recovery</u>, HPCA 2015.
- Yixin Luo, Yu Cai, Saugata Ghose, Jongmoo Choi, and Onur Mutlu, <u>WARM: Improving NAND Flash Memory Lifetime with Write-hotness Aware</u> <u>Retention Management</u>, MSST 2015.

#### 2. Flash-based SSD prototyping and testing platform

4) Yu Cai, Erich F. Haratsh, Mark McCartney, Ken Mai, FPGA-based solid-state drive prototyping platform, FCCM 2011.



### Our Flash Memory Works (II)

#### 3. Overall flash error analysis

- 5) Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, <u>Error Patterns in MLC NAND Flash Memory: Measurement, Characterization,</u> <u>and Analysis</u>, DATE 2012.
- Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Adrian Cristal, Osman Unsal, and Ken Mai, <u>Error Analysis and Retention-Aware Error Management for NAND Flash</u> <u>Memory</u>, ITJ 2013.

#### 4. Program and erase noise study

 Yu Cai, Erich F. Haratsch, Onur Mutlu, and Ken Mai, <u>Threshold Voltage Distribution in MLC NAND Flash Memory:</u> <u>Characterization, Analysis and Modeling</u>, DATE 2013.



### Our Flash Memory Works (III)

#### 5. <u>Cell-to-cell interference characterization and tolerance</u>

- 8) Yu Cai, Onur Mutlu, Erich F. Haratsch, and Ken Mai, <u>Program Interference in MLC NAND Flash Memory: Characterization,</u> <u>Modeling, and Mitigation</u>, ICCD 2013.
- 9) Yu Cai, Gulay Yalcin, Onur Mutlu, Erich F. Haratsch, Osman Unsal, Adrian Cristal, and Ken Mai, <u>Neighbor-Cell Assisted Error Correction for MLC NAND Flash Memories</u>, SIGMETRICS 2014.

#### 6. Read disturb noise study

10) Yu Cai, Yixin Luo, Saugata Ghose, Erich F. Haratsch, Ken Mai, and Onur Mutlu, <u>Read Disturb Errors in MLC NAND Flash Memory: Characterization and</u> <u>Mitigation</u>, DSN 2015.



### Our Flash Memory Works (IV)

#### 7. Flash errors in the field

11) Justin Meza, Qiang Wu, Sanjeev Kumar, and Onur Mutlu, <u>A Large-Scale Study of Flash Memory Errors in the Field</u>, SIGMETRICS 2015.

#### 8. Persistent memory

12) Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu,

**ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems**, MICRO 2015.



#### Phase Change Memory As DRAM Replacement

 Benjamin C. Lee, Engin Ipek, Onur Mutlu, and Doug Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative" Proceedings of the <u>36th International Symposium on Computer Architecture</u> (ISCA), pages 2-13, Austin, TX, June 2009. <u>Slides (pdf)</u>

 Benjamin C. Lee, Ping Zhou, Jun Yang, Youtao Zhang, Bo Zhao, Engin Ipek, Onur Mutlu, and Doug Burger, "Phase Change Technology and the Future of Main Memory" IEEE Micro, Special Issue: Micro's Top Picks from 2009 Computer Architecture Conferences (MICRO TOP PICKS), Vol. 30, No. 1, pages 60-70, January/February 2010.

#### STT-MRAM As DRAM Replacement

- Emre Kultursay, Mahmut Kandemir, Anand Sivasubramaniam, and Onur Mutlu,
  - "Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative"
  - Proceedings of the <u>2013 IEEE International Symposium on Performance</u> <u>Analysis of Systems and Software</u> (**ISPASS**), Austin, TX, April 2013. Slides (pptx) (pdf)

### Taking Advantage of Persistence in Memory

 Justin Meza, Yixin Luo, Samira Khan, Jishen Zhao, Yuan Xie, and Onur Mutlu,
 "A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory"
 Proceedings of the <u>5th Workshop on Energy-Efficient Design</u> (WEED), Tel-Aviv, Israel, June 2013. <u>Slides (pptx) Slides (pdf)</u>

 Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and Onur Mutlu,
 "ThyNVM: Enabling Software-Transparent Crash Consistency in Persistent Memory Systems"
 Proceedings of the <u>48th International Symposium on Microarchitecture</u> (MICRO), Waikiki, Hawaii, USA, December 2015.
 [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [ Poster (pptx) (pdf)]
 [Source Code]

### Hybrid DRAM + NVM Systems (I)

- HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding, and Onur Mutlu,
   "Row Buffer Locality Aware Caching Policies for Hybrid <u>Memories"</u>
   Proceedings of the <u>30th IEEE International Conference on Computer Design</u> (ICCD),
   Montreal, Quebec, Canada, September 2012. <u>Slides (pptx) (pdf)</u> Best paper award (in Computer Systems and Applications track).
- Justin Meza, Jichuan Chang, HanBin Yoon, Onur Mutlu, and Parthasarathy Ranganathan,
   "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management" IEEE Computer Architecture Letters (CAL), February 2012.

### Hybrid DRAM + NVM Systems (II)

 Dongwoo Kang, Seungjae Baek, Jongmoo Choi, Donghee Lee, Sam H. Noh, and Onur Mutlu,
 <u>"Amnesic Cache Management for Non-Volatile Memory"</u> *Proceedings of the* <u>31st International Conference on Massive Storage Systems and</u> <u>Technologies (MSST</u>), Santa Clara, CA, June 2015. [Slides (pdf)]

#### NVM Design and Architecture

- HanBin Yoon, Justin Meza, Naveen Muralimanohar, Norman P. Jouppi, and Onur Mutlu,
   "Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories"
   ACM Transactions on Architecture and Code Optimization (TACO), Vol. 11, No. 4, December 2014. [Slides (ppt) (pdf)]
   Presented at the 10th HiPEAC Conference, Amsterdam, Netherlands, January 2015.
   [Slides (ppt) (pdf)]
- Justin Meza, Jing Li, and Onur Mutlu, "Evaluating Row Buffer Locality in Future Non-Volatile Main Memories" SAFARI Technical Report, TR-SAFARI-2012-002, Carnegie Mellon University, December 2012.



All are available at

http://users.ece.cmu.edu/~omutlu/projects.htm http://users.ece.cmu.edu/~omutlu/talks.htm

 And, many other previous works on NAND flash memory errors and management

### Related Videos and Course Materials Flash Memory

- <u>Undergraduate Computer Architecture Course Lecture</u> <u>Videos (2013, 2014, 2015)</u>
- <u>Undergraduate Computer Architecture Course</u> <u>Materials</u> (2013, 2014, 2015)
- Graduate Computer Architecture Lecture Videos (2013, 2015)
- Parallel Computer Architecture Course Materials (Lecture Videos)
- Memory Systems Short Course Materials
   (Lecture Video on Main Memory and DRAM Basics)



### Additional Slides

# Backup slides

# System characteristics

| SSD<br>capacity | PCIe   | Average<br>age<br>(years) | SSDs per<br>server | Average<br>written<br>(TB) | Average<br>read<br>(TB) |
|-----------------|--------|---------------------------|--------------------|----------------------------|-------------------------|
| 720GB           | V1, X4 | 2.4                       | 1                  | 27.2                       | 23.8                    |
|                 |        |                           | 2                  | 48.5                       | 45.1                    |
| 1.2TB           | V2, X4 | 1.6                       | 1                  | 37.8                       | 43.4                    |
|                 |        |                           | 2                  | 18.9                       | 30.6                    |
| 3.2TB           | V2, X4 | 0.5                       | 1                  | 23.9                       | 51.1                    |
|                 |        |                           | 2                  | 14.8                       | 18.2                    |






## DRAM buffer

## stores address translations

A MARINA MARINA

may buffer writes



Average temperature (°C)