

### NAND Structure Aware Controller Framework

#### mengxin@derastorage.com



- The Challenges of NAND Flash
- Adaptive Error Mitigation by means of NAND Structure Aware
  - Noise Cells Repair
  - Dynamic Cell Levels
  - Retirement
  - Adaptive RAID





- Trends: Smaller process; Multi-level cell; Stacking cell vertically
- The factors impact reliability: Fewer electrons in cell; Larger inter-cell interference and disturbance
- Controller plays key role for conquering challenges



# Error characterization, mitigation, and data recovery techniques

|                                                                | Error Type  |                |                         |                              |               |
|----------------------------------------------------------------|-------------|----------------|-------------------------|------------------------------|---------------|
| Methods                                                        | P/E Cycling | Data Retention | Read/Program<br>Disturb | Cell to Cell<br>Interference | Media Defects |
| 2-pass Programming<br>Shadow Program Sequence<br>Randomization |             |                |                         | x                            |               |
| Read-Retry                                                     | x           | x              | x                       |                              |               |
| Auto Read Calibration                                          | x           | x              | x                       |                              |               |
| Vth Optimization                                               | x           | x              | x                       |                              |               |
| RAID                                                           |             |                |                         |                              | x             |
| Refresh                                                        |             | x              | x                       |                              |               |
| Adaptive Error Mitigation                                      | X           | X              | X                       | X                            | X             |

- Some ways to improve data recovery:
  - Advanced error correction algorithms and more credible soft information.
  - Obtain more optimal Vref by read-retry, calibration and Vth scan.
  - Flash management mechanism.



# **NAND Structure Aware**



- Physical structure  $\rightarrow$  the throughput and parallelism.
- Logical structure  $\rightarrow$  efficient flash management and I/O pipeline.
- Form parity protection groups across multiple channels, chips, dies together into a RAID Group.
- Ensure the reliability at all structure levels.

Santa Clara, CA



- Noise Cells: Overlapping after retention. In overlap region  $\rightarrow$  RBE. If the number of Noise Cells is over the decoding capability  $\rightarrow$  UBE.
- Identify there Cells in the correctable period. Obtain bit flip vector after ECC decode OK.
- Identify each bit flip whether a Noise Cells by the key formula.
- Recorder and cache the significant Noise Cells' location.
- When Noise Cells bring out uncorrectable error, ECC manager repair it and re-decoding again.
- Highly effective at reducing the error bit rate of failed pages.
- Using more soft bit data and bin area for accurately classifying Noise Cells.
- Similarly, disturb issues can also apply this method.

Flash Memory Summit 2018

Santa Clara, CA

# Noise Cells Repair-repair it





- It is transparent to the processing data path for Firmware.
- Skip Noise Cells: Filling the Noise Cells with dummy data before trans to NAND, and striping dummy data on Noise Cells, before decoding, the impact of it on the data decoding is avoided.
- Predict Noise Cells: When starting LDPC soft decoding, the Noise Cells Repairer Engine search the location of Noise Cells, then repairing those LLR's value to a prediction one.
- The efficiency and space considerations of Noise Cells Location Search Table is a challenge in implementation.

# **Dynamic Cell Levels**



The more states are encoded within the same voltage range, the more likely Vth distribution overlap, the more RBER. 

- Increase the margins between the states' Vth distributions.
- The states which are figured by solid line are easier to identify than others.
- Reading XP is greater difficult than LP/UP, especially in case of disturb/retention issues.
- The methods of Dynamic Cell Levels:
  - Tracking blocks' P/E and Cell levels.
  - Downgrade weak one to just use LP/UP to store useful data.
  - Downgrade weak one to SLC mode.
  - OP was reduced, but endurance and reliability was extended.

Flash Memory Summit 2018 Santa Clara, CA



#### Retirement

- Monitor and Statistic page/block errors in runtime
- Eliminate potential troubles in advance.
- Reduce the overhead of error handle and the risk of data loss.

| Retirement<br>Scale | Trigger Condition                                                                                                                                                                                                                                                 | Operation                                                                                                                                                              |
|---------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Page                | <ul> <li>Graded by bit error count that occurred on page.</li> <li>Statistic error grade and frequency.</li> <li>Define threshold according to different page type</li> <li>Reach threshold → retire page.</li> <li>tPROG &gt; MAX → retire this block</li> </ul> | <ul> <li>Written dummy data to the retired page, to avoid risk of damage on user data</li> <li>Reduce the cost of data recovery</li> </ul>                             |
| Block               | <ul> <li>Statistic retired page count and type.</li> <li>When retired count reach a threshold, we can choose to retire this block or switch to SLC mode block.</li> <li>tBERS &gt; MAX → retire block</li> </ul>                                                  | <ul> <li>Use the reserved good block to replace the retired one.</li> <li>Reduce the overhead of GC on those blocks, which has too much invalid dummy page.</li> </ul> |



# Adaptive RAID



- ECC and E2E data path protection  $\rightarrow$  Bit/Byte level data protection.
- RAIN→ Page/Block/Die level protection
- RAID group number N+1 is a parameter  $\rightarrow$  Balance between performance, failure rate and capacity.
- RAID stripe size is adjusted dynamically once bad element appears → Enhance the fault-tolerant capability
- Parity data element rotate on multiple channels  $\rightarrow$  Reduce the impact on reading.



# Welcome to you!



# **Memory the Future**

visit us at booth #523

#### mengxin@derastorage.com