

## The Challenges of PCIe SSD Robustness in Cross Temperature Applications

Presenter: Nicolas Leng, ATP Electronics Inc.



©2024 Conference Concepts, Inc. All Rights Reserved



# Agenda

#### **D**Understanding the Challenges of PCIe SSD in Cross-Temp Environment

#### **D** How to Achieve PCIe SSD Robustness in Cross-Temp Environment

- Mechanical Challenges & Considerations
- Environment/Testing Challenges & Considerations
- Firmware Challenges & Considerations

#### **Conclusions**





## Understanding the Challenges of PCIe SSD in Cross-Temp Environment

Cross-Temp application like, Edge/IoT/Automotive applications, may encounter thermal challenges which have been a critical issue impacting performance and reliability.



|   | Case        | Temp. | Airflow    | Customer Criteria                                               |
|---|-------------|-------|------------|-----------------------------------------------------------------|
|   | Box PC      | High  | No Airflow | Needs to stay operational without shutting down                 |
|   | Data Logger | High  | Strong     | Sustained Read/Write performance                                |
|   | lloT Server | High  | Strong     | Fan actions triggered by temp need to stay within certain range |
| Ē | MS          |       |            |                                                                 |

## Impact on the Trifecta for Industrial SSD Robustness: Endurance, Temperature Resilience & Data Integrity

125°C Operating Cross-Temp Range with Robust

End of life NAND

**FW Error-Handling** 

Fresh NAND

#### Endurance

- 5K+ P/E cycles in Native TLC
- 100K+ P/E cycles in pSLC mode

(100% P/E cycle) (1 P/E cycle) Without With **Cross Temp. Error Handling** Error Handling Quality of NAND die **Errors in NAND Flash** As NAND Flash nears the end of its begin to rise as it life, implementing a robust errorhandling mechanism is vital for approaches the end of its operational life. minimizing errors and preserving data integrity. Error Bits based on 1KB











# Mechanical Challenges & Considerations



## Mechanical Challenges & Considerations: PCB Design and Thermal Assessment

#### IR Drop analysis and Power Drop simulation in PCB design

- Power drop simulation to identify the amount of electric power produced or consumed when electric current flows throughout the voltage drop.
- Locate current and temperature hot spots to avoid the risk of failure

#### **Optimizing PCB Layout and Component Placement**

- Engineers adjust the layout circuits, wire thickness, and the qty/position of through holes.
- Minimize IR drop, Improve Performance, Signal Integrity, and Power/heat Distribution efficiency





#### Cadence Power DC

### Useful Thermal Management System for Mechanical design at assigned thermal/flow environment

- Preventing overheating issues
- Understanding the environmental effect and the mechanical design influence of heat dissipation







## Environment Testing & Thermal Enhancement Options



©2024 Conference Concepts, Inc. All Rights Reserved

## Engineering Validation- Managing Heat While Keeping Performance



#### **Dynamic Thermal Throttling**

Verifies the balance between performance and temperature by firmware mechanism continuously detecting device temperature and adjusting the pace.



#### ATP auto power measurement

Automatically detects power data with other key elements in one diagram.

- Avg./Max. Current (mA)
- Read/Write Performance (MB/s)
- Avg. Response time (ms)
- Device Temp. (°C)
- Controller Temp. (°C)



| (edg) | • | Read MBps<br>Max Current(mA)<br>Avg Current(mA)<br>Avg Response Time(ms)*10<br>Device Temp(°C)*10<br>Controller Temp(°C)*10 |
|-------|---|-----------------------------------------------------------------------------------------------------------------------------|
|-------|---|-----------------------------------------------------------------------------------------------------------------------------|

## Staying **COOL**





(Ta: 55°C & Airflow: 600 LFM)

- When the composite temperature increases due to activity, the SSD's NAND flash controller slows performance due to thermal throttling
- The max. composite temp. of NVMe SSD is reduced, and the performance is steady with optimized FW algorithm.



# Environment/Testing Challenges & Considerations





## Environment/Testing Challenges & Considerations

#### Environment Dependent Adversity

Temperature extremes/Heat generated by SSDsMechanical shock & vibration

#### Endurance/Reliability Assessments

User and market models may influence endurance requirements

**□** Endurance is operating temperature dependent



#### JESD312 Endurance Requirements

| Characteristic                            | Personal Auto | Professional<br>Auto |
|-------------------------------------------|---------------|----------------------|
| Years of Operation                        | 15            | 8                    |
| Days per year of use                      | 344           | 365                  |
| Average Hours per Day of Use              | 3             | 12                   |
| Nominal temperature, power on, active use | 55°C          | 55°C                 |
| Nominal temperature, power off            | 30°C          | 30°C                 |

#### Use Automotive Applications as an Example

□ life span of automotive SSD assessment is difficult

- TBW rating decreases in the higher temperature range
- TBW/DWPD requirements during the life span of a drive highly depends on temperature rating and personal or professional auto scenario





## PCIe SSD Robustness in Cross-Temp Applications

Design Validation & Testing for Mission-Critical Applications

#### **Fields with Mission-Critical Applications require**

- Thermal Design/Product Characterization and Specification Validation
- Achieve Design Reliability with Extensive Testing





## Cross-Temp Reliability Assessments



#### **Environment Dependent Adversity**

- Qualification Tests to Validate Product Robustness
- Multiple factors combined with cross temp applications
  - Accelerated environment stress test

-THB Temperature humidity bias/TC temperature cycle/HTSL high temp storage life

Accelerated lifetime simulation tests

-HTOL high temp operating life/ELFR Early life Failure Rate/EDR Endurance data retention...

#### **Temp Cycles to Ensure Solderability**

- Thermal Cycling Test: temp cycles between 0°C to 100 °C for 1000 cycles with designed ramp rate/dwell time
- Mechanical Shock & Vibration: with various shock patterns & waves
  - Dye & Pry and Cross-Section Check: to examine
- potential damages

<u>X-Ray</u>



#### Cross-Section Check





## Temperature Reliability Assessments



#### **Achieve Design Reliability with Extensive Testing**

## Actual drive-level testing to validate the rated MTBF value

Reliability demonstration test with decent sample size should be conducted to obtain MTBF, and should not just rely on reliability prediction software (such as Telcordia)



ATP Proprietary Coach-Gym Testing System for I-Temp testing (-40°C to 85°C)

#### End-of-Life Testing with Proven UBER Value and Beyond

- ✓ Drives went through P/E cycle testing until end of life and beyond without UECC
- ✓ Retention Test at 10%, 100% and 120% EOL P/E cycle
- ✓ The cumulative TBW for all SSDs demonstrates UBER to be less than 1 Uncorrectable Read Error in 10<sup>17</sup> bits read on SSD drive level

| Ea                         | 0.6eV   |      |           |
|----------------------------|---------|------|-----------|
| T <sub>STRESS</sub> High   | 72°C    |      |           |
| T <sub>STRESS</sub> Low    |         |      |           |
|                            | TSTRESS |      | TUSE      |
| Sample Size                | (hours) | AF   | (hours)   |
| 432                        | 1555    | 2.84 | 1,910,181 |
| T <sub>STRESS</sub> Hours: | 671,760 |      |           |
| T <sub>USE</sub> Hours:    |         |      | 1,910,181 |
|                            |         |      |           |
| Choose Confidence Level >> | 60%     |      |           |
|                            |         |      | MTBF @    |
| Failures                   | X^2     |      | 55°C      |
| О                          | 1.83    |      | 2,084,689 |
| 1                          | 4.04    |      | 944,553   |
| 2                          | 6.21    |      | 615,120   |
| 3                          | 8.35    |      | 457,500   |
| 4                          | 10.47   |      | 364,774   |
|                            |         |      |           |





## Cross-Temp Reliability Assessments

#### ■4 Corner Testing

based on temperature cycle and high/low voltage combined with temp extremes to better simulate actual environment





Power cycling test/sudden power off recovery test should be conducted in temperature extremes

## Component Level Reliability

#### **Comprehensive NAND Flash Testing at Temp Extremes**

NAND Flash IC Screening



Illustration of Blocks/ICs in a module device

Good Blocks / Qualified Blocks identified for your application

Weak Blocks / Blocks that are not qualified and screened out

#### **Comprehensive NAND Flash Testing at Temp Extremes**

- Prevents products from failing before specified end of life across the industrial temperature range and across various embedded/industrial usage cases
- Direct and complete NAND flash quality control (Typically masked under the NAND flash controller error correction engine)
- Identifies qualified/unqualified blocks intended for your application using stress accelerants such as temperature, power/voltage and other factors

#### ■Other Considerations

Component derating for all components used on an SSD should be assessed to ensure enough design guard band



# Firmware Challenges & Considerations



# Cross Temperature Challenges: Program and Read data @ Cross Temp.



#### Increasing Errors Program @low temp Read @high temp

|                                       | (FIE-Cycle, Tob F/E cycles) |                         |                          |                          |                          |                                 |
|---------------------------------------|-----------------------------|-------------------------|--------------------------|--------------------------|--------------------------|---------------------------------|
| Program<br>@ 0C                       | Read<br>@ 0°C               | Read @20°C<br>(1 hour)  | Read @ 40°C (2<br>hours) | Read @ 60°C (3<br>hours) | Read @ 70°C (4<br>hours) | Keep constant<br>70°C for 16hrs |
| UECC<br>(ECC threshold<br>72bits/1KB) | N/A                         | N/A                     | 747                      | 2259                     | 5320                     | 10869                           |
| Program<br>@ 70C                      | Read @70°C                  | Read @ 60°C (1<br>hour) | Read @ 40°C<br>(2 hours) | Read @ 20°C<br>(3 hours) | Read @ 0°C<br>(4 hours)  | Keep constant 0°C<br>for 16hrs  |
| UECC<br>(ECC threshold<br>72bits/1KB) | N/A                         | N/A                     | N/A                      | N/A                      | N/A                      | <b>1</b><br>17                  |

(Pre-cycle: 100 P/E cycles)



## Cross Temperature Challenges: Significant Vth shift when flash is close to end-of-life

Require Robust Error-Handling Algorithm for Cross-Temperature Environments



## Error Handling Mechanism – Auto Read Calibration



Optimized reference voltage

1) **Read Retry** is not enough.

2) We need sophisticated Auto Read Calibration.

#### **Example:**

Radio Channel @ 93.6kHz without noise

No Noise = No UECC = Data Integrity





|         | w/o ARC<br>@high temp               | w/ ARC @high temp.                                                            | w/ ARC @low temp.                                                             |
|---------|-------------------------------------|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------|
| Card 1  | 383                                 | 0                                                                             | 0                                                                             |
| Card 2  | 77                                  | 0                                                                             | 0                                                                             |
| Card 3  | 5                                   | 0                                                                             | 0                                                                             |
| Remarks | RTBB =<br>UECC = Data<br>Corruption | <b>No RTBB:</b> error bits are recovered by ARC at high temp. =Data Integrity | <b>No RTBB:</b> error bits are recovered by ARC at low temp. = Data Integrity |

NOTE: Read Retry is implemented as default to compare ARC \*RTBB: Run time bad block \*\*UECC: Uncorrectable ECC



## Conclusions





## Conclusions

- The cross-temp impact on the endurance, temperature resilience, and data integrity for Industrial SSD Robustness is intricate; all aspects from HW design level, engineering validation, FW error handling need to be considered.
- The PCIe SSD Robustness in cross temp applications may be achieved through:

■Mechanical: Start right from PCB design simulation to pinpoint potential area of excessive heat in circuit design, followed by thermal simulation to know the mechanical design influence of heat dissipation; for application specific scenario, customized solution may be formed through HS solutions and thermal throttling in chamber testing so as to choose heat sink solution to maximize performance

Temp related reliability testing: achieve design reliability with extensive testing, which includes actual drive-level testing in chamber to validate the rated MTBF values, end of life (EOL) testing, solderability under temp cycles, 4-corner testing for mission critical applications, etc.

□Firmware: through FW auto read calibration mechanism, error bits are recovered at high temp and enhance data Integrity; the method of error handling is proven effective at high temp.



## For More Information on ATP Electronics



Visit our website: www.atpinc.com



Follow us on LinkedIn

#### Subscribe to our newsletter





## Thank you!