#### Western Digital.

# A Journey into NVMe-oF™: Options, Trade-offs and Challenges

Ihab Hamadi Fellow, Western Digital

August 8, 2019





Western Digital.

Flash Memory Summit 2019, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.

#### Background: Why NVMe? Why NVMe-oF?

- Parallelism fits multi-core CPUs
  - Also reduces/spreads host CPU load
- Removes some cost components
  - Some common HW blocks
  - One driver
- Storage System Benefits
  - Lower latency (average & tail)
  - Higher BW
- NVMe-oF Motivation:
  - Extend benefits end-to-end





| NVMe Controller FE |  |  |  |  |
|--------------------|--|--|--|--|
| IO Scheduler       |  |  |  |  |
| Back End           |  |  |  |  |
|                    |  |  |  |  |



#### **NVMe Transport Model**

|                | NVMe Transports                     |             |     |                    |      |       |
|----------------|-------------------------------------|-------------|-----|--------------------|------|-------|
| Locality       | Local Bus Fabric Message Transports |             |     | ts                 |      |       |
| Model: Cmd/Rsp | Memory                              | Capsule     |     | Capsule            |      |       |
| Model: Data    | Memory                              | Capsule/Msg |     | Capsule/Shared Mem |      | d Mem |
| Fabric Type    | PCle                                | FC          | ТСР | IB                 | RoCE | iWARP |



#### Fabric 101: Lossy vs. Lossless Fabrics



### Data Center Bridging (DCB)



Western Digital.

Flash Memory Summit 2019, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.

#### **DCB: PFC**



Western Digital.

Lanes

## **DCB: ECN**

- ECN is end-to-end congestion management mechanism
- Three roles: Sender (RP), Switch (CP), Receiver (NP)
- Goal is to slow down sender before packets are dropped
- QCN, DC-QCN, DC-TCP





Western Digital.

8/9/2019 8

#### **Do I Really Need DCB (Lossless Net) with RoCE?** *BW vs. IO Size*



Source: Western Digital Performance Tests



#### **Fabric Selection Criteria**

|                                                                                                            |                                                                                                                                  |                                                                                                      | B                                                                                               |                                       |
|------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|---------------------------------------|
| Environment                                                                                                | Metrics                                                                                                                          | Scale                                                                                                | Operations                                                                                      | Future                                |
| Target Loc<br>Accessibility<br>Distance<br>Existing Fabrics<br>Consumer Loc<br>Regulatory<br>Multi-tenancy | Perf: Latency<br>Perf: Predictability<br>Perf: Consistency<br>Perf: Bandwidth<br>Cost: \$/Port<br>Cost: CPU/BW<br>Cost: CPU/IOPS | Single Rack<br>Multi-rack<br>Clos architecture<br>Oversubscription<br>Link aggregation<br>Redundancy | Onboarding<br>Configuration<br>Automation<br>Adv Telemtry<br>Intent Based<br>SW Defined <x></x> | Future Roadmap<br>Scale-up<br>Upgrade |

### Case Study: Fabrics Comparison (partial sample)

|                                                                 | NVMe/RoCE          | NVMe/TCP           |
|-----------------------------------------------------------------|--------------------|--------------------|
| Max Speed (current->next gen)                                   | 200G → 400G        | 200G → 400G        |
| Link Aggregation                                                | Yes. HW based      | Yes. HW based      |
| 1/2 Round Trip Transport Latency                                | 1.4us              | 8-30us             |
| 4k Write Latency<br>(50 <sup>th</sup> Percentile)               | 14us               | 31us               |
| 4k Write Latency – Tail/QoS<br>(99.99 <sup>th</sup> percentile) | 25us               | 272us              |
| Encapsulation                                                   | UDP                | ТСР                |
| Routability                                                     | Routable UDP based | Routable TCP based |
| Scale                                                           | Multi Rack         | Multi Rack         |
| Convergence with other traffic                                  | Yes                | Yes                |
| Switch ASIC (Merchant Silicon)                                  | Yes                | Yes                |
| Disaggregated Switches                                          | Yes                | Yes                |
| SDN                                                             | Yes                | Yes                |







8/9/2019

Western Digital.

Flash Memory Summit 2019, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.

#### **Latency Comparison**

#### Latency (us) Percentiles



Source: Western Digital Performance Tests

Flash Memory Summit 2019, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.

#### Latency vs. IOPS



Western Digital.

Flash Memory Summit 2019, Santa Clara, CA © 2019 Western Digital Corporation or its affiliates. All rights reserved.

8/9/2019

#### **Test Setup**

- Linux kernel 5.0
- Mellanox ConnectX-5 NIC
- Mellanox 2700 32x100G switch
- Intel<sup>®</sup> Xeon<sup>®</sup> Gold 6150 CPU @ 2.70GHz
- 100G RAM disk



#### Summary

- NVMe/RoCE and NVMe/TCP are complimentary technologies
- RoCE has lower and more consistent latency
- RoCE needs DCB
- RoCE uses less CPU cycles
- TCP does not need DCB
- TCP appears less optimized for performance and efficiency
- No "One Size Fits All"

# Western Digital.

Western Digital and the Western Digital logo are registered trademarks or trademarks of Western Digital Corporation or its affiliates in the US and/or other countries. Intel and Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. The NVMe and NVMe-oF word marks are trademarks of NVM Express, Inc. All other marks are the property of their respective owners.