

# Important new NVMe features for optimizing the data pipeline Dr. Stephen Bates, CTO

#### Eideticom



1



- Intro to NVMe Controller Memory Buffers (CMBs)
- Use cases for CMBs
  - Submission Queue Support (SQS) only
  - RDS (Read Data Support) and WDS (Write Data Support) for NVMe p2p copies
  - SQS, RDS and WDS for optimized NVMe over Fabrics
- Software for NVMe CMBs
  - SPDK (Storage Performance Developer Kit) work for NVMe copies.
  - Linux kernel work for p2pdma and for offload.
- Roadmap for the future



## Intro to Controller Memory Buffers

- CMBs were introduced to the NVMe standard in 2014 in version 1.2.
- A NVMe CMB is a PCIe BAR (or part thereof) that can be used for certain NVMe specific data types.
- The main purpose of the CMB is to provide an alternative to:
  - Placing queues in host memory
  - Placing data for DMA in host memory.
- As well as a BAR, two optional NVMe registers are needed:
  - CMBLOC location
  - CMBSZ size and supported types
- Multiple vendors support CMB today (Intel, Eideticom, Everspin) or soon (Toshiba, Samsung, WDC etc).

Flash Memory Summit 2018 Santa Clara, CA

#### 3.1.11 Offset 38h: CMBLOC – Controller Memory Buffer Location

This optional register defines the location of the Controller Memory Buffer (refer to section 4.7). If CMBSZ is 0, this register is reserved.

| Bit   | Type | Reset        | Description                                                                                                                                                                                                                                      |
|-------|------|--------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| 31:12 | RO   | Impl         | Offset (OFST): Indicates the offset of the Controller Memory Buffer in multiples of the                                                                                                                                                          |
| 31.12 |      | Spec         | Size Unit specified in CMBSZ. This value shall be 4KB aligned.                                                                                                                                                                                   |
| 11:03 | RO   | Oh           | Reserved                                                                                                                                                                                                                                         |
| 02:00 | RO   | Impl<br>Spec | Base Indicator Register (BIR): Indicates the Base Address Register (BAR) that<br>contains the Controller Memory Buffer. For a 64-bit BAR, the BAR for the lower 32-bits<br>of the address is specified. Values 0h, 2h, 3h, 4h, and 5h are valid. |

#### 3.1.12 Offset 3Ch: CMBSZ – Controller Memory Buffer Size

This optional register defines the size of the Controller Memory Buffer (refer to section 4.7). If the controller does not support the Controller Memory Buffer feature then this register shall be cleared to 0h.

| Bit   | Туре | Reset        | Description                                                                                                                                                                                                                                                                                       |         |             |   |  |  |
|-------|------|--------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------|-------------|---|--|--|
| 31:12 | RO   | Impl<br>Spec | Size (SZ): Indicates the size of the Controller Memory Buffer available for use by the<br>host. The size is in multiples of the Size Unit. If the Offset + Size exceeds the length of<br>the indicated BAR, the size available to the host is limited by the length of the BAR.                   |         |             |   |  |  |
|       |      |              | Size Units (SZU): Indicates the granularity of the Size field.                                                                                                                                                                                                                                    |         |             |   |  |  |
|       |      |              |                                                                                                                                                                                                                                                                                                   | Value   | Granularity | ] |  |  |
|       | RO   | Impi<br>Spec |                                                                                                                                                                                                                                                                                                   | Oh      | 4 KB        | ] |  |  |
|       |      |              |                                                                                                                                                                                                                                                                                                   | 1h      | 64 KB       | ] |  |  |
| 11:08 |      |              |                                                                                                                                                                                                                                                                                                   | 2h      | 1 MB        | ] |  |  |
| 11:08 |      |              |                                                                                                                                                                                                                                                                                                   | 3h      | 16 MB       | ] |  |  |
|       |      |              |                                                                                                                                                                                                                                                                                                   | 4h      | 256 MB      | ] |  |  |
|       |      |              |                                                                                                                                                                                                                                                                                                   | 5h      | 4 GB        | ] |  |  |
|       |      |              |                                                                                                                                                                                                                                                                                                   | 6h      | 64 GB       | ] |  |  |
|       |      |              |                                                                                                                                                                                                                                                                                                   | 7h – Fh | Reserved    |   |  |  |
| 07:05 | RO   | Oh           | Reserved                                                                                                                                                                                                                                                                                          |         |             |   |  |  |
|       | RO   | Impl<br>Spec | Write Data Support (WDS): If this bit is set to '1', then the controller supports data and                                                                                                                                                                                                        |         |             |   |  |  |
| 04    |      |              | metadata in the Controller Memory Buffer for commands that transfer data from the host<br>to the controller (e.g., Write). If this bit is cleared to '0', then all data and metadata for<br>commands that transfer data from the host to the controller shall be transferred from host<br>memory. |         |             |   |  |  |



## Intro to Controller Memory Buffers



Flash Memory Summit 2018 Santa Clara, CA

- A This device's manufacturer has registered its vendor ID and device IDs with the PCIe database. This means you get a human-readable description of it.
- B This device has three PCIe BARs:
- BAR0 is 16KB and is the standard NVMe BAR that any legitimate NVMe device must have.
- C The third BAR is the Controller Memory Buffer (CMB) which can be used for both NVMe queues and NVMe data.
- F Since this device is a NVMe device it is bound to the standard Linux kernel NVMe driver.

#### Below is CMBLOC and CMBSZ for a CMB enabled NVMe device

| cmbloc | : 3<br>Offset            | (0567)   | ð (See cmbsz.szu for granularity) |                                                                            |  |  |  |
|--------|--------------------------|----------|-----------------------------------|----------------------------------------------------------------------------|--|--|--|
|        | Base Indicator Register  |          |                                   | se chibs2.520 for granularity)                                             |  |  |  |
| cmbsz  | : 500003                 |          |                                   |                                                                            |  |  |  |
|        | Size                     | (SZ):    | 1280                              | 9                                                                          |  |  |  |
|        | Size Units               | (SZU):   | 4 KB                              | 8                                                                          |  |  |  |
|        | Write Data Support       | (WDS):   | Write                             | te Data and metadata transfer in Controller Memory Buffer is Not supported |  |  |  |
|        | Read Data Support        | (RDS):   | Read D                            | d Data and metadata transfer in Controller Memory Buffer is Not supported  |  |  |  |
|        | PRP SGL List Support     |          |                                   | /SG Lists in Controller Memory Buffer is Not supported                     |  |  |  |
|        | Completion Queue Support | t (COS): | Admin                             | in and I/O Completion Queues in Controller Memory Buffer is Supported      |  |  |  |
|        |                          |          |                                   | in and I/O Submission Queues in Controller Memory Buffer is Supported      |  |  |  |
|        |                          |          |                                   |                                                                            |  |  |  |

4



## Some Fun Use Cases for CMBs

 Placing some (or all) of your NVMe queues in CMB rather than host memory. Reduce latency [Linux Kernel<sup>1</sup> and SPDK<sup>1</sup>].

- Using the CMB as a DMA buffer allows for offloaded NVMe copies. This can improve performance and offloads the host CPU [SPDK<sup>1</sup>].
- Using the CMB as a DMA buffer allows RDMA NICs to directly place NVMe-oF data into the NVMe SSD. Reduce latency and CPU load [Linux Kernel<sup>2</sup>]



#### Traditional DMAs (left) load the CPU. P2P DMAs (right) do not load the CPU.

Flash Memory Summit 2018 Santa Clara, CA <sup>1</sup> Upstream in the relevant tree. <sup>2</sup> Proposed patches (see last slide for git repo).



## Software for CMBs - SPDK

 Storage Performance Development Kit (SPDK) is a Free and Open Source (FOSS) user-space framework for high performance storage.

- Focus on NVMe and NVMe-oF.
- Code added in Feb 2018 to enable P2P NVMe copies when CMBs allow it.
- A simple example of an application using this new API also in SPDK examples (cmb\_copy).

Flash Memory Summit 2018 Santa Clara, CA



cmb\_copy is an example application using SPDK's APIs to copy data between NVMe SSDs using P2P DMAs. This bypasses the CPU's memory and PCIe subsystems.



#### Software for CMBs - SPDK

| sbates@dionysus:~/spdk\$ # OK, so here we show the switch ports. Note the USP is at t                                   | r |                   |                   |                      |
|-------------------------------------------------------------------------------------------------------------------------|---|-------------------|-------------------|----------------------|
| he top.                                                                                                                 | i |                   | r                 | 7                    |
| sbates@dionysus:~/spdk\$ # At the bottom the uio0 and io01 DSP are the two we care ab                                   | 1 |                   | (32-0-4-0)        |                      |
| out.                                                                                                                    | 1 |                   | Link UP           |                      |
| <pre>sbates@dionysus:~/spdk\$ # Let's reset the counters</pre>                                                          | 1 |                   | L0-x16            |                      |
| <pre>sbates@dionysus:~/spdk\$ # Great now let's do a copy</pre>                                                         | 1 |                   | x16-Gen3 - 8 GT/s |                      |
| <pre>sbates@dionysus:~/spdk\$ sudo examples/nvme/cmb_copy/cmb_copy -r 0000:68:00.0-1-100-1</pre>                        |   |                   |                   |                      |
| 6000 -w 0000:69:00.0-1-100-16000 -c 0000:68:00.0                                                                        | 1 |                   |                   |                      |
| Starting DPDK 17.11.0 initialization                                                                                    | 1 |                   |                   |                      |
| [ DPDK EAL parameters: cmb_copy -c 0x1file-prefix=spdk0base-virtaddr=0x1000000                                          |   |                   | I: 541 kB         |                      |
| 000 —proc-type=auto ]                                                                                                   |   |                   | E: 483 kB         |                      |
| EAL: Detected 16 lcore(s)                                                                                               |   |                   |                   |                      |
| EAL: Auto-detected process type: PRIMARY                                                                                | 1 |                   | I: 16 kB/s        |                      |
| EAL: No free hugepages reported in hugepages-1048576kB                                                                  |   |                   | E: 15.6 kB/s      |                      |
| EAL: Probing VFIO support                                                                                               |   |                   |                   |                      |
| EAL: PCI device 0000:68:00.0 on NUMA socket 0                                                                           |   |                   |                   |                      |
| EAL: probe driver: 1de5:2000 spdk_nvme                                                                                  | ! |                   | L                 |                      |
| probe_cb - probed 0000:68:00.0!                                                                                         |   |                   |                   |                      |
| EAL: PCI device 0000:69:00.0 on NUMA socket 0                                                                           |   |                   |                   |                      |
| EAL: probe driver: 8086:f1a5 spdk_nvme                                                                                  | ! | r<br>lv (8-0-1-0) |                   | lv (24-0-3-0)        |
| probe_cb - probed 0000:69:00.0!<br>nyme gpair.c: 112:nyme admin gpair print command: *NOTICE*: GET LOG PAGE (02) sgid:0 |   | (V (8-0-1-0)      | v (12-0-1-4)      | V (24-0-3-0)         |
| cid:87 nsid:fffffff cdw10:007f00c0 cdw11:00000000                                                                       |   | ILO-x8            | ILO-x8            | L1nk UP  <br> L0-x16 |
| nvme_qpair.c: 283:nvme_qpair_print_completion: #NOTICE*: INVALID LOG PAGE (01/09) sq                                    |   | 1x8-Gen3 - 8 GT/s | 1x8-Gen3 - 8 GT/s | 1x4-Gen3 - 8 GT/s    |
| id:0 cid:87 cdw0:0 sqhd:000e p:1 m:0 dnr:0                                                                              | 1 | 11de5:1000        | 11de5:2000        | 18086:f1a5           |
| nyme ctrlr.c: 401:nyme ctrlr set intel support log pages: *ERROR*: nyme ctrlr cmd ge                                    |   | 11063-1000        | uio0              | uio1                 |
| t_log_page failed!                                                                                                      |   |                   | luitoo            | 0101                 |
| attach_cb - attached 0000:69:00.0!                                                                                      | 1 | I: 0 B            | II: 9 MB          | I: 300 kB            |
| attach cb - attached 0000;68:00.0!                                                                                      |   | IE: 0 B           | IE: 338 kB        | IE: 9.01 MB          |
| <pre>nvme_pcie.c: 602:nvme_pcie_ctrlr_free_cmb_io_buffer: *ERROR*: nvme_pcie_ctrlr_free_c</pre>                         |   | 1                 | 1 330 10          |                      |
| mb_io_buffer: no deallocation for CMB buffers vet!                                                                      |   | II: 0 B/s         | II: 0 B/s         | II: 0 B/s            |
| sbates@dionysus:~/spdk\$ #                                                                                              | i | IE: 0 B/s         | IE: 0 B/s         | E: 0 B/s             |
|                                                                                                                         |   |                   |                   |                      |
|                                                                                                                         | i |                   |                   |                      |
|                                                                                                                         | i | L                 | - i               | - iJ                 |
|                                                                                                                         | L |                   |                   |                      |
| [0] 0:bash*                                                                                                             |   |                   |                   | "dionysus" 16:06 24  |
|                                                                                                                         |   |                   |                   |                      |

- cmb\_copy moves data direct from SSD A to SSD B using NVMe CMB(s).
- 99.99% of UpStream Port (USP) traffic on PCIe switch is eliminated.
- OS is still in complete control of the IO and handles any status/error messages.
- NVMe SQEs can also be in the CMB or not as desired. SGLs or PRPs can be supported.

Flash Memory Summit 2018 Santa Clara, CA

See https://asciinema.org/a/bkd32zDLyKvIq7F8M5BBvdX42



#### Software for CMBs – The Linux Kernel

 A P2P framework called p2pdma is being proposed for the Linux kernel.

- Much more general than NVMe CMBs. Any PCIe device can utilize it (NICS, GPGPUs etc.).
- PCIe drivers can register device memory (e.g. CMBs or BARs) or request access to P2P memory for DMA.
- Initial patches use p2pdma to optimize the NVMe-oF target code.

Flash Memory Summit 2018 Santa Clara, CA

| Mode of Operation                                  | Latency<br>(read/<br>write) us | CPU<br>Utilization | CPU<br>Memory<br>Bandwidth | CPU PCIe<br>Bandwidth | NVMe<br>Bandwidth | Ethernet<br>Bandwidth |
|----------------------------------------------------|--------------------------------|--------------------|----------------------------|-----------------------|-------------------|-----------------------|
| Vanilla NVMe-oF                                    | 188/227                        | 1.00               | 1.00                       | 1.00                  | 1.00              | 1.00                  |
| ConnectX-5 Offload                                 | 128/138                        | 0.02               | 2.40                       | 1.03                  | 1.00              | 1.00                  |
| Eideticom NoLoad<br>p2pmem                         | 167/212                        | 0.55               | 0.09                       | 0.01                  | 1.00              | 1.00                  |
| ConnectX-5 Offload +<br>Eideticom NoLoad<br>p2pmem | 142/154                        | 0.02               | 0.02                       | 0.04                  | 1.00              | 1.00                  |

The p2pdma framework can be used to improve NVMeoF targets. Here we show results from a NVMe-oF target system.

p2pdma can reduce CPU memory load by x50 and CPU PCIe load by x25. NVMe offload can also be employed to reduce CPU core load by x50.



#### Software for CMBs – The Linux Kernel

 The hardware setup for the NVMe-oF p2pdma testing is as shown on the right.

- The software setup consisted of a modified Linux kernel and standard NVMe-oF configuration tools (mostly nvme-cli and nvmet).
- The Linux kernel used added support for NVMe offload and Peer-2-Peer DMAs using an NVMe CMB provided by the Eideticom NVMe device.

Flash Memory Summit 2018 Santa Clara, CA



Green: Legacy Data Path Red: p2pdma Data Path

This is the NVMe-oF target configuration used. Note RDMA NIC is connected to switch and not CPU Root Port.



#### Software for CMBs – The Linux Kernel



- Eideticom puts accelerators behind NVMe namespaces.
- We can combine this with CMBs to add compute to p2pdma.
- Demo with AMD in Xilinx in Xilinx booth.
- >3GB/s compression (input) via U.2 NVMe accelerator (NoLoad) with p2pdma offloading CPU by >99%.
- Leverages p2pdma to move data from input NVMe SSD to NoLoad and from NoLoad to output NVMe SSD.



## **Persistent Memory Regions**

- PMRs add persistent to CMBs.
- PMR features being discussed and standard will be updated. Get involved!
- Small PMRs (10MB-1GB) interesting:
  - Write cache
  - Journal for SQEs
  - Persistent scratchpad for meta-data
- Big PMRs (>>1GB) are *really* interesting.



#### **Persistent Memory Regions**

See https://www.youtube.com/watch?v=olEem6hHAss&t=1310s

#### Building a PMoF Target Today: Hardware (v2-pcie)

- Fabric I/F .
  - Require CPU utilization on the client side. 0
  - Not true load/store on client. 0
  - Challenging to scale. 0
  - Non-coherent (client and target) 0
- Control Plane .
  - ↔ Uh, why is the CPU in the way?
  - ↔ Very CPU/ISA dependent DDIO
    - cache effects
  - PM Media
    - Expensive 0
    - ↔ Not hot-swappable
    - Capacity/Scale issues
    - MoBo support (ADR) required



- Also .
  - Decouples target-side CPU DDR from performance 0
  - Decouples target-side CPU PCIe from performance 0
  - Fabric I/F can be upgraded in time (Star Wars!) 0

Flash Memory Summit 2018 Santa Clara, CA

Large NVMe PMRs would enable PMoF aware filesystems!



## Roadmap for CMBs and PMRs and the Software

- NVMe CMBs have been in the standard for a while. However it's only now they are starting to become available and software is starting to utilize them.
- SPDK and the Linux kernel are the two main locations for CMB software enablement today.
- SPDK: NVMe P2P copies. NVMe-oF updates coming.
- Linux kernel. p2pdma framework upstream soon. Will be expanded to support other NVMe/PCIe resources (e.g. doorbells).
- Persistent Memory Regions add non-volatile CMBs and will require (lots of) software enablement too. They will enable a path to Persistent memory storage on the PCIe bus.