# Use Cases for CXL-based Active Memory Tiering and Near Memory Accelerators

Presenter: Divya Vijayaraghavan (Altera) Co-authors: Tom Schulte and Pekon Gupta (Altera)



### Two Prominent CXL Use Cases

- 1. Active Memory Tiering
  - Local and remote memory tiers, migration of hot and cold pages between tiers
- 2. Near Memory Compute Acceleration
  - Remote memory tiers accelerate or process data near memory elements

| Active                         | Near Memory             | Architecture \<br>Attributes | Memory Expansion                                                       | Memory Disaggregation                                                               | Acceleration                                         |
|--------------------------------|-------------------------|------------------------------|------------------------------------------------------------------------|-------------------------------------------------------------------------------------|------------------------------------------------------|
| Memory<br>Tiering              | Compute<br>Acceleration | Use Cases                    | Capacity Expansion<br>Bandwidth Expansion<br>Software assisted Tiering | Hardware assisted Tiering<br>Differentiated Memory pooling<br>Multi-host management | Inline acceleration<br>Look-aside acceleration (QAT) |
| High<br>Performance<br>Compute | SmartNIC,<br>IPU        | Cost Sensitivity             | High                                                                   | Moderate                                                                            | Moderate                                             |
|                                |                         | Bandwidth                    | 80% of line-rate                                                       | ~ 80% of line-rate                                                                  | ~ TBD                                                |
|                                |                         | Form-factors                 | EDSFF (E3.S, E1.S)                                                     | PCIe CEM, Blade, Custom                                                             | PCIe CEM, Blade, OCP, custom                         |
|                                | Financial<br>Services   | Latency<br>(round trip)      | <100ns                                                                 | ~200ns to 350ns                                                                     | ~300ns to 500ns                                      |
|                                |                         | Media                        | DRAM                                                                   | DRAM DDR4/5, NAND,<br>Emerging persistent Memory                                    | DDR4/5 DRAM, NAND                                    |
|                                |                         | Power                        | Low: 50% ~ 90% of DDR5                                                 | твр                                                                                 | твр                                                  |

## Active Memory Tiering Considerations

#### **Approaches**

- S/W driven: Kernel scans memory allocation, identifies local vs. remote memory references <sup>(1)</sup>
- H/W-based hot page detection: Identifies most frequently accessed physical pages in Tier 2 memory
- Hardware-assisted application-transparent memory tiering management <sup>(2)</sup>

| Challenges | Accuracy of "hotness" classification      | Mitigation | Increased offload of hot page detection to hardware |  |
|------------|-------------------------------------------|------------|-----------------------------------------------------|--|
|            | Page migration latency                    |            | Provide enhanced memory access monitoring/reporting |  |
|            | Understanding of workload characteristics |            | capabilities on cxl.mem HDM interface               |  |
|            | Hardware vs. software partitioning        |            | Identify frequently used Host Physical Addresses    |  |





### 2 Near Memory Compute Acceleration Considerations

#### **Approaches**

- Near memory processing engine implemented in proximity to EMIF controller on CXL EP device <sup>(3)</sup>
- Memory tiering with computational memory devices and standard memory devices <sup>(4)</sup>



## Performance Metrics

 Publicly disclosed data points emerging showing latency advantage of CXL Type 2 Near Memory Acceleration vs. traditional PCIe<sup>(5)</sup>

Posted June 25, 2024

STAC Report: LMS ÜberNIC CXL with 10GbE and 25GbE under STAC-N1



New records from the first tests of a pure FPGA-based or CXL-based UDP stack.

#### Intel-UIUC ksm offload to CXL Type 2 device

- Kernel features increase tail latency of applications and consume CPU cycles
- Offloading the kernel features to CXL Type-2 device
  - 83% lower tail latency of application
  - 61% fewer CPU cycle consumption by ksm

#### Measured CXL.cache latency 68% lower than PCIe

| Protocol  | Tool                       | Initiator | Target | Design<br>Used/Command  |
|-----------|----------------------------|-----------|--------|-------------------------|
| CXL.cache | Intel CXL Stress<br>Tester | Device    | Host   | CXL Type2 ED/<br>Rdcurr |
| PCIe      | MCDMA Driver               | Device    | Host   | MCDMA ED / Mem<br>Read  |



©2024 Conference Concepts, Inc. All Rights Reserved

### Takeaways

- CXL-based Memory Tiering and Near Memory Acceleration provide advantages
  - Reduction in system TCO
  - Offloading of processing from CPU
  - Reduction in processing latency for specific workloads
- Some challenges can be mitigated by FPGA-based solutions
  - CXL IP and design example with configurable pre-built accelerator functions
  - Dynamic reconfigurability



### Contributors from Altera

- Bhushan Chitlur
- Sung San Choe
- Shawn Slockers
- Navneet Rao
- Zhongqian Yu
- Lingyan Li
- Xuan Zhao
- Jiwei He



### References

(1) Meta, UMichigan - TPP: Transparent Page Placement for CXL-enabled Tiered-Memory: https://arxiv.org/abs/2206.02878

(2) Google - <u>https://doi.org/10.1145/3582016.3582031</u>

(3) SKHynix - <u>https://www.youtube.com/watch?v=pbnTlY41h08</u>

(4) SmartModular - https://www.youtube.com/watch?v=A\_PML20fk-Y

(5) STAC - STAC Report: LMS ÜberNIC CXL with 10GbE and 25GbE under STAC-

N1 | STAC - Insight for the Algorithmic Enterprise | STAC (stacresearch.com)

(6) Unifabrix - <u>https://www.unifabrix.com/</u>



# Backup

