

## CXL Native Memory<sup>™</sup>

# Do We Really Need DDR?



**Bill Gervasi, Principal Systems Architect** 

Wolley Inc.

bilge@wolleytech.com



#### 1. CXL Native Memory: Do We Really Need DDR? (15 minutes)

CXL memory modules enable memory expansion, enabling larger capacities to support emerging applications such as large language models where the LLMs demand 140GB or more of local capacity. HBM can't enable these large memory capacities, and CXL is a logical method to expand memory, but at significant cost in terms of power consumed and bandwidth wasted. Is DDR doing us any favors, and can we imagine a memory world without DDR? CXL Native Memory proposes to replace the inefficient DDR interface with a CXL direct physical interface that drives memory cores from the CXL FLIT without protocol retranslation. CXL Native Memory reduces memory latency overhead while saving power.





# Good news!

### **CXL ended the fabric wars**

sort of)



analysis



We'll address the impact of NVLink/UALink in my other talk...

#### TIPLE DEVICES OF ALL TYPES PER ROOT PORT

VOLLEY

System

3

f Switch Accelerator Enclosu witch Memory Enclo

#### Just as DDR5 goes to one DIMM per channel...





CXL comes along to save the day with DRAM expansion!



...but data centers are stuck with all these old DDR4 and DDR5 DIMMs they already paid for...

...cutting server memory capacity in half...



...so the initial introduction of an otherwise awesome technology is in the form of a chimera...











Fortunately, once the DIMM inventory is exhausted, the REAL CXL memory modules will take over...



Eliminating redundant voltage regulation, sockets, etc...

More cost effective than a module populated with new DIMMs







CXL modules assume a very long PCIe bus requiring high current drivers

Each DDR PHY drives external circuits with heavy loading and complex calibration

Redundant voltage regulation burns additional power







#### L1: 96% hit rate, 1 cycle access L2: 95% hit rate, 25 cycles access L3: 98% hit rate, 80 cycles access

The good news: near-CPU caches do have high hit rates (reduces waste from unnecessary accesses)

By the time an access gets to the local DRAM, though, hit rates start to drop dramatically

Read hit ~82% Write hit ~62%



A question I have posed that CPU guys refuse to answer:

How much performance gain are we getting for each watt expended?

ESPECIALLY when it comes to speculative DRAM page



Access to remote memory drops even further, especially with increased thread count Hit rate ~65% and this is before memory pooling

...and this is before memory pooling...

https://www.futureplus.com/blog/critical-memory-performance-metrics-for-ddr4-systems-page-hit-analysis

https://arxiv.org/pdf/2303.15375#:~:text=Meanwhile%2C%20as%20the%20block%20size%20increases %20beyond,latency%20begins%20to%20dominate%20the%20p99%20latency.





1KB block X 10 DRAMs X 2 (ACT + PRE) 64 byte

cache line

100 bytes used

on average



#### Waste > 99.97%

4KB block (plus DRAM accesses at SSD and Host)

### Waste > 97.5%

Adding up the ratio of data used to data moved, we can generously estimate that data centers are

### 0.00004% efficient

(We suck at using data)

3









### Who cares about data usage efficiency?

For starters, the US Department of Energy cares about avoiding a time when we can no longer power the internet



Fortunately, large data center owners are finally catching on to the idea that total cost of ownership matters





3

### **CXL Native Memory<sup>™</sup> Imagines a World Without DDR**



And we see the power and latency improvement





### **CXL Native Memory Uses the CXL FLIT Directly**



CXL FLIT has everything a memory needs

- Address
- Command
- Data + metadata

Translate to core functions and timing (banks, rows, columns, etc.)

#### No DDR interface is needed

CXL.io provides for interesting enhancements to strict memory protocol





### **Bringing CXL to the Motherboard**





Naked Memory die are just memory arrays, row drivers, and sense amps







#### ONE SMALL FLIT IN EVERY TRANSFER LARGE FLITS ARE MULTIPLES OF SMALL FLITS

#### Data Usage Efficiency = 2000X that of CXL-DDR







### Proposal for FleX (M.28)

FleX (M.28) has

- PCIe Gen 6 x8 support + CXL
- Diff pairs on same side of module for Gen6 support
  - Tx, Rx calibrated independently
- On-module regulation
- Power ~ 11W

#### M.4 lengths TBD; starting estimates:

- 30 mm
- 60 mm
- 80 mm



OLLEL







DDR5 has a "slot problem"

To run DDR5 with two DIMMs per channel, the channel maxes out at 5600

To run DDR5 at 6400+, the layout is restricted to one DIMM per channel

This means end users must choose between speed and capacity

Sad user

CXL Native Memory in a FleX module allows DDR slots to run at 6400+ without sacrificing memory capacity

Courtesy of Tom Schnell, Distinguished Scientist, Dell Computer











Thank you for your time

Any more questions?



Bill Gervasi, Principal Systems Architect Wolley Inc. bilge@wolleytech.com



