# Benefits of CXL for Server Memory Infrastructure August 6, 2025 #### Growth of the CXL ### **CXL Integrators List** #### OEMs offering CXL-capable servers: - US / EMEA - Dell - HPE - Lenovo - Supermicro - APAC - Advantech - Giga - Quanta - AIC Scan the QR code to view the Integrators List ## CXL 3.0: Pooling & Sharing - 1 Expanded use case showing memory sharing and pooling - 2 CXL Fabric Manager is available to setup, deploy, and modify the environment #### CXL Benefits for AI & Database Infrastructure - In-Memory DB w/local + CXL memory - Increases memory capacity - Minimal DB performance impact - Lower TCO, less power, better thermal mgmt ## Leo Smart Memory Controller Portfolio #### **Features** Architected to Accelerate AI and Database Infrastructure - Memory expansion, pooling and sharing - Performance optimized for memory intensive workloads #### Customizable RAS - Memory testing and repair with vendor defined stress patterns - Per channel leaky bucket counter for logging DRAM errors - Enhanced ECC scheme #### Datacenter-grade Security - TEE features compliant w/ leading security standards - End-to-end security: RoT, Secure boot/update/debug/recovery, anti-rollback #### COSMOS framework for Hyperscalers and Enterprise - · Tools and infrastructure to assist in Design, Debugging, RAS & Fleet Mgmt - COSMOS UI, CLI & SDK APIs - COSMOS Security Manager #### Cloud-Scale Interop Lab support CXL/PCIe Electrical Testing, Compliance Tests, Reset and Initialization Tests etc. | oxe, reset that resting, compliance rests, reset and mindale atom rests etc. | | | | | |------------------------------------------------------------------------------|-------------------------------------------------------|--|--|--| | Parameter | Leo 1 | | | | | CXL | CXL 1.1/2.0 | | | | | PCle | PCIe 5.0 <b>32GT/s</b> | | | | | Lane Configuration | 1x16, 2x8 | | | | | DDR Configuration | <b>2ch DDR5</b> Up to 5600 (1 DPC) Up to 4800 (2 DPC) | | | | | Capacity | Up to <b>2TB</b> | | | | | Package | 27x27mm | | | | #### **Product Offering** **Leo Smart Memory Controllers** Leo 1 P- Series/E-Series PCle 5, CXL 2.0 In Production #### Aurora A-Series Hardware Solutions #### Leo Accelerates AI & Database Infrastructure #### In-Memory Databases | Transaction Processing 200% more queries per second, 1.5x memory capacity increase Based on TPC-H benchmark with 2xLeo CXL Memory Controllers (512GB) and 1xIntel CPUs (1TB) #### **HPC | Computer Aided Engineering** 50% more iterations per second, w/ 50% added Memory Based on CFD (CPU2017 FP) benchmark with 2x Leo CXL Memory Controllers (256GB) and 1x Intel CPU (512GB) #### Al Inferencing | Chatbot Services 40% Faster time to insights with LLMs, w/30% added Memory FlexGen (OPT-66B) with 2xLeo CXL Memory Controllers (256GB), 2xNVIDIA GPUs, 1xAMD CPU (768GB) #### Al Inferencing | Recommendation System 73% More recommendations per second, 2x memory capacity increase Based on DLRM benchmark with 4x Leo CXL Memory Controllers (1TB) and 1x AMD CPU (1.1TB) Leo CXL Smart Memory Controller ## What is a RAG Pipeline? #### Core components - Data - Model - Embeddings - Query # **Memory Demand** #### Phased Approach - Generate Embeddings - Memory demand spikes - Running the pipeline - Based upon the app Figure based on<sup>[1]</sup> ### Advantages with CMM-D in RAG Cluster Up to 19% higher performance with CMM-D in VectorDB search compared to DRAM case in Milvus RAG cluster - Performance gain with bandwidth expansion through the CMM-D in Milvus RAG Cluster - Using SW interleaving (between DRAM and CMM-D) to achieve optimal CXL bandwidth performance #### \*\*Weighted Interleaving - Linux kernel SW weighted interleaving provides opportunity to define an interleave ratio to best utilize DRAM and CXL memory for optimal performance in a workload - Included in Kernel Mainline (v6.9) #### Comarison of QPS by Number of Servers #### B/W monitoring results in one data server Memory System DRAM Bandwidth ### Advantages with CMM-D in RAG Cluster TCO reduction effect and memory expansion effect can be secured - Equivalent QPS/\$ and 40% reduction in \$/GB cost - Operating Power reduction through application can reduce operating cost. Dataset: MSMARCO-V2 | Raw Size | Indexing<br>Size(HNSW) | Entity Count | Dimension | Precision | Vector Size | |----------|------------------------|--------------|------------------|-----------|-------------| | 290GB | 673GB | 138 Million | 1024<br>(cohere) | FP32 | 4096B | Reference TCO Calculator: https://v0-cxl-tco-2-nvdatd.vercel.app/ #### Related FMS Activities - Join the panel discussion on Thursday - Title: Driving Interconnects: Memory and storage fabrics for new AI/ML workloads - Thursday, August 7, from 1:25 2:30 pm PT - AI/ML Track (AIML-304-1) - Panelists from: Meta, Microsoft, Texas A&M, Cal Poly University, and Samsung - Location: Ballroom A Stop by Samsung's booth (#407) to learn more about our CXL solutions. # Montage Technology Presented by: Geof Findley · \*\*\*\*\*\*\*\*\*\*\*\*\*\*\* All the second second # Montage Technology and CXL Memory Expansion Controller – MXC - Montage more than 20 years in memory products leading the industry. Second largest PCIe GEN5 and GEN4 Retimer supplier. 1st to ship CXL controllers...Gen 1, 2, and now Gen 3 - MXC newest product based on deep understanding of both DDR and PCIe technologies - MXC GEN1 supports CXL2.0 and DDR4-3200/DDR5-4800 (In mass production) - MXC GEN2 supports CXL2.x and DDR5-6400 (In mass production) - MXC GEN3: Shipping M88MX6852 Type3 CXL® Memory eXpander Controller (Industry 1st) - CXL 3.1 compatible - PCIe Gen6 speed up to 64GT/s - CXL x8 port with bifurcation to 2x4 ports - Up to DDR5-8000, with two independent memory controllers - Enhanced RAS capability - Security with IDE/TSP/DICE - Rich management features ### CXL Adoption and Mix...has a home! Two Thirds of Servers today can support CXL products by end of '26 well over 90% ### CXL adoption in the datacenter...its Started - 32Gb monolithic die (and corresponding 128GB RDIMM) and MRDIMM (with higher bandwidth) are alternatives to CXL expansion. Expansion being considered when high DRAM content is desired - Memory pooling is getting developed and deployed with CXL 2.0 today, will explode when CXL3.x is available. # TCO Savings Examples with CXL Memory Avoid High-Cost DIMMs - 128GB and 256GB DIMMs have high price premiums - CXL add-in cards with DIMM slots provide more total channels per socket → same system capacity with lower priced DIMMs - Same concept applies regardless of mode: Intel Flat Memory Mode & SW-based tiering | Memory per<br>socket | Socket-<br>attached<br>DIMMs only | With CXL | Memory<br>TCO<br>Savings* | |----------------------|-----------------------------------|---------------------------------|---------------------------| | 1TB | 8x 128GB<br>(1DPC) | 8x 64GB +<br>8x 64GB on CXL | 24% | | 2TB | 16x 128GB | 16x 64GB +<br>16x 64GB on CXL | 27% | | 4TB | 16x 256GB | 16x 128GB +<br>16x 128GB on CXL | 16% | <sup>\*</sup>TCO savings based on Intel modeling using projected DIMM and CXL pricing for 2025 Use CXL-attached DIMMs to achieve high system memory capacity and avoid expensive high-capacity DIMMs ### Memory Tiering: H/W Based Example, Intel FMM - Both DRAM and far memory exposed to OS as combined physical memory One memory tier - Data is 'Tiered': Resides in either DRAM or FM no replication - Hot data is swapped into DRAM one cacheline at a time, not a whole 4KB page - Performance very good due to 1:1 Near/Far memory ratios - No software modification needed ## Summary and Conclusion - Throughput Analysis - CXL FMM Setup shows performance almost equivalent to Native Setup in terms of throughput - Across all tested workloads, the CXL FMM setup consistently delivers performance within ~95-100% - Latency Analysis - Read Latency: CXL FMM Setup tends to be 5–10us higher than Native, a relative increase of 3–5% - Update Latency: Generally slightly higher on CXL FMM (10-20us) - Latency results are consistent across repetitions - Latency with CXL FMM Setup is slightly higher, especially for update operations, but the increase is small - Stability - Each workload repeated 3 times, and results were highly consistent, indicating stable system behavior. - No performance degradation or instability was observed due to CXL usage. Conclusion: CXL FMM Setup demonstrates excellent usability and stability in MongoDB performance testing Note: CXL Memory Module only add additional hardware latency less than 100ns. However, Overall CXL FMM latency addition is 5-10us. This hints much of the latency savings could come from software side improvement # XConn Technologies Presented by: JP Jiang All the second second ### XConn's CXL 2.0 Switch - World's First CXL 2.0 (XC50256) & PCIe 5.0 (XC51256) switch IC - 2,048 GB/s total BW with 256 lanes - Lowest port-to-port latency - Lowest power consumption/port - Reduced PCB area, Lower TCO - Compatible with CXL 1.1 and CXL 2.0 - Supports memory expansion/pooling/sharing - Works in <u>hybrid</u> mode (CXL/PCle mixed) - In production and shipping now # CXL Memory Sharing/Pooling Chassis # Composable Memory Pooling & Sharing For Cloud Database ACM 2025 SIGMOD Industrial best paper award Unlocking the Potential of CXL for Disaggregated Memory in Cloud-Native Databases https://lnkd.in/gwwB\_4Ph ### CXL Performance Improvement over RDMA #### Compare with DRAM | | DRAM | | CXL w/o switch | | CXL w. switch | | |--------------|-------|--------|----------------|--------|---------------|--------| | | Local | Remote | Local | Remote | Local | Remote | | Latency (ns) | 146 | 231 | 265.2 | 345.9 | 549 | 651 | - 3.76× that of local DRAM - 2.82× that of remote DRAM - Switch introduces additional latency #### Compare with RDMA | Size – | Write late | ncy (µs) | Read later | ıcy (μs) | |--------|------------|----------|------------|----------| | | RDMA | CXL | RDMA | CXL | | 64B | 4.48 | 0.78 | 4.55 | 0.75 | | 512B | 4.69 | 0.84 | 4.79 | 0.85 | | 1KB | 4.77 | 0.88 | 4.91 | 1.07 | | 4KB | 5.06 | 1.02 | 5.58 | 1.86 | | 16KB | 6.12 | 1.68 | 7.13 | 2.46 | | | | | | | - Reducing latency by 5.74× for writes and 6.07× for reads at 64B - CXL latency is more sensitive to data size - Avoiding page-level copy is beneficial # CXL-based Memory Pool in PolarDB - Servers send control message via Ethernet - CXL switch is connected via CXL x16 lanes - Up to 16 TB memory - Avoid tiered memory, deploying BP directly on CXL memory - A metadata server is dedicated for the CXL memory pool management - Compute node allocates CXL memory via RPC #### Database Performance on CXL - CXL-BP shows comparable performance with DRAM-BP - Database buffer pool is bandwidth-sensitive - Memory tiering is not necessary, saving bandwidth and simplifying design # Performance in Pooling Scenarios Lower bandwidth usage **₹** 75% Higher performance **★** 3.2x Higher resouces utilization 🛖 4x ## Performance in Sharing Scenarios Over 70% improvement in point-update workload Over 160% improvement in 12-nodes cluster Larger cluster, greater improvement #### CXL For AI Workloads #### Break the Memory Wall - Majority of GPUs/Accelerators do not support CXL, PCIe is available - AI "Memory Wall" --Large AI models require multi-TBs or more memory (Tokens, KV caching, etc.) - XConn's "Ultra IO Transformer" enables GPUs/DPUs (PCIe devices) to directly access CXL memory pool # Thank You www.ComputeExpressLink.org