

# How CXL Computational Memory Will Revolutionize Data Processing

**Harry Juhyun Kim** 

**CPO and Co-founder** 









# MX1: CXL MEMORY + DATA PROCESSING

# CXL COMPUTATIONAL MEMORY FOR LARGE-SCALE DATA CHECK OUT MX1 AT BOOTH #734

## **Novel CXL Hardware** 1000s of Custom RISC-V Cores DDR5 x 4Ch ~1TB **TFLOPS Vector Engine** CXL 3.0 HDM-DB SSD-backed with Back-invalidation **CXL** Expansion Cache Coherence

## **Rich Software Framework**





## **AI+DATA WORLFLOWS**



AI/ML IS THE BIGGEST CONSUMER OF DATA TODAY
THE LARGEST TABLES AND LAKEHOUSES ARE BUILT FOR AI



#### Vector Database

- Vector Indexing and Searching
- RAG Pipelines

### Data Preparation

- Preparing tables
- Embeddings, long sequences of features
- Batch ETL pipelines similar to traditional data transformation

### Data Loading

- Feeding GPUs
- Latency sensitive
- Order sensitive (Training)
  - → random/chronological/...





## CHALLENGES IN DATA ANALYTICS @ SCALE

MEMORY INEFFICIENCIES AND DIMINISHING RETURNS IN SCALING CALL FOR A NEW MEMORY SOLUTION





# SOLUTION?



# WHAT IF A SINGLE DENSE MEMORY AND COMPUTE NODE COULD REPLACE A CLUSTER OF 10?

Dense Memory and Compute Node





Cluster of Nodes







## DATA PROCESSING QUADRANT

Low

EMBARRESSINGLY PARALLEL, LOW COMPUTE INTENSITY, MEMORY BOUNDED

**Disaggregation, Computational Memory** 

DDR CXL MEMORY



**Big Data Processing** 

Vector Databases For Gen Al





DNA Analysis **Computational Intensity Per Memory Access** 



No Interest





**Bandwidth/Capacity Expansion** 



DDR LPDDR CXL MEMORY

**General Computing** 

Operation Diversity Per Memory Access

Low

High

**HBM**, Explicit Memory



HBM GDDR

**Artificial Intelligence** 



### XCENA

# XCENA CXL COMPUTATIONAL MEMORY



# BEYOND JUST ANOTHER CXL MEMORY EXPANDER

DATA PROCESSING LIBRARY

**DATA ANALYTICS ACCELERATION** 

TCO SAVINGS
SEAMLESS INTEGRATION

**COMPUTING HW-SW SOLUTION** 

**NEAR-MEMORY PROCESSING** 

LESS DATA MOVEMENT LOWER CPU UTIL LOWER POWER BETTER PERFORMANCE

CXL MEMORY

HIGH PERFORMANCE CXL DRAM EXPANSION

*INFINITE MEMORY* 

PB-SCALE MEMORY
WITH SSD EXPANSION

MORE MEMORY
CHEAPER MEMORY



# NEAR-MEMORY PROCESSING



More Capacity, Extra Bandwidth, +Latency
Reducing data movement by Near-Memory Processing











1000s RISC-V CORES FOR EFFICIENT DATA PROCESSING







LOWER CPU UTILIZATION

#### XCENA

## PARALLEL PROGRAMMING

Parallel Xceleration Library (PXL)

Map API encapsulates all the complexities of device management and provides a FRAMEWORK of thought for developers.

```
#include "pxl/pxl.hpp"
                                                                                                                     1 #include <algorithm>
                                                                                                                     2 #include "mu/mu.hpp"
 3 // setup the device
 4 const char* filename = "mu_kernel/mu_kernel.mubin";
                                                                                                                     4 void sort_with_ptr(int* a, int size)
                                                                                                                     5 {
 5 const char* muFuncName = "sort_with_ptr";
 6 auto context = pxl::runtime::createContext(0);
                                                                                                                            auto taskIdx = mu::getTaskIdx();
 7 auto module = pxl::createModule(filename);
                                                                                                                            auto curArray = &a[taskIdx * size];
 8 auto job = context->createJob();
                                                                                                                            std::sort(curArray, curArray + size);
                                                                                       PXL Runtime
                                                                                                                     9 }
 9 job->load(module);
10 auto muFunction = module->createFunction(muFuncName);
                                                                                                                     11 MU_KERNEL_ADD(sort_with_ptr)
11
12 // allocate cxl memory
13 int* a = reinterpret_cast<int*>(context->memAlloc(N * 128 * sizeof(int)));
                                                                                                                            Device Compute Resources
                                                                                        Host
14 // setup data
15 for (size_t i = 0; i < N; i++)
                                                                                                              thread 0
                                                                                                                         thread 1
                                                                                                                                    thread 2
                                                                                                                                                          thread N-1
16 {
                                                                                                                                    int* a
       for (size_t j = 0; j < 128; j++)</pre>
17
18
           a[i * 128 + j] = 128 - j;
19
                                                                                   CXL Memory
21 }
                                                                                                                               a[128]
                                                                                                                                       a[256]
                                                                                                                                                         a[128 x (N-1)]
22
                                                                  Coherent, Unified View of a[]
23 auto map = job->buildMap(muFunction, N);
24 auto ret = map->execute(a, 128);
                                                                                                                      128 128 128
                                                                                                                                                        ←128→
25 map->synchronize();
```



## DATA SPECIFIC LIBRARY



Spark SQL query is compiled and planned by Spark, translated into Velox plans by Gluten, and executed by XFLARE on MX1.

```
spark.read.parquet("persons.parquet").createOrReplaceTempView("persons")
2 spark.sql("""
      SELECT *
      FROM
             persons
     ORDER BY age DESC
6 """).show()
                                                                               Offloading control flow
                                                                                                     Arrow
                          XFLARE
                                                                                                  sorted buffer
                          OrderBy
                                                                              ΜU
     Spark
                                         Operator
                                                                                                     Arrow
    Gluten
                                         Scheduler
                                                                                                 column buffer
    Velox
                          XFLARE
                          Parquet
                                                                                                    Parquet
                                                                                                   file buffer
                                                                              ΜU
                               XFLARE
                                                                                                 CXL DRAM
                                                       Host
                                                                         MX1 Device
```



# DON'T MOVE DATA, MOVE COMPUTATIONS!

## CXL LOAD/STORE

Use CXL memory as just another part of host memory.

#### **PXL API**

PXL allows you to manage resources and execute user kernels in parallel.

#### **XFLARE**

Common formats (Substrait, Arrow) and high-level APIs simplify query offloading.









Q Search XCENA SDK Docs

## Introduct CENAigw COMSDK XCENA Software Development Kit (SDK) is a comprehensive software framework designed to

º- xcena.com/SDK

# QEMUCXL 3.0 Get Started Tutorials Tutorials Total Control of C

Documentation

Release Notes Example Codes and Tutorials

Seamless integration with computational CXL memory, specialized for data-intensive applications

 Support for emulation and simulation with general tools like QEMU and our proprietary simulators, ensuring smooth development and testing environments

## Software Ecosystem

Applications: Main applications running on the host system that leverage XCENA hards:



# THANK YOU

HARRY KIM, CPO

harry.kim@xcena.com

http://xcena.com

https://www.linkedin.com/company/xcena/

