

# Scaling GPU Clusters & Low Latency Memory Fabrics With Active PCIe / CXL Cabling

Chris Blackburn

System Architect & Director of Field Applications Engineering





# **AI Infrastructure Scale Challenges**

Proprietary & Confidential <AWS>



Proprietary & Confidential <AWS



Model sizes have doubled after 6 months<sup>\*</sup> Scale up fabrics connect hundreds of GPUs

Al servers consume 8X more power than CPU servers\*\* GPUs transitioning from air to liquid cooling\*\*\*

### Al infrastructure under heavy pressure to scale clusters across several racks



# **Emerging Application: Multi Rack AI Fabric**

Proprietary & Confidential <AWS>

roprietary & Confidential <AWS>





Proprietary & Confidential <AWS





### Memory Bottlenecks Due to AI / ML Workloads

Proprietary & Confidential <AWS>

Proprietary & Confidential <AWS>





Proprietary & Confidential <AWS

# **Emerging Application: Heterogeneous Infrastructure**

Proprietary & Confidential <AWS>

Proprietary & Confidential <AWS</p>



Proprietary & Confidential < AWS> AsteraLabs.

Proprietary & Confidential <AWS





Proprietary & Confidential < AWS

### **External Cabling Reach Considerations**

oprietary & Confidential

Proprietary & Confidential <AWS>

roprietary & Confidential <AWS>



Proprietary & Confidential <AWS> AsteraLabs.

Proprietary & Confidential <AWS

# **PCIe/CXL AECs: Handling PCIe Side-Band Signals**

Proprietary & Confidential <AWS>

Proprietary & Confidential <AWS>



#### Three "required" side-band signals defined in PCI-SIG's Card Electomechanical (CEM) Specification:

| PCIe Side-Band<br>Signal | Description                                                            | Option for handling within an AEC                                                                                                                                                   | Alternative                                                                                                                                                                                                                                                                                |
|--------------------------|------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| REFCLK                   | 100 MHz HCSL clock<br>with or without<br>spread-spectrum<br>modulation | Dedicated differential pair to carry REFCLK from<br>one side to the other.<br>Pros: Allows for common clock topologies<br>Cons: Extra cable cost, "asymmetric" cable<br>design      | No REFCLK transport in cable: SRNS/SRIS.<br><b>Pros</b> : lower cost, "symmetric" cable; scalable to<br>multi-link AECs<br><b>Cons</b> : CC topology requires dedicated side-band<br>cable between systems                                                                                 |
| PERST#                   | PCIe Protocol Reset                                                    | Dedicated single-ended line to carry PERST#.<br><b>Pros</b> : Allows PERST# synchronization on a per-<br>link basis<br><b>Cons</b> : Extra cable cost, "asymmetric" cable<br>design | No PERST# transport in cable. PCIe Reset events<br>are handled through in-band Hot Reset, host-<br>coordinated local reset, side-band management,<br>and/or Hot Plug support.<br>Pros: Lower cost, "symmetric cable"; scalable to<br>multi-link AECs<br>Cons: No dedicated per-link PERST# |
| PRSNT#                   | Cable (cable) present indicator                                        | Pluggable cable MSAs (OSFP, OSFP-XD, etc.) include ModPrsL functionality already                                                                                                    | N/A                                                                                                                                                                                                                                                                                        |



Proprietary & Confidential <AWS

Proprietary & Confident

# **AECs: PCIe VS. Ethernet**

Two main differences:

Proprietary & Confidential <AWS>

Proprietary & Confidential <AWS



**Protocol complexity**: PCIe's backwards compatibility and link training requirements make AECs more complex for PCIe compared to Ethernet

Interoperability: The variety of device types and ecosystem players is significantly more for PCIe compared to Ethernet







**AECs: PCIe VS. Ethernet** 



Two main differences:

**Protocol complexity**: PCIe's backwards compatibility and link training requirements make AECs more complex for PCIe compared to Ethernet

Interoperability: The variety of device types and ecosystem players is significantly more for PCIe compared to Ethernet

PCI Express (and CXL) Ethernet Device Device Device Device В B А Α • CPU Switch • NIC Accelerator/FPGA/GPU • Switch Memory Controller • Switch • NIC Storage Switch



Proprietary & Confidential <AWS

# **PCIe Cabling Form Factor Comparison**

prietary & Confidentia

Proprietary & Confidential <AWS>

roprietary & Confidential <AWS>

|                                      | <b>OSFP-XD</b><br>Under consideration<br>for Optical | CDFP (x16)<br>CopprLink                  | OSFP<br>Under consideration<br>for Optical | QSFP-DD                                      | QSFP                                     | prietary & Confidential |
|--------------------------------------|------------------------------------------------------|------------------------------------------|--------------------------------------------|----------------------------------------------|------------------------------------------|-------------------------|
| High-speed lane count (full duplex)  | 16                                                   | 16                                       | 8                                          | 8                                            | 4                                        | S>                      |
| X-Y PCB Size (normalized to x16)     | 2292 mm <sup>2</sup><br>0.60 mm<br>26-32 AWG         | 1460 mm²        0.75 mm        28-32 AWG | 3989 mm²        0.60 mm        26-32 AWG   | 2472 mm <sup>2</sup><br>0.80 mm<br>27-32 AWG | 3933 mm²        0.80 mm        26-32 AWG |                         |
| Connector Contact Pitch              |                                                      |                                          |                                            |                                              |                                          |                         |
| Cable Gauge Supported                |                                                      |                                          |                                            |                                              |                                          | prietary & Confidential |
| 32 GT/s Max DAC reach (at max gauge) | 4 m                                                  | 3.0 m                                    | 4 m                                        | 3.5 m                                        | 4 m                                      |                         |
| 32 GT/s Max AEC reach (at max gauge) | 7 m                                                  | 5.5 m                                    | 7 m                                        | 6 m                                          | 7 m                                      |                         |
| 64 GT/s Max DAC reach (at max gauge) | 3 m                                                  | 2.5 m                                    | 3 m                                        | 2.5 m                                        | 3 m                                      |                         |
| 64 GT/s Max AEC reach (at max gauge) | 6 m                                                  | 5 m                                      | 6 m                                        | 5 m                                          | 6 m                                      | S>                      |
| Active Copper cable                  | Yes                                                  | No                                       | Yes                                        | Yes                                          | Yes                                      |                         |
| Active Optical cable                 | Yes                                                  | No                                       | Yes                                        | Yes                                          | Yes                                      | orietary & Confidential |
| confidential <aws></aws>             | 8x2.5A@3.3V                                          | 1x1.5A@12V +<br>1x1.5A@3.3V              | <b>4x</b> 2.5A@3.3V                        | <b>6x</b> 1.5A@3.3V                          | 3x1.0A@3.3V                              |                         |
| Power Capability per Lane            | 66W/16 =<br>4.125W                                   | 23W/16=<br>1.44W                         | 33W/8=<br>4.125W                           | 30W/8=<br>3.75W                              | 10W/4=<br>2.5W                           | ~ <u>~</u>              |

Proprietary & Confidential <AWS>

#### Assumptions:

• Twinax losses: 28/27/26AWG=4.3/4.0/3.6 + 10% dB/m at 16 GHz.

AEC: Retimer silicon to cable pads: 4 dB @ 16 GHz

DAC: Retimer silicon (behind cage) to passive DAC cable pads: 9.5 dB @ 16 GHz
 Reference: https://drive.google.com/file/d/12Z5TklgkzESbf4fZzj7WQBi3y4oto-gB/view

**Proprietary & Confidential** 

AsteraLabs.

• Rx and Tx both terminating in the Retimer is necessary for:

Equalization Phase 2/3 training In-band lane margining stial <AWS>

assembly

Upper paddle card pinout

Proprietary & Confidential <AWS> Lower paddle card pinout y & Confidential <AWS>



# **CopprLink and Active External Cables**

SFF-TA-1032 (CDFP) uses two physical paddle cards inside a cable

signals from separate paddle cards into a Retimer component?

This presents a significant challenge: **How can you connect Tx and Rx** 





#### Proprietary & Confidential <AWS

ry & Confidential

Wrap Up

Proprietary & Confidential <AWS>







Proprietary & Confidential <AWS>

- Evolving AI and disaggregated compute system
  topologies require more external cabling
  - Reach requirements vary from 2m (within the rack), to 7m (rack to rack), and beyond (larger clusters)
- Retimer-based AEC and optical solutions enable reach extension while presenting an easy-todesign-to PCIe compliance point to the host/device
- Implementing PCIe AEC and optical involves higher design complexity in terms of protocol and interoperability as compared to Ethernet
  - OSFP-XD/OSFP represents an attractive option for PCIe/CXL x16/x8 applications, allowing for passive DAC, AEC, and Optical solutions



# FMS

# Thank You



Check us out on

www.asteralabs.com