# PDSC2: Introduction to High Bandwidth Memory

Marc Greenberg, Principal/CEO Marc Greenberg Consulting LLC

marc@marcgreenberg.com



#### Abstract

In this Professional Development Series session, you'll learn about key aspects of High Bandwidth Memory (HBM): What is HBM, a short history of HBM, why is HBM important right now, how Large Language Models (LLMs) and Generative AI are driving demand for HBM technology, comparison of HBM with other popular memory types (DDR, LPDDR and GDDR), a high level view of HBM architecture, PCB and package requirements to implement chips deploying HBM, a view of the market for HBM and the chips that use it, and a review of public information on the future development of high bandwidth memories.



#### Brief Bio and Disclosures

#### • Past Endeavors:

- Group Director, Product Marketing, Cadence, 2017-2023
- Director, Product Marketing, Synopsys, 2012-2017
- Director, Product Management, Cadence, 2010-2012
- Director, Technical Marketing, Denali Software, Inc. 2003-2010
- Various roles including IP Procurement Manager, Motorola Semiconductor (SPS) 1993-2003
- Current Endeavors:
  - VP Product, Cassia.ai
  - Director, Strategic Alliances, Blue Cheetah
  - Vice-Chair, non-public JEDEC Task Group
  - Consultant, The Six Semiconductor
  - Leading an undisclosed AI project as part of Marc Greenberg Consulting LLC
  - Occasional advisor to investment firms etc
  - I own stock in Cadence and Cassia.ai, and I own Marc Greenberg Consulting LLC

All material presented here is my own opinion and does not necessarily represent the position or opinion of any of my clients or any third party





#### Why are we here?





# AI



©2024 Conference Concepts, Inc. All Rights Reserved

#### Where we were vs where we're going

Please confirm your identity

Photo 3 of 5

2011 Facebook facial recognition attempt







2024 OpenAI Sora show reel

https://youtu.be/2fAPgOCjToA?si=YtHCU1bytGEKhdwe

©2024 Conference Concepts, Inc. All Rights Reserved

#### What is HBM used for?

- L4 cache (picture at right)
- Streaming buffer in networking applications



Al Accelerator





15

Source: Bill Gervasi, an hour or two ago



#### Introduction to AI chips (more to come later)

- The chips that operate most of the big server-based AI hardware are giant math machines
- Parallelism, specialization and a shift in the types of AI that they run has enabled very large computing machines to be developed

https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf https://resources.nvidia.com/en-us-blackwell-architecture?ncid=no-ncid https://www.flickr.com/photos/130561288@N04/albums/72177720295479734/with/51867067870





NVIDIA P100 2016 16nm 60 SM units Quad HBM2 21 TFLOPs (FP16) 700w TDP

NVIDIA Blackwell 2024 4nm 144 SM units 576 Tensor Cores Dual-quad HBM3E 2500 TFLOPs (FP16) 700w TDP



#### The compute demand for AI is insatiable





https://epochai.org/blog/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year https://epochai.org/blog/trends-in-machine-learning-hardware

#### And it's expensive



FEPOCH AI Amortized hardware and energy cost to train frontier AI models over time Cost (2023 USD, log scale) ----- Regression mean 95% CI of mean Using estimated cost of TPU Gemini 1.0 Ultra-100M GPT-4 PaLM (540B) 10M GPT-3 175B (davinci) AlphaGo Zero 1M AlphaGo Master Inflection-2 AlphaZero 100k GNMT DALL-E 10k 2.4x/year 1000 100 10 2016 2017 2018 2019 2020 2021 2022 2023 2024 Publication date Sam Altman 🤣 Follow \* \* \* @sama

> we will have to monetize it somehow at some point; the compute costs are eye-watering

11:38 PM · Dec 4, 2022



Tony Chan Carusone, Alphawave, Chiplet Summit Keynote 2024, https://epochai.org/blog/how-much-does-it-cost-to-train-frontier-ai-models x.com

> ©2024 Marc Greenberg Consulting, LLC All Rights Reserved

7

ALPHAWAVE SEMI

# But memory bandwidth and capacity are not keeping up

|                                     | Specification and unit                           | Growth rate Doubling time      | Datapoint of highest<br>performance<br>Metric prefix ~ | N  | Note this is a<br>GDDR6-based card. |
|-------------------------------------|--------------------------------------------------|--------------------------------|--------------------------------------------------------|----|-------------------------------------|
|                                     | FLOP/s (FP32)                                    | 2x every 2.3 [2.1; 2.6] years  | ~90 TFLOP/s (NVIDIA L40)                               | 45 |                                     |
| Computational                       | FLOP/s (tensor-FP32)                             | NA <sup>1</sup>                | ~495 TFLOP/s (NVIDIA H100<br>SXM)                      | 7  | GPGPU vs Tensor is important.       |
| Performance                         | FLOP/s (tensor-FP16)                             | NA                             | ~990 TFLOP/s (NVIDIA H100<br>SXM)                      | 8  |                                     |
|                                     | OP/s (INT8)                                      | NA                             | ~1980 TOP/s (NVIDIA H100<br>SXM)                       | 10 | Note this is a                      |
| Computational price-<br>performance | FLOP per \$ (FP32)                               | 2x every 2.1 [1.6; 2.91] years | ~4.2 exaFLOP per \$ (AMD<br>Radeon RX 7900 XTX)        | 33 | GDDR6-based card.                   |
| Computational energy-<br>efficiency | FLOP/s per Watt (FP32)                           | 2x every 3.0 [2.7; 3.3] years  | ~302 GFLOP/s per W (NVIDIA<br>L40)                     | 43 | important.                          |
| Memory capacity                     | DRAM capacity (Byte)                             | 2x every 4 [3; 6] years        | ~128 GB (AMD Radeon Instinct MI250X)                   | 47 |                                     |
| Memory bandwidth                    | DRAM bandwidth in Byte/s                         | 2x every 4 [3; 5] years        | ~3.3 TB/s (NVIDIA H100 SXM)                            | 47 |                                     |
| Interconnect<br>bandwidth           | Chip-to-chip communication<br>bandwidth (Byte/s) | NA                             | ~900 GB/s (NVIDIA H100)                                | 45 |                                     |

Table 1: Key performance trends. All estimates are computed only for ML hardware. Numbers in brackets refer to the [5; 95]-th percentile estimate from bootstrapping with 1000 samples. OOM refers to order of magnitude, and N refers to the number of observations in our dataset. Note that performance figures are for dense matrix multiplication performance.

https://epochai.org/blog/trends-in-machine-learning-hardware ©2024 Marc Greenberg Consulting, LLC

All Rights Reserved

#### Amdahl's law

#### $Definition \ [edit]$

Amdahl's law can be formulated in the following way:<sup>[3]</sup>

$$S_{ ext{latency}}(s) = rac{1}{(1-p)+rac{p}{s}}$$

where

• Slatency is the theoretical speedup of the execution of the whole task;

- s is the speedup of the part of the task that benefits from improved system resources;
- p is the proportion of execution time that the part benefiting from improved resources originally occupied.

Furthermore,

shows that the theoretical speedup of the execution of the whole task increases with the improvement of the resources of the system and that regardless of the magnitude of the improvement, the theoretical speedup is always limited by the part of the task that cannot benefit from the improvement.

Amdahl's law applies only to the cases where the problem size is fixed. In practice, as more computing resources become available, they tend to get used on larger problems (larger datasets), and the time spent in the parallelizable part often grows much faster than the inherently serial work. In this case, Gustafson's law gives a less pessimistic and more realistic assessment of the parallel performance.<sup>[4]</sup>



https://en.wikipedia.org/wiki/Amdahl%27s\_law

©2024 Marc Greenberg Consulting, LLC All Rights Reserved  Translation: If you increase the compute without increasing the memory bandwidth, then the theoretical speedup will be limited by the memory bandwidth

#### Generalized Neuron Behavior



Neural Network (section of larger network)





# Systolic Arrays: The efficient heart of a TPU

- The generalized term "Systolic Array" is the technique used in almost all Tensor Processing units
  - Google "TPU"
  - NVIDIA "Tensor Core"
    - contained within "Streaming Multiprocessors" -SM
  - AMD CDNA "Matrix Cores"
    - contained within "Accelerator Complex Dies" XCDs
  - Tenstorrent "Tensix" Cores

FCCM'96 -- IEEE Symposium on FPGAs for Custom Computing Machines April 17-19, 1996, Napa, CA





Systolic arrays are not new... here's one I worked on 30 years ago.

The name "Systolic Array" was coined in 1979 but a WWII code-breaking machine used the same technique



#### Generalized Neural Network Behavior

• Arrange all the inputs and weights into a matrix, then multiply and accumulate the results using a systolic array



#### Scalar, Vector, Matrix, Tensor









ScalarVector0-way1-way tensorCharacterWord

Matrix 2-way tensor Page









6-way Library



©2024 Marc Greenberg Consulting, LLC All Rights Reserved 5-way Bookcase

#### How matrix/tensor math is done by CPU





|  | - 1 |  |  |
|--|-----|--|--|
|  |     |  |  |
|  | - 1 |  |  |
|  |     |  |  |
|  |     |  |  |
|  | - 1 |  |  |
|  | - 1 |  |  |

vector

tensor

#### Tensor math



#### Matrix multiplication: Tiled, B transposed







Totals: mem:2307 cache hits:2080≅90%



#### https://youtu.be/aMvCEEBIBto?si=2oIAEufVXcVh8Kc1



#### Vector math by GPU, Tensor math by TPU

Pascal = GPGPU doing vector math

Volta = GPGPU+Tensor unit, tensor unit doing math

Later animation is effect of multi-precision (for inference)





#### Tensor Math (again)





<u>https://www.youtube.com/watch?v=yyR0ZoCeB08</u>

#### More TPU every generation





Figure 5. H100 FP16 Tensor Core has 3x throughput compared to A100 FP16 Tensor Core https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/

### The magic of the TPU

Table 4 compares key features of TPU v3 and TPU v4. Manufactured in 7 nm instead of 16 nm, TPU v4 has twice the matrix multipliers (enabled by the increased process density) and an 11% faster clock—this drives the 2.2X gain in peak performance. About 40% of the performance/Watt improvement was from technology and the rest was from design improvements (e.g., balancing the pipeline, implementing clock gating). The HBM memory bandwidth is 1.3x higher.





https://arxiv.org/pdf/2304.01433 source: Google

 Table 4: TPU v4 and TPU v3 [26] features. Measured power is for the ASIC and HBM running production applications.

|                                | Google TPUv4                                      | TPUv3                            |
|--------------------------------|---------------------------------------------------|----------------------------------|
| Production deployment          | 2020                                              | 2018                             |
| Peak TFLOPS                    | 275 (bf16 or int8)                                | 123 (bf16)                       |
| Clock Rate                     | 1050 MHz                                          | 940 MHz                          |
| Tech. node, Die size           | 7 nm, <600 mm2                                    | 16 nm, < 700 mm2                 |
| Transistor count               | 22 billion                                        | 10 billion                       |
| Chips per CPU host             | 4                                                 | 8                                |
| TDP                            | N.A.                                              | N.A.                             |
| Idle, min/mean/max<br>power    | 90, 121/170/192 W                                 | 123, 175/220/262 W               |
| Inter Chip Interconnect        | 6 links @ 50 GB/s                                 | 4 links @ 70 GB/s                |
| Largest scale<br>configuration | 4096 chips                                        | 1024 chips                       |
| Processor Style                | Single Instruction<br>2D Data                     | Single Instruction<br>2D Data    |
| Processors / Chip              | 2                                                 | 2                                |
| Threads / Core                 | 1                                                 | 1                                |
| SparseCores / Chip             | 4                                                 | 2                                |
| On Chip Memory                 | 128 (CMEM) +<br>32 MiB (VMEM) +<br>10 MiB (spMEM) | 32 MiB (VMEM) +<br>5 MiB (spMEM) |
| Register File Size             | 0.25 MiB                                          | 0.25 MiB                         |
| HBM2 capacity, BW              | 32 GiB, 1200 GB/s                                 | 32 GiB, 900 GB/s                 |

Figure 9 below shows performance of an internal production recommendation model (DLRM0, see Sections 7.8 and 7.9) across the two TPU generations for 128 chips. The standalone CPU configuration has 576 Skylake sockets (400 for learners and 176 for variable servers). The bottom two bars show TPU v4 without SC, where the embeddings are placed in CPU memory. The "Emb on CPU" bar places embeddings in CPU host memory and the "Emb on Variable Server" bar places embeddings on 64 external variable servers. TPU v3 is faster than CPUs by 9.8x. TPU v4 beats TPU v3 by 3.1x and CPUs by 30.1x. When embeddings are placed in CPU memory for TPU v4, performance drops by 5x-7x, with bottlenecks due to CPU memory bandwidth.

#### A rapid shift in capability (NVIDIA 2022)



| Accelerator Model                             | K80                                      | P100      | P100              | P100             | V100       | V100     | A100      | A100     | A100     | H100      | H100       |                                                |
|-----------------------------------------------|------------------------------------------|-----------|-------------------|------------------|------------|----------|-----------|----------|----------|-----------|------------|------------------------------------------------|
| GPU                                           | 2 * GK210B                               | GP100     | GP100             | GP100            | GV100      | GV100    | GA100     | GA100    | GA100    | GH100     | GH100      |                                                |
| Bus                                           | PCI-E 3.0                                | PCI-E 3.0 | PCI-E 3.0         | SXM              | PCI-E 3.0  | SXM2/3   | PCI-E 4.0 | SXM4     | SXM4     | PCI-E 5.0 | SXM5       |                                                |
| GDDR5 or GDDR6/HBM2 Memory                    | 24 GB                                    | 12 GB     | 16 GB             | 16 GB            | 16/32 GB   | 16/32 GB | 40 GB     | 40 GB    | 80 GB    | 80 GB     | 80 GB      |                                                |
| Performance / Watt                            |                                          |           |                   |                  |            |          |           |          |          |           |            |                                                |
| F8 Efficiency (Gigaops/Watt)                  | 1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1. | -         | 11 <del></del> 1  | ( <del></del> )  | -          | -        | -         |          | -        | 9,142.9   | 5,714.3    |                                                |
| INT8 Efficiency (Gigaops/Watt)                |                                          | =         | -                 | 8 <del>9</del> 3 | 224.0      | 209.3    | 3,120.0   | 3,120.0  | 3,120.0  |           | -          |                                                |
| FP16 TC, FP32 ACC Efficiency (Gigaflops/Watt) | -                                        | -         | -                 | 9 <del>4</del> 2 | 448.0      | 416.7    | 1,560.0   | 1,560.0  | 1,560.0  | 4,571.4   | 2,857.1    | <ul> <li>Multiple X every generatio</li> </ul> |
| FP16 Efficiency (Gigaflops/Watt)              | 14                                       | 74.8      | 74.8              | 70.7             | 100.5      | 104.7    | 195.0     | 195.0    | 195.0    | 274.3     | 171.4      |                                                |
| FP32/TF32 Efficiency (Gigaflops/Watt)         | 29.1                                     | 37.2      | 37.2              | 35.3             | 50.2       | 52.3     | 48.8      | 48.8     | 48.8     | 137.1     | 85.7       |                                                |
| FP64 Efficiency (Gigaflops/Watt)              | 9.7                                      | 18.8      | 18.8              | 17.7             | 25.0       | 26.0     | 48.8      | 48.8     | 48.8     | 55.7      | 27.9       |                                                |
| \$ / Performance                              |                                          |           |                   |                  |            |          |           |          |          |           |            |                                                |
| Street Price, Single Unit                     | \$400                                    | \$600     | \$2,200           | \$1,100          | \$7,500    | \$2,500  | \$6,000   | \$12,500 | \$15,000 | \$17,500  | \$19,500   |                                                |
| 5 / FP8 Teraflops                             |                                          | -         | 11 <del></del> 1  | ( <del></del> )  | -          | -        | 177       | 7        | -        | \$5.47    | \$4.88     |                                                |
| 5 / INT8 Teraops                              |                                          |           | 8 <b>-</b>        | 8 <del>8</del> 5 | \$133.93   | \$39.81  | \$4.81    | \$10.02  | \$12.02  |           | -          |                                                |
| \$ / FP16 TC, FP32 ACC Teraflops              | -                                        | -         |                   |                  | \$66.96    | \$20.00  | \$9.62    | \$20.03  | \$24.04  | \$10.94   | \$9.75     | TPU with >10x cost efficient                   |
| \$ / FP16 Teraflops                           | 74                                       | \$32.09   | \$117.65          | \$51.89          | \$267.86   | \$79.62  | \$76.92   | \$160.26 | \$192.31 | \$182.29  | \$162.50   | compared to GPU opteratio                      |
| \$ / FP32/TF32 Teraflops                      | \$45.77                                  | \$64.52   | \$236.56          | \$103.77         | \$597.13   | \$159.24 | \$19.23   | \$40.06  | \$48.08  | \$56.09   | \$62.50    |                                                |
| \$ / FP64 Teraflops                           | \$137.46                                 | \$127.66  | \$468.09          | \$207.55         | \$1,201.92 | \$320.51 | \$307.69  | \$641.03 | \$769.23 | \$897.44  | \$1,000.00 |                                                |
| S / Performance / Watt                        |                                          |           |                   |                  |            |          |           |          |          |           |            |                                                |
| \$ / FP8 Teraops / Watt                       | -                                        | 5         | 19 <del>7</del> 3 | 27.9             | -          | =        | -         | -        | 53       | \$1.91    | \$3.41     |                                                |
| \$ / INT8 Teraops / Watt                      | 1.                                       | -         | 8 <del></del> 8   | ( <del></del> )  | \$33.48    | \$11.94  | \$1.92    | \$4.01   | \$4.81   |           |            |                                                |
| \$ / FP16 TC, FP32 ACC Teraflops / Watt       |                                          | -         | -                 |                  | \$16.74    | \$6.00   | \$3.85    | \$8.01   | \$9.62   | \$3.83    | \$6.83     | TPU with ~20X \$/FLOPs/Wa                      |
| \$ / FP16 Teraflops / Watt                    | -                                        | \$8.02    | \$29.41           | \$15.57          | \$74.64    | \$23.89  | \$30.77   | \$64.10  | \$76.92  | \$63.80   | \$113.75   | compared to GPU opteratio                      |
| \$ / FP32/TF32 Teraflops / Watt               | \$13.73                                  | \$16.13   | \$59.14           | \$31.13          | \$149.28   | \$47.77  | \$123.08  | \$256.41 | \$307.69 | \$127.60  | \$227.50   |                                                |
| \$ / FP64 Teraflops / Watt                    | \$41.24                                  | \$31.91   | \$117.02          | \$62.26          | \$300.48   | \$96.15  | \$123.08  | \$256.41 | \$307.69 | \$314.10  | \$700.00   |                                                |



As usual items in **bold** red italics are estimations by *The Next Platform*.

https://www.nextplatform.com/2022/05/09/how-much-of-a-premium-will-nvidia-charge-for-hopper-gpus/

#### And growing fast (Google TPU example)



TensorCore TensorCore Scalar Unit Vector Unit Scalar Unit Vector Unit 32K High High Bandwid Bandwidth Matrix Matrix Matrix Matrix Memory Memory Multiplication Multiplication Multiplication Unit Multiplication Unit

TPU v4

TPU v3



128K Trillium (Nextplatform prediction)



https://www.nextplatform.com/wpcontent/uploads/2022/10/google-tpuv4-v3v2-chip-block-diagrams.jpg

| Google TPU Compute Engines        |                    |                    |                    |                     |                     | "Trillium"          |        |        |        | and a second second second |        |
|-----------------------------------|--------------------|--------------------|--------------------|---------------------|---------------------|---------------------|--------|--------|--------|----------------------------|--------|
|                                   | TPU vl             | TPU v2             | TPU v3             | TPU v4              | TPU v5p             | TPU v6              | Over   | Over   | Over   | Over                       |        |
| First Deployed                    | Q2 2015            | Q3 2017            | Q4 2018            | Q4 2021             | Q4 2023             | Q4 2024             | TPU v2 | TPU v3 | TPU v4 | TPU v5e                    |        |
| ML Inference                      | Yes                | Yes                | Yes                | Yes                 | Yes                 | Yes                 |        |        |        |                            |        |
| ML Training                       | No                 | Yes                | Yes                | Yes                 | Yes                 | Yes                 |        |        |        |                            |        |
| Chip Process                      | 28 nm              | 16 nm              | 16 nm              | 7 nm                | 5 nm                | 4 nm                |        |        |        |                            |        |
| Transistors                       | 3 B                | 9 B                | 10 B               | 31 B                | 777                 | 777                 | 1.11   | 3.10   | 777    | 777                        |        |
| Die Size                          | $330 \text{ mm}^2$ | $625 \text{ mm}^2$ | $700 \text{ mm}^2$ | 780 mm <sup>2</sup> | 700 mm <sup>2</sup> | 790 mm <sup>2</sup> | 1.12   | 1.11   | 0.90   | 2.26                       |        |
| Clock Speed                       | 700 MHz            | 700 MHz            | 940 MHz            | 1,050 MHz           | 2,040 MHz           | 2,060 MHz           | 1.34   | 1.12   | 1.94   | 1.18                       |        |
| TensorCores Per Chip              | 1                  | 2                  | 2                  | 2                   | 2                   | 2                   |        |        |        |                            |        |
| MXU Matrix Size/Core              | 1 * 256x256        | 1*128x128          | 2*128x128          | 4 * 128x128         | 4 * 128x128         | 4 * 256x256         | 1.00   | 1.00   | 1.00   | 2.00                       | Growir |
| Dataflow SparseCores              | 12                 | 122                | 2                  | 4                   | 4                   | 4                   |        |        |        |                            |        |
| On Chip Cache Memory              | 28 MB              | 32 MB              | 32 MB              | 32 MB               | 48 MB               | 777                 | 1.00   | 1.00   | 1.50   | 777                        | •      |
| Off Chip HBM Memory               | 8 GB               | 16 GB              | 32 GB              | 32 GB               | 95 GB               | 32 GB               | 2.00   | 1.00   | 2.97   | 2.00                       |        |
| HBM Memory Bandwidth              | 300 Gb/sec         | 700 GB/sec         | 900 GB/sec         | 1,228 GB/sec        | 2,765 GB/sec        | 1,640 TB/sec        | 1.29   | 1.36   | 2.25   | 2.00                       |        |
| INT8 Peak Teraflops               | 92                 | 1221               | 24                 | 275                 | 918                 | 1,852               | -      | 120    | 3.34   | 4.70                       |        |
| BF16 Peak Teraflops               | -                  | 46                 | 123                | 137.5               | 459                 | 926                 | 2.67   | 1.12   | 3.34   | 4.70                       |        |
| Precision                         | INT8               | BF16               | BF16               | BF16/INT8           | BF16/INT8           | BF16/INT8           |        |        |        |                            |        |
| ICI Links * Speed Gb/sec          | -                  | 4 * 496            | 4 * 656            | 6 * 448             | 6 * 800             | 4 * 800             | 1.32   | 1.02   | 1.79   | 2.00                       |        |
| Interconnect Topology             | <u> </u>           | 2D Torus           | 2D Torus           | 3D Torus            | 3D Torus            | 2D Torus            |        |        |        |                            |        |
| Chip Idle Watts                   | 28                 | 53                 | 84                 | 170                 | 777                 | 777                 |        |        |        |                            |        |
| Max Measured Watts                | 777                | 777                | 262                | 192                 | 777                 | 777                 |        |        |        |                            |        |
| Chip TDP Watts                    | 75                 | 280                | 450                | 300                 | 777                 | 777                 |        |        |        |                            |        |
| Chips Per CPU Host                | 4                  | 4                  | 4                  | 4                   | 8                   | 777                 |        |        |        |                            |        |
| Max Chips Per Pod                 | -                  | 256                | 1,024              | 4,096               | 8,960               | 256                 | 4.00   | 4.00   | 2.19   | 1.00                       |        |
| Peak Petaflops Per Pod            | 21                 | 12                 | 126                | 1,126               | 8,225               | 474                 | 10.70  | 8.94   | 7.30   | 4.70                       |        |
| All-Reduce Bandwidth Per Pod      | -                  | 120 TB/sec         | 340 TB/sec         | 1,100 TB/sec        | 777                 | 777                 | 2.83   | 3.24   | 777    | 777                        |        |
| Bisection Bandwidth Per Pod       | 12.                | 2 TB/sec           | 6.4 TB/sec         | 24 TB/sec           | 777                 | 777                 | 3.20   | 3.75   | 777    | 777                        |        |
| Pricing Per TPU Chip, Lowest U    | S Region Price     | ing                |                    |                     |                     |                     |        |        |        |                            |        |
| Preemptible Spot                  | -                  | <u>\$0.45</u>      | <u>\$0.60</u>      | <u>\$0.97</u>       | <u>\$2.10</u>       | <u>\$1.25</u>       | 1.33   | 1.61   | 2.17   | 2.08                       |        |
| Preemptible Spot For Three Months | -                  | \$972.00           | \$1.296.00         | \$2.086.56          | \$4.536.00          | \$2.700.00          | 1.33   | 1.61   | 2.17   | 2.08                       |        |
| Premptible Spot Price/Peak Perf   |                    | <u>\$21.13</u>     | <u>\$10.54</u>     | <u>\$7.59</u>       | \$4.94              | <u>\$1.46</u>       | 0.50   | 0.72   | 0.65   | 0.44                       |        |
| On Demand Per Hour                | -                  | <u>\$1.50</u>      | \$2.00             | \$3.22              | <u>\$4.20</u>       | \$2.50              | 1.33   | 1.61   | 1.30   | 2.08                       |        |
| On Demand For Three Months        | 12.1               | \$3.240.00         | \$4.320.00         | <u>\$6,955.20</u>   | <u>\$9.072.00</u>   | \$5,400.00          | 1.33   | 1.61   | 1.30   | 2.08                       |        |
| On Demand Price/Peak Perf         | -                  | \$70.43            | \$35.12            | \$25.29             | \$9.88              | <u>\$2.92</u>       | 0.50   | 0.72   | 0.39   | 0.44                       |        |
| I Year CUD Per Hour               | 12.1               | <u>\$0.95</u>      | <u>\$1.26</u>      | <u>\$2.03</u>       | <u>\$2.94</u>       | <u>\$1.75</u>       | 1.33   | 1.61   | 1.45   | 2.08                       |        |
| I Year CUD Cost                   | Ξ.                 | \$8.283.87         | <u>\$11.045.16</u> | \$17.782.71         |                     | \$15,340.50         | 1.33   | 1.61   | 1.45   | 2.08                       |        |
| 1 Year CUD Price/Peak Perf        | 21                 | <u>\$180.08</u>    | \$89.80            | <u>\$64.66</u>      | <u>\$28.07</u>      | <u>\$8.28</u>       | 0.50   | 0.72   | 0.43   | 0.44                       |        |
| 3 Year CUD Per Hour               |                    | <u>\$0.68</u>      | <u>\$0.90</u>      | <u>\$1.45</u>       | <u>\$1.89</u>       | <u>\$1.13</u>       | 1.33   | 1.61   | 1.30   | 2.08                       |        |
| 3 Year CUD Cost                   | <u>1</u>           | <u>\$5.917.05</u>  | <u>\$7.889.40</u>  | <u>\$12,701.93</u>  | <u>\$16.567.74</u>  | \$9,861.75          | 1.33   | 1.61   | 1.30   | 2.08                       |        |
| 3 Year CUD Price/Peak Perf        | -                  | \$128.63           | \$64.14            | \$46.19             | \$18.05             | \$5.32              | 0.50   | 0.72   | 0.39   | 0.44                       |        |

As usual items in bold red italics are estimations by *The Next Platform*. Inference-oriented TPUs removed for clarity

https://www.nextplatform.com/2024/06/10/lots-of-questions-on-googles-trillium-tpu-v6-a-few-answers/ ©2024 Marc Greenberg Consulting, LLC

All Rights Reserved

#### Generative AI is iterative

- To make things extra fun, LLMs are iterative
- Demonstrative example:
  - We're going to make a children's story about (subject 1) and (subject 2)
  - Each of you is an LLM tasked with adding one word to the end of the story





#### Section Summary

- The processors used in AI are powerful math machines
- Matrix Multiply-Accumulate (MAC) is the fastest growing part of most AI / ML Chip Architectures
  - often doubling from generation to generation
- Dramatic increase in math capability drives increase in memory bandwidth demand
- Size of generative AI models and iteration drives increase in memory capacity
- HBM is today's solution for solving memory bandwidth challenges (at a cost, which we'll talk about later)

#### Shameless Plug

- I'm also VP Product for Cassia.ai
- We've constructed a TPU that is 33% smaller and improves TOPS/w by 2.5x compared with traditional techniques by using our technology
- This could also mean 2.5x less power or 2.5x more TPU operations within the same power envelope
- Ask me later



# Chiplets



©2024 Conference Concepts, Inc. All Rights Reserved

### The problems with really big monolithic chips

- Yield
- Reticle limit
- Thermal
- Scaling of different parts of the chip
- Cost per transistor



https://medium.com/@marcussl.chan/chiplets-why-it-can-solve-the-slowing-of-moores-law-651ed53f413d



#### Chiplet Economy



https://www.club386.com/amd-radeon-rx-7900-xtx-review-rise-of-the-chiplets/

#### Chiplet enables extreme integration





TSMC slide from IEDM conference foresees advancements in packaging technologies. (Image credit: TSMC) Source: https://www.tomshardware.com/tech-industry/manufacturing/tsmc-charts-a-course-to-trillion-transistor-chips-eyes-monolithic-chips-with-200billion-transistors-built-on-1nm-node

#### What is a chiplet?





ODSA Business Overview white paper - https://drive.google.com/file/d/1UmNyyciEF\_OJZZ35HOeL5X3g-gmP33YZ/view

#### What is a chiplet?

#### **Packaging Choices**



- Packaging and D2D interface are connected
  - Bump pitch and escape pitch → via pad, L/S
- Depths of signal bumps → # of layers
- Needed insertion loss → dielectric choice
- Maximum distance between chiplets





ODSA Business Overview white paper - https://drive.google.com/file/d/1UmNyyciEF\_OJZZ35HOeL5X3g-gmP33YZ/view

#### What is a chiplet?

#### **Three Financial Benefits of Chiplets**





ODSA Business Overview white paper - https://drive.google.com/file/d/1UmNyyciEF\_OJZZ35HOeL5X3g-gmP33YZ/view

©2024 Marc Greenberg Consulting, LLC All Rights Reserved 11

#### There are many ways of doing chiplets

The Wild West of Semiconductor Packaging 100s of ways to package devices





John Park, Cadence, EETimes Chiplet Conference 2024

## History of 2.5D/3D Stacked DRAMs

- First standards-based DRAM on Logic Chip: Wioming using 1<sup>st</sup> gen WidelO
- WidelO Goal: Reduce power, increase performance, reduce PCB area
- What actually happened: Package-on-package of LPDDRx



https://www.ieee-edps.com/archives/2012/c/1800greenberg.pdf



Enough Talk! Practical Approaches to 3-D IC – TSV/Silicon Interposer and Wide I/O Implementation from People Who Have Been There and Done That, presented by Frank Lee of TSMC and Marc Greenberg of Cadence Design Systems – at 49<sup>th</sup> DAC, June 2012

# History of 2.5D/3D Stacked DRAMs

- First standards-based DRAM on Logic Chip: Wioming using 1<sup>st</sup> gen WidelO
- WidelO Goal: Reduce power, increase performance, reduce PCB area
- What actually happened: Package-on-package of LPDDRx

#### Wide IO SME architecture overview

- Wide IO Memory Controller (Cadence DENALI)
  - Compliant with DRAM specification for Wide IO from JEDEC (<u>http://www.jedec.org/</u>)
  - High performance, and advanced low-power features
  - First deliveries to 3D-IC Wioming ST-Ericsson/LETI project
- Wide IO PHY Interface
  - 200MHz,128 bit ,SDR
  - ~1200 TSVs,  $\mu$ buffers and  $\mu$ bumps
  - Also integrates ESD protections for DRAM
- Specific Design for Wide IO Testability Integration
  - Boundary scan, direct access, stuck-at, memory bist, PLL test
- Smart Memory Engine
  - Data transfer handling between Wide IO, SRAM and ANoC

14

https://www.ieee-edps.com/archives/2012/c/1800greenberg.pdf

- Integration within ANoC
- Up to 3.2GB/s data bandwidth

RTI Conference - 13th Dec 2011

A Three-Layers 3D-IC Stack including WidelO and 3D NoC



cādence" 🖙 Leti 🥸 Ericsson



Enough Talk! Practical Approaches to 3-D IC – TSV/Silicon Interposer and Wide I/O Implementation from People Who Have Been There and Done That, presented by Frank Lee of TSMC and Marc Greenberg of Cadence Design Systems – at 49<sup>th</sup> DAC, June 2012

©2024 Marc Greenberg Consulting, LLC

All Rights Reserved

# Why not just put the memory on top?

3D vs. 2D/2.5D Interconnects to Solve AI Bottlenecks



#### **3D Interconnects**

- Highest D2D bandwidth
- XPU uses most advanced/expensive node at highest compute density
- Heat limits compute density under DRAM
  - Wastes valuable XPU silicon area (e.g. 2nm)

### 2D/2.5D Interconnects

- High enough D2D bandwidth for most applications
- Base die uses cheaper N-1/N-2 FinFET node, but advanced enough for most logic functions
- Easier thermal management
  - Best use of valuable XPU silicon area

### 3D & 2D/2.5D all necessary to develop the optimal solutions





# First HBM Chip: AMD Fiji (2014 production)

- 28nm
- 596mm2
  - 1011mm2 interposer
- 4x HBM (1)
- 512GB/s





### ... and what it took to build it

• ">15 Prototypes over 8.5 years" starting ~2008





https://www.ectc.net/files/66/5/66thECTC\_Panel\_BlackAMD.pdf

## AMD continues in the chiplet direction





https://www.amd.com/en/technologies/cdna.html

# Mix and match, partitioning



https://www.club386.com/amd-instinct-mi300-architecture-speaks-to-massive-ai-performance/



# HBM Isn't easy



Tech Industry > Artificial Intelligence

(provider) GPUs and HBM3 memory caused half of failures during (user) training — one failure every three hours for (user) 16,384 GPU training cluster

News By Anton Shilov published July 27, 2024

But (user) knows how to mitigate the issues.

https://www.tomshardware.com/

Assumptions: 16,384 GPUs 6 HBM per GPU 8Gbps per pin 1024 pins per HBM =8\*10^17bps \* 10800s =1 error per 8.6\*10^21 bits

### Gratuitous Die Photos



Nvidia Blackwell



Intel Ponte Vecchio



AMD EPYC



Intel Sapphire Rapids







# HBM



©2024 Conference Concepts, Inc. All Rights Reserved

## HBM Cross-section



### Some disassembled HBM photos









https://www.flickr.com/photos/130561288@N04/albums/72177720295479734/with/52207241684

# AMD MI300 Chiplet stackup





https://www.servethehome.com/amd-instinct-mi300x-gpu-and-mi300a-apus-launched-for-ai-era/amd-instinct-mi300-family-architecture-chip-stack/ ©2024 Marc Greenberg Consulting, LLC All Rights Reserved

## Fundamentals of HBM Operation



Fig. 1: Organization of DRAM (left) and HBM devices (right).

https://www.ece.mcmaster.ca/faculty/hassan/assets/publications/hbm\_iccad2021.pdf



# HBM3 Banks and Channels/Pseudochannels

- 8 DWORD channels for data
  - 128 bits wide, divided into 2 64-bit Pseudochannels
  - Total 1024 Data pins
- AWORD for Command/Address

 B6
 B4
 B7
 B0

 B7
 B5
 B3
 B6
 B4
 B2

 Channel A B14
 Channel A B14



https://www.2cm.com.tw/2cm/zh-tw/tech/7C42130C5D1645C8884E53E62E27533E

- BL2
- 64 banks per channel

Massively Parallel Memory Architecture



## HBM3 RAS Features

- Parity
- Redundancy and remapping
  - LANE\_REPAIR
  - SOFT\_LANE\_REPAIR
  - HARD\_LANE\_REPAIR
- Loopback
- ECC / On-die ECC (Symbol based SEV signal gives ECC status)
  - Auto ECS
- Test (IEEE 1500)

# Summary Section



©2024 Conference Concepts, Inc. All Rights Reserved

# Financial Math

- 2022 cost analysis
- HPC Data center cost HBM vs DDR5
- A word on HBM Economics:
  - If you assemble a module with HBM, the HBM inventory
  - might be yours

Technical Research Study | High-Bandwidth Memory (HBM) Architecture CPU Revolutionizes High-Performance Computing (HPC)

| Node Component                             | 2S Intel® Xeon® Max Series Processor with Zero DIMMs | 2S AMD EPYC <sup>™</sup> 7773X Processor with 32 GB DIMM |
|--------------------------------------------|------------------------------------------------------|----------------------------------------------------------|
| A. Total server (node) cost (from Table 2) | \$31,400                                             | \$24,136                                                 |
| B. Equivalent nodes (from Table 6)         | 33                                                   | 100                                                      |
| C. Equivalent node cost (C = A x B)        | \$1,036,200                                          | \$2,413,600                                              |
|                                            | 57% less                                             | -                                                        |

 Table 7 | Equivalent node costs (rounded to nearest dollar)

This comparison tells us that the cost of a 100-node server cluster powered by Intel Xeon Max Series processors will be 57 percent less than the cost of a 100-node server cluster run by AMD EPYC 7773X processors. However, it will still deliver the same performance. The performance of the Intel Xeon Max Series processors helps significantly lower TCO at scale because so many fewer servers are needed.

The biggest assumption in our analysis was the equivalent nodes of 100 to 33. Even if the equivalent nodes were 100 to 50 or 100 to 75, it would still cost less to run a server cluster powered by Intel Xeon Max Series processors, as compared to a server cluster run by AMD EPYC 7773X processors.

https://www.prowesscorp.com/wp-content/uploads/2022/12/220126-Intel-HBM-Architecture-CPU-Revolutionizes-HPC-technical-research-study.pdf

## GDDR vs LPDDR vs HBM

| Property                               | HBM4              | НВМЗЕ    | HBM3    | HBM2E   | GDDR7   | GDDR6         | LPDDR5X | DDR5    |
|----------------------------------------|-------------------|----------|---------|---------|---------|---------------|---------|---------|
| Max capacity per stack, chip or module | 36-64GB           | 36GB     | 24GB    | 16GB    | 3GB     | 2GB           | 16GB    | 2048GB  |
| Data Transfer Rate                     | 6.4GT/s           | 8.8GT/s  | 6.4GT/s | 3.6GT/s | 32GT/s  | 24GT/s        | 9.6GT/s | 8.4GT/s |
| Max stack                              | 16                | 12       | 12      | 8       | 1*      | 1*            | 8       | 8-16    |
| Interface Width                        | 2048              | 1024     | 1024    | 1024    | 32      | 32            | 32-64   | 64      |
| Signaling                              | Undisclosed       | NRZ      | NRZ     | NRZ     | PAM3    | NRZ           | NRZ     | NRZ     |
| I/O Voltage                            |                   | 1.1v     | 1.1v    | 1.2v    | 1.2v    | 1.2-<br>1.35v | 0.4v    | 1.1v    |
| Bandwidth per stack, chip or module    | 1500-<br>2000GB/s | 1200GB/s | 819GB/s | 406GB/s | 128GB/s | 96GB/s        | 77GB/s  | 67GB/s  |



\* Designed for clamshell implementation on PCB generating a virtual 2-stack

Mostly derived from https://www.embedded.com/wp-content/uploads/sites/2/2024/01/memory-bandwidth-table-2023-002.jpg

# GPU memory mix

 Forecast rapid growth of HBM4

• Bit split evenly between GDDR and HBM



(Source: DRAM Market Monitor Q2 2024, Yole Intelligence, June 2024)



© Yole Intelligence 2024

https://www.yolegroup.com/product/monitor/dram-market-monitor/



# Supply Challenges



### SK hynix's high bandwidth memory buffet fully booked till 2025

Micron also riding the AI wave with 128 GB DDR5 RDIMMs

 Dan Robinson Thu 2 May 2024 // 13:31 UTC

Memory chipmaker SK hynix has already sold all the high bandwidth memory (HBM) it will manufacture this year and most of its expected 2025 production, citing increased demand driven by the AI craze. Micron is also getting in on the act with availability of 128 GB DDR5 RDIMMs for servers.

SK hynix told a news conference at its Icheon headquarters in South Korea that it is set to expand its output of memory chips, predicting that global demand is set to increase over the long term thanks to applications such as AI.



### TSMC's Entire CoWoS Supply Reportedly Reserved By NVIDIA & AMD Until 2025



https://wccftech.com/tsmc-entire-cowos-supply-reserved-by-nvidia-amd-until-2025/



### HBM market size

#### 2022-2025 High Bandwidth Memory (HBM) market evolution

(Source: Next-Generation DRAM 2024 – Focus on HBM and 3D DRAM, Yole Intelligence, January 2024)



5% of bits shipped

15-20% of revenue



©2024 Marc Greenberg Consulting, LLC All Rights Reserved © Yole Intelligence 2024

# IP Availability

- IP availability generally in nodes from 7nm to =<3nm</li>
- Multiple suppliers
  - IP Vendors
  - ASIC vendors

HBM3E PHY at 8.4Gbps Write Eye Diagram Industry's fastest HBM3E





Example: Cadence, used with permission

### HBM4

- Announced standard changes:
- 2x channels
- 24Gb and 32Gb die configurations
- 4, 8, 12, and 16 high TSV stacks
- speed pins up to 6.4Gbps with discussion ongoing for higher frequencies

### JEDEC Approaches Finalization of HBM4 Standard, Eyes Future Innovations

ARLINGTON, Va., USA – July 10, 2024 – JEDEC Solid State Technology Association, the global leader in the development of standards for the microelectronics industry, today announced it is nearing completion of the next version of its highly anticipated High Bandwidth Memory (HBM) DRAM standard: HBM4. Designed as an evolutionary step beyond the currently published HBM3 standard, HBM4 aims to further enhance data processing rates while maintaining essential features such as higher bandwidth, lower power consumption, and increased capacity per die and/or stack. These advancements are vital for applications that require efficient handling of large datasets and complex calculations, including generative artificial intelligence (AI), high-performance computing, high-end graphics cards, and servers.

HBM4 is set to introduce a doubled channel count per stack compared to HBM3, with a larger physical footprint. To support device compatibility, the standard ensures that a single controller can work with both HBM3 and HBM4 if needed. Different configurations will require various interposers to accommodate the differing footprints. HBM4 will specify 24 Gb and 32 Gb layers, with options for supporting 4-high, 8-high, 12-high and 16-high TSV stacks. The committee has initial agreement on speeds bins up to 6.4 Gbps with discussion ongoing for higher frequencies.

JEDEC encourages companies to join and help shape the future of JEDEC standards. Membership grants access to pre-publication proposals and provides early insights into active projects like HBM4. Discover the benefits of membership and join today.

JEDEC standards are subject to change during and after the development process, including disapproval by the JEDEC Board of Directors.

https://www.jedec.org/news/pressreleases/jedec-approaches-finalization-hbm4-standard-eyes-future-innovations ©2024 Marc Greenberg Consulting, LLC All Rights Reserved

# What could be in the future: different I/O



https://www.nextplatform.com/2024/03/28/how-to-build-a-better-blackwell-gpu-than-nvidia-did/



## What could be in the future: PIM



https://hc34.hotchips.org/assets/program/posters/hc34.SKhynix.YongkeeKwon.v03.pdf



Figure 8. Quantization on (a) CPU vs. (b) PIM.

https://www.pdl.cmu.edu/PDL-FTP/associated/asplos18-pim-final.pdf



## Summary

- HBM, Chiplet, AI technology are intrinsically linked
- A robust ecosystem for all the components is available
- A roadmap for higher levels of memory bandwidth is assured
- Questions? marc@marcgreenberg.com

