## **Nvidia Hopper Architecture**



### Manuel Ujaldón

Full Professor in Computer Architecture @ University of Malaga DLI Ambassador @ Nvidia Corporation





## **Contents**

I. Introduction. [6 slides]
II. Hardware design. [7]
III. Major features. [6]
IV. Performance, scalability, connectivity. [6]
V. Products, market segments, roadmap. [12]
VI. Nvidia AI Platform. [6]



# I. Introduction



#### Explore the latest technologies and business breakthroughs.

Learn from experts how AI and the evolution of the 3D Internet are profoundly impacting industries—and society as a whole.

Join us for the online conference September 19-22, 2022 and be part of what comes next.

## **GPUs are everywhere**



Manuel Ujaldón - Univ. of Málaga



# The top layer consists of domain-specific libraries

- RTX: Ray-tracing.
- HPC: High Performance Computing.
- RAPIDS: Data analytics.
- Al: Artificial Intelligence.
- CLARA: Health care and life sciences.
- METROPOLIS: Video analytics and streaming signal AI platform.
- ORIVE: Autonomous vehicles.
- ISAAC: Robotics.
- AERIAL 5G: 5G virtual ramp processing.

# Overview of CUDA hardware generations

|             |                       |        |       |               |                    |            |                       |            | Ampe | e    |
|-------------|-----------------------|--------|-------|---------------|--------------------|------------|-----------------------|------------|------|------|
| 32          |                       |        |       |               |                    |            | Turi<br>RT c<br>int8, | ng<br>ores |      |      |
| 30          |                       |        |       |               |                    |            |                       | int4       |      |      |
| 28          | Volta<br>Tensor cores |        |       |               |                    |            |                       |            |      |      |
| 26          |                       |        |       |               |                    |            |                       |            |      |      |
| 24          |                       |        |       |               |                    |            |                       |            |      |      |
|             |                       |        |       |               |                    |            |                       |            |      |      |
|             |                       | Pascal |       |               |                    |            |                       |            |      |      |
|             |                       |        |       |               |                    | lemory, N∖ | /Link                 |            |      |      |
| 5 18        |                       |        |       |               | fp16               |            |                       |            |      |      |
| 16 I        |                       |        |       |               |                    |            |                       |            |      |      |
| aldnon 14   |                       |        |       | Mar           |                    |            |                       |            |      |      |
| E 12        |                       |        |       | Unif          | well<br>ied memory |            |                       |            |      |      |
|             |                       |        |       | DX1           |                    |            |                       |            |      |      |
| Š           |                       |        | Kaa   | lee           |                    |            |                       |            |      |      |
|             |                       |        | E Dvn | amic Parallel | lism               |            |                       |            |      |      |
| 6           |                       |        |       |               | -                  |            |                       |            |      |      |
|             |                       | Fem    | ni    |               |                    |            |                       |            |      |      |
| ב<br>כ<br>כ |                       |        |       |               |                    |            |                       |            |      |      |
|             | CUD                   |        |       |               |                    |            |                       |            |      |      |
|             | 2008                  | 2010   | 2012  | 2014          | 2016               | 2018       | 2019                  | 2020       | 2021 | 2022 |

# Comparing the GPU and the CPU: Two methods for building supercomputers



#### Manuel Ujaldón - DLI University Ambassador

# The CUDA hardware: SIMD processors structured, a tale of hardware scalability

## • A GPU consists of:

N multiprocessors (or SMs), each containing
 M cores (or stream processors).

## Heterogeneous computing:

- GPU: Data intensive. Fine-grain parallelism.
- CPU: Control/management. Coarse grain parallelism.

|                        | G80<br>(Tesla) | GF100<br>(Fermi) | GK110<br>(Kepler) | GM200<br>(Maxwell) | GP100<br>(Pascal) | GV100<br>(Volta) | TU102<br>(Turing) | A100<br>(Ampere) | H100<br>(Hopper) |
|------------------------|----------------|------------------|-------------------|--------------------|-------------------|------------------|-------------------|------------------|------------------|
| Time frame             | 2006-09        | 2010-11          | 2012-13           | 2014-15            | 2016-17           | 2018-20          | 2019-20           | 2020-22          | 2022-?           |
| N (multiprocessors)    | 16-30          | 14-16            | 13-15             | 4-24               | 56                | 80               | 72                | 108              | 132              |
| M (fp32 cores/multip.) | 8              | 32               | 192               | 128                | 64                | 64               | 64                | 64               | 128              |
| # cores                | 128-240        | 448-512          | 2496-2880         | 512-3072           | 3584              | 5120             | 4608              | 6912             | 16896            |







## The new generations (2016-2022)

|                            |                    | Pascal                                    |                             | Volta                        | Turing                      | Am    | oere  | Нор    | per    |
|----------------------------|--------------------|-------------------------------------------|-----------------------------|------------------------------|-----------------------------|-------|-------|--------|--------|
| Architecture               | GP104<br>(GTX1080) | <b>GP100</b><br>(Titan X)<br>(Tesla P100) | <b>GP102</b><br>(Tesla P40) | <b>GV100</b><br>(Tesla V100) | <b>TU102</b><br>(Titan RTX) | A100  | GA100 | H100   | GH100  |
| Time frame                 | 2016               | 2017                                      | 2017                        | 2018                         | 2019                        | 2020  | 2020  | 2022   | 2022   |
| CUDA Compute<br>Capability | 6.0                | 6.0                                       | 6.1                         | 7.0                          | 7.5                         | 8.0   | 8.x   | 9.0    | 9.x    |
| N (multiprocs.)            | 40                 | 56                                        | 60                          | 80                           | 72                          | 108   | 128   | 114    | 132    |
| M (cores/multip.)          | 64                 | 64                                        | 64                          | 64                           | 64                          | 64    | 64    | 128    | 128    |
| Number of cores            | 2.560              | 3.584                                     | 3.840                       | 5.120                        | 4.608                       | 6.912 | 8.192 | 14.592 | 16.896 |



## II. Hardware design



# Next wave of AI requires performance and scalability



MEGAMOLBART: <u>https://catalog.ngc.nvidia.com/orgs/nvidia/teams/clara/models/megamolbart</u> SegFormer: <u>https://arxiv.org/abs/2105.15203</u> Decision Transformer: <u>https://arxiv.org/pdf/2106.01345.pdf</u> SuperGLUE: https://super.gluebenchmark.com/leaderboard

Exploding Computational Requirements, source: NVIDIA Analysis and https://github.com/amirgholami/ai\_and\_memory\_wall

#### Manuel Ujaldón - DLI University Ambassador



## The printed circuit board for Hopper





## The GH100 GPU with 144 SMs and 6 HBM3 stacks





## GH100 streaming multiprocessor

|                                                           |                                                                 |                                          | L1 Instruct                | tion Cache                                                          |                                 |                            |  |  |
|-----------------------------------------------------------|-----------------------------------------------------------------|------------------------------------------|----------------------------|---------------------------------------------------------------------|---------------------------------|----------------------------|--|--|
|                                                           | L0 Ir                                                           | struction C                              | ache                       |                                                                     | .0 Instruction C                | ache                       |  |  |
|                                                           | Warp Sch                                                        | eduler (32 t                             | hread/clk)                 | Warp                                                                | Scheduler (32 t                 | thread/clk)                |  |  |
|                                                           | Dispatel                                                        | h Unit (32 th                            | read/clk)                  | Disp                                                                | atch Unit (32 th                | read/clk)                  |  |  |
|                                                           | Register                                                        | File (16,384                             | 4 x 32-bit)                | Regis                                                               | ter File (16,38                 | 4 x 32-bit)                |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            | 1                          |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            | ]                          |  |  |
| INT32                                                     |                                                                 | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     |                                                                 | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     |                                                                 | FP64                                     | TENSOR CORE                | INT32 FP32 FP32                                                     | FP64                            | TENSOR CORE                |  |  |
| INT32<br>INT32                                            | FP32 FP32<br>FP32 FP32                                          | FP64<br>FP64                             | TENSOR CORE                | INT32 FP32 FP32<br>INT32 FP32 FP32                                  | FP64<br>FP64                    | TENSOR CORE                |  |  |
| INT32<br>INT32                                            | FP32 FP32<br>FP32 FP32                                          | FP64<br>FP64                             | 4 <sup>th</sup> GENERATION | INT32 FP32 FP32<br>INT32 FP32 FP32                                  | FP64<br>FP64                    | 4 <sup>th</sup> GENERATION |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64<br>FP64                             |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64<br>FP64                             |                            | INT32 FP32 FP32                                                     | FP64<br>FP64                    | 1                          |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            | 1                          |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            | 1                          |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            | 1                          |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| LD/<br>ST                                                 | LD/ LD/ LD/<br>ST ST ST                                         | LD/ LD/<br>ST ST                         | LD/ LD/<br>ST ST SFU       |                                                                     | .D/ LD/ LD/<br>ST ST ST         | LD/ LD/<br>ST ST SFU       |  |  |
| _                                                         |                                                                 |                                          |                            |                                                                     |                                 |                            |  |  |
|                                                           |                                                                 | struction C                              |                            |                                                                     | 0 Instruction C                 |                            |  |  |
|                                                           |                                                                 | eduler (32 t                             |                            | Warp Scheduler (32 thread/clk)<br>Dispatch Unit (32 thread/clk)     |                                 |                            |  |  |
|                                                           |                                                                 | h Unit (32 th<br>File (16,384            |                            |                                                                     | ter File (16,38                 |                            |  |  |
| INT32                                                     |                                                                 | FP64                                     | 4 X 32-0it)                | INT32 FP32 FP32                                                     |                                 | 4 x 32-bit)                |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64<br>FP64                             |                            | INT32 FP32 FP32                                                     | FP64<br>FP64                    |                            |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            | 1                          |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            | 1                          |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     |                                                                 | FP64                                     | TENSOR CORE                | INT32 FP32 FP32                                                     | FP64                            | TENSOR CORE                |  |  |
| INT32                                                     |                                                                 | FP64                                     | 4 <sup>th</sup> GENERATION | INT32 FP32 FP32                                                     | FP64                            | 4 <sup>th</sup> GENERATION |  |  |
|                                                           | FP32 FP32                                                       | FP64                                     |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32                                                     | FP32 FP32<br>FP32 FP32                                          | FP64<br>FP64                             |                            | INT32 FP32 FP32<br>INT32 FP32 FP32                                  | FP64<br>FP64                    |                            |  |  |
| INT32                                                     |                                                                 | FP64<br>FP64                             |                            |                                                                     |                                 |                            |  |  |
| INT32<br>INT32                                            |                                                                 |                                          |                            | INT32 FP32 FP32<br>INT32 FP32 FP32                                  | FP64<br>FP64                    | 4                          |  |  |
| INT32<br>INT32<br>INT32                                   | FP32 FP32                                                       |                                          |                            |                                                                     | FP64<br>FP64                    | 1                          |  |  |
| INT32<br>INT32<br>INT32<br>INT32                          | FP32 FP32<br>FP32 FP32                                          | FP64                                     |                            | INT32 FP32 FP32                                                     |                                 |                            |  |  |
| INT32<br>INT32<br>INT32                                   | FP32 FP32<br>FP32 FP32                                          |                                          |                            | INT32 FP32 FP32<br>INT32 FP32 FP32                                  | FP64                            |                            |  |  |
| INT32<br>INT32<br>INT32<br>INT32<br>INT32<br>INT32<br>LD/ | FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>LD/ LD/ LD/ | FP64<br>FP64<br>FP64<br>LD/ LD/          |                            | INT32 FP32 FP32                                                     | FP64                            |                            |  |  |
| INT32<br>INT32<br>INT32<br>INT32<br>INT32<br>INT32        | FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32   | FP64<br>FP64<br>FP64                     | ST ST SFU                  | INT32 FP32 FP32<br>LD/ LD/ LD/ L<br>ST ST ST ST                     | FP64<br>.D/ LD/ LD/<br>ST ST ST | LD/ LD/<br>ST ST SFU       |  |  |
| INT32<br>INT32<br>INT32<br>INT32<br>INT32<br>INT32<br>LD/ | FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>LD/ LD/ LD/ | FP64<br>FP64<br>FP64<br>LD/ LD/<br>ST ST | Tensor Memo                | INT32 FP32 FP32<br>LD/ LD/ LD/ LD/<br>ST ST ST ST<br>ry Accelerator | FP64<br>LD/ LD/<br>ST ST ST     |                            |  |  |
| INT32<br>INT32<br>INT32<br>INT32<br>INT32<br>INT32<br>LD/ | FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>FP32 FP32<br>LD/ LD/ LD/ | FP64<br>FP64<br>FP64<br>LD/ LD/<br>ST ST | ST ST SFU                  | INT32 FP32 FP32<br>LD/ LD/ LD/ LD/<br>ST ST ST ST<br>ry Accelerator | FP64<br>LD/ LD/<br>ST ST ST     |                            |  |  |



# Computational and memory resources in last 3 flagship GPUs

|                           | Tesla V100<br>(Volta) | Titan RTX<br>(Turing) | A100 (Ampere) | H100 *<br>(Hopper) |
|---------------------------|-----------------------|-----------------------|---------------|--------------------|
| GPU (chip)                | GV100                 | TU102                 | GA100         | GH100              |
| fp32 cores                | 5120                  | 4608                  | 6912          | 16896              |
| fp64 cores                | 2560                  | 144                   | 4096          | 8448               |
| Frequency (base-boost)    | 1370-1455 MHz         | 1440-1770 MHz         | 1410 MHz      | n/a                |
| TFLOPS (fp16, fp32, fp64) | 30, 15, 7.5           | 32.6, 16.3, 0.51      | 78, 19.5, 9.7 | 120, 60, 30        |
| Memory interface          | HBM2 4096 bits        | GDDR6 384 bits        | HBM2 5 stacks | HBM3 5 stacks      |
| Memory bandwidth          | 900 GB/s.             | 672 GB/s.             | 1555 GB/s.    | 3000 GB/s.         |
| Video memory              | 16 ó 32 GB            | 24 GB                 | 48 GB         | 80 GB              |
| L2 cache                  | 6 MB                  | 6 MB                  | 40 MB         | 50 MB              |
| Shared memory per multip. | Hasta 96 KB           | Up to 64 KB           | Up to 164 KB  | Up to 228 KB.      |

(\*) Preliminary specifications for H100 based on current expectations and are subject to change in the shipping product.

Manuel Ujaldón - DLI University Ambassador



## Matrix operations implemented in hardware





# Thread block clusters: A new layer in the memory hierarchy











# III. Major features

# Hopper: The new engine for Al infrastructure.



Custom 4N TSMC process 80 billion transistors



Transformer engine



4<sup>th</sup> generation NVLink



Confidential computing







**DPX** instructions



# Transformer engine: Tensor cores optimized for transformer models

Nvidia tuned adaptive range optimization across 16-bit and 8-bit match.
 Configurable macro blocks deliver performance without accuracy loss.
 6x faster training and inference of transformers models.





# Confidential computing: Secure data and AI models in-use



# Multi-GPU instance: 7 secure tenants within a single GPU

### 7 fully isolated and secured instances, QoS guaranteed.











# DPX: New instructions for accelerating dynamic programming algorithms





ğ

Optimization







Graph analytics



### **Real-time performance**





## IV. Performance, scalability, connectivity



## Substantial acceleration in all areas

## H100 speed-up vs. A100 on multiple GPUs:



Projected performance subject to change

# H100 brings order-of-magnitude leap in performance

### Performance and scalability for next-generation breakthroughs:





Input sequence length=128, output sequence length=20 A100 cluster: HDR IB network H100 cluster: NDR IB network for 16 H100 configuration 16 A100 vs 8 H100 for 2 sec 32 A100 vs 16 H100 for 1 and 1.5 sec



3D FFT (4K^3):

- A100 cluster: HDR IB network - H100 cluster: NVLink Switch System, NDR IB

Genome sequencing (Smith-Waterman):

- 1 A100

- 1 H100

Projected performance subject to change



## Speed-up breakout vs. A100



### **NVIDIA**.

# Memory bandwidth improvement since the adoption of High Bandwidth Memory





# Unprecedented AI and HPC performance, scalability and connectivity

### Peak performance:

| Data type     | (NVLink  |           | (includi | up vs. A100<br>ng sparsity)<br>PCI-e |
|---------------|----------|-----------|----------|--------------------------------------|
| fp8           | 4 Peta-  | 3.2 Peta- | 6x       | 5x                                   |
| fp16 (half)   | 2 Peta-  | 1.6 Peta- | Зx       | 2.5x                                 |
| fp32 (float)  | 1 Peta-  | 0.8 Peta- | Зx       | 2.5x                                 |
| fp64 (double) | 60 Tera- | 48 Tera-  | Зx       | 2.5x                                 |

|           | HBM3 memory<br>(NVLink) | HBM2e mem.<br>(PCI-e) |
|-----------|-------------------------|-----------------------|
| Size      | 80 Gbytes               | 80 Gbytes             |
| Bandwidth | 3 TB/s. (1.5x vs. A100) | 2 TB/s.               |

### Scalability:

- NVLink Switch: Up to 256 GPUs (from NVSwitch@DGX).
- NVLink Bridge: Up to 2 GPUs (for PCI-e).

## Connectivity:

- GPU to GPU:
  - 900 GB/s (4<sup>th</sup> gen. NVLink).
  - ○600 GB/s (5<sup>th</sup> gen. PCI-e).
- GPU to CPU:
  - ■128 GB/s (5<sup>th</sup> gen. PCI-e).
- Form factors:

• NVLink:



PCI-express:





## NVLink switch system

## ■ High perf. 4<sup>th</sup> gener. NVLink network for up to 256 GPUs.



#### 4th GEN NVLink

- 900 GB/s from 18 bi-directional ports @ 25 GB/s. each.
- GPU-2-GPU connectivity across nodes.

#### 3rd GEN NVSwitch (from DGXs)

- All-to-all NVLink switching for 8-256 GPUs.
- Accelerate collectives multicast and SHARP.

#### **NVLink Switch**

• 128 port cross-connect based on NVSwitch.

### Representative hardware: H100 cluster (1 scalable unit)

- Servers: 32.
- NVLink switches: 18.
- NVLink optical cables: 1152.
- All-to-all bandwidth: 57.6 TB/s.



## V. Products, market segments, roadmap



## HGX-H100

#### HIGHEST PERFORMANCE FOR AI AND HPC

4-way / 8-way H100 GPUs in-network SHARP compute with sparsity: 32 PetaFLOPS (FP8) 3.6 TFLOPS (FP16)

NVIDIA Certified HPC Offering from All Makers

#### FASTEST, SCALABLE INTERCONNECT

4th Gen NVLINK with 3X faster All-Reduce communications vs. previous generation.

3.6 TB/s bisection bandwidth NVLINK Switch System Option Scales Up to 256 GPUs

SECURE COMPUTING First HGX System with Confidential Computing







## H100 PCI-express

#### HIGHEST AI AND HPC MAINSTREAM PERFORMANCE

3.2PF fp8 (5x) 1.6PF fp16 (2.5x) 800TF TF32 (2.5x) 48TF fp64 (2.5x) (x-factors vs. A100 and including sparsity) 2TB/s , 80GB HBM2e memory

#### HIGHEST COMPUTE ENERGY EFFICIENCY

Configurable TDP - 150W to 350W 2 Slot FHFL mainstream form factor

HIGHEST PERFORMING SERVER CONNECTIVITY 128GB/s PCI Gen5 600 GB/s GPU-2-GPU connectivity (5X PCIe Gen5) up to 2 GPUs with NVLink Bridge





## H100 CNX converged accelerator

# Delivering high-speed GPU-network I/O to mainstream servers



350W | 80GB | 400 Gb/s Ethernet or InfiniBand PCIe Gen 5 | 2-Slot FHFL | NVLink

#### MULTI-NODE TRAINING



High performance and scalability

#### 5G AI / PROCESSING



5G processing and AI services on a single commodity server



## H100 CNX

## The mainstream choice for Multi-GPU



#### PCIe Gen4 Mainstream Server

- Throughput limited by Gen4 and CPU processing bottlenecks
- Reduced CPU performance from managing data transfers

System configuration: 2U, 2S 64C CPU, 1024GB RAM, 2TB SSD, ConnectX-7 Dx NIC on H100 PCIe config



#### PCIe Gen4 Mainstream Server with H100 CNX

- Gen5 GPUDirect between network and H100 delivers 2X higher throughput
- CPU performance increases
- Scalable multi-node GPU processing



## Highest performance training with H100

#### 4x higher performance over A100



Server cost for representation purpose. Please contact your OEM/ODM for actual costs System configuration (Training): HGX A100 8-way | HGX H100 8-way <u>excludes</u> NVlink Switch System



#### Optimal compute for large inference deployments



Server cost for representation purpose. Please contact your OEM/ODM for actual costs System configuration: 2U, 2S 64C CPU, 1024GB RAM, 2TB SSD, ConnectX-7 Dx NIC. 3 Year Hosting Cost: \$150/kW/m



## **Portfolio availability**

#### Hopper coming soon





## Choose the right H100 GPU

|           |              | Training            | , Inference, HPC, Data A | Analytics        |
|-----------|--------------|---------------------|--------------------------|------------------|
| GPU       | Availability | Highest             | Mainstrea                | im Servers       |
|           |              | Performance         | Multi-Node Jobs          | Single-Node Jobs |
| H100 SXM  | Q3 '22       | DGX HGX 4-Way/8-Way |                          |                  |
| H100 CNX  | Q4 '22       |                     | HGX CNX                  | HGX CNX          |
| H100 PCle | Q3 '22       |                     |                          | HGX PCIE         |

**Price-performance** comparison relative within each column



## Data center GPU comparison

- 1. Coming soon
- 2. Supported on <u>Azure NVIDIA A100</u> with reduced performance compared to
- 3. A100 without Confidential Computing or H100 with Confidential Computing.

 $\odot$ 

4. All Tensor Core numbers with sparsity

|                                               | H1                                | 00                                                    | A       | 100                                                   | A30                                             | A2                                                        | T4                                            | A40                                                   | A10                                       | A16                                                          |
|-----------------------------------------------|-----------------------------------|-------------------------------------------------------|---------|-------------------------------------------------------|-------------------------------------------------|-----------------------------------------------------------|-----------------------------------------------|-------------------------------------------------------|-------------------------------------------|--------------------------------------------------------------|
| Design                                        | Highest<br>Big NLP,               |                                                       |         | gh Perf<br>mpute                                      | Mainstream<br>Compute                           | Entry-Level<br>Small<br>Footprint                         | Small<br>Footprint<br>Datacenter<br>Inference | High Perf<br>Graphics                                 | Mainstream<br>Graphics &<br>Video with Al | High Density<br>Virtual<br>Desktop                           |
| Form Factor                                   | SXM5                              | x16 PCle<br>Gen5<br>2 Slot FHFL<br>3 NVLINK<br>Bridge | SXM4    | x16 PCle<br>Gen4<br>2 Slot FHFL<br>3 NVLink<br>Bridge | x16 PCIe Gen4<br>2 Slot FHFL<br>1 NVLink Bridge | x8 PCIe Gen4<br>1 Slot LP                                 | x16 PCIe Gen3<br>1 Slot LP                    | x16 PCle<br>Gen4<br>2 Slot FHFL<br>1 NVLink<br>Bridge | x16 PCIe<br>Gen4<br>1 slot LP             | x16 PCIe<br>Gen4<br>2 Slot FHFL                              |
| Max Power                                     | 700W                              | 350W                                                  | 500W    | 300W                                                  | 165W                                            | 40-60W                                                    | 70W                                           | 300W                                                  | 150W                                      | 250W                                                         |
| FP64 TC   FP32 TFLOPS <sup>3</sup>            | 60   60                           | 48   48                                               | 19.5    | 5   19.5                                              | 10   10                                         | NA   4.5                                                  | NA   8                                        | NA   37                                               | NA   31                                   | NA   4x4.5                                                   |
| TF32 TC   FP16 TC TFLOPS <sup>3</sup>         | 1000   2000                       | 800   1600                                            | 312     | 2   624                                               | 165   330                                       | 18   36                                                   | NA   65                                       | 150   300                                             | 125   250                                 | 4x18   4x36                                                  |
| FP8 TC   INT8 TC TFLOPS/<br>TOPS <sup>3</sup> | 4000   4000                       | 4000   4000                                           | NA      | 1248                                                  | NA   661                                        | NA   72                                                   | NA   130                                      | NA   600                                              | NA   500                                  | NA   4x72                                                    |
| GPU Memory / Speed                            | 80GB HBM3                         | 80GB<br>HBM2e                                         | 80GI    | 3 HBM2e                                               | 24GB HBM2                                       | 16GB GDDR6                                                | 16GB GDDR6                                    | 48GB GDDR6                                            | 24GB GDDR6                                | 4x 16GB<br>GDDR6                                             |
| Multi-Instance GPU (MIG)                      | Up t                              | .o 7                                                  | U       | p to 7                                                | Up to 4                                         | -                                                         | -                                             |                                                       | -                                         | -                                                            |
| NVLink Connectivity                           | Up to 256                         | 2                                                     | Up to 8 | 2                                                     | 2                                               | -                                                         | -                                             | 2                                                     | -                                         | -                                                            |
| Media Acceleration                            | 7 JPEG E<br>7 Video I             |                                                       |         | G Decoder<br>o Decoder                                | 1 JPEG Decoder<br>4 Video<br>Decoder            | 1 Video<br>Encoder<br>2 Video<br>Decoder<br>(+AV1 decode) | 1 Video<br>Encoder<br>2 Video<br>Decoder      | 2 Video                                               | Encoder<br>Decoder<br>decode)             | 4 Video<br>Encoder<br>8 Video<br>Decoder<br>(+AV1<br>decode) |
| Ray Tracing                                   |                                   |                                                       |         | -                                                     |                                                 | Yes                                                       | Yes                                           | Yes                                                   | Yes                                       | Yes                                                          |
| Transformer Engine                            | Ye                                | 25                                                    |         | -                                                     |                                                 | -                                                         | -                                             | -                                                     | -                                         | -                                                            |
| DPX Instructions                              | Ye                                | 25                                                    |         | -                                                     |                                                 | -                                                         | -                                             |                                                       | -                                         |                                                              |
| Graphics                                      | For in-situ v<br>(no NVIDIA<br>vW | vPC or RTX                                            |         | or in-situ visua<br>NVIDIA vPC or                     |                                                 | Good                                                      | Good                                          | Best                                                  | Better                                    | Good                                                         |
| vGPU                                          | Ye                                | S <sup>1</sup>                                        |         | Yes                                                   |                                                 | Yes                                                       | Yes                                           | Yes                                                   | Yes                                       | Yes                                                          |
| Hardware Root of Trust                        | Ye                                | 25                                                    |         | Optiona                                               | ι                                               | Optional                                                  | -                                             | Optional                                              | Optional                                  | Optional                                                     |
| Confidential Computing                        | Ye                                | 25                                                    |         | (2)                                                   | -                                               | -                                                         | -                                             |                                                       | -                                         | -                                                            |
| el Server Availability                        | Q3'22                             | Q3'22                                                 | In Pro  | duction                                               | In Production                                   | In Production                                             | In Production                                 | In Production                                         | In Production                             | In Production                                                |



## Choose the right data center GPU

|                                       | GPU         | Availability | DL Training<br>& DA                          | DL Inference                                  | ₩<br>НРС/АІ                 | ک<br>Render Farms | Virtual<br>Workstation | د<br>Virtual<br>Desktop (VDI) | AA<br>A ainstream<br>Acceleration        | Far Edge<br>Acceleration                 | Al-on-5G |
|---------------------------------------|-------------|--------------|----------------------------------------------|-----------------------------------------------|-----------------------------|-------------------|------------------------|-------------------------------|------------------------------------------|------------------------------------------|----------|
|                                       | H100        | Q3 '22       | SXM PCIE CNX                                 | SXM PCIE CNX                                  | SXM PCIE CNX                |                   |                        |                               | PCIE CNX                                 |                                          | CNX      |
| Compute                               | A100        | Now          | SXM PCIE A100X                               | SXM PCIE                                      | SXM PCIE A100X              |                   |                        |                               | PCIE A100X                               |                                          | X001A    |
|                                       | A30         | Now          |                                              | PCIE                                          | PCIE                        |                   |                        |                               | PCIE                                     |                                          | A30X     |
| pute                                  | A40         | Now          |                                              |                                               |                             |                   |                        |                               |                                          |                                          |          |
| Graphics / Compute                    | A10         | Now          |                                              | •                                             |                             | •                 |                        |                               |                                          |                                          |          |
| Grap                                  | A16         | Now          |                                              |                                               |                             |                   |                        |                               |                                          |                                          |          |
| Factor<br>Braphics                    | A2          | Now          |                                              |                                               |                             |                   |                        |                               |                                          |                                          |          |
| Small Form Factor<br>Compute/Graphics | T4          | Now          |                                              | •                                             |                             |                   | •                      |                               |                                          | •                                        |          |
|                                       | Good Better | Compute SE   | mance comparison wit<br>F Compute & Graphics | hin each product grou<br>) and workload colum | p (Compute, Graphics -<br>n | <u>8.</u>         | SXM SXM form fa        |                               | 00 + ConnectX7 Conve<br>A100 or A30 + Bl | erged PCIe card<br>ueField2 Converged PC | Cle Card |

#### 

## Delivering the AI center of excellence for enterprise

#### Best of breed infrastructure for AI development built on DGX

#### NVIDIA DGX H100



The World's First AI System with NVIDIA H100

8x NVIDIA H100 | 32 PFLOPS FP8 (6X) | 0.5 PFLOPS FP64 (3X) 640 GB HBM3 | 3.6 TB/s (1.5X) BISECTION B/W

4<sup>th</sup> Generation of the World's Most Successful Platform Purpose-Built for Enterprise AI

#### COMING LATE 2022

### DGX SuperPOD with DGX H100



32 DGX H100 | 1 EFLOPS AI NVLINK SWITCH SYSTEM | QUANTUM-2 IB | 20TB HBM3 | 70 TB/s BISECTION B/W (11X)

1 ExaFLOPS of AI Performance in 32 Nodes Scale as large as needed in 32 node increments

X-Factors compare performance over DGX SuperPOD with DGX A100 supercomputer configuration with same number of nodes



## Announcing Nvidia EOS Supercomputer

#### The world's most advanced AI infrastructure

| NVIDIA Eos                                           |                                              |     |
|------------------------------------------------------|----------------------------------------------|-----|
| GX SuperPOD Powered by 5<br>00 Quantum-2 IB Switches | 76 DGX H100 Systems  <br>360 NVLink Switches |     |
| FP8                                                  | 18 EFLOPS                                    | 6X  |
| P16                                                  | 9 EFLOPS                                     | 3X  |
| P64                                                  | 275 PFLOPS                                   | 3X  |
| n-Network Compute                                    | 3.7 PFLOPS                                   | 36X |
| Bisection Bandwidth                                  | 230 TB/s                                     | 2X  |
| NVLINK Domain                                        | 256 GPUs                                     | 32X |

Blueprint for OEM and Cloud Partner Offerings

Cloud Native | Performance Isolation | Multi-Tenant

X-Factors compare performance over DGX A100 SuperPOD based supercomputer configuration with same number of Nodes



## VI. Nvidia AI Platform



## Accelerating the Next Wave of AI: AI Platform Updates





### TAO

#### Framework for creating custom, production-ready models to power speech and vision AI applications.



#### https://developer.nvidia.com/tao



## Triton

# Open-source inference serving software for fast, easy inference deployment.





## **Nemo Megatron**

#### Accelerated framework for training large language models.





### **Riva 2.0**

#### World-class speech AI.

• Fully customizable.

### Supported with Riva Enterprise.





#### The hub of GPU-optimized software



#### **1.5M+ users millions of downloads**

\*8x NVIDIA A100 40GB. NVIDIA DGX. ResNet-50. Mixed Precision. 256 batch size.



## Thanks for your attention!

 You can always reach me in Spain at the Computer Architecture Department of the University of Malaga:

- e-mail: <u>ujaldon@uma.es</u>
- Web page: <u>http://manuel.ujaldon.es</u> (english/spanish versions available).

