#### **GPGPU:** Challenges ahead

#### PPAM'15 conference Krakow (Poland). September, 6<sup>th</sup>-9<sup>th</sup>, 2015

#### Manuel Ujaldón

A/Prof. @ Univ. of Malaga (Spain) Conjoint Senior Lecturer @ Univ. of Newcastle (Australia) CUDA Fellow @ Nvidia





#### Talk contents [37 slides]

- 1. Past, present and future [5]
- 2. Transistors and memory improvements [19]
  - 1. New manufacturing processes [2]
  - 2. New memories [15]
- 3. Stacked DRAM [10]

וסועם

- 1. HMC (Hybrid Memory Cube) [6]
- 2. HBM (High Bandwidth Memory) [3]
- 4. Impact on GPUs and concluding remarks [3]





#### I. Past, present and future



#### Past: The GPU market share

©

#### GF 8800GTX. 8800GTS 640M / GTS 11 GF 6800/GT/GTO/Ultra. GF 7800 GTX Core Discrete Graphics Market Share 2003-14 X700 SE, X300SE HM, X1650 XT, X1950 PRO/XT/ Crossfire X550 X300/LE/SE, X600 Pro/XT. ion. GF 9600GT, 9800GTX, 9800 GX2, 8800GS. HD 4350,4550,4570 HD 5750, 5770, 5970, 5 HD 3450, 3470, 3650, 3730, 3750, 3870 X2 HD 4830, 4850 X2 69 15670, 5830, 583 X800 Pro/XT/XT PE GT 610, 620, 630, 640, GTX 670, 690 HD 7970 GHz Ed. T 730, 740. 5 230, R9 295X2 GTX 760, 770, 780 GTX 650 Ti GF 6200, 6500, HD 5750, 5770, 5970, 5450, 5550, 5570, 69 5670, 5830, 5870 Eyefinity Ed FX 5700U, 5950U, FX 5600XT,5900ZT GF 7800GT HD 7730, 7990 6800XT. GF 7800G1 X800 GT/GTO, X700, AIW X800GT/XL X1900 GT X700 Pro/LE, X800. X800XL/SE X850 Pro/XT/ GF 7100GS GF 7950GT X850 XT PE GF 9400GT, 9500GT, 9600GS GTX 560 Ti 448 Core HD 4580, 6930 X800 XL X1650 PRO GTX 465 X1300 XT R7 240, 250, 260, 260X, 270, 270X 9800GT/GTX+, GTX 260 C216. HD 4650,4670,3550,3570, 4870 X2 ATI 9600XT 9800XT GTX 650, 660 GF 7800GTX 512M. X1300/Pro, AIW 2006 GTS 450, GTX 460 HD 5850, 5870 R9 280X, 290, 290X X1950 Cro HD 2900 PRO AIW 9600 Pro GTX 970, 980 X1600PRO/XT, X1800XL/XT/Crossfire Ed HD 4750, 4860 HD 7950 Boost GTX 680 GF 8800GTS 320M X1550, X1650, X19 GF 8800GTS 512M. GTX 275, 285, 295 GT 430, GTX 460SE, 570, 580 R9 285 AIW X1800XL FX 5700LE, PCX5750/5950, HD 3610 HD 7750, 7770 GHz GT 720, GTX 750. 50 Ti ID 3850, 3870, 2900GT GTS 250 HD 6850 6870 6950 6970 GT 630 r.2, 640 r.2, AIW 9200/9600/9600XT GT GTX Titan Black, itan-Z HD 7850, 7870 GHz HD 7950, 7970 GF 7200GS, 7300SE/LE, 7600GS/GT GTX 650 Ti Boost, 660 Ti GTX Titan GTX 960 R7 250X GF 9300GS/GE, GTX 280, 260. 7900GT/GTO/GTX/GX2, 7800GS. GF 8400GS Rev.3 GT 440, 545, 550 Ti, GTX 560 Ti, 590 GF 6600/GT, 6800LE. AIW X600 Pro, GF 8300GS, 8400G 8500GT 8600GS R9 280 GF 9600 GSO HD 4850, 4870, 3830, 3850 X2, X1900XT/XTX/Crossfire Ed. HD 4730,4550, 4570, 4770, HD 6990 HD 7790 AIW X800XT, X850 XT Crossfire Ed. 8600GT/GTS, 8800 X1800 GTO, AIE X1900 HD 4890 GT 520, GTX 560 ID 2350 2400 PD( HD 2600PRO/XT ID 2900 XT. X16 HD 3410 HD 5610, 6350, 6450, 6570, 6670, 6750, 6770, 6790 02 '13 03 '13 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 1Q15 2Q15 Nvidia 50% 53% 51% 55% 63% 63% 64% 61% 64% 66% 65% 76% 77% 81% 37% 46% 45% 46% 37% 36% 39% 36% 33% 35% 22% 18% AMD 45% 24% 3-1 Pre-CUDA era: 1-1 Stable period of 7 years: 2-1 4-]

Manuel Ujaldon - Nvidia CUDA Fellow

Source: Jon Peddie Research consulting



#### Present: Two hibernating movers wake up

| ССС | Code names             | Commercial<br>series | Year<br>range | Manufacturing<br>process @ TSMC | Graphics<br>memory |
|-----|------------------------|----------------------|---------------|---------------------------------|--------------------|
| 1.0 | G80                    | 8xxx                 | 2006-07       | 90 nm.                          | DDR3               |
| 1.1 | G84,6 G92,4,6,8        | 8xxx/9xxx            | 2007-09       | 80, 65, 55 nm.                  | DDR2/DDR3          |
| 1.2 | GT215,6,8              | 2xx                  | 2009-10       | 40 nm.                          | DDR2/DDR3          |
| 1.3 | GT200                  | 2xx                  | 2008-09       | 65, 55 nm.                      | DDR3               |
| 2.0 | GF100, GF110           | 4xx/5xx              | 2010-11       | 40 nm.                          | DDR://DDR5         |
| 2.1 | GF104,6,8, GF114,6,8,9 | 4xx/5xx/7xx          | 2010-13       | 40 nm.                          | DDR: /DDR5         |
| 3.0 | GK104,6,7              | 6xx/7xx              | 2012-14       | 28 nm.                          | DDR://DDR5         |
| 3.5 | GK110, GK208           | 6xx/7xx/Titan        | 2013-14       | 28 nm.                          | DDR: /DDR5         |
| 3.7 | GK210 (2xGK110)        | Titan                | 2014          | 28 nm.                          | DDR://DDR5         |
| 5.0 | GM107,8                | 7xx                  | 2014-15       | 28 nm.                          | DDR://DDR5         |
| 5.2 | GM200,4,6              | 9xx/Titan            | 2014-15       | 28 nm.                          | DDR5               |

#### Future: GTC'15 official announcements

©.





## United States to build two flagship supercomputers





Lawrence Livermore National Laboratory

SUMMIT 150-300 PFLOPS Peak Performance SIERRA > 100 PFLOPS Peak Performance

IBM POWER9 CPU + NVIDIA Volta GPU NVLink High Speed Interconnect >40 TFLOPS/Node >3,400 Nodes 2017

Major Step Forward on the Path to Exascale





# Past, present and future in numerical accuracy: Trade-off vs. performance

- [2010] Fermi: float (fp32) 2x faster than double (fp64).
- [2012] Kepler: fp32 3x fp64.
- [2014] Maxwell: fp32 32x fp64.
- [2016] Pascal: Introducing half-precision (fp16) 2x fp32.
- Half precision widely used in video-games and deep learning applications, so expect good scalability in future GPU generations.







#### II. Transistors and memory improvements



#### Benefits

# When you shrink the transistor gate, you get:

- Faster switching:
   Higher frequency.
- Smaller units:
  - More transistors per chip.
  - Bigger designs.
- Lower power:
  - Less heat.
  - Wider autonomy.

#### When you adopt Stacked-DRAM, you get:

- Faster response:
  - Higher frequency and bandwidth.
- High density packaging:
  - More bytes per chip.
  - Bigger sizes.
- Low power:
  - Less heat.
  - Wider autonomy.

#### More GFLOPS/W

#### More bandwidth



K80

K40

K20

#### GPU peak performance vs. CPU Peak GFLOPS (fp64) Peak Memory Bandwidth



- GPU 6x faster on "double": GPU 6x more bandwidth:
  - GPU: 3000 GFLOPS
  - CPU: 500 GFLOPS

2011

7 GHz x 48 bytes = 336 GB/s. 2 GHz x 32 bytes = 64 GB/s.

2012

x86 CPU

2013

2014





#### II.1. New manufacturing processes



#### Manufacturing process for a fabless company

- A loyal partner for more than 15 years has been TSMC.
- After many speculations, NVIDIA announced in Nov'14 to use TSMC's next-generation 16nm FinFET process.
- They skip the 20nm node. Intel & Samsung now in 14nm.
- Roadmap (already announced by TSMC):
  - Past: [4Q'11] They introduced 28nm.
  - Present: 16nm FinFET.
    - [4Q'15] Volume production.
    - [1Q'16] Commercial chips. Pascal will arrive shortly after this starting point.
  - Future: 10nm 3D FinFET.
    - [4Q'16] Available to customers.
    - [1Q'17] Volume production.
  - Beyond: [4Q'17] 7nm 3D FinFET.





# Benefits of moving from the last 28nm node to the first 16nm node

- 40% more performance at the same power draw.
- 50% less power at the same speed.

Source: Cadence (TSMC's partner)





#### II.2. New memories









## A 2014/15 graphics card: Kepler/Maxwell GPU with GDDR5 memory



Ò







#### A 2016 graphics card: Pascal GPU with Stacked DRAM



Manuel Ujaldon - Nvidia CUDA Fellow



## A Pascal GPU prototype







## The Pascal GPU prototype: SXM 2.0 Form Factor



#### (\* Marketing Code Name. Name is not final).





#### Pascal Stacked DRAM Memory



3D chip-on-wafer integration.
3x bandwidth vs. GDDR5.
2.7x capacity vs. GDDR5.
4x energy efficient per bit.





# How to break the 1 TB/s bandwidth barrier with a 2x 500 MHz clock

- $\bigcirc$  BW = frequency\*width => 1 TB/s = 2x500MHz \* width =>
- width = 8000 Gbits/s / 1 GHz = 8000 bits







## Unified memory: Encourage the programmer NOW to see the FUTURE memory



performances and address spaces.



a single global address space. Performance is sensitive to data proximity.



#### CUDA memory types

|                     | Zero-Copy<br>(pinned memory) | Unified Virtual<br>Addressing | Unified Memory              |
|---------------------|------------------------------|-------------------------------|-----------------------------|
| CUDA call           | cudaMallocHost(&A, 4);       | cudaMalloc(&A, 4);            | cudaMallocManaged(&A, 4);   |
| Allocation fixed in | Main memory (DDR3)           | Video memory (GDDR5)          | Both                        |
| Local access for    | CPU                          | Home GPU                      | CPU and home GPU            |
| PCI-e access for    | All GPUs                     | Other GPUs                    | Other GPUs                  |
| Other features      | Avoid swapping to disk       | No CPU access                 | On access CPU/GPU migration |
| Coherency           | At all times                 | Between GPUs                  | Only at launch & sync.      |
| Full support in     | CUDA 2.2                     | CUDA 1.0                      | CUDA 6.0                    |







# Example 1: Sorting elements from a file. The programming style converges with C

| CPU code in C                                                                               | GPU code in CUDA (v. 6.0 on)                                                                      |
|---------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------|
| <pre>void sortfile (FILE *fp, int N) {     char *data;     data = (char *) malloc(N);</pre> | <pre>void sortfile (FILE *fp, int N) {     char *data;     cudaMallocManaged(&amp;data, N);</pre> |
| <pre>fread(data, 1, N, fp);</pre>                                                           | <pre>fread(data, 1, N, fp);</pre>                                                                 |
| <pre>qsort(data, N, 1, compare);</pre>                                                      | <pre>qsort&lt;&lt;&lt;&gt;&gt;&gt;(data, N, 1, compare); cudaDeviceSynchronize();</pre>           |
| use_data(data);                                                                             | use_data(data);                                                                                   |
| <pre>free(data); }</pre>                                                                    | <pre>cudaFree(data); }</pre>                                                                      |

Ò



#### Example 2: Linked lists



- Almost impossible to manage in the original CUDA API.
- The best you can do is use pinned memory:
  - Pointers are global: Just as unified memory pointers.
  - Performance is low: GPU suffers from PCI-e bandwidth.
  - GPU latency is very high, which is critical for linked lists because of the intrinsic pointer chasing.





## Linked lists with unified memory



Can pass list elements between CPU & GPU.

No need to move data back and forth between CPU and GPU.

Can insert and delete elements from CPU or GPU.

But program must still ensure no race conditions (data is coherent between CPU & GPU at kernel launch only).





#### Unified memory: Summary

Drop-in replacement for cudaMalloc() using cudaMallocManaged().

□cudaMemcpy() now optional.

Greatly simplifies code porting.

Less Host-side memory management.

Enables shared data structures between CPU & GPU

Single pointer to data = no change to data structures.

Powerful for high-level languages like C++.





## Unified memory: The roadmap. Contributions on every abstraction level

| Abstraction<br>level | Past:<br>Consolidated<br>in 2014                          | Present:<br>On the way<br>during 2015                             | Future:<br>Available<br>in coming years |
|----------------------|-----------------------------------------------------------|-------------------------------------------------------------------|-----------------------------------------|
| High                 | Single pointer to data.<br>No cudaMemcpy()<br>is required | Prefetching mechanisms<br>to anticipate data arrival<br>in copies | System allocator unified                |
| Medium               | Coherence @<br>launch & synchronize                       | Migration hints                                                   | Stack memory unified                    |
| Low                  | Shared C/C++ data<br>structures                           | Additional<br>OS support                                          | Hardware-accelerated<br>coherence       |



#### III. Stacked DRAM (3D RAM)





#### Stacked DRAM: A tale of two consortiums

#### HMCC (Hybrid Memory Cube Consortium).

Mentors: Micron and Samsung.

<u>http://www.hybridmemorycube.org</u> (HMC 1.0, 1.1, 2.0 already available)

#### HBM (High Bandwidth Memory).

Mentors: AMD and SK Hynix.

<u>https://www.jedec.org/standards-documents/docs/jesd235 (access via JEDEC).</u>

Keep an eye on what the gurus predict at the end of this year (incoming report by the ITRS):

<u>http://www.itrs.net</u>



# III.1 HMC (Hybrid Memory Cube)

**©** NVIDIA,



### Hybrid Memory Cube Consortium (HMCC)

| HMCC achievements and milestones                                               | Date                |
|--------------------------------------------------------------------------------|---------------------|
| First papers published about Stacked DRAM<br>(based on research projects)      | 2003-2006           |
| First commercial announcement of the technology,<br>by Tezzaron Semiconductors | January, 2005       |
| HMC Consortium is launched by Micron Technologies<br>and Samsung Electronics   | October, 2011       |
| Specification HMC 1.0 available                                                | April, 2013         |
| Production samples based on the standard                                       | Second half of 2014 |
| 2.5 configuration available                                                    | End of 2014         |
| Specification HMC 2.0 available                                                | 2015                |



# Developer members of HMCC (at the time HMC 1.0 was available)





### Hybrid Memory Cube at a glance

**Revolutionary Approach to Break Through the "Memory Wall"** 

- Evolutionary DRAM roadmaps hit limitations of bandwidth and power efficiency.
- Micron introduces a new class of memory: Hybrid Memory Cube.
- Unique combination of DRAMs on Logic.

#### **Key Features**

- Micron-designed logic controller.
- High speed link to CPU.
- Massively parallel "Through Silicon Via" connection to DRAM.

#### **Unparalleled performance**

- Up to 15x the bandwidth of a DDR3 module [but just 2x vs. GDDR5].
- 70% less energy usage per bit than existing technologies [measured in number of active signals involved, power savings are 50% only].
- Occupying nearly 90% less space than today's RDIMMs [95% savings].

Targeting high performance computing and networking, eventually migrating into computing and consumer



#### Details on silicon integration

DRAM cells are organized in vaults, which take borrowed the interleaved memory arrays from already existing DRAM chips.

A logic controller is placed at the base of the DRAM layers, with data matrices on top.

The assembly is connected with through-silicon vias, **TSVs**, which traverse vertically the stack using pitches between 4 and 50 microns with a vertical latency of 12 picoseconds for a Stacked DRAM endowed with 20 layers.











# What it takes to each technology to reach 640 GB/s.

O

| Circuitry required                | DDR3L-1600      | DDR4-3200       | Stacked DRAM HMC 1.0   |
|-----------------------------------|-----------------|-----------------|------------------------|
| Data bandwidth (GB/s.)            | 12.8 per module | 25.6 per module | 20 per link of 16 bits |
| Items required to reach 640 GB/s. | 50 modules      | 25 modules      | 32 links (8 3D chips)  |

| Active signals                   | DDR3L-1600     | DDR4-3200      | Stacked DRAM HMC 1.0      |
|----------------------------------|----------------|----------------|---------------------------|
| Active pinout required           | 143 per module | 148 per module | 270 per chip              |
| Total number of electrical lines | 7150           | 3700           | 2160 <b>(70% savings)</b> |

| Energy consumed              | DDR3L-1600     | DDR4-3200      | Stacked DRAM HMC 1.0        |
|------------------------------|----------------|----------------|-----------------------------|
| Watts (W.)                   | 6.2 per module | 8.4 per module | 5 per link                  |
| Power consumed for 640 GB/s. | 310 W.         | 210 W.         | 160 W. <b>(50% savings)</b> |

| Physical space on motherboard     | DDR3L-1600          | DDR4-3200                | Stacked DRAM HMC 1.0          |
|-----------------------------------|---------------------|--------------------------|-------------------------------|
| Module area (width x height)      | 165 mm. x 10 m      | $m. = 1650 \text{ mm}^2$ | 1089 mm <sup>2</sup> per chip |
| Total area occupied for 640 GB/s. | 825 cm <sup>2</sup> | 412.5 cm <sup>2</sup>    | 43.5 cm² <b>(95% savings)</b> |



# III.2. HBM (High Bandwidth Memory)





#### Why GDDR5 is not enough

Performance: Scaling has slowed down dramatically and grown exponentially more expensive in the last few years.

Power:

Already in the non-efficient region of power/performance chart.

It requires much more energy to increase the BW that it used to.

| Case study         | Video<br>memory | Bandwidth | Bandwidth<br>per watt | Total power<br>consumed |
|--------------------|-----------------|-----------|-----------------------|-------------------------|
| AMD Radeon R9 290X | GDDR5           | 320 GB/s  | 10 GB/s               | 32 W.                   |
| AMD Fiji           | HBM             | 512 GB/s  | 35 GB/s               | 15 W.                   |

#### Space:

 $\bigcirc$  4 chips of 256 MB occupy 672 mm<sup>2</sup>.

• Using HBM, 1 GB occupies only 35 mm<sup>2</sup> (5%).

Silicon interposer is required to benefit from wire density.



## The bandwidth battle: HBM vs. DDR3 and GDDR5

|                                                         | DDR3                                                                     | GDDR5                                                                       | HBM1                                                                               | HBM2                                                            |
|---------------------------------------------------------|--------------------------------------------------------------------------|-----------------------------------------------------------------------------|------------------------------------------------------------------------------------|-----------------------------------------------------------------|
| Pins for data                                           | 8 per chip                                                               | 32 per chip                                                                 | 2 x 128 per layer                                                                  | 2 x 128 per layer                                               |
| Prefetching (per pin)                                   | 8                                                                        | 8                                                                           | 2                                                                                  | 2                                                               |
| Access granularity<br>(product of the<br>last two rows) | 8 bytes<br>per chip                                                      | 32 bytes<br>per chip                                                        | 64 bytes<br>per layer                                                              | 64 bytes<br>per layer                                           |
| Bandwidth<br>(per chip or layer)                        | 2 GB/s<br>(2 Gbps/pin)                                                   | 28 GB/s<br>(7 Gbps/pin)                                                     | 32 GB/s<br>(1 Gbps/pin)                                                            | 64 GB/s<br>(2 Gbps/pin)                                         |
| Chips or layers                                         | 8 chips/module                                                           | 12 chips/card                                                               | 4 layers/cube                                                                      | 4 or 8 layers/cube                                              |
| Cubes per GPU                                           | -                                                                        | -                                                                           | 4                                                                                  | 4                                                               |
| Total GPU bandwidth                                     | Typical CPU:<br>2 GB/s.<br>* 8 chips<br>* 4 channels<br>= <b>64 GB/s</b> | Maxwell Titan X:<br>28 GB/s<br>* 12 chips<br>= <b>336 GB/s</b><br>(the end) | AMD's Fiji:<br>32 GB/s<br>* 4 layers<br>* 4 cubes =<br>512 GB/s<br>(the beginning) | 64 GB/s<br>* 4 or 8 layers<br>* 4 cubes =<br><b>1 or 2 TB/s</b> |





#### Pending challenges

Competitive cost (hopefully solved on massive sellings).

Power density: One watt for every 35 GB/s is too much when your goal is to exceed the TB/s barrier.

Capacity (hopefully solved when 16nm, 10nm and 7nm manufacturing processes contribute).

|                    | HBM1    | HBM2     |
|--------------------|---------|----------|
| Capacity per layer | 2 Gbits | 8 Gbits  |
| Layers per cube    | 4       | 4-8      |
| Capacity per cube  | 1 GB    | 4-8 GB   |
| Cubes per GPU      | 4       | 4        |
| Total capacity     | 4 GB    | 16-32 GB |





IV. Impact on GPUs and concluding remarks

**©** NVIDIA.



#### The Roofline model: Hardware vs. Software







#### **Concluding remarks**

We are facing the heterogeneous era in chips, with better integration of computing and capacity plus an emphasis on buses:

- TSVs for communicating memory cells faster.
- Silicon interposers for higher data volume and better scalability.

GPU programmers can benefit from this technology by adopting unified memory and providing hints to compilers about the way they actually use data.

• HMC and HBM emerge to break the memory wall and promote more hierarchy on interconnections and less hierarchy on memory types.



#### Acknowledgments & Disclaimer

- To the people at Nvidia, for sharing ideas and slides.
  And to the company for the sponsorhip to bring me here.
- To Scott Stevens and Susan Platt (Micron) for providing me technical info from the HMC consortium, incorporated to this presentation under explicit permission.
- To Lorena Barba (CUDA Fellow), for her contribution to the FMM example using the roofline model.
- This talk shows my view of emerging technologies as a scientist. It is not an attempt to reflect future plans of Nvidia nor developments on the way (unless explicitly mentioned).