## Exploring Emerging Memory Technologies in Extreme Scale High Performance Computing

#### Jeffrey S. Vetter

Many contributions from FTG Group and Friends

Presented to PPAM 2017: 12<sup>th</sup> International Conference on Parallel Processing and Applied Mathematics

Lublin, Poland

17 Sep 2017





National Laboratory

### Highlights

- Recent trends in extreme-scale HPC paint an ambiguous future
  - Contemporary systems provide evidence that power constraints are driving architectures to change rapidly (e.g., Dennard, Moore)
  - Markets and business strategies impact our goals
  - Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O
- Memory systems are leading the charge!
  - New devices and materials
  - New system organizations
  - New configurations
  - Vast (local) capacities
- Programming systems must support these new memory systems (and portability)!!
  - We need new programming systems to effectively use these architectures
  - Dragon: transparent access from GPUs to vast amounts of NVM
  - NVL-C: programming a hybrid DRAM-NVM main memory
  - Papyrus: aggregating NVM to provide distributed data structures
- These changes in underlying memory system technologies will have substantial impact on both architecture and application design



### Oak Ridge National Laboratory is the DOE Office of Science's Largest Lab



Princeton Plasma Physics Laboratory Princeton, New Jersey

#### Today, ORNL is a leading science and energy laboratory





#### Our Science requires that we continue to advance our computational capability over the next decade on the roadmap to exascale.

Since clock-rate scaling ended in 2003, HPC performance has been achieved through increased parallelism. Jaguar scaled to 300,000 cores.

Titan and beyond deliver hierarchical parallelism with very powerful nodes. MPI plus thread level parallelism through **OpenACC or OpenMP plus vectors** 

2022





OLCF5: 5-10x Summit ~20 MW



#### 2018 OLCF leadership system Hybrid CPU/GPU architecture

Vendor: IBM (Prime) / NVIDIA™ / Mellanox Technologies®

| FEATURE                                              | TITAN                              | SUMMIT                              |
|------------------------------------------------------|------------------------------------|-------------------------------------|
| Application<br>Performance                           | Baseline                           | 5-10x Titan                         |
| Number of Nodes                                      | 18,688                             | ~4,600                              |
| Node performance                                     | 1.4 TF                             | > 40 TF                             |
| Memory per Node                                      | 32 GB DDR3 + 6 GB<br>GDDR5         | 512 GB DDR4 + HBM                   |
| NV memory per Node                                   | 0                                  | 1600 GB                             |
| Total System Memory                                  | 710 TB                             | >10 PB DDR4 + HBM<br>+ Non-volatile |
| System Interconnect<br>(node injection<br>bandwidth) | Gemini (6.4 GB/s)                  | Dual Rail EDR-IB (23<br>GB/s)       |
| Interconnect Topology                                | 3d Torus                           | Non-blocking Fat Tree               |
| Processors                                           | 1 AMD Opteron™<br>1 NVIDIA Kepler™ | 2 IBM POWER9тм<br>6 NVIDIA Voltaтм  |
| File System                                          | 32 PB, 1 TB/s, Lustre©             | 250 PB, 2.5 TB/s,<br>GPFS™          |
| Peak power consumption                               | 9 MW                               | 15 MW                               |







# ECP has formulated a holistic approach that uses co-design and integration to achieve capable exascale

| Application Development          | Software<br>Technology                                                                                                                                                                                                                                                                                                                                                                                                                                          | Hardware<br>Technology | Exascale<br>Systems                |  |
|----------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------|------------------------------------|--|
| Science and mission applications | Scalable software Hardware technolog<br>stack elements                                                                                                                                                                                                                                                                                                                                                                                                          |                        | Integrated exascale supercomputers |  |
|                                  | Correctness       Visualization       Data Analysis         Applications       Co-Design         Programming models,<br>development environment, and<br>runtimes       Math libraries and<br>Frameworks       Tools         System Software, resource<br>management threading,<br>scheduling, monitoring, and<br>control       Memory and<br>Burst buffer       Data<br>management<br>I/O and file<br>system         Node OS, runtimes       Hardware interface |                        |                                    |  |





#### Future Technologies Group (FTG)

Jeffrey S. Vetter, Group Leader

The Future Technologies Group performs research in core technologies for future generations of high-end computing architectures, including prototype computer architectures and experimental software systems. We investigate these technologies with the goal of improving the performance, energy efficiency, reliability, and productivity of these architectures for our sponsors and applications teams. See http://ft.ornl.gov.





#### **Key Technical Areas**

- Heterogeneous architectures Deep memory hierarchies
- including non-volatile memory Performance measurement, analysis, simulation, and
- modeling of emerging architectures.
- Programming systems to address emerging architectures
- Beyond Moore's Computing

#### LLVM Parallel IR

• mpiP

NV-Scavenger

 Scalable Heterogeneous **Computing Benchmarks** (SHOC) DESTINY •Aspen OpenARC Papyrus •NVL-C •Oxbow

Software Artifacts

HPDC, TPDS, DATE, PLDI, IPDPS, Trans VLSI, etc. •Two Gordon Bell awards NSF Keeneland •DOE Titan

Publications in SC, ICS.

Impact

- •IEEE TCHPC Early Career
- •IEEE Fellow
- •~60 interns
- ~120 FTG seminars





## **Emerging Memory Systems**



#### Memory Systems Started Diversifying Several Years Ago

- Architectures
  - HMC, HBM/2/3, LPDDR4, GDDR5X,
     WIDEIO2, etc
  - 2.5D, 3D Stacking
- Configurations
  - Unified memory
  - Scratchpads
  - Write through, write back, etc
  - Consistency and coherence protocols
  - Virtual v. Physical, paging strategies
- New devices
  - ReRAM, PCRAM, STT-MRAM, Xpoint





Copyright (c) 2014 Hiroshige Goto All rights reserved.

|                             | SRAM    | DRAM    | eDRAM   | 2D NAND<br>Flash | 3D NAND<br>Flash | PCRAM                             | STTRAM | 2D ReRAM | 3D ReRA                          |
|-----------------------------|---------|---------|---------|------------------|------------------|-----------------------------------|--------|----------|----------------------------------|
| Data Retention              | N       | N       | N       | Y                | Y                | Y                                 | Y      | Y        | Y                                |
| Cell Size (F <sup>2</sup> ) | 50-200  | 4-6     | 19-26   | 2-5              | <1               | 4-10                              | 8-40   | 4        | <1                               |
| Minimum F demonstrated (nm) | 14      | 25      | 22      | 16               | 64               | 20                                | 28     | 27       | 24                               |
| Read Time (ns)              | <1      | 30      | 5       | 104              | 104              | 10-50                             | 3-10   | 10-50    | 10-50                            |
| Write Time (ns)             | <1      | 50      | 5       |                  |                  | 100-300                           | 3-10   | 10-50    | 10-50                            |
| Number of Rewrites          | 1016    | 1016    | 1014    | 104-105          | 104-105          | 10 <sup>8</sup> -10 <sup>10</sup> | 1013   | 108-1012 | 10 <sup>8</sup> -10 <sup>1</sup> |
| Read Power                  | Low     | Low     | Low     |                  |                  | Low                               | Medium | Medium   | Mediun                           |
| Write Power                 | Low     | Low     | Low     | High             | High             | High                              | Medium | Medium   | Mediun                           |
| Power (other than R/W)      | Leakage | Refresh | Refresh | None             | None             | None                              | None   | Sneak    | Sneak                            |
| Maturity                    |         |         |         |                  |                  |                                   |        |          |                                  |
| •                           |         |         |         |                  |                  |                                   |        |          |                                  |

J.S. Vetter and S. Mittal, "Opportunities for Nonvolatile Memory Systems in Extreme-Scale High Performance Computing," CiSE, 17(2):73-82, 2015.



Fig. 4. (a) A typical 111R structure of RRAM with  $HfO_x$ ; (b) HR-TEM image of the TiN/Ti/HfO<sub>x</sub>/TiN stacked layer; the thickness of the  $HfO_2$  is 20 nm.

National Laboratory

H.S.P. Wong, H.Y. Lee, S. Yu et al., "Metal-oxide RRAM," Proceeding of the IEE, 100(6) 1990 70:2012.

#### **Current ASCR Computing At a Glance**

| System attributes     | NERSC<br>Now                               | OLCF<br>Now                             | ALCF<br>Now                | NERSC Upgrade OLCF Upgrade AL                                                     |                                                                 | ALCF U                                              | F Upgrades                                                                          |  |
|-----------------------|--------------------------------------------|-----------------------------------------|----------------------------|-----------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------------|--|
| Planned Installation  | Edison                                     | TITAN                                   | MIRA                       | Cori<br>2016                                                                      | Summit<br>2017-2018                                             | Theta<br>2016                                       | Aurora<br>2018-2019                                                                 |  |
| System peak (PF)      | 2.6                                        | 27                                      | 10                         | > 30                                                                              | 150                                                             | >8.5                                                | 180                                                                                 |  |
| Peak Power (MW)       | 2                                          | 9                                       | 4.8                        | < 3.7                                                                             | 10                                                              | 1.7                                                 | 13                                                                                  |  |
| Total system memory   | 357 TB                                     | 710TB                                   | 768TB                      | ~1 PB DDR4 + High<br>Bandwidth Memory<br>(HBM)+1.5PB<br>persistent memory         | > 1.74 PB DDR4 +<br>HBM + 2.8 PB<br>persistent memory           | >480 TB DDR4 +<br>High Bandwidth<br>Memory (HBM)    | > 7 PB High Bandwidth<br>On-Package Memory<br>Local Memory and<br>Persistent Memory |  |
| Node performance (TF) | 0.460                                      | 1.452                                   | 0.204                      | > 3                                                                               | > 40                                                            | > 3                                                 | > 17 times Mira                                                                     |  |
| Node processors       | Intel Ivy<br>Bridge                        | AMD<br>Opteron<br>Nvidia<br>Kepler      | 64-bit<br>PowerPC<br>A2    | Intel Knights Landing<br>many core CPUs<br>Intel Haswell CPU in<br>data partition | Multiple IBM<br>Power9 CPUs &<br>multiple Nvidia<br>Voltas GPUS | Intel Knights Landing<br>Xeon Phi many core<br>CPUs | Knights Hill Xeon Phi<br>many core CPUs                                             |  |
| System size (nodes)   | 5,600<br>nodes                             | 18,688<br>nodes                         | 49,152                     | 9,300 nodes<br>1,900 nodes in data<br>partition                                   | ~3,500 nodes                                                    | >2,500 nodes                                        | >50,000 nodes                                                                       |  |
| System Interconnect   | Aries                                      | Gemini                                  | 5D Torus                   | Aries                                                                             | Dual Rail<br>EDR-IB                                             | Aries                                               | 2 <sup>nd</sup> Generation Intel<br>Omni-Path Architecture                          |  |
| File System           | 7.6 PB<br>168 GB/s,<br>Lustre <sup>®</sup> | 32 PB<br>1 TB/s,<br>Lustre <sup>®</sup> | 26 PB<br>300 GB/s<br>GPFS™ | 28 PB<br>744 GB/s<br>Lustre <sup>®</sup>                                          | 120 PB<br>1 TB/s<br>GPFS™                                       | 10PB, 210 GB/s<br>Lustre initial                    | 150 PB<br>1 TB/s<br>Lustre <sup>®</sup>                                             |  |

Binkley, ASCAC, April 2016



Complexity α T

#### NVRAM Technology Continues to Improve – Driven by Broad Market Forces



http://www.eetasia.com/STATIC/ARTICLE IMAGES/201212/EEOL 2012DEC28 STOR MFG NT 01.jpg

#### Comparison of Emerging Memory Technologies

|                             |         |         | Deploye | ed                               |                                  | Experimental                      |        |                                   |                                   |
|-----------------------------|---------|---------|---------|----------------------------------|----------------------------------|-----------------------------------|--------|-----------------------------------|-----------------------------------|
|                             | SRAM    | DRAM    | eDRAM   | 2D<br>NAND<br>Flash              | 3D<br>NAND<br>Flash              | PCRAM                             | STTRAM | 2D<br>ReRAM                       | 3D<br>ReRAM                       |
| Data Retention              | N       | Ν       | Ν       | Y                                | Y                                | Y                                 | Y      | Y                                 | Y                                 |
| Cell Size (F <sup>2</sup> ) | 50-200  | 4-6     | 19-26   | 2-5                              | <1                               | 4-10                              | 8-40   | 4                                 | <1                                |
| Minimum F demonstrated (nm) | 14      | 25      | 22      | 16                               | 64                               | 20                                | 28     | 27                                | 24                                |
| Read Time (ns)              | < 1     | 30      | 5       | 104                              | 104                              | 10-50                             | 3-10   | 10-50                             | 10-50                             |
| Write Time (ns)             | < 1     | 50      | 5       | 10 <sup>5</sup>                  | 10 <sup>5</sup>                  | 100-300                           | 3-10   | 10-50                             | 10-50                             |
| Number of Rewrites          | 1016    | 1016    | 1016    | 10 <sup>4</sup> -10 <sup>5</sup> | 10 <sup>4</sup> -10 <sup>5</sup> | 10 <sup>8</sup> -10 <sup>10</sup> | 1015   | 10 <sup>8</sup> -10 <sup>12</sup> | 10 <sup>8</sup> -10 <sup>12</sup> |
| Read Power                  | Low     | Low     | Low     | High                             | High                             | Low                               | Medium | Medium                            | Medium                            |
| Write Power                 | Low     | Low     | Low     | High                             | High                             | High                              | Medium | Medium                            | Medium                            |
| Power (other than R/W)      | Leakage | Refresh | Refresh | None                             | None                             | None                              | None   | Sneak                             | Sneak                             |
| Maturity                    |         |         |         |                                  |                                  |                                   |        |                                   |                                   |

Intel/Micron Xpoint? Samsung Z-NAND?



http://ft.ornl.gov/trac/blackcomb

#### Investigating Solutions to Memory and Storage Challenges



# Migration up the hierarchy





## **HPC Application Scenarios for NVM**



[Liu, et al., MSST 2012]

• In-mem tables

Burst Buffers, C/R



Figure 3: Read/write ratios, memory reference rates and memory object sizes for memory objects in Nek5000

• In situ visualization



#### Empirical results show many reasons...

- Lookup, index, and permutation tables
- · Inverted and 'element-lagged' mass matrices
- Geometry arrays for grids
- Thermal conductivity for soils
- · Strain and conductivity rates
- Boundary condition data
- · Constants for transforms, interpolation
- MC Tally tables, cross-section materials tables...
   National Laboratory

J.S. Vetter and S. Mittal, "Opportunities for Nonvolatile Memory Systems in Extreme-Scale High-Performance Computing," Computing in Science & Engineering, 17(2):73-82, 2015.

## **Runtime support for NVM**



#### DRAGON provides NVM transparently to GPU through OS, drivers

Provide vast NVM (FusionIO 1-5TB) to GPU (Pascal) transparently





Markthub, Belviranli, et al. DRAGON: Direct Resource Access for GPUs over NVM, submitted

#### **Results with Caffe**



Figure 6: Comparison of ResNet execution times on Caffe.



Figure 7: Comparison of C3D the execution times on Caffe.



## Language support for NVM: NVL-C - extending C to support NVM

J. Denny, S. Lee, and J.S. Vetter, "NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems," in ACM High Performance Distributed Computing (HPDC). Kyoto: ACM, 2016



#### **NVL-C: Programming Model**

- Minimal, familiar, programming interface:
  - Minimal C language extensions.
  - App can still use DRAM
- Pointer safety:
  - Persistence creates new categories of pointer bugs
  - Best to enforce pointer safety constraints at compile time rather than run time
- Transactions:
  - Prevent corruption of persistent memory in case of application or system failure
- Language extensions enable:
  - Compile-time safety constraints
  - NVM-related compiler analyses and optimizations
- LLVM-based:
  - Core of compiler can be reused for other front ends and languages
  - Can take advantage of LLVM ecosystem

```
#include <nvl.h>
struct list {
 int value;
  nvl struct list *next;
};
void remove(int k) {
 nvl heap t *heap
    = nvl open("foo.nvl");
 nvl struct list *a
    = nvl get root(heap, struct list);
 #pragma nvl atomic
  while (a->next != NULL) {
    if (a->next->value == k)
      a->next = a->next->next;
    else
      a = a - > next;
 nvl close(heap);
```



Denny, J.E., Lee, S., and Vetter, J.S.: 'NVL-C: Static Analysis Techniques for Efficient, Correct Programming of Non-Volatile Main Memory Systems'. Proc. Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, Kyoto, Japan2016 pp. Pages

#### **Design Goals: Modular implementation**



- Core is common compiler middle-end
- Multiple complier front ends for multiple high-level languages:
  - For now, just OpenARC for NVL-C
- Multiple runtime implementations:
  - For now, just Intel's pmem (pmemobj)



#### Programming Model: Pointer types (like Coburn et al.)



#### Programming Model: Transactions: Undo logs

```
#include <nvl.h>
void matmul(nvl float a[I][J],
            nvl float b[I][K],
            nvl float c[K][J],
            nvl int *i)
  while (*i<I) {</pre>
    #pragma nvl atomic heap(heap)
      for (int j=0; j<J; ++j) {
        float sum = 0.0;
        for (int k=0; k < K; ++k)
         sum += b[*i][k] * c[k][j];
        a[*i][j] = sum;
      ++*i;
```

- Before every NVM store, transaction creates undo log to back up old data
- Undo log contains metadata plus old data being overwritten
- Problem: large overhead because an undo log is created for every element of a (every iteration of j loop)



#### **Evaluation: LULESH**



- ExM = use SSD as extended DRAM
- T1 = BSR + transactions
- T2 = T1 + backup clauses
- T3 = T1 + clobber clauses
- BlockNVM = msync included
- ByteNVM = msync suppressed

- backup is important for performance
- clobber cannot be applied because old data is needed



# Programming Scalable NVM with Papyrus



#### **Papyrus Overview**



- Papyrus
  - A user-level library using MPI
    - MPI-interoperable
    - No daemon, no server
- Virtual File System (VFS)
  - Uniform aggregate NVM storage image
- Template Container Library (TCL)
  - High-level programming interface built on top of VFS
    - Data elements are distributed to multiple NVM nodes



#### **Papyrus VFS Directory Structure**



Papyrus VFS Directory Structure

MPI rank 3 MPI rank 1 MPI rank 2 MPI rank 0 Compute Node 0 Compute Node 1 NVM NVM B D С Private NVM Architecture MPI rank 0 MPI rank 1 **Burst Buffer Node** Compute Node 0 **NVM** В A Compute Node1 D MPI rank 2 MPI rank 3 Shared NVM Architecture

- Uniform aggregate file directory structure across private and shared NVM architectures
- Papyrus root directory
  - Entry point to the aggregate NVM storage image
- Rank directories
  - Same number of *rank directories* as the number of the running MPI ranks

• A file on a rank directory **N** will be stored on

| Private NVM Architecture                                        | Shared NVM Architecture                                                                  |
|-----------------------------------------------------------------|------------------------------------------------------------------------------------------|
| An NVM in the node that runs MPI rank <b>N</b> (Locality-aware) | A single NVM or striped over multiple NVMs on burst buffer <i>(Locality-independent)</i> |



### Papyrus Template Container Library (TCL)



- A high-level programming interface on top of VFS
- Three C++ template containers
  - papyrus::map<Key, T>
    - hashmap
  - papyrus::vector<T>
    - mutable 1D array
  - papyrus::matrix<T>
    - mutable 2D array
- Data elements are
  - Distributed to multiple NVM nodes
  - Globally accessed by all MPI ranks



#### PapyrusKV: A High-Performance Parallel Key-Value Store for Distributed NVM Architectures

- Leverage emerging NVM technologies
  - High performance
  - High capacity
  - Persistence property
- Designed for the next-generation DOE systems
  - Portable across local NVM and dedicated NVM architectures
  - An embedded key-value store (no system-level daemons and servers)
- Designed for HPC applications
  - MPI/UPC-interoperable
  - Application customizability
    - Memory consistency models (sequential and relaxed)
    - Protection attributes (read-only, write-only, read-write)
    - Load balancing
  - Zero-copy workflow, asynchronous checkpoint/restart



National Laboratory

J. Kim, S. Lee, and J. S. Vetter, "PapyrusKV: A High-Performance Parallel Key-Value Store for Distributed NVM Architectures," In Proc. of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2017 (to appear)

#### PapyrusKV Example Get operations





Present design allows remote cache only for RO data.

#### **Evaluation**

• Evaluation results on OLCF's SummitDev, TACC's Stampede (KNL), and NERSC's Cori













Figure 8: Get operation performance. SG and B refer to Storage Group and SSTable Binary search, respectively.



Figure 10: Checkpoint, restart, and restart with redistribution (RD) performance.



Figure 11: Performance comparisons with MDHIM on Summitdev. NVMe (N) and Lustre (L) are used for their data storages.

#### ECP Application Case Study 1 Meraculous (UPC)

- A parallel De Bruijin graph construction and traversal for De Novo genome assembly
  - ExaBiome, Exascale Solutions for Microbiome Analysis, LBNL



Graphic from ExaBiome: Exascale Solutions to Microbiome Analysis (LBNL, LANL, JGI), 2017

#### Table 1: Source lines of code.

| Source file          | UPC  | UPC+PapyrusKV |
|----------------------|------|---------------|
| meraculous.c         | 469  | 475 (+6)      |
| buildUFXhashBinary.h | 315  | 173 (-143)    |
| kmer_hash.h          | 457  | 129 (-328)    |
| UU_traversal_final.h | 1754 | 1724 (-30)    |
| Modified Total       | 2995 | 2501 (-494)   |
| Grand Total          | 5971 | 5477 (-494)   |



Figure 5: Distributed hash table implementations in UPC and PapyrusKV. \*The same user *hash* function in the UPC application can be used in PapyrusKV.



## ECP Application Case Study 2: HACC (MPI-IO)

- An N-body cosmology code framework
  - ExaSky, Computing the Sky at Extreme Scales, ANL



Graphic from HACCing the Universe on the BG/Q (ANL), 2014



Figure 7: Two-phases checkpointing. PapyrusKV reduces the I/O overhead with help from fast access of NVM. Asynchronous checkpoint hides the I/O overhead between NVM and parallel file system from the application.

WIP: Initial results show about a 10% performance improvement in application performance.



# Implications





- 1. Device and architecture trends will have major impacts on HPC in coming decade
  - 1. NVM in HPC systems is real!
- 2. Performance trends of system components will create new opportunities and challenges
  - 1. Winners and losers
- 3. Sea of NVM allows/requires applications to operate differently
  - 1. Sea of NVM will permit applications to run for weeks without doing I/O to external storage system
  - 2. Applications will simply access local/remote NVM
  - 3. Longer term productive I/O will be 'occasionally' written to Lustre, GPFS
  - 4. Checkpointing (as we know it) will disappear
- 4. Requirements for system design will change
  - 1. Increase in byte-addressable memory-like message sizes and frequencies
  - 2. Reduced traditional IO demands
  - 3. KV traffic could have considerable impact need more applications evidence
  - 4. Need changes to the operational mode of the system



#### Summary

- Recent trends in extreme-scale HPC paint an ambiguous future
  - Contemporary systems provide evidence that power constraints are driving architectures to change rapidly (e.g., Dennard, Moore)
  - Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O
- Memory systems are leading the charge in BMC now!
  - New devices
  - New integration
  - New configurations
  - Vast (local) capacities
- Programming systems must support these new memory systems (and portability)!!
  - We need new programming systems to effectively use these architectures
  - NVL-C
  - Papyrus
- Changes in memory systems will dramatically impact systems and applications





#### Acknowledgements

- Contributors and Sponsors
  - Future Technologies Group: <u>http://ft.ornl.gov</u>
  - US Department of Energy Office of Science
    - DOE Vancouver Project: <u>https://ft.ornl.gov/trac/vancouver</u>
    - DOE Blackcomb Project: <u>https://ft.ornl.gov/trac/blackcomb</u>
    - DOE ExMatEx Codesign Center: <a href="http://codesign.lanl.gov">http://codesign.lanl.gov</a>
    - DOE Cesar Codesign Center: <u>http://cesar.mcs.anl.gov/</u>
    - DOE Exascale Efforts: <u>http://science.energy.gov/ascr/research/computer-science/</u>
  - Scalable Heterogeneous Computing Benchmark team: <u>http://bit.ly/shocmarx</u>
  - US National Science Foundation Keeneland Project: <u>http://keeneland.gatech.edu</u>
  - US DARPA
  - NVIDIA CUDA Center of Excellence



