2-hpc perspective on acceleratorshpcuniversity.org/media/trainingmaterials/11/2-hpc... · 6 leibniz...

A Quick HPC Perspective on Accelerators

Vincent Betro Computational Scientist

University of Tennessee National Institute for Computational Sciences

Copyright 2013

Things started looking harder around 2004…

Source: published SPECInt data

Moore’s Law is not at all dead…

High Volume Manufacturing 2004 2006 2008 2010 2012 2014 2016 2018

Feature Size 90nm 65nm 45nm 32nm 22nm 16nm 11nm 8nm

Integration Capacity (Billions of Transistors)

2 4 8 16 32 64 128 256

Intel process technology capabilities

50nm

Transistor for 90nm Process

Source: Intel Influenza Virus

Source: CDC

The problem in 2004…didn’t get better.

0

1

10

100

1,000

1975

1980

1985

1990

1995

2000

2005

2010

2015

2020

Watts�pe

r�Squ

are�c

m

Historical�Single�Core Historical�MultiͲCoreITRS�Hi�Perf

100W�lightbulb

Hot�Plate

Nuclear�Reactor

1.E+01

1.E+02

1.E+03

1.E+04

1/1/92

1/1/96

1/1/00

1/1/04

1/1/08

1/1/12

1/1/16

Clock�(

MHz

)

Node�Processor�Clock�(MHz)�(H)Node�Processor�Clock�(MHz)�(L)Node�Processor�Clock�(MHz)�(M)Acclerator�Core�Clock�(MHz)�(H)Acclerator�Core�Clock�(MHz)�(L)Acclerator�Core�Clock�(MHz)�(M)

Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

Not a new problem, just a new scale…

CPU Power

W)

Cray-2 with cooling tower in foreground, circa 1985

How to get same number of transistors to give us more performance without cranking up power?

C1

C4

C2

C3

Small core

Big core

Cache

Cache

1

2

3

4

1 2

1 1

1

2

3

4

1

2

3

4

Power

Performance

Power = ¼

Performance = 1/2

Many core is more power efficient

Power ~ area

Single thread performance ~ area**.5

Example: Dual core with voltage scaling

Area = 1 Voltage = 1 Freq = 1 Power = 1 Perf = 1

Area = 2 Voltage = 0.85 Freq = 0.85 Power = 1 Perf = ~1.8

Frequency Reduction

Power Reduction

Performance Reduction

15% 45% 10%

A 15% Reduction In Voltage

Yields

SINGLE CORE DUAL CORE

RULE OF THUMB

GPU Architecture – Kepler: Streaming Multiprocessor (SMX)

!   192 SP CUDA Cores per SMX !   192 fp32 ops/clock !   192 int32 ops/clock

!   64 DP CUDA Cores per SMX !   64 fp64 ops/clock

!   4 warp schedulers !   Up to 2048 threads concurrently

!   32 special-function units !   64KB shared mem + L1 cache !   48KB Read-Only Data cache !   64K 32-bit registers

Intel’s MIC Approach

Since the days of RISC vs. CISC, Intel has mastered the art of figuring out what is important about a new processing technology, and saying “why can’t we do this in x86?” The Intel Many Integrated Core (MIC) architecture is about large die, simpler circuit, much more parallelism, in the x86 line.

Courtesy Dan Stanzione, TACC

MIC Architecture

• Many cores on the die •  L1 and L2 cache • Bidirectional ring network

for L2 Memory and PCIe connection

Courtesy Dan Stanzione, TACC

What makes an Accelerator?

PCI Bus

CPU Memory GPU Memory

CPU GPU Intel Xeon Phi

Top 10 Systems in November 2012 (Just got updated last month)

# Site Manufacturer Computer Country Cores Rmax [Pflops]

Power [MW]

1 Oak Ridge National Laboratory Cray Titan Cray XK7, Opteron 16C 2.2GHz, Gemini, NVIDIA K20x USA 560,640 17.6 8.21

2 Lawrence Livermore National Laboratory IBM Sequoia BlueGene/Q, Power BQC 16C 1.6GHz, Custom

USA 1,572,864 16.3 7.89

3 RIKEN Advanced Institute for Computational Science Fujitsu

K Computer SPARC64 VIIIfx 2.0GHz, Tofu Interconnect

Japan 795,024 10.5 12.66

4 Argonne National Laboratory IBM Mira BlueGene/Q, Power BQC 16C 1.6GHz, Custom

USA 786,432 8.16 3.95

5 Forschungszentrum Juelich (FZJ) IBM JuQUEEN BlueGene/Q, Power BQC 16C 1.6GHz, Custom

Germany 393,216 4.14 1.97

6 Leibniz Rechenzentrum IBM SuperMUC iDataPlex DX360M4, Xeon E5 8C 2.7GHz, Infiniband FDR

Germany 147,456 2.90 3.52

7 Texas Advanced Computing Center/UT Dell Stampede PowerEdge C8220, Xeon E5 8C 2.7GHz, Intel Xeon Phi

USA 204,900 2.66

8 National SuperComputer Center in Tianjin NUDT

Tianhe-1A NUDT TH MPP, Xeon 6C, NVidia, FT-1000 8C

China 186,368 2.57 4.04

9 CINECA IBM Fermi BlueGene/Q, Power BQC 16C 1.6GHz, Custom

Italy 163,840 1.73 0.82

10 IBM IBM DARPA Trial Subset Power 775, Power7 8C 3.84GHz, Custom

USA 63,360 1.52 3.57

•  Upgrade of Jaguar from Cray XT5 to XK6 •  Cray Linux Environment operating system

•  Gemini interconnect  3-D Torus  Globally addressable memory  Advanced synchronization features

•  AMD Opteron 6274 processors (Interlagos)

•  New accelerated node design using NVIDIA multi-core accelerators  2011: 960 NVIDIA x2090 “Fermi” GPUs  2012: 14,592 NVIDIA “Kepler” GPUs

•  20+ PFlops peak system performance

•  600 TB DDR3 mem. + 88 TB GDDR5 mem

Titan System at Oak Ridge National Laboratory (Just displaced by Chinese system using Xeon Phi)

Power Efficiency over Time

0

500

1,000

1,500

2,000

2,500

3,000

2008 2009 2010 2011 2012

Linp

ack/Po

wer [Gflo

ps/kW]

TOP10

TOP50

TOP500

Accelerator and BG

multicore

Data from: TOP500 November 2012

Courtesy Horst Simon, LBNL

Projected Performance Development

Data from: TOP500 November 2012

0.1 1

10 100 1000 10000

100000 1000000 10000000

100000000 1E+09 1E+10 1E+11

1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020

SUM

N=1

N=500

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

1 Eflop/s

Courtesy Horst Simon, LBNL

Trends with ends.

0.50

0.75

1.00

1.25

1.50

1.75

2.00

2.25

01/01/96

01/01/00

01/01/04

01/01/08

01/01/12

Comp

ound

Annu

alGrow

thRa

te:CAG

R

Rmax (Gflop/s) Total Cores

Ave Cycles/sec per core (Mhz) Mem/Core (GB)

Source: Kogge and Shalf, IEEE CISE Courtesy Horst Simon, LBNL

Summary

Heterogeneous computing is an important piece of the computing future. GPU accelerators are here to stay. Now, let’s see how you can use them…

2-hpc perspective on acceleratorshpcuniversity.org/media/trainingmaterials/11/2-hpc... · 6 leibniz...

Documents