top500 list - 筑波大学taisuke/tmp/achpc-ln9-4cut.pdf · 2015. 12. 7. · overview of hpc system...

Advanced Course in High Performance

Computing

(Department of Computer Science)

also as code-shared class of

High Performance Computing Technology

(Human Biology Program)

“Case study on supercomputers”

Lecture Note #9, 2015/12/09

(lectured by Taisuke Boku)[email protected]

2014/12/24 ACHPC 1

TOP500 List

Sum of #18#500

#1 Computer

#500 Computer

http://www.top500.org

2014/12/24 ACHPC 2

1 million times faster in 20 years

Top1 drops from top500 in 10 years

Top1 reaches the sum in 5 years

1 EXAFLOPS in 2020??

Advance of supercomputer

n Improvement of Silicon Process

(moor’s law) x2 in 18 month

Speed of Tr does not scale now

n Innovations on architecture

n 1976 Vector processor architecture (Cray-1)

n 1980 Rise of Vector computer

n 1990 Microprocessor and parallel architecture

n 19908 Rise of Parallel computer

Vector Parallel, Massively Parallel, SMP

2014/12/24 ACHPC 3

1970 1980 1990 2000 2010 2020

perf

orm

ance

(FLO

PS)

year

T2K-Tsukuba

HA-PACS

PACS-CS

CP-PACS

Titan

Sequoia

K computer

Earth Simulator

Numerical Wind Tunnel

75APU

S810/20

SX-2

XMPCRAY-1

ILLIAC IV

BlueGene/LRoadrunner

ASCI WhiteASCI Red

SX-4

SR8000

SX-8

YMP C90

CM-5

Tianhe-1ATSUBAME2.0

1G

1T

1P

1E

30 years

10 millions times faster

Vector MPP Cluster Accelerator

Japanese

Supercomputer

Univ. Tsukuba

Overview of HPC System (1)

• 100MFLOPS81GFLOPS (’70/e8’80/b)

– Vector Computer like Cray-1

– High performance with Vector register and high bandwidth

memory

– World First GFLOPS machine: NEC SX-2

• 10GFLOPS8100GFLOPS(’80/e8’90/b)

– Parallel vector pipeline

– Vector computer with shared memory (several to several tens of

node)

– Beginning of MPP

– 1996: #1-#3 in Top500 is Japanese machine

• 1TFLOPS (’90/e)

– Massively parallel by scalar micro processor : ASCI machines

– World First TFLOPS machine: SNL ASCI Red

2014/12/24 ACHPC 4

2014/12/24 ACHPC 5

NWT (Numerical Wind Tunnel) 1993

• National Aerospace Laboratory

of Japan (Current JAXA)

developed in partnership with

Fujitsu.

• Vector Parallel Architecture

with distributed memory

connected by cross-bar

network (166 ndoes)

• #1 in Top500 from 1993/11 to

1995/11 (280GFLOPS)

• VPP500 was developed based

on the NWT, and continue

vector parallel architecture.

2014/12/24 ACHPC 6

CP-PACS

• CCS in U-Tsukuba

by U-Tsukuba /

HITACH in 1996

• Becomes world

fastest computer

developed by

university

(1996/11)

• Computer for

computational

physics

• Improve processor

with pseudo vector

• Base for SR-2201

• 2048 CPU

614GFLOPS

2014/12/24 ACHPC 7 2014/12/24 ACHPC 8

Overview of HPC System (2)

• 10TFLOPS(’00/b)

– ASCI machine

– Earth simulator (40TFLOPS)

• 100TFLOPS(’00/m)

– Massively parallel/energy save: IBM BlueGene/L

• 1PFLOPS(‘08)

– World First PFLOPS machine: LANL RoadRunner

– ORNL Jaguar

2014/12/24 ACHPC 9

Earth simulator

• JAMSTEC by NEC

in 2002

• Vector parallel

computer

• TOP500#1 2002/6-

2004/6

• large scale weather

simulation, etc.

• 8 vector processor

shares memory,

and each node

connected by single

cross-bar network

• Base for SX-6

• 5120 CPU

(640node)

40 TFLOPS

2014/12/24 ACHPC 10

Blue Gene/L

• Lawrence

Livermore National

Lab. by IBM in 2005

• TOP500#1 2004/11-

2007/11

• embedded low

power processor,

very large number

of processors

• Particle simulation

fluid dynamic

simulation, etc

• 65536 CPU

360 TFLOPS

2014/12/24 ACHPC 11

Roadrunner

• At Los Alamos National Lab.

by IBM in 2008

• World first PFLOPS

computer

TOP500#1 2008/6-2009/6

• Hybrid cluster, each node

with Opteron processor and

IBM Cell Broadband Engine

• 129600 cores 1.46 PFLOPS

(Linpack: 1.11 PFLOPS)

2014/12/24 ACHPC 12

TOP4 in Nov. 2014

Rank

Name Country

Rmax(PFLOPS)

Node Architecture # of nodes Power(MW)

1 Tianhe-2 China 33.9 2 Xeon(12C) + 3

Xeon Phi(57C)

16000 17.8

2 Titan U.S.A 17.6 1 Opteron(16C)+ 1

K20X(14C)

18688 8.2

3 Sequoia U.S.A 17.2 1 PowerBQC(16C) 98304 7.9

4 K

computer

Japan 10.5 1 SPARC64(8C) 82944 12.7

2014/12/24 ACHPC 13

Top 4 is kept in two yearsTianhe-2(��-2)

• National University of Defense

Technology, China

• Top500 2013/6 #1,

33.8PFLOPS (efficiency 62%),

16000node(2 Xeon(12core) +

3 Phi(57core), 17.8MW,

1.9GFLOPS/W

• TH Express-2 interconnection (same

perf. of IB QDR (40Gbps) x2)

• CPU Intel IvyBridge 12core/chip,

2.2GHz

• ACC Intel Xeon Phi 57core/chip,

1.1GHz

• 162 racks (125 rack for comp.

128node/rack)2014/12/24 ACHPC 14

Compute Node of Tianhe-2

CPU x 2: 2.2GHz x 8flop/cycle x 12core x 2

= 422.4GFLOPS

MIC x 3: 1.1GHz x 16flop/cycle x 57core x 3

= 3.01TFLOPS

CPU: 422.4 x 16000 = 6.75PFLOPS

64GB x 16000 = 1.0 PB

MIC: 3010 x 16000 = 48.16PFLOPS

8GB x 3 x 16000 = 0.384 PB

2014/12/24 ACHPC 15

ORNL Titan

• Oak Ridge National Laboratory

• Cray XK7

• Top500 2012/11 #1,


18688 CPU + 18688 GPU,

8.2MW, 2.14GFLOPS/W

• Gemini Interconnect (3Dtorus)

• CPU AMD Opteron 6274

16core/chip, 2.2GHz

• GPU nVIDIA K20X, 2688CUDA

core (896 DP unit)

• 200 racks

2014/12/24 ACHPC 16

Node of XK7

Node

CPU: 2.2GHz x 4flop/cycle x 16core

= 140.8GFLOPS

GPU: 1.46GHz x 64flop/cycle x 14SMX

= 1.31TFLOPS

CPU: 140.8 * 18688 = 2.63 PFLOPS

32GB x 18688 = 0.598 PB

GPU: 1310 * 18688 = 24.48 PFLOPS

6GB x 18688 = 0.112 PB2014/12/24 ACHPC 17

LLNL Sequoia

• Lawrence Livermore National

Laboratory (LLNL)

• IBM BlueGene/Q

• Top500 2012/6 #1,


1.57Mcore, 7.89MW,

2.07GFLOPS/W

• 4 BlueGene/Q were listed in

top10

• 18core/chip, 1.6GHz, 4way

SMT, 204.8GFLOPS/55W,

L2:32MB eDRAM, mem:

16GB, 42.5GB/s

• 32chip/node, 32node/rack,

96rack

2014/12/24 ACHPC 18

Sequoia Organization

1.6GHz x 8flop/cycle

= 12.8GFLOPS

12.8GFLOPS x 16core

= 204.8GFLOPS 204.8GFLOPS x 32

= 6.5TFLOPS

16GB x 32

= 512GB

6.5TFLOPS x 32

= 210TFLOPS

512GB x 32

= 16TB

210TFLOPS x 96

= 20.1PFLOPS

16TB x 96

= 1.6PB

2014/12/24 ACHPC 19

K computer

• RIKEN by Fujitsu in 2012

• Each nodes has 4 SPARC64

VIIIfx (8core) and network chip

(Tofu Interconnect)

• TOP500#1 2011/6-11

• 705k core, 10.5 PFLOPS

(efficiency 93%)

• LINPACK power consumption

12.7MW

• Green500#6 2011/6

(824MFLOPS/W)

• Green500#1 is BlueGene/Q

Proto2 2GFLOPS/W (40kW)

• 864 racks

2014/12/24 ACHPC 20

Compute Nodes and network

Compute nodes (CPUs): 88,128

SPARC64 VIIIfx

Number of cores: 705,024 (8core/CPU)

Peak performance: 11.2PFLOPS

2GHz x 8flop/cycle x 8core x 88128

Memory: 1.3PB (16GB/node)

Logical 3-dimensional torus network

Peak bandwidth: 5GB/s x 2 for each

direction of logical 3-dimensional

torus network

bi-section bandwidth: > 30TB/s

Courtesy of FUJITSU Ltd.

Compute node

Logical 3-dimensional torus network

for programming

2013/2/13 ACHPC 21

CPU

ICC

Cluster computer

• Cluster : HPC

– Old cluster: poor men’s supercomputer

• From small to large scale

– cost effective (peak performance / cost)

– Commodity of processor and network

• Platform with general purpose and application specific

– 64bit IA-32(x86) with Linux

– Accelerator via I/O

• Massively parallel

– Scalar processor becomes fast, but the requirement is more

– Large scale network with commodity network

– Easily adding Accelerators

2014/12/24 ACHPC 22

TOP500 List (2014/11)

Architecture CountSystem Share

(%)

Rmax

(PFlops)

Rpeak

(PFlops)Cores(Million)

Cluster 429 85.8 207.0 320.8 15.5

MPP 71 14.2 101.9 132.7 7.6

total 500 100.0 308.9 453.5 23.1

2014/12/24 ACHPC 23

Core per socket

Count System Share (%)

Rmax(PFlops)

Core (Million)

1-4 17 3.4 4.0 0.42

6 56 11.2 20.0 1.77

8 232 46.4 96.3 7.01

10 87 17.4 34.4 2.10

12 57 11.4 71.8 5.63

14 7 1.4 4.80 0.15

16 44 8.8 77.5 6.07

Commodity CPU

• COTS(Commodity Of The Shelf) high cost

performance �changes supercomputer from

special purpose to general purpose

– Commodity : development cost paid by general

consumer in the world

– Vector computer/ MPP: user pay the development

cost

• The performance of CPU

– Clock frequency

– Multicore

– SIMD

2014/12/24 ACHPC 24

Commodity network

• Classical commodity network

– Ethernet: 10Base, 100Base, 1000Base, 10GBase

– High cost performance in bandwidth, but latency is

not enough

– Basically tree topology, low scalability

• SAN (System Area Network)

– Myrinet, Infiniband, Quadrix, …

– Both high bandwidth and low latency , but expensive

– Clos-network, Fat-Tree network

– Cost greatly decreased, SAN becomes commodity

– On-board Infiniband NIC instead of on-board Ethernet

NIC

2014/12/24 ACHPC 25

Scalability in cluster

• Advance in general I/O bus

– PCI �PCI-X� PCI-Express�Gen2� Gen3

– Parallel link �multi high speed serial link

– Direct link from CPU: Hyper Transport, Quick Path

• Hardware accelerator

– Clear Speed: TITECH TSUBAME1.2

– Cell Broadband Engine: LANL Roadrunner

– GRAPE: U-Tsukuba FIRST Cluster

– GPGPU: TSUBAME2.0, HA-PACS, …

2014/12/24 ACHPC 26

Supercomputer in Japan (2014/11)

Rank Name Site Vendor Rmax Rpeak

4K computer

RIKEN Advanced Institute for

Computational Science (AICS) Fujitsu 10510000 11280384

15TSUBAME2.5

GSIC Center, Tokyo Institute of

Technology NEC/HP 2785000 5735685

38Helios

International Fusion Energy

Research Centre (IFERC6 Bull SA 1237000 1524096

48Oakleaf-FX IThe University of Tokyo Fujitsu 1043000 1135411

49QUARTETTO Kyushu University Hitachi/Fujitsu 1018000 1502236

63Aterui

National Astronomical

Observatory of Japan Cray Inc. 801400 1058304

70COMA CCS, University of Tsukuba Cray Inc. 745997 998502

86

Central Research Institute of

Electric Power Industry/CRIEPI SGI 582100 670925

91SAKURA KEK IBM 536663 629146

92HIMAWARI KEK IBM 536663 629146

117HA-PACS CCS, University of Tsukuba Cray Inc. 421600 778128

2014/12/24 ACHPC 27

Trend of large scale HPC cluster

• CPU performance

– Multi-core (8-16core/CPU)

– SIMD instruction (AVX: 8 flops/clock)

• Interconnect

– Infiniband (or other SAN) becomes popular in HPC

and decrease the cost

– Large switch (648port/rack)

– Optical cable

– Infiniband SDR � DDR � QDR(40Gbps) �FDR(56Gbps) � EDR(100Gbps)

2014/12/24 ACHPC 28

Heterogeneous Computing Platform

• Parallel system those node has CPU and Accelerator

– Basic style: Nodes in cluster with accelerator

• ClearSpeed

• GRAPE

• Cell Broadband Engine

• GPGPU (General Purpose – Graphic Processing Unit)

• Xeon Phi (many core processor)

– Accelerator was used in special purpose application

– 2008/06 Roadrunner@ LANL achieved the performance over

1PFLOPS in LINPACK by Opteron + Cell BE hetero config

(Peak: 1.3PFLOPS6

– 2Hybrid Computing” is used in combination with shared-memory

and distributed-memory, Heterogeneous Computing is used for

combination with CPU and accelerator here

2014/12/24 ACHPC 29

TSUBAME2.0

• TITECH by NEC/HP in 2010

• Each node has Xeon X5670

(6core)x2 and GPU(Nvidia

M2050)x3, and connected

Infiniband QDR

• TOP500#4 2010/11

• 1400nodes, 1.2 PFLOPS

• Linpack power consumption

1.4MW

• Green500#2 2010/11

(958MFOPS/W)

2014/12/24 ACHPC 30

TSUBAME2.5 upgrade GPU to K20X, TOP500#11 2013/11

TSUBAME-KFC 4.5GFLOPS/W Green500#1 2013/11

Problems of commodity CPU

• Computation performance

– CPU frequency is limited, sequential performance will not be

improved.

– Process technology is still advanced, and the capacity of gate

will be increased. (until ??)

– Peak performance is increased by increase of number of cores.

• Memory bandwidth (rich vector vs. poor scalar)

– The gap between CPU FLOPS and memory bandwidth is wider

– The gap between Peak performance (LINPACK etc.) and

effective performance (non-cache-aware program)

2014/12/24 ACHPC 31

Balance on CPU : Memory : Network

2014/12/24 ACHPC 32

Systems C : M : N = GFLOPS : GB/s : GB/s C : M : N (M = 1.0)

CP-PACS 0.3 1.2 0.3 0.25 1 0.25

Earth Simulator 64 256 12.5 0.25 1 0.05

PACS-CS 5.6 6.4 0.75 0.90 1 0.1

T2K-Tsukuba 147.2 42.7 8 3.50 1 0.2

Small C and large N is good for bandwidth

Previous vector computer = 34Byte/FLOP”

(in the table C:M = 0.25:1)

Bottleneck of memory bandwidth

• HPC application requires memory bandwidth

– Fluid dynamics, weather forecast

– QCD (Quantum Chromo Dynamics)

– FFT

– Particle simulation with long distance interaction

– …

• These application does not fit the current multi-core

architecture

– Required Byte/FLOP is not provided

– On-chip storage such as cache or register is not enough for the

working set of the applications

• Key is data localization�it is not always applicable

– Cache tuning (blocking, cache-awareness)

2014/12/24 ACHPC 33

Future of system architecture

• Cluster system

– Multi-core/multi-socket/multi-NIC

• Programming and performance tuning becomes more complicated

because of memory hierarchy and multiple network I/F

• Programming on the hybrid architecture with shared memory(multi-

core + multi-socket) and distributed memory(interconnect)

�currently user should write program explicitly with shared memory

(ex. OpenMP) and distributed memory (ex. MPI)

�new programming paradigm, compiler technology are required

– Larger scale cluster is available, but there are several limitations

• Space: especially in Japan

• Power: power per core is decreased, but power per node is almost

fixed

• Cooling: cheap cluster becomes a bottleneck, …

2014/12/24 ACHPC 34

�Is there limit of improvement performance by large scale cluster ?

Expect for Accelerator

• From special purpose to (semi-)general purpose

– While GRAPE has special arithmetic unit, SIMD instruction is

general arithmetic pipeline

– In GPGPU, standard programming tool is prepared. (nVIDIA

CUDA, PGI compiler for GPGPU)

– Effective performance per watt than CPU

• Intel Xeon E5645: 57.6GFLOPS / 80 W = 0.72GFLOPS/W

• nVIDIA M2090: 665GFLOPS / 225W = 3.0GFLOPS/W

• Intel SandyBridge: 160GFLOPS / 115W = 1.4GFLOPS/W

• Memory access bandwidth

– GPGPU has high memory bandwidth

• nVIDIA M2090: 177GB/s (ECC off)

• Intel Xeon E5656: 32GB/s, Intel SandyBridge: 51.2GB/s

2014/12/24 ACHPC 35

X 4

Problem of current accelerator

• Bandwidth between CPU and device

– Via External bus, usually PCI-Express

� PCI-E Gen3 x 16 : 8GB/s less than 1 port of DDR3

– Ex. A GPGPU runs over 70GFLOPS in Himeno benchmark

� only on a GPGPU

� the performance was greatly degraded if requiring memory

transfer between CPU and GPU, or between GPU and GPU

using multiple GPU on a node

• Bandwidth between GPUs

– Data transfer between GPUs on each node require three hops of

memory copy

• Registers, on-chip memory, SIMD instructions are

somewhat limitation for applications.

2014/12/24 ACHPC 36

Future of parallel computer system

• Largest problem: power consumption

– To realize Exa-FLOPS, power consumption per core has to be greatly reduced

• The limit of semiconductor process

• Leakage current will be relatively larger, and DVFS (Dynamic Voltage Frequency Scaling) is limited

• Breakthrough for semiconductor process and circuit level

• Large scale interconnect network

– Fat-tree in current cluster will be limit

• Direct network: merit of neighbor communication will be low power

• Application improvement in algorithm to prevent long distant communication

• Reliability

– Dependability: MTBF with millions of processor will be worse. �dynamic fault tolerance is required

– Failure recovery technology for massively parallel beyond a classical check-pointing / restart method

2014/12/24 ACHPC 37

Green500

2014/12/24 ACHPC 38

in Nov. 2014

Supercomputers in CCS

• PAX series

– CP-PACS

– PACS-CS

• FIRST

• T2K Tsukuba

• HA-PACS

– base cluster

– TCA

2014/12/24 ACHPC 39

History of parallel computer PAX(PACS) in U-Tsukuba

��

��

��

��

�()��$'��()��$+,

�,�*, ��1��*(!��(+#$'(��'��/�$

Year Name Performance

1978 PACS-9 7KFLOPS

1980 PAXS-32 500KFLOPS

1983 PAX-128 4MFLOPS

1989 QCDPAX 14GFLOPS

1996 CP-PACS 614GFLOPS

2006 PACS-CS 14.3TFLOPS

2012 HA-PACS 800TFLOPS

2014 COMA 1.001PFLOPS

��

n �(() *�,$('�/$,#��(&)-,�,$('�%��$ ',$+,+��'��

�(&)-, *��'"$' *+

n ��*" ,�) *!(*&�'� ��*$. '��1��))%$��,$('

n �(',$'-(-+�� . %()& ',�/$,#� 0) *$ '� �

��-&-%�,$('

� *.$� �(-,�$'�%�+,�� )�

2014/12/24 ACHPC 40

CP-PACS (1996 Univ. Tsukuba)

• First large scale massively parallel

supercomputer developed in Japan

ü Scalar processor with pseudo vector

ü Flexible and high performance network

• Collaboration with physics and computer

science.

• Collaboration with university and vender

(Hitachi), Hitachi developed SR-2201 based

on CP-PACS

• Scientific breakthrough in particle physics and

astrophysics

ü First principle calculation for QCD

ü General simulation model for field (fluid,

electromagnetic field, wave function, etc)

2014/12/24 ACHPC 41

Architecture of CP-PACS

2014/12/24 ACHPC 42

Pseudo-vector 3D hyper crosbar

PACS-CS• CCS in U-Tsukuba

by U-Tsukuba

/Hitachi/Fujitsu in

2006

• PC cluster + multi-

link Ethernet (3D

hyper-crossbar)

• For computational

sciences

• 2560 CPU

14.4 TFLOPS

• Service was ended

in 2011 September.

2014/12/24 ACHPC 43

Unit chassis (19inch x 1U)

2014/12/24 ACHPC 44

3D-HXB (16x16x10=2560 node)

2014/12/24 ACHPC 45

FIRST

• CCS in U-Tsukuba by U-

Tsukuba/HP/Hamamatsu

Metrics in 2005

• Hybrid cluster, each node

has dual socket Xeon and

GRAPE-6 board (Blade-

GRAPE)


astrophysics

• 256 nodes/512 cores

+ 1024 GRAPE-6 chip

• Host: 3.1 TFLOPS

Blade-GRAPE: 33TFLOPS

2014/12/24 ACHPC 46

Blade GRAPE

• Full size PCI card requiring

2-PCI slots

• 10 layers in a board

• 4 GRAPE6 chips =

136.8GFLOPS

• Electric power of 54W

• Memory of 16 MB (260K

particles)

• Implemented by

Hamamatsu Metrics Co.

Accelerator calculating gravity

for FIRST (embedded in a node)

2014/12/24 ACHPC 47

Interconnection Network

2014/12/24 ACHPC 48

T2K Tsukuba• CCS in U-Tsukuba by

Appro International + Cray

Japan in 2008

• Commodity based PC

cluster with high node

performance and network

performance


sciences

• 648 node = 10368 CPU

core

95 TFLOPS

• Service will end in 2014

February.

T2K Open Supercomputer Alliance:

University of Tsukuba, University of Tokyo, Kyoto University

2014/12/24 ACHPC 49

Computation node and file serverComputation node (70 racks)

648 node (quad-core x 4 socket / node)

Opteron Barcelona B8000 CPU

2.3GHz x 4FLOP/c x 4core x 4socket

= 147.2 GFLOPS / node

= 95.3 TFLOPS / system

20.8TB memory / system

800TB (physical 1PB) RAID-6

Luster cluster file system

Infiniband x 2

Dual MDS and OSS config.

� high reliability

File server (disk array only)

2014/12/24 ACHPC 50

Block diagram of a node

2014/12/24 ACHPC 51

Infiniband 4xDDR x 4-rail Fat-Tree

2014/12/24 ACHPC 52

HA-PACS• CCS in U-Tsukuba by

Appro in 2012

• Commodity based PC

cluster with multiple GPU

accelerators


sciences

• 268 node = 4288 CPU

core and + 1072 GPU

802 (= 89 + 713) TFLOPS

• 40 TByte memory

2014/12/24 ACHPC 53

Rack configuration

…

Computation

nodes7Appro

Green Blade 8204

(8U enc. 4 node)

268 node (67

enc./23 rack),

800TFLOPS

Interconnection7Mellanox IS5300 (QDR

IB 288 port) x 2

Login/Management

nodes7Appro Green

Blade 8203 x 8, 10GbE

I/F

Storage7DDN

SFA10000,

connecting QDR IB,

Luster File system,

User area: 504TB

Total 26 racks2014/12/24 ACHPC 54

Computation node

665GFLOPSx4

=2660GFLOPS

20.8GFLOPSx16

=332.8GFLOPS

(2.6GHz x 8flop/clock)

Total: 3TFLOPS

(16GB, 12.8GB/s)x8

=128GB, 102.4GB/s

8GB/s

AVX

(6GB, 177GB/s)x4

=24GB, 708GB/s

2014/12/24 ACHPC 55

HA-PACS Project• HA-PACS (Highly Accelerated Parallel Advanced system for Computational

Sciences)

– 8th generation of PAX/PACS series supercomputer

• Promotion of computational science applications in key areas in our Center

– Target field: QCD, astrophysics, QM/MM (quantum mechanics / molecular

mechanics, bioscience)

HA-PACS is not only a “commodity GPU-accelerated PC cluster” but also experiment platform for direct communication among accelerators.

• Two parts

• HA-PACS base cluster– for development of GPU-accelerated code for target fields, and performing

product-run of them

– Now in operation since Feb. 2012

• HA-PACS/TCA (TCA = Tightly Coupled Accelerators)

– for elementary research on new technology for accelerated computing

– Our original communication chip named “PEACH2”

– 64 nodes will be installed in Oct. 2013

2014/12/24 ACHPC 56

HA-PACS/TCA (Tightly Coupled Accelerator)

• PEACH2– 4 ports of PCI Express Gen2 x8

lanes

– Direct connection between

accelerators (GPUs) over the

nodes

– hardwired on main data path and

PCIe interface fabric

PEACH2CPU

PCIe

CPUPCIe

Node

PEACH2

PCIe

PCIe

PCIe

GPU

GPU

PCIe

PCIe

Node

PCIe

PCIe

PCIeGPU

GPUCPU

CPU

IB HCA

IB HCA

IBSwitch

n True GPU-directn current GPU clusters require 3-

hop communication (3-5 times

memory copy)

n For strong scaling, inter-GPU

direct communication protocol

is needed for lower latency and

higher throughput

MEMMEM

MEM MEM

2014/12/24 ACHPC 57

PEACH2 board (Production version for

HA-PACS/TCA)Main board+ sub board Most part operates at 250 MHz

(PCIe Gen2 logic runs at 250MHz)

PCI Express x8 card edge

Power supplyfor various voltage

DDR3-

SDRAM

FPGA

(Altera Stratix IV

530GX)

PCIe x16 cable connecter

PCIe x8 cable connecter

2014/12/24 ACHPC 58

TCA node structure

• CPU can uniformly

access to GPUs.

• PEACH2 can access

every GPUs

– Kepler architecture +

CUDA 5.0 “GPUDirect

Support for RDMA”

– Performance over QPI

is quite bad.

=> support only for

GPU0, GPU1

• Connect among 3

nodes

• This configuration is similar to HA-PACS base cluster except PEACH2.

– All the PCIe lanes (80 lanes) embedded in CPUs are used.

CPU

(Xeon

E5)

CPU

(Xeon

E5)QPI

PCIe

GPU

0

GPU

2

GPU

3

IB

HCA

PEA

CH2

GPU

1

G2

x8 G2

x16

G2

x16

G3

x8G2

x16

G2

x16

G2

x8

G2

x8

G2

x8

Single PCIe address

GPU: NVIDIA K20X(Kepler architecture6

592014/12/24 ACHPC

HA-PACS System

Base Cluster

TCATCA: 5Rack x 2Line

2014/12/24 ACHPC 60

(2.8 GHz x 8 flop/clock)

Total: 5.688 TFLOPS

8 GB/s

AVX

1.31 TFLOPSx4

=5.24 TFLOPS

22.4 GFLOPS x20

=448.0 GFLOPS

(16 GB, 14.9 GB/s)x8

=128 GB, 119.4 GB/s

(6 GB, 250 GB/s)x4

=24 GB, 1 TB/s

4 Channels

1,866 MHz

59.7 GB/sec

4 Channels

1,866 MHz

59.7 GB/sec

Ivy Bridge Ivy Bridge

4 x NVIDIA K20X

HA-PACS/TCA (Computation node)

Gen 2 x 16

Gen 2 x 16

Gen 2 x 16

Gen 2 x 16

PEACH2 board

(TCA interconnect)

Gen 2 x 8

Gen 2 x 8

Gen 2 x 8

Legacy Devices

2014/12/24 ACHPC 61

HA-PACS Total System

• InfiniBand QDR 40port x 2ch between base cluster and

TCA

HA-PACSBase Cluster268 nodes

HA-PACS / TCA

64 nodes

InfiniBand QDR324port sw

InfiniBand QDR324port sw

InfiniBand QDR108 port sw

InfiniBand QDR108 port sw

40

40

LustreFilesystem

421 TFLOPS, Efficiency 54%,

41st 2012.6, 73rd 2013.11

1.15 GFLOPS/W

277 TFLOPS, Efficiency 76%,

134th 2013.11 Top500

3.52 GFLOPS/W 3rd 2013.11 Green500

ACHPC 622014/12/24 ACHPC

COMA (PACS-IX)n 2014/4

n Cray CS300

n Intel Xeon Phi (KNC:

Knights Corner)

n 393 node (2 Xeon E5-

2670v2 +

2 Xeon Phi 7110P)

n Mellanox IniniBand FDR,

Fat Tree

n File Server: DDN

1.5PB (RAID6+Lustre)

n 1.001 PFLOPS

(HPL: 746 TFLOPS)

June ’14 TOP500 #51

n HPL efficiency 74.7%

2014/12/24 ACHPC 63

What is COMA ?

• Cluster of Many-core Architecture processor

• COMA = Coma Cluster

– a large cluster of galaxies that contains over 1,000 identified

galaxies

– galaxies (cluster of stars) (=Many Core)

– cluster of galaxies = (Cluster)

• COMA is also 9th machine of PACS series, so it also

called ”PACS-IX”

2014/12/24 ACHPC 64

Intel Xeon Phi 7110P

Power supply Inel Xeon E5-2670v2 (IvyBridge core)

IB FDR

Mellanox

Connect-X3

COMA (PACS-IX) comp. node (Cray 1U 1027GR) SATA HDD

(3.5inch 1TB x2)

2014/12/24 ACHPC 65

COMA node configration

CPU

(Xeon

E5)

CPU

(Xeon

E5)QPI

PCIe

MIC0 MIC1IB

HCA

G2

x16

G3

x8G2

x16

1054 1054

2014/12/24 ACHPC 66

Power consumption

• In LINPACK (HPL) power efficiency, COMA(2014) is 20 times

efficient than T2K-Tsukuba(2008) in 6 years

• What is the efficiency for real application ?

System LINPACK (TF) LINPACK (kW) Ave. Power(kW)T2K-Tsukuba 76.5 671.8 420

HA-PACS (GPU) 421 407.3 250

HA-PACS/TCA

(GPU)

277 93.0 34

COMA (MIC) 746 264.8 215

(ave. usage 60%6

2014/12/24 ACHPC 67

top500 list - 筑波大学taisuke/tmp/achpc-ln9-4cut.pdf · 2015. 12. 7. · overview of hpc system...

Documents