euretile hw platforms piero vicini - infn roma castness’11 - january 2011 - rome call...

EURETILE HW platformsEURETILE HW platforms

Piero VICINI - INFN Roma

CASTNESS’11 - January 2011 - Rome

Call FP7-ICT-2009-4 Objective ICT-2009.8.1FET – Future Emerging Technologies: Concurrent Tera-device Computing

INFN contribution to EURETILE projectINFN contribution to EURETILE project

• Medium/Long term INFN Objectives

• HPC systems dedicated to scientific applications (more than 20 years of “history” and several generations of APE machines): design and use

• Optimize well-established application kernels (LQCD) and explore new challenging applications (neural network modeling, Bio-Computing, Gravitational wave analysis, Complex systems…)

• Scaling to (multi) petaflops parallel system needs of a scalable interconnection network characterized by 103-105 network routers

• DNP (Distributed Network Processor) refinement and evolution • Analyze collective behavior of a “huge” number of interconnected DNP-based

computing nodes (deadlock, starvation, throughput efficiency,…)

• Add fault tolerance capabilities to limit the impact of link failures on network

• Add “brain inspired” features to explore new programming model and to boost computing performances

SHAPES legacySHAPES legacy

• A suitable INFN APE computing engine• (Sh)ApOtto multi-tile (8+) processor, 40(+) GFlops, 10W

• 8(+) RISC+VLIW_FP Core + DNP based network

• SoC based on “single tile replica” allowing for increasing number of tile with silicon processes

• Multi-chip high density system: • 1K (Sh)ApOtto, 40 TFlops, 20 KW, 200 KEuro per rack

• Enhanced/New programming model, semi-automated application mapping software, HW dependant Light OS

• But…• We need 3-5 MEuro for NRE (chip, mechanic, man

power…)

• We need strong partnership with silicon foundry

• Risky investment and mass production in 3/5 years from now

• …and last but not least…• technology is growing fast and people learned the

lesson…

• Pump up flops/W, flops/Euro, flops/m3

• The race is still open but the current situation doesn’t allow us to start NOW and successfully compete with emerging “commodity” hardware

M8+ (0,7)

DC/DC

M8+ (1,7) M8+ (2,7) M8+ (3,7)

DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC

M8+ (0,6) M8+ (1,6) M8+ (2,6) M8+ (3,6)

M8+ (0,5)

DC/DC

M8+ (1,5) M8+ (2,5) M8+ (3,5)


M8+ (0,4) M8+ (1,4) M8+ (2,4) M8+ (3,4)

M8+ (0,3)

DC/DC

M8+ (1,3) M8+ (2,3) M8+ (3,3)


M8+ (0,2) M8+ (1,2) M8+ (2,2) M8+ (3,2)

M8+ (0,1)

DC/DC

M8+ (1,1) M8+ (2,1) M8+ (3,1)


M8+ (0,0) M8+ (1,0) M8+ (2,0) M8+ (3,0)

Back connectors area (Power Supply)

3DT connectors area for TeraMotherBoard stacking

Front connectors area (I/O)

HPC Emerging “commodity”: (GP)GPUHPC Emerging “commodity”: (GP)GPU

Xeon X5670 Opteron 8439 ATI HD5870 Tesla C1060 Tesla C2070

# of cores 6 6 1600 240 448

GFlops (SP) 140 134 2720 933 1030

Gflops (DP) 70 67 544 78 515

TDP (Watt) 95 105 188 188 247

Price (Euro) 1600 2000 400 1500 < 2000

GFlops/Euro 0,04 0,03 1,36 0,05 > 0,26

GFlops/Watt 0,74 0,64 2,89 0,41 2,09

Nvidia Fermi (Tesla 20xx)• (3 *109 transistors)

• ~500 core, 1 TF SP, 0.5 TF DP

• 6 GB external memory (150 GB/s)

• ~250W, <2K Euro

… it’s always a matter of “brute force”…

• General Purpose Graphic Processing Unit

• Impressive peak performances • TFlops per chip

• Videogames market i.e. 10 G$/yr

• Two main competitors (Nvidia, ATI)

• Architecture and characteristics fit very well with LQCD requirements• Many-Core (>>100) SIMD-like architecture

• Single core specialized for data parallel floating point computation

• High local memory bandwidth

• “Green”: high Flops/W ratio

• Cost effective: high Flops/$ ratio

LQCD & GPULQCD & GPU

Story begins with video games… (Egri, Fodor et al. 2006)

Wilson-Dirac operator at 120Gflops (K.Ogawa 2009)

Domain Wall fermions (Tsukuba/Taiwan 2009) Definitive work: Quda lib (M.A.Clark et al.

2009): Double, Single, Half-precision Half-prec solver with reliable updates >

100Gflops MIT/X11 Open Source License

INFN codes development 2D Spin models (Di Renzo et al, 2008) LQCD Stag. fermions on Chroma (Cossu, D'Elia et

al, 2009) with impressive results for single GPU: 1 cpu+C1060 = 1.5 apeNEXT crate (!)

But one GPU is not enough, we need to scale up to 100-1000

Many level of parallelism are needed: Intra-GPU i.e. efficient single-GPU codes Intra-node i.e. efficient hardware to support from GPU-to-GPU communication in the

same host Inter-node i.e low latency, high bandwidth network optimized to support RDMA, first

neighbors comms,…

Embedded system emerging (…) Embedded system emerging (…) architectures: ARM + accelerators architectures: ARM + accelerators

Have you ever heard of similar architectures? ;-)

NVidia Tegra: multi-arm + specialized audio/video/graphic accelerators

FreeScale i.MX6 platform

TI DaVinci: (multi) ARM + DSP

“An ARM processor coupled with an NVIDIA GPU represents the computing platform of the future.” W. Dally, Nvidia Chief Scientist

Project Denver, Jan 5 announcement: NVIDIA CPU running the ARM instruction set, integrated on the same chip as the NVIDIA GPU

Next generation FPGANext generation FPGA

• Latest FPGA-based systems are the ideal hardware to prototype significant components of the EURETILE reference platform

• Two main FPGA families: ALTERA STRATIX V – XILINX VIRTEX 7,

• 28nm, introduction during 2011

• TFlops performances, (multi)Terabits I/O bandwidth, HWired uP cores

Altera Embedded Initiative

Xilinx Extensible Processing Platform

DNP: Distributed Network ProcessorDNP: Distributed Network Processor

• DNP: 3D Torus network controller• packet-based direct network with 2D/3D

torus topology.

• fixed size header/footer envelope (header+ footer)

• auto-routing using dimension-order static routing, with dead-lock avoidance.

• Error detection via EDAC/CRC at packet level.

• RDMA capabilities, PUT and GET, are implemented at the firmware level.

• SystemC models, VHDL (synthesizable) code, AMBA Interface (SHAPES), PCI Express Interface

• Implementation on FPGA and “almost” tape-out on ASIC

• DNP enhancements in EURETILE

• Introduce fault tolerance hardware capabilities

• link self-diagnostic

• new (fault tolerant) routing algorithms

• Exploration of “on-chip” DNP-ASIP integration and optimization of “off-chip” DNP-GPU

• Explore brain-inspired features (multicast,…)

routerrouter

7x7 ports switch7x7 ports switch

toruslink

toruslink

toruslink

toruslink

toruslink

toruslink

toruslink

toruslink

toruslink

toruslink

toruslink

toruslink

TX/RX FIFOs &

Logic

TX/RX FIFOs &

Logic

routing logic

routing logic

arbiterarbiter

X+

X-

Y+

Y-

Z+

Z-

PCIe X8 Gen2 core

PCIe X8 Gen2 core

NIOS II processor

NIOS II processor

collective communication

block

collective communication

block

memory controller

memory controller

DDR3Module

DDR3Module

128@250MHz bus

PCIe X8 Gen2 8@5 Gbps

100/1000 Eth port

100/1000 Eth port

Custom PC Cluster Network: APEnet+Custom PC Cluster Network: APEnet+

• APEnet+: DNP on FPGA-based PCI-Express card to get a 3Dim Torus network for PC Cluster

• FPGA-based (Altera Stratix IV) card with PCIe form factor

• Single slot width, 4 torus links, 2D torus topology.

• Secondary piggy-back card, resulting in a double slot width, 6 links, 3D torus topology.

• Embedded NIOS processor to support RDMA operations

• FPGA (Stratix IV EP4SGX2xx) synthesis results:• PCIe x8 Gen2 host interface (peak 4+4 GB/s)

• 6 Torus link fully bidir 34 Gb/s each direction (~400Gb/s integrated bandwidth) on 4 lanes using QSFP+ interconnect mechanics

• Internal Clk up to 210 MHz

• 128 bit word size crossbar switch

• Resource usage on Stratix IV EP4SGX290:

• 15% Logic Elements , 20% registers, 50% internal memory

• For next generation FPGA (28nm) these numbers become negligible!!

• Preliminary estimation gives 3% LE, 4% regs, 15% mems

• Deliverables • 3 channels prototype board for links electrical

characterization and firmware development completed and tested in 2010

• APEnet+ board (6 channels) design completed in 2010

• 4 APEnet+ boards in 1Q 2011

QUOnG as the EURETILE HPC platform QUOnG as the EURETILE HPC platform demonstratordemonstrator

• QUantum chromodynamics ON Gpu• PC clusters accelerated with high-end

GPU and interconnected via 3D torus network APEnet+

• Added value: tight integration between accelerators (GPU) and custom/reconfigurable network (DNP on FPGA) allows computing efficiency gain

• Production and deployment of medium/large systems green and cost effective in 2011• Elementary unit:

• multi-core INTEL (packed in 2 1U rackable system)

• S2070 FERMI GPU system (4 TFlops)

• 2 apenet+ board

• 42U rack system:

• 60 TFlops/rack peak

• 25 kW/rack (i.e. 0.4 kW/TFlops)

• 300 k€/rack (i.e. 5 K€/TFlops)

• We will leverage on QUOnG system to demonstrate EURETILE HPC platform.• Similar mechanics/topology but

enhanced DNP (fault tolerant, brain inspired,…) on network board

• HDS + DOL/DAL programming model

Embedded reference platform demonstratorEmbedded reference platform demonstrator

• Apenet+ is the “ready to use” elementary component of EURETILE embedded platform.• Interconnected apenet+ boards realize the prototype of

the array of DNP-interconnected FPGA useful to test fault tolerant capabilities and brain inspired enhancements of the network.

• On availability of 28nm FPGA component• evaluate the design of a new enhanced apenet+ board

(EURETNet…)• integrated hardwired ARM uP.

• many resources to explore coupling with ASIP accelerator

• investigate the integration of 16/32 small modules equipped with 28nm FPGA (+ local memory banks) on a “backbone” board (leveraging on APE/SHAPES mechanics design)

• to accelerate dedicated task (via ASIP) in bio-computing application currently under evaluation

1

3

2

0

5

7

6

4

9

11

10

8

13

15

14

12

BackPlane

FrontPlanePB

Backup SlidesBackup Slides

Next years INFN computing requirementsNext years INFN computing requirements

V. Lubicz – CSN4 talk, settembre 2009

• Compute intensive Physics (excluding LHC stuff)• ~ 0.01-1 Pflops for single research group

• ~ 0.1-10 Pflops nationwide

• and beyond LQCD…• 2D Spin models, Bio-Computing, Gravitational wave analysis,

Complex systems, 2D/3D Fluid Dynamics, Montecarlo for medical and space sciences, …

GPUs activities: (inter)national scenario GPUs activities: (inter)national scenario

A random list (not exaustive): “Keeneland” (GeorgiaTech, Nvidia, HP, Oak Ridge): HP

+ Nvidia, in 2012 2 PFlops peak “Tianhe” (NUDT China): Intel CPU + AMD GPUs, 1.2

Pflops peak “Nebulae” (“China’s Dawning Information Industry Co”,

Schenzen China): AMD Opteron + Nvidia GPUs, 1.2 Pflops peak

GPU Supercomputer at CQSE (Taiwan), 16 Nvidia S1070 SGI: next servers (UltraViolet) CPU+GPU TOP500 systems

INFN activities: GPU Computing Interest Group (see Bosi talk)

The question is: Are the GPUS good enough for LQCD computation?

• FP vs local memory bandwidth• LQCD requirements are: 1 word I/O vs 8 flop -> 4 Bytes / 8 flop • GPU memory bandwidth equal to 150 GB/s, Peak Performances 2.7 TFlops • LQCD on GPU Theoretical Peak performances is (150/4)*8=300 Gflops i.e

11% of peak (similar to measured efficency….)

• Remote access vs local memory bandwidth i.e scaling capabilities• LQCD requirement (R): 16 (8) (words) local access / 1 (word) remote access -

> Rqcd = 16 (8)• GPUs I/O Inteface is a PCI Express Gen2 x16 -> 16*5Gb/s -> 10 GB/s• Ratio Local/Remote is: 150/10 = 15 ≈ Rqcd

• GFlops/Watt• GPU (LQCD peak) 300 GFlops / 180W = 1.7 vs 2 (SHAPES platform)

• Gflops/Euro• GPU (LQCD peak) 300 Gflops/ 350Euro = 0.85 vs 1.3 (SHAPES platform)

So the answer is : Yes!! GPUs show value of system parameters similar (perhaps

better?) to SHAPES and it’s also real hardware….

GPU for LQCD computingGPU for LQCD computing

euretile hw platforms piero vicini - infn roma castness’11 - january 2011 - rome call...

Documents

single gpu

gpu communication

high bandwidth network

single tile replica

computing nodes deadlock

increasing number of

fp core dnp

lqcd gpustory