euretile hw platforms piero vicini - infn roma castness’11 - january 2011 - rome call...
TRANSCRIPT
EURETILE HW platformsEURETILE HW platforms
Piero VICINI - INFN Roma
CASTNESS’11 - January 2011 - Rome
Call FP7-ICT-2009-4 Objective ICT-2009.8.1FET – Future Emerging Technologies: Concurrent Tera-device Computing
INFN contribution to EURETILE projectINFN contribution to EURETILE project
• Medium/Long term INFN Objectives
• HPC systems dedicated to scientific applications (more than 20 years of “history” and several generations of APE machines): design and use
• Optimize well-established application kernels (LQCD) and explore new challenging applications (neural network modeling, Bio-Computing, Gravitational wave analysis, Complex systems…)
• Scaling to (multi) petaflops parallel system needs of a scalable interconnection network characterized by 103-105 network routers
• DNP (Distributed Network Processor) refinement and evolution • Analyze collective behavior of a “huge” number of interconnected DNP-based
computing nodes (deadlock, starvation, throughput efficiency,…)
• Add fault tolerance capabilities to limit the impact of link failures on network
• Add “brain inspired” features to explore new programming model and to boost computing performances
SHAPES legacySHAPES legacy
• A suitable INFN APE computing engine• (Sh)ApOtto multi-tile (8+) processor, 40(+) GFlops, 10W
• 8(+) RISC+VLIW_FP Core + DNP based network
• SoC based on “single tile replica” allowing for increasing number of tile with silicon processes
• Multi-chip high density system: • 1K (Sh)ApOtto, 40 TFlops, 20 KW, 200 KEuro per rack
• Enhanced/New programming model, semi-automated application mapping software, HW dependant Light OS
• But…• We need 3-5 MEuro for NRE (chip, mechanic, man
power…)
• We need strong partnership with silicon foundry
• Risky investment and mass production in 3/5 years from now
• …and last but not least…• technology is growing fast and people learned the
lesson…
• Pump up flops/W, flops/Euro, flops/m3
• The race is still open but the current situation doesn’t allow us to start NOW and successfully compete with emerging “commodity” hardware
M8+ (0,7)
DC/DC
M8+ (1,7) M8+ (2,7) M8+ (3,7)
DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC
M8+ (0,6) M8+ (1,6) M8+ (2,6) M8+ (3,6)
M8+ (0,5)
DC/DC
M8+ (1,5) M8+ (2,5) M8+ (3,5)
DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC
M8+ (0,4) M8+ (1,4) M8+ (2,4) M8+ (3,4)
M8+ (0,3)
DC/DC
M8+ (1,3) M8+ (2,3) M8+ (3,3)
DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC
M8+ (0,2) M8+ (1,2) M8+ (2,2) M8+ (3,2)
M8+ (0,1)
DC/DC
M8+ (1,1) M8+ (2,1) M8+ (3,1)
DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC
M8+ (0,0) M8+ (1,0) M8+ (2,0) M8+ (3,0)
Back connectors area (Power Supply)
3DT connectors area for TeraMotherBoard stacking
Front connectors area (I/O)
HPC Emerging “commodity”: (GP)GPUHPC Emerging “commodity”: (GP)GPU
Xeon X5670 Opteron 8439 ATI HD5870 Tesla C1060 Tesla C2070
# of cores 6 6 1600 240 448
GFlops (SP) 140 134 2720 933 1030
Gflops (DP) 70 67 544 78 515
TDP (Watt) 95 105 188 188 247
Price (Euro) 1600 2000 400 1500 < 2000
GFlops/Euro 0,04 0,03 1,36 0,05 > 0,26
GFlops/Watt 0,74 0,64 2,89 0,41 2,09
Nvidia Fermi (Tesla 20xx)• (3 *109 transistors)
• ~500 core, 1 TF SP, 0.5 TF DP
• 6 GB external memory (150 GB/s)
• ~250W, <2K Euro
… it’s always a matter of “brute force”…
• General Purpose Graphic Processing Unit
• Impressive peak performances • TFlops per chip
• Videogames market i.e. 10 G$/yr
• Two main competitors (Nvidia, ATI)
• Architecture and characteristics fit very well with LQCD requirements• Many-Core (>>100) SIMD-like architecture
• Single core specialized for data parallel floating point computation
• High local memory bandwidth
• “Green”: high Flops/W ratio
• Cost effective: high Flops/$ ratio
LQCD & GPULQCD & GPU
Story begins with video games… (Egri, Fodor et al. 2006)
Wilson-Dirac operator at 120Gflops (K.Ogawa 2009)
Domain Wall fermions (Tsukuba/Taiwan 2009) Definitive work: Quda lib (M.A.Clark et al.
2009): Double, Single, Half-precision Half-prec solver with reliable updates >
100Gflops MIT/X11 Open Source License
INFN codes development 2D Spin models (Di Renzo et al, 2008) LQCD Stag. fermions on Chroma (Cossu, D'Elia et
al, 2009) with impressive results for single GPU: 1 cpu+C1060 = 1.5 apeNEXT crate (!)
But one GPU is not enough, we need to scale up to 100-1000
Many level of parallelism are needed: Intra-GPU i.e. efficient single-GPU codes Intra-node i.e. efficient hardware to support from GPU-to-GPU communication in the
same host Inter-node i.e low latency, high bandwidth network optimized to support RDMA, first
neighbors comms,…
Embedded system emerging (…) Embedded system emerging (…) architectures: ARM + accelerators architectures: ARM + accelerators
Have you ever heard of similar architectures? ;-)
NVidia Tegra: multi-arm + specialized audio/video/graphic accelerators
FreeScale i.MX6 platform
TI DaVinci: (multi) ARM + DSP
“An ARM processor coupled with an NVIDIA GPU represents the computing platform of the future.” W. Dally, Nvidia Chief Scientist
Project Denver, Jan 5 announcement: NVIDIA CPU running the ARM instruction set, integrated on the same chip as the NVIDIA GPU
Next generation FPGANext generation FPGA
• Latest FPGA-based systems are the ideal hardware to prototype significant components of the EURETILE reference platform
• Two main FPGA families: ALTERA STRATIX V – XILINX VIRTEX 7,
• 28nm, introduction during 2011
• TFlops performances, (multi)Terabits I/O bandwidth, HWired uP cores
Altera Embedded Initiative
Xilinx Extensible Processing Platform
DNP: Distributed Network ProcessorDNP: Distributed Network Processor
• DNP: 3D Torus network controller• packet-based direct network with 2D/3D
torus topology.
• fixed size header/footer envelope (header+ footer)
• auto-routing using dimension-order static routing, with dead-lock avoidance.
• Error detection via EDAC/CRC at packet level.
• RDMA capabilities, PUT and GET, are implemented at the firmware level.
• SystemC models, VHDL (synthesizable) code, AMBA Interface (SHAPES), PCI Express Interface
• Implementation on FPGA and “almost” tape-out on ASIC
• DNP enhancements in EURETILE
• Introduce fault tolerance hardware capabilities
• link self-diagnostic
• new (fault tolerant) routing algorithms
• Exploration of “on-chip” DNP-ASIP integration and optimization of “off-chip” DNP-GPU
• Explore brain-inspired features (multicast,…)
routerrouter
7x7 ports switch7x7 ports switch
toruslink
toruslink
toruslink
toruslink
toruslink
toruslink
toruslink
toruslink
toruslink
toruslink
toruslink
toruslink
TX/RX FIFOs &
Logic
TX/RX FIFOs &
Logic
routing logic
routing logic
arbiterarbiter
X+
X-
Y+
Y-
Z+
Z-
PCIe X8 Gen2 core
PCIe X8 Gen2 core
NIOS II processor
NIOS II processor
collective communication
block
collective communication
block
memory controller
memory controller
DDR3Module
DDR3Module
128@250MHz bus
PCIe X8 Gen2 8@5 Gbps
100/1000 Eth port
100/1000 Eth port
Custom PC Cluster Network: APEnet+Custom PC Cluster Network: APEnet+
• APEnet+: DNP on FPGA-based PCI-Express card to get a 3Dim Torus network for PC Cluster
• FPGA-based (Altera Stratix IV) card with PCIe form factor
• Single slot width, 4 torus links, 2D torus topology.
• Secondary piggy-back card, resulting in a double slot width, 6 links, 3D torus topology.
• Embedded NIOS processor to support RDMA operations
• FPGA (Stratix IV EP4SGX2xx) synthesis results:• PCIe x8 Gen2 host interface (peak 4+4 GB/s)
• 6 Torus link fully bidir 34 Gb/s each direction (~400Gb/s integrated bandwidth) on 4 lanes using QSFP+ interconnect mechanics
• Internal Clk up to 210 MHz
• 128 bit word size crossbar switch
• Resource usage on Stratix IV EP4SGX290:
• 15% Logic Elements , 20% registers, 50% internal memory
• For next generation FPGA (28nm) these numbers become negligible!!
• Preliminary estimation gives 3% LE, 4% regs, 15% mems
• Deliverables • 3 channels prototype board for links electrical
characterization and firmware development completed and tested in 2010
• APEnet+ board (6 channels) design completed in 2010
• 4 APEnet+ boards in 1Q 2011
QUOnG as the EURETILE HPC platform QUOnG as the EURETILE HPC platform demonstratordemonstrator
• QUantum chromodynamics ON Gpu• PC clusters accelerated with high-end
GPU and interconnected via 3D torus network APEnet+
• Added value: tight integration between accelerators (GPU) and custom/reconfigurable network (DNP on FPGA) allows computing efficiency gain
• Production and deployment of medium/large systems green and cost effective in 2011• Elementary unit:
• multi-core INTEL (packed in 2 1U rackable system)
• S2070 FERMI GPU system (4 TFlops)
• 2 apenet+ board
• 42U rack system:
• 60 TFlops/rack peak
• 25 kW/rack (i.e. 0.4 kW/TFlops)
• 300 k€/rack (i.e. 5 K€/TFlops)
• We will leverage on QUOnG system to demonstrate EURETILE HPC platform.• Similar mechanics/topology but
enhanced DNP (fault tolerant, brain inspired,…) on network board
• HDS + DOL/DAL programming model
Embedded reference platform demonstratorEmbedded reference platform demonstrator
• Apenet+ is the “ready to use” elementary component of EURETILE embedded platform.• Interconnected apenet+ boards realize the prototype of
the array of DNP-interconnected FPGA useful to test fault tolerant capabilities and brain inspired enhancements of the network.
• On availability of 28nm FPGA component• evaluate the design of a new enhanced apenet+ board
(EURETNet…)• integrated hardwired ARM uP.
• many resources to explore coupling with ASIP accelerator
• investigate the integration of 16/32 small modules equipped with 28nm FPGA (+ local memory banks) on a “backbone” board (leveraging on APE/SHAPES mechanics design)
• to accelerate dedicated task (via ASIP) in bio-computing application currently under evaluation
1
3
2
0
5
7
6
4
9
11
10
8
13
15
14
12
BackPlane
FrontPlanePB
Backup SlidesBackup Slides
Next years INFN computing requirementsNext years INFN computing requirements
V. Lubicz – CSN4 talk, settembre 2009
• Compute intensive Physics (excluding LHC stuff)• ~ 0.01-1 Pflops for single research group
• ~ 0.1-10 Pflops nationwide
• and beyond LQCD…• 2D Spin models, Bio-Computing, Gravitational wave analysis,
Complex systems, 2D/3D Fluid Dynamics, Montecarlo for medical and space sciences, …
GPUs activities: (inter)national scenario GPUs activities: (inter)national scenario
A random list (not exaustive): “Keeneland” (GeorgiaTech, Nvidia, HP, Oak Ridge): HP
+ Nvidia, in 2012 2 PFlops peak “Tianhe” (NUDT China): Intel CPU + AMD GPUs, 1.2
Pflops peak “Nebulae” (“China’s Dawning Information Industry Co”,
Schenzen China): AMD Opteron + Nvidia GPUs, 1.2 Pflops peak
GPU Supercomputer at CQSE (Taiwan), 16 Nvidia S1070 SGI: next servers (UltraViolet) CPU+GPU TOP500 systems
INFN activities: GPU Computing Interest Group (see Bosi talk)
The question is: Are the GPUS good enough for LQCD computation?
• FP vs local memory bandwidth• LQCD requirements are: 1 word I/O vs 8 flop -> 4 Bytes / 8 flop • GPU memory bandwidth equal to 150 GB/s, Peak Performances 2.7 TFlops • LQCD on GPU Theoretical Peak performances is (150/4)*8=300 Gflops i.e
11% of peak (similar to measured efficency….)
• Remote access vs local memory bandwidth i.e scaling capabilities• LQCD requirement (R): 16 (8) (words) local access / 1 (word) remote access -
> Rqcd = 16 (8)• GPUs I/O Inteface is a PCI Express Gen2 x16 -> 16*5Gb/s -> 10 GB/s• Ratio Local/Remote is: 150/10 = 15 ≈ Rqcd
• GFlops/Watt• GPU (LQCD peak) 300 GFlops / 180W = 1.7 vs 2 (SHAPES platform)
• Gflops/Euro• GPU (LQCD peak) 300 Gflops/ 350Euro = 0.85 vs 1.3 (SHAPES platform)
So the answer is : Yes!! GPUs show value of system parameters similar (perhaps
better?) to SHAPES and it’s also real hardware….
GPU for LQCD computingGPU for LQCD computing