nec sx-aurora tsubasa - hpc and quantum summit 2019 · 2019-10-10 · nec sx-aurora tsubasa vector...

12
NEC SX-Aurora TSUBASA New Generation Vector Computing

Upload: others

Post on 27-Jun-2020

9 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

NEC SX-Aurora TSUBASA

New Generation Vector Computing

Page 2: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

NEC SX-Aurora TSUBASA

Future Technology for TodayWe live in an ever-changing world. What once was considered impossible, not only became reality in the end, it sometimes even belongs to our everyday experience now. Who would have thought 200 years ago that human beings will ever set foot on moon? 100 years ago, who would have thought that quick and easy communication, by phone or by email, between people all around the world, at any time, would become as common and as normal as reading a book? Or – to pick a more recent example – who would have thought 20 years ago that in the near future, cars will be able to drive on their own, without human interference? And just try to imagine what achievements future technologies will make possible!

Supercomputing comes along in different shapes: high-resolution simulations of scientific or technical processes, large-scale analyses of huge amounts of unstructured data and the search for information, or the artificial intelligence of machines to support human beings in their decisions and provide relief when tedious but highly challenging tasks need to be performed.

Artificial intelligence once was a popular theme of science-fiction only, but it belongs to modern technology now. Machine Learning, or better Deep Learning, shapes technology in an unprecedented way and will change the interaction between human beings and machines forever. Deep Learning is driven by a spectacular technology leap that has become possible by an interplay between the growing computing capabilities of modern processor architectures, and breakthrough improve-ments in algorithm development.

NEC SX-Aurora TSUBASA takes processor architecture to a new level. As a powerful vector engine with the largest memory bandwidth available, it provides unparalleled sustained performance to mission critical applications that need to deliver results in the shortest amount of time, or to environments that require information derived from huge amounts of data as reliably and quickly as possible.

NEC not only provides a feature-rich suite of compilers and libraries that meet modern-day requirements and standards. With the NEC Frovedis framework on the NEC SX-Aurora TSUBASA vector engine, data analytics is brought to a level of performance never seen before. Real-time medical image processing and analysis helps medical staff to diagnose diseases quicker and more reliably. For effective fraud detection, hundreds of millions of financial transactions all over the world must be analysed in seconds.

It is NEC’s mission to not only innovate technology, but to utilize this technology as a contribution to a better society, enabling people to live brighter lives. We summarize this approach in our business message: Orches-trating a brighter world.

Shigeyuki AinoDeputy General ManagerAI Platform DivisionNEC Corporation

Page 3: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

NEC SX-Aurora TSUBASA

Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the HPC-market. NEC’s legacy in vector computing has been success-ful in the past, and NEC’s vector technology is completely capable of addressing the issues previous computing technology is suffering from. In the mathematical formulation of typical scientific problems, there is always a sufficient amount of data-par-allelism in codes, just because it results from the underlying characteristics of the problem and how this is expressed mathematically, for example as a partial differential equation. Nature is local, means thingscan be described by partial differential equations in some way, and thus are inherently parallel and providing the notion of “neighbourhood”. The application should be written in such a way that the compiler could clearly identify such underlying structures. This mathematical or physical fact needs to be reflected in the way the applications are written, and the varia-bles need to be organized in memory accordingly. The vector computing paradigm is simple: “I have to execute a mathematical operation. On which elements, grids, variables, particles, equations, structures, can I apply it simultaneously?” If this is your way of thinking, you will inevitably write vector code. For example, it is one of best practices in C to organize variables as a structure of arrays, not as an array of structures. It is also this understanding that leads to the idea of “domain specific languages” (DSL) – a framework for describing actions to be applied to a whole field. NEC provides the hardware and software necessary to apply this vector paradigm to the solution of real-world problems.

Input Pipeline Result

Scalar

SIMD

Vector

NECVector

Page 4: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

NEC SX-Aurora TSUBASA

NEC SX-Aurora TSUBASA – The New Generation Vector Architecture

The heart of the new SX architecture is the vector engine (VE) contained in the vector host (VH). The VE executes complete applications while the VH mainly provides OS functions for connected VEs. NEC SX Aurora TSUBASA – the vector engine (VE) – offers a hitherto unparalleled performance especially for memory-bound applications. As a standard PCIe card, it fits effortlessly in a standard x86 server host en-vironment. Each vector CPU has access to six extremely high-bandwidth HBM2 memory modules, which allows for unprecedented performance for memory-intensive applications. The world’s first implementation of a CPU design with six HBM2 memory modules using a “chip-on-wafer-on-sub-strate” technology (CoWoS) leads to the world-record memory bandwidth of 1.2 TB/s. The vector core on the VE is the most powerful single core in HPC avail-able today, thus keeping the design philosophy from the previous SX series. It achieves the industry-leading performance of 307 GFlops per core, and a memory bandwidth of 150 GB/s per core. The vector proces-sor is based upon a 16 nm FinFET process technology for extremely high performance and low power consumption.

The SX Vector Architecture from Past to Future

Type Year Tech- nology

CPU Frequency

CPU Perfor-mance

CPU Memory

Bandwidth

1983 Bipolar 166 MHz1.3

GFlops10.7

GB/sec

1989 Bipolar 340 MHz5.5

GFlops12.8

GB/sec

1994 350 nm 125 Mhz2.0

GFlops16.0

GB/sec

1998 250 nm 250 MHz8.0

GFlops64.0

GB/sec

2001 150 nm 500 MHz8.0

GFlops32.0

GB/sec

Type Year Tech- nology

CPU Frequency

CPU Perfor-mance

CPU Memory

Bandwidth

2002 150 nm 552 MHz8.8

GFlops35.3

GB/sec

2004 90 nm 1.0 GHz16.0

GFlops64.0

GB/sec

2007 65 nm 3.2 GHz102.4 GFlops

256.0 GB/sec

2013 28 nm 1.0 GHz256.0 GFlops

256.0 GB/sec

2018 16 nm1.4-1.6

GHz

2,150 - 2,457 GFlops

1,228.8 GB/sec

SX-2

SX-3

SX-4

SX-5

SX-6

SX-ACE

SX-Aurora TSUBASA

SX-9

SX-8

SX-7

Page 5: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

A vector engine has eight independent vector cores, each of which has three FMA units (“fused multiply-add”). 64 fully functional vector registers per core – with 256 entries of 8 bytes width each – can feed the functional units with data or receive results from those, thus being able to handle double-precision data at full speed.

Currently, three vector host series are available, each optimally suited for meeting different functionality and capacity requirements. From a tower model series A100 for smaller development environments, to the rack-mountable A300 series with flexible configuration options, to the supercomputer series A500 for large-scale configurations with direct-liquid cooling (DLC).

NEC SX-Aurora TSUBASA

Page 6: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

The A100 series is a tower model, which can be used on a desk top. This model ames at the personal use for

developers and programmers, and mainly consists of one Xeon processor and one VE card of Type 10C*.

The A300 series is a standard rack mountable model with air cooling. There are three types of products in this

series, A300-2, A300-4, and A300-8. The numbers following “A300” are the number of maximum VE cards per

each A300 product. Due to the standard rack mount implementation and the air cooling, this series has high

configuration flexibility as same as a de-facto standard x86 servers. The supported VE cards are Type 10B and

Type 10C*.

A block diagram of A300-2 is shown in the following figure. This model consists of one Xeon provessor, up to

two VE cards, and one IB HCA in a 1U implementation.

A100-11VE Tower

A300-22VE Server

The A300-4 model consists of two Xeon processors and can be deployed with up to four VE cards per server of

Type 10B or Type 10C*. Moreover, up to two IB HCAs can be deployed with this server.

A300-44VE Server

NEC SX-Aurora TSUBASA

* See table Technical Specification at the end

Page 7: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

High vector processing power and high memory bandwidth are provided by the A300-8 series. This server is

designed for a large scale vector processing system by implementing up to eight VE card per two Xeon proces-

sors. A big differentiation from A300-2 and A300-4 are the two PCIe switches connecting the VE cards with

the Xeon processors. This way, direct MPI communication between VE/VE or VE/IB HCA is possible. VE card

Type 10B* and Type 10C* are supported by this product.

A300-88VE Server

NEC SX-Aurora TSUBASA

The high end product of SX-Aurora TSUBASA is A500-64. Up to 64 VE cards are implemented into one

dedicated DLC rack.

Basically, eight units of A300-8 are used with DLC units („direct liquid cooling”), thereby offering an optimized

full-rack cooling for a large-scale supercomputer solution. The Type 10B* VE card is also supported. In order to

reduce the cooling costs, an inlet water temperature of up to 40 °C is supported (“hot water cooling”).

* See table Technical Specification at the end

Page 8: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

Vector Programming With World-Class Development ToolsA supercomputer is a tool to increase the productivity of researchers and developers. Software developers will find a plethora of worlds-class development tools and programming environment for writing superb vector code. Software development is done on the vector host, and a vector cross-compiler translates the code into an exe-cutable SX-Aurora TSUBASA binary. At runtime, the VE operating system (VEOS) loads the binary into the vector engine, where it then executes.

NEC SX-Aurora TSUBASA

C C11 (ISO/IEC 9899:2011)

C++ C++14 (ISO/IEC 14882:2014)

Fortran Fortran 2003 (ISO/IEC 1539-1:2004)Fortran 2008 (ISO/IEC 1539-1:2010)

OpenMP Version 4.5

MPI Version 3.1

Numeric libraries BLAS, FFT, Lapack

Further tools GNU profiler (gprof) GNU debugger (gdb) ECLIPSE Parallel Tools Platform (PTP) FTrace Viewer /PROGINF

The vector cross-compiler supports advanced automatic vectorization and parallelization for industry-leading sustained performance and highly optimized MPI libraries in a GNU/Linux environment. The NEC SX-Aurora TSUBASA programming environment comprises a feature-rich compiler supporting several modern programminglanguage standards, as well as a set of precompiled mathematical libraries for easy development of scientific code, including BLAS, FFT, LAPACK, and ScaLAPACK.

Page 9: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

Accelerating Machine Learning with the NEC Frovedis Library

NEC Frovedis is an A.I. framework for vectorized and distributed data analytics and is available as an Open Source software package (https://github.com/frovedis). NEC Frovedis on SX-Aurora TSUBASA provides a much higher machine learning performance compared to Apache Spark on commodity x86 systems. Typical applica-tions for NEC Frovedis are:• web advertisement (logistics regression)

• document clustering (k-means)

• recommendation (singular value decomposition)

NEC SX-Aurora TSUBASA

The NEC vector technology, combined with NEC software technology, accelerates HPC and Big Data Analytics applications, to invent new social values.

Machine Learning

Linear Regression

Ridge Regression

Lasso Regression

Logostic Regression Linear SVM SVD

Preprocess

K-means word2vec EVD Factorization Machines

Collaborativefiltering (ALS)

Decision Tree Naive Bayes Art2 LDALogistic Regression(Multiple Class)

Association rule mining

Association rule mining

Graph

Page Rank Triangle Counting Connected Components

Power IterationClustering Planned

Supported

Gaussian Mixture Random Forrest GBDTIsotonic Regression

With the advent of the era of A.I. and deep learning, the computational capabilities of HPC are used for these emerging and growing fields of application. The NEC Frovedis framework accelerates machine learning on NEC SX-Aurora TSUBASA by utilizing its large memory bandwidth and vector architecture.

Page 10: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

NEC SX-Aurora TSUBASA

NEC Deep Learning AccelerationArtificial Intelligence (A.I.), and in particular in Deep Learning, has opened a new level of so far unthinkable ap-plications like self-driving cars, conversational interfaces, and autonomous robots, to mention just a few. Deep learning is a highly compute-intensive method that has become possible because of the massive increase of available computing power. Still the learning phase of deep neural networks has immense requirements that can only be addressed with specialized accelerators, like Graphics Processing Units (GPUs) or NEC SX-Aurora TSUBASA.

But powerful accelerators only unleash their full computational potential if they are used efficiently. Machine Learning is a relatively new discipline, and many existing Deep Learning frameworks are still at a very early stage with a lot of room for performance improvements. As application developers and users of Deep Learn-ing frameworks are rather interested in much higher levels of abstraction compared to HPC developers, Deep Learning programs are less performance-optimized in general, and optimized framework are required to fully exploit the capabilities of the underlying hardware. NEC has a long-standing experience in code developing and performance optimization in High-Performance Computing. Therefore our own AI application developers and tools benefit from decades of experience in tuning applications to achieve the maximum level of efficiency.

Building on this extensive expertise in optimizing compilers and A.I. technologies, NEC Deep Learning Accelera-tion improves Deep Learning performance acting transparently between commonly used software frameworks and the underlying hardware. It will allow developers to freely select their favorite Deep Learning framework, such as, TensorFlow, PyTorch, etc. and to benefit from a significant speed-up without modifying their source code. NEC Deep Learning Acceleration will achieve great performance improvements by automatically optimiz-ing the order of operations in user programs while keeping them 100% functionally equivalent. The resulting code will run optimally on the underlying hardware, be it a CPU, GPU, SX-Aurora TSUBASA, or a combination of different processors and accelerators.

UserPrograms...

...PyTorch

CPU GPUSX-AuroraTSUBASA

Deep Learning Frameworks

Deep Learning Acceleration

Hardware

TensorFlow CNTK

Page 11: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

SX-Aurora TSUBASA specifications

Tower Rack Mount Supercomputer

Models

Model name A100-1 A300-2 A300-4 A300-8 A500-64

Max. Vector Engines (VEs) 1 2 4 8 64

# of Vector Hosts (VHs) 1 1 1 1 8

Formfactor Tower 1U rackmount 1U rackmoutn 4U rackmount proprietary rack

Vector Engine (VE)

# of VEs 1 1, 2 1, 2, 4 6, 8 32, 48, 64

VE type Type 10C Type 10B/C Type 10B/C Type 10B/C Type 10A/B

Max. VE performance (TF) 2.15 4.30 8.60 17.20 157.28

Max. VE memory bandwidth (TB/s) 0.75 2.45 4.91 9.83 78.64

Max. VE memory capacity (GB) 24 96 192 384 3072

Vector Host (VH)

Xeon® processors/VH 1 1 2 2 2

Xeon® processor Intel® Xeon® Gold 6100 family, Silver 4100 family

Max. memory configuration 2666 MHz DDR4 DIMM x 6 / Xeon® processor

Max. memory capacity (GB) 192 192 384 384 384

OS CentOS/Red Hat Enterprise Linux 7.3/7.4/7.5

Interconnect

Max. HCAs (InfiniBand EDR) – 1 2 4 32

Bidirectional bandwidth (GB/s) – 25 50 100 800

Power and Cooling

Power consumption (HPL) 0.6 kW 1.0 kW 1.8 kW 3.2 kW 28 kW

Cooling Air Air Air Air Air + Water

Software

Bandled software VE controlling software, VE driver

Software developers kit Vector compiler/debugger/libraries/profiler for VE

MPI MPI libraly for VE

Vector Engine (VE) Specifications

Type 10C Type 10B Type 10A

Core Specifications

Clock speed (GHz) 1.4 1.4 1.6

Peak performance (GF) 268.8 268.8 307.2

Average memory bandwidth (GB/s) 93 153 153

Processor Specifications

# of cores / processor 8 8 8

Peak performance (GF) 2.15 2.15 2.45

Memory bandwidth (TB/s) 0.75 1.22 1.22

Cache capacity (MB) 16 16 16

Memory capacity (GB) 24 48 48

Safety Notice: Before using this product, please read carefully and comply with the cautions and warnings in manuals such as the Installation Guide and Safety Precautions. Incorrect use may cause a fire, electrical shock, or injury.

NEC SX-Aurora TSUBASA

Technical Specifications

Page 12: NEC SX-Aurora TSUBASA - HPC and Quantum Summit 2019 · 2019-10-10 · NEC SX-Aurora TSUBASA Vector Computing: an Old and New Paradigm The time is now for some breakthrough in the

NEC Corporation7-1, Shiba 5-chomeMinato-ku, Tokyo 108-8001Japan

www.nec.com

NEC Deutschland GmbHHPC EMEA HeadquarterFritz-Vomfelde-Straße 14-16D-40547 DüsseldorfTel.: +49 (0) 211 5369 0

HPC Division Raiffeisenstraße 14D-70771 Leinfelden-EchterdingenTel.: +49 (0) 711 78 055 0

HPC Division3 Parc ArianeF-78284 GuyancourtTel.: +33 (0) 139 30 66 00

NEC Laboratories Europe GmbHKurfürsten-Anlage 3669115 HeidelbergGermany

Tel: +49 (0) 6221 4342-0 Fax: +49 (0) 6221 4342-155email: [email protected]

HPCE UK DivisionUnit 24, 29-30 Horse FairBanbury, Oxon, OX16 0BWDDI: +44 (0) 1295 814500