2018-01-24 fpga accelerated hpc · 1/24/2018 · economics 2% other 5% ca. 700m core-hours/year...

FPGA-accelerated High-Performance ComputingClose to Breakthrough or Pipedream?

Christian Plessl

Paderborn Center for Parallel Computing& Dept. Computer Science

Paderborn University, Germany

24 January 2018

• HPC and Computational Science• Status of using FPGAs in HPC• FPGA-accelerated HPC in Paderborn

– plans– lessons learned

• Conclusions and call to action

Outline

to

Science

ComputationalScience

From

Experiment

4

Theory

5

High-Performance Computing (HPC)

6

• Use computers simulation to obtain scientific results

• “Third paradigm” following experiment and theory

• Advantages of computer experiments:– make predictions what will happen – perform experiments that would otherwise be impossible,

too difficult, too dangerous– perfect reproducibility– can offer explanations why something happens

7

What is Computational Science?

• Computational science penetration all fields– engineering– natural sciences– humanities

• Growing processing demand– simulation– optimization– data intensive analytics

• Computer are virtual instruments– microscopes, telescopes, chemistry labs, ...– improve exponentially in capabilities in contrast to their

physical counterparts

8

Computational Science Drives HPC Demand

images: UCLA, MPG

9

Which Sciences Are Using HPC?

Paderborn Center forParallel Computing

Note: ALCC data is from calendar year 2016.

Materials Science 30%

Physics 24%Engineering 13%

Earth Science 10%

Chemistry 15%

Biological Sciences 6%

Computer Science 2% 2016 INCITE BY DOMAIN

3.57 BILLION CORE-HOURS

Physics 30%

Materials Science 20%

Chemistry 9%

Earth Science 8%

Biological Sciences 1%

Computer Science 10%

2016 ALCC BY DOMAIN

1.74 BILLION CORE-HOURS

Engineering 22%

2016 ALCF SCIENCE HIGHLIGHTS 7

Argonne LeadershipComputing Facility

Physics33%

Engineering21%

Chemistry27%

ComputerScience12%

Economics2%

other5%

ca. 700M core-hours/year

• Massively parallel computation across all levels– instruction, core, socket, rack

• Power consumption has become a first-class concern– operating cost and and power supply– cooling infrastructure

10

HPC: Massive Scale and Challenges

14

1664

2561,0244,096

16,38465,536

262,1441,048,5764,194,304

16,777,216

06/9

3

06/9

4

06/9

5

06/9

6

06/9

7

06/9

8

06/9

9

06/0

0

06/0

1

06/0

2

06/0

3

06/0

4

06/0

5

06/0

6

06/0

7

06/0

8

06/0

9

06/1

0

06/1

1

06/1

2

06/1

3

06/1

4

06/1

5

06/1

6

06/1

7

NumberofCPUcoresinTop500Supercomputers

Cores #1Cores #100Power #1 [kW]Power #100 [kW]Trend for rank 1Trend for rank 100

data: Top500

• Ambitious roadmaps for HPC– launch of Exascale computing projects in US, Europe, Japan around

2010– objective: 1 ExaFLOP – less than 20 MW – by 2020

• Requires substantial improvements in the whole stack– processor architecture, network, programming models– system design, cooling

• Efficient computing resources are more important than ever– a wealth of new and re-invented architectures– accelerators– heterogeneous computing

11

Quest for Energy-Efficient Computing

a huge opportunity forreconfigurable computing

f

Cell

FPGA

GPU

Manycore

Vector-processor

acce

lerato

rs

• Accelerators entered HPC a decadeago

• Performance (Top 500, 11/2017)– 20% of systems use accelerators– 25%-35% of accumulated performance

• Efficiency (Green 500, 11/2017)– most efficient systems use PEZY-SC or

GPU accelerators

12

Accelerators on the Rise

statistics: Top500.org

• Breakdown of Top 500 11/2017 by accelerator type

• Interesting observations1. 80% of the systems do not use any

accelerators2. Only NVidia GPUs and Intel Xeon Phi

gained traction3. FPGAs are absent from the Top500

• Quick rise ≠ universal adoption– Why don’t we see much broader

adoption of accelerators?– Stagnation or a matter of time?

13

A Different Take on the Same Data

statistics: Top500.org

FPGA

14

Overarching Questions and Motivation for this Talk1. If accelerators – in particular

FPGAs – are so great, why aren’t they in much wider use?

2. What can we do to change this situation?

• Currently operational, larger scale general-purpose FPGA installations– CHREC U. Florida: Novo-G#– Hartree Center UK: Maxeler MPC-X cluster– Texas TACC: Catapult 1 and Intel HARP v2 cluster– Paderborn University: XCL + HARP v2 cluster

• HPC Applications with FPGA support– no generally available, production-ready HPC codes– some proof of concept codes (e.g. Maxeler Application Gallery)– probably some integrated solutions/appliances (bioinformatics, cryptography)

• HPC Libraries with FPGA support– nothing usable/maintained (not even FFT, BLAS, LAPACK)– announced: Intel and Xilinx acceleration libraries (mainly deep learning)

15

Maybe Top500 Is Too Narrow. Perform a Broader Search

• Numerous publications show the potential of FPGAs for relevant HPC problems• Some examples

– Linear algebra: CG solver for sparse linear equation systems [1]§ 20-40x faster than CPU

– Geophysics: 3D convolution [1]§ 70x faster than CPU, 14x faster than GPU

– Molecular dynamics [2]§ 80x faster than NAMD (single core) CPU

– Bioinformatics (BLAST) [3]§ 5x faster than optimized, parallel CPU implementation

– Climate modeling [4]§ 4 FPGAs 19x faster than two socket CPU, 7x faster than GPU

16

Are FPGAs Not Promising for HPC? I Don’t Think So

[1] O. Lindtjorn, R. G. Clapp, O. Pell, O. Mencer, M. J. Flynn, and H. Fu. Beyond traditional microprocessors for geoscience high-performance computing applications. IEEE Micro,Mar.–Apr. 2011.[2] M. Chiu and M. C. Herbordt. Molecular dynamics simulations on high-performance reconfigurable computing systems. ACM TRETS Nov. 2010.[3] A. Mahram, and M. C. Herbordt. NCBI BLASTP on High-Performance Reconfigurable Computing System. ACM TRETS Jan 2015.[4] L. Gan, H. Fu, W. Luk et. al. Solving the Global Atmospheric Equations through Heterogeneous Reconfigurable Platforms. ACM TRETS Mar. 2015

• Hypothesis: clear value proposition is mandatory– no other affordable technology can satisfy the requirements– NRE: avoidance of cost for ASIC design– CAPEX, OPEX: reduction in investment and operating cost

• Areas where FPGAs seem to have some commercial relevance– networking equipment (latency, NRE)– high frequency trading (latency, NRE)– bioinformatics (NRE, CAPEX)– cryptanalysis (CAPEX and OPEX)– defense / medical signal processing (space, power consumption)– deep learning inference (CAPEX and OPEX)

• Viability of general purpose use (e.g. Amazon F1) so far unproven

17

Some Areas Where FPGAs Are Successfully Used

So, where is the problem?

• Since joining Paderborn University in 2007, I started to connect more intensely with the HPC community

• The timing was good– HPC community was increasingly interested in computer architecture and

accelerators– FPGAs were already known as hot technology for the future of HPC – some computational scientists were actually interested in collaboration

• My naïve assumptions– HPC folks will be convinced of FPGAs once they see case studies– publishing the results at mainstream HPC conferences is in reach– FPGAs will soon be mainstream to have chances for an HPC faculty

position

19

Pitching FPGAs Acceleration to HPC Audience (1)

20


collaboration withphysicist ✓

problem that soundssomewhat important ✓

state-of-the-art CPUsand FPGAs ✓

21


algorithm simple enough to understand ✓

22


high-level hardware synthesis, no HDL (any computationalscientist can do this) ✓

23


the synthesis result isnot something weird, that can only be understood by electrical engineers ✓

24


CPU and FPGA use double precision arithmetic ✓

CPU implementation appears to be reasonably optimized:• multi-threaded ✓• cache blocking ✓• NUMA-aware memory

allocation ✓

speedup is not stellar, but OK considering the strong CPU baseline ✓

• The CPU performance baseline is too low– stencil codes can be much better optimized– code probably not vectorized

• Optimization for FPGA insufficiently understood– what is the theoretical performance limit and bottlenecks

(computation, memory, dependencies)– how can FPGAs ever win, if the DRAM is slower than for

CPUs (lack of understanding of pipelining, streaming, ...)

• Fear, uncertainty, doubt– is this work actually relevant for computational scientists?– can you train HPC developers to use FPGAs?– will the required investment in expensive FPGA hard and

software pay off?

25

My Pitch Was Not Received as Well as Expected

Performance of FPGA~16 DP computations / update→ 1000 MCell/s = 16 GFLOPS

Peak Performance CPU2 sockets * 4 cores * 2.5GHz * 1 – 8 FLOPS = 20 – 160 GFLOPS

• HPC developers are constantly told exciting stories– this technology is the future: Itanium, Cell, BlueGene, Xeon Phi– the compiler will handle the complexity for you

• No user cares for energy efficiency– only infrastructure providers do

• User care for ease-of-use and protection of their investments– many codes are gigantic, countless person years investment– there are plenty of free computing resources available for academics

• Benefits of the new technology are not convincingly presented– proof-of-concept case study, no real-state of the art problems– improvement in metrics not relevant for target users (method vs.

insight-driven research)26

What was Going Wrong?

TRUST US ✓

• The pitfalls FPGA acceleration in HPC are currently not widely acknowledged, discussed and understood

• Interesting position paper published 2009 in ACM TRETS– Premise: FPGAs show lots of promise

but lack acceptance in general-purpose HPC installations

– Proposed 12 areas where researcher need to make contributions to increase acceptance of FPGAs in HPC

• Many observations and conclusions still apply today

27

Pitfalls of HPC Acceleration for HPC

The Long Road to Production Reconfigurable Supercomputing · 26: 11

Table I. The State of FPGA Research Toward HPCArea Status Activity Difficulty

Step 1: Standardization poor moderate lowStep 2: High Performance Forward Portability poor low highStep 3: Enhanced Device Performance good low highStep 4: Enhanced System Architecture fair none moderateStep 5: Simplified Library Usage fair low lowStep 6: Concurrent APIs poor low lowStep 7: Better Performance Studies fair moderate moderateStep 8: Improved Programming Environment good high highStep 9: Improved Infrastructure poor low moderateStep 10: Enhanced Communications good moderate moderateStep 11: Enhanced Reliability poor low highStep 12: Provide OS Support poor low low

objective viewpoint that only it can. Step 2 (high performance forward porta-bility) is in similarly bad shape, and is a much harder problem. There is rela-tively little activity in this area beyond work on high level language compilers,and the compilers still require far too much target specific tuning.

Step 3 (enhanced device performance) and Step 4 (enhanced system archi-tecture) are in much better shape, though there is little academic activity at-tempting to improve either for HPC applications. Truly excellent architectureswill be difficult to define due to the diversity of applications. Step 5 (simplifiedlibrary usage) is progressing much better than the related Step 6 (concurrentAPIs). Although vendors ship BLAS and FFT libraries, they have not extendedthe APIs to expose the concurrency. Furthermore, they do not provide higherlevel libraries (e.g. solvers like Trilinos [Heroux et al. 2005] and PETSc [Balayet al. 1997]). This leads into Step 7 (better performance studies). Accelera-tor research stands in striking contrast to high performance computing andgeneral microprocessor optimization work. In the latter, optimization work of-ten goes into widely available libraries (e.g. ATLAS [Whaley et al. 2001] andFFTW [Frigo and Johnson 1998]). In contrast, accelerator research tends tobe a single proof of concept effort that never makes it outside the lab—despitethe fact that it targets widely used core algorithms [Zhuo and Prasanna 2005;deLorimier and DeHon 2005]—and the authors of this work are no different[Underwood and Hemmert 2004; Underwood et al. 2007]. It is time for accel-erator researchers to invest the extra effort and make their work applicable toStep 5 and Step 6.

Step 8 (improved programming environment) and Step 9 (improved in-frastructure) go hand in hand from the perspective of an application de-veloper, but all of the research community’s attention has been focused oncompilers. Thus, compilers are in relatively good shape (though much remainsto be done), but research into other key components is extremely rare. Theproblem of communications with an FPGA (Step 10) has improved dramati-cally with recent parts that include hard cores for PCIExpress and 8+ Gb/sSERDES [Alfke 2008; Mansur 2008]. Thus, the hardest part of the problemis largely solved; however, these blocks remain difficult to use, because thecommunity still needs to define semantically useful, but generally applicable,ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 4, Article 26, Pub. date: September 2009.

Critical Areas Identified by Underwood et al.

28

Getting to the Core of the Problem

29

Accelerator research stands in striking contrast to high performance computing and general microprocessor optimization work. In the latter, optimization work often goes into widely available libraries (e.g. ATLAS and FFTW).

In contrast, accelerator research tends to be a single proof of concept effort that never makes it outside the lab – despite the fact that it targets widely used core algorithms. [..]

I t is time for accelerator researchers to invest the extra effort and make their work applicable.

[Underwood et. al 2009]

Intuitive Assessment of the Progress We Made Since 2009

30

The Long Road to Production Reconfigurable Supercomputing · 26: 11

Table I. The State of FPGA Research Toward HPCArea Status Activity Difficulty

Step 1: Standardization poor moderate lowStep 2: High Performance Forward Portability poor low highStep 3: Enhanced Device Performance good low highStep 4: Enhanced System Architecture fair none moderateStep 5: Simplified Library Usage fair low lowStep 6: Concurrent APIs poor low lowStep 7: Better Performance Studies fair moderate moderateStep 8: Improved Programming Environment good high highStep 9: Improved Infrastructure poor low moderateStep 10: Enhanced Communications good moderate moderateStep 11: Enhanced Reliability poor low highStep 12: Provide OS Support poor low low

objective viewpoint that only it can. Step 2 (high performance forward porta-bility) is in similarly bad shape, and is a much harder problem. There is rela-tively little activity in this area beyond work on high level language compilers,and the compilers still require far too much target specific tuning.

Step 3 (enhanced device performance) and Step 4 (enhanced system archi-tecture) are in much better shape, though there is little academic activity at-tempting to improve either for HPC applications. Truly excellent architectureswill be difficult to define due to the diversity of applications. Step 5 (simplifiedlibrary usage) is progressing much better than the related Step 6 (concurrentAPIs). Although vendors ship BLAS and FFT libraries, they have not extendedthe APIs to expose the concurrency. Furthermore, they do not provide higherlevel libraries (e.g. solvers like Trilinos [Heroux et al. 2005] and PETSc [Balayet al. 1997]). This leads into Step 7 (better performance studies). Accelera-tor research stands in striking contrast to high performance computing andgeneral microprocessor optimization work. In the latter, optimization work of-ten goes into widely available libraries (e.g. ATLAS [Whaley et al. 2001] andFFTW [Frigo and Johnson 1998]). In contrast, accelerator research tends tobe a single proof of concept effort that never makes it outside the lab—despitethe fact that it targets widely used core algorithms [Zhuo and Prasanna 2005;deLorimier and DeHon 2005]—and the authors of this work are no different[Underwood and Hemmert 2004; Underwood et al. 2007]. It is time for accel-erator researchers to invest the extra effort and make their work applicable toStep 5 and Step 6.

Step 8 (improved programming environment) and Step 9 (improved in-frastructure) go hand in hand from the perspective of an application de-veloper, but all of the research community’s attention has been focused oncompilers. Thus, compilers are in relatively good shape (though much remainsto be done), but research into other key components is extremely rare. Theproblem of communications with an FPGA (Step 10) has improved dramati-cally with recent parts that include hard cores for PCIExpress and 8+ Gb/sSERDES [Alfke 2008; Mansur 2008]. Thus, the hardest part of the problemis largely solved; however, these blocks remain difficult to use, because thecommunity still needs to define semantically useful, but generally applicable,ACM Transactions on Reconfigurable Technology and Systems, Vol. 2, No. 4, Article 26, Pub. date: September 2009.

• The time of the free lunch for performance is over– GPUs have paved the way for application modifications– previously the code was assumed to be sacred and untouchable

• Energy efficiency has become a pressing issue– opens up another dimension for competition

• There is finally a “killer app”– inference for deep neural networks– FPGAs ride the AI hype-wave

• Cloud and data center players make massive investments in FPGAs– Altera acquisition by Intel, IBM/Xilinx partnership– use of FPGAs in clouds of Microsoft, Amazon, Baidu, IBM, Huawei, etc.– the overall ecosystem will profit from this

31

Changes in Ecosystem Since Underwood’s Assessment

• OpenCL High-Level Synthesis design flows– language capable of specifying many aspects

relevant for FPGAs– standardized and used in other contexts too– supports easier design space exploration– abstracts from FPGA board, memory channels,

PCIe interfaces

• Highly capable FPGA devices– vast amounts of DSP blocks– suitable bit widths for implementing floating point

arithmetic

• Steps towards better system integration– shared and coherent global memory access

32

Technological Progress Since Underwood’s Assessment

HPC-relevant Intel Stratix 10 features• 5.5 M LE• 28 MB block RAM• 10 TFLOPS single-precision floating-

point performance• 80 GFLOPS/W (best Green500

system achieves 17 DP GFLOPS/W)• hardened PCIe x16• hardened memory controllers for

DDR4• up to 96 transceivers

• Longstanding experience with FPGAs for HPC• Current FPGA infrastructure

– two testbed clusters for public use– additional FPGA systems from most major vendors

33

HPC with FPGAs at Paderborn University

System Inst CPU FPGA Toolflow Properties

Convey HC-1 2010 Xeon 5138 4x Xilinx Virtex-5 LX 330 HDL + vectorprocessor overlay

CPU and FPGA connected via FSB, cache-coherent NUMA architecture

Maxeler MPC-C 2012 Xeon X5660 4x Xilinx Virtex-6 SX475T MaxJ data flow language

4 PCIe boards, MaxRing interconnect

Nallatech 385A 2016 Xeon E5-1260v2

Intel/Altera Arria 10 GX1150 Intel OpenCL Nallatech 385A FPGA card

IBM S812L 2016 POWER8 10-cores

Xilinx Virtex-7 VX690T Xilinx OpenCL AlphaData PCIe FPGA board (ADM-PCIE-7V3)

MicronWorkstation

2016 Intel i7-5930K

Xilinx Kintex-7 UltrascaleKU115

Xilinx OpenCL Pico AC-510 FPGA board with Hybrid-memory cube

XCL cluster 2017 Xeon E5-1630v4

Xilinx Virtex-7 VX690T + Xilinx Kintex Ultrascale KU115

Xilinx OpenCL 8-node cluster with 2 FPGA cards per node (AlphaData ADM-PCIE-7V3 and ADM-PCIE-8K5)

HARP cluster 2017 Xeon E5-v4 Intel BDW+FPGA hybrid CPU/FPGA

Intel OpenCL, HDL 10-node cluster with 1 BDW+FPGA processor per node

• Recently acquired funding for next generation HPC system– 10M€ HPC system + 15M€ data center building

• FPGAs play a strategic role our HPC investment– exploration of FPGAs in HPC– port libraries and real scientific applications to FPGAs– work on parallel FPGA implementations (MPI, PGAS)– study performance and energy trade-offs

• Investment complemented by research, development and support efforts– infrastructure accessible for free for researchers in Germany– international collaborations possible and desired, negotiated on case-by-case basis

34

HPC with FPGAs at Paderborn University (2)

• Idea: Build experience for production system with FPGA testbed clusters– building a cluster from components proved far more difficult than ever expected– lot of effort from technicians, admins, and researchers

• FPGA hard- and software stacks are not ready for primetime yet

• Main difficulties– poor onboarding experience– fragility of firmware, drivers and software stack– available management tools not suitable for multi-user HPC environment– security implications poorly understood

• Conclusion: we will procure the production FPGA systems as validated solutions from major HPC vendors

35

War Stories and Challenges (1)

• Poor onboarding experience– hardly anything works out of the gate when installing FPGA

card in server– outdated and incorrect administrator guides– typical admins are not able to cope with the technology,

lack of good self-diagnostics

• Fragility of firmware, driver and software stack– reliance of very specific (sometimes patched) OS versions– intermingling of HLS flows, backend tools and BSPs– unstable drivers (crashes, deadlocks, corruption of

data/configuration)– in-field firmware upgrades not always possible, take too

long or cannot be automated

36


• Available management tools not suitable for multi-user HPC environment– no best practices to support applications relying on specific BSP-variants, driver versions, etc.– no best practices/capabilities for automated firmware provisioning in cluster and workload

management systems– static partitioning of FPGA into subsets per firmware (OpenCL / HDL and different tool releases)

leads to inacceptable resource fragmentation

• Security implications poorly understood– ecosystem does not systematically consider multi-user scenario– FPGA and board vendors are not confident asserting security properties of BSPs– shared memory without memory protection opens the gates for evil (cache coherent CPU+FPGA,

PCIe bus master)– OpenCL BSPs are delivered by vendors, no possibility to verify correctness and security– too many ways to crash or lock up a machine (denial of service)

37


• The future is bright– FPGAs can deliver attractive solutions for HPC and data center workloads– we have the most capable FPGA silicon we ever had– HLS tools can not only deliver increased productivity but also competitive results for increasing

number of domains– there finally is a “killer application” for FPGAs– serious investments and commitment to FPGAs from suppliers and hyperscale data centers

• There is still substantial groundwork to do– improve stability of software and hardware stack– address needs of multi-user environment (security, backward compatibility, automated

provisioning of BSPs)– better support for HPC languages and libraries (Fortran, OpenMP, OpenACC, MPI)

• The needs of data center applications will hopefully move the whole field along

38

Conclusions

• Perform fair comparisons– no overblown claims, use strong and optimized baselines– equivalent hardware generations

• Break out of the case studies dilemma– target actual scientific codes rather than extracted kernels– use relevant problem sizes and test data– aim for generic designs that can handle broad range of problems– target multi-FPGA implementation

• Spread the word– connect with the HPC community and present your results– release the results as open-source

• Join us in this effort!39

Call to Action

[email protected]://pc2.uni-paderborn.deTwitter: @plessl // @pc2_upb

2018-01-24 fpga accelerated hpc · 1/24/2018 · economics 2% other 5% ca. 700m core-hours/year...

Documents