a heterogeneous gasnet implementation for fpga-accelerated ...willenbe/... · high-performance...

36
High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation for FPGA-accelerated Computing Ruediger Willenberg and Paul Chow October 9, 2014

Upload: others

Post on 08-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

High-Performance Reconfigurable Computing Group

University of Toronto

A Heterogeneous GASNet Implementation

for FPGA-accelerated Computing

Ruediger Willenberg and Paul Chow

October 9, 2014

Page 2: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

FIELD-PROGRAMMABLE GATE ARRAYS

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

2

Page 3: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

FIELD-PROGRAMMABLE GATE ARRAYS

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

3 © H.Roesner

© flickr / MadPhysicist

Page 4: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Hardwired FPGA functions

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

4

SRAM

SRAM

DSP

SerDes

1GEMAC

ARMA9

ARMA9

Page 5: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Computing with FPGAs

• Fully customized dataflow and buffering

• Tightly coupled pipelining of computations

• Very low energy / computation ratio

• Computation cores can be switched out

in msecs with partial reconfiguration

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

5

Page 6: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Example: Smith-Waterman

DNA sequencing (Dynamic Programming)

49x – 980x speedup

(I/O dependent) on

Xilinx V4-LX160 FPGA vs.

2.2GHz AMD Opteron

(Storaasli/Cray 2009)

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

6

Page 7: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

FPGA programmability drawbacks

• Very different thinking than software programming

• Established Hardware Description Languages

(Verilog HDL, VHDL) are very low-level

• Interfaces are often specific to vendors,

platforms, FPGA boards

• Implementation (compile-equivalent) takes hours

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

7

Page 8: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Programmability improvements

• Higher-level HDLs

– SystemC, SystemVerilog, Bluespec, Chisel (UC Berkeley)

• High-Level-Synthesis (“C-to-gates”)

– Very active area in research and industry, but:

– Only useful if programmer understands hardware

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

8

Page 9: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Programmability improvements (2)

OpenCL as an interface to FPGAs

– Abstracts hardware interface as it does for GPUs

– Sets a programming model (memory, core hierarchy)

– Implies High-Level-Synthesis for actual code kernels

– But GPU model not a perfect fit for FPGA

• Overconstrained dataflow

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

9

Page 10: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Classic accelerator model: Master-Slave

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

10

Page 11: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Our programming model philosophy

• Use a common API for Software and Hardware

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

11

FPG

A

CPU (x86)

Application

Common API

Interconnect

Drivers

Interconnect

Kernel migration

FPG

A

CPU (x86)

Application

Common API

Embedded CPU

Application

Common API

Common API

Hardware

Interconnect

Drivers

Interconnect

Kernel migration

FPG

A

CPU (x86)

Application

Common API

Embedded CPU

Application

Common API

Common API

Hardware

Custom

Processing

Element

Common API

Hardware

Interconnect

Drivers

Interconnect

Kernel migration

FPG

A

CPU (x86)

Application

Common API

Embedded CPU

Application

Common API

Common API

Hardware

Custom

Processing

Element

Common API

Hardware

Interconnect

Drivers

Interconnect

Kernel migration

Page 12: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Common SW/HW API

• CPU and FPGA components can initiate data

transfers

• SW and HW components use similar call formats

• For distributed memory and message-passing,

this was implemented by TMD-MPI

(TMD: Toronto Molecular Dynamics)

• Our contribution:

Building a heterogeneous infrastructure for PGAS

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

12

Page 13: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Why again a common API?

• Easier development: SW Prototyping Migration

• Model makes no distinction between CPUs and FPGAs

(in terms of data communication, synchronization)

• FPGA-initiated communication relieves CPU

(even more so for 1-sided comm.)

• FPGA-only systems (or 1 CPU + many FPGAs) can

work efficiently

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

13

Page 14: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

THeGASNet

• Toronto Heterogeneous GASNet

• Compatible software and hardware

implementations of the GASNet Core API

• GASNet is widely used, well-defined PGAS API

• Core API’s “Active Message” types (S/M/L/LA) and

handlers cover essential hardware needs

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

14

Page 15: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

THeGASNet Hardware: GAScore

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

15

Memory

GAScore

RDMA

GA

SNet.h

App.c

CPU

FPGA

On-ChipNetwork

Page 16: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

GAScore

• Remote memory communication engine (RDMA)

• Controlled through FIFOs (FSLs)

• Configuration parameters are the same as for

GASNet Active Message function calls

– Simple software layer for embedded CPUs

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

16

Page 17: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

THeGASNet Hardware: GAScore

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

17

Memory

GAScore

RDMA

GA

SNet.h

App.c

CPU

FPGA

On-ChipNetwork

Page 18: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

THeGASNet Hardware: GAScore

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

18

Memory

GAScore

RDMA

GA

SNet.h

App.c

CPU

FPGA

On-ChipNetwork

Custom

Hardware

Page 19: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

THeGASNet Hardware: PAMS

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

19

Memory

GAScore

RDMA

GA

SNet.h

App.c

CPU

FPGA

On-ChipNetwork

Custom

Hardware

PAMS

Custom

Core

Page 20: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Programmable Active Message Sequencer

• Controls custom hardware operations

• Handles reception/transmission of Active Messages

• Custom hardware ops and Active Messages are

initiated based on:

– Custom hardware state

– Number of received messages with a specific code

– Amount of received data

• (Re-)programmable through Active Messages

• Future: Reconfigure Computation Core by AM

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

20

Page 21: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

THeGASNet Software (PC/ARM)

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

21

m

a

i

m

()

W

r

a

p

e

r

Application Thread 0

Handler Thread 0

Application Thread 1

Handler Thread 1

PCIe Thread

TCP/IP Thread

Page 22: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

THeGASNet packet format

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

22

Src Dst Size

Handler Args

AM Params

Payload

32 bits

0-4

kB

yte

Identical for:

• Thread-to-thread

• TCP/IP

• PCIe

• On-chip network

Page 23: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

MapleHoney - Heterogeneous Cluster

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

23

BEE4

FPGA FPGA

FPGA FPGA

PCRAM RAM

RAMRAM

RA

MR

AM

RA

MR

AM

PC

PC PC

PCIe

DDR200GbE

Page 24: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

MapleHoney - Single FPGA

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

24

FPG

A r

ing

up

FPGA ring down

PCIeDRAM Controller

DR

AM

Co

ntr

olle

r

FPGA

Node0

Node4

Node1

Node5

Node7

Node2

Node3

Node6 NetIf / FSL

Network

Page 25: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Microbenchmarks: FPGA

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

25

FPGA hops 1-way 2-way

0 160 330

1 240 490

2 320 550

FPGA hops 4 Bytes 4 kBytes

0 540 22555

1 620 22635

2 700 22715

Long message latency (ns)

Short message latency (ns)

Page 26: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

26

Latency to node 0 870

Latency to node 0 and back 1960

Latency to node 0 620

Latency to node 0 and back 1480

Simple barrier latency (ns)

“Staggered” barrier latency (ns)

Microbenchmarks: FPGA

Page 27: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Microbenchmarks: PC,TCP/IP,1GEth

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

27

Full barrier latency (4x 4 Threads) 21.6 ms

2-way Short message latency 9.3 ms

1-way 4 kByte Long message latency 6.2 ms

Average Bandwidth 64MByte transfer 0.84 Gbit/s

Average Bandwidth 1GByte transfer 0.87 Gbit/s

! High variation on latency measurements except

barrier => Thread scheduling issues?

Page 28: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Microbenchmarks: PC <-> FPGA (PCIe)

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

28

2-way Short message latency 15.7 us

1-way 4 kB Long latency PC -> FPGA 32.4 us

1-way 4 kB Long latency FPGA -> PC 23.6 us

Sustained PCIe bandwidth PC -> FPGA 175.1 Mbyte/s

Sustained PCIe bandwidth FPGA -> PC 172.3 Mbyte/s

Page 29: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Application example: Jacobi Heat Transfer

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

29

Iterative PDE solver with stencil computation pattern

Cell iteration Node division Communication

pattern

Page 30: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Step I: Design PC application

• Design SPMD application using THeGASNet

• Problem size: 45760x45760 cells on 16 nodes/threads

• 11880x11880 cells per node/thread: ~1GB storage

• Debug and profile application on PC platform

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

30

Page 31: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Step II: Migrate to MicroBlaze

• Using the same THeGASNet headers and compiler

family (gcc), differences should be minimal

• Static instead of dynamic memory allocation; shared

segment pre-defined.

• OS functionality like timers may need to be changed

(abstracted in platform header)

• Command-line arguments to application converted to

compile-time arguments (application unchanged)

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

31

Page 32: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Step III: Replace MicroBlaze with HW

• External connections (Memory bus and GAScore

interface) stay the same

• Custom core only does computation

• PAMS needs to be configured for communication in-

between computation phases

• Keeping one SW GASNet thread on PC to

– Move data into FPGA memory

– Program PAMS

– Output benchmark time

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

32

Page 33: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Performance comparison

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

33

4 PCs, 16 threads 4.54s

16 MicroBlazes 222.86s

16 Custom FPGA cores 5.69s

Time for one complete iteration (averaged)

Memory-bound: No FPGA gain!

Page 34: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Conclusion

GASNet SW API + GAScore (+PAMS) enable

• Universal abstraction for FPGA interfaces

• Simple heterogeneous communication and

synchronization

• Simplified kernel migration to accelerators

• Still have to pick the right problems for

FPGA speedup!

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

34

Page 35: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Outlook: THePaC++

Heterogeneous C++ PGAS library

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

35

CPU-based

Host

CPU-based

Host THeGASNet

CPU-based

Host

GASNet Library

C++ PGAS Library

C++ PGAS Application

Compile Static

generation

Dynamic generation

P A M S

Custom

FPGA

Hardware

Manual

or

HLS

Page 36: A Heterogeneous GASNet Implementation for FPGA-accelerated ...willenbe/... · High-Performance Reconfigurable Computing Group University of Toronto A Heterogeneous GASNet Implementation

Thank you for your attention!

Questions?

October 9, 2014 High-Performance Reconfigurable Computing Group ∙ University of Toronto

36