1 a survey of system architectures and techniques for fpga ...1 a survey of system architectures and...

1

A Survey of System Architectures andTechniques for FPGA Virtualization

Masudul Hassan Quraishi, Erfan Bank Tavakoli, Fengbo Ren

Abstract—FPGA accelerators are gaining increasing attention in both cloud and edge computing because of their hardware flexibility,high computational throughput, and low power consumption. However, the design flow of FPGAs often requires specific knowledge ofthe underlying hardware, which hinders the wide adoption of FPGAs by application developers. Therefore, the virtualization of FPGAsbecomes extremely important to create a useful abstraction of the hardware suitable for application developers. Such abstraction alsoenables the sharing of FPGA resources among multiple users and accelerator applications, which is important because, traditionally,FPGAs have been mostly used in single-user, single-embedded-application scenarios. There are many works in the field of FPGAvirtualization covering different aspects and targeting different application areas. In this survey, we review the system architecturesused in the literature for FPGA virtualization. In addition, we identify the primary objectives of FPGA virtualization, based on which wesummarize the techniques for realizing FPGA virtualization. This survey helps researchers to efficiently learn about FPGA virtualizationresearch by providing a comprehensive review of the existing literature.

Index Terms—FPGA, Virtualization, Architecture, Accelerator, Reconfiguration

F

1 INTRODUCTION

F IELD programmable gate arrays (FPGAs) are gainingincreasing attention in both cloud and edge computing

because of their hardware flexibility, superior computationalthroughput, and low energy consumption [1]. Recently,commercial cloud services, including Amazon [2] and Mi-crosoft [3], have been employing FPGAs. In contrast withCPUs and GPUs that are widely deployed in the cloud,FPGAs have several unique features rendering them syn-ergistic accelerators for both cloud and edge computing.First, unlike CPUs and GPUs that are optimized for thebatch processing of memory data, FPGAs are inherentlyefficient for processing streaming data from inputs/out-puts (I/Os) at the network edge. With abundant registerand configurable I/O resources, a streaming architecturecan be implemented on an FPGA to process data streamsdirectly from I/Os in a pipelined fashion. The pipelineregisters allow efficient data movement among processingelements (PEs) without involving memory access, resultingin significantly improved throughput and reduced latency[4], [5]. Second, unlike CPUs and GPUs that have a fixedarchitecture, FPGAs can adapt their architecture to best fitany algorithm characteristics due to their hardware flexi-bility. Specifically, the hardware resources on an FPGA canbe dynamically reconfigured to compose both spatial andtemporal (pipeline) parallelism at a fine granularity and on amassive scale [1], [6]. As a result, FPGAs can provide consis-tently high computational throughput for accelerating bothhigh-concurrency and high-dependency algorithms, servinga much broader range of cloud and edge applications. Third,FPGA devices consume an order of magnitude lower powerthan CPUs and GPUs and are up to two orders of magnitude

• M. Quraishi, E. Bank Tavakoli, F. Ren are with the School of Computing,Informatics, and Decision Systems Engineering, Arizona State Univer-sity, Tempe, AZ, 85281.E-mail: [email protected]; [email protected]; [email protected].

more energy-efficient, especially for processing streamingdata or executing high-dependency tasks [7], [8], [9]. Suchmerits lead to improved thermal stability as well as reducedcooling and energy costs, which is critically needed for bothcloud and edge computing.

Even though FPGAs offer great benefits over CPUsand GPUs, these benefits come with design and usabilitytrade-offs. Conventionally, FPGA application developmentrequires the use of a hardware description language (HDL)and knowledge about the low-level details of FPGA hard-ware. This is a deal-breaker for most software applicationdevelopers that are not familiar with HDLs nor hardwarespecifics at all. Even though high-level synthesis (HLS) hasenabled the development of FPGA kernels in C-like high-level languages (e.g., C++/OpenCL) [10], one still needsto have basic knowledge about FPGA hardware specificsin order to develop performance-optimized FPGA kernels.As a result, the FPGA application development based onHLS remains esoteric. Moreover, the existing flows of FPGAkernel design are highly hardware-specific (each FPGA ker-nel binary is specific to one FPGA model only), and thevendor-provided tools for deploying and managing FPGAkernels lack the support for sharing FPGA resources acrossmultiple users and applications. These limitations makeFPGAs insufficient for supporting multi-tenancy cloud andedge computing. One solution to these limitations is decou-pling the application (i.e., hardware-agnostic) and the kerneldesign (i.e., hardware-specific) development regions [11].

FPGA virtualization that aims to address the challengesmentioned above is the key to enabling the wide adoption ofFPGAs by software application developers in multi-tenancycloud and edge computing. Specifically, the primary objec-tives of FPGA virtualization are to: 1) create abstractionsof the physical hardware to hide the hardware specificsfrom application developers and the operating system (OS)as well as provide applications with simple and familiar

arX

iv:2

011.

0907

3v3

[cs

.AR

] 1

9 Fe

b 20

21

2

interfaces for accessing the virtual resources; 2) enable theefficient space- and time-sharing of FPGA resources acrossmultiple users and applications to achieve high resourceutilization; 3) facilitate the transparent provisioning andmanagement of FPGA resources; and 4) provide strict per-formance and data isolation among different users andapplications to ensure data security and system resilience.

An early survey paper on virtualization of reconfig-urable hardware [12] covers a narrow perspective of FPGAvirtualization by discussing three virtualization approaches,i.e., temporal partitioning, virtualized execution, and virtualmachine. Recent surveys on FPGA virtualization [13], [14],[15] mostly discuss FPGA virtualization in the context ofcloud and edge computing. Vaishnav et al. [13] proposedthree categories of FPGA virtualization techniques and ap-proaches based on the level of abstraction and the scale ofcomputing resource deployment: resource level, node level,and multi-node level. The resource-level abstraction refersto the virtualization of reconfigurable hardware and I/Oresources on an FPGA, while the node-level and multi-node-level abstractions refer to the virtualization of a computingsystem with one and multiple FPGAs, respectively. Suchcategorization can create ambiguity since many approachesand techniques used in one abstraction level of the systemcan often be applied to other levels as well. For example,the paper considers scheduling as a node-level virtualiza-tion technique. The scheduling of FPGA tasks is, in fact, acommon concept that can also be applied to the multi-nodeand resource levels. Furthermore, the survey lists severalFPGA virtualization objectives, but a discussion on howthese objectives are linked to the virtualization techniquesin the existing literature is missing.

The survey paper in [14] revisits some of the existingsurvey papers to suggest selection criteria of appropriatecommunication architectures and virtualization techniquesfor deploying FPGAs to data centers. It has selected threedifferent areas to review: previously used nomenclature,Network-on-Chip (NoC) evaluation on FPGA, and FPGAvirtualization. Its discussions in the FPGA virtualizationsection are similar to [13], and therefore have the samelimitations mentioned above.

Skhiri et al. [15] identified drawbacks of using FPGAwith a local machine and presented a survey of papers thatuses FPGAs in the cloud to solve the identified drawbacks.They discussed cloud FPGA services in three differentgroups: software tools, platforms, and resources. As part oftheir discussion on FPGA-resources-as-a-service, the paperrevisits the classification of virtualization approaches in [13],and hence, has the same limitations.

Overall, the existing survey papers on FPGA virtual-ization review a limited set of existing work; for example,literature discussing isolation and security perspectives ofFPGA virtualization are not reviewed. There is an existingsurvey on security and trust of general FPGA-based systems[16], but the perspective of virtualization is not covered inthe work. One of the surveys has overlapped categoriza-tion, which creates ambiguity in the scope of which eachvirtualization technique can apply. The existing surveysalso fall short in linking the virtualization techniques tothe core objectives of FPGA virtualization. In addition, theexisting surveys fail to discuss the system architecture per-

spective of FPGA virtualization techniques. As FPGAs arereconfigurable hardware, FPGA computing systems can bebuilt with a variety of system architecture choices, and thevirtualization techniques can be highly dependant on the ar-chitecture choices. Thus, reviewing the system architecturesof FPGA computing systems is critical to understanding thecorresponding FPGA virtualization techniques.

In this survey paper, we present a comprehensive reviewof both the system architectures, covering the hardware,software, and overlay stacks, and the techniques for FPGAvirtualization, as well as their associations. Such organiza-tion of the survey sets a clear boundary among differentsystem stacks, which helps readers better understand howdifferent virtualization techniques apply to different systemarchitectures. Furthermore, our work elaborates the four keyobjectives of FPGA virtualization by discussing how eachobjective is addressed by different virtualization techniquesin the literature. The system architectures used for FPGAvirtualization are summarized in Table 1, and the key tech-niques for realizing FPGA virtualization are summarized inTable 2, categorized by the four primary objectives of FPGAvirtualization. In addition, this survey paper also reviewsthe existing papers on the isolation and security issues ofmulti-tenant FPGAs that are overlooked by the previoussurveys.

This survey paper is organized as follows. Section 2provides background on the general concepts of virtualiza-tion and the challenges of FPGA virtualization, clarifies theimportant definitions used throughout this paper, and re-views the available programming models adopted in FPGAvirtualization. Section 3 presents the system architecturedesign for FPGA virtualization in detail. In Section 4, wediscuss the four objectives of FPGA virtualization and howthey are implemented in different system stacks. Section 5summarizes the overall survey and draws the conclusion.

2 BACKGROUND

2.1 Virtualization: General ConceptsVirtualization, in general, is creating a virtual abstractionof computing resources to hide the low-level hardwaredetails from users. Virtualization is often software-based,i.e., the virtual abstraction is implemented on the softwarelevel to hide the complexity of the underlying hardware.In virtualization, the resources are transparently presentedto users so that each user has an illusion of having unlim-ited and exclusive access. The most common example ofvirtualization is running different OSs on a single processorusing virtual machines, where a virtual machine creates theillusion of a standalone machine with an OS to the users.In this way, the users get the flexibility to easily switch to adifferent system environment or even a different OS withoutchanging the computing hardware.

For GPU virtualization, in gVirtuS [17], a split driverapproach is utilized with a frontend component (guest-sidesoftware component) deployed on virtual machine imagesand a backend component that manages device requests andaccesses and multiplexes devices. In [18], rCuda is proposedto access GPUs on HPC clusters within virtual machinesremotely. The rCuda framework has two components: 1) theclient middleware consisting of a collection of wrappers in

3

charge of forwarding the API calls from the applications re-questing acceleration services to the server middleware andretrieving the results back, and 2) the server middlewarethat receives, interprets, and executes the API calls from theclients and performing GPU multiplexing.

Similar to the general concept of virtualization, FPGAvirtualization is creating a virtual abstraction of FPGA re-sources to hide the intricate low-level hardware details fromusers or application developers. The definition of FPGAvirtualization has changed significantly over time. The earlywork on FPGA virtualization in the 90s [19] introduced OSprinciples, such as partitioning, segmentation, and overlay-ing for FPGAs and termed these techniques as the virtual-ization techniques of FPGAs. Later on, the work in the 00s[20] defined FPGA virtualization as an abstract hardwarelayer on top of physical FPGA fabrics. This layer is nowcommonly known as the overlay architecture [21]. In [12],three approaches for FPGA virtualization are presented:temporal partitioning, virtualized execution, and mappingto an abstract virtual machine. As FPGA technology hasevolved dramatically over the past few decades, the goalof FPGA virtualization has also changed over time. Today,from the perspective of cloud and edge computing, theprimary objectives of FPGA virtualization are to create auseful abstraction of the underlying hardware to facilitateapplication developers and users and enable the efficientsharing and management of FPGA resources across multipleusers and applications with strict performance and dataisolation.

In a recent survey on FPGA Virtualization [13], the objec-tives of FPGA virtualization are listed as: multi-tenancy, re-source management, flexibility, isolation, scalability, performance,security, resilience, and programmer’s productivity. However,we believe that scalability, performance, and resilience arethe primary considerations of cloud and edge computingin general but not specific to FPGA virtualization, eventhough virtualization may facilitate improvements on someof those perspectives. Also, flexibility and programmer’sproductivity are the primary objectives of HLS techniquesrather than FPGA virtualization. Therefore, we define theprimary objectives of FPGA virtualization in the context ofcloud and edge computing as the following:

• Abstraction: Abstraction hides the hardwarespecifics from application developers and the OSas well as provide applications with simple andfamiliar interfaces to access the virtual FPGAresources.

• Multi-tenancy: To enable the efficient sharing ofFPGA resources among multiple users and applica-tions.

• Resource Management: To facilitate the transparentprovisioning and management of FPGA resources forworkload balancing and fault tolerance.

• Isolation: To provide strict performance and dataisolation among different users and applications toensure data security and system resilience.

Our discussion on the FPGA virtualization techniquesin the paper will focus on these four objectives. While theprevious survey paper only listed the objectives of FPGAvirtualization, our discussion in Section 4 elaborates on how

each of the four key objectives is addressed by differentvirtualization techniques in the existing literature as wellas their association with different stacks of the systemarchitecture.

2.2 Challenges of FPGA Virtualization

It should be noted that virtualization is not hardware-agnostic. One must know about the hardware specifics ofa computing system in order to properly virtualize it. In thecase of FPGAs, the underlying hardware is reconfigurable,and thus the hardware architecture can vary accordingto the application implemented. Such hardware flexibilitymakes the virtualization of FPGA resources much morechallenging than the virtualization of CPUs and GPUs witha fixed hardware architecture.

Traditionally, FPGAs are used primarily for embeddedapplications, where a dedicated application running on anFPGA is only accessible to a dedicated user. Althoughmodern commercial FPGAs have the potential to supportmultiple applications via partial reconfiguration (PR), theexisting FPGA architectures and design flows are still notoptimized for sharing hardware resources among multipleusers and applications, making the multi-tenancy comput-ing objective of FPGA virtualization a challenging task toaccomplish.

In addition, FPGAs have limited performance portabil-ity. Since FPGA architectures and design flows are bothvendor- and hardware-specific, it is impossible to applythe same FPGA kernel (bitfile) to a different FPGA deviceoffered by the same or a different vendor. Although HDLand HLS codes are portable across devices and vendors(other than vendor-provided IP cores and language direc-tives), the re-compilation needed for mapping the sameapplication codes onto different FPGA devices can be ex-tremely time-consuming, creating difficulties for the run-time deployment and provisioning of FPGA applicationsin a cloud or edge computing environment. Due to thetight coupling between FPGA hardware and vendor-specificdesign flows, the development of a generalized frameworkfor FPGA virtualization becomes a challenge. This makesthe transparent FPGA resource management objective ofFPGA virtualization a challenging task to accomplish.

2.3 Definitions

We found that the use of terminologies for FPGA virtual-ization in the existing literature is not always consistent.Sometimes, different terminologies are used to refer to thesame or similar system architecture components. To avoidconfusion and ambiguity, we uniformly define three impor-tant terminologies for FPGA virtualization as follows.

• Shell refers to the static region of FPGA hardware,which is pre-designed and pre-implemented priorto application deployment and fixed at run time.Common glue logic is packaged into a reusableshell framework to enhance hardware re-usability.No shell framework is able to fulfill the requirementsof all applications. Introducing more functionalitiesinto a shell framework to cover more application re-quirements also results in increased resource usage,

4

higher power consumption, and possibly lower clockfrequency of the static region of FPGA hardware. Insome of the literature, a shell is also referred to as ahardware OS [22], [23], a static region/part [24], [25],and hull [26].

• Role refers to the dynamic region of FPGA hardware,which can be independently reconfigured to map dif-ferent applications at run time. A role is commonlycomprised of several partially reconfigurable regions(PRRs). Each of the PRRs can be reconfigured on-the-fly without interrupting the operation of otherPRRs using the technique known as dynamic partialreconfiguration (DPR) [27]. In some of the literature,a role is referred to as an application region [28], areconfiguration region [29], or global zone [26].

• vFPGA or virtual FPGA is a virtual abstraction ofthe physical FPGA. One or multiple PRRs can bemapped to a single vFPGA.

• Accelerator refers to a PRR or vFPGA programmedwith an application.

• Hypervisor creates an abstraction layer between soft-ware applications and the underlying FPGA hard-ware and manages the vFPGA resources. In some ofthe literature, a hypervisor is referred to as an OS, aresource management system/framework [30] [31],a virtual machine monitor (VMM) [32], a run-timemanager [31], a hardware task manager [33], a con-text management [34], or a virtualization manager[35].

2.4 Programming Model

There are three programming models for application de-velopment supported by the existing FPGA virtualizationframeworks: domain specific language (DSL) programming,HDL programming at the register transfer level (RTL), andHLS programming.

In DSL programming, the applications are written ina high-level language (e.g., C). Selected portions of codes(e.g., loops) are separated and then implemented as FPGAkernels through the conventional RTL or HLS design flows.For example, in [36], loops are identified, and a data flowgraph is created. The data flow graph is then converted intoan FPGA kernel through the conventional RTL flow.

In HDL programming, FPGA kernels are designed usingan HDL. Common HDLs include Verilog, SystemVerilog,and VHDL. HDLs are the most commonly used program-ming languages in FPGA Virtualization, as it is the tra-ditional programming language for embedded FPGA de-sign. Writing HDL codes for achieving optimized resourceutilization, power consumption, and performance requiressignificant design effort as well as expert knowledge abouthardware design, which makes it difficult for software ap-plication developers to master. From Table 1, we observethat most of the prior work use HDL programming asthe programming model, which unfortunately is a deal-breaker for the adoption of FPGAs by software applicationdevelopers in cloud and edge computing.

In HLS programming, a high-level programming lan-guage (e.g., OpenCL [10], Java [37], [38]) is used to developFPGA kernels [39]. Then, the high-level language codes are

CPUFPGA

(a) On-Chip

PCIeCPUCPUFPGAFPGA

(b) Local Host

NetworkFPGAFPGA

CPUCPU

(c) Remote Host

Fig. 1: Host interfaces for FPGAs

translated into HDL codes. Different from HDL program-ming, HLS programming requires less knowledge aboutthe underlying hardware and provides a significant im-provement in design productivity. Some HLS-based designtools provide not only compilation and simulation utilitiesbut also run-time utility for FPGA management. Take IntelFPGA SDK for OpenCL [10] as an example. It providesan offline compiler for generating FPGA kernel hardwarewith interface logic, a software-based emulator for sym-bolic debugging and functional verification, and a run-timeenvironment including firmware, software, and the devicedriver to control and monitor the kernel deployment andexecution, and transfer data between the system memoriesof the host and the FPGA computing device.

3 SYSTEM ARCHITECTURE FOR FPGA VIRTUAL-IZATION

3.1 Hardware StackThe hardware stack of FPGA virtualization varies in thedesign of three components: 1) the host interface that defineshow an FPGA computing device is connected to a host CPU;2) the shell that defines the storage, communication, andmanagement capability of an FPGA computing device; and3) the role that defines the run-time application mappingcapability of an FPGA computing device.

3.1.1 Host InterfaceThere are three different types of host interfaces availablefor the existing FPGA devices (Fig. 1) described as follows.

• On-Chip Host Interface connects the FPGA comput-ing hardware to a dedicated soft/hard CPU imple-mented on the same chip (e.g., a Xilinx MicroBlazesoft processor or an ARM-based hard processor onXilinx Zynq FPGAs) (see Fig. 1a) [40]. The on-chiphost interface has the lowest communication latency,thus has the highest performance among all the in-terface types. However, a hard CPU implementationoccupies the FPGA chip area, and a soft CPU im-plementation consumes the reconfigurable resourcesavailable on an FPGA device, which reduces thecapacity of application mapping.

5

TABL

E1:

Sum

mar

yof

the

Syst

emA

rchi

tect

ures

,Pro

gram

min

gM

odel

s,A

pplic

atio

ns,a

ndFP

GA

Plat

form

sA

dopt

edby

the

Exis

ting

FPG

AV

irtu

aliz

atio

nW

ork

Wor

kSy

stem

Arc

hite

ctur

ePM

App

lica

tion

Plat

form

Har

dwar

eSo

ftw

are

Ove

rlay

Hos

tSh

ell

Rol

e

Vir

tual

[19]

,[41

]N

/AN

/AN

/AN

/AN

/AN

/AN

/AN

/AFi

rm-c

ore

JIT

[42]

N/A

N/A

N/A

N/A

CG

RA

HD

LM

CN

CBe

nchm

ark

Xili

nxSp

arta

n-II

EC

ray

HPR

C[4

3]LH

Non

ePR

RN

/MN

/AH

DL

Imag

eFe

atur

eEx

trac

tion

Cra

yX

D1

OS4

RS

[40]

OC

Com

mun

icat

ion

Com

pone

ntPR

RO

S4R

SN

/AH

DL

Mul

tim

edia

Xili

nxV

irte

x-II

NoC

[44]

LHN

/AN

/AN

/MH

RC

sw

ith

NoC

N/A

3-D

ESC

rypt

oA

lgor

ithm

Xili

nxV

irte

x-5

Vir

tual

RC

[45]

LHN

/AN

/ASo

ftw

are-

Mid

dlew

are

API

N/A

HLS

,HD

LBi

oinf

orm

atic

sG

iDEL

,N

alla

tech

,Pi

coC

ompu

ting

HPR

C[3

5]LH

N/A

N/A

N/M

N/A

HD

LID

EAC

iphe

r/Eu

ler

CFD

Solv

erA

lter

aSt

rati

xII

Ipv

FPG

A[3

2]LH

PCIe

Con

trol

ler,

DM

AN

/AX

enV

MM

N/A

HD

LFF

TX

ilinx

Vir

tex-

6V

DI

[33]

OC

Mem

ory/

Con

figur

atio

n/N

etw

ork

Con

trol

ler

PRR

N/M

N/A

N/A

Mul

tim

edia

Xili

nxV

irte

x-II

AC

CL

[46]

LHPC

IeC

ontr

olle

r,D

MA

PRR

Ope

nSta

ckN

/AN

/AC

rypt

ogra

phy

Xili

nxK

inte

x-7

VFR

[29]

LHN

etFP

GA

BSP

PRR

Ope

nSta

ck,S

AV

ITe

stbe

dN

/AH

LSLo

adBa

lanc

erX

ilinx

Vir

tex-

5C

atap

ult[

3]LH

,RH

Mem

ory

Con

trol

ler,

Hos

tC

omm

.,N

etw

ork

Acc

ess

N/A

N/M

N/A

HD

LR

anki

ngPo

rtio

nof

the

Bing

Web

Sear

chEn

gine

Inte

lStr

atix

V

FFH

LS[4

7]LH

N/A

N/A

Ope

nCL

Inte

rmed

iate

Fab-

rics

HLS

Com

pute

rV

isio

nan

dIm

age

Pro-

cess

ing

Xili

nxV

irte

x-6

HSD

C[4

8]O

C,R

HM

anag

emen

tLa

yer,

Net

wor

kSe

r-vi

ceLa

yer

PRR

Ope

nSta

ckN

/AH

DL

N/M

Xili

nxZ

ynq-

7100

VFA

Clo

ud[4

9]LH

PCIe

Con

trol

ler,

DM

APR

RD

yRA

CT

Test

Plat

form

N/A

N/A

Map

Red

uce

Xili

nxV

irte

x-7

RC

3E[3

0]LH

,RH

Com

mun

icat

ion/

Vir

tual

izat

ion

In-

fras

truc

ture

PRR

N/A

N/A

HD

LBl

ack-

Scho

les

Mon

teC

arlo

Sim

ula-

tion

Xili

nxV

irte

x-7

RR

aaS

[50]

LHPC

IeC

ontr

olle

rPR

RN

/ASo

ftPr

oces

sor

HD

LIm

age

Proc

essi

ngN

/MV

irt.

Run

tim

e[3

1]LH

PCIe

Con

trol

ler,

Run

-tim

eM

anag

erPR

RM

ulti

-Thr

eade

dO

SN

/AD

SL,H

LS,H

DL

Gra

phA

lgor

ithm

s:Pa

geR

ank,

Tri-

angl

eC

ount

and

Out

lier

Det

ecto

rX

ilinx

Vir

tex-

7

Ker

-ON

E[2

5]O

CN

/APR

RR

TOS

N/A

N/A

DSP

Xili

nxZ

edBo

ard

RC

2F[2

4]LH

,RH

Hyp

ervi

sor,

I/O

Com

pone

nts

DPR

RN

/AN

/AH

DL

N/M

Xili

nxV

irte

x-7

VER

Clo

ud[5

1]O

C,L

HPC

IeC

ontr

olle

r,R

un-t

ime

Man

ager

PRR

Mul

ti-T

hrea

ded

OS

N/A

DSL

,HLS

,HD

LG

raph

Alg

orit

hms:

Page

Ran

k,Tr

i-an

gle

Cou

ntan

dO

utlie

rD

etec

tor

Xili

nxV

irte

x-7

Feni

ks[2

2]R

HH

ost

Com

m.,

Net

wto

rk/S

tora

geSt

ack,

Mem

ory

Con

trol

ler

PRR

N/M

N/A

HD

LD

ata

Com

pres

sion

Engi

ne,N

etw

ork

Fire

wal

Inte

lStr

atix

V

Cos

tofV

irt.

[52]

LHD

DR

3/PC

IeC

ontr

olle

r,Et

hern

etC

ore

N/A

N/A

NoC

Emul

atio

nH

DL

DSP

Inte

lArr

ia10

NFV

[34]

OC

,LH

PCIe

Con

trol

ler,

DM

APR

RN

/MM

CU

onSo

ftPr

o-ce

ssor

HD

LC

onte

xtSw

itch

Xili

nxV

irte

x-7

FPG

As-

Clo

ud[2

8]LH

Mem

ory

Con

trol

ler

PRR

Ope

nCL

N/A

HLS

Net

wor

kFu

ncti

ons

Xili

nxV

irte

x-7

RA

CO

S[5

3]LH

PCIe

Con

trol

ler,

DM

A,I

CA

PD

PRR

Use

rA

PI,K

erne

ldri

ver

N/A

HD

L,H

LSEd

gean

dm

otio

nde

tect

ion

Xili

nxV

irte

x6

Am

orph

OS

[26]

LHPC

IeC

ontr

olle

r,D

MA

,MM

IOD

PRR

lib-A

mor

phO

SN

/AH

DL

CN

N,M

emor

yst

ream

ing,

Bitc

oin

Alt

era

Stra

tix

VG

S,X

ilinx

Ult

raSc

ale+

Elas

tic

[23]

OC

AX

ID

PRR

Ope

nCL

N/A

HLS

Sche

dulin

gA

lgor

ithm

TE8

080

Boar

dFP

GA

Vir

t[54

],[5

5]LH

PCIe

Con

trol

ler

N/A

Vir

tio-

Vso

ckC

GR

AH

DL

NoC

Emul

atio

nIn

telS

trat

ixV

hCO

DE

[56]

LHPC

IeC

ontr

olle

r,C

lock

gene

rato

rPR

RN

/MN

/AH

DL

Sort

er,A

ES,K

Mea

nsX

ilinx

Vir

tex-

7/K

inte

x-7

ViT

AL

[57]

RH

Late

ncy-

inse

nsit

ive

Inte

rfac

e,A

d-dr

ess

Tran

slat

ion

DPR

RN

/AN

/AH

LSM

achi

neLe

arni

ngX

ilinx

Ult

raSc

ale+

†

N/A

=N

otA

pplic

able

;N/M

=N

otM

enti

oned

;PM

=Pr

ogra

mm

ing

Mod

el;L

H=

Loca

lHos

t;R

H=

Rem

ote

Hos

t;O

C=

On

Chi

p

6

• Local Host Interface is the most popular approach,which connects an FPGA computing device imple-mented as a daughter-card to a local CPU hostthrough a high-speed serial communication (e.g. bus,such as PCI Express (PCIe) (see Fig. 1b) [26], [56].

• Remote Host Interface connects an FPGA comput-ing device to a remote CPU host through the net-work, which allows the FPGA computing deviceto communicate with the CPU host node remotely(see Fig. 1c) [22], [58]. A remote host interface hasthe highest communication latency among all theinterfaces. But, a remote host interface decouples anFPGA from a local host CPU, thus enables the inte-gration of standalone FPGA computing devices andgreatly improves the scalability of FPGA computing.

3.1.2 ShellAs mentioned in section 2.3, a shell is the static regionof FPGA hardware that contains the necessary glue logicand controllers for handling system memory access and thecommunications to a CPU host or other network devices.Thus, a shell should include but not limited to a systemmemory controller, a host interface controller, and a networkinterface controller.

• System Memory Controller provides a user-friendlyinterface to the FPGA kernel for accessing the systemmemory of the FPGA computing device (e.g., a DDRSDRAM or high-bandwidth memory), which canbe implemented either off-chip or on-chip (i.e., SoCdesign). In the existing literature, a system memorycontroller is also referred to as a DMA Engine [22],DMA Controller [49], DRAM Controller [29], DRAMAdapter [49], DDR3 Controller [52], or Memory Man-ager [51].

• Host Interface Controller enables the communica-tion of an FPGA computing device with a CPUhost through either an on-chip, local, or remote hostinterface. In the existing literature, a host interfacecontroller is also referred to as a PCIe Module [28],PCIe Controller [34], and DMA Controller [32].

• Network Interface Controller provides a communi-cation interface to a network (e.g., cloud network),which facilitates the communication of an FPGAcomputing device to other network devices withoutthe intervention of the host CPU. In the existing liter-ature, a network interface controller is also referredto as a Network Stack [55], Network Service Layer[48], Ethernet Core and [52].

3.1.3 RoleAs mentioned in section 2, a role is the dynamic regionof FPGA hardware reserved for application mapping atrun time. There are two approaches to reconfigure a role,namely, flat compilation and partial reconfiguration. In flatcompilation, the whole role is reconfigured as a singleregion and is exclusive to single-accelerator applications.In partial reconfiguration, a role can contain multiple DPRregions (DPRRs), and a single DPRR can be reconfiguredindependently, or multiple DPRRs can be combined as onebigger DPRR and be reconfigured independently from the

test PRRs at run time. The use of multiple DPRRs enablesapplication deployment with flexible size of FPGA fabrics,thus can achieve higher resource utilization with improvedoverall system performance [23]. Considering DPR, It is ofgreat importance to reduce the reconfiguration time as muchas possible using techniques such as Intermediate Fabrics[59] to meet the applications’ timing requirements.

3.2 Software Stack

The software stack refers to an OS, software applications,and frameworks running on a host CPU for the purposeof FPGA virtualization. A host CPU can be either on-chip,local, or remote, depending on the hardware stack. Theabstraction of vFPGA is generally made available to usersin the software stack. The software stack allows users todevelop and deploy applications onto vFPGAs and managevFPGA resources easily without specific knowledge of theunderlying hardware. Specifically, the software stack pro-vides users with software libraries and APIs to perform thecommunication between a host and an FPGA computingdevice, the provisioning and management of applicationsand physical FPGA resources. Table 1 shows a list of thesoftware stack used in the existing FPGA virtualizationwork in three categories: OS, host application, and softwareframework.

OS: Since the idea of FPGA virtualization is primarilyadopted from OS, the most common software stack of FPGAvirtualization is an OS (e.g., Windows and Linux), oftensupporting multiple processes and threads [31], [51] for themanagement of multiple users and FPGA devices. In FPGAvirtualization, both compile- and run-time management ofFPGA resources are important factors to be considered. Ifthe virtualization system is designed for a cloud or edgecomputing environment where real-time streaming data isused, OSs with real-time processing capability (e.g., real-time operating system (RTOS) [25]) is preferable. Thereare specialized OSs particularly designed for reconfigurablecomputing and FPGAs. For example, ReconOS [60] is de-signed to support both software and hardware threads,as well as allow interaction between the threads in a re-configurable computing system. Leap FPGA OS [61]) andOperating System for Reconfigurable Systems (OS4RS) [40]are FPGA virtualization OS for compile-time managementof FPGA resources. RACOS [53] provides a simple and effi-cient interface for multiple user applications to access singleand multi-threaded accelerators transparently. AmorphOS’sOS [26], integrated as a user-mode library on a host CPU,provides system calls to manage Morphlets (i.e., vFPGAs)and enables the communication between host processes andMorphlets.

Host Application: Application running on host CPUsare commonly written in C/C++ or OpenCL. OpenCLprovides the development language for device kernels aswell as the programming API for host applications, whichmakes it a good choice for FPGA virtualization [23], [28].The programming API can be used to deploy, manage, andcommunicate with FPGA kernels. Vendors may provideSDKs for OpenCL (e.g., Intel FPGA SDK for OpenCL [62])that provides the API for a host OpenCL application tomanage the execution of FPGA kernels. One important task

7

of the host application is the management of communicationbetween the host and the FPGA device. In [45], a software-middleware API, written in C++, is provided to enable theportability of application software code (i.e., host code) onvarious physical platforms. The software-middleware is avirtualization layer between the application software andthe platform API. The middleware API translates the virtualplatform communication routines for the virtual compo-nents into native API calls to the physical platform in [32],the concept of virtual machine monitor (VMM) [63] used inOS virtualization is adopted in FPGA virtualization for thecommunication between a user and an FPGA device. Theyhave modified the existing inter-domain (i.e., driver domain,Dom0, and unprivileged domain, DomU) communication inXen VMM using shared memory. In this way, multiple userprocesses can simultaneously access a shared FPGA withreduced overhead.

Software Framework: OpenStack [64] is an open-sourceplatform for controlling compute, storage, and networkingresources in cloud computing. Some of the FPGA virtualiza-tion systems developed for cloud and data center utilize theOpenStack framework [29], [46], [48]). In [46], new modulesare added to OpenStack compute nodes (i.e., a physical ma-chine composing of CPUs, memory, disks, and networking)in different layers (e.g., hardware, hypervisor, library, andlayers) to support FPGA resource abstraction and sharing.In [48], an accelerator service is introduced in OpenStack tosupport network-attached standalone FPGAs. The FPGAshave software-defined-network-enabled modules that canconnect to FPGA plugins in the accelerator service. Bymaet al. [29] uses a Smart Application on Virtual Infrastructure(SAVI) test framework with OpenStack to integrate FPGAsto cloud and manage them like conventional VMs. In thiswork, the authors proposed to partition the FPGA intomultiple PRRs and abstract each PRR as a Virtualized FPGAResource (VFR). An agent application with a device driver isdeveloped to manage the VFRs. The SAVI testbed controllerin OpenStack provides the APIs for connecting the VFRagent with the driver, enabling the remote managementof FPGA resources from a cloud server. Fahmy et al. [49]extends an existing FPGA test platform called DyRACT[65] to support multiple independent FPGA accelerators. Asoftware framework and a hypervisor is implemented inDyRACT to connect with the communication interface inthe static region of an FPGA. FPGAVirt [54], [55] leveragesa communication framework named Virtio-vsock [66] todevelop their software virtualization framework, in whichVirtio-vsock provides an I/O virtualization framework anda client for the communication between VMs and FPGAs.

3.3 Overlay Architecture

An FPGA overlay, also known as intermediate fabric [59],is a virtual reconfigurable architecture implemented on topof a physical FPGA fabric for providing a higher abstrac-tion level [21] of FPGA resources. An overlay architectureserves as an intermediate layer between a user applica-tion and a physical FPGA. The physical architecture of anFPGA can vary significantly across different device familiesor vendors. Overlay architectures can bridge that gap byproviding a uniform architecture on top of the physical

FPGA. The overlay’s application software is portable onany device which supports the targeted overlay architecture.This concept is analogous to Java Virtual Machine [67] thatallows the same byte-code to be executed on any Java-supported machines. Overlay architectures provide a muchhigher level of hardware abstraction for application map-ping, which significantly reduces the compilation time andprovides improved portability of application codes acrossdifferent FPGA device families and vendors. Overlay ar-chitectures also provide application developers with bettersoftware design flexibility as they can target an abstractcomputing architectures with a known instruction set. Fur-thermore, portability in FPGA resource management can beachieved using overlay architectures. Resource managementtechniques can vary significantly depending on the FPGAarchitecture and the implementation stack where the re-source manager is implemented (Section 4.3). If the resourcemanager is implemented in the overlay stack, the same over-lay architecture, therefore same techniques can be appliedto manage PRR of different FPGA devices. However, thesebenefits come at the cost of reduced performance and lessefficient utilization of FPGA resources.

Configuration: An FPGA overlay can be either spatiallyconfigured (SC) or time-multiplexed (TM), depending on itsrun-time configurability. If an FPGA overlay has functionalunits with fixed assigned tasks, it is referred to as an SCoverlay. If an FPGA overlay can change the operation ofits functional units on a cycle-by-cycle basis, the overlay isreferred to as a TM overlay. The interconnection betweenfunctional units in SC and TM overlays can be fixed orreconfigurable at run time, which can be structured asa Network-on-Chip (NoC). A previous survey paper [68]provides a comprehensive survey of TM overlays.

Granularity: According to the granularity (at whichhardware can be programmed) of an overlay architec-ture, SC and TM overlays can be further categorized intofine-grained and coarse-grained overlays [69]. Fine-grainedoverlays can operate at the bit-level just like physical FPGAsbut provide programmability and configurability at a higherlevel of abstraction than physical FPGAs. Firm-core virtualFPGA [42] is an early work in the area of fine-grainedoverlays, where a synthesizable VHDL model of a targetFPGA fabric is developed. Two types of switch matricesdeveloped using tri-state buffers and multiplexers providethe programmable interconnect between the custom CLBinterfaces. Even though the overlay has huge hardwareoverhead, it can be used in applications where portabilityis more important than resource utilization. ZUMA [70]also provides bitstream portability across FPGAs of differ-ent vendors with a more optimized implementation thanFirm-core virtual FPGA. ZUMA configures LUTRAMs touse them as the building block for configurable LUTS androuting multiplexers. A configuration controller is imple-mented to reprogram the LUTRAMs connected in a crossbarnetwork. A previous survey paper in [71] presents a com-prehensive survey about coarse-grained overlays. Coarse-grained overlays are implemented as coarse-grained recon-figurable arrays (CGRAs) [72] or processors. In a CGRA,arrays of PEs are connected via a 2D network. CGRAs canadopt different interconnect topologies (e.g., island style,nearest neighbor, and NoC). The 2D mesh structures in the

8

TABL

E2:

Sum

mar

yof

FPG

AV

irtu

aliz

atio

nTe

chni

ques

for

Abs

trac

tion

,Mul

ti-t

enan

cy,R

esou

rce

Man

agem

ent,

and

Isol

atio

n

Wor

kA

bstr

acti

onM

ulti

-ten

ancy

Res

ourc

eM

anag

emen

tIs

olat

ion

Tech

niqu

eIS

Use

rm

anag

emen

tM

TTe

chni

que

ISTe

chni

que

IS

Vir

tual

[19]

,[41

]N

/AN

/AN

/AN

/APa

rtit

ioni

ng,S

egm

enta

tion

,Ove

rlay

ing

SW,H

WN

/AN

/A

Firm

-cor

eJI

T[4

2]N

/AN

/AN

/AN

/AFi

rmC

ore

inte

rmed

iate

fabr

icO

LN

/AN

/A

Cra

yH

PRC

[43]

N/A

N/A

N/A

SMV

irtu

aliz

atio

nM

anag

er,V

irtu

alM

emor

ySp

ace,

Mes

sage

Que

ueSW

N/A

N/A

OS4

RS

[40]

Uni

fied

com

mun

icat

ion

inte

rfac

ebe

twee

nH

Wan

dSW

SW,H

WN

/ASM

,TM

virt

ualh

ardw

are

inO

San

dun

ified

com

mun

icat

ion

mec

hani

smw

/H

WSW

,HW

N/A

N/A

NoC

[44]

N/A

N/A

N/A

SM,T

MO

Sw

ith

run-

tim

ere

confi

gura

tion

and

virt

ualF

PGA

page

tabl

em

anag

emen

tSW

N/A

N/A

Fabr

ic[5

9]V

irtu

alab

stra

ctar

chit

ec-

ture

usin

gin

term

edia

tefa

bric

OL

N/A

TMN

/AN

/AN

/AN

/A

Vir

tual

RC

[45]

N/A

N/A

N/A

N/A

virt

ualF

PGA

plat

form

and

soft

war

e-m

iddl

ewar

eA

PISW

N/A

N/A

HPR

C[3

5]N

/AN

/AN

/ATM

Vir

tual

izat

ion

Man

ager

,Vir

tual

Mem

ory

Spac

e,M

essa

geQ

ueue

SWN

/AN

/A

pvFP

GA

[32]

N/A

N/A

N/A

TMFr

onte

ndan

dBa

cken

dD

rive

r,D

evic

eD

rive

rfo

rFP

GA

acce

lera

tor

wit

hX

enV

MM

SWN

/AN

/A

VD

I[3

3]N

/AN

/AN

/ASM

HW

Task

Man

ager

,HW

Con

trol

Libr

ary

SWN

/AN

/A

Rec

onZ

UM

A[7

3]R

econ

OS

wit

hZ

UM

AO

verl

aySW

,OL

N/A

N/A

N/A

N/A

N/A

N/A

AC

CL

[46]

N/A

N/A

Ope

nSta

ckco

ntro

ller

SM,T

MH

yper

viso

rla

yer

inSW

and

serv

ice

laye

ron

FPG

ASW

,HW

N/A

N/A

VFR

[29]

N/A

N/A

Ope

nSta

ckco

ntro

ller

SM,T

MPR

Rs

man

aged

byO

penS

tack

SW,H

WN

/AN

/A

HSD

C[4

8]N

/AN

/AO

penS

tack

cont

rolle

rSM

Net

wor

kM

anag

eran

dA

ccel

erat

orSe

rvic

eEx

tens

ion

for

Ope

nSta

ckSW

,HW

N/A

N/A

VFA

Clo

ud[4

9]N

/AN

/ASo

ftw

are

runn

ing

oncl

ient

mac

hine

SM,T

MH

yper

viso

ron

clou

dvi

rtua

lizat

ion

laye

rSW

N/A

N/A

RC

3E[3

0]N

/AN

/AR

econ

fig.C

omm

onC

loud

Com

puti

ngEn

viro

nmen

tSM

,TM

Hyp

ervi

sor

oncl

oud

for

inte

grat

ion

and

man

agem

ento

fvir

tual

hard

war

eSW

,HW

N/A

N/A

RR

aaS

[50]

N/A

N/A

App

licat

ion

for

user

requ

estm

anag

emen

tSM

,TM

cont

rolle

r/H

yper

viso

r,N

oCar

chit

ectu

reof

reco

nfigu

rabl

ere

gion

sSW

,HW

N/A

N/A

Vir

t.R

unti

me

[31]

N/A

N/A

Use

rth

read

san

dth

read

man

ager

SM,T

Mru

n-ti

me

man

ager

onon

-boa

rdpr

oces

sor,

FPG

Ath

read

sH

WN

/AN

/A

Ker

-ON

E[2

5]N

/AN

/AN

/ASM

Hyp

ervi

sor

onA

RM

proc

esso

ran

dPR

RM

onit

oron

FPG

ASW

,HW

FIof

VM

san

dD

PRs

usin

gH

yper

viso

ron

AR

Mpr

oces

sor

SW

RC

2F[2

4]N

/AN

/AN

/MSM

,TM

Hyp

ervi

sor

oncl

oud

for

inte

grat

ion

and

man

agem

ento

fvir

tual

hard

war

eSW

,HW

N/A

N/A

VER

Clo

ud[5

1]N

/AN

/AM

ulti

-thr

eadi

ngon

host

CPU

SM,T

Mru

n-ti

me

man

ager

onso

ftpr

oces

sor

OL

N/A

N/A

Feni

ks[2

2]Id

enti

cal

virt

ual

IOin

ter-

face

SW,H

WN

/MSM

OS

onFP

GA

,hos

tage

ntco

nnec

ted

toce

ntra

lized

cont

rolle

rsin

clou

dSW

,HW

PIby

sepa

rati

onof

appl

icat

ion

and

OS

HW

NFV

[34]

N/A

N/A

Net

wor

kre

ques

tson

host

OS

SM,T

MC

onte

xtM

anag

erim

plem

ente

das

MC

Uon

soft

/SoC

proc

esso

rH

W,

MW

N/A

N/A

FPG

As-

Clo

ud[2

8]N

/AN

/AV

Man

dH

yper

viso

ron

host

SMH

yper

viso

rin

stat

icre

gion

ofFP

GA

HW

N/A

N/A

Vir

t.Se

curi

ty[7

4]N

/AN

/AN

/AN

/AN

/AN

/AFI

usin

gun

ique

Ove

rlay

OL

Cos

tofV

irt[

52]

N/A

N/A

N/A

N/A

Ava

lon

inte

rcon

nect

for

man

agin

gPR

Rs

HW

N/A

N/A

FPG

AV

irt[

54],

[55]

NoC

Ove

rlay

arch

itec

ture

OL

Vir

tio

Vso

ckcl

ient

runn

ing

ongu

estO

SSM

,TM

FPG

AM

anag

emen

tSer

vice

,Map

ping

Tabl

e,N

oCO

verl

ayar

chit

ectu

reSW

,OL

FIus

ing

Har

dwar

eSa

ndbo

xH

W

hCO

DE

[56]

N/A

N/A

Sche

dule

ron

host

SMM

ulti

-cha

nnel

PCIe

mod

ule

tom

anag

ePR

Rs

HW

N/A

N/A

Elas

tic

[23]

N/A

N/A

N/A

SM,T

MR

un-t

ime

reso

urce

man

ager

for

virt

ualiz

ing

reso

urce

foot

prin

tof

Ope

nCL

kern

els

SWN

/AN

/A

Am

orph

OS

[26]

Mor

phle

twhi

chen

caps

u-la

tes

user

FPG

Alo

gic

SW,H

WN

/ASM

,TM

zone

man

ager

onho

stC

PUSW

FIan

dPI

usin

gre

sour

ceal

loca

tion

polic

yan

dhw

arbi

ter

SW,H

W

ViT

AL

[57]

Laye

rbe

twee

nph

ysic

alFP

GA

san

dco

mpi

lati

onla

yer

SW,H

WN

/ASM

,TM

Hyp

ervi

sor

and

syst

emco

ntro

ller

SWFI

usin

gru

ntim

em

anag

emen

tpol

icy

SW

Shar

edM

em[7

5]N

/AN

/AV

Man

dH

yper

viso

ron

host

SM,T

MH

yper

viso

ran

dha

rdw

are

mon

itor

for

man

agin

gac

cele

rato

rsSW

,HW

FIbe

twee

nac

cele

rato

rsus

ing

page

tabl

esl

icin

gSW

,HW

†

N/A

=N

otA

pplic

able

;N/M

=N

otM

enti

oned

;IS

=Im

plem

enta

tion

Stac

k;M

T=

Mul

tipl

exin

gTe

chni

que;

SM=

Spat

ialM

ulti

plex

ing;

TM=

Tim

eM

ulti

plex

ing;

FI=

Func

tion

alIs

olat

ion;

PI=

Perf

orm

ance

Isol

atio

n;

9

GuestOS

GuestOS

GuestOS

Shell

OverlayResource Manager

Host OS

Role

PCIe

CPU FPGA

Comm.Client

Comm.Client

Comm.Client

Fig. 2: System architecture for an exemplar FPGA virtualiza-tion system

TABLE 3: Description of the system architecture stacks ofthe example FPGA virtualization system in Figure 2

Stack Description

HW

The host interface is a local host interface wherethe CPU is connected to the FPGA via PCIe.Shell: Static region that contains the board supportpackage (BSP).Role is the reconfigurable region of the FPGA. Here, anoverlay architecture is implemented on thereconfigurable region.

SW

A host OS in the CPU hosting multiple virtualmachines with guest OSs. Users in guest OS canmake requests to use the FPGA.A resource management software that receivesrequests from guest OS and communicates withFPGA to fulfill the requests. It also keeps trackof the available FPGA resources.A communication client that manages communicationbetween guest OS and resource manager.

OL A CGRA based overlay that creates an abstractionof the reconfigurable region in physical FPGA.

island style and nearest neighbor typologies have similari-ties to the interconnect on FPGAs. These interconnects sup-port direct communication between adjacent PEs. However,the communication flexibility comes at the cost of additionalFPGA resource usage. In the NoC topology, the PEs are con-nected in a network and communicates via routers. NoC-based CGRA architectures are becoming popular in FPGAvirtualization due to the flexible communication betweenPEs at run time.

Overlay architectures have been discussed in the liter-ature for a long time, even before the concept of FPGAvirtualization. The primary purpose of overlay architecturehas been to improve design productivity and reduce thecompilation time of FPGA design. FPGA overlays, espe-cially CGRA-based overlays, have recently caught a lotof attention for being used for FPGA virtualization. In arecent work [55], authors present an NoC-based overlay [76]architecture that offers flexible placement of hardware taskswith high throughput communication between PEs. Thearchitecture has a torus network equipped with data packetcommunication and high-level C++ library for accessingoverlay resources. Adjacent PEs form a sub-network, inwhich they communicate directly with each other, and theinter-subnetwork communication takes place via routers.

Figure 2 shows a block diagram of an example FPGAvirtualization system, generalized from the system in [55].Description of different stacks of the system architecture isshown in Table 3.

4 OBJECTIVES OF FPGA VIRTUALIZATION

As discusses in section 2, the primary objectives of FPGAvirtualization are abstraction, multi-tenancy, resource man-agement, and isolation. In this section, we summarize howthese four key objectives are achieved in the literature.

4.1 Abstraction

The first objective of FPGA virtualization aims to makeFPGA more usable by creating abstractions of the physicalhardware to hide the hardware specifics from applicationdevelopers and the OS as well as provide applications withsimple and familiar interfaces to access the virtual resources.Table 2 lists a summary of FPGA virtualization techniquesfor creating abstraction, which can be categorized into over-lay architecture and OS-based approaches.

As discussed in Section 3.3, overlay architecture providesa known-instruction-set architecture that can be targetedby software developers to create and run applications forFPGAs. The usability benefit comes with a penalty ofperformance drop and less efficient FPGA resource usage.This issue can be mitigated by using a database of overlayarchitectures and selecting the optimized one for each ap-plication [36]. Overlay architecture also benefits usability bysignificantly reducing the compilation time of applicationcodes and providing homogeneity in abstraction across theFPGA devices from different vendors. For example, thesame overlay architecture implemented on Intel and XilinxFPGAs provides exactly the same abstraction even thoughthe underlying hardware architecture is different.

OSs used in FPGA virtualization provide software in-terfaces for easier deployment and access to FPGA applica-tions. Some examples of work in this area are Feniks OS [22],ReconOS [60], RACOS [53], OS4RS [40], and AmorphOS[26]. Feniks OS extends the shell region of an FPGA toimplement the OS, which provides an abstracted interfaceto developers for application management. In OS4RS, theconcept of a device node is proposed where multiple appli-cations on an FPGA can be accessed using a single node.RACOS provides users with a simple interface to load/un-load accelerators and presents the I/O transparently to theusers.AmorphOS’s OS interface exposes APIs to load anddeplete Morphlets (i.e., vFPGAs), and read and write dataover the transport layer to the physical FPGAs configuredwith Morphlets. The abstraction created for FPGA virtu-alization not only facilitates application development butalso enables better application access. In [46], an accelera-tor marketplace concept is presented where users can buyand use accelerator applications. The marketplace is imple-mented on top of an FPGA virtualization layer integratedinto OpenStack. The underlying virtualization layer hidesthe complexity of hardware to make it extremely easy todevelop, manage and use FPGA applications.

4.2 Multi-Tenancy

The second objective of FPGA virtualization aims to sharethe same FPGA resources among multiple users and appli-cations. Since in the FPGA virtualization infrastructure, it isnecessary to support multiple applications, we narrow ourdiscussion in this subsection to multiple users. Therefore,

10

single/multiple tenant refers to if an FPGA virtualizationsystem supports resource sharing across multiple users ornot.

For multi-tenancy support, it is required to have eitherspatial or temporal multiplexing. In spatial multiplexing, byutilizing PR techniques, an FPGA device is partitioned intomultiple PRRs such that each PRR can be reconfigured inde-pendently for application mapping without affecting otherregions. The reconfiguration can be done either statically ordynamically at run time. In temporal multiplexing, there isonly one PRR, which is reconfigured over time, and differentapplications are allocated in different time intervals. Thereconfiguration time of the PRR should be reduced as muchas possible to maintain efficient resource sharing.

The user management components are generally imple-mented in the software stack. Table 2 shows a list of usermanagement techniques used in the literature. The mostcommon methods in the literature to support multiple usersinclude utilizing Openstack Control Node [29], [46], [48],Virtio-vsock client running on a guest OS [54], [55], andmulti-threading on a host CPU [31], [51].

In [29], the OpenStack scheduler selects a resource andcontacts its associated agent, which is a separate processbeside the hypervisor, upon a request for a vFPGA froma user. When OpenStack requests the PRR from the agent,it commands the hypervisor to boot a VM with the user-defined OS image and parameters. In [46], a control nodemanages requests from multiple users and launches VMson physical machines. Subsequently, different users accesstheir associated VMs and deploy their applications. In [48],when a user requests for a vFPGA, the accelerator schedulersearches in a resource pool to find a PRR matching theuser request. If successful, a user ID and IP address willbe configured for the vFPGA. Subsequently, the vFPGA ID,IP address, and required files to generate a bitstream for theuser application will be returned to the user.

In [54] and [55], utilizing an overlay architecture enablesassigning multiple vFPGAs on a single or several FPGAs.Each user accesses a guest OS installed on a VM. Uponthe user request, the corresponding VM requests FPGAresources, and the hypervisor looks for an available PRR.If available, memory space is allocated to the vFPGA forstoring data.

In [31] and [51], a run-time manager that provideshardware-assisted memory virtualization and memory pro-tection allows multiple users to simultaneously execute theirapplications on an FPGA device. A user thread on a hostCPU sends the specifications of vFPGAs and codes to berun on a local processor to the manager thread, whichwill be deployed to an FPGA. On the FPGA side, a run-time manager allocates the resources required by the FPGAapplication, creates an FPGA user thread with the codes forthe local processor, instantiates a vFPGA, and notifies thehost when the FPGA application terminates. The host userthread can then retrieve the output data from the FPGA andeither sends new data to be processed or deallocates theFPGA application.

4.3 Resource ManagementThe third objective of FPGA virtualization aims to facilitatethe transparent provisioning and management of FPGA

resources for workload balancing and fault tolerance. Themanagement of FPGA resources refers to the managementof the PRRs. For a virtualization system with a single FPGA,resource management primarily involves configuring thePRRs by allocation and deallocation of bitstreams. In a vir-tualization system where multiple FPGAs are connected viaa network, routing of information among multiple FPGAs,configuring the PRRs of multiple FPGAs with bitstreams,and the scheduling of accelerators can be labeled as resourcemanagement tasks. Table 2 lists a summary of FPGA virtual-ization techniques for resource management, which are dis-cussed below according to which stack (software, hardware,overlay) in the system architecture they are implemented in.

4.3.1 Resource Management in the Software StackThe primary ideas of the resource management approachesimplemented in the software stack are borrowed from OS,CPU, and memory virtualization. Partitioning methods inmemory systems (e.g., paging and segmentation) are usedto divide and keep track of the FPGA resource utilization.The OS in [44] used virtual FPGA page tables and a recon-figuration manager to manage configurations of identicalPRRs connected with NoC-based interconnects. While thevirtual page table keeps track of the resource utilization ofthe PRRs, the run-time reconfiguration manager swaps con-figurations in and out of the PRRs. Other than partitioningand page tables, the concept of a hypervisor in a virtualmachine has also been widely used in FPGA virtualizationfor resource management. A hypervisor is generally part ofthe host machine and used for the management of and com-munication with hardware resources on an FPGA. Fahmy etal. [49] implements a hypervisor as a part of the server-sidesoftware stack in a cloud-based architecture. The hypervisorcommunicates with the PRRs via a communication interfaceimplemented in the FPGA static region. The hypervisoris responsible for maintaining a list of PRRs, configuringvFPGAs by selecting optimal PRRs from that list, as wellas allocating vFPGAs to users. In [75], the authors proposea hypervisor named Optimus for a shared-memory FPGAplatform. The hypervisor provides scheduling of VMs on apre-configured FPGA with temporal multiplexing to sharea single accelerator with multiple VMs as well as spatialmultiplexing to share the FPGA among multiple acceler-ators. Optimus receives requests from host applicationsas interrupts and communicates with accelerators using ahardware monitor implemented in the shell. Knodel et al.present a cloud-based hypervisor called RC3E in [30], whichintegrates and manages FPGA-based hardware acceleratorsin the cloud. The resource manager, termed as FPGA devicemanager in RC3E, configures vFPGA using bitfiles from adatabase and monitors the accelerators accessed by the uservirtual machines. The Reconfigurable Common ComputingFramework (RC2F) used in RC3E was extended by theauthors in [24]. In this work, the authors present one phys-ical FPGA as multiple vFPGAs and manage the vFPGAsusing their custom FPGA hypervisor. The FPGA hypervisormanages the states of the vFPGAs, as well as reconfiguresthem using the internal configuration access port (ICAP). InSoC-based FPGAs, the hypervisor can run in an OS runningon an on-chip CPU. For example, in [25], a hypervisorimplemented on an ARM processor detects the requests

11

from different VMs and programs PRRs dynamically. Someof the prior work utilizes existing software frameworkssuch as OpenStack or OpenCL run-time to manage FPGAresources, which is discussed in section 3.2.

4.3.2 Resource Management in the Overlay StackSimilar to the SW stack, the resource managers in theoverlay stack are also implemented in software. However,they are implemented inside the FPGA, therefore, are tightlycoupled with the PRRs in FPGA. The resource managersin the SW stack discussed in section 4.3.1 are implementedin the host outside of the FPGA and are loosely coupledwith the PRRs they manage. In [51], a soft-processor-basedoverlay runs a run-time manager or hypervisor for FPGAresource management in a similar way as to how an OS ona host machine manages the hardware resources. The run-time manager can handle service requests as interrupts fromaccelerators as well as from the host. The interrupt fromaccelerators is received using a custom-designed messagebox module, while the interrupt from the host is receivedusing PCIe. In [34], the resource manager is implemented asa microcontroller unit (MCU) running on a soft-processor-based overlay, which acts as a scheduler and context man-ager of accelerators. Context management within the sameaccelerator is handled using job queues, while multipleaccelerators are managed using a job scheduler. Other thanprocessor-based overlays, some of the CGRA-based over-lays are specifically designed for resource management. Forexample, NoC-structured CGRA-based overlay [76] archi-tectures are designed for better placement of applicationsinto PRRs. The overlay abstracts the PRRs as connectedPEs in a 2D Torus topology with high throughput com-munication between the PEs using routers. Placement ofapplications in adjacent PEs can be done directly, whereasthe placement to distant PEs is done using the routers.

4.3.3 Resource Management in the Hardware StackIn the hardware stack, a resource manager is generallyimplemented as part of the static region. In this case, aresource manager is in charge of the communication with ahost machine using PCIe and configuration access ports (e.g.ICAP and PCAP in Xilinx FPGA [77]). In [56], the authorsbuilt a multi-channel shell with multiple independent PCIechannels for managing multiple accelerators. A command-line tool is used to load bitstreams from a repository andsend over the PCIe channels to program the vFPGAs. In[25], a virtual device manager on a host soft processor sendsconfiguration requests to a separate resource manager in thestatic region. The resource manager downloads bitstreamsusing the PCAP interface and programs the PRRs via inter-connects between the resource manager and PRRs. Similarto the software stack, a hypervisor can be implemented inthe hardware stack. Kidane et al. [50] propose a hypervisorimplemented in the static region of an FPGA to managevFPGA resources. The hypervisor is responsible for theselection of bitstream from a library and the placementof the bitstream to an appropriate vFPGA based on thesize of the design. In [46], the resource manager is imple-mented in the static region that provides a communicationinterface with a hypervisor in the software layer as wellas an interface for managing PRRs. The resource manager

fetches requests from the software stack from a job queueand programs the PRRs. It also uses a direct memory ac-cess (DMA) engine to context-switch and to manage datato/from accelerators. Tarafder et al. [28] uses the XilinxSDAccel platform [78] as a hypervisor implemented on thestatic region and provides the interfaces to memory, PCIe,and Ethernet. The hypervisor can program the PRRs onceit receives configuration requests from the OpenCL API viaPCIe. Custom state machines or hardware controllers canalso be used for resource management. Custom-designedhardware controllers can leverage the DMA transfer of pre-stored bitstreams in off-chip memory for faster configura-tion. In DyRACT [65], resource management is done bytwo custom state machines named the configuration statemachine and the ICAP state machine. The configurationstate machine receives configuration requests from PCIeand writes the bitstream in memory, while the ICAP statemachine reads the bitstream from memory and programsthe FPGA.

FPGA resource management approaches implementedin the software stack benefit from the convenience of soft-ware development and reuse of existing software frame-works. However, the communication between a host CPUand an FPGA can become a bottleneck and hurt the man-agement performance. As different FPGA vendors have dif-ferent techniques for managing PRRs in their FPGA devices,handling the management of PRRs with overlay architec-tures has the benefit of allowing the resource managementapproaches to be compatible with a variety of FPGA devicesacross different vendors. However, such a benefit comeswith penalties on performance and resource utilization ef-ficiency. Differently, implementing resource managementutilities in the hardware stack increases the managementperformance but loses the compatibility as well as reducesthe available FPGA resources for application mapping.

4.4 Isolation

The fourth objective of FPGA virtualization aims to providestrict performance and data isolation among different usersand applications to ensure data security and system re-silience. Table 2 summarizes the list of papers that addressesthe isolation objective in FPGA virtualization.

If the FPGA virtualization system supports multi-tenancy, each tenant in the system will have shared accessto the available hardware resources on the FPGA, e.g. oneor multiple PRRs, a set of I/O resources, and on-chip oroff-chip memory resources. Therefore, it is essential to makesure each tenant only has strict access to its own resources ata specific time, and they are not able to access or manipulatedata nor interrupt the execution of accelerators of othertenants.

From the perspective of the hardware stack, at leastthree types of isolation are required for FPGA virtualizationsystems: functional isolation, performance isolation, andfault isolation.

Functional isolation ensures that different vFPGAs canbe reconfigured and managed independently without af-fecting the functionality and operation of the others. TheDPR capability of modern FPGA supports the independentreconfiguration of a PRR at run time while other PRRs

12

are active, providing the fundamental support for realizingfunctional isolation. Functional isolation also ensures thateach user or application only has access to its assignedhardware resources. In FPGA virtualization systems, mem-ory and I/O resources are often shared among multipleusers and/or applications in a time-multiplexed fashion.Therefore, safe memory and I/O access need to be ensuredto guarantee the integrity of functionality and avoid dataconflicts. Existing locking mechanisms such as mutex andsemaphore [51] can be used to ensure safe memory access.Moreover, when multiple applications share an FPGA, strictisolation among their I/O channels is vital to ensure thecorrect transmission of data. Some FPGA vendors (e.g.,Xilinx) have proposed the isolation design flow [79] toensure the independent execution of multiple applicationson the same FPGA device. In [75], the hypervisor designedby the authors provides memory and I/O isolation usinga technique called page table slicing. In page table slicing,each guest application and their accelerators use their guestvirtual address to interact with the IO address space of theDRAM on the FPGA device. The hypervisor partitions theIO address space among accelerators so that each acceleratorhas a unique memory and IO address space that guest appli-cations can access. The hypervisor maintains an offset tablefor address translation between the guest virtual addressand IO address space. The FPGA virtualization frameworksproposed in [54], [55] utilize Hardware Sandbox (HWSB) forthe isolation of vFPGAs based on their overlay architecture[76]. The HWSB adds the identifier of a target vFPGAto a data packet while communication over the overlay’sinterconnection routers among different vFPGAs belongingto the same VM, which prevents unauthorized access of data[54]. The work in [74] utilizes virtual architectures (i.e., over-lays) to safeguard FPGAs from bitstream tampering. In thiswork, an application-specialized overlay is created using anoverlay generator. Then, for each vFPGA that deploys thesame application as other vFPGAs, the generated overlaybitfile is modified to create a unique overlay. This processis called uniquification and improves bitstream security.In [25], the functional isolation between VMs and PRRsis ensured by implementing a custom VMM on an ARMprocessor. Each VM has its dedicated software tasks andisolated virtual resources managed by the VMM.

Performance isolation refers to the capability of limitingthe temporal interference among multiple vFPGAs runningon the same physical FPGA so that the event of a vFPGAwill have little impact on the performance of other vFPGAs.For example, in cloud-based FPGA virtualization, whendifferent user requests are handled dynamically by differentvFPGAs at run time, the performance of each vFPGA shouldstay stable as long as the peak processing capability ofthe physical FPGA is not reached. Feniks OS [22] providesperformance isolation in the role by using PR techniques,as PR disables interconnections and logic elements on theboundary of PRRs. Therefore, adjacent PRRs are physicallyunable to affect each other. In addition, this work separatesthe OS from the application region by implementing the OSin the shell.

Fault isolation refers to the capability of limiting thenegative impacts of software faults or hardware failureson the operation of vFPGAs for system resilience. The

hardwired PCIe controllers and DPR capability of modernFPGAs inherently provide the hardware-level support forfault isolation. In addition, system-level fault recovery andfallback mechanisms are required to fast-recover vFPGAservices to further improve the system resilience. In [3], theauthors introduced a protocol to ensure fault isolation. Theprotocol can reconfigure groups of FPGAs or remap servicesrobustly, re-maps the FPGAs to recover from failure, andreport a vector of errors to their management software todiagnose problems.

Integrating FPGAs into a multi-tenant environmentmakes them vulnerable to remote software-based powerside-channel attacks [80], [81]. To tackle the issue, an on-chippower monitor can be programmed on a region dedicatedto an attacker on a shared FPGA using ring oscillators [82],and the power monitor can observe the power consump-tion of other regions on the FPGA. The observed powerconsumption could reveal sensitive information such as bitvalue in the RSA crypto engine. Moreover, the maliciousapplication could cause delay faults for the victim regionby disturbing the power network through aggressive con-sumption of power. Unfortunately, there has not been muchwork aiming to address such security issues. To adopt themulti-tenancy techniques in FPGA virtualization systemsdiscussed in section 4.2, these security issues need to beaddressed carefully in future research.

Although isolation is an important topic for the practicalapplication of FPGA virtualization systems, especially inthe context of multi-tenancy cloud and edge computing,there has not been much work focusing on this topic in theexisting literature. Future research in this area, especially thefault isolation in FPGA virtualization, is critically needed.

5 CONCLUSION AND FUTURE TRENDS

This paper provides a survey on the system architecturesand various techniques for FPGA virtualization in the con-text of cloud and edge computing, which is intended tofacilitate future researchers to efficiently learn about FPGAvirtualization research.

From a careful review of the existing literature, weidentified two research topics in FPGA virtualization thatneed extra attention in future research. First, we find that thehardware boundary of a vFPGA is limited to a PRR in a sin-gle FPGA in most of the existing work, which has limited thescope of FPGA virtualization. Ideally, FPGA virtualizationshould completely break the hardware boundary of FPGAssuch that a physical FPGA can be used as multiple vFPGAs,and multiple physical FPGAs (connected on-package, on-board, or via network) can also be used as a single vFPGA.More general FPGA virtualization approaches for leverag-ing multi-FPGA systems or networked FPGA clusters shallbe explored across different system stacks in future research.Additionally, although the isolation aspect of FPGA virtu-alization is of great importance for practical applications,this topic of research is currently under-explored. Morefunctional, performance and fault isolation approaches forFPGA virtualization shall be explored in future research toensure data security and system resilience.

Moreover, there has been a trend in designing special-ized FPGA virtualization frameworks for commonly-used

13

artificial intelligence and deep learning applications suchas deep neural networks (DNNs) [58]. Most of the existingwork only focus on performance and usability perspectivesof the applications like enabling fast mapping from appli-cation codes to FPGA executable and creating a useful ab-straction of FPGAs to compilers. For example, the solutionsin [58], [83], [84] focus on enabling the acceleration of a largevariety of DNN models using Instruction-Set-Architecture-based methods to avoid the overhead of traditional fullcompilation of FPGA designs. The existing works are eithermainly focused on performance optimization [83], [84] ofstatic DNN workload execution or they are not scalablesince they do not support multi-FPGA scenarios in cloudenvironments [57]. Moreover, the issue of isolation whilesharing resources are ignored in the existing work. Thework in [58] provides physical and performance isolation ofFPGAs and tries to address functional isolation by assigningseparate hardware resource pools to different users. Futureresearch needs to address the scalability issue by supportingmulti-FPGA virtualization. Furthermore, in-depth analysisand research are needed to provide better functional iso-lation while sharing the physical resources in multi-tenantFPGAs.

REFERENCES

[1] S. Biookaghazadeh, M. Zhao, and F. Ren, “Are fpgas suitable foredge computing?” in {USENIX} Workshop on Hot Topics in EdgeComputing (HotEdge 18), 2018.

[2] Amazon.com Inc. (2017) F1 instances: Run custom FPGAs in theAWS cloud. [Online]. Available: https://aws.amazon.com/ec2/instance-types/f1

[3] A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constan-tinides, J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Grayet al., “A reconfigurable fabric for accelerating large-scale datacen-ter services,” in 2014 ACM/IEEE 41st International Symposium onComputer Architecture (ISCA). IEEE, 2014, pp. 13–24.

[4] R. Tessier and W. Burleson, “Reconfigurable computing for digitalsignal processing: A survey,” Journal of VLSI signal processingsystems for signal, image and video technology, vol. 28, no. 1-2, pp.7–27, 2001.

[5] R. Woods, J. McAllister, G. Lightbody, and Y. Yi, FPGA-basedimplementation of signal processing systems. John Wiley & Sons,2008.

[6] D. Markovic and R. W. Brodersen, DSP architecture design essentials.Springer Science & Business Media, 2012.

[7] M. Vestias and H. Neto, “Trends of CPU, GPU and FPGA for high-performance computing,” in 2014 24th International Conference onField Programmable Logic and Applications (FPL). IEEE, 2014, pp.1–6.

[8] E. Bank-Tavakoli, S. A. Ghasemzadeh, M. Kamal, A. Afzali-Kusha,and M. Pedram, “Polar: A pipelined/overlapped fpga-based lstmaccelerator,” IEEE Transactions on Very Large Scale Integration (VLSI)Systems, vol. 28, no. 3, pp. 838–842, 2019.

[9] R. Dorrance, F. Ren, and D. Markovic, “A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas onFPGAs,” in Proceedings of the 2014 ACM/SIGDA international sym-posium on Field-programmable gate arrays, 2014, pp. 161–170.

[10] A. Munshi, “The opencl specification,” in 2009 IEEE Hot Chips 21Symposium (HCS). IEEE, 2009, pp. 1–314.

[11] M. Riera, E. B. Tavakoli, M. H. Quraishi, and F. Ren, “Halo1.0: A hardware-agnostic accelerator orchestration frameworkfor enabling hardware-agnostic programming with true per-formance portability for heterogeneous hpc,” arXiv preprintarXiv:2011.10896, 2020.

[12] C. Plessl and M. Platzner, “Virtualization of Hardware - Introduc-tion and Survey,” in ERSA, 2004.

[13] A. Vaishnav, K. D. Pham, and D. Koch, “A Survey on FPGA Virtu-alization,” 2018 28th International Conference on Field ProgrammableLogic and Applications (FPL), pp. 131–1317, 2018.

[14] Q. Ijaz, E.-B. Bourennane, A. K. Bashir, and H. Asghar, “Revisitingthe high-performance reconfigurable computing for future data-centers,” Future Internet, vol. 12, no. 4, p. 64, 2020.

[15] R. Skhiri, V. Fresse, J. P. Jamont, B. Suffran, and J. Malek, “Fromfpga to support cloud to cloud of fpga: State of the art,” Interna-tional Journal of Reconfigurable Computing, vol. 2019, 2019.

[16] J. Zhang and G. Qu, “A survey on security and trust of fpga-basedsystems,” in 2014 International Conference on Field-ProgrammableTechnology (FPT). IEEE, 2014, pp. 147–152.

[17] R. Di Lauro, F. Giannone, L. Ambrosio, and R. Montella, “Virtu-alizing general purpose gpus for high performance cloud com-puting: an application to a fluid simulator,” in 2012 IEEE 10thInternational Symposium on Parallel and Distributed Processing withApplications. IEEE, 2012, pp. 863–864.

[18] J. Duato, A. J. Pena, F. Silla, J. C. Fernandez, R. Mayo, andE. S. Quintana-Orti, “Enabling cuda acceleration within virtualmachines using rcuda,” in 2011 18th International Conference onHigh Performance Computing. IEEE, 2011, pp. 1–10.

[19] W. Fornaciari and V. Piuri, “Virtual FPGAs: Some steps behind thephysical barriers,” in Parallel and Distributed Processing, J. Rolim,Ed. Berlin, Heidelberg: Springer Berlin Heidelberg, 1998, pp. 7–12.

[20] L. Lagadec, D. Lavenier, E. Fabiani, and B. Pottier, “Placing, Rout-ing, and Editing Virtual FPGAs,” in Field-Programmable Logic andApplications, G. Brebner and R. Woods, Eds. Berlin, Heidelberg:Springer Berlin Heidelberg, 2001, pp. 357–366.

[21] H. K.-H. So and C. Liu, “FPGA Overlays,” in FPGAs for SoftwareProgrammers, D. Koch, F. Hannig, and D. Ziener, Eds. Cham:Springer International Publishing, 2016, pp. 285–305.

[22] J. Zhang, Y. Xiong, N. Xu, R. Shu, B. Li, P. Cheng, G. Chen, andT. Moscibroda, “The feniks fpga operating system for cloud com-puting,” in Proceedings of the 8th Asia-Pacific Workshop on Systems,2017, pp. 1–7.

[23] A. Vaishnav, K. D. Pham, D. Koch, and J. Garside, “Resourceelastic virtualization for fpgas using opencl,” in 2018 28th Interna-tional Conference on Field Programmable Logic and Applications (FPL).IEEE, 2018, pp. 111–1117.

[24] O. Knodel, P. Genssler, and R. Spallek, “Virtualizing Reconfig-urable Hardware to Provide Scalability in Cloud Architectures,”in 2017 The Tenth International Conference on Advances in Circuits,Electronics and Micro-electronics (CENICS), 2017.

[25] T. Xia, J.-C. Prevotet, and F. Nouvel, “Hypervisor mechanismsto manage fpga reconfigurable accelerators,” in 2016 InternationalConference on Field-Programmable Technology (FPT). IEEE, 2016, pp.44–52.

[26] A. Khawaja, J. Landgraf, R. Prakash, M. Wei, E. Schkufza, andC. J. Rossbach, “Sharing, protection, and compatibility for recon-figurable fabric with amorphos,” in 13th {USENIX} Symposium onOperating Systems Design and Implementation ({OSDI} 18), 2018, pp.107–127.

[27] W. Lie and W. Feng-Yan, “Dynamic partial reconfiguration in fp-gas,” in 2009 Third International Symposium on Intelligent InformationTechnology Application, vol. 2. IEEE, 2009, pp. 445–448.

[28] N. Tarafdar, N. Eskandari, T. Lin, and P. Chow, “Designing forfpgas in the cloud,” IEEE Design & Test, vol. 35, no. 1, pp. 23–29,2017.

[29] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow,“Fpgas in the cloud: Booting virtualized hardware acceleratorswith openstack,” in 2014 IEEE 22nd Annual International Symposiumon Field-Programmable Custom Computing Machines. IEEE, 2014, pp.109–116.

[30] O. Knodel, P. Lehmann, and R. G. Spallek, “Rc3e: Reconfigurableaccelerators in data centres and their provision by adapted ser-vice models,” in 2016 IEEE 9th International Conference on CloudComputing (CLOUD). IEEE, 2016, pp. 19–26.

[31] M. Asiatici, N. George, K. Vipin, S. A. Fahmy, and P. Ienne,“Designing a virtual runtime for FPGA accelerators in the cloud,”2016 26th International Conference on Field Programmable Logic andApplications (FPL), pp. 1–2, 2016.

[32] W. Wang, M. Bolic, and J. Parri, “pvfpga: accessing an fpga-basedhardware accelerator in a paravirtualized environment,” in 2013International Conference on Hardware/Software Codesign and SystemSynthesis (CODES+ ISSS). IEEE, 2013, pp. 1–9.

[33] C.-H. Huang and P.-A. Hsiung, “Virtualizable hardware/softwaredesign infrastructure for dynamically partially reconfigurable sys-tems,” ACM Transactions on Reconfigurable Technology and Systems(TRETS), vol. 6, no. 2, pp. 1–18, 2013.

https://aws.amazon.com/ec2/instance-types/f1

https://aws.amazon.com/ec2/instance-types/f1

14

[34] M. Paolino, S. Pinneterre, and D. Raho, “Fpga virtualization withaccelerators overcommitment for network function virtualiza-tion,” in 2017 International Conference on ReConFigurable Computingand FPGAs (ReConFig). IEEE, 2017, pp. 1–6.

[35] I. Gonzalez, S. Lopez-Buedo, G. Sutter, D. Sanchez-Roman, F. J.Gomez-Arribas, and J. Aracil, “Virtualization of ReconfigurableCoprocessors in HPRC Systems with Multicore Architecture,” J.Syst. Archit., vol. 58, no. 6-7, pp. 247–256, Jun. 2012. [Online].Available: http://dx.doi.org/10.1016/j.sysarc.2012.03.002

[36] C. Liu, H.-C. Ng, and H. K.-H. So, “Quickdough: A rapid fpga loopaccelerator design framework using soft cgra overlay,” in 2015International Conference on Field Programmable Technology (FPT).IEEE, 2015, pp. 56–63.

[37] O. Pell, O. Mencer, K. H. Tsoi, and W. Luk, “Maximum perfor-mance computing with dataflow engines,” in High-performancecomputing using FPGAs. Springer, 2013, pp. 747–774.

[38] O. Pell and V. Averbukh, “Maximum performance computing withdataflow engines,” Computing in Science and Engineering, vol. 14,no. 4, pp. 98–103, 2012.

[39] Y. Li, Z. Liu, K. Xu, H. Yu, and F. Ren, “A GPU-outperformingFPGA accelerator architecture for binary convolutional neuralnetworks,” ACM Journal on Emerging Technologies in ComputingSystems (JETC), vol. 14, no. 2, p. 18, 2018.

[40] C.-H. Huang and P.-A. Hsiung, “Hardware resource virtualizationfor dynamically partially reconfigurable systems,” IEEE EmbeddedSystems Letters, vol. 1, no. 1, pp. 19–23, 2009.

[41] W. Fornaciari and V. Piuri, “General methodologies to virtualizefpgas in hw/sw systems,” in 1998 Midwest Symposium on Circuitsand Systems (Cat. No. 98CB36268). ieee, 1998, pp. 90–93.

[42] R. L. Lysecky, K. Miller, F. Vahid, and K. A. Vissers, “Firm-corevirtual fpga for just-in-time fpga compilation,” in FPGA, 2005, p.271.

[43] E. El-Araby, I. Gonzalez, and T. El-Ghazawi, “Virtualizing andsharing reconfigurable resources in high-performance reconfig-urable computing systems,” in 2008 Second International Workshopon High-Performance Reconfigurable Computing Technology and Appli-cations. IEEE, 2008, pp. 1–8.

[44] J. Yang, L. Yan, L. Ju, Y. Wen, S. Zhang, and T. Chen, “Homo-geneous noc-based fpga: The foundation for virtual fpga,” in2010 10th IEEE International Conference on Computer and InformationTechnology. IEEE, 2010, pp. 62–67.

[45] R. Kirchgessner, G. Stitt, A. George, and H. Lam, “VirtualRC: AVirtual FPGA Platform for Applications and Tools Portability,”in Proceedings of the ACM/SIGDA International Symposium on FieldProgrammable Gate Arrays, ser. FPGA ’12. New York, NY, USA:ACM, 2012, pp. 205–208, event-place: Monterey, California, USA.[Online]. Available: http://doi.acm.org/10.1145/2145694.2145728

[46] F. Chen, Y. Shan, Y. Zhang, Y. Wang, H. Franke, X. Chang, andK. Wang, “Enabling FPGAs in the Cloud,” in Proceedings of the 11thACM Conference on Computing Frontiers, ser. CF ’14. New York,NY, USA: ACM, 2014, pp. 3:1–3:10, event-place: Cagliari, Italy.[Online]. Available: http://doi.acm.org/10.1145/2597917.2597929

[47] J. Coole and G. Stitt, “Fast, flexible high-level synthesis fromopencl using reconfiguration contexts,” IEEE Micro, vol. 34, no. 1,pp. 42–53, 2013.

[48] J. Weerasinghe, F. Abel, C. Hagleitner, and A. Herkersdorf, “En-abling fpgas in hyperscale data centers,” in 2015 IEEE 12th IntlConf on Ubiquitous Intelligence and Computing and 2015 IEEE 12thIntl Conf on Autonomic and Trusted Computing and 2015 IEEE 15thIntl Conf on Scalable Computing and Communications and Its Associ-ated Workshops (UIC-ATC-ScalCom). IEEE, 2015, pp. 1078–1086.

[49] S. A. Fahmy, K. Vipin, and S. Shreejith, “Virtualized fpga acceler-ators for efficient cloud computing,” in 2015 IEEE 7th InternationalConference on Cloud Computing Technology and Science (CloudCom).IEEE, 2015, pp. 430–435.

[50] H. L. Kidane, E.-B. Bourennane, and G. Ochoa-Ruiz, “Noc basedvirtualized accelerators for cloud computing,” in 2016 IEEE 10thInternational Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC). IEEE, 2016, pp. 133–137.

[51] M. Asiatici, N. George, K. Vipin, S. A. Fahmy, and P. Ienne,“Virtualized execution runtime for fpga accelerators in the cloud,”Ieee Access, vol. 5, pp. 1900–1910, 2017.

[52] S. Yazdanshenas and V. Betz, “Quantifying and mitigating thecosts of fpga virtualization,” in 2017 27th International Conferenceon Field Programmable Logic and Applications (FPL). IEEE, 2017, pp.1–7.

[53] C. Vatsolakis and D. Pnevmatikatos, “Racos: Transparent accessand virtualization of reconfigurable hardware accelerators,” in2017 International Conference on Embedded Computer Systems: Ar-chitectures, Modeling, and Simulation (SAMOS). IEEE, 2017, pp.11–19.

[54] J. M. Mbongue, F. Hategekimana, D. T. Kwadjo, and C. Bobda,“FPGA Virtualization in Cloud-Based Infrastructures Over Virtio,”2018 IEEE 36th International Conference on Computer Design (ICCD),pp. 242–245, 2018.

[55] J. Mbongue, F. Hategekimana, D. T. Kwadjo, D. Andrews, andC. Bobda, “Fpgavirt: A novel virtualization framework for fpgasin the cloud,” in 2018 IEEE 11th International Conference on CloudComputing (CLOUD). IEEE, 2018, pp. 862–865.

[56] Q. Zhao, M. Amagasaki, M. Iida, M. Kuga, and T. Sueyoshi, “En-abling FPGA-as-a-Service in the Cloud with hCODE Platform,”IEICE Transactions on Information and Systems, vol. E101.D, no. 2,pp. 335–343, 2018.

[57] Y. Zha and J. Li, “Virtualizing fpgas in the cloud,” in Proceedings ofthe Twenty-Fifth International Conference on Architectural Support forProgramming Languages and Operating Systems, 2020, pp. 845–858.

[58] S. Zeng, G. Dai, H. Sun, K. Zhong, G. Ge, K. Guo, Y. Wang,and H. Yang, “Enabling efficient and flexible fpga virtualiza-tion for deep learning in the cloud,” in 2020 IEEE 28th AnnualInternational Symposium on Field-Programmable Custom ComputingMachines (FCCM). IEEE, 2020, pp. 102–110.

[59] J. Coole and G. Stitt, “Intermediate fabrics: Virtual architectures forcircuit portability and fast placement and routing,” in Proceedingsof the eighth IEEE/ACM/IFIP international conference on Hardware/-software codesign and system synthesis, 2010, pp. 13–22.

[60] A. Agne, M. Happe, A. Keller, E. Lubbers, B. Plattner, M. Platzner,and C. Plessl, “Reconos: An operating system approach for recon-figurable computing,” IEEE Micro, vol. 34, no. 1, pp. 60–71, 2013.

[61] K. Fleming and M. Adler, “The leap fpga operating system,” inFPGAs for Software Programmers. Springer, 2016, pp. 245–258.

[62] Intel Corporation. (2020) Intel® SDK for OpenCL™Applications. [Online]. Available: https://software.intel.com/en-us/opencl-sdk

[63] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho,R. Neugebauer, I. Pratt, and A. Warfield, “Xen and the art ofvirtualization,” ACM SIGOPS operating systems review, vol. 37,no. 5, pp. 164–177, 2003.

[64] OpenStack, “OpenStack - Open source software for buildingprivate and public clouds,” 2020. [Online]. Available: http://www.openstack.org/.

[65] K. Vipin and S. A. Fahmy, “Dyract: A partial reconfigurationenabled accelerator and test platform,” in 2014 24th internationalconference on field programmable logic and applications (FPL). IEEE,2014, pp. 1–7.

[66] S. Hajnoczi. (2015) Virtio-vsock: zero-configuration host/guestcommunication. [Online]. Available: https://www.linuxkvm.org/page/KVMForum2015

[67] T. Lindholm, F. Yellin, G. Bracha, and A. Buckley, The Java VirtualMachine Specification, Java SE 8 Edition, 1st ed. Addison-WesleyProfessional, 2014.

[68] X. Li and D. L. Maskell, “Time-multiplexed fpga overlay archi-tectures: A survey,” ACM Transactions on Design Automation ofElectronic Systems (TODAES), vol. 24, no. 5, pp. 1–19, 2019.

[69] R. Ferreira, J. G. Vendramini, L. Mucida, M. M. Pereira, andL. Carro, “An fpga-based heterogeneous coarse-grained dynami-cally reconfigurable architecture,” in Proceedings of the 14th interna-tional conference on Compilers, architectures and synthesis for embeddedsystems, 2011, pp. 195–204.

[70] A. Brant and G. Lemieux, “ZUMA: An Open FPGA OverlayArchitecture,” 2012 IEEE 20th International Symposium on Field-Programmable Custom Computing Machines, pp. 93–96, 2012.

[71] A. K. Jain, “Architecture centric coarse-grained fpga overlays,”Ph.D. dissertation, 2017.

[72] R. Hartenstein, “Coarse Grain Reconfigurable Architecture (Em-bedded Tutorial),” in Proceedings of the 2001 Asia and South PacificDesign Automation Conference, ser. ASP-DAC ’01. New York, NY,USA: ACM, 2001, pp. 564–570, event-place: Yokohama, Japan.

[73] T. Wiersema, A. Bockhorn, and M. Platzner, “Embeddingfpga overlays into configurable systems-on-chip: Reconos meetszuma,” in 2014 International Conference on ReConFigurable Comput-ing and FPGAs (ReConFig14). IEEE, 2014, pp. 1–6.

http://dx.doi.org/10.1016/j.sysarc.2012.03.002

http://doi.acm.org/10.1145/2145694.2145728

http://doi.acm.org/10.1145/2597917.2597929

https://software.intel.com/en-us/opencl-sdk

https://software.intel.com/en-us/opencl-sdk

http://www.openstack.org/.

http://www.openstack.org/.

https://www.linuxkvm.org/page/KVM Forum 2015

https://www.linuxkvm.org/page/KVM Forum 2015

15

[74] G. Stitt, R. Karam, K. Yang, and S. Bhunia, “A uniquified virtu-alization approach to hardware security,” IEEE Embedded SystemsLetters, vol. 9, no. 3, pp. 53–56, 2017.

[75] J. Ma, G. Zuo, K. Loughlin, X. Cheng, Y. Liu, A. M. Eneyew, Z. Qi,and B. Kasikci, “A hypervisor for shared-memory fpga platforms,”in Proceedings of the Twenty-Fifth International Conference on Archi-tectural Support for Programming Languages and Operating Systems,2020, pp. 827–844.

[76] J. Mandebi Mbongue, D. Tchuinkou Kwadjo, and C. Bobda, “FLex-iTASK: A Flexible FPGA Overlay for Efficient Multitasking,” inProceedings of the 2018 on Great Lakes Symposium on VLSI, ser.GLSVLSI ’18. New York, NY, USA: ACM, 2018, pp. 483–486,event-place: Chicago, IL, USA.

[77] Xilinx Inc. (2018) Vivado Design Suite Par-tial Reconfiguration User Guide. [Online]. Avail-able: https://www.xilinx.com/support/documentation/swmanuals/xilinx2018 1/ug909-vivado-partial-reconfiguration.pdf

[78] Xilinx Inc. (2016) SDAccel Development Environment,2016. [Online]. Available: https://www.xilinx.com/products/design-tools/software-zone.html

[79] Xilinx Inc. (2019) Xilinx Isolation Design Flow.[Online]. Available: https://www.xilinx.com/applications/isolation-design-flow.html

[80] J. Krautter, D. R. Gnad, F. Schellenberg, A. Moradi, and M. B.Tahoori, “Active fences against voltage-based side channels inmulti-tenant fpgas.” IACR Cryptol. ePrint Arch., vol. 2019, p. 1152,2019.

[81] S. Yazdanshenas and V. Betz, “The costs of confidentiality invirtualized fpgas,” IEEE Transactions on Very Large Scale Integration(VLSI) Systems, vol. 27, no. 10, pp. 2272–2283, 2019.

[82] M. Zhao and G. E. Suh, “Fpga-based remote power side-channelattacks,” in 2018 IEEE Symposium on Security and Privacy (SP).IEEE, 2018, pp. 229–244.

[83] Xilinx Inc. Accelerating DNNs with Xilinx Alveo AcceleratorCards. [Online]. Available: https://www.xilinx.com/support/documentation/white papers/wp504-accel-dnns.pdf

[84] J. Fowers, K. Ovtcharov, M. Papamichael, T. Massengill, M. Liu,D. Lo, S. Alkalay, M. Haselman, L. Adams, M. Ghandi et al.,“A configurable cloud-scale dnn processor for real-time ai,” in2018 ACM/IEEE 45th Annual International Symposium on ComputerArchitecture (ISCA). IEEE, 2018, pp. 1–14.

Masudul Hassan Quraishi Masudul HassanQuraishi received the B.Sc. Degree in ElectricalEngineering from Bangladesh University of En-gineering and Technology, Bangladesh in 2013and MS degree in Computer Engineering fromArizona State University, USA in 2020. He iscurrently a Computer Engineering PhD studentat Arizona State University, USA. His currentresearch involves virtualization of FPGA for highperformance computing.

Erfan Bank Tavakoli Erfan Bank Tavakoli re-ceived the B.Sc. degree from Iran Universityof Science and Technology, Tehran, Iran, in2017, the M.Sc. degree from the University ofTehran, Tehran, Iran, in 2019, all in electrical en-gineering. He is currently a computer engineer-ing Ph.D. student at Arizona State University,AZ, USA. His current research interests includehardware acceleration and deep learning.

Fengbo Ren Fengbo Ren (S’10-M’15-SM’20)received the B.Eng. Degree from Zhejiang Uni-versity, Hangzhou, China, in 2008 and the M.S.and Ph.D. degrees from the University of Califor-nia, Los Angeles, in 2010 and 2014, respectively,all in electrical engineering.

In 2015, he joined the faculty of the School ofComputing, Informatics, and Decision SystemsEngineering at Arizona State University (ASU).His Ph.D. research involved designing energy-efficient VLSI systems, accelerating compres-

sive sensing signal reconstruction, and developing emerging memorytechnology. His current research interests are focused on algorithm,hardware, and system innovations for data analytics and informationprocessing, with emphasis on bringing energy efficiency and data intel-ligence into a broad spectrum of today’s computing infrastructures, fromdata center server systems to wearable and Internet-of-things devices.He is a member of the Digital Signal Processing Technical Committeeand VLSI Systems & Applications Technical Committee of the IEEECircuits and Systems Society.

Dr. Ren received the Broadcom Fellowship in 2012, the prestigiousNational Science Foundation (NSF) Faculty Early Career Development(CAREER) Award in 2017, the Google Faculty Research Award in 2018.He also received the Top 5 percent Best Teacher Awards from the FultonSchools of Engineering at ASU in 2017, 2018, and 2019.

https://www.xilinx.com/support/documentation/sw_manuals/ xilinx2018_1/ ug909-vivado-partial-reconfiguration.pdf

https://www.xilinx.com/support/documentation/sw_manuals/ xilinx2018_1/ ug909-vivado-partial-reconfiguration.pdf

https://www.xilinx.com/products/design-tools/software-zone.html

https://www.xilinx.com/products/design-tools/software-zone.html

https://www.xilinx.com/applications/isolation-design-flow.html

https://www.xilinx.com/applications/isolation-design-flow.html

https://www.xilinx.com/support/documentation/ white_papers/wp504-accel-dnns.pdf

https://www.xilinx.com/support/documentation/ white_papers/wp504-accel-dnns.pdf

1 a survey of system architectures and techniques for fpga ...1 a survey of system architectures and...

Documents