sharing, protection and compatibility for reconfigurable fabric with … · isbn 78-1-939133-08-3...

This paper is included in the Proceedings of the 13th USENIX Symposium on Operating Systems Design

and Implementation (OSDI ’18).October 8–10, 2018 • Carlsbad, CA, USA

ISBN 978-1-939133-08-3

Open access to the Proceedings of the 13th USENIX Symposium on Operating Systems

Design and Implementation is sponsored by USENIX.

Sharing, Protection and Compatibility for Reconfigurable Fabric with AmorphoSAhmed Khawaja, Joshua Landgraf, and Rohith Prakash, UT Austin;

Michael Wei and Eric Schkufza, VMware Research Group; Christopher J. Rossbach, UT Austin and VMware Research Group

https://www.usenix.org/conference/osdi18/presentation/khawaja

Sharing, Protection, and Compatibility for Reconfigurable Fabric withAMORPHOS

Ahmed Khawaja1, Joshua Landgraf1, Rohith Prakash1,Michael Wei2, Eric Schkufza2, Christopher J. Rossbach3

1The University of Texas at Austin 2VMware Research Group3The University of Texas at Austin and VMware Research Group

Abstract

Cloud providers such as Amazon and Microsoft havebegun to support on-demand FPGA acceleration in thecloud, and hardware vendors will support FPGAs in futureprocessors. At the same time, technology advancementssuch as 3D stacking, through-silicon vias (TSVs), andFinFETs have greatly increased FPGA density. The mas-sive parallelism of current FPGAs can support not onlyextremely large applications, but multiple applicationssimultaneously as well.

System support for FPGAs, however, is in its infancy.Unlike software, where resource configurations are lim-ited to simple dimensions of compute, memory, and I/O,FPGAs provide a multi-dimensional sea of resourcesknown as the FPGA fabric: logic cells, floating pointunits, memories, and I/O can all be wired together, lead-ing to spatial constraints on FPGA resources. Currentstacks either support only a single application or staticallypartition the FPGA fabric into fixed-size slots. These de-signs cannot efficiently support diverse workloads: thesize of the largest slot places an artificial limit on appli-cation size, and oversized slots result in wasted FPGAresources and reduced concurrency.

This paper presents AMORPHOS, which encapsulatesuser FPGA logic in morphable tasks, or Morphlets. Mor-phlets provide isolation and protection across mutuallydistrustful protection domains, extending the guaranteesof software processes. Morphlets can morph, dynamicallyaltering their deployed form based on resource require-ments and availability. To build Morphlets, developersprovide a parameterized hardware design that interfaceswith AMORPHOS, along with a mesh, which specifiesexternal resource requirements. AMORPHOS exploresthe parameter space, generating deployable Morphletsof varying size and resource requirements. AMORPHOSmultiplexes Morphlets on the FPGA in both space and

1

10

100

1000

10000

1

10

100

1000

250n

m

180n

m

130n

m

90nm

60nm

45nm

28nm

20nm

14-1

6nm

7-10

nm*

1998 1999 2001 2005 2006 2009 2010 2014 2016 2018Logic Cells IOs Memory

3.36¢

3.55¢

2.74¢

1.28¢1.01¢

¢/LC

Figure 1: Cost per logic cell and relative density of memory and logiccells over time for FPGAs at each process node. Left and right axesshow logic cells and memory density in log-scale relative to 250nm. Thedotted line shows the cost per logic cell for the highest density FPGA atthat node (in cents) where historical pricing was available [84]. The 14-16nm node introduced FinFETs, which greatly increase performance/W,so that the same application may use fewer logic cells. * Data for 7-10nmprojected from [22].

time to maximize FPGA utilization.We implement AMORPHOS on Amazon F1 [1] and

Microsoft Catapult [92]. We show that protected sharingand dynamic scalability support on workloads such asDNN inference and blockchain mining improves aggre-gate throughput up to 4× and 23× on Catapult and F1respectively.

1 IntroductionFPGAs offer compelling hardware acceleration in appli-cation domains ranging from databases [28, 59, 74], fi-nance [54, 70], neural networks [115, 104], graph pro-cessing [36, 85], communication [57, 107, 27], and net-working [53, 92, 27]. Over the last few decades, FPGAcompute density has grown dramatically, cost per logiccell has dropped precipitously (Figure 1), and higher-level programming abstractions [60, 19, 32, 20, 65, 90]have emerged to improve programmer productivity. Cloudproviders such as Amazon [1] are offering compute re-sources with FPGAs. However, system software has not

USENIX Association 13th USENIX Symposium on Operating Systems Design and Implementation 107

kept up. The body of research effort on FPGA OS sup-port [77, 100, 99, 86, 45, 52, 29] and sharing [30] hasyielded no first-class commodity OS support, and on-demand FPGAs from AWS and Microsoft support asingle-application model.

Current proposals for FPGA sharing [30, 26, 42, 110,63] partition a physical FPGA into a small number offixed-size slots and demand-share them across user logicusing hardware support for partial reconfiguration (PR).PR changes the configuration of FPGA fabric within aslot without perturbing the state of the rest of the FPGA.User logic is pre-compiled to a bitstream that targets thepre-defined slots, enabling a system to deploy user logicwith low latency. A reserved partition of the fabric, orshell, implements library support. Fixed-slot designs havesignificant drawbacks in practice. Forcing applicationsto target fixed partitions unnecessarily constrains them:the size of the largest partition places an artificial limit onapplication size, and oversized partitions result in wastedFPGA resources and reduced concurrency.

We present a design and prototype of protected shar-ing and cross-platform compatibility for FPGAs calledAMORPHOS. AMORPHOS enables applications to scaledynamically in response to load and availability, and en-ables the system to transparently change mappings be-tween user logic and physical fabric to increase utiliza-tion. AMORPHOS avoids fixed-size slots and managesphysical fabric in dynamically sized zones. Zones aredemand-shared across morphable tasks, or Morphlets. AMorphlet is a new abstraction which forms a protectionboundary and encapsulates user FPGA logic in a waythat enables it to be dynamically scaled and remapped tothe physical fabric. Morphlets express scalability dimen-sions and resource constraints using a mesh. A mesh isa map from feasible resource combinations to abstractdescriptions of the logic. Meshes act as an intermediaterepresentation (IR) that can be re-targeted at runtime todifferent hardware allocations, allowing the AMORPHOSscheduler to re-target Morphlets to available FPGA fabric.AMORPHOS caches dynamically generated bitstreamsin a shared registry to hide the latency of re-targeting.AMORPHOS mediates Morphlet access to OS-managedresources through a hull, which hardens and extends atraditional shell design with access control and supportfor isolation. The hull also forms a canonical interfacethat enables Morphlets to be portable.

We prototype AMORPHOS on both Amazon F1 and Mi-crosoft Catapult. Measurements show that AMORPHOS’sabstractions provide both compatibility and protected shar-ing while dramatically improving utilization and through-put. We make the following contributions:

• A minimal set of OS-level abstractions and interfacesto enable mutually distrustful FPGA sharing andprotected access to OS-managed resources.• A compatibility layer that enables portability of

FPGA code across Amazon F1 and Microsoft Cata-pult FPGA systems.• Techniques that transparently transition between

scheduling modes based on fixed and variable zonesto increase utilization and throughput.• Evaluation of a prototype showing AMORPHOS shar-

ing support increases fabric utilization and systemthroughput up to 4× (Catapult) and 23× (F1) rela-tive to fixed-slot sharing and non-sharing designs.

2 BackgroundField Programmable Gate Arrays (FPGAs) are circuitsthat can be configured post-manufacture to implementcustom logic. FPGAs may be deployed in a system inseveral ways:

Discrete. A FPGA can be used on its own without aprocessor. Network switches, for example [17], can beimplemented this way to provide a programmable dataplane.

System-on-chip. FPGAs may include one or more hard(in-silicon) processors [35, 16] tightly integrated with theFPGA. Logic in the FPGA can manipulate the processorand vice versa (e.g. FPGA logic may directly write intoprocessor caches or manipulate memory controllers).

Bump-in-the-wire. FPGAs can be placed on an I/Opipeline to “transparently” manipulate data. For exam-ple, an FPGA may be integrated into a network card ormemory and storage controller to provide line-rate en-cryption [8].

Co-processor/Offload. FPGAs can be I/O-attached (e.g.via PCIe) to offload compute. An application configuresthe FPGA to implement a hardware accelerator and sendsdata and requests to it like a co-processor. Many work-loads targeting on-demand cloud FPGAs [1, 79], such asDNNs [83, 116], media transcoding [9], genomics [6],real-time risk modeling [87], and blockchain [105, 49]fall in this category. AMORPHOS is designed for FPGAsdeployed in the co-processor/offload configuration.

2.1 Software versus Hardware

Writing Hardware. Hardware description languages(HDLs), such as Verilog [106] and VHDL [21], enable de-velopers to configure the various resources on the FPGAfabric: interconnect, look-up-tables (LUTs), flip-flops, on-chip memory (block RAM), and “hard resources” (adders,DSPs, memory controllers, etc.). Unlike software, where

108 13th USENIX Symposium on Operating Systems Design and Implementation USENIX Association

resource arrangement is abstracted away by the ISA, hard-ware gives developers explicit control over arranging andconnecting resources in a flexible manner.

Building and deploying hardware. To be deployed onan FPGA, a design must be converted into a bitstream, abinary which configures the FPGA fabric. The bitstreamis built from the HDL in two stages: First, synthesis con-verts and maps the HDL into a netlist, which describeshow resources on the FPGA should be connected to im-plement design logic. Synthesis is similar to softwarecompilation and usually takes on the order of minutes.Second, the place-and-route (PAR) step takes the netlistand attempts to route the design on the FPGA fabric. PARis a constraint-solving problem which can take hours fora complex design. A bitstream takes 10s-100s of millisec-onds to be loaded.

Sharing and reconfiguring hardware. Unlike software,which can be context switched by saving and restoringarchitectural state, context switching FPGA hardware atarbitrary points requires capturing the current state ofthe logic, as loading a new bitstream will reset that state.While mechanisms do exist, they are not universally sup-ported [47] or are in their early stages [23], and are not sup-ported in all AMORPHOS’s target environments. There-fore, time-sharing must either be non-preemptive, or mustforcibly revoke access to the FPGA, potentially at the costof losing application state.

Partial Reconfiguration. Hardware support for partialreconfiguration [76] (PR) enables parts of an FPGA to bereconfigured in situ without impacting the live configura-tion or circuit state of other parts of the FPGA fabric. Useof the feature necessitates including partial reconfigura-tion logic along with the netlist during the place-and-routebuild phase, but does not otherwise impact the process ina fundamental way: the output is a bitstream that targets aspecific set of physical FPGA resources. Partial reconfigu-ration can be faster because partial bitstreams are smaller.Because PR can allow an application to change withoutimpacting the state of other applications, it is an attractiveprimitive for implementing context switching.

Scaling Hardware. Unlike software, which is scaledby increasing the number of cores or the number of op-erations executed per instruction (SIMD), hardware canscale by implementing what can be thought of as entirelynew specialized instructions or algorithms. This enablesFPGAs to provide energy-efficiency and evolvability thatare difficult to achieve with fixed-function hardware likeGPUs or TPUs [117, 46, 11]. For example, a deep neuralnetwork (DNN) can be implemented as thousands of inde-pendent 2-bit bitwise processors, rather than consuming

the pipeline of a general purpose 64-bit processor.

2.2 FPGA OS and Sharing SupportOn-demand FPGAs in the cloud, such as Amazon F1 [1],only enable coarse-grain sharing of a FPGA. F1 providesdevelopers with SDKs for developing, simulating, debug-ging, and compiling hardware accelerators on-demand.FPGA designs are saved as Amazon FPGA Images (AFIs)and deployed to an F1 instance. The AWS Marketplacefunctions as a library of pre-built common AFIs. At de-ployment, an AFI is assigned the fabric of the entireFPGA: there is no support for sharing across protectiondomains. The lack of fine-grained sharing means thatboth the cloud provider and the user are locked out of theflexibility of the FPGA: once a user loads an AFI, Ama-zon must assume that the entire FPGA is being used bythat AFI, even though the FPGA may be idle. Other thandecommissioning the instance, the user has no way torelease FPGA resources back to the cloud provider. As aresult, workloads which need to conditionally or occasion-ally offload compute [97], or which cannot fully utilizethe FPGA, may be unable to cost-effectively use cloudFPGAs.

Previous proposals have touched on OS-level concernssuch as cross-application sharing [31, 109, 52], hardwareabstraction layers [111, 61, 78, 62, 50, 80], and sharedruntime support [45, 103, 37], or access from a virtualmachine [88]. Theoretical aspects of spatial schedulingon FPGAs [43, 102, 108, 31], task scheduling in hetero-geneous CPU-FPGA platforms [25, 102, 108, 44, 18],mechanisms for preemption [73], relocation [55], and con-text switch [72, 93] are well-explored. Access from anFPGA to OS-managed resources such as virtual mem-ory [33, 15, 114, 77], file systems [100], and systemcalls [77, 100] has enjoyed the research community’sattention as well. However, no first class OS support forFPGAs is present in modern commodity OSes and cloudFPGA platforms support a single application model.

Recent designs for FPGA sharing in datacenters [30, 26,42, 110, 63] leverage partial reconfiguration to demandshare fixed pre-reserved partitions of FPGA fabric amongapplications with bitstreams pre-compiled to target thosepartitions. AMORPHOS begins with a design of this form,extends it to enable cross-domain protection, and replacesthe fixed slot restriction with elastic resource managementto increase utilization and throughput.

3 GoalsAMORPHOS supports demand-sharing of FPGAs by mu-tually distrustful processes. AMORPHOS multiplexes fab-ric spatially by default, co-scheduling user logic fromdifferent processes, and falling back to time-sharing when


FPGA Fabric

Host

bus

DRAM

FPGA Fabric

Morphlet AApp A

Morphlet A

Host

bus

DRAM

FPGA Fabric

Morphlet A

App A

App B

Morphlet A

Morphlet B

Host

bus

DRAM

FPGA FabricMorphlet A’

App A

App B

Morphlet A

Morphlet B

Host

bus

DRAMApp A

Morphlet A

App B

Morphlet B

App C

Morphlet C

App D

Morphlet D

Morphlet A''Morphlet B'Morphlet CMorphlet D

T0 T1 T2 T3

Low-Latency Mode High-Throughput Mode Low-Latency Mode High-Throughput Mode

Figure 2: AMORPHOS managing a number of DNNWeaver (see §6) Morphlets. The top row depicts the host and FPGA state while the bottomrow shows the corresponding chip layout on Catapult. At T0, a single DNNWeaver Morphlet is placed on the FPGA. At T1, AMORPHOS detectsunderutilization and transitions to high-throughput mode, giving the Morphlet more area. At T2, another Morphlet is instantiated and AMORPHOSreturns to low-latency mode. Finally, at T3, 2 more DNNWeavers have been scheduled and AMORPHOS transitions to high-throughput mode to fitthem all on the FPGA.

space-sharing is infeasible due to resource constraints. Acritical design objective for AMORPHOS is avoiding theartificial constraints on inter- and intra-Morphlet scalabil-ity induced by a fixed-slot design. AMORPHOS enablesindividual applications to utilize additional fabric if avail-able, and enables multiple applications to share the fabricto achieve higher aggregate utilization.

3.1 Programming ModelWe target a programming model of HDL (hardware de-scription language) over an abstract FPGA fabric. Theprimary tangible change from current HDL-FPGA pro-gramming models is the requirement for the developer touse virtual interfaces for communication with the host andaccess to on-board resources such as DRAM, network I/O,etc. Collectively, these interfaces form a mediation andcompatibility layer called the hull, which encapsulates,hardens, and extends current vendor-specific shells [92, 1]

3.2 IsolationAMORPHOS provides protection guarantees similar tothose provided to processes in a software OS. Memoryand I/O protection is enforced between Morphlets. Best ef-fort performance isolation is provided based on resourceallocation policy and scheduler hints. When FPGA re-sources are constrained, AMORPHOS dedicates an evenshare of I/O and memory bandwidth to each Morphlet, en-forced by a hardware arbiter. AMORPHOS makes a best ef-fort to allocate fabric fairly under contention by preferringspatial assignments that balance the resources allocated toeach application, and time-slicing fairly when spatial shar-ing is unfeasible (see §4.2 for details). Extending these

mechanisms to provide priority-proportional fairness isstraightforward, but our prototype currently does not pro-vide flexible software-exposed policies, which we leaveas future work. Our current design avoids co-schedulingMorphlets which will interfere with each other throughcontention on the hardware based on scheduler hints.

AMORPHOS does not provide explicit protectionagainst side channels. Side channels exist and are an activearea of research where some mitigations now exist [94].However, the attack surface for Morphlets is considerablysmaller, as Morphlets enjoy exclusive access to all theFPGA hardware resources they use except interfaces toAMORPHOS itself, which are implemented with cross-domain isolation in mind. For example, special care istaken to zero out all signals on a Morphlet’s interfaceif it is not the intended recipient of a transaction, whichensures the Morphlet can not monitor the address/datasignals of other Morphlets.

3.3 Dynamic ScalabilityA key goal of AMORPHOS is increasing utilization. Whenonly a single application is on the FPGA, it should enjoyexclusive access to all resources it can actually use. Whenmultiple Morphlets contend, if a feasible partitioning ofthe fabric accommodating them all exists, applications aremapped to shares of the fabric concurrently. If no feasiblepartitioning exists, the system falls back to time-sharingat coarse granularity. A key challenge to realizing thisvision is very high latency (potentially hours or more) ofplace-and-route (PAR), which maps user-logic to physicalfabric. Using partial reconfiguration (PR) to deploy appli-


cations avoids that latency, but constrains applications tofixed slots, giving up elasticity. Avoiding or hiding PAR la-tency without constraining logic to fixed slots is a primarydesign goal for Morphlets and the AMORPHOS scheduler.Furthermore, for Morphlets to take advantage of differentsize partitions, the programming model must provide away for the developer to express scalability dimensions,valid configurations, and hints to the system to inform thescheduler.

While AMORPHOS’s primary sharing strategy is spa-tial sharing, support for time-sharing is a de facto re-quirement to avoid starvation when the FPGA is con-tended. Preemptive time-slicing requires mechanisms forcapturing, evacuating, and restoring state on the FPGA,and while some applicable mechanisms do exist (e.g.ICAP [47]) they are not universally supported, and state-capture remains an active research area [55, 72, 93, 73].We opt for a non-preemptive context switch based on ex-tensions to the programming that include a quiescenceinterface.

3.4 Motivating ExampleFigure 2 shows a series of scheduling decisions taken byour system in response to applications requesting use ofthe FPGA. The top row depicts the state of the host andFPGA while the bottom row shows the correspondingchip layout on Catapult V1 FPGAs [92] (Altera Stratix V5SGSMD5H2F35I3L). At time T0, process A instantiatesa Morphlet on the FPGA. To provide on-demand accessat the lowest latency, it initially deploys A on fixed-sizezone 1 using partial reconfiguration. At time T1, AMOR-PHOS notices the resulting under-utilization and morphs.A’s mesh is used to select a more performant netlist thatuses as much of the FPGA as it can profitably consume,and full reconfiguration is used to give A all the resourcesnot consumed by AMORPHOS itself. At time T2, processB requests FPGA fabric. To serve that demand quickly,AMORPHOS morphs again, reinstating A in zone 1, andmapping B to zone 2. At some future time T3, whichrepresents the state after potentially many interveningevents, four processes have requested FPGA access, andAMORPHOS has morphed by selecting netlists from eachMorphlet’s mesh to produce a single combined bitstreamthat co-schedules all. Utilization and throughput are im-proved by 2× compared to a fixed slot design.

4 DesignAMORPHOS introduces a number of new abstractionsand interconnected components. A system overview isshown in Figure 3. User logic is encapsulated in Mor-phlets, a zone manager tracks allocatable area of physicalFPGA fabric, and a scheduler manages the mapping be-

tween Morphlets and zones. To enable flexible mappingof Morphlets to zones, Morphlets encapsulate informa-tion to enable the scheduler to generate new bitstreamson demand, in the form of meshes. To hide the latency ofPAR for dynamic re-targeting of Morphlets, the schedulermaintains a registry that caches (potentially combined)bitstreams that can be instantiated on a zone with low la-tency. AMORPHOS mediates Morphlet access to memoryand I/O with a compatibility and protection layer calledthe hull.

4.1 HullThe primary job of the hull is to provide cross-domainprotection by mediating access to memory and I/O, and toenable compatibility by presenting Morphlets with canon-ical interfaces to interact with the rest of the platform. Thehull coordinates with the scheduler by sending and moni-toring quiescence signals (§4.3), disabling connections tozones of the FPGA currently being reprogrammed (§4.2),and connecting and initializing Morphlets after reprogram-ming is complete. The hull provides memory protectionfor on-board DRAM using segment-based address transla-tion and manages peripheral I/O devices by implementingshared logic to interface with them, along with simpleaccess mediation logic (e.g. rate-limiting for contendedI/O). Finally, the hull exports interfaces to the host OS toconfigure access control and protection mechanisms, e.g.base and bounds registers for segments.

We expect that future FPGA platforms will providesome of this functionality, address translation in particu-lar, in “hard IP,” meaning it will be supported directly insilicon. Our current prototypes are forced to synthesizethese functions from FPGA fabric.

4.2 Zones and SchedulingThe zone manager allocates physical FPGA fabric to Mor-phlets. Fabric not consumed by the hull forms a globalzone, which can be recursively subdivided into smallerreconfigurable zones that can be allocated to differentMorphlets. Our Catapult prototype supports two smallerzones within the global zone, each of which can be furthersubdivided into two. F1 hardware has considerably moreresources, and could support a considerably larger num-ber of zones with more levels of subdivision. However,F1 does not expose the partial reconfiguration feature, soour F1 prototype is forced to manage only a single globalzone. Zones may be allocated to individual Morphletsor may accommodate multiple Morphlets simultaneously.When it is time to schedule a Morphlet, the job of the zonemanager is to find (or create) a free reconfigurable zonematching the Morphlet’s default bitstream. If a matchis found, the Morphlet can be deployed on that zone di-


Figure 3: AMORPHOS design overview. FPGA Morphlets (applications) are synthesized by the user and given to AMORPHOS to be convertedinto bitstreams capable of being placed on the FPGA. The FPGA is split into a hull and multiple zones, in which Morphlets can be scheduled fromcocoons. Access to memory and I/O from Morphlets is virtualized by the hull, which implements the logic to interface with the resources directly andto ensure proper access control. On the host side, communication to the Morphlet is virtualized through the lib-AMORPHOS interface.

Design

ParametersRVector Synthesis App 1

Netlist

Place &Route

Place &Route

PRLogic

Amorph OS

ZoneBitstream

Full ChipBitstreamFPGA

ShellApp 2Netlist

App NNetlist

…

Low LatencyScheduling Mode

High ThroughputScheduling Mode

HDL

Figure 4: AMORPHOS Morphlet life cycle.

rectly. If one is not found, the zone manager must coalescefree (or reclaimed) zones to form a larger one, and informAMORPHOS that it must re-target the Morphlet along withany other currently-running Morphlets to be deployed onthe coalesced zone. In the limit, all Morphlets are de-ployed together on the global zone, maximizing aggregateutilization and individual application performance.

Zones play a key role in balancing scheduling latencyagainst aggregate throughput because fixed zones and PRis better for fast deployment, while a larger zones withmultiple Morphlets is better for utilization and throughput.AMORPHOS’s scheduler supports two modes reflectingthis tradeoff, low-latency mode and high-throughput mode,and transitions between those modes transparently basedon demand.

In low-latency mode, reconfigurable zones enable Mor-phlet to be deployed almost instantly through partial re-configuration with the Morphlet’s default bitstream. TheMorphlet’s default bitstream targets one or more of thesmaller zones and includes the partial reconfigurationlogic required to enable it to use PR. PR-based schedul-ing also allows other Morphlets to continue uninterrupted.However, reconfigurable zones incur significant area over-head for the additional PR logic required and increasefragmentation of the FPGA fabric.

When the reconfigurable regions cannot accommodatethe Morphlets of all applications concurrently, a morph op-eration occurs. The zone manager coalesces zones to form

RVector Cocoon

s t r u c t {boo l o p t L a t e n c y ;s i z e t minMem ;s i z e t optMem ;

s t r u c t {b i t s t r e a m d e f a u l t b i t s t r e a m ;map<RVector , n e t l i s t > mesh ;

} Cocoon ;s i z e t memBw;s i z e t PCIeBw ;

} RVector ;

Figure 5: Object model for Cocoons.

larger ones, eventually converging to the single globalzone, and the scheduler enters high-throughput mode. Todo so, it re-targets running Morphlets by running place-and-route to create a bitstream that includes logic for allof them and subsequently maps that bitstream to the tar-get zone. When the global zone is the target, this requiresreconfiguring the whole FPGA. However, the global zonecan accommodate significantly more Morphlets becausePR support fabric is freed, and fragmentation is elimi-nated by not restricting Morphlets to exclusive partitionsof the FPGA. AMORPHOS hides the latency of place-and-route for morph operations by caching or pre-computingcombined bitstreams targeting the global zone in a Mor-phlet registry. The registry’s entries are bitstreams for“co-Morphlets” representing co-compiled combinationsof Morphlets.

AMORPHOS also uses the morph operation for singleMorphlets when the FPGA fabric is underutilized. Mov-ing a Morphlet to a larger zone or the global zone puts sig-nificantly more resources at its disposal. The applicationcan then use these resources to run faster. AMORPHOStargets applications in which Morphlets will likely runfor an extended time, so the overhead of moving Mor-phlets to larger zones is amortized by the gains in ag-gregate throughput. The ability of a Morphlet to benefitfrom increasing resource shares is visible to AMORPHOSthrough the Morphlet’s mesh, enabling AMORPHOS toavoid morphing when it is not performance profitable todo so.


4.3 Morphlets and Cocoons

While Morphlets are analogous to and extend the processabstraction, the AMORPHOS build toolchain produces Co-coons from HDL specifications targeting AMORPHOS,which are analogous to an application binary. In addi-tion to the deployable bitstream produced by currentFPGA build tools, Cocoons encapsulate abstract infor-mation about the Morphlet to enable stages of the buildtoolchain to be re-invoked dynamically to produce differ-ent bitstreams on demand. Dynamic re-targeting enablesco-scheduling of multiple Morphlets on a zone or dynamicscaling of the fabric resources allocated to an individualMorphlet.

Figure 5 shows the contents of a Cocoon, and Figure 4shows how the various stages in the build and deploymentprocess interact with Cocoons to enable dynamic target-ing. A cocoon’s default bitstream targets a default zoneon the device and can be deployed using PR. Its mesh en-capsulates a constrained set of strategies for re-targetingthe Morphlet’s user logic. Concretely, a mesh is a mapof abstract descriptions of the logic, or netlists, keyed byRVectors. An RVector describes a feasible combinationof resources and scheduler hints for the correspondingnetlist. The netlist acts as an intermediate representation(IR) which can, potentially in combination with netlistsfrom other Morphlets, be used as input to place-and-routetools to produce new deployable bitstreams. The defaultbitstream is always used when the scheduler is in low-latency mode. When the scheduler is in high-throughputmode it may compare current system state against RVec-tors in the mesh to select an appropriate netlist. To deploythe dynamically chosen configuration, the scheduler canthen produce the required bitstream or look it up in in theMorphlet registry (§4.5) to hide place-and-route latency.

RVectors. A RVector (Resource Vector) describes Mor-phlet resource constraints and utilization hints that cannotbe derived from the netlist in the mesh. Important entriesinclude Boolean valued hints for memory and PCIe usagewhich simplify connection to AMORPHOS FPGA-side in-terfaces, as well as optimal and minimal memory footprintand bandwidth estimates. Our experience implementingAMORPHOS is that hints regarding an application’s bottle-neck resources and access patterns are essential to guideco-scheduling. For example, this allows the hull to beoptimized for lower memory access latency with somebandwidth trade-off. Note that low level FPGA-specificresources (e.g. number of LUTs, BRAMs, etc.) can bederived from a netlist and are not included in a RVector.

Quiescence Interface. Evacuating Morphlets from theFPGA is necessary when the enclosing process terminates

System call interface

Scheduler Morphlet Manager

Zone Manager

Transport

Driver

FPGA

Userspace

int mph_fork(Cocoon_t);void mph_exit(int fd);void* mph_rcntrl(int, addr_t);void mph_wcntrl(int, addr_t, void*);size_t mph_read(int, addr_t, void*);void mph_write(int, addr_t, void*, size_t);

// Controlvoid* peek(addr_t);void poke(addr_t, void*);// Data transfersize_t read(addr_t, void*);void write(addr_t, void*, size_t);

System calls

Transport functions

Regi

stry

System calls

Bus

Net

wor

k

Figure 6: AMORPHOS host stack interfaces between user space andFPGA Morphlets.

or when the scheduler needs to reallocate a zone to an-other Morphlet. Rather than immediately removing theMorphlet (at risk of losing work) or attempting to captureand save a Morphlet’s state (difficult with current hard-ware [98, 56]), the hull provides a quiescence interfaceto inform the Morphlet of the impending context switch.The Morphlet is then given an opportunity to enter a sta-ble state and/or save its progress. A Morphlet informsAMORPHOS that it can be safety switched by assertinga quiescence signal through the hull. Unresponsive Mor-phlets are forcibly evacuated after a configurable time-outto avoid DoS. Our current design allows Morphlets toleave data in on-board memory in the absence of mem-ory pressure from incoming Morphlets. Transparent swapin/out of a Morphlet’s FPGA DRAM state is a straightfor-ward operation; our current prototypes do not yet supportit.

4.4 Host Stack/OS interfaceAMORPHOS integrates with the OS in the Catapult stackand acts as a user-mode library for F1. The entire hoststack is depicted in Figure 6. AMORPHOS’s OS interfaceexposes system calls to manage Morphlets and enablescommunication between host processes and Morphlets.The interface provides APIs to load and evacuate Mor-phlets as well as to read and write data over the transportlayer to FPGA-resident Morphlets.

4.5 Morphlet RegistryAMORPHOS dynamically transitions between low-latency and high-throughput scheduling mode, reflecting afundamental latency/density tradeoff. To hide the latencyof dynamic bitstream generation, AMORPHOS maintainsa registry, a cache of precomputed bitstreams that containdeployable spatial sharing combinations of multiple Mor-phlets. For a large number of Morphlets, precomputingbitstreams for all possible combinations is impractical,particularly when combinations include duplicate Mor-phlets. We argue that a number of factors enable us toreduce the search space to a practical level. First, building


1E-01

1E+00

1E+01

1E+02

1E+03

1E+04

1E+05

1E+06

10 apps 20 apps 30 apps 10 apps 20 apps 30 apps

current projected

Co

st($

)

1 morphlet 2 morphlets 4 morphlets 8 morphlets

1 week

1 day

Figure 7: Cost of pre-compiling all possible combinations of Mor-phlets given varying numbers of deployable Morphlets and varyinglevels of concurrency. Cost is in dollars and reflects the cost of rentingdemand infrastructure from Amazon AWS to run the build toolchains.“Current” data are based on measurements with our present toolchain,while projected are scaled to assume a (conservative) 20× improvementin place-and-route performance based on [40, 39, 38].

combined Morphlets can occur in parallel. Second, reduc-ing the latency of place-and-route is an active area, andrecent research has produced order of magnitude reduc-tions (20-70×), e.g based on GPUs [40] or other parallelresources [39, 38]. Third, Morphlets can be grouped bypopularity or according to hints encoded in RVectors tobound the the number of choices, and sharing densitiesneed not be maximized to achieve multiplicative improve-ments in throughput and utilization.

Figure 7 shows the cost in dollars for AWS infrastruc-ture to pre-compile all possible combinations of Mor-phlets for varying numbers of Morphlets and concurrencylevels using current tools and using future tools whoseperformance is projected based on [40, 38, 39]. Compiletimes are derived from our own benchmark builds. 1 Thedotted lines correspond to a day and a week of computetime on 20 VMs. The AWS marketplace, at the time ofthis writing, offers only 18 FPGA applications [4]. Fromthis pool, all possible co-schedules of 4 Morphlets can becomputed in under a day for $100 in computation time.Faster future build tools and careful grouping to reducethe search space can increase utilization further. For ex-ample, if co-locatable Morphlets are partitioned in groupsof 20, all densities of up to 8 can be precomputed in ahandful of days for $1,000. The registry need not elimi-nate lookups or maximize density to significantly improveutilization.

5 ImplementationWe implement AMORPHOS on Amazon F1 FPGA cloudinstances[1, 2] and the Microsoft Catapult open researchplatform [92], available at TACC [5].

1 Concretely, a single instance of DNNWeaver can be compiled forF1 in 103 minutes. The second and third instances bring that to 118minutes, while 8 instances can be co-compiled in 157.

Catapult and F1 both support shells to provide threebasic forms of platform library support: 1) a bulk host-FPGA data transfer interface, 2) a control interface tomanage FPGA applications, and 3) interfaces to on-boardDRAM. Catapult and F1 expose these functions with dif-ferent levels of abstraction. Catapult supports packetizedbulk data transfers, a register interface for control signals,and a simple FPGA-side memory read/write interfacewith independent ports. F1’s shell exports AXI4 [3] inter-faces to encapsulate these three functional areas. AMOR-PHOS’s interface must encapsulate both Catapult and F1interfaces, as well as implement address translation formemory protection and I/O access mediation.

The AMORPHOS hull exposes 1) Control Register (Cn-trlReg) for Morphlet management, 2) Simple PCIe forbulk data transfer, and 3) a AMORPHOS Memory Inter-face (AMI) supporting 64-byte read/write transactions.Morphlets written to these interfaces are portable acrossCatapult and F1. AMORPHOS transparently manipulatesaddress bits so Morphlets believe they have full controlof memory. OS-programmable BARs (base-address reg-isters) are used to control and protect what regions ofmemory are accessible to different Morphlets. In additionto memory protection, AMORPHOS provides each Mor-phlet with a virtual address space and abstracts away the1-to-1 port-to-channel mappings imposed by F1 and Cat-apult shells. Virtual address spaces are striped across allmemory channels. The number of co-resident Morphlets,memory access ports per Morphlet, and number of mem-ory channels are parameters for the hull. Furthermore,the hull is modular and incurs no overhead for unusedinterfaces on a target FPGA platform.

Logic structures, such as FIFOs, are fundamental build-ing blocks for FPGA application designers. AMORPHOSprovides an FPGA-agnostic wrapper, HullFIFO, that ex-poses a high level interface to efficiently map to low-levelprimitives on both F1 and Catapult.

5.1 CatapultCatapult divides FPGA fabric into a shell and user-logiccalled a role. The Catapult shell interface to memory istwo 64-byte wide read/write ports over disjoint addressspaces. AMORPHOS adopts the 64-byte transaction sizebut virtualizes the interfaces for multiple co-resident Mor-phlets using segment-based address translation and buffer-ing to support application-level read-modify-write opera-tions.

To enable AMORPHOS to use partial reconfigurationto manage zones we add a PR controller and PR wrapper.The PR controller streams in PR bitstream data from thePCIe bus and transfers it to a PR IP module (vendor-provided Intellectual Property logic block) which uses it


to reconfigure the zone fabric. I/O to each Morphlet isrouted through the PR wrapper, which handles driving theMorphlet inputs and disconnecting the Morphlet outputsduring PR. This safeguards the application and preventsspurious I/O during the programming process.

5.2 F1F1 features a shell and a user application as Custom Logic(CL). F1 features twice as many memory channels as Cat-apult and requires the CL to instantiate additional memorycontrollers if more than one memory channel is needed.AMORPHOS handles instantiating the memory controllersand is parameterized to scale itself to handle additionalmemory channels. The F1 shell features many differentPCIe interfaces, some for DMA type transfers betweenthe host and some lower throughput for management/-control of the CL. PCIe and Memory on F1 are exposedover AXI4 interfaces, which are more complex than theinterfaces on Catapult. This complexity is abstracted awayfrom the Morphlet and implemented in our hull. The hullsits on top of an unmodified F1 shell.

5.3 Multiplexing AMORPHOS InterfacesLarge numbers of concurrent Morphlets can stress AMOR-PHOS’s internal FPGA-side subsystems. Each Morphletrequires the same set of interfaces (CntrlReg, Memory,and PCIe). Routing and connections to all of them is com-plicated by the fact that I/O pads for each can be (and areon F1/Catapult hardware) on different edges of the physi-cal FPGA, which stresses place and route tools by com-plicating the routing problem and increasing congestion.Designing AMORPHOS’s multiplexing logic to anticipatescale can mitigate some, but not all, of the problem. Aninitial design used multiple flat multiplexers to distributeinterface signals to each Morphlet, but we found that, de-spite plenty of available fabric, they could not scale past4 concurrent Morphlets in most cases.

Our current design implements a pipelined binary treeto route the CntrlReg signals. The tree-distribution net-work enables us to add pipeline stages, making it easierto meet timing while reducing the fanout of large databuses. The benefit is a substantial improvement to thescale at which AMORPHOS can route interfaces to con-current Morphlets. The trade-off is minimal additionallatency: 1 additional cycle for each layer, easily tolerablefor CntrlReg, which is a low-bandwidth control interface.

Our current implementation takes a different approachwith memory. Rather than scale the memory subsystem toprovide N Morphlets with access to M memory channelsfor an arbitrary number of Morphlets, AMORPHOS usesflat multiplexing with up to 8 Morphlets and statically par-titions the memory channels across groups of Morphlets

at sharing densities above 8. This policy enables us to usea single-level of multiplexing and provide access to allchannels for all Morphlets at lower densities but avoidsthe complexity and latency of an additional tree network athigh densities. The tradeoff is that Morphlets are restrictedto using a subset of DRAM channels, which does not alterthe capacity of their memory share but does reduce thebandwidth available to them. Memory systems performbetter when they manage fewer access streams (assum-ing sequential access) because back-to-back operationsfrom a single stream enable optimizations that are notfeasible between operations from different streams. Thedesign decision enables much higher densities as it im-proves routability: a group of Morphlets only need to routeto a subset of the memory channels. Our experience isthat memory bandwidth contention determines the upperbound on scalability for Morphlets which share DRAM.Contention occurs at lower levels of concurrency than thelevels that require strict group-based DRAM channel par-titioning, so optimizing DRAM access for high sharingdensity is unlikely to provide substantial benefits.

5.4 Host StackAMORPHOS provides a host stack which interfaces withuserspace applications, implemented as an OS extensionin our Catapult prototype, and as a user-mode libraryfor F1. The host stack comprises a system call interface,FPGA Morphlet manager and scheduler, zone manager,and transport layer that encapsulates the control and bulktransfer interfaces described above (§5). The interfaceand stack structure are illustrated in Figure 6. Controlsignals and reading/writing data are passed through thesyscall interface to the transport layer. Morphlet alloca-tion, scheduling hints, and tear down are redirected to theMorphlet scheduler and zone manager.

The host system call interface for Catapult is imple-mented as a service which supports the transport layerby wrapping the Catapult driver and library stack. Theservice associates Morphlets with file descriptors, export-ing read and write operations on them, and communicateswith the scheduler to monitor the active state of executingMorphlets or request quiescence.

6 EvaluationAMORPHOS runs on both a Mt Granite FPGA board inthe Catapult V1 cloud platform [92], containing an AlteraStratix V GS running at 125 MHz with two 4 GB DDR3channels, and an Amazon F1 cloud instance [1], using aXilinx UltraScale+ VU9P running at 125 MHz with four16 GB DDR4 channels. Both platforms are connectedover a PCIe bus and support build tools we adapt to buildAMORPHOS and our benchmarks, summarized in Table 1.


Program Description

DNNWeaver Convolutional neural networkMemDrive Memory streamingBitcoin Bitcoin hashing acceleratorDFADD Double-precision additionDFMUL Double-precision multiplicationDFSIN Double-precision Sine functionMIPS Simplified MIPS processorADPCM Adaptive differential pulse codecGSM Linear predictive coding analysisJPEG JPEG image decompressionMOTION Motion vector decodingAES Advanced encryption standardBLOWFISH Data encryption standardSHA Secure hash algorithm

Table 1: Benchmarks used to evaluate AMORPHOS

Benchmarks. We evaluate benchmarks that cover threeimportant categories for FPGA applications, defined bywhether they are memory-bound, compute-bound, or dy-namic resource bound. Morphlets are compute-boundwhen low-level FPGA resources such as LUTs, BRAMs,etc. are limited. Morphlets are memory-bound when off-chip memory bandwidth or latency constrains their per-formance. Morphlets are dynamic resource bound whenthey can be mapped to the fabric in ways that representdifferent points along their roofline model [82], mean-ing they can be memory- or compute-bound. Our BitcoinMorphlet (based on [12]) is compute-bound. It is parame-terized to replicate hashing units and can scale to consumemost of the on-board FPGA fabric. Additional instancesof functional units increase logic utilization limiting themaximum size/throughput of the Morphlet. Applicationsthat are memory-bound usually have a low compute-to-memory ratio and directly benefit from additional off-chip memory bandwidth. Streaming applications (e.g. indatabase [75] or search [112]) access large amounts ofdata, often discarding much of it or doing minimal com-pute per datum. To represent a range of such applications,we wrote a custom Morphlet called MemDrive (MemD)that can be configured on the host side post-synthesis togenerate different memory traffic patterns and read/writeratios, along with operations such as fills, reductions, andECC checks.

Many applications can be configured to take advantageof either additional logic or additional memory bandwidth,corresponding to different points along their rooflinemodel. To represent this class, we evaluate DNNWeaver[96], an open source DNN design framework that can beused to synthesize models from a description of a specificnetwork topology. The user controls the number of func-tional units and data buffer sizes, translating to variable

Catapult Benchmark Logic Cells Registers BRAM BitsDNNWeaver 39,994 108,640 387,840

MemDrive 2,449 1,488 570,496Bitcoin 42,171 60,257 21,408

blowfish 20,581 24,082 810,850gsm 20,910 24,716 5,552

mips 17,672 19,981 657,574dfmul 17,759 20,586 0

aes 23,900 28,366 689,630motion 25,178 26,734 687,366

dfadd 18,043 21,014 662,694sha 17,772 21,380 788,806

adpcm 22,840 29,837 663,654jpeg 42,243 40,327 1,116,312

dfsin 26,742 32,572 663,805F1 Benchmark LUTs Flip Flops BRAM Bits

DNNWeaver 4,924 4,773 339,968MemDrive 1,136 930 0

Bitcoin 40,106 46,191 0

Table 2: FPGA resource utilization by Morphlet type broken down byresource type as reported by each platform.

demand for on- and off-chip resources. We instantiateDNNWeaver with an 8-layer LeNet [71] topology.

To increase benchmark diversity, we include a num-ber of benchmarks that perform many useful non-trivialfunctions that do not fully utilize the fabric or memorybandwidth. We use the LegUp [7] high level synthesis(HLS) environment to generate 11 Morphlets (a subset ofCHStone[48]). LegUp applications use memory by com-posing it from BRAMs when needed, rather than off-chipDRAM, so they do not contend for DRAM bandwidth.However, as many FPGA applications (DNNs included)are optimized to ensure their working set fits in on-chipBRAMs to minimize off-chip memory access, we believethey are representative.

Metrics. We report resource utilization and performancemeasured by throughput. The build tools for each platformbreak down resource utilization into logic, registers/flip-flops, and BlockRAMs. Morphlets are instrumented withcycle counters to measure the runtime on the FPGA wheneach is running. End-to-end execution time is measuredfrom the host side. Performance for MemDrive is reportedas memory throughput (bytes/cycle). Bitcoin performanceis reported as normalized hash throughput with the base-line being a fully unrolled and pipelined instance of theapplication (to the maximum the open source code permit-ted), producing a full block hash per cycle. DNNWeaverperformance is reported as normalized throughput, wherethe baseline is the number of cycles required for input datato run through all network layers and complete inference.

We evaluate AMORPHOS with 14 different bench-marks, listed in Table 1. The logic, register, and memoryutilization of these benchmarks is listed for both Catapultand (partially for) F1 in Table 2.

Table 3 shows increases in utilization and system


Catapult Configuration # ALMs Utilization Sys. Throughput1 Bitcoin 63,973 1.00x 1.00x2 Bitcoin 93,908 1.47x 2.00x4 Bitcoin 141,139 2.21x 4.00x1 DNNWeaver 92,619 1.45x 1.00x2 DNNWeaver 134,972 2.11x 2.00x4 DNNWeaver 154,956 2.42x 3.31x1 DNN, 1 MemD 92,135 1.44x 1.41x2 DNN, 2 MemD 148,249 2.32x 0.80x1 DNN, 1 BTC 112,010 1.75x 2.00x2 DNN, 2 BTC 140,635 2.20x 3.68x1 DNN, 1 BTC, 2 MemD 96,994 1.52x 1.86x2 BTC, 2 MemD 95,936 1.50x 2.77xF1 Configuration # LUTs Utilization Sys. Throughput1 MemD 68,885 1.00x 1.00x2 MemD 89,161 1.29x 1.67x4 MemD 100,773 1.46x 1.37x8 MemD 127,530 1.85x 0.78x1 Bitcoin 104,851 1.52x 1.00x4 Bitcoin 229,482 3.33x 4.00x8 Bitcoin 484,879 7.03x 8.00x1 DNNWeaver 90,118 1.31x 1.00x4 DNNWeaver 129,925 1.89x 3.94x8 DNNWeaver 187,839 2.73x 7.80x16 DNNWeaver 294,290 4.28x 14.80x32 DNNWeaver 397,580 5.78x 23.22x

Table 3: Morphlet configurations run in AMORPHOS with correspond-ing ALM/LUT (logic) usage, relative system utilization improvement,and relative system throughput.

throughput that are made possible by co-scheduling Mor-phlets using AMORPHOS. Utilization is measured asALM (adaptive logic module) or LUT (lookup table) us-age relative to the smallest configuration on each platform,1 Bitcoin for Catapult and 1 MemDrive for F1. Systemthroughput is reported as the sum of each Morphlet’s nor-malized throughput, relative to a single instance of thatMorphlet. In only two cases does co-scheduling Mor-phlets result in reduced system throughput, both of whichinvolve multiple MemDrive Morphlets, which interferesignificantly with other memory-dependant Morphlets. Inthe best cases, co-scheduling Morphlets results in 7.03xincreased utilization and 23.22x increased throughput.

6.1 CHStoneWe evaluate CHStone benchmarks to illustrate generalityand to demonstrate that useful accelerators can be co-scheduled at high density with AMORPHOS to increasethroughput. We find that the upper bound on density forall is determined by AMORPHOS’s ability to route controlinterfaces to them, which translates to an upper bound of8 on our Catapult prototype. Because the LegUp com-piler implements memory with BRAM, rather than byconnecting to on-board DRAM, the CHStone workloadsonly shared resource is the CntrlReg interface. Absentany source of contention, they scale linearly to the upperbound when co-scheduled as Morphlets on AMORPHOS.

0

30

60

90

120

1 2 4 8

Band

wid

th (b

ytes

/ cy

cle)

Number of MemDrives

MemDrive Memory Bandwidth

Total Read BW Total Write BW Read BW / App Write BW / App

Figure 8: Total and per-MemDrive memory bandwidth for differentnumbers of Morphlets running in AMORPHOS on Catapult.

We do not report measurements on our F1 prototype asthey illustrate the same phenomenon.

6.2 MemDrive

We study contention between memory-bound Morphletsusing MemDrive, which stresses memory bandwidth.AMORPHOS’s 64-byte read/write interface maps wellto Catapult, but does not support burst transactions (onetransaction returning multiple data payloads), which isnecessary to achieve high read throughput on F1’s AXIinterface to memory. While we were able to achieve peakwrite-bandwidth on F1 and observe contention due to mul-tiple applications running concurrently, we were unableto saturate read-bandwidth. In future work, our intentionis to introduce burst detection and dynamically coalescememory requests.

Catapult’s memory system has a theoretical bandwidthof of 128 bytes/cycle. Experiments on our Catapult pro-totype show that the total achievable memory bandwidthis roughly 100 bytes/cycle for writes and 90 bytes/cyclefor reads. We ran MemDrive in AMORPHOS and directlyon the Catapult system to confirm that our virtualizationlayer incurs no bandwidth loss. Figure 8 shows the per-Morphlet and total system read/write bandwidth when run-ning 1–8 Morphlets of MemDrive in AMORPHOS. Totalsystem bandwidth decreases as the number of co-residentMorphlets rises from 1 up to 4, and saturates from 4 to 8.On F1, we observed similar contention when scaling from1 to 8 MemDrive Morphlets. The RVector of each Mor-phlet provides hints to AMORPHOS’s on-FPGA memoryscheduler, enabling it to manage contention fairly, andimprove effective memory bandwidth (e.g. by batchingmemory requests) or minimize latency for Morphlets thatare latency sensitive. MemDrive is not latency sensitive,but its ability to saturate memory has implications for thememory scheduler, which must take care to ensure thatlatency sensitive Morphlets such as DNNWeaver are notimpacted by that saturation.


6.3 DNNWeaverTable 3 shows how DNNWeaver scales when instantiat-ing multiple Morphlets. We see that aggregate throughputincreases with more Morphlets, but contention causes thedeviation from perfect linear-scaling to increase with in-creasing co-resident Morphlets. Both Catapult and F1 the-oretically have enough bandwidth to support up to 4 and32 DNNWeaver Morphlets, respectively. Contention forthe memory system manifests as an increase in memorylatency for DNNWeaver. We further show this contentionin Table 3 by pairing DNNWeaver with MemDrive. SinceDNNWeaver performance can suffer if it is paired with amemory-bound Morphlet, encoding a Morphlet’s sensitiv-ity to memory bandwidth/latency in the RVector is usefulfor the AMORPHOS scheduler.

6.4 BitcoinUp to 4 and 8 Bitcoin Morphlets can be co-resident onCatapult and F1 respectively. Table 3 shows that scalingis linear as Bitcoin only contends for on-chip resources,which are assigned during bitstream generation. The RVec-tor for a Bitcoin-type application specifies that there isno runtime overhead except fabric resources. This wouldenable AMORPHOS to intelligently co-schedule Bitcoinwith other Morphlets that make heavy use of memory butrequire much less fabric resources, such as MemDrive.Compute-bound Morphlets would be great for utilizing un-used fabric as they can scale with available logic resourceswithout hurting the performance of memory-bound Mor-phlets. We show this in Table 3 by pairing Bitcoin withDNNWeaver and MemDrive.

6.5 Density LimitsTo determine the limits on sharing density we co-scheduleas many concurrent Morphlets as possible, manually ma-nipulating the build process where necessary to achievehigher density. While AMORPHOS can achieve high lev-els of concurrency this way, practically attainable andperformance profitable levels are lower. High density co-scheduling of Morphlets stresses build tools because inter-faces must be routed to each Morphlet. Avoiding routingcongestion at higher densities require manipulation ofthe build tools. For example, configuring the build toolsto focus on congestion rather than logic minimizationspreads out the design and replicates logic, increasingarea overheads. Routing is heuristic, so successfully meet-ing timing can depend on trying multiple random seeds.Such interventions are impractical to automate in an OSscheduler, and a production deployment of AMORPHOSwould necessarily tolerate sharing densities below themaximum possible.

Morphlet MaxPerf MaxTools MaxDNNWeaver 32 8 32Bitcoin 8 4 8MemDrive 2 8 32

Table 4: Limits on AMORPHOS F1 sharing density for DNNWeaver,MemDrive, and Bitcoin. The MaxPerf column indicates the level ofMorphlet concurrency at which throughput is maximal. The MaxToolscolumn indicates the maximum concurrency achievable without manualintervention in the build process. The Max column indicates the max-imum level we attained with manual intervention in the build process.For example, DNNWeaver’s maximal performance is achieved at 32Morphlets, which is only achievable with manual effort; the build toolchain defaults achieve a maximum density of 8.

Limits on sharing density differ across workloads. Ta-ble 4 shows maximum densities on F1 when the up-per bound is determined by best throughput, build trans-parency, or physical limits of the FPGA.

6.6 End-to-End PerformanceTo compare AMORPHOS against other FPGA sharingdesigns, we measure the time required to run 1-8 Bit-coin instances on Catapult using AMORPHOS in high-throughput mode, several slot-based approaches, and ano-sharing baseline. The performance of slot-based ap-proaches is emulated by running AMORPHOS in low-latency mode, which uses PR to switch between zones ofequal size. The performance of not sharing is emulated byrunning AMORPHOS with a single Morphlet. Since pro-gramming the whole FPGA using the Catapult tools takesa significant portion of time, we also emulate optimal fullFPGA reconfiguration by adding a delay of 200ms, whichis comparable to programming the whole FPGA via PR.The overhead of using AMORPHOS to emulate these ap-proaches is negligible compared to application runtime,so we expect our results to be accurate for all approaches.

In high-throughput mode, AMORPHOS can fit 4 full-sized Bitcoin Morphlets on the FPGA: we assume that theregistry is pre-populated with the required bitstream (see§4.5). When using fixed slots, only two Bitcoin instancescan be co-resident. Since slots may not always be ableto fit the largest version of Bitcoin, we emulate threedifferent sizes of slots, which we refer to as small, medium,and large. The small slot can fit a quarter-speed variantof Bitcoin, the medium slot can fit a half-speed variant,and the large slot can fit the full-speed variant. In the no-sharing approach, a single full-sized Bitcoin Morphlet isinstantiated.

Figure 9 reports the full system runtime of each ap-proach. When running only a single Bitcoin Morphlet,AMORPHOS is comparable to both the no sharing and twolarge slot approaches. The smaller slot-based approacheslimit the size of the application and already perform worsethan AMORPHOS. With two Bitcoin Morphlets, only


16

32

64

128

256

512

1 2 3 4 5 6 7 8

Run

Tim

e (s

)

Number of Morphlets

End-to-End Bitcoin Run Time

No Sharing Two Small Slots Two Medium Slots Two Large Slots AmorphOS (HT)

Figure 9: End-to-end runtime of Bitcoin executing under severaldifferent sharing schemes. Runtime is plotted logarithmically with lowerruntimes being better.

AMORPHOS and two large slots are comparable. Finally,with 3 or more Bitcoin Morphlets, AMORPHOS is consis-tently able to attain higher logic densities and thereforebetter throughput than all other competing approaches.While AMORPHOS cannot always run in high-throughputmode as shown here, we expect AMORPHOS to main-tain the same comparative advantage in the long run asit will only have to operate in low-latency mode until ahigh-throughput bitstream has been generated.

6.7 Hierarchical Zone ManagementAMORPHOS can manage a zone in three ways. It canallocate the zone for exclusive use by a single Morphlet,co-schedule multiple Morphlets on it, or recursively subdi-vide it into two smaller zones. Subdividing top-level zonesmay be attractive if Morphlets do not fully utilize thosezones or if Morphlet response time is more important thanend-to-end run time. This flexibility gives rise to a policyspace that trades off between density, performance, andregistry overhead.

To characterize these trade-offs, we run three BitcoinMorphlets on our Catapult prototype, in which AMOR-PHOS uses a single global zone or two top-level reconfig-urable zones, each of which may be subdivided in two. Wemeasure end-to-end execution time to completion for allthree Morphlets, using a lower-bound baseline that doesnot share (non-sharing) and an upper-bound baseline thatco-schedules all Morphlets on the global zone (global).We evaluate three different policies for managing the top-level zones. The first implements only a single-level ofzone partitioning (single-level) with no co-schedulingwithin the zones. The second policy schedules combinedMorphlets on zones without subdividing the zones (co-schedule). The third policy (subdivide) can subdividesthe top-level zones. Registry entries for all combined bit-streams are pre-populated. For the co-schedule and sub-divide cases, we morph the second two Morphlets byscaling them down to fit concurrently in a top-level recon-figurable zone, which reduces their throughput by a factor

0

0.5

1

1.5

2

2.5

3

Non-sharing Global Single-Level Co-schedule Subdivide

Rela

tive

Spee

dup

Zone Scheme

Performance of Zone-Sharing Schemes

Figure 10: End-to-end performance of various zone-sharing schemeswhen executing 3 Bitcoin Morphlets.

of 4, but allows us to run all three Morphlets in parallel.Figure 10 shows end-to-end speedup relative to the

non-sharing case for all policies. In this scenario, a sin-gle level of zone partitioning is the best option when co-scheduling on the global zone is not possible. This enablesthe first two Morphlets to run concurrently, providing ad-ditional concurrency that results in a performance gainrelative to the no-sharing strategy. Both strategies for sub-dividing a top-level reconfigurable zone perform worsethan the sequential case, for two reasons. First, perfor-mance is reduced by scaling them to fit a subdivided zone.Second, subdividing zones does not make all the under-lying resources available to each subdivision. AdditionalPR logic is required for each, which consumes additionalarea and reduces routability.

Measurements of overhead for PR on Catapult FPGAsshow that it grows linearly with density. Interconnect isthe bottleneck resource, with 4% of global interconnectand 2% of global logic consumed by PR per Morphlet.While the 4% interconnect overhead can become a signifi-cant fraction of the allocatable fabric, density is primarilylimited by fragmentation (we observe an average utiliza-tion loss of 16% in our workloads), which makes multiplelevels of subdivision unprofitable for all but very smallMorphlets on Catapult.

However, multiple levels of subdivision may be usefulon F1 FPGAs, where the fraction of resources allocatablethrough PR zones is larger. F1 does not expose PR, so topredict sharing densities for PR-based subdivision on F1,we extrapolate assuming the same average 16% fragmen-tation per PR zone and 2% per-Morphlet overhead for PRlogic. We assume that all fabric not consumed by AMOR-PHOS and PR logic can be divided evenly and allocatedto Morphlets, but we impose a 90% upper bound on uti-lization per resource type, which is suggested by Xilinx tobe the likely upper bound on UltraScale FPGAs [14]. Thisover-estimates utilizable fabric: other vendor guidance ismore conservative [10] and our measured utilization doesnot exceed 70% for any workload.


Derived upper bounds on density for F1 hardware showthat CLBs (Configurable Logic Blocks, which encapsu-late multiple LUTs) are the limiting resource. We predictmaximum sharing density with PR to be 16 and 4 forDNNWeaver and Bitcoin, respectively.2 This suggeststhat zone subdivision will likely be possible and effectiveon F1. Our experience building AMORPHOS, however, isthat subdividing zones increases the design complexity ofhardware components and limits density unnecessarily byincreasing fragmentation. In contrast, increasing densityby co-scheduling Morphlets on a global zone can providemuch higher densities with potentially higher effectivedeployment latency, but shifts much of the complexity toa software registry.

7 Related WorkFPGA programmability. Improving FPGA programma-bility is an active area largely characterized by efforts toenable programming with higher level languages, includ-ing C/C++ and other imperative languages [60, 19, 34, 69,13, 18, 51, 68, 67], DSLs [32, 69, 95, 20, 81, 101, 66, 91],and even managed sequential languages such as C# [32]and map-reduce [95]. Progress in this area motivates ourwork, but is also orthogonal to it.

FPGA access to OS-managed resources. Prior workhas explored exposing file systems [100] and the syscallinterface [77, 100] to FPGAs. Much of this work hassimilar goals to our own, but we decided to focus on theexploration of cross-domain sharing and basic memoryvirtualization. A more mature AMORPHOS could clearlybenefit from the rich body of work on memory virtualiza-tion for FPGAs [33, 15, 114, 77].

FPGA OSes. Previous work on FPGA OSes has fo-cused on theoretical foundations for spatial sharing [43,102, 108, 31], mechanisms for task preemption [73], re-location [55], context switch [72, 93], and scheduling ofhardware and software tasks [25, 102, 108, 44]. Whilethese explore ideas pertinent to OS primitives, end-to-endOS system-building was not their goal.

Extending current OS abstractions to FPGAs is anotherarea of active research. ReconOS [77] extends a multi-threaded programming model to configurable SoCs thatenables programmers to use “hardware threads” to trans-parently access OS-managed objects in the eCos [41]embedded OS. Hthreads [86] implements a similar hard-ware thread abstraction. Borph [100, 99] uses a hard-ware process abstraction to encapsulate FPGA logic in aprocess-like protection domain. Multi-application sharingfor FPGAs is explored in [31, 109, 52], but some works

2We do not predict density for MemDrive as it is bottlenecked bymemory bandwidth at low densities.

restrict the programming model or design space [111], ordo not tackle isolation and protection [31]. AMORPHOSdiffers by proposing new OS abstractions that differ fromthe existing CPU-oriented programming models.

MURAC [45] is the most closely related work toAMORPHOS. In MURAC, a process’ logical addressspace encompasses all on-device resources that logically“belong to it”, enabling the scheduler to support contextswitch using an ICAP (Internal Configuration AccessPort). AMORPHOS takes a similar position on protec-tion domains, but focuses on spatial scheduling and doesnot rely on hardware support for state capture.

FPGA Virtualization. Systems have been proposed thatvirtualize FPGAs with regions [88], tasks [89], processingelements [37], IPC-like communication primitives [80],and abstraction layers/overlays over diverse FPGA hard-ware [62, 50, 24, 61, 103]. Works virtualizing FPGAs inthe cloud [30, 1, 79] share many of our core goals andtackle similar challenges. While these platforms use sim-ilar primitives to those of AMORPHOS, they typicallyrestrict the programming and/or deployment model anddo not support cross-domain sharing of FPGA fabric.

Overlays. FPGA overlays provide a virtualization layerto make a design independent of specific FPGA hard-ware [24, 113], enabling fast compilation times and lowdeployment latency [58, 64], at the cost of reduced hard-ware utilization and throughput. Like AMORPHOS, manyoverlays support some time-sharing and or spatial shar-ing. Overlays implement the same virtual architecture ondifferent devices, they form a compatibility layer at thehardware interface. In contrast, AMORPHOS providescompatibility at the application-OS interface. Unlike Mor-phlets, overlays run on a virtual architecture, introducingoverheads that limit utilization and performance.

8 Conclusion

This paper has described AMORPHOS, a design for FPGAprotected sharing and compatibility based on abstractionsthat preserve existing programming models. AMORPHOSmodulates between space- and time-sharing policies andisolates logic from different applications, enabling cross-cloud compatibility and dramatically improved through-put and utilization.

9 Acknowledgements

We thank the anonymous reviewers and our shepherdMiguel Castro for their insights and comments. Weacknowledge funding from NSF grant CNS-1618563.


References[1] Amazon EC2 F1 Instances. https:

//aws.amazon.com/ec2/instance-types/f1/. (Accessed on 09/27/2018).

[2] Amazon Web Services (AWS) - Cloud ComputingServices. https://aws.amazon.com/. (Ac-cessed on 04/30/2018).

[3] AMBA Specifications Arm. https://www.arm.com/products/system-ip/amba-specifications. (Accessed on05/03/2018).

[4] AWS FPGA Marketplace. https://aws.amazon.com/marketplace/search/results?searchTerms=fpga.(Accessed on 09/14/2018).

[5] Catapult - Texas Advanced Computing Cen-ter. https://www.tacc.utexas.edu/systems/catapult/. (Accessed on9/27/2018).

[6] Edico Genomes DRAGEN Platform.http://edicogenome.com/dragen-bioit-platform/. (Accessed on 5/2/2018).

[7] High-Level Synthesis with LegUp. http://legup.eecg.utoronto.ca/. (Accessed on10/24/2017).

[8] Innova-2 Flex Programmable Network Adapter.http://www.mellanox.com/related-docs/prod adapter cards/PB Innova-2 Flex.pdf. (Accessed on 5/2/2018).

[9] Live Video Encoding Using New AWSF1 Acceleration NGCodec. https://ngcodec.com/news/2017/3/31/live-video-encoding-using-new-aws-f1-acceleration. (Accessed on 09/27/2018).

[10] Measuring Device Performance and Utiliza-tion: A Competitive Overview (WP496).https://www.xilinx.com/support/documentation/white papers/wp496-comp-perf-util.pdf. (Accessed on09/27/2018).

[11] Microsoft unveils Project Brainwavefor real-time AI - Microsoft Research.https://www.microsoft.com/en-us/research/blog/microsoft-unveils-

project-brainwave/. (Accessed on10/21/2017).

[12] progranism/Open-Source-FPGA-Bitcoin-Miner.https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner.(Accessed on 10/24/2017).

[13] SDAccel Development Environment. https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html.(Accessed on 09/27/2018).

[14] UltraScale Architecture: Highest Device Uti-lization, Performance, and Scalability (WP455).https://www.xilinx.com/support/documentation/white papers/wp455-utilization.pdf. (Accessed on 09/27/2018).

[15] ADLER, M., FLEMING, K. E., PARASHAR, A.,PELLAUER, M., AND EMER, J. Leap scratchpads:Automatic memory and cache management for re-configurable logic. In Proceedings of the 19thACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (New York, NY, USA,2011), FPGA ’11, ACM, pp. 25–28.

[16] ALTERA. Cyclone V SoC Development Board Ref-erence Manual. https://www.altera.com/content/dam/altera-www/global/en US/pdfs/literature/manual/rm cv soc dev board.pdf. (Accessed on5/2/2018).

[17] ALTERA. Integrating 100-GbE Switch-ing Solutions on 28nm FPGAs. https://www.altera.com/en US/pdfs/literature/wp/wp-01127-stxv-100gbe-switching.pdf. (Accessed on5/2/2018).

[18] ANDERSON, E., AGRON, J., PECK, W., STEVENS,J., BAIJOT, F., SASS, R., AND ANDREWS, D. En-abling a uniform programming model across thesoftware/hardware boundary. fccm’06, 2006.

[19] AUERBACH, J., BACON, D. F., CHENG, P., ANDRABBAH, R. Lime: A java-compatible and synthe-sizable language for heterogeneous architectures.In Proceedings of the ACM International Confer-ence on Object Oriented Programming SystemsLanguages and Applications (New York, NY, USA,2010), OOPSLA ’10, ACM, pp. 89–108.


https://aws.amazon.com/ec2/instance-types/f1/



https://aws.amazon.com/

https://www.arm.com/products/system-ip/amba-specifications



https://aws.amazon.com/marketplace/search/results?searchTerms=fpga



https://www.tacc.utexas.edu/systems/catapult/

https://www.tacc.utexas.edu/systems/catapult/

http://edicogenome.com/dragen-bioit-platform/

http://edicogenome.com/dragen-bioit-platform/

http://legup.eecg.utoronto.ca/

http://legup.eecg.utoronto.ca/

http://www.mellanox.com/related-docs/prod_adapter_cards/PB_Innova-2_Flex.pdf



https://ngcodec.com/news/2017/3/31/live-video-encoding-using-new-aws-f1-acceleration




https://www.xilinx.com/support/documentation/white_papers/wp496-comp-perf-util.pdf



https://www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/



https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner

https://github.com/progranism/Open-Source-FPGA-Bitcoin-Miner

https://www.xilinx.com/products/design-tools/software-zone/sdaccel.html



https://www.xilinx.com/support/documentation/white_papers/wp455-utilization.pdf



https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/manual/rm_cv_soc_dev_board.pdf




https://www.altera.com/en_US/pdfs/literature/wp/wp-01127-stxv-100gbe-switching.pdf




[20] BACHRACH, J., VO, H., RICHARDS, B. C.,LEE, Y., WATERMAN, A., AVIZIENIS, R.,WAWRZYNEK, J., AND ASANOVIC, K. Chisel:constructing hardware in a scala embeddedlanguage. In The 49th Annual Design AutomationConference 2012, DAC ’12, San Francisco, CA,USA, June 3-7, 2012 (2012), pp. 1216–1225.

[21] BHASKER, J. A Vhdl primer. Prentice-Hall, 1999.

[22] BOHR, M. Moores Law Leadership. https://newsroom.intel.com/newsroom/wp-content/uploads/sites/11/2017/03/Mark-Bohr-2017-Moores-Law.pdf.(Accessed on 09/27/2018).

[23] BOURGE, A., MULLER, O., AND ROUSSEAU,F. Automatic high-level hardware checkpointselection for reconfigurable systems. In Field-Programmable Custom Computing Machines(FCCM), 2015 IEEE 23rd Annual InternationalSymposium on (2015), IEEE, pp. 155–158.

[24] BRANT, A., AND LEMIEUX, G. G. Zuma:An open fpga overlay architecture. In Field-Programmable Custom Computing Machines(FCCM), 2012 IEEE 20th Annual InternationalSymposium on (2012), IEEE, pp. 93–96.

[25] BREBNER, G. J. A virtual hardware operating sys-tem for the xilinx xc6200. In Proceedings of the 6thInternational Workshop on Field-ProgrammableLogic, Smart Applications, New Paradigms andCompilers (London, UK, UK, 1996), FPL ’96,Springer-Verlag, pp. 327–336.

[26] BYMA, S., STEFFAN, J. G., BANNAZADEH, H.,GARCIA, A. L., AND CHOW, P. Fpgas in the cloud:Booting virtualized hardware accelerators withopenstack. In Proceedings of the 2014 IEEE 22NdInternational Symposium on Field-ProgrammableCustom Computing Machines (Washington, DC,USA, 2014), FCCM ’14, IEEE Computer Society,pp. 109–116.

[27] BYMA, S., TARAFDAR, N., XU, T., BAN-NAZADEH, H., LEON-GARCIA, A., AND CHOW,P. Expanding openflow capabilities with virtual-ized reconfigurable hardware. In Proceedings ofthe 2015 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays (New York,NY, USA, 2015), FPGA ’15, ACM, pp. 94–97.

[28] CASPER, J., AND OLUKOTUN, K. Hardware ac-celeration of database operations. In Proceedings

of the 2014 ACM/SIGDA International Symposiumon Field-programmable Gate Arrays (New York,NY, USA, 2014), FPGA ’14, ACM, pp. 151–160.

[29] CHAI, Z., YU, J., WANG, Z., ZHANG, J., ANDZHOU, H. An embedded fpga operating systemoptimized for vision computing (abstract only).In Proceedings of the 2015 ACM/SIGDA Interna-tional Symposium on Field-Programmable GateArrays (New York, NY, USA, 2015), FPGA ’15,ACM, pp. 271–271.

[30] CHEN, F., SHAN, Y., ZHANG, Y., WANG, Y.,FRANKE, H., CHANG, X., AND WANG, K. En-abling fpgas in the cloud. In Proceedings ofthe 11th ACM Conference on Computing Fron-tiers (New York, NY, USA, 2014), CF ’14, ACM,pp. 3:1–3:10.

[31] CHEN, L., MARCONI, T., AND MITRA, T. Onlinescheduling for multi-core shared reconfigurablefabric. In Proceedings of the Conference on De-sign, Automation and Test in Europe (San Jose, CA,USA, 2012), DATE ’12, EDA Consortium, pp. 582–585.

[32] CHUNG, E. S., DAVIS, J. D., AND LEE, J. Linqits:Big data on little clients. In 40th InternationalSymposium on Computer Architecture (June 2013),ACM.

[33] CHUNG, E. S., HOE, J. C., AND MAI, K.Coram: An in-fabric memory architecture for fpga-based computing. In Proceedings of the 19thACM/SIGDA International Symposium on FieldProgrammable Gate Arrays (New York, NY, USA,2011), FPGA ’11, ACM, pp. 97–106.

[34] COUSSY, P., AND MORAWIEC, A. High-level syn-thesis: from algorithm to digital circuit. SpringerScience & Business Media, 2008.

[35] CROCKETT, L. H., ELLIOT, R. A., ENDERWITZ,M. A., AND STEWART, R. W. The Zynq Book: Em-bedded Processing with the Arm Cortex-A9 on theXilinx Zynq-7000 All Programmable Soc. Strath-clyde Academic Media, 2014.

[36] DAI, G., CHI, Y., WANG, Y., AND YANG, H.Fpgp: Graph processing framework on fpga a casestudy of breadth-first search. In Proceedings ofthe 2016 ACM/SIGDA International Symposiumon Field-Programmable Gate Arrays (New York,NY, USA, 2016), FPGA ’16, ACM, pp. 105–110.


https://newsroom.intel.com/newsroom/wp-content/uploads/sites/11/2017/03/Mark-Bohr-2017-Moores-Law.pdf




[37] DEHON, A., MARKOVSKY, Y., CASPI, E., CHU,M., HUANG, R., PERISSAKIS, S., POZZI, L.,YEH, J., AND WAWRZYNEK, J. Stream computa-tions organized for reconfigurable execution. Mi-croprocessors and Microsystems 30, 6 (2006), 334–354.

[38] DHAR, S., ADYA, S. N., SINGHAL, L., IYER,M. A., AND PAN, D. Z. Detailed placement formodern fpgas using 2d dynamic programming. InProceedings of the 35th International Conferenceon Computer-Aided Design, ICCAD 2016, Austin,TX, USA, November 7-10, 2016 (2016), p. 9.

[39] DHAR, S., IYER, M. A., ADYA, S. N., SINGHAL,L., RUBANOV, N., AND PAN, D. Z. An effectivetiming-driven detailed placement algorithm for fp-gas. In Proceedings of the 2017 ACM on Interna-tional Symposium on Physical Design, ISDP 2017,Portland, OR, USA, March 19-22, 2017 (2017),pp. 151–157.

[40] DHAR, S., AND PAN, D. Z. Gdp: Gpu accelerateddetailed placement. In HPEC (2018).

[41] DOMAHIDI, A., CHU, E., AND BOYD, S. Ecos:An socp solver for embedded systems. In ControlConference (ECC), 2013 European (2013), IEEE,pp. 3071–3076.

[42] FAHMY, S. A., VIPIN, K., AND SHREEJITH, S.Virtualized fpga accelerators for efficient cloudcomputing. In Proceedings of the 2015 IEEE7th International Conference on Cloud Comput-ing Technology and Science (CloudCom) (Wash-ington, DC, USA, 2015), CLOUDCOM ’15, IEEEComputer Society, pp. 430–435.

[43] FU, W., AND COMPTON, K. Scheduling in-tervals for reconfigurable computing. In Field-Programmable Custom Computing Machines,2008. FCCM ’08. 16th International Symposiumon (April 2008), pp. 87–96.

[44] GONZALEZ, I., LOPEZ-BUEDO, S., SUTTER, G.,SANCHEZ-ROMAN, D., GOMEZ-ARRIBAS, F. J.,AND ARACIL, J. Virtualization of reconfigurablecoprocessors in hprc systems with multicore ar-chitecture. J. Syst. Archit. 58, 6-7 (June 2012),247–256.

[45] HAMILTON, B. K., INGGS, M., AND SO, H. K. H.Scheduling mixed-architecture processes in tightly

coupled fpga-cpu reconfigurable computers. InField-Programmable Custom Computing Machines(FCCM), 2014 IEEE 22nd Annual InternationalSymposium on (May 2014), pp. 240–240.

[46] HAN, S., MAO, H., AND DALLY, W. J. Deep com-pression: Compressing deep neural networks withpruning, trained quantization and huffman coding.arXiv preprint arXiv:1510.00149 (2015).

[47] HANSEN, S. G., KOCH, D., AND TORRESEN, J.High speed partial run-time reconfiguration usingenhanced icap hard macro. In Parallel and Dis-tributed Processing Workshops and Phd Forum(IPDPSW), 2011 IEEE International Symposiumon (2011), IEEE, pp. 174–180.

[48] HARA, Y., TOMIYAMA, H., HONDA, S., TAKADA,H., AND ISHII, K. Chstone: A benchmark programsuite for practical c-based high-level synthesis. InCircuits and Systems, 2008. ISCAS 2008. IEEE In-ternational Symposium on (2008), IEEE, pp. 1192–1195.

[49] HARI KRISHNAN, R., AND SAI SAKETH, Y. Cryp-tocurrency mining–transition to cloud.

[50] HUANG, C.-H., AND HSIUNG, P.-A. Hardwareresource virtualization for dynamically partiallyreconfigurable systems. IEEE Embed. Syst. Lett. 1,1 (May 2009), 19–23.

[51] INC, S. C. Carte programming environment, 2006.

[52] ISMAIL, A., AND SHANNON, L. Fuse: Front-end user framework for o/s abstraction of hard-ware accelerators. In Proceedings of the 2011IEEE 19th Annual International Symposium onField-Programmable Custom Computing Machines(Washington, DC, USA, 2011), FCCM ’11, IEEEComputer Society, pp. 170–177.

[53] ISTVAN, Z., SIDLER, D., ALONSO, G., ANDVUKOLIC, M. Consensus in a box: Inexpensive co-ordination in hardware. In Proceedings of the 13thUsenix Conference on Networked Systems Designand Implementation (Berkeley, CA, USA, 2016),NSDI’16, USENIX Association, pp. 425–438.

[54] KAGANOV, A., LAKHANY, A., AND CHOW, P.Fpga acceleration of multifactor cdo pricing. ACMTrans. Reconfigurable Technol. Syst. 4, 2 (May2011), 20:1–20:17.


[55] KALTE, H., AND PORRMANN, M. Context sav-ing and restoring for multitasking in reconfigurablesystems. In Field Programmable Logic and Appli-cations, 2005. International Conference on (Aug2005), pp. 223–228.

[56] KALTE, H., AND PORRMANN, M. Context sav-ing and restoring for multitasking in reconfigurablesystems. In Field Programmable Logic and Appli-cations, 2005. International Conference on (2005),IEEE, pp. 223–228.

[57] KAPITZA, R., BEHL, J., CACHIN, C., DIS-TLER, T., KUHNLE, S., MOHAMMADI, S. V.,SCHRODER-PREIKSCHAT, W., AND STENGEL, K.Cheapbft: Resource-efficient byzantine fault toler-ance. In Proceedings of the 7th ACM EuropeanConference on Computer Systems (New York, NY,USA, 2012), EuroSys ’12, ACM, pp. 295–308.

[58] KAPRE, N., AND GRAY, J. Hoplite: Building aus-tere overlay nocs for fpgas. In FPL (2015), IEEE,pp. 1–8.

[59] KARA, K., AND ALONSO, G. Fast and robusthashing for database operators. In 26th Interna-tional Conference on Field Programmable Logicand Applications, FPL 2016, Lausanne, Switzer-land, August 29 - September 2, 2016 (2016), pp. 1–4.

[60] KHRONOS GROUP. The OpenCL Specification,Version 1.0, 2009.

[61] KIRCHGESSNER, R., GEORGE, A. D., ANDSTITT, G. Low-overhead fpga middleware for ap-plication portability and productivity. ACM Trans.Reconfigurable Technol. Syst. 8, 4 (Sept. 2015),21:1–21:22.

[62] KIRCHGESSNER, R., STITT, G., GEORGE, A.,AND LAM, H. Virtualrc: A virtual fpga platformfor applications and tools portability. In Proceed-ings of the ACM/SIGDA International Symposiumon Field Programmable Gate Arrays (New York,NY, USA, 2012), FPGA ’12, ACM, pp. 205–208.

[63] KNODEL, O., AND SPALLEK, R. G. RC3E: pro-vision and management of reconfigurable hard-ware accelerators in a cloud environment. CoRRabs/1508.06843 (2015).

[64] KOCH, D., BECKHOFF, C., AND LEMIEUX, G.G. F. An efficient FPGA overlay for portable cus-tom instruction set extensions. In FPL (2013),IEEE, pp. 1–8.

[65] KOEPLINGER, D., DELIMITROU, C., PRAB-HAKAR, R., KOZYRAKIS, C., ZHANG, Y., ANDOLUKOTUN, K. Automatic generation of efficientaccelerators for reconfigurable hardware. InProceedings of the 43rd International Symposiumon Computer Architecture (Piscataway, NJ, USA,2016), ISCA ’16, IEEE Press, pp. 115–127.

[66] KOEPLINGER, D., DELIMITROU, C., PRAB-HAKAR, R., KOZYRAKIS, C., ZHANG, Y., ANDOLUKOTUN, K. Automatic generation of efficientaccelerators for reconfigurable hardware. InProceedings of the 43rd International Symposiumon Computer Architecture (Piscataway, NJ, USA,2016), ISCA ’16, IEEE Press, pp. 115–127.

[67] KOEPLINGER, D., FELDMAN, M., PRABHAKAR,R., ZHANG, Y., HADJIS, S., FISZEL, R., ZHAO,T., NARDI, L., PEDRAM, A., KOZYRAKIS, C.,AND OLUKOTUN, K. Spatial: A language and com-piler for application accelerators. In Proceedingsof the 39th ACM SIGPLAN Conference on Pro-gramming Language Design and Implementation(New York, NY, USA, 2018), PLDI 2018, ACM,pp. 296–311.

[68] LEBAK, J., KEPNER, J., HOFFMANN, H., ANDRUTLEDGE, E. Parallel vsipl++: An open stan-dard software library for high-performance parallelsignal processing. Proceedings of the IEEE 93, 2(2005), 313–330.

[69] LEBEDEV, I. A., FLETCHER, C. W., CHENG, S.,MARTIN, J., DOUPNIK, A., BURKE, D., LIN, M.,AND WAWRZYNEK, J. Exploring many-core de-sign templates for fpgas and asics. Int. J. Reconfig.Comp. 2012 (2012), 439141:1–439141:15.

[70] LEBER, C., GEIB, B., AND LITZ, H. High fre-quency trading acceleration using fpgas. In Pro-ceedings of the 2011 21st International Confer-ence on Field Programmable Logic and Appli-cations (Washington, DC, USA, 2011), FPL ’11,IEEE Computer Society, pp. 317–322.

[71] LECUN, Y., BOTTOU, L., BENGIO, Y., ANDHAFFNER, P. Gradient-based learning applied todocument recognition. Proceedings of the IEEE86, 11 (1998), 2278–2324.

[72] LEE, T.-Y., HU, C.-C., LAI, L.-W., AND TSAI,C.-C. Hardware context-switch methodology fordynamically partially reconfigurable systems. J.Inf. Sci. Eng. 26 (2010), 1289–1305.


[73] LEVINSON, L., MANNER, R., SESSLER, M., ANDSIMMLER, H. Preemptive multitasking on fpgas.In Field-Programmable Custom Computing Ma-chines, 2000 IEEE Symposium on (2000), pp. 301–302.

[74] LI, S., LIM, H., LEE, V. W., AHN, J. H.,KALIA, A., KAMINSKY, M., ANDERSEN, D. G.,SEONGIL, O., LEE, S., AND DUBEY, P. Architect-ing to achieve a billion requests per second through-put on a single key-value store server platform. InProceedings of the 42Nd Annual International Sym-posium on Computer Architecture (New York, NY,USA, 2015), ISCA ’15, ACM, pp. 476–488.

[75] LIN, E. C., AND RUTENBAR, R. A. A multi-fpga10x-real-time high-speed search engine for a 5000-word vocabulary speech recognizer. In Proceed-ings of the ACM/SIGDA international symposiumon Field programmable gate arrays (2009), ACM,pp. 83–92.

[76] LIU, M., KUEHN, W., LU, Z., AND JANTSCH, A.Run-time partial reconfiguration speed investiga-tion and architectural design space exploration. InField Programmable Logic and Applications, 2009.FPL 2009. International Conference on (2009),IEEE, pp. 498–502.

[77] LUBBERS, E., AND PLATZNER, M. Reconos: Mul-tithreaded programming for reconfigurable comput-ers. ACM Trans. Embed. Comput. Syst. 9, 1 (Oct.2009), 8:1–8:33.

[78] LYSECKY, R., MILLER, K., VAHID, F., AND VIS-SERS, K. Firm-core virtual fpga for just-in-timefpga compilation (abstract only). In Proceedingsof the 2005 ACM/SIGDA 13th International Sym-posium on Field-programmable Gate Arrays (NewYork, NY, USA, 2005), FPGA ’05, ACM, pp. 271–271.

[79] MICROSOFT. Microsoft azure goes back to rackservers with project olympus, 2017.

[80] MISHRA, M., CALLAHAN, T. J., CHELCEA, T.,VENKATARAMANI, G., GOLDSTEIN, S. C., ANDBUDIU, M. Tartan: Evaluating spatial computationfor whole program execution. SIGOPS Oper. Syst.Rev. 40, 5 (Oct. 2006), 163–174.

[81] MOORE, N., CONTI, A., LEESER, M., CORDES,B., AND KING, L. S. An extensible frameworkfor application portability between reconfigurablesupercomputing architectures, 2007.

[82] MURALIDHARAN, S., O’BRIEN, K., ANDLALANNE, C. A semi-automated tool flow forroofline anaylsis of opencl kernels on accelerators.In First International Workshop on HeterogeneousHigh-performance Reconfigurable Computing(H2RC’15) (2015).

[83] NURVITADHI, E., VENKATESH, G., SIM, J.,MARR, D., HUANG, R., HOCK, J. O. G., LIEW,Y. T., SRIVATSAN, K., MOSS, D., SUBHASCHAN-DRA, S., ET AL. Can fpgas beat gpus in acceler-ating next-generation deep neural networks? InFPGA (2017), pp. 5–14.

[84] OCTOPART. Octopart historical pricing, 2017.

[85] OGUNTEBI, T., AND OLUKOTUN, K. Graphops:A dataflow library for graph analytics acceleration.In Proceedings of the 2016 ACM/SIGDA Interna-tional Symposium on Field-Programmable GateArrays (New York, NY, USA, 2016), FPGA ’16,ACM, pp. 111–117.

[86] PECK, W., ANDERSON, E. K., AGRON, J.,STEVENS, J., BAIJOT, F., AND ANDREWS, D. L.Hthreads: A computational model for reconfig-urable devices. In FPL (2006), IEEE, pp. 1–4.

[87] PELLERIN, D. Accelerated Computing onAWS. http://asapconference.org/slides/amazon.pdf, July 2017. (Accessed on5/2/2018).

[88] PHAM, K. D., JAIN, A. K., CUI, J., FAHMY, S. A.,AND MASKELL, D. L. Microkernel hypervisorfor a hybrid arm-fpga platform. In Application-Specific Systems, Architectures and Processors(ASAP), 2013 IEEE 24th International Conferenceon (June 2013), pp. 219–226.

[89] PLESSL, C., AND PLATZNER, M. Zippy-a coarse-grained reconfigurable array with support for hard-ware virtualization. In Application-Specific Sys-tems, Architecture Processors, 2005. ASAP 2005.16th IEEE International Conference on (2005),IEEE, pp. 213–218.

[90] PRABHAKAR, R., KOEPLINGER, D., BROWN,K. J., LEE, H., DE SA, C., KOZYRAKIS, C., ANDOLUKOTUN, K. Generating configurable hardwarefrom parallel patterns. SIGOPS Oper. Syst. Rev. 50,2 (Mar. 2016), 651–665.


http://asapconference.org/slides/amazon.pdf

http://asapconference.org/slides/amazon.pdf

[91] PRABHAKAR, R., ZHANG, Y., KOEPLINGER, D.,FELDMAN, M., ZHAO, T., HADJIS, S., PEDRAM,A., KOZYRAKIS, C., AND OLUKOTUN, K. Plas-ticine: A reconfigurable architecture for parallelpaterns. In Proceedings of the 44th Annual Interna-tional Symposium on Computer Architecture (NewYork, NY, USA, 2017), ISCA ’17, ACM, pp. 389–402.

[92] PUTNAM, A., CAULFIELD, A., CHUNG, E.,CHIOU, D., CONSTANTINIDES, K., DEMME,J., ESMAEILZADEH, H., FOWERS, J., GOPAL,G. P., GRAY, J., HASELMAN, M., HAUCK, S.,HEIL, S., HORMATI, A., KIM, J.-Y., LANKA, S.,LARUS, J., PETERSON, E., POPE, S., SMITH, A.,THONG, J., XIAO, P. Y., AND BURGER, D. Areconfigurable fabric for accelerating large-scaledatacenter services. In 41st Annual InternationalSymposium on Computer Architecture (ISCA)(June 2014).

[93] RUPNOW, K., FU, W., AND COMPTON, K. Block,drop or roll(back): Alternative preemption meth-ods for RH multi-tasking. In FCCM 2009, 17thIEEE Symposium on Field Programmable CustomComputing Machines, Napa, California, USA, 5-7April 2009, Proceedings (2009), pp. 63–70.

[94] SHAFIEE, A., GUNDU, A., SHEVGOOR, M., BAL-ASUBRAMONIAN, R., AND TIWARI, M. Avoidinginformation leakage in the memory controller withfixed service policies. In Proceedings of the 48th In-ternational Symposium on Microarchitecture (NewYork, NY, USA, 2015), MICRO-48, ACM, pp. 89–101.

[95] SHAN, Y., WANG, B., YAN, J., WANG, Y., XU,N.-Y., AND YANG, H. Fpmr: Mapreduce frame-work on fpga. In FPGA (2010), P. Y. K. Cheungand J. Wawrzynek, Eds., ACM, pp. 93–102.

[96] SHARMA, H., PARK, J., AMARO, E., THWAITES,B., KOTHA, P., GUPTA, A., KIM, J. K., MISHRA,A., AND ESMAEILZADEH, H. Dnnweaver: Fromhigh-level deep network models to fpga accelera-tion. In the Workshop on Cognitive Architectures(2016).

[97] SIDLER, D., ISTVAN, Z., OWAIDA, M., ANDALONSO, G. Accelerating pattern matchingqueries in hybrid cpu-fpga architectures. In Pro-ceedings of the 2017 ACM International Con-ference on Management of Data (2017), ACM,pp. 403–415.

[98] SIMMLER, H., LEVINSON, L., AND MANNER,R. Multitasking on fpga coprocessors. Field-Programmable Logic and Applications: TheRoadmap to Reconfigurable Computing (2000),121–130.

[99] SO, H. K.-H., AND BRODERSEN, R. A unifiedhardware/software runtime environment for fpga-based reconfigurable computers using borph. ACMTrans. Embed. Comput. Syst. 7, 2 (Jan. 2008), 14:1–14:28.

[100] SO, H. K.-H., AND BRODERSEN, R. W. BORPH:An Operating System for FPGA-Based Reconfig-urable Computers. PhD thesis, EECS Department,University of California, Berkeley, Jul 2007.

[101] SO, H. K.-H., AND WAWRZYNEK, J. Olaf’16:Second international workshop on overlay archi-tectures for fpgas. In Proceedings of the 2016ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA,2016), FPGA ’16, ACM, pp. 1–1.

[102] STEIGER, C., WALDER, H., AND PLATZNER, M.Operating systems for reconfigurable embeddedplatforms: online scheduling of real-time tasks.IEEE Transactions on Computers 53, 11 (Nov2004), 1393–1407.

[103] STITT, G., AND COOLE, J. Intermediate fabrics:Virtual architectures for near-instant fpga compila-tion. IEEE Embedded Systems Letters 3, 3 (Sept2011), 81–84.

[104] SUDA, N., CHANDRA, V., DASIKA, G., MO-HANTY, A., MA, Y., VRUDHULA, S., SEO, J.-S., AND CAO, Y. Throughput-optimized opencl-based fpga accelerator for large-scale convolutionalneural networks. In Proceedings of the 2016ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA,2016), FPGA ’16, ACM, pp. 16–25.

[105] TAYLOR, M. B. Bitcoin and the age of bespokesilicon. In Proceedings of the 2013 InternationalConference on Compilers, Architectures and Syn-thesis for Embedded Systems (2013), IEEE Press,p. 16.

[106] THOMAS, D., AND MOORBY, P. The Verilog R©Hardware Description Language. Springer Science& Business Media, 2008.


[107] TSUTSUI, A., MIYAZAKI, T., YAMADA, K., ANDOHTA, N. Special purpose fpga for high-speeddigital telecommunication systems. In Proceed-ings of the 1995 International Conference on Com-puter Design: VLSI in Computers and Processors(Washington, DC, USA, 1995), ICCD ’95, IEEEComputer Society, pp. 486–491.

[108] WASSI, G., BENKHELIFA, M. E. A., LAWDAY,G., VERDIER, F., AND GARCIA, S. Multi-shapetasks scheduling for online multitasking on fpgas.In Reconfigurable and Communication-CentricSystems-on-Chip (ReCoSoC), 2014 9th Interna-tional Symposium on (May 2014), pp. 1–7.

[109] WATKINS, M. A., AND ALBONESI, D. H. Remap:A reconfigurable heterogeneous multicore archi-tecture. In Proceedings of the 2010 43rd AnnualIEEE/ACM International Symposium on Microar-chitecture (Washington, DC, USA, 2010), MICRO’43, IEEE Computer Society, pp. 497–508.

[110] WEERASINGHE, J., ABEL, F., HAGLEITNER, C.,AND HERKERSDORF, A. Enabling fpgas in hyper-scale data centers. In 2015 IEEE 12th Intl Conf onUbiquitous Intelligence and Computing and 2015IEEE 12th Intl Conf on Autonomic and TrustedComputing and 2015 IEEE 15th Intl Conf on Scal-able Computing and Communications and Its As-sociated Workshops (UIC-ATC-ScalCom), Beijing,China, August 10-14, 2015 (2015), pp. 1078–1086.

[111] WERNER, S., OEY, O., GOHRINGER, D.,HUBNER, M., AND BECKER, J. Virtualizedon-chip distributed computing for heterogeneousreconfigurable multi-core systems. In Proceedingsof the Conference on Design, Automation and Testin Europe (San Jose, CA, USA, 2012), DATE ’12,EDA Consortium, pp. 280–283.

[112] WEST, B., CHAMBERLAIN, R. D., INDECK, R. S.,AND ZHANG, Q. An fpga-based search engine forunstructured database. In Proc. of 2nd Workshopon Application Specific Processors (2003), vol. 12,pp. 25–32.

[113] WIERSEMA, T., BOCKHORN, A., ANDPLATZNER, M. Embedding FPGA over-lays into configurable systems-on-chip: Reconosmeets ZUMA. In ReConFig (2014), IEEE, pp. 1–6.

[114] WINTERSTEIN, F., FLEMING, K., YANG, H.-J., BAYLISS, S., AND CONSTANTINIDES, G.Matchup: Memory abstractions for heap manip-ulating programs. In Proceedings of the 2015ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (New York, NY, USA,2015), FPGA ’15, ACM, pp. 136–145.

[115] ZHANG, C., LI, P., SUN, G., GUAN, Y., XIAO,B., AND CONG, J. Optimizing fpga-based acceler-ator design for deep convolutional neural networks.In Proceedings of the 2015 ACM/SIGDA Interna-tional Symposium on Field-Programmable GateArrays (New York, NY, USA, 2015), FPGA ’15,ACM, pp. 161–170.

[116] ZHANG, C., LI, P., SUN, G., GUAN, Y., XIAO,B., AND CONG, J. Optimizing fpga-based acceler-ator design for deep convolutional neural networks.In Proceedings of the 2015 ACM/SIGDA Interna-tional Symposium on Field-Programmable GateArrays (2015), ACM, pp. 161–170.

[117] ZHAO, R., SONG, W., ZHANG, W., XING, T.,LIN, J.-H., SRIVASTAVA, M. B., GUPTA, R., ANDZHANG, Z. Accelerating binarized convolutionalneural networks with software-programmable fp-gas. In FPGA (2017), pp. 15–24.


sharing, protection and compatibility for reconfigurable fabric with … · isbn 78-1-939133-08-3...

Documents