memory and network interface virtualization for multi ... · memory and network interface...
TRANSCRIPT
Memory and Network Interface Virtualization forMulti-Tenant Reconfigurable Compute Devices
by
Daniel Rozhko
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical & Computer EngineeringUniversity of Toronto
c© Copyright 2018 by Daniel Rozhko
Abstract
Memory and Network Interface Virtualization for
Multi-Tenant Reconfigurable Compute Devices
Daniel Rozhko
Master of Applied Science
The Edward S. Rogers Sr. Department of Electrical & Computer Engineering
University of Toronto
2018
Field Programmable Gate Arrays (FPGAs) have increasingly been deployed in datacenters. These
devices have proven effective at performing certain compute tasks better (faster, lower latency, higher
throughput, and/or more efficiently) than traditional compute devices. In this work, we explore the
application of the virtualization concept to multi-tenant FPGAs, with a specific focus on enabling
multiple applications to securely share an FPGA and preventing snooping and/or errant behaviour.
Conceptually, the traditional shell concept in FPGA design is extended to the hard shell and soft shell
components. The hard shell focuses on the logical isolation of hardware applications on the same FPGA.
For memory and networking interfaces on the FPGA, this work introduces hardware components aimed
at enabling this isolation. In particular, the design of a new component that is termed the Network
Management Unit (NMU) is presented in detail. The NMU enables the isolation of network traffic.
ii
Acknowledgements
The work presented here would not be possible without the generous and always insightful support of my
advisor Professor Paul Chow. His wealth of knowledge and vast experience, with both the technologies
presented herein and indeed academic endeavor in general, aided the development and execution of this
research greatly. Thank you for your invaluable advice and support.
In addition, I owe thanks to the many peers and colleagues who provided support to me in the course
of this Masters program. First, to my fellow members of Professor Paul Chow’s research group; Dan
Ly-Ma, Roberto DiCecco, Naif Tarafdar, Eric Fukuda, Justin Tai, Charles Lo, Fernando Martin Del
Campo, Vincent Mirian, Jasmina Capalija Ex Vasiljevic, Nariman Eskandari, and Varun Sharma; your
help, advice, and most notably your willingness to listen, was appreciated and I thank you for it.
Next, to the colleagues with whom I shared an office, and indeed a significant portion of my time;
Xander Chin, Karthik Ganesan, Mario Badr, Jin Hee Kim, Joy Chen, Julie Hsaio, Shehab Elsayed, Josh
San Miguel, Jose Antonio Munoz Cepillo, Patrick Judd, and again the colleagues listed above; your
help, technical and otherwise, was a boon to the development of this research and my life as a graduate
student. Thank you. I count you all amongst my friends.
iii
Contents
Acknowledgments iii
Table of Contents iv
List of Tables viii
List of Figures ix
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background 4
2.1 Reconfigurable Compute Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Coarse-Grained Reconfigurable Architectures . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 Computer Aided Design Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Desktop Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Containerization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 FPGAs in the Cloud and FPGA Virtualization . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Hardware OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
iv
3 Virtualization Model 16
3.1 Deployment Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.1 Physical Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 Service Provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1.3 Deployment Model Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2 Required Functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.1 Performance Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2 Data Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Domain Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4 Channels Targeted for Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.5 Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.6 Interface Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Virtualization Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Shell Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 Xilinx Specific Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4 Memory Interfaces 31
4.1 AXI4 Protocol Verification and Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.1 The AXI4 Protocol Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.2 AXI4 Master Protocol Verification Requirements . . . . . . . . . . . . . . . . . . . 34
4.1.3 Memory Transaction Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.4 Memory Protocol Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Performance Isolation for AXI4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Traditional Credit-Based Rate Throttling . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2 AXI4-Specific Credit Mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.3 Bandwidth Conserving System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.4 Limitations for SDRAM Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.2.5 Bandwidth Limiting Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 52
4.3 Memory Management Unit Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.3.1 Base-and-Bounds MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 Coarse-Grained Paged MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.4 Memory Virtualizing Shell Overhead Evaluation . . . . . . . . . . . . . . . . . . . . . . . 56
4.4.1 Building up the Secure Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
v
4.4.2 Latency Impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
4.4.3 Paged-NMU Size Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5 Multi-Channel Memory Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.1 Separately Managed Channels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.2 Single Shared MMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.5.3 Parallel MMUs with a Single Port . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.5.4 Parallel MMUs with Multiple Ports . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.5.5 Multi-Memory Channel Implementations in Previous Works . . . . . . . . . . . . . 66
5 Network Interfaces 68
5.1 Network Interface Performance Isolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.1 AXI-Stream Decoupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.2 AXI-Stream Protocol Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1.3 Network Interface Bandwidth Throttling . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Network Security Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Software Analogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.2.2 OpenFlow Switching Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 The Network Management Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Access Control Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 Internal Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.3 VLAN Networking Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.4 Layer of Network Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.5 NMU Nomenclature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Network Management Unit Hardware Design . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4.1 Reusable Sub-Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.2 Destination Rules Enforced . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.3 Universal NMU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.4 Limited Functionality NMUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5 Network Virtualizing Shell Overhead Evaluation . . . . . . . . . . . . . . . . . . . . . . . 89
5.5.1 Shell Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5.2 NMU Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.6 Multi-Channel Networking Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.6.1 Separately Managed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
vi
5.6.2 One General and One Exclusive Connection . . . . . . . . . . . . . . . . . . . . . . 96
5.6.3 Individual Connection Per Application . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 Conclusion 98
6.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.1 Further Shell Explorations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.3.2 Additional Security Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.3.3 Hardening Shell Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Bibliography 104
A AXI4 Protocol Assertions 111
vii
List of Tables
4.1 Bandwidth Throttling Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2 Shared Memory Secured Shell Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3 Shared Memory Secured Shell Utilization (Percentage) . . . . . . . . . . . . . . . . . . . . 58
4.4 Latency Increase Per AXI Channel for Shared Memory Secured Shell . . . . . . . . . . . . 59
4.5 Shell Utilization as a Function of Page Size in MMU . . . . . . . . . . . . . . . . . . . . . 60
5.1 NMU Nomenclature Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.2 Shared Network Connectivity Secured Shell Overhead . . . . . . . . . . . . . . . . . . . . 89
5.3 Shared Network Connectivity Secured Shell Overhead (Percentage) . . . . . . . . . . . . . 89
5.4 NMU Area and Latency Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
A.1 AXI4 Protocol Write Address Channel Assertion . . . . . . . . . . . . . . . . . . . . . . . 112
A.2 AXI4 Protocol Write Data Channel Assertions . . . . . . . . . . . . . . . . . . . . . . . . 113
A.3 AXI4 Protocol Write Response Channel Assertions . . . . . . . . . . . . . . . . . . . . . . 114
A.4 AXI4 Protocol Read Address Channel Assertions . . . . . . . . . . . . . . . . . . . . . . . 115
A.5 AXI4 Protocol Read Data Channel Assertions . . . . . . . . . . . . . . . . . . . . . . . . . 116
A.6 AXI4 Protocol Exclusive Access Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . 117
viii
List of Figures
2.1 FPGA CAD Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Types of Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Memory Address Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Microsoft Catapult v2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.1 Soft Shell Inspired Virtualization Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Soft Shell Including Management Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1 Adding Decoupling for Memory to the Shell . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 AXI4 Read Channel Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 AXI4 Write Channel Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Xilinx AXI Decoupler Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.5 AXI4 Memory Decoupler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 AXI4 Memory Protocol Verifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.7 Adding Bandwidth Throttling for Memory to the Shell . . . . . . . . . . . . . . . . . . . . 42
4.8 AXI4 Memory Bandwidth Throttler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.9 AXI4 Memory Utilization Monitor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.10 Adding MMU to the Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.11 Base and Bounds MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.12 On-Chip Coarse Grained MMU Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.13 Multi-Channel Organization with Separately Managed Channels . . . . . . . . . . . . . . 61
4.14 Multi-Channel Organization with Single Shared MMU . . . . . . . . . . . . . . . . . . . . 61
4.15 Multi-Channel Organization with Parallel Shared MMUs . . . . . . . . . . . . . . . . . . . 63
4.16 Multi-Channel Organization with Parallel Shared MMUs (modified) . . . . . . . . . . . . 65
4.17 Multi-Channel Organization with Parallel NMUs and Multiple Ports . . . . . . . . . . . . 66
ix
5.1 Adding Performance Isolation for Networking to the Shell . . . . . . . . . . . . . . . . . . 69
5.2 Network Interface Decoupler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Network Interface Protocol Verifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Network Interface Bandwidth Throttler . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Example Implementation of an OpenFlow Capable Switch . . . . . . . . . . . . . . . . . . 76
5.6 Adding NMU to the Shell . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.7 Packet Parser Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.8 Tagger & Encapsulation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.9 De-Tagger & De-Encapsulation Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.10 Universal NMU System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.11 NMU Varieties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.12 Multi-Application Test Setup for Networking . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.13 Universal NMU utilization vs Number of Logical Connections . . . . . . . . . . . . . . . . 94
5.14 Multiple Network Interfaces Managed Separately . . . . . . . . . . . . . . . . . . . . . . . 95
5.15 Multiple Network Interfaces With an Exclusive Connection . . . . . . . . . . . . . . . . . 96
x
Chapter 1
Introduction
The use of reconfigurable computing devices in mainstream datacentre applications is growing, with
more companies and institutions using such devices to accelerate compute workloads. Specific examples
of particular note include the work done to integrate Field Programmable Gate Array (FPGA) devices
into the Microsoft Bing search engine [1] [2] [3], the introduction of FPGA devices in Amazon’s AWS
cloud offering [4], and the work done at IBM Research to integrate FPGA devices into the cloud [5].
Many academic works have also explored the deployment of FPGA devices in datacentre environments;
clearly the datacentre deployment of FPGA devices is an emergent and popular solution to expand the
compute capabilities of datacentres.
For CPU-based compute nodes in datacentres, it is common to use virtualization technologies to
enable multi-tenant use, i.e. multiple resident virtual compute nodes on a single physical system. This
increases the efficiency of the datacentres by increasing the effective utilization of those compute nodes.
In addition, this enables cloud-based service models as a single physical system can be shared by the
multiple customers of a cloud provider. This thesis explores the tradeoffs in designing analogous virtual-
ization technologies for reconfigurable compute devices, particularly the shared memory and networking
interfaces of FPGA devices.
1.1 Motivation
Reconfigurable compute devices, and in particular FPGA devices, have the potential to accelerate many
compute operations. In fact, FPGA devices have been shown to accelerate encryption [6], compression [7],
packet processing [8], and even machine learning applications such as neural networks [9]. This has no
doubt motivated their continued deployment in datacentre applications.
1
Chapter 1. Introduction 2
Microsoft has successfully demonstrated their Catapult platform (both v1 [1] and v2 [2] versions),
which utilizes FPGAs to accelerate Bing search. Microsoft has since expanded their deployment of
FPGAs to their Azure offering [10], using FPGAs to compress network traffic for their cloud customers.
Amazon has deployed FPGA devices to their own cloud offering, AWS [4]. A single AWS instance can
be created with up to eight FPGA devices, completely programmable by the cloud users. The interest
in using FPGAs in the datacentre will likely only increase in the future, making the efficient design of
the platform that enables their deployment of key importance.
The benefits of virtualization, and thus the motivation of exploring virtualization technologies and
potential benefits for FPGAs, are well established. Virtualization as a technology dates back to the
1960s [11], and its use in datacentres today is widespread. Simply put, virtualization allows for what
would have been multiple physical servers to run co-located as virtual servers on a single physical server
node. As long as the single physical server has enough resources to run the workloads of all of the
virtual servers along with the overhead of the virtualization itself, the virtualized deployment saves
on the need for more physical server nodes. The same argument motivates virtualization for FPGA
devices: if multiple FPGA bound applications can be co-located on a single physical FPGA, and that
physical FPGA has enough resources (i.e. area and interface bandwidth) to accommodate the original
applications and the overhead of the virtualization platform itself, the FPGA deployment can be made
with fewer total compute nodes.
In this thesis we specifically address the implementation of virtualization from the perspective of
performance, data, and network domain isolation, while also considering that FPGAs have a limited
amount of area. We present the implementation of various components to achieve these domains of
isolation for an FPGA targeting a multi-tenant deployment.
1.2 Contributions
The contribution of this thesis is a detailed analysis and exploration of the design tradeoffs involved
in virtualizing FPGA devices, and design of a virtualization platform considering these tradeoffs. The
major components of the contribution are as follows:
• The introduction and formalization of the concepts of a “hard shell” and “soft shell”, easing the
design and analysis of virtualization technologies targeting reconfigurable compute devices
• A functioning HDL implementation of memory virtualization hardware cores, targeting single and
multi-channel memory platforms
Chapter 1. Introduction 3
• A functioning HDL implementation of network virtualization hardware cores, exploring multiple
deployment scenarios depending on the network infrastructure
• An analysis of the area overheads incurred by the virtualization technologies considering the design
tradeoffs discussed
1.3 Overview
The remainder of this thesis is organized as follows: Chapter 2 will provide background information
and will review previous work on the virtualization of reconfigurable compute devices. Chapter 3 will
provide an analysis of various deployment models for FPGAs in the datacentre and introduce concepts
used in the design of virtualization technologies for reconfigurable compute devices. Chapter 4 and
Chapter 5 will present the design and analysis of memory interface virtualization and network interface
virtualization respectively, considering the analyses of Chapter 3. And finally, Chapter 6 will examine
avenues for future work and conclude the thesis.
Chapter 2
Background
This chapter introduces some concepts, technologies, and definitions used throughout the thesis. It also
examines some related work, providing context for the thesis.
2.1 Reconfigurable Compute Devices
Reconfigurable computing describes a general class of devices in which the primary circuit elements can
be reconfigured into a user-defined hardware configuration; the array of these reconfigurable elements
is often referred to as a reconfigurable fabric. The device as a whole performs some computation, often
interfacing with some external network or memory. This section describes such devices, particularly the
FPGA, which is the focus of the implementations presented in Chapters 4 and 5, as well technologies
that enable the use of reconfigurable computing devices.
2.1.1 Field Programmable Gate Arrays
The most common reconfigurable fabric used in reconfigurable compute applications is the FPGA. An
FPGA is an integrated circuit designed such that its functionality can be reprogrammed after its manu-
facture, according to some specification of the user. Users specify some hardware configuration using a
Hardware Description Language (HDL), a class of programming languages that can be used to describe
digital hardware. The user’s HDL application can be synthesized into a bitstream (used to program the
FPGA) using a set of Computer Aided Design (CAD) tools. For example, Xilinx FPGA devices can be
targeted for synthesis using the Vivado CAD tool set [12]. The bitstream generated describes the desired
configuration of the FPGA’s various building blocks, which can be configured to implement most digital
circuit designs.
4
Chapter 2. Background 5
The main building block of an FPGA is the Look-up Table (LUT), which can implement a four-,
five-, or six-input combinational logic function. The LUTs are grouped with flip flops inside logic blocks,
which are connected to a series of switches and connection blocks implementing a programmable routing
fabric. In addition to the LUTs, modern FPGAs include dedicated Digital Signal Processing (DSP)
blocks to perform multiplication, Block Random Access Memory (BRAM) for local memory storage,
and often even PCIe [13] or Network controllers [14]. Note that hardware components implemented as
dedicated silicon on an FPGA chip, rather than through a configuration and interconnection of LUTs
to implement the logic, are often referred to as “hardened”, in contrast to digital circuits synthesized
on the programmable fabric, which are referred to as “soft”; the PCIe and Network controllers included
on FPGA chips are examples of such hardened components. See [15] for more information on FPGA
architecture.
2.1.2 Coarse-Grained Reconfigurable Architectures
Another relatively new reconfigurable compute device is the Coarse Grained Reconfigurable Architecture
(CGRA) device, which as the name suggests, includes much coarser building blocks than the LUTs of the
FPGA. These blocks can be configured to perform some larger arithmetic functions, such as additions
or shifts, and are often connected using some routing fabric [16]. As there are no general purpose LUTs
to implement arbitrary hardware, these devices are often programmed using a different method (i.e., in
contrast to the HDLs used for FPGAs). Also, any communication to the external world (e.g., memory
controllers or network controllers) would need to be implemented as hard elements. The content of this
thesis focuses on the FPGA, though similar memory and network interface designs as presented for the
FPGA class of devices could be implemented as hard elements if virtualization were to be considered for
CGRAs.
2.1.3 Computer Aided Design Tools
As mentioned in Section 2.1.1, CAD tools are used to take a user’s description of the hardware they
intend to implement on the device and create a description of how the elements of the device should be
organized and configured to achieve the specified implementation. This final configuration descriptor file
is generally referred to as a bitstream, in reference to the fact that the descriptor represents the states to
be programmed to the bits of Static Random Access Memory (SRAM) cells that drive the configurable
portions of the LUTs and routing fabric. Figure 2.1 shows the steps in a typical FPGA design flow,
further details of these design flow steps, and their particular functions, can be found in [17].
Chapter 2. Background 6
Figure 2.1: A Typical FPGA CAD Design Flow, adapted from [17]
For FPGA devices, the user’s description of the hardware to be implemented is often in the form
of HDL, but other descriptions are available. High Level Synthesis (HLS) has increased in popularity
in recent years due to the fact that complex digital hardware circuits can be described using simpler
software-based programming languages, such as C, C++, or OpenCL [18]. Some examples of HLS-based
CAD tools include LegUp [19], Xilinx’s Vivado HLS [20], and Intel’s FPGA SDK for OpenCL [21].
HLS lowers the barrier to adopting reconfigurable computing platforms thereby increasing the viability
(financial or otherwise) of datacentre deployments of FPGAs. This motivates further research into such
datacentre deployments and this virtualization work more specifically.
One CAD-based innovation for FPGA devices that is particularly important in enabling datacentre
deployments of FPGAs is Partial Reconfiguration (PR). PR allows for the FPGA device to be partitioned
and for these partitions to be reconfigured independently, such that one portion of the FPGA can be
actively running some circuit while another portion is reconfigured without its operation being stalled
or affected. These techniques are described in works by both major FPGA vendors [22] [23]. In PR-
based FPGA CAD design flows, the portion of the FPGA which is not reconfigured after the initial
configuration is termed the static region, while the portions of the FPGA that are reconfigured live are
called the dynamic or PR regions (note, an FPGA can typically have multiple PR regions). These PR-
Chapter 2. Background 7
based FPGA CAD design flows generate PR bitstreams, that can be programmed through the traditional
Joint Test Action Group (JTAG) boundary scan methods, or often using internal connections driven by
the FPGA fabric directly, such as the Xilinx-based ICAP connection [24]
2.2 Virtualization
Virtualization is a widely used term in many sub-fields of computer architecture, computer science, and
digital hardware. In regards to the term virtualization, the focus of this thesis is desktop virtualization
(often also termed server virtualization), which is the virtualization of a server compute node. This
section describes this type of virtualization in more detail.
2.2.1 Desktop Virtualization
Desktop virtualization is the set of technologies used to enable the deployment of multiple virtual servers
on a single physical server. In other terms, virtualization enables a single physical server to be seen
by its multiple tenant virtual servers as multiple unique and wholly independent hardware instances.
It was originally envisioned by IBM to partition their mainframe computers and allow multiple virtual
workloads to run on a single physical mainframe [11]. The main benefit of virtualization was the increase
in the efficiency of the mainframe computers, since single workloads would not use 100% of the physical
server’s resources at all times.
Essentially, virtualization software seek to emulate multiple instances of the physical server, such that
each tenant can run on the emulated physical server without modification. Virtualization software is
commonly referred to as a Virtual Machine Monitor (VMM), or as a hypervisor. Without virtualization,
modern servers already support context switching between independent processes. Processes cannot
access the environment of other processes without sufficiently elevated privileges. For the VMM to
emulate multiple physical servers, it must only intercept these privileged calls from the virtual systems,
most often termed a Virtual Machine (VM), and ensure they only access parts of the system assigned to
them. For example, memory is allocated on a page basis to the VM and a translation action is needed
every time a VM attempts to access memory. Similarly, I/O devices are either emulated, or assigned
wholly to a single VM and attempts to query the system about the I/O devices are intercepted by the
VMM. Only emulated and physical I/O devices assigned to a VM will be discoverable.
Multiple different types of virtualization software are available today. Two main categories of VMM
software are Type I and Type II virtualization [25]. For Type I, the VMM software runs atop of a
traditional operating system, and the guest VMs run on the presented virtualization layer. For Type II,
Chapter 2. Background 8
the VMM software runs directly on the physical server itself, becoming the main operating system of the
physical machine. In addition, paravirtualizated VMMs of both types exist, which decrease the overhead
of virtualization by allowing the VMs to install drivers that are virtualization-aware, bypassing some of
the overheads associated with virtualization [11]. Type II non-paravirtualized VMMs are the main focus
of this work, as it is not evident that there is an analogue to Type I software or paravirtualization for
FPGA devices.
Two key goals to consider for virtualization solutions are data isolation and performance isolation.
Data isolation ensures that the data of one VM cannot be accessed, modified, or otherwise molested
by other VMs on the same VMM. Performance isolation refers to the idea that the performance of a
VM should not be impacted by the transient activity of other VMs running on the same VMM. While
processor time scheduled to a VM and memory allocated to a VM can be strictly controlled, the memory
access patterns and cache usage patterns of other VMs can affect the performance of a VM [26].
2.2.2 Containerization
Containerization is a virtualization technology that aims to reduce the overhead imposed on a physical
server running a traditional VMM. In a containerized environment, each virtualized server (termed
containers in the containerization context rather than VMs) shares an operating system kernel, but has
its own execution environment and middlleware setup [27]. For example, Linux based containers can be
created using control groups (cgroups) and Linux namespaces.
By sharing a kernel, and thereby a single application scheduling environment and memory allocation
scheme, overhead is reduced and resources can be more effectively distributed. The system’s process
scheduler is fully aware of not only the VMs running on the system, but all the processes of the VMs.
The system’s memory allocation scheme is similarly fully aware of the memory requirements of each
process, and can allocate memory more efficiently. For traditional virtualization solutions, there would
be two layers of process scheduling and memory allocation, first at the VMM level and then again at the
VM’s guest operating system level. Figure 2.2 shows a visual comparison of the virtualization techniques
discussed.
2.2.3 Operating Systems
Operating Systems are not generally considered virtualization technologies, though previous work on
FPGAs creating both Hardware Hypervisors and Hardware Operating Systems are very similar in that
they present an abstracted environment for multiple hardware tasks to run on a single FPGA. From
Chapter 2. Background 9
VM1
User Space
Libraries & Middleware
Guest OS
VM2
User Space
Libraries & Middleware
Guest OS
VM3
User Space
Libraries & Middleware
Guest OS
Virtual Machine Manager (VMM)
Physical Server
VM1
User Space
Libraries & Middleware
Guest OS
VM2
User Space
Libraries & Middleware
Guest OS
Virtual Machine Manager (VMM)
Physical Server
Host Operating System
Libraries & Middleware
Host System Applications
Container 1
User Space
Libraries & Middleware
Container 2
User Space
Libraries & Middleware
Container 3
User Space
Libraries & Middleware
Host Operating System Kernel
Physical Server
(a) (b) (c)
Figure 2.2: (a) Type I virtualization, (b) Type II virtualization, (c) Containerization
this hardware analogue, it is also easy to see how traditional software operating systems are similar
to VMMs. While VMMs allow for multiple guest operating systems to run in environments that seem
completely independent, operating systems allow for multiple user applications to run in environments
that seem completely independent. This is mainly accomplished through context switching and virtual
memory, i.e., memory accesses from applications are translated before being serviced.
Memory virtualization works by dividing the memory region into pages of some pre-determined size
and then allocating the pages to the applications as they are needed. Each application sees a zero-base
address for its memory space, accesses to the virtual memory space are intercepted and handled by a
translation mechanism at the operating system level. The lowest significant bits of the virtual address,
those that index memory within a page, remain unchanged, while the most significant bits are remapped
to the actual physical memory location assigned to that application. Page mappings are stored in a
map table in memory and cached in a structure known as a Translation Lookaside Buffer (TLB) [28].
Figure 2.3 depicts the memory translation scheme. Note, VMM environments must have two levels of
translation: at the guest operating system level, and then again at the VMM level.
2.3 FPGAs in the Cloud and FPGA Virtualization
A lot of effort has gone into investigating the deployment of FPGA devices in datacentres and cloud
environments. This is an important area of consideration for this work since virtualization technologies
are often used in cloud settings, and most of these FPGA cloud and datacentre deployment works include
some version of virtualization.
Chapter 2. Background 10
31..12 11..0
31..12 11..0
Virtual Memory Address
Physical Memory Address
Address Translation Lookup
Process ID
31..12 11..0
31..12 11..0
Virtual Memory Address
Guest Physical Address
Guest OS Address Translation Lookup
Process ID
31..12 11..0
Host Physical Memory Address
Hypervisor Address Translation Lookup
VM ID
(b)(a)
Figure 2.3: Virtual to physical memory address translation. (a) Translation in a standard operatingenvironment, (b) Translation in a virtualized environment
2.3.1 Related Work
This related work is important to establish the context in which this thesis is presented.
Microsoft Catapult
Microsoft introduced FPGAs into their Bing Search datacentres to accelerate their search algorithms
using specialized hardware implementations of those algorithms. The original implementation, dubbed
Catapult v1, was published in 2014 [1]. The Catapult implementation included FPGAs installed as
PCIe add-on cards within Processor-based servers. These FPGAs are controlled and receive their data
from the Processor system, essentially setup as a master-slave configuration. Multiple FPGAs are
connected together using a dedicated interconnection network, configured in a torus arrangement. This
allows for the FPGAs to communicate with each other and enables multi-FPGA applications. The work
characterizes the hardware application in their platform as the “Role” and the surrounding abstraction
layer as the “Shell”. While this shell does not enable sharing of the FPGA between multiple tenant
applications, it does provide an abstraction layer and might be considered analogous to a Hardware
Operating System [1].
This is a good place to discuss the nomenclature used to describe the various components of FPGA
platforms. What the Microsoft authors (Putnam et al.) term the Role is often called the “Hardware
Application”, “Kernel”, “PR Region” (for implementations using PR), “vFPGA” (i.e. virtual FPGA),
and many other names. For simplicity, the term Hardware Application will be used in this thesis except
when specifically referring to the nomenclature used by other works. What the Microsoft authors term
the shell is often called the “Hardware Hypervisor”, “Hardware Operating System”, “Static Region”
(for implementations using PR), “Service Logic”, and many other names. Because of the popularity of
Chapter 2. Background 11
Figure 2.4: Microsoft Catapult v2 bump-in-the-wire configuration, adapted from [3]
the Catapult work, the term Shell will be used in this thesis except when specifically referring to the
nomenclature used by other works.
The second version of the Catapult platform changed the deployment model. The FPGAs are still
connected as add-on cards to a dedicated Processor-based server, but the dedicated interconnection
network is augmented with a direct connection to the datacentre Ethernet network. In addition, the
network connection from the Processor-based server is connected directly to the FPGA, presenting a
bump-in-the wire like configuration, as shown in Figure 2.4. This enables applications that can act
directly on the incoming and outgoing network traffic of the traditional server, such as encryption and
compression. In addition, since the FPGAs are connected directly to the network, applications can be
built that require no intervention from the Processor-based server. The FPGA can receive data and
instructions from the network and push results back out to the network itself. Data received and sent
by the Hardware Application can also be configured to use the introduced Lightweight Transport Layer
(LTL) protocol [3], virtualizing network access by encapsulating sent data in a Layer 4 network packet.
This version of Catapult also considered the deployment of multiple Hardware Applications (i.e.,
Roles) to a single FPGA, through the use of what is termed an Elastic Router (an on-chip arbiter and
router that routes traffic to different roles depending on the packet’s incoming Media Access Control
(MAC) address). In this way, Catapult v2 does implement virtualization. Catapult v2 was briefly
introduced by Chiou in 2016 [2], and later elaborated by Caulfield et al. later that year [3].
Microsoft has also recently made these Catapult v2 FPGA devices available to Azure users, though
not directly: users cannot program the FPGAs themselves. Instead, the FPGAs are used automatically
Chapter 2. Background 12
to offload some simple virtual networking tasks from the Processor and compress traffic [10]. There is
no indication from Microsoft whether the FPGAs themselves are virtualized in this environment.
Amazon AWS F1 Instances
Amazon’s deployment of FPGAs in their AWS cloud offering differs from Microsoft’s in that Amazon
has made the FPGA devices available to program to its AWS customers. An AWS F1 instance (as the
FPGA-containing VM instances are called) with up to eight FPGA cards can be provisioned. These
FPGA cards are connected to a Processor-based server and a dedicated FPGA-only interconnection
network, similar to the Catapult v1 work. The notable difference is that all eight FPGAs are connected
to a single physical server [4]. More information about the deployment can be obtained from the
Hardware and Software development kits provided on their Github platform [29]. Amazon refers to a
Hardware Application as CL (Custom Logic) and also uses the Shell terminology. From the hardware
development kit, we note that there is no sense of abstraction or multi-tenancy in Amazon’s shell, and
thus no consideration of virtualization.
Amazon also provides the Xilinx SDAccel [30] programming environment for its F1 instances, which
is an implementation of the OpenCL [18] standard for Xilinx FPGA devices. The SDAccel shell, simply
termed the static region in the Xilinx documentation, adds some abstraction as up to 16 Hardware
Applications, termed kernels from the OpenCL standard, can be instantiated. Each of the 16 Hardware
Applications can access isolated (and therefore virtualized) regions of the shared off-chip memory.
IBM Research
The work presented by IBM Research considers virtualization directly, as a goal of the implementation [5].
The IBM work, by Chen et al., divides the FPGA spatially into distinct application regions. It is in
these application regions that Hardware Applications are to be programmed, termed accelerators in the
IBM work. The Shell, termed Service Logic by IBM, secures access to shared off-chip memory and a
dedicated host Processor-based server [5]. Similar to all of the previously presented FPGA deployment
models, IBM has adopted an add-on card model with the Processor-based system acting as the master
(i.e., controlling and configuring the FPGA). This solution provides no other communication between
FPGAs other than through the host, so it is the most limited in terms of multi-FPGA applications.
Chen et al. specifically decouple the roles of accelerator developers (i.e., developers of the Hardware
Application) and software developers. In the presented model, software developers would offload some
computation from their software application to a dedicated Hardware accelerator. Accelerator devel-
opers would create the set of standardized hardware accelerators that the software developers could
Chapter 2. Background 13
instantiate from their software code [5]. The shell thus includes scheduling hardware in addition to the
memory virtualization, to switch between invocations from multiple software threads of VMs. Hardware
accelerators are programmed into the spatially divided regions of the FPGA device using PR-based
techniques.
Academic Works
Multiple academic works have also explored the deployment of FPGAs in cloud environments. The
work presented by Fahmy et al. virtualizes FPGAs in a similar manner to the work developed by
IBM Research. The FPGA is partitioned spatially into multiple regions and programmed using the PR
methodology. The Hardware Applications are termed Partially Reconfigurable Regions and the Shell is
termed the static logic. The Shell arbitrates access to off-chip memory and a host Processor through
a PCIe interface. No networking functionality is provided. This work considers performance isolation
explicitly by implementing a round-robin scheduler for access to the host communication interface,
preventing starvation. Similar to the IBM model, this work imagines a deployment model wherein
multiple Hardware Applications are available to be programmed and application developers can invoke
them to offload some compute [31].
The implementation by Byma et al. introduces a different deployment model for FPGAs in cloud
deployments. The virtualized FPGA resources are directly connected to the datacentre’s Ethernet
network, similar to the Catapult v2 implementation, but in contrast to that work, there is no connection
of these FPGA devices to some host Processor-based system. Rather than being scheduled by and
receiving data from a dedicated Processor, the FPGA includes a small microcontroller (implemented
as soft logic on the FPGA) that communicates with the network. The cloud management software
(OpenStack in this case [32]) communicates with the FPGA over the network to program and schedule
tasks. Like the previous implementation and the IBM Research implementation, this work uses PR to
allow for seamless multi-tenancy. Hardware Applications can be programed by sending bitstreams over
the network to the dedicated microcontroller on the FPGA. Note, this work refers to the Hardware
Application Regions as Virtualized FPGA Resources (VFRs), and refers to the Shell as the Static
Hardware.
The Byma et al. work virtualizes access to memory by portioning off an equal chunk of memory
to each Hardware Application Region, effectively isolating each region’s data. The network connection
is virtualized by using an arbiter that redirects incoming network traffic to the appropriate Hardware
Application based on the OpenStack assigned MAC address. Outgoing traffic is policed by replacing
the source MAC address supplied by the user with the OpenStack assigned address, preventing MAC
Chapter 2. Background 14
address spoofing [33].
An implementation by Tarafdar et al. modifies an SDAccel-like Shell (as described in the Amazon
F1 section) to add networking capabilities. From a memory and host-connectivity point of view, the
implementation is the same as the SDAccel Shell. The memory can be virtualized and shared between up
to 16 Hardware Applications (called kernels in the OpenCL nomenclature), and the system is connected
and scheduled by a dedicated host Processor-based system. The work adds virtualized networking capa-
bilities, allowing the Hardware Applications to communicate to each other and other network connected
devices. Rather than police and arbitrate sent and received data like the system proposed by Byma
et al., this implementation virtualizes network access by encapsulating sent data in a User Datagram
Protocol (UDP) like Layer 4 network packet, similar to the methodology employed by Catapult’s LTL
protocol [34].
Finally, work by Yazdanshenas and Betz implemented a multi-tenant FPGA shell with the express
focus of measuring the overhead of the shell implementation [35]. The shell was implemented with up
to four applications (termed Roles) implemented on the FPGA. The shell design included connectivity
for two memory channels, four network interfaces, and a connection to a CPU-based host system. No
explicit isolation of the memory or network interfaces is attempted in that work. The conclusions of the
work indicate that virtualizing FPGAs decreases the maximum operating frequency of the applications
by up to about 40% and increases routing congestion in the Place and Route by up to 2.6x.
2.4 Hardware OS
Hardware OS works focus on creating platforms for FPGA application development, analogous to soft-
ware Operating Systems. They are similar to virtualization solutions in that they often also permit
multiple Hardware Applications to share a single FPGA device. They differ in that the focus of these
works is generally on abstracting access to external resources for Hardware Applications, rather than
data and performance isolation. Because of their similarity to virtualization solutions, they can provide
important insights in the design of virtualization technologies.
2.4.1 Related Work
One popular Hardware OS work is R3TOS, which is actually quite similar to the virtualization solutions
already presented. The R3TOS Hardware OS uses PR to allow multiple Hardware Applications to share
the same FPGA. The R3TOS work targets embedded system environments rather than the datacentre
environments considered in this thesis. Nonetheless, the R3TOS system manages the shared memory
Chapter 2. Background 15
so that the memory of each Hardware application is protected, similar to the virtualization solutions
presented in Section 2.2.3. In addition, R3TOS adds interprocess communication, allowing Hardware
Applications to communicate to each other on the same FPGA [36].
Other Hardware OS implementations focus on developing abstractions that make it easier to pro-
gram for Reconfigurable Devices. Both ReconOS [37] and BORPH [38] implement Hardware OSes that
introduce hardware processes as analogues to software threads. In fact, both systems are integrated with
Unix based operating systems and allow Hardware threads to access the Unix filesystem. The ReconOS
system also introduces a mechanism by which Hardware Applications can initiate system calls, using an
Operating System Synchronization State Machine that the Hardware Applications must interact with.
Finally, the LEAP system provides perhaps the deepest set of abstractions. The LEAP OS provides
two main categories of abstractions: Scratchpads and Latency-Insensitive Channels. Scratchpads are
essentially automatically generated caches that allow Hardware Applications to access off-chip memory.
Latency-Insensitive Channels can allow different parts of a Hardware Application (or indeed different
Hardware Applications) to communicate to each other. In addition, Latency-Insensitive Channels allow
Hardware Applications to communicate to a host Processor-based system. It provides Operating System
services over the host connections such as standard I/O, barriers, locks, and debugging [39].
Chapter 3
Virtualization Model
The focus of this thesis work is to examine the tradeoffs in the design of a virtualization solution for
FPGA devices. In particular, solutions analogous to software based VMMs are sought. In the software
realm, VMMs provide multi-tenant support with data isolation and performance isolation, such that
multiple VMs can share a single physical resource securely and with some level of quality of service.
These too are the goals for our analogous hardware virtualization model.
3.1 Deployment Model
From the discussed related works, it is clear that there are a number of deployment models for FPGA
devices in datacentres. Virtualization solutions for FPGAs vary in their physical connectivity, in the way
the FPGA services are provisioned for users, and in the abstraction models provided to the Hardware
Applications to access external resources. In this section, the various deployment methodologies are
analyzed.
3.1.1 Physical Implementation
Three main physical deployment models were used in previous work for FPGA virtualization. The most
common FPGA deployment model includes the FPGA device as an accelerator add-on to a traditional
Processor-based compute node. The processor-based system schedules tasks on the FPGA, sends data to
the device, and then retrieves results as they become available. The advantage of such a deployment lies
in the great availability and familiarity of existing software, which can be accelerated without changing
the entire codebase. Only those components of the software that need to be accelerated on the FPGA
need to be implemented in Hardware. Also, existing software frameworks such as OpenCL [18] make
16
Chapter 3. Virtualization Model 17
the task of programming FPGAs for software developers less daunting. The downsides of this physical
deployment model is that all communication from the FPGA device must be made through the Processor,
which can introduce a significant latency. For the remainder of this thesis, this deployment model will
be referred to as the master-slave model.
Direct connected FPGA deployments, in which FPGA devices are connected directly to the datacen-
ter’s primary network infrastructure (e.g., an Ethernet network), allow for multi-FPGA applications to
see lower latencies. As this interconnection network is the existing datacentre’s Ethernet network, this
deployment opens up a lot of application possibilities, as the FPGAs can communicate quickly with each
other and other devices in the datacentre. This is the deployment model of the Byma et al. work [33].
If the FPGA is also connected to a host Processor-based system, then the FPGA deployment can take
advantage of both models, as demonstrated in the Tarafdar et al. [34] and the Microsoft Catapult v2
works [2]. This deployment model will be referred to as the direct-connected deployment model. Note,
the Catapult v1 work by Microsoft includes an interconnection network for the FPGAs, though this
interconnection network is reserved for communication by the FPGAs alone, so we do not consider this
a direct-connected FPGA deployment by our definition.
Finally, the Catapult v2 [2] work introduces a bump-in-the-wire physical deployment model, which
is particularly useful for offloading network processing from the Processor-based system to the FPGA.
This exact use case is demonstrated in the deployment of the Catapult v2 system in Microsoft’s Azure
product [10]. This offloading reduces the latency required for complex packet processing tasks that would
otherwise need to be done in the software host. It may also reduce deployment costs, as each Processor-
based system and hardware FPGA pair need only one network interface connection on the downstream
network switch. However, the bump-in-the-wire compute model utilizes at least one networking interface
port on the FPGA for the host’s network connection, which could be instead used for increasing the
potential outgoing bandwidth of an FPGA device.
The direct-connected FPGA deployment model is the most flexible in terms of the types of applica-
tions that can be targeted to them, given that direct-connected FPGAs can communicate to each-other
and potentially distant servers, as well as local servers in the datacentre in which they are deployed,
implicitly enabling the kind of communication that would have been possible with a master-slave model
(albeit through a higher-latency Ethernet network rather than a PCIe connection). This model is specifi-
cally advantageous, however, for those applications that require low-latency inter-FPGA communication.
By treating FPGAs as equal peers to traditional Processor-based compute nodes, it also enables more
possibilities; applications targeted at the Reconfigurable Compute enabled datacentres are no longer con-
strained by the need to initiate and terminate computation at software nodes. This directed-connected
Chapter 3. Virtualization Model 18
FPGA deployment model is adopted for the exploration in this thesis.
3.1.2 Service Provisioning
The existing FPGA virtualization solutions also vary in the ways that these FPGA resources are made
available to the datacentre’s users. There is a particular contrast between the deployments presented
in the IBM and Microsoft works, and the deployments in the Amazon and Academic works. IBM
and Microsoft restrict the FPGA to trusted Hardware applications, generated in-house or by trusted
developers. The users simply invoke instances of the existing Hardware Applications. This deployment
model is simple to use for end users, but can be quite limiting as custom Hardware accelerators cannot be
developed. For our virtualization solution, it is more general to consider the case where non-trusted users
share the same physical FPGA, which is also closer to the software VMM analogue. Note, data isolation
and performance isolation become much more important focuses of the design; mutually distrusting
users must be sufficiently assured of the security and quality of service of their virtual session for the
virtualization solution to be viable.
In addition, to truly enable seamless multi-tenancy on these FPGAs, a PR methodology must be
used. The physical FPGA can be shared spatially by dividing the FPGA into separate virtual regions
(i.e., Hardware Application Regions) and each can be programmed independently without affecting the
others’ operation. One consequence of this approach is that portability of the developed applications is
more limited, as each Hardware Application must be synthesized by FPGA CAD tools targeted for each
different PR region available. While it is technically possible to make multiple PR regions that share the
same layout and connection profile so as to eliminate the need to synthesize bitstreams for each region
individually, such an implementation is technically challenging to achieve. Datacentres are also likely to
have multiple types of FPGAs deployed as new devices are released with better capabilities. Datacentres
may even deploy devices from different vendors. Bitstream portability cannot easily be provisioned for
end users, and the lowest level of portability for Hardware Applications is the HDL source.
As an extension to the above discussion, whether the end-user or the datacentre managers are re-
sponsible for synthesizing the Hardware Application into a bitstream must be considered. The work by
Byma et al. used the latter approach; end-users would simply upload the Hardware Application source
HDL to the cloud management software, and the cloud would automatically synthesize the application
into the appropriate bitstreams. This does introduce a level of abstraction that eases development as the
user does not have to be aware of the existence of different PR regions, but it removes from end-users key
information needed in the design iteration process for Hardware Applications. Namely, the Synthesis
Chapter 3. Virtualization Model 19
and Place and Route processes provided by FPGA CAD tools generate reports that are vital in fixing
any timing violations in the Hardware Application. Also, Synthesis and Place and Route settings can
often be set for different levels of the design hierarchy, which is often required to strike a balance between
the solution exploration effort and the total runtime of the CAD tools. While a cloud system that passes
this information (reports and settings) from the user to the CAD tools and vice versa could be designed,
these issues introduce quite a bit of complexity into the datacentre management system design.
The advantage of separating bitstream generation from the end-users is that the source HDL is easier
to examine and parse than the final bitstream (the bitstream format is often proprietary and specific
to the device vendor). Such parsing can be used to ensure that the end-user’s Hardware Application
does not perform any malicious activity. The deployment by Amazon [4] does the bitstream generation
for this reason, though that deployment does not include multiple Hardware Applications on the same
FPGA and as such does not use HDL parsing to determine whether malicious interaction with co-resident
Hardware Applications is attempted. It is unclear whether one can determine for certain (at least in
an automated manner) by examining source HDL that no malicious activity is attempted (such an
examination is left for future work). We contend that the most general assumption is that all malicious
activity cannot be automatically deduced from source HDL.
Whether or not the source HDL is parsed by the cloud management system, some amount of security
considerations must be made in the design of the Shell. Thus, for simplicity, the Deployment model we
explore in this thesis does not consider source HDL parsing and instead has the end-users synthesizing
the Hardware Applications into bitstreams themselves. A more thorough exploration of the level of
security that can be guaranteed through source parsing is left for future work.
Finally, the way that external resources are connected to Hardware Applications must be considered.
As an example, consider the OpenCL [18] programming model, used with the SDAccel Shell pltaform
[30]. It decouples memory buffers from the Hardware Applications accessing the memory. A particular
memory location can be used as the output for one Hardware Application, and once that Hardware
Application has finished execution, that memory buffer can be reassigned as the input to another Hard-
ware Application, perhaps even in a different PR Region. This is mostly provisioned at higher levels of
the compute paradigm (as in the OpenCL example), though it must also be considered in the design
of the Shell. In particular, the Shell must ensure that memory locations can be attached and detached
from Hardware Applications completely, which is not for example possible with the work presented by
Byma et al. (the memory is statically assigned to each Hardware Application Region). Similar lifetime
decoupling could be considered for all external resources provisioned by the Shell. We note that such a
methodology results in no loss of generality, as it can easily be used to provision programming models
Chapter 3. Virtualization Model 20
in which the memory buffer lifetime is either coupled or decoupled from the lifetime of the Hardware
Application.
3.1.3 Deployment Model Summary
The most flexible and powerful deployment model for FPGA virtualization is one in which direct-
connected FPGAs are made available to the end-users of datacentres to be programmed directly. The
FPGAs are treated as peers to software compute nodes, rather than slaves, and can be programmed
directly by the users, rather than be restricted to some available Hardware Applications developed by
trusted developers. The key goals of such a virtualization model must be securing the operation of one
user from any interference or impact from other users on the same physical resource. The FPGA is to be
shared using a PR-based CAD flow, and portability of Hardware Applications can only be guaranteed
at the HDL level. We posit that this model is general, since any applications that target other FPGA
deployment models could be ported to a direct-connected FPGA deployment; communications to CPU-
based compute nodes, as explicitly enabled in master-slave models, can still be achieved by sending
communications over the shared network infrastructure.
3.2 Required Functionality
The deployment model discussed above introduces a number of requirements of the virtualization solu-
tion. Note, since a PR-based implementation is to be targeted, these requirements are to be implemented
in the Shell (or static regions) of the platform.
3.2.1 Performance Isolation
First, the Shell must adequately decouple the actions of the virtualized Hardware Applications from each
other, such that some reasonable level of performance can be assured. Hardware Applications access
external resources using some standardized interface. For example, memory resources are often accessed
using the Advanced eXtensible Interface (AXI) protocol. Access to the shared resource is usually pro-
vided through some sort of arbitrated interconnection network. As a first step in performance isolation,
some adherence to the interface protocol needs to be assured such that the interconnect providing the
arbitration can service all requests. Illegal or spurious requests to the shared interconnect need to be
blocked to prevent the interconnect from entering an error state or stalling operations.
A protocol verifier and decoupler can be included to assure adherence to the interface protocol. Note,
the protocol verifier and decoupler need only block those protocol violations that may force the intercon-
Chapter 3. Virtualization Model 21
nect into an error state, and decouple the interface on such an occurrence. Any protocol violations that
do not affect the interconnect or resource functionality (i.e., the resource continues to function correctly)
can be ignored. The protocol verifiers implemented in our work block errant transactions by modifying
the transactions to be protocol conformant, which eliminates the protocol violation, though the original
intent of the transaction may not be preserved. This may affect the operation of the application, but we
are not concerned with the correct operation of errant applications as long as they cannot affect the other
applications on the FPGA. Previous works that focus on blocking errant transactions generate Verilog
for a full list of protocol assertions [40]; our work is more methodical in selecting only those protocol
assertions that may cause errors in donwstream devices (e.g., interconnects, memory controllers).
Shared resources often have a limited bandwidth available to the multiple Hardware Applications. To
ensure quality of service, we need to be able to regulate the resource bandwidth used by each application.
Thus, the next step in performance isolation is the inclusion of bandwidth shaping functionality into the
shared interconnect. Bandwidth shaping is a mechanism by which the accesses to the shared resource
by the Hardware Applications can be limited to within some allotted bandwidth budget. This ensures
that one application cannot spam the memory interface with requests that cause other applications to
be denied access. This specific example illustrates how bandwidth shaping can eliminate intentional
starvation forced by bad actors, though bandwidth shaping can also add value beyond these security
benefits; bandwidth shaping provides more deterministic performance to the applications, often resulting
in more reliable system performance. Also, for cloud-based deployments, the inclusion of bandwidth
shaping allows the cloud provider to provision virtual resources and charge customers based on the
required resource access bandwidth.
3.2.2 Data Isolation
From a security perspective, the virtualization solution must ensure that Hardware Applications are
limited in terms of what parts of external resources they are granted access to. Applications should not
be permitted to view, access, or modify the execution environment of the other virtualized Hardware
Applications running on the same physical system. Considering off-chip memory, locations in memory
assigned to one specific Hardware Application should not be readable or writeable by other Hardware
Applications. In processor-based systems, this is often accomplished through the use of a Memory
Management Unit (MMU), which includes the TLB and dedicated hardware to do virtual memory
translation lookups in the Operating System’s page tables. The Virtualization solution should include a
Memory Management Unit to effectively decouple the memory assigned to each Hardware Application.
Chapter 3. Virtualization Model 22
Note, this decoupling also enables the memory model from the OpenCL programming model described
in Section 3.1.2.
Another external resource considered for this work is access to the network, which follows from our
direct-connected FPGA model. As with the memory, data meant for a single Hardware Application
should be excluded from all other Hardware Applications. In other words, incoming network packets
should only be forwarded to the Hardware Application for which it is destined. Some form of network
routing must be implemented. Another less obvious need for the isolation of networking resources is
in the prevention of MAC address spoofing, i.e., one Hardware Application should not be allowed to
send packets with the MAC address assigned to another Hardware Application. The example described
here considers Layer 2 spoofing, though one could imagine a system that implements this exclusion at
a different layer of the network stack (e.g., the work by Tarafdar et al. excludes traffic at Layer 4). We
propose that the unit providing this network isolation be termed the Network Management Unit (NMU).
3.2.3 Domain Isolation
Following our discussion on network resource isolation, and as an analogue to data isolation, we introduce
the concept of Domain Isolation: virtual compute nodes (Hardware Applications in the case of this thesis
work) should be excluded from interacting with the domain of the other virtual nodes. In the case of
memory, domain isolation implies access to the memory locations assigned to each Hardware Application,
and in the case of networking, domain isolation implies ownership of some network identity (e.g. MAC
address at Layer 2) for both received and transmitted traffic.
3.2.4 Channels Targeted for Isolation
To effectively virtualize multiple applications on an FPGA, we’ve discussed the importance of isolating
those applications in different ways. Specifically, we’ve introduced performance isolation, data isolation
(for memory), and domain isolation (for networking) as specific isolation needs in Sections 3.2.1, 3.2.2,
and 3.2.3. In the field of security, isolation could imply quite a wide range of potential protections; in
this subsection, we discuss the specific goals of our system in introducing isolation.
In terms of memory connectivity, we focus specifically on introducing isolation for the AXI memory-
mapped interface. This interface is targeted because it is commonly used in FPGA designs that target
Xilinx devices (the vendor used in our work); maintaining backwards compatibility with existing FPGA
designs is an important guarantee of the isolation solution presented in this thesis. The AXI protocol
includes multiple channels, including data channels for read and write paths, address/control channels
Chapter 3. Virtualization Model 23
for read and write paths, and a write response channel. All five of these channels are targeted for
isolation. Specifically, this isolation solution aims to provide: data isolation guarantees, such that data
is protected against access and/or modification by other users; access guarantees, such that no application
can perform some series of transactions that could starve other applications of access; and some level of
performance guarantees.
In terms of networking connectivity, we focus specifically on introducing isolation for the AXI stream
interface, included for both egress and ingress packets. Again, this interface is targeted for backwards
compatibility with already developed hardware blocks and applications. Both the egress and ingress
channels are targeted for isolation. Specifically, this solution aims to provide: domain isolation guaran-
tees, such that network packets cannot be sent from the device that would interfere with another domain
or section of the network to which the Hardware Application should not have access; access guarantees,
such that no application can perform some series of packet transmissions that could stall the shared
resources or starve other applications; and some level of performance guarantees.
Channels that are explicitly not targeted for isolation include any resources other than memory or
networking connectivity, or any possible vulnerable side-channels. Side-channels commonly refers to
methods by which malicious actors can affect or glean information in reference to other applications in
the system without actually gaining access to the data channels by which the applications traditionally
communicate. For example, a malicious actor could monitor the voltage of the FPGA device, and that
voltage could be used to correlate to a specific value of interest (e.g., an encryption key). This has been
shown to be possible on FPGA devices [41] [42] [43]. No components are solutions are introduced in this
thesis to target side-channel isolation.
3.2.5 Management
A mechanism to manage the multi-tenant FPGA system, including setting up the parameters of the
MMU and the NMU and programming the application regions, is required. Such a management layer
should be integrated with a higher layer scheduling and allocation framework. This higher level schedul-
ing layer could be a cloud management software suite if the FPGA deployment targets a cloud datacentre
(e.g., the OpenStack management software [32]). Such management can be facilitated in a number of
ways, though it is important to note that the implementation of the scheduling layer does not affect the
Virtualization model presented to the Hardware Applications. The implementation of the management
layer only impacts the ease of integrating the system within a higher level scheduling framework.
Most of the previous works include management through an attached dedicated host system. Using
Chapter 3. Virtualization Model 24
an attached host system would require the FPGA device to be connected to the host though some
mechanism such as PCIe, though it wouldn’t necessarily imply the master-slave configuration discussed
in section 3.1.1. The host connection can be provisioned to provide only management, leaving the
virtualized Hardware Applications as solely directly connected compute modules, or it can be used
for both management and as a connection to the Hardware Applications themselves, enabling users to
decide which configuration to use (i.e., direct-connected or master-slave). This methodology for the
management would meet the deployment model specifications set out in Section 3.1.1.
An alternative to using a direct-connected host to manage the virtualized system is to include some
means of management directly in the Shell. This is the methodology employed by Byma et al. [33] in
their work. Rather than implement the entire scheduling algorithm in dedicated hardware, instead a
small Processor implemented as soft-logic in the Shell was used to program the PR regions and setup
the parameters of the system. Such a soft Processor can also be used to setup the parameters of the
MMU and the NMU. Some scheduling and allocation software can be run on the dedicated processor,
or alternatively a lightweight software setup can be used that simply accepts commands from some
scheduling software running elsewhere. Note, the soft-processor would need to be connected to the
FPGA’s network port for access to the external world.
From the above two discussions, an effective way to implement management of the Shell is by using
a processor to program and setup the Shell’s parameters. This processor can be implemented as a soft-
processor on the FPGA directly, can be an embedded ARM processor (as seen in many of the latest
FPGA devices), or can be a processor connected to the FPGA through the PCIe in a host machine.
The Virtualization Model seen by the Hardware Applications and their developers is unaffected by the
means of management.
3.2.6 Interface Abstraction
Another key consideration in the design of a virtualization platform is the way in which the external
interfaces are abstracted and presented to the Hardware Applications. Previous FPGA virtualization
solutions have tended to minimize the extent to which the external resources are abstracted, presenting a
fairly low-level view of the memory and networking resources. Some abstraction is nonetheless included.
For example, in the work by Tarafadar et al. [34], networking connectivity is presented to the Hardware
Applications as an encapsulating Layer 4 switch. In that work the Hardware Application does not need
to construct the Layer 2 and 3 parts of the sent data packets, but instead only sends the payload that
is encapsulated automatically. A similar approach is used in the Catapult v2 work with their LTL
Chapter 3. Virtualization Model 25
protocol [2].
The Hardware OS works described in section 2.4 provide a deeper set of interface abstractions. These
abstractions range from automatically instantiated cache structures for accessing memory resources, as
shown in the LEAP work [39], to system calls to the software OS running on a connected processor-based
host system [38]. In addition, different FPGA programming models can provide their own set of abstrac-
tions; the OpenCL programming model used with the SDAccel [30] Shell for example provides a deep
memory and execution model that is provisioned by the Shell. No one abstraction model has become
dominant in the programming of FPGAs, and indeed more abstractions can and likely will be conceived
in future works. If we draw from the analogous software Operating System environment for example,
abstractions to access file storage, to hide network stack complexity, and maybe even implement higher
level programming frameworks can ease the development process for Hardware Application developers.
Some of these design methodologies have been researched on FPGAs before: an in-hardware Trans-
mission Control Protocol (TCP) core has been developed to abstract the Layer 4 network model from
Hardware developers [44], and Message Passing Interface (MPI) abstractions have also been researched
for FPGAs [45].
At this point in the state-of-the-art, it would not make sense to lock-in to a particular set of abstrac-
tions for the Shell design. Maximal flexibility in the deployment and development of future abstraction
models would be advantageous. If we consider the software Operating System analogue again, a wide
array and range of system services provide a rich set of abstractions that ease software development.
Such a rich set of abstractions should be the goal of future Hardware Application models to achieve a
similar ease of development for hardware applications. Since an FPGA has limited resources, it would
not be possible to include all possible system services that might be required by any one particular
application within the Shell. This drives our introduction of two new concepts, the “Hard Shell” and
the “Soft Shell”, discussed in the following subsection.
Soft Shell vs. Hard Shell
We draw on two key insights from our previous discussion to introduce the concepts of the “Hard Shell”
and “Soft Shell”. First, as discussed in the previous paragraph, it will not necessarily be possible to
include all desired interface and system abstractions into the Shell design, simply based on resource
limitations. Next, as discussed in section 3.1.2, bitstream portability for our virtualization model cannot
necessarily (or easily) be provisioned. The lowest level of portability for Hardware Applications will
likely be the source HDL. Any Hardware Application must be synthesized for each PR region targeted
anyway, thus any abstractions needed by that Hardware Application can be included and synthesized
Chapter 3. Virtualization Model 26
with the original HDL source. These abstractions synthesized and included inside the PR region itself
is what we are calling the Soft Shell. In software systems, the Soft Shell might be referred to as the
unprivileged or untrusted domain of operation.
In the design of the Shell, we follow the principle that only those components of the system that need
to be shared amongst the PR regions are included in the static region; we call this the Hard Shell. In
software systems, the Hard Shell might be referred to as the privileged or trusted domain of operation.
This includes the components that arbitrate access to shared resources and ensure domain isolation for
these resources. In addition, performance isolation features may need to be included in the Hard Shell.
In the case that users generate their own bitstream, which would include the Soft Shell (it is synthesized
with the Hardware Application), performance isolation functionality cannot be left to the Soft Shell and
must be included in the Hard Shell as well. This is because a malicious actor could modify the Soft Shell
in its source HDL form to remove any security features included therein, circumventing the protections;
the examination of a synthesized bitstream for such circumvention is a difficult task. If the user does
not generate the bitstream, performance isolation could be reliably provided within the Soft Shell (since
its generation would be hidden from the user). To summarize, the Hard Shell provides only the minimal
features needed to facilitate multi-tenancy and secure resource sharing: the domain and performance
isolation discussed in sections 3.2.3 and 3.2.1.
We envision Soft Shell implementations (and indeed the minimal Soft Shell’s presented in this work) as
a generated wrapper for a user application that provides higher-level abstractions and services specifically
required by the user application. The generated wrapper changes depending on the abstractions needed
by the specific application, and as such, many different abstraction and programming models can be
implemented within the same Virtualization model. As an example, the soft shell could provide a TCP
stack to ease the development of network connected applications. The introduction of the Soft Shell
also has implications on the design of the Hard Shell. As an illustrative example, consider a situation
where a Soft Shell instantiated abstraction itself needs access to a region in memory. To effectively hide
the provisioning of this abstraction from the Hardware Application, the Hardware Application must be
unaware of this provisioned memory, i.e., it must not be forced to avoid those regions in memory alloted
to the interface abstraction component. The Hard Shell must implement the secure separation of shared
access from within a single PR region as well.
Chapter 3. Virtualization Model 27
DDR Controller
Memory Management Unit (MMU)
Protocol Verifier/
Decoupler
Protocol Verifier/
Decoupler
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 2
TCP Core
MPI Core
Mem
ory
In
terc
onn
ect
Mem
ory
In
terc
onn
ect
FCoE Core
File-Stream Accessor
Protocol Verifier/
Decoupler
Protocol Verifier/
Decoupler
Network Management Unit (NMU)
Ethernet Controller
Bandwidth-Shaping
Interconnect
Bandwidth-Shaping
Interconnect
Figure 3.1: Virtualization Shell architecture based on the Soft Shell concept
3.3 Virtualization Shell
The virtualization model discussed in the previous sections of this chapter gives way to the virtualization
Shell depicted in Figure 3.1. Note, for this thesis work, only external memory and network resources
are considered, as depicted in the figure, though a similar approach could be applied to other external
resources.
3.3.1 Shell Overview
In the Virtualization-based Shell, each Hardware Application is joined by some set of abstractions as
part of the Soft Shell. Both the Hardware Application itself and the Soft Shell components are included
within the PR region. Note, the abstractions included in the diagram are for illustrative purposes only,
this thesis focuses mainly on exploring the design parameters of the Hard Shell. For each of the external
resources, the Hardware Application can include multiple virtualized interfaces to access the physical
resource. For example, Hardware Application 1 from Figure 3.1 includes four virtual memory interfaces
to access the physical off-chip memory (three interfaces from the Application itself and another one from
the TCP interface block). Each of these virtual interfaces is isolated individually in the MMU and NMU
of the Hard Shell.
The Hard Shell includes all those components shown outside of the PR regions. The protocol verifier
and decoupler along with the bandwidth shaping interconnect implement performance isolation for both
memory and network access. The MMU and the NMU implement domain isolation for the memory
and network respectively, treating each virtual interface as an individual domain. This is accomplished
by assigning each of the virtual interfaces within the Soft Shells (whether connected to the Hardware
Application or some Soft Shell component) a unique Virtual Interface ID (VIID). The MMU and the
NMU perform isolation based on this VIID.
Chapter 3. Virtualization Model 28
DDR Controller
Memory Management Unit (MMU)
Protocol Verifier/
Decoupler
Protocol Verifier/
Decoupler
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 2
TCP Core
MPI Core
Mem
ory
In
terc
onn
ect
Mem
ory
In
terc
onn
ect
FCoE Core
File-Stream Accessor
Protocol Verifier/
Decoupler
Protocol Verifier/
Decoupler
Network Management Unit (NMU)
Ethernet Controller
Bandwidth-Shaping
Interconnect
Network Interconnect
Shell Mangement
Layer
Mangement Interconnect
PCIe Controller
Figure 3.2: Virtualization Shell with management infrastructure shown
The way that the management layer interacts with the Soft Shell is shown in Figure 3.2. Each of the
components in the Shell (both within the Hard Shell and Soft Shell) are connected to some management
interconnect. One possible implementation for this interconnect would be a memory mapped interconnect
model that allows for registers within each of the components to be accessed and modified. For the
components within the Soft Shell to be accessible by the management layer, the PR region must include
an interface to the management interconnect. Upon generation of the Soft Shell, each of the Soft Shell
components can be connected to this management interface and mapped within the PR region’s memory
space on the interface. Note, this implies that the Soft Shell configuration space varies from Hardware
Application to Hardware Application. For the Management layer to know how to configure the Soft Shell
components, some listing of the Soft Shell’s configuration must be generated upon the generation of the
Soft Shell itself. The Hardware Application can then be fully described by the generated bitstream and
this Soft Shell description file.
As discussed in Section 3.2.5, the management of the Shell components can be done through either
a PCIe connected host system, or through an integrated soft-processor connected to the shared network
interface. The Shell Management Layer from Figure 3.2 represents either the bridge between the PCIe
controller and the management interconnect (With the management layer acting as the master of the
memory mapped register interconnect), or it represents the soft-processor, which can communicate
over some Ethernet connection. Note, the diagram shows the Management Layer using the Ethernet
Controller though the network interconnect, which is also shared by the PR regions, though a dedicated
Ethernet connection could also be provisioned.
Chapter 3. Virtualization Model 29
3.3.2 Xilinx Specific Details
The shell shown in Figures 3.1 and 3.2 is implemented, with various component configurations, on an
AlphaData 8k5 FPGA board with a Xilinx Ultrascale XCKU115 FPGA [46]. Further implementation
details are discussed in Chapters 4 and 5.
The AlphaData FPGA board has two channels of DDR4 memory, each containing a single rank.
The capacity of each of these memory ranks is 8GB, with a 64-bit interface. Except where explicitly
mentioned, the shells designed in this thesis use only a single memory channel. The DDR controller
used for all of the designs of this work are those generated by the Xilinx Memory Interface Generator
(MIG) provided with the Xilinx Vivado CAD tools [47]. This DDR controller presents a data-width of
512-bits to the interconnect. All of the interconnect components and the respective interfaces provided
to the Soft Shell regions are also configured with a 512-bit interface (i.e., no data-width conversion is
included). The frequency of operation for the DDR controller is 299.581 MHz, which is four times less
than the frequency of operation of the DDR memory itself. The clock for the DDR controller is provided
to the remaining shell logic, and this clock is used for all of the logic that drives the DDR controller
(which includes the interconnect, any added components, and the Soft Shell memory interfaces). Note,
this wide memory bus and relatively high clock speed (for FPGA designs) introduced a great deal of
routing stress on the FPGA CAD tools. In all of the shell designs, register slices had to be inserted into
the data-path to meet these timing requirements. This high routing stress is consistent with other shell
explorations such as the work described in [35].
The network connections on the AlphaData board are two individual SFP+ cages that can operate
at up to 16Gbps [46]. For Ethernet connections, these ports are set to operate at the 10Gbps Ethernet
line rate. Except where explicitly mentioned, the shells designed in this thesis use only a single network
interface connection. The Ethernet controller used for all the shell designs of this work is the 10Gbps
Ethernet Subsystem offered by Xilinx [48], implemented using the Vivado CAD tools. This Ethernet
controller presents egress and ingress stream interfaces with a data-width of 64-bits each. As with the
memory solutions, there is no data-width conversion included, so all of the attached interconnect and
Soft Shell interfaces are also configured with 64-bit ingress and egress interfaces. The Ethernet controller
generates a clock that operates at a frequency of 156.25 MHz. This clock is used to drive all of the logic
within the network path interconnect and the networking Soft Shell interfaces.
For the management layer, we implement a PCIe-based connection, with all management done in
an attached CPU-Based system. The PCIe connectivity is provisioned with the PCIe Subsystem core
provided by Xilinx [49], which includes AXI interfaces that can easily interface with the other system
Chapter 3. Virtualization Model 30
components (most Xilinx components operate using the AXI interface). The PCIe Subsystem includes
two separate memory mapped interfaces, one AXI-Lite interface meant to access configuration registers,
and one full AXI4 interface that can operate at higher bandwidths. The AXI-Lite interface is configured
as a 32-bit interface and is connected to all of the configuration registers of all the components within the
Hard Shell. An AXI-Lite interface is also connected to each of the Soft Shells to enable the configurability
of any Soft Shell instantiated abstractions. The full bandwidth AXI4 interface is simply connected to
the DDR Controller to allow the attached host to access the FPGA’s memory. The full AXI4 interface
has a 256-bit data-width. All of the PCIe components are clocked by a 100 MHz frequency clock. To
configure registers in different clock domains, the AXI Interconnect IP Core is used that automatically
instantiates any necessary clock conversion functionality.
Chapter 4
Memory Interfaces
This thesis focuses on the security of two specific common FPGA shared resources: memory and net-
working connectivity. In this chapter, we focus on the secured sharing of off-chip memory, such as Double
Data Rate (DDR) attached off-chip memories. Most FPGAs include some form of on-chip memory, and
generally support off-chip memory technologies. The on-chip memories (such as BRAM and LUTRAM
in the Xilinx line of products) are usually spread across the FPGA spatially; these memories can be
split based on their spatial locality, i.e., each Hardware Application would have access to the on-chip
memories located within their assigned PR region. On-chip memories do not need any special security
or isolation considerations, but off-chip memories on the other hand need to be accessed through a single
shared bus and would have their entire contents theoretically accessible by all actors with unfettered
access to this bus. In this chapter we analyze the performance and domain isolation solutions needed to
secure this shared resource.
4.1 AXI4 Protocol Verification and Decoupling
On Xilinx FPGA devices, controllers for managing access to DDR memory are made available that
present an AXI interface to the accessors [47]. Specifically, these controllers target the AXI4 version
of the AXI standard. The FPGA board targeted in this work is the AlphaData 8k5 FPGA board,
that includes a Xilinx Kintex Ultrascale XCKU115 FPGA device [46]. As such, the work presented
herein is specifically tailored towards the AXI4 interface standards. While the work is specific to the
AXI4 standards, we assert that the concepts are general and could be applied to any memory interface
standard.
As the first step in securing the shared access of off-chip memory resources between multiple co-
31
Chapter 4. Memory Interfaces 32
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
AXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
AXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
AXI4 Decoup. + Verify
AXI4 Decoup. + Verify
AXI4 Decoup. + Verify
AXI4 Decoup. + Verify
(a) (b)
Figure 4.1: Adding Decoupling for Memory to the Shell (a) Shell without decoupling (b) Shell withadded decoupling
resident Hardware Applications, conformance to the AXI4 standard must be checked. In particular, as
mentioned in Section 3.2.1, those aspects of the AXI4 standard that might cause the memory controller
or the shared bus used to access the memory controller to enter into an unexpected state, perform some
erroneous transaction, or stall operation must be enforced. In this subsection, we cover the implemen-
tation of such functionality.
To illustrate the intention of the work described in this subsection, see Figure 4.1. In part (a) of
the figure, we depict an unsecured shell organization that includes only off-chip memory as an external
resource. The multiple Hardware Applications connect to an AXI4 interconnect that arbitrates access to
the Xilinx DDR memory controller. Specifically, for the AlphaData 8k5 board, DDR4 memory is attached
as available off-chip memory. The board itself also includes Peripheral Component Interconnect Express
(PCIe) connectivity to a host CPU-based system. As mentioned in Section 3.2.5, this PCIe is used to
manage the various components of the shell.
The PCIe connectivity is provided by the Xilinx DMA Subsystem for PCIe [49], that is also connected
to the AXI4 interconnect to be able to access the off-chip memory. Part (b) of the figure indicates
how protocol verification and decoupling modifies the simple unsecured shell depicted in part (a); each
AXI4 interface connection from each of the Hardware Applications is connected to a Protocol Verifier-
Decoupler, that in turn outputs a modified AXI4 interface connecting to the original AXI4 interconnect.
These Protocol Verifier-Decouplers are all connected to the PCIe management network.
Chapter 4. Memory Interfaces 33
Figure 4.2: The AXI4 Read Channel Interfaces, adapted from Figure 1.1 of [50]
4.1.1 The AXI4 Protocol Basics
Before introducing the design of the Protocol Verifier-Decoupler, some AXI4 standard details must be
introduced. The complete details of the standard are described in [50]. As a primer, the AXI4 standard
splits memory read and write transactions into channels. Figure 4.2 shows the read channel specification,
and Figure 4.3 shows the write channel specifications. In the AXI4 standard document, those devices
that issue memory access requests are referred to as Masters, while those devices that receive and fulfill
memory access requests are referred to as Slaves. As can be seen from the figures, requests for access
to memory resources, data sent to/from the memory resource, and responses from the memory resource
are separated into individual channels. Each of these channels includes independent handshaking signals
(i.e., signals that indicate a new transaction is available and that an available transaction has been
received).
The address and control channels are used to initiate read and write requests. These channels include
a series of signals (in addition to the handshaking signals) that indicate the type of transaction to initiate,
specifically the size of each data beat and the number of data beats in total to send for that particular
transaction. In addition to the size and number of the expected data beats, the address and control
signals allow for the specification of a “burst mode” for the requested transaction: options include a
“FIXED” mode, where the address of each data beat is the same; an “INCR” mode, where the address of
the data beats is incremented for each beat; and a “WRAP” mode, where the addresses also increment,
but wrap to a lower address once the address reaches some alignment boundary. The write address and
control channel signals are generally prefixed with “aw” and the read address and control signals are
generally prefixed “ar”.
Chapter 4. Memory Interfaces 34
Figure 4.3: The AXI4 Read Channel Interfaces, adapted from Figure 1.2 of [50]
The write and read data channels are responsible for transferring the data to be used in the memory
access operation; multiple data beats can be sent for each requested transaction depending on the burst
mode and length. The read data channel signals are generally prefixed with “r” and the write data
channel signals are generally prefixed with “w”. The read data channel also includes a “resp” signal
that indicates the success of a requested transaction; the analogous signal for the write transaction is
provided on a dedicated write response channel, whose signals are generally prefixed with “b”.
4.1.2 AXI4 Master Protocol Verification Requirements
The AXI4 standard includes a companion standards document that details the specific assertions that
must be met to guarantee that any particular transaction is protocol compliant, the AXI4 Protocol As-
sertions User Guide [51]. A summary of relevant assertions is included in Appendix A. Table A.1 includes
a summary of the protocol assertions for the write address and control channel. Table A.2 includes a
summary of the protocol assertions for the write data channel. Table A.3 includes a summary of the
protocol assertions for the write response channel. Table A.4 includes a summary of the protocol asser-
tions for the read address and control channel. Table A.5 includes a summary of the protocol assertions
for the read data channel. Finally, Table A.6 includes a summary of the protocol assertions dealing with
Chapter 4. Memory Interfaces 35
exclusive access. Protocol assertions that target simulations of AXI4 peripherals rather than synthesis,
and those pertaining to signals not included in the Xilinx memory controller, are not included for brevity
(and because they do not need to be considered). In addition, the handshaking assertions are summa-
rized by a single entry in each table (e.g., a single AXI ERRM AWxxxxx STABLE entry is included
in the write address and control assertions table rather than individual AXI ERRM AWLEN STABLE,
AXI ERRM AWSIZE STABLE, etc., entries).
Xilinx actually provides to customers an AXI4 protocol assertion Hardware Core [52], though this
block simply monitors the AXI4 transactions and would not prevent an erroneous transaction from
propagating to the interconnect; for secure isolation, erroneous transactions need to be prevented. Not
all protocol errors can induce erroneous operation in the system; in fact, the Xilinx AXI4 interconnect
user guide [53] and the Xilinx DDR Memory Controller user guide [47] indicate that some errors are
ignored. These ignored errors would not need to be prevented. In Tables A.1 - A.6, the final two columns
indicate whether or not the interconnect and memory controller, respectively, will enter some erroneous
operation state as a result of the AXI4 protocol violation (the highlighted entries indicate an error that
need be avoided).
The Xilinx AXI4 interconnect ignores most AXI4 protocol violations, though there are some excep-
tions that must be considered. For the read and write address channels, the size of a single beat must
be indicated to be less than or equal to the interface width; an erroneous value of this signal may cause
width converters in the interconnect (and indeed in the memory controller) to fail to operate correctly.
Also, the WLAST indicator on the write data channel must be asserted correctly such that the inter-
connect arbitrates correctly. For all of the output channels (the read and write address channels, and
the write data channel), the signals must remain stable once they have been indicated to be valid and
before they have been received by the slave interface; changing values might cause different values to
propagate in the interconnect depending on when the values are sampled. Finally, for all input channels
(the read data channel and the write response channel), the data must be accepted within a reasonable
time to avoid the shared interconnect from hanging.
The Xilinx AXI4 memory controller does not ignore as many protocol violations as the Xilinx AXI4
interconnect, and as such more verification considerations need to be addressed. Specifically, additional
errors need to be avoided in the read and write address channels. First, if a burst transaction with
a burst mode of WRAP is indicated, the address of that transaction must be aligned and the burst
length must be a power of 2 between 2 and 16. Incorrect WRAP burst mode transactions may cause
undefined behaviour in the memory controller. In addition, the AXI4 protocol standard specifies that
no transaction may access data across a 4kB boundary. This protocol violation is particularly significant
Chapter 4. Memory Interfaces 36
for shared access since assigned memory regions must not be accessible by actors to whom the memory
is not assigned.
4.1.3 Memory Transaction Decoupling
The AXI4 protocol isolation presented in this work is divided into two separate components: the protocol
decoupler and the protocol assertion verifier. The purpose of the protocol decoupler is to allow the FPGA
management framework to disconnect a Hardware Application from the shared interconnect. This might
be done to reprogram the PR region in which the Hardware Application is resident (i.e., to assign the
PR region to a new or modified Hardware Application) for those deployments where PR is enabled, or to
pause the memory accesses of the Hardware Application for some other reason, such as to disable further
erroneous transactions from a misbehaving Hardware Application or to perform some maintenance on
the memory resource (e.g., defragmentation).
The existing commercial decoupling solutions generally work by driving the handshake signals to
a zero value while allowing the remaining signals to pass through. This works because even if a valid
transaction is presented on the interface, on the decoupled side of the interface decoupler the valid signals
will be held low. This is how the Xilinx Partial Reconfiguration Decoupler works for example [54], as
shown in Figure 4.4 adapted from that work. This works to prevent any new beats from being sent into
the interconnect, but it might lock-up the interconnect if some burst transaction has not been completed.
In this work we present a new AXI4-specific decoupler that stops new memory access requests from being
sent while waiting for transactions to finish before decoupling the write data and read/write response
channels.
Figure 4.5 shows the design of this modified decoupler. The read and write address and control
channels are decoupled in a similar way to the previously mentioned decoupler, i.e., the valid and
ready signals are driven low when a decouple event is indicated, however there must be one special
consideration. The AXI protocol specifies that once the valid signal on any channel has been asserted,
it cannot be de-asserted until that transaction has been accepted by the downstream interface. For each
of the read/write address and control channels, a single “sticky” bit is included that goes high whenever
a valid beat is indicated but not accepted; this sticky bit is used to gate the decouple signal until after
that waiting transaction is accepted. The data and response channels must not be decoupled until the
outstanding transactions those channels are to service are completed; to this end counters are included
that count the expected number of write data beats to be sent, the expected number of write responses
to be received, and the expected number of read responses to be received (for read responses, a read
Chapter 4. Memory Interfaces 37
Figure 4.4: Xilinx AXI Decoupler Operation, adapted from Figure 2.3 of [54]
data beat with the LAST signal asserted is counted as a single response). The counters are used to gate
the decouple signal until no transactions are outstanding.
As a final consideration, we know from Section 4.1.2 that a Hardware Application can cause the
interconnect to hang if a malicious user does not send the data beats or does not accept read/write
responses in a timely manner. In fact, the Hardware Application could decide to never accept and thus
deprive all other Hardware Applications of access to this resource. As such, there must also be a way
to force the interface to send data beats and accept responses if it has timed out (a timeout condition
is included in the memory protocol verifier described in Section 4.1.4). A “decouple force” input signal
drives the READY signals of the read and write response channels to high when those channels are not
decoupled. The forced decouple state also generates forced write data beats to be sent on the write data
channel, with the write strobe signal (a signal that indicates which bytes of a write transactions should
actually be written) set to zero. When all of the channels are successfully decoupled, and all outstanding
transactions complete, a “decouple done” output signal is asserted.
4.1.4 Memory Protocol Verification
The AXI4 protocol verifier prevents the Hardware Application from creating any of the protocol vi-
olations described in Section 4.1.2, and is shown in Figure 4.6. The depicted module labelled Ad-
dress/Control Correction modifies any read and write requests that contain a protocol violation such
Chapter 4. Memory Interfaces 38
Write response channel
Write Data Channel
Write address channel
Sticky awvalid 0
0
decouple
awvalid_in
awready_in
awvalid_out
awready_out
aw_decoupled
Read address channel
Sticky arvalid 0
0
decouple
arvalid_in
arready_in
arvalid_out
arready_out
ar_decoupled
Outstanding write data
0
0
decouple
wvalid_out
wready_out
w_decoupled
= 0 1
0
decouple_force
wvalid_in
wready_in
wrstrb_outwstrb_in
Outstanding write resp
0
0
decouple
bready_out
bvalid_out
b_decoupled
= 0 1
decouple_force
bready_in
bvalid_in
Read data channel
Outstanding read data
0
0
decouple
rready_out
rvalid_out
r_decoupled
= 0 1
decouple_force
rready_in
rvalid_in
aw_decoupledar_decoupled
w_decoupled
r_decoupled
b_decoupled
decouple_done
Figure 4.5: AXI4 Memory Decoupler
that the protocol violation is removed. The details of the corrections of this module are described in
Algorithm 1.
For the read/write address and control channels, the first check done is that the SIZE signal does
not indicate a beat size that is greater than the data interface; the SIZE field is overridden if it contains
an invalid value. Next, if a WRAP type burst mode is indicated, the number of beats indicated in the
burst length must be a power of 2 between 2 and 16. If the value is not one of these, the burst length
value remains unchanged but the burst mode is overridden to be of INCR type. This is to prevent the
decoupler from malfunctioning (if the burst length is changed, the decoupler will send a different number
of write beats than expected by the interconnect). Finally, if the final burst mode continues to be of
Chapter 4. Memory Interfaces 39
Inputs: size in, addr in, burst in, len inOutputs: size out, addr out, burst out, len out, 4k error
len out← len in;
if size in > MAX SIZE thensize out←MAX SIZE;
elsesize out← size in;
end
var: is burst length← len in = 2 ∨ len in = 4 ∨ len in = 8 ∨ len in = 16;
if burst in = WRAP ∧ ¬ is burst length thenburst out← INCR;
elseburst out← burst in;
end
array: addr masks[8]← { ...111111111, ...111111110, ...111111100, ...111111000,...111110000, ...111100000, ...111000000, ...110000000 };
var: addr align← addr in & addr masks[size out];
if burst out = WRAP thenaddr out← addr align;
elseaddr out← addr in;
end
var: addr last← addr align+ (len in << size out)− 1;var: addr match← addr last[MSB − 1 : 12] 6= addr in[MSB − 1 : 12];
if burst out = INCR ∧ addr match then4k error ← 1;
else4k error ← 0;
endAlgorithm 1: Address/Control Channel Correction
Chapter 4. Memory Interfaces 40
Idle Cycle Counters
Read Output Channels
Write Output Channels
awsize_in
awlen_in
awburst_in
awaddr_in
awsize_out
awlen_out
awburst_out
awaddr_out
aw4k_error
Address/Control Correction
arsize_in
arlen_in
arburst_in
araddr_in
arsize_out
arlen_out
arburst_out
araddr_out
ar4k_error
Address/Control Correction
Current Beat Counter
==wlast_out
awid
wdata
wstrb
arid
wvalid
rst count
Idle Cycle Counter
wready
= MAX_IDLE
w_timeout
sticky bit
rst count
Idle Cycle Counter
bready bvalid
= MAX_IDLE
b_timeout
sticky bit
rst count
Idle Cycle Counter
rready rvalid
= MAX_IDLE
r_timeout
sticky bit
decouple_force
reg
reg
reg
reg
reg
reg
reg
reg
reg
reg
reg
reg
reg
reg
reg
Figure 4.6: AXI4 Memory Protocol Verifier
WRAP type (after the check described before), the address is forced to an aligned value by masking
out the least significant bits that correspond to the beat size. Rather than including a barrel shifter to
calculate this mask, a lookup table of masks is used (the SIZE field is 3 bits wide and thus limited to
only 8 different possible values ranging from 1 Byte to 27 = 128 Bytes).
Note, overriding parameters on the read/write request changes the behaviour from the Hardware
Application User’s point of view, since the actual transaction sent differs from the one that was intended,
but it will not cause any unexpected behaviour in the shared resources. This system promotes shell
stability over the functioning of an errant user, whose functionality should be suspect in any case
considering the malformed transaction request.
The final check on the read/write address and control channels is a 4kB boundary crossing check.
Chapter 4. Memory Interfaces 41
To simplify this calculation, we note that a WRAP burst is incapable of crossing a 4kB boundary
since the maximum transaction size of a WRAP burst is 2048 Bytes (16 beats x 128 Bytes/beat) and
WRAP bursts always access data aligned to the total transaction size. To determine whether a user
transaction crosses a 4kB boundary, the address of the last byte of the transaction is computed and
its most significant bits are compared to those of the starting address indicated. The number of bits
compared is equal to the address field length minus 12 (since the bus is byte addressable and 12 bits
are needed to index a 4kB page). The last address is computed by taking the aligned version of the
address (as computed for the WRAP address alignment) and adding the burst LENGTH multiplied by
the burst SIZE. The SIZE parameter is in a log base 2 form, so the multiply can be implemented with
a barrel shifter. Thus, a 4kB boundary crossing is detected when the burst type is INCR and the most
significant bits of the final address in the transaction is different than the most significant bits of the
starting address.
In the previous instances of protocol violations, the transaction was changed so as to remove the
violation and then allowed to proceed. In the case of the 4kB boundary crossing, that isn’t possible as
the only change that could guarantee no boundary crossing would be to change the burst length and such
a change would cause the decoupler to malfunction. Instead, an error signal is output from the verifier
that is used by the MMU later (the MMU includes an error handling mechanism, which is described in
Section 4.3).
The write data output channel needs one specific protocol correction, the proper assertion of the
LAST signal, indicating the last data beat in a transaction. When a write transaction is accepted in the
write address channel, the LEN value is pushed to a small First In First Out (FIFO) buffer. The AXI4
protocol standard does not allow for the reordering of write requests (if reordering should be allowed,
separate FIFOs would be needed per write ID), so this FIFO contains the stream of all expected data
beat counts. A counter is incremented every time a write data beat is accepted, and when the counter’s
value is equal to the value at the head of the FIFO, the LAST signal is asserted, the FIFO’s data is
read, and the counter is reset. Thus, the user’s LAST signal is ignored and a corrected version is output,
removing the possibility of this specific protocol violation.
For all of the output channels, the read and write address channels and the write data channel, the
STABLE AXI protocol assertions must also be met. All of the output signals of these channels are
stored into registers once the VALID signal is asserted, assuring that any changes to the signals are not
captured by the interconnect.
Finally, the read and write response channels, as well as the write data channel, must be handled to
in a timely manner to prevent the interconnect from hanging. Counters for each of these channels are
Chapter 4. Memory Interfaces 42
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
AXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
AXI4 Decoup. + Verify
AXI4 Decoup. + Verify
AXI4 Decoup. + Verify
AXI4 Decoup. + Verify
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW Throttler
BW Throttler
BW Throttler
BW Throttler
AXI4 Inter.
(a) (b)
Figure 4.7: Adding Bandwidth Throttling for Memory to the Shell (a) Shell without bandwidth throttling(b) Shell with added bandwidth throttling
included that count the number of cycles for which the channel is ready to proceed with a transaction but
the Hardware Application user does not proceed. Once the counter reaches some parameterized timeout
value, an error signal is asserted. The logical ORing of all of these error signals is output from the module
and used to drive the “decouple force” input signal of the decoupler. Once the decouple force signal
is asserted, the decoupler will override the users signals and force a response. Once the outstanding
transactions have completed, the decouple force signal will force the interface into a decoupled state,
preventing further starving of the memory resources.
4.2 Performance Isolation for AXI4
Memory resources have limited bandwidth available and that bandwidth must be effectively shared
between competing Hardware Applications on a co-resident FPGA device. As we discussed in Sec-
tion 3.2.1, this Performance Isolation provides more reliable performance for the users of virtualized
FPGA resources. To illustrate the intention of the work described in this subsection, see Figure 4.7. In
part (a) of the figure, we depict the shell described so far, with memory decoupling and protocol veri-
fication implemented. Part (b) of the figure demonstrates how performance isolation modifies the shell
depicted in part (a); the interconnection network is augmented with bandwidth throttling components
to limit the rate of transactions allowed from each Hardware Application. These bandwidth throttlers
are also connected to the PCIe management network.
Chapter 4. Memory Interfaces 43
4.2.1 Traditional Credit-Based Rate Throttling
The concept of Latency-Rate Servers (LR Servers) was introduced in [55] to ensure some level of quality
of service in broadband applications. An LR-Server is a rate limiter for some traffic generating endpoint
on a network that limits the rate of the sender to ensure some bounded latency and a rate of transmission.
The concept has also been applied outside of broadband applications; the works described in [56] [57] [58]
use this conceptual definition in the design of rate limiters for other resources, including SDRAM memory.
The work presented in [58] is particularly interesting given its simple design and powerful bandwidth
guarantees. This credit-based approach can be modified to target the AXI4 memory interface.
The credit-based accounting system of [58] works by assigning to each requester a rate ρ and a bursti-
ness σ. Each requester also has an associated counter whose value is used to determine whether or not
the requester is allowed to initiate a new request. When the requester has a request pending that has
not yet been serviced, the counter is incremented by some amount n every cycle, accumulating credits.
For every cycle that the requester is given access to the shared bus, the counter is decremented by d.
In most conditions, the counter is incremented by n every cycle, and decremented by d every cycle for
which the requester is granted access; if d is greater than n (it should be when configured), the counter
accumulates credits when it does not have access to the bus, and loses credits when it is granted access
to the bus. As an exception, the credit-based accounting work specifies that when the requester does not
have any pending requests, its credit is reset to an initialization amount equal to d× σ, which prevents
the requester from accumulating credits and then sending requests all at once with the accumulated
credits. The credit accounting system is summarized as follows:
credits(t+ 1) = credits(t) + n− d has bus access
credits(t+ 1) = credits(t) + n no access but pending requests
credits(t+ 1) = d× σ no pending requests
The values of n and d, integers, are chosen such that ρ ≈ n/d, which ensures that credits are re-
populated at a rate that gives the requester the allocated bandwidth. The credit accounting system
can be further simplified if the credits are taken as fixed point values (rather than integer values in the
original formulation), and the value of d is fixed to 1. In that case, the credit accounting system is
simplified to:
Chapter 4. Memory Interfaces 44
credits(t+ 1) = credits(t) + ρ− 1 has bus access
credits(t+ 1) = credits(t) + ρ no access but pending requests
credits(t+ 1) = σ no pending requests
Where ρ is a decimal representation of the percentage of time the requester should be granted access to
the bus (ρ ≤ 1), and σ is an integer value that continues to represent the burstiness of the LR Server.
Technically speaking, the original formulation from [58] where ρ ≈ n/d allows for a closer approximation
of ρ as there is more flexibility in choosing integer values of both n and d to get a close approximation.
The fixed point system simply truncates ρ to represent it in the reduced precision formulation. However,
this second formulation eases further calculations.
The requester is granted access to the bus when there are enough credits for the request to be
processed. For requests that require access to the bus for a single cycle, the requester need only check
that there are at least n credits in the first formulation of the credit accounting system, and at least one
credit in the second formulation of the credit accounting system. However, for burst AXI4 transactions,
the number of credits needed to initiate a transaction is LEN ∗ n in the first credit accounting system,
and simply LEN in the second credit accounting system, where LEN is the burst length. Thus, the
second credit accounting system eliminates the need for multiplications and reduces the area needed to
implement the credit accounting system.
Note, other rate limiting systems have been implemented to enable multiple memory access interfaces
to share a memory system with some guaranteed bandwidth. For example, the Bandwidth Guaranteed
Prioritized Queuing (BGPQ) system, presented in [59] can effectively manage bandwidth allocations.
The BGPQ system, however, requires that requests be queued, which introduces more area and latency
overhead, and arbitrates based on some credit count value for each requester in the system. This means
that to determine whether a specific requester should have access to the bus, its credit value must be
compared to all other requester’s credit values. This could be an expensive operation. Extensions to
the BGPQ mechanism can implement latency guarantees, so that interfaces that have low bandwidth
requirements but strict latency requirements can be serviced [60]. This implementation introduces
yet more logic to the arbitration. The credit-based accounting mechanism chosen for this thesis can
determine whether to allow transactions to propagate based solely on information for that requester,
reducing combinational logic and routing strain for the FPGA. Comparative analyses of these other rate
limiting systems are not formally included in this thesis.
The work presented in [58] presents a robust proof that the credit accounting system is an LR-Server,
and can thus be used to effectively regulate the latency and rate of transactions for memory access.
Chapter 4. Memory Interfaces 45
4.2.2 AXI4-Specific Credit Mechanism
This thesis presents two separate contributions relating to the credit-based accounting system presented
in [58]: first, the modified credit accounting methodology presented in the previous section (which
dispenses with an integer based credit system for a fixed point system), and second, the modification of
the credit-accounting system for modern memory access protocols, AXI4 in this specific case.
The first change motivated by the AXI4 standard is one that addresses the separation of memory
access requests into multiple channels. The read and write requests are logically separated, and the ad-
dress/control and data parts of the transaction are also logically separated. The simplest way to address
this separation is to instantiate separate credit accounting systems for each of read and write requests
streams. Two separate credit-accounting systems also allows the FPGA management framework to al-
locate read and write bandwidth independently, enabling finer-grained control. The separation of the
address/control from the data motivates changing the way that the credit system is reset to the σ value.
Rather than simply inspecting the memory request channels for pending requests to determine whether
to reset the counter, the write data channel must also be monitored; if there are no outstanding requests
on the write address channel and no outstanding data to send on the write channel, the counter can be
reset. These changes are summarized in the following:
crRd(t+ 1) = crRd(t) + ρrd − LENrd AR request accepted
crRd(t+ 1) = crRd(t) + ρrd pending AR request
crRd(t+ 1) = σrd no pending AR request
crWr(t+ 1) = crWr(t) + ρwr − LENwr AW request accepted
crWr(t+ 1) = crWr(t) + ρwr pending AW request or Data to send
crWr(t+ 1) = σwr no pending AW request or Data
where CrRd and CrWr represent the credits for the read and write channels respectively and LEN
is the length of the burst transaction. Note, the subscripts on the ρ, σ and LEN values indicate
separate parameters for the read and write channels.
As mentioned in Section 4.1.4, the user Hardware Application does not have to send data in the
same cycle that the interconnect is ready to receive the data. With the implementation of the mem-
ory protocol verifier described in this thesis, in Section 4.1.4, a timeout limit is set to avoid indefinite
starvation of the system. This has an impact on the credit accounting system in two ways: first, the
number of credits that must be subtracted when the transaction is accepted is equal to the length of
Chapter 4. Memory Interfaces 46
the transaction multiplied by the total number of cycles the user could wait per data beat; next, if the
user is more efficient in their responses, credits must be redeposited at the time the data is accepted on
the write data channel. The effects of these changes on the write component of the credit accounting
mechanism are as follows:
crWr(t+ 1) = crWr(t) + ρwr − craw + crw pending AW request or Data to send
crWr(t+ 1) = σwr no pending AW request or Data
where :
craw = LENwr × (timeout cyc+1) AW request accepted
craw = 0 otherwise
crw = timeout cyc −wasted cycwr write data accepted
crw = 0 otherwise
wasted cycwr(t+ 1) = wasted cycwr(t) + 1 WREADY high and WVALID low
wasted cycwr(t+ 1) = wasted cycwr(t) WREADY not asserted
wasted cycwr(t+ 1) = 0 write data accepted or reset
where the term timeout cyc is a constant value that represents the maximum number of cycles the
user is given to respond to a ready write data channel. If the value of timeour cyc is chosen such that
timeout cyc+ 1 is a power of 2, the multiplication in the above can be implemented as a constant shift
and adds little combinational area overhead. The user can also waste cycles on the read data bus, by
refusing to accept data immediately when it is available. A similar formulation as above for the read
component of the credit accounting system yields the following:
crRd(t+ 1) = crRd(t) + ρrd − crar + crr pending AR request
crRd(t+ 1) = σrd no pending AR request
Chapter 4. Memory Interfaces 47
where :
crar = LENrd × (timeout cyc+1) AR request accepted
crar = 0 otherwise
crr = timeout cyc−wasted cycrd read data accepted
crr = 0 otherwise
wasted cycrd(t+ 1) = wasted cycrd(t) + 1 RVALID high and RREADY low
wasted cycrd(t+ 1) = wasted cycrd(t) RVALID not asserted
wasted cycrd(t+ 1) = 0 read data accepted or reset
The term wasted cyc, in both the read and write credit accounting formulations, monitors the write
data channel for activity and stores the total number of cycles in which the data bus was ready to
transmit data but the user did not accept the transmission. In this way, the user must “pay” cred-
its to waste bus cycles and thus does not use more than their allotted share of bandwidth. If a user
is efficient in their access of the data bus, no cycles are wasted and all oversubscribed credits will be
returned. In this case, the number of credits initially deducted when the transaction is accepted is
equal to LEN × (timeout cyc+ 1), and the number of credits redeposited at each successful data beat
transmission is simply timeout cyc−0 (wasted cyc would be zero), that leaves an effective total number
of credits used equal to LEN for each transaction, which was the initial credit formulation. The user
essentially “borrows” credits to pay for any potential wasted cycles in the future.
The hardware implementation of this modified credit accounting based bandwidth throttler is shown
in Figure 4.8. The read and write channels have separate counters to keep track of the number of
credits each stream has. For the address/control channels, the LEN field of the transaction is used to
calculate the number of credits needed, using a constant shift to represent the multiplication by the term
timeout cyc+1. This is compared to the credit counter’s integer component (the counter is a fixed point
decimal number) to determine whether enough credits are available. This comparator signal is used to
decouple the address/control channel from the Hardware Application. The credit update mechanism is
implemented by adding the update term ρ (which is passed to the block as an input) to the fractional
part of the counter, subtracting the shifted LEN value if the requested transaction is accepted, and
adding back a term that implements the timeout cyc − wasted cyc calculation if the write/read data
beat is accepted. If at any point there is no outstanding transactions pending, the σ is loaded into the
credit counter instead of the updated term. For the write transactions, this default value loading is done
Chapter 4. Memory Interfaces 48
Write Channel
0
update (ρ)
++
_ _
0
cred_need
0
w_add_back
Credits (int)
Credits (frac)
sel0 sel1
sel3
init (σ)
++
sel4
0
0
awvalid_out
awready_out
>=
awvalid_in
awready_in
xx cred_needawlen
MAX_TIME
x cred_needawlen
MAX_TIME
rst count
Idle Cycle Counter
wvalid wready
_ _
(MAX_TIME-1)w_add_back
rst count
Idle Cycle Counter
wvalid wready
_
(MAX_TIME-1)w_add_back
sel0wvalidwready sel1awvalid
awready
sel3awvalidoutstanding wdata sel4
sel0wvalidwready sel1awvalid
awready
sel3awvalidoutstanding wdata sel4
sel0wvalidwready sel1awvalid
awready
sel3awvalidoutstanding wdata sel4
Read Channel
0
update (ρ)
++
_ _
0
cred_need
0
r_add_back
Credits (int)
Credits (frac)
sel0 sel1
sel3
init (σ)
++
sel4
0
0
arvalid_out
arready_out
>=
arvalid_in
arready_inxx cred_need
arlen
MAX_TIME
x cred_needarlen
MAX_TIME
rst count
Idle Cycle Counter
rready rvalid
_ _
(MAX_TIME-1)r_add_back
rst count
Idle Cycle Counter
rready rvalid
_
(MAX_TIME-1)r_add_back
sel0rvalidrready sel1arvalid
arready
sel3arvalid sel4
Figure 4.8: AXI4 Memory Bandwidth Throttler
when there are no pending transactions and no pending data beats.
4.2.3 Bandwidth Conserving System
The bandwidth throttling system described thus far is limited in that the total system bandwidth would
be limited to the cumulatively assigned ρ values; any unused or unassigned bandwidth is wasted and
cannot be reclaimed. The original credit-accounting system in [58] includes an addendum on bandwidth
conservation. If all of the requesters are blocked from proceeding with a transaction or do not have
a pending transaction, a second tier arbitration scheme is used to override the bandwidth throttler
decouplers. Requesters that are granted access from this second tier arbitration scheme when they are
blocked do not consume credits for the accepted transaction.
To implement this system in the previously described bandwidth throttler, each of the bandwidth
throttlers needs to output its pending status (i.e., whether or not it has a pending transaction that is
blocked, or pending write data to send) and input a credit override signal. The pending status outputs
are ORed to determine if any requester has a valid request pending. If no valid requests are pending
and blocked, a second arbitration scheme outputs the credit override signals back to the bandwidth
Chapter 4. Memory Interfaces 49
throttlers. In the case of this thesis work, a simple time-division multiplexing second tier arbitration is
used, selecting a single requester to override per cycle in a cyclic manner. Once a requester has been
granted an override, or some other requester has a valid request granted, the override system ceases
operation in favour of the original bandwidth throttling system.
If a requester issues a transaction request based on an override, it does not subtract the credits that
would have been necessary had the transaction been issued in a normal manner. One complication
that comes up in this implementation is that without alteration, the current system would continue to
redeposit credits for efficient use of the read/write data channel, which does not make sense since those
credits were not borrowed in the first place. To remedy this, a FIFO buffer between the address/control
channel and the data channel stores whether or not the transaction was issued with an override. If
the transaction was issued with an override, the credits are not redeposited based on the efficiency of
the transaction on the data bus. In the AXI4 standard, write data transactions must be issued in the
same order as the corresponding write address/control transactions, so a single 1-bit FIFO is sufficient
to store this information. However, read data transactions must be returned for a specific transaction
ID value, but can be returned out of order for transactions with different ID values. The read data
transactions need separate FIFOs for each possible ID value, so 2AXI ID WIDTH 1-bit FIFOs are needed
for a functioning override system for the read channel.
4.2.4 Limitations for SDRAM Systems
The total available bandwidth of an SDRAM based memory system is highly dependent on the memory
access pattern. For example, the Xilinx memory controller described in [47] has a stated memory bus
utilization efficiency of 94 percent for sequential read accesses, but only 24 percent for random-addressed
alternating read and write transactions. The sources of inefficiency are summarized in [61], which divides
these inefficiencies into five distinct components: refresh efficiency, which represents the total possible
bus utilization efficiency considering that some number of cycles must be stalled while the DDR memory
undergoes a refresh operation; command efficiency, which encapsulates all inefficiencies as a result of
limitations in the commands that must be issued and the order in which they can be issued; data
efficiency, which refers to the amount of data read/written that goes unused (newer generation DDR
memories access data in bursts, e.g., burst size of four for DDR2 memories and burst size of eight for
DDR3 and DDR4 memories); bank-efficiency, which refers to the inefficiencies created from accesses to
banks in which the requested row is not open (i.e., a page miss); and read-write switching efficiency,
which deals with the increased latency required when back-to-back transactions are of different types
Chapter 4. Memory Interfaces 50
(i.e., a read following a write or a write following a read).
Refresh inefficiencies are not avoidable with DDR memories, the command efficiency is completely a
product of the memory controller design (i.e., outside the influence of the Hardware Application users,
and therefore not of import in this thesis), and the data efficiency is already considered in the current
shell design; the Xilinx memory controller [47] performs fixed bursts of size eight, and the bandwidth
throttler of Section 4.2.2 deducts tokens for the entire transfer whether or not the data is used. In terms
of performance isolation, the relevant efficiencies that determine the effect one Hardware Application
has on the others are the bank efficiency and the read-write switching efficiency.
Software performance isolation solutions have run into the same problem. The work presented in [62]
presents a system whereby a multi-core CPU system allocates bandwidth to each core of the system.
To get past the problem of less than ideal bandwidth availability due to inefficiencies, the total of the
bandwidth that would be assigned to the cores is equal to the guaranteed bandwidth of the memory
system, considering the least efficient use of the memory resource. In that work for example the guaran-
teed bandwidth was equal to 1.2GB/s, compared to the peak bandwidth of 6.4 GB/s (about 19 percent
of the peak bandwidth is guaranteed compared to the 24 percent of our use-case). All excess bandwidth
is allocated to cores on the basis of need (i.e., the cores compete for the excess bandwidth).
Such a system can be applied to the FPGA virtualization solution presented in this thesis. The sum
of all ρ values, across all Hardware Applications’ read and write ports, can be limited to the minimum
efficiency of the memory resource, 0.24 in this case (note, the ρ values are decimal quantities). The
remaining excess bandwidth can then be split amongst the requesters using the bandwidth conservation
mechanism presented in Section 4.2.3. This is not an ideal solution, since many high-performance
applications with efficient memory access patterns may want to reserve bandwidth greater than the
guaranteed bandwidth. Design of more complicated bandwidth throttlers, that consider the memory
access pattern of the requester, is left to future work.
As a potential remedy, the FPGA management framework can assign bandwidth greater than the
guaranteed bandwidth and monitor the resultant utilization rate to ensure the memory access pattern
does not result in over subscription of the memory data bus, though this system would require periodic
monitoring of the bandwidth utilization to ensure it is not abused by a malicious actor. Figure 4.9
depicts the design of a utilization monitor based on the exponential weighted moving average [63], the
formula for which follows:
Avgexp(t) = Avgexp(0) t = 0
Avgexp(t) = α · In(t) + (1− α) ·Avgexp(t− 1) t > 0
Chapter 4. Memory Interfaces 51
++
++
Count
<< N
<< N
_ _
>> N
wvalidwready
rvalidrready
utilization
Figure 4.9: AXI4 Memory Utilization Monitor
Where Avgexp(0) is the initialization value, α is a parameter less than 1, and In(t) is the input value
of the time series to average. With careful selection of the parameter α, this type of average can be
implemented using only shifts and adds, without any expensive multiplications or divisions. If a value
for α is chosen as 1/1024 for example, the formula simplifies to:
Avgexp(t) = Avgexp(t− 1) + (In(t) >> 10)− (Avgexp(t− 1) >> 10)
This can be done for any power of two that is less than one (i.e. two to a negative exponent), that
simplifies to a shift of n bits where the power of 2 is expressed as 2−n. The average can be represented
as a fixed point decimal number with 1 single integer bit and 2×n decimal bits (this is needed to prevent
a loss of precision in the subtraction). The bandwidth utilization monitor of Figure 4.9 implements such
an average calculator though with two separate input parameters, that indicate a valid write data beat
has been accepted and a valid read data beat has been accepted, respectively. Since the downstream
memory controller cannot issue read and write beats at the same time, the bus should on average only
have one of these inputs active at a time. This utilization monitor is included after the interconnect
depicted in part (b) of Figure 4.7 to continuously monitor the total granted bandwidth of the memory
data bus.
Chapter 4. Memory Interfaces 52
Table 4.1: Bandwidth Throttling Performance
Incrementing Read Bursts Random & Narrow Accesses
ρ Assigned Reclaimed BW ρ Assigned Reclaimed BW
0% 96.5% 0% 96.5%
25% 70.8% 6.25% 47.5%
50% 47.1% 12.5% 6.8%
75% 22.1% 25% 0%
100% 0% 50% 0%
4.2.5 Bandwidth Limiting Performance Evaluation
To test the performance of the bandwidth throttlers, one application’s share of memory bandwidth must
be monitored while another application tries to spam the interconnect. In this situation, the bandwidth
throttlers should limit the ability of the second actor to use an excessive amount of memory bandwidth to
starve the first actor. For this setup, we include two applications that are continuously requesting access
to the memory bus. The first application is unthrottled but is connected to a lower priority connection
on the interconnect; this actor will only be granted access to the bus when the second application
is not requesting access to the bus, or is blocked by the bandwidth throttler. This first application
performs exclusively read accesses from the same memory address to model a high efficiency memory
access pattern. The second application’s access pattern is varied as part of this evaluation. The total
utilization of the first application is monitored. Note, both applications are simply micro-benchmarks
developed for this monitoring purpose, no other functionality is included.
The results of this experiment are shown in Figure 4.1. The “ρ Assigned” column represents the
amount of bandwidth assigned to the spamming interface, and the second column indicates the amount
of Bandwidth that was reclaimed by the unblocked interface. For the spamming interface, two different
modes of sending were tested. In the first test, the spamming interface would send efficient memory
accesses (large bursts to the same address). In the second set of tests, the spamming interface sent
randomly addressed memory accesses with a small burst size. As expected, this later test case is able to
saturate the entire bandwidth of the interconnect with only a 25 percent allowance (i.e., ρ value). While
the bandwidth throttler is effective at splitting memory bandwidth among efficient requesters, inefficient
accesses to memory can slow down the whole system. Just like in the software realm, the best solution
is to assign bandwidth based on the known guaranteed memory bandwidth; any excess bandwidth can
be reclaimed. An effective FPGA management framework should continually monitor the bandwidth to
ensure that the Hardware Applications are not being starved.
Chapter 4. Memory Interfaces 53
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW Throttler
BW Throttler
BW Throttler
BW Throttler
AXI4 Inter.
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
DDR4 Controller
AXI4 Inter.
MMU
AXI4 Inter.
BW
BW
BW
BW
(a) (b)
Figure 4.10: Adding MMU to the Shell (a) Shell without MMU (b) Shell with added MMU
4.3 Memory Management Unit Design
The previous subsections dealt with the performance isolation of the memory resource, i.e., how memory
resources are shared between the co-resident Hardware Applications to ensure some level of performance
can be guaranteed to the Hardware Applications. The MMU however is needed to guarantee data
isolation, so that memory assigned to one specific Hardware Application is inaccessible by other Hardware
Applications. The MMU is not a new concept and has been implemented in CPU devices to ensure
memory isolation between distinct processes [28]. Two different MMU design are explored in this thesis
work to determine their applicability to a multi-tenant FPGA deployment: base-and-bounds MMU and
paged MMU presented in Sections 4.3.1 and 4.3.2 respectively.
To illustrate the intention of the work described in this subsection, see Figure 4.10. In part (a)
of the figure, we depict the shell described so far, with memory decoupling, protocol verification, and
bandwidth throttling implemented. Part (b) of the figure demonstrates how the inclusion of an MMU
modifies the shell depicted in part (a); an MMU sits between the master port (i.e., the request issuing
port) of the interconnect network and the memory controller. An additional interconnect is added
between the master port of the MMU and the memory controller. While this new interconnect includes
only one master and one slave port, it also includes a range of unmapped addresses that the MMU can
target if the memory access request is determined to be errant (i.e., to a region of the memory that
should be inaccessible by the requester). The Xilinx memory interconnect instantiates a dummy slave
device within the interconnect that receives all requests addressed to unmapped addresses [53]. The
MMU is also connected to the PCIe management network.
Chapter 4. Memory Interfaces 54
++xN
reg
Base Memory
>>> ADDR_BITS - 1
reg
Bound Memory
reg
reg
reg
addr_in
error
other_fields_in other_fields_out
addr_out
ID_in
Figure 4.11: Base and Bounds MMU Design
4.3.1 Base-and-Bounds MMU Design
The base-and-bounds MMU design simply compares incoming addresses to some limit to determine
whether the access is out-of-bounds, and then adds the base address to the incoming address to shift
that Hardware Application’s memory access to some pre-assigned location in memory. This design is not
presented as new work in this thesis, as indeed it is an older design as described in [64]; its description
is included here for clarity.
The base-and-bounds MMU designed for inclusion in the shell is depicted in Figure 4.11 (only a single
channel of request remapping is shown, the logic would be included for the read and write channel). The
incoming address is first compared to the bound register to determine whether it is an errant access,
and then the address is added to the base address to get the effective physical address to access. If the
access is determined to be errant, the most significant bit of the output address is set; this is used to
indicate an error to the downstream interconnect, as that range of addresses is unmapped. The error
bit of the address is also set if a 4k boundary error was indicated at the protocol verifier stage (see
Section 4.1.4 for details), that prevents accesses that cross the 4k boundary from propagating to the
memory controller. The incoming VIID of the request is used to index into a memory containing base
and bound values, so these values can be set independently for each Hardware Application and indeed
each independent memory port within a Hardware Application. This design presents a zero-base address
to each individual memory port while mapping the accesses to physically different parts of the memory.
Chapter 4. Memory Interfaces 55
4.3.2 Coarse-Grained Paged MMU Design
Page-based MMUs are common in modern CPUs; they implement the memory virtualization solution
presented in Section 2.2.3. This solution divides the entire memory space into segments of equal size
(in most cases) that can be assigned to the Hardware Applications and the memory interfaces therein.
Multiple of these pages can be assigned to a single memory interface, with some number of the most
significant bits of the address used to index a page-mapping table, that stores a remapping value and a
bit to indicate whether or not the mapping is valid. The bits used to index the table are replaced with the
remapping value, resulting in the virtual to physical address translation depicted in Figure 2.3. As with
the base-and-and-bounds MMU, the error bit from the mapping table and the error bit forwarded from
the AXI4 protocol verifier (that indicates a 4k boundary crossing) are used to set the most significant
bit of the output address, resulting in an output address that is not mapped to the memory controller
and is handled by a dummy AXI4 slave instead.
If N bits are remapped using a page-based MMU (page size of 2ADDR WIDTH−N , and 2N pages),
the mapping table would need to be (N + 1) bits wide and 2N entries deep. For many page-based
MMU systems, the page size tends to be relatively small (compared to the total memory size) and thus
the page-mapping table can be quite large. For example, the AXI4 standard specifies a 4k boundary
for page mapping (hence the 4k boundary crossing assertion); in a 32-bit (4GB) memory system, this
would result in 20 bits remapped, requiring a table of about 1 million entries of width 21-bits. A unique
page table is also needed for each VIID value, as every VIID value has a unique mapping, which further
increases the storage need for page-mapping tables. Rather than storing these page tables in dedicated
hardware structures, the page tables are stored in the CPU-system’s memory and caching structures are
included in the MMU to reduce the latency of access.
For a multi-tenant FPGA deployment where memory virtualization is needed, any large page-table
structure would likely also need to be stored in the memory itself, which would add latency to any
accesses of memory. Caching structures could also be implemented in this case, however associative
structures tend to be expensive to implement on FPGAs. The need for such large page-tables structures
is less obvious for FPGA deployments however. First, multi-tenant FPGA deployments would be limited
in the number of memory interfaces that need simultaneous access to the memory, since the number
of Hardware Applications would be limited by the spatial constraints of the device itself. This is in
contrast to CPU systems, where preemption means that many software processes can be active on the
device even if they are not spatially located in an executing core. And second, many applications that
target FPGAs tend to reserve memory in large contiguous chunks (e.g., Neural Nets that store a large
Chapter 4. Memory Interfaces 56
map
Page Table
Memory
xN
>> ADDR_BITS - 1regerror
v
Concataddr[ADDR_BITS-INDEX_BITS-1:0]
addr[ADDR_BITS-1:INDEX_BITS]
reg
Concat
addr_out
ID_in
other_fields_in other_fields_outreg
addr_in
Figure 4.12: On-Chip Coarse Grained MMU Design
amount of weights and activations).
In this thesis work, we implement a coarse-grained paged MMU. The term “coarse-grained” refers
to the fact that the size of the pages is quite large relative to the size of the memory itself. In this case,
the entire page-table structure, that includes mappings for all VIIDs, can be stored in on-chip BRAM
resources with little impedance to the progress of the read and write transactions (outside the latency of
access to the BRAM itself). Illustrating with an example, for the AlphaData 8k5 board, an 8GB off-chip
memory is attached; if this memory were to be split into coarse-grained pages of 64MB instead of 4kB,
and the total number of unique VIIDs is limited to 32 (e.g., four separate Hardware Applications with
eight possible interfaces per applications), the required size of the entire page-table structure would be
a depth of 233−26 × 32 = 4096 and a bit width of (33 − 26) + 1 = 8, that could easily fit into on-chip
BRAM. This does of course come with the trade off that the memory is only divided into 128 pages, so
the mapping is a little more limited, though flexibility can be bought at the expensive of more BRAM
resources by increasing the page-table size. Figure 4.12 shows the implementation of a coarse-grained
paged MMU. The figure shows only a single channel of the MMU, The full MMU would have a separate
page table and mapping logic for both read and write channels.
4.4 Memory Virtualizing Shell Overhead Evaluation
Virtualizing any compute resource introduces some overhead, and in the realm of FPGAs that overhead
generally takes the form of area utilization. To evaluate the proposed secuirtization solutions, we im-
Chapter 4. Memory Interfaces 57
Table 4.2: Shared Memory Secured Shell Utilization
LUT LUTRAM FF BRAM DSP
Shell Type Num % Incr Num % Incr Num % Incr Num % Incr Num % Incr.
No Security Incl. 65,243 6192 82,928 94.5 3
Add Decouplers 66,159 1.4% 6224 0.5% 86,193 3.9% 94.5 0% 3 0%
Add BW Throttlers 67,583 2.2% 6224 0% 89,357 3.7% 94.5 0% 3 0%
Add Base+Bound MMU 72,722 7.6% 6417 3.1% 99,786 11.7% 94.5 0% 3 0%
Switch to Paged MMU 70,731 -2.7% 7344 14.4% 96,144 -3.6% 94.5 0% 3 0%
plement them into an FPGA shell and measure this overhead. If the overhead is sufficiently small, the
case that virtualizing FPGAs in datacentre deployments is made stronger. In this section, we evaluate
our proposed solution in terms of area overhead.
4.4.1 Building up the Secure Shell
To determine the overhead of the various components of the memory-securing shell, features can be
added incrementally to see their impact. As a first step, a shell without any of the discussed memory
security features was implemented on the Alpha Data 8k5 FPGA board. This shell implementation
resembled that of part (a) of Figure 4.1. The board includes a Xilinx Kintex Ultrascale XCKU115 with
DDR4 attached off-chip memory. All tests were done using the Xilinx Vivado 2018.1 software, and the
associated versions of the the PCIe Subsystem and Memory Controller cores.
Table 4.2 list the utilization of the shell developed with each of the security components added on
incrementally. A second column in the table indicates the increase in utilization from the previous shell
iteration. The First shell simply implements a memory controller and a shared interconnect, with no
technology to manage effective sharing of that interconnect. Note, this shell includes in its synthesis
simple applications in each Hardware Application Region to put a realistic routing stress on the Place
and Route tools. The Hardware application is the same one as that used for the tested of the bandwidth
throttlers earlier in this chapter. The first entry of Table 4.2 shows the area utilization of the shell
and these simple Hardware Applications. Note, all of the shells described in this section include four
Hardware Applications with three VIID bits each, for a total of 32 managed logical connections. Note,
Table 4.3 includes the same information except expressed as a percentage of the Kintex Ultrascale
XCKU115’s total resources.
The next entries show different components added to the shell design. The first new entry includes
decouplers and protocol verifiers in the design. From this we can wee that the overhead for including AXI4
protocol verifiers is in the range of 1.4 percent to 3.9 percent, depending on which specific resource you
Chapter 4. Memory Interfaces 58
Table 4.3: Shared Memory Secured Shell Utilization (Percentage)
Shell Type LUT (%) LUTRAM (%) FF (%) BRAM (%) DSP (%)
No Security Incl. 9.84% 2.11% 6.25% 4.38% 0.05%
Add Decouplers 9.97% 2.12% 6.50% 4.38% 0.05%
Add BW Throttlers 10.19% 2.12% 6.74% 4.38% 0.05%
Add Base+Bound MMU 10.96% 2.18% 7.52% 4.38% 0.05%
Switch to Paged MMU 10.66% 2.50% 7.25% 4.38% 0.05%
consider. Further adding bandwidth throttlers to the shell, to implement performance isolation, needs
2.2 percent more LUTs and 3.7 percent more flip flops. Finally, the addition of a network management
unit, to perform some data isolation, adds 7-11 percent of the total area utilization to the shell design.
The page-based MMU uses much more LUTRAMs than the base-and-bounds MMU, which is expected
considering that the page tables take up more memory.
We also note from Table 4.3 that the overall area utilization of the shell is not very significant,
though this excludes any networking connectivity. Only about 10 percent of the chip resources are
needed to implement any kind of protocol verification or 11 percent to implement data isolation. These
utilization number also include the small Hardware Applications, as they are synthesized with the shell,
so the actual area utilization need is lower. Such a low area overhead is perfect for deployment in a
virtualization environment, since higher overheads depletes the FPGA of resources which could have
been used by another Hardware Application. While the total area utilization is fairly low, these memory
based solution tend to induce a lot of routing strain on the system, and meeting the timing requirements
of the system necessitated retiming and high-effort synthesis.
4.4.2 Latency Impact
Introducing the isolation components to virtualize the memory interfaces also introduced some latency
to the shell design. The summary of the latencies introduced are summarized in Table 4.4. All of
the latencies are normalized to the shell that implements no isolation, with the table displaying the
additional number of cycles of latency added over that shell design. Adding the decouplers and protocol
verifiers only adds a single cycle of latency for all of the output channels. This single cycle of latency
is added because the protocol verifiers must insert a registering stage to avoid transactions that have
changing values while the valid signal is held high and the ready signal has not yet been asserted, as
described in Section 4.1.4. The channels that are driven from the memory controller to the Hardware
applications, the read and write response channels, don’t need this additional registering stage.
Bandwidth throttlers have a more significant effect on the amount of latency introduced into the
Chapter 4. Memory Interfaces 59
Table 4.4: Latency Increase Per AXI Channel for Shared Memory Secured Shell
Shell Type AW Channel W Channel B Channel AR Channel R Channel
No Security Incl. – – – – –
Add Decouplers +1 cycle +1 cycle +0 cycles +1 cycle +0 cycles
Add BW Throttlers +4 cycles +4 cycles +3 cycles +4 cycles +0 cycles
Add Paged MMU +6 cycles +4 cycles +3 cycles +6 cycles +1 cycles
system. Three additional cycles of latency are added to every channel except for the read data response
channel. Note, the addition of the bandwidth throttler itself does not add any cycles of latency, since
the bandwidth throttler is implemented as combinational logic. All of the additional cycles of latency
are added through the inclusion of register slices, which were required for the updated shell design to
meet the system timing requirements. The bandwidth throttlers put a significant routing stress on the
shell, which necessitated the inclusion of three separate register slice stages (at different points in the
data path). Note, the register slices were not required for the read channel, so no latency was introduced
on received data.
Finally, a paged-based MMU was added and configured to use pages that were 64MB in size. The
paged MMU includes a single cycle of latency in the AW and AR channels. The additional cycles of
latency on the AW and AR channels not accounted for in the MMU design, and the added cycle of
latency for the R channel, was added because of an additional register slice. Note, the MMU does not
impact the read data path directly (it simply passes read responses through without alteration), so the
register slice required on the R Channel was necessary simply from the increased routing stress put on
the entire shell with the inclusion of the MMU.
4.4.3 Paged-NMU Size Comparisons
One of the main changes to a traditional page-based MMU versus the one described in this work is the
coarse-grained nature of the pages. Each page could be up to 128 MB in size, which would leave the 8GB
memory divided into only about 64 pages; for a system with 32 unique VIIDs (e.g., the works presented
in this thesis), that leaves an average of two pages per Hardware Application. While we contend that
large pages are likely sufficient for FPGAs, given the smaller number of concurrent applications and the
big-data nature of many FPGA targeted applications, this may not be true for all circumstances or even
in the future as more applications are ported to FPGAs. In this subsection, we analyze the actual area
tradeoff involved in increasing the size of the pages for an on-chip page-based MMU.
The shells studied in the previous subsection served as the platform for this evaluation. As such, it is
necessary to compare the results against each other, since the nominal values themselves also include the
Chapter 4. Memory Interfaces 60
Table 4.5: Shell Utilization as a Function of Page Size in MMU
Shell Type LUT (%) LUTRAM (%) FF (%) BRAM (%) DSP (%)
128 MB Pages 10.54% 2.28% 7.24% 4.38% 0.05%
64 MB Pages 10.66% 2.50% 7.25% 4.38% 0.05%
32 MB Pages 10.70% 2.59% 7.25% 4.38% 0.05%
16 MB Pages 11.17% 3.50% 7.25% 4.38% 0.05%
8 MB Pages 11.93% 4.90% 7.27% 4.38% 0.05%
4 MB Pages 13.47% 7.69% 7.26% 4.38% 0.05%
small Hardware Applications themselves. Table 4.5 shows a breakdown of the area utilization needed
to implement coarse-grained MMUs with various page sizes. As the page size decreases, the amount
of FPGA area resources needed increases in turn, namely the amount of LUTRAMs needed for the
solution. This does make intuitive sense, since the main need in reducing the page size is storage space.
In any case, we find that even as the page size decreases to about 4 MB, the total utilization by the shell
components is not greatly effected.
4.5 Multi-Channel Memory Considerations
The memory virtualization solutions discussed thus far have only considered a single independent memory
channel. Many FPGAs often include multiple memory channels to increase the total effective bandwidth
of external memory. This section considers the design decisions to be made in extending the previous
concepts of this chapter to a multi memory channel platform, introducing a few different paradigms for
including multiple memory channels. Note, the Alpha Data 8k5 FPGA board used in this work includes
two separate DDR4 memory channels.
4.5.1 Separately Managed Channels
The simplest way to virtualize multiple memory channels is to separate the channels and attach each to
some fraction of the Hardware Applications Regions, i.e., each Hardware Application Region is connected
to a single memory channel. This solution is depicted in Figure 4.13. Separately managed memory
channels do not introduce any increased complexity over the solutions presented earlier in this chapter,
since each memory channel would simply have the performance and data isolation of a single-channel
system. This solution is not explicitly evaluated in this thesis, but it is included here for completeness.
Chapter 4. Memory Interfaces 61
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1DDR4 Controller
AXI4 Interconnect
MMU
AXI4 Inter.
DDR4 Controller
AXI4 Interconnect
MMU
AXI4 Inter.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Figure 4.13: Multi-Channel Organization with Separately Managed Channels
AXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
MMU
AXI4 Inter.
DDR4 Controller
Width Converter + Buffering
Width Converter + Buffering
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Figure 4.14: Multi-Channel Organization with Single Shared MMU
4.5.2 Single Shared MMU
A shared MMU system is depicted if Figure 4.14. In this system, each Hardware Application has a
single top-level memory interface (i.e., the interface that is at the PR region boundary) that connects to
protocol decouplers and verifiers, bandwidth throttlers, and a single MMU, just as in the single-channel
solutions. The difference is that the single MMU is connected at its master (request issuing) side to
an interconnect that can route requests to any of the memory channels (two channels are depicted in
Figure 4.14). In other words, the memory spaces of the memory channels are logically concatenated and
the single MMU serves this concatenated memory space.
If the data width of the interface presented to the Hardware Applications is equal to the width
Chapter 4. Memory Interfaces 62
of a single memory channel, the post-MMU interconnect can be connected directly to the memory
controllers for each of the memory channels; however, this limits the system to just a fraction of the
total memory bandwidth available (e.g., one half for two memory channels and one quarter for four
memory channels). To use the entire available bandwidth, the data width of the interface presented to
the Hardware Applications must be at least the number of channels multiplied by the data width of a
single memory channel. In this case, the there must be a data width converter inserted between the
port-MMU interconnect and the memory channels, specifically a data width downsizer for write data
received and a data width upsizer for read data returned.
For the write data interface, a downsizer on its own would exert back-pressure on the interconnect
preventing it from sending data at the full bandwidth speed (since a downsizer cannot accept a new
data beat every cycle). To prevent write requests from throttling the performance of the entire system,
a write data buffer must be included for each memory channel. The post-MMU interconnect can simply
write data to these buffers and not be throttled by the back-pressure of the data width downsizers. For
the read channels, the data width upsizers would not have new data available every cycle, as they would
have to wait for multiple read data beats to pack into one larger read data beat. The AXI4 protocol,
however, allows for read data to be interleaved, i.e., the interconnect can interchangeably read data from
different channels and send them upstream out of order. There is no buffering requirement for the read
data channel. These data width converters and write data buffers are shown in Figure 4.14.
This shared MMU solution is simple and requires relatively few changes from the single memory
channel system, but it does present some potential problems. Memory controllers implemented on
FPGAs tend to have wide data widths already because of the relatively slower clock speeds achievable in
FPGA fabric relative to the ASIC devices (e.g., CPUs) for which off-chip memory solutions are generally
designed. The data width must be increased at the same ratio that the clock speed is reduced between
the memory device itself and the FPGA fabric clock (e.g., if the clock is reduced by 1/4, the data width
must be increased by four-fold). Introducing multiple memory channels widens that data width even
further, and that might present timing challenges to the Hardware Applications. For example, the Xilinx
memory controller in [47] requires a four-fold data width increase, resulting in a native data width of 256
bits for the memory controller, which would increase to a 512-bit data width in the AXI interconnect
for two memory channels and a 1024-bit data width for four memory channels.
A further complication actually limits the effectiveness of the performance isolation in a shared MMU
solution. The bandwidth throttlers operate on the AXI interface presented to the Hardware Application
itself, with no knowledge of the future memory channel that the request will eventually target. If all of the
Hardware Applications try to target the same memory channel (assuming the MMU assignment allows
Chapter 4. Memory Interfaces 63
AXI4 Interconnect AXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
DDR4 Controller
MMU (stage 2)
DDR4 Controller
Width Converter + Buffering
Width Converter + Buffering
MMU (stage 2)
MMU (stage 1)
MMU (stage 1)
MMU (stage 1)
MMU (stage 1)
AXI4 Inter.
AXI4 Inter.
AXI4 Inter.
AXI4 Inter.
BW BW BW BW BW BW BW BW
Figure 4.15: Multi-Channel Organization with Parallel Shared MMUs
such), the memory bus’ bandwidth would be effectively limited to the bandwidth of that single memory
channel. The only way to ensure performance isolation would be to isolate Hardware Applications to a
single memory channel, which would somewhat defeat the purpose of the multi-channel memory solution.
If any memory channel has memory that is assigned to two or more Hardware Applications, there is a
potential for malicious, or even unintentional, bandwidth limiting for those Hardware Applications.
4.5.3 Parallel MMUs with a Single Port
To overcome the problem in performance isolation for a shared MMU solution, each Hardware Applica-
tion can have a first stage MMU that simply indicates which memory channel the request is to access,
and this information can be used to reroute that request to the correct memory channel. An interconnect
can follow this first stage MMU and map requests to the memory channel indicated by the first stage
MMU. Separate bandwidth throttlers can then be instantiated at the output of this first interconnection
network, that would effectively throttle the bandwidth between each Hardware Application and memory
channel pairing individually. This arrangement is depicted in Figure 4.15.
If the system uses a base-and-bounds MMU design, the first stage MMU would simply be a table
indexed by the VIID of the requester interface and indicate which memory channel that VIID is mapped
to. The most significant bits of the address, which indicate the memory channel, would be replaced with
this stored value. If the system uses a coarse-grained paged MMU, this first stage MMU’s page-table
would be indexed by the same bits of the address (in addition to the VIID) as the later stage MMU,
containing the same number of entries as the portion of the later stage MMU’s page table assigned to
that specific Hardware Application Region. However, the mapping value stored in the page-table simply
Chapter 4. Memory Interfaces 64
indicates the memory channel that page is mapped to, so only the most significant bits of the address,
that indicate the memory channel, would be replaced with this stored value. The remainder of the
mapping would be stored in the second stage MMU’s page table.
This first MMU and interconnect can also handle out-of-bounds access, freeing downstream compo-
nents of wasting bandwidth on useless transactions. This would also mean that bandwidth credits in
the downstream bandwidth throttler are not consumed by out-of-bounds accesses. For a coarse-grained
paged MMU system, the first stage MMU would indicate the validity of a page mapping and act on
4k boundary crossing errors, while the second stage MMU could safely assume all mappings are valid
and ignore 4k boundary crossings. For a base-and-bounds MMU system, the first stage MMU would
deal with the bound check and 4k boundary crossing errors in addition to storing the mapped memory
channel for each VIID, and the second stage could safely ignore any errors and simply add the base
component.
In this MMU arrangement, since requests are already separated by a targeted memory channel to
perform performance isolation, those separated request streams need only be forwarded to that memory
channel. Thus, each memory channel can have a separate MMU that only handles requests targeting
that memory channel; we term this MMU system “Parallel MMUs with a Single Port” because each
memory channel has an individual MMU and each Hardware Application Region has a single port.
In the single shared MMU approach, the data width of the memory interface presented to the
Hardware Application has to be wider to allow for the full memory bandwidth of the attached memory
to be realized. In this case, there is no bottleneck at a single MMU, so the interface width can be
smaller. In fact, the interface width presented to the Hardware Applications would only limit the
maximum amount of bandwidth that could be assigned to the Hardware Application, and the total
system bandwidth might not be impacted. For example, if a system has two memory controllers with a
256-bit data width, and each memory interface port at the PR boundary to two Hardware Applications
also has a 256-bit data-width, the full bandwidth of the system could still be used as long as the access
patterns of the Hardware Applications are efficiently mapped across the memory channels. Note, a wider
data interface at the Hardware Application would still be required if the system might want to assign
more than the bandwidth of a single memory channel to any Hardware Application.
The arrangement depicted in Figure 4.15 includes a wider memory access interface and thus also
includes data width converters and write channel buffers before the memory channels. This arrangement
would require all of the MMUs and interconnects to have larger data widths as well. These data
width converters and write channel buffers could however be included immediately before the bandwidth
throttlers, as indicated in the modified arrangement shown in Figure 4.16. This would reduce the size
Chapter 4. Memory Interfaces 65
AXI4 InterconnectAXI4 Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
DDR4 Controller
MMU (stage 2)
DDR4 Controller
MMU (stage 2)
MMU (stage 1)
MMU (stage 1)
MMU (stage 1)
MMU (stage 1)
AXI4 Inter.
AXI4 Inter.
AXI4 Inter.
AXI4 Inter.
Width Converter + Buffering
Width Converter + Buffering
BW BW BW BW BW BW BW BW
Figure 4.16: Multi-Channel Organization with Parallel Shared MMUs (modified)
of the downstream interconnect and MMUs, but would require data width converters for each Hardware
Application Region. We term this arrangement the “parallel MMUs with a Single Port (modified)”.
4.5.4 Parallel MMUs with Multiple Ports
Looking at the modified parallel MMUs arrangement, much of the infrastructure located before the
bandwidth throttlers could be included in a soft shell implementation and need not necessarily be
included in the static hard shell implementation. In essence, what this would do is implement a parallel
MMUs system with multiple ports presented at the PR interface to the Hardware Application Region.
Each port would correspond to a separate memory channel. This is shown in Figure 4.17, which is
essentially the same as Figure 4.16 except with the protocol decouplers and verifiers duplicated and
moved to just before the bandwidth throttlers, and the other components moved inside the soft shell.
The first stage MMU would then be connected to the management framework through the management
connection of the soft shell.
The advantage of this arrangement is that the interconnect instantiated within the soft shell can
be made just large enough to accommodate the largest memory interface needed inside the soft shell.
For example, if a particular Hardware Application needed only memory interfaces of width 64-bits, that
interconnect and first stage MMU could be limited to 64-bits with a data width upsizer included at the
memory interface. Note, since the bandwidth throttler included in this thesis penalizes requesters for
gaps in data transmission and acceptance, that a data-width upsizer would induce, some buffering of
write requests until enough data has been received would be needed to preserve bandwidth allocations.
Another advantage of this system is that if the memory interfaces within the soft shell use fewer address
Chapter 4. Memory Interfaces 66
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
DDR4 Controller
AXI4 Interconnect
MMU (stage 2)
DDR4 Controller
MMU (stage 2)
AXI4
Inter. Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW
BW
BW
BW
BW BW BW BW
Figure 4.17: Multi-Channel Organization with Parallel NMUs and Multiple Ports
bits than the system memory (i.e., they do not need large memory allocations), the first stage MMU’s
depth can be reduced and total FPGA area utilization of the shell (cumulative hard and soft shell
utilization) would also be reduced.
These multi-channel solutions are presented here as a conceptual discussion. The implementation of
such solutions is left to future work.
4.5.5 Multi-Memory Channel Implementations in Previous Works
Most of the previous works described in Chapter 2 include only a single memory channel, similar to the
exploration presented in this thesis. One notable exception is the SDAccel platform created by Xilinx [30].
Specifically, the SDAccel Platform Reference Design described in [65] shows that the Shell implemented
for the SDAccel platform includes four separate memory channels. In that reference platform, the
connections to the off-chip memory are not abstracted through the Shell, but instead presented directly
to the PR region. Since the Shell presented in that work does not have multiple applications, even the
memory controller itself is meant to be implemented within the PR region. This work therefore does
not present a multi-memory channel solution with any kind of virtualization.
One relevant prior work that considers both multiple applications and multi-memory channel de-
ployments is the work presented by Yazdanshenas and Betz [35]. In that work, an exploration of the
overheads associated with a multi-tenant Shell are explored. In that exploration, multiple memory
channels are considered explicitly. The way that those memory channels are presented to the hardware
applications is consistent with the theoretical solution presented in Section 4.5.4. More specifically, each
Chapter 4. Memory Interfaces 67
of the memory channels is accessed through a separate interface within each applications (i.e., each
application has a memory access interface to correspond to each memory channel). However, that work
does not explicitly consider isolation and therefore would not include the parallel MMUs described in
Section 4.5.4.
Chapter 5
Network Interfaces
In this chapter, we switch the focus to securing the sharing of the network interface, which is required for
the direct-connected FPGA deployment model. Network interfaces, particularly Ethernet connectivity,
are provided on many FPGA boards and are often directly supported by FPGA vendors. The Alpha
Data 8k5 FPGA board used in this work for example includes 10 Gbps Ethernet connectivity [46]. Xilinx
provides support for Ethernet ports, including the 10 Gbps port on the Alpha Data device, through its
10G Ethernet Subsystem IP Core [48]. The interafce provided to the user for this Xilinx Ethernet
controller is an AXI-Stream interface; while the work presented in this thesis targets the AXI-Stream
interface, this interface is generic enough such that these methods could be applied to other interface
types as well.
In contrast to the solutions that aim at securing memory, presented in Chapter 4, network inter-
faces are connected to the data-centre infrastructure itself, which means activity propagated over these
connections could impact applications beyond the multi-tenant device. In this chapter we analyze the
domain isolation solutions needed to address this problem, as well as discuss how performance isolation
solutions can be extended to memory interfaces.
5.1 Network Interface Performance Isolation
To institute performance isolation for the network interface, similar stages to those implemented for the
memory channel are required: protocol decoupling, protocol verification, and interconnect bandwidth
throttling. To illustrate the intention of the work described in this section, see Figure 5.1.
In part (a) of the figure, we depict an unsecured shell organization that includes only network
connectivity as an external resource. The multiple Hardware Application Regions include an AXI-
68
Chapter 5. Network Interfaces 69
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
10Gbps Ethernet
AXI-Stream Interconnect
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
10Gbps Ethernet
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW Throttler
BW Throttler
BW Throttler
BW Throttler
AXIS Inter.
(a) (b)
Figure 5.1: Adding Performance Isolation for Networking to the Shell (a) Shell without isolation (b)Shell with added isolation
Stream output port that connects to an AXI-Stream interconnect to arbitrate access to the Xilinx
Ethernet controller. In addition, the Hardware Application Regions include AXI-Stream input ports that
are driven by the output of another AXI-Stream interconnect. Packets that arrive from the Ethernet
controller pass through a component that takes the least significant bits of the packet’s MAC address
as the VIID to determine which interface to route to in the AXI-Stream Interconnect that drives the
AXI-Stream Input ports of the Hardware Application Regions. This component is called a “Simple
NMU” since it manages which interface to route input packets to, though it does not implement any
security features like the NMUs described in Section 5.3. As with the shell for the memory connectivity,
PCIe is included such that a host computer can manage the shell, though in the case of the unsecured
shell there is nothing to manage outside the soft shell components.
Part (b) of the figure indicates how the various performance isolation components modify the simple
unsecured shell depicted in part (a); each of the AXI-Stream input and output ports pass through
AXI-Stream Protocol Verifier-Decoupler components. The AXI-Stream output ports are connected to
bandwidth throttlers, so that the access to the Ethernet output port can be farily shared amongst the
Hardware Application Regions. Note, no bandwidth regulation is done for input packets since the shell
can not effectively assert backpressure on the Ethernet input port. All of these performance isolation
components are connected to the PCIe management network.
Chapter 5. Network Interfaces 70
Ingress Packet ChannelEgress Packet Channel
0
0
decouple
egr_tvalid_in
egr_tready_in
egr_tvalid_out
egr_decoupled
egr_tready_out
Outstanding egr packet?
= 00
0
decouple
egr_tvalid_in
egr_tready_in
egr_tvalid_out
egr_decoupled
egr_tready_out
Outstanding egr packet?
= 0
0
1
decouple
ingr_tvalid_in
ingr_tready_in
ingr_tvalid_out
ingr_decoupled
ingr_tready_out
Outstanding ingr packet?
= 0
set rst
Sticky bit
ingr_tlast_in
0
1
decouple
ingr_tvalid_in
ingr_tready_in
ingr_tvalid_out
ingr_decoupled
ingr_tready_out
Outstanding ingr packet?
= 0
set rst
Sticky bit
ingr_tlast_in
ingr_decoupled
egr_decoupleddecouple_done
Figure 5.2: Network Interface Decoupler
5.1.1 AXI-Stream Decoupling
As stated in Section 4.1.3, decouplers are needed so that the Hardware Application Region can be
effectively disconnected from the shared interconnect and Ethernet connection. This could be done to
reprogram the PR region in which the Hardware Application is resident for those deployments where PR
is enabled, or to pause the Hardware Applications for some other reason, such as to disable the packets
from being sent from a particular Hardware Application.
The network connectivity is provided by an AXI-Stream interface, which is a pretty generic interface
providing only a data field (with a strobe value indicating valid bytes), a LAST signal to indicate the
end of a packet, and some handshaking signals. The simple Xilinx decoupler [54] cannot be used in this
case because it might decouple packets mid way through transmission, that could cause downstream
components to lock up waiting for the last data beat of a packet. Thus, decoupling activity must be
gated with an indication of whether or not a packet is midway through its transmission. Figure 5.2
shows the implementation of this AXI-Stream decoupler. Outstanding packet trackers (implement using
procedural HDL code) are used to track whether there is a mid-stream packet for both the input and
output stream directions.
For input packets, the decoupled READY signal is held high so that all packets are seen as accepted
by the downstream interconnect, to prevent the backpressure from locking up the interconnect. Since
packets are pseduo-accepted in this way, there might be a problem if the input AXI-Stream port is un-
decoupled midway through one of these pseduo-accepted packets, as the Hardware Application would
see a partial packet with no way of knowing whether or not it is a complete packet. To prevent this,
the decoupler signal must also be tied to a sticky decouple signal, that simply enables the decoupling
Chapter 5. Network Interfaces 71
even after the decouple signal has been de-asserted until the pseudo-accepted packet’s transmission is
complete.
5.1.2 AXI-Stream Protocol Verification
Unlike the AXI4 Protocol for memory interfaces, the AXI-Stream protocol does not include protocol
assertions that must be met to confirm protocol compliance, because the AXI-Stream interface has little
disallowed behaviour. The only protocol assertion of note that could be inferred for the AXI-Stream
interface is the Handshaking check, i.e., once the VALID signal is asserted, all the signals must maintain
their value until the READY signal indicates that the data beat has been accepted. The Xilinx Ethernet
controller does however impose some additional protocol restrictions, namely that the KEEP signal (the
strobe signal that indicates which data bytes are valid) must be held all high before the last beat is
transferred and that packet transmission cannot include any gaps (i.e., once a packet is started, VALID
cannot be de-asserted until the last beat is transferred). Invalid KEEP values are ignored by the Xilinx
Ethernet controller, but gaps in the transmission can cause packets to be dropped [48]. Finally, the
components implemented in future sections require that the packet be held to a maximum size (this is
often called the Maximum Transmission Unit (MTU)), so packets must assert the LAST signal before
the packet exceeds this size.
The total list of assertions that must be met to ensure that packet transmission is not interrupted
is: the Handshaking check, no gaps in the transmission, and packet size limited to the MTU. Note, the
no gaps in transmission applies to input packets as well, indicating that there can be no gaps in the
acceptance of a packet transmission, but the other assertions apply for the output direction only.
The AXI-Stream protocol verifier design is depicted in Figure 5.3. An outstanding packet tracker
is used to track whether any outgoing packets are midstream; this value is used to override the VALID
signal to ensure there are no gaps in the transmission. Next, a counter is used to keep track of the number
of beats sent for outgoing packets; once the count is equal to one less than the maximum packet size, the
LAST signal is forced high to end the packet. Note, both of these changes could corrupt the packet sent,
but the purpose of the protocol verifier is simply to prevent malformed requests from propagating to
the downstream interconnect, so this is only of concern to the Hardware Application sending malformed
packets. Finally, the AXI-Stream outputs are registered so the values are not captured if they change
after the VALID signal has been asserted. For input packets, the only change required to insure protocol
compliance is the overriding of the READY to signal to a high value; the input port must accept all
packets when they arrive and cannot ever assert backpressure to lock up the interconnect.
Chapter 5. Network Interfaces 72
Outstanding egr packet?
!= 0
Beat Counter
= (MTU – 1)
reg
reg
egr_tvalid_in
reg
egr_tvalid_out
egr_tlast_inegr_tlast_out
other_egr_signals
ingr_tready_in1
other_ingr_signals
Figure 5.3: Network Interface Protocol Verifier
5.1.3 Network Interface Bandwidth Throttling
Network bandwidth throttling can be implemented by again using a modified version of the credit-
based accounting system presented in [58]. Since network transmissions cannot be interrupted mid-
transmission, the number of credits needed to initiate the transmission must be the total number of
packets that the transmission might need to use. This is equal to the number of beats that make up
an MTU packet. It is not possible to tell the size of the packet before it has been transmitted, so the
total credits deducted on a new packet transmission acceptance must be equal to the MTU. Once the
end of the packet is reached, credits can be redeposited based on how much shorter the packet is than
the MTU. As a reminder, the original credit accounting mechanism was as follows:
credits(t+ 1) = credits(t) + ρ− 1 has bus access
credits(t+ 1) = credits(t) + ρ no access but pending requests
credits(t+ 1) = σ no pending requests
Where ρ is a decimal value (less than or equal to one) that represents the proportion of bandwidth
assigned to that interface, and σ represents the burstiness accepted from that interface. This formula-
tion can be adjusted to implement the changes needed for the network interface as follows:
credits(t+ 1) = credits(t) + ρ − crnew + crlast pending packet or data to send
credits(t+ 1) = σ no pending packet or data
Chapter 5. Network Interfaces 73
where :
crnew = MAX BEATSMTU new packet transmission accepted
crnew = 0 otherwise
crlast = unsent last data beat accepted
crlast = 0 otherwise
unsent(t+ 1) = MAX BEATSMTU − 1 TLAST and TREADY high or reset
unsent(t+ 1) = unsent(t)− 1 TREADY high and TVALID high
unsent(t+ 1) = unsent(t) TREADY not asserted
The bandwidth throttler implemented based on this formulation is shown in Figure 5.4. As mentioned
in the beginning of this chapter, bandwidth throttling only effects the output AXI-Stream interface.
Again, an outstanding packet tracker is included that prevents decoupling based on the credit count
once a packet has started transmission. The credit count is compared to the MTU to determine whether
or not that interface should be decoupled. The credit update system updates the amount of credits
stored in the credit register whenever a new packet transmission is accepted and/or the last beat of a
transmission is sent.
Unlike the bandwidth throttling for the memory interface, the total bandwidth available on the
shared network interface is not dependant on the network access pattern. The total bandwidth available
should only be limited by the downstream datacentre switching infrastructure. As such, the network
bandwidth throttling system does not need a bandwidth conserving system like the one introduced for
the memory bandwidth in Section 4.2.3. Instead, the sum of the ρ values should simply be set to the
total bandwidth available in the system.
5.2 Network Security Background
Virtualized FPGA deployments must consider security in the way that Hardware Applications are allowed
to access the shared network. As already mentioned in the introduction to this chapter, this securitization
is required not only to isolate the Hardware Application Regions from each other, but to isolate the rest
of the network from any unwanted accesses from the Hardware Applications themselves. This is what
was termed Domain Isolation in Section 3.2.3. The need for Domain Isolation is not restricted to FPGAs
Chapter 5. Network Interfaces 74
0
update (ρ)
++
_ _
0
MTU
0
add_back
Credits (int)
Credits (frac)
sel0 sel1
sel3
init (σ)
++
sel4
0
0
>=
egr_tvalid_in
egr_tready_in
rst count
Beat Counter
tready & tlast tready & tvalid
_ _
(MTU-1)add_back
sel0tlasttready
sel1tvalidtready
sel3tvalid sel4
egr_tvalid_out
egr_tready_out
tlast_prev
Figure 5.4: Network Interface Bandwidth Throttler
deployed in the cloud, this security consideration is necessary also for software VMs installed on CPU-
based datacentre nodes. In this section, we discuss the solutions used to provide domain isolation in
other parts of the datacentre.
5.2.1 Software Analogues
In the software domain, the National Institute of Standards and Technology (NIST) details some common
methodologies used to secure access to a shared network by VMs in a virtualized environment [66]. The
main methodology presented is the virtual switch, a fully functional switch implemented in software that
switches traffic from the virtual network connections to the physical network interface and the next-level
Chapter 5. Network Interfaces 75
physical switch. Distributed virtual switches extend the virtual switch concept by provisioning and
managing virtual switches on multiple physical nodes simultaneously, an avenue that could be explored
for hardware NMU solutions in future work.
Another common network security methodology, according to NIST, is the firewall: devices and/or
security layers within switches or software that filter traffic such that only allowed connections are left to
pass-through to the network. The set of allowed connections is often specified in what are termed Access
Control Lists (ACLs), or alternatively Network Access Control Lists (NACLs). Firewall functionality
can be provisioned using physical appliances installed in the network, through ACLs implemented in
the physical switches of the network, or through firewalls implemented in the virtual switch solutions
mentioned earlier.
For multi-tenant environments, the pushing of ACLs to a physical firewall appliance or the next-level
physical switch is often termed hairpinning, since traffic from the VM is first routed to the physical
appliance and then to its final destination. Note, for such a firewall implementation to work, some
level of source semantics enforcement must be done before routing to the firewall appliance such that
the traffic is uniquely identifiable. Such hairpinning techniques are considered here in this thesis for
analogous hardware solutions.
A final consideration, virtual networking subdivides the physical network into virtual networks that
can be provisioned to different users and isolated from each other. The simplest form of virtual net-
working is the Virtual Local Area Network (VLAN) tag, IEEE 802.1Q [67]. The VLAN tag includes
a 12-bit virtual ID that allows switches to identify, and isolate packets between, devices on the same
virtual network. Such tagging can often be done by the switches themselves at ingress to the network.
Additionally, network virtualization can be provided using encapsulation based methods, VXLAN [68] or
NVGRE [69]; Virtual Tunnel Endpointss (VTEPs), often implemented within virtual switches, perform
the encapsulation and de-encapsulation.
5.2.2 OpenFlow Switching Hardware
In addition to virtual switches implemented on software nodes, hardware network switches can also
be used to implement security for network connected devices. One of the most ubiquitous Hardware
Switch standards is the OpenFlow standard [70]. The OpenFlow standard was specifically introduced
as an open source Software Defined Networking (SDN) solution; SDN describes network deployment
and management solutions that split the data forwarding plane and the control plane. In reference to
security, OpenFlow is relevant because it introduces a format for rules to influence how packets are
Chapter 5. Network Interfaces 76
OpenFlow Table
(TCAM)
OpenFlow Table
(TCAM)
OpenFlow Table
(TCAM)
OpenFlow Table
(TCAM)
MAC Parser
VLAN Parser
IP4 Parser
Transport Parser
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Parsed Field
Queues
Input Interface
MAC Parser
VLAN Parser
IP4 Parser
Transport Parser
Input Interface
MAC Parser
VLAN Parser
IP4 Parser
Transport Parser
Input Interface
MAC Parser
VLAN Parser
IP4 Parser
Transport Parser
Input Interface
Parsed Field
Queues
Parsed Field
Queues
Arbiter
OpenFLow Rule Processor
Packet Buffer MemoryPacket Buffer Memory
DMA Accessor
Input Interface
Input Interface
Input Interface
Output Interface
Figure 5.5: Example Implementation of an OpenFlow Capable Switch
forwarded or dropped when processed by the OpenFlow switch containing those rules; these rules can
include ACLs to implement security measures on an OpenFlow switch.
Complete OpenFlow switch solutions have been implemented on FPGAs [71] [72], and they can
provide the same level of security afforded to software systems through the use of rules that target
security, such as ACLs, adherence to routing protocols and stateful inspection. However, they consume
significant resources, on the order of 15-36 percent LUTs and 45-62 percent of BRAMs for the devices
used. This high area overhead indicates that full switch solutions implemented on FPGAs are likely too
large to implement in conjunction with a shell and multiple Hardware Application Regions; alternative
solutions must be sought that minimizes the area overhead.
One possible OpenFlow switch solution is shown in Figure 5.5. The packets flow in from the network
inputs to the outputs after they have been processed. When packets arrive at the network input, they
are parsed for key network fields that are required to compare against the rules stored in the OpenFlow
tables. The kinds of fields parsed out from a packet include source and destination MAC addresses,
source and destination IP addresses, port numbers, etc. Once the packet has been parsed, the parsed
fields are sent to a queue, while they wait to be processed by the OpenFlow tables, and the packet itself
is sent to a buffer until its eventual destination is determined.
The OpenFlow table processor pulls parsed packet data from one of the queues waiting to be processed
and compares the fields to the expected field data in each of the OpenFlow rules. OpenFlow rules are
stored in OpenFlow tables, which are implemented as Ternary Content Addressable Memories. If the
parsed packet data matches with a rule stored in the OpenFlow table TCAM, that rule has an associated
Chapter 5. Network Interfaces 77
action that is used to modify the packet, modify the parsed fields, update some internal switch metrics,
or add some metadata to the parsed fields. The parsed packet data is forwarded through a series of these
OpenFlow tables, matching up to one rule per OpenFlow table. Once the packet has passed through all
of the series of OpenFlow tables, the actions list it has accumulated is implemented by modifying the
packet in the ways specified (e.g. removing a VLAN tag field, or updating some IP field value), and/or
dropping/forwarding the packet to the specified output interface. Note, an OpenFlow switch can send
modified parsed packet data back to the queue to be reprocessed by the OpenFlow tables.
Some works modify this basic structure to implement reduced versions of the OpenFlow standard. For
example, the work presented in [72] modifies the OpenFlow table structure such that each rule matched
can have multiple actions associated with it, and then does not include multiple OpenFlow tables (the
work has multiple tables, but it is best interpreted as a single OpenFlow table that is pipelined). This
solution limits the flexibility of the OpenFlow standard, but also reduced the overall area need for the
Hardware switch implementation.
From this description we can glean why the full OpenFlow Hardware solutions might take up such
a significant amount of FPGA hardware resources. While the OpenFlow switch can implement network
security, it also has a great deal of overhead that is included to deal with other networking needs, such
as packet forwarding and VLAN tagging. Also, the queuing structure for parsed packet data forces the
parsed data and the packets themselves to be buffered. The need for buffer space would be determined by
the maximum number of packets the switch needs to hold while they to wait to be processed, which can be
significant depending on the network speed of the Ethernet interface, the number of network interfaces,
and the average time it takes to process a single packet. All of this added buffering, the inclusion of
multiple OpenFlow tables, and the need to sometimes reprocess a packet through the OpenFlow tables
also can add a significant amount of latency to the processing of a packet. Solutions that target security
exclusively can omit some of the overhead of a full OpenFlow switch implementation to reduce this area
overhead need and alleviate this long packet processing latency.
5.3 The Network Management Unit
The software analogues demonstrate some of the needs of network security, namely the enforcement of
access control (either directly or by hairpinning such functionality to the next-level physical switch or
some hardware appliance), and the ability to route traffic between logical interfaces on the same FPGA.
In traditional software virtual environments, VMs share memory and I/O connections. The memory
sharing is generally provisioned by hardware means, specifically, data isolation is provided through the
Chapter 5. Network Interfaces 78
employment of a MMU [28]. As an analogy to the MMU, that provides memory data isolation, we
propose the creation of an NMU, that provides network domain isolation. Based on the related work,
and the trends we identified, we contend that the NMU is required to enable the secure deployment of
direct-connected FPGAs in multi-user or multi-tenant datacenters and cloud deployments.
Similar to the software analogues presented in the previous section, there can be many potential
ways to secure the network interface for shared use of the network resources. For example, in the
Chapter 2, several works were presented that had some kind of network security gaurantees. The work
presented by Byma et al. [33] policed outgoing traffic by replacing the source MAC address with the
one assigned to the sender; the work presented by Tarafdar et al. [34] encapsulated data within a MAC
packet; and the work presented by Microsoft research, specifically Catapult 2 [3], encapsulated data in a
custom Transport layer protocol called Lightweight Transport Layer (LTL). In this section, some of the
considerations that might be needed for network security are presented, and a nomeclature is developed
to refer to these NMUs.
Note, the exact requirements of the NMU design will always depend on the specific deployment
details of the datacentre in which the FPGAs are to be deployed. For this reason, we do not present a
single NMU that we posit meets the requirements for domain isolation of networking interfaces. Instead,
a number of potential NMU designs are presented, which represent a series of deployment scenarios that
we claim meet the domain isolation needs of many common FPGA deployments.
5.3.1 Access Control Level
We note from the software analogues that ACLs are one important way in which network connectivity
should be secured. Access control functionality can be done within the NMU, or hairpinned to the next
level switch. The first criteria by which we categorize potential NMU designs is the level of access control
done within the NMU rather than pushed to the next level switch.
Un-Inspected Networking (Type A)
At the lowest level, we have NMUs that do not inspect outgoing packets at all and push all access control
functionality to the next-level switch (and potentially a further firewall appliance); we call these Type A
NMUs. Of course, for the next-level switch to be able to uniquely identify separate logical interfaces,
some methodology must be employed to mark outgoing packets as originating from a particular logical
interface. Two recent different IEEE standards could be used to this end.
The Edge Virtual Bridging standard (802.1Qbg) [73] allows for a single physical port of a switch to be
Chapter 5. Network Interfaces 79
treated as multiple logical ports by associating each logical connection with a specific Service VLAN tag.
Similarly, the Bridge Port Extension standard (802.1pr) [74] allows for a single physical port on a switch
to be expanded into multiple individually managed connections using a custom tag structure. Thus,
a Type A NMU should employ such tagging to push both routing and access control to the next-level
switch.
The simplicity of Type A NMUs lend themselves to simple hardware realizations, but they require
all ACLs to be implemented at the next-level switch, tightly coupling the hardware application to the
switch configuration, which is not desirable (the datacentre management framework must manage ACLs
in multiple places with multiple update and management procedures).
Source Semantics Enforcement (Type B)
The next level of access control is source semantics enforcement, i.e., ACLs that ensure the sender
addresses in the packets are correct and no other device addresses are spoofed; we term these Type B
NMUs. This is the type of NMU applied to the work presented by Byma et al. [33]. If source semantics
are enforced on the FPGA, further access controls can be applied at the next-level switch without the
configuration complexity of the Type A NMUs. Also, the Type B NMU does not rely on relatively new
IEEE standards that may have limited adoption. While the configuration complexity is reduced, most
access controls must still be implemented on the next-level switch; Type B NMU solutions remain tightly
coupled to the switch configuration.
Destination Rule Enforcement (Type C)
We define Type C NMUs as those that perform both sender and destination based access controls on
the FPGA. The full scope of what might constitute access control could be quite wide, and in fact might
include the full implementation of a switch on the FPGA. As discussed in the previous section, such an
implementation is likely infeasible or carries too high an overhead. Instead, we narrow the definition of
access controls.
Some previous works have shown FPGA datacenter deployments that rely solely on static point-to-
point links between the FPGAs. Limiting the NMU’s access control to a single destination field per
logical network interface would allow for some access control to be implemented in the Type C NMU at
a relatively lower cost. Moreover, multiple logical network interfaces can be provided to each hardware
application to implement point-to-multipoint connectivity. Other simple destination-based rules can also
be included, such as limiting the ability to send multicast packets, and limiting IP traffic to a specific
subnet. We contend that these simple access controls are powerful enough for many tasks.
Chapter 5. Network Interfaces 80
The Type C NMU adds complexity in the hardware implementation, and as such area overhead,
however it removes the tight coupling between the hardware application and the network infrastructure,
which should greatly ease deployment. Of course, this is limited: if the point-to-point access controls
are not sufficient enough to isolate the network accesses, more powerful ACLs from the next-level switch
would be needed.
Packet Encapsulation (Type E)1
Finally, Type E NMUs eliminate the need for access controls by moving packet encapsulation into the
NMU itself; instead of users performing network packetization within their own Hardware Applications,
they simply send the payload to the NMU, that encapsulates it within the appropriate network packet.
This is the methodology imposed in the implementation by Tarafdar et al. [34], and implied as an option
in the Catapult v2 work with the introduction of the LTL protocol [3].
Type E NMU solutions can be quite simple in terms of the hardware required to implement them, and
there is no tight coupling between the hardware application deployment and the network configuration.
Type E NMUs are however the least flexible, as they impose point-to-point only connectivity. Type E
NMUs also share network encapsulation hardware between the hardware applications, reducing area
utilization, but thus also require Hardware Applications to be rewritten to target the encapsulation-
based NMU scheme.
5.3.2 Internal Routing
Another functionality that might be required is the routability of traffic between logical network interfaces
located on the same FPGA. In general, haripin routing to the next-level switch and back is not possible
since the IEEE switch specifications explicitly forbid the re-routing of packets to the interface on which
the packet was received. The Edge Virtual Bridging [73] and the Bridge Port Extensions [74] protocols
are exceptions, so the Type A NMUs based on these standards enable routability by default.
For other NMU types, routability between the logical network interfaces can only be provided by
including routing functionality directly in the NMU; we term such NMUs as Type *R NMUs. Note,
routability doesn’t necessarily need to be provided, though this would impose on the cloud management
framework the limitation that two applications that need to communicate with each other must be
provisioned on different FPGAs; this might be an onerous limitation. This is the methodology employed
by Byma et al. [33] for example.
1Type D is intentionally unused and left for NMUs with a richer set of access controls (such as fully implementedswitches on FPGAs, stateful access controls, or OpenFlow flow tables), left for future work
Chapter 5. Network Interfaces 81
5.3.3 VLAN Networking Support
From the NIST publication, another common way to ensure network security is by encapsulating packets
within a virtual network, such as a VLAN or a VXLAN. A VLAN-based NMU would tag each logical
network interface with the appropriate VLAN tag without having to parse the packet itself, and as such
we classify it as a Type A NMU (Types Av and ARv). A VXLAN-based NMU would encapsulate the
whole packet within a VXLAN delivery packet, and as such we classify it as a Type E NMU (Types Ev
and ERv).
5.3.4 Layer of Network Virtualization
Routing functionality and access control can be implemented at various levels of the network protocol
stack, depending on the desired abstraction to present to the hardware application. For example, the
hardware applications might have their own MAC addresses, or they might share a MAC/IP address
and differ only on the Layer 4 port number. NMUs can be designed to process packets at a specific layer
of the network protocol stack: MAC-only NMU, MAC/IP NMU, and MAC/IP/Layer4 NMU.
5.3.5 NMU Nomenclature
The previous subsections have presented many different features that could be implemented to provide an
effective network security solution. For simplicity, all of these NMU types and features are summarized
by the nomenclature presented in Table 5.1. The Type of the NMU is determined by the level of access
control that it supports. In addition, an R or a v can be added to indicate that the NMU supports
routing between Hardware Applications on the same FPGA and that the NMU specifically targets a
virtualized network technology respectively. Finally, the layer of the network stack at that the NMU
works is appended to the end of the name. As a final note, a Universal NMU is used to refer to an NMU
that is designed to support all of the potential features; a Universal NMU can be parameterized by the
FPGA management framework at runtime to determine which of the modes to implement for each of
the Hardware Applications and the network access ports indicated by their VIID.
5.4 Network Management Unit Hardware Design
The Network Management Unit was introduced conceptually in the previous section. This section
discusses the actual hardware implementation of the NMU for synthesis into a shell design. To illustrate
the intention of the work described in this section, see Figure 5.6. In part (a) of the figure, a shell
Chapter 5. Network Interfaces 82
Table 5.1: NMU Nomenclature Summary
Type (A|B|C|E) [R] [v] - [L2|L3|L4]Type A No access controls provided within the FPGA, some tagging such
that ACLs can be applied at the next-level physical switch (hair-pinning)
Type B Source semantics enforcement for all outgoing traffic from hard-ware applications, allowing ACLs at the next-level switch whileeliminating spoofing
Type C Source semantics enforcement and some simple dest. based accesscontrols (e.g. restricting to a single dest, or restricting multicastand/or broadcast)
Type E Encapsulation: hardware applications send payload without gen-erating packet headers, network packet generation done in theNMU itself
Type *R Routing between hardware applications on the same FPGA doneinside of the NMU (no hairpinning)
Type *v Virtualized networking environment supported
[L2|L3|L4] Network protocol stack layer the NMU operates with respect to(L2 = MAC, L3 = IP, L4 = Transport)
E.g. Type A-vepa, Type A-etag, Type Av, Type ARv, Type B-L2, Type B-L3, Type B-L4, Type BR-L2, Type BR-L3, Type BR-L4, Type C-L2, Type C-L3, Type C-L4, Type CR-L2, Type CR-L3, Type CR-L4, Type E-L2, Type E-L3, Type E-L4, Type ER-L2, Type ER-L3, Type ER-L4, Type ERv-vxlan, Type ERv-nvgre,Type Ev-vxlan, Type Ev-nvgre, Universal
with just the performance isolation components is shown. In part (b) of the figure, the Simple NMU of
part (a), that in itself could provide no network security, is replaced with a more complex NMU based on
the descriptions in the previous section. This compex NMU is connected to the PCIe based management
framework such that the parameters of the NMU can be set at runtime.
5.4.1 Reusable Sub-Components
To implement the functionality required of the NMUs, we need packet processing components that can
examine the packets and pull out the relevant header information, as well as modify the packets by
inserting and removing headers/fields. These components can be designed as reusable sub-components
to reduce the complexity is deploying the various different types of NMUs.
Packet Parser-Processor
Packet parsers are used to pull out header information from a packet. This header information is
then generally compared to some ACLs or a routing table. Previous works doing packet processing
on FPGAs range from complex programmable parser designs [75], to simpler parsers generated from
Chapter 5. Network Interfaces 83
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
10Gbps Ethernet
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW Throttler
BW Throttler
BW Throttler
BW Throttler
AXIS Inter.
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
PCIe Controller
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
10Gbps Ethernet
AXIS Inter.
NMU
BW
BW
BW
BW
(a) (b)
Figure 5.6: Adding NMU to the Shell (a) Shell without NMU (b) Shell with added NMU
domain specific languages [76]. One of the focuses of our solution is to minimize the hardware overhead
of network security for virtualized FPGAs, so we focus on the simpler designs.
The simple parser architectures include parsers for each part of the network protocol stack, cascad-
ing the parsers and accumulating the parsed information. For example, parsers could be created and
connected in a cascade for MAC-parsing, IPv4-parsing, ARP-parsing, etc. The parsers themselves are
generally simple, including a counter that counts the current position within the packet stream, and
specialized field extractors that look for particular offsets within the packet for the field to be extracted.
Note, the position that the field extractor must look for to find the field can change based on previous
packet fields extracted, and so cannot necessarily be hard-coded.
We employed a similar parser design in our work. Figure 5.7 shows a number of Field Extraction
Sequencers that each extract a particular field in the packet. Traditional packet parsing systems pull
out all the fields of interest, through some series of packet parsers, and then pass those fields en masse
to some routing table or flow table structure to be analyzed and processed (e.g., like the OpenFlow
standard switch implementations). A key difference in our design is the inclusion of the Access Control
and Routing CAM logic for a particular field directly within the parser responsible for extracting that
field. This design allows for the cascaded parsers to simply pass along the cumulative routing and ACL
status instead of the entire field (that might contribute to high register utilization in highly pipelined
designs). This direct inclusion in the parser also eliminates the need for buffering of packets and queuing
of parsed packet data for processing, since all the parsers by necessity must operate at line rate. Access
Control and Routing CAM components can be excluded if not needed for a particular NMU type.
Chapter 5. Network Interfaces 84
Count
Field Extraction Sequencer
Next Header Determination
Source Addr Access Control
Dest Addr Routing CAM
Dest Addr Access Control
valid
valid
valid
acl e
rror
rout
e m
ask
nex
t hea
der
done?
Figure 5.7: Packet Parser Architecture
Tagger/Encapsulator
The tagger and encapsulation components are used to insert bytes at the beginning or in the middle
of a packet, to support the Type A and Type E NMUs respectively. To insert bytes into a packet, the
incoming packet stream must first be divided into segments, that can be read and pushed to the output
individually. This is accomplished using a segmented FIFO, where the segments form multiple FIFO
outputs. The segmentation is done on a 16-bit basis, since all network headers at Layer 4 and below are
aligned to 16-bit boundaries.
Figure 5.8 shows the implemented tagger/encapsulation core, with the input driving a segmented
FIFO. The output stream is generated by using multiplexers to select from the segments of the input
FIFO, and the tag/encapsulation data to be inserted into the packet. A Packet Output Sequencer,
implemented as a Finite State Machine, sequences the input and the bytes of the data to construct the
output packet. The stream VIID from the input is used to determine which logical network interface
sent the packet that is currently being processed. This VIID is used to index into the Tag/Encap Data
register file to access the tag data to be inserted into packets specifically from that logical interface.
De-Tagger/De-Encapsulator
The de-tagger and de-encapsulation components do the opposite task as the tagger and encapsulator.
For packets coming in from the network, these components can be used to strip some bytes from the
packet that are not needed in the downstream hardware applications, like various tag information or
Chapter 5. Network Interfaces 85
Segmented FIFO
Segmented FIFO
Tag/Encap. Data to be Inserted
Packet Output Sequencer
Input stream ID
Figure 5.8: Tagger & Encapsulation Architecture
Segmented FIFO
Segmented FIFO
Packet Input Sequencer
Figure 5.9: De-Tagger & De-Encapsulation Architecture
whole network headers in the case of Type E NMUs. See Figure 5.9 for details on the implementation
of the de-tagger/de-encapsulator component. The design is similar to the tagger and encapsulator of
Figure 5.8, except the direction of packet flow is reversed. The 16-bit segments of the input from the
network drive multiplexers that in turn drive the FIFO. The input side of the FIFO is segmented and
the output side is full-width. The Packet Output Sequencer drives the selects of the multiplexers and
the enables of the FIFOs to discard the appropriate bytes.
Chapter 5. Network Interfaces 86
5.4.2 Destination Rules Enforced
For Type C NMUs some restricted set of destination-based ACLs can be enforced. While there could
be quite a lot of different rules that might be enforced for the destination fields of a network packet,
the full set of possible rules would be unwieldy. As such, we implement some destination-field-based
rules that, from our personal experience, we believe cover many of the secure deployment scenarios that
might be encountered. We do not contend that this list is exhaustive or optimal in any way, but rather
include them here as one possible implementation of a Type C NMU. Any of the ACLs or functionality
described in this section can be optionally enabled/disabled for any network interface individually. The
rules included in this section are implemented in all Type C NMUs evaluated in this thesis.
MAC
For the MAC destination address, a simple ACL is included to restrict the destination address to a single
possible value. This would limit the network interface for which the rule is enabled to a single destination
value, and would be useful in implementing point-to-point networking. If a Hardware Application need
only communicate to a single destination, this ACL could enforce that restriction. In addition, ACLs
are included that can bar the sending of packets destined to any multicast MAC address, or can block
all multicast packets except those sent to a IP4 multicast address. This is included to prevent Hardware
Applications from spamming, whether maliciously or inadvertently, the network interface with many
multicast packets.
VLAN
For the VLAN field, in addition to being able to restrict output to a specific VLAN (which technically
acts as both a source and destination ACL, and is included in both Type B and Type C NMUs), the
VLAN parser can also act on the priority field in the VLAN tag. The priority field can be restricted to
a certain value to prevent the Hardware Applications from targeting a priority class that is reserved for
other network traffic.
EtherType
The EtherType is a field included in the MAC header that indicates the type of the next header in the
packet. For the EtherType, additional ACL functionality is added to restrict sending to only packets
that target IP4, IP6, ARP, and/or sending raw Ethernet packets. The ACL can be restricted to any,
all, or any combination of the above protocols. From our experience, these are the protocols commonly
Chapter 5. Network Interfaces 87
targeted by FPGA Hardware Applications. If the Hardware Application needs to target some other
packet type, the ACL can be disabled.
IP4
For the IP destination address, we similarly implement restrictions that allow only a single destination
address and that limit the ability to send multicast and broadcast packets. In addition, a subnet mask
field can be configured to allow the Hardware Application to target a specific network subnet, allowing
the NMU to restrict network communication from each Hardware Application to a specific network
segment. Finally, the packet parser for the IP field includes an ACL that can restrict or allow access
to all public IP addresses, i.e., IP addresses outside the range of addresses reserved for private use.
Allowing these IP addresses to be targeted allows the Hardware Application to communicate to any IP
address outside the immediate datacentre deployment, while still restricting access to any other network
devices in the datacentre. This functionality can be combined with the subnet mask functionality to
restrict the Hardware Application to communicating to any public IP addresses, and only those private
IP addresses within its subnet.
Port
For the transport layer, only point-to-point communication restrictions are possible. In other words, an
ACL can be included that restricts the Hardware Application to addressing only a single destination
port. No other relevant destination-based ACLs were included for the transport layer.
5.4.3 Universal NMU
The Universal NMU is depicted in Figure 5.10. As stated in Section 5.4.1, the parser components are
cascaded, with routing CAMs and ACL components integrated within the parsers rather than in some
later packet processing stage.
Packets flowing from the FPGA to the network (left to right in the figure) pass through parsing
stages for MAC, VLAN, IP, and the Transport layer. In addition, not shown in the figure, an ARP
stage is included that is processed in parallel with the IP parser. The parser chain is followed by
an On-Chip Router Filtering component that can filter out packets (i.e., drop) that failed any of the
preceding ACLs. This Filtering component also includes a fully described ACL that dictates which of the
Hardware Application interfaces are allowed to communicate with each other. If a packet is attempting
to be forwarded to a co-resident Hardware Application that it is barred from communicating with, that
Chapter 5. Network Interfaces 88
tag
L2 L3 L4
L2L3L4
tag
NetworkFPGA MAC Parser
Source ACL
Dest CAM
Dest ACL
VLAN Parser
VLAN ACL
VLAN CAM
IP Parser
Source ACL
Dest CAM
Dest ACL
Port Parser
Source ACL
Dest CAM
Dest ACL
Universal Tagger/Encap.
On-Chip Router
FIltering
MAC Parser
Dest CAM
VLAN Parser
VLAN CAM
IP Parser
Dest CAM
Port Parser
Dest CAM
Univ. Tag Parser &
De-Tagger
Dest CAM
SwitchBuffer & Filtering
Universal De-Encap.
Figure 5.10: Universal NMU System Architecture
packet will also be dropped. The fully-described on-chip ACL is implemented using a bitmask that
contains a bit for each logical network connection that is used to mask out any communications that are
not permissible. This Filtering component also includes a must route mask, that can force any Network
Interfaces packets to automatically be forwarded to another Hardware Application Region’s input port.
This extended functionality can enable the NMU to facilitate direct on-chip communication.
Following the Filtering component, any undropped packets proceed to an AXI-Stream switch that al-
lows for packets to be routed back into the FPGA or out to the network. This implements the routability
functionality of the NMU, so that if any Hardware Application is targeting for communication another
Hardware Application on the same device, it can be forwarded properly. Finally, a tagger/encapsulator
can be used to implement the tagging or encapsulation modes.
Packets arriving from the network (right to left in the figure) are first de-tagged (if a tagging mode
happens to be enabled) before being passed to the ingress path parsers. While these parsers are logically
separate from the egress path parsers, the CAMs used in all the parsers are register-based and the
registers are shared between both versions of the parsers, reducing the area utilization. After the
parsers, a de-encapsulation stage is included to support any of the encapsulation modes. Finally, a
buffering stage that can hold at least one maximum transmission sized packet must be included since
the ingress port to the FPGA is shared with the packets rerouted inward from the egress path; packets
must be buffered so they are not lost or dropped.
Chapter 5. Network Interfaces 89
Table 5.2: Shared Network Connectivity Secured Shell Overhead
LUT LUTRAM FF BRAM DSP
Shell Type Num % Incr Num % Incr Num % Incr Num % Incr Num % Incr.
No Security Incl. 36,141 4770 36,133 61 0
Perf. Isolation 36,416 0.7% 4770 0% 36,587 1.3% 61 0% 0 0%
Universal NMU 59,449 63.2% 8295 73.9% 53,541 46.3% 63 3.27% 0 0%
Table 5.3: Shared Network Connectivity Secured Shell Overhead (Percentage)
Shell Type LUT (%) LUTRAM (%) FF (%) BRAM (%) DSP (%)
No Security Incl. 5.45% 1.62% 2.71% 2.81% 0.0%
Perf. Isolation 5.49% 1.62% 2.75% 2.81% 0.0%
Universal NMUs 8.96% 2.82% 4.04% 2.92% 0.0%
5.4.4 Limited Functionality NMUs
In addition to the Universal NMU, many other limited functionality NMUs can be implemented based
on the descriptions in Section 5.3. Depicted in Figure 5.11 is the Universal NMU again, but this time
pictorially labelled so as to act as a legend for the other limited functionality NMUs. The Universal NMU
is shown in Part (a) of the Figure, while parts (b)-(h) show the limited functionality NMUs. Each of the
shapes over the labelled components in the Universal NMU are included or excluded in the depiction
of the limited functionality NMUs based on whether or not they would be needed to implement this
limited functionality.
5.5 Network Virtualizing Shell Overhead Evaluation
As with the memory protections introduced into the shell, we evaluate our shell design based on the
area overhead of its implementation.
5.5.1 Shell Design
Most of the components of the memory securitization part of the shell are similar in nature to the
solutions presented for the memory. We evaluate the overhead of the components in a similar way,
incrementally adding the parts to an existing shell design and iteratively adding the isolation components.
The results are summarized in Table 5.2 and Table 5.3, with the second table giving the percentage of
available resources on the FPGA that the shell uses. These evaluations are similarly performed on the
Kintex Ultrascale XKCU115.
We note that adding the performance isolation componenets has very little impact on the total area
Chapter 5. Network Interfaces 90
tag
L2 L3 L4
L2L3L4
tag
NetworkFPGA MAC Parser
Source ACL
Dest CAM
Dest ACL
VLAN Parser
VLAN ACL
VLAN CAM
IP Parser
Source ACL
Dest CAM
Dest ACL
Port Parser
Source ACL
Dest CAM
Dest ACL
Universal Tagger/Encap.
On-Chip Router
FIltering
MAC Parser
Dest CAM
VLAN Parser
VLAN CAM
IP Parser
Dest CAM
Port Parser
Dest CAM
Univ. Tag Parser &
De-Tagger
Dest CAM
SwitchBuffer & Filtering
s-tag e-tag L2
L2 L3
L2 L3 L4
L2 vlan
L2
L2 L3
L2 L3 L4
vlan
Universal De-Encap.
L2
L2 L3
L2 L3 L4
L2
L2 L3
L2 L3 L4
L2 encap L3 encap
L4 encap
L2 encap L3 encap
L4 encapvxlan encap vxlan encap
(b)
(a)
(c) (d)
(e) (f)
(g) (h)
Figure 5.11: NMU Varieties (a) Universal NMU, with components labeled and marked with symbols tobe used as the legend for sub-figures (b) Type A NMUs (c) Type B NMUs (d) Type C NMUs (e) TypeBR NMUs (f) Type CR NMUs (g) Type E NMUs (h) Type ER NMUs
Chapter 5. Network Interfaces 91
utilization of the shell. This makes intuitive sense since the amount of logic needed to decouple and
verify the protocol assertions on the network interface was fairly small. The NMU however adds a great
deal of overhead to the system. There is an increase in the usage of LUTS by 62 percent, of LUTRAM
by 74 percent, and of flip flops of 46 percent. The NMU is a lot more logic intensive than the other
componenets of the design, so this also makes sense. Even so, the total utilization of the modified shell
does not exceed 9 percent of the whole of the FPGA. Considering the functionality that is possible using
the Universal NMU, it is a worthwhile inclusion in any FPGAs included in the datacentre. More detailed
analysis of the NMU follows.
5.5.2 NMU Overhead
The NMU designs were tested on an Alpha Data 8k5 FPGA add-in board with a 10Gb Ethernet con-
nection; the FPGA on that Board is a Xilinx Kintex Ultrascale XCKU115. All tests were done using
the Xilinx Vivado 2018.1 software, and the associated versions of the the PCIe Subsystem and Ethernet
Subsystem cores.
The NMU was placed in a system with four hardware applications, each connected to the ingress and
egress ports of the Ethernet Controller through an AXI Stream Switch. Each application is provided
eight logical network connections, so the NMUs evaluated support 32 total logical connections. The
Ethernet controller has a datapath width of 64-bits and operates at 156.25 MHz, which is the clock used
for the whole test platform (except for the PCIe Controller). The applications themselves simply include
a Block RAM that stores packet data, a DMA device to send that packet data out to the network, and a
DMA engine that receives data from the network to store to Block RAM. Each of the Apps is controlled
through PCIe by a Host PC that manages the test setup. The Host is also responsible for configuring
the NMU. Figure 5.12 shows the architecture of the test platform.
To evaluate the various NMUs based on the previous descriptions, each of those design decisions
is compared on an area utilization and unloaded latency basis. Note, such designs would generally be
evaluated in terms of throughput as well, but all of the packet processing components used in this work
operate at the 10Gbps line-rate of the Ethernet controller. All of the results are shown in Table 5.4.
Access Control
Part (b) of Table 5.4 shows the area and latency results of the four different types of NMUs. The Type A
NMU, as expected, has the lowest area and latency, though this is likely because the Type A NMU does
not need on-FPGA switching to allow the Hardware Applications to communicate (The Bridge Port
Chapter 5. Network Interfaces 92
App 1 App 2
App 3 App 4
PCIe Controller10Gb Ethernet
Controller
NMU
AXI Stream Switch
AXI Stream Switch
Figure 5.12: Multi-Application Test Setup for Networking
Extensions E-tag standard allows for hairpin routing). The Encapsulation based NMU has a slightly
lower utilization, indicating that Type E NMUs might be preferable to reduce area utilization, though
this is at the cost of slightly increased latency caused by the included segmented FIFOs in the packet
path. Finally, we note that the added overhead of implementing some destination-based access controls
(i.e., Type C NMUs) is fairly minimal.
Virtualization
The results of the evaluation for the two virtualized networking NMUs are shown in Part (c) of Table 5.4.
The VLAN based virtualization solution uses about the same amount of resources as the Type B and
Type C NMUs from Part (b), though there is added latency from the tagging functionality. The VXLAN
virtualized solution has a much higher utilization because it must parse a full Layer 4 packet first before
identifying the virtual ID and routing the packet. The modest area overhead relative to the other NMUs
might be worth it considering the ease of deployment, and ubiquity, of virtual network solutions.
Routability
Dropping the requirement that there be routability between co-resident hardware applications cuts the
area utilization in half for the Type B and Type C NMUs, and nearly in half for the other NMUs, as
shown in Part (d) of Table 5.4. There is also a drop in latency from removing the Switching. Note,
Chapter 5. Network Interfaces 93
Table 5.4: NMU Area and Latency Comparisons
CLB LUTs Flip-Flops Latency (cycles)
egress / ingress
(a) Universal 23,014 (3.47%) 16,336 (1.23%) 13–18 / 19–25
(b) Access Control Evaluation
Type A-etag 4049 (0.61%) 5010 (0.38%) 1 / 4–6
Type BR-L2 7199 (1.09%) 4311 (0.32%) 5–10 / 6–8
Type CR-L2 7424 (1.12%) 4378 (0.33%) 5–10 / 6–8
Type ER-L2 6133 (0.92%) 4316 (0.33%) 6–7 / 8–10
(c) Virtualization Evaluation
Type ARv-vlan 7218 (1.09%) 5827 (0.44%) 6–8 / 8–10
Type ERv-vxlan 9606 (1.45%) 5628 (0.42%) 6–7 / 9–15
(d) Routability Evaluation
Type Av-vlan 3753 (0.57%) 4582 (0.35%) 1 / 4–6
Type B-L2 3516 (0.53%) 2883 (0.22%) 1–6 / 2–4
Type C-L2 3687 (0.56%) 2867 (0.22%) 1–6 / 2–4
Type E-L2 3392 (0.51%) 3113 (0.23%) 1 / 4–6
(e) Network Layer Evaluation
Type CR-L2 7424 (1.12%) 4378 (0.33%) 5–10 / 6–8
Type CR-L3 11,645 (1.76%) 6372 (0.48%) 6–11 / 7–12
Type CR-L4 12,550 (1.89%) 7053 (0.53%) 6–11 / 7–12
while there are area and latency benefits to dropping the routability, it likely will lead to a more difficult
system to manage.
Network Layer
From Part (e) of Table 5.4, we note that the biggest increase in area utilization yet in this evaluation
is a result of elevating functionality to Layer 3 (IP4) and Layer 4 (Transport) of the Network stack.
There’s a 57 percent increase in LUT utilization from Layer 2 to Layer 3, suggesting that much of the
area utilization is in parsing and controlling the IPv4 network packets. Previous FPGA works have built
on top of Layer 2 network Packets (e.g. Byma et. al. [33]), but higher layer protocols may be needed if
FPGAs span broadcast domains.
Universal NMU
The Universal NMU’s latency and area results are shown in Part (a) of Table 5.4. There is a consid-
erable, though not unreasonable, increase in latency over other NMU solutions. This is expected, as
more pipeline stages were required to meet timing and all packets must pass through both tagging and
Chapter 5. Network Interfaces 94
4 8 16 32 64 128 2560
5
10
15
20
1.5 1.82.4
3.5
5.7
9.8
18.6
0.4 0.5 0.81.2
2.2
3.9
8.1
uti
liza
tion
(%)
LUTs FFs
Figure 5.13: Universal NMU utilization vs Number of Logical Connections
encapsulation stages. The latency numbers shown assume no UDP checksum calculation; if a checksum
is to be calculated in UDP-Encap mode (i.e. Type ER-L4), the entire packet would need to be buffered
during the computation. This adds an additional 190 cycles for a maximally-sized packet of 1522 bytes.
The total area utilization is just under 3.5 percent of the LUTs available on the FPGA, which includes
LUTs configured as logic as well as LUTs configured as LUTRAMs and Shift Registers. In terms of
flip-flops, the Universal NMU’s utilization is just 1.23 percent, so overall the area overhead of the NMU
is quite modest. Figure 5.13 shows how the Universal NMU scales with number of logical connections,
reaching to just over 18 percent LUTs and 8 percent of FFs at 256 connections. This low utilization
can be attributed to the modified parser design discussed in Section 5.4.1, and the pairing-down of
network security functionality from a fully functional switching capability to the minimally necessary
access controls of the NMU.
The Universal NMU is fairly small compared to the Kintex Ultrascale XCKU115 FPGA. When the
Universal NMU was run through place and route for a Virtex Unltrascale+ XCKVU13P, it required only
1.32 percent of the LUTs and 0.44 percent of the FFs. As FPGAs increase in size, the resource needs
of the NMU solution approach nearly negligible quantities. This small size also suggests that hardening
the NMU would not need a significant overhead in terms of die area.
Finally, the Universal NMU can be compared to a full switch solution implemented on an FPGA.
The solution presented in [72] is a relatively low utilization OpenFlow siwtch implemented on a Xilinx
Virtex-7 VX485T FPGA. That solution uses 15.93 percent of the LUTs and 7.84 percent of the FFs.
When the Universal NMU was run through place and route for this FPGA, it required 7.36 percent of
the LUTs and 2.48 percent of the FFs. This is less than half the LUTs and less than a third of the FFs
Chapter 5. Network Interfaces 95
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 110Gbps Ethernet
AXI-Stream Interconnect
NMU
10Gbps Ethernet
AXI-Stream Interconnect
NMU
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Dec. + Ver.
BW Throt.
Figure 5.14: Multiple Network Interfaces Managed Separately
compared to the full switch solution. Note, the OpenFlow switch we are comparing against is a much
simplified switch with effectively only one OpenFlow table, so one would assume that a larger richer
implementation would use up quite a bit more resources than our Universal NMU architecture.
5.6 Multi-Channel Networking Considerations
A FPGA deployment could also include multiple network interfaces. For example, the Alpha Data
8k5 FPGA board includes 2 10 Gbps Ethernet connections [46]. Including multiple network interfaces
impacts the design of an isolation based networking solution.
5.6.1 Separately Managed
In the simplest deployment, all these interfaces would be connected to the same downstream datacentre
network. In this simple use-case, there is no advantage in connecting each Hardware Application to
all of the network interfaces, since they all the target what is essentially the same network resource.
Thus, the network interfaces can simply be connected to some fraction of the Hardware Applications;
each Hardware Application is connected to only a single of the network interfaces. In this case, the
performance and domain isolation components would simply be replicated for each network interface
and connected to a subset of the Hardware Applications. Figure 5.14 depicts this organization.
Chapter 5. Network Interfaces 96
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1 Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
10Gbps Ethernet
AXI-Stream Interconnect
NMU
10Gbps Ethernet
AXIS
Inter. Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Soft Shell (PR Region)
HW Application 1
Dec. only
Dec. only
Dec. only
Dec. only
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
Dec. + Ver.
BW
BW
BW
BW
Dest indicator
Figure 5.15: Multiple Network Interfaces With an Exclusive Connection
5.6.2 One General and One Exclusive Connection
The Catapult v2 [3] work motivates a special use-case, where-in one of the network interfaces is connected
to a device directly rather than a switching-based network. In the Microsoft work, the connection is
made to a traditional CPU-based server, enabling a Bump-in-the-Wire FPGA deployment. In any such
case, where one of the network interfaces has a special purpose rather than connecting to a switched
Ethernet network (e.g., connected to a device’s egress network port, connected to a non-switched or non-
Ethernet network, etc.), that port is likely to be assigned exclusively to a single Hardware Application.
We term this specific Network Interface an exclusive connection, and term the deployment Multiple
Network Interfaces with an Exclusive Connection. The Hardware Application with exclusive ownership
is likely to be a trusted application that can be safely deployed without domain isolation enforced on that
connection. The only components needed to ensure this connectivity would be decouplers to disconnect
all Hardware Applications that are not to access the exclusive interface and a component to indicate the
destination (VIID) for routing incoming packets. This system is shown in Figure 5.15, with the Ethernet
controller on the left implementing an Exclusive Interface (e.g., this interface could be connected directly
to a host server for a bump-in-the-wire deployment).
5.6.3 Individual Connection Per Application
As a final consideration, a special use-case exists if there are enough network interfaces to assign a single
interface to each of the Hardware Applications on the FPGA; some of the isolation components would
not need to be implemented. For example, the bandwidth throttlers implementing performance isolation
Chapter 5. Network Interfaces 97
would serve no purpose in this configuration. The inclusion of the NMU would not strictly be necessary,
since every application would be uniquely identifiable at the next level switch; if each application is
uniquely identifiable, any ACLs that need to be implemented could be implemented in the switch. The
NMU could still be included to reduce the processing overhead required in the switch. The decouplers
and protocol verifiers would still be required since the application should be prevented from forcing the
Ethernet controller into an errant state in case that application region is reconfigured with a different
application. However, instead of these decouplers and verifiers, special reset logic could be used that
would allow the Ethernet controller to be reset whenever the application region is reconfigured, which
would break the controller out of any errant states.
These multi-interface solutions are presented here as a conceptual discussion. The implementation
of such solutions is left to future work.
Chapter 6
Conclusion
In this thesis, many different concepts and solutions were introduced that aim to better secure and
virtualize multi-tenant FPGA deployments. In this final chapter, these contributions are summarized
and conclusions are drawn.
6.1 Summary of Contributions
The contributions presented in thesis thesis include:
• the conceptual introduction of the soft shell and the hard shell, and the principle along with them
that in multi-tenant FPGA deployments, any security protections should be embedded in the hard
shell and abstractions can be left to the soft shell (we do not contend that there is no use for
abstractions in the static region)
• extending the concept of a process ID to the FPGA with the coined VIID terminology, enabling
multiple virtual interfaces to coexist within the same PR region or Hardware Application
– we contend this is actually necessary to implement the soft shell functionality since the in-
stantiation of middleware components within the soft shell should be transparent to the user
and as such its resource needs should be virtualized separately
• the development of a set of hardware to effectively block malicious or malformed data transmissions
along a shared AXI bus, and the concepts to extend these hardware items to other interfaces
(decoupling, protocol verification, and bandwidth throttling)
98
Chapter 6. Conclusion 99
• the extension of the credit controlled accounting latency rate server of [58] to consider separate data
and command channels and to handle the the potential for users to stall the shared interconnect
by waiting even when the interconnect is ready to receive/transmit data
• An evaluation of the area overheads associated with the various components identified as necessary
to secure the shared memory interface, including the area overhead analysis of existing memory
virtualization solutions (base and bounds and paged MMUs)
• An extension of the introduced shared memory protection infrastructure to network resources,
including decoupling, protocol verification, and bandwidth throttling
• the introduction of a new type of network security component, the Network Management Unit, that
is a low overhead alternative to full switch implementation and yet more powerful than existing
network protection schemes on FPGAs
• a top-level analysis of how these concepts could be extended to multi-channel memory and multi-
interface network systems
6.2 Conclusions
Just as software compute nodes can benefit from the virtualization of CPU-based computers, the virtu-
alization of the FPGA can provide benefits to datacentre FPGA deployments. In software implementa-
tions, overhead is a key consideration in the effectiveness of virtualization, since a high overhead impacts
the degree to which the physical device can be shared. Applying these same principles to FPGAs, the
solutions presented in this thesis provide a very low overhead means of securing logical isolation between
Hardware Applications on the same FPGA.
The total overhead of the memory interface virtualization was achieved with an incremental overhead
ranging from 8.4% to 18.6% of the various FPGA resources, over a simple memory-based Shell that
includes no isolation. The total overhead of the network interface virtualization was achieved with an
incremental overhead ranging from 48.2% to 73.9% of the base networking Shell. While some of these
increases show an overhead that seems relatively high, the total utilization of any of the shells presented
in this thesis did not exceed 11% of the LUT resources or 8% of the flip flop resources. From this we
can concludes that the isolation components presented can be implemented at a fairly low overhead,
validating the Shell implementation.
The analysis of the implemented bandwidth throttling components showed that these implemen-
tations can effectively manage and allocate bandwidth to particular applications in a shared FPGA
Chapter 6. Conclusion 100
environment. It was also shown, however, that the bandwidth throttling of the memory connection is
highly dependent on the memory access pattern. We conclude that the performance isolation problems
encountered in the software virtualization realm, namely that inefficient memory access patterns make
it difficult to accurately assign the full bandwidth of a memory bus to individual users, also impact the
performance isolation solutions for hardware systems. The bandwidth reclamation system presented can
be used to allocate unused bandwidth. Some amount of memory bandwidth allocation can be guaranteed
with the modified credit-based accounting bandwidth throttlers presented in this work.
6.3 Future Work
Given that FPGAs in cloud and datacentre deployments are relatively new, the work presented herein
has many vectors for future research.
6.3.1 Further Shell Explorations
To begin, the conceptual Soft Shell described in Chapter 3 could be fleshed out to provide a much
more powerful and flexible platform for FPGA development. For instance, auto-generated middleware
within the soft shell could abstract from the user difficult protocol details and allow for much easier
hardware development. This auto-generated soft shell is supported by the hard shell features presented
in this thesis, though further development of the soft shell concepts might expose or introduce new
considerations for the virtualization and isolation enabled by the hard shell.
In developing the soft-shell, some concepts touched upon in this thesis should be explored further.
First, the development of tools that automatically generate a wrapper around some Hardware Appli-
cation such that it can be deployed in a hard-shell based FPGA deployment would enable a powerful
use case that lessens the barrier in using FPGA for computation. Second, the development of actual
middleware for FPGAs is a potentially fruitful extension of the work presented here. As a final point,
more applications developed for FPGAs, targeted at the virtualized Shell presented in this thesis in
particular, could be useful to validate the necessity of the work presented in this thesis and to perhaps
discover future vectors or vulnerabilities that must be addressed to protect Hardware Applications from
malicious activity.
In terms of the Hard Shell itself, work could be done in improving the concepts presented in this thesis
to strengthen the security guarantees. For example, the bandwidth throttler for the memory interface is
quite limited in that it does not take into account the stream of accesses from the requester in calculating
an updated token count. Future work could look at effective ways to build better bandwidth throttlers
Chapter 6. Conclusion 101
given the complications of the SDRAM protocol. The isolation solutions introduced here could be
expanded to target other platforms, whether non-AXI or even non-FPGA. For example, some of the
same concepts covered in this work could potentially be applied to CGRAs, or perhaps to some Neural
Network Processor to allow for co-residency on these new class of devices without software interference to
guarantee protection. The general methods applied for memory and networking could also be extended
to other interface types, perhaps PCIe and storage would be worthwhile peripherals for FPGAs.
6.3.2 Additional Security Considerations
The work presented here does introduce some security-based controls for cloud managers to consider
in their FPGA deployments, but the solutions and evaluations presented do not take into account
the entire security situation. This work only considers logical (i.e., in terms of digital logic) security
concerns, whereas electrical means of attacks or interference exist as well. Given the level of control that
FPGA users have over the electrical circuitry of the FPGA, electrical attacks could be of ever increasing
importance in FPGA deployments.
There have been a few specific attacks demonstrated already in academic works on FPGA devices.
The work presented in [42] demonstrated that bitstreams can be generated that fluctuate the voltage of
the device in such a way as to crash the entire device. Another work, presented in [41] demonstrated that
the voltage level of a system could be monitored with special digital circuitry implemented on the FPGA,
and that this voltage level could be used to ascertain information about other applications running on
the same FPGA. Finally, [43] showed that malicious applications could leak information from co-resident
applications by monitoring the level of wires that the malicious actor has access to which happen to
be routed near wires of other applications. The data transferred on the wires of the other applications
induces voltage level changes on the wire controlled by the malicious actor. For a multi-tenant device, all
of these attack vectors are obvious violations of the principle of isolating co-resident applications. Work
into the electrical isolation of applications on the same FPGA is a vital area of future work if FPGAs
can every be truly securely shared in multi-user environments.
In addition to these FPGA specific security vulnerabilities, some vulnerabilities that target CPU-
based systems might present security concerns for FPGAs as well. As an example, the Row Hammer
vulnerability showed that a malicious actor could implement a memory access technique that allowed
them to change values in memory to which they did not have access [77]. The Row Hammer attack
should also be possible on an FPGA, and in fact might be easier to implement since the user has more
fine-grained control over the sequence of memory accesses that are sent. An analysis of existing software
Chapter 6. Conclusion 102
vulnerabilities, especially those that target SDRAM memories (since these same types of memories and
similar memory controllers are used in FPGAs would be a useful area for future work.
6.3.3 Hardening Shell Components
Many of the ideas and technologies brought up or introduced in this work might be amenable to hardening
within an FPGA. With the increasing focus on compute-focused FPGA usage, it’s likely that FPGA
vendors like Xilinx and Intel might further expand the scope of functionality that is baked into the
device rather than provisioned through programmable fabric. Much of the functionality described in
this thesis could effectively be hardened within FPGAs to reduce the strain that the large shells have on
the placing and routing of the Hardware Applications themselves. Even the simpler shells presented in
this work sometimes presented difficulties in meeting the timing requirements. The memory interfaces
in particular, at 512-bits wide, were challenging to implement on the FPGA. The conceptual ideas
introduced in this work, such at the soft shell and the VIID could influence the design of these hardened
components to ensure maximal compatibility with the greatest number of deployments.
The components that are most amenable to hardening in this work are the MMU and the NMU, since
their inclusion does not depend on the application use case or deployment details of the FPGA. The
implementation of the other components, such as the bandwidth throttlers and the decouplers, depends
on the total number of co-resident applications that should be enabled on the FPGA, and also the
location spatially of those applications. If these components were hardened, the number of applications
and the location on the FPGA of those applications would need to be fixed. The MMU and the NMU
however have fixed sized interfaces regardless of the application deployment scenario and have fixed
locations on the device where they must be located already; the MMU and the NMU must be located
near the DDR and Ethernet controllers that are tied to specific pins on the FPGA device. Note, the
DDR and Ethernet controllers should likely be hardened (if not already) before the hardening of these
data and domain isolation components.
In the previous paragraph we posited that those isolation components that are replicated per hardware
application (i.e., the decouplers, verifiers, and bandwidth throttlers) should not be hardened. A possible
exception to that line of thinking is in an FPGA in which some hardened interconnection network exists.
The work presented in [35] recommended the hardening of so called Network on Chip (NoC) components
in future multi-tenant targeted FPGAs. In this case, the protocol verification and decoupling components
should be considered for hardening with every interface to the hardened interconnection network. The
added overhead of this solution may not be worth it, since the number of such interfaces might be quite
Chapter 6. Conclusion 103
significant and adding the decoupling and verification logic to each might present too large an overhead.
Nonetheless, the hardening of these components in a system with an already hardened NoC would be a
worthwhile consideration in future work.
Bibliography
[1] A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides, J. Demme, H. Esmaeilzadeh,
J. Fowers, J. Gray, M. Haselman, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, E. Peterson,
A. Smith, J. Thong, P. Y. Xiao, D. Burger, J. Larus, G. P. Gopal, and S. Pope, “A Reconfig-
urable Fabric for Accelerating Large-Scale Datacenter Services,” in Proceeding of the 41st Annual
International Symposium on Computer Architecuture (ISCA), pp. 13–24, IEEE Press, June 2014.
[2] D. Chiou, “Heterogeneous Computing and Infrastructure for Energy Efficiency in Microsoft Data
Centers: Extended Abstract,” in Proceedings of the 2016 International Symposium on Low Power
Electronics and Design, ISLPED ’16, (New York, NY, USA), pp. 150–151, ACM, 2016.
[3] A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers, M. Haselman, S. Heil,
M. Humphrey, P. Kaur, J. Y. Kim, D. Lo, T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods,
S. Lanka, D. Chiou, and D. Burger, “A cloud-scale acceleration architecture,” in 2016 49th Annual
IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016.
[4] “Amazon EC2 F1 Instances.” aws.amazon.com/ec2/instance-types/f1/.
[5] F. Chen, Y. Shan, Y. Zhang, Y. Wang, H. Franke, X. Chang, and K. Wang, “Enabling FPGAs
in the Cloud,” in Proceedings of the 11th ACM Conference on Computing Frontiers, CF ’14, (New
York, NY, USA), pp. 3:1–3:10, ACM, 2014.
[6] M. McLoone and J. McCanny, High Performance Single-Chip FPGA Rijndael Algorithm Implemen-
tations, pp. 65–76. Berlin, Heidelberg: Springer Berlin Heidelberg, 2001.
[7] S. Rigler, W. Bishop, and A. Kennings, “FPGA-Based Lossless Data Compression using Huffman
and LZ77 Algorithms,” in 2007 Canadian Conference on Electrical and Computer Engineering,
pp. 1235–1238, April 2007.
104
Bibliography 105
[8] F. Braun, J. Lockwood, and M. Waldvogel, “Protocol wrappers for layered network packet process-
ing in reconfigurable hardware,” IEEE Micro, vol. 22, pp. 66–74, Jan 2002.
[9] R. DiCecco, G. Lacey, J. Vasiljevic, P. Chow, G. Taylor, and S. Areibi, “Caffeinated FPGAs:
FPGA framework For Convolutional Neural Networks,” in 2016 International Conference on Field-
Programmable Technology (FPT), pp. 265–268, Dec 2016.
[10] M. Russinovich, “Inside the Microsoft FPGA-based config-
urable cloud.” azure.microsoft.com/en-ca/resources/videos/
build-2017-inside-the-microsoft-fpga-based-configurable-cloud/.
[11] J. Sahoo, S. Mohapatra, and R. Lath, “Virtualization: A Survey on Concepts, Taxonomy and
Associated Security Issues,” in 2010 Second International Conference on Computer and Network
Technology, pp. 222–226, April 2010.
[12] “Vivado Design Suite User Guide,” Tech. Rep. UG973 v2017.3, Xilinx, Oct 2017.
[13] “UltraScale Devices Gen3 Integrated Block for PCI Express v4.4,” Tech. Rep. PG156, Xilinx, Dec
2017.
[14] “UltraScale Devices Integrated 100G Ethernet v2.3,” Tech. Rep. PG165, Xilinx, Oct 2017.
[15] I. Kuon, R. Tessier, and J. Rose, “FPGA Architecture: Survey and Challenges,” Found. Trends
Electron. Des. Autom., vol. 2, pp. 135–253, Feb. 2008.
[16] B. Mei, A. Lambrechts, J. Y. Mignolet, D. Verkest, and R. Lauwereins, “Architecture exploration
for a reconfigurable architecture template,” IEEE Design Test of Computers, vol. 22, pp. 90–101,
March 2005.
[17] D. Chen, J. Cong, and P. Pan, “FPGA Design Automation: A Survey,” Foundations and Trends
in Electronic Design Automation, vol. 1, pp. 139–169, Jan. 2006.
[18] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: A Parallel Programming Standard for Heterogeneous
Computing Systems,” IEEE Des. Test, vol. 12, pp. 66–73, May 2010.
[19] A. Canis, J. Choi, M. Aldham, V. Zhang, A. Kammoona, J. H. Anderson, S. Brown, and T. Cza-
jkowski, “LegUp: High-level Synthesis for FPGA-based Processor/Accelerator Systems,” in Pro-
ceedings of the 19th ACM/SIGDA International Symposium on Field Programmable Gate Arrays,
FPGA ’11, (New York, NY, USA), pp. 33–36, ACM, 2011.
Bibliography 106
[20] T. Feist, “Accelerating IP Development with High-Level Synthesis,” in Vivado Design Suite
Whitepaper, no. WP416 v1.1, Jun 2012.
[21] T. S. Czajkowski, U. Aydonat, D. Denisenko, J. Freeman, M. Kinsner, D. Neto, J. Wong, P. Yian-
nacouras, and D. P. Singh, “From opencl to high-performance hardware on FPGAS,” in 22nd
International Conference on Field Programmable Logic and Applications (FPL), pp. 531–534, Aug
2012.
[22] D. Dye, “Partial Reconfiguration of Xilinx FPGAs Using ISE Design Suite,” Tech. Rep. WP374
v1.2, Xilinx, May 2012.
[23] “Increasing Design Functionality with Partial and Dynamic Reconfiguration in 28-nm FPGAs,”
Tech. Rep. WP-01137-1.0, Intel, July 2010.
[24] “UltraScale Architecture Configuration,” Tech. Rep. UG570 v1.8, Xilinx, Dec 2017.
[25] G. Vallee, T. Naughton, C. Engelmann, H. Ong, and S. L. Scott, “System-Level Virtualization
for High Performance Computing,” in 16th Euromicro Conference on Parallel, Distributed and
Network-Based Processing (PDP 2008), pp. 636–643, Feb 2008.
[26] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memory Bandwidth Management for
Efficient Performance Isolation in Multi-Core Platforms,” IEEE Transactions on Computers, vol. 65,
pp. 562–576, Feb 2016.
[27] C. Pahl, “Containerization and the PaaS Cloud,” IEEE Cloud Computing, vol. 2, pp. 24–31, May
2015.
[28] A. Bhattacharjee and D. Lustig, “Architectural and Operating System Support for Virtual Mem-
ory,” Synthesis Lectures on Computer Architecture, vol. 12, no. 5, pp. 1–175, 2017.
[29] Amazon, “AWS EC2 FPGA Hardware and Software Development Kits.” aws.amazon.com/ec2/
instance-types/f1/, 2017.
[30] “SDAccel Environment,” Tech. Rep. UG1164 v2016.3, Xilinx, Nov 2016.
[31] S. A. Fahmy, K. Vipin, and S. Shreejith, “Virtualized FPGA Accelerators for Efficient Cloud Com-
puting,” in 2015 IEEE 7th International Conference on Cloud Computing Technology and Science
(CloudCom), pp. 430–435, Nov 2015.
[32] “OpenStack.” www.openstack.org.
Bibliography 107
[33] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia, and P. Chow, “FPGAs in the Cloud: Booting
Virtualized Hardware Accelerators with OpenStack,” in 2014 IEEE 22nd Annual International
Symposium on Field-Programmable Custom Computing Machines, pp. 109–116, May 2014.
[34] N. Tarafdar, T. Lin, E. Fukuda, H. Bannazadeh, A. Leon-Garcia, and P. Chow, “Enabling Flex-
ible Network FPGA Clusters in a Heterogeneous Cloud Data Center,” in Proceedings of the 2017
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’17, (New
York, NY, USA), pp. 237–246, ACM, 2017.
[35] S. Yazdanshenas and V. Betz, “Quantifying and mitigating the costs of FPGA virtualization,” in
2017 27th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–7,
Sept 2017.
[36] X. Iturbe, K. Benkrid, C. Hong, A. Ebrahim, R. Torrego, I. Martinez, T. Arslan, and J. Perez,
“R3TOS: A Novel Reliable Reconfigurable Real-Time Operating System for Highly Adaptive, Effi-
cient, and Dependable Computing on FPGAs,” IEEE Transactions on Computers, vol. 62, pp. 1542–
1556, Aug 2013.
[37] A. Agne, M. Happe, A. Keller, E. Lbbers, B. Plattner, M. Platzner, and C. Plessl, “ReconOS: An
Operating System Approach for Reconfigurable Computing,” IEEE Micro, vol. 34, pp. 60–71, Jan
2014.
[38] R. Brodersen, A. Tkachenko, and H. K. H. So, “A unified hardware/software runtime environment
for FPGA-based reconfigurable computers using BORPH,” in Proceedings of the 4th International
Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS ’06), pp. 259–
264, Oct 2006.
[39] K. Fleming, H. J. Yang, M. Adler, and J. Emer, “The LEAP FPGA operating system,” in 2014
24th International Conference on Field Programmable Logic and Applications (FPL), pp. 1–8, Sept
2014.
[40] F. Hategekimana, T. Whitaker, M. J. H. Pantho, and C. Bobda, “Shielding non-trusted IPs in
SoCs,” in 2017 27th International Conference on Field Programmable Logic and Applications (FPL),
pp. 1–4, Sept 2017.
[41] M. Zhao and G. E. Suh, “FPGA-Based Remote Power Side-Channel Attacks,” in 2018 IEEE Sym-
posium on Security and Privacy (SP), pp. 229–244, May 2018.
Bibliography 108
[42] D. R. E. Gnad, F. Oboril, and M. B. Tahoori, “Voltage drop-based fault attacks on FPGAs us-
ing valid bitstreams,” in 2017 27th International Conference on Field Programmable Logic and
Applications (FPL), pp. 1–7, Sept 2017.
[43] C. Ramesh, S. B. Patil, S. N. Dhanuskodi, G. Provelengios, S. Pillement, D. Holcomb, and R. Tessier,
“FPGA Side Channel Attacks without Physical Access,” in 2018 IEEE 26th Annual International
Symposium on Field-Programmable Custom Computing Machines (FCCM), pp. 45–52, April 2018.
[44] D. Sidler, G. Alonso, M. Blott, K. Karras, K. Vissers, and R. Carley, “Scalable 10gbps tcp/ip stack
architecture for reconfigurable hardware,” in 2015 IEEE 23rd Annual International Symposium on
Field-Programmable Custom Computing Machines, pp. 36–43, May 2015.
[45] M. Saldana and P. Chow, “TMD-MPI: An MPI Implementation for Multiple Processors Across
Multiple FPGAs,” in 2006 International Conference on Field Programmable Logic and Applications,
pp. 1–6, Aug 2006.
[46] “ADM-PCIE-8K5 User Manual,” Tech. Rep. 1.9, Alpha Data, September 2017.
[47] “UltraScale Architecture-Based FPGAs Memory IP,” Tech. Rep. PG150 v1.4, Xilinx, April 2018.
[48] “10G/25G High Speed Ethernet Subsystem,” Tech. Rep. PG210 v2.4, Xilinx, June 2018.
[49] “DMA/Bridge Subsystem for PCI Express,” Tech. Rep. PG195 v4.1, Xilinx, April 2018.
[50] “AMBA AXI Protocol Specification,” Tech. Rep. 2.0, ARM, 2010.
[51] “AMBA 4 AXI4, AXI4-Lite, and AXI4-Stream Protocol Assertions User Guide,” Tech. Rep. r0p1,
ARM, 2012.
[52] “AXI Protocol Checker,” Tech. Rep. PG101 v2.0, Xilinx, April 2018.
[53] “AXI Interconnect,” Tech. Rep. PG059 v2.1, Xilinx, December 2017.
[54] “Partial Reconfiguration Decoupler,” Tech. Rep. PG227 v1.0, Xilinx, April 2016.
[55] D. Stiliadis and A. Varma, “Latency-rate servers: a general model for analysis of traffic scheduling
algorithms,” IEEE/ACM Transactions on Networking, vol. 6, pp. 611–624, Oct 1998.
[56] B. Akesson, K. Goossens, and M. Ringhofer, “Predator: A predictable sdram memory controller,” in
2007 5th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System
Synthesis (CODES+ISSS), pp. 251–256, Sept 2007.
Bibliography 109
[57] B. Akesson, A. Minaeva, P. Sucha, A. Nelson, and Z. Hanzalek, “An efficient configuration method-
ology for time-division multiplexed single resources,” in 21st IEEE Real-Time and Embedded Tech-
nology and Applications Symposium, pp. 161–171, April 2015.
[58] B. Akesson, L. Steffens, E. Strooisma, and K. Goossens, “Real-Time Scheduling of Hybrid Sys-
tems using Credit-Controlled Static-Priority Arbitration,” Tech. Rep. NXP-TN-2007-00119, NXP
Semiconductors, 2008.
[59] K. L. E. Law, “The bandwidth guaranteed prioritized queuing and its implementations,” in GLOBE-
COM 97. IEEE Global Telecommunications Conference. Conference Record, vol. 3, pp. 1445–1449
vol.3, Nov 1997.
[60] Z. Dai, M. Jarvin, and J. Zhu, “Credit borrow and repay: Sharing dram with minimum latency and
bandwidth guarantees,” in 2010 IEEE/ACM International Conference on Computer-Aided Design
(ICCAD), pp. 197–204, Nov 2010.
[61] L. Woltjer, “Optimal DDR controller,” Master’s thesis, University of Twente, the Netherlands, Jan
2005.
[62] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memory bandwidth management for
efficient performance isolation in multi-core platforms,” IEEE Transactions on Computers, vol. 65,
pp. 562–576, Feb 2016.
[63] J. S. Hunter, “The exponentially weighted moving average,” Journal of Quality Technology, vol. 18,
p. 203210, 1986.
[64] C. P. Pfleeger and S. L. Pfleeger, Security in Computing. Prentice Hall Professional, 4 ed., 2013.
[65] “SDAccel Platform Reference Design User Guide,” Tech. Rep. UG1234 v2017.1, Xilinx, June 2017.
[66] R. Chandramouli, Secure Virtual Network Configuration for Virtual Machine (VM) Protection.
National Institute of Standards and Technology, March 2016.
[67] “IEEE Standard for Local and Metropolitan Area Networks—Virtual Bridged Local Area Net-
works,” IEEE Std 802.1Q-2005 (Incorporates IEEE Std 802.1Q1998, IEEE Std 802.1u-2001, IEEE
Std 802.1v-2001, and IEEE Std 802.1s-2002), pp. 1–300, May 2006.
[68] M.Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Sridhar, M. Bursell, and C. Wright,
“Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer
2 Networks over Layer 3 Networks,” IETF RFC 7348, August 2014.
Bibliography 110
[69] P. Garg and Y. Wang, “NVGRE: Network Virtualization Using Generic Routing Encapsulation,”
IETF RFC 7637, September 2015.
[70] OpenFlow Switch Specification version 1.1.0, February 2011. http://archive.openflow.org/
documents/openflow-spec-v1.1.0.pdf.
[71] B. Ho, C. Pham-Quoc, T. N. Thinh, and N. Thoai, “A Secured OpenFlow-Based Switch Archi-
tecture,” in 2016 International Conference on Advanced Computing and Applications (ACOMP),
pp. 83–89, Nov 2016.
[72] V. B. Wijekoon, T. M. Dananjaya, P. H. Kariyawasam, S. Iddamalgoda, and A. Pasqual, “High
performance flow matching architecture for OpenFlow data plane,” in 2016 IEEE Conference on
Network Function Virtualization and Software Defined Networks (NFV-SDN), pp. 186–191, Nov
2016.
[73] “IEEE Standard for Local and metropolitan area networks–Media Access Control (MAC) Bridges
and Virtual Bridged Local Area Networks–Amendment 21: Edge Virtual Bridging,” IEEE Std
802.1Qbg-2012 (Amendment to IEEE Std 802.1Q-2011 as amended by IEEE Std 802.1Qbe-2011,
IEEE Std 802.1Qbc-2011, IEEE Std 802.1Qbb-2011, IEEE Std 802.1Qaz-2011, IEEE Std 802.1Qbf-
2011, and IEEE Std 802.aq-2012), July 2012.
[74] “IEEE Standard for Local and metropolitan area networks–Virtual Bridged Local Area Networks–
Bridge Port Extension,” IEEE Std 802.1BR-2012, July 2012.
[75] M. Attig and G. Brebner, “400 Gb/s Programmable Packet Parsing on a Single FPGA,” in 2011
ACM/IEEE Seventh Symposium on Architectures for Networking and Communications Systems,
pp. 12–23, Oct 2011.
[76] P. Bencek, V. Pu, and H. Kubtov, “P4-to-VHDL: Automatic Generation of 100 Gbps Packet
Parsers,” in 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom
Computing Machines (FCCM), pp. 148–155, May 2016.
[77] K. Park, S. Baeg, S. Wen, and R. Wong, “Active-precharge hammering on a row induced failure
in DDR3 SDRAMs under 3 nm technology,” in 2014 IEEE International Integrated Reliability
Workshop Final Report (IIRW), pp. 82–85, Oct 2014.
Appendix A
AXI4 Protocol Assertions
111
Appendix
A.
AXI4
ProtocolAsse
rtions
112
Table A.1: AXI4 Protocol Write Address Channel AssertionProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]
AXI ERRM AWADDR BOUNDARY A write burst cannot cross a 4KBboundary
Protocol error ignored Error: causes out of bounds access
AXI ERRM AWADDR WRAP ALIGN A write transaction with burst typeWRAP has an aligned address
Protocol error ignored Error: undefined behaviour
AXI ERRM AWBURST A value of 2b11 on AWBURST isnot permitted when AWVALID isHigh
Protocol error ignored Defaults to INCR burst type, noerror
AXI ERRM AWLEN LOCK Exclusive access transactions can-not have a length greater than 16beats
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM AWCACHE If not cacheable, AWCACHE =2’b00
Protocol error ignored Signal unused, no error
AXI ERRM AWLEN FIXED Transactions of burst type FIXEDcannot have a length greater than16 beats
Protocol error ignored FIXED burst type unsupported,defaults to INCR type, no error
AXI ERRM AWLEN WRAP A write transaction with burst typeWRAP has a length of 2, 4, 8, or 16
Protocol error ignored Error: undefined behaviour
AXI ERRM AWSIZE The size of a write transfer does notexceed the width of the data inter-face
Error: data width convertersmay not operate correctly
Error: interconnect error
AXI ERRM AWVALID RESET AWVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM AWxxxxx STABLE Handshake Checks: AWxxxxxmust remain stable when AW-VALID is asserted and AWREADYLow
Error: changing signals may ef-fect interconnect functionality
Error: interconnect error
AXI ERRM AWREADY MAX WAIT Recommended that AWREADY isasserted within MAXWAITS cyclesof AWVALID being asserted
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
Appendix
A.
AXI4
ProtocolAsse
rtions
113
Table A.2: AXI4 Protocol Write Data Channel AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]
AXI ERRM WDATA NUM The number of write data itemsmatches AWLEN for the corre-sponding address. This is trig-gered when any of the following oc-curs: Write data arrives, WLASTis set, and the WDATA count isnot equal to AWLEN; Write dataarrives, WLAST is not set, and theWDATA count is equal to AWLEN;ADDR arrives, WLAST is alreadyreceived, and the WDATA count isnot equal to AWLEN
Error: may cause interconnectto hang
Error: interconnect error
AXI ERRM WSTRB A write transaction with burst typeWRAP has an aligned address
Protocol error ignored Protocol error ignored
AXI ERRM WVALID RESET WVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM Wxxxxx STABLE Handshake Checks: Wxxxxx mustremain stable when WVALID is as-serted and WREADY Low
Error: changing signals may ef-fect interconnect functionality
Error: interconnect error
AXI ERRM WREADY MAX WAIT Recommended that WREADY isasserted within MAXWAITS cyclesof WVALID being asserted
Signals from static region don’tneed to be checked, no error
Signals from static region don’t notneed to be checked, no error
Appendix
A.
AXI4
ProtocolAsse
rtions
114
Table A.3: AXI4 Protocol Write Response Channel Assertions
Protocol Assertion Description Interconnect Response [53] Mem Controller Response [47]
AXI ERRM BRESP ALL DONE EOS All write transaction addressesare matched with a correspondingbuffered response
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM BRESP EXOKAY An EXOKAY write response canonly be given to an exclusive writeaccess
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM BVALID RESET BVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM BRESP AW A slave must not take BVALIDHIGH until after the write addresshandshake is complete
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM BRESP WLAST A slave must not take BVALIDHIGH until after the last write datahandshake is complete
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM Bxxxxx STABLE Handshake Checks: Bxxxxx mustremain stable when BVALID is as-serted and BREADY Low
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM BREADY MAX WAIT Recommended that BREADY is as-serted within MAXWAITS cycles ofBVALID being asserted
Error: not accepting responsewill cause interconnect to hang
Error: interconnect error
Appendix
A.
AXI4
ProtocolAsse
rtions
115
Table A.4: AXI4 Protocol Read Address Channel AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]
AXI ERRM ARADDR BOUNDARY A write burst cannot cross a 4KBboundary
Protocol error ignored Error: causes out of bounds access
AXI ERRM ARADDR WRAP ALIGN A write transaction with burst typeWRAP has an aligned address
Protocol error ignored Error: undefined behaviour
AXI ERRM ARBURST A value of 2b11 on ARBURST isnot permitted when ARVALID isHigh
Protocol error ignored Defaults to INCR burst type, noerror
AXI ERRM ARLEN LOCK Exclusive access transactions can-not have a length greater than 16beats
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM ARCACHE If not cacheable, ARCACHE =2’b00
Protocol error ignored Signal unused, no error
AXI ERRM ARLEN FIXED Transactions of burst type FIXEDcannot have a length greater than16 beats
Protocol error ignored FIXED burst type unsupported,defaults to INCR type, no error
AXI ERRM ARLEN WRAP A write transaction with burst typeWRAP has a length of 2, 4, 8, or 16
Protocol error ignored Error: undefined behaviour
AXI ERRM ARSIZE The size of a write transfer does notexceed the width of the data inter-face
Error: data width convertersmay not operate correctly
Error: interconnect error
AXI ERRM ARVALID RESET ARVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM ARxxxxx STABLE Handshake Checks: ARxxxxx mustremain stable when ARVALID isasserted and ARREADY Low
Error: changing signals may ef-fect interconnect functionality
Error: interconnect error
AXI ERRM ARREADY MAX WAIT Recommended that ARREADY isasserted within MAXWAITS cyclesof ARVALID being asserted
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
Appendix
A.
AXI4
ProtocolAsse
rtions
116
Table A.5: AXI4 Protocol Read Data Channel AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]
AXI ERRM RLAST ALL DONE EOS All outstanding read bursts musthave completed
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RDATA NUM The number of read data itemsmust match the correspondingARLEN
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RID The read data must always followthe address that it relates to. IfIDs are used, RID must also matchARID of an outstanding addressread transaction. This violationcan also occur when RVALID is as-serted with no preceding AR trans-fer
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RRESP EXOKAY An EXOKAY write response canonly be given to an exclusive readaccess
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RVALID RESET RVALID is Low for the first cycleafter ARESETn goes High
PR reset and static region re-set are not asserted at the sametime, no error
PR reset and static region reset arenot asserted at the same time, noerror
AXI ERRM Rxxxxx STABLE Handshake Checks: Rxxxxx mustremain stable when RVALID is as-serted and RREADY Low
Signals from static region don’tneed to be checked, no error
Signals from static region don’tneed to be checked, no error
AXI ERRM RREADY MAX WAIT Recommended that RREADY is as-serted within MAXWAITS cycles ofRVALID being asserted
Error: not accepting responsewill cause interconnect to hang
Error: interconnect error
Appendix
A.
AXI4
ProtocolAsse
rtions
117
Table A.6: AXI4 Protocol Exclusive Access AssertionsProtocol Assertion Description Interconnect Response [53] Mem Controller Response [47]
AXI ERRM EXCL ALIGN The address of an exclusive access isaligned to the total number of bytesin the transaction
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM EXCL LEN The number of bytes to be trans-ferred in an exclusive access burstis a power of 2, that is, 1, 2, 4, 8,16, 32, 64, or 128 bytes
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM EXCL MATCH Recommended that the address,size, and length of an exclusivewrite with a given ID is the sameas the address, size, and length ofthe preceding exclusive read withthe same ID
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM EXCL MAX 128 is the maximum number ofbytes that can be transferred in anexclusive burst
Protocol error ignored Exclusive access not supported, ig-nored, no error
AXI ERRM EXCL PAIR Recommended that every exclusivewrite has an earlier outstanding ex-clusive read with the same ID
Protocol error ignored Exclusive access not supported, ig-nored, no error