hardware acceleration landscape for distributed real-time...
Embed Size (px)
Hardware Acceleration Landscape for DistributedReal-time Analytics: Virtues and Limitations
Mohammadreza Najafi, Kaiwen Zhang, Hans-Arno JacobsenTechnical University of Munich
Mohammad SadoghiPurdue University
Abstract—We are witnessing a technological revolution witha broad impact ranging from daily life (e.g., personalizedmedicine and education) to industry (e.g., data-driven health-care, commerce, agriculture, and mining). At the core of thistransformation lies “data”. This transformation is facilitatedby embedded devices, collectively known as Internet of Things(IoT), which produce real-time feeds of sensor data which arecollected and processed to produce a dynamic physical modelused for optimized real-time decision making. At the infrastruc-ture level, there is a need to develop a scalable architecturefor processing massive volumes of present and historical dataat an unprecedented velocity to support the IoT paradigm. Tocope with such extreme scale, we argue for the need to revisitthe hardware and software co-design landscape in light of twokey technological advancements. First is the virtualization ofcomputation and storage over highly distributed data centersspanning across continents. Second is the emergence of a varietyof specialized hardware accelerators that complement traditionalgeneral-purpose processors. Further efforts are required to unifythese two trends in order to harness the power of big data.In this paper, we present a formulation and characterization ofthe hardware acceleration landscape geared towards real-timeanalytics in the cloud. Our goal is to assist both researchersand practitioners navigating the newly revived field of softwareand hardware co-design for building next generation distributedsystems. We further present a case study to explore software andhardware interplay for designing distributed real-time streamprocessing.
Traditionally, the data management community has devel-oped specialized solutions that focus on either extreme of adata spectrum between volume and velocity, yielding to large-volume batch processing systems, e.g., Hadoop , Spark ,and high-velocity stream processing systems, e.g., Flink ,Storm , Spark Streaming , respectively. However, the cur-rent trend, led by the Internet of Things (IoT) paradigm, leanstowards the large-volume processing of rich data produced bydistributed sensors in real-time at a high velocity for reactiveand automated decision making. While the aforementionedspecialized platforms have proven to achieve a certain balancebetween speed and scale, their performance is still inadequatein light of the emerging real-time applications.
This remarkable shift towards big data presents an interest-ing opportunity to study the interplay of software and hardwarein order to understand the limitations of the current co-designspace for distributed systems, which must be fully exploredbefore resorting to specialized systems such as ASICs, FPGAs,and GPUs. Each hardware accelerator has a unique perfor-mance profile with enormous potential for speed and sizeto compete with or to complement CPUs, as envisioned in
Latency(lower is better)
Application Specific Integrated Chip
General Purpose Processors
Field Programmable Gate Arrays
< 1 … 100 microseconds
1 … 100 milliseconds >
1 … 100 seconds >
10 minutes <
Fig. 1: Envisioned acceleration technology outlook.
Figure 1. In this paper, we primarily focus on cost-effective,power-efficient hardware acceleration solutions which excelat analytical computations by tapping into inherent low-level hardware parallelism. To motivate the adoption of theseemerging hardware accelerators, we describe the four primarychallenges faced by today’s general-purpose processors.
Large & complex control units. The design of general-purpose processors is based on the execution of consecutiveoperations on data residing in main memory. This architecturemust guarantee a correct sequential order execution. As aresult, processors include complex logic to increase perfor-mance (e.g., super pipelining and out-of-order execution). Anyperformance gain comes at the cost of devoting up to 95% ofresources (i.e., transistors) to these control units .
Memory wall & von Neumann bottleneck. The currentcomputer architecture suffers from the limited bandwidthbetween CPU and memory. This issue is referred to as thememory wall and is becoming a major scalability limitationas the gap between CPU and memory speed increases. Tomitigate this issue, processors have been equipped with largecache units, but their effectiveness heavily depends on thememory access patterns. Additionally, the von Neumann bot-tleneck further contributes to the memory wall by sharing thelimited memory bandwidth between both instructions and data.
Redundant memory accesses. The present-day system en-forces that data arriving from an I/O device is first read/writtento main memory before it is processed by the CPU, resultingin a substantial loss of memory bandwidth. Consider a simpledata stream filtering operation that does not require the incom-
Input and Output
Custom Stream Processing Element
Redundant Access Memory Wall
Selective external memory access Low power consumption
us Data Bus
Input and Output
Custom Stream Processing Element
Redundant AccessMemory Wall
Large control unitSmall execution engineHigh power consumption
Large execution engineSmall control unit
Selective external memory accessLow power consumption
Fig. 2: General-purpose processor architecture vs. customhardware architecture.
ing data stream to be first written to main memory. In theory,the data arriving from the I/O should be streamed directlythrough the processor; however, today’s computer architectureprevents this basic modus operandi.
Power consumption. Manufacturers often aim at increasingthe transistor density together with higher clock speed, butthe increase in the clock speed (leading to a higher derivingvoltage for the chip) results in a superlinear increase in powerconsumption and a greater need for heat dissipation .
These performance limiting factors in today’s computerarchitecture have generated a growing interest in acceleratingdistributed data management and data stream processing usingcustom hardware solutions , , , , , ,, ,  and more general hardware acceleration atcloud-scale , , , , , . In our past workon designing custom hardware accelerators for data streams(see Figure 2), we have demonstrated the use of a muchsimpler control logic with better usage of chip area, therebyachieving higher performance per transistor ratio , .We can substantially reduce the memory wall overhead bycoupling processor and local memory, instantiating as manyof these components as necessary. Through this coupling, theredundant memory access is reduced by avoiding to copyand read memory whenever possible. The use of many low-frequency, but specialized, chips may circumvent the need forgeneralized but high-frequency processors. Despite the greatpotentials of hardware accelerators, there is a pressing need tonavigate this complex and growing landscape before solutionscan be deployed in a large-scale environment.
However, today’s hardware accelerators comes in varietyof forms as demonstrated in Figure 3, ranging from embed-ded hardware features, e.g., hardware threading and Single-Instruction Multiple Data (SIMD), in general-purpose pro-cessors (e.g., CPUs) to more specialized processors such asGPUs, FPGAs, ASICs. For example, graphics processing units
Single-InstructionMultiple Data (SIMD)
Oracle SPARC DAX,AltiVec,SSE,AVX
Graphics ProcessingUnit (GPU)
Discrete & Integrated GPUs (Nvidia,AMD,Intel)
Field Programmable Gate Array (FPGA)
Xilinx and AlteraFPGAs
Application Specific Integrated Chip (ASIC)
SPARC M7 Microprocessor
Fig. 3: Accelerator spectrum.
(GPUs) offer massive parallelism and floating-point compu-tation based on an architecture equipped with thousands oflightweight cores coupled with substantially higher memorybandwidth compared to CPUs. Due to the fixed architectureof GPUs, only specific types of applications or algorithms canbenefit from its superior parallelism, such as tasks that follow aregular pattern (e.g., matrix computation). In contrast, a field-programmable gate array (FPGA) is an integrated circuit thatcan be configured to encode any algorithm, even irregularones. FPGAs can be configured to construct an optimal archi-tecture of custom cores engineered for a targeted algorithm.FPGAs contain compute components which are configurablelogic blocks (CLBs) consisting of Lookup Tables (LUTs) andSRAM; memory components including registers, distributedRAMs, and Block RAMs (BRAMs); and communication com-ponents that consist of configurable interconnects and wiring.Any circuit design developed for FPGAs can be directlyburned onto application-specific integrated circuits (ASICs),which provide greater performance in exchange for flexibility.Although the reminder of this paper focus on FPGAs, manyparts of the content discussed are equally applicable to otherforms of acceleration.
To further narrow down our discussion, we focus mostly ondistributed real-time analytics over continuous data streamsusing FPGAs. To cope with the lack of flexibility of customhardware solutions, the state of the art assume that the setof queries is known in advance. Essentially, past works relyon the compilation of static queries and fixed stream schemasonto hardware designs synthesized to configure FPGAs .Depending on the complexity of the design, this synthesis stepcan last minutes or even hours. This is not suitable for modern-day stream processing needs, which require fast, on the fly,reconfiguration. Furthermore, many of the existing approachesassume a complete halt during any synthesis modification ,. While synthesis and stream processing may overlap tosome extent, a significant amount of time and effort is stillrequired to reconfigure, which may take several minutes upto hours. More importantly, this approach requires additionallogic for buffering, handling of dropped tuples, requests forre-transmissions, and additional data flow controlling tasks,which renders this style of processing difficult, if not im-possible, in practice. These concerns are often ignored in theapproaches listed above, which assumed that processing stopsentirely before a new query-stream processing cycle starts.
In this paper, we make the following contributions. First,we present a comprehensive formalization of the accelerationlandscape over distributed heterogeneous hardware in thecontext of real-time analytics over data streams. In our formal-ization, we will incorporate recent attempts to address theseaforementioned shortcomings of custom hardware solutions.
System ModelStandalone Co-placement Co-processor
Declarative Languages (SQL-based)Procedural LanguagesHardware Description Languages
(VHDL, Verilog, SystemC, TLM)
General Purpose Languages(C, C++, Java)
Parallel Programming Languages & Parallel APIs
(CUDA, OpenCL, OpenMP)
Static Compiler (Glacier)
Parametrized Circuits(Skeleton Automata, fpga-ToPSS,
OPB, FQP, Ibex, IBM Netezza) Temporal/Spatial Instructions
Parametrized Data Segments(FQP)
Bi-directional Flow(Handshake Join)
Multi-query Optimization(Rete-like Network)
Boolean Formula Precomputation(Ibex)
Fig. 4: Acceleration design landscape.
Second, we study the fundamental strategy in finding the rightbalance between the data and control flow on hardware for de-signing highly distributed stream joins, which are some of themost expensive operations in SQL-based stream processing.Our experience working with stream joins in FPGAs representa case study whose findings and challenges are applicable toother areas found in our acceleration landscape.
II. DESIGN LANDSCAPE
To formalize the acceleration landscape, we develop a broadclassification derived from a recent comprehensive FPGAs andGPUs tutorial . As presented in Figure 4, we divide thedesign space into four categories: system model, programmingmodel, representational model, and algorithmic model.
In the system model layer, we focus on a holistic view ofthe entire distributed compute infrastructure, possibly hostedvirtually in the cloud, in order to identify how accelerators canbe employed when considering the overall system architecture.We identify three main deployment categories: standalone,co-placement, and co-processor in distributed systems. Thestandalone mode is the most simple architecture, where theentire software stack is embedded on hardware. Past works onevent , ,  and stream processing , , ,, , where the processing tasks are carried out entirelyon FPGAs, can be categorized as standalone mode. The co-processor design can be viewed as a co-operative methodin order to offload and/or distribute (partial) computationfrom the host system (e.g., running on CPUs) to accelerators(e.g., FPGAs or GPUs). For instance, FPGAs and CPUs co-operation is exploited in , where the Shore-MT database ismodified to offload regular expression computations, expressedas user-defined functions in SQL queries, to FPGAs. Incontrast, the co-placement design is focused on identifying
the best placement strategy to place the accelerators on thedata path in a distributed system. This mode of computationcan be characterized as performing a partial or best-effortcomputational model. For instance, this model is applied inIBM Netezza .
We refer to active data path as an alternative data-centricview of the system model. In a distributed setting, each pieceof data travels from a source (data producer) to a destination(data consumer), passing through the network and temporar-ily residing in storage and memory of intermediate nodes.Usually, the actual data computation task is performed closeto the destination using CPUs. Instead, an active data pathdistributes processing tasks along the entire length to variousnetwork, storage, and memory components by making them“active”, i.e., coupled with an accelerator. Thus, a wide rangeof partial or best-effort computation can be employed usingco-placement or co-processor strategies, which are buildingblocks towards creating active distributed systems.
The programming model focuses on the usability of het-erogeneous hardware accelerators. Akin to procedural pro-gramming languages such as C and Java, there are numeroushardware description languages such as VHDL, Verilog, Sys-temC, and TLM that expose key attributes of hardware in aC-like syntax. Unlike the software mindset which is focusedon algorithm implementation (e.g., using conditional and loopstatements), the programming in hardware must also considerthe data and control flow mapping onto logic gates andwiring. To simplify the usability of hardware programming,recent works have shifted towards SQL-based declarativelanguages using static (Glacier) and dynamic (FQP) compilers.Essentially, these compilers generate a hardware descriptionlanguage when given SQL queries as input. Glacier converts aquery into a final circuit design by composing operator-based
Flexible Query Processor (FQP)
Distributor CB OP-Block OP-Block
Skeleton Meta Data
ExeFetch Decode Mem
Processor PipelineMain Memory
Data Stream Compiler OptimizerLoader Queries
Fig. 5: Query execution on general-purpose processor (soft-ware) vs. flexible hardware solution.
logic blocks. Once synthesized to FPGA, the design cannotbe altered. In contrast, FQP starts with a fixed topology ofprogrammable cores (where each core executes a variety ofSQL operators such as selection, project, or joins) that arefirst synthesized on FPGA. Subsequently, the FQP compilerthen generates a dynamic mapping of queries onto the FQPtopology at runtime.
The data and control flow can be realized on hardwareusing either static or dynamic circuits, which we refer as therepresentational model. The most basic form is a static circuit(also the best performing) that has a fixed logic and hard-codedwiring which cannot be altered at runtime. However, there isa wide spectrum when moving from static to a fully dynamicdesign. The first level of dynamism is named parametrizedcircuits. In the context of stream processing, this translatesto the ability of changing the selection or join conditions atruntime without the need to re-synthesize the design.
To accelerate complex event processing, fpga-ToPSS ,,  focuses on hiding the latency of accessing off-chipmemory (used for storing dynamic queries) by exploiting faston-chip memory (for storing static queries). Furthermore, theon-chip memory is coupled with computation using horizontaldata partitioning, which eliminates global communication cost.To accelerate XML processing, the notion of skeleton au-tomata  allows decomposition of non-deterministic finite-state automata for encoding XPath queries into static structuralskeleton which are embedded in logic gates while queryconditions are dynamically stored in memory at runtime.Ibex  provides co-operation between hardware (to par-tially push and compute queries on FPGAs) and software(by pre-computing and passing arbitrary Boolean selectionconditions onto hardware). Online-Programmable Blocks (OP-
Common FPGA-based Solutions
Flexible Query Processor (FQP)
Map new operators Apply it
(u-seconds ~ m-seconds) (u-seconds)
Apply changes inhardware model
Time consuming& costly
(Hours ~ Months)
Time consuming(Minutes ~ Days)
Halt normal system operation
CostlyData flow control
Time consuming(Seconds ~ Minutes)
Resume the system operation
CostlyData flow control
Fig. 6: Comparing standard vs. flexible query executionpipeline on a reconfigurable computing fabric.
Block) ,  introduces dynamic re-configurable logicblocks to implement selection, projection, and join operations,where the conditions of each operator can seamlessly be ad-justed at runtime. A collection of OP-Blocks and general Cus-tom Blocks (CBs) together form the Flexible Query Processor(FQP) architecture , . FQP is a general streamingarchitecture that can be implemented in fully standalone modeand replace traditional software solutions as demonstrated inFigure 5. Q100  supports query plans of arbitrary sizeby horizontally partitioning them into fixed sets of pipelinedstages of SQL operators using the proposed temporal andspatial instructions. IBM Netezza  is a major commercialproduct that exploits parametrizable circuits in order to offloadquery computation.
The next level of dynamism is to support schemas of varyingsize as far as the wiring width is concerned in order to connectand transfer tuples among computation cores on FPGAs. Onesolution is to simply provision the wiring for the maximumpossible schema size: This option is simply impractical giventhe cost of wiring and routing. To support dynamic schemas,parametrized data segments are proposed in FQP throughvertical partitioning of both query and data given a fixed wiringand routing budget .
The final level of dynamism, which we call parametrizedtopology, is the ability to modify both the structure of queries(i.e., macro changes) and modify the conditions of each opera-tor within a query (i.e., micro changes). This provides supportfor the real-time additions and modifications of queries (seeFigure 6). FQP ,  supports parametrized topologieswhich allow the construction of complex queries by composingre-configurable logic blocks in the form of OP-Blocks, wherethe composition itself is also re-configurable (see the examplein Figure 7).
The algorithmic model is the lowest layer of our accel-eration landscape. To achieve higher throughput and lowerlatency, the following goals must be met. First, we mustexploit the available massive parallelism by minimizing thedesign control flow. Generally speaking, there are three de-sign patterns: data parallelism, task parallelism, and pipelineparallelism. Simply put, data parallelism focuses on executingthe same task concurrently over partitioned data (e.g., SIMD)while task parallelism is focused on executing concurrent andindependent tasks over replicated/partitioned data (e.g., Intelhyper-threading). The pipeline parallelism, arguably the mostimportant design pattern on hardware, is to break a task into asequence of sub-tasks, where the data flows from one sub-task
Join Over Product ID
Selection (Age>25 & Gender=female)
Join Over Product ID
σ Age > 25
Over Product IDWindow: 1536
σ Gender = female
Over Product IDWindow: 2048
Fig. 7: An example of query plan assignment in FQP.
to the next. The second algorithmic goal is to minimize thewiring and routing overhead to maximize the data flow. Thethird algorithmic goal is to maximize the memory bandwidththrough tight coupling of memory and logic and exploitingthe memory hierarchy effectively starting from the fastest toslowest memory, namely, registers, on-chip memory, and off-chip memory. Notably, faster memory often equates to lowerdensity implying that the amount memory available throughregisters is substantially smaller than that of on-chip memory,and likewise for off-chip memory.
For instance, to support multi-query optimization, a globalquery plan based on Rete-like network is constructed to exploitboth inter- and intra-query parallelism, essentially leveragingall forms of parallelisms mentioned earlier including data,task, and pipeline parallelism. For faster data pruning in orderto improve memory bandwidth, hash-based indexing methodsare employed to efficiently identify all relevant queries givenincoming tuples . To avoid designing complex adaptivecircuitry, Ibex  proposes precomputation of a truth tablefor Boolean expressions in software first and transfer the truthtable into hardware during FPGA configuration when a newquery is inserted. There exist also a number of works focusedon stream join, which is one of the most expensive opera-tion in real-time analytics. For example, handshake join introduced a new data flow to naturally direct streams inopposite directions in order to simplify the control flow, whichwe categorize as bi-directional flow (bi-flow) while SplitJoin introduced a novel top-down flow, which we refer toas uni-directional flow (uni-flow) to completely eliminate allcontrol flows among joins cores and allow them to operatesindependently. The rest of the paper focuses on stream joinsas a case study. Our experience working with stream joins inFPGAs reveal some conclusions, which can be generalized toother algorithms and the entire acceleration landscape.
III. CASE STUDY: FLOW-BASED PARALLEL STREAM JOIN
The stream join operation inherits the relational databasejoin semantics by introducing a sliding window to trans-form unbounded streams, R, S, into finite sets of tuples(i.e., relations). Although sliding window semantics providea robust abstraction to deal with the unbounded nature ofdata streams , , , , , , , , itremains a challenge to improve parallelism within stream joinprocessing, especially, when leveraging many-core systems.
JC JC JC JC
JC JC JC JC
Fig. 8: Parallel stream join models: (a) bi-flow and (b) uni-flow.
For instance, a single sliding window could conceptually bedivided into many smaller sub-windows, where each sub-window could be assigned to a different join core.1 However,distributing a single logical stream into many independentcores introduces a new coordination challenge: to guaranteethat each incoming tuple in one stream is compared exactlyonce with all tuples in the other stream. To cope with paral-lelism in this context, two key approaches have been proposed:handshake join  and SplitJoin .
Bi-directional Data Flow For Stream Join: The coor-dination challenge is addressed by handshake join  (andits equivalent realization as OP-Chain in hardware, presentedin ) that transforms the stream join into a data flow (bi-flow)problem (evolved from Kang’s three-step procedure ):tuples flow from left-to-right (for the S stream) and from right-to-left (for the R stream) and pass through each join core. Werefer to this data flow as bidirectional, since tuples pass in twodirections through the join cores.
We broadly classify all join techniques based on similardesigns as belonging to the bi-flow model; this data flowmodel provides a scalable parallelism to increase processingthroughput. The bi-flow design ensures that every tuple iscompared exactly once by design (shown in Figure 8). Thisnew data flow model offers greater processing throughputby increasing parallelism, yet suffers from latency increasesince the processing of a single incoming tuple requires asequential flow through the entire processing pipeline. Toimprove latency, a central coordination is introduced to fast-forward tuples through the linear chain of join cores in thelow-latency handshake join  (the coordination module isdepicted in the left side of Figure 8). In order to reduce latency,each tuple of each stream is replicated and forwarded to thenext join core before the join computation is carried out bythe current core .
Figure 10 demonstrates the internal blocks and theirconnections in a join core needed to realize the bi-flowmodel. Window Buffer (-R&-S) are sliding windowbuffers and Buffer Manager (-R&-S) are responsiblefor sending/receiving tuples to/from neighboring join coresand storing/expiring tuples to/from window buffers. TheCoordinator Unit controls permissions and priorities tomanage data communication requests to ensure query pro-cessing correctness; finally, the Processing Unit is theexecution core of the join core. It is activated when a newtuple is inserted. Then, based on programmed queries (here, ajoin operator), it processes the new tuple by comparing it tothe entire window of the other stream.
1A join core is an abstraction that could apply to a processor’s core, acompute node in a cluster, or a custom-hardware core on FPGAs (e.g., ,, .)
Result Gathering Network
1 2 3 4 5 76 8
1 2 3 4 5 76 8
DNode DNodeDNode DNode
DNode DNodeDNode DNode DNode DNodeDNode DNode
Fig. 9: Uni-flow parallel stream join hardware architecture.
Uni-directional Data Flow For Stream Join: A funda-mentally new approach for computing a stream join, referredto as SplitJoin, has been proposed in . SplitJoin essentiallychanges the way tuples enter and leave the sliding windows,namely, by dropping the need to have separate left and rightdata flows (i.e., bi-directional flow). SplitJoin diverts fromthe bi-directional data flow-oriented processing of existingapproaches , . As illustrated in Figure 8, SplitJoinintroduces a single top-down data flow that transforms theoverall distributed architecture. First, the join cores are nolonger chained linearly (i.e., avoiding linear latency overhead).In fact, they are now completely independent (i.e., also avoid-ing inter-core communication overhead). Second, both streamstravel through a single path entering each join core; thus,eliminating all complexity due to potential race conditionscaused by in-flight tuples and complexity due to ensuring thecorrect tuple-arrival order, namely, by relying on the FIFOproperty, the ordering requirement is trivially satisfied by usinga single (logical) path. Therefore, SplitJoin does not rely onany central coordination for propagating and ordering theinput/output streams. Third, the communication path can befully utilized to sustain the maximum throughput and eachtuple no longer needs to pass through every join core. SplitJoinintroduces what we, here, refer to as the unidirectional (top-down) data-flow model, uni-flow, in contrast to the bi-flowmodel, introduced earlier.
To parallelize join processing, the uni-flow model dividesthe sliding window into smaller sub-windows, and each sub-window is assigned to a join core. In the uni-flow model,all join cores receive a new incoming tuple. In each joincore, depending on the tuple origin, i.e., whether from Ror S stream, the distributed processing and storage steps areorchestrated accordingly. For example, if the incoming tuplebelongs to the R stream, TR, then all processing cores compareTR to all the tuples in stream S sub-windows. At the sametime, TR is stored in exactly one join core’s sub-window forR stream. The join core needed to store a newly arrivingtuple is chosen based on a round-robin scheme. Each joincore independently counts (separately for each stream) thenumber of tuples received and, based on its position amongother join cores, determines its turn to store an incomingtuple. This storage mechanism eliminates the need for a central
coordinator for tuple assignment, which is a key contributingfactor in achieving (almost) linear scalability in the uni-flowmodel as increasing the number of join cores.
IV. REALIZING FLOW-BASED STREAM JOIN IN HARDWARE
We focus on the design and realization of a parallel anddistributed stream join in hardware based on the flow-basedmodel. Furthermore, we compare the join core internal archi-tecture for both uni-flow and bi-flow models. Our parallel uni-flow hardware stream join architecture comprises three mainparts: (1) distribution network, (2) processing (join cores), and(3) result gathering network, as shown in Figure 9.
Distribution Network: The distribution network is respon-sible for transferring incoming tuples from the system input toall join cores. In this work, we present two alternatives for thisnetwork: (1) a lightweight design and (2) a scalable design.
The lightweight network distributes incoming tuples toall join cores at once without extra components, which ispreferable for comparably small solutions, while the scalablevariant uses a hierarchical architecture for the distribution.Here, we only present the design of the scalable network, whileour experiments include evaluations for both.
In the scalable distribution network, we use DNode to builda hierarchical network. DNode receives a tuple in its input portand broadcasts it to all its output ports. All DNodes rely onthe same communication protocol, making it straightforward toscale the design by cascading them. DNodes store incomingtuples as long as their internal buffer is not full. As output,each DNode sends out the stored tuples, one tuple in eachclock cycle, provided the next DNodes are not full. Theupper part of Figure 9 demonstrates the distribution networkcomprised of DNodes. Here, we see a 1 → 2 fan-out sizefrom each level to the next level from top to bottom. Otherfan-out sizes (e.g., 1→ 4) could be interesting to explore sincethey reduce the height of the distribution network and lowercommunication latency.
The scalable distribution network consumes more resources(i.e., DNodes’ pipeline buffers) than the lightweight variantand adds a few clock cycles, depending on its height, to thedistribution latency (though it does not affect the tuple inser-tion throughput). On the other hand, the scalable distributionnetwork pays off as the number of join cores increases, since
Fig. 10: Bi-flow join core design.
Join Core (uni-flow)
SplitJoin in Hardware
Fig. 11: Uni-flow join core design.
New Tuple R
New Join Operator
Store in Window R
New Tuple S
Store in Window S
Not Store Turn
Not Store Turn
Fig. 12: Storage core.
Repeats for every tuple
in sliding window
New Join Operator
New Join Operator
IDLE New Tuple R
New Join Operator
Store in Window R
Not Store Not Store Turn
Fig. 13: Processing core.
it does not suffer from the clock frequency drop (degradingthe performance) as observed in the lightweight design.
The DNodes arrangement, as shown in Figure 9, formsa pipelined distribution network. Utilizing more join coreslogarithmically increases the number of pipeline stages. Thismeans it takes more clock cycles for a tuple to reach the joincores. Nonetheless, a constant transfer rate of tuples from onepipeline stage to the next keeps the distribution throughputintact, regardless of the number of stages.
Join Cores: In our parallel hardware design, the actualstream join processing is performed in join cores. Each joincore individually implements the original join operator (with-out posing any limitation on the chosen join algorithm, e.g.,nested-loop join or hash join) but on a fraction of the originalsliding window.
The internal architecture of our hardware join core basedon the uni-flow model is shown in Figure 11, and it hasFetcher, Storage Core, and Processing Core asits main building blocks. Fetcher is an intermediate bufferthat separates a join core from the distribution network. Thisreduces the communication path latency and improves themaximum clock frequency. The Storage Core is responsi-ble for storing (and consequently expiring) tuples from R or Sstreams into their corresponding sub-window. The distributiontask of assigning tuples to each Storage Core is performedin a round-robin fashion. The Storage Core remembersthe number of tuples received from each stream and (byknowing the number of join cores and its own position inthem) stores a tuple based on its turn.
The state diagram for the Storage Core controller ispresented in Figure 12. A join operator can be dynamicallyprogrammed without the need for synthesis (individually foreach join core) by an instruction which has two segments. Thefirst segment defines join parameters such as the number ofjoin cores and the current join core position among them, whilethe second segment carries the join operator conditions. Thejoin operator programming is performed in Operator Store1 and Operator Store 2. This makes it possible to updatethe current join operator in real-time. After programming,Storage Core is ready to accept tuples from both streams.While receiving a tuple from the S stream, when it is thecurrent join core’s turn to store, the tuple is stored in itscorresponding sub-window in the Store in Window S state;otherwise, the storage task is skipped by moving to the S StoreDone state. The procedure for the reception of tuples from theR stream follows an identical procedure.
The Processing Core is responsible for the actualexecution in which each new tuple (or a chunk of tuples) iscompared to all tuples in the sub-window of other stream bypulling them up one at a time and comparing them with thenewly arrived tuple.
Figure 13 presents the state diagram of the ProcessingCore controller. In the initial step it reads the join operatorin Operator Read 1 and Operator Read 2 states. The actualcomparison is performed in the Join Processing state, wheretuples are read from their corresponding sub-window, one readper cycle. In case a match is found, it is emitted in Emit Result.Finally, at the end of processing, the controller waits in theJoin Wait state for another tuple to process. After a new tuplereception, the whole execution repeats itself by means of adirect transition to the Join Processing state.
Result Gathering Network: The result gathering networkis responsible for collecting result tuples from join cores. Sim-ilar to the distribution network, we propose two (1) lightweightand (2) scalable alternatives for this network. We focus on thelatter for descriptions while comparing both of them in theevaluation section.
The lower part of Figure 9 demonstrates the design ofthe result gathering network using GNodes. Each GNodecollects resulting tuples from two sources connected to itstwo upper ports using a Toggle Grant mechanism that togglesthe collection permission for its previous nodes in each clockcycle. As a result, each source (i.e., a join core or a previousGNode) pushes out a resulting tuple to the next GNode onceevery two clock cycles.
The Toggle Grant mechanism simplifies the design of theresult collection; instead of using a two-directional handshakebetween two connected GNodes to transfer a resulting tuple,we use a single-directional signaling, in which each GNodelooks only at permission to push out one of the results storedin its buffer. The destination (next) GNode simply toggles thispermission each cycle without the need for any special controlunit.GNodes configuration, as shown in Figure 9, provides a
pipeline result gathering network where in each stage, tuplesfrom two sources are merged into one and are pushed to thenext pipeline stage from top to bottom (respecting our uni-directional top-down flow model). The pipelining mechanismreduces the effective fan-in size in each pipeline stage andthereby prevents clock frequency drop.
In Figure 9, arrows in the distribution and result gatheringnetwork are data buses that define the width of received tuples,
2 4 8 16 32 640
2 4 80
Number of Join Cores
W : 213 211
(a) Uni-flow (V5, F:100MHz)
7 8 9 10 11 12 13
(b) (JCs: 16, V5, F:100MHz)
11 12 13 14 15 16 17 18
16 17 180
(c) Uni-flow (V7, F:300MHz)
16 17 18 19 20 21 22 23
JCs: 16 28
(d) Uni-flow on software
Fig. 14: Throughput measurements for flow-based stream join on hardware: (a), (b), and (c); on software: (d).
including their 2-bit headers. The header defines whether weare dealing with a new join operator or a tuple belonging toeither the R or S stream. The width of the data bus for resulttuples is twice (not counting the header) the size of the inputdata bus since a result is comprised of two input tuples thathave met the join condition(s).
Flow-based Hardware Design Comparison: In the uni-flow model, data passes in a single top-down flow, in whicheach join core receives tuples directly from the distributor andoperates independently from other join cores (Figure 8b). Theuni-flow design offers full utilization of the communicationchannel’s bandwidth, specifically from the input to each joincore, since all tuples travel over the same path to each joincore. Therefore, regardless of the incoming tuple rate for eachstream, every tuple has access to the full bandwidth. To clarifythis issue for the bi-flow model, assume we are receiving tuplesonly from stream R; then, all communication channels forstream S are left unutilized. Even with an equal tuple ratefor both streams, it is impossible to achieve simultaneoustransmission of both TR and TS between two neighboringjoin cores due to the locks needed to avoid race conditions.Furthermore, comparing the internal design of a join corebased on the bi-flow model, Figure 10, and one based on uni-flow Figure 11, we see a significant reduction in the number ofinternal components that correspondingly reduces the designcomplexity. Neighbor-to-neighbor tuple traveling circuitriesfor two streams are eliminated from Buffer Manager-R& -S and Coordination Unit, as they are reduced andmerged to form Fetcher and Storage Core in the joincore based on the uni-flow model. This improvement also re-duces the number of I/Os from five to two, which significantlyreduces the hardware complexity, as the number of I/Os isoften an important indication of complexity and final cost ofa hardware design.
V. EXPERIMENTAL RESULTSFor hardware experiments, we synthesized and programmed
our solution on a ML505 evaluation platform featuring aVirtex-5 XC5VLX50T FPGA. Additionally, we synthesizedour solution on a more recent VC707 Evaluation board featur-ing a Virtex-7 XC7VX485T FPGA. For software experiments,we used a 32-core Dell PowerEdge R820 featuring 4 × IntelE5-4650 (TDP: 130 Watt) processors and 32 × 16GB (PC3-12800) memory, running Ubuntu 14.04.2 LTS.
We realized parallel stream join based on bi-flow modelusing a simplified OP-Chain topology from FQP2, proposedin . We used the Xilinx synthesis tool chain to synthe-size, map, place, and route both of the bi-flow and uni-flowparallel hardware realizations and loaded the resulting bit fileonto our FPGA using a JTAG interface. This bit file containsall required information to configure the FPGA.
The input streams consist of 64-bit tuples that are joinedagainst each other using an equi-join, though there is nolimitation on the condition(s) used. Both of the realizationshave the ability to adopt larger tuples that are defined by pre-synthesis parameters.
Throughput Evaluation: Throughput measurements for bi-flow and uni-flow parallel stream join realizations are presentedin Figure 14. For the uni-flow version, we were able toinstantiate 16 join cores on our platform with up to W : 213
window size (per stream), as we see in Figure 14a. We observea linear speedup with respects to the number of join cores asexpected. We were not able to realize window sizes largerthan 211 when instantiating 32 and 64 join cores due to theextra consumption of memory resources in the distribution andresult gathering networks and auxiliary components.
Figure 14b presents the comparison between the inputthroughput in a parallel stream join based on uni-flow andbi-flow models as we change the window size. We observenearly an order of magnitude speedup when using a uni-flow compared to a bi-flow model. Although in theory, bothmodels are similar in their parallelization concept, the simplerarchitecture in uni-flow brings superior performance. We werenot able to instantiate 16 join cores with 213 in bi-flowhardware, unlike the uni-flow one, because each join core ismore complex and requires a greater amount of resources torealize it.
Figure 14c presents extracted (from a synthesis re-port) throughput on a mid-size, but more recent, Virtex-7(XC7VX485T) FPGA. We were able to realize a uni-flowparallel stream join with as many as 512 join cores andwindow sizes as large as 218. We used a 300MHz clockfrequency for this evaluation as provided by the synthesisreport. As a result of having more join cores and a higherclock frequency, we see acceleration of around two orders of
2FQP is available in VHSIC Hardware Description Language (VHDL).
1 2 3 4 5 6 7 8 9102
Number of Join Cores
)W : 218 (V7), W : 218 (V7s), W : 213 (V5)
1 2 3 4 5 6 7 8 9
Fig. 15: Latency reports for uni-flow parallel stream join onhardware.
12 16 20 24 28 32
Number of Join Cores
W : 217 W : 218 W : 219
Fig. 16: Latency reports foruni-flow on software.
1 2 3 4 5 6 7 8 9
Number of Join Cores (2x)
W:218 (V7) W:218 (V7s) W:213 (V5)
Fig. 17: Uni-flow clock fre-quencies on Virtex-5 and 7.
magnitude when we utilize a window size of 213 compared tothe realization on Virtex-5 (Figure 14a).
We ran our experiments on the software realization of a uni-flow parallel stream join (available from ) and the through-put results are presented in Figure 14d for 16 and 28 join cores.Similar to the experimental setup in , the maximum inputthroughputs were achieved while using 28 cores out of 32 onour platform since some internal components in SplitJoin, i.e.,the distribution and result gathering network, also consume aportion of the processors’ capacity.
Although the operating frequency of the Virtex-7 FPGA issignificantly lower than that of the processors used in oursystem, 300MHz compared to the processor base frequencyof 2.7GHz and max turbo of 3.3GHz, still we observedaround 15× acceleration compared to the software realization(28 join cores) while using the same window size (218) onboth platforms (Figures 14c and 14d). Two factors are maincontributors to this throughput gain: 1) ability to instantiatemore join cores compared to the software version, sincethey operate in parallel and linearly increase the processingthroughput as a function of their count. 2) utilizing internalBRAMs in FPGA by essentially coupling data and processingin each join core, while in the software variant, the slidingwindow data resides in the main memory. This data has tomove back-and-forth through the memory hierarchy for eachincoming tuple.
Latency Evaluation: We refer to latency as the time ittakes to process and emit all results for a newly insertedtuple. We mainly focus on latency comparisons between thehardware and software realizations of the uni-flow model for aparallel stream join. The measurements for these comparisonsare shown in Figures 15 and 16.
Figure 15 captures the latency observed with respects tothe number of clock cycles and the execution time (µsecond).For realization on Virtex-5 (V 5), we used the lightweightdistribution and result gathering networks since the system isrelatively small; however, for the synthesized design on Virtex-7 (V 7), we have reports for both lightweight and scalable(specified by s in figures) variants of distribution and resultgathering networks.
As we increase the number of join cores, we do not observea significant difference in the number of cycles required toprocess a tuple in either realization. The distribution networkin the lightweight design requires fewer cycles to transfer in-coming tuples to all join cores while on the scalable version, a
tuple has to travel through multiple distribution levels (log2N ,where N is the number of join cores) to reach join cores;however, this advantage is neutralized by the greater latencyin the lightweight result gathering network. The cost of round-robin collection from join cores, one after another, quicklybecomes dominant as we approach larger numbers of joincores. However, by taking into account the clock frequencydrop in the lightweight solution as we increase the number ofjoin cores, the actual difference in latency becomes significant,as shown in Figure 15.
Utilizing a window size of 218 for each stream, the hardwareversion (Figure 15) shows around two orders of magnitudeimprovement in latency compared to the software variant(Figure 16), mainly due to massive parallelism and memoryand processing coupling.
Scalability Evaluation: The scalability of a hardware de-sign is determined by how the maximum operating clockfrequency is affected as we scale up the system. Here we scaleup our solution by increasing the number of join cores thattranslates to a linearly increase in the processing throughput.
Figure 17 shows how clock frequency changes as weincrease the number of join cores for lightweight versionson Virtex-5 (V5) and Virtex-7 (V7) and scalable version onVirtex-7 (V7s). For the realization on our Virtex-5 FPGA, wedo not see any significant drop as we increase the number ofjoin cores. Although we are using the lightweight version, thesystem size (number of join cores) is rather small to show itseffect; actually, we even see an increase in the clock frequencywhen utilizing 16 join cores that is due to heuristic mappingalgorithms adopted by the synthesis tool3.
For larger uni-flow based realization with more join cores,we see how the clock frequency of the lightweight versiondrops as we increase the number of join cores. Since theVirtex-7 FPGA supports higher clock frequencies, comparedto the Virtex-5, it is more sensitive to large fanout sizes andlonger signal paths; therefore, we see this effect even whenusing 8 and 16 join cores. For the hardware realization basedon uni-flow with scalable distribution and result gatheringnetworks, we observe no significant variations in the clockfrequency as we scale up the system.
Power Consumption Evaluation: The extracted powerconsumption reports when using 16 join cores with a total
3Using more restrictions (such as using a higher clock constraint, e.g.,190MHz) it is possible to achieve higher clock frequencies when utilizingfewer number of join cores.
Fig. 18: Heterogeneous hardware virtualization.
window size of 213 (for each stream) consumed 1647.53mWand 800.35mW power for parallel stream join based on bi-flow and uni-flow, respectively. As expected simpler design andcorrespondingly smaller circuit size resulted in more than 50%power saving in utilizing uni-flow compared to bi-flow.
VI. CONCLUSIONS & OPEN PROBLEMS
Adoption of hardware accelerators, in particular, FPGAs,are gaining momentum in both industry and academia. Cloudproviders such as Amazon are building new distributed in-frastructure that offers FPGAs4 that are connected to general-purpose processors using PCIe connection . Furthermore,the employed FPGAs share the same memory address spacewith CPUs, which could give rise to many novel applicationsof FPGAs by using co-placement and co-processor strategies.Microsoft’s Configurable Cloud  also uses a layer ofFPGAs between network switches and servers to filter and ma-nipulate data flows at line-rate. Another prominent deploymentexample is IBM FPGA-Based Acceleration within SuperVesselOpenPOWER Development Cloud . Google’s Tensor Pro-cessing Unit (TPU) designed for distributed machine learningworkloads is also gaining traction in its data centers, althoughcurrent TPUs are based on ASIC chips .
Given that the availability of FPGAs and other forms ofaccelerators (e.g., GPUs and ASICs) are now becoming thenorm even in distributed cloud infrastructures, then our broadervision (first discussed in ) is to identify key opportunitiesto exploit the strength of available hardware accelerators giventhe unique characteristics of distributed real-time analytics. Asa first step to fulfilling this vision, we start from our FQPfabric , , a generic streaming architecture composedof a dynamically re-programmable stream processing elementsthat can be chained together to form a customizable processingtopology, exhibiting a “Lego-like” connectable property. Weargue that our proposed architecture may serve as a basicframework to explore and study the entire life-cycle of accel-erating distributed stream processing on hardware. Here weidentify a list of important short- and long-term problems thatcan be studied within our FQP framework:
1) What is the complexity of query assignment to a set ofcustom hardware blocks (including but not limited to OP-Blocks)? Note, a poorly chosen query assignment may
4Xilinx UltraScale+ VU9P fabricated using a 16 nm process and withapproximately 2.5 million logic elements and 6,800 Digital Signal Processing(DSP) engines.
increase query execution time, leave some blocks un-utilized, negatively affect energy use, and degrade theoverall processing performance.
2) How to formalize query assignment algorithmically (e.g.,develop a cost model), and what is the relationshipbetween query optimization on hardware and classicalquery optimization in databases? Unlike the classicalquery optimization and physical plan generations, onhardware, we are not just limited to join reordering andphysical operator selections because there is a whole newperspective on how to apply instruction-level and fine-grained memory-access optimization (through differentOP-Blocks implementations or processor-memory cou-plings); what is the most efficient method for wiringcustom operators to minimize the routing distance; orhow to collect and store statistics during query executionwhile minimizing the impact on query evaluation; andhow to introduce dynamic rewiring and movement of datagiven a fixed FQP topology?
3) What is the best initial topology given a sample queryworkload and a set of application requirements knowna priori? How to construct an optimal initial topologywhich maximizes on-chip resource utilization (reducesthe number of OP-Blocks) or minimizes latency (reducesrouting and wiring complexity among OP-Blocks)?
4) Given the topology and the query assignment formalism,it is then possible to generalize the query mapping fromsingle-query optimization to multi-query optimization toamortize the execution cost across the shared processingof several queries? But given the dynamic nature ofFQP and a lightweight statistics collection capability, thenthe natural question is: How to dynamically re-optimizeand re-map queries onto the FQP topology (in light ofobserved statistics) in order to maximize inter- and intra-query optimizations that are inspired by the capabilitiesof custom stream processors?
5) How do we extend query execution on hardware to co-placement and/or co-processor designs by distributingand orchestration query execution over heterogeneoushardware (each with unique characteristics) such asCPUs, FPGAs, and GPUs? The key insight is how tosuperimpose FQP abstraction over these heterogeneouscompute nodes in order to hide their intricacy and tovirtualize the computation over them, as illustrated inFigure 18. An important design decision arises as to howthese distributed and virtualized devices communicateand whether or not they are placed on a single board,thus, having at least a shared external memory space ordistributed shared memory through RDMAs; placed onmulti-boards and connected through interfaces such asPCIe; or span across the entire data centers connectedthrough the network.
This research was supported by the Alexander von Hum-boldt Foundation.
REFERENCES “Apache Hadoop.” Apache Software Foundation, 2006. [Online].
Available: http://hadoop.apache.org M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica,
“Spark: Cluster computing with working sets,” HotCloud, 2010. P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and
K. Tzoumas, “Apache Flink: Stream and batch processing in a singleengine,” Data Engineering, 2015.
 A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulka-rni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, andD. Ryaboy, “[email protected],” SIGMOD, 2014.
 M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica,“Discretized streams: Fault-tolerant streaming computation at scale,”SOSP, 2013.
 “Symmetric key cryptography on modern graphics hardware.” Ad-vanced Micro Devices, Inc., 2008.
 “Enhanced Intel SpeedStep technology for the Intel Pentium M proces-sor.” Intel, Inc., 2004.
 R. Mueller, J. Teubner, and G. Alonso, “Data processing on FPGAs,”VLDB, 2009.
 ——, “Streams on wires: a query compiler for FPGAs,” VLDB, 2009. A. Hagiescu, W.-F. Wong, D. Bacon, and R. Rabbah, “A computing
origami: Folding streams in FPGAs,” DAC, 2009. T. Honjo and K. Oikawa, “Hardware acceleration of Hadoop MapRe-
duce,” BigData, 2013. M. Sadoghi, R. Javed, N. Tarafdar, H. Singh, R. Palaniappan, and H.-A.
Jacobsen, “Multi-query stream processing on FPGAs,” ICDE, 2012. M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “Flexible query processor
on FPGAs,” PVLDB, 2013. L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, “Q100:
The architecture and design of a database processing unit,” ASPLOS,2014.
 M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “Configurable hardware-based streaming architecture using online programmable-blocks,” ICDE,2015.
 Z. Wang, J. Paul, H. Y. Cheah, B. He, and W. Zhang, “Relational queryprocessing on OpenCL-based FPGAs,” FPL, 2016.
 O. Segal, M. Margala, S. R. Chalamalasetti, and M. Wright, “High levelprogramming framework for FPGAs in the data center,” FPL, 2014.
 S. Breß, H. Funke, and J. Teubner, “Robust query processing in co-processor-accelerated databases,” SIGMOD, 2016.
 T. Gomes, S. Pinto, T. Gomes, A. Tavares, and J. Cabral, “Towards anFPGA-based edge device for the Internet of Things,” ETFA, 2015.
 C. Wang, X. Li, and X. Zhou, “SODA: Software defined FPGA basedaccelerators for big data,” DATE, 2015.
 A. Putnam, A. M. Caulfield, E. S. Chung, D. Chiou, K. Constantinides,J. Demme, H. Esmaeilzadeh, J. Fowers, G. P. Gopal, J. Gray, M. Hasel-man, S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus,E. Peterson, S. Pope, A. Smith, J. Thong, P. Y. Xiao, and D. Burger,“A reconfigurable fabric for accelerating large-scale datacenter services,”ISCA, 2014.
 A. M. Caulfield, E. S. Chung, A. Putnam, H. Angepat, J. Fowers,M. Haselman, S. Heil, M. Humphrey, P. Kaur, J. Y. Kim, D. Lo,T. Massengill, K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka,D. Chiou, and D. Burger, “A cloud-scale acceleration architecture,”MICRO, 2016.
 R. R. Bordawekar and M. Sadoghi, “Accelerating database workloadsby software-hardware-system co-design,” ICDE, May 2016.
 M. Sadoghi, M. Labrecque, H. Singh, W. Shum, and H.-A. Jacobsen,“Efficient event processing through reconfigurable hardware for algo-rithmic trading,” PVLDB, 2010.
 L. Woods, J. Teubner, and G. Alonso, “Complex event detection at wirespeed with FPGAs,” PVLDB, 2010.
 M. Sadoghi, H. Singh, and H.-A. Jacobsen, “Towards highly parallelevent processing through reconfigurable hardware,” DaMoN, 2011.
 D. Sidler, Z. Istvan, M. Ewaida, and G. Alonso, “Accelerating patternmatching queries in hybrid CPU-FPGA architectures,” SIGMOD, 2017.
 “IBM Netezza 1000 data warehouse appliance.” IBM, Inc., 2009. M. Sadoghi, H. P. Singh, and H.-A. Jacobsen, “fpga-ToPSS: Line-speed
event processing on FPGAs,” DEBS, 2011. J. Teubner, L. Woods, and C. Nie, “Skeleton automata for FPGAs:
reconfiguring without reconstructing,” SIGMOD, 2012. L. Woods, Z. István, and G. Alonso, “Ibex - an intelligent storage engine
with support for advanced SQL off-loading,” PVLDB, 2014. M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “The FQP vision: Flexible
query processing on a reconfigurable computing fabric,” SIGMOD, 2015.
 J. Teubner and R. Mueller, “How soccer players would do stream joins,”SIGMOD, 2011.
 M. Najafi, M. Sadoghi, and H.-A. Jacobsen, “SplitJoin: A scalable,low-latency stream join architecture with adjustable ordering precision,”ATC, 2016.
 B. Gedik, R. R. Bordawekar, and P. S. Yu, “CellJoin: A parallel streamjoin operator for the cell processor,” VLDBJ, 2009.
 P. Roy, J. Teubner, and R. Gemulla, “Low-latency handshake join,”PVLDB, 2014.
 S. Blanas, Y. Li, and J. M. Patel, “Design and evaluation of mainmemory hash join algorithms for multi-core CPUs,” SIGMOD, 2011.
 J. Teubner, G. Alonso, C. Balkesen, and M. T. Ozsu, “Main-memoryhash joins on multi-core CPUs: Tuning to the underlying hardware,”ICDE, 2013.
 V. Leis, P. Boncz, A. Kemper, and T. Neumann, “Morsel-driven paral-lelism: A NUMA-aware query evaluation framework for the many-coreage,” SIGMOD, 2014.
 J. Kang, J. Naughton, and S. Viglas, “Evaluating window joins overunbounded streams,” ICDE, 2003.
 “Announcing Amazon EC2 F1 instances with custom FPGAs.” Ama-zon, Inc., 2016.
 A. Caulfield, E. Chung, A. Putnam, H. Angepat, J. Fowers, M. Hasel-man, S. Heil, M. Humphrey, P. Kaur, J.-Y. Kim, D. Lo, T. Massengill,K. Ovtcharov, M. Papamichael, L. Woods, S. Lanka, D. Chiou, andD. Burger, “A cloud-scale acceleration architecture,” MICRO, 2016.
 “OpenPOWER cloud: Accelerating cloud computing,” IBM Research,2016.
 M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga,S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden,M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: A system for large-scalemachine learning,” OSDI, 2016.