atomix: a framework for deploying signal processing applications

17
This paper is included in the Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15). May 4–6, 2015 • Oakland, CA, USA ISBN 978-1-931971-218 Open Access to the Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) is sponsored by USENIX Atomix: A Framework for Deploying Signal Processing Applications on Wireless Infrastructure Manu Bansal, Aaron Schulman, and Sachin Katti, Stanford University https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/bansal

Upload: lydat

Post on 04-Jan-2017

222 views

Category:

Documents


0 download

TRANSCRIPT

This paper is included in the Proceedings of the 12th USENIX Symposium on Networked Systems

Design and Implementation (NSDI ’15).May 4–6, 2015 • Oakland, CA, USA

ISBN 978-1-931971-218

Open Access to the Proceedings of the 12th USENIX Symposium on

Networked Systems Design and Implementation (NSDI ’15)

is sponsored by USENIX

Atomix: A Framework for Deploying Signal Processing Applications on Wireless Infrastructure

Manu Bansal, Aaron Schulman, and Sachin Katti, Stanford University

https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/bansal

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 173

Atomix: A Framework for Deploying Signal Processing Applications onWireless Infrastructure

Manu Bansal, Aaron Schulman, and Sachin KattiStanford University

AbstractMulti-processor DSPs have become the platform of

choice for wireless infrastructure. This trend presentsan opportunity to enable faster and wider scale deploy-ment of signal processing applications at scale. However,achieving the hardware-like performance required bysignal processing applications requires interacting withbare metal features on the DSP. This makes it challeng-ing to build modular applications.

We present Atomix, a modular software frameworkfor building applications on wireless infrastructure. Wedemonstrate that it is feasible to build modular DSP soft-ware by building the application entirely out of fixed-timing computations that we call atoms. We show thatapplications built in Atomix achieve hardware-like per-formance by building an 802.11a receiver that operatesat high bandwidth and low latency. We also demonstratethat the modular structure of software built with Atomixmakes it easy for programmers to deploy new signal pro-cessing applications. We demonstrate this by tailoringthe 802.11a receiver to long-distance environments andadding RF localization to it.

1 Introduction

Programmable Digital Signal Processors (DSPs) are in-creasingly replacing ASICs as the platform of choicefor performing the heavyweight signal processing in ourwireless infrastructure. The primary driver for the shifttoward this programmable infrastructure is the increasedrate of wireless standard updates: the 3GPP releases anew LTE standard roughly once every 18 months [3].DSPs enable such quick upgrade cycles since their soft-ware can be updated with the push of a button.DSPsin the infrastructure strike balance between high perfor-mance, low power, (only 5-8 Watts to run a 20MHz LTEbasestation), and programmability. They do so by com-bining multiple processing cores for programmability

with hardware accelerators for performance and powerefficiency.

The trend toward building programmable DSPs intowireless infrastructure presents a valuable opportunity:programmers can build new signal processing applica-tions and quickly deploy them on wireless infrastructureat scale. Such applications encompass both communica-tion and non-communication signal processing. Commu-nication applications include standard wireless protocols(e.g., LTE and WiFi), as well as customized protocols forparticular environments (e.g., long distance rural broad-band [14, 6]). Non-communication applications includeusing radio waves for localization [28, 8].

There are three basic primitives that programmersneed from the software framework of DSPs to build anddeploy signal processing applications: tapping into a sig-nal processing chain, tweaking a signal processing block,and inserting or deleting a signal processing block. How-ever, the DSP software framework must present a modu-lar interface to the programmer that supports these prim-itives. Such modularity is essential to add new applica-tions without needing to modify or understand the fullsoftware base which could uptake excessive effort. Forexample, tweaking a specific block to improve its func-tion or efficiency should not affect other blocks, norshould it require the programmer to tweak or understandother parts of the software.

Designing a modular software framework for DSPsis challenging because of the demanding requirementsof the communication applications that it must support.Communication applications are both high through-put and latency sensitive. This combination requireshardware-like performance from the DSP software; itmust be highly efficient and have predictable executiontiming. For example, to process a typical 20MHz WiFichannel, the DSP must process a sample of the signalevery 25ns (assuming Nyquist sampling). Assuming theDSP is running at a 1GHz (this is a typical DSP clockrate), there are only 25 cycles available on average per

1

174 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

DSP core to process each sample. Additionally, the pro-cessing has to finish within a short time because of ARQ.For instance, WiFi needs to decode and send an ACKwithin 16μs of receiving the last sample of a packet.

Building modular DSP software that has hardware-like efficiency and predictability is challenging. The pri-mary reason is, programmers must use several bare metalfeatures to achieve hardware-like performance. DSPssuch as the TI 6670 [23] are highly parallel processorswith multiple cores and hardware accelerators. Thereis no rich operating system support to manage those re-sources with adequate performance. As a result, the pro-grammer must manually parallelize software. DSPs typ-ically lack cache-coherent shared memory. This forcesthe programmer to explicitly move data across coresand hardware accelerators. Further, DSPs tend to haveshallow memory hierarchies with software-addressableSRAM that the programmer has to explicitly manage.

In this paper, we present Atomix, a modular softwareframework for developing high bandwidth and low la-tency apps for DSPs. The key idea behind Atomix isthe atom abstraction, which is defined as a unit of exe-cution that takes a fixed, known amount of time to runevery time it is invoked. In Atomix, programmers canexpress every module in a signal processing applicationas an atom. Atoms can be an algorithmic blocks such asFFT, or compositions of blocks such as OFDM, or dif-ferent packet processing modes in protocols such as LTEor WiFi, or even low-level plumbing primitives such asdata transfer across cores.

We evaluate the Atomix framework by using it to builda standard 802.11a WiFi receiver called Atomix11a.WiFi is an ideal app to evaluate the performance capa-bility of Atomix because it uses the same bandwidth asLTE with similar processing complexity and spectral effi-ciency of an OFDMA protocol. Additionally, WiFi’s de-code latency constraints of tens of microseconds are twoorders of magnitude tighter than LTE’s1. For a 10MHzWiFi channel, Atomix11a achieves a frame decode la-tency of 36.4μs with a variance of 1.5μs.

We also evaluate the modularity of apps developed inAtomix by developing two apps on top of Atomix11a:(1) we customize Atomix11a to support long-distancelinks and (2) we add RF localization to Atomix11a.

2 Background and Design Goals

2.1 App TaxonomyA typical wireless stack is naturally specified in termsof signal processing blocks, a data flowgraph that com-poses those blocks, and a state machine that selects ap-

1LTE’s 3ms HARQ process turnaround requirement allows for ap-proximately 1ms of decode latency.

ACK Transmit

OFDM ModulatorBPSK Mapper

Interleaver1/2 Convolution Coder

Parity CreaterScrambler

ACK Creator

CRC X

CRC ✓

Header Decode

Header Parser

Prior blocks

BPSK SlicerBPSK DeinterleaverBPSK Depuncturer1/2 Viterbi Decoder

Data Decode [6 Mbps]

OFDM EqualizerOFDM Demodulator

CFO & AGC

DescramblerCRC Checker

64QAM Slicer64QAM Deinterleaver64QAM Depuncturer2/3 Viterbi Decoder

Data Decode [54 Mbps]

OFDM EqualizerOFDM Demodulator

CFO & AGC

DescramblerCRC Checker

CRC X

CRC ✓

Figure 1: Modules of an 802.11a receiver: blocks, data-flowgraphs, and a state machine.

propriate flowgraphs to process incoming samples. Forexample, an 802.11a baseband receiver (fig. 1) hasblocks for OFDM demodulation, channel equalization,constellation-to-bit slicing, and Viterbi decoding. Differ-ent blocks are composed into separate data flowgraphsfor decoding a 54Mbps sample stream, a 6Mbps samplestream, or the PHY layer header, or for transmitting anACK. Finally, the flowgraphs are assembled into a re-ceiver state machine that transitions from the header de-code state to one of the data decode states, optionallyfollowed by the ACK transmit state.

We classify apps according to the kind of modifica-tions needed to the base wireless stack. (For a wirelessstack app that is bootstrapping the base-station, the basestack is null.) To illustrate the classification, we use asimple OFDM receive chain (fig. 2) as the underlyingstack that is modified. We classify applications based onmodifications that fall into three main categories:

Tap. These modifications tap the signal at various pointsin the processing chain to implement their functionality.Fig. 2 shows an example of tapping the CSI output andsending it back for offline processing. Such tappingcapability is needed for applications like indoor local-ization [28, 8]. Localization works by converting fewsamples from the right point in the signal processingchain into location estimates.

Tweak. These modifications tweak parameters ofindividual signal processing blocks that are already partof the wireless stack to implement tailored functionality.Fig. 2 shows an example of tweaking the channel equal-ization block. The ability to tweak parameters or modifythe functionality of a block can enable applicationslike better receiver designs (e.g. [13, 25]) for existingtransmission formats or adapting the signal chain to adifferent deployment setting such as long distance ruralWiFi links as we show in sec. 5.

Insert (and Delete). These modifications insert new

2

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 175

Figure 2: Three basic kinds of modifications

blocks into the signal processing chain, either in the crit-ical path or as additional branches. By inserting newblocks, we can add new functionality to existing signalprocessing chains. For example, we can add a BER-estimation block for fine-grained channel quality indica-tion to significantly improve rate-adaptation performance[26].

A new wireless stack app that bootstraps a basestationalso requires insertion of new blocks.

2.2 Requirements and Design GoalsOur goal is to design a software framework for wirelessinfrastructure in which new signal processing applica-tions can be easily created and deployed. This requiresthree properties from the framework:

1. Modularity — For a programmer to easily add newapps, the framework must provide a modular inter-face supporting tap, tweak, and insert primitives.In using those primitives, the programmer must beable to make local code changes without needing tounderstand or refactor unrelated parts of the existingimplementation.

2. Predictable latency — When a new app is deployed,the underlying communication stack should exhibithardware-like predictable latency. The programmershould be able to precisely estimate and control thelatency down to the order of a microsecond. Thiswill let the programmer ensure that the timing re-quirements specified by the communication proto-col are still met (e.g., 16μs sample processing la-tency for 20MHz WiFi).

3. High computational throughput — As new appsare deployed, the underlying communication stackshould remain computationally efficient. The pro-grammer should have the tools needed to utilize thefull computational power of the hardware platformto meet sample-rate processing requirements (e.g.,40Msps for a 20MHz WiFi channel). Modularityand predictable latency should not come at the ex-pense of computational efficiency.

While is easy to meet any of these requirements in iso-lation, combining them generally leads to tradeoffs.

(a) An atom is a unit of ex-ecution with fixed timing

(b) Atom compositions havefixed timing

Figure 3: The atom abstraction

In a modular system designed for predictability, mod-ules could always be provisioned for worst-case execu-tion times. However, that would an inefficient choiceif modules had large gaps between worst and average-case execution times. Another system that dynamicallyschedules modules with the objective of keeping proces-sors busy would achieve high average processing effi-ciency. However, it could cause high variability in ex-ecution latencies of modules [15]. By programming anapplication using low-level instructions, a programmercould tightly control processing latencies; by optimizingit as a monolithic piece of software, she could extract thebest possible performance from the hardware. However,the resulting implementation would lack modularity.

These tradeoffs make the design of Atomix challeng-ing. However, as we discuss in the next section, our in-sights about the application structure of wireless data-planes allows us to design for all of these goals simulta-neously.

3 Design

3.1 The atom abstractionThe design of Atomix is based on the key abstraction ofan atom:

Atom: A unit of execution with fixed timing.

In Atomix, every operation from signal processing tosystem handling can be implemented as an atom. Fur-ther, those atoms can be composed to form more com-plex atoms. When atoms are composed, their executiontimes add up, so that if A and B are atoms, t(A ◦B) =t(A)+ t(B) (fig. 3).

The Atomix framework provides the substrate to im-plement signal processing blocks as atoms (sec. 3.2).It also enables programmers to compose those simpleatoms into more complex flowgraph atoms (sec. 3.3) andstate atoms (sec. 3.4) to tie together an entire signal pro-cessing application. Finally, it provides the capabilityto map atoms onto multiple cores to create fine-grainedpipelines exploiting parallelism (sec. 3.5).

The Atomix framework provides atoms for commonplatform handling operations such as transferring databetween cores and interacting with accelerators. The

3

176 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

programmer only needs to implement atoms in C lan-guage for custom signal processing blocks. Then, shecan tie them together into flowgraphs and a state machineusing the language provided by Atomix. The frameworkprovides a high-level declarative API to compose atomsand to map them to multiple cores. Atomix provides acompiler to translate the declarative code into C code.Translated application code and custom signal process-ing atoms are compiled with the target DSP’s native Ccompiler. Atomix also provides an optimized runtimesystem (sec. 3.6) to execute compiled atoms efficientlywith predictability. The full Atomix application develop-ment workflow is described in sec. 3.7.

Block, flowgraph, and state abstractions have one-to-one correspondence with the structure of baseband appli-cations (i.e., the basestation stack and other signal pro-cessing applications). Consequently, baseband applica-tion software in Atomix has the modularity inherent inthe application structure. Since the implementations areentirely made up of atoms, they have predictable timingat every granularity of the modular structure.

To create efficient parallel implementations that canprocess high sample bandwidths at low latency, theprogrammer simply ties together appropriate system-handling atoms with signal processing atoms. Thenthey use the high-level API to map flowgraphs and statemachines onto multiple cores. Modularity and pre-dictable timing of atoms simplifies the process of design-ing pipelines, since the programmer can design a sched-ule for different bandwidth or latency objectives by sim-ply adding up timings of blocks under various layouts.

In the modular structure of Atomix, signal processingblocks are encapsulated into separate atoms. As a result,atom interfaces provide fine-grained signal tap points;blocks can be tweaked individually by tweaking the cor-responding atoms; blocks can be inserted and deleted atfine granularity. Since atoms are tied together using ahigh-level API, inserting and deleting new atoms is easy.

The timings of atoms add up when they are composed.This simple additive relation allows the programmer toeasily infer the effect of a modification on the end-to-end timing of the implementation. On tweaking an atom,she only needs to re-time the tweaked atom and substi-tute its new cost. On inserting an atom, she simply addsthe execution time of the inserted atom to obtain the newend-to-end timing. After predicting the effect of modi-fication on timing, the programmer can easily adjust themulticore layout in the high-level API, if needed to meetprocessing bandwidth or latency requirements.

We note that the atom abstraction — a unit of exe-cution with fixed timing — is an idealized design goal.A software system will inherently have some variabilityin execution times. For example, the micro-architectureof a processor could affect timing due to bus contention

Figure 4: Signal processing blocks to atoms

or hardware instruction manipulation. Atomix strives toachieve sufficiently low variability in execution times ofatoms for the purpose of building wireless signal pro-cessing applications.

3.2 Blocks decompose into atomsThe basic unit of a signal processing application is a sig-nal processing block. A block takes in an input signaland transforms it to another output signal. For exam-ple, a constellation bit-slicer block takes in constellationsymbols like BPSK or QPSK symbols and produces databits from them. The execution time of a block dependson the operation it performs, the length of data on whichit operates, the processor type (i.e., DSP or accelerator)on which it operates, and the memory location of databuffers it operates on. For a block to be abstracted as anatom, it must execute in known, fixed amount of time.An Atomix signal processing block implements a fixedalgorithmic function, operates on fixed data lengths, isassociated with a specific processor type, and uses onlythe memory buffers passed to it during invocation.

Consider the example of a bit-slicing operation, asshown in fig. 4. A slicer block will be decomposed inAtomix as separate BPSK, QPSK, QAM blocks. Eachof these blocks will be implemented to accept input andoutput buffers of fixed lengths, leading to BPSK48 fora BPSK block that takes in 48 complex symbols. Theblocks must also be implemented for specific executioncore types. If the BPSK48 block could run on both a DSPcore and an ARM core, BPSK48DSP and BPSK48ARMwould be separate blocks. They may share code in-ternally through common subroutines. By followingthose decomposition rules, a programmer can implementblocks so that they always executes the same set of in-structions on the same kind of processor.

Atomix blocks only make use of memory bufferspassed in through function calls, making their memoryaccesses explicit. Further, the signature of the block im-plementation is annotated to indicate input and outputports. The Atomix compiler uses I/O port annotationsto manage data flow between atoms, as discussed laterin sec. 3.3. The Atomix framework provides high-levelAPIs to control memory placement so that a block al-

4

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 177

Atom OperationF RG buffer * readGet(fifo)F WG buffer * writeGet(fifo)F RD void readDone(fifo, buffer)F WD void writeDone(fifo, buffer)

Table 1: Atomix FIFO atoms (denoted collectively by F).

(a) Composing a flowgraph from atoms.#atom:<atomname>:<atomtype>atom :eq :EQatom :bpsk48 :BPSK48#fifo:<fifoname>:nbufs=<nbufs>:mem=<memoryid>fifo :fEqDDSyms :nbufs=2 :mem=L2fifo :fDDBits :nbufs=4 :mem=L2#flowgraph:<fgname>:<atomname>;+flowgraph:FG BPSK:{csi; ofdm; eq; bpsk48;}#wire:<atomname>:<fifoname>+wire :eq :fCSI,fFdDDSyms,fEqDDSymswire :bpsk48 :fEqDDSyms,fDDBits

(b) Code to compose the flowgraph

Figure 5: Composing a flowgraph from atoms

ways operates on the same memory types (L2 or DDRetc.). Further, in Atomix, blocks run to completion unin-terrupted.

By following simple decomposition rules, Atomix en-ables the programmer to implement signal processingblocks as atoms. The blocks will run fixed sets of in-structions executing uninterrupted on fixed resources us-ing fixed memories. As a result, they will have fixed ex-ecution times.

3.3 Flowgraphs are expressed as atomsThe Atomix framework lets the user program applica-tion flowgraphs as atoms. For example, the 6Mbps datadecoding flowgraph, shown previously in fig. 1, can beexpressed as an atom. Flowgraphs are created by tyingtogether signal processing blocks to FIFO queues. FI-FOs provide intermediate storage and pass data betweensignal processing atoms. The framework provides oper-ations to access FIFOs as atoms. This makes an Atomixflowgraph a composition of signal processing atoms andFIFO access atoms. Since a flowgraph is a compositionof atoms, it is also an atom in itself.

To program a flowgraph in Atomix, the user declaresatoms, FIFOs, wirings between atoms and FIFOs, andthe execution sequence of atoms in the flowgraph. Con-sider the simple signal processing flowgraph shown pre-viously in fig. 2. It shows a channel state information

(a) State control flow. Blue: Data flow, Red: Control flow.

(b) State machine example. Blue: Data flow, Red: Control flow.atom :dispatcher :Dispatcheratom :dxHDR :HDRParseratom :jumpToCRC :Jumpfifo :fDispatcher :nbufs=2 :mem=L2flowgraph:FG HDR AXN:{ ... }flowgraph:FG BPSK AXN:{ ... eq; bpsk48; ... }flowgraph:FG QPSK AXN:{ ... eq; qpsk48; ... }flowgraph:FG HDR RULE:{ dxHDR; }flowgraph:FG DD RULE:{ jumpToCRC; }wire :dispatcher :fDispatcherwire :dxHDR :fDispatcher#state:<statename>:<FG action>:<FG rule>state :ST HDR :FG HDR AXN:FG HDR RULEstate :ST BPSK :FG BPSK AXN:FG DD RULEstate :ST QPSK :FG QPSK AXN:FG DD RULE

(c) Code to compose a state machine

Figure 6: Composing a state machine

block (CSI) and an OFDM demodulator block (OFDM)both feeding data to a channel equalizer block (EQ) thatfeeds data to a slicer block (SLICER). A specific realiza-tion of this flowgraph with the BPSK48 block is shownin fig. 5a.

In the implementation, block-level atoms and FIFOsare declared with the atom and fifo API, as shown in5b. In declaring FIFOs, the programmer is able to controlthe memory region in which the FIFO will be allocated,thus also controlling the execution time of the atoms thatwill operate on those FIFOs.

The programmer creates a named flowgraph constructthat specifies the sequence in which atoms of the flow-graph will be executed (e.g., FG BPSK in fig. 5b). FI-FOs are wired to atoms using the wire API to specifythe flow of data between atoms. FIFOs are wired to re-spective input and output ports of the atoms. Based onthe wiring information, the Atomix compiler inserts ap-

5

178 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

propriate FIFO atoms (denoted by F, described in table1) for each wired atom. FIFO atoms draw and returnbuffers from FIFO queues. These FIFO access atomshave known, fixed execution times.

Atomix implements a simple control flow modelwhere a flowgraph is executed by executing each of itsatoms in sequence without interruption. This straight-through execution model implies that the timing of aflowgraph is simply the sum of timings of its constituentatoms, namely, the signal processing atoms created bythe user and the FIFO atoms inserted by the framework.

3.4 States are expressed as atoms

At the top level, Atomix applications are state machines.With blocks and flowgraphs implemented as atoms, thefinal application component that the programmer needsto create is the state machine. Atomix enables the user toprogram states as atoms.

The main components of a state machine are a dis-patcher atom and named state structures, as shown in fig.6a. A state is made up of two flowgraphs, the actionflowgraph and the rule flowgraph. Control flow startsfrom the dispatcher atom which invokes the next statein the state machine’s transition sequence with an iter-ation count n. When the framework invokes a state, itexecutes the action flowgraph n times followed by therule flowgraph once. The action typically performs sig-nal processing operations while the rule flowgraph usesthe output of the action flowgraph to decide the next stateto transition to.

As an example, consider the reference 802.11a re-ceiver previously shown in fig. 1. An implementationof a subset of that state machine in Atomix is shown infig. 6b. It shows a header decode state, two data decodestates and an ACK transmit state. States are declaredwith the state API as shown in table 6c for referencestate machine.

A block making state transition decision writes out adecision buffer into the dispatcher’s queue, where the de-cision buffer indicates the next state and the correspond-ing iteration count n. In the example in fig. 6a, the deci-sion atom DxHDR translates the header field decoded bythe corresponding action flowgraph into a decision out-put of (< ST BPSK | ST QPSK >, n). When controlreturns to the dispatcher at the end of ST HDR, the dis-patcher reads the next-in-queue decision buffer to con-tinue state transitions.

The components that make up the execution sequenceof a state are the dispatcher atom and the action and ruleflowgraphs. Since a state is composed of atoms, it is alsoan atom in itself.

3.5 Atoms generalize to multiple cores

DSPs are able to processing high-bandwidth com-munication applications at low power because of theparallelism of multiple DSP cores, and hardware ac-celeration. Atomix enables the programmer to easilyparallelize and accelerate flowgraphs and states onmulticore DSPs.

Multicore flowgraphs. Flowgraphs are parallelized bysplitting into smaller flowgraphs, one for each core. Eachsub-flowgraph contains a subset of the blocks in the orig-inal flowgraph. The programmer may choose any assign-ment of blocks to the sub-flowgraphs.

Consider the example of the BPSK flowgraph shownpreviously in fig. 5a. It is shown laid out as a pipelineon three DSP cores in fig. 7a. The single flowgraph(FG BPSK) is split into three sub-flowgraphs, one foreach core: FGBa0 processes the ofdm block on dsp0,FGBa1 processes the csi and eq blocks on dsp1, andFGBa2 processes the bpsk48 block (shown as b) ondsp2. Declarations of these flowgraphs are shown in fig.7c.

Multicore blocking FIFOs and transfer atoms. Whenflowgraphs are parallelized, FIFOs are assigned to thesame cores as the blocks they are wired to. However, itis possible that a FIFO is wired to multiple blocks thatare assigned to different cores. In our example, one suchFIFO is at the output of ofdm and the input of eq. Insuch a case, the FIFO is replaced with multiple FIFOs,one for each core to which its wired blocks are assigned.The blocks are re-wired to the FIFOs on their own coresto prevent expensive remote FIFO access. This is neces-sary to preserve the timing of blocks.

When a FIFO is replaced by multiple FIFOs on dif-ferent cores, data needs to be transferred between themfor the flowgraph to compute the correct result. Atomixprovides data transfer functionality as transfer atoms. Byinserting transfer atoms in the sub-flowgraphs, continuityof the flowgraph is preserved. In the example of ofdmand eq blocks, the transfer atom t is inserted in sub-flowgraph FGBa0 on dsp0. This atom pushes data fromthe output FIFO of ofdm on dsp0 to the input FIFO ofeq on dsp1. This is shown in figs. 7a and 7b. (The atomt could have been inserted into sub-flowgraph FGBa1instead. In that case, it would pull data from dsp0 todsp1, and the cost of data transfer operation would beadded to FGBa1.)

The FIFO read and write atoms are implemented asblocking operations; a read call on a FIFO returns onlywhen it has a filled buffer and, similarly, a write callreturns when the FIFO has an empty buffer. By beingblocking, FIFO access atoms are able to synchronize ex-

6

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 179

ecution of multiple cores on data dependencies. For ex-ample, the eq atom on dsp1 is able to run only when theofdm atom on dsp0 has produced a data buffer and t hasfinished transferring it to dsp1.

Simple extensions to the declarative API let theprogrammer specify multicore assignment of atoms andFIFOs, as shown fig. 7c. Transfer atoms are insertedinto flowgraphs like any other signal processing atoms.

0 4 8 12 16

(a) Multicore pipeline example.

(b) Multicore state machine structure.#atom :<name> :<atomtype> :core=<id>atom :dis0 :Dispatcher :core=0atom :dis1 :Dispatcher :core=1atom :t :TR :core=0atom :dx :Jump :core=0atom :v :VCPIssue :core=2atom :w :VCPWait :core=2#fifo :<name> :nbufs=<nbufs>:mem=<id>fifo :fDis0 :nbufs=4:mem=core0.L2fifo :fDis1 :nbufs=4:mem=core1.L2

flowgraph:FGBa0:{ofdm; t;}flowgraph:FGBr0:{dx; cp;}flowgraph:FGBa1:{csi; eq;}flowgraph:FGBr1:{t1;}flowgraph:FGBa2:{u; b; w; v;}flowgraph:FGBr2:{t2;}#state:<stname>:core=<id>:<FGaxn>:<FGrule>state :ST BPSK :core=0:FGBa0:FGBr0state :ST BPSK :core=1:FGBa1:FGBr1state :ST BPSK :core=2:FGBa2:FGBr2

(c) Code to compose a multicore pipeline. Highlighted fieldsare multicore extensions to the API.

Figure 7: Parallelizing atoms on multiple cores.

Multicore states. Multicore states enable parallelizedexecution of the top-level application state machine. Amulti-core state structure is made up of a pair of actionand rule flowgraphs for each core. When the multicore

system enters a state, each core executes its respectiveflowgraph pair for that state. By setting the core-specificactions of a multicore state to sub-flowgraphs of a paral-lelized flowgraph, the programmer is able to create effi-cient multicore processing pipelines.

A three-core version of the BPSK state ST BPSK isshown in fig. 7b. For this state, the cores now usesub-flowgraphs FGBa0, FGBa1, and FGBa2 as actions.Similarly, they use FGBr0, FGBr1, and FGBr2 forrules. The code to declare the multicore state is shownin fig. 7c.

A multicore state machine executes like multiple par-allel state machines executing asynchronously. Eachcore has its own dispatcher atom (fig. 7b. On any core,control flow starts at the dispatcher, passes through thenext state to execute, and returns to the dispatcher. Coressynchronize on the state transition sequence by exchang-ing transition decisions computed in the rule flowgraphs.

In our example, once the system enters ST BPSK,all cores start processing their respective action flow-graphs for the state. These sub-flowgraphs exchange datathrough the transfer atoms that the programmer insertedwhen parallelizing them. Collectively, they execute theparallelized BPSK processing pipeline.

After finishing a state’s action iterations, cores inde-pendently execute their respective rule flowgraphs forthat state. In our configuration, only dsp0 executes thedecision atom dx for the BPSK state. Its output decisionis distributed to each dispatcher through transfer atomst1 and t2. By executing state-transition decision atomson a single (though possibly different) core for eachstate, the programmer synchronizes state transitionsacross all cores. In this way, the entire system transitionsstates in lock-step.

Accelerator atoms. The pipeline shown in fig. 7aincludes atoms for Viterbi-decoding on Viterbi Co-Processors (VCPs). To use the VCPs, Atomix providesVCPIssue (v) and VCPWait (w) atoms. A VCPIssueatom can configure a VCP and start its execution. AVCPWait atom can wait for VCP execution to terminateand thus synchronize a DSP core with a VCP core.A hardware accelerator takes a fixed amounts of timeto execute a given workload. Consequently, issue andwait operations interacting with an accelerator finishexecuting in predictable amounts of time, making thematoms. This model extends to other accelerators.

Multicore execution model and timing. The multicoreexecution model of Atomix is similar to the single-coresetting. Blocks access FIFOs for input and output buffersbefore they execute, they run as soon as buffers are avail-able, and they always runs to completion. Cores com-municate by transferring data between FIFOs, and syn-

7

180 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

Queue statusreadIdxwriteIdx

Per-buffer statusfreeOrBusy

filledOrEmpty

(a) Queue mgmt. data struc-tures (b) Buffer state machine

Figure 8: Lock-free queue management

chronize execution by polling FIFOs with blocking calls.When a block polls a FIFO, its wait duration is deter-mined by execution times of upstream atoms, includingdata producers and transfer atoms.

In this manner, execution control flow simply mimicsdata flow in the system. This allows the timing of a mul-ticore atom to be computed by adding up the executiontimes of atoms on the critical path of execution, i.e., theslowest path of data flow in the parallel execution.

To illustrate the timing model, we analyze the BPSKaction flowgraph pipeline. We assume that data isarriving every 4 units of time. We also assume thatthe blocks have the following execution costs periteration: ofdm:2.5, csi,eq:1.5, b:1.0, t,u,w,v:0.5,and VCP-processing:6.0. The total DSP computationload of this flowgraph is 6.0 units, which cannot besustained at data arrival rate by a single DSP. Whenlaid out on three DSP cores and two VCPs with transferatoms, the pipeline is able to meet the processingthroughput requirement, as shown in fig. 7a. Thepipeline’s latency is the timing of the critical pathofdm-t-eq-u-b-w-v-VCP, which is 13.5 units.

3.6 Efficient data-flow implementationThe runtime system of Atomix executes fine-grainedpipelines with hardware-like efficiency using two maintechniques: lock-free FIFOs, and asynchronous datatransfers.

Lock-free FIFO implementation. The FIFO imple-mentation in Atomix provides a simple API with fourfunctions (table 1). We design the functions to executeextremely efficiently: the Get functions take 40 cycleseach, and the Done functions take 8 cycles each. Typ-ically, FIFO queues in multicore environments are im-plemented with locks to serialize access. Atomix doesuse them because locks are expensive, and they can cre-ate timing variability by serializing concurrent accessesin arbitrary order. Atomix FIFO API implementation is

Per-buffer statusfreeOrBusy

filledOrEmptylinkNum

(a) Extended buffer state (b) Transfer scenario

Figure 9: Link number field and Transfer CompletionCode (TCC) for handling asynchronous transfers

entirely lock-free.In order to operate correctly without locks, Atomix re-

quires every FIFO to have a single-reader and a single-writer (SRSW) at any point in multicore execution. Inaddition to common readIdx and writeIdx status for thequeue, a per-buffer 2-bit (freeOrBusy, filledOrEmpty)status tuple is maintained (fig. 8a). As FIFO API callsare made, the buffer transitions through those states (fig.8b). If a call cannot succeed (e.g., RG in state 01), itblocks until the buffer reaches the required state for thecall to proceed. Only one of the four possible API callscan succeed and hence, modify the data structures in anygiven state. By the SRSW property, only one FIFO ac-cessor will ever be allowed to write to the FIFO datastructure, ensuring race-free operation without locks.

The SRSW constraint may seem too restrictive com-pared to typical FIFO APIs. However, in our experience,most FIFOs wireless applications naturally have singlereaders and single writers. Multiple readers or writersfrom different states could still be wired to the sameFIFO since at any given time, the system is in exactlyone state.

Handling asynchronous transfers. The FIFO functionsread/writeDone cannot be used with asynchronousDMA transfer. In order to run them upon DMA transfercompletion, DSP cores must be interrupted, which wouldcause variability in atom execution. To deal with this is-sue, we introduce the buffer state forwarding mechanism,where the FIFO manager is able to deduce readDonesand writeDones without those calls being made ex-plicitly.

To implement buffer state forwarding, we extend per-buffer status field with a linkNum field (LN). On ourprototyping platform, TI KeyStone DSPs, DMA chan-nels have associated event registers to indicate transfercompletion. We denote this register by TCC (fig. 9).When the DMA transfer atom TD issues a DMA request,it sets up the LN field of source and destination buffers topoint to the TCC register of the DMA channel used forthe transfer. TD issues the request and returns. The corethat executed TD moves on to other atoms. The trans-fer would finish later asynchronously but the buffers will

8

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 181

continue to be marked busy. However, when the buffersare accessed again in FIFO order, the queue manager isable to identify from non-zero LN field that they weremarked busy for an asynchronous transfer. The queuemanager then polls the TCC flag pointed by LN. Whenthe TCC indicates transfer completion, the queue man-ager forwards the state of the buffer as if the correspond-ing readDone or writeDonewas called on the buffer.

3.7 Full app development workflow

Atomix compiler

Figure 10: Full Atomix app development workflow

Stages in Atomix app development. To summarize theAtomix framework design, the following is an overviewof the application development workflow (fig. 10). First,the user implements signal processing blocks as atoms inC language. Next, she composes blocks into flowgraphsand states using the declarative interface of Atomix.Then, she computes a parallelized schedule and resourceassignment that will meet the latency and throughputrequirements of the application, and incorporates it inthe declarative app code. This high-level app code isthen compiled down to low-level C code by the Atomixcompiler. Finally, the low-level application C code iscompiled with the platform’s native C compiler andlinked against Atomix runtime libraries into a binary.The modular structure of Atomix applications and thestreamlined development flow let the user rapidly iteratedesigns and optimize performance.

Algorithmic scheduling. The execution model ofAtomix makes it possible to algorithmically compute thebest parallelized schedule for an application. Blockscan be profiled individually for execution times. Then,finding the optimal schedule for a flowgraph can be ex-pressed as a resource assignment problem with depen-dency constraints. FIFO access costs, data transfer costsand accelerator access costs can all be modeled as ad-ditional constraints. We have been able to formulatethe flowgraph scheduling problem as an Integer Lin-ear Program (ILP). The flowgraph scheduling problemin Atomix is similar to the instruction-loop scheduling

problem solved by VLIW compilers [17, 18]. Incorporat-ing algorithmic scheduling in the Atomix compiler cansimplify the application development flow even further.

4 Building an 802.11a Receiver in Atomix

We put Atomix framework’s capability of providing lowlatency and high throughput to test by using it to imple-ment a 10MHz 802.11a receiver called Atomix11a. Weimplement 802.11a as a benchmark due to its particularlydemanding requirements. It uses the same bandwidthmodes as LTE (5, 10, 20 MHz), imposing similar pro-cessing complexity. However, it places much more strin-gent frame-decoding latency constraints than LTE - 64μsfor 5MHz and 32μs for 10MHz compared to more than1ms for LTE. To stress the system, we implement thehighest throughput Modulation Coding Scheme (MCS)in 802.11a, MCS 7: 64-QAM with 3/4 coding rate, whichoperates at 27Mbps on a 10MHz channel.

We first evaluate Atomix11a and show that (1)Atomix11a can decode over-the-air frames in real-timeon the TI 6670 a 4-core, 4-Viterbi accelerator 1.25GHzDSP, (2) Atomix11a can achieve 36μs processing la-tency with 1.5μs of variability, sufficient to meet require-ments of a 5MHz 802.11a channel and close to meeting10MHz requirements, and (3) most of the atoms we builtfor Atomix11a have predictable timing. Next, we de-scribe the use of Atomix to implement a modular 802.11areceiver that has low latency and high throughput on thelow power (7 Watt) TI 6670 DSP. We could implementthe receiver in about 3000 lines of declarative applicationcode (excluding individual block implementations in Clanguage).

4.1 Evaluation of Atomix11aWe first demonstrate that Atomix11a is a robust, faithfulimplementation of 802.11a signal processing. Then weevaluate Atomix11a’s latency and timing variability, andfinally we evaluate the runtime of Atomix11a’s atomsto show that they come close to Atomix’s fixed runtimeatom abstraction.

Atomix11a exceeds receiver sensitivity requirements.An 802.11a compliant receiver must be able to decode1,000 byte packets with a Packet Error Rate (PER) of0.10 at -65dBm receive power over an AWGN channel,as measured at the RF frontend’s antenna port [7]. Inour experiment, we feed signal over an RF co-axialcable to emulate an AWGN channel without multipathreflections. On the receiver, we use the high-qualityRF frontend of an R&S FSW spectrum analyzer todigitize received signal into baseband I/Q samples, sothe experiment focuses on the robustness of Atomix11a

9

182 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

0

0.05

0.1

0.15

0.2

0.25

0.3

-80 -75 -70 -65 -60 -55 -50

Pack

et e

rror

rate

Receive power (dBm)

Figure 11: Atomix11a exceeds receiver sensitivity spec-ifications (gray box).

34

35

36

37

6

122

8

176

12

284

20

500

30

770

40

1040

50

1310

Late

ncy

sec)

Payload size

Symbols:

Bytes:

Figure 12: Atomix11a processing latency is at most36.4μs and it varies at most by 1.5μs.

baseband running on the TI 6670 DSP, not the qualityof the RF frontend. Fig. 11 shows the results of this ex-periment. Atomix11a exceeds 802.11a’s requirement of0.10 PER at -65dBm; it achieves a 0.046 PER at -78dBm.

Atomix11a operates robustly indoors. The second ro-bustness test is to see if Atomix11a can decode packetssent over-the-air in challenging indoor multipath chan-nels causing frequency selective fading. Robust 802.11areceivers use a computationally intensive zero-forcingchannel equalizer to equalize the sub carriers.

We setup a 6 meter link across an office and trans-mitted 100,000 802.11a frames, each 1,000 bytes, atMCS 7 (64-QAM, 3/4 coding rate), and over the 10MHzchannel at 2.479GHz. We used a USRP2 RF frontendconnected to the TI 6670 DSP with gigabit ethernet.Both the transmitter and the receiver were connectedwith 3 dBi omnidirectional antennas. The receiversuccessfully decoded (verified CRC) for 99,999/100,000frames. Therefore Atomix11a is capable of an extremelylow PER of 0.001% in an indoor office environment, andis likely a faithful implementation of 802.11a.

Atomix11a has low processing latency. Next, we per-form an end-to-end test of Atomix framework’s abilityto support a low latency, low timing variability, and highthroughput signal processing chain. We transmitted 200frames each of different sizes (6-50 symbols, 122-1310bytes) to the USRP2 which is connected to the TI 6670DSP. We compute the processing latency by subtract-

100

101

102

103

104

Ru

ns

0

0.5

1

1.5

2

2.5

0 20 40 60 80 100 120 140

Ru

nti

me

(μse

c)

Atoms sorted by minimum runtime

min max

75th

percentile

Figure 13: Atomix11a atoms have low timing variabil-ity.

ing the frame processing time of Atomix11a from theframe’s airtime. We measure Atomix11a’s maximumprocessing latency, as well as the range over which itvaries.

Fig. 12 shows the results of this experiment. Thewhiskers are the minimum and max of each of the200 frames, and the boxes show the 25th, 50th, 75th

percentiles. For all the frame sizes tested, the minimumand maximum processing latency is similar, indicatingthat the composition of atoms adds minimal latencybetween frames. The max processing latency is 36.4μs,1.14× the requirement of 32μs for 802.11a. Although36.4μs latency is low, the Atomix11a implementationcould be further optimized to meet protocol latencyrequirements. Specifically, there is room to optimize theimplementations of individual atoms, and to algorithmi-cally compute the parallelization schedule of atoms thatminimizes decode latency.

Most Atomix11a atoms have low timing variability.In the final experiment, we observe the variability of theruntime of every atom in Atomix11a. We expect somevariability due to L1 caching and micro-architecturalsources of variability like bus contention (sec. 3.2). Weinstrumented each of the atoms in the Atomix11a imple-mentation with a lightweight cycle counter that recordsthe cycle count of each execution of the atom. We mea-sured the runtime of every atom that executes while re-ceiving 12 frames of 500 bytes (20 OFDM symbols atMCS 7). Most atoms run every OFDM symbol of theframe.

Fig. 13 shows the min, max, and 75th percentileruntime of all of the 150 atoms that were executed inAtomix11a (bottom), as well as the number of times eachof those atoms were run in our experiment (top). Most

10

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 183

CFOcorrect and

OFDM

Core 0Eq., slice,

deinterleave,depuncture

Core 1Viterbi issue

and data assembly

Core 2Descramble

andCRC32

Core 3

Viterbi 0 Viterbi 1 Viterbi 2 Viterbi 3

16

2

6

3#

OFDM Symbol

4 5

2 3 4 5 6 2 3

Figure 14: Fine-grained OFDM symbol pipelining inAtomix11a’s data decode state.

atoms have a fixed or insignificant runtime (atoms 60-150). These atoms include both framework atoms suchas memory transfers, as well as Atomix11a computationatoms such as the 64-QAM soft slicer and the PLCP’ssoft deinterleaver.

Atomix11a’s atoms have at most a 0.486μs range be-tween their max and min runtime due to L1 caching andmicro-architecture. The atom with the largest runtimerange (atom 0) is the channel estimator, which is the mostcomputationally intensive atom in Atomix11a. Althoughthe difference between the max and min runtime for theseatoms can be significant, for all atoms the 75th percentileof runtime is close to the minimum. This shows that theatom abstraction holds well in practice. Further, execu-tion times of different atoms have low co-variance. Thisexplains the low end-to-end packet decoding variabilityof 1.5μs. This variability can be reduced further by dis-abling L1 caching on the TI 6670 and managing the L1SRAM in software with Atomix transfer atoms.

4.2 Implementation of Atomix11aModularity resulting from the atom abstraction enabledus to implement Atomix11a with fine-grained pipelineparallelism, which was crucial in achieving high throughand low latency on the TI 6670 DSP. We implementedsignal processing blocks as fine-grained atoms that op-erate on one OFDM symbol (8μs worth of samplesat 10MHz) every iteration, as discussed earlier. TheAtomix API enabled us to compose the block-levelatoms into flowgraphs and states and schedule them forhigh performance in a low power budget of 7W.

Precise timing of atoms allowed us to spread thepipeline over all the cores and accelerators so it wouldexecute without stalls and achieve the required pro-cessing throughput of 10Msps reliably. Fine-grainedatoms allowed us to pipeline data processing with datareception to achieve low decode latency of 36μs.

Fine-grained pipeline structure. The fine-grainedpipeline structure of Atomix11a is shown in fig. 14. Itdepicts a snapshot of the pipelining in payload decod-

Time (μsec)

STFLTF

PLCP

~

Payload (MCS 7)

Core 0

Core 1

Core 2

Core 3

0 50 100 450 500

Figure 15: Core utilization of Atomix11a while it re-ceives a 1500 byte MCS 7 frame at 10MHz.

ing state. At any point in the steady state, six OFDMsymbols are being processed in the pipeline. With fine-grained resource allocation API of Atomix, we couldlay out the pipeline so that each DSP core and Viterbico-processor (see ahead) contributed to the processingof an OFDM symbol. This enabled the highest sam-ple processing throughput achievable by the hardware re-sources.

The core utilization map resulting from our fine-grained pipeline is shown in fig. 15. It depicts theprocessor’s active and idle time on all four DSP coreswhile it is receiving a 1500 byte 802.11a MCS 7 frame.The vertical grid indicates the arrival of a symbol toAtomix11a. As the figure shows, all cores are partic-ipating in a pipeline to process every arriving OFDMsymbol.

Parallelizing Viterbi decoding for high throughput.Optimal Viterbi-decoding is a sequential algorithm.However, in practice, the decoded sequence of bits(codeword) can be partitioned into overlapping chunksthat can be decoded in parallel with negligible perfor-mance penalty under certain overlap size constraints.We use this scheme to use all 4 VCPs for decoding thesame WiFi frame, where each chunk is one-half OFDMsymbol with appropriate padding of surrounding bits tosatisfy overlap constraints. Such a scheme has also beenused in BigStation [30] where it is described in moredetail.

Optimizing latency with overlapping states. Atomixallows different cores to operate simultaneously in dif-ferent states. This feature enables the implementer to useall cores to execute as many states at once as possible,and thus provide lower latency than if the states ran inserial. For example, fig. 15 shows as the LTF symbolarrives we split its processing across two cores, andsame for the PLCP symbol2. The split point for the LTF

2Note that with a 10MHz signal, the LTF and PLCP processing endsbefore the arrival of the next symbol. However, with a 20MHz signal

11

184 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

state is after CFO estimation because the CFO estimatescan immediately be applied on the PLCP symbol whenit arrives. Then, when the channel estimates finishon core 1 in LTF, the rest of the PLCP is processed.Core 1 uses the channel estimates to complete the PLCPprocessing with the symbol transferred from core 0. Insummary, Atomix enabled pipelined processing of twosymbols in two states at the same time on two cores.

Easy to program. We implemented the Atomix11areceiver in 3000 lines of Atomix’s high-level code. TheAtomix compiler translated the high-level code intoabout 30,000 lines of low-level application-specific Ccode. Atomix reduces the LoC effort by an order-of-magnitude. Further, it makes the development processsignificantly easier through its abstractions comparedto directly writing low-level code. All of the blocksin Atomix11a’s signal processing library were imple-mented with less than 300 lines of C code each.

Low resource footprint. The Atomix atom abstrac-tion requires each configuration of a signal processingblock to be split out as a separate block, e.g. BPSK48,BPSK96, QPSK48. To avoid code size explosion, weimplement similar blocks as wrappers on a common pa-rameterized kernel (e.g., BPSK<N>). Atomix API alsoprovides primitives to instantiate parameterized blocksin the atom declarations.

Using those techniques, Atomix11a has code size of468 KByte per core. This comfortably fits in the 512KByte of L2 SRAM per core that we allocated as pro-gram memory. The data size on any core is at most 145KByte out of 512 KByte of L2 SRAM allocated for datamemory. Code size could be further optimized by cre-ating per-core binaries containing only code executed oneach of the cores. Another approach is to write the blocksin C++ using templates. It is also possible to partition theSRAM in favor of program memory.

5 Modifiability

We demonstrate the modularity of the Atomix 802.11areceiver by adapting it to two new applications: long-distance links, and phase-array location signatures.

5.1 Adapting to long-distance linksLong-distance wireless links appear in settings like ru-ral connectivity [6], point-to-point backhaul and commu-nity networks [4]. They are challenging because the de-lay spread of the multipath wireless channel also growswith link distance. However, WiFi is designed for shorterdistance links. As a case-study of Atomix modifiability,

(4μs interval), the PLCP symbol will arrive during LTF processing.

Figure 16: Modifications for long-distance app

(a) Measured CIR (b) PER with long-CP

Figure 17: Performance on a two-tap multipath channel.Long cyclic-prefix (CP) is critical for long outdoor links.

we demonstrate the process of systematically diagnosingand adapting our base WiFi receiver to work well over anemulated long-distance link.

We emulate a long-distance channel that has two mul-tipath components: the line-of-sight component, and areflection delayed by 2us (20 samples) that is 12dB lowerin strength. On this channel, the unmodified Atomix11areceiver resulted in 100% packet error rate (PER). To as-certain that the signal strength is not the limiting factor,we tap the baseband chain to read off sample energy val-ues being computed by the energy-detection block (ED),as shown in fig. 16. On the simulated transmission withan average carrier-to-noise ratio (CNR) of 45dB, our taprevealed the CNR to be 45.4dB, confirming sufficientsignal strength for successful decode.

Ruling out signal strength limitation, we suspected ahigh-distortion channel. To look deeper, we inserted anerror-vector-magnitude (EVM) computation block afterslicing the constellation. On a distortion channel, wewould expect a high EVM value, or equivalently, a lowSNR indicated by the EVM. The EVM block runs in0.64us on core 1. We set it to compute on the output ofthe PLCP field. Core 1 had enough cycles (fig. 15) in thePLCP decode state to run EVM without adding latencyto packet decode. Our EVM block indicated only 19.2dBSNR, much lower than 45.4dB CNR. This strengthenedthe suspicion of a poor channel.

To find the ground truth about the channel, we inserteda block to compute the time-domain channel impulse re-sponse (CIR) using the long training field (LTF) of thepreamble, same as used for frequency-domain channelestimation. The CIR block runs in 3.4μs. We set it torun core 3, which has enough spare cycles after packetdetection to run CIR without hurting packet decode. The

12

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 185

(a) Modifications for AoA app

0.2

0.4

0.6

0.8

1

30

210

60

240

90

270

120

300

150

330

180 0

(b) AoA spectrum

Figure 18: AoA spectrum app

CIR block showed the channel response to be as shownin fig. 17a. It accurately recovered the emulated channeland pointed us to the root problem: two strong compo-nents 20 samples apart, and a delay spread too high forthe standard WiFi OFDM cyclic prefix length of 16 sam-ples.

Armed with this knowledge, we implemented the solu-tion of increasing the cyclic prefix (CP) length to 32 sam-ples. To implement the longer CP communication mode,only the SYNC block needed tweaking. It is responsiblefor discarding the cyclic prefix from each OFDM sym-bol before passing it to rest of the processing chain. Wetweaked it to discard 32 samples instead of 16. After im-plementing the longer CP, performance of the link cameback to the expected CNR-PER quality on a benign chan-nel. The observed performance of CP length 32 over the2us two-tap channel is shown in fig. 17b. On the samereception as before, the EVM block now indicated SNRof 38.7dB, much closer to the CNR of 45.4dB, confirm-ing that signal distortion had been effectively corrected.

By simply tapping existing signals, tweaking a blockand inserting a few blocks, we could methodically adaptthe receiver to a new application setting. The whole exer-cise took just a few hours of programmer-time to modifyor add a total of 20 lines of declarative code (after weimplemented the individual signal processing blocks).

5.2 Adding location signaturesLocation signatures of a wireless transmitter can be com-puted using phase-array processing at a MIMO AP re-ceiver. MUSIC [19] is a popular algorithm to computeangle-of-arrival (AoA) spectrum which is used in manylocation-based systems like [28, 29]. It is desirable toembed AoA spectrum computation on the AP to savebandwidth incurred in remote computation, and to makelocation estimates available at the AP at sub-millisecondlatencies. For such scenarios, we demonstrate addition ofan AoA computation application to our Atomix 802.11areceiver.

The set of blocks used to compute AoA spectrumis shown in fig. 18a. A flowgraph is composedwith the phase-multiplier block (PH), correlation matrixcomputation block (Rxx), a complex Hermitian eigen-

decomposition block (EIG), and a spectrum computationblock (SPT). We leverage the baseband receiver to detectWiFi packets using preamble-detection. Preamble sam-ples are tapped and copied on a FIFO for AoA computa-tion. Once the packet decode finishes, AoA computationis invoked. A sample of AoA spectrum computed on theDSP processor is shown in fig. 18b.

The entire AoA computation chain takes 530us withroom for further optimization. Its components are: PH:1.5us, Rxx: 21us, EIG: 178us and SPT: 330us. Foreigendecomposition, we use a FORTRAN-to-C trans-lated routine from EISPACK [2]. Our simple implemen-tation is already able to compute an AoA spectrum inabout half of a millisecond at the base-station. We areable to reproduce the results from [28, 29], which weomit for brevity. To add the AoA app, we needed toadd only about 30 lines of declarative code to the baseAtomix11a receiver (in addition to blocks in C).

6 Related work

Existing frameworks suited to building baseband appli-cations using abstractions of blocks and flowgraphs haveonly targeted GPPs and FPGAs. Also, operating systemsfor building realtime programs on DSPs do not providethe right abstractions for modular communicationapplications. To the best of our knowledge, Atomix isthe first system to enable high-performance, modularcommunication applications on multi-core DSPs.

Modular frameworks for GPPs and FPGAs. GNURadio [5] is a rapid prototyping toolkit for signal pro-cessing applications. It provides interfaces to connectblocks into flowgraphs on GPPs and DSPs [10]. How-ever, unlike Atomix, GNU Radio’s abstractions do nothave guaranteed timing. They depend on abundant pro-cessing resources for real-time performance.

The SORA [21] software radio and its modular soft-ware architecture called Brick [11] provide real-time per-formance of wireless applications on GPPs. With similarmodularity goals, SORA/Brick and Atomix share simi-larities in the programming interfaces. However, the dif-ferences in GPP and DSP architectures make for verydifferent designs of the two systems. SORA’s design in-cludes mechanisms like binding threads to cores, mask-ing interrupts from cores and special memory handlingfor optimization cache performance on signal processingapps. These apply well to the GPP environment with ageneral-purpose operating system but do not carry overto the DSP architecture.

Ziria [20] is a domain specific language targeted toGPPs for implementing efficient PHY designs. It facil-itates concise and error-free expression of PHY controllogic. It also provides support for automatic vectoriza-

13

186 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

tion of individual signal processing blocks. The advan-tages of programming signal processing apps in Ziria,currently limited to GPPs, are complementary to thoseof Atomix. They can be extended to DSPs by a compilerthat translates Ziria programs to Atomix programs, usingAtomix as a convenient abstraction layer.

The Click modular router [9] provides an abstractionof blocks and flowgraphs for building packet-based net-work infrastructure. The expressiveness of Click’s sim-ple elements and FIFOs was an inspiration for Atomix.However, Click is not suited to developing wireless ap-plications on DSPs. For instance, Click’s elements lacktiming guarantees; the Click dynamic scheduler is asource of timing variability.

StreamIt [24] provides a flowgraph model and in-tuitive syntax to program stream processing applica-tions. The StreamIt programming language explicitlyreveals parallelization opportunities to the compiler soit can exploit task, data, and pipelining parallelism. Inthat sense, StreamIt’s compiler complements Atomix.However, StreamIt’s abstractions lack guaranteed exe-cution times. Moreover, StreamIt encourages embed-ding branches within blocks that can cause variability.Atomix forbids this behavior, and requires all branchesto be expressed as their own atoms.

AirBlue [12] simplifies design of cross-layer signalprocessing applications for FPGAs. Like Atomix,AirBlue also requires FIFOs between signal processingblocks to enable modular modifications. However, mod-ular implementation of an application for predictableand efficient execution poses different challenges onmulti-core DSPs and FPGAs.

Real-time operating systems for DSPs. Real-timeoperating systems such as QNX Neutrino [16], Win-dRiver VxWorks [27] are used in automotive, roboticsand avionics systems. DSPs have their own real-time op-erating systems such as TI SYS/BIOS [22], and 3L Dia-mond [1]. In principle, the real-time guarantees providedby these systems (e.g., priority-scheduling, predictableinterrupt timing) can be applied to real-time wireless in-frastructure applications too. However, these abstrac-tions can only be indirectly mapped to the blocks andflowgraphs that make up baseband applications. Real-time OSes generally provide some form of FIFO-basedinterprocess communication that can be used to createflowgraphs. However, such FIFOs are not designed forthe fine-grained pipeline parallelism that Atomix’s lockfree FIFOs enable.

7 Discussion

Atomix in the development pipeline: Atomix aims tomake signal processing applications easy to deploy on

DSPs in the wireless infrastructure.However, Atomixcan also enable at-scale prototyping and testing ofwireless applications. Platforms like USRP E310 andE110 integrate DSPs and ARM cores with commod-ity RF frontends in embedded form-factors. Theseplatforms are ideal for building wide-scale researchtestbeds. Atomix makes them easy to program for rapidprototyping.

Limitations of Atomix: Atomix pushes conditionals outof blocks to the top level construct of state machine.This makes stepping through conditionals much slowerthan having them embedded within blocks. Most base-band apps spend majority of their time in computation-ally heavy signal processing blocks, making the cost ofstepping through states negligible. However, for applica-tions dominated by data-dependent workloads like iter-ative computations like successive interference cancella-tion or searching/sorting, Atomix can be inefficient.

8 Conclusion

Atomix is a framework for building and widely deploy-ing modular, predictable latency, high throughput base-band signal processing applications. The key abstractionthat enables Atomix is that of an atom — a computationunit with fixed execution time. Atomix demonstrates thatby abstracting operations as atoms, ease of programmingand fine control for predictable timing can be providedsimultaneously without sacrificing efficiency for signalprocessing applications. We show that it can be used todeploy apps in both WiFi and LTE infrastructures.

We plan to make Atomix available to the researchcommunity and enable the implementation and evalua-tion of new baseband apps at scale. We also plan to fur-ther develop Atomix to be a programmable dataplane fora software-defined radio access network. Specifically,we plan to investigate open APIs (analogous to Open-Flow) that programmable basestations should expose toenable networks to tightly manage packet processing ona per-flow basis, and enable software defined control overthe wireless infrastructure.

9 Acknowledgments

We thank Kyle Jamieson and Jie Xiong for providingus with a reference implementation of SecureArray [29].We thank Jeffrey Mehlman for help with implementa-tion, and Yiannis Yiakoumis for suggestions on present-ing the content. We also thank our shepherd DeepakGanesan for his valuable feedback, and the anonymousreviewers for their insightful comments. This work wassupported by the NSF POMI grant.

14

USENIX Association 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) 187

References

[1] 3L Diamond. http://www.3l.com/products/3l-diamond.

[2] EISPACK. http://www.netlib.org/eispack.

[3] 3GPP. Specification Release versionmatrix. http://www.3gpp.org/specifications/67-releases.

[4] J. Bicket, D. Aguayo, S. Biswas, and R. Morris. Ar-chitecture and evaluation of an unplanned 802.11bmesh network. In Proc. ACM Conference on Mo-bile Computing and Networking (MobiCom), 2005.

[5] E. Blossom. GNU radio: tools for exploring theradio frequency spectrum. Linux J., 2004:4–, June2004.

[6] K. Chebrolu, B. Raman, and S. Sen. Long-distance802.11b links: Performance measurements and ex-perience. In Proc. ACM Conference on MobileComputing and Networking (MobiCom), 2006.

[7] IEEE. 802.11a-1999.

[8] K. Joshi, S. Hong, and S. Katti. Pinpoint: Lo-calizing interfering radios. In Proc. Symposiumon Networked Systems Design and Implementation(NSDI), 2013.

[9] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F.Kaashoek. The Click modular router. ACM Trans-actions on Computer Systems (TOCS), 2000.

[10] S. Ma, V. Marojevic, P. Balister, and J. H.Reed. Porting GNU Radio to multicore DSP+ARMsystem-on-chip – a purely open-source approach.In Karlsruhe Workshop on Software Radios, 2014.

[11] Microsoft Research. Brick Specification. TheSORA Project, 2011.

[12] A. Ng, K. E. Fleming, M. Vutukuru, S. Gross,Arvind, and H. Balakrishnan. Airblue: A systemfor cross-layer wireless protocol development. InProc. ACM/IEEE Symposium on Architectures forNetworking and Communications Systems (ANCS),2010.

[13] K. Nikitopoulos, J. Zhou, B. Congdon, andK. Jamieson. Geosphere: Consistently turningMIMO capacity into throughput. In Proc. ACMSIGCOMM, 2014.

[14] R. Patra, S. Nedevschi, S. Surana, A. Sheth, L. Sub-ramanian, and E. Brewer. Wildnet: Design andimplementation of high performance WiFi basedlong distance networks. In Proc. Symposium onNetworked Systems Design and Implementation(NSDI), 2007.

[15] P. Puschner and A. Burns. Guest editorial: A reviewof worst-case execution-time analysis. Real-TimeSystems, 2000.

[16] QNX Software Systems, Ltd, Kanata, Ontatio,Canada. http://www.qnx.com.

[17] B. R. Rau. Iterative modulo scheduling: An algo-rithm for software pipelining loops. In Proc. In-ternational Symposium on Microarchitecture (MI-CRO), 1994.

[18] J. Sanchez and A. Gonzalez. The effectiveness ofloop unrolling for modulo scheduling in clusteredVLIW architectures. In Proc. International Con-ference on Parallel Processing (ICPP), 2000.

[19] R. O. Schmidt. Multiple emitter location and signalparameter estimation. IEEE Transactions on Anten-nas and Propagation, 1986.

[20] G. Stewart, M. Gowda, G. Mainland, B. Radunovic,D. Vytiniotis, and C. L. Agullo. Ziria: A DSLfor wireless systems programming. In Proc. ACMInternational Conference on Architectural Supportfor Programming Languages and Operating Sys-tems (ASPLOS), 2015.

[21] K. Tan, J. Zhang, J. Fang, H. Liu, Y. Ye, S. Wang,Y. Zhang, H. Wu, W. Wang, and G. M. Voelker.Sora: High performance software radio using gen-eral purpose multi-core processors. In Proc. Sym-posium on Networked Systems Design and Imple-mentation (NSDI), 2009.

[22] Texas Instruments. SYS/BIOS (TI-RTOS Kernel).

[23] Texas Instruments. TMS320C6670 - Multi-core Fixed and Floating-Point System-on-Chip.http://www.ti.com/product/tms320c6670.

[24] W. Thies, M. Karczmarek, and S. P. Amarasinghe.Streamit: A language for streaming applications. InProc. International Conference on Compiler Con-struction (CC), 2002.

[25] A. Vigato, S. Tomasin, L. Vangelista, V. Mignone,N. Benvenuto, and A. Morello. Coded decision di-rected demodulation for second generation digitalvideo broadcasting standard. IEEE Transactions onBroadcasting, 2009.

15

188 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15) USENIX Association

[26] M. Vutukuru, H. Balakrishnan, and K. Jamieson.Cross-layer wireless bit rate adaptation. In Proc.ACM SIGCOMM, 2009.

[27] Wind River Systems, Inc, Alameda, CA, USA.http://www.vxworks.com.

[28] J. Xiong and K. Jamieson. Arraytrack: A fine-grained indoor location system. In Proc. Sympo-sium on Networked Systems Design and Implemen-tation (NSDI), 2013.

[29] J. Xiong and K. Jamieson. Securearray: Improv-ing WiFi security with fine-grained physical-layerinformation. In Proc. ACM Conference on MobileComputing and Networking (MobiCom), 2013.

[30] Q. Yang, X. Li, H. Yao, J. Fang, K. Tan, W. Hu,J. Zhang, and Y. Zhang. BigStation: Enablingscalable real-time signal processingin large MU-MIMO systems. In Proc. ACM SIGCOMM, 2013.

16