gordon stewart (princeton), mahanth gowda (uiuc), geoff mainland (drexel), cristina luengo agullo...

Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR),

Dimitrios Vytiniotis (MSR)

ZIRIAA DSL for wireless systems programming

2

Wireless PHY and software radios

PHY layer Most computationally demanding part of wireless stack Mainly ASIC, but lots of innovation in PHY layer design

Software Defined Radios (SDR): Popular platforms for experimentation and deployments Architecture: PCI/USB cards send digitized samples to CPU for

processing

header

payload

crcPHY

Sora PCIe card

SDR hardware CPU

PCIe/USB

3

Software radios today Tools: GNURadio, SORA

Dataflow-like C++ programming models and libraries Big community of contributors

Industrial applications: WiFi AP – Sora GSM BS (2G) – OpenBTS LTE eNodeB (4G, 5G research) – OpenAirInterface

But to meet standard timing requirements, lots of manual optimizations in low-level

codeExample: WiFi needs to output 1.28Gbps

4

Wireless PHY programming tutorial

DetectSTSChannel

EstimationInvert

Channel

Packetstart

Channel info

Decode Header

InvertChannel

Decode Packet

Packetinfo

Detect transmission

Estimates effects of communication

channel

Transmission detected

Channel characteristics

Payload modulation parameters

• Coarse-grained control flow across blocks

• Most blocks process large chunks of statically known sizes

• Each block manages only internal state

WiFi RX Exampleheader

payload

crc

5

Issues with existing tools State and control flow issues

In traditional dataflow you program “control” info in the data path, e.g. Streamit

Alternatively, pipeline reconfiguration through shared state (e.g. SORA); invisible to programmer and compiler

Hand-written optimizations to achieve performance Manual lookup table generation code Manually tuned widths of block inputs, hardcoded inside block

implementations

Consequences Very hard (if possible at all) to reuse blocks in different pipelines Very hard to make functionally correct changes to existing pipelines Very hard to debug

6

So is “software” any good?

Common complaints about hardware platforms … Error prone designs, latency-sensitivity issues Code semantics tied to underlying hardware or execution model Hard to develop, maintain the code, big barrier to entry

… become valid against software platforms too, undermining the very advantages of software!

ZIRIA is a DSL for wireless PHY that addresses this problem

7

ZIRIA A DSL for stream processing

1. Bit processing layer and domain-specific optimizations2. Pipeline composition layer and (a different set of) pipeline optimizations

Programming abstractions well-suited for wireless PHY implementations in software (e.g. 802.11a/g)

Optimizing compiler that generates fast C code

Open source under Apache 2.0, including WiFi RX/TXhttp://research.microsoft.com/projects/Ziria

http://research.microsoft.com/projects/Ziria

8

A stream transformer t

t :: STrans a b

ZIRIA abstractions

t

input (a)

output (b)

c

input (a)

output (b)

return ctrl (v)

A stream computer cc :: SComp a b v

DetectSTS

ChannelEstimation

InvertChannel

Packetstart

Channel info

Decode Header

InvertChannel

Decode Packet

Packetinfo

WiFi RX

Primitives: z ::= map(f) | repeat(z) | if c then z1 else z2 | take | emit(v) | while c do z | ...

9

Composing in the data or control path

seq { v <- (c1 >>> t1) ; t2 >>> t3 }

Like a pipe: “along the data

path”

In sequence:“along the control

path”

c1

t1

t2

t3

v

11

Compilation model

Pipeline parallelization across top-level pipes (>>>) Compilation to finite state machines within each core Demand-driven (pull): no queues inserted by

compiler Natural for PHY apps Static optimizations of finite state machines based on properties of

blocks

12

Optimizing ZIRIA code

1. Partial evaluation, inlining 2. Type-directed fusion of parts of dataflow

graphs3. Reuse memory, avoid redundant copying4. Pipeline vectorization transformation 5. Automatically compile to lookup tables

(LUTs)

13

Vectorization Transform pipelines to operate over arrays

Good for: locality, dataflow execution overhead Can transform pipeline loops to tight C-level loops Source-to-source and type-preserving => easier to

get right13

repeat { x <- take ; emit f(x) }

repeat { (vect_xa : arr[8] int) <- take; for i in 0,2 { for j in 0,4 { vect_ya[j] := f(vect_xa[i*4 + j]) } emit vect_ya; }}

STrans int intSTrans (arr[8] int) (arr[4]

int)

14

How to get vectorization right?

repeat { x <- take; emit x

} >>> takes 4;y <- takes 4;print y;

Problem: pipeline reconfiguration

Solution: see paper

0,1,2,3,4,5,6,7,8,9,A,B

4,5,6,7

x <- takes 8emits x[0,4]emits x[4,4] ??8,9,A,

B

WRONG!!!!

15

Throughput (WiFi RX)

sample = a complex number = 2*16bit = 4 bytes

WiFi

16

Throughput (WiFi TX)

WiFi

17

Effects of optimizations (WiFi RX)

18

Latency & real-world performance

Throughput only gives average latency We also evaluate tail latency: see paper for details

Real-world experiments: 98% packet success rate

19

Conclusions

Ziria: a new DSL and compiler for wireless PHY Allows for concise and reusable block implementations

Expresses state and pipeline reconfiguration Compiler optimizes for specific pipeline placements

Generates code that matches hand-optimized C code

20

Next steps Make Ziria a standard platform for software radios Support more radio platforms (SORA, USRP, BladeRF) Support different architectures (CPUs, DSPs, SoCs, FPGAs) Provide more implementations of wireless standards (WiFi, LTE,

Zigbee)

Future research More expressive forms of parallelism Programming up the network stack: MAC layer

http://research.microsoft.com/projects/Ziriahttp://github.com/dimitriv/Ziria



http://github.com/dimitriv/Ziria

21

Thanks!




22

Latency (RX)

Normalized with1/40Mhz = 0.025

23

Effects of optimizations (802.11a/g TX)

Vectorization alone not great (reason: bit array addressing) but enables LUTs!

24

Latency (TX)

Methodology• Sample consecutive

read operations• Normalize 1/data-rate• Result: largely satisfying

latency requirements

25

Pipeline vectorizationProblem statement: Given (c :: SComp a b v), rewrite it to (vectc :: SComp (arr[N] a) (arr[M] b) v)for suitable N,M

Benefits of vectorization Fatter pipelines => lower dataflow graph interpretive overhead Array inputs vs individual elements => more data locality Can transform pipeline loops to tight C-level loops Especially for bit-arrays, enhances effects of LUTs Source-to-source and type-preserving => easier to get rightNB: Manually done in Sora, hurts code reusability

26

Vectorization and LUT synergylet comp scrambler() = var scrmbl_st: arr[7] bit := {'1,'1,'1,'1,'1,'1,'1}; var tmp,y: bit; repeat { (x:bit) <- take; do { tmp := (scrmbl_st[3] ^ scrmbl_st[0]); scrmbl_st[0:5] := scrmbl_st[1:6]; scrmbl_st[6] := tmp; y := x ^ tmp };

emit (y) }

let comp v_scrambler () = var scrmbl_st: arr[7] bit := {'1,'1,'1,'1,'1,'1,'1}; var tmp,y: bit;

var vect_ya_26: arr[8] bit; let auto_map_71(vect_xa_25: arr[8] bit) = LUT for vect_j_28 in 0, 8 { vect_ya_26[vect_j_28] := tmp := scrmbl_st[3]^scrmbl_st[0]; scrmbl_st[0:+6] := scrmbl_st[1:+6]; scrmbl_st[6] := tmp; y := vect_xa_25[0*8+vect_j_28]^tmp; return y }; return vect_ya_26 in map auto_map_71

Vectorization

Automatic lookup-table-compilationInput-vars = scrmbl_st, vect_xa_25 = 15 bitsOutput-vars = vect_ya_26, scrmbl_st = 2 bytesIDEA: precompile to LUT of 2^15 * 2 = 64K

RESULT: ~ 2.8Gbps scrambler

gordon stewart (princeton), mahanth gowda (uiuc), geoff mainland (drexel), cristina luengo agullo...

Documents