gordon stewart (princeton), mahanth gowda (uiuc), geoff mainland (drexel), cristina luengo agullo...

25
Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis (MSR) ZIRIA A DSL for wireless systems programming

Upload: joseph-jordan

Post on 19-Dec-2015

229 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR),

Dimitrios Vytiniotis (MSR)

ZIRIAA DSL for wireless systems programming

Page 2: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

2

Wireless PHY and software radios

PHY layer Most computationally demanding part of wireless stack Mainly ASIC, but lots of innovation in PHY layer design

Software Defined Radios (SDR): Popular platforms for experimentation and deployments Architecture: PCI/USB cards send digitized samples to CPU for

processing

header

payload

crcPHY

Sora PCIe card

SDR hardware CPU

PCIe/USB

Page 3: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

3

Software radios today Tools: GNURadio, SORA

Dataflow-like C++ programming models and libraries Big community of contributors

Industrial applications: WiFi AP – Sora GSM BS (2G) – OpenBTS LTE eNodeB (4G, 5G research) – OpenAirInterface

But to meet standard timing requirements, lots of manual optimizations in low-level

codeExample: WiFi needs to output 1.28Gbps

Page 4: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

4

Wireless PHY programming tutorial

DetectSTSChannel

EstimationInvert

Channel

Packetstart

Channel info

Decode Header

InvertChannel

Decode Packet

Packetinfo

Detect transmission

Estimates effects of communication

channel

Transmission detected

Channel characteristics

Payload modulation parameters

• Coarse-grained control flow across blocks

• Most blocks process large chunks of statically known sizes

• Each block manages only internal state

WiFi RX Exampleheader

payload

crc

Page 5: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

5

Issues with existing tools State and control flow issues

In traditional dataflow you program “control” info in the data path, e.g. Streamit

Alternatively, pipeline reconfiguration through shared state (e.g. SORA); invisible to programmer and compiler

Hand-written optimizations to achieve performance Manual lookup table generation code Manually tuned widths of block inputs, hardcoded inside block

implementations

Consequences Very hard (if possible at all) to reuse blocks in different pipelines Very hard to make functionally correct changes to existing pipelines Very hard to debug

Page 6: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

6

So is “software” any good?

Common complaints about hardware platforms … Error prone designs, latency-sensitivity issues Code semantics tied to underlying hardware or execution model Hard to develop, maintain the code, big barrier to entry

… become valid against software platforms too, undermining the very advantages of software!

ZIRIA is a DSL for wireless PHY that addresses this problem

Page 7: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

7

ZIRIA A DSL for stream processing

1. Bit processing layer and domain-specific optimizations2. Pipeline composition layer and (a different set of) pipeline optimizations

Programming abstractions well-suited for wireless PHY implementations in software (e.g. 802.11a/g)

Optimizing compiler that generates fast C code

Open source under Apache 2.0, including WiFi RX/TXhttp://research.microsoft.com/projects/Ziria

Page 8: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

8

A stream transformer t

t :: STrans a b

ZIRIA abstractions

t

input (a)

output (b)

c

input (a)

output (b)

return ctrl (v)

A stream computer cc :: SComp a b v

DetectSTS

ChannelEstimation

InvertChannel

Packetstart

Channel info

Decode Header

InvertChannel

Decode Packet

Packetinfo

WiFi RX

Primitives: z ::= map(f) | repeat(z) | if c then z1 else z2 | take | emit(v) | while c do z | ...

Page 9: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

9

Composing in the data or control path

seq { v <- (c1 >>> t1) ; t2 >>> t3 }

Like a pipe: “along the data

path”

In sequence:“along the control

path”

c1

t1

t2

t3

v

Page 10: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

11

Compilation model

Pipeline parallelization across top-level pipes (>>>) Compilation to finite state machines within each core Demand-driven (pull): no queues inserted by

compiler Natural for PHY apps Static optimizations of finite state machines based on properties of

blocks

Page 11: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

12

Optimizing ZIRIA code

1. Partial evaluation, inlining 2. Type-directed fusion of parts of dataflow

graphs3. Reuse memory, avoid redundant copying4. Pipeline vectorization transformation 5. Automatically compile to lookup tables

(LUTs)

Page 12: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

13

Vectorization Transform pipelines to operate over arrays

Good for: locality, dataflow execution overhead Can transform pipeline loops to tight C-level loops Source-to-source and type-preserving => easier to

get right13

repeat { x <- take ; emit f(x) }

repeat { (vect_xa : arr[8] int) <- take; for i in 0,2 { for j in 0,4 { vect_ya[j] := f(vect_xa[i*4 + j]) } emit vect_ya; }}

STrans int intSTrans (arr[8] int) (arr[4]

int)

Page 13: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

14

How to get vectorization right?

repeat { x <- take; emit x

} >>> takes 4;y <- takes 4;print y;

Problem: pipeline reconfiguration

Solution: see paper

0,1,2,3,4,5,6,7,8,9,A,B

4,5,6,7

x <- takes 8emits x[0,4]emits x[4,4] ??8,9,A,

B

WRONG!!!!

Page 14: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

15

Throughput (WiFi RX)

sample = a complex number = 2*16bit = 4 bytes

WiFi

Page 15: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

16

Throughput (WiFi TX)

WiFi

Page 16: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

17

Effects of optimizations (WiFi RX)

Page 17: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

18

Latency & real-world performance

Throughput only gives average latency We also evaluate tail latency: see paper for details

Real-world experiments: 98% packet success rate

Page 18: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

19

Conclusions

Ziria: a new DSL and compiler for wireless PHY Allows for concise and reusable block implementations

Expresses state and pipeline reconfiguration Compiler optimizes for specific pipeline placements

Generates code that matches hand-optimized C code

Page 19: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

20

Next steps Make Ziria a standard platform for software radios Support more radio platforms (SORA, USRP, BladeRF) Support different architectures (CPUs, DSPs, SoCs, FPGAs) Provide more implementations of wireless standards (WiFi, LTE,

Zigbee)

Future research More expressive forms of parallelism Programming up the network stack: MAC layer

http://research.microsoft.com/projects/Ziriahttp://github.com/dimitriv/Ziria

Page 20: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

21

Thanks!

http://research.microsoft.com/projects/Ziria

Page 21: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

22

Latency (RX)

Normalized with1/40Mhz = 0.025

Page 22: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

23

Effects of optimizations (802.11a/g TX)

Vectorization alone not great (reason: bit array addressing) but enables LUTs!

Page 23: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

24

Latency (TX)

Methodology• Sample consecutive

read operations• Normalize 1/data-rate• Result: largely satisfying

latency requirements

Page 24: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

25

Pipeline vectorizationProblem statement: Given (c :: SComp a b v), rewrite it to (vectc :: SComp (arr[N] a) (arr[M] b) v)for suitable N,M

Benefits of vectorization Fatter pipelines => lower dataflow graph interpretive overhead Array inputs vs individual elements => more data locality Can transform pipeline loops to tight C-level loops Especially for bit-arrays, enhances effects of LUTs Source-to-source and type-preserving => easier to get rightNB: Manually done in Sora, hurts code reusability

Page 25: Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR), Dimitrios Vytiniotis

26

Vectorization and LUT synergylet comp scrambler() =  var scrmbl_st: arr[7] bit := {'1,'1,'1,'1,'1,'1,'1};   var tmp,y: bit;    repeat {      (x:bit) <- take;      do {        tmp := (scrmbl_st[3] ^ scrmbl_st[0]);        scrmbl_st[0:5] := scrmbl_st[1:6];        scrmbl_st[6] := tmp;        y := x ^ tmp      };

      emit (y)  }

let comp v_scrambler () =  var scrmbl_st: arr[7] bit := {'1,'1,'1,'1,'1,'1,'1};   var tmp,y: bit;

  var vect_ya_26: arr[8] bit;  let auto_map_71(vect_xa_25: arr[8] bit) =    LUT for vect_j_28 in 0, 8 {          vect_ya_26[vect_j_28] := tmp := scrmbl_st[3]^scrmbl_st[0];             scrmbl_st[0:+6] := scrmbl_st[1:+6];             scrmbl_st[6] := tmp;             y := vect_xa_25[0*8+vect_j_28]^tmp;             return y        };        return vect_ya_26  in map auto_map_71

Vectorization

Automatic lookup-table-compilationInput-vars = scrmbl_st, vect_xa_25 = 15 bitsOutput-vars = vect_ya_26, scrmbl_st = 2 bytesIDEA: precompile to LUT of 2^15 * 2 = 64K

RESULT: ~ 2.8Gbps scrambler