gordon stewart (princeton), mahanth gowda (uiuc), geoff mainland (drexel), cristina luengo agullo...
TRANSCRIPT
Gordon Stewart (Princeton), Mahanth Gowda (UIUC), Geoff Mainland (Drexel), Cristina Luengo Agullo (UPC), Bozidar Radunovic (MSR),
Dimitrios Vytiniotis (MSR)
ZIRIAA DSL for wireless systems programming
2
Wireless PHY and software radios
PHY layer Most computationally demanding part of wireless stack Mainly ASIC, but lots of innovation in PHY layer design
Software Defined Radios (SDR): Popular platforms for experimentation and deployments Architecture: PCI/USB cards send digitized samples to CPU for
processing
header
payload
crcPHY
Sora PCIe card
SDR hardware CPU
PCIe/USB
3
Software radios today Tools: GNURadio, SORA
Dataflow-like C++ programming models and libraries Big community of contributors
Industrial applications: WiFi AP – Sora GSM BS (2G) – OpenBTS LTE eNodeB (4G, 5G research) – OpenAirInterface
But to meet standard timing requirements, lots of manual optimizations in low-level
codeExample: WiFi needs to output 1.28Gbps
4
Wireless PHY programming tutorial
DetectSTSChannel
EstimationInvert
Channel
Packetstart
Channel info
Decode Header
InvertChannel
Decode Packet
Packetinfo
Detect transmission
Estimates effects of communication
channel
Transmission detected
Channel characteristics
Payload modulation parameters
• Coarse-grained control flow across blocks
• Most blocks process large chunks of statically known sizes
• Each block manages only internal state
WiFi RX Exampleheader
payload
crc
5
Issues with existing tools State and control flow issues
In traditional dataflow you program “control” info in the data path, e.g. Streamit
Alternatively, pipeline reconfiguration through shared state (e.g. SORA); invisible to programmer and compiler
Hand-written optimizations to achieve performance Manual lookup table generation code Manually tuned widths of block inputs, hardcoded inside block
implementations
Consequences Very hard (if possible at all) to reuse blocks in different pipelines Very hard to make functionally correct changes to existing pipelines Very hard to debug
6
So is “software” any good?
Common complaints about hardware platforms … Error prone designs, latency-sensitivity issues Code semantics tied to underlying hardware or execution model Hard to develop, maintain the code, big barrier to entry
… become valid against software platforms too, undermining the very advantages of software!
ZIRIA is a DSL for wireless PHY that addresses this problem
7
ZIRIA A DSL for stream processing
1. Bit processing layer and domain-specific optimizations2. Pipeline composition layer and (a different set of) pipeline optimizations
Programming abstractions well-suited for wireless PHY implementations in software (e.g. 802.11a/g)
Optimizing compiler that generates fast C code
Open source under Apache 2.0, including WiFi RX/TXhttp://research.microsoft.com/projects/Ziria
8
A stream transformer t
t :: STrans a b
ZIRIA abstractions
t
input (a)
output (b)
c
input (a)
output (b)
return ctrl (v)
A stream computer cc :: SComp a b v
DetectSTS
ChannelEstimation
InvertChannel
Packetstart
Channel info
Decode Header
InvertChannel
Decode Packet
Packetinfo
WiFi RX
Primitives: z ::= map(f) | repeat(z) | if c then z1 else z2 | take | emit(v) | while c do z | ...
9
Composing in the data or control path
seq { v <- (c1 >>> t1) ; t2 >>> t3 }
Like a pipe: “along the data
path”
In sequence:“along the control
path”
c1
t1
t2
t3
v
11
Compilation model
Pipeline parallelization across top-level pipes (>>>) Compilation to finite state machines within each core Demand-driven (pull): no queues inserted by
compiler Natural for PHY apps Static optimizations of finite state machines based on properties of
blocks
12
Optimizing ZIRIA code
1. Partial evaluation, inlining 2. Type-directed fusion of parts of dataflow
graphs3. Reuse memory, avoid redundant copying4. Pipeline vectorization transformation 5. Automatically compile to lookup tables
(LUTs)
13
Vectorization Transform pipelines to operate over arrays
Good for: locality, dataflow execution overhead Can transform pipeline loops to tight C-level loops Source-to-source and type-preserving => easier to
get right13
repeat { x <- take ; emit f(x) }
repeat { (vect_xa : arr[8] int) <- take; for i in 0,2 { for j in 0,4 { vect_ya[j] := f(vect_xa[i*4 + j]) } emit vect_ya; }}
STrans int intSTrans (arr[8] int) (arr[4]
int)
14
How to get vectorization right?
repeat { x <- take; emit x
} >>> takes 4;y <- takes 4;print y;
Problem: pipeline reconfiguration
Solution: see paper
0,1,2,3,4,5,6,7,8,9,A,B
4,5,6,7
x <- takes 8emits x[0,4]emits x[4,4] ??8,9,A,
B
WRONG!!!!
15
Throughput (WiFi RX)
sample = a complex number = 2*16bit = 4 bytes
WiFi
16
Throughput (WiFi TX)
WiFi
17
Effects of optimizations (WiFi RX)
18
Latency & real-world performance
Throughput only gives average latency We also evaluate tail latency: see paper for details
Real-world experiments: 98% packet success rate
19
Conclusions
Ziria: a new DSL and compiler for wireless PHY Allows for concise and reusable block implementations
Expresses state and pipeline reconfiguration Compiler optimizes for specific pipeline placements
Generates code that matches hand-optimized C code
20
Next steps Make Ziria a standard platform for software radios Support more radio platforms (SORA, USRP, BladeRF) Support different architectures (CPUs, DSPs, SoCs, FPGAs) Provide more implementations of wireless standards (WiFi, LTE,
Zigbee)
Future research More expressive forms of parallelism Programming up the network stack: MAC layer
http://research.microsoft.com/projects/Ziriahttp://github.com/dimitriv/Ziria
21
Thanks!
http://research.microsoft.com/projects/Ziria
22
Latency (RX)
Normalized with1/40Mhz = 0.025
23
Effects of optimizations (802.11a/g TX)
Vectorization alone not great (reason: bit array addressing) but enables LUTs!
24
Latency (TX)
Methodology• Sample consecutive
read operations• Normalize 1/data-rate• Result: largely satisfying
latency requirements
25
Pipeline vectorizationProblem statement: Given (c :: SComp a b v), rewrite it to (vectc :: SComp (arr[N] a) (arr[M] b) v)for suitable N,M
Benefits of vectorization Fatter pipelines => lower dataflow graph interpretive overhead Array inputs vs individual elements => more data locality Can transform pipeline loops to tight C-level loops Especially for bit-arrays, enhances effects of LUTs Source-to-source and type-preserving => easier to get rightNB: Manually done in Sora, hurts code reusability
26
Vectorization and LUT synergylet comp scrambler() = var scrmbl_st: arr[7] bit := {'1,'1,'1,'1,'1,'1,'1}; var tmp,y: bit; repeat { (x:bit) <- take; do { tmp := (scrmbl_st[3] ^ scrmbl_st[0]); scrmbl_st[0:5] := scrmbl_st[1:6]; scrmbl_st[6] := tmp; y := x ^ tmp };
emit (y) }
let comp v_scrambler () = var scrmbl_st: arr[7] bit := {'1,'1,'1,'1,'1,'1,'1}; var tmp,y: bit;
var vect_ya_26: arr[8] bit; let auto_map_71(vect_xa_25: arr[8] bit) = LUT for vect_j_28 in 0, 8 { vect_ya_26[vect_j_28] := tmp := scrmbl_st[3]^scrmbl_st[0]; scrmbl_st[0:+6] := scrmbl_st[1:+6]; scrmbl_st[6] := tmp; y := vect_xa_25[0*8+vect_j_28]^tmp; return y }; return vect_ya_26 in map auto_map_71
Vectorization
Automatic lookup-table-compilationInput-vars = scrmbl_st, vect_xa_25 = 15 bitsOutput-vars = vect_ya_26, scrmbl_st = 2 bytesIDEA: precompile to LUT of 2^15 * 2 = 64K
RESULT: ~ 2.8Gbps scrambler