scalapipe: a streaming application generatorroger/transfer/saahpc-2012.pdf · scalapipe: a...
TRANSCRIPT
ScalaPipe: A Streaming Application Generator
Joseph G. Wingbermuehle, Roger D. Chamberlain,
Ron K. Cytron
1
This work is supported by the National Science Foundation under grants CNS-09095368 and CNS-0931693.
Streaming
Computation kernels, or blocks connected by explicit communication channels
Stage 1 Stage 2 Stage 3
2
Advantages: – Performance – Reuse – Abstraction
Systems: – Auto-Pipe [Fr06]
– Streams-C [Go00] – StreamIT [Th02]
Example:Solution to Laplace’s Equation
• PDE with several uses, including stationary heat diffusion • Solvable using a Monte-Carlo technique
3
Streaming Implementation
4
Random Walk Print
Parallel Walks
5
Random Split
Walk
Walk
Average Print
Auto-Pipe X
X Description
C Block
X compiler
VHDL Block
Application
6
&
Edge Connections
Laplace Application in X
block top { ! Random rand; Split split; Walk walk1; Walk walk2; Average avg; Print print; !e1: rand -> split; e2: split.y0 -> walk1; e3: split.y1 -> walk2; e4: walk1 -> avg.x0; e5: walk2 -> avg.x1; e6: avg -> print; };
Labels
Block instances
Blocks are implemented
externally in C or an HDL.
7
rand
split
walk1 walk2
avg
e1
e2 e3
e4 e5
e6
Observation 1As the number of Walk blocks increases,the amount of configuration code increases
8
ÊÊ
Ê
Ê
Ê
5 10 15Walk Blocks
20
40
60
80
100
Lines
128 Walk blocks requires 896 lines of X!
e1: rand -> split;!e2: split.y0 -> walk1;!e3: split.y1 -> walk2;!e4: walk1 -> avg.x0;!e5: walk2 -> avg.x1;!e6: avg -> print;
Two Walk blocks:
e1: rand -> split1; e2: split1.y0 -> split2; e3: split1.y1 -> split3; e4: split2.y0 -> walk1; e5: split2.y1 -> walk2; e6: split3.y0 -> walk3; e7: split3.y1 -> walk4; e8: walk1 -> avg1.x0; e9: walk2 -> avg1.x1; e10: walk3 -> avg2.x0; e11: walk4 -> avg2.x1; e12: avg1 -> avg3.x0; e13: avg2 -> avg3.x1; e14: avg3 -> print;
Four Walk blocks:
Our ApproachType-safe generator language
9
val Laplace = new AutoPipeApp { ! val random = Random() val splits = iteratedMap(levels, random, SplitU32) ! val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } ! val result = iteratedFold(walks, AverageU32) Print(result) }
Same code can generate 1 Walk block or 128 Walk blocks.
Observation 2
Moving blocks to a new device requires reimplementation
10
HDL Implementation
C Implementation Others
Our ApproachA single language for block implementations
11
ScalaPipe Block
HDL Implementation
C Implementation Others
Observation 3Changing the data type requires new block implementations
12
module ShiftRightU32(...); ! input wire[31:0] input_x; output wire[31:0] output_y;!!...! output_y <= input_x >> 1;!... !endmodule
module ShiftRightS64(...); ! input wire[63:0] input_x; output wire[63:0] output_y;!!...! output_y <= input_x >>> 1;!... !endmodule
Our Solution
Polymorphic block implementations
13
class Average(t: AutoPipeType) extends AutoPipeBlock { ! val in0 = input(t) val in1 = input(t) val out = output(t) ! out = (in0 + in1) / 2 }
Same implementation works for integral, fixed point, and floating point types.
Observation 4
The block interface for blocks on the same resource is a bottleneck
14
Block 1 Implementation
Block 2 Implementation
Runtime System
Block Interface Block Interface
Compiler
Our Approach
Single compiler for both the block language and coordination language.
15
Coordination Language Block Language
ScalaPipe
Source code (Scala)
Scala compiler Generator
ScalaPipe Library
Coordination DSL
Block DSL
Application 1 (e.g. 2 Walks)
16
Application 2 (e.g. 8 Walks)
AverageU32 Block
val AverageU32 extends AutoPipeBlock { ! val in0 = input(UNSIGNED32) val in1 = input(UNSIGNED32) val out = output(UNSIGNED32) ! out = (in0 + in1) / 2 }
17
AverageU32 out
in0
in1
Polymorphic Average Block
18
class Average(t: AutoPipeType) extends AutoPipeBlock { ! val in0 = input(t) val in1 = input(t) val out = output(t) ! out = (in0 + in1) / 2 }!!val AverageU32 = new Average(UNSIGNED32)
t can be any of the following:
•Signed or unsigned integer of any width
•Fixed point type
•Floating point type
Language Virtualization [Ch10]
19
class Repeat(v: Int, count: Int) extends AutoPipeBlock { ! val in = input(SIGNED32) val out = output(SIGNED32) val tmp = local(SIGNED32) ! tmp = in if (tmp == v) { // Evaluated at run time for (i <- 1 to count) { // Expanded at compile time out = tmp } } else { out = tmp } }
External AverageU32
val AverageU32 = new AutoPipeBlock { ! val in0 = input(UNSIGNED32) val in1 = input(UNSIGNED32) val out = output(UNSIGNED32) ! external(“HDL”, “AverageU32”) ! // Optional internal implementation }
• Potentially more efficient • External and internal blocks can be mixed
20
Block Code Generation
Internal Block Specification
Abstract Syntax Tree
Control Flow Graph
C
OpenCL C
Verilog
External Block Specification
21
Optimizer
HDL Code Optimizer
• Common subexpression elimination • Dead store elimination • Dead code elimination • Strength reduction • Copy propagation • ASAP scheduling
22
Coordination DSLDescribes the topology and resource mapping
23
val Laplace = new AutoPipeApp { ! val random = Random() val splits = iteratedMap(levels, random, SplitU32) ! val walks = Array.tabulate(1 << levels) { x => Walk(splits(x))() } ! val result = iteratedFold(walks, AverageU32) Print(result) }
Generating Pipelines
24
def pipeline(s: Stream,! b: AutoPipeBlock,! n: Int): Stream = {! if (n > 0) {! pipeline(b(s), b, n - 1)! } else {! s! }!}
Inc Inc Inc Inc
val result = pipeline(source,! Inc, 4)
block pipeline {!! input UNSIGNED32 source;! output UNSIGNED32 result;!! Inc inc1;! Inc inc2;! Inc inc3;! Inc inc4;!! source -> inc1;! inc1 -> inc2;! inc2 -> inc3;! inc3 -> inc4;! inc4 -> result;!};
ScalaPipe:X language:
CPU 0 CPU 0FPGA 0
Aspect-Oriented Resource Mapping
25
Random Split
Walk
Walk
Average Print
map(ANY_BLOCK -> Print, FPGA2CPU()
map(Random -> ANY_BLOCK, CPU2FPGA())
TimeTrial [La11]
measure(ANY_BLOCK -> Walk, ‘backpressure)
10 20 30 40 50 60Frame
20
40
60
80
100% Backpressure
26
Random Split
Walk
Walk
Average Print
How do we find bottlenecks?
CPU FPGA 16 Walks Custom RNG
50
100
150
200
Time HsL
101s
CPU 0
Illustration of Use
27
RNG Walk Print
CPU FPGA 16 Walks Custom RNG
50
100
150
200
Time HsL
101s
174s
CPU 0
Illustration of Use
28
FPGA 0
RNG Walk Print
83% Backpressure
CPU FPGA 16 Walks Custom RNG
50
100
150
200
Time HsL
101s
174s
41s
CPU 0
Illustration of Use
29
FPGA 0
RNG Split
Walk
0% Backpressure
Walk
CPU 0
Illustration of Use
30
FPGA 0
cRNG Split
Walk
Walk
CPU FPGA 16 Walks Custom RNG
50
100
150
200
Time HsL
101s
174s
41s
12s
The Current State of ScalaPipe
•Code generation for CPUs, FPGAs, and GPUs •FPGA and GPU code generation is suboptimal •No cross-block optimizations
31
The Future of ScalaPipe
•Improved code generation -Consume multiple items at a time -More Verilog and OpenCL C optimizations
•Support for more devices •Library generation •Cross-block optimizations
32
Conclusion
•ScalaPipe is a streaming application generator •The block DSL allows code reuse across data
types and platforms •The coordination DSL allows easy generation of
large and complex topologies •Keeping everything in the same language
exposes optimization opportunities
33
ScalaPipe
Block DSLCoordination DSL
References
H. CHAFI, Z. DEVITO, A. MOORS, T. ROMPF, A. K. SUJEETH, P. HANRAHAN, M. ODERSKY, AND K. OLUKOTUN, “Language virtualization for hetero- geneous parallel computing,” in Proc. of ACM Int’l Conf. on Object Oriented Programming Systems, Languages, and Applications, 2010, pp. 835–847.
J.M. LANCASTER, J. G. WINGBERMUEHLE, AND R. D. CHAMBERLAIN, “Asking for performance: Exploiting developer intuition to guide instrumentation with TimeTrial,” in Proc. of IEEE 13th Int’l Conf. on High Performance Computing and Communcations, Sep. 2011, pp. 321–330.
M. A. FRANKLIN, E. J. TYSON, J. BUCKLEY, P. CROWLEY, AND J. MASCHMEYER, “Auto-Pipe and the X language: A pipeline design tool and description language,” in Proc. of Int’l Parallel and Distributed Processing Symp., Apr. 2006.
M. B. GOKHALE, J. M. STONE, J. ARNOLD, AND M. KALINOWSKI, “Stream- oriented FPGA computing in the Streams-C high level language,” in Proc. of IEEE Symp. on Field-Programmable Custom Computing Machines, Apr. 2000, pp. 49–56.
W. THIES, M. KARCZMAREK, AND S. AMARASINGHE, “StreamIt: A language for streaming applications,” in Proc. of 11th Int’l Conf. on Compiler Construction, 2002, pp. 179–196.
34