spatial: a language and compiler for application accelerators · using the polyhedral model [wang,...
TRANSCRIPT
![Page 1: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/1.jpg)
Spatial: A Language and Compiler
for Application Accelerators
David Koeplinger Matthew Feldman Raghu Prabhakar
Yaqi Zhang Stefan Hadjis Ruben Fiszel
Tian Zhao Luigi Nardi Ardavan Pedram
Christos Kozyrakis Kunle Olukotun
PLDI June 21, 2018
![Page 2: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/2.jpg)
Instructions Add Overheads
32-bit ADD: ~0.5 pJ
Register File
Access
Control Overheads
25pJ 6pJ 38pJ
Instruction: 70 pJ
I-cache access
CPU
L1 Cache
(Instructions)
L1 Cache
(Data)
L2 Cache
DRAM
Register FileInst. Queue
Arithmetic/LogicControl
Floating Point
Mark Horowitz, Computing’s Energy Problem (and what we can do about it) ISSCC 2014
Legend
Control Compute
RegsSRAM
Instruction-Based
2
mov r8, rcx
add r8, 8
mov r9, rdx
add r9, 8
mov rcx, rax
mov rax, 0
.calc:
mov rbx, [r9]
imul rbx, [r8]
add rax, rbx
add r8, 8
add r9, 8
loop .calc
vectorA ∙ vectorB
![Page 3: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/3.jpg)
A Dark Tale: The CPU Power Wall
3
![Page 4: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/4.jpg)
A More Efficient WayConfiguration-Based
CPU*
L1 Cache
(Instructions)
L1 Cache
(Data)
L2 Cache
DRAM
Register FileInst. Queue
Arithmetic/LogicControl
Floating Point
*Not to scale
Instruction-Based
Custom Hardware*
+×vectorB
vectorA
acc
DRAM
+
ctr
ctrl
*Also not to scale
Legend
Control Compute
RegsSRAM
4
.calc:
add r8, 8
add r8, 8
loop .calc
add r9, 8
add r8, 8
add rax, rbx
imul rbx, [r8]
mov rbx, [r9]
mov rax, 0
mov rcx, rax
mov r9, rdx
mov r8, rcx
vectorA ∙ vectorB
![Page 5: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/5.jpg)
The Future Is (Probably) Reconfigurable
10,000
1,000
100
10
1
0.1
Energ
y E
ffic
iency (
MO
PS
/mW
)
Not programmable Less programmable More programmable
Programmability
ASIC
CPU
GPU
Reconfigurable
Instruction-BasedFPGA
5
CGRA
Dedicated
25x perf/W vs. CPU
XPU (HotChips ’17)
287 MOps/mW
Brainwave (ISCA ’18)
![Page 6: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/6.jpg)
Key Question
Performance Productivity
Portability
How can we more productively target
reconfigurable architectures like FPGAs?
Fast and efficient designs Fast and efficient programmers
Target-generic solutions
6
![Page 7: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/7.jpg)
Language Taxonomy
x86
Domain
Specificity
Abstraction
Domain-Specific
Multi-Domain
General Purpose
Higher Level Lower Level
Instruction-Based Architectures (CPUs)
Lower Level
Reconfigurable Architectures (FPGAs)
AbstractionVHDL
VerilogNetlist
MyHDL
Halide
“What?” “How?”“How?”
7
![Page 8: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/8.jpg)
Abstracting Hardware DesignDomain
Specificity
Abstraction
Higher Level Lower Level
Instruction-Based Architectures (CPUs)
Lower Level
Reconfigurable Architectures (FPGAs)
AbstractionNetlist
HDLs
“What?” “How?”“How?”
8
+hardware
pragmas
![Page 9: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/9.jpg)
HDLs
Performance Productivity
Portability
9
Hardware Description Languages (HDLs) e.g. Verilog, VHDL, Chisel, Bluespec
✓Arbitrary RTL ✘ No high-level abstractions
✘ Significant target-specific code
![Page 10: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/10.jpg)
C + Pragmas
Performance Productivity
Portability
10
✓ Nested loops
✘ Difficult to optimize
✘Ad-hoc mix of
software/hardware
✓ Portable for single vendor
✘ No memory
hierarchy
✘ No arbitrary
pipelining
Existing High Level Synthesis (C + Pragmas)e.g. Vivado HLS, SDAccel, Altera OpenCL
HDLs
![Page 11: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/11.jpg)
Criteria for Improved HLS
11
Requirement C+Pragmas
Express control as nested loopsEnables analysis of access patterns
Represent memory hierarchy explicitlyAids on-chip memory optimization, specialization
Specialize memory transfers Enables customized memory controllers based on access patterns
Capture design parametersEnables automatic design tuning in compiler
Support arbitrarily nested pipeliningExploits nested parallelism
![Page 12: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/12.jpg)
FPGA
+×
tileB
tileA
acc
DRAM
vectorA
vectorB
Design Space Parameters Example
vectorA ∙ vectorB
Legend
Control Compute
RegsSRAM
Small and simple, but slow!
ctr
12
![Page 13: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/13.jpg)
◼ Increases length of DRAM accesses Runtime
◼ Increases exploited locality Runtime
◼ Increases local memory sizes Area
FPGA
+×
tileB
tileA
acc
DRAM
vectorA
vectorB
vectorA ∙ vectorB
Important Parameters: Buffer Sizes
Legend
Control Compute
RegsSRAM
ctr
13
![Page 14: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/14.jpg)
FPGA
Stage 2
Tile B
◼ Overlaps memory and compute Runtime
◼ Increases local memory sizes Area
◼ Adds synchronization logic Area
Important Parameters: Pipelining
Stage 1
+×
tileB (0)
tileA (0)
acc
DRAM
vectorA
vectorB
tileA (1)
tileB (1)
vectorA ∙ vectorB
Legend
Control Compute
RegsSRAM
Double Buffer
14
Metapipelining requires buffering
![Page 15: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/15.jpg)
◼ Improves element throughput Runtime
◼ Duplicates compute resources Area
Important Parameters: Parallelization
FPGA
+×
acc
DRAM
vectorA
vectorB
ctr
vectorA ∙ vectorB
×ctrctr
×+
+
Legend
Control Compute
RegsSRAM
tileB
tileA
15
![Page 16: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/16.jpg)
◼ Improves memory bandwidth Runtime
◼ May duplicate memory resources Area
Important Parameters: Memory Banking
+×
acc
DRAM
vectorA
vectorB
ctr
vectorA ∙ vectorB
×ctrctr
×+
+
Legend
Control Compute
RegsSRAM
tileB
tileA
Banked SRAM
16
Parallelization requires banking
![Page 17: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/17.jpg)
Criteria for Improved HLS
Requirement C+Pragmas
Express control as nested loopsEnables analysis of access patterns
Represent memory hierarchy explicitlyAids on-chip memory optimization, specialization
Specialize memory transfers Enables customized memory controllers based on access patterns
Capture design parametersEnables automatic design tuning in compiler
Support arbitrarily nested pipeliningExploits nested parallelism
17
![Page 18: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/18.jpg)
Rethinking HLS
Performance Productivity
Portability
18
✓ Nested loops
✓Automatic memory
banking/buffering
✓ Implicit design parameters
(unrolling, banking, etc.)
✓ Memory
hierarchy
✓Arbitrary pipelining
✓Target-generic source
across reconfigurable architectures
✓Automated design tuning
HDLs C + PragmasImproved HLS
![Page 19: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/19.jpg)
Abstracting Hardware DesignDomain
Specificity
Abstraction
Higher Level Lower Level
Instruction-Based Architectures (CPUs)
Lower Level
Reconfigurable Architectures (FPGAs)
AbstractionNetlist
HDLs
“What?” “How?”“How?”
19
Spatial
+pragmas
![Page 20: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/20.jpg)
Spatial: Memory Hierarchy
20
DDR DRAM
GB
On-Chip SRAM
MB
Local Regs
KB
val image = DRAM[UInt8](H,W)
val buffer = SRAM[UInt8](C)
val fifo = FIFO[Float](D)val lbuf = LineBuffer[Int](R,C)
val accum = Reg[Double]val pixels = RegFile[UInt8](R,C)
buffer load image(i, j::j+C) // densebuffer gather image(a) // sparse
![Page 21: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/21.jpg)
val B = 64 (64 → 1024)val buffer = SRAM[Float](B)Foreach(N by B){i => …
}
val P = 16 (1 → 32)Reduce(0)(N by 1 par P){i =>data(i)
}{(a,b) => a + b}
Stream.Foreach(0 until N){i => …
}
Implicit/Explicit parallelization factors(optional, but can be explicitly declared)
Explicit size parameters for loop step
size and buffer sizes(informs compiler it can tune this value)
Implicit/Explicit control schemes(also optional, but can be used to override compiler)
Foreach(64 par 16){i => buffer(i) // Parallel read
}
Implicit memory banking and buffering
schemes for parallelized access
Spatial: Control And Design Parameters
21
![Page 22: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/22.jpg)
Dot Product in Spatial
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
}
DRAM
vectorA
Off-chip memory declarations
vectorB
FPGA
output
22
![Page 23: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/23.jpg)
DRAM
Dot Product in Spatial
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
}
vectorA
vectorB
Explicit work division in IR
FPGA
output
24
![Page 24: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/24.jpg)
Dot Product in Spatial
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}
}
DRAM
vectorA
vectorB
Tiled reduction (outer)
FPGA
Outer Reduceoutput
24
![Page 25: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/25.jpg)
DRAM
Dot Product in Spatial
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}
}
vectorA
vectorB
FPGA
Outer Reduce
On-chip memory declarations
tileB (0)
tileA (0) tileA (1)
tileB (1)
acc
output
24
acc
![Page 26: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/26.jpg)
DRAM
Dot Product in Spatial
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}
}
vectorA
vectorB
FPGA
Outer Reduce
DRAM → SRAM transfers(also have store, scatter, and gather)
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
output
24
acc acc
![Page 27: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/27.jpg)
DRAM
Dot Product in Spatial
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}
}
vectorA
vectorB
FPGA
Outer Reduce
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Tiled reduction (pipelined)
Stage 2
+×
acc
output
24
acc acc
![Page 28: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/28.jpg)
FPGA
Outer ReduceStage 3
DRAM
Dot Product in Spatial
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
}
vectorA
vectorB
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Stage 2
+×
acc
output
Outer reduce function
+
24
acc acc
![Page 29: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/29.jpg)
Dot Product in Spatial
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
} 24
FPGA
Outer ReduceStage 3
DRAM
vectorA
vectorB
acc
Stage 1
tileB (0)
tileA (0) tileA (1)
tileB (1)
Stage 2
+×
acc
output +
acc acc
Tile Size (B)Banking strategy
Parallelism factor #1
Metapipelining toggle
Parallelism factor #3
Parallelism
factor #2
![Page 30: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/30.jpg)
Dot Product in Spatial
Spatial Program Design Parameters
val output = ArgOut[Float]val vectorA = DRAM[Float](N)val vectorB = DRAM[Float](N)
Accel {Reduce(output)(N by B){ i =>val tileA = SRAM[Float](B)val tileB = SRAM[Float](B)val acc = Reg[Float]
tileA load vectorA(i :: i+B)tileB load vectorB(i :: i+B)
Reduce(acc)(B by 1){ j => tileA(j) * tileB(j)
}{a, b => a + b}}{a, b => a + b}
}25
![Page 31: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/31.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis Spatial IRSpatial Program Design Parameters
IntermediateRepresentation
DesignParameters
IR Transformation
IR Analysis
Code Generation
Legend
The Spatial Compiler
26
![Page 32: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/32.jpg)
Control SchedulingSpatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis
Spatial IR
◼ Creates loop pipeline schedules
◼ Detects data dependencies across loop intervals
◼ Calculate initiation interval of pipelines
◼ Set maximum depth of buffers
◼ Supports arbitrarily nested pipelines
(Commercial HLS tools don’t support this)
27
![Page 33: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/33.jpg)
◼ Insight: determine banking strategy in a single loop nest
using the polyhedral model [Wang, Li, Cong FPGA ’14]
◼ Spatial’s contribution: find the (near) optimal
banking/buffering strategy across all loop nests
◼ Algorithm in a nutshell:
1. Bank each reader as a separate coherent copy
(accounting for reaching writes)
2. Greedily merge copies if merging is legal and cheaper
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis
Spatial IR
Local Memory Analysis
+×
acc
ctr×
ctrctr
×+
+tileB
tileA
28
![Page 34: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/34.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
ModifiedParameters
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis
Spatial IR Design Parameters
Original tuning methods:
◼ Pre-prune space using simple heuristics
◼ Randomly sample ~100,000 design points
◼ Model area/runtime of each point
Proposed tuning method
◼ Reinforcement learning: HyperMapper
(More details in paper)
◼ Fast: No slow transformers in loop
Design Tuning
29
![Page 35: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/35.jpg)
Spatial IR
Control Scheduling
Mem. Banking/Buffering
Access Pattern Analysis
Control Inference
Pipeline Unrolling
Pipeline Retiming
[Optional] Design Tuning
Host Resource Allocation
Control Signal Inference
Chisel Code Generation
Area/Runtime Analysis
Spatial IR
The Spatial Compiler: The Rest
Code generation
◼ Synthesizable Chisel
◼ C++ code for host CPU
30
![Page 36: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/36.jpg)
◼ FPGA:
◼ Amazon EC2 F1 Instance: Xilinx VU9P FPGA
◼ Fixed clock rate of 150 MHz
◼ Applications
◼ SDAccel: Hand optimized, tuned implementations
◼ Spatial: Hand written, automatically tuned implementations
◼ Execution time = FPGA execution time
Evaluation: Performance
31
![Page 37: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/37.jpg)
0
5
10
15
Performance (Spatial vs. SDAccel)
Average 2.9x faster hardware than SDAccelS
peed
up
ove
r S
DA
ccel
8.5x 1.4x1.6x 1.4x 3.5x 14.1x 1.3x
32
![Page 38: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/38.jpg)
Productivity: Lines of Code
0
50
100
150
200
250
SDAccel
Spatial
12%
Average 42% shorter programs versus SDAccel
60%47% 44% 31% 66% 35%
Lines
33
![Page 39: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/39.jpg)
◼ FPGA 1
◼ Amazon EC2 F1 Instance: Xilinx VU9P FPGA
◼ 19.2 GB/s DRAM bandwidth (single channel)
◼ FPGA 2
◼ Xilinx Zynq ZC706
◼ 4.3 GB/s
◼ Applications
◼ Spatial: Hand written, automatically tuned implementations
◼ Fixed clock rate of 150 MHz
Evaluation: Portability
34
![Page 40: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/40.jpg)
Portability: VU9P vs. Zynq ZC706
0
5
10
15
20
252.5x 1.2x2.5x 2.5x 1.3x 2.5x 4.6x
Identical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA
Sp
eed
up
DRAM Bandwidth: 4.5x
LUTs (GP compute): 47.3x
DSPs (integer FMA): 7.6x
On-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 41: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/41.jpg)
Portability: VU9P vs. Zynq ZC706
0
5
10
15
20
252.6x 2.1x9.4x 2.7x 1.7x 1.0x 1.1x
Identical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA
Tuning: Speedup only from tuning parameters for larger FPGA
Sp
eed
up
DRAM Bandwidth: 4.5x
LUTs (GP compute): 47.3x
DSPs (integer FMA): 7.6x
On-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 42: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/42.jpg)
Portability: VU9P vs. Zynq ZC706
0
5
10
15
20
256.5x 2.5x23.4x 6.8x 2.2x 2.5x 5.0x
Identical Spatial source, multiple targets
Porting: Speedup (VU9P / Zynq) only from moving to larger FPGA
Tuning: Speedup only from tuning parameters for larger FPGA
Product = Porting ×Tuning
Sp
eed
up
DRAM Bandwidth: 4.5x
LUTs (GP compute): 47.3x
DSPs (integer FMA): 7.6x
On-chip memory*: 4.0x
VU9P / ZC706* No URAM used on VU9P
35
![Page 43: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/43.jpg)
Portability: Plasticine CGRAIdentical Spatial source, multiple targets
Even reconfigurable hardware that isn’t an FPGA!
36
BenchmarkDRAM Bandwidth (%)
Load StoreResource Utilization (%)
PCU PMU AG Speedupvs. VU9P
BlackScholes 77.4 12.9 73.4 10.9 20.6 1.6
GDA 24.0 0.2 95.3 73.4 38.2 9.8
GEMM 20.5 2.1 96.8 64.1 11.7 55.0
K-Means 8.0 0.4 89.1 57.8 17.6 6.3
TPC-H Q6 97.2 0.0 29.7 37.5 70.6 1.6
Prabhakar et al. Plasticine: A Reconfigurable Architecture For Parallel Patterns (ISCA ‘17)
![Page 44: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/44.jpg)
Conclusion◼ Reconfigurable architectures are becoming key for performance / energy efficiency
◼ Current programming solutions for reconfigurables are still inadequate
◼ Need to rethink outside of the C box for high level synthesis:◼ Memory hierarchy for optimization
◼ Design parameters for tuning
◼ Arbitrarily nestable pipelines
◼ Spatial prototypes these language and compiler criteria:
◼ Average speedup of 2.9x versus SDAccel on VU9P
◼ Average 42% less code than SDAccel
◼ Achieves transparent portability through internal support for automated design tuning
(HyperMapper)
37
Spatial is open source: spatial.stanford.edu
Performance Productivity
Portability
![Page 45: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/45.jpg)
The Team
RaghuPrabhakar
YaqiZhang
DavidKoeplinger
MattFeldman
TianZhao
ArdavanPedram
ChristosKozyrakis
KunleOlukotun
StefanHadjis
RubenFiszel
LuigiNardi
38
![Page 46: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/46.jpg)
Backup Slides
![Page 47: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/47.jpg)
Custom ASICs
Good for widely used, fixed specifications (like compression)
Expensive with long design turnaround for developing fields like ML
TimeJeff Dean, Scaled ML 2018
Kunle Olukotun, ISCA 2018
20,000
15,000
10,000
5,000
0
2009 20172011 2013 2015
20
15
10
5
0
Relative # of
Papers / Year
Since 2009
ML Arxiv Papers
![Page 48: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/48.jpg)
C + Pragmas Example
Add 512 integers originating from accelerator DRAM
void sum(int* mem) {
mem[512] = 0;
for(int i=0; i < 512; i++) {mem[512] += mem[i];
}
}
48
Commercial
HLS Tool
Runtime: 27,236 clock cycles
(100x too long!)
![Page 49: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/49.jpg)
C + Pragmas Example
Add 512 integers originating from external DRAM
49
#define CHUNKSIZE (sizeof(MPort)/sizeof(int)) #define LOOPCOUNT (512/CHUNKSIZE)
void sum(MPort* mem) { MPort buff[LOOPCOUNT];memcpy(buff, mem, LOOPCOUNT);
int sum = 0;for(int i=1; i<LOOPCOUNT; i++) {
#pragma PIPELINE for(int j=0; j<CHUNKSIZE; j++) {
#pragma UNROLL sum += (int)(buff[i]>>j*sizeof(int)*8);
}} mem[512] = sum;
}
Width of DRAM controller interface
Burst Access
Use local variable
Specialcompiler directives
Loop Restructuring
Bit shifting to extract individual elements
Specialcompiler directives
Runtime: 302 clock cycles
![Page 50: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/50.jpg)
Hardware Design Considerations
1. Finite physical compute and memory resources
2. Requires aggressive pipelining for performance
◼ Maximize useful execution time of compute resources
3. Disjoint memory space
◼ No hardware managed memory hierarchy
4. Huge design parameter spaces
◼ Parameters are interdependent, change runtime by orders of magnitude
5. Others… pipeline timing, clocking, etc.
![Page 51: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/51.jpg)
Local Memory Analysis Example
Foreach(N by 1){ r =>val a = SRAM[Float](D)val b = SRAM[Float](D)val c = SRAM[Float](D)Foreach(D par 2){i =>
a(i) = … } Reduce(sum)(D par 2){j =>
a(b(j))}{(a,b) => a + b}Foreach(D par 2){k => c(k) = a(k) * sum
} }
Step 1: For each read:
Find the banking and buffering for that read
and all writes that may be visible to that read
a
2i
2i+1
Foreach{i =>
b(2j)
b(2j+1)
Reduce{j =>
2k
2k+1
Foreach{k =>
Write port
Read port
1 “instance” of a
![Page 52: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/52.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r =>val a = SRAM[Float](D)val b = SRAM[Float](D)val c = SRAM[Float](D)Foreach(D par 2){i =>
a(i) = … } Reduce(sum)(D par 2){j =>
a(b(j))}{(a,b) => a + b}Foreach(D par 2){k => c(k) = a(k) * sum
} }
2i
2i+1
Foreach{i =>
b(2j)
b(2j+1)
Reduce{j =>
a
a
a
Step 1: For each read:
Find the banking and buffering for that read
and all writes that may be visible to that read
![Page 53: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/53.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r =>val a = SRAM[Float](D)val b = SRAM[Float](D)val c = SRAM[Float](D)Foreach(D par 2){i =>
a(i) = … } Reduce(sum)(D par 2){j =>
a(b(j))}{(a,b) => a + b}Foreach(D par 2){k => c(k) = a(k) * sum
} }
2i
2i+1
Foreach{i =>
b(2j)
b(2j+1)
Reduce{j =>
a a
Step 1: For each read:
Find the banking and buffering for that read
and all writes that may be visible to that read
![Page 54: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/54.jpg)
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r =>val a = SRAM[Float](D)val b = SRAM[Float](D)val c = SRAM[Float](D)Foreach(D par 2){i =>
a(i) = … } Reduce(sum)(D par 2){j =>
a(b(j))}{(a,b) => a + b}Foreach(D par 2){k => c(k) = a(k) * sum
} }
2i
2i+1
Foreach{i =>
b(2j)
b(2j+1)
Reduce{j =>
MetapipelineDistance = 1 a a
a a
(~4-8x memory)
![Page 55: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/55.jpg)
a
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r =>val a = SRAM[Float](D)val b = SRAM[Float](D)val c = SRAM[Float](D)Foreach(D par 2){i =>
a(i) = … } Reduce(sum)(D par 2){j =>
a(b(j))}{(a,b) => a + b}Foreach(D par 2){k => c(k) = a(k) * sum
} }
2i
2i+1
Foreach{i =>
MetapipelineDistance = 2
2k
2k+1
Foreach{k =>
aa (~3-6x memory)
![Page 56: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/56.jpg)
a
Local Memory Analysis Example (Cont.)
Foreach(N by 1){ r =>val a = SRAM[Float](D)val b = SRAM[Float](D)val c = SRAM[Float](D)Foreach(D par 2){i =>
a(i) = … } Reduce(sum)(D par 2){j =>
a(b(j))}{(a,b) => a + b}Foreach(D par 2){k => c(k) = a(k) * sum
} }
2i
2i+1
Foreach{i =>
2k
2k+1
Foreach{k =>
aa
b(2j)
b(2j+1)
Reduce{j =>
a a
a aStep 2: Greedily combine (merge)
instances
- Don’t combine if there are port conflicts
- Don’t combine if the cost of merging is
greater than sum of unmerged
**Recompute banking for merged
instances!
(~7-14x memory)
![Page 57: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/57.jpg)
Local Memory Analysis
Foreach(N by 1){ r =>val a = SRAM[Float](D)val b = SRAM[Float](D)val c = SRAM[Float](D)Foreach(D par 2){i =>
a(i) = … } Reduce(sum)(D par 2){j =>
a(b(j))}{(a,b) => a + b}Foreach(D par 2){k => c(k) = a(k) * sum
} }
2i
2i+1
Foreach{i =>
2k
2k+1
Foreach{k =>
b(2j)
b(2j+1)
Reduce{j =>
aa
Step 2: Greedily combine (merge)
instances
- Don’t combine if there are bank conflicts
- Don’t combine if the cost of merging is
greater than sum of unmerged
**Recompute banking for merged
instances!
aa
a(~5-10x memory)(40% less)
![Page 58: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/58.jpg)
Kernel-Based Approach
Performance Productivity
Portability
High level specification
no hardware design knowledge required
Reasonably target-generic if done right
Manually implement each DSL operation;
use a simple compiler to stitch them together
Misses cross-kernel optimizations
Excessive memory transfers
Excessive buffering
![Page 59: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/59.jpg)
type TM = FixPt[TRUE,_9,_23]type TX = FixPt[TRUE,_9,_7]
val data = DRAM[TX](N, D)val y = DRAM[TM](N)val weights = DRAM[TM](D)
Accel {val yAddr = Reg[Int](-1) val yCache = SRAM[TM](CSIZE)val wK = SRAM[TM](D)
wK load weights(0::D)
Sequential.Foreach(E by 1){e => epoch(random[Int](N), …) breakpoint()
}
weights(0 :: D) store wK}
123456789101112131415161718192021
Arbitrary precision custom types
Off-chip memory allocations
Accelerator scope
On-chip memory allocations
Explicit memory transfer
Declaration of a sequential loop
Explicit memory transfer
Debugging breakpoint
Stochastic Gradient Descent in Spatial
![Page 60: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/60.jpg)
SGD in Spatial
def epoch(i: Int, ...): Unit = {val yPt = Reg[TM]if (i >= yAddr & i < yAddr+CSIZE & yAddr != -1) {
yPt := yCache(i - yAddr)} else {
yAddr := i - (i % CSIZE)yCache load y(yAddr::yAddr + CSIZE)yPt := yCache(i % CSIZE)
}
val x = SRAM[TX](D)x load data(i, 0::D)
// Compute gradient against wK_tval yHat = Reg[TM]Reduce(yHat)(D by 1){j => wK(j) * x(j).to[TM] }{_+_}val yErr = yHat - yPt
// Update wK_t with reduced variance updateForeach(D by 1){i =>
wK(i) = wK(i) – (A.to[TM] * yErr * x(i).to[TM]) }
}
222324252627282930313233343536373839404142434445
Custom caching for random access on y
Explicit memory transfer
Gradient computation
Weight update
![Page 61: Spatial: A Language and Compiler for Application Accelerators · using the polyhedral model [Wang, Li, Cong FPGA ’14] Spatial’scontribution: find the (near) optimal banking/buffering](https://reader035.vdocuments.net/reader035/viewer/2022071102/5fdb2bcb6e26ab7fe66d8fab/html5/thumbnails/61.jpg)
FPGA
15.Sequential.Foreach
41. Reduce
DRAM
13.load
×wKweights
22.store
37.load yHat+
-45. Foreach
-×
y
27. if … else
yPt
yAddr
yCache
yErr
xdata
SGD in Spatial: Hardware