from low-architectural expertise up to high-throughput non-binary
TRANSCRIPT
![Page 1: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/1.jpg)
©2005, it- institutode telecomunicações. Todososdireitos reservados.
From Low-architectural Expertise Up toHigh-throughput Non-binary LDPC Decoders:Optimization Guidelines using High-level Synthesis
João Andrade1, Nithin George2, Kimon Karras3, David Novo1,Vitor Silva1, Paolo Ienne2, Gabriel Falcão1
1 University of Coimbra, PT; 2 EPFL, CH; 3 Xilinx Research Labs, IE
FPL 2015, London, UK, 1-4 Sept. 20150
![Page 2: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/2.jpg)
Outline
The Challenge
The ProblemNon-binary LDPC decoding
Decoding architectureHLS decoder design
Experimental Results
Conclusion
1 | FPL’15, London, UK
![Page 3: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/3.jpg)
2 | FPL’15, London, UK
The Challenge
![Page 4: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/4.jpg)
Design an efficient LDPC decoder fast
• RTL requires too specialized knowledge• Our background is GPU and not hardware
• Error-prone design space exploration (DSE)
• Extensive code refactoring for DSE
• High-level synthesis (HLS) has been around for years
• Fast time to market
3 | FPL’15, London, UK
![Page 5: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/5.jpg)
Design an efficient LDPC decoder fast• Adjust DS decisions much faster wo/ extensive refactoring
• Bitwidth for a particular SNR operation point• Decoding schedule• Decoding algorithm
• C/C++ code base can be used with Vivado HLS• C/C++ supported• Cycle-accurate simulation after C-synthesis• Code annotations (#pragma) or Tcl commands
Why?• Power budgets of GPUs way above requirements
• Real-time operation is required• High decoding throughputs• Low latencies
4 | FPL’15, London, UK
![Page 6: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/6.jpg)
5 | FPL’15, London, UK
The Problem
![Page 7: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/7.jpg)
FEC on communication systems
• Belief propagation problem → LDPC codes decoding
• Non-binary LDPC can tackle
• Quantum-key distribution
• Erasure channel (burst)
• AWGN channel
• But have a very high (non-linear) numerical complexity
• Irregular data patterns and intensive access profile
6 | FPL’15, London, UK
![Page 8: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/8.jpg)
Soft-Decoding AlgorithmsBelief-Propagation I
• LDPC decoding is a particular case of belief-propagation
α α αα2 1 α2 α 1 1 α2 1 1
αc1
c6c5c4c3c2c1
α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2
F F F F F F F F F F F F
mv(x)
mvc(x) mcv(x)
mcv(z)mvc(z)
perm
ute
deperm
ute
CN1 CN2 CN3
VN1 VN3 VN4 VN5 VN6
Walsh-Hadamard
Transform
m∗v(x)
VN2
dc = 4
dv = 2
• Messages circulate through a bipartite graph structure withcomputation applied at the node- and edge-level
7 | FPL’15, London, UK
![Page 9: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/9.jpg)
Soft-Decoding AlgorithmsBelief-Propagation II
• The bipartite model can be employed to generalize otheralgorithms w/ the following constraints
• node level functions → must produce/consume data coherently
• edge level functions → produce/consume without restrictions
• By defining these kernels different algorithms can be defined
• CN and VN → Hadamard products
• Edges permute/depermute
• Edges apply the Fast Walsh-Hadamard Transform
8 | FPL’15, London, UK
![Page 10: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/10.jpg)
HLS decoderMapping the LDPC Tanner graph
• We attempt an isomorphic transformation of the Tanner graphH =
α 0 1 α 0 1α2 α 0 1 1 00 α α2 0 α2 1
αα2 1 α2 α 1 1 α2 1 1
αc1
c6c5c4c3c2c1
α2c1 c6c6α2c5c4 c5αc4αc2 c3 α2c3αc2
F F F F F F F F F F F F
mv(x)
mvc(x) mcv(x)
mcv(z)mvc(z)
perm
ute
deperm
ute
CN1
CN2
CN3
VN1
VN3
VN4
VN5
VN6
Walsh-Hadamard
Transform
m∗v(x)
VN2
α α
vnUpdate();
permute();
depermute();
fwht();
cnUpdate();
index_lut
• Therein, each node/edge-level kernel become their ownC-function and nodes/edges an iteration within a loop structure
9 | FPL’15, London, UK
![Page 11: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/11.jpg)
HLS decoderWhere to begin?
• There are several dimensions to non-binary LDPC decoding
(code related)• N VNs and M CNs to process
• Each VN connects to dv CNs
• Each CN connects to dc VNs
(Galois Field related)• 2m probabilities to compute per probability mass-function (pmf)
10 | FPL’15, London, UK
![Page 12: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/12.jpg)
HLS decoderHow to express computation?
• Suppose a GPU SIMT-architecture mindset//flat loop unsuitable for Vivado HLS optimizationsfor(int i = 0; i < edges*q*d_v; i++){
int e = i/(d_v*q); //get VN idint g = i%q; //get GF(q) elementint t = (i/q)%d_v; //get d_v element
computation();}
• What is it any different than this?//nested loop suitable for Vivado HLS optimizationsfor(int e = 0; e < edges; e++)
for(int g = 0; g < q; g++)for(int t = 0; t < d_v; t++)
computation();
• Optimizations are hardly picked up in the former
11 | FPL’15, London, UK
![Page 13: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/13.jpg)
HLS decoderLoop structures
CN->VNVN->CN
E: edges
GF: 2m
vn_proc
perm
ute
E: edges
E: edges
LogGF: m
fwht
cn_proc
deperm
ute
E:
G:
VN/CNW:
E:
G_read:
G_write:
G_read:
LOGGF:
G_compute:
fwht
E: edges
GF: 2m
Dc: dc
E: edges
GF_read: 2m
GF_write: 2m
G_read: 2m
LOGGF: m
G_compute:
2m
DRAM:
mcv
mvc
mv
iterate
pro
log
ue
ep
ilog
ue
iterate
l_mvc
l_mcv
l_mv
GF_read: 2m
GF_write: 2m
Dv: dv
GF_read: 2m
GF: 2m
GF_write: 2m
BRAM arrays
partitioned in Solutions IV-VII
E: edges
LogGF: m
GF_read: 2m
GF: 2m
GF_write: 2m
2 RW ports
available
• Loop trip counts• E: edges or
N×dv = M×dc
• GF: 2m
• LOGGF: m• Dv/Dc: dv /dc
• Local BRAM copiesare maintained
• Data streamsfrom DRAM
12 | FPL’15, London, UK
![Page 14: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/14.jpg)
HLS decoderLoop structures
//nested loop structure of vn_procE:for(int e = 0; e < edges; e++)
GF:for(int g = 0; g < q; g++)Dv:for(int t = 0; t < d_v; t++)
//computation follows
CN->VNVN->CN
E: edges
GF: 2m
vn_proc
perm
ute
E: edges
E: edges
LogGF: m
fwht
cn_proc
deperm
ute
E:
G:
VN/CNW:
E:
G_read:
G_write:
G_read:
LOGGF:
G_compute:
fwht
E: edges
GF: 2m
Dc: dc
E: edges
GF_read: 2m
GF_write: 2m
G_read: 2m
LOGGF: m
G_compute:
2m
DRAM:
mcv
mvc
mv
iterate
pro
log
ue
ep
ilog
ue
iterate
l_mvc
l_mcv
l_mv
GF_read: 2m
GF_write: 2m
Dv: dv
GF_read: 2m
GF: 2m
GF_write: 2m
BRAM arrays
partitioned in Solutions IV-VII
E: edges
LogGF: m
GF_read: 2m
GF: 2m
GF_write: 2m
2 RW ports
available
13 | FPL’15, London, UK
![Page 15: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/15.jpg)
HLS decoderLoop structures
E:for(int e = 0; e < limit; e++){GF_read:for(int g = 0; g < GF; g++)
//load data into temporary bufferGF_write:for(int g = 0; g < GF; g++)
//permute and store back to memory}
CN->VNVN->CN
E: edges
GF: 2m
vn_proc
perm
ute
E: edges
E: edges
LogGF: m
fwht
cn_proc
deperm
ute
E:
G:
VN/CNW:
E:
G_read:
G_write:
G_read:
LOGGF:
G_compute:
fwht
E: edges
GF: 2m
Dc: dc
E: edges
GF_read: 2m
GF_write: 2m
G_read: 2m
LOGGF: m
G_compute:
2m
DRAM:
mcv
mvc
mv
iterate
pro
log
ue
ep
ilog
ue
iterate
l_mvc
l_mcv
l_mv
GF_read: 2m
GF_write: 2m
Dv: dv
GF_read: 2m
GF: 2m
GF_write: 2m
BRAM arrays
partitioned in Solutions IV-VII
E: edges
LogGF: m
GF_read: 2m
GF: 2m
GF_write: 2m
2 RW ports
available
14 | FPL’15, London, UK
![Page 16: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/16.jpg)
HLS decoderLoop structures
E:for(int e = 0; e < edges; e++){G_read:for(int g = 0; g < q; g++){
//load data into temporary array}LogGF:for(int c=0;c<m;c++)
GF:for(int g = 0; g < q; g++)//perform Radix-2 computation
G_write:for(int g = 0; g < q; g++){//store data back to memory
}}
CN->VNVN->CN
E: edges
GF: 2m
vn_proc
perm
ute
E: edges
E: edges
LogGF: m
fwht
cn_proc
deperm
ute
E:
G:
VN/CNW:
E:
G_read:
G_write:
G_read:
LOGGF:
G_compute:
fwht
E: edges
GF: 2m
Dc: dc
E: edges
GF_read: 2m
GF_write: 2m
G_read: 2m
LOGGF: m
G_compute:
2m
DRAM:
mcv
mvc
mv
iterate
pro
log
ue
ep
ilog
ue
iterate
l_mvc
l_mcv
l_mv
GF_read: 2m
GF_write: 2m
Dv: dv
GF_read: 2m
GF: 2m
GF_write: 2m
BRAM arrays
partitioned in Solutions IV-VII
E: edges
LogGF: m
GF_read: 2m
GF: 2m
GF_write: 2m
2 RW ports
available
15 | FPL’15, London, UK
![Page 17: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/17.jpg)
HLS decoder solutionsPartitioning
• Scheduling analysis after C-synthesis shows lack of mem. ports• BRAMs are instantiated with dual-port control• Further action is required
set_directive_resource -core RAM_T2P_BRAMset_directive_array_partition -type cyclic -factor 4 -dim 1
Partitioned arraysOriginal arrays
l_mv
l_mvc
l_mcv
l_mv_0
l_mvc_0
l_mv_1
l_mv_2
l_mv_3
l_mvc_1
l_mvc_2
l_mvc_3
l_mcv_0
l_mcv_1
l_mcv_2
l_mcv_30
1
2
3
4
5
.
.
.
0
1
2
3
4
5
.
.
.
0
1
2
3
4
5
.
.
.
0
4
8
.
.
1
5
9
.
.
2
6
10
.
.
3
7
11
.
.
0
4
8
.
.
1
5
9
.
.
2
6
10
.
.
3
7
11
.
.
0
4
8
.
.
1
5
9
.
.
2
6
10
.
.
3
7
11
.
.
2x2m
RW ports availablePartitioning
16 | FPL’15, London, UK
![Page 18: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/18.jpg)
HLS decoderOptimization solution I
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution I, base version wo/ optimizations
17 | FPL’15, London, UK
![Page 19: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/19.jpg)
HLS decoderOptimization solution II
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution II Full unroll of inner loops LOGGF and GFset_directive_unroll "*/LOGGF"set_directive_unroll "*/GF"
18 | FPL’15, London, UK
![Page 20: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/20.jpg)
HLS decoderOptimization solution III
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution III, II+pipeline of outer loops E to II=1set_directive_pipeline "*/E" -II=1
19 | FPL’15, London, UK
![Page 21: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/21.jpg)
HLS decoderOptimization solution IV
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution IV, I+cyclic partitioning of all BRAM arrays by afactor of 2m
set_directive_array_partition -type cyclic -factor 2^m -dim 1 "decoder" l_buffer
20 | FPL’15, London, UK
![Page 22: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/22.jpg)
HLS decoderOptimization solution V
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution V, IV+full unroll of inner loops LOGGF and GF
21 | FPL’15, London, UK
![Page 23: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/23.jpg)
HLS decoderOptimization solution VI
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution VI, III+IV (unroll, pipeline, partition)
22 | FPL’15, London, UK
![Page 24: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/24.jpg)
HLS decoderOptimization solution VII
E: edges
fwht
E: edges
LogGF: m
fwht
GF_read: 2m
GF_compute: 2m
GF_write: 2m
E: edges
E: edges
fwht
fwht
2x2m
RW ports available2 RW ports available
…
…
E: edges
fwht
…
…
Solution II Solution III Solution IV Solution V Solution VI
not parallel
high IIparallel
low IIE: edges
fwht
Solution VII
…
• Solution VII, IV+pipeline of inner loops LOGGF GF to II=1set_directive_pipeline "*/LOGGF" -II=1set_directive_pipeline "*/GF" -II=1
23 | FPL’15, London, UK
![Page 25: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/25.jpg)
HLS decoder
• Define fixed-point computation suitable for a target SNR/BER#include<ap_cint.h>//data is stored in llr type variables//computation is performed in llr_ type variables//use floating-pointtypedef float llr;typedef float llr_;//use Q8.7 fixed-pointtypedef ap_fixed< 8, 1, AP_RND_INF, SC_SAT > llr;typedef ap_fixed< 16, 3, AP_RND_INF, SC_SAT > llr_;
24 | FPL’15, London, UK
![Page 26: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/26.jpg)
Experimental resultsClock frequency and latency
OptimizationsI II III IV V VI VII
La
ten
cy
[c
yc
les
]
10 3
10 4
10 5
10 6
0
50
100
150
200
250
OptimizationsI II III IV V VI VII
La
ten
cy
[c
yc
les
]
10 4
10 5
10 6
10 7
Fre
qu
en
cy
[M
Hz]
0
50
100
150
200
250
OptimizationsI II III IV V VI VII
La
ten
cy
[c
yc
les
]
10 4
10 5
10 6
10 7
Fre
qu
en
cy
[M
Hz]
0
50
100
150
200
250
• Best clock frequency of operation obtained for Solution VI• Lowest latency always achieved for Solution VI• Solution III is a good compromise between VI and theremaining Solutions
• Solution VII replication of pipelined loops is a poor designchoice
• Most alike to OpenCL strategy
25 | FPL’15, London, UK
![Page 27: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/27.jpg)
Experimental resultsFPGA utilization
I
IV
II
V
III
VI
I
IV
IIV
III
VI
I
IV
II
III
V
VI
partition
unroll
unroll
partition
pipeline
pipeline
partition
higher utilization
latency unchanged
low
er
late
ncy
utiliz
atio
n u
nch
an
ged
• Under 20% LUTutil.
• Multiple decoderinstantiation
• What about pin,clock and meminterface?
26 | FPL’15, London, UK
![Page 28: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/28.jpg)
HLS host platform
• Originally RTL-project and can be Tcl’d automatically• VC709 board target(693K CLBs)
Board DRAM 0 DRAM 1
Memory Interface
...
AXI4
Interconnect
AXI4
Interconnect
BRAMs KBRAMs 1
HLS IP
Core 1
HLS IP
Core K
FPGA
core 0
core 1
core 2
AXI4 I.
Mem. Int.
BRAMs 2
HLS IP
Core 2
• DMA via PCIe → 3KLUTs
• Two DRAM banks controlled(MIG)
• Two AXI interconnect can beconfigured for up to 16 HLScores
• Data streams from the DRAMbank 0 and to bank 1
• Each HLS core performscomputation to its own“BRAM” space
27 | FPL’15, London, UK
![Page 29: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/29.jpg)
Experimental resultsPareto exploration
LUTs [%]0 10 20 30 40 50 60 70 80 90
La
ten
cy
[7
s]
10 0
10 1
10 2
10 3
10 4
10 5
GF(4) GF(8) GF(16)Non-optimal points
Final decoder design w/ DRAM controllersand several accelerators instantiated
Single acceleratorw/o DRAM controllers
ParetoOptimalPoints
• The host HLS arch and the multiple decoders elevatethe LUT utilization to ∼80%
• {14, 5, 3} decoders for GF(22), GF(23) and GF(24)
28 | FPL’15, London, UK
![Page 30: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/30.jpg)
Experimental resultsComparison with RTL decoders
Decoder m K LUT [%] FF BRAM DSP Thr. [Mbit/s] Clk [MHz]
This work
2 1 14 7 0.5 0.5 1.17 25014 80 35 6 6 14.54 219
3 1 21 9 0.9 0.9 0.95 2506 81 34 5 5 4.81 210
4 1 30 13 2 2 0.66 2163 73 32 5 5 1.85 201
Zhang TCS–I’11 4
1
48 (Slices) 41 – 9.3 –
Emden ISTC’102 33.16
1004 – 13.228 1.56
Spagnol SiPS’09 3 13 3 1 – ≤4.7 99Boutillon TCS–I’13 6 19 6 1 – 2.95 61Andrade ICASSP’14 8 85 (LEs) 62 7 1.1 163Scheiber ICECS’13 1 14 (Slices) 21 – 13.4 122
∗ Differences in technology nodes and FPGA are not considered.
29 | FPL’15, London, UK
![Page 31: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/31.jpg)
Comparison with previous HLS works
• Maxeler decoders allow for ∼1 Gbit/s decoding throughputsPratas, GlobalSIP’13Andrade, ASAP’14
• OpenCL (Altera) decoder peaks at hundreds of Kbit/sAndrade, ICASSP’14
FPGA GF(23)
GF(23)
(floating-point)Util.[%] I IV V VI I IV V VILUTs 0.64 1.13 5.20 10.4 0.65 1.48 11.7 17.3FF 0.28 0.53 2.52 3.94 0.29 0.51 3.30 7.56DSP 0.06 0.06 0.89 0.89 0.06 0.06 1.78 1.78BRAM 0.44 0.82 0.82 0.82 0.78 1.63 2.72 1.63
• Vivado HLS decoder reaches dozens of MBit/sScheiber, ICECS’13
30 | FPL’15, London, UK
![Page 32: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/32.jpg)
Summary
• Code writing-style counts• Language is the same, model is not
• Design optimizations come hand-in-hand with the codewriting-style
• Clearly defined bounds are better• Optimizations can be double-edge swords
• We can achieve same ballpark figures of RTL• Higher utilization
• Outlook• When will platforms be automatically generated?
• When will the C programming model merge?
31 | FPL’15, London, UK
![Page 33: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/33.jpg)
32 | FPL’15, London, UK
*(b++)=*(a++)*c;
Qué?
b[i]=a[i]*c;
Ah, si!
![Page 34: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/34.jpg)
Thank you. Questions arewelcome.
33 | FPL’15, London, UK
![Page 35: From Low-architectural Expertise Up to High-throughput Non-binary](https://reader033.vdocuments.net/reader033/viewer/2022051713/5870c7d71a28ab79438b93f8/html5/thumbnails/35.jpg)
What tool to use?
What HLS tool?• How much are we willing to lose in control?
• A lot? → OpenCL (C-based)• Dataflow? → MaxCompiler (JAVA)• Some? → LegUp, Vivado HLS (C/C++, SystemC)• None? → Stick to RTL (Verilog, VHDL)
• Vivado HLS allows fine control over• Loop scheduling → unroll, pipeline, merge, flatten• AXI4 blocks → master/slave memory and stream interfaces• Arbitrary bitwidth → fixed-point types supported• No clock, no external memory interfaces, and no pin I/O layout
34 | FPL’15, London, UK