1 a high throughput pipelined architecture for h.264/avc deblocking filter kefalas nikolaos,...
TRANSCRIPT
1
A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC
DEBLOCKING FILTER
Kefalas Nikolaos, Theodoridis George
VLSI Design Lab.Electrical & Computer Eng. Department
University of Patras, Greece
2
Outline
1. Deblocking filter algorithm
2. Filtering ordering
3. Memory organization
4. Pipelined architecture
5. Synthesis results and comparisons
6. Conclusions and future work
3
Deblocking Filter Algorithm (1/3)
Deblocking Filter
The deblocking filter is used in H.264/AVC to reduce the blocking artifacts –Improves subjective & objective quality and reduces the bit-rate typically 5-10%.
It is performed on a macroblock (MB) basis after the completion of the macroblock reconstruction stage
It includes a large number of data depended branches – each 4x4 pixel area is processed up to four times
It spends over one-third (1/3) of the total decoding time
4
Deblocking Filter Algorithm (2/3)
Each MB is processed in 4x4 blocks
The vertical edges are filtered at first rightwards
–from edge V0 to edge V3
Then horizontal ones downwards–from edge H0 to H3
Each 8 pixels of two adjacent 4x4 sub-blocks are filtered at the same time
–The same process repeats for the chroma components
16x16 luma (Y)
p0p1p2p3
q0q1q2q3
V0 V2 V3
subedge
44
p0p1p2p3 q3q2q1q0
V1
H0
H1
H2
H3
edge
8x8 chroma (Cb)
8x8 chroma (Cr)
edge to be filtered
edge NOT to be filtered
Sub_block of current MB
Sub_block of neighboring
MB
5
Deblocking Filter Algorithm (3/3)
Each sub-edge shares a BS value
The BS along with two thresholds α, β decides the filtering strength of each sub-edge
–A filter samples flag is calculated
Three filter types are used–Strong filter (4- or 5-tap filter)–Weak filter–No filtering
6
Outline
1. Deblocking filter algorithm
2. Filtering ordering
3. Memory organization
4. Pipelined architecture
5. Synthesis results and comparisons
6. Conclusions and future work
7
Filtering Order
During filtering all four sub-edges of each sub-block are filtered and almost all pixels are involved and updated
A suitable filtering order is needed to:– Reduce the size of the on-chip memory for buffering intermediate data– Increase data reuse – Reduce the external memory accesses– Simplify control and steering logic– Avoid pipeline stalls due to data and resource hazards
8
Proposed Filtering Order
The vertical sub-edges are filtered in raster scan-order followed by the horizontal ones
The filtering direction is not changed before all vertical edges of luma and chroma are filtered
The proposed order is in accordance to the standard
4 5
26
B4
0 1
24
25
B0
12 13B12
8 9
27
B8
2
28
29
B1 3
32
33
B2
36
37
B3
6
30
B5 7
34
B6
38
B7
10
31
B9 11
35
B10
39
B11
14B13 15B14 B15
L0
L1
L2
L3
U0 U1 U2 U3
16x16 luma (Y)
18 19B18
16 17
40
41
B16
42
43
B17
B19
L0
L1
U0 U1
8x8 chroma (Cb)
22 23B22
20 21
44
45
B20
46
47
B21
B23
L0
L1
U0 U1
8x8 chroma (Cr)
Proposed Ordering
1 5
18
B4
0 4
16
17
B0
3 7B12
2 6
19
B8
8
20
21
B1 12
24
25
B2
28
29
B3
9
22
B5 13
26
B6
30
B7
10
23
B9 14
27
B10
31
B11
11B13 15B14 B15
L0
L1
L2
L3
U0 U1 U2 U3
16x 16 luma (Y)
33 35B18
32 34
36
37
B16
38
39
B17
B19
L0
L1
U0 U1
8x 8 chroma (Cb)
41 43B22
40 42
44
45
B20
46
47
B21
B23
L0
L1
U0 U1
8x 8 chroma (Cr)
Original Ordering
9
Outline
1. Deblocking filter algorithm
2. Filtering ordering
3. Memory organization
4. Pipelined architecture
5. Synthesis results and comparisons
6. Conclusions and future work
10
Memory Organization (1/2)
Four single port memories are employed (sizes in bits)– Current-A (CM-A) 96x32– Current-B (CM-B) 96x32– Left _mem (LM) 32x32– Upper_mem (UM) 2xFWx32 + 2x(FW/16)x32
Transpose buffers TR-P and TR-Q (4x32) – typical systolic array
4-S
tag
e-F
ilter
Q
P
BsUnit
Bs,a,b,tc0
PreviewsStage
ReconstructedPixels
Current AMemoryW R
Current BMemoryW R
UpperMemory
R
W
LeftMemory
R
W
TR-P
TR-Q
ReferenceMemory
CodingInformation
IN OUT
IN OUT
All internal buses are 32 bits
11
Memory Organization (2/2)
F ra m e W in d th
Fram
e Height
Fram
e Height 2
C h ro m a C om p o n en t (C b ):F ra m e W in d th /2
C h ro m a C om p o n en t (C b ):F ra m e W in d th /2
U pper M em oryLeft M em oryC urren t M em ory AC urren t M em ory B
12
Outline
1. Deblocking filter algorithm
2. Filtering ordering
3. Memory organization
4. Pipelined architecture
5. Synthesis results and comparisons
6. Conclusions and future work
13
Algorithm Features
Deblocking filter algorithm computational intensive operations– LUT operations – retrieve values α(IndexA), β(IndexB), c1(Index A, BS)– BS calculation– Weak Filter BS(1~3) filtering, δ calculation and clipping operations– Strong Filter BS(4)
The introduced pipeline exploits specific algorithmic features– BS is the same for all micro-edges of a sub-edge for the luma
component– BS of the luma component is reused for the chroma components – For the (4:2:0) format BS changes every 2 micro-edges in chroma
components
14
Proposed Pipeline Organization
Bs
Unit
Me
mory
Rea
d Delta
C
alc
ula
tion
Bs
1~
31st
-Sta
ge
Thre
sshold
Calc
ula
tio
n
Bs
1~
32n
d-S
tag
e
Bs
1~
33rd
-Sta
ge
Bs
41st
-Sta
ge
Bs
42n
d-S
tage
Bs
1~
3
1st- Stage 2nd- Stage 3st- Stage 4st- Stage 5st- Stage
Me
mory
Write
Bs, α , β, tc0
P_in,Q_in
P_out,Q_out
15
Pipeline Operation
Each sub-block needs 4 cycles to be processed
The BS unit spends 4 cycles (BS calculation & LUT operations)–BS and LUT operations are do not depend on pixel values
BS calculation & LUT operations are overlapped with the filtering operations for the luma component
Four initialization cycles are needed to calculate the BS and the α, β, c1 for the first luma sub-block
Bs
Uni
tM
em
ory
Rea
d Del
ta
Cal
cula
tion
Bs
1~
31st
-Sta
ge
Thr
ess
hold
Cal
cula
tion
Bs
1~
32nd
-Sta
ge
Bs
1~
33rd
-Sta
ge
Bs
41st
-Sta
ge
Bs
42n
d-S
tage
Bs
1~
3
1st- Stage 2nd- Stage 3st- Stage 4st- Stage 5st- Stage
Me
mor
yW
rite
Bs, α , β, tc0
P_in,Q_in
P_out,Q_out
16
BS=4 Filtering
+
p0 q0
1 +
p1 q1
1 +
p1
p2
1 +
p2 p3
1 +
q1
q2
1 +
q2 q3
1
+1 +1 +1
p0q0 p1p2/p1p0
p1p2/p1p0q1q2/q1q0
+1 +1 +1 +1
p0q0p1q1 p0q0p1p2/p1q1p1p0
<<1
p0q0q1q2/p1q1q1q0
>>3>>3 >>2
p0'bs4w
>>3 >>2
<<1
>>3>>2
p1'bs4s
p0'bs4s
p2'bs4s
q0'bs4s
q0'bs4w
q2'bs4s
p0q0
p0q0 p0q0
p1q1
>>2
q1'bs4s
2
1
p0
4
3
q0
p2p3 q1q2/q1q0 q2q3
sum1 sum2
sum3
sum4 sum5sum6
sum7
sum8 sum9
sum10 sum11 sum12 sum13
+
p0 q0
1 +
p1 q1
1 +
p1
p2
1 +
p2 p3
1 +
q1
q2
1 +
q2 q3
1
+1 +1 +1
p0q0 p1p2/p1p0
p1p2/p1p0q1q2/q1q0
+1 +1 +1 +1
p0q0p1q1 p0q0p1p2/p1q1p1p0
<<1
p0q0q1q2/p1q1q1q0
>>3>>3 >>2
p0'bs4w
>>3 >>2
<<1
>>3>>2
p1'bs4s
p0'bs4s
p2'bs4s
q0'bs4s
q0'bs4w
q2'bs4s
p0q0
p0q0 p0q0
p1q1
>>2
q1'bs4s
2
1
p0
4
3
q0
p2p3 q1q2/q1q0 q2q3
sum1 sum2
sum3
sum4 sum5sum6
sum7
sum8 sum9
sum10 sum11 sum12 sum13
Bs
Uni
tM
em
ory
Rea
d Del
ta
Cal
cula
tion
Bs
1~
31st
-Sta
ge
Thr
ess
hold
Cal
cula
tion
Bs
1~
32nd
-Sta
ge
Bs
1~
33rd
-Sta
ge
Bs
41st
-Sta
ge
Bs
42n
d-S
tage
Bs
1~
3
1st- Stage 2nd- Stage 3st- Stage 4st- Stage 5st- Stage
Me
mor
yW
rite
Bs, α , β, tc0
P_in,Q_in
P_out,Q_out
Filter equations modified to improve delay & area
BS=4 – 13 adders instead of 28
Total componentsAdders: 13+14+4=31
17
Pipeline Benefits
LUT operations and BS calculation are not squeezed in a single pipeline stage
–Bs Unit has 4-cycles
The filtering operations are expanded in three pipeline stages
The BS values are reused for filtering the chroma components
Modification of the original filtering equations (improve performance & area)
The proposed ordering eases control logic and memory addressing avoiding any potential critical path increase
Bs
Uni
tM
em
ory
Rea
d Del
ta
Cal
cula
tion
Bs
1~
31st
-Sta
ge
Thr
ess
hold
Cal
cula
tion
Bs
1~
32nd
-Sta
ge
Bs
1~
33rd
-Sta
ge
Bs
41st
-Sta
ge
Bs
42n
d-S
tage
Bs
1~
3
1st- Stage 2nd- Stage 3st- Stage 4st- Stage 5st- Stage
Me
mor
yW
rite
Bs, α , β, tc0
P_in,Q_in
P_out,Q_out
18
Edge Filter Process
Block Cycle 0 1 2 3 4
Filtered Sub-edge
0 1 2 3 4
PIN L0 B0 B1 B2 L1
QIN B0 B1 B2 B3 B4
TR_P-W B0 B1 B2
TR_P-R B0 B1
TR_Q-W B3
TR_Q-R
CM_A-R B0 B1 B2 B3 B4
CM_B-W B0 B1
LM-R L0 L1
LM-W
UPM-W
Ext_M-W L0
4 5
26
B4
0 1
24
25
B0
12 13B12
8 9
27
B8
2
28
29
B1 3
32
33
B2
36
37
B3
6
30
B5 7
34
B6
38
B7
10
31
B9 11
35
B10
39
B11
14B13 15B14 B15
L0
L1
L2
L3
U0 U1 U2 U3
16x16 luma (Y)
18 19B18
16 17
40
41
B16
42
43
B17
B19
L0
L1
U0 U1
8x8 chroma (Cb)
22 23B22
20 21
44
45
B20
46
47
B21
B23
L0
L1
U0 U1
8x8 chroma (Cr)
Proposed Ordering
4-S
tag
e-F
ilter
Q
P
BsUnit
Bs,a,b,tc0
PreviewsStage
ReconstructedPixels
Current AMemoryW R
Current BMemoryW R
UpperMemory
R
W
LeftMemory
R
W
TR-P
TR-Q
ReferenceMemory
CodingInformation
IN OUT
IN OUT
19
Vertical Edge Filter Process
Total cycles = 4*27= 108–If two port memory has been used then total cycles = 4x24=96 which is the optimum
Block Cycle 0 1 2 3 4 5 6 12 13 14 22 23 24 25 26
Filtered Sub-edge 0 1 2 3 4 5 6 12 13 14 22 23
PIN L0 B0 B1 B2 L1 B4 B5 … L3 B12 B13 … L1 B22
QIN B0 B1 B2 B3 B4 B5 B6 B12 B13 B14 B22 B23
TR_P-W B0 B1 B2 B4 … B10 L3 B12 … B20 L1 B22
TR_P-R B0 B1 B2 B9 B10 L3 B20 B22
TR_Q-W B3 … B11 … B21 B23
TR_Q-R 3 B11 B19 B21 B23
CM_A-R B0 B1 B2 B3 B4 B5 B6 … B12 B13 B14 … B22 B23
CM_B-W B0 B1 B2 B3 B9 B10 B11 B19 B20 B21 B22 B23
LM-R L0 L1 … L3 … L1
LM-W
UPM-W L3 L1
Ext_M-W L0 L1
20
Processing Cycles
Vertical Edges: 108 cycles Horizontal Edges: 108 cycles Initialize: 10 cycles
–6 fetch coding info, initialize control–4 1st BS calculation
Normal operation: 226 cycles
For the last row (edges 27, 31, 35, 41, 45): 5x4=20 extra cycles
–Resource hazard (Bus conflict)
For the last MB in frame 12 extra cycles are needed (edges 39, 43, 47)
–Resource hazard (Bus conflict)
Worst case total cycles: 258
4 5
26
B4
0 1
24
25
B0
12 13B12
8 9
27
B8
2
28
29
B1 3
32
33
B2
36
37
B3
6
30
B5 7
34
B6
38
B7
10
31
B9 11
35
B10
39
B11
14B13 15B14 B15
L0
L1
L2
L3
U0 U1 U2 U3
16x16 luma (Y)
18 19B18
16 17
40
41
B16
42
43
B17
B19
L0
L1
U0 U1
8x8 chroma (Cb)
22 23B22
20 21
44
45
B20
46
47
B21
B23
L0
L1
U0 U1
8x8 chroma (Cr)
Proposed Ordering
21
Outline
1. Deblocking filter algorithm
2. Filtering ordering
3. Memory organization
4. Pipelined architecture
5. Synthesis results and comparisons
6. Conclusions and future work
22
Experimental Setup
Synthesis Setup– Synopsys design compiler
– TSMC 0.18um
FPGA proven– Stand alone, compared with the JM reference software
– It has also verified as a part of a H.264 hardware encoder
– It achieves 280 MHz in Virtex 5 speed grade 3
23
Synthesis Results and Comparisons
[5] (2008) [6] (2008) [7] (2009) [8] (2006) Proposed
Pipeline stages 5 5 4 5 5
Filtering order Hybrid Hybrid Hybrid Hybrid Impr. Sequential
Local RAMs (bits) 1P1 2x96x321P 96x32,2P1 32x321P 32x32
1P 96x32 , 1P 32x32
1P 96x32, 2P 32x32
1P 2x96x32,1P 32x32
Upper neighbour RAM (bits) 1P 2FWx32 N/A 1P 2FW2x321P
1.5FWx321P 2FWx32
Coding information RAM (bits) N/A N/A N/A N/A 2(FW/16)x327
Transpose buffers (4x32 bits) 7 1 5 2 2
Technology (μm) 0.18 0.18 0.18 0.18 0.18
Gate count (103gates) 21.5 18.7 26 20.9 19.2
Kernel processing (cycles/MB) 204 210/2223 1924 214/2465 226/2466
Max frequency (MHz) 200 200 220 100 400 (1.8x up to 4x)
Throughput (103MB/s) 980 952 1146 467 1796 (1.5x up to
3.8x)
Fps – Full HD (1920x1080) 120 102 140 57 216
Fps – Ultra HD (3840x2160) 30 25 35 14 541:1P: Single-Port, 2P: Two-Port, 2::FW: Frame width, 3: Filtering cycles only, 4: Filtering cycles only, 5: It takes 246 cycles to filter a MB at the right frame boundary, 6: It takes 246 cycles to filter a MB at the bottom frame row, 7: The 2x(FW/16)x32 bits are stored in upper memory
24
Conclusions
A novel high speed pipeline architecture for the H.264/AVC deblocking filter is proposed
It operates at 400 MHz and occupies 19.2 Kgates in 0.18 um CMOS technology
It achieves 216 and 54 fps for Full and Ultra-HD frames, respectively
Only single port memories are employed
No external memory accesses are needed during filtering– Parameters and neighbors are store internally – Only fully filtered data are written to external memories
25
Questions ???
26
Hardware Architecture (Pipeline organization) 5/
Threshold Calculation
-
p0 q0
-
p1 p0
-
q1 q0
-
p2 p0
-
q2
p0-q0
filterSampesFlag
Bs 123 strengthap
p1-p0
q0
-
2a
>>2
abs abs abs abs abs
< < < < < <
a b b b b
Bs 123 strengthaq
Bs 4 filtering strength
apq4
q1-q0 p2-p0 q2-q0a>>2+2
27
BS=4 Filtering
+
p0 q0
1 +
p1 q1
1 +
p1
p2
1 +
p2 p3
1 +
q1
q2
1 +
q2 q3
1
+1 +1 +1
p0q0 p1p2/p1p0
p1p2/p1p0q1q2/q1q0
+1 +1 +1 +1
p0q0p1q1 p0q0p1p2/p1q1p1p0
<<1
p0q0q1q2/p1q1q1q0
>>3>>3 >>2
p0'bs4w
>>3 >>2
<<1
>>3>>2
p1'bs4s
p0'bs4s
p2'bs4s
q0'bs4s
q0'bs4w
q2'bs4s
p0q0
p0q0 p0q0
p1q1
>>2
q1'bs4s
2
1
p0
4
3
q0
p2p3 q1q2/q1q0 q2q3
sum1 sum2
sum3
sum4 sum5sum6
sum7
sum8 sum9
sum10 sum11 sum12 sum13
28
Deblocking Filter Algorithm 3/3
p3 p2 p1 p0 q0 q1 q2 q3
4 4 block boundary΄
block p block q
block p or qintra-coded?
blockboundaryis also macroblock
boundary?
coefficient codedin block p or q?
ref(p)!=ref(q) orV(p,x)-V(q,x) ³1pel orV(p,y)-V(q,y) ³1pel?
Bs=0Bs=1Bs=2Bs=3Bs=4
P3 P2 P1 P0 Q0 Q1 Q2 Q3
filtering
filtered pixel
original pixel
Strong Mode Standard Mode No Filtering Mode
YesYes
Yes
NoNo
No
Each sub-edge between two adjacent 4x4 luma sub-blocks share a Boundary Strength (BS)
The BS value along with two threshold variables, α and β, decide the filtering strength of the sub-edge
29
Hardware Architecture (Pipeline organization) 5/
Bs 1,2,3 filter
-
q0
-
p1p0
-
p2
p1
+
p0 q0
1
+
+
q0-p0 p2-p1
+
+
>>3Δ0ibs123
p1'bs123
<<2
q1
4
<<1
>>1
+
-
q2
q1
<<1
>>1 >>1Δp1ibs123
Δq1ibs123
Clip3tc Clip3tc0 Clip3tc0
q2-q1
+
p0'bs123
p0
Clip1
-
q0
q0'bs123
Clip1
p1
+
q1'bs123
q1
+
tc0
tc
aq
ap
30
Deblocking Filter Algorithm 4/4
Boundary strength across horizontal edges–The boundary strength is calculated for each sub-edge for the luma component–It is reused for the chroma components in 2:1 ratio for 4:2:0 format