1 a high throughput pipelined architecture for h.264/avc deblocking filter kefalas nikolaos,...

1

A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC

DEBLOCKING FILTER

Kefalas Nikolaos, Theodoridis George

VLSI Design Lab.Electrical & Computer Eng. Department

University of Patras, Greece

2

Outline

1. Deblocking filter algorithm

2. Filtering ordering

3. Memory organization

4. Pipelined architecture

5. Synthesis results and comparisons

6. Conclusions and future work

3

Deblocking Filter Algorithm (1/3)

Deblocking Filter

The deblocking filter is used in H.264/AVC to reduce the blocking artifacts –Improves subjective & objective quality and reduces the bit-rate typically 5-10%.

It is performed on a macroblock (MB) basis after the completion of the macroblock reconstruction stage

It includes a large number of data depended branches – each 4x4 pixel area is processed up to four times

It spends over one-third (1/3) of the total decoding time

4


Each MB is processed in 4x4 blocks

The vertical edges are filtered at first rightwards

–from edge V0 to edge V3

Then horizontal ones downwards–from edge H0 to H3

Each 8 pixels of two adjacent 4x4 sub-blocks are filtered at the same time

–The same process repeats for the chroma components

16x16 luma (Y)

p0p1p2p3

q0q1q2q3

V0 V2 V3

subedge

44

p0p1p2p3 q3q2q1q0

V1

H0

H1

H2

H3

edge

8x8 chroma (Cb)

8x8 chroma (Cr)

edge to be filtered

edge NOT to be filtered

Sub_block of current MB

Sub_block of neighboring

MB

5


Each sub-edge shares a BS value

The BS along with two thresholds α, β decides the filtering strength of each sub-edge

–A filter samples flag is calculated

Three filter types are used–Strong filter (4- or 5-tap filter)–Weak filter–No filtering

6

Outline







7

Filtering Order

During filtering all four sub-edges of each sub-block are filtered and almost all pixels are involved and updated

A suitable filtering order is needed to:– Reduce the size of the on-chip memory for buffering intermediate data– Increase data reuse – Reduce the external memory accesses– Simplify control and steering logic– Avoid pipeline stalls due to data and resource hazards

8

Proposed Filtering Order

The vertical sub-edges are filtered in raster scan-order followed by the horizontal ones

The filtering direction is not changed before all vertical edges of luma and chroma are filtered

The proposed order is in accordance to the standard

4 5

26

B4

0 1

24

25

B0

12 13B12

8 9

27

B8

2

28

29

B1 3

32

33

B2

36

37

B3

6

30

B5 7

34

B6

38

B7

10

31

B9 11

35

B10

39

B11

14B13 15B14 B15

L0

L1

L2

L3

U0 U1 U2 U3

16x16 luma (Y)

18 19B18

16 17

40

41

B16

42

43

B17

B19

L0

L1

U0 U1

8x8 chroma (Cb)

22 23B22

20 21

44

45

B20

46

47

B21

B23

L0

L1

U0 U1

8x8 chroma (Cr)

Proposed Ordering

1 5

18

B4

0 4

16

17

B0

3 7B12

2 6

19

B8

8

20

21

B1 12

24

25

B2

28

29

B3

9

22

B5 13

26

B6

30

B7

10

23

B9 14

27

B10

31

B11

11B13 15B14 B15

L0

L1

L2

L3

U0 U1 U2 U3

16x 16 luma (Y)

33 35B18

32 34

36

37

B16

38

39

B17

B19

L0

L1

U0 U1

8x 8 chroma (Cb)

41 43B22

40 42

44

45

B20

46

47

B21

B23

L0

L1

U0 U1

8x 8 chroma (Cr)

Original Ordering

9

Outline







10

Memory Organization (1/2)

Four single port memories are employed (sizes in bits)– Current-A (CM-A) 96x32– Current-B (CM-B) 96x32– Left _mem (LM) 32x32– Upper_mem (UM) 2xFWx32 + 2x(FW/16)x32

Transpose buffers TR-P and TR-Q (4x32) – typical systolic array

4-S

tag

e-F

ilter

Q

P

BsUnit

Bs,a,b,tc0

PreviewsStage

ReconstructedPixels

Current AMemoryW R

Current BMemoryW R

UpperMemory

R

W

LeftMemory

R

W

TR-P

TR-Q

ReferenceMemory

CodingInformation

IN OUT

IN OUT

All internal buses are 32 bits

11

Memory Organization (2/2)

F ra m e W in d th

Fram

e Height

Fram

e Height 2

C h ro m a C om p o n en t (C b ):F ra m e W in d th /2

C h ro m a C om p o n en t (C b ):F ra m e W in d th /2

U pper M em oryLeft M em oryC urren t M em ory AC urren t M em ory B

12

Outline







13

Algorithm Features

Deblocking filter algorithm computational intensive operations– LUT operations – retrieve values α(IndexA), β(IndexB), c1(Index A, BS)– BS calculation– Weak Filter BS(1~3) filtering, δ calculation and clipping operations– Strong Filter BS(4)

The introduced pipeline exploits specific algorithmic features– BS is the same for all micro-edges of a sub-edge for the luma

component– BS of the luma component is reused for the chroma components – For the (4:2:0) format BS changes every 2 micro-edges in chroma

components

14

Proposed Pipeline Organization

Bs

Unit

Me

mory

Rea

d Delta

C

alc

ula

tion

Bs

1~

31st

-Sta

ge

Thre

sshold

Calc

ula

tio

n

Bs

1~

32n

d-S

tag

e

Bs

1~

33rd

-Sta

ge

Bs

41st

-Sta

ge

Bs

42n

d-S

tage

Bs

1~

3

1st- Stage 2nd- Stage 3st- Stage 4st- Stage 5st- Stage

Me

mory

Write

Bs, α , β, tc0

P_in,Q_in

P_out,Q_out

15

Pipeline Operation

Each sub-block needs 4 cycles to be processed

The BS unit spends 4 cycles (BS calculation & LUT operations)–BS and LUT operations are do not depend on pixel values

BS calculation & LUT operations are overlapped with the filtering operations for the luma component

Four initialization cycles are needed to calculate the BS and the α, β, c1 for the first luma sub-block

Bs

Uni

tM

em

ory

Rea

d Del

ta

Cal

cula

tion

Bs

1~

31st

-Sta

ge

Thr

ess

hold

Cal

cula

tion

Bs

1~

32nd

-Sta

ge

Bs

1~

33rd

-Sta

ge

Bs

41st

-Sta

ge

Bs

42n

d-S

tage

Bs

1~

3


Me

mor

yW

rite

Bs, α , β, tc0

P_in,Q_in

P_out,Q_out

16

BS=4 Filtering

+

p0 q0

1 +

p1 q1

1 +

p1

p2

1 +

p2 p3

1 +

q1

q2

1 +

q2 q3

1

+1 +1 +1

p0q0 p1p2/p1p0

p1p2/p1p0q1q2/q1q0

+1 +1 +1 +1

p0q0p1q1 p0q0p1p2/p1q1p1p0

<<1

p0q0q1q2/p1q1q1q0

>>3>>3 >>2

p0'bs4w

>>3 >>2

<<1

>>3>>2

p1'bs4s

p0'bs4s

p2'bs4s

q0'bs4s

q0'bs4w

q2'bs4s

p0q0

p0q0 p0q0

p1q1

>>2

q1'bs4s

2

1

p0

4

3

q0

p2p3 q1q2/q1q0 q2q3

sum1 sum2

sum3

sum4 sum5sum6

sum7

sum8 sum9

sum10 sum11 sum12 sum13

+

p0 q0

1 +

p1 q1

1 +

p1

p2

1 +

p2 p3

1 +

q1

q2

1 +

q2 q3

1

+1 +1 +1

p0q0 p1p2/p1p0

p1p2/p1p0q1q2/q1q0

+1 +1 +1 +1


<<1

p0q0q1q2/p1q1q1q0

>>3>>3 >>2

p0'bs4w

>>3 >>2

<<1

>>3>>2

p1'bs4s

p0'bs4s

p2'bs4s

q0'bs4s

q0'bs4w

q2'bs4s

p0q0

p0q0 p0q0

p1q1

>>2

q1'bs4s

2

1

p0

4

3

q0

p2p3 q1q2/q1q0 q2q3

sum1 sum2

sum3

sum4 sum5sum6

sum7

sum8 sum9


Bs

Uni

tM

em

ory

Rea

d Del

ta

Cal

cula

tion

Bs

1~

31st

-Sta

ge

Thr

ess

hold

Cal

cula

tion

Bs

1~

32nd

-Sta

ge

Bs

1~

33rd

-Sta

ge

Bs

41st

-Sta

ge

Bs

42n

d-S

tage

Bs

1~

3


Me

mor

yW

rite

Bs, α , β, tc0

P_in,Q_in

P_out,Q_out

Filter equations modified to improve delay & area

BS=4 – 13 adders instead of 28

Total componentsAdders: 13+14+4=31

17

Pipeline Benefits

LUT operations and BS calculation are not squeezed in a single pipeline stage

–Bs Unit has 4-cycles

The filtering operations are expanded in three pipeline stages

The BS values are reused for filtering the chroma components

Modification of the original filtering equations (improve performance & area)

The proposed ordering eases control logic and memory addressing avoiding any potential critical path increase

Bs

Uni

tM

em

ory

Rea

d Del

ta

Cal

cula

tion

Bs

1~

31st

-Sta

ge

Thr

ess

hold

Cal

cula

tion

Bs

1~

32nd

-Sta

ge

Bs

1~

33rd

-Sta

ge

Bs

41st

-Sta

ge

Bs

42n

d-S

tage

Bs

1~

3


Me

mor

yW

rite

Bs, α , β, tc0

P_in,Q_in

P_out,Q_out

18

Edge Filter Process

Block Cycle 0 1 2 3 4

Filtered Sub-edge

0 1 2 3 4

PIN L0 B0 B1 B2 L1

QIN B0 B1 B2 B3 B4

TR_P-W B0 B1 B2

TR_P-R B0 B1

TR_Q-W B3

TR_Q-R

CM_A-R B0 B1 B2 B3 B4

CM_B-W B0 B1

LM-R L0 L1

LM-W

UPM-W

Ext_M-W L0

4 5

26

B4

0 1

24

25

B0

12 13B12

8 9

27

B8

2

28

29

B1 3

32

33

B2

36

37

B3

6

30

B5 7

34

B6

38

B7

10

31

B9 11

35

B10

39

B11

14B13 15B14 B15

L0

L1

L2

L3

U0 U1 U2 U3

16x16 luma (Y)

18 19B18

16 17

40

41

B16

42

43

B17

B19

L0

L1

U0 U1

8x8 chroma (Cb)

22 23B22

20 21

44

45

B20

46

47

B21

B23

L0

L1

U0 U1

8x8 chroma (Cr)

Proposed Ordering

4-S

tag

e-F

ilter

Q

P

BsUnit

Bs,a,b,tc0

PreviewsStage

ReconstructedPixels

Current AMemoryW R

Current BMemoryW R

UpperMemory

R

W

LeftMemory

R

W

TR-P

TR-Q

ReferenceMemory

CodingInformation

IN OUT

IN OUT

19

Vertical Edge Filter Process

Total cycles = 4*27= 108–If two port memory has been used then total cycles = 4x24=96 which is the optimum

Block Cycle 0 1 2 3 4 5 6 12 13 14 22 23 24 25 26

Filtered Sub-edge 0 1 2 3 4 5 6 12 13 14 22 23

PIN L0 B0 B1 B2 L1 B4 B5 … L3 B12 B13 … L1 B22

QIN B0 B1 B2 B3 B4 B5 B6 B12 B13 B14 B22 B23

TR_P-W B0 B1 B2 B4 … B10 L3 B12 … B20 L1 B22

TR_P-R B0 B1 B2 B9 B10 L3 B20 B22

TR_Q-W B3 … B11 … B21 B23

TR_Q-R 3 B11 B19 B21 B23

CM_A-R B0 B1 B2 B3 B4 B5 B6 … B12 B13 B14 … B22 B23

CM_B-W B0 B1 B2 B3 B9 B10 B11 B19 B20 B21 B22 B23

LM-R L0 L1 … L3 … L1

LM-W

UPM-W L3 L1

Ext_M-W L0 L1

20

Processing Cycles

Vertical Edges: 108 cycles Horizontal Edges: 108 cycles Initialize: 10 cycles

–6 fetch coding info, initialize control–4 1st BS calculation

Normal operation: 226 cycles

For the last row (edges 27, 31, 35, 41, 45): 5x4=20 extra cycles

–Resource hazard (Bus conflict)

For the last MB in frame 12 extra cycles are needed (edges 39, 43, 47)

–Resource hazard (Bus conflict)

Worst case total cycles: 258

4 5

26

B4

0 1

24

25

B0

12 13B12

8 9

27

B8

2

28

29

B1 3

32

33

B2

36

37

B3

6

30

B5 7

34

B6

38

B7

10

31

B9 11

35

B10

39

B11

14B13 15B14 B15

L0

L1

L2

L3

U0 U1 U2 U3

16x16 luma (Y)

18 19B18

16 17

40

41

B16

42

43

B17

B19

L0

L1

U0 U1

8x8 chroma (Cb)

22 23B22

20 21

44

45

B20

46

47

B21

B23

L0

L1

U0 U1

8x8 chroma (Cr)

Proposed Ordering

21

Outline







22

Experimental Setup

Synthesis Setup– Synopsys design compiler

– TSMC 0.18um

FPGA proven– Stand alone, compared with the JM reference software

– It has also verified as a part of a H.264 hardware encoder

– It achieves 280 MHz in Virtex 5 speed grade 3

23

Synthesis Results and Comparisons

[5] (2008) [6] (2008) [7] (2009) [8] (2006) Proposed

Pipeline stages 5 5 4 5 5

Filtering order Hybrid Hybrid Hybrid Hybrid Impr. Sequential

Local RAMs (bits) 1P1 2x96x321P 96x32,2P1 32x321P 32x32

1P 96x32 , 1P 32x32

1P 96x32, 2P 32x32

1P 2x96x32,1P 32x32

Upper neighbour RAM (bits) 1P 2FWx32 N/A 1P 2FW2x321P

1.5FWx321P 2FWx32

Coding information RAM (bits) N/A N/A N/A N/A 2(FW/16)x327

Transpose buffers (4x32 bits) 7 1 5 2 2

Technology (μm) 0.18 0.18 0.18 0.18 0.18

Gate count (103gates) 21.5 18.7 26 20.9 19.2

Kernel processing (cycles/MB) 204 210/2223 1924 214/2465 226/2466

Max frequency (MHz) 200 200 220 100 400 (1.8x up to 4x)

Throughput (103MB/s) 980 952 1146 467 1796 (1.5x up to

3.8x)

Fps – Full HD (1920x1080) 120 102 140 57 216

Fps – Ultra HD (3840x2160) 30 25 35 14 541:1P: Single-Port, 2P: Two-Port, 2::FW: Frame width, 3: Filtering cycles only, 4: Filtering cycles only, 5: It takes 246 cycles to filter a MB at the right frame boundary, 6: It takes 246 cycles to filter a MB at the bottom frame row, 7: The 2x(FW/16)x32 bits are stored in upper memory

24

Conclusions

A novel high speed pipeline architecture for the H.264/AVC deblocking filter is proposed

It operates at 400 MHz and occupies 19.2 Kgates in 0.18 um CMOS technology

It achieves 216 and 54 fps for Full and Ultra-HD frames, respectively

Only single port memories are employed

No external memory accesses are needed during filtering– Parameters and neighbors are store internally – Only fully filtered data are written to external memories

25

Questions ???

26

Hardware Architecture (Pipeline organization) 5/

Threshold Calculation

-

p0 q0

-

p1 p0

-

q1 q0

-

p2 p0

-

q2

p0-q0

filterSampesFlag

Bs 123 strengthap

p1-p0

q0

-

2a

>>2

abs abs abs abs abs

< < < < < <

a b b b b

Bs 123 strengthaq

Bs 4 filtering strength

apq4

q1-q0 p2-p0 q2-q0a>>2+2

27

BS=4 Filtering

+

p0 q0

1 +

p1 q1

1 +

p1

p2

1 +

p2 p3

1 +

q1

q2

1 +

q2 q3

1

+1 +1 +1

p0q0 p1p2/p1p0

p1p2/p1p0q1q2/q1q0

+1 +1 +1 +1


<<1

p0q0q1q2/p1q1q1q0

>>3>>3 >>2

p0'bs4w

>>3 >>2

<<1

>>3>>2

p1'bs4s

p0'bs4s

p2'bs4s

q0'bs4s

q0'bs4w

q2'bs4s

p0q0

p0q0 p0q0

p1q1

>>2

q1'bs4s

2

1

p0

4

3

q0

p2p3 q1q2/q1q0 q2q3

sum1 sum2

sum3

sum4 sum5sum6

sum7

sum8 sum9


28

Deblocking Filter Algorithm 3/3

p3 p2 p1 p0 q0 q1 q2 q3

4 4 block boundary΄

block p block q

block p or qintra-coded?

blockboundaryis also macroblock

boundary?

coefficient codedin block p or q?

ref(p)!=ref(q) orV(p,x)-V(q,x) ³1pel orV(p,y)-V(q,y) ³1pel?

Bs=0Bs=1Bs=2Bs=3Bs=4

P3 P2 P1 P0 Q0 Q1 Q2 Q3

filtering

filtered pixel

original pixel

Strong Mode Standard Mode No Filtering Mode

YesYes

Yes

NoNo

No

Each sub-edge between two adjacent 4x4 luma sub-blocks share a Boundary Strength (BS)

The BS value along with two threshold variables, α and β, decide the filtering strength of the sub-edge

29

Hardware Architecture (Pipeline organization) 5/

Bs 1,2,3 filter

-

q0

-

p1p0

-

p2

p1

+

p0 q0

1

+

+

q0-p0 p2-p1

+

+

>>3Δ0ibs123

p1'bs123

<<2

q1

4

<<1

>>1

+

-

q2

q1

<<1

>>1 >>1Δp1ibs123

Δq1ibs123

Clip3tc Clip3tc0 Clip3tc0

q2-q1

+

p0'bs123

p0

Clip1

-

q0

q0'bs123

Clip1

p1

+

q1'bs123

q1

+

tc0

tc

aq

ap

30

Deblocking Filter Algorithm 4/4

Boundary strength across horizontal edges–The boundary strength is calculated for each sub-edge for the luma component–It is reused for the chroma components in 2:1 ratio for 4:2:0 format

1 a high throughput pipelined architecture for h.264/avc deblocking filter kefalas nikolaos,...

Documents