[ieee 2007 25th international conference on computer design iccd 2007 - lake tahoe, ca, usa...

6
Speed-area optimized FPGA implementation for Full Search Block Matching Santosh Ghosh and Avishek Saha Department of Computer Science and Engineering, IIT Kharagpur, WB, India, 721302 {santosh, avishek}@cse.iitkgp.ernet.in Abstract This paper presents an FPGA based hardware design for Full Search Block Matching (FSBM) based Motion Es- timation (ME) in video compression. The significantly higher resolution of HDTV based applications is achieved by using FSBM based ME. The proposed architecture uses a modification of the Sum-of-Absolute-Differences (SAD) computation in FSBM such that the total number of ad- ditions/subtraction operations is drastically reduced. This successfully optimizes the conflicting design requirements of high throughput and small silicon area. Comparison re- sults demonstrate the superior performance of our architec- ture. Finally, the design of a reconfigurable block matching hardware has been discussed. 1 Introuction Rapid growth in High-Definition (HD) digital video ap- plications has lead to an increased interest in portable HD- quality encoder design. HD-compatible MPEG2 MP@HL encoder uses Full Search Block Matching Algorithm (FS- BMA) based Motion Estimation (ME). The ME module ac- counts for more than 80% of the computational complex- ity of a typical video encoder. Moreover, the power con- sumption of an FSBM-based encoder is prohibitively high, particularly for portable implementations. Hence, efficient ME processor cores need to be designed to realize portable HDTV video encoders. Parameterizable FSBM ASIC design to solve the input bandwidth problem by using on-chip line buffers was pro- posed in [15]. [18] proposed a family of modular VLSI ar- chitectures which allow sequential inputs but perform paral- lel processing with 100 percent efficiency. A systolic map- ping procedure to derive FSBM architectures was proposed in [4]. The designs of ([2], [20]) and [5] focused on the reduction of pin counts by sharing memory units and 2- dimensional data reuse, respectively. [19] improved the memory bandwidth by using an overlapped data flow of search area which increased the processing element (PE) utilization. A low-latency high-throughput tree architec- ture for FSBM was proposed in [3]. Both [13] and [1] proposed low-power architectures based on removal of un- necessary computations. Finally, a novel low-power par- allel tree FSBM architecture was proposed in [6], which exploited the spatial data correlations within parallel can- didate block searches for data sharing and thus effectively reduces data access bandwidth and power consumption. [7] proposed an FPGA architecture to implement parallel com- putation of FSBM. Systolic array and novel OnLine Arith- metic (OLA) based designs for FSBM were proposed in [8] and [9], respectively. Customizable low-power FPGA cores were proposed by [10]. [11] evaluated the performance of FSBM hardware architectures [4] implemented on Xilinx FPGA. The results show that, real-time motion estimation for CIF (352 × 288) sequences can be achieved with 2-D systolic arrays and moderate capacity (250 k gates) FPGA chip. An adder-tree based 16 × 1 SAD FPGA hardware was implemented by [17]. The aforementioned FSBM architectures can be divided into two categories, namely, FPGA [7, 8, 9, 10, 11, 17] and ASIC [4, 15, 18, 2, 3, 20, 5, 19, 13, 1, 6]. This work uses FPGA technology to implement a high-performance ME hardware with due consideration to (a) processing speed and (b) silicon area. Almost all aforementioned VLSI archi- tectures optimize any one of these parameters. The novelty of the proposed architecture lies in its combined optimiza- tion of the aforementioned conflicting design requirements. The proposed hardware uses an initially-split pipeline to re- duce processing cycles for each MB and thus increases the throughput. In addition, this design requires less number of adders and only one Absolute Difference (AD) PE, which drastically reduces the silicon area when compared to other existing designs. The pixels of the search regions have been organized in memory banks such that two sets of 128-bit (16 8-bit pixels) data can be accessed in each clock cycle. Section 2 gives an overview of FSBM-based motion es- timation. Section 3 presents a brief discussion on SAD modifications and describes the proposed FSBM hardware. The implementation and comparative results have been pre- sented in Section 4. Section 5 presents a reconfigurable ad- dress generator. Finally, Section 6 concludes this paper. 1-4244-1258-7/07/$25.00 ©2007 IEEE 13

Upload: avishek

Post on 11-Mar-2017

215 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: [IEEE 2007 25th International Conference on Computer Design ICCD 2007 - Lake Tahoe, CA, USA (2007.10.7-2007.10.10)] 2007 25th International Conference on Computer Design - Speed-area

Speed-area optimized FPGA implementation for Full Search Block Matching

Santosh Ghosh and Avishek SahaDepartment of Computer Science and Engineering, IIT Kharagpur, WB, India, 721302

{santosh, avishek}@cse.iitkgp.ernet.in

Abstract

This paper presents an FPGA based hardware designfor Full Search Block Matching (FSBM) based Motion Es-timation (ME) in video compression. The significantlyhigher resolution of HDTV based applications is achievedby using FSBM based ME. The proposed architecture usesa modification of the Sum-of-Absolute-Differences (SAD)computation in FSBM such that the total number of ad-ditions/subtraction operations is drastically reduced. Thissuccessfully optimizes the conflicting design requirementsof high throughput and small silicon area. Comparison re-sults demonstrate the superior performance of our architec-ture. Finally, the design of a reconfigurable block matchinghardware has been discussed.

1 Introuction

Rapid growth in High-Definition (HD) digital video ap-plications has lead to an increased interest in portable HD-quality encoder design. HD-compatible MPEG2 MP@HLencoder uses Full Search Block Matching Algorithm (FS-BMA) based Motion Estimation (ME). The ME module ac-counts for more than 80% of the computational complex-ity of a typical video encoder. Moreover, the power con-sumption of an FSBM-based encoder is prohibitively high,particularly for portable implementations. Hence, efficientME processor cores need to be designed to realize portableHDTV video encoders.

Parameterizable FSBM ASIC design to solve the inputbandwidth problem by using on-chip line buffers was pro-posed in [15]. [18] proposed a family of modular VLSI ar-chitectures which allow sequential inputs but perform paral-lel processing with 100 percent efficiency. A systolic map-ping procedure to derive FSBM architectures was proposedin [4]. The designs of ([2], [20]) and [5] focused on thereduction of pin counts by sharing memory units and 2-dimensional data reuse, respectively. [19] improved thememory bandwidth by using an overlapped data flow ofsearch area which increased the processing element (PE)utilization. A low-latency high-throughput tree architec-

ture for FSBM was proposed in [3]. Both [13] and [1]proposed low-power architectures based on removal of un-necessary computations. Finally, a novel low-power par-allel tree FSBM architecture was proposed in [6], whichexploited the spatial data correlations within parallel can-didate block searches for data sharing and thus effectivelyreduces data access bandwidth and power consumption. [7]proposed an FPGA architecture to implement parallel com-putation of FSBM. Systolic array and novel OnLine Arith-metic (OLA) based designs for FSBM were proposed in [8]and [9], respectively. Customizable low-power FPGA coreswere proposed by [10]. [11] evaluated the performance ofFSBM hardware architectures [4] implemented on XilinxFPGA. The results show that, real-time motion estimationfor CIF (352 × 288) sequences can be achieved with 2-Dsystolic arrays and moderate capacity (250 k gates) FPGAchip. An adder-tree based 16×1 SAD FPGA hardware wasimplemented by [17].

The aforementioned FSBM architectures can be dividedinto two categories, namely, FPGA [7, 8, 9, 10, 11, 17] andASIC [4, 15, 18, 2, 3, 20, 5, 19, 13, 1, 6]. This work usesFPGA technology to implement a high-performance MEhardware with due consideration to (a) processing speedand (b) silicon area. Almost all aforementioned VLSI archi-tectures optimize any one of these parameters. The noveltyof the proposed architecture lies in its combined optimiza-tion of the aforementioned conflicting design requirements.The proposed hardware uses an initially-split pipeline to re-duce processing cycles for each MB and thus increases thethroughput. In addition, this design requires less number ofadders and only one Absolute Difference (AD) PE, whichdrastically reduces the silicon area when compared to otherexisting designs. The pixels of the search regions have beenorganized in memory banks such that two sets of 128-bit(16 8-bit pixels) data can be accessed in each clock cycle.

Section 2 gives an overview of FSBM-based motion es-timation. Section 3 presents a brief discussion on SADmodifications and describes the proposed FSBM hardware.The implementation and comparative results have been pre-sented in Section 4. Section 5 presents a reconfigurable ad-dress generator. Finally, Section 6 concludes this paper.

1-4244-1258-7/07/$25.00 ©2007 IEEE 13

Page 2: [IEEE 2007 25th International Conference on Computer Design ICCD 2007 - Lake Tahoe, CA, USA (2007.10.7-2007.10.10)] 2007 25th International Conference on Computer Design - Speed-area

2 FSBM-based Motion Estimation

Motion-compensated video compression models thepixel motion within the current picture as a translation ofthose within a previous picture. The motion vector is ob-tained by minimizing a cost function measuring the mis-match between the current MB in current frame and thecandidate block in reference frame. SAD, the most popularcost function, between the pixels of the current MB x(i, j)and the search region y(i, j) can be expressed as,

SAD(u, v) =N−1∑

i=0

N−1∑

j=0

|x(i, j) − y(i + u, j + v)| (1)

where, (u, v) is the displacement between these two blocks.Thus, each search requires N2 absolute differences and(N2 − 1) additions. The FSBMA exhaustively evaluatesall possible search locations and hence is optimal in termsof reconstructed video quality and compression ratio. Highcomputational requirements, regular processing scheme andsimple control structures make the hardware implementa-tion of FSBM a preferred choice.

Table 1: Execution profile of a typical video encoderSAD ME/ DCT/ Q/IQ VLC/ Others

MC IDCT VLD72.28% 16.85% 6.17% 2.35% 1.45% 0.32%

The execution profile of a standard video encoder ob-tained using the GNU gprof tool has been shown in Table 2.The table shows that motion estimation is the most compu-tationally expensive module in a typical video encoder. Inaddition, SAD computations take the maximum time due tocomplex nature of absolute operation and subsequent mul-titude of additions.

3 Proposed FSBM Architecture

In this section we delineate our proposed speed-area op-timized FSBM architecture. The first subsection briefly ex-plains the SAD modification and the MB searching tech-nique. The subsequent subsections describe the proposedhardware and the memory organization.

3.1 SAD modification

This section presents a modification to SAD computa-tion. The SAD expression in Eq. 1 can be re-written as,

SAD(u, v) ≥∣∣∣∣

N−1∑

i=0

N−1∑

j=0

x(i, j)−N−1∑

i=0

N−1∑

j=0

y(i+u, j+v)∣∣∣∣

(2)

The detailed proof of the above derivation can be foundin [12]. Again, it can be posited that, if,

∣∣∣∣N−1∑

i=0

N−1∑

j=0

x(i, j) −N−1∑

i=0

N−1∑

j=0

y(i + u, j + v)∣∣∣∣ ≥ SADmin

then,

SAD(u, v) ≥ SADmin (by Eq. 2) (3)

where SADmin denotes the current minimum SADvalue. Thus, if Eq. 3 is satisfied, then the SAD computa-tion at the (u, v)th location may be skipped.

In addition, if X(u, v) be the sum of pixel intensities atthe (u, v)th MB location, then this sum can be derived fromX(u − 1, v) by subtracting and adding the intensity sum ofcolumns at specific positions. Based on this fact, [12] pro-poses a search strategy to efficiently derive and compute theMB sums at successive locations. The MB search techniqueused in our proposed design adopts this particular approach.

3.2 Pipelined SAD Operator

The SAD hardware for FSBMA has been divided intoeight independent sequential steps. It computes the initialfull SAD for the first Search Location (SL) and derives theSAD sums for subsequent SLs. Fig. 1 shows the data pathof the proposed SAD operator for N = 16.

Stages 1 to 4 of the proposed design have been split tofacilitate parallel processing. Each half-stage (from Stage 1to Stage 4) computes the sum of 16 pixel values per clockcycle. These partial sums are accumulated in SR and MBregisters of Stage 6. Initially, the SR and MB registersof Stage 6 are initialized to 0. For the first SAD calcu-lation, Stage 5 just passes the intermediate addition resultof Stage 4 to Stage 6. This can be achieved by setting theS0 control signal of Stage 6 to 0. Thus, the SAD sum ofthe candidate MB and the first SL can be computed in 6 (forthe six stages of the pipeline) + 15 (to add 16 values) = 21cycles. Thereafter, for every subsequent SL, the right andthe left half-stages add the pixel intensities of the old andnew rows/coloumns, respectively. At this point, Stage 5is activated by enabling the S0 control signal. This stagedifferentiates the resultant sum of the two half-stages andaccumulates the result in SR register of Stage 6. Stage 7computes the AD between the older MB sum and the newlyobtained SL sum. Finally, Stage 8 compares the new SADwith the existing SADmin and stores the minimum SADsum obtained so far. Thus, at each clock cycle, the pro-posed pipelined architecture computes one new SAD valueand stores the minimum SAD. Hence, with a search regionsize of p = 16, this hardware can search the best match foran MB in only [(2p + 1)2 − 1] + 23 clocks = 1111 clockcycles.

14

Page 3: [IEEE 2007 25th International Conference on Computer Design ICCD 2007 - Lake Tahoe, CA, USA (2007.10.7-2007.10.10)] 2007 25th International Conference on Computer Design - Speed-area

+

+

+ +

+ + ++

+ + + + + + + +

+

+ +

+ + ++

+ + + + + + +

_

+

0 1

+SR MB

AD

SAD

a < b

1 0

S0

Pipeline Stages

(2)(1)

(3)

(4)

(5)

(6)

(7)

(8)

Figure 1: Data path of different pipeline stages of the proposed SAD unit

3.3 Memory Organization

Our design adopts the MB scanning technique proposedin [12]. The pixels in p = 16 search region are representedby Pi,j where 0 ≤ i ≤ 48 and 0 ≤ j ≤ 48 (shown inFig. 3)). This search region has (2p + 1)2 = 332 = 1089search locations.

P1,1 P1,2 ……… …. …. ….. …. P1,48

P2,1 P2,2 ……… …. …. ….. …. P2,48

P48,1 P48,2 ……… …. …. …. …. P48,48

1 2 3 …. …. ….. …. 48

123

.

.

.

.

.

.

48

column number

row

num

ber

Figure 3: Position of Pixels in the search region

Initially, the sum of the first search location is computedby

∑16j=1

∑16i=1 Pi,j equation. Thereafter, to move towards

left or right the oldest column of the pervious search loca-tion is subtracted from one new column in the new searchlocation. This implies that, at every clock, we need to accesstwo 128-bit (16 × 8) data from the memory. These 128-bitdata are basically represented as a part of one column inthe search region (Fig.3), e.g., [P1,1, P2,1, P3,1, ..., P16,1] isone such 128-bit data, which belongs to the column 1 of thesearch region. It is observed that the one of the columnsfrom column number 17 to 32 are accessed concurrentlywith another column from rest of the columns, i.e., 1 to 16and 33 to 48, in the pre-defined search region. Therefore,the pixels have been organized in two different memorybanks, as shown in Fig. 2. The data in these memory banksare organized in column major format so that the whole col-umn can be accessed by a single memory access. The mem-ory controller generates the right address at every clocks forboth the memory banks. The selected 384 bits (48 pixelsof a single column of Fig.3) of each bank are then multi-plexed and the correct 16 pixels are passed onto the SADprocessing unit.

When the search location is moved down from the pre-vious position, then we need to access two set of row pix-els. This is not possible by the previously organized mem-ory banks in one clock. It is easily observed Fig. 3 thateither the first 16 pixels or the last 16 pixels of a singlerow have to be accessed for this purpose. It is also tobe observed that, for the even row number, the first 16

15

Page 4: [IEEE 2007 25th International Conference on Computer Design ICCD 2007 - Lake Tahoe, CA, USA (2007.10.7-2007.10.10)] 2007 25th International Conference on Computer Design - Speed-area

P1,1 P2,1 …..…... P16,1 …. P32,1 …. P48,1

P1,2 P2,2 …..…... P16,2 …. P32,2 …. P48,2

P1,48 P2,48 …..…...P16,48 …. P32,48 …. P48,48

1 2 3 …. …. ….. …. 48

1

2

3

.

.

16

33

.

48

P1,3 P2,3 …..…... P16,3 …. P32,3 …. P48,3

P1,33 P2,33 …..…...P16,33 …. P32,33 …. P48,33

P1,16 P2,16 …..…. P16,16 …. P32,16 …. P48,16

colu

mn

num

ber

row number

(a)

P1,33 P1,34 …..… … . …. ….. …. P1,48

P2,1 P2,2 ……… … . …. ….. …. P2,16

P48,1 P48,2 ……… … . …. …. …. P48,16

1 2 3 …. …. ….. …. 16

1

2

3

.

16

33

.

.

48

P3,33 P3,34 …..… … . …. ….. …. P3,48

P16,1 P16,2 ……. …. …. ….. …. P16,16

P33,33 P33,34 …... …. …. ….. …. P33,48

RB

1R

B3

(b)

P1,17 P2,17 ….... P16,17 …. P32,17 …. P48,17

P1,18 P2,18 ….... P16,18 …. P32,18 …. P48,18

1 2 3 …. …. ….. …. 48

17

18

19

.

.

32

P1,19 P2,19 ….... P16,19 …. P32,19 …. P48,19

P1,32 P2,32 ….... P16,32 …. P32,32 …. P48,32

row number

colu

mn

num

ber

(c)

P17,33 P17,34 …... …. …. ….. …. P17,48

P18,1 P18,2 …… …. …. ….. …. P18,16

1 2 3 …. …. ….. …. 16

17

18

19

.

32

P19,33 P19,34 ….. …. …. ….. …. P19,48

P32,1 P32,2 ……. …. …. ….. …. P32,16

RB

2

(d)

Figure 2: Organization of pixels in [(a),(c)] column major/[(b),(d)] row major format that are added or subtracted during theshift of search in left or right/down locations, respectively. (c) and (d) represent the corresponding 2nd column/row memorybanks that are independent of the 1st column/row memory banks shown in (a) and (b), respectively

(Pi,1, Pi,2, ... , Pi,16 when i is even) and for the odd rowthe last 16 (Pi,33, Pi,34, ... , Pi,48 when i is odd) pixels areaccessed to handle the downward movement of the searchlocation. Hence, we have stored the required row values inanother two memory banks. One is 32× 128− bit, to store32 such row pixel sets and the another one is 16×128−bit,to store 16 such row pixel sets. Thus, the design needs only768 bytes of overhead memory. The organization of thismemory banks and the stored pixels are shown in Fig. 2.

In order to reduce the total number of memory ac-cesses in FSBM-based architecture, data reuse can be per-formed [14] at four different levels. Our on-chip memorybank organization technique adopts the data reuse definedas Level A and Level B. Level A describes the locality ofdata within the candidate block strip where the search loca-tions are moving within the block strip. Level B describesthe locality among the candidate block strips, as verticallyadjacent candidate block strips are overlapped. In our de-sign this memory organization primarily based on the usageof Look Up Tables (LUT) in the FPGA implementation.

4 Performance Analysis

This section presents the implementations results of theproposed hardware. Subsequently, it compares the obtainedresults with other exiting FPGA based designs.

4.1 Implementation Results

The proposed design has been implemented in Ver-ilog HDL and verified with RTL simulations using MentorGraphics ModelSim SE. The Verilog RTL has been synthe-

sized on a Xilinx Virtex IV 4vlx100ff1513 FPGA. The syn-thesis results show that design requires 333 CLB Slices, 416DFFs/Latches and a total of 278 input/output pins. The areaof the implementation is 380 look-up tables (LUTs) and thehighest achievable frequency is 221.322 MHz.

The pipelined design takes 23 clock cycles to producethe first SAD value. Thereafter, one SAD value is generatedin every cycle. A search range of p = 16 has (2p + 1)2 =1089 search locations. So for a search range of p = 16, thenumber of cycles required by our hardware to find the bestmatching block is, 23 (for the first search location) + (1089-1) (for the remaining search locations) = 1111 cycles.

Our FPGA implementation works at a maximum fre-quency of 221.322MHz (4.52 ns clock cycle). Hence, theFPGA implementation can process a MB (16x16) in 5.022usec (1111 clock cycles per MB * 4.52 ns per clock cy-cle = 5.022 usec) and a 720p HDTV (1280x720) frame in18.078 msec (3600 MBs per frame * 5.022 usecs per MB= 18.078 msec). At this speed, the proposed hardware canprocess 55.33 720p HDTV frames per second. This is abig improvement over other approaches, where the framesprocessed per second is much lower. This is evident fromTable 2. The high speed and throughput of our design ismainly because of the modified SAD operation and the splitpipeline design of the proposed architecture.

4.2 Performance Comparison

This subsection compares the hardware features andperformance of the proposed design with existing FPGAarchitectures. No comparison has been made with availableASIC solutions.

16

Page 5: [IEEE 2007 25th International Conference on Computer Design ICCD 2007 - Lake Tahoe, CA, USA (2007.10.7-2007.10.10)] 2007 25th International Conference on Computer Design - Speed-area

Table 2: Comparison of hardware features and performance with N=16 and p=16Feature-based comparison Performance

Design cycles Freq CLB Input AD Adders Comp HDTV Through-/MB (MHz) Slices Ports PEs 720p put (T) T

Area(fps) (MBs/sec)

Loukil et al. [7] 8482 103.8 1654 48 33 33 8-bit 17 3.4 12237.7 7.4(Altera Stratix)

Mohammad et al. [8] 25344 191.0 300 2 33 33 8-bit 34 2.09 7536.3 25.1(Xilinx Virtex II)Olivares et al. [9] 27481 366.8 2296 2 256 510 1-bit 1 3.71 13347.4 5.8(Xilinx Spartan)Roma et al. [10] 2800 76.1 29430 3 256 15 8-bit 1 7.55 27178.6 0.92

(Xilinx XCV3200E)Ryszko et al. [11] 1584 30.0 948 16 256 16 8-bit 1 5.26 18939.4 11.9(Xilinx XC40250)Wong et al. [17] 45738 197.0 1699 32 16 243 8-bit 1 1.2 4307.1 2.5

(Altera Flex20KE)Our 1111 221.322 333 256 1 16 8-bit, 8 1 55.33 199209.7 598.2

(Xilinx Virtex IV) 9-bit, 4 10-bit, 3 11-bit& 2 16-bit

Table 4.1 compares the hardware features of the pro-posed and existing FPGA solutions for a macroblock (MB)of size 16 × 16 and a search range of p = 16. As canbe seen, our design consumes less cycles per MB, has thehighest maximum operating frequency. The splitting of theinitial stage of the pipeline facilitates this high speed. Thearea required in terms of CLB slices and the hardware com-plexity in terms of AD PEs (Absolute Difference ProcessingElements), adders and comparators are much lesser for theproposed architecture. Modification of the SAD operationcontributes to the high speed and less area and hardwarecomplexity. The use of memory banks has led to higheron-chip bandwidth. However, this has also led to the onlydrawback of our design, which is the high number of in-put/output pins.

A performance comparison of the various architectureshas been also shown in Table 4.1. In order to comparethe speed-area optimized performance of different architec-tures, the new performance criteria of throughput/area hasbeen used. Higher the throughput/area parameter of a de-sign, more is the speed-area optimization of the architec-ture. The architectures have been compared in terms of(a) number of HDTV 720p (1280x720) frames that can beprocessed per second, (b) throughput or MBs processed persecond, (c) throughput/area, and (d) the I/O bandwidth. Ascan be seen, the proposed design has a very high through-put and can process the maximum number of HDTV 720pframes per second (fps). Moreover, the superior speed-areaoptimization in the proposed design is exhibited by its sub-

stantially high throughput/area value of 598.2.

5 Reconfigurable Block Matching Hardware

Apart from using the full pattern, block matching canalso be performed by using N-queen decimation patterns. Ithas been shown [16] that the N-queen patterns have similarPSNR drop but yield much faster encoding performance ascompared to the full pattern, particularly for N = 4 andN = 8. This section presents a reconfigurable hardware de-sign to find the minimum SAD value by selecting any one ofthe full-search, 8-queen or 4-queen decimation techniques.To the best of our knowledge no similar hardware designexists in literature.

For both 4-queen and 8-queen decimation techniques,the pixels being processed for two consecutive SAD-basedblock matching are mutually independent. This fact can beutilized to further enhance the performance of the SAD op-erator discussed in section 3. Only the memory organiza-tion and the address generation at each clock will differ forthe three decimation patterns. It has been observed that thereconfigurable address generator and SAD operator requireonly 40% and 2% extra hardware cost, respectively, as com-pared to the already proposed full pixel architecture.

The reconfigurable address generator uses a commondatapath. Two consecutive addresses are represented bytheir respective bit value differences. For each decima-tion technique, the bit value is toggled following some pre-defined patterns. Bit toggling of the 8-bit address lines are

17

Page 6: [IEEE 2007 25th International Conference on Computer Design ICCD 2007 - Lake Tahoe, CA, USA (2007.10.7-2007.10.10)] 2007 25th International Conference on Computer Design - Speed-area

controlled by their respective enable signals which are be-ing generated by one special controller logic. This state ma-chine based controller generates the respective enable sig-nals depending on 2-bit decimation mode select input sig-nals. The pipelined datapath shown in Fig. 1 can also be re-configured according to the user specified decimation mode.In case of 8-queen on 16×16 block size, 32 pixel values areadded at every clock by both halves of the pipe stages fromone to five. The resultant value is directly used to performabsolute difference with the MB to calculate current SADvalue. The same datapath of the pipelined SAD operatoralso performs the SAD calculation for 4-queen decimation.This technique requires 64 pixels for each SAD value for16×16 block size. So, the pipeline is reconfigured in a waysuch that its both halves from stage one to five and stagesix are used to perform the addition of these 64 pixel val-ues. Subsequently, it performs sum of absolute differencesto get the new SAD.

6 Conclusions

This paper has presented a FPGA based design for FullSearch Block Matching Algorithm. The novelty of thisdesign lies in its modified SAD calculation and in split-pipelined design for parallel processing in the initial stagesof the hardware. The macroblock search scan has also beensuitably altered to facilitate the derivation of SAD sumsfrom previously computed results. Compared to existingFPGA architectures, the proposed design exhibits superiorperformance in terms of high throughput and low hardwarecomplexity. The high frame processing rate of 55.33 fpsmakes this design particularly useful in both frame and fieldprocessing of HDTV based applications. The paper finallyhints out the reconfigurable block matching hardware thatcould be useful to general purpose real time video process-ing unit.

References

[1] V. Do and K. Yun. A low-power vlsi architecture for full-search block-matching. IEEE Tran. Circ. and Sys. VideoTech., 8(4):393–398, Aug 1998.

[2] C. Hsieh and T. Lin. Vlsi architecture for block-matchingmotion estimation algorithm. IEEE Tran. Circ. and Sys.Video Tech., 2(2):169–175, June 1992.

[3] Y. Jehng, L. Chen, and T. Chiueh. Efficient and simple vlsitree architecture for motion estimation algorithms. IEEETran. Sig. Pro., 41(2):889–899, Feb 1993.

[4] T. Komarek and P. Pirsch. Array archtectures for blockmatching algorithms. IEEE Circ. and Sys., 36(10):1301–1308, Oct 1989.

[5] Y. Lai and L. Chen. A data-interlacing architecture withtwo-dimensional data-reuse for full-search block-matchingalgorithm. IEEE Tran. Circ. and Sys. Video Tech., 8(2):124–127, April 1998.

[6] S. Lin, P. Tseng, and L. Chen. Low-power parallel tree ar-chitecture for full search block-matching motion estimation.In Proc. of Intl. Symp. Circ. and Sys., volume 2, pages 313–316, May 2004.

[7] H. Loukil, F. Ghozzi, A. Samet, M. Ben Ayed, and N. Mas-moudi. Hardware implementation of block matching algo-rithm with fpga technology. In Proc. Intl. Conf. on Micro-electronics, pages 542–546, Dec 2004.

[8] M. Mohammadzadeh, M. Eshghi, and M. Azadfar. Param-eterizable implementation of full search block matching al-gorithm using fpga for real-time applications. In Proc. 5thIEEE Intl. Caracas Conf. on Dev., Circ. and Sys., DominicanRepublic, pages 200–203, Nov 2004.

[9] J. Olivares, J. Hormigo, J. Villalba, I. Benavides, and E. Za-pata. Sad computation based on online arithmetic for motionestimation. Jrnl. Microproc. and Microsys., 30:250–258, Jan2006.

[10] N. Roma, T. Dias, and L. Sousa. Customisable core-basedarchitectures for real-time motion estimation on fpgas. InProc. of 3rd Intl. Conf. on Field Prog. Logic and Appl., pages745–754, Sep 2003.

[11] A. Ryszko and K. Wiatr. An assesment of fpga suitabilityfor implementation of real-time motion estimation. In Proc.IEEE Euromicro Symp. on DSD, pages 364–367, 2001.

[12] A. Saha and S. Ghosh. A speed-area optimization of fullsearch block matching with applications in high-definitiontvs (hdtv). In To appear in LNCS Proc. of High PerformanceComputing (HiPC), Dec 2007.

[13] L. Sousa and N. Roma. Low-power array architectures formotion estimation. In IEEE 3rd Workshop on Mult. Sig.Proc., pages 679–684, 1999.

[14] J. Tuan and C. Jen. An architecture of full-search blockmatching for minimum memory bandwidth requirement. InProceedings of the IEEE GLSVLSI, pages 152–156, Feb1998.

[15] L. Vos and M. Stegherr. Parameterizable vlsi architecturesfor the full- search block- matching algorithm. IEEE Circ.and Sys., 36(10):1309–1316, Oct 1989.

[16] C. Wang, S. Yang, C. Liu, and T. Chiang. A hierarchical n-queen decimation lattice and hardware architecture formo-tion estimation. IEEE Transactions on CSVT, 14(4):429–440, April 2004.

[17] S. Wong, V. S., and S. Cotofona. A sum of absolute dif-ferences implementation in fpga hardware. In Proc. 28thEuromicro Conf., pages 183–188, Sep 2002.

[18] K. Yang, M. Sun, and L. Wu. A family of vlsi designs forthe motion compensation block-matching algorithm. IEEECirc. and Sys., 36(10):1317–1325, Oct 1989.

[19] Y. Yeh and C. Lee. Cost-effective vlsi architectures andbuffer. size optimization for full-search block matching al-gorithms. IEEE Tran. VLSI Sys., 7(3):345–358, Sep. 1999.

[20] H. Yeo and Y. Hu. A novel modular systolic array archi-tecture for full-search blockmatching motion estimation. InProc. Intl. Conf. on Acou., Speech, and Sig. Proc., volume 5,pages 3303–3306, 1995.

18