a low cost fpga system for high speed

8/22/2019 A Low Cost FPGA System for High Speed

1/8

A Lo w Cost FPGA System for High SpeedFace Detection and TrackingStavros Paschalakis and Miroslaw BoberMitsubishi Electric ITE B. I.: Visual Information Labora toryE-mail: {stavros.paschalakis. miroslaw.bober}@vi[..te. mee.com

AbstractWe present an FPGA face de tec t ion and trackingsystem for audiovisual communications, with a particularfocus on mobile videoconferencing. The advantoges ofdeploying such a technology in a mobile handset aremany, including fa ce stabilisation, reduced biw ate, andhigher qualify video on practical display sizes. Most fac edefectio n methods, however, assume at leost mo destgeneral purpose processing capabilities, making theminappropriate f o r real-time application s, especially fo rpower-l imited devices ,as well as m odestcurtom ha rdwmeimplementations. We present a method which achieves avery high detec tion and trackingperjormance an d, at thesame time, enfoils a signixccontly reduced compu a t i a w lcomplexify. a h w i n g real-time implementations oncustom hardware or simple microprocessors. We thenpropose an FPGA implementation which en tails very lowlcgic and memory COS& and achieves extremely highprocess ing rates at very low clock speeds.

1. IntroductionThe main amof audiovisual communications is to

enhance aural interaction through the addition of visualinformation in the form of real-time or near-real-timevideo of the users faces. 3G vide-capable mobilephones can significantly reduce the limitations of theconventional phone, but this new technology must beseamlessly integrated into the handset so that it does notcreate complications for users. More specifically, video.enabled terminals are hand-held and non-stationay, sospeakers have to ensure that their faces are central to thecameras field of view, which can be a distraction toconversation. Another problem is picture instability due tothe relative motion between the camera and the speaker,which may also have the aforementioned adverse effect,A further issue is the bitrate required on transmission for asatisfactory picture quality This can be dramaticallyreduced though the transmission of only the stabilisedface of a speaker instead of the entire frames. In h n p this

eliminates the need for unneceswily large displays andfacilitates the scaling of the faces to the displaydimensions, so that a complete face image is alwayspresented to the speakers,with low coding distortion

The integration of real-time face detection and trackingtechnology in mobile videophones can address all theseissues. To date, various face detection methods have beenproposed, with varying characteristics,such as tolerance toiUumination and geometric variations and computationalcomplexity. With regard to the latter, there are techniqueswhich entad relatively undemanding implementations inthe context of modem computer systems. Thesetechniques, however, still assume the availability ofmodest processing power, memory, floating-pointcapabilities, etc., rendering them inappropriate forconstrained real-time implementations. This is especiallythe case for power-limited devices such as mobile phones,the hardware of which also has to perform other tasks likevideo codingldecoding [ I ] and error management [Z].In this paper we propose a low cost FPGA system forface detection and tracking. First, we present a highperformance face detection and tracking method, whlch,while being robust to illumination and geometricvariations, entails a low computational load. T h s makes itsuitable for simple microprocessor as well as customhardware implementations. We then present an FPGArealisation of this method. In recent years, FPGAs haveproven to be invaluable in image processing applications[3-51 because they combine the reprngrammabilityadvantage of general purpose processors with the parallelprocessing and speed advantages of custom hardware.Our proposed FPFA face detection and tracking systemcombines very low logic and memory costs with highprocessing rates at low clock speeds, making it ideal forsmall power-limited devices such as mobiles as well asapplications where isolated smaTt sensors are required,e.g. sunreillance systems.2. Face Detection and Tracking Method

The problem of automatic face detection ha s been anactive area of study for some time and numerous


2/8

methodologies exist. These may be broadly categorised asappearance-based, feature-based or colour-based.Appearance-based approaches [6,7] reat face detectionasa recognition problem and, in general, do not utilisehuman knowledge of facial appearances, but rely onmachme learning and statistical methods to ascertain suchcharacteristics. Feature-based methods detect faces bydetecting features such as the eyes and the nose, or partsor combinations thereof [8,9], and usually rely on low-level operators and o priori heuristic rules of the humanface structure. Finally, colour-based methods may beviewed as a special case of feature-based methods, wherethe feature in question is the colour of the human skin[10,1 I]. The advantage of colour-based face detectionmethods is that, whde computationally efficient, they aregenerally robust to geomebic transformations, such asscale, orientation and viewpoint changes, since suchtransformationsdo not affect the colour of skin, s well asto complex backgrounds and illumination variations. Ourmethod also adopts the skin colour-based detectionapproach and can be summarisedas follows.The frst step is the subsampling of each video frame,by dividing it into non-overlapping blocks of a fixed su eand averaging the values within each block to produce asingle pixel in the subsampled frame. The aim of thisprocess is twofold. First, it reduces the amount of data forthe subsequent processing. Second, it results in thegeneration of larger skin patches which decrease thesusceptibility of the system to features such as amoustache or spectacles. Clearly, the subsampling factormust be chosen carefully so as to stnke a balance betweencomputational efficiency and detection performance. Ourcurrent system assumes @IF video, i.e. 176x144 pixels,for a c h 16x16 pixel subsampling, resulting in a 1 1 x 9pixel frame, achieves this balance. For example, Figurel(a) shows a QCIF image containing the faces of tw opeople and Figure l(h) shows its subsampled version

The second main stage of the face detection andtmckmg method is colour-based skin pixel detection in avideo kame. Here, this detection relies on a statisticalhistogram-based skin colour model. More specifically, astatistical skin colour model has been created ofnine,during a training phase, by projecting manuallysegmented skin pixels onto a histogam. The trainingdatabase encompasses the skin tones of different peopleand of different ethnicities, e.g. Caucasian, Asian,African, etc., captured under various illuminationconditions. During skin detection, an unknown pixel isprojected onto the histogram to retrieve a s k i n colourprobability value from the appropriate bin, which is thenused in conjunction with a threshold to decide theclassification of the pixel as slan or non-skin. Forexample, the result of the skin filtering operation on theimage of Figure I@) may be seen inFigure 1(c).

With such a fi ltering approach it is common practice to

convert the original RGB colour space to an alternativecolour space for the creation of the skin colour model andthe subsequent filtering. Desirable features of such acolour space include dimensionality reduction for a moreefficient implementation (e.g. 2D as opposed to the 3DRGB space), robustness to illuminationvariations, and theclustering of the skin colour of different people and withdifferent ethnicities in one or more compact regions.Thus, a number of appmpriate colour spaces have beenproposed in the literature, including the normalised rgspace, the W components of CIE L W , and the HScomponents of HSV, among others. In thiswork, our skincolour filter utilises a 2D adaptation of the 3D log-opponent colour space IGB, which has been successfullyused for skin detection [ I I], termed LogRG for simplicityand defmed as

L, =L(G)+L(RGB.,)L, =L(R)-L(G)+L(RGB,,) 1with L(+ lag,,(x+l).106.2-

and RGB, =255 for 24-bit colourAssuming a 24-hit colour depth for the RGB frame, therelations of (1 ) also include scaling to give each of thetwo LogRG values an &bit integer representation. Notethat, with this wlour space, the B component of the videosignal is not used Extensive experimentation has shownus that the el hi it io n of theB plane, wtuch has the leastcontribution to skin colour under the most commonilluminants, has an insignificant impact with regards toskin filtering compared with using the complete 3D log-opponent colour space. This elimination greatly simplifiesthe associated implementation, and with the additionaladvantage that only the R and G planes of the originalvideo signal require subsampling.Despite of the robustness of the chosen (or indeed any)colour system to illumination changes and the inclusion ofvarious skin tones into the statistical model, colour-basedskin detection can produce sub-optimal results fordifferent illuminations and people when a static model isemployed. Thus, what is needed is an adaptive model,which can be calibrated to different people and todifferent illuminants. Here, this process is user initiatedand relies on LogRG-converted pixels of the originalframe, which are contained in a small predefmed regionof fixed sue. The current skin colour model and the newcalibration data are then fused to produce an adaptedmodel. The exact mechanism of this will be exploredlater. Obviously, the user initiating the adaptation processwill he aware of the sue and location of the predefiedregion where hisher skin should appear for a successfulcalibration. Any non-skin pixels accidentally contained inthe calibration area can be eliminated using the simplerule b G and RS, which we have found to he true forskin under most illuminants.

-215-


3/8

Figure l . (a) Original QCIF frame. (b) Subsampling. (c) Skin detection. The while pixels o n the skin map denotedetected skin. (d) Face centroid (marked by crosses) and face selection for encoding and transmission (the area within the

superimposed rectangle). ?he images have been modrfed toprotect the identities offhe subjects.The generation of the skin map is followed by spatial

filtering for noise reduction. Currently, this filtering is aneighbourhood operation which resets a skin pixel to non-skin if it has less then four skin pixels among its eightneighbours and sets a non-skin pixel to skin if all of itseight neighburs are skin pixels.A connected componentanalysis algorithm is then applied to the filtered skin map.The result of this process, the details and implementationof which will be examined in the next Section, is that theindividual skin pixels of the skm map are grouped intodisjoint skin regions. The next step is the calculation ofsome statistics for each region, which are subsequentlyused by the face trackmg process. These statistics are thezeroth and fxst order moments [3], given by

m, =CCf(x,r) Y

wherefixxy) is the skin map

and f ( i r )={fornon-skinpixelsThen, the centroid of each region is calculated as

1for skin pixels

E , f -3m, moo

2

3The tracking process is based on distance and mass

measurements. The distance between the centroid of eachregion and the centroid of the face of the previous frameis calculated as the maximum axisdistanced, using

d =maax(&,&)with & = x - - p n r ) I dy=lY-LJn.I

where (a,?) is the centroid of a skin region of thecurrent frame

and (:Jmv,ypEv)s the centroid of the face of theprevious frame

The skin group wh ch produces the smallest distance d isselected as the tracked face for the current frame, unless

that region does not also have the largest mass among theskin regions for ten consecutive fiames. In that case, thetracked face switches to the maximum mass region. Thisallows the system to always lock on to the mostprominent skin region which, given the targeted mobilevideophone application, is most likely to be the region ofinterest, i.e. the main speaker. The ten frame hysteresisfunction employed prevents flickering among skin regionsof similar sizes. Another special situation is the absenceof any skin pixels following the spatial filtering of theskin niap and before grouping. Then, the system assumesthe last face position as the current one.

Finally, the wwrdinates of the centre of the trackingwindow are calculated as the average of the centroidcoordinates of the face for the current frame and a numberof previous frames (currently seven). m i s allows smoothtrackmg, as well as smooth switching between skinregions. As for the size of the tracking window, this mayhe easily obtained from the area of the tracked face or itmay he fixed to a pre-determined size. Keeping in mindour target application, and the display size limitations forthe videophone handsets, the current system assumes afixed trackmg window of QQCIF resolution with portmitaspect ratio, i.e. 72x88 pixels. As an example, Figure I(d)shows the centroids of both the faces of Figure l(a) and,assuming the face on the right is selected as the trackedface, the region selected for encodlng and transmission isidentified by the hounding rectangle. It is this image thatwill appear on the display of the handset of a second user.Optionally, the handset of each user may also display, atone of the comers of the screen, a scaled down version ofthe identified face of lhat user, allowing both users tosimultaneously view the other person as well as the videothat they are themselves transmitting.3. FPGA-Based System

The top level organisation of the FPGA system whichimplements the face detection and tracking methoddescribed above can he seen in Figure 2.Thus, at the toplevel, the proposed PGA-based system is a pipeline ofthree stages, namely suhsampling, skin detection, and facedetectiodtracking .

-216-


4/8

IFigure 2. Overall system organisation as a three stage

pipeline

Block Sum RGFigure 3. Subsampler implementation

3.1. SubsamplingThe current subsampling circuit has been designed for

the processing of QCIF 176.144 pixel 24-bit RGB videoframes using a 16x16 mask, giving rise to 1 1 x 9 pixelframes. The circuit processes data directly from a camerawithout reliance on frame storage and assumes that pixelsare delivered row-by-row, as is usually the case, e.g. lefl-to-right and topt-bottom, and that colour band valuesare delivered sequentially rather than in parallel, e.g. in anR-G-B byte order. Figure 3 show that this module iscomprised of two accumulatorun ts. As seen earlier, thesubsequent skin filtering process does not require the Bplane ofthe original frame. Thus, ACC, only accumulatesthe R and G values of each row of each pixel block. Sincepixel values are delivered sequentially and in a fxedorder, ACCl does not contain two accumulators but iscomprised of a single 12-bit full adder and a 2x12-bit shiftregister. ACCt accumulates he results produced by ACCland therefore calculates the R and G pixel value sums forentire pixel blocks. Similarly to ACCI, ACCt is made upof a single 16-bit fu l adder and a 22x16-bit sh ft register.The size of all the components has been chosen so that anoverflow does not occur. Each final result produced byACC, contains a required pixel value for the subsampledframe. The processing time for each pixel byte is a singlecycle and the module can process values continuously,regardless of row or frame transitions. The appropriatecontrol circuitry allows the module to be paused at anypoint and for any number of clock cycles, allowing it toadapt to different camera timings and specifics.

3.2. Skin FilteringThe second pipeline stage, the skin filtering circuit,

processes the subsampled frame values as they areproduced by the subsampler and does not require thestorage of complete frames. The tasks of this circuit are totransform the R and G values of the subsampled or theoriginal kame to the LogRG space in order to obtain,respectively, skin colour probabilities or adapt the originalskin colour model to the current skin tone and lightingconditions. Thus, the circuit has two modes of operation,namely skin filtering and model calibration.

Although both modes of operation rely on the samehardware, it is more convenient to fust consider the skinfdtering process, outlined in Figure 4. This stage is, initself, a pipeline. The fust step is the transformation L(x)of (1) for the R and G subsampled frame values. This canhe implemented very efficiently as a lookup operationinto a 256 ~8-bi t mbedded ROM. The processing time foreach value is a single cycle, and the conk01 circuitqallows zero o r any number of cycles between the deliveryof the two values of each RG pair. The remainder of theskin filtering process requires two cycles. In the fustcycle, Cycle 0, the transformed R and G values arecon vde d to the LogRG colours L, and Lt according tothe relations of (1) and are subsequently offset by a fwedamount (-39 for L I and +2 for Lt). The whole processsimply requires three %bit full adders. The offsettingoperation is related to the implementation of the skincolour model. More specifically, the model can beefficiently implemented as an embedded RAMmodulewhch contains not the entire LogRG histogram, but onlythat region were skin colours occur With our model, thisactive area is a small 3ZXl6 in window, resulting in avery inexpensive implementation. This is why thecompact clustering of the different skm tones is animportant properly of the colour space chosen for themodel creation. Another optmisation we have cmied outis that, instead of usmg multi-level skin colour probabilitymodel in conjunction with a probabihty threshold, wedirectly implemented a thresholded b i m y model. Thus,the entire skin colour model has been implemented as aninexpensive 512xI-bit embedded RAM created byvectorising the aforementioned 32x16 bm Window,In thef i ycle, Cycle 1 the appropriately offset L, and L2 recombined to form an address for asynchronously readingthe skin model RAM. The output from this memoly is abinarised skin colour probability. The Boimds Checkcomponent checks whether the L I and La values areactually Within the model range and, if mt, the skmcolour probability is cleared to 0. Ths componentrequires just two %bit comparators. The circuit has beenimplemented so that there can be zero or any number ofclock cycles between the delivery of any two RG pairs,regardless of row or frame transitions.

-217-


5/8

Skim FiltmimgSubsampled Valwa...... ................................. ~I R,=

Convnsionm

0U-RU

f fModelAdaptation(RAMWrite)

Bounds CheckLL__iFigure 5. Theskin colour model adaptation pipeline

The advantage of this skin filter implementation is thatit entails minimal logic, very small amounts of memory(mainly the 32 0 bytes for the two memory blocks) andcan span as little as four clock cycles. Such a streamhnedimplementation, however, makes the task of adapting theskin colour model to specific subjects and/or imagingconditions more cballengmg. A question that arises ishow the calibration values are incorporated into themodel. In the general case, the adaptation of a multi-levelprobability model requires the creation of an equivalentcalibration model followed by the fusion of the twomodels, e.g. hstogram averaging. With our binarisedmodel implementation this is not an option. The solutionwhich we have identified entails the extmction of someadditional information from the multi-level probabilitymodel. More specifically, all the bins of the originalmodel can be split into four classes or categories, eachencoding: (a) the initial 0/1 probability, (b) how theoriginal probability changes during a ''sinking" process atthe s h r l of the model adaptation and (c) how it againchanges based on the occurrences of the given colour inthe calibration area. The initial "sinking"or "probabilitysuppression" process is required at the start of the modeladaptation so that the new data can have a significantimpact. This approach allows both the adaptation of theoriginal model as well as its recovery at any time. Thisadditional information can be efficiently encoded in a512x4-bit embedded RAM with also includes thetemporary storage required wh le the model adaptation isin progress and colour categories are being updated, andsome simple logic, i.e. adders and comparators.

Figure 5 shows the model adaptation pipeline of thecircuit. The frst step in the model adaptation process (notshown) is the combined retrieval and sinking of theoriginal model probabilities, based on the colour categoryinformation as discussed above. For each individualpixel, the frst step d u n q calibration is the logariihictransformation of the R and G original pixel values, asde sc dx d above for the subsampled values. Someadditional checks are also performed to ensure that thepixel in question should be used in the calibration process,as shown in Figure 5 . The associated components aretrivial and require mainly comparators. The rest of thecalibration process requires two more cycles. In Cycle 0the LogRG colours L I and L2 re calculated in the samefashion as for the skin filtering procedure. Then, in Cycle1, the L I and L2 values are used to access the modelmemory. If it is decided that the pixel should be used forthe calibration, the colour category information and, ifappropriate, the correspondmg probability are updated.With this organisation, the retrieval of the original model,without sinking and calibration, is also straightforward.The pipelining of this path allows zero or any number ofcycles between the delivery of pixel bytes or RGBtriplets, regardless of row or frame transitions.

-218


6/8

3.3. Face Detection/TrackingThe tasks of the face detectiodtracking circuit are tospatially filter the 11 9 bi nay skin map produced by the

previous stage, perform region growing, calculate regionstatistics, identify the region corresponding to the face ofthe previous frame and calculate the centre of the facedisplay window. As for the previous circuits, this moduleprocesses the skin map values as they are produced by theprevious stage and does not require frame storage.

Figure 6 shows the organisation of k s ircuit and theactions for each cycle of operation. Since this stage in theprocessing relies heavily on neighbourhood operationsand no frame storage is implemented, the circuit utilises ashift register which stores a continuously s w i n g imageneighbourhmd, as shown in Figure 7.Thus, in Cycle 0the binary skm probability value is stored in a 25I-bitshift register (SRI). Once a complete neighbourhd isformed, in Cycle 1 the pixel being processed is spatiallyfdtered for noise reduction.ms relies mamly on an addertree as shown in Figure 8 .

Each filtered pixel is used in the connected componentanalysis process. This usually involves fldfdl-typeoperations, which are unsuitable for an FPGAimplementation because they require a complete framestorage and entail a demand%, and usually recursive,implementation Thus, a different algorithm is employedhere which aims a t computational efficiency. As theimage is scanned left-to-right and topto-bottom each skinpixel is assigned a numerical tag To T,. determinedbased on the tags of four of its neighbours, namely thethree pixels above and the one pixel to the left This takesplace d u h g Cycle 1 of the circuit operation. The tags areencoded using three bits and are stored in 13x3-bit shitregister (SRz) which forms a continuously slufting tagneighbourhood. No n - s h pixels are always assigned thetag T7 while skin pixels take the lowest tag found amongtheir neighbours., e.g. between To and T I, To is chosen. Ifnone of the neighbours is skir&a new tag is assigned to askin pixel, e.g. ifT o and T I have been used, the skin pixelunder consideration is assigned Tz . This part of the circuitis illustrated in Figure 9.

The circuit also maintains an 8~3-bit tag-groupassociation register array, which indicates which of theskin groups Go Ga the pixels of a tag belong to. Thisarray is initialised so that all tags correspond to group G,,i.e. the non-skin group. In Cycles 2 to 5, the tag-groupassociation register array is updated for the tag of each ofthe four neighbouring skin pixels of the previous cycle.New skin tags a x nitially associated with the group ofthe same value, e.g. Tzwith Gz. Now assume, forexample, that a pixel is assigned TI, associated with GI,and that subsequent processing reveals that this pixel isneighbouring a To pixel, associated with Go . Rather thanchanging the tags ofal l the TI pixels to To , he tag-group

Store SP in ShiR Registsr SR I(Neighbourhood Formation)

SP SpatialFilmingSPTag Ddcrmination (T,-T,)

SP Tag-GroupAssociation Updating

1::5Fi

PY O IWindow Calculation U Frame Window for

Display IEncoding/Transmission

Figure 6.Organisation of the face detectiodtrackingmodule

association array is updated to indcate that TI is alsoassociated with skin group Go.So,when the whole imagehad been processed, this array shows which pixels makeup each distinct skin group. Although it is possible for an11x9 skin image to require more than the effective sevenskin tags, extensive testing showed that, in practice, thisdoes not easily arise; in fact, it never arose in our tests.Even if this event occurs it will simply result in theabandonment of the processing of just one frame and theposition of the tracked face may be interpolated from theprevious frames.

-219-


7/8

, ./." '~ SP,. .,V VPinel of nkm1 Local NeighbourhoodFigure 7. Skin probability (SP) neighbourhood formation

in shiilregister SRI (25x1-bit)

Filtered p.

Figure 8. Skin probability (SP) spatial filteringAs discussed earlier, the face trackmg process relies on

the statistics moo,m lo and mo l of (2), calculated for eachskin region. However, these statistics need not becalculated at such late a stage in the processing With ourcurrent circuit these statistics are calculated in Cycle 2during the connected component analysis process and forthe pixels of each separate tag rather then each separateskin group. These values are stored in three registermays, 8~7-bitor mooand 8.9-bit for ml o and mol

The next cycles are applicable only for the processingof the last pixel in a skin map. When the whole image hasbeen processed, the tag-group may mformation is used tocalculate the mw , ml oand mo l statistics for the skin groupsas s u m s of the statistics of the appropriate tags. This takesplace during Cycles 6 to 13 and relies malnly on threeadder trees and some simple additional logic. The resultsare stored hack to the m w , m,o and mol arrays.

The calculation of the x coordinate of the centroid ofeach skin group according to (3) is performed duringCycles 14 to 22. A 9-bit by 7-bit two-stage pipelinedinteger divider is used for this. The results are stored hackinto the m lo array. The same process is performed duringCycles 22 to 30 for the calculation of the y centroidcoordinate of each skin group, with the results stored hackto the mol register array. Cycles 31 to 37 are used toidentify which of the skin regions represented by groups

N w T s g . .Figure 9. Implementation of the tag shift register SR 2

(13~3-bit)nd tag determination logic.Go to Gs s spatiaUy closest to the face of the previousframe as well as the largest among those skin regions.Finding the closest face requires only some simple logicfor the implementationof the distance equations of (4).In Cycle 38 the face of the current fmme is selectedbetween the closest and the largest face based on thesimple rules discussed earlier. If the circuit "run out" ofskin tags during the processing of the frame or no skmpixels were found afler fdtering, the face of the previousframe is assumed to he the face of the current frame.Finally, the coordinates of the centre of the window fortransmission and display are calculated in Cycle 39 as theaverage of the face centroids for the current and theprevious seven frames, which are stored in two shiftregisters. ' Ihe display coordinates are then scaled to thedimensions of the original video hy bit s h i b g . Asdiscussed earlier, the module does not dictate what thedimensions of the tracking window should he, due to thefixed su e of the mobile videophone device.3.4. CircuitStatistics

The entire circuit has been implemented using VHDL.Table 1 shows the statistics for the implementation ofproposed face detection and tracking system on anALTERA APEX 20K device. These figures include 46flip-flops for 110 synchronisation. The total memoryrequirements of the system are less than 700 bytes.Assuming uninterrupted pixel byte delivery, a processingspeed of 434 frames per second is obtained for a clockspeed of 33MHz. This is directly derived from the timingof the subsampling module, whc h is the bottleneck of thecomplete circuit, despite its pixel value per clock cycleprocessing time. In other words, the pixel delivelymechanism is the actual limitation of the system. It is,therefore, obvious that the circuit has the potential ofachieving speeds far higher than real-time videoprocessing requirements, even at low clock speeds.

-220-


8/8

Table 1.Design summary for the FFGA face detectionand trackmg system

DeviceTotal pinsTotal logic elementsTotal flip-flopsTotal ESB bitsTotal memoryCLKJ-

A L E R A EPZOK 1000EBC652-1353173/38400 (8.26%)9124608/327680 (1.41%)697.5 bytes33 .13MHz

4. Discussion an d ConclusionsWe presented an .FFGA-based face detection and

tracking system for audiovisual communications, with aparticular focus on mobile videoconferencing. Theadvantages of deploying such a technology in a mobilevideoconferencing system include improved visual effectthrough face stahilisation, reduction in the bandwidthrequirements through the encoding and transmission ofonly the stabilised face, and the provision of high qualityvideo on practical display sizes. The adopted skin colour-based face detection method acheves a very hghperformance and is robust both to illumination variationsand to geometric changes, such as rotation, scale andviewpoint, since skin colour is invariant to suchtransformations In particular, extensive evaluation of themethcd has revealed that it is particularly robust undervely low illumination conditions. We have chosen anFPGA as our implementation vehicle because it combinesthe reprogrammability advantage of general purposeprocessors with the parallel processing and speedadvantages of custom hardware. Thus, it is easy to seethat specific aspects of the proposed system, e.g. theframe dimensions, !he subsampling mask or the skincolour filtering process, may be easily changed in a largersystem by reconfiguring the device. However, l h e adoptedmethod is also suitable for implementation on simplemicroprocessors or ASICs.

The size and speed evaluation of the proposed FPGAcircuit revealed its low cost in terms of logic and memory(less than 700 bytes), in conjunction with i ts frameprocessing capability at low clock speeds (over 400framedsec. at 33MHz). This frame processing capabilityfar exceeds thc requirements for a mobilevideoconferencing application and, indeed, real-timevideo broadcasting requirements. This demonstrates theapplicability of our design approach to high-speedapplications. Also possible is the high-speed parallel

processing of multiple video kames by multiplexingmultiple video sources into the system, through thereplication of key components and the multiplexing of themajority of the existing hardware. The low circuit area ofthe current design certainly allows this within a singleFPGA device. Finally, it is obvious that the proposedmethod and implementation would also he applicable toother colour-based detection and tracking applications.AcknowledgementsThe authors acknowledge th e contributionof Mr. James Cooperin the developmentof th e face stahilisation and tracking method.ReferencesI .

2.3 .

4.

5.

6 .

7.

8.

9.

IO

11

ITU-T Recommendation H.263 Version 2, Video codingfor low hit rate communication, 1998.Yang, Y., Zhu, Q ., Error Control and Concealment forVideo Communications:A Review, Proceeding8 Tf lheIEEEs 1998,vol. 86 , no. 5, pp. 974-997.Paschalakis S., Moment Methods and HardwareArchitechlres for High Speed Binary, Greyscale andColour Panem Recognition,Ph.D.Thesrs,Department ofElectronics Universityof Kent at Canterbury,U& 2001.Paschalakis, S., Lee, P., Bober, M., An FPGA System forthe High Speed Extraction,, Normalization andClassification of Moment Descriptors, To appear: InProceedings of 13 Inlemntronol Confireme on FiledPmgrmmnable Logic mdApplie orions FPLU3,2003.Swenson, R. L., Dimond, K. R., A Hardware FPGAImplementationof aZD Median Filter Using aNovel RankAdjustment Technique, In Pmceedngs of Th IEEIntemationol Confermce on I m g e Processing and ilsAppIicoIiomIPA99, 1999,vol . l , pp. 103-106.Turk, M., Pentland, A., Eigenfaces for Recognition,Joum IofCo gnitiv e Neuroscience, 1991, vol. 3, no. 1, pp.71-86.Yang, M. H., Ahuja, N., Kriegman, D., Face DetectionUsing aMixhlre of Linear Subspaces,In Proceedings ofk IEEE Infema liom l Conterence on Aulomaric Face andGesIureReeogniIiaFGUU, 2000, pp . 70-76.Viola, P. , Jones, M. J. , Robust Real-Time ObjectDetection, Technical Reporl CRL 20UI/U1, CambridgeResearch Laboratory, Compaq, 2001.L u n g , T. K., Burl, M. C., Perona, P., FindingFaces inCluttered Scenes Using Random Labelled GraphMatching, ~n Pmceedngs of 5* I E nremtiondCor$cmnce a Computer Vision ICCV95, 1995, pp. 637-w.Yang, M. H. , Ahuja,N. , Detecting HumanFaces in ColorImages, ~n pmceedngs . 5 IEEE ~miemationdConfcrrence onImagePmcessingIClP98, 1998,vol. I , pp.127-130.Fleck, M., Forsyth, D., Bregler, C., Finding nakedpeople, In Proceeding.s of dhEumpean Conjerolce onCompuferVision ECCV96, 1996,vol. 2, pp. 593-602.

-221

a low cost fpga system for high speed

Documents