student final year project declaration - city u

60
Department of Electronic Engineering FINAL YEAR PROJECT REPORT BEngCE-2008/09-KLC-03 Human Motion Analysis Student Name: Chu Kwong Wai Student ID: Supervisor: Dr. CHAN, K L Assessor: Dr. CHAN, Stanley C F Bachelor of Engineering (Honours) in Computer Engineering

Upload: others

Post on 12-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Department of Electronic Engineering

FINAL YEAR PROJECT REPORT

BEngCE-2008/09-KLC-03

Human Motion Analysis

Student Name: Chu Kwong Wai Student ID: Supervisor: Dr. CHAN, K L Assessor: Dr. CHAN, Stanley C F

Bachelor of Engineering (Honours) in Computer Engineering

Student Final Year Project Declaration

I have read the student handbook and I understand the meaning of academic dishonesty, in particular plagiarism and collusion. I declare that the work submitted for the final year project does not involve academic dishonesty. I give permission for my final year project work to be electronically scanned and if found to involve academic dishonesty, I am aware of the consequences as stated in the Student Handbook. Project Title : Human Motion Analysis

Student Name : Chu Kwong Wai

Student ID:

Signature

Date : 23-Apr-09

No part of this report may be reproduced, stored in a retrieval system, or transcribed in any form or by any means – electronic, mechanical, photocopying, recording or otherwise – without the prior written permission of City University of Hong Kong.

i

Table of Contents:

List of Figures……………………………………………. iii

Abstract…………………………………………………… v

Chapter 1 Introduction………………………………… 1

1.1 Background………………………………………………… 1

1.2 Objectives………………………………………………….. 3

1.3 Challenges Faced………………………………………….. 3

1.4 Chapter Outline……………………………………………. 5

Chapter 2 Related Work………………………………… 6

2.1 Background Subtraction…………………………………… 6

2.2 Skeletonization…………………………………………….. 6

2.3 Motion Tracking…………………………………………… 7

Chapter 3 Methodology………………………………… 9

3.1 System Overview………………………………………….. 9

3.2 Background Subtraction Module………………………….. 11

3.2.1 Background Modeling…………………………… 11

3.2.2 K-means Algorithm……………………………… 13

3.2.3 Foreground Detection……………………………. 14

3.2.4 Median Filter…………………………………….. 16

ii

3.3 Analysis Module…………………………………………. 18

3.3.1 Dilation Operation……………………................. 18

3.3.2 Erosion Operation………………………………. 21

3.3.3 Boundary Extraction……………………………. 23

3.3.4 Removing the Non-body Objects………………. 25

3.3.5 Extracting Extremal Points……………………... 28

3.3.6 Locating Interested Points……………………… 30

3.4 Tracking Module………………………………………… 33

Chapter 4 Result……………………………………... 35

4.1 System Configuration……………………………………… 35

4.2 Background Subtraction……………………………………. 36

4.3 Median Filter………………………………………………... 39

4.4 Morphological Operations………………………………….. 40

4.5 Locating Interested Points…………………………………... 42

4.6 Tracking Process…………………………………………….. 43

4.6.1 Result of Kalman Filter……………………………… 43

4.6.2 Object Tracking………………………………………. 45

Chapter 5 Discussion…………………………………… 47

Chapter 6 Conclusion…………………………………... 50

Reference…………………………………………………… 51

iii

List of Figures

Figure 1. Flow chart of human motion analysis system……………………… 9

Figure 2. Demonstration of a region based background modeling process….. 11

Figure 3. Modeling composite background………………………………….. 12

Figure 4. Flow chart of the K-means algorithm employed………………….. 13

Figure 5. Histogram of an image block with two background regions……… 14

Figure 6. Results of background subtraction & median filtering…………… 16

Figure 7. Calculating the median value of a pixel neighborhood………….. . 17

Figure 8. Result of dilation operation………………………………………. 19

Figure 9. 8-connected neighborhood……………………………………….. 20

Figure 10. Illustration of dilation operation………………………………… 20

Figure 11. Result of erosion operation……………………………………… 22

Figure 12. Illustration of erosion operation…………………………………. 22

Figure 13. Result of boundary extraction…………………………………… 23

Figure 14. Illustration of boundary extraction operation…………………… 24

Figure 15. Removal of non-body objects…………………………………… 25

Figure 16. Connected component labeling………………………………...... 27

Figure 17. Process of extracting extremal points……………………………. 28

Figure 18. Interested points located…………………………………………. 30

Figure 19. Locating elbow position……………………………………….... 31

Figure 20. Locating shoulder position………………………………………. 32

Figure 21. Illustration of Kalman filter……………………………………… 33

Figure 22. Results of background subtraction……………………………..... 36

Figure 23. Comparison between frame difference & region-based approach.. 37

Figure 24. Median filtering…………………………………………………... 39

iv

Figure 25. Two sets of results of morphological operations…………………. 41

Figure 26. Located interested points…………………………………………. 42

Figure 27. An unsuccessful case on locating interested points………………. 42

Figure 28. Result of tracking of the centroid of swimmer…………………… 43

Figure 29. Applying Kalman filter on tracking interested points…………….. 44

Figure 30. Object tracking……………………………………………………. 45

v

Abstract

This project aims to analyze human motion in an image sequence of swimming. The

proposed system consists of three main modules. First, a background subtraction

module segments the human region in spite of large scene variations in the monitored

pool areas. Second, the analysis module obtains the skeletonization of the segmented

region of human. Finally, a tracking module is used to acquire the trajectories of the

body parts. The background subtraction module employs a region-based background

modeling approach that can differentiate the moving human subject from the

nonstationary water surface. The skeletonization is obtained by locating the interested

points such as hands, elbows and shoulders of the human. The tracking module

utilizes the Kalman filter to find the position of the human body and the trajectories of

the body parts. We test the human motion analysis system using video clips of

breaststroke swimming. The experimental results are very good.

- 1 -

Chapter 1 Introduction

1.1 Background

Human motion analysis has been an active research topic in computer vision for

decades. This research topic aims at developing a system which can understand

human activities; such system would be an essential part of many applications such as

human-robot interaction, sports video analysis and visual surveillance. For instance,

human motion analysis enables a robot to understand the motion of human, which is

very important for having an effective communication between robot and human.

Triesch and von der Malsburg [1] proposed a mechanism for implementing a gesture

interface for human-robot interaction, which enables a robot to understand simple

gesture commands from human such as grasping and moving objects. It is believable

that more natural and intelligent communication between human and robot could be

achieved in the future. In the application of sports, human motion analysis system can

segment different parts of the human body, track the trajectories of various joints over

an image sequence, and reconstruct the 3-D human body structure. Those are

particular useful information for the coaches to analyze the performance of athletes,

and study the motions which cause injuries. Human motion analysis system is also

- 2 -

often seen in the automatic visual surveillance, which aims to perform monitoring

tasks automatically in order to improve the safety in different environment. For

example, [2] provides commercial products of automatic video surveillance system,

which can detect, record and transmit various kinds of motion activity. Besides, Eng

et al. [3] developed a live visual surveillance system for detecting early drowning at

swimming pool, which is definitely able to increase the quality of work of the

lifeguards.

Human motion analysis is the preprocessing step of human motion recognition,

because before any recognition process could be performed on the human body, a

variety of information of the body part must be obtained by the human motion

analysis process. Traditionally, the work of human motion analysis could be divided

into the following parts: 1) segmenting the human body part from video frames using

background subtraction approach, which usually need to build up a background model;

2) locating some interested joints on the human body part, which often makes use of

prior information or employs 2D or 3D human model; 3) tracking the movement of

the human body, which relies on the correspondence of image features between

consecutive video frames with the information of human body part’s position, shape

and velocity.

- 3 -

1.2 Objectives

In this report we address the problem of developing a system for analyzing and

tracking the motion of swimmers at the pool. The final year project has the following

the objectives:

˙Obtain the body part region of the swimmer from an image sequence of a

swimming video.

˙Locate some interested points on the upper parts of the human body.

˙Track the human body along the image sequence in order to get the

trajectories of swimmer.

1.3 Challenges Faced

One key challenge we faced during the project development is the relatively high

level of noise in human body part segmentation. It is because performing background

subtraction on video of an aquatic environment is much more complex than doing that

on video of the terrestrial scene. In the aquatic environment, the water surface is keep

fluctuating, which contributes to a non-stationary background for the background

subtraction process. The intensity variation of pixels at the scene over time is in a very

wide distribution, which is demonstrated in Figure 1 of [4]. The background

subtraction could not be done with traditional frame difference approach, which

- 4 -

differentiates the foreground objects from background based on the intensity

difference of pixels in the foreground and background images. In our system, we use a

region-based approach for the background subtraction, which would be less sensitive

to the wide intensity change over video frames. Further discussion would be provided

in the methodology chapter.

Apart from the above problem in background subtraction, there are also problems

existing in locating features and tracking. Some feature points such as the hands will

be relatively easier to locate since they are the most outstanding part of the body. By

locating the extremal points on the boundary of the human body can successfully

locate the hands position, which will be discussed in the methodology chapter.

Nevertheless, the other feature points such as shoulders and elbows will be much

more difficult to locate since they have not apparent characteristic like the hands.

Sometimes those feature points will be covered by the water or cannot be

differentiated from the body part, which makes the locating process even harder.

Using a predefined ratio between those feature points can help the locating process

and that will be discussed in the methodology chapter.

For the tracking process, the major problem is that sometimes the measurement data

- 5 -

will be missing due to the background subtraction process, which will affect the

tracking performance. A discussion about this issue will be provided in the result of

the tracking process in the result chapter.

1.4 Chapter Outline

This report is organized as follow. In Chapter 2, we review relevant prior works on

human motion analysis, which mainly focuses on background subtraction,

skeletonization of the extracted human region and motion tracking. Chapter 3 gives a

detailed explanation on the methodologies employed for the analysis system. In

chapter 4, experiment results of each stage of the human motion analysis are provided.

Finally, discussion and conclusion are presented in chapter 5 and 6 respectively.

- 6 -

Chapter 2 Related Work

2.1 Background Subtraction

In order to handle the highly dynamic aquatic background, Eng et al. [4] proposed a

homogeneous region-based modeling approach for background subtraction. This

approach takes the background as a composition of homogeneous region processes. A

sequence of background frames is needed to form the homogeneous background

region. For each of these background frames, it would be first divided into a number

of non-overlapping square blocks. Then, a hierarchical K-means algorithm is

performed on color vectors collected from these divided square blocks to form

homogeneous regions. This approach enables an efficient spatial searching scheme

with long-range spatial displacement search ability to capture the background pixels

movements. However, an additional devised thresholding scheme is needed in the

foreground detection process in order to have efficient result.

2.2 Skeletonization

Fujiyoshi and Lipton [5] proposed a real-time, robust and computational cheap

approach to get a “star” skeleton of a human target. This approach first detects

moving targets and extracts their boundaries. Each skeleton is obtained by detecting

- 7 -

and extracting the extremal points on the boundary of the segmented target object.

The “star” skeleton has only the extremals of the target object jointed to its centroid

point. The “star” skeleton can be used to determine the body posture and cyclic

motion of body segments, which are useful cues for analyzing human activities such

as walking and running. Since this approach is not iterative, so it is computational

inexpensive. It also provides a mechanism for controlling scale sensitivity.

Nevertheless, it does not rely on any priori human model; therefore it is very useful

for real-time application. However, this “star” skeleton consists of only the extremal

points such as hands, legs and the head of human body. Individual joints like elbows,

shoulders, ankles and knees can not be determined. Therefore, we proposed a new

approach to locate some interested points on the upper parts of the human body,

including hands, elbows and shoulders. Details are provided in the chapter of

methodology.

2.3 Motion Tracking

In the work of [6], a discrete Kalman filter is adopted for tracking human walking

motion. The five major equations of Kalman filter are divided into two groups: time

update equations and measurement update equations. The time update equations can

predict the current state and error covariance, and the measurement update equations

- 8 -

can correct the current state and error covariance predicted from these time update

equations. Similar approach is adopted in this project. Further explanation of the

Kalman filter would be provided in the methodology chapter. Chan [6] assumed that

the human subject walks at a constant velocity and the major results are obtained

based on this assumption. A discussion on the case that the human subject walks at

non-constant velocity is also provided. Apart from tracking the global position of the

human subject, 8 joints of the body are also tracked. Those include left shoulder, left

elbow, left hip, left knee, right shoulder, right elbow, right hip and right knee. A

similarity matching scheme is used to measure the performance of tracking system.

Most of the time the tracking system can have a good performance on tracking the

human subject, especially on tracking the global position movement of the human

object. However, the performance would deteriorate when the joints are hidden due to

self-occlusion.

- 9 -

Chapter 3 Methodology

Figure 1. Flow chart of human motion analysis system.

3.1 System overview

The flow chart of the proposed human motion analysis system is shown in Figure 1,

which consists of two phases: the learning phase and detection phase. In the learning

phase, the objective is to build up a background model, which is the essential part of

the background subtraction module. The background model is constructed using a

sequence of background images. The system goes into the detection phase after the

background model is obtained. Each input image with foreground objects would be

first processed by the background subtraction module, and then this module will

classify all image pixels into background and foreground pixels respectively. The

Learning

Phase

Detection

Phase

Background

Pixels

Foreground

Pixels

Skeleton Body part

trajectories

- 10 -

background pixels will be passed back to the background model for update.

Meanwhile, this module will output an image constructed with the foreground pixels,

which will be the input of the analysis module. The analysis module will perform

some morphological operations on the input image to improve the shape of the human

body obtained. And then this module will construct a skeleton of human body with the

boundary of the body and some interested points located in the upper part of the body.

The tracking module will use this skeleton image as an input, and apply a Kalman

filter for tracking the target object. The Kalman filter will output the body part

trajectories and then the whole process finishes in here.

- 11 -

3.2 Background subtraction module

3.2.1 Background modeling

(a) Original image

(b) Formation of non-overlapping (c) One block of pixels (d) Histogram

image blocks

Figure 2. Demonstration of a region based background modeling process.

Gray level

No. of pixels

Dividing into

blocks

- 12 -

A region based approach is employed for background modeling, and Figure 2

demonstrates the process. Figure 2(a) is an original image. It will be first partitioned

into a number of non-overlapping square blocks. Figure 2(b) presents the partitioning

step. For each of those non-overlapping blocks, represented by Figure 2(c), a

histogram is established to record the pixel values of the image block. This process

will be repeated for a specific number of background images. Figure 2(d) shows the

histogram of a particular block summed over 50 image frames. The histogram is

similar to a Gaussian distributed model, hence the standard deviation and mean value

could be used to represent the characteristics of that image block. Any image pixel

that falls within the mean ± 4 standard deviation would be considered as a background

pixel.

(a) Formation of non-overlapping (b) One block of (c) Multi-Gaussian

image blocks composite background distributed

histogram

Figure 3. Modeling composite background.

Gray level

No. of pixels

- 13 -

However, composite background could exist in some blocks. The histograms of those

blocks will be characterized by a multi-Gaussian distributed model. Figure 3(b) shows

an image block with two background regions: water and lane divider. Figure 3(c)

shows the corresponding histogram. A K-means algorithm is employed to obtain the

mean value and standard deviation of individual Gaussian distributions for

background pixel identification.

3.2.2 K-means algorithm

Figure 4. Flow chart of the K-means algorithm employed.

Figure 4 demonstrates how the K-means algorithm works. This algorithm first

assumes that only one group of background pixels exists in an image block. And then

it will calculate the mean value and standard deviation of the block histogram. If the

standard deviation is larger than a threshold value, a new group is added. Pixels inside

Assume there is one

group of background

pixels only

Calculate the mean &

standard deviation(SD)

SD >

Threshold?

Adding new

group

Output mean

& SD of each group

Yes

No

Assigning pixels to

closest group

- 14 -

the block will be assigned to the closest group, the mean value and standard deviation

of each group will be recalculated. The algorithm will output the mean value and

standard deviation of all background regions only when the standard deviation of each

group is smaller than a threshold value, or the number of groups is larger than two.

3.2.3 Foreground detection

Figure 5. Histogram of an image block with two background regions.

When the system goes into the detection phase, each image pixel will be examined

with respect to its own image block histogram. It will be first assigned to the closest

background region as follow: let there are two background regions exist in a block,

group A and group B as shown in Figure 5, with mean values Ma, Mb and standard

deviations SDa and SDb respectively. Let p be one of the pixels of the block.

- 15 -

If

absolute difference between p and Ma is smaller than that between p and Mb

then

p belongs to group A.

else

p belongs to group B.

Assume that p belongs to group A in this example. If the pixel value of p falls into the

region under the curve of A, p is a background pixel, otherwise, it belongs to

foreground. This process could be explained as follow:

If

absolute ( p - Ma ) < N x SDa i.e p falls into the region under the

curve of A

then

p is a background pixel

else

p is a foreground pixel

The value of N is determined empirically.

- 16 -

Any detected background pixel will be added into its own image block histogram,

which will be used to do the foreground detection for the next image frame.

3.2.4 Median filter

(a) (b)

(c)

Figure 6. (a) Original image; (b) image after background subtraction; (c) image after

median filtering.

- 17 -

Every input image with foreground objects will be processed by the background

subtraction module, and the foreground objects will be extracted and form an output

image, which is presented in Figure 6(b). This output image will be processed by a

median filter to remove some noise caused by background subtraction, and the result

is shown in Figure 6(c).

Figure 7. Calculating the median value of a pixel neighborhood. [7]

A spatial median filter is employed in the system to remove the speckle noise in

intermediate result. Each pixel in the image will be examined by the median filter in

turn; this filter looks at each pixel’s neighbors to find a representative value of its

surroundings. The median of neighboring pixel values will replace the current

examining pixel value. The median is determined as follow: start by sorting all

surrounding neighborhood pixel values as well as the pixel being examined, into

- 18 -

numerical order. And then the pixel being examined will be replaced with the middle

pixel value. An illustration is given in Figure 7, where a 3x3 square neighborhood is

used. The pixel at the center with pixel value 150 is being examined, and it is replaced

with the middle pixel value 124 after sorting all neighboring pixel values into

numerical order.

3.3 Analysis module

After the background subtraction process, some morphological operations will be

performed in order to obtain a better shape of the target objects. The first

morphological operation is dilation.

3.3.1 Dilation operation

Dilation is a basic operation in mathematical morphology. E is defined as an

Euclidean space, A is a binary image in E, and B is a structuring element. The dilation

of A by B is defined by [8]:

(1)

- 19 -

(a) Image after background (b) Image after dilation

subtraction.

Figure 8. Result of dilation operation.

The effect of the dilation operation is to enlarge the foreground regions, while holes

within those regions could be filled up. Besides, nearby disjoint foreground objects

could be connected. Figure 8 shows the result of this operation. Figure 8(a) is the

image before dilation. We can see that the swimmer’s body is separated into several

disjoint parts after the background subtraction, and some holes exist inside the body

region. Figure 8(b) is the result image of dilation operation. The holes inside the body

region are filled up and those disjoint parts are reconnected to form a complete human

body.

- 20 -

Figure 9. 8-connected neighborhood.

(a) Before dilation (b) After dilation

Figure 10. Illustration of dilation operation - white pixel: background; black pixel :

foreground.

An 8-connected neighborhood scheme is employed in the dilation operation. For a

pixel p shown in Figure 9, the eight pixels surrounding it are considered as its

8-connected neighbors. This 8-connected neighborhood scheme considers each pixel

in the image in turn. Assume p is a background pixel. If at least one of its 8-connected

p p p p

- 21 -

neighbors is a foreground pixel, it will be changed to the foreground pixel as well.

Figure 10 illustrates this process. In Figure 10(a), p is a background pixel, but one of

its 8-connected neighbors is a foreground pixel. So p will be changed to a foreground

pixel. This operation will examine every pixel in the image and finally the foreground

region will be expanded, as shown in Figure 10(b).

3.3.2 Erosion operation

Since the dilation operation will expand the size of foreground objects, an opposite

morphological operation, erosion, is applied on the result of dilation in order to shrink

the size of those foreground objects. Erosion is another basic operation in

mathematical morphology. Assume that E is an Euclidean space, A is a binary image

in E, and B is a structuring element. The erosion of A by B is defined by:

(2)

where Bz is the translation of B by the vector z.[9]

- 22 -

(a) Image after dilation (b) Image after erosion

Figure 11. Result of erosion operation.

Figure 11(a) shows an image after the dilation operation. The foreground objects are

expanded in this image. An erosion operation is applied on this image, which aims to

shrink the size of foreground objects. The result is shown in Figure 11(b), in which

the size of foreground objects is reduced, and the contours are improved.

(a) Before erosion (b) After erosion

Figure 12. Illustration of erosion operation - white pixel: background; black pixel :

foreground.

p p

- 23 -

Again, the 8-connected neighborhood is employed here. Every pixel in the image

will be considered by this 8-connected neighborhood scheme. As shown in Figure

12(a), if p is a foreground pixel and at least one of its 8-connected neighbors is a

background pixel, it will be changed to a background pixel. After all pixels are

examined, the foreground object will be shrunk as shown in Figure 12(b).

3.3.3 Boundary extraction

After the erosion operation, the contours of foreground objects are repaired. The next

step is to obtain the boundary of those foreground objects, which is critical

information for the subsequent steps in the analysis module. A boundary extraction

operation is employed to handle this task.

(a) Image before boundary extraction (b) Image after boundary extraction

Figure 13. Result of boundary extraction.

- 24 -

Boundary extraction aims to preserve the boundary pixels of foreground objects while

removing all other foreground pixels. Those boundary pixels represent the contours of

the foreground objects, which are valuable data for further analysis. With the interior

foreground pixels removed, the analysis can be performed more efficiently. Figure

13(a) shows an image before the boundary extraction. After the boundary extraction

operation, only the boundary pixels of foreground objects remained as shown in

Figure 13(b).

Figure 14. Illustration of boundary extraction operation.

The operation of boundary extraction could be explained as follow:

Step 1. Apply erosion operation on the original image.

Step 2. Subtract the eroded image by the original image.

Step 3. The pixels remained are the boundary pixels.

(a) Original image (b) Image after

erosion

(c) Image with

boundary pixels

only

- 25 -

The above process is illustrated in Figure 14, where Figure 14(b) is the eroded image

obtained from the original image of Figure 14(a). Figure 14(c) shows the image of

boundary pixels obtained by the subtraction process.

3.3.4 Removing the non-body objects

(a) Image with foreground and (b) Image after removing the

background objects background objects

Figure 15. Removal of non-body objects.

Figure 15(a) shows an image after the boundary extraction operation. Many small

connected regions remain after the background subtraction operation. These

connected regions, representing non-stationary background regions, will be

considered as non-body objects and removed by a non-body removal scheme. This

scheme works as follow:

- 26 -

Labeling all connected regions in the image as objects

For each of those objects

if

Its area is smaller than a threshold value

then

the object is removed.

The process of labeling the connected regions mentioned above employs a connected

component labeling algorithm. This algorithm uses a two-pass approach [10]:

On the first pass:

Step 1. Scan each pixel on the image.

Step 2. If the pixel is a foreground pixel.

1. Get its 8-connected neighboring pixels.

2. If there are no neighbors, assign a new label for the current

pixel and continue.

3. Otherwise, find the smallest label in the neighbors and assign it

to the current pixel.

4. Store the equivalence between neighboring labels.

- 27 -

On the second pass:

Step 1. Scan each pixel on the image.

Step 2. If the pixel is a foreground pixel.

1. Re-label the pixel with the smallest equivalent label.

The process discussed above is illustrated in Figure 16.

(a)

(b)

(c)

- 28 -

Figure 16. Connected component labeling: (a) Input image with foreground pixels of

1; (b) Result of first pass - check 8-connected neighbors for labeling and equivalent

labels are indicated; (c) Result of second pass - replacing all equivalent labels with the

smallest label.

3.3.5 Extracting extremal points

Figure 17. Process of extracting extremal points.

After the boundary image is obtained by the boundary extraction operation, some

extremal points on the human body have to be located. Those extremal points include

the legs and hands. However, before we could do that, the centroid of the human body

Centroid distance

Border position

distance

Border position Averaging

Filtering

(a)

(b)

(c) (d)

- 29 -

has to be determined first. Since position of the boundary pixels are already known,

the position of the centroid can be calculated as follow:

Let ( cx , cy ) be the position of the centroid, then

∑ == N

i ixN 1c1x (3)

∑ == N

i iyN

y1c

1 (4)

where N is the total number of boundary pixels, ( ix , iy ) is the position of a pixel on

the boundary.

After we get the centroid of the human body as shown in Figure 17(a), the distance

between each boundary pixel and the centroid, as shown in Figure 17(b), is calculated

as follow:

2)(2)(c

yi

yc

xi

xi

d −+−= (5)

The distance profile in Figure 17(b) is very fluctuating. An averaging filter is applied

to remove the noise, and the result is shown in Figure 17(c).

Those local maxima in Figure 17(c) are taken as the extremal points, and the image in

- 30 -

Figure 17(d) is constructed by connecting all those extremal points to the centroid.

The local maxima can be determined by finding the zero-crossings in the difference

function.

1−−= iii ddδ (6)

3.3.6 Locating interested points

(a) (b)

Figure 18. (a) Interested points located in the boundary image. (b) Interested points

overlaid on the original image.

Some interested points have to be located in the upper part of the human body. They

include the shoulders, elbows and hands. A skeleton can be constructed to represent

the upper part of the body by connecting those interested points, which is shown in

Figure 18(a). Figure 18(b) illustrates how accurate the locating mechanism with the

Centroid

Shoulders

Elbows

Hands

- 31 -

interested points overlaid on the original image.

Since the position of two hands and the centroid are already known from the previous

step, the locating mechanism only needs to locate the elbows and shoulders, which is

illustrated in below.

Figure 19. Locating elbow position.

Figure 19 illustrates how the elbows can be located. Since the positions of two hands

are known, by calculating the average value of those two positions, the point d can be

located. A line can be constructed by connecting point d to the centroid. This line is

broken into two parts with a predefined ratio, and then point e can be located. A

horizontal line L, which crosses point e, is formed. Start from point e and search along

d

e

L

- 32 -

L towards the left, two boundary pixels can be located. The position of the right elbow

is the average of the coordinates of those two pixels. A similar approach can be used

to locate the left elbow position by searching towards the right direction from point e.

Figure 20. Locating shoulder position.

The approach for locating the shoulder position of the human body is illustrated in

Figure 20, which is similar to that in locating the elbow position. The position of

centroid ( cx , cy ) is known, a horizontal line L crossing the centroid is drawn. Starting

from the centroid, a boundary pixel a can be found by searching along line L towards

the left. The position of the right shoulder can be located by partitioning the line

segment between points a and the centroid with a predefined ratio. A similar approach

can be used to obtain another boundary pixel b and the position of the left shoulder.

L a b L a b centroid

- 33 -

3.4 Tracking module

Figure 21. Illustration of Kalman filter.

In the tracking module, we assume that the swimmer swims at a constant velocity. A

Kalman filter is employed for tracking the target object. The employed Kalman filter

can estimate the position of the swimmer recursively, where the position of the

swimmer is represent by the centroid of the body part. Figure 21 illustrates how this

filter works. Five major equations are used and they are separated into two models:

the prediction model and the correction model. The first equation in the prediction

model predicts the current position of the human body part by using the last estimated

Prediction Model Correction Model

Initial Value of

and

Predict the position

Predict the error covariance

Calculate the Kalman gain

Update estimate with measurement

Update the error covariance

- 34 -

value, where k

S − is the current predicted value for the position, 1−k

S is the last

estimated value, V is the constant velocity, T∆ is the time between two consecutive

video frames, Wk is the random noise variable. The second equation in the prediction

model predicts the error covariance ahead, where k

P − is the predicted error

covariance, 1−k

P is the last estimated error covariance, Q is the error covariance of

the random noise variable Wk. Both the predicted position value

kS − and

kP − will

be input into the correction model for correction. The first equation in the correction

model outputs the Kalman gain, where k

K is the Kalman gain, k

P − is the error

covariance of predict value, R is the error covariance of the measurement value. The

second equation in the correction model calculates the estimated value for the position

by using the predicted value of the positionk

S − , the measurement value of the position

kZ and the Kalman gaink

K . The third equation in the correction model updates the

error covariance. Both the estimated value of position kS and updated error

covariance kP will be input back into the prediction model for predicting the next

position value. Therefore, after inputting the initial estimates for 1−kS and 1−kP , the

Kalman filter can run recursively, the Kalman gain and other variables can be updated

automatically.

- 35 -

Chapter 4 Result

4.1 System Configuration

This final year project is developed on a notebook computer with the following

configuration:

CPU: Intel Core(TM) 2 Duo CPU P8600, 2.40GHz

Memory: 2 GB

Hard disk: 200 GB

Operation system: Windows Vista

The program of this project is implemented in C++ programming language with

Microsoft Visual Studio 2008 software kit. The libraries used in this project are listed

in below:

#include <stdlib.h>

#include <iostream>

#include <fstream>

#include <string>

#include <sstream>

#include <math.h>

The video data used for testing is in MPEG-2 format with a 720x576 resolution and

the frame rate is 25 fps.

- 36 -

4.2 Background Subtraction

Figure 22. Images in the first row are the original ones, images in the second

row are the corresponding results of background subtraction.

Figure 22 shows the results of background subtraction, where 30 image frames are

used for building up the initial background model. The major noises remained are the

pixels of the lane dividers on the water surface and those on the bottom of the

swimming pool. The lane divider on the water surface will float and wave due to the

move of water, which makes its position in different image frames be different, and

hence provides a great challenge to the background subtraction process, which

assumes that all background objects should be static. Besides, swimmer will make

ripples on the water surface when she/he is swimming. The constant color of the lane

divider on the bottom of the pool will be disturbed by the ripples. Both causes

- 37 -

contribute to non-stationary background pixels in the image which are difficult to be

remove by the background subtraction. However, the results shown in Figure 22 are

much better than the traditional frame difference approach. The figure below shows a

comparison of the two approaches.

Figure 23. Comparison between frame difference and region-based approach.

Since the frame difference approach simply perform the subtraction between an image

with foreground objects with a background image (template), a pixel is assigned as

foreground if its difference is larger than a threshold value. However, the variation of

the pixel value is very large. No matter using a low threshold (30) or a relatively high

threshold (100), there are still many background objects remained after the

background subtraction. The result of using a low threshold and high threshold is

shown in Figure 23(a) and (b) respectively. Figure 23(c) shows the result of using the

Input Image

Frame Difference

Region-based

(a) Low Threshold (b) High Threshold

(c)

- 38 -

region-based approach we proposed. The background objects remained in the image is

comparatively much less than that using the frame difference approach, and it

demonstrates that the performance of the region-based approach is much better than

the frame difference approach in doing background subtraction in the aquatic

environment. It is because the region-based approach constructs the background

model in blocks. When classifying a pixel as foreground or background, the pixel is

compared with the Gaussian distributed model of its own image block, not just

compared with the corresponding pixel values in previous frames only. Therefore, the

region-based approach is less sensitive to the variation of pixels value and hence

achieves a better performance.

- 39 -

4.3 Median Filter

(a)

(b) (c) (d)

Figure 24. Median filtering (a) image after background subtraction; (b), (c) and (d)

images after applying median filter on (a) with window size 3x3, 9x9 and 15x15

respectively.

The median filter can effectively remove the background objects still remained after

the background subtraction operation, and the results are shown in Figure 24. The

larger the window size used in the median filter operation, more background objects

could be removed. The result shown in Figure 24(d) uses the largest window size, so

the background objects remained are the least. However, a larger window size can

also affect the shape of the foreground region (human) more. Figure 24(b) uses the

smallest window size, so the shape of the human body is most similar to that in Figure

- 40 -

24(a). Therefore, by considering the effect on the shape of the human body and the

strength of the noise reduction, the window size of 9x9 is considered as the best for

median filtering and Figure 24(c) shows the result.

4.4 Morphological Operations

- 41 -

Figure 25. Two sets of results of morphological operations are shown in the left and

right columns respectively. From top to bottom rows: original, dilation, erosion,

boundary extraction, non-body object removal.

In the left column of Figure 25, the dilation operation can successfully fill up all the

holes within the body and connect the disjoint parts. It helps the subsequent boundary

extraction operation to obtain a complete boundary of the human body. However, if

the distance between the disjoint parts is too large, the dilation operation cannot

reconnect the disjoint parts. The result in the right column of Figure 25 shows such a

case. Since the dilation is unable to reconnect the well separated disjoint parts, the

subsequent non-body object removal operation will mistakenly treat the isolated left

hand region of the swimmer as a background object, and remove it from the image.

- 42 -

4.5 Locating Interested Points

Figure 26. (a) A skeleton is drawn on the boundary image with the located

interested points. (b) Interested points are overlaid on the original image.

Figure 26 shows the result of successfully locating the interested points. A skeleton is

constructed with those located interested points. This skeleton is valuable information

for other project such as human motion recognition, because by finding the angles

between those points, high-level interpretation could be done. Those interested points

are also highlighted and overlaid on the original image for performance evaluation. It

can be seen that the accuracy of the locating process is quite high in this case.

Figure 27. An unsuccessful case on locating interested points.

Centroid

Shoulders

Elbows

Hands

Centroid

Shoulders

Elbows

Hands

- 43 -

The performance of the locating process is heavily relied on the quality of the

boundary of human body obtained. If the human body obtained is incomplete, the

performance of the locating process will deteriorate. Figure 27 shows such a case.

Although the position of the centroid is acceptable, all other interested points are

located incorrectly.

4.6 Tracking Process

4.6.1 Result of Kalman Filter

Kalman F ilter Result

0

50

100

150

200

250

300

350

400

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81

Frame Number

Position(Y-coordinate)

Measured

P redicted

Estimated

Kalman Filter Result

0

100

200

300

400

500

600

700

1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81

Frame Number

Position(X-coordinate)

Measured

Estimated

Predicted

Figure 28. Result of tracking of the centroid of swimmer.

- 44 -

The Kalman filter has been tested on some video clips. Figure 28 shows the result of

tracking the centroid of a swimmer on one video clip. We can see that the measured

position, predicted position and estimated position are close to each other most of the

time. For the cases that there are large errors in the measured value, the estimated

value will be shifted towards the predicted value, and the tracking process can go on.

The reason why the measured value drops to zero in some frames is that in some

frames the swimmer is occluded by the splash. The background subtraction could not

extract the body region successfully, and therefore there is no measured data.

(a) (b) (c)

(d) (e) (f)

Figure 29. Applying Kalman filter on tracking interested points.

Centroid Shoulders

Elbows Hands

- 45 -

The Kalman filter is also applied on tracking the movement of those interested points,

which aims to assist the locating process. The results are shown on the Figure 29. The

accuracy of this tracking process is high for the gestures when the two hands are

stretched out. The performance will deteriorate when both hands are pulled back

underneath the body - a self-occlusion problem.

4.6.2 Object Tracking

Figure 30. From left to right, from top to bottom. A rectangle is drawn around

the body part of the swimmer.

- 46 -

Figure 30 shows the tracking of a swimmer on a video clip. A green rectangle is

drawn around the body part of the swimmer to illustrate the performance of the

tracking process. The rectangle is constructed by using the estimated positions of the

centroid as the center, and its size is determined by the body part obtained in the

background subtraction process. The tracking process focuses on the upper part of the

body, so the size of this bounding rectangle will change according to the stretch of the

two hands. We can see from Figure 30 that the tracking performance is good since the

rectangle can enclose the body part most of the time.

- 47 -

Chapter 5 Discussion

We proposed a region-based approach for the background modeling process, which

is less sensitive to the wide change of background pixels along the image sequence.

Therefore, the results in the background subtraction are quite acceptable. Some

interested points can be located on the upper part of the body, which will be valuable

information for motion recognition. Besides, the tracking module can also

successfully track the motion of the swimmer by employing the Kalman filter.

However, there are three major limitations exist in the system. The first one is that the

image frames used for building up the background model must come from a

stationary camera. It is because the region-based background subtraction heavily

relies on the information provided by previous frames. While light source variation

can be compensated by the adaptive updating of the background model, if the

background is spatially changing due to waving water, there is not reliable

information provided for this approach to determine whether a pixel belongs to

foreground or background. Nevertheless, this limitation exists in the traditional frame

difference approach too. In order to do background subtraction in a dynamic

environment, some researches used a skin model to identify the human body. A

- 48 -

specific color range will be used for the skin model. Any pixels fall into that range

will be considered as foreground, but this approach assumes that there are no

background objects with the same color range.

Another limitation exists in the analysis module, currently it is developed for

analyzing the breaststroke swimming style only. We focused on analyzing the

breaststroke since this swimming style has the following three advantages: 1) The

swimming speed of breaststroke style is relatively slow as compared with freestyle

and butterfly, which contributes a very small difference in two consecutive frames and

hence makes the human region easier to locate and analyze. 2) Breaststroke is a very

gentle swimming style with much less splash, which reduces the difficulty in

background subtraction. 3) In breaststroke swimming, the two hands will be stretched

out from the body most of the time, which makes them easier to be differentiated from

the trunk.

The last limitation exist in the system is that the Kalman filter employed in the

tracking module assumes that the swimmer swims at a constant speed, and the

prediction model in the filter is built up based on this assumption. However, even for

the same swimming style the swimming speed will not be constant all the time, which

- 49 -

contributes to a large error in the prediction. Hence the final estimated value is shifted

towards the measurement value. The tracking accuracy relies on the measurement

value, so if the accuracy of the measurement value is deteriorated due to a poor

background subtraction, the overall performance of the tracking module will be

downgraded. Future work can be done to solve this problem by increasing the

accuracy of the prediction model, which can be achieved by considering the

periodicity of the swimming motion.

- 50 -

Chapter 6 Conclusion

This final year project proposed a vision system for analyzing the swimmer’s motion.

We adopted a region-based approach in doing the background subtraction in order to

extract the human body from an image sequence of a swimming video. After

obtaining the human body part, some interested points including hands, elbows,

shoulders and the centroid will be located in the upper part of the body. These are

useful data for the subsequent project on human motion recognition. A tracking

process is also established on tracking the movement of the swimmer aiming to

acquire the trajectories of the body parts. Future work includes developing a tracking

process on tracing the movement of the interested points, which intends to increase

the accuracy on locating those interested points.

- 51 -

Reference

[1] J. Triesch; C. Von Der Malsburg: A gesture interface for human-robot-interaction.

Proceedings of the Third IEEE International Conference on Automatic Face and

Gesture Recognition. 14-16 April 1998, pp. 546 - 551

[2] http://www.detec.no

[3] H. L. Eng; K. A. Toh; W. Y. Yau; J. X. Wang:

DEWS: A Live Visual Surveillance System for Early Drowning Detection at Pool.

IEEE Transactions on Circuits and Systems for Video Technology, Vol. 18, No. 2, Feb.

2008, pp. 196-210

[4] H. L. Eng; J. W; S. W. Kam and W. Y. Yau: Robust Human Detection Within a

Highly Dynamic Aquatic Environment in Real Time. IEEE Transactions on Image

Processing, Vol. 15, No. 6, June 2006, pp. 1583-1600

[5] H. Fujiyoshi; A. J. Lipton: Real-time Human Motion Analysis by Image

Skeletonization. Proceedings of Fourth IEEE Workshop on Applications of Computer

- 52 -

Vision, WACV '98. 19-21 Oct. 1998, pp. 15 - 21

[6] K. L. Chan: Video-Based Gait Analysis By Silhouette Chamfer Distance and

Kalman Filter. International Journal of Image and Graphics, Vol. 8, No. 3, July 2008,

pp. 383-418

[7] http://homepages.inf.ed.ac.uk/rbf/HIPR2/median.htm

[8] http://en.wikipedia.org/wiki/Dilation_(morphology)

[9] http://en.wikipedia.org/wiki/Erosion_(morphology)

[10] http://en.wikipedia.org/wiki/Connected_Component_Labeling