human body pose estimation with particle swarm optimisation

Human Body Pose Estimation With ParticleSwarm Optimisation

S. Ivekovic [email protected]. Trucco [email protected]. R. Petillot [email protected] of Electrical, Electronic and Computer Engineering, School of Engineeringand Physical Sciences, Heriot-Watt University, Edinburgh EH14 4AS, United Kingdom

AbstractIn this paper we present the Particle Swarm Optimisation (PSO) approach to 3-D hu-man body pose estimation from multi-view image data. Pose estimation from multipleviews is a challenging problem requiring a powerful optimisation algorithm. PSO, afairly recent technique, has been shown to tackle multimodal functions successfullyand as such presents a promising tool for this problem. We use the silhouettes ex-tracted from the multiple views to construct the objective function and a layered sub-division surface body model to represent the pose. A synthetic sequence of a humanwalking-rotating is used to analyse different parameter settings and performance ofthe algorithm. Multi-view image data of the upper human body, typical of immersivevideoconferencing scenarios, is then used to show the applicability of the method toreal data. The upper-body pose estimation algorithm is formulated in two differentways, in 3-D space and disparity space, and the resulting pose estimates are presented.

KeywordsParticle Swarm Optimisation, Human Body Pose Estimation, Disparity Space

1 IntroductionHuman body pose estimation from images is an important topic in many researchareas. Examples include surveillance (Haritaoglu et al. (2000)), motion capture(Deutscher et al. (2000)), human gait analysis (Veeraraghavan et al. (2005)), humanactivity recognition (Ben-Arie et al. (2002)), sign language recognition (Ong and Ran-ganath (2005)), medical analysis (Kohle et al. (1997)), communications (R. et al. (2005))and animation (Collomosse et al. (2003)).

In this paper we address the problem of human body pose estimation for applica-tion in two distinct problem domains, surveillance with full-body passive millimeter-wave sensors (Haworth et al. (2007)) and immersive videoconferencing (Ivekovic andTrucco (2006); Isgro et al. (2004)).

In surveillance, millimeter-wave sensors have been investigated for detection ofthreats concealed under clothing. In setup reported by Haworth et al.(Haworth et al.(2007)), the person is asked to enter the sensor’s field of view and perform a rotatingwalk around their main body axis. The resulting millimeter-wave frame sequence isthen inspected automatically to locate abnormal textures and body shapes, suggest-ing the presence of possible threats. Fitting an a priori body model enables compar-isons with predicted, synthetic frames generated by a simulator (Grafulla-Gonzalez

c©200X by the Massachusetts Institute of Technology Evolutionary Computation x(x): xxx-xxx

S. Ivekovic, E. Trucco, Y. R. Petillot

Figure 1: Human body pose estimation problem. The body pose is represented witha kinematic chain and constrained by the silhouette. Disc Thrower figure courtesy ofFCIT.

et al. (2005)) and assists with tracking of suspicious areas. Single-view millimeter-wavedata does not contain enough information to accurately fit a body model for this pur-pose. We performed a set of experiments with a simulated synthetic video sequence toshow that, for the expected constrained body motion, augmenting the sensor with twovideo-cameras provides a sufficient constraint for reliably solving the pose estimationproblem and fitting a model.

Immersive videoconferencing aims at recreating the sense of presence that is an in-tegral part of an ordinary meeting. As it is impossible to place a camera in the middle ofthe screen (where the ’eyes’ of remote participants appear), the video acquired from re-mote cameras must be warped to match the local viewpoint. This requires high-qualityview synthesis (Ivekovic and Trucco (2006)) which in turn requires high quality stereodisparity data, normally difficult to achieve given the geometric constraints imposedby videoconferencing setups (Isgro et al. (2004)). By estimating the upper-body poseand fitting a model to the available disparity data, the quality of the view-synthesiscan be significantly improved and the sense of presence strengthened. We describe twodifferent approaches to estimating the upper-body pose from multi-view video data.The first approach uses a body model in 3-D space and demonstrates the intuitive wayof solving the problem. The second, more elegant approach, uses a body model indisparity space.

In the remainder of this paper we first describe the motivation for using ParticleSwarm Optimisation in Section 2. We give an overview of the related work in Sec-tion 3. Details of the PSO algorithm that we used in our experiments is described inSection 4 and the pose estimation algorithm in Section 5. Section 6 describes the experi-ments with the simulated synthetic sequence for the surveillance purposes and Section7 describes the upper-body pose estimation from multi-view video data in 3-D anddisparity space. Conclusions are given in Section 8.

2 MotivationHuman body pose is usually represented with a kinematic chain structure consisting ofjoints and limbs (see Figure 1 for illustration). Each joint can have up to three rotationaldegrees of freedom, i.e., rotation around the x, y, and z axis, while the joint at the topof the chain, commonly referred to as root joint, also posesses up to three translationaldegrees of freedom, defining the position of the entire structure in the reference space.

A simple example of a kinematic (sub)chain in a human body is a set of two angleswhich influence the movement of the elbow. A (simplified) model of the elbow consistsof two rotational degrees of freedom, one in the shoulder and one in the elbow itself, asshown in Figure 2. Let us assume that the human arm is seen from four different view-

2 Evolutionary Computation Volume x, Number x

Human Body Pose Estimation with PSO

Elbow RotationShoulder Rotation

-3.32-3.3

-3.28-3.26-3.24-3.22-3.2

-3.18-3.16-3.14-3.12

Fitness function value

Graph of the fitness function for lower arm

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0 0.5 1 1.5 2 2.5 3

Fitness function value

Figure 2: Illustration of the elbow joint parametrisation.

points and that the evaluation function measures the amount of overlap between thesilhouettes as described before. A graph of such evaluation function is shown in Figure2. This example can be treated as a boundary case for pose estimation as the humankinematic structure normally contains many more than just 2 degrees of freedom. Ascan be seen from Figure 2, the evaluation function is clearly multimodal. A full kine-matic structure, describing the entire body, will therefore most certainly require a globaloptimisation algorithm to adequately address the dimensionality and multimodality ofthe problem while keeping it fully automatic.

Particle Swarm Optimisation has been reported to perform well on classic multi-modal test functions such as, e.g., Schaffer’s f6 function (Kennedy and Eberhart (1995);Dawis (1991)). As it is a fairly recent technique, its use for optimisation problems suchas human body pose estimation has not been widely explored yet. As shown in Sec-tion 5, formulating the pose estimation problem in the context of PSO is actually fairlystraightforward. The simplicity of the PSO algorithm versus its ability to find globalminima is in itself a very appealing argument for its use and in this paper we show thatit can be used to solve the pose estimation problem as well.

3 Related Work

Human body pose estimation from video data is a well-known problem. In videoproduction and medical contexts motion and pose is often acquired with commercialsytems based on a variety on markers attached to the body. Vision and graphics re-search has concentrated on pose and motion estimation without markers. Prototypesolutions have been reported with and without explicit body models, using image dataor 3D scanner data, and single or multiple-viewpoint video sequences. A not-so-recentsurvey can be found in (Gavrilla (1999)) and partly also in (Moeslund and Granum(2001)).

Body pose estimation from images using a human body model has been addressedby various researchers. Plankers and Fua (Plaenkers and Fua (2001)) report using animplicit surface body model (metaballs) which they fit to 3D stereo data constrainedby silhouette contours. They use an implementation of Levenberg-Marquardt optimi-sation method to fit the model to the stereo data obtained from 3 views. Carranzaet al. (Carranza et al. (2003)) use a triangular mesh body model enhanced with a 1DBezier spline and downhill optimisation constrained with silhouettes to recover thebody pose. Poppe et al. (R. et al. (2005)) mention a similar application area to ours,virtual environments, and work on monocular video sequences using a simple bodymodel composed of cylinders.

Our work differs from the mentioned related work in two aspects. Unlike otherreported work, we use a subdivision surface body model, the choice of which was

Evolutionary Computation Volume x, Number x 3


motivated by the requirements of our application, and we recover the pose parametersusing a global optimisation algorithm, the Particle Swarm Optimisation (PSO).

The Evolutionary Computation research addressed the articulated body pose esti-mation and analysis in various ways including using Genetic Algorithms (Ye and Liu(2005); Shoji et al. (2000); Hsu et al. (2006)) and Neural Networks (Gross et al. (2006);Guo et al. (1994)). PSO has been successfully applied to various problems, however, wewere not able to find many references to its use for articulated body pose estimation.The closest related work found is by Schutte et al. (Schutte et al. (2004)) who report aparallel PSO implementation and illustrate its performance on an example of a simplekinematic chain similar to our boundary case example described in section 2.

4 Particle Swarm OptimisationParticle Swarm Optimisation (PSO) is an evolutionary computation technique intro-duced by Kennedy and Eberhart in 1995 (Kennedy and Eberhart (1995)). The idea orig-inated from the simulation of a simplified social model where the agents were thoughtof as collision-proof birds and the original intent was to graphically simulate the un-predictable choreography of a bird flock.

The original PSO algorithm was later modified by the authors and other re-searchers to improve its search capabilities and convergence. Several successful ap-plications of PSO were also reported in the literature. For an overview of the relevantresearch in this area an interested reader will find a good starting point in (Eberhartand Shi (2004)).

One of the important modifications of PSO was introduced in 1998 by Shi andEberhart (Shi and Eberhart (1998)). They changed the velocity update equation of theswarm by adding an additional parameter called inertia weight, w. The aim of this pa-rameter was to guide the search behaviour of the swarm. The larger the inertia param-eter value, the more global the search, and vice versa. Several other modifications wereadded later on but for the purpose of this paper we focus on the contribution of (Shiand Eberhart (1998)), as it is also the version of PSO which we used in our experiments.

In the following we give a brief overview of the PSO algorithm using inertia weightparameter.

4.1 PSO Algorithm with Inertia Weight ParameterAssume an n-dimensional search space S ⊆ R

n, a swarm consisting of N particles anda fitness function f : S → R defined on the search space. The i-th particle is representedas an n-dimensional vector Xi = (xi1, xi2, ..., xin)T ∈ S. The velocity of this particle isalso an n-dimensional vector Vi = (vi1, vi2, ..., vin)T ∈ S. The best position encounteredby the i-th particle so far (personal best) is denoted as Pi = (pi1, pi2, ..., pin)T ∈ S and thevalue of the fitness function at that position pbesti = f(Pi). The index of the particlewith the overall best position so far (global best) is denoted as g and gbest = f(Pg). Letus also denote the optimum of the fitness function f by sol = f(Ps), where the index sdenotes the solution position in the search space. The PSO algorithm can then be statedas follows.

1. Initialisation:

• Initialise a population of particles {Xi}, i = 1...N, with random positions andvelocities in the search space S. For each particle evaluate the desired fitnessfunction and set pbesti = f(Xi). Identify the best particle in the swarm andstore its index as g and its position as Pg .



2. Repeat until |sol − gbest| < ε for some predefined ε or the number of iterationsreaches a predefined limit:

• Move the swarm by updating the position of every particle according to thefollowing two equations:

Vi = wVi + ϕ1(Pi − Xi) + ϕ2(Pg − Xi)

Xi = Xi + Vi (1)

where ϕ1 and ϕ2 are random numbers defined by an upper limit which is aparameter of the system and w is the inertia weight parameter.

• For i = 1...N update pbesti and gbest.

The value of the inertia weight w can remain constant throughout the search or changewith time. The parameters ϕ1 and ϕ2 influence the social and cognition components ofthe swarm behaviour (Shi and Eberhart (1998)). They are composed of a random num-ber and a constant and can be written as ϕ1 = c1rand1() and ϕ2 = c2rand2(), where c1

and c2 are two constants and rand1() and rand2() two random numbers in the interval[0, 1]. In our experiments the values of the constants c1 and c2 were both set to integer 2,as recommended by (Shi and Eberhart (1998); Kennedy and Eberhart (1995)), which onaverage made the weights for social and cognition components of the swarm equal to1. Throughout the experiments with pose estimation we concentrated on the influenceof the inertia weight parameter on the swarm behaviour and did not experiment withthe social or cognition bias that could be introduced through manipulating the valuesof ϕ1 and ϕ2.

4.2 Inertia weight parameterInertia weight plays an important role in directing the exploratory behaviour of the par-ticles. Higher inertia values push the particles to explore more of the search space andemphasise their individual velocity. This behaviour is useful when trying to coarselyexplore the entire search space to find a good starting point for a multimodal optimi-sation. Lower inertia values force particles to focus on a smaller search area and movetowards the best solution found so far. This approach makes sense when the globaloptimum region has been successfully identified and the exact optimum location in thesearch space is required.

Shi and Eberhart discussed the influence of different inertia values on the ex-ploratory abilities of the swarm in (Shi and Eberhart (1998)). They used a constantinertia change function and one which decreased linearly with time. They tested iner-tia values in the interval [0, 1.4] and found that for a constant inertia value a mediumvalue of w, i.e., 0.8 < w < 1.2, had the best chance of finding the global optimum whilealso requiring a moderate number of iterations. Large values of w, i.e., w > 1.2 madePSO behave more like a global search method always trying to exploit new search ar-eas.

We decided to model the inertia change with an exponential function which al-lowed us to use a constant sampling step while gradually guiding the swarm from aglobal to more local exploration:

w(x) =A

ex, x ∈ [0, ln(10A)], (2)



where A denotes the starting value of w when x = 0. The optimisation termi-nated when w(x) fell below 0.1. The sampling variable x was incremented by ∆x =ln(10A)/N , where N is the desired number of inertia weight changes.

The swarm was allowed to explore the search space with a particular inertia valuefor as long as every move of the swarm improved the current global optimum esti-mate. As soon as an iteration failed to improve the estimate, the value of the samplingvariable increased and the inertia weight value decreased accordingly. This forced theswarm to identify possible optimum regions at the very beginning, then focus on thebest few, and eventually settle down in the most promising region and find the globaloptimum.

5 Pose Estimation with PSOIn this section we describe the building blocks of our pose estimation algorithm. We be-gin with the body model representing the estimated pose, describe the PSO parametri-sation of the pose, explain the evaluation function and describe two extensions of theoriginal algorithm presented in Section 4 which allowed us to perform the pose estima-tion more accurately and efficiently.

5.1 Body ModelWe use a 3-D layered subdivision surface body model consisting of two layers, theskeleton and the skin. The skeleton layer is defined as a set of transformation matriceswhich encode the information about the position and orientation of every joint withrespect to its parent joint in the kinematic chain hierarchy:

Skeleton = {T 2

1, T 3

2, ..., T N

N−1}. (3)

N is the number of joints in the skeleton and T ji is a homogeneous transformation

matrix encoding the orientation of the coordinate system of joint j with respect to thecoordinate system of joint i. The top of the hierarchy is the root joint which branches outinto a number of kinematic sub-chains modelling the skeletal structure of the humanbody.

The skin layer represents the second layer in the model and is connected to theskeleton through the joints’ local coordinate systems. Each of the joints controls a cer-tain area of the skin. Whenever a joint or limb moves, the corresponding part of the skinmoves and deforms with it. The skin can therefore be described as a set of transforma-tion matrices forming the skeleton layer combined with the sets of points influenced byeach of the transformations:

Skin = {{T 2

1 , PT 2

1}, {T 3

2 , PT 3

2}, ..., {T N

N−1, PT NN−1

}} (4)

In order to generate a smooth skin surface of the model, all the skin points PT

j

i

have tobe transformed into a common coordinate system such as the world coordinate system:

Pw = T 1

w ∗ T 2

1∗ · · · ∗ T j

i ∗ PT

j

i, ∀ i, j ∈ [1, . . . , N ] (5)

The points (vertices) P iw are connected with edges into faces F to form a base mesh:

M0 = {V, F}, where V = {P iw}, F = {P i1

w , P i2w , P i3

w , P i4w } (6)

which is then subdivided to obtain the smooth limit surface, i.e., the skin:M∞ = S∞ . . .S1S0M0, (7)

where S is a subdivision operator.



5.2 PSO Parameterisation of the PoseIn PSO, each particle represents a potential solution in the search space. Our searchspace is the space of all plausible skeleton configurations. The individual particle’sposition vector in the search space is therefore specified as follows:

Xi = (rootx, rooty , rootz , α0

x, β0

y , γ0

z , α1

x, β1

y , γ1

z , ..., γNx ), (8)

where rootx, rooty , rootz denote the position of the root joint (first joint in the hierar-chy) with respect to the reference (world) coordinate system, and αi

x, βiy, γi

z refer torotational degrees of freedom of joint i around the x, y, and z-axis, respectively.

5.3 Evaluation FunctionThe evaluation function compares the silhouettes extracted from the original imagesacquired by the cameras and the silhouettes generated by the model in its current pose.The original images can be acquired from N different viewpoints. Each of the originalimages is foreground-background segmented and binarised to obtain a silhouette. Letthe images containing the original silhouettes be denoted as Io

i , i = 1...N . Similarly, letImi , i = 1...N denote images of the model silhouettes. The evaluation function can then

be written as follows:

E = α

row∑

1

col∑

1

(Io1 & Im

1 )+β

row∑

1

col∑

1

(Io2 & Im

2 )+γ

row∑

1

col∑

1

(Io3 & Im

3 )+· · ·+ω

row∑

1

col∑

1

(IoN & Im

N ),

(9)where row and col denote the image dimensions, i.e., number of rows and columns,respectively, and & denotes the logical AND operation. Coefficients α, β, ..., ω are usedto normalise the contribution of every view to the total error count. Let T1, T2, ..., TN

denote the total number of pixels for the original silhouettes of the views 1...N , respec-tively. The values of the coefficients are then α = 1/T1, β = 1/T2, ..., ω = 1/TN .

5.4 Hierarchical ApproachAlthough PSO has been reported to tackle highly multidimensional problems success-fully (Kennedy and Eberhart (1995)), the unmodified algorithm described in Section4 failed to solve the complex pose estimation problem satisfactorily. The main reasonfor this was the complex evaluation function which prevented us from using a largeswarm size if the result was to be obtained within an acceptable time frame.

We decided to augment the original algorithm to exploit the inherent hierarchypresent in the kinematic chain model. The hierarchy meant that the positions of thejoints lower in the chain were constrained by the configurations of the joints higher inthe chain. For example, the rotation of the shoulder joint directly restricted the positionof the elbow and limited its plausible orientation.

This approach greatly reduced the combinatorial complexity of the search as itconstrained the search space much more than if all the joint values were optimised forat once. Taking advantage of the dependency constraint between the individual jointsis inevitable if the algorithm is to be more efficient, given that the swarm size and thetime available were limited.

Incorporating the hierarchical search in the PSO algorithm proves to be fairlystraighforward. The first step consists of identifying the appropriate kinematic chainstructure and establishing dependencies between the individual joints. A commonly



Figure 3: (left) full-body model used in synthetic experiments (centre) schematiccolour-coded illustration of the kinematic subchains (right) upper-body model usedin multi-view sequences

used set of guidelines to levels of articulation and dependencies between the jointsof an articulated model is contained in the ISO H-Anim Standard (19774:200x (2006)).The Level of Articulation 1 (LOA1) of the H-Anim Standard most closely resemblesthe kinematic structure of our model and the schematic illustration of the individualkinematic subchains is shown in Figure 3.

Once the hierarchy has been identified, we can optimise for the joint rotations ina hierarchical manner. Depending on the complexity of the evaluation function, levelof articulation of individual joints, and dependency between them, this can be doneone joint at a time or by grouping several joints together. We describe the hierarchicalapproach in more detail in the experimental section.

5.5 Continuity of Pose EstimatesWhen optimising the pose over a sequence of frames depicting a continuous bodymotion, one of the requirements is the temporal and spatial consistency of the esti-mates. Optimising the pose from scratch on a frame by frame basis can, in principle,produce a continuous articulated sequence. In practice, however, it is much morelikely that seemingly plausible estimates are ambiguous if not additionally constrained.

For example, the evaluation function comparing the overlap between the silhou-ettes does not necessarily distinguish between the front or the back of the model. Theoptimisation can return any of the ϕ ± kπ, k ∈ Z values as the orientation in referencespace estimate, as long as all the answers lie within the allowed parameter boundaries.This becomes obvious when a temporally consistent sequence of estimates is required.

A simple solution is to use the best estimate for the frame at t = t − 1 to initialisethe optimisation in the frame t. This does not in itself guarantee the smoothness of thecontinuous motion across the frames, as it is likely that due to the uncertainty in theaccuracy of the individual pose estimates the estimates will not be entirely consistentwith the ground truth. It does, however, provide a constraint that enforces the prox-imity of the neighbouring estimates in the search space and also naturally reduces thecomplexity of the search.

6 Experiments with Synthetic DataIn this section we describe the experiments performed on the synthetic data sequencewith known ground truth. We generated a sequence in which a simple full-body modelwas used to simulate a constrained rotating walk that would happen inside a millime-ter wave sensor. The aim of these experiments was to establish how reliably a pose



Figure 4: Example frames from the synthetic sequence used as constraints for the poseestimation. Front and top view are shown combined.

Table 1: Degrees of Freedom for Synthetic Data TestsJOINT (index) DOF

World orientation 1Root Location (root) 1

Left Hip rotation 1Right Hip rotation 1

TOTAL 4

could be estimated for such a constrained full-body motion, using only two cameraviews.

6.1 The Model and the SequenceThe simple body model used to generate the sequence is shown in Figure 3(a). Asthe motion inside a sensor is fairly constrained, it was possible to approximate it with4 degrees of freedom shown in Table 1. The synthetic cameras were placed in frontof the model and on top of the model, providing two useful constraints for the poseestimation. Figure 4 shows example silhouettes for top and front view, extracted fromthe sequence. The sequence consists of 360 frames in which the model performs onefull walking rotation around its main axis.

6.2 Evaluation FunctionIn this set of experiments we use only two views, front view and top view, to constrainthe pose. This reduces the evaluation function from Equation ?? to:

E = αrow∑

1

col∑

1

(Iof & Im

f ) + βrow∑

1

col∑

1

(Iot & Im

t ),

where the indices f and t denote front and top view, respectively, and the rest of theparameters remain as in Equation 9.

6.3 PSO Parametrisation of the PoseEach individual particle models the 4 degrees of freedom that change to influence thepose of the model:

Xi = (rootx, αworld, αleft hip, αright hip), (10)

where rootx denotes the position of the root joint with respect to the world coordinatesystem (translation), αworld denotes the orientation of the model with respect to the



world coordinate system, αleft hip denotes a left hip rotation and αright hip denotes aright hip rotation.

6.4 PSO Performance AnalysisParticle Swarm Optimisation algorithm contains several parameters which influenceits performance. Swarm size and inertia weight parameter are two which, when chosencarefully, can significantly influence the behaviour of the swarm. In pose estimationwe are not only interested in the accuracy of the results, but also in the efficiency of theestimation process. To this end we have performed several experiments, measuringhow different parameter settings affect the uncertainty of the pose estimates. Wepresent the results in the sequel.

The following two standard formulas for sample mean and standard deviationwere used to calculate the uncertainty intervals of estimated parameter values:

x =1

N

N∑

i=1

xi (11)

σx =

√

√

√

√

1

N − 1

N∑

i=1

(xi − x)2 (12)

6.4.1 Swarm SizeThe common sense tells us that the larger the number of particles, the more likely oneof the particles will explore the right region of the search space and find the optimalsolution. However, large swarm size comes with a price of higher computational com-plexity which is not always affordable. In our case, the complexity of the evaluationfunction restricts the number of particles which can be used if the solution is to beobtained within some pre-defined time period.

The goal of the first experiment was to establish the optimal consensus betweenthe size of the swarm, acceptable quality of the pose estimate and reasonable esti-mation time. We performed pose estimation with varying swarm sizes on a singleframe of the sequence. The size of the swarm varied from 5 to 100 particles in theincrements of 5. The estimation with every individual swarm size was run 20 timesto establish the uncertainty of the result. The plots in Figure 5 show the pose estimateuncertainty interval for each of the tested swarm sizes overlaid on top of the groundtruth information for all 4 optimised parameters. Time necessary to get the estimateand the error value are also shown.

The orientation of the model with respect to the world reference frame is plottedin a separate plot as its uncertainty interval is much larger than that of the other pa-rameters. This is a consequence of the lack of continuity constraint which we mentionin Section 5.5. The optimisation performance is very good for the other three optimisedparameters as can be seen from the plots. Results shown in Figure 5 indicate that 40particles present an acceptable compromise between the quality of the estimate and therequired time.



-8

-7

-6

-5

-4

-3

-2

-1

0

1

0 10 20 30 40 50 60 70 80 90 100

Para

met

er V

alue

Population size

Uncertainty interval plot for the world rotation parameter

world rotation uncertainty intervalground truth

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 10 20 30 40 50 60 70 80 90 100

Para

met

er V

alue

Population size

Uncertainty plot for root position, left and right hip

root and hips uncertaintyground truth

-2.02

-2

-1.98

-1.96

-1.94

-1.92

-1.9

-1.88

0 10 20 30 40 50 60 70 80 90 100

Erro

r Val

ue

Population size

Best error estimate uncertainty interval plot

best error estimate uncertainty intervalground truth

0

100

200

300

400

500

600

700

0 10 20 30 40 50 60 70 80 90 100

Tim

e

Population size

Optimisation time uncertainty interval plot

optimisation time estimate uncertainty interval

Figure 5: Uncertainty interval analysis for varying population sizes.

6.4.2 Inertia WeightDynamically changing inertia weight, modelled with an exponential function (Equa-tion 2) allows us to influence the exploratory behaviour of the swarm as described inSection 4.2. The higher the initial inertia value, the more global the search, however, italso takes longer before the optimisation converges. Just like in the case of the swarmsize, there is a tradeoff between the extent of the global exploration and the necessarytime for convergence. As our pose-estimation problem becomes highly nonlinear withthe increasing number of parameters, our next goal was to find the lowest possibleinertia value which will allow enough global exploration to locate the global optimumbasin and then quickly converge towards the exact location. In this experiment we keptthe swarm size constant at 40 particles as suggested by the first experiment.

We again performed 20 repetitions of pose estimation for each different startinginertia value. The values tested started at x = 1.0 for w = 2.0/ex and were decreasedin intervals of ∆x = 0.2 until they reached x = 3.0. Figure 6 shows the results of theexperiment. The experiments show that the inertia value w = 2.0/e which we used asthe highest starting point in this experiment is also the best one. If the starting valueis set any lower, the uncertainty in the error estimates increases. This conclusion ismade under the assumption that each estimation is run from scratch without any priorknowledge of the pose.

6.4.3 Pose continuityIn Section 5.5 we mentioned the need for consistent estimates and suggested thatinitialising the search in the new frame with the best result of the previous frame willensure the coherency of the estimates. In order to force the swarm to only explore closeto the suggested initial estimate, the inertia value has to be kept low as well, which



-7

-6

-5

-4

-3

-2

-1

0

1

1 1.5 2 2.5 3

Para

met

er v

alue

Inertia weight value

World rotation estimate uncertainty interval

world rotation estimate uncertainty intervalground truth

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

1 1.5 2 2.5 3

Para

met

er V

alue


Uncertainty plot for root position, left and right hip

root and hips uncertaintyground truth

-2.01

-2

-1.99

-1.98

-1.97

-1.96

-1.95

-1.94

-1.93

-1.92

1 1.5 2 2.5 3

Erro

r Val

ue


Best error estimate uncertainty interval plot

best error estimate uncertainty intervalground truth

0

50

100

150

200

250

300

1 1.5 2 2.5 3

Tim

e


Optimisation time uncertainty interval plot

optimisation time estimate uncertainty interval

Figure 6: Uncertainty interval analysis for varying inertia weight values.

additionally reduces the complexity of the search.

In this experiment we ran the pose estimation for the first half of the syntheticsequence, initialising the search in every new frame with the best estimate from theprevious frame. The swarm size was set to 40 particles and and the inertia weightstarted at w = 0.3. Figure 7 shows the results, the PSO estimates on the left and theground truth on the right. Experimental results confirm that the algorithm is capableof estimating the pose reliably.

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 20 40 60 80 100 120 140 160 180

Para

met

er v

alue

Frame number

Pose estimation for sequence of 180 frames - PSO estimate

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

0 20 40 60 80 100 120 140 160 180

Para

met

er v

alue

Frame number

Pose estimation for sequence of 180 frames - Ground Truth

Figure 7: Results of the pose estimation for a continuous sequence of 180 frames.



Figure 8: Example pose from the multi-view data set. Each pose is acquired from 4viewpoints simultaneously.

7 Experiments with Multi-View Still Images7.1 Data SetThe data for this set of experiments consists of a set of different upper-body poses, typ-ical of immersive video-conferencing scenarios, acquired from 4 different viewpoints.Example pose shown from all four views is given in Figure 8. The data was acquiredusing the setup shown in Figure 9 consisting of 4 fire-wire webcams and off-the shelflighting.

Figure 9: Multi-view camera setup used to acquire the data and a detailed view of thelighting.

7.2 Upper Body ModelThe upper body model (see Figure 3(c)) consists of 10 joints with a total of 20 degreesof freedom, 3 translations and 17 rotations. Limb lengths are fixed. Table 2 shows thedetailed list of degrees of freedom used. As the body model is a subdivision surface, itcan be used at various levels of smoothness. 2 levels of subdivision were sufficient toachieve a shape which was smooth enough to interpret the pose.

7.3 PSO Parametrisation of Pose EstimationThe search space is 20-dimensional and the individual particle’s position vector in thesearch space is specified as follows:

Xi = (rootx, rooty , rootz , α0

x, β0

y , γ0

z , α1

x, γ1

z , ..., α7

x), (13)

where rootx, rooty , rootz denote the position of the root joint with respect to the worldcoordinate system, α0

x denotes a rotation around x-axis of the root joint coordinatesystem for angle α, γ1

z denotes a rotation around z-axis of the clavicle-neck joint forangle γ, etc.

7.4 Evaluation functionLet the images containing the original silhouettes be denoted as Io

l , Ioc , Io

r , and Iot for

left, centre, right, and top original silhouette, respectively. Similarly, let Iml , Im

c , Imr ,

and Imt denote images of the model silhouettes from the left, centre, right, and top view,



Table 2: Degrees of Freedom for Tests with Real DataJOINT (index) DOF

Root location (root) 3Root orientation (0) 3

Clavicle-neck orientation (1) 2Clavicle-left orientation (2) 2

Left Shoulder orientation (3) 3Left Elbow orientation (4) 1

Clavicle-right orientation (5) 2Right Shoulder orientation (6) 3

Right Elbow orientation (7) 1TOTAL 20

respectively. The evaluation function from Equation ?? then simplifies to:

E = α

row∑

1

col∑

1

(Iol & Im

l ) + β

row∑

1

col∑

1

(Ioc & Im

c ) +

γ

row∑

1

col∑

1

(Ior & Im

r ) + δ

row∑

1

col∑

1

(Iot & Im

t ) (14)

7.5 Hierarchical Upper-Body Pose EstimationIn Section 5.4 we mentioned the need for hierarchical approach to pose estimation inorder to limit the complexity of the problem. In the case of the synthetic sequence(6) the number of parameters to be optimised was low enough to not require thehierarchical approach.

In the case of upper body pose estimation where the search space has expanded to20 dimensions, this is not the case anymore. When we used the original algorithm withexponentially falling inertia weight value, the inertia weight fell to zero well before theoptimisation identified the global optimum basin of attraction. Waiting for the rightanswer by prolonging the time simply wasn’t feasible. Instead, we decided to optimisehierarchically.

We performed the hierarchical optimisation in 7 steps (see Table 3). First, weoptimised the location of the skeleton in space, i.e., the location of the root joint,followed by the root joint orientation. These were both 3 DOF optimisations. Oncethe skeleton was positioned in space, we optimised the neck and head sub-chain, forwhich we only used 2 DOF in the clavicle neck joint to model the tilt of the head. Themovement of the clavicle left and clavicle right joint on their own does not produceenough variation in the silhouette shape to be optimised individually. Therefore,in the next step, we combined the left clavicle joint with two rotational dimensionsof the shoulder joint and optimised the parameters of the left upper arm, a 4 DOFoptimisation. Likewise, we then optimised the right upper arm, again 4 DOF. At the



Table 3: Steps in the hierarchical optimisationTORSO RIGHT UPPER ARM(1) Root location (5) Clavicle-right orientation3DOF: rootx, rooty , rootz + Right shoulder orientation(2) Root orientation 4DOF: α5

x, γ5z , α6

x, γ6z

3DOF: α0

x, β0

y , γ0

z

NECK & HEAD LEFT LOWER ARM(3) Clavicle-neck orientation (6) Left shoulder orientation2DOF: α1

x, γ1

z + Left elbow orientation2DOF: β3

y , α4x

LEFT UPPER ARM RIGHT LOWER ARM(4) Clavicle-left orientation (7) Right shoulder orientation+ Left shoulder orientation + Right elbow orientation4DOF: α2

x, γ2

z , α3

x, γ3

z 2DOF: β6

y , α7

x

end we were left with the left and right lower arm, each modelled with 2 DOF asdescribed in the boundary case. The two 4 DOF upper arm optimisations required aslightly denser sampling of the inertia weight function to correctly locate the optimumregion.

Our fitness function is based on silhouette overlap between the model and theoriginal silhouette coming from the video sequence. When optimising joint parametershierarchically, the joints lower in the hierarchy mislead the silhouette overlap count asthey contribute to it despite of not having been optimised yet. We avoid this by deform-ing the subdivision body model so that at a particular stage of the optimisation onlythose body parts which are currently optimised or have already been optimised arevisible. We also exclude the hands from the model entirely, as the hands in the originalsequence exhibit too much articulation and constantly mislead the optimisation. Thehands can be very useful as a constraint, however, that is entirely based on the inputimages and the explicit modelling as such is not necessary.

7.6 Combination of using the original and hierarchical approachThe results of the hierarchical approach were encouraging (see Figure 10 left). Theoptimisation correctly identified the pose. However, having optimised hierarchicallywithout looking back, another problem creeped in, namely that of error propagation.In our hierarchical approach we rely on the fact that each individual stage comes upwith the best possible result and therefore provides a good starting point for the nextstage. However, results from Section 6 show that there is always some uncertaintypresent in the pose estimates and this uncertainty propagates and grows through thehierarchical stages.

At this point, going back to optimising all the parameters at once makes much



Figure 10: The left image shows the result under the influence of the error propagationin the hierarchical approach. The right image illustrates how this can be corrected usingthe combined approach.

more sense. The result of the hierarchical optimisation, even if corrupted by propagatedbad estimates, represents an excellent starting point for the original algorithm. If weuse it as the initial position in the search space and initialise the particles around it witha low initial inertia value, e.g., w = 0.5, we can force the swarm to explore the spacearound the provided initial solution and find a better one. This approach successfullycorrects the influence of the error propagation as shown in Figure 10 right.

Figure 11:

7.7 Pose Estimation in Disparity SpaceThe pose estimation we presented in the previous section represents only one ofthe steps in a much larger application as shown in Table 4. Once the pose has beenestimated, the model is fit to the incomplete and noisy disparity data. The new,complete data is then used for high-quality view synthesis.

The previous section described the upper-body pose estimation in 3-D space.Given that we want to post-process disparity data, it makes sense to investigate thepossibility of estimating the pose in disparity space itself and use the model directly,without any additional transformations between the image space and 3-D space, oncethe pose has been estimated. Another advantage is that the noise associated with{x, y, d} is homoscedastic (Demirdjian and Darrell (2002)) unlike the noise associatedwith the 3-D points reconstructed from stereo data which is well known to be het-eroscedastic.

In this section we describe the modifications which are necessary to perform the

Figure 12: Combined approach results for third pose, centre and left view.



Table 4: Diagram of the entire algorithm from image acquisition to view synthesis.1. multi-view 2. silhouette 3. pose 4. disparity data 5. view

image acquisition extraction estimation completion with model synthesis

pose estimation in disparity space. Apart from the described modifications, all the re-maining steps of the algorithm are exactly the same as in the 3-D case.Let us assume that a 3-D point M = (X, Y, Z, 1)T is viewed by two distinct cameras,

left camera P l and right camera P r, and that the image of the point M is defined asml = (xl, yl, 1)T ' P lM and mr = (xr , yr, 1)T ' P rM in the left and right cam-era’s image plane, respectively, where “'” denotes equality up to a scale factor. Thecorresponding points ml and mr are related by a disparity which, in a general case, wedefine as:

d(ml, mr) = mr − ml = (xr − xl, yr − yl). (15)In the case of rectified images, the two corresponding points lie on the same scanline,and the disparity simplifies to a displacement along the scanline:

d(ml, mr) = xr − xl (16)

For a rectified stereo pair of images, the disparity space is then defined as a three-dimensional space D

3 = {x, y, d}. There exists a projective transformation Γ betweenthe 3-D space R

3 and disparity space D3, Γ : R

3 → D3:

Γ =

pl11

pl12

pl13

pl14

pl21 pl

22 pl23 pl

24

pr11

− pl11

pr12

− pl12

pr13

− pl13

pr14

− pl14

pl31

pl32

pl33

pl34

(17)

for whichD ' ΓM , M ∈ R

3, D ∈ D3, (18)

and where plij and pr

ij denote the elements of the left and right rectified camera projec-tion matrix,P l and P r, respectively.

7.7.1 Body Model in Disparity SpaceGiven the projective transformation Γ as in Equation 17, there exists a relationship be-tween a homogeneous transformation A = [R|t] in R

3 and an equivalent homogeneoustransformation B in D

3 (see Figure 13):

B = ΓAΓ−1 (19)

The notion of homogeneous transformations in R3 is therefore directly transferrable to

the disparity space D3. The points can be rotated and translated in disparity space di-

rectly if, for every known homogeneous transformation A in 3-D space, a correspond-ing transformation B is computed as in Equation 19.

7.8 Articulated Body Pose Estimation in Disparity SpaceWe use the result from Equation 19 to estimate the pose of an articulated upper bodymodel in disparity space by estimating the homogeneous transformations of individ-ual joints composing the model. The skeleton layer of the upper-body model is nowdefined as follows:

Skeleton = {B2

1, B3

2, ..., BN

N−1}, (20)



M AMA

B

Γ Γ

ΓΜ ΒΓΜ = ΓΑΜ

Figure 13: Transformation diagram

where N is the number of joints in the skeleton and Bji is a homogeneous disparity space

transformation matrix encoding the orientation of the coordinate system of joint j withrespect to the coordinate system of joint i. The skin is connected to the skeleton throughthe joints’ local coordinate systems and deforms with it, just like in the 3-D case.

Figure 14: An example of estimated upper body pose. The upper row shows a wire-frame model overlaid on top of the silhouette to illustrate the evaluation function con-straint.

Figure 15: Upper body pose estimates. The estimate in the third row is slightly under-constrained with the available views and as a result the right lower arm estimate is notentirely correct, however, this is due to the underconstrained evaluation function andnot the fault of PSO itself.



8 ConclusionsIn this paper we have presented the algorithm for pose estimation with Particle SwarmOptimisation. Experiments with synthetic and real data illustrated the ability of themethod to solve the problem reliably and within an acceptable time frame. The futurework will concentrate on demonstrating the pose continuity and tracking with PSO onreal data, which, due to the restrictions of our simple camera setup, was at this pointnot possible. The results shown in this paper should be illustrative enough to fosterthe use of PSO in related problems and so bridge the gap between the EvolutionaryMethods and Computer Vision.

References19774:200x, I. F. (2006). Information technology computer graphics and image processing hu-

manoid animation (h-anim).

Ben-Arie, J., Wang, Z., Pandit, P., and Rajaram, S. (2002). Human activity recognition using mul-tidimensional indexing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(8).

Carranza, J., Theobalt, C., Magnor, M., and Seidel, H. (2003). Free-viewpoint video of humanactors. ACM Transactions on Graphics, 22(3).

Collomosse, J., Rowntree, D., and Hall, P. (2003). Video analysis for cartoon-like special effects.In Proceedings of the 14th British Machine Vision Conference, pages 749–758.

Dawis, L. (1991). Van Nostrand Reinhold, New York.

Demirdjian, D. and Darrell, T. (2002). Using multiple-hypothesis disparity maps and image ve-locity for 3-d motion estimation. International Journal of Computer Vision, 47(1/2/3):219–228.

Deutscher, J., Blake, A., and Reid, I. (2000). Articulated body motion capture by annealed particlefiltering. In Proceedings of the IEEE Computer Society Conference on Computer Vision and PatternRecognition 2000, volume 2, page 2126.

Eberhart, R. C. and Shi, Y. H. (2004). Special issue on particle swarm optimization. IEEE Transac-tions on Evolutionary Computation, 8(3).

Gavrilla, D. (1999). Visual analysis of human movement: A survey. Computer Vision and ImageUnderstanding, 1999, 73.

Grafulla-Gonzalez, B., Haworth, C., Harvey, A., Lebart, K., Petillot, Y., de Saint Pern, Y., Tomsin,M., and Trucco, E. (2005). Millimetre-wave personnel scanners for automated weapon detec-tion. Pattern Recognition and Image Analysis, Part 2, Proceedings of the lecture notes in computerScience, 3687.

Gross, H., Richarz, J., Mueller, S., Scheidig, A., and Martin, C. (2006). Probabilistic multi-modalpeople tracker and monocular pointing pose estimator for visual instruction of mobile robotassistants. In International Joint Conference on Neural Networks, 2006, pages 4209 – 4217.

Guo, Y., Xu, G., and Tsuji, S. (1994). Understanding human motion patterns. In Proceedings of the12th IAPR International Conference on Computer Vision Image Processing, pages 325 – 329.

Haritaoglu, I., Harwood, D., and Davis, L. (2000). W4: real-time surveillance of people and theiractivities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8).

Haworth, C., De Saint-Pern, Y., Clark, D., Trucco, E., and Petillot, Y. (2007). Detection and trackingof multiple metallic objects in millimetre-wave images. International Journal of Computer Vision,71(2).

Hsu, H.-H., Hsieh, S.-W., Chen, W.-C., Chen, C.-J., and Yang, C.-Y. (2006). Motion analysis forthe standing long jump. In 26th IEEE International Conference on Distributed Computing SystemsWorkshops, page 47.



Isgro, F., Trucco, E., and Schreer, O. (2004). Three-dimensional image processing in the future ofimmersive media. IEEE Transactions on Circuits and Systems for Video Technology, 14(3):288–303.

Ivekovic, S. and Trucco, E. (2006). Human body pose estimation with pso. In World Congress onComputational Intelligence, WCCI 2006.

Kennedy, J. and Eberhart, R. (1995). Particle swarm optimization. In Proceedings of the IEEEInternational Conference on Neural Networks, volume 4, pages 1942–1948. IEEE.

Kohle, M., Merkl, D., and Kastner, J. (1997). Clinical gait analysis by neural networks: issuesand experiences. In Proceedings of the tenth IEEE Symposium on Computer-Based Medical Systems,pages 138–143.

Moeslund, T. and Granum, E. (2001). A survey of computer vision-based human motion capture.Computer Vision and Image Understanding, 81.

Ong, S. and Ranganath, S. (2005). Automatic sign language analysis: A survey and the futurebeyond lexical meaning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(6).

Plaenkers, R. and Fua, P. (2001). Articulated soft objects for video-based body modeling. In 8thInternational Conference on Computer Vision, ICCV 2001.

R., P., D., H., A., N., and M., P. (2005). Towards real-time body pose estimation for presenters inmeeting environments. In Proceedings of the 13-th International Conference in Central Europe onComputer Graphics, Visualization and Computer Vision’2005.

Schutte, J. F., Reinbolt, J. A., Fregly, B. J., Haftka, R. T., and George, A. D. (2004). Parallel globaloptimization with the particle swarm algorithm. International Journal for Numerical Methods inEngineering, 61(13).

Shi, Y. H. and Eberhart, R. C. (1998). A modified particle swarm optimizer. In Proceedings of theIEEE International Conference on Evolutionary Computation.

Shoji, K., Mito, A., and Toyama, F. (2000). Pose estimation of a 2d articulated object from itssilhouette using a ga. In Proceedings of the 15th International Conference on Pattern Recognition,volume 3, pages 713 – 717.

Veeraraghavan, A., Roy-Chowdhury, A., and Chellappa, R. (2005). Matching shape sequences invideo with applications in human movement analysis. IEEE Transactions on Pattern Analysisand Machine Intelligence, 27(12).

Ye, Z. and Liu, Z.-Q. (2005). Genetic condensation for motion tracking. In Proceedings of 2005International Conference on Machine Learning and Cybernetics, volume 9, pages 5542 – 5547.


human body pose estimation with particle swarm optimisation

Documents