1 homogeneous accelerometer-based sensor networks for game...

1

Homogeneous Accelerometer-Based Sensor Networksfor Game Interaction

ANTHONY WHITEHEAD, HANNAH JOHNSTON, KAITLYN FOX,NICK CRAMPTON, and JOE TUEN, Carleton University

We have created and tested a wearable sensor network that detects a user’s body position and motion asinput for interactive applications. It is envisioned to take game experiences such as Dance Dance Revolution,Wii Fit, and other active play scenarios to a whole new level, augmenting or replacing the binary foot-padand balance board with a more immersive, full-body input system. We describe the design and functionalityof the sensor network to characterize and verify body pose and position, perform experiments, and reporton the capabilities and limitations of such a system. Our experience shows that a distributed set of sensorsaround the body prevents the player from cheating the system by using motion of the device alone to trickthe system. In this work we show that a relatively simple sensor network configuration can enforce properform and ensure that the player is actively participating in the game context, while a larger configurationcan be used in training applications.

Categories and Subject Descriptors: H.5.1 [Information Interfaces and Presentation]: Multimedia In-formation Systems—Evaluation methodology; J.m [Computer Applications]: Miscellaneous

General Terms: Algorithms, Performance, Design, Experimentation, Human Factors

Additional Key Words and Phrases: Sensor networks, video games, human computer interaction, accelerom-eter, entertainment technologies

ACM Reference Format:Whitehead, A., Johnston, H., Fox, K., Crampton, N., and Tuen, J. 2011. Homogeneous accelerometer-basedsensor networks for game interaction. ACM Comput. Entertain. 9, 1, Article 1 (April 2011), 18 pages.DOI = 10.1145/1953005.1953006 http://doi.acm.org/10.1145/1953005.1953006

1. INTRODUCTION

With the introduction of the Wii, Nintendo has once again changed the standard videogame input system and players are embracing the ideology whole-heartedly [NPDGroup 2007]. Under the premise that game players are indeed looking for novelty intoday’s games, and not just another graphics-enhanced version of a decade old concept,we have seen increasing demand for unique interfaces and game interaction. This newform of interaction has allowed users to become more engaged and immersed in thegames they play. For game designers, game difficulty can now be affected not only bytiming (e.g., when you press the buttons), but also by a player’s body movement andcontrol; not just when you do it but how you do it can be an important factor in gameplay and design.

Accelerometers have inherent limitations when it comes to detecting mo-tion. With only three pieces of information supplied from tri-axis accelerometers

Authors’ addresses: A. Whitehead (corresponding author), H. Johnston, K. Fox, N. Crampton, and J. Tuen,Department of Information Technology, Carleton University, 1125 Colonel By Drive, Ottawa, ON, CanadaK1S 5B6; email: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrights forcomponents of this work owned by others than ACM must be honored. Abstracting with credit is permitted.To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of thiswork in other works requires prior specific permission and/or a fee. Permissions may be requested fromPublications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2011 ACM 1544-3574/2011/04-ART1 $10.00

DOI 10.1145/1953005.1953006 http://doi.acm.org/10.1145/1953005.1953006

ACM Computers in Entertainment, Vol. 9, No. 1, Article 1, Publication date: April 2011.

1:2 A. Whitehead et al.

(x, y, z acceleration data), it is impossible to know the exact rotation and positionof the accelerometer. In this respect, it cannot act as a replacement for a gyroscope andcannot accurately detect the amount of movement itself. Despite these limitations, thiswork shows that it is still possible to recognize and detect full-body poses using severalsensors at a sufficient rate to allow a great many applications.

We believe that the next logical progression is a multisensor network placed onstrategic areas of the body to allow for an even more immersive gaming experience.Such a network allows users to replicate elaborate dances, yoga and tai chi poses,and can verify the accuracy of the pose. Although it is desirable to have a systemthat easily interprets human input, it is also expected that the users must developand perfect the application-specific skills with practice. We have shown that only 4sensors are sufficient to enforce form and require the player to correctly engage indesigned activities [Crampton et al. 2007]; however, for training applications such asyoga, pilates, and tai chi, a larger sensor network is preferable as the goal is to improveform and train rather than purely fun and entertainment. The more sensors in thenetwork, the more rigidly we can define the body pose and motion.

This article examines prior research in the field of sensor networks as input de-vices. This is followed by a look at our creation of a network of accelerometers forrecording and detecting full-body position and motion. Experimental results are pre-sented that identify the capabilities and limitations of a homogeneous accelerometersensor network. A summary of our findings is then presented to support the use ofaccelerometer-based sensor networks as input in video games and training applica-tions. We conclude with a brief at look at our development of games using the sensornetwork and at future research opportunities.

2. BACKGROUND

There is a great deal of recent interest into the use of accelerometers for various ap-plications. Although there has been significant prior research in the area of singleaccelerometers as input devices [Wilson et al. 2004; Kratz et al. 2007; Keskin et al.2003; Segen et al. 1998], especially since their inclusion into cellular phones [Foeresteret al. 1999; Ashbrook et al. 2005], it has been primarily focused on gesture recognition[Kratz et al. 2007; Keskin et al. 2003; Segen et al. 1998; Foerster et al. 1999; Wilson et al.2004; Payne et al. 2006; Baudisch et al. 2007; Ashbrook et al. 2005; Mantyjarvi et al.2001], and not pose recognition. Prior work discusses physical activity recognition us-ing acceleration [Lee et al. 2002; Bussman et al. 2001] or a fusion of acceleration andother data modalities [Lukowicz et al. 2002], but it is unclear how most of these systemswill perform under real-world conditions because of the limited development of the ap-plication in question. Wearable computer systems have already been prototyped usingacceleration, audio, video, and other sensors to recognize user activity [Clarkson 2002].Readers are referred to Bao [2003] for an in-depth summary of other work. Althoughthe literature supports the use of acceleration data for physical activity recognition,little work has been done to validate the idea under real-world conditions. Most priorwork on activity recognition using acceleration relies heavily on data collected in con-trolled laboratory settings [Bao et al. 2004] within the context of wearable computingor having no concrete application in mind. Little has been found in the way of creatingfull-body sensor networks for use in interactive applications. The existing work foundin the field of human-input devices can be grouped into three categories: handheldcontrollers, video game control, and multisensor activity recognition.

2.1. Handheld Controllers

Currently the best known handheld inertial-sensing device, Nintendo’s Wii remote,utilizes a tri-axis accelerometer to detect motion for use in games. It also uses infrared


Homogeneous Accelerometer-Based Sensor Networks for Game Interaction 1:3

sensors in conjunction with a “sensor bar” to track the position of the remote in relationto the screen. Since only one accelerometer is used, in one inertial sensor location,the motions it detects can easily be cheated with partial movement. For example,swinging a sword can be replaced by quick flicks of the wrist. Microsoft has developeda device called the XWand, which integrates a wide variety of sensor types includinga magnetometer, a gyroscope, a microcontroller, an IR LED sensor, and one two-axisaccelerometer [Wilson et al. 2004]. The purpose of this apparatus is to detect basicmotions that would control a multitude of household electronics through pointing andgesturing. However, the majority of the motions currently recognized by the devicecan be easily replicated using a single tri-axis accelerometer, making the inclusion ofthe additional sensors redundant. Others have used tri-axis accelerometers to createsimple handheld units to replace input from joysticks, button presses, and mousemovements [Payne et al. 2006; Baudisch et al. 2007]. Similar to the Wii remote, thesecontrollers are limited in their capabilities by only using a single location accelerometerfor input. These systems differ from our approach in that they consist of a single sensoror a condensed network of sensors located at a single point on the body (e.g., in thehand), rather than a network of sensors placed in multiple locations on the body. Othersystems include Sony’s Six-axis controller and extensions to the Wii-remote with theWii-Fit balance board.

2.2. Video Game Control

The majority of developments in the field of human-input devices for video games makeuse of sensors other than accelerometers [Parker 2006]. Two of the most popular humankinetic games on the market are Dance Dance Revolution and Guitar Hero. Althoughthey seem more interactive than standard games, they still make use of binary input,similar to button presses. They are, in fact, less sophisticated than today’s typicalconsole-based handheld controller.

ParaParaParadise [ParaPara 2010] is another dance-oriented game similar to DanceDance Revolution that uses motion sensors in an octagonal ring to detect various armmovements. Although the input system differs from that of Dance Dance Revolution, itis still quite limited in its range of detectable movements because it simply is anotherbinary button configuration system using hands to simulate button presses rather thanfeet, as in Dance Dance Revolution.

Sony’s EyeToy is a camera-based motion recognition system developed for PlaySta-tion 2. Its camera allows users to literally appear in games. However, being camerabased, it inherits the same limitations of other video-based detection systems, such aslighting dependencies and occlusion issues, and the real-time interactivity constraintsmake computer vision algorithms difficult in that they are not sophisticated enough.Recent work in camera space also includes Microsoft’s project Natel [ProjectNatal 2010]However, we note that all camera-based solutions suffer a number of issues that oursystem alleviates, namely view invariance, that is, when a player turns away from thecamera, parts of the body are obscured. To overcome this issue a network of camerasall around the player is required and the space needed to facilitate such a system isnot generally practical. Also, camera-based systems are affected by field-of-view issuesand lighting issues. The system described herein does not suffer from these issues.

2.3. Multisensor Activity Recognition

Current research focusing on the use of larger sensor networks for activity recognitionis the most similar area to our work. However, our work considers the correctness ofthe activities performed, rather than simply distinguishing between them. Identifyingwhat a person is doing is a different problem from determining how well she is doing itcompared to an expert, and as a result, their use was never intended as input systems



to interactive applications [Bao et al. 2004]. Many of these systems use accelerometersfor detecting motion instead of poses. This use of accelerometers can be prone to errorif not used in conjunction with other types of sensors like gyroscopes. Acceleration dueto gravity and centripetal acceleration both impact sensor readings, making it difficultto determine orientation while in motion. These discrepancies will lead to large errorsin distance calculations, making it necessary to constantly recalibrate the sensors[Ashbrook et al. 2005].

Motion Bands are straps containing a tri-axis accelerometer, a tri-axis gyro, and amagnetometer. They are designed to be placed on any limb and detect motion patterns.Although mainly used in applications to analyze physical activities like running and taichi, preliminary tests have also been conducted using the single-sensor unit to controla character in a snowboarding simulation [Laurila et al. 2005]. This could, however,be done using one accelerometer. Reliance on pure distance information means thatthe system is affected by scale issues (e.g., significant difference between child andadult motions), thus notably limiting the range of use. The intent of the Motion Bandswork is similar to our work, however, our network is significantly simpler and achievesarguably better results with less overhead.

The Acceleration Sensing Glove is a glove consisting of six biaxial accelerometerswhich acts as a portable replacement for a computer keyboard [Perng et al. 1999].Users replicate predefined hand positions, each translating to a different characteror keyboard command. Though it currently makes use of twenty-eight different com-mands (twenty-six letters, space, and delete), the system has the theoretical potentialof containing up to 4000 predefined poses. The developers of the Acceleration Sens-ing Glove claim that dynamic hand gestures could be recognized, but for reasons ofsimplicity, they have not yet been explored.

Body area sensor network platforms such as the one propsed by Fergus et al. [2009]have a similar concept to our work in that spreading the sensors out over the bodycan produce more information to be used by applications. They describe a framework[Fergus et al. 2009] for aggregating the data in a wireless system, but acknowledgethat the application base needs more investigation on their part. In our work, we focusthe application development in the game space. Commercial products such as XSENSMoven [Xsens 2010] use a network of body-mounted inertial sensors and OrganicMotion [OrganicMotion 2010] uses a network of cameras to do motion capture foranimation systems. The camera-based system requires too much space for general usewhile the cost of the inertial sensors in the Moven system is far too high for generalconsumption.

Much of the prior work focusing on multisensor activity recognition uses a special,limited subset of physical activities trained in a laboratory environment. However,Bao et al. [2004] examine nine activities which is a larger set in the literature. Forour game context, we have nine different classes to recognize for even the simplestof games. Interestingly, Bao et al. [2004] demonstrated 95.8% recognition rates fordata collected in the lab but these success rates dropped to 66.7% for experiments inreal-world environments. Within the context of our application, the recognition ratesare fundamentally important to the playability of the game and must be better thanchance and significantly higher to ensure interactivity. Without reasonable recognitionrates, interactivity will not be a feasible venture. Most of the prior work has typicallybeen conducted with only 1–2 accelerometers, typically bi-axial, not tri-axial, withonly a few using more than two [Bao et al. 2004; Kern et al. 2003; van Laerhovenet al. 2002; Mantyjarvi et al. 2001], but this work is done in the context of wearablecomputers, not human-computer interaction. We examine up to 8 accelerometers in ournetwork, but the system as a whole is easily expandable beyond that number. No workdirectly addresses placement of the accelerometers in order to provide the best data



for recognizing activities, although it has been suggested that for some activities, moresensors improve recognition [van Laerhoven et al. 2002]. We agree and verify this inour experiments section.

3. REVIEW OF ACCELEROMETER THEORY

Accelerometers can provide orientation, instantaneous distance, velocity, and accelera-tion information. Whilst in theory, this data would be extremely beneficial in the designof an input system, practically, they suffer drift and inversion effects, which we shallexplain next. However, there is still a great deal of information that can be taken froman accelerometer such as rotation, orientation, and force feedback. Force feedback canbe used by directly taking the acceleration or velocity readings and using these read-ings to indicate some level of force applied to the move. For example, in a football gamesetting, high acceleration readings would indicate a long bomb, while low accelerationreadings on the throwing action would indicate a short pass.

The most significant issue with accelerometers comes from gravity, which results inconstant acceleration readings even if the sensor is not moving at all. To remove thistype of error, we must compute and negate the effects of acceleration due to gravity.However, under motion, it is impossible to tell how much of the sensor’s reading is fromstatic rotation versus dynamic motion. There are two unknowns, acceleration causedby rotation and acceleration caused by motion, with only a single information source,making the problem mathematically underdetermined. Another important issue comesfrom what we call the inversion problem. As the sensor slows down abruptly, thesensor will read deceleration (or negative acceleration) and compute instantaneousnegative distance values even though the sensor is still moving forward, but slowingdown.

The acceleration data can be used to determine how much rotation the sensor isunder. This is possible because the acceleration due to gravity is constantly being readby the sensor with 1g of acceleration. If the sensor is placed on a flat surface (parallelto the earth’s surface) and is completely motionless, all of this acceleration will be readfrom the z-axis. The acceleration on the x and y axes will be zero, disregarding noise.When the sensor is rotated on its side ninety degrees, the axis that is now pointingtoward the earth will read 1g and the z-axis will read 0, again disregarding noise.Getting the angle from the readings on an axis is done simply by taking the inversesine of the acceleration readings.

Tri-Axis Rotation Sensing. When using a tri-axis accelerometer the z-axis can becombined with both of the rotating axes to improve rotation computation precision.Both pitch (ø) and roll (ρ) can be sensed simultaneously using the readings of all threeaxes: x, y, and z.

ø = arctan[

X√Y 2 + Z2

](1)

ρ = arctan[

Y√X2 + Z2

](2)

Yaw is not directly computable from a single accelerometer. Recall that pitch causesa change in readings in the y-axis accelerometer due to gravity. The same is true for rolland the x-axis. However, yaw is the rotation around the z-axis that does not change anyorthogonal axis reading on the sensor. These calculations are very accurate and, withthe right sensor, to within fractions of a degree. For a full explanation of the capabilitiesof an accelerometer sensor, the reader is referred to Whitehead et al. [2007]. The systemis providing data at rates up to 1000 Hz and settle times on the devices are shorterthan human reaction time. This allows individual poses to be observed and changed



Fig. 1. (a) Sensor pods: Plastic capsules provide a way to fasten accelerometers to the body using adjustablevelcro strapping; (b) 4 accelerometer sensor network with pods on the wrists and above the knees.

and measured all before the next pose is comprehended by a human player. Pitch androll information from a network of accelerometers is what we use to determine poseinformation for the entire body. We discuss this next.

4. DESIGN AND IMPLEMENTATION OF A SENSOR NETWORK

In this section we describe the design and implementation details of our sensor networkand the software components that make up the recognition system.

4.1. Sensor Pods

Our sensor network consists of a variable number of sensor pods depending on theapplication in question. We have found that 4 sensors are suitable for entertainment-related applications, while 8–16 better serve training applications. Each pod containstwo layers of foam that sandwich a 2g tri-axis accelerometer, ensuring it remains securein its position and while minimizing unwanted movement on the sensor itself. Figure 1shows a tri-axis accelerometer placed into its protective casing; this serves the purposeof protecting the chips and connections from motion-based wear and tear.

4.2. Sensor Pod Placement

The nature of the poses and the application dictate the placement of accelerometers onthe body. A truly generic system would be comprised of sensors on all of the major bonesegments in the body. However, this is not necessary for all applications, specifically iforientation information for certain parts of the body is not required. In our exampleapplication, it is necessary to get position and rotation information of both arms andlegs. Figure 1(b) shows sensors placed on the body to minimize movement, slippage,and excessive rotation of the sensor pod once mounted. The system does not requirea precise placement of the pods, and each application has the freedom to place podswhere appropriate, given the aforementioned constraint. However, the sensor podsneed to be placed consistently in the same general vicinity and orientation on each newor returning participant.

4.3. Body Pose Detection

As with any classification technique, we rely on the basic training and recognition (gameplaying) phases from classical pattern recognition theory [Duda and Hart 1973]. Thisallows the sensor network to be reusable in many different contexts, as body positionand orientation information can be easily gathered and the recognition system trainedby example. We use the rotation sensing capabilities of the accelerometers and theirplacements on the body to determine poses that players recreate. We have used this



pose recognition system to successfully create a number of dancing-oriented gamesthat require players to recreate the dance position in time with music. These gamesare described in Section 5. We will briefly discuss the training and recognition phasesnext.

4.3.1. Training Phase. The purpose of the training phase is to create a dataset from agiven number of example poses by an individual. A reading consists of the x, y, and zacceleration data for each pod in the sensor network. We take readings from a numberof people to create the entire dataset, thus determining a range of acceptable positions.To record a pose, the subject performs each pose five times. 500 readings are takeneach time, over a two-second period, with a break between; this yields 2500 separatereadings. We have found that good real-world recognition rates occur when the samplesize is at least 5 people and they are of diverse body types. However, as with any patternrecognition system, the more data available for training, the better the recognitionrates. It should be noted at this point that more samples in the training data setallows for more acceptable variability in the recognition phase. Thus for entertainmentapplications, a larger dataset is encouraged. But for training applications, where thegoal is to effectively learn the proper body positions of an expert or trainer, a smallertraining dataset is sufficient. More detail about optimal training set size is covered inSection 5.

We treat the readings as points in n-dimensional space and compute the centroid andconvex hull of the volume created by all of the sample points. The data for each axis ofeach accelerometer from each sensor pod for each pose is compiled and stored in a posebank used for subsequent recognition tasks. A pose bank is defined as a collection ofinformation from each of the sensors attached to the body for all defined poses.

4.3.2. Recognition Phase. As an application runs, acceleration values from the sensornetwork are read in continuously and compared to the poses stored in the pose bank. Weuse a minimum mean distance rule classifier [Duda and Hart 1973] as our recognitionsystem, which is outlined by Eq. (3). It characterizes each category (pose) by meanand standard deviations of the components of its training data. The distance, d(M,m), between an unknown sample M (input move) and the mean of the features ofclass m (trained pose) is then computed. The unknown sample is assigned to class m*(recognized as move m∗) for which such distance d(M, m) is minimum. Formally,

m∗ = arg min d(M, mi) {i = 1, 2 . . C}. (3)

d(M, mi) is the Mahalanobis distance metric (4). Introduced by P. C. Mahalanobisin 1936 [Mahalanobis 1936], it is based on correlations between variables by whichdifferent patterns can be identified and analysed. It is a useful way of determiningsimilarity of an unknown sample set to a known one. It differs from Euclidean distancein that it takes into account the correlations of the dataset and is scale invariant.

d(p, q) =√√√√ n∑

i=1

(pi − qi)2

σ 2i

(4)

σi is the standard deviation of the training data over the sample set. The Mahalanobisdistance is simply the distance of the test point from the centroid divided by the widthof the convex hull in the direction of the test point. If the distance falls within anacceptable threshold for a known pose, the pose is deemed detected. When checkedagainst all poses in a pose bank, it is the minimum distance that would classify thepose.



Fig. 2. The 8 poses, inspired by Michael Jackson’s Thriller.

5. EXPERIMENTAL RESULTS

Experiments were conducted to examine the capabilities and limitations of our sensornetwork, specifically to determine the feasibility of applying similar principles in avideo game context. Participants were asked to perform multiple trials consisting ofthe eight distinct poses shown in Figure 2. For these particular experiments, poseswere inspired by Michael Jackson’s zombie dance from Thriller, and selected to ensurereasonable differences in limb orientation. Tests were conducted in a supervised labsetting with a consistent level of instruction and coaching given by the trainer toexamine the effects that the amount of training data has, the effect of the size of thesensor network, the effect of practice, the effect of pose difficulty, and the process ofdetermining appropriate, application-specific recognition thresholds.

It is unreasonable to expect a user to be able to achieve a “perfect” pose, in other wordsachieve a recognition distance of zero, even if he is a part of the training sample. Simplevariations are inevitable due to core mechanics of the human musculature system andthe variability in the placement of the sensors as they are not permanently attached tothe body. Our first set of experiments test how well a person can play against his owntraining data versus another person’s training data. This experiment has significantimpact on the practicality of the system. In this context, the user was given a poseto get into and a set amount of time to complete the task. Once the pose was calledfor, the system checked continuously for the pose for the prescribed amount of time,typically less than a few seconds. With self-training data the recognition rate was 100%.This means that within the context of these experiments, the users were always able toduplicate the poses for which they trained. This is fundamentally interesting because itallows the game developer to ensure that the game is playable by everyone, regardlessof physical limitations. The game developer only needs to incorporate a training phase(already common in many games) into her game. The overall recognition rate for alltrials and all moves is 73.75% when testing against the training dataset created by adifferent individual. This is not quite sufficient for a truly interactive application outof the box and would force a retraining initialization phase.

It became apparent throughout the earlier experiment that, even with practice,some participants were unable to successfully replicate certain poses trained from asingle individual. Physical limitations can make it impossible for some subjects toget their body into the exact same position as a single person who trained the dataset.These restrictions include factors like muscle size and volume, limb length, proportions,and overall body type. Combining multiple people’s datasets makes it possible for awider variety of users to achieve higher recognition rates. To test this, we collectedtraining data from seven individuals of varying heights and shapes. The seven setswere combined, thereby generalizing the pose bank due to a larger training sample.This allows greater variation in the range of acceptable poses and makes the system



Table I. Pose Confusion Matrix for Training Data of Seven Subjects

POSE 1 2 3 4 5 6 7 81 3.996 16.386 12.308 12.074 20.178 25.487 29.183 38.8742 15.878 3.104 14.394 22.878 33.016 33.768 20.393 53.6463 15.993 27.239 2.192 16.053 22.355 44.357 34.821 18.2124 10.918 32.047 16.059 2.868 16.341 45.845 39.553 43.5305 9.704 37.074 12.302 11.682 3.773 48.108 47.171 25.9716 22.948 21.116 27.062 23.684 45.247 3.315 26.668 56.5917 43.606 27.261 16.094 30.934 43.458 35.509 3.212 60.0278 12.896 28.030 8.702 18.970 29.038 39.525 36.946 3.992

Table II. Average Multiplier Factors for 8 Poses, for Single and Multiple Training DatasetSizes

POSE

Training Set Size1 2 3 4 5 6 7 8

1 3.44 4.90 3.24 3.46 3.91 4.46 5.29 6.477 4.25 7.74 6.22 6.06 7.07 10.40 9.26 9.42

Table III. Effect of Multiple Training Data Samples on RecognitionRates

Number of Data Sets Used Successful Recognition (%)1 37.57 87.5

generally more usable by people who are not a part of the training data. Variationsof a few degrees in the limb positions are generally acceptable, but the applicationwould dictate how much variation is acceptable. This phenomenon, in fact, resultsmore from the training process accounting for differences in body sizes, shapes, andflexibilities than a deficiency in classification methodology and supports our assertionthat the stricter the pose and position (such as in a training application), the smallerthe amount of training data should be.

Confusion matrices as shown in Table I were created to determine the thresholdvalues and reduce the likelihood of false positives among other poses, while maximizingthe likelihood of true positives. The values in Table I are the average distances ascomputed from Eq. (4) for each pose when compared to each other. In other words,when a player strikes pose 1 (column) we compute the distance for all 8 poses (rows).The confusion matrix helps us to define a global threshold that will allow correct posesto be verified. From Table I we can see a global threshold distance of 5 would suffice inmaking poses recognizable and keeping false positive rates down.

The results indicate little possibility for the occurrence of false positives. An indica-tor of reduction of false postivies can be observed through the multiplying factor. Themultiplying factor is the ratio of the distances between a correct pose match (smallestvalue) and its nearest value (next smallest value). This provides an indicator of howfar the closest mismatch is to a correct match distance: the higher the factor (ratio), theless likely a false positive is to occur. The average multiplying factor for a correct clas-sification versus a misclassification is higher for multiple training datasets, meaningthat the likelihood of false positives actually goes down as the amount of training datagoes up. Table II shows the multiplying factors for each pose. This is an indicator thatthe difference between a correct classification and a misclassification is proportionallyincreasing and thus there is significant value in having a larger training set.

As Table III show the recognition rates increase substantially as the size of thetraining set increases. For entertainment applications where recognition rates mustbe higher for the general population, more training data is preferable.



Table IV. Mahalanobis Distances for Pose 3, Using Networks with 4 and 8 Sensors

Number of Accelerometers Used Mahalanobis Distance Average Distance Per Sensor4 2.676948 0.6692368 7.269316 0.9086645

Fig. 3. Trend lines for the learning curves of 7 subjects.

Our next experiment independently validates the suggestion of Bao et al. [2004] thatthe more accelerometers there are in a sensor network, the more accurately activitiescan be differentiated. Table IV shows that increasing the number of sensors makesmore demands on the user to properly replicate the pose, since each added sensorcontributes to the overall distance from the optimal pose. The addition of more sensorsto the body places a higer demand on the relationships between the body parts beingcorrect. For example, if the forearm is to be placed parallel to the ground, this can bedone in a number of ways as a result of the range of motion of the shoulder. However,if we place another sensor on the upper arm so that is has to also be parallel to theground, we have reduced the allowable range of motion of the shoulder and the twobody parts must work in conjunction with one another to accurately duplicat a pose. Assuch, the error in location in the upper arm directly affects the forearm (by virtue thatthey are attached) and will as a result increase the error determined for the forearm.This is reflected by the higher average distance per sensor.

The effect of practice cannot be overlooked. In our experiments we saw that the errorrate drops as the player gets more experience playing the game, and most significantlyafter the first few trials. This is an expected side-effect of practicing and, in fact, shouldbe a desirable outcome from a game design point of view. As indicated in Figure 3,there is a definite trend for better success as the number of plays increases.

The learning curve experiment was conducted to examine the effects of repetitionon successful recognition rates. Our hypothesis was that with practice, any user wouldbe able to achieve correct recognition of every pose in a trial. In a video game context,we also want the players to progressively succeed with practice. Seven participantsperformed the eight poses set out by the pose bank illustrated in Figure 2 sequentially,up to fifteen trials, stopping if they successfully executed all poses in one trial. It isimportant to note that this experiment was completed with a training sample of one, themost difficult and closest to a training scenario. The resulting trend lines, illustrated inFigure 3, support our hypothesis in that, given enough repetition, any user can improvetheir ability to match the orientation and position of a single expert. This tends tosupport our claim that the sensor network can be used in training applications where



Fig. 4. Poses used in subsequent experiments.

body position and orientation are fundamentally important, such as yoga and pilates,but further statistical analysis is required. Although there was observed improvementoverall, certain poses were problematic for most subjects, which led to further pose-based analysis discussed next.

Clearly, different players will learn at different rates, but within 15 trials mostplayers are able to bring their recognition rates up by more than 30%. When we havea larger training dataset, the initial recognition rates are higher than in Figure 3, andall of the people were able to achieve perfect recognition rates with practice. Theserecognition rates are suitable to allow the game to be trained in the developmentstudio by multiple people and subsequently played by most players without the needfor a training phase being built into the game. Moreover, this allows game designersto integrate progressively increasing levels of difficulty with the full knowledge thatplayers can improve with practice.

In the following subsections, we highlight results from our previous experiments todetermine the optimal number of datasets for training and the technical limitations ofspeed. We also introduce new results on strictness in pose training, physical exertion,and long-term improvement from learning versus fatigue.

5.1. Optimal Number Datasets to Include in Training

As previously discussed, it is advantageous to combine data from multiple subjects inorder to create a pose bank that is playable by a wide variety of players. A differentset of 8 poses were used in this and subsequent experiments. The 8 poses are shown inFigure 4. This data is recorded so that it can be compared to 27 different pose banks,containing between 1 and 9 subjects’ pose training data. Each of the 3 trials includesthe same training data, with each dataset added to the bank in a different randomizedorder. As such, the trials converge when the training data from all 9 participants hasbeen added to the bank. Figure 5 illustrates the decreasing average distance values.The asymptotic nature of the curves suggests that the average threshold distancelevels off around 5 and the inclusion of additional training data will eventually yielddecreasing returns.

5.2. Strictness in Training Poses

Two alternate methods were used to train pose data. In the first method, subjectsreplicated the poses seen on screen as they would naturally, and they were recordedthis way. Only poses that were clearly misunderstood or incorrect were excluded fromthe dataset. A second series of training data was recorded with stricter enforcementof the ideal pose. Prior to recording, subjects performing the poses were required toadjust their limb positions until they were deemed comparable to the ideal pose by ahuman evaluator. When playing against the strict training process, users encounteredsignificantly more difficulty in successfully replicating the poses. In some cases, partic-ipants were unable to get into the poses at all, regardless of the amount of coaching or



Fig. 5. Convergence of distances as the training dataset increases.

Fig. 6. Effect of speed on success rate.

limb adjustment. For these reasons, a more natural pose training system is generallydesirable within the gaming context; however, a stricter system may be desirable fortraining or simulation applications.

5.3. Technical Limitations of Speed

In our application, poses are presented at a variable rate. In our trials, the speed wasincreased to the point where participants were no longer able to replicate the majorityof poses. Using a consistent pose validation threshold, one expert player achieved asuccess rate of 100% at the speed of 1 pose per 750 milliseconds (ms). This playerwas capable of achieving success rates above 80% at 1 pose per 625 ms, but Figure 6illustrates the significant drop in success rates between 625 ms and 500 ms per pose foran expert player. With practice, subjects will improve, however, it is unlikely that theywill reach perfection at a speed greater than 1 pose per 625 ms. The maximum speed is



Fig. 7. Heart rate increases as the ability to successfully duplicate the presented poses increases.

less a function of hardware limitations and more of human limitations. Settle time forthe accelerometers we use is approximately 180 ms. With reaction time accounting forapproximately 190 ms [Galton 1899; Woodworth et al. 1980; Welford 1954], at a speedof 1 pose per 500 ms, just over 100 ms remains for participants to move their bodiesinto the correct position.

In trials where the speed between poses was less than 625 ms, the expert playerwould no longer attempt all poses. While the behavior of skipping difficult poses is notuncommon among new players, it is uncharacteristic for the experts and in this caseis a function of strategy as opposed to laziness. With practice at the 500 ms speed, itbecame evident to the player that it was not possible to get into pose fast enough sohe strategically chose sequential poses that were easier to replicate, skipping a moredifficult pose in between to prepare for the next. This allowed the player to optimizehis results when reaction and settle times made 100% success impossible. The speedtrial results are shown in Figure 6.

It should also be noted that success rates may be superficially increased at fasterspeeds by using a larger acceptable validation threshold value. The downside of thisapproach is that it permits players to perform less accurate poses, potentially reducingintended benefits and enjoyment.

5.4. Physical Exertion

In this experiment, heart rate was calculated for each participant using the applica-tion. We compared the heart rate in beats per minutes against the numbers of movesperformed correctly. There is a strong correlation between heart rate increase and theability to duplicate the presented poses, shown in Figure 7. This shows that as playersattempt to achieve higher success rates, they are required to work harder and thusachieve better physical benefits than those who do not try as hard.



Fig. 8. (a) Screen shot of Posemania dancing game created to use recognition system; (b) screen shot ofRoboPaint dancing game created to use recognition system dancing to YMCA.

Fig. 9. Dance Fighter instructions (a) and game play (b).

5.5. Long-Term Improvement from Learning vs. Fatigue

Our early work shows that players’ success in replicating poses improved over timeas a result of practice. In this experiment, prior to commencing the actual play test,brand new participants were shown the 8 poses and given instruction on how to ac-curately perform them. The first 5 minutes of playing was completed at a speed of 1pose per 3 seconds (1p/3s) and the next 5 minutes of training were completed at 1p/2s.An additional 10 minutes of practice was completed at a speed of 1p/s. In the test-ing session, participants continuously performed 15 minutes of play, and then had a1-minute break. Playing then resumed for another 14 minutes.

Despite moderate levels of continued exertion (and consequently fatigue), 13 out of15 participants continued to improve when playing for an extended duration in theirfinal trial. The average improvement was approximately 12.5%.

6. EXPERIENCES IN GAME DESIGN AND DEVELOPMENT

We have created several games for use with the sensor network, but posed-based gamesare the focus of this work. Our game Posemania requires the player to perform danceposes to music at the right time. Poses scroll up the screen and the player must replicatethe pose as they reach the top, in a similar fashion to Dance Dance Revolution (DDR).Also in the spirit of Dance Dance Revolution, RoboPaint requires players to match the



dance of an animat robot. As poses are correctly matched, the player’s robot body getspainted. The player with the most successful poses, and thus painted parts, wins. BothPosemania and RoboPaint can be played with up to 4 players.

In Dance Fighter, players upload images of their faces onto player bodies and dance-off against one another, using a series of predefined dance poses. For a duration specifiedby the players, participants must each perform the dance moves in whichever noncon-secutive order they choose. Players earn more points for performing poses with higherdifficulty levels. Dance Fighter relies on the same pose bank recognition system, butthe interface was developed using Adobe Flash for the graphics and animation.

6.1. Player Enjoyment

Posemania was tested with 62 participants, aged 9–13 (29 male, 33 female). Partici-pants played two at a time against each other at a speed of 1 pose per 1500 ms, usingthe precision-based scoring method. They were given group training on the system andthe poses prior to playing the game once, for approximately 1 minute. The averageresponse to “Did you enjoy the game?” (on a scale from 0–4) was 3.35 (3.34 among maleparticipants, 3.37 among females).

6.2. Design Issue: Pose Selection

To avoid confusion it is generally desirable to select poses with obvious limb anglesthat can be seen with minimal image detail, for example, arms and legs forming 45,90, or 180 degree angles. Complex limb positions are not only difficult to see on screen,but they tend to be more challenging to replicate. Each pose should be distinct fromthe others. It is important to recall that a change in yaw will not affect the accelerationvalues because gravity is still acting in the same direction. In games like Posemania(where the system checks the player’s position against only one pose at a time), it is notcritical for the poses to be distinct, however, in other games similar poses may causeconfusion or the occurrence of false positives.

6.3. Design Issue: Pose Presentation

It is important to illustrate the poses properly so players know what body position toreplicate. A previous study [Hoysniemi 2006] suggests that most users find mirroredimages easier to replicate. Wii Fit makes use of both a view from behind and a rear-view-style mirror image. In Posemania, three-dimensional poses are represented byflat images in the two-dimensional space, making it even more challenging to accu-rately convey body positions. For this reason, it is important to provide clear imagesof noncomplex poses. In a modified version of Posemania, pose images were alteredso that the left and right side of the body were each color-coded. This did not drasti-cally improve performance among experienced players; however, it is possible that thecolor-coding might prove more useful for new users.

6.4. Design Issue: Pose Speed and Precision

Two separate scoring systems were implemented in our prototype games. In the firstsystem, players gain score points as they perform poses at or exceeding a given level ofprecision. Alternatively, players score an increasing number of points the closer theyget to the ideal pose. Players receive visual feedback for their pose precision in theform of a color brightness change: the closer the participant is to perfectly replicatingthe pose, the brighter the hue of the star shape. The two separate scoring methodslead to different game play. With the first scoring system, players generally do theminimum required to successfully replicate the pose. Players are concerned with speedto maximize the number of successful poses. With the additional precision requirement,players focus on carefully executing the pose perfectly and require more time per pose



as a result. This method is generally preferred among new players as it provides moreimmediate feedback and satisfaction when they are still learning how to execute theposes. Both scoring methods are legitimate and may each be appropriate in differentgame contexts.

6.5. Design Issue: Learning Multiple Poses

Posemania requires players to learn and replicate 8 poses, which can be overwhelmingfor some. The training process is more successful when players are able to see and tryout the poses prior to game play. It may help users to start off with a limited set of posesand gradually add new ones in. Pose speed should also be increased gradually, initiallyallowing more time for players to recognize the pose and then perform it correctly.When players are forced to play at a speed beyond their capabilities, they often skipthe most physically challenging poses. This reduces both enjoyment and health benefitsfrom physical exertion.

7. CONCLUSION AND DISCUSSION

In this work we have shown that sensor networks of accelerometers are a viable inputsystem for control in video games. By only using accelerometers, we have created arelatively inexpensive system still capable of accurate, full-body pose detection. Thougha new user to the system may not initially achieve 100% pose recognition, it has beenshown that with practice, improvement will be seen. Furthermore, recognition ratescontinue to improve as a larger and more diverse sample of training data is used.Game complexity can be naturally introduced by playing with the threshold value forrecognition allowing easy, normal, hard, and outrageous levels of difficulty.

We have created a number of pose-based games and have shown that this type of in-put network would be useful in games that are dance related, as well other applicationssuch as fitness and training modules. Due to the limited number of moves in a game,relative to other larger datasets such as DNA sequences, it would appear that classicalmethods of classification such as our nearest mean distance classifier will perform wellenough to meet the computational demands of video games. We have shown the poseclassification of an accelerometer sensor network to be a feasible technology in thecreation of games.

Our future work includes a number of areas for exploration, including dynamic ges-ture recognition combined with proper form enforcement as well as examinations intobetter training and classification techniques. In current game versions, poses are rep-resented in only two dimensions, an issue that causes some players difficulties. Furtherresearch will explore alternate methods of pose representation, which may extend tothree-dimensional models. Also, we are looking to branch off into nonhomogeneoussensor networks. We will continue to expand our sensor network and its applicationsby adding more accelerometers as well as other possible sensors. Finally, it is worthnoting that certain poses may be extremely difficult for some players to reach or main-tain, creating a level of frustration for the player. An interesting consequence of thisdevelopment is that real-time adaptive thresholding algorithms would be useful tohelp remove frustration points, yet also keep the game challenging. This in our opinionwould be preferable over a self-training scheme.

REFERENCES

ASHBROOK, D., WESTEYN, T., AND STARNER, T. 2005. Dancing in the streets: Smartphones and gaming. InProceedings of the Workshop on Ubiquitous Entertainment and Games at the 7th International Conferenceon Ubiquitous Computing.

BAO, L. 2003. Physical activity recognition from acceleration data under semi-naturalistic conditions.M. thesis, Massachusetts Institute of Technology.



BAO, L. AND INTILLE, S. 2004. Activity recognition from user-annotated acceleration data. In Proceedings ofthe 2nd International Conference on Pervasive Computing. 1–17.

BAUDISCH, P., SINCLAIR, M., AND WILSON, A. 2007. Soap: How to make a mouse work in mid-air. In Proceedings of CHI’07 Extended Abstracts on Human Factors in Computing Systems.1935–1940.

BUSSMANN, J., MARTENS, W., TULEN, J., SCHASFOORT, F., VAN DEN BERG-EMONS, H., AND STAM, H. 2001. Measur-ing daily behavior using ambulatory accelerometry: The activity monitor. Behav. Res. Meth. Instrum.Comput. 33, 3, 349–56.

CLARKSON, B. Life patterns: Structure from wearable sensors. Ph.D. thesis, Massachusetts Institute ofTechnology.

CRAMPTON, N., FOX, K., JOHNSTON, H., AND WHITEHEAD, A. 2007. Dance, dance evolution: Accelerometer sensornetworks as input to video games. In Proceedings of the IEEE International Workshop on Haptic AudioVisual Environments and their Applications (HAVE’07). 107–112.

DUDA, R. AND HART, P. 1973. Pattern Classification. Wiley Interscience.FERGUS, P., KIFAYAT, K., COOPER, S., MERABTI, M., AND EL RHALIBI, A. 2009. A framework for physical health im-

provement using wireless sensor networks and gaming. In Proceedings of the ICST/IEEE InternationalWorkshop on Technologies to Counter Cognitive Decline.

FOERSTER, F., SMEJA, M., AND FAHRENBERG, J. 1999. Detection of posture and motion by accelerometry: Avalidation in ambulatory monitoring. Comput. Hum. Behav. 15, 571–583.

GALTON, F. 1899. On instruments for (1) testing perception of differences of tint and for (2) determiningreaction time. J. Anthropol. Inst. 19, 27–29.

HOYSNIEMI, J. 2006. Design and evaluation of physically interactive games. In Dissertations in InteractiveTechnology, Number 5.

KERN, N., SCHIELE, B., AND SCHMIDT, A. 2003. Multi-Sensor activity context detection for wearable computing.In Proceedings of the European Symposium on Ambient Intelligence (EUSAI).

KESKIN, C., ERKAN, A., AND AKARUN, L. 2003. Real time hand tracking and 3d gesture recognition for interactiveinterfaces using hmm. In Proceedings of the Joint International Conference ICANN/ICONIP. Springer.

KRATZ, L., SMITH, M., AND LEE, F. J. 2007. Wiizards: 3D gesture recognition for game play input. In Proceedingsof the Conference on Future Play.

LAURILA, K., PYLVANAINEN, T., SILANTO, S., AND VIROLAINEN, A. 2005. Wireless motion bands. In Proceedingsof the Workshop on Ubiquitous Computing to Support Monitoring, Measuring, and Motivating Exercise(UbiComp’05).

LEE, S. W. AND MASE, K. 2002. Activity and location recognition using wearable sensors. IEEE Pervas. Comput.1, 3, 24–32.

LUKOWICZ. P., JUNKER, H., STAGER, M., BUREN, T., AND TROSTER, G. 2002. WearNET: A distributed multi-sensorsystem for context aware wearables. In Proceedings of Ubiquitous Computing (UbiComp’02). G. Borrielloand L.E. Holmquist, Ed., Lecture Note in Computer Science; vol. 2498, Springer, 361–70.

MAHALANOBIS, P. C. 1936. On the generalised distance in statistics. Proc. Nat. Inst. Sci. India 12, 49–55.MANTYJARVI, J., HIMBERG, J., AND SEPPANEN, T. 2001. Recognizing human motion with multiple acceleration

sensors. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics. 747–52.NPD GROUP MARKET RESEARCH. 2007. http://www.npd.com/corpServlet?nextpage=corp-welcome.html.ORGANICMOTION. 2010. http://www.organicmotion.com/solutions.PARAPARA. 2010. http://www.paraparaonline.com/thegame/.PARKER, J. R. 2006. Human motion as input and control in kinetic games. In Proceedings of the Future Play

Conference. 1–7.PAYNE, J., KEIR, P., ELGOYHEN, J., MCLUNDIE, M., NAEF, M., HORNER, M., AND ANDERSON, P. 2006. Gameplay

issues in the design of spatial 3D gestures for video games. In Proceedings of CHI’06 Extended Abstractson Human Factors in Computing Systems. 1217–1222.

PERNG, J., FISHER, B., HOLLAR, S., AND PISTER, K. 1999. Acceleration sensing glove (ASG). In Proceedings ofthe 3rd International Symposium on Wearable Computers (ISWC’99). 178–180.

PROJECT NATAL. 2010. http://www.xbox.com/en-us/live/projectnatal/.SEGEN, J. AND KUMAR. S. 1998. Fast and accurate 3D gesture recognition interface. In Proceedings of the 14th

International Conference on Pattern Recognition.VAN LAERHOVEN, K., SCHMIDT, A., AND GELLERSEN, H. 2002. Multi-Sensor context aware clothing. In Proceedings

of the 6th IEEE International Symposium on Wearable Computers. 49–56.WELFORD, A. T. 1980. Choice reaction time: Basic concepts. In Reaction Times, A. T. Welford, Ed. Academic

Press, New York. 73–128.



WHITEHEAD, A., JOHNSTON, H., FOX, K., AND CRAMPTON, N. 2007. Sensor networks as video game input devices.In Proceedings of the Future Play Conference.

WILSON, D. AND WILSON, A. 2004. Gesture recognition using the XWand. Tech. rep. CMU-RI-TR-04-57,Robotics Institute, Carnegie Mellon University, Pittsburgh, PA.

WOODWORTH, R. S. AND SCHLOSBERG, H. 1954. Experimental Psychology. Henry Holt, New York.XSENS 2010. http://www.xsens.com/.

Received June 2010; accepted August 2010


1 homogeneous accelerometer-based sensor networks for game...

Documents