motion data and machine learning: prototyping and...

Motion Data and Machine Learning:Prototyping and Evaluation

Thierry RavetUMONS Numediart31 boulevard Dolez7000 Mons, [email protected]

Nicolas d’AlessandroHovertone SPRL4 rue des Soeurs NoiresB-7000, Mons (Belgium)[email protected]

Joelle TilmanneUMONS Numediart31 boulevard Dolez7000 Mons, [email protected]

Sohaib LarabaUMONS Numediart31 boulevard Dolez7000 Mons, [email protected]

AbstractIn this work, we address the problem of graphicalvisualization to train and validate machine learningsolutions with motion capture data. We describe ourexperiment to build an efficient system to explore andmanipulate, spatially and temporally, motion datacollection. We present a prototyping tool for motionrepresentation and interaction design based on theMotionMachine framework. This framework provides acoherent process chain to annotate the data, applytraining algorithm and validate graphically the obtainedresults.

Author KeywordsMotion Capture, Gesture Recognition, Motion synthesis,fast prototyping

ACM Classification KeywordsH.5.m [Information interfaces and presentation (e.g.,HCI)]: Miscellaneous.

IntroductionOver the last ten years, an important amount of motioncapture techniques have emerged. However most of thesetechniques – such as inertial suits or optical markerstracking – did remain expensive and cumbersome. Morerecently, the democratization of depth cameras – like the

Microsoft Kinect – has considerably changed the scope ofmarkerless mocap research. Indeed the wide disseminationof these sensors provided various resources (databases,software, results) to the scientific community [1].

Skeletal data acquisition generates a huge amount ofhigh-dimensionality data. Such hierarchical data is neithervery readable, nor very reusable. Machine learning hasappeared as a good way to extract the useful informationin the studied data and to describe the properties of acollection of motion-captured data. A lot of authors havedescribed how to use models like Randomized DecisionForests, Hidden Markov Models, Neural Networks orRestricted Boltzmann Machine in various tasks:interpretation of raw data [2], gesture recognition [3]motion synthesis [4] or animation retargeting [5].

If performance results for recognition tasks can be detailedin terms of accuracy and specificity, it is uneasy toquantify the quality of synthesized sequences regardingthe dimensionality of the data. In such application, moststudies present their method without giving informationabout the quality of their results, or just give a link tosome sequences of synthesized motion. One solution is touse subjective tests. It requires good displaying conditionsand tools to allow the evaluators to judge the naturalnessof the resulting sequences.

Furthermore, it is not obvious to format and to annotateskeletal data to build the training sets. Video annotationsystems like Anvil [8] can be transposed to the mocapdomain but an adequate 3D space visualization system isan advantage by allowing the users to change the point ofview. Indeed it is unlikely to correctly annotate a gesturewithout the capability to focus on the most informativeskeletal joints.

In the next sections, we present MotionMachine, a motiondata processing toolkit, and two different models that wetrained with this framework. We highlight the advantagesthat this platform provides in the development of thesemachine learning models.

Related WorksSeveral tools exist to manipulate and visualize motioncapture data with the goal of creating motion-enabledapplications. The MOCAP toolbox [9] is among thesetools, giving access to many high-level processingfunctions, but it is only available for Matlab and thereforenot very suitable for real-time, iterative and interactivetesting and performance. When it comes toperformance-driven tools, we find software like theRAMToolkit [10] with the opposite issue: the tool is notgeneric and rather very specific to a single use case(Kinect-captured dancers or specific mocap markersplacement). Other frameworks based on visualprogramming language allow motion data processingadapted for realtime streaming: we can mention EyesWeb[11] or Mubu [12].

MotionMachine FrameworkMotionMachine is an open source C++ library thatenables the rapid prototyping of motion features. Thesefeatures can be computed on standardized motion capturedata structures coming from both typical motion capturefile formats and live OSC streams. The user can choose aselection of these features to represent motion in theconsidered use case. It has first been introduced in [6].

General StructureThe overall data flow used in MotionMachine is presentedin Figure 1. The library is built from two independentmodules: one for data representation and feature

extraction (built on the top of the Armadillo C++ library[13]), the other to take care of 2D and 3D scenesvisualization and general user interaction aspects (built onthe top of the openFrameworks C++ library [15]).

Figure 1: Overall data flow used in MotionMachine: modularstructure to process mocap data files and/or streams intofeatures files and streams, labels files and visualization[6].

Four important core features are available in theMotionMachine framework:

1. Skeletal Model Independent Motion Data: Most ofmotion capture devices provide skeletal data, i.e.changes in the position and/or orientation of 3Djoints and/or segments. In MotionMachine, we havedeveloped a model-independent formalism forstoring and accessing such skeletal data throughAPIs[6].

2. Collections of Motion Feature Extractors:MotionMachine is built around the idea thatdevelopers can write custom code to be inserted inthe motion capture data processing pipeline, whilestill preserving the intuitiveness and efficiency of theoverall environment. The design principle underlyingthe available collection of feature extractors isessentially container-driven and based on the idea

that offline batch and windowed real-timeprocessing should both be available by default forany built-in or third-party feature.

Figure 2: Visualization of 3D scene and 2D scene (fivefeatures and annotations are displayed) for onecontemporary dance sequence [6]

3. Interactive 2D/3D Scene View : In MotionMachine,we wanted to improve the affordance of motioncapture data processing by solving severalvisualization issues and bring the user faster tohis/her valuable work. The library comes with anintegrated 2D/3D scene viewer for displaying mocapdata on screen and interacting with the contents [6].3D and 2D time lines are automatically synchronizedhelping to observe available motion capture datafrom different viewpoints as we can see in figure 2.

4. Annotation Layer : In MotionMachine, we haveintegrated a lightweight annotation scheme. Itallows the programmatic and UI-based insertion ofLabels alongside the motion capture data. It meansthat the time tag of these Labels can beautomatically derived from signal properties in thefeature extraction code or added manually by theuser.

Motion Data and Machine LearningIn this section, we present two machine learningapplications which have been implemented inMotionMachine to estimate the platform efficiency inprototyping such processes.

Figure 3: HMM-based gesture decoding of a dance stepsample. Four HMMs (for four different dance steps) weretrained with HTK. Each gesture was modeled with 10 states.On the top of the Figure, we can see the decoded gesture andthe most likely current state. The cursor is pointing at a framedecoded as gesture 2 and state 5.

HMM Decoding with HTKOur first application deals with gesture recognition. Inprevious works, we detailed how it is possible toimplement a gesture decoding system based on HiddenMarkov Models (HMMs) [17]. The HMMs represent thegesture as a succession of states. At each state, localstatistics of the observations apply and both localstatistics and state transition probabilities are determinedby training. Hence it is possible to identify the most likelygesture corresponding to new observations and computean approximation of the current state. That reflects theprogression in the executed gesture.

The annotation of a traditional dance data collection wasmade in MotionMachine. The HMMs used for gesturerecognition were trained using HTK (Hidden MarkovModel Toolkit [16]). Each input data frame contains thebody skeleton pose data. We can choose to describe thispose with the position of skeleton nodes or the orientationof the bones. The gesture recognition is performed byusing a Viterbi algorithm. We used an implementationproposed by the MLPack library [14]. The Viterbialgorithm is applied on the whole temporal window ofdata stored in a MotionMachine Track data type. Thedecoding results (decoded gesture and most likely currentstate) are displayed in 2D scene view (see Fig. 3). Acursor is synchronized with the played skeletal datastream. This interface makes it easier for the user toverify the decoding timing accuracy and to visualizepotential problems in the model.

Figure 4: An Input-Output Temporal Restricted BoltzmannMachine: Input data are motion capture data performed by anactor. The output data are animation signal adapted for aspecific virtual character.

Animation retargeting and difference visualizationThis second application aims at automatically retargetingmotion capture data to virtual characters for animation.As [4] and [5] have shown, Restricted Boltzmann Machine(RBM) can model temporal series such as motion data.We implemented the retargeting solution proposed by [5].This algorithm provides a way to train a model that adaptsautomatically motion capture data for a given character.

We recorded a motion capture data collection of 10gestures. Each gesture sequence was then manuallyadapted to the morphology of a cartoon-like virtual avatarby a 3D animator.

Figure 5: Visualization of the difference between capturedmotion and synthesized data adapted to a virtual character.

With this double set of temporally aligned data, wetrained a TRBM (temporal RBM) that respects thestructure depicted in figure 4. This TRBM model canthen be used to adapt automatically new sets of motion

capture data for the animation of the virtual character forwhich it was trained.

However, the comparison of original motion capture dataand synthesized adapted data is often complicated for thehuman eye, as the differences can sometimes be verysubtle. MotionMachine provides a user-friendly way tocompare the data. To analyze the results, as we can see infigure 5, the difference between the original capturedskeletal poses and the synthesized adapted data can beclearly put in evidence for each frame. It enables the userto visualize the influence of the model on the originaldata, and hence to verify accuracy of the machine learningprocess.

ConclusionThe high dimensionality of motion capture data makes itdifficult for the user to manipulate it efficiently whendesigning machine learning based applications. As humanbeings, our understanding of motion is mainly based onthe visualization of the motion itself, displayed on a 3Dcharacter, rather than on abstract high-dimensioncoordinates time series.

In this paper, we have described our proposal for aplatform designed to help the user to prototype a machinelearning processing chain for motion analysis, integratingwhat we consider important for such an interface. A 3Dspace visualization synchronized and superimposed with a2D scene for rendering motion features time series allowthe user to manipulate motion data collections, and tosegment and annotate them efficiently. The frameworkthat we propose provides parsing functions to extract theskeletal data from motion capture archives under differentfile formats. By implementing machine learningalgorithms into the MotionMachine framework, we

propose to follow the visions of openFrameworks [15]:simplicity, intuitiveness extensibility. This results in amodular platform that can manage both offline motiondata and real-time data streams.

Further works will include improvements in how the usercan interact with the underlying machine learningprocesses. In order for such interaction to lead to moreadequate machine learning models, we need to furtherexplore the visual representation of the impact of learningalgorithm parameter tuning on modeling performances.We will explore such visual representations and evaluatetheir efficiency in the context of prototyping machinelearning applications.

AcknowledgementsThis work has been supported by the European Union(FP7-IC7- 2011-9) under grant agreement n 600676(i-Treasures project) and by regional funds called RegionWallonne GREENTIC (convention number 1317957).

References[1] Z. Zhang: Microsoft Kinect Sensor and Its Effects.

IEEE Multimedia, vol.19, no.2, 4–10 (2012)[2] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M.

Finocchio, R. Moore, A. Kipman, and A.Blake:Real-time human pose recognition in parts fromsingle depth images. In Proceedings of the 2011 IEEEConference on Computer Vision and PatternRecognition (CVPR ’11), 1297-1304 (2011).

[3] Ying Yin, Davis, R.: Real-time continuous gesturerecognition for natural human-computer interaction.IEEE Symposium onVisual Languages andHuman-Centric Computing (VL/HCC), 113 -120(2014)

[4] Graham W. Taylor, Geoffrey E. Hinton, Sam T.Roweis: Modeling Human Motion Using Binary Latent

Variables. NIPS 2006: 1345-1352 (2006)[5] Matthew D. Zeiler, Graham W. Taylor, Leonid Sigal,

Iain Matthews, and Rob Fergus: Facial ExpressionTransfer with Input-Output Temporal RestrictedBoltzmann Machines. Neural Information ProcessingSystems (2011)

[6] J. Tilmanne and N. d’Alessandro: MotionMachine: ANew Framework For Motion Capture Signal FeaturePrototyping. Proc. of EUSIPCO 2015.

[7] N. dAlessandro et al.: Towards the Sketching ofPerformative Control with Data. Proc.s of theeNTERFACE Summer Workshop on MultimodalInterfaces, 2013.

[8] Anvil, http://www.anvil-software.org/[9] B. Burger and P. Toiviainen: MoCap Toolbox A

Matlab Toolbox for Computational Analysis ofMovement Data. Proc. of the 10th Sound and MusicComputing Conference, (SMC) (2013)

[10] RAMToolkit, http://interlab.ycam.jp/en/projects/ram/ram_dance_toolkit

[11] The EyesWeb Project,http://www.infomus.org/eyesweb_ita.php

[12] Mubu, http://imtr.ircam.fr/imtr/MuBu[13] armadillo, C++ linear algebra library

http://arma.sourceforge.net/

[14] mlpack, scalable machine learning libraryhttp://www.mlpack.org/

[15] openFrameworks, C++ creative coding libraryhttp://www.openframeworks.cc/

[16] U. of Cambridge. The hidden markov modeltoolkit(htk), 2009. http://htk.eng.cam.ac.uk

[17] T. Ravet, J. Tilmanne, and N. d’Alessandro. 2014.Hidden Markov Model Based Real-Time MotionRecognition and Following. Proc. of the 2014International Workshop on Movement and Computing(MOCO ’14).

http://www.anvil-software.org/

http://interlab.ycam.jp/en/projects/ram/ram_dance_toolkit

http://interlab.ycam.jp/en/projects/ram/ram_dance_toolkit

http://www.infomus.org/eyesweb_ita.php

http://imtr.ircam.fr/imtr/MuBu

http://arma.sourceforge.net/

http://www.mlpack.org/

http://www.openframeworks.cc/

http://htk.eng.cam.ac.uk

motion data and machine learning: prototyping and...

Documents