sensing (e)motions
DESCRIPTION
A review of state of the art motion and emotion sensing technologies in human computer interactionTRANSCRIPT
Amsterdam 2011
Sensing (E)motions A review of state of the art motion and emotion sensing technologies in human computer interaction Nikolaos Poulios
MSc. Computer Science / Multimedia
Vrije Universiteit Amsterdam
Creative Learning Lab – Waag Society
2
Table of Contents
1. Introduction................................................................................................ 3
2. Basic motion sensors .................................................................................. 4 2.1 Sensing forces.................................................................................................................................... 4 2.2 Detecting motion.............................................................................................................................. 5 2.3 Measuring distance ......................................................................................................................... 5 3. Motion Capture and tracking systems......................................................... 6 3.1 Optical Systems................................................................................................................................. 6 3.2 Non-‐optical systems ....................................................................................................................... 9 3.4 Motion capture libraries .............................................................................................................10 4. Motion sense in interaction ...................................................................... 11 Hand Tracking.........................................................................................................................................11 Head/Face Tracking .............................................................................................................................11 Eye Tracking ............................................................................................................................................12 Nintendo Wii Remote ..........................................................................................................................13 Floor boards.............................................................................................................................................13 Sony PlayStation Move........................................................................................................................13 Microsoft Kinect .....................................................................................................................................14 5. Sensing emotions...................................................................................... 15 5.1 Speech analysis ...............................................................................................................................17 5.2 Facial expressions..........................................................................................................................18 5.3 Body movement/postures .........................................................................................................19 5.4 Pupil size............................................................................................................................................20 5.5 Bio-‐sensors .......................................................................................................................................20 5.6 Brain Computer Interfaces (BCI) ............................................................................................21 5.7 Developing Tools for Emotional Intelligence.....................................................................23 6. Sensor Hardware Platforms ...................................................................... 25 6.1 Arduino...............................................................................................................................................25 6.2 .Net Gadgeteer .................................................................................................................................26 6.3 Phidgets..............................................................................................................................................26 6.4 Shimmer.............................................................................................................................................27 6.5 I-‐CubeX ...............................................................................................................................................27 7. Interactive Software Development Frameworks ....................................... 28 Visual Programming Languages .....................................................................................................28 Working with sensors..........................................................................................................................29 Bibliography ................................................................................................. 31
References.................................................................................................... 31
3
1. Introduction
This document is a study of current trends in human computer physical interaction, focusing on input interfaces utilizing motion, and emotion sensing technologies. This study is meant to be a review of state of the art interaction systems, presenting their main characteristics, as well as hardware and software frameworks to facilitate the development of projects utilizing motion sensing technologies and multi-‐modal emotion recognition techniques based on image, vocal, and biophysical signals analysis.
Physical interaction has been a challenge for HCI researchers and designers for many years, but it was recent technological progress that allowed the production of innovative and practical motion sensing interfaces. These interfaces introduce new possibilities for interaction, but also the challenge for designers, to adopt new technologies on existing mental models. A common pitfall in motion based interaction design, is to end up with a system that requires from the user to move in a very strict way, using very specific gestures, and thus, not exactly physical. Facing this challenge, most systems presented in this document, are currently used commercially mainly on video games, which are a non-‐critical application field, offering more freedom for experimentation with new technologies.
Most commercial games with motion interaction are based on sports themes, encouraging the physical activity of the player and enchasing the entertainment value. Beyond exertion games, motion interaction technology provides the base for a new range of educational games based on virtual worlds, offering user immersion and evoking lifelike experiences, focusing on embodied and playful learning. Gymnastics course is included in all school programs and has proved to help not only the physical, but also the mental state of students. Motion interaction gives the opportunity to embed movement in the learning process with a playful and more active approach. Learning via movement may add an additional modality and prime for later recall of knowledge. Creating more opportunities for physical, embodied learning may give the ability to students to utilize more neural connections -‐via movement-‐ to aid in recall of new knowledge [1].
Embedment of emotion sensing technologies adds another dimension to game dynamics, and interactive story telling, enhancing affective communication and the idea of extended cognition, where the mind, body and environment, form together a complete cognition system of its own. Multi-‐modal signal analysis can provide us with a better insight of players’ state and behaviour and allow us to develop more personalized, educational, training and assessment tools. Virtual worlds and the dynamics of game and play provide excellent possibilities to train and assess social emotional competencies, and allow learners to interact with simulated conflict situations in the realms of a safe and confined space.
4
2. Basic motion sensors
This part of the document is a short presentation of the fundamental motion sensors, used in applications that are studied further in the document. These sensors are basic electronics, with a very particular function, translating changes in one form of energy to changes in electrical energy. All the presented sensors are already around us for quite a while now, in every day systems like: automatic sliding doors and lights, alarm systems, cars, and various industrial control systems. During recent years the progress of technology has reduced their size and cost, allowing their application to a variety of devices like mobile phones and game controllers, while certain projects have developed frameworks to facilitate and simplify their use in multi-‐purpose applications made by a wider range of people, involved on designing and programming of interactive systems.
2.1 Sensing forces
Piezoelectric sensors are a category of sensors that use the piezoelectric effect to measure pressure, acceleration, stain or force by converting them to an electrical charge. Piezoelectricity is the ability of some materials, notably crystals and certain ceramics, to generate an electric potential in response to physical stress.
Force-Sensing Resistors are materials whose resistance changes when a force is applied on them. Flexible force sensors are ultra-‐thin, flexible printed circuits, consisting of two laminated layers of conductive material and pressure-‐sensitive ink. The resistance of a flexible sensor in a circuit is decreased under pressure. Flexible sensors are used to measure forces in a higher range than that of a piezoelectric sensor.
Capacitance sensors are very sensitive sensors, detecting anything that is conductive or has a dielectric different than that of the air. Nowadays they are usually found in touch screens, though there are capacitance sensors that can detect body’s charge from distances up to a meter (such sensors are used by the Theremin musical instrument).
Accelerometer is a sensor that measures the change in speed of movement, or acceleration. Conceptually, an accelerometer behaves as a damped mass on a spring. When the accelerometer experiences acceleration, the mass is displaced to the point that the spring is able to accelerate the mass at the same rate as the casing. The displacement is then measured to give the acceleration. An accelerometer thus measures weight per unit of (test) mass, a quantity also known as specific force, or g-‐force. Another way of stating this is that by measuring weight, an accelerometer measures the acceleration of the free-‐fall reference frame relative to itself. Accelerometers typically have two or sometimes three axis of measurement.
Gyroscopes are sensors that measure angular acceleration. They are similar to accelerometers, except that they measure how fast the angle of
5
rotation is changing, rather than measuring acceleration in a straight line. Gyroscopes work based on the principles of conservation of angular momentum. Mechanical gyroscopes are consisted by a hi rate spinning disk whose axle is free to take any orientation, mounted on a set of two gimbals with orthogonal pivot axes, allowing the gyroscope to minimize any external torque and preserve its orientation, regardless of any motion of the platform on which is mounted.
2.2 Detecting motion
Photoelectric switches use a light beam hitting a photosensitive target sensor. When a body breaks the beam, passing between the sensor and the light beam, the switch is activated.
Passive infrared sensors measure infrared light radiating from objects in their field of view. Apparent motion is detected when an infrared source with one temperature, such as a human, passes in front of an infrared source with another temperature, such as a wall.
Magnetic switches consist of a very thin pair of contacts in a protective housing. When exposed to a magnet they are drawn together closing the switch.
Hall effect sensors are transducers that change their output voltage from low to high when the magnetic field around them changes.
2.3 Measuring distance
Most distance sensors use an energy source, transmitting a reference signal, and a sensor measuring the signal reflected by the target, back to the source, to calculate the distance of the target. Most applications use infrared light sensors, sending an infrared beam and read the reflection of the beam off a target. For longer ranges, ultrasonic sensors are used, sending a ping of ultrasonic sound and then timing how long it takes to bounce back. Alternative implementations of distance sensors are based on combination of magnetic or Hall effect sensors (for very short distances), measuring variations in a reference magnetic field.
6
3. Motion Capture and tracking systems
Motion capture (mocap)/tracking is the process of recording/tracking body movement and map it on to the movement of a digital model. Human body movement mechanics is a topic of interest for science since ancient years and today many different disciplines use motion analysis systems to capture movement and posture of the human body. In clinical research, motion capture has been used to analyze walking patterns of impaired patients in order to receive the right orthopedic treatment, to monitor the progress of a treatment, and to help the designing of prosthetics. Motion analysis is also widely used in sports to analyze and optimize athletes’ movement in order to achieve better performance.
During the last years motion capture systems has been used extensively in the areas of cinematography and video games in order to animate computer generated characters with natural human movement, following recorded moves of an actor inside special studios, replacing the tradition animating method of rotoscope on which animators trace over live action film movement, frame by frame. Despite the high cost of the special equipment, space and setup required for a motion capture system; they are preferred by some productions over traditional animation techniques for their ability to give more realistic results and in shorter, or even real time.
Motion capture systems is a very active field of research, today there are many alternative type of systems using different technologies with differences in accuracy, functional requirements and cost, and their suitability depends on the nature of the project. The range of applications utilizing motion capture is becoming wider, following the progress made on processors, memory chips and sensors regarding their speed, accuracy, size and cost, as well as the progress on algorithms developed for data processing. The two major categories of motion capture systems are optical and non-optical.
3.1 Optical Systems
Optical systems work based on data captured from a single or multiple image sensors calibrated to provide overlapping projections, and algorithms to triangulate the 3D position of a subject in space. Most optical systems utilize markers, distinguishable by the cameras from the rest of the captured image in order to determine their position easier and more accurate. The process of motion capture begins with the calibration of the system in which markers are placed in known positions and every camera position and lens distortion is calculated accordingly. If two calibrated cameras see a marker, its 3D position can be determined. After calibration of the system a performer wears markers near each joint of her body to identify the motion by the positions or angles between the markers. The number of cameras
7
required for an optical system depends on the size of the space we need to cover, the desired accuracy and the number of subjects we need to track at the same time. Typically a system like that consists of 6 to 24 hi-‐speed cameras, while there are systems using hundreds of cameras to achieve better accuracy. Optical systems are characterized by the captured image resolution in pixels, the sampling frequency in hertz and the frame rate, which is balanced between the image resolution and sampling frequency. Different types of markers exist between optical systems.
Passive markers are the simplest type of markers, featuring retro-‐ reflective material to reflect light generated near the cameras lens. Camera’s threshold is adjusted to sample bright reflective markers ignoring the rest of the captured image. Major advantage of passive markers is that the subject does not need to wear any electronics that might limit her freedom to move. Passive markers are attached directly to the skin or attached to specially designed spandex/lycra full body suit. The major disadvantage of passive markers is what is called markers swapping, meaning that all markers are identical and the system might mismatch a marker with the corresponding joint, requiring larger number of cameras to avoid the problem.
Figure 1: Active marker motion capture system
Active markers are another type of markers. Instead of reflecting light, active markers use LEDs to emit light, increasing the maximum distances and volume for capture. Optical systems using active markers triangulate positions by illuminating one LED at a time very quickly or multiple LEDs with software to identify them by their relative position. Refined versions of active markers exist, using time modulation over the amplitude or pulse of the LEDs to provide marker ID in order to eliminate markers swapping. Computer processing of modulated IDs offers clearer data and less filtered results. This higher accuracy and resolution requires more processing than passive technologies, but the additional processing is done at the camera to improve resolution via a subpixel or centroid processing, providing both high resolution and high speed.
8
Both technologies mentioned above is mainly used indoors in special motion capture studios. Passive systems are usually less expensive than active and easier to set up, while active systems are more accurate and after the initial set up, require less time to get results from. Commercial active and passive systems are available from companies like Vicon, Naturalpoint, Qualisys and PhaseSpace, and usually cost between tens and hundreds of thousands of euro.
Semi-passive - Photosensitive markers. Prakash [2] is a motion capture system developed in MIT’s Media Lab as an inexpensive alternative system (the overall cost is less than 1.000 euro), suitable also for outdoor use and real time motion capture. Instead of using expensive hi-‐speed cameras, Prakash uses multi-‐LED hi speed projectors with passive binary films (masks) set in front. The light intensity sequencing provides a temporal modulation and the masks provide a spatial modulation. Each beamer projects invisible (near infrared) binary patterns thousands of times per second. Tags with photo sensor attached to the scene determine their location by decoding the transmitted space-‐dependent labels. Apart from their position, tags can compute their own orientation, incident illumination, and reflectance. These tracking tags work in natural lighting conditions and can be imperceptibly embedded in attire or other objects. The system supports an unlimited number of tags in a scene, with each tag uniquely identified to eliminate marker-‐swapping issues. Since the system eliminates a high-‐speed camera and the corresponding high-‐speed image stream, it requires significantly lower data bandwidth. The tags also provide incident illumination data, which can be used to match scene lighting when inserting synthetic elements.
Markerless Motion Capture. Motion capture and computer vision have been a very active field of research during the 15 last years and there have been a lot of studies to develop markerless motion capture systems, based on the use of a single or multiple cameras and optimized image analysis algorithms, and with comparable performance to that of more expensive commercial systems, previously mentioned.
Recently a team from the Carnegie Mellon University working with Disney Research presented a system featuring small body-‐mounted cameras to reconstruct the motion of a subject [3]. Outward-‐looking cameras are attached to the limbs of the subject, and the joint angles and root pose are estimated through non-‐linear optimization. The optimization objective function incorporates terms for image matching error and temporal continuity of motion. Structure-‐from-‐motion is used to estimate the skeleton structure and to provide initialization for the non-‐linear optimization procedure. Global motion is estimated and drift is controlled by matching the captured set of videos to a 3D reconstruction of the scene built from reference imagery. By estimating the camera poses, the global and relative motion of an actor can be captured outdoors under a wide variety of lighting conditions or in extended indoor regions without any additional equipment.
9
Several other techniques and algorithms have been proposed for markerless motion capture for single or multiple subjects. Most of them use footage from multiple cameras to make a volumetric reconstruction of the body using background removal, skin color detection, “shape from silhouette (SFS)” and structure from motion methods. The formalism of SFS was introduced by A. Laurentini [4]. By definition, an object lies inside the volume generated by back-‐projecting its silhouette through the camera center (called silhouette’s cone). With multiple views of the same object at the same time, the intersection of all the silhouette’s cones build a volume called ”Visual Hull”, which is guaranteed to contain the real object. After the visual hull has been constructed, body pose is estimated by fitting shape models of specific body parts to the volume or by applying heuristic assumptions of features related to position and establish the correspondence of joints between successive frames. Markerless motion capture systems based on these methods have been developed by various academic research laboratories, like the BioMotion Lab of Stanford University [5], the University of Amsterdam [6], the Max Planck Institute [7], and commercial systems like Organic Motion’s solutions.
3.2 Non-‐optical systems
This category includes all motion capture systems that instead of image sensors, they are using alternative types of sensors to capture motion. These systems collect data from wearable sensors attached to the subject’s body and translate them into motion in space. Their main advantage is that because they are not based on cameras, they don’t require a studio setup, they are more portable and they can be used outdoors, capturing motion in large areas and independent of light conditions. Their main disadvantages are that usually they are less accurate than optical systems and that they might limit the subject’s freedom to move and perform.
Inertial systems use miniature inertial sensors attached to the joints of the body, biomechanical models and sensor fusion algorithms to translate data into motion. Starting from a known position, inertial systems use wireless accelerometers and gyroscopes, sending data to a computer to continuously calculate the position, orientation and velocity of the subject with full six degrees of freedom body motion. Their accuracy depends on the number of sensors used. Commercial inertial motion capture systems are available from companies like XSens and Animazoo.
Mechanical or exo-skeleton systems use a skeletal-‐like structure worn by the subject, consisting either by straight metal or plastic rods, linked together with potentiometers articulating the joints, or using flexible sensors to measure joint angles during motion. Mechanical systems are real time and low cost but they capture only the relative movement of the subject, requiring an external absolute positioning system and they might be not comfortable for a performer to wear. Commercial systems like the Gypsy 7 by Animazoo combine gyroscope and exo-‐skeletal to capture absolute and relative motion.
10
Magnetic systems utilize sensors placed on the body to measure the low-‐frequency magnetic field generated by a transmitter source. Position and orientation are calculated by the relative magnetic flux of three orthogonal coils on both the transmitter and each receiver. The relative intensity of the voltage or current of the three coils allows these systems to calculate both range and orientation by meticulously mapping the tracking volume. The sensor captures 6 degrees of freedom, which provides useful results obtained with two-‐thirds the number of markers required in optical systems; one on upper arm and one on lower arm for elbow position and angle. Magnetic systems are low cost but nowadays rarely used because of their major disadvantages. Since each sensor requires its own (fairly thick) shielded cable, the tether used by magnetic systems can be quite cumbersome. Magnetic systems have issues with azimuth. If an actor is doing a push-‐up type posture, the system will get confused. Multiple actor magnetic setups also have problems with two or more actors in close proximity. Sensors from the different actors will interfere with each other, providing distorted results. Magnetic systems have very negative reactions to metal or magnetic fields in the environment caused by metallic construction materials in buildings or other electrical appliances in use.
3.4 Motion capture libraries
As mentioned before, motion capture is an easier technique to give realistic motion to virtual characters and although most motion capture systems require expensive equipment and special studios, independent developers can take advantage of online available free or commercial libraries, which include motion captured data from various human activities, in file formats that can be imported in 3D animating software and mapped to any character model. A quick search for motion capture libraries will return a long list of resources, among them the Carnegie Mellon University, which has published a very large motion capture database, freely available at http://mocap.cs.cmu.edu/, http://www.mocapclub.com/, which includes a library from the Motion Capture Society association, and http://mocapdata.com, which is also a large resource of both free and commercial animation files.
11
4. Motion sense in interaction
During the last years, sensors and principals used in motion capture systems have been used in smaller scale, to low cost consumer computer input devices, to provide physical interaction input interfaces. During the last five years, all major companies in the video game industry have developed different technologies for games and controllers with motion based interaction. Although sports have always been a popular theme on video games, and game companies started to explore sensor based physical interfaces from the middle of 1980s, it was until recently that technology allowed them to produce wireless and lightweight devices, practical to use as game controllers. That fact, along with the popularity of large TV screens in today’s average living room, have created the bases for the creation of games offering more immersion and encourage gamers’ physical activity. Today “exertion games” or “exergames” is a growing market, attracting also more people that were not traditionally attracted to video games, and considered them a rather passive activity.
This part is a presentation of current techniques and examples of devices for physical input interfaces and game controllers, based on motion sensors.
Hand Tracking
Designing wearable input interfaces, usually called “data gloves”, to allow a user to use her hands and fingers to navigate in a virtual world, use hand gestures, and interact with objects in a more natural way, was one of the first examples of natural user interfaces. The first data glove was created in 1977, and since then a few companies and laboratories came up with their own implementations. Data gloves use various sensors as accelerometers or gyroscopes to capture hand movement and flexible sensors for the bending of fingers. Some data gloves use optical fibers attached to the fingers and a photocell as a way to measure bending, since some light escapes the fiber when bended. Some data gloves also provide haptic feedback, applying small forces and vibrations to give users a sense of touch.
Data gloves are also used on body motion capture systems, because solutions based on markers are not able to capture such detail in finger movement. This technique is called hand-‐over.
Head/Face Tracking
Facial expressions and small facial muscles movement is also difficult to capture during body motion capture. For that reason facial motion capture is done in a separate recording, by attaching a lot of small markers in the actors face.
12
In the field of interaction and the gaming industry, head tracking devices exist, allowing the computer to set a camera’s viewpoint according to the position of the player in space. Commercial systems, like NaturalPoint’s TrackIR, use an infrared sensor and active markers attached to player’s head. Other systems, like a lot of head mounted displays for virtual reality systems, use tilt sensors to track had movement. There are also available applications that use a plain camera and automatic face detection algorithms to track user’s position, but because of using a plain camera they lack or are less accurate on movement along the depth axis.
Eye Tracking
Eye tracking is the process of measuring either the point of gaze of a viewer or the motion of an eye relative to head. Eye trackers are mostly used in research on the visual system, in psychology, in cognitive linguistics and also in marketing research, product design and usability testing, to spot elements that attract viewers gaze or others that do not.
Eye trackers measure rotations of the eye and principally the fall into three categories: The first category uses an attachment to the eye, like a contact lens with an embedded mirror or magnetic field sensor. Measurements with tight fitting contact lenses have provided extremely sensitive recordings of eye movement, and magnetic search coils are the method of choice for researchers studying the dynamics and underlying physiology of eye movement. The second category uses electric potentials measured with electrodes placed around the eyes. The eyes are the origin of a steady electric potential field, which can also be detected in total darkness and if the eyes are closed. It can be modeled to be generated by a dipole with its positive pole at the cornea and its negative pole at the retina. The electric signal that can be derived using two pairs of contact electrodes placed on the skin around one eye is called Electroculogram (EOG). If the eyes move from the centre position towards the periphery, the retina approaches one electrode while the cornea approaches the opposing one. This change in the orientation of the dipole and consequently the electric potential field results in a change in the measured EOG signal. Inversely, by analysing these changes in eye movement can be tracked.
The last and most commonly used category is non-‐intrusive, optical based systems using the Pupil Centre Corneal Reflection (PCCR) technique. This technique uses a light source to illuminate the eye causing highly visible reflections, and a camera to capture an image of the eye showing these reflections. Image processing algorithms are then used to identify the reflection of the light source on the cornea and the pupil. Calculating the angle between the two reflections, combined with other geometrical characteristics of the reflections, allow us to determine the gaze direction.
There are two different illumination setups that can be used with PCCR technique: bright pupil tracking, where an illuminator is placed close to the optical axis of the imaging device, which causes the pupil to appear lit
13
up; and dark pupil, where the illuminator is placed away from the optical axis causing the pupil to appear darker than the iris. There are different factors affecting the pupil detection when using each one of the two techniques like age of the subject, light conditions and ethnicity. Some commercial systems like Tobii eye trackers can use both techniques, determining the best technique during the calibration procedure where the viewer is asked to gaze at certain point on screen.
Eye trackers can also be used as an interaction input interface, replacing a mouse for example, allowing the user to control the cursor with her eyes. EyeWriter [8] is a collaborative research project for building an eye tracker from inexpensive material, along with open source software, developed to empower people who are suffering from ALS and other physical disabilities with creative technologies
Nintendo Wii Remote
In 2006, Nintendo released its, now popular, Wii video game console. The major innovation of Wii, was its remote game controller, the Wii Mote. Wii Mote features an infrared sensor and an accelerometer, that allows it to calculate its position in space and track hand movement. Using the wii mote, the player is able to aim at items on screen, and interact using gestures and natural movement.
Upon its release date, the Wii mote gained much attention thanks to its advanced features and quickly became very popular among programming enthusiasts, who wrote software that allowed the use of the device beyond the game console. After that the wii mote has been used numerous projects as a controller, or as an infrared sensor to track infrared LEDs, attached to other items, for example a head tracking system like the one previously mentioned.
Floor boards
Floorboards equipped with pressure sensors were the first attempt to make an input interface, with which a player would utilize her whole body in game interaction. The first controller of this kind was created by Atari, in 1982, called Joyboard. In 2007, Nintendo released a modern, wireless version, called Balance Board, along with a series of fitness games utilizing it, called Wii Fit, for the Wii game console.
Sony PlayStation Move
Sony’s motion sensing platform for the PlayStation console includes the PlayStation Eye camera, which is capable of capturing standard video at 60 Hz, at 640x480 pixel resolution, or at 120 Hz at 320x240 pixels, along with computer vision and gesture recognition software, and a microphone array for voice location tracking and voice commands recognition.
14
The PlayStation Move motion controller features an orb at the head, which can glow in any of a full range of RGB colors using LEDs. Based on the colors in the user environment captured by the PlayStation Eye camera, the system dynamically selects an orb color that can be distinguished from the rest of the scene. The colored light serves as an active marker, the position of which can be tracked by the camera. The uniform spherical shape and known size of the light, also allows the system to accurately determine the controller's distance from the camera through the light's image size. The controller also features an accelerometer and a gyroscope, used to track rotation as well as overall motion. An internal magnetometer is also used for calibrating the controller's orientation against the Earth's magnetic field to help correct against cumulative error (drift) by the inertial sensors. The inertial sensors can be used to calculate position in cases where the camera tracking is insufficient, such as when the controller is obscured behind the player's back.
Microsoft Kinect
Kinect was Microsoft’s answer on motion sensors to the video game consoles competition. Initially released as an accessory for the Xbox 360 game console, Kinect was the first consumer device that allowed real-‐time, markerless full body 3D motion capture in a room environment. Kinect features a normal RGB camera and a depth sensor, consisting of an infrared laser projector and an infrared camera, capable of capturing 3D video data at 30 Hz, at 640x480 pixels. The sensor also includes a 3-‐axis accelerometer to determine its orientation and a four-‐microphone array allowing it to also receive voice commands, ambient noise reduction, and to determine the source location of a sound. The most innovative part of the Kinect though, is a microprocessor running a “trained”, by using machine learning and a large training set of images, algorithm, that allows it to track multiple bodies’ motion, based on 20 joints for each body.
Kinect uses a single depth image [9], which is segmented into a dense
probabilistic body part labeling, with the parts defined to be spatially localized near skeletal joint of interest. Reprojecting the inferred parts into world space, spatial modes of each part distribution are localized and thus generate confidence-‐weighted proposals for the 3D locations of each skeletal joint. The segmentation into body parts is treated as a per-‐pixel classification task. A very large collection of realistic depth images of humans of many shapes and sizes in highly varied poses sampled from a large motion capture database were used to train a deep randomized decision forest classifier which avoids over-‐fitting. Simple, discriminative depth comparison image features yield 3D translation invariance while maintaining high computational efficiency. Finally, spatial modes of the inferred per-‐pixel distributions are computed using mean shift, resulting in the 3D joint proposals.
15
Figure 2 Kinect tracking joints
Kinect truly revolutionized the field of natural user interface for gaming
and became upon its release, the fastest selling electric device ever, while as with the release of WiiMote, quickly attracted the attention of a large community of programming enthusiasts who wrote open source software, allowing the use of Kinect for independent, computer platform applications, followed by a large number of projects found on internet including interactive application, games, installations and robotics, utilizing the sensor. After the release on Internet of a large number of impressive examples of uses of the Kinect, companies involved in its development, like Prime Sense and Microsoft, decided to support these efforts by releasing software to facilitate independent project development.
5. Sensing emotions
The vision of machines with emotional intelligence [10] coexists with that of artificial intelligence since the invention of the term. It is a popular theme in science fiction literature, featuring androids understanding emotions and having human like behavior, and aptly raising ethical questions about the use of such technologies. Although we are still quite far from this vision (or nightmare for some), research laboratories around the world work on developing emotion-‐sensing technology to support the study of human behavior, the affective human computer interaction, and communication between people. Automatic recognition of human effective states is an important research topic for a broad range of applications, including psychology research, computer assisted therapeutic systems, safety monitoring applications, assessment and training systems, user experience studies, marketing research, and automatic affect-‐based indexing of digital material [11].
Emotion recognition can make social interaction more affective in cases where there are difficulties to communicate expressively, for example for people on the autistic spectrum, where an autistic person might outwardly
16
appear calm and relaxed, while experiencing a state of emotional or cognitive overload [12], and every day social networking applications where there is a tendency on text based communication, or communicating through avatars in virtual worlds.
As with physical interaction interfaces, a lot of studies experiment with the application of physiological sensors on video games and interactive story telling [13]. Video games are an excellent application area to explore benefits and drawbacks of physiological sensor-‐interaction because there are less severe consequences of failure than in critical control systems, making games a field bridging laboratory research and commercial systems. It has also been shown that video games can stimulate strong emotional reactions from players, making them an appropriate field for behavior studies, and as gaming has turned into a huge entertainment industry, companies are interested to use physiological feedback for game design evaluation. Explorations to develop “biofeedback” games, games to make users more aware of their physiological state and train them to control it using game dynamics, started from the early 1980s. In 1984, Thought Technology developed a racing game called CalmPrix [14], utilizing a modified galvanic skin response sensor, followed by other innovative game companies like Atari and Nintendo using a variety of bio-‐sensors, presenting their own biofeedback games. Some of these games never made it to the market while others did, but without the expected market success.
As we all know from personal experience, emotions are hard to define and recognize. Despite all our senses, the verbal and non-‐verbal communications skills we have as humans, it is often hard to immediately recognize someone’s emotions, if they are real or pretended, if someone is talking seriously or joking, laughing or crying etc. Expression of emotions is becoming even more complex when analyzed in a global, cross-‐cultural scale. It is easy to imagine thus, that emotion recognition is a very difficult task for a computer, especially on real time application where the system has to analyze the user’s state and give a response on a very narrow time frame. Classic psychological research claims the existence of six basic expressions of emotion that are universally displayed and recognized: happiness, anger, sadness, surprise, disgust, and fear [15], other studies on emotion recognition also include emotions like despair, interest, irritation and pride [16]. A lot of studies do not accept this categorization of emotions, suggesting that it is not emotions but some components of emotions that are universally linked with certain communicative displays. Most theorists agree that the two dominant dimensions of emotion can be described as valence (pleasant vs. unpleasant) and arousal (activated vs. deactivated or excited vs. calm) [17]. Mapping even basic emotions on these two dimensions is challenging, and emotion recognition systems analyzing single human modalities like voice or facial expressions, usually suffer either from poor accuracy or over simplified classification of emotions.
17
Figure 3 Emotions mapped on basic dimensions
The next part is a presentation of the various sensors used to capture physiological signals that can be associated with the emotional state of a person, along with software for emotion recognition developed from previous research.
5.1 Speech analysis
Speech is the primary method of human communication. Analysis of certain features extracted by speech characteristics like intensity, pitch, phonetic features, voice segments, pause length, and spectral modeling, along with linguistic analysis based on keywords used, can be used to make conclusions over the emotional state of a person [18].
EmoVoice [19], developed by the university of Augsburg, Human Centered Multimedia Laboratory, is a framework for emotional speech corpus and classifier creation and for offline as well as real-‐time on-‐ line speech emotion recognition. The framework is meant to be used by non-‐experts and therefore comes with an interface to create an own personal or application specific emotion recognizer. EmoVoice is now intergrated to the SSI framework (see emotion frameworks)
openEar [20], developed by the Technische Universität München, Institute for Human-‐Machine Communication ,is an open source, C++ library, form speech processing and emotion recognition, combining features for audio recording, feature extraction, and classification of results, along with pre-‐trained models.
18
5.2 Facial expressions
Facial expressions analysis has been the first, and extensively used since then, method for emotion recognition on multiple studies, and it is the preferred method for single modal emotion recognition systems. Facial expressions are the main non-‐verbal communication tools, providing the most powerful, versatile and natural means of communicating motivational and affective state. Apart from expressing emotion, facial expressions are providing important communication communicative cues during social interaction, such as our level of interest, our desire to take a speaking turn and continuous feedback signaling understanding of the information conveyed. Facial expression constitutes 55 percent of the effect of a communicated message [21] and is hence a major modality in human communication. Several studies have also shown that ordinary people can detect six emotional facial expressions with an accuracy ranging from 70% to 98%.
On facial expressions analysis systems, the face is segmented focusing on the facial areas of eyes, eyebrows, mouth and nose. Each of these feature-‐candidate areas contains the features whose boundaries are extracted and stored over time, and then the displacement of each feature is compared to “neutral face” model images to conclude the emotion expressed by the subject. Changes over systems are usually on the number of features tracked and the kind of classifier used.
There are already quite few systems for facial expression analysis developed by research institutes and some are available for research, or commercially, examples of such systems are: the SHORE system [22], developed by the Fraunhofer, eMotion [23], a project started from the University of Amsterdam, which also includes software to map captured facial expressions on second life avatars, MindReader [24] developed initially by Cambridge University (based on the commercial system of Nevenvision, now acquired by Google), projects of the ibug (intelligent behaviour understanding group) of the Imperial College London [25], and FaceAPI [26] from Seeing machines. There are also some open source examples of facial features tracking, using openCV [27] (open Computer Vision) library and the included Haar classifier. openCV is a library for real time image analysis and it has become one of the standard libraries for computer vision, with C, C++, Python, and Java interfaces , used in robotics and multimedia applications, and included in a lot of frameworks for the development of such applications.
19
5.3 Body movement/postures
Although a lot has been written for the so-‐called “body language”, body movement and posture has not been researched on emotion recognition so extensively, as facial expressions and voice analysis. There are though some studies, questioning the validity of facial expressions as a modality for recognizing affective states, because face is involved in various functions and many of the famously recognized facial expressions represent only a small subset of the possible expressions, suggesting body posture as a very good indicator for certain categories of basic emotions. Most studies however, have not been able to demonstrate similar recognition accuracy with that of facial expressions classifiers, especially those who study emotion recognition from static body postures only. Coulson [28] considered how 6 joint rotations (head bend, chest bend, abdomen twist, shoulder forward/backward, shoulder swing, and elbow bend) could help recognizing 6 emotions (angry, fear, happy, sad, surprised and disgust). Concordance rates for attributions of the 6 emotions ranged from zero for many disgust postures to over 90 percent for some anger and sadness postures. Kleinsmith A. and Bianchi-‐Berthouze [29] used four affective dimensions (valence, arousal, potency, and avoidance) instead of discrete emotion categories. On their study there was a 12% error percentage for valence, 10% for both arousal and potency, and 11% in the case of avoidance. On their conclusions they report that other types of body motion features, may be necessary for achieving better recognition of some affective states such as fear, and better performance of their model. Other studies that include body motion as a modality [30], tracking features like quantity of motion and contraction index of the body, velocity, acceleration and fluidity of the hand’s barycenter, orientation and approach/avoidance behaviors of two participants towards their interlocutor in an interaction, suggest that body language reflect their level of activation and dominance but are less informative about their valence (positive vs negative).
Another role of body posture should be also noted. Studies suggest that body posture can actually induce changes in affective states or have a feedback role affecting motivation and emotion. A study by Riskind and Gotay [31], for example, revealed how “subjects who had been temporarily placed in a slumped, depressed physical posture later appeared to develop helplessness more readily, as assessed by their lack of persistence in a standard learned helplessness task, than did subjects who had been placed in an expansive, upright posture.” Furthermore, it was shown that posture had also an effect on verbally reported self-‐perceptions. Another study [32] examining postures as a modality for recognizing emotions, suggests that involving the body in the control of technology facilitates users’ expression of their feelings, which in turn makes them have an improved experience, i.e., being engaged.
An open source library for analyzing body motion extracted from video is the EyesWeb [33] Expressive Gesture Analysis Library. EyesWeb refers both to research projects of InfoMus Lab of the University of Genova, on
20
multimodal interactive systems and expressive gesture, and to an open software platform to support the development of real-‐time multimodal distributed interactive applications.
5.4 Pupil size
Studies have shown that the eye’s pupil is significantly larger during both emotionally negative and positive stimuli than during neutral stimuli [34]. Although we can distinguish valence, pupil size can be used as an additional modality of arousal. A lot of eye tracker devices have the ability to measure the pupil’s size.
5.5 Bio-‐sensors
From the range of modalities mentioned in the previous section, facial expression analysis has been researched the most and proven to be the most accurate. The use of this technique though, introduces a number of practical difficulties on some applications. During face tracking the camera must have a clear image of the face, limiting freedom of movement, requires good lighting conditions, and a rather static background image. Additionally it is easy from someone not to reveal his emotions on the camera, or as mentioned earlier, autistic persons for example might even have difficulty to do so when they want to express their emotions. For these reasons scientists have also turned to the use of embodied biophysical sensors, monitoring signals that can reveal valuable information, not only for the physical state of someone, but the emotional and mental as well.
The physiological signals usually monitored in behavior studies are:
Heartbeat rate (ECG): Electrocardiography sensors determine heartbeat rate by detecting and amplifying the tiny electrical changes on the skin that are caused when the heart muscle depolarizes, by measuring the difference in voltage between two electrodes placed either side of the heart. There are also optical heartbeat sensors using an infrared LED and a phototransistor, placed closed to each other with usually a fingertip, or the ear lobe, in between. These sensors work based on the fact that when your heart beats you have a quick rush of blood into tiny blood vessels close to your skin, which makes it less transparent, so less light comes through it to the phototransistor. Changes in heartbeat can give us a clear index of arousal, but sensors are prone to movement artifacts, and it is difficult to determine valence.
Galvanic Skin Response (GSR)/Electro Dermal Activity (EDA) both refer to the electrical changes measured at the surface of the skin. EDA sensors usually work by passing a miniscule amount of direct current between two electrodes in contact with the skin. When a person experiences emotional arousal, increased cognitive workload or physical exertion, the brain sends signals to the skin to increase the level of sweating. Sweat is a weak electrolyte and good conductor, the filling of sweat ducts results in
21
increasing the conductance of the applied current. Changes in skin conductance at the surface thus provide a sensitive and convenient measure of assessing sympathetic arousal changes associated with emotion, cognition and attention.
Skin temperature/Heat flux is the amount of heat that the body emits. Studies have shown that Heat Flux is effective in detecting context switches. This is because context switches often involve physical movement, which causes the body to warm up and therefore emit heat.
There are a lot of companies today producing commercial wireless, wearable biophysical sensors transmitting signals to software running on a smart-‐phone or computer, for sports enthusiasts who like to monitor and keep track of their exercising habits. Most of them do not offer an open API for application development but in some cases it is possible to read the packets sent by the sensor with custom libraries.
5.6 Brain Computer Interfaces (BCI)
Brain computer interfaces are sensors monitoring brain activity to translate user’s thoughts or mental state into actions on the computer. The brain’s electrical charge is maintained by billions of neurons. Neurons are electrically charged by membrane transport proteins that pump ions across their membranes. Neurons are constantly exchanging ions with the extracellular milieu, for example to propagate action potentials.
Electroencephalography (EEG) is the recording of electrical activity, using electrodes attached along the scalp, measuring voltage fluctuations resulting from ionic current flows within neurons, and generated by the synchronous activity of thousands or millions of neurons with similar spatial orientation in the brain.
Since its discovery in 1924 by Hans Berger, EEG has been widely used in clinical research, on neurology, to diagnose epilepsy, coma, brain death and various encephalopathies. Scalp EEG activity shows oscillations at a variety of frequencies and researchers have associated certain oscillations frequency ranges and spatial distributions to different states of brain functioning. Although EEG is not the most accurate method to monitor brain activity, its ease of use, portability and low set-‐up cost has made it the most studied one, and resulted its application to other research fields and all kinds of experiments where it is interesting to monitor the mental state of the subject. Usually three frequency ranges are used for this purpose:
22
• Theta (4 -‐ 7 Hz): related to drowsiness
• Alpha (8 -‐ 13 Hz): related to relaxation
• Beta (>13 – 30 Hz): related to alertness
During the last years, EEG has made its way to human computer interaction research, research towards machines with emotional intelligence, and a small number of companies are working on developing low cost, non-‐invasive, brain computer interface products like the Emotiv headset , Neurosky’s Mindwave, Starlab’s Enobio (which combines EEG, ECG and EOG sensors) , and OpenEEG [35], a community project has been created to support the creation of open hardware and software solutions. On a consumer level, these interfaces are currently used mainly in gaming and other entertainment applications, since they are still proved to be inaccurate and not practical for more critical applications.
Functional near-infrared spectroscopy (fNIRS) is an emerging technique for sensing brain activity, similar to the technique used by optical heartbeat sensors mentioned earlier in the document. The fNIRS system is made up of probes that send light at two wavelengths in the near-‐infrared range. Biological tissues are relatively transparent to light at these wavelengths. The main absorbers of the light are oxygenated hemoglobin and deoxygenated hemoglobin. These act as relevant markers of hemodynamic and metabolic changes associated with neural activity in the brain. The reflected light is then picked up by the detectors on the device. Depending on the amount of light that is reflected, we can get a measure of brain activity in the area beneath the sensors.
Studies in fNIRS [36] report that the hemodynamic response being measured in brain is a slow response which occur over 5-‐8 seconds. This makes the technique currently impractical to be used for interaction input interfaces. For the moment there is still no commercial brain computer interface utilizing the fNIRS technique.
23
5.7 Developing Tools for Emotional Intelligence
As mentioned on the introduction of this chapter, emotion recognition is a difficult task for a computer and the performance of such systems can vary depending on the state of the interacting person as well as environmental conditions. In order to increase the reliability of emotion sensing systems, and after gaining experience by developing single modal analysis systems, modern research examines the application of multi-‐modal systems [37][38], combining various sensors and data analysis and sharing a last decision level to determine the emotional or effective state of the subject. Towards this direction there have been a number of projects, with contribution from universities all over Europe, for the development of frameworks and middleware that make easier for researchers to develop and use multi-‐modal emotion recognition systems.
CALLAS [39] (Conveying Affectiveness in Leading-‐edge Living Adaptive System) is a project funded by the European Commission, under the 6th Framework Programme, with the participation of a lot of universities around Europe. CALLAS is a framework based on a plug-‐in multimodal architecture, containing a collection of components for feature extraction from text, audio, video and motion sensors, and process emotional aspects in real-‐time for easy development of applications for art and entertainment. The CALLAS framework also includes its own visual programming, authoring tool, CAT.
SEMAINE [40] is also a project funded by the European Commission, under the 7th Framework Programme, aiming to build a Sensitive Artificial Listener, a multimodal dialogue system which can sustain an interaction with a user for some time and react appropriately to user’s non-‐verbal behavior. The system can take input from video and audio to analyze the user’s emotional state. The SEMAINE API is available as open source, supporting C++ and Java; it features the Apache ActiveMQ message broker as an integration layer and can run as a distributed system.
SSI [41] (Social Signal Interpretation) framework is developed by the Human Centered Multimedia research laboratory of the University of Augsburg. It is available as open source, written in C++, and contains tools to
24
record, analyze and recognize human behavior in real-‐time, such as gestures, mimics, head nods, and emotional speech. It also follows a plug-‐in based design, with a growing collection including among others, input from the Wii-‐mote and the Kinect sensor (under development), while it also supports the use of external libraries such as OpenCV, ARToolKit, SHORE, Torch, Speex, Watson. SSI supports the machine-‐learning pipeline in its full length and offers a graphical interface that assists a user to collect own training corpora and obtain personalized models. It also features an XML-‐editor programming environment to draft and run pipelines without special programming skills.
Apart from developing special software, a lot of projects have focused on creating standard formats to represent human emotions and share them along emotion aware applications. These formats can be used for example to annotate digital media in order to train models for affective indexing, collect data to train virtual agents, or to share data between emotion recognition system and an application, developed by another party, that will animate a virtual avatar of the user accordingly.
MPEG-‐4 (Part 2 “Visual”) contains MPEG-4 FAP [42] (Facial animation parameters), a set of 68 parameters to allow the animation of synthetic face models, which can be used on facial expressions analysis applications. MPEG-V [43] is a standard under development for a common middle layer format for interaction and visualization, among virtual world applications.
EMMA [44] (Extensible Multimodal Annotation Language), is an XML markup language, recommended by the W3C, for containing and annotating the interpretation of user input. It is a wrapper language that can include various kind of payloads representing interpretation of various user input. An interpretation element contains information about the modality upon which the interpretation is based, can indicate start and end timestamps of the interpretation and many more attributes. EmotionML [45], is a “plug-‐in” language, also recommended by W3C, which can be combined with EMMA, to represent human emotions on user input. EmotionML recognizes the fact that there is no single agreed representation of affective states, or of vocabularies to use. Therefore, an emotional state <emotion> can be characterized using four types of descriptions: <category>, <dimensions>, <appraisals>, and <action-‐tendencies>. An example of EMMA document carrying EmotionML as interpretation payload is given bellow:
<emma:emma xmlns:emma="http://www.w3.org/2003/04/emma" version="1.0"> <emma:interpretation emma:start="123456789"> <emotion xmlns="http://www.w3.org/2005/Incubator/emotion"> <dimensions set="valenceArousalPotency"> <arousal value="-0.29"/> <valence value="-0.22"/> </dimensions> </emotion>
</emma:interpretation>
</emma:emma>
25
HEO [46] (Human Emotion Ontology) is an effort to make an RDF, OWL ontology to represent human emotions with sub classes and attributes to describe input modalities, dimensions (arousal, valence, dominance), action tendencies and many more.
SAIBA [47] (Situation, Agent, Intention, Behavior, Animation) is a running project focusing on the creation of a framework of languages for Embodied Conversational Agents, with three stages representing intent planning, behavior planning and behavior realization. A Function Markup Language (FML), describing intent without referring to physical behavior, mediates between the first two stages and a Behavior Markup Language (BML) describing desired physical realization, mediates between the last two stages. BML has behavior elements for head, torso, face, gaze, body, legs, gesture, speech and lips and defines attributes for animating, lips and gaze synchronization, gestures etc.
More information, articles and tools can be found on the HUMAINE Association website [48], an international community around research on emotions and human-‐machine interaction.
6. Sensor Hardware Platforms
There is a very large number of companies producing sensors and offering specialized solutions for any nature of project. As final products, designed for a specific use though, these solutions often introduce restrictions on their application to custom projects, and collaborating with custom written software. The architectural design of a project featuring multiple sensors, requires not only a sensor network that will make sure that all sensors work together without problems, but also a network that can be customized to fit the project’s data flow design. The use of sensor platforms complies with these two requirements offering a common standard base between sensors and the freedom to customize their function and connectivity. The following part presents some examples of sensor platforms used today, with different design approaches.
6.1 Arduino
Arduino is an open-‐source electronics platform. It is designed as a low cost, expandable, multi-‐purpose prototyping platform based on flexible, easy to use hardware and software. Since its introduction, Arduino has created a very large community, sharing support and code; it is used for education in a lot of laboratories around the world, and has become a standard for interactive designers, media artists, and hobbyists.
26
The Arduino basic platform consists of three parts: The Arduino microcontroller, which can be built by hand using the provided schematics or purchased preassembled and in different versions and sizes, including versions designed to implement wireless nodes, with XBee* radio connector and circuitry for battery and charging; or version like the LilyPad, designed so it can be sewn onto fabric, for wearable applications. The Arduino microcontrollers are based on the Atmel 8-‐bit AVR family of microcontrollers with RISC architecture.
Second part of the platform is the language and compiler. Arduino’s language is based on C, and designed to simplify the creation of physical interaction application, in combination with the use of the third part, the IDE, which is built on Java. The three parts make a platform with simplified programming language, used to create instructions for a controller basic enough to be easily used for common programming tasks, yet powerful to support complex projects.
Arduino can be expanded with a great variety of add-‐ons, the Arduino shields as they are called, and a great variety of motion and environmental sensors, network devices, servomotors, and can implement wireless sensors, tangible interfaces and robots.
*XBee is a ZigBee-‐enabled device for Arduino. ZigBee is a wireless communication standard, designed to be inexpensive, with low-‐power consumption. Most importantly ZigBee is particularly well designed for mesh networks, which connect from node to node, instead of a single router network.
6.2 .Net Gadgeteer
Following Arduino’s success, Microsoft Research recently launched the .Net Gadgeteer open source platform , a microcontroller based on the ARM7 processor, designed to be programmed through the Microsoft’s .NET (Micro) Framework and C# and expand through solder-‐less connection modules. The idea of solder-‐less connection modules will encourage more people without any experience in building circuits, to try and build their own gadget prototypes. Since Gadgeteer is a very new platform, and since it uses its own connection standard, the list of available modules is still limited on sensors.
6.3 Phidgets
Phidgets is also a platform on the concept of Arduino, designed to be even simpler. Phidgets is a line of plug and play building blocks for physical computing that can be connected over USB to a computer and communicate with any application. The Phidgets API controls all the USB communication with the devices, making simpler the communication between applications. Arduino supports the creation of more complex projects, but Phidgets allows you to built simpler prototypes faster, and supports programming in
27
a large variety of programming languages including high level languages like C# and Actionscript 3, as well as visual programming frameworks like Max/PureData and LabView (see Ch.7).
6.4 Shimmer
Shimmer is an open source platform for small, wearable, wireless sensors. Shimmer started as a project of Intel Research and is now a division of Realtime Technologies. Unlike the previous hardware platforms presented, focusing on multi purpose prototype building, shimmer produces already assembled, highly sophisticated sensors, focusing more on research around Body(or Personal) Area Networks (B/P AN). BAN research aims at the development of wireless distributed systems for autonomous and remote monitoring of patients in health care.
Shimmer platform consists of the main unit, a light weight pack with an MSP430 processor, battery, Bluetooth and 802.15.4 connectivity, a micro SD memory slot for offline data storage, a tilt sensor and an accelerometer. A variety of motion, biophysical and ambient sensors can be connected with the unit. The firmware of the unit embeds TinyOS [53], a very light and highly customizable unix-‐like operating system, specially designed for low power embedded systems and sensor networks. Shimmer supports development of applications in C# and also a LabView library, although every unit is an autonomous node, providing data in raw or semi processed format, accessible through all applications via custom libraries.
6.5 I-‐CubeX
I-‐CubeX is a commercial platform producing a variety of sensors and providing multiple sensor kits for research and interactive projects. I-‐CubeX provides an API with support for various languages like C++, Actionscript, Max/Jitter, while the sensors can communicate directly with musical keyboard instruments using the MIDI interface. On the platforms website there are a lot of application code examples and sensor kits suggested for a wide range of interactive applications categories.
28
7. Interactive Software Development Frameworks
The last part of this document is a short presentation of various useful frameworks and toolkits for interactive application programming. Although a lot of the frameworks, mentioned bellow, share a lot of common elements, this list serves two purposes. The first is to cover frameworks written in different languages so that the reader can find one that is written in a familiar language for him, or that serves better his projects requirements. The second purpose is to encourage the reader to visit and explore the websites of the tools mentioned, where previous work of very talented programmers and artists is showcased, often with available source code, being thus a great source of inspiration for anyone interested in multimedia programming and visual arts.
Processing (Java based) is an open source programming language and environment focusing graphics and interactions programming. Based on a very minimal environment, Processing was developed as a “software sketchbook” and a tool to teach fundamental computer programming for visual arts. Processing was the first of a series of frameworks that appeared during the recent years, wrapping a growing collection of standard libraries for graphics, image, video, audio manipulation, network libraries, physics engines and many more, offering also more simplified interfaces to all these libraries to make it simple to combine them inside a program.
After the success of Processing, openFrameworks (C++) was released, following the same concept, using C++ to deliver applications with better performance than processing and native C++ libraries, offering also the ability to develop native applications for the iOS and Android mobile platforms. openFrameworks has built a very large support community and it has been successfully used from mobile apps to large and complex interactive installations. Beyond the basic standard libraries wrapped by openFrameworks, users are constantly expanding the list of add-‐on libraries and components, including libraries to create tangible interfaces and physical interaction, like the TUIO and TouchLib libraries, and the OpenNI framework, which has already produced a few very interesting projects using the Kinect sensor. Cinder (C++) and Polycode (C++/Lua) are also two other open source toolkits similar to openFrameworks.
Visual Programming Languages
Visual programming languages combine traditional coding with tools that allow the user to handle all components as blocks on a canvas. Each block has some kind of input signal, and the code inside the block determines its output. In that way the user control the flow of data inside a program by virtually wiring signals with blocks input/outputs. Apart from
29
offering a clearer structure, by using this visual schematic, to people with no programming background, visual programming languages also focus more on live, or run-‐time coding, allowing to change the behavior of a block without requiring to recompile of the whole program.
The most popular visual programming languages are Max, developed by Cycling74, and PureData, its free open source equivalent, actually developed by one of the initial developers of Max, Miller Puckette. Max and PureData were particularly popular to musicians, since electronic music was one of the first fields utilizing digital technology and programming and this logic of dataflow programming, wiring different signals, effects and sensors was something that musicians were already familiar with from recording studios. Today both tools have a very large collection of patches and programming APIs to integrate different effects and sensors.
Isadora, developed by TroikaTronix, software branch of Troika Ranch, a media intensive dance company, is a visual programming language focusing mainly in manipulation of video and audio for live performances, supporting up to 6 different independent outputs, and including also a C++ SDK to develop custom filters and effects.
Field is a Python based open source toolkit, developed by OpenEndedGroup, a team of artists also with experience in interactive installations, and working on theatre and dance performances. Field includes a Processing plug-‐in which replaces the Processing IDE, and through which all Processing libraries can be used on Field. A program written in Field can also include also code in other programming languages, including languages that execute inside other applications like Autodesk Maya and Adobe After Effects. Field supports only Mac and Linux platforms
VVVV is another new visual programming toolkit, free for non-‐commercial use, compatible with Windows platform only, using DirectX libraries and supporting programming in C#.
QuartzComposer is part of Apple’s XCode framework, for visual programming using native libraries of the MacOS.
Working with sensors
For working more particular with sensors, signal processing and pattern recognition, the most popular applications, offering both visual and traditional programming are LabView, by National Instruments, and Simulink, developed by MathWorks.
BioMOBIUS is an open platform, developed by an open community of researchers and by TRIL Centre, which allows researcher to rapidly develop sophisticated technology solutions for biomedical research. It was developed with the philosophy of providing a common technology platform, which comprises hardware, software, services and sensors. BioMOBIUS
30
development environment is based on EyesWeb, and provide support for designing applications based on the Shimmer sensor platform.
Exemplar is an open source kit for programming of prototypes using sensors, developed by Stanford’s University, Human Computer Interaction Group. Exemplar is a plug-‐in written for Eclipse IDE, offering a GUI through which is possible visually monitor live sensors signals and manipulate them.
ROS (Robot Operating System) is an open source project providing libraries and tools like device drivers, message passing middleware, computer vision libraries, and more features to support the creation of robot applications. Since robots are an ensemble of sensors and motors, ROS features could also support the creation of a project utilizing a network of autonomous sensor nodes. Among other sensors, ROS now includes drivers and libraries for the Kinect sensor, which is a perfect solution for computer vision in low cost robot projects, and has already been used with very interesting results.
Result of the combination of ROS with the Kinect sensor is also the Point Cloud Library (PCL), sister project of ROS, including state of the art algorithms for 3D point cloud processing, including filtering, feature estimation, surface reconstruction and registration, model fitting and segmentation.
Bibliography
Joshua Noble (2009). Programming Interactivity. Sebastopol (U.S.A.): O’Reilly Media
Dan O’Sullivan and Tom Igoe (2004). Physical Computing. Boston (U.S.A.): Thomson Course Technology
References
[1]: M. C. Johnson-‐Glenberg, D. Birchfield, P. Savvides, C. Megowan-‐Romanowicz. In: L. Annetta & S. Bronack (eds.) Serious Educational Game Assessment: Practical Methods and Models for Educational Games, Simulations and Virtual Worlds. pp. 225-‐241. Sense Publications, Rotterdam. 2010
[2]:Ramesh Raskar, Hideaki Nii, Bert deDecker, Yuki Hashimoto, Jay Summet, Dylan Moore, Yong Zhao, Jonathan Westhues, Paul Dietz, John Barnwell, Shree Nayar, Masahiko Inami, Philippe Bekaert, Michael Noland, Vlad Branzoi, and Erich Bruns. 2007. Prakash: lighting aware motion capture using photosensing markers and multiplexed illuminators. In ACM SIGGRAPH 2007 papers (SIGGRAPH '07). ACM, New York, NY, USA, , Article 36 .
[3]: Takaaki Shiratori, Hyun Soo Park, Leonid Sigal, Yaser Sheikh, Jessica K. Hodgins "Motion Capture from Body-‐Mounted Cameras" ACM Transactions on Graphics, Vol. 30, No. 4 (Proc. ACM SIGGRAPH 2011), July 2011
[4]: A. Laurentini (February 1994). "The visual hull concept for silhouette-based image understanding". IEEE Trans. Pattern Analysis and Machine Intelligence.. pp. 150–162
[5]: Corazza S., Mündermann L., Andriacchi T., A Framework For The Functional Identification Of Joint Centers Using Markerless Motion Capture, Validation For The Hip Joint, Journal of Biomechanics, 2007.
[6]: L. Xinghan, B. Berendsen, R.T. Tan, R.C. Veltkamp, Dept. of Inf. & Comput. Sci., Utrecht Univ., Utrecht, Netherlands. Human Pose Estimation for Multiple Persons Based on Volume Reconstruction. In: Proc. 2010 20th ICRP. IEEE, 2010, pp 3591-‐3594.
[7]: Rosenhahn, B., Brox, T., Kersting, U. G., Smith, A. W., Gurney, J. K., & Klette, R. (2006). A system for marker-‐less motion capture. Main, 1(1), 45-‐51. Citeseer.
[8]: http://www.eyewriter.org
[9]: J. Shotton, A. Fitzgibbon, M. Cook,T. Sharp, M. Finocchio, R. Moore,A. Kipman, A. Blake. Real-‐Time Human Pose Recognition in Parts from a Single Depth Image. Microsoft Research Cambridge, 2011.
[10]: R. W. Picard. Toward Machines with Emotional Intelligence. In: IEEE Transactions on Pattern Analysis and Machine Intelligence -‐ Graph Algorithms and Computer Vision Journal, Vol. 23, 10, IEEE Computer Society, 2001, pp. 1175-‐1191. [11]: O.A. Schipor, Ş.G. Pentiuc, M.D. Schipor. Towards a multimodal emotion recognition framework to be integrated in a computer based speech therapy system. In: The 6th International Conference on Speech Technology and Human-‐Computer Dialogue, 2011. [12]: Rosalind W. Picard.Future affective technology for autism and emotion communication Phil. Trans. R. Soc. B December 12,2009. [13]: IRIS project. Integrate Research on Interactive Storytelling. http://iris.scm.tees.ac.uk/
[14]: Lennart E. Nacke. Directions in Physiological Game Evaluation and Interaction. In CHI 2011 BBI Workshop Proceedings, Vancouver, BC, Canada. 2011
[15]: Ekman, P, & Friesen, W. V. (1978). The facial action coding system: A technique for the measurement of facial movement. Palo Alto: Consulting Psychologists Press.
[16]: G. Castellano, L. Kessous, G. Caridakis. Emotion Recognition through Multiple Modalities: Face, Body Gesture, Speech. In : Affect and Emotion in Human-Computer Interaction, Springer Berlin / Heidelberg, 2008. Pp 92-103 [17]:Rosalind W. Picard.Future affective technology for autism and emotion communication Phil. Trans. R. Soc. B December 12,2009.
[18]: A. Batliner, D. Seppi, S. Steidl, B. Schuller. Segmenting into Adequate Units for Automatic Recognition of Emotion-‐Related Episodes: A Speech-‐Based Approach. In :Advances in Human-‐Computer Interaction Volume 2010 (2010)
32 [19] T. Vogt, E. André and N. Bee, "EmoVoice -‐ A framework for online recognition of emotions from voice," in Proceedings of Workshop on Perception and Interactive Technologies for Speech-Based Systems, 2008.
[20] F. Eyben, M. Wöllmer, and B. Schuller. openEAR -‐ Introducing the Munich Open-‐Source Emotion and Affect Recognition Toolkit. In:Proc. 4th International HUMAINE Association Conference on Affective Computing and Intelligent Interaction 2009 (ACII 2009), Amsterdam, The Netherlands, volume I, pp. 576–581. IEEE, 2009. 10.-‐12.09.2009.
[21]: A. Mehrabian. Communication without words. Psychology Today, 2(4):53–56, 1968.
[22]: Fraunhofer Institute. Germany http://www.iis.fraunhofer.de/en/bf/bsy/produkte/shore/
[23]:Salah, A.A., N. Sebe, Th. Gevers, Communication and automatic interpretation of affect from facial expressions, in D. Gökçay & G. Yıldırım (eds.), Affective Computing and Interaction: Psychological, Cognitive and Neuroscientific Perspectives, to appear.
[24]: Rana el Kaliouby and Peter Robinson. Real-‐Time Inference of Complex Mental States from Facial Expressions and Head Gestures. In the IEEE International Workshop on Real Time Computer Vision for Human Computer Interaction at el Kaliouby, CVPR, 2004.
[25]:Intelligent Behaviour Understanding Group (iBUG), Department of Computing, Imperial College London http://ibug.doc.ic.ac.uk/resources/facial-tracker-2011/
[26]: Seeing Macinnes. FaceAPI http://www.seeingmachines.com/product/faceapi/
[27] open Computer Vision Library. http://opencv.org
[28] Coulson, M. (2004) 'Attributing Emotion To Static Body Postures: Recognition Accuracy, Confusions, And Viewpoint Dependence.' Journal of Nonverbal Behavior 28 (2) 117-139
[29] Kleinsmith A., and Bianchi-‐Berthouze N., Recognizing affective dimensions from body posture, In: Proc. 2nd Intl Conf of ACII, LNCS 4738, Portugal, pp. 48-‐58, 2007
[30] A. Metallinou , A. Katsamanis, Wang Yun, S.Narayanan. Tracking changes in continuous emotion states using body language and prosodic cues. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE. Prague. 2011. pp 2288-2291
[31] Riskind, J.H., and Gotay, C.C.: Physical posture: Could it have regulatory or feedback effects on motivation and emotion? Motivation and Emotion 6(3) (1982).pp 273–298
[32] N. Bianchi-‐Berthouze, P., Cairns, A., Cox, C., Jennett, W.., Kim,.On posture as a modality for expressing and recognizing emotions. Emotion and HCI workshop at BCS HCI London, September, 2006
[33] A. Camurri, B. Mazzarino, G. Volpe. Analysis of Expressive Gesture: The EyesWeb Expressive Gesture Processing Library In : GESTURE-‐BASED COMMUNICATION IN HUMAN-‐COMPUTER INTERACTION Lecture Notes in Computer Science, 2004, Volume 2915/2004, 469-‐470
[34] Timo Partala and Veikko Surakka. 2003. Pupil size variation as an indication of affective processing. Int. J. Hum.-Comput. Stud. 59, 1-‐2 (July 2003), 185-‐198.
[35] OpenEEG. http://openeeg.sourceforge.net/doc/
[36] Erin Treacy Solovey, Audrey Girouard, Krysta Chauncey, Leanne M. Hirshfield, Angelo Sassaroli, Feng Zheng, Sergio Fantini, and Robert J.K. Jacob. 2009. Using fNIRS brain sensing in realistic HCI settings: experiments and guidelines. In Proceedings of the 22nd annual ACM symposium on User interface software and technology (UIST '09). ACM, New York, NY, USA, 157-‐166.
[37] O. A. Schipor, S. G. Pentiuc, M. D. Schipor. Towards a multimodal emotion recognition framework to be integrated in a Computer Based Speech Therapy System. In: 6th Conference on Speech Technology and Human-‐Computer Dialogue (SpeD), IEEE.Brasov.Romania.2011. pp 1-‐6.
[38] Eija Haapalainen, SeungJun Kim, Jodi F. Forlizzi, and Anind K. Dey. 2010. Psycho-‐physiological measures for assessing cognitive load. In Proceedings of the 12th ACM international conference on Ubiquitous computing (Ubicomp '10). ACM, New York, NY, USA, 301-‐310.
[39] Bertoncini, M. and Cavazza, M., 2007. Emotional Multimodal Interfaces for Digital Media: The CALLAS Challenge. Proceedings of HCI International 2007.
[40] Marc Schro ̈der. The SEMAINE API: Towards a Standards-‐Based Framework for Building Emotion-‐Oriented Systems. In: Advances in Human-‐Computer Interaction.
33 Volume 2010 (2010), Article ID 319406, 21 pages
[41] J. Wagner, F. Lingenfelser, and E. Andre, The Social Signal Interpretation Framework (SSI) for Real Time Signal Processing and Recognitions," in Proceedings of INTERSPEECH 2011, Florence, Italy, 2011.
[42] F. Lavagetto and R. Pockaj, "The Facial Animation Engine: towards a high-‐level interface for the design of MPEG-‐4 compliant animated faces", IEEE Trans. on Circuits and Systems for Video Technology, Vol. 9, n.2, March 1999, pp.277-‐289
[43] MPEG-‐V (Information Exchange with Virtual Worlds) http://mpeg.chiariglione.org/working_documents.htm#MPEG-‐V
http://www.metaverse1.org/
[44] EMMA: Extensible MultiModal Annotation markup language W3C Recommendation 10 February 2009 http://www.w3.org/TR/emma/
[45] Emotion Markup Language (EmotionML) 1.0. W3C Working Draft 7 April 2011 http://www.w3.org/TR/emotionml/
[46] : Marco Grassi. 2009. Developing HEO human emotions ontology. In Proceedings of the 2009 joint COST 2101 and 2102 international conference on Biometric ID management and multimodal communication (BioID_MultiComm'09), Julian Fierrez, Javier Ortega-‐Garcia, Anna Esposito, Andrzej Drygajlo, and Marcos Faundez-‐Zanuy (Eds.). Springer-‐Verlag, Berlin, Heidelberg, 244-‐251.
[47] S. Kopp, B. Krenn, S. Marsella, et al., “Towards a common framework for multimodal generation: the behavior markup language,” in Proceedings of the 6th International Conference on Intelligent Virtual Agents (IVA ’06), vol. 4133 of Lecture Notes in Computer Science, pp. 205–217, 2006.
[48] HUMAINE. http://emotion-‐research.net/