geeair: a universal multimodal remote control …gpan/publication/2010-puc-geeair.pdfgeeair: a...

ORIGINAL ARTICLE

GeeAir: a universal multimodal remote control device for homeappliances

Gang Pan Jiahui Wu Daqing Zhang

Zhaohui Wu Yingchun Yang Shijian Li

Received: 1 June 2009 / Accepted: 22 October 2009 / Published online: 10 March 2010

Springer-Verlag London Limited 2010

Abstract In this paper, we present a handheld device

called GeeAir for remotely controlling home appliances via

a mixed modality of speech, gesture, joystick, button, and

light. This solution is superior to the existing universal

remote controllers in that it can be used by the users with

physical and vision impairments in a natural manner. By

combining diverse interaction techniques in a single

device, the GeeAir enables different user groups to control

home appliances effectively, satisfying even the unmet

needs of physically and vision-impaired users while

maintaining high usability and reliability. The experiments

demonstrate that the GeeAir prototype achieves prominent

performance through standardizing a small set of verbal

and gesture commands and introducing the feedback

mechanisms.

Keywords Universal remote controller Gesture recognition Speech recognition Smart home

1 Introduction

Nowadays, it is almost impossible for home inhabitants to

go for a day without interacting with the home appliances.

Although remote control of home appliances such as

TV, DVD, windows, lights, etc. serves well for ordinary

people with acceptable physical or emotional comfort, they

can provide more for the dignity, security, and well-being

of elderly or disabled people [1]. One can imagine a situ-

ation where a person has lost some of his/her physical

dexterity or mobility. In the absence of suitable controls,

he/she would need a caregiver to assist with the operation

of home appliances, with the attendant expense and loss of

independence and privacy. But with adequate assistance,

this person might be able to live independently at his/her

home.

The current home appliances are often equipped with

remote controllers operating via infrared (IR) light signals.

Each household is likely to own several remote controllers,

which are often incompatible with each other and have

different layouts. In order to reduce the number of remote

controls, universal remote controllers (URCs) were intro-

duced to merge the functions of individual controllers into

one device [25]. A URC learns IR command sets from

each appliance and operates the appliance selected by a

user. There are two fundamental steps involved in the

control procedure of a URC: target object selection and

command issuing. To select a target object for operation, a

user might press a button, turn a rotary wheel, or touch an

icon depending on how the panel of the URC is designed.

To issue a command, a user needs to point the controller to

G. Pan (&) J. Wu Z. Wu Y. Yang S. Li (&)Department of Computer Science, Zhejiang University,

Zhejiang, China

e-mail: [email protected]

J. Wu


Z. Wu


Y. Yang


S. Li


D. Zhang

Handicom Lab, Institut TELECOM SudParis, Evry, France


123

Pers Ubiquit Comput (2010) 14:723735

DOI 10.1007/s00779-010-0287-7

the target appliance and press a specific button on the

controller. Subsequently, the controller emits the infrared

signal to the selected appliance for the specified operation.

Although URCs combine the functions of remote con-

trollers into one device, elderly and disabled home users

may still have difficulties in using a URC due to a number

of reasons: First, a URC has too many buttons that need to

be remembered, and several button presses may be needed

to achieve a simple function. Second, the buttons on a URC

may be too small for the elderly, physically disabled and

vision-impaired people to use. Finally, button operation is

just one modality to interact with the home appliances,

which may not be the most natural and efficient means for

human machine interaction.

Speech and gesture are two natural ways that people

interact with each other. Much research has been done to

use speech, gesture, or eye-gaze to control home appli-

ances. However, there is limited success reported in the

literature on the deployment of these modalities due to the

constraint of each single modality. Controlling through a

spoken language or oral command is indeed straightfor-

ward for expressing intentions, but the single modality of

speech has the following limitations in real implementa-

tion: First the accurate extraction and recognition of control

commands from daily continuous speech is still difficult

due to the ambiguities of natural languages, especially in

noisy environments. Second, speech is not instant, e.g.

some commands need complex phases or sentences, which

may need a long time to process and react.

Using the single modality of gesture to control home

appliances has also been explored. Since the computer

vision-based gesture and eye-gaze control is highly

dependent on the lighting condition and camera facing

angle, it turns out to be rather difficult to accurately rec-

ognize gestures under poor lighting condition using a

camera-based system. In addition, it is also uncomfortable

and inconvenient if the user is required to face the camera

directly to complete a gesture. Different from the vision-

based gesture recognition approach, the accelerometer-

based gesture interaction is an emerging technique that

exploits the acceleration data of hand motion for recogni-

tion and control. No camera is required but a wearable or

portable accelerometer-equipped device in daily life, such

as a watch, a smart phone or a MP3 player. These wireless-

enabled portable/wearable devices provide new possibili-

ties for interacting with a wide range of home appliances

such as doors, window curtains, TVs, etc.

In this paper, we present a universal multimodal remote

control device which unifies several interaction modalities

such as speech, gesture, button, joystick, and light, so that

home inhabitants ranging from common users to elderly,

physically disabled, and vision-impaired people are all able

to interact with the home appliances in the way they feel

comfortable. Specifically, we develop a universal multi-

modal remote controller, called GeeAir, which not only

provides comfort and convenience for common users in

controlling home appliances, but also meets the special

needs of physically and vision-impaired people in operat-

ing the home appliances to live independently and enjoy a

better quality of life.

The paper is organized as follows. First, the related work

on universal remote controllers and multimodal control

systems is summarized in Sect. 2. Then an overview of the

GeeAir system architecture is presented in Sect. 3. In Sect.

4, the key techniques to select the desired target appliance

for operation are described, followed by the introduction of

feedback mechanisms ensuring the reliable confirmation.

Section 5 proposes a standard set of hand gestures for

operating different home appliances and a novel algorithm

for the accelerometer-based gesture recognition. Section 6

reports the implementation details and the experimental

results of the speech/gesture recognition algorithms com-

pared to other existing algorithms. An initial evaluation of

the GeeAir prototype with 10 users is also given in this

section. Finally, we provide our conclusions for the design

and test of GeeAir and highlight some future research

directions in Sect. 7.

2 Related work

In the consumer electronics market, several universal

remote control products can be found in the home elec-

tronics stores. These products can be roughly categorized

into two groups, according to how the target appliance is

selected: button-based URCs and screen-based URCs. The

former group allocates a few buttons in the control panel of

the URC for appliance selection, where one button corre-

sponds to one appliance. For example, Phillips 4-in-1

URC has four buttons reserved in the panel to control TV/

VCR/DVD/SAT, respectively. Users select one of the four

appliances by pressing the corresponding button [2]. Since

the number of buttons in a URC control panel is fixed, the

extensibility of the button-based URCs is limited. The

screen-based URCs overcome this limitation by putting a

built-in mini-screen and a navigation button in the control

panel of URCs. When users press the navigation button, the

mini-screen shows the selected home appliance one after

another. When the target appliance appears in the screen,

the user completes the device selection by releasing the

button [35]. Apparently, both kinds of URCs only support

button-pressing as the single input modality, thus people

with limited motor skills, finger dexterity, or weak vision

might not be able to use these remote controls.

In parallel to the efforts of developing universal remote

controllers by consumer electronics manufacturers, there

724 Pers Ubiquit Comput (2010) 14:723735

123

has been a lot of research on universal GUI to enable

mobile devices for home appliance control. Different

approaches have been proposed to generate the universal

graphical user interface in various mobile platforms [6, 7].

All those solutions assume that users can navigate the GUI

on the tiny screen of a mobile device with a pen or button.

Thus, they support only one single input modality and

consequently cannot meet the needs of elders and those

with certain physical or vision impairment.

Compared to the single modality solutions, multimodal

control systems combine the strengths of multiple modal-

ities, and thus increase the applicability and usability of

humanmachine interaction. To meet the different

requirements of varied users and applications, various

combinations of input and output modalities have been

explored in previous projects. For example, the seminal

work by Bolt [8] created a Put-That-There system where

people can use pointing gesture to select an object from a

virtual diagram of a room which is shown in a large-screen

display and subsequently use speech to operate on the

selected object. The EU HOME-AOM project [9, 10]

applied the mixed modality of speech, gesture, and GUI for

the home appliance control for disabled people, in which

speech and gesture were used to assist in the navigation of

GUI commands. GWindows [11] operated the Microsoft

window applications by using speech to move/close/mini-

mize/maximize/scroll and using motion gestures to deter-

mine the movement distance. Krum et al. [12] implement a

system that helps user navigate in a whole earth 3D visu-

alization environment at a distance from the display. It

employs Gesture Pendant [13] for tracking of simple hand

motions and utilizing speech for navigation commands.

Different from those projects, our work intends to provide a

single, multimodal control device for a wider range of

home users, including the elders and those with physical or

vision impairment besides ordinary users. Our solution

supports a mixed modality of speech, gesture, button,

joystick and light as input and output, adapting to different

needs and interaction preferences of various user groups. In

addition, we use an accelerometer-based gesture recogni-

tion approach instead of the camera-based one used pre-

viously, which allows users to move freely in a ubiquitous

home environment and control the home appliance in any

lighting condition.

The closest research to our work is by Kela et al. [14]

who used several modalities to interact with a design studio

environment. The modalities explored include speech input

and output, gesture input, RFID-tag, a laser-tracked pen

and a mobile device with touch screen. Our work differs

from theirs in the following aspects:

(1) While Kela et als work uses diverse modalities in a

studio environment, they deploy multiple devices to

control multiple applications, and we focus on

building a handy, single multimodal device for

controlling multiple home appliances.

(2) Kela et als work takes the design studio as the

application environment, the designers as the user

group, and convenience and comfort as the design

goal. Instead, our research aims at a different, actually

larger, user group. We not only provide ordinary

home inhabitants with convenience and comfort, but

also elders and those with physical and vision

impairment. For example, we provide joystick as

one input modality which is very useful for people

with hand disability.

(3) In order to ensure the reliability and robustness of the

multimodal remote controllers for elders and disabled

people, we introduce voice and light as feedback. So

that the desired control object can be reliably

identified even if speech recognition is not 100%

accurate. In our GeeAir solution, users are allowed to

use speech or joystick to select a target appliance for

operation and use voice and light to get feedback.

Such solution can satisfy the needs of user groups

with disabilities in speaking, hearing, vision, and

hand.

(4) Although we also use accelerometer-based approach

for gesture control as Kela et al. did, we developed

a novel and very different algorithm [15] which is

more accurate than the algorithm used in Ref. [14].

While they adopted a HMM (hidden Markov

model)-based approach for gesture recognition and

process the acceleration data in the time domain

without conducting feature extraction, we processed

the data in frequency domain with feature extraction

to reduce the noise and variation of a gesture data,

thus significantly improving the recognition

performance.

3 GeeAir: an overview

The design goal of GeeAir is to become a single universal

remote controller which serves not only common users but

also those physically disabled and vision-impaired people.

In the home environment as illustrated in Fig. 1, GeeAir

takes the inputs from the users to select a target appliance

first and then recognizes the predefined hand gesture of

users to control the selected target appliance. As described

before, the mixed modalities of speech, joystick, light, and

button are used for selecting a desired target appliance. In

order to avoid any potential error during the selection, two

feedback mechanisms are introduced in GeeAir design:

lighting feedback and voice echo.

Pers Ubiquit Comput (2010) 14:723735 725

123

The look and feel of GeeAir prototype is shown in

Fig. 2, which borrows the design from Nintendo Nunchuk.

The key components of GeeAir and their functionalities are

described as follows:

(1) A three-axis built-in accelerometer: to capture users

3-D hand gesture signals.

(2) An eight-orientation joystick: to select a target

appliance efficiently.

(3) A built-in microphone: to acquire users speech

commands.

(4) A speaker: to provide users with voice feedback and

reminders.

(5) Button A and B: used to label the beginning and end

of speech and gesture commands. These two buttons

are designed in different sizes and shapes, in order to

help user differentiate them by tactility.

(6) A built-in digital signal processing unit: to handle the

computation involved in the processing of multimo-

dality inputs and outputs.

(7) A built-in communication unit: to send and receive

wireless signals.

The workflow of using GeeAir consists of three main

stages: appliance selection, feedback and confirmation, and

operation command issuing, as shown in Fig. 3. At any

moment, GeeAir has a current appliance for operation. The

current appliance is indicated by the light signal or voice

reminder. If a user intends to control another appliance

rather than the current appliance, he/she needs to select the

desired one via joystick or speaking the target appliance

name. If speech is used, GeeAir will obtain the name of the

target appliance with speech recognition. The feedback for

appliance selection has two options: light signal (a con-

trollable light attached to each appliance) and voice echo,

which help users correct occasional errors of speech rec-

ognition of the target appliance name. If the current

appliance is exactly the one that the user wants to operate,

the user can wave the GeeAir in air for the follow-up

operations. Then the gesture will be recognized by GeeAir

Fig. 1 Illustration of theGeeAir for remote control of

home appliances

Fig. 2 Conceptual illustrationof the GeeAirs components for

multimodal control. a a three-axis accelerometer, joystick,

microphone, speaker, and two

buttons are built in GeeAir;

b two buttons (Button A andButton B) in the front view of

GeeAir


123

and the corresponding command will be issued to the

current appliance wirelessly.

4 Multimodal selection of a target appliance

4.1 Selecting via speech commands

Speech is one of the most natural ways for interaction

between human and machines. However, for home

appliance control, it is still a great challenge to robustly

extract and recognize the control commands in real life

environment using user-independent large vocabulary

continuous speech recognition technology. In contrast,

small vocabulary recognition of isolated words is quite

reliable and accurate, verified by many successful prac-

tical applications.

GeeAir provides the option of selecting a target appli-

ance via speech commands. GeeAir will record users

utterance through the equipped microphone and then rec-

ognize the appliance name. In this case, the vocabulary to

be recognized is small because the number of home

appliances is limited and their names are relatively fixed. In

order to avoid the segmentation of the appliance name from

the natural utterance, users are asked to press Button A on

GeeAir to start speaking the appliance name for object

selection, and release the button after speaking the appli-

ance name.

For isolated word recognition, the commonly used

techniques include VQ (Vector Quantization), DTW

(Dynamic Time Warping), and HMM (Hidden Markov

Model) [16, 17]. For GeeAir, we build an isolated word

recognition system based on continuous density hidden

Markov model (CDHMM) [18]. The whole recognition

process consists of the following steps:

(1) Defining the lexicon: recording the words to be

recognized by the system. Each word is repeatedly

recorded several times by each participant.

(2) Feature extraction: the MFCC (Mel Frequency Cep-

strum Coefficient) feature vectors [19] are computed,

together with their first derivatives.

(3) Modeling words: for each word in the lexicon, a left-

to-right CDHMM is built with a number of states.

Each state is characterized by a Gaussian mixture

model (GMM).

(4) Training the models: the parameters of the distribu-

tions in GMM and the state transition probabilities

within CDHMMs are estimated using the Baum-

Belch algorithm [17].

(5) Recognition of a word: first, we compute observations

of the word (feature vector), and then the probability

of its observations is generated from each of the

words CDHMM models using the Viterbi algorithm.

The word is recognized to be the one whose model

has the highest probability.

4.2 Selecting via joystick

The second modality GeeAir provides to select a target

appliance is through the built-in joystick. Joystick is a

traditional input device in machine control of trucks, CT

scanner, as well as video games. It outperforms buttons in

navigation due to its continuity, fast reaction and nearly no

relative movement between hand and itself during the

controlling process. Thus, joystick is a good choice for

selecting objects which are circled around in the spatial

space.

The operation principle of the joystick is illustrated in

Fig. 4. The accessible area is octagonal. There are two

states defined for joystick operation: inactive and active.

Inactive state indicates that joystick is not pushed and stay

in the middle of the octagon; active state indicates that

joystick is pushed to the edge of the octagonal at any angle.

The eight valid joystick positions are: north, northeast, east,

southeast, south, southwest, west, and northwest. Each

position occupies 45 degrees.

Wrong

Operate the current appliance with

Yes

B

Signallight

Feedback

Select a target appliance

Rotate joystick

gesture and command issuing

Continue operatingthe current appliance?

Begin

orVoiceecho

or Speakits name

Right

No

Fig. 3 Workflow of GeeAir


123

A user can move the joystick along the octagon to select

appliances in the physical spatial space. Intuitively, an

octagonal joystick can be matched to eight appliances

statically. However, to select the target appliance from the

different number of appliances in each household, GeeAir

exploits the rule of dynamic and relative association

between the positions and the appliances. A valid position

is not necessarily associated with a fixed device. In this

sense, when a user intends to select an appliance, the initial

position which he/she pushes the joystick to first is

dynamically associated with the current selected appliance.

While the user rotates the joystick to a neighboring posi-

tion, the current appliance will also shift to its neighboring

appliance. Whether the left nearest one or the right nearest

one is selected depends on the users rotating direction, i.e.

counter-clockwise and clockwise. The dynamic association

ensures the flexibility when the number of appliances

varies. Thus, any number of appliances can be easily

navigated by using the joystick.

4.3 Feedback mechanism

GeeAir has two kinds of feedback mechanisms available for

confirmation purpose: voice echo and signal light. GeeAir

has a built-in mini-speaker, which can replay the name of the

appliance when the appliance is selected by either speech or

joystick. Voice echo informs the user whether the object

recognized by the system is the desired one that users intend

to select. If a controllable LED light is attached to each

appliance, the lights can be used as a feedback, i.e. the red

LED light of the selected appliance is turned on for user

confirmation while the other lights are keeping off.

For joystick-based appliance selection, the light feed-

back will immediately occur as soon as the joystick

changes a position, that is, when the joystick moves from

one position to another, the light signal will also shift from

one appliance to the next. The instant lighting during

joystick rotation will be much helpful for user due to the

quick response of joystick operations. However, the voice

echo cannot occur for every covered position if joystick

rotates too fast because there is no enough time for voice

echo. For this reason, GeeAir sets a movement speed limit,

one position/second, for voice echo. If the joystick stays in

a position for less than 1 second, the voice echo of the

appliance associated dynamically with this position will be

suppressed. Any voice echo can be interrupted by rotating

joystick to the next position when users know that the

current one is not the desired one, which helps users to

speed up the selection process.

With the feedback mechanisms, if the user finds the

recognized object is not the desired one, he/she can correct

it immediately by repeating the appliance selection. Thus,

the command issuing for a wrong appliance could be

avoided. Any of the two feedback mechanisms can be

combined with one of the two selection schemes introduced

previously, i.e., there are four combinations available:

speech-voice, speech-light, joystick-voice, joystick-light.

Both feedback modalities of voice and light are suitable

for motor-impaired people, they also free users from

reading on-screen prompts. The voice-based feedback is

suitable for any people with normal hearing. Although the

signal light requires users vision, it is less demanding to

recognize the binary states, ON, and OFF, of a light, than

the semantic information in text or picture on screen.

5 Operating an appliance via gesture

After the target appliance is selected, GeeAir uses gesture

commands to operate it. Gestures performed by GeeAir are

recognized based on acceleration data acquired by the built-

in three-axis accelerometer [15]. Compared to the camera-

based gesture recognition techniques [20], the accelerome-

ter-based gesture recognition does not rely on lighting con-

ditions and camera facing angle, and also does not require

any deployment of devices in the environment. Similar to

issuing speech commands, users begin a gesture by pushing

the Button B, and end it by releasing the button, avoiding the

accuracy degradation caused by gesture segmentation.

5.1 Gesture command definition

In order to enable effective gesture-based interaction,

several requirements must be met when designing a set of

gesture commands for home appliances: (1) the semantic

connection between gestures and commands should be

natural, so that the meaning of a gesture is easy to learn and

remember for users; (2) gestures should be simple and

terse, avoiding those require high precision over a long

period of time. Moreover, they should be quick to perform

and repeat, without causing fatigue over time; (3) the

gesture commands for different appliances should be con-

sistent, i.e., similar operations of different appliances

Fig. 4 Octagonal accessible area of joystick. Each position covers 45degrees. The joystick can be rotated either clockwise or counter-

clockwise to change the position


123

should be defined as the same gesture to reduce the size of

gesture vocabulary which the users have to learn.

Usually there are two different ways employed in gesture

command definition: user-dependent and user-independent.

Previous work focuses more on user-dependent gesture

recognition [2123], where each user is required to perform a

couple of gestures as training/template samples before using

the system. In this case, users are requested to personalize a

remote controller by mapping each operation to a certain

gesture they find suitable and comfortable. However, the

training process is still a burden for users, although some

work [23, 24] has been done on optimizing recognition

algorithms to reduce the size of training sample set. GeeAir

aims at user-independent gesture recognition and control.

Different users will share a common set of gesture com-

mands and do not need to train GeeAir from person to person.

In this paper, we define a nine-gesture vocabulary to

control the frequently used functions of seven categories of

home appliances, as listed in Table 1. The gesture of

ForwardBackward is performed in the XY plane, and the

other eight gestures are waved in the YZ plane.

(1) The gesture of ForwardBackward is performed as if

pushing an ON/OFF switch button on a control panel

of electronic appliances.

(2) The swinging gestures of Up and Down are very

natural to express the meaning of up and down, e.g.

volume up/down, temperature up/down.

(3) Similarly, the two gestures of Left and Right are also

natural to represent the meaning of previous and next.

(4) The gestures of Double-Left and Double-Right denot-

ing a fast move toward left or right suggest users of

fast backward/fast forward.

(5) The gesture of alphabet V implying a tick or rising

up suggests a Play operation. Additionally, we

follow the tradition that most of the current players

use the same button to share operations of Play and

Pause.

(6) The gesture of Inverted V implies a decreasing trend,

which we define as a Stop operation.

Specifically, however, Up/Down and Double-Left/Dou-

ble-Right are continuous commands rather than instant

ones, for example, modulating the volume or adjusting

curtains is a continuous operation. In order to avoid fre-

quently performing the same gesture, when such com-

mands are recognized, GeeAir will continuously issue the

commands with a certain interval until users push Button B

or it reaches to its maximum.

5.2 Gesture recognition with FDSVM

GeeAir employs the algorithm FDSVM [15], proposed by

the authors, to recognize gesture commands from acceler-

ation data. FDSVM uses a frame-based descriptor to

compactly represent a gesture, which reduces noise and

variation of a gesture data, and thus improves the gesture

recognition performance significantly.

The FDSVM system has two main phasestraining and

recognizingand four componentsacceleration data

Table 1 Definition of gesture commands for appliances

Appliance Gesture commands

Forwardbackward Up; down Left; right Double-left; double-right V; inverted-V

Television ON/OFF Vol. up

Vol. down

Prev. channel

Next channel

DVD ON/OFF Vol. up

Vol. down

Prev. track

Next track

F Forward

F Backward

Play/pause

Stop

Radio ON/OFF Vol. up

Vol. down

Prev. channel

Next channel

Speaker ON/OFF Vol. up

Vol. down

Air conditioner ON/OFF Temp. up

Temp. down

Lamp ON/OFF Brtn. up

Brtn. down

Curtain Open/Close Curt. up

Curt. down

Vol, Volume; F Forward, Fast Forward; F Backward, Fast Backward; Temp, Temperature; Brtn, Brightness; Curt, Curtain


123

acquisition, feature extraction, training SVM, and recog-

nition by SVM, as shown in Fig. 5. The former two com-

ponents are shared by the training and recognizing phases.

5.2.1 Feature extraction: frame-based gesture descriptor

The three-axis accelerometer built in GeeAir can discretely

sense the gestural acceleration data of three spatial

orthogonal axes. We denote a gesture command as:

G ax; ay; az

where ax, ay, az are the acceleration sequences from three

axes. We divide a gesture into N ? 1 segments with iden-

tical length, and every two adjunct segments make up a

frame with a segment-length overlap, as illustrated in Fig. 6.

We employ five features in both frequency and spatial

domain to characterize each frame:

In frequency domain (discrete Fourier transform (DFT)

on each frame per axis):

(1) mean l: the DC component over the frame(2) energy e: the sum of the squared DFT component

magnitudes except the DC component, and subse-

quently divided by the number of the components for

the purpose of normalization.

(3) entropy d: the normalized information entropy of theDFT component magnitudes with the DC component

excluded.

In spatial domain:

(4) standard deviation r: indicates the amplitude vari-ability of a gesture

(5) correlation c among the axes: implies the strength ofa linear relationship between each pair of axis.

We combine all features extracted as described above to

form a feature vector s, which represents the gesture

command itself. Considering 5 features per frame per axis,

3 axes, and N frames per gesture, the dimension of the

feature vector should be d = 53N = 15N.

5.2.2 Gesture classification: multiclass SVM

Suppose there are two types of gestures GTR1, GTR2

needed to be classified. We denote the training set with n

samples as

fsi; gig; i 1; . . .; n

where si 2 Rd represents a feature vector of a gesturecommand and

gi 1; if si belongs to GTR11; if si belongs to GTR2

A separating plane written as

w s b 0

which can be obtained by solving a dual convex quadratic

programming problem [25].

The extension to multiple gestures classification is

achieved by a multiclass SVM using one-versus-one

strategy or one-versus-all strategy. SVM is a method to

deal with the highly non-linear classification and regression

problems. Benefiting from structural risk minimization

principle and avoidance of over-fitting by its soft margin,

SVM usually outperforms the traditional parameter esti-

mation methods which are based on Law of Large Num-

bers when there are merely limited training data available.

6 Evaluations

6.1 Implementation

We build a prototype of GeeAir, including hardware and

algorithms implementation, to verify the design and

AccelerationData

Acquisition

Feature Extraction

TrainingSVM

Recognitionby SVM

FrameSegmentation

FeatureCalculation

Fig. 5 Block diagram of theFDSVM gesture recognition

system

Segment 0 Segment 1 Segment 2 Segment N

Frame 1Frame 0

Frame N-1

Gesture

. . . . . .

.

.

.

Fig. 6 Illustration of segmentsand frames for a gesture


123

performance. Currently, the GeeAir can acquire speech and

gesture commands with two buttons, and perform joystick-

based selection. The software, including algorithms for

speech recognition and gesture recognition, is still imple-

mented on a PC instead of GeeAir. We use Bluetooth to

connect the GeeAir and the PC.

6.1.1 Hardware setup

The GeeAir prototype is built based on Nintendo Wiimote

for acceleration sensing and its expansion Nunchuk for

joystick selection. It has a 3-D accelerometer, a joystick,

and two buttons: Button A and Button B (inspired by

Button C and Button Z of Nunchuk). The built-in micro-

phone and speaker of GeeAir are simply replaced with

Bluetooth wireless headphone connected to a laptop com-

puter. Wiimote is also employed to help build communi-

cation between the laptop computer and GeeAir.

GeeAir utilizes Bluetooth as the non-directional wireless

communication. However, most of current appliances

adopt infrared remote controllers and therefore are unable

to receive Bluetooth signal. We developed a Bluetooth

infrared Adaptor (BI Adaptor) to convert the Bluetooth

signals to infrared signals, which will be unnecessary when

the appliances are able to communicate via Bluetooth. Also

the signal light for feedback mechanism is embedded on

the BI Adaptor, shown in Fig. 7.

6.1.2 Algorithms implementation

For the isolated word recognition in GeeAir, the lexicon has

12 words for seven categories of home appliances, shown in

Table 2. The utterances are recorded with 16 kHz sampling

frequency and 16-bit resolution. The feature vector of 26

dimensional MFCC (13 dimensional cepstrum coefficients

and their first derivatives) is employed, which is computed

with a window size of 32 ms and a step size of 16 ms. Each

word is represented by a trained left-to-right CDHMM

model with 3 states, which is implemented on the base of

HTK (Hidden Markov Toolkit) [26]. The eight-dimension

mixture Gaussian distribution is used for modeling states.

We use 6 Baum-Welch re-estimation iterations.

Gesture recognition with FDSVM for GeeAir uses an

open source software package of FFTW [27] for discrete

Fourier transformation. Then five features mean, energy,

entropy, correlation, and standard deviation of individual

axis in one frame are calculated. The feature vector is

eventually put into a classifier in order to train an SVM

model or retrieve a recognized gesture type. The SVM

component utilizes the package SVMmulticlass [28]. The

details may refer to the reference [15].

6.2 Data acquisition

To evaluate the GeeAirs performance of oral command

recognition and gesture recognition, we built a speech

Fig. 7 Components of theBluetoothinfrared adaptor

Table 2 Speech vocabulary of twelve Chinese words for sevenappliances

No. Appliances Chinese words

1 Television Dian sh

Dian sh j

2 DVD player DVD

3 Radio Shou yn

Shou yn j

4 Speaker Yn xiang

Yn xiang

5 Air conditioner Kong tiao

6 Lamp Dian deng

Tai deng

R guang deng

7 Curtain Chuang lian


123

database with 7 appliance names and a gesture acceleration

database with 9 gestures. Both databases are acquired by 10

persons, including 5 males and 5 females. The collection

procedure lasts 5 days.

The vocabulary in speech database includes 12 Chinese

words of 7 appliances, listed in Table 2. Some of the

appliances may have more than one name, depending on

users habits. Each user is required to record 4 times per

word per day. Thus, each user has 20 samples for each

Chinese word.

For the gesture acceleration database, each participant

was asked to perform each gesture for 6 repetitions per

day. Thus, there are 6 9 5 9 9 9 10 = 2,700 samples.

The start and end of a gesture are labeled by pressing the

Button B on the Wiimote during data acquisition. Fig-

ure 8 illustrates the acquisition devices. We divided the 9

gestures into 3 groups as listed in Table 3, for the pur-

pose of evaluating usability for different potential appli-

ances. For example, Group 1 is for speaker, air

conditioner, lamp, and curtain; Group 2 is for television

and radio.

We employed the leave-one-day-out cross validation for

the user-dependent case and the leave-one-person-out cross

validation for the user-independent case in speech and

gesture experiments. For the leave-one-day-out cross-val-

idation, we divide all the samples into five partitions,

choosing 1 days samples for a partition (namely 60 sam-

ples per gesture per partition, and 40 samples per word per

partition). At each time, four partitions from five are for

training, the remainder of one partition is for testing. We

then repeat it five times and finally take the average rec-

ognition rate. For the leave-one-person-out cross-valida-

tion, nine participants data (out of ten) is used as the

training set; the data of the remaining participant is used as

the testing set.

6.3 Speech recognition accuracy

Using the 12-word speech data described previously, the

experimental results shows that the user-dependent speech

recognition achieves the accuracy of 98.21%, and user-

independent performance has the recognition rate of

91.79%. Figure 9 illustrates the recognition performance

over time in the user-dependent case.

6.4 Gesture recognition accuracy

6.4.1 Experiment 1: effect of frame number N

The purpose of analyzing a gesture in frames rather than as

a whole is to describe its local characteristics correspond-

ing to time span. The frame count N indicates the precision

we know about a gesture. Intuitively, the more frames a

gesture is broken up into, the more details are known about

the gesture. However, it may lead to the over-fitting

problem if the frame number N is too large. It will also

increase the dimension of the feature space, which

increases computational complexity. This experiment is to

examine the effect of varying N.

Figure 10 shows the experimental results for varying

frame number N using the data set of Group 3. As can be

seen, higher-rating occurs at the center in both curves and

lower-rating at both ends. This result supports our

assumption that the feature will convey little discriminativeFig. 8 Acquisition devices of gesture acceleration data

Table 3 The nine gestures are divided into three groups for thegesture recognition experiments

No. Size Gesture

1 3 Forwardbackward, up, down

2 5 Forwardbackward, up, down, left, right

3 9 Forwardbackward, up, down, left, right, double-left,

double-right, V, inverted-V

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

Day 1 Day 2 Day 3 Day 4 Day 5 Average

Recognition rate

Fig. 9 User-dependent speech recognition result varying over time


123

information when N is too small, and the over-fitting

problem will occur when N is too large. The recognition

accuracy is obviously lower than the rest when N is 2. The

two curves are nearly flat when N is between 4 and 7. In the

following experiments, we choose N = 5.

6.4.2 Experiment 2: user-dependent gesture recognition

In this experiment, to demonstrate the performance of our

method, we compare it with four methods: decision tree

C4.5, Nave Bayes, DTW, and the HMM algorithm. We

employed the implementation of C4.5 by Quinlan [29] for

comparison purpose.

We carried out the experiments and comparison tests on

the 3 groups of data set, respectively. The comparison

results are shown in Fig. 11. When recognizing the three

gestures of Group 1, all the five approaches obtain the

recognition rate of more than 90%, where our proposed

FDSVM achieves 99.17% (a little bit lower than DTW,

99.76%). When the number of gesture type increases, the

performance of HMM and DTW decreases significantly. In

contrast, our FDSVM method performs well even in rec-

ognizing all the 9 gestures, with the recognition rate of

96.40%.

6.4.3 Experiment 3: user-independent gesture recognition

User-independent case means that the system is well-

trained before users use it. Such implementation avoids

users efforts to perform several gestures as training data.

The results of user-independent gesture recognition test

and comparison are shown in Fig. 12. Obviously, the rec-

ognition rate of user-independent gesture recognition is

lower than that of user-dependent one. Our FDSVM has

very stable recognition performance when the number of

gesture types increases. It achieves the recognition rate of

94.17% for 3 gestures of Group 1 and 91.07% for 9 ges-

tures of Group 3. DTW achieves recognition rate of

97.38% for Group 1 and 95.78% for Group 2, slightly

outperforming our methods. However, our FDSVM sig-

nificantly outperforms DTW in 9 gestures of Group 3. The

result reveals that our FDSVM has good generalization

capability with respect to the number of gesture types.

6.5 Response time test

We have set up 8 home appliances as control objects in the

laboratory: a curtain, two lights, a TV, an air-conditioner, a

speaker, and a DVD player. We then recruited 10 graduate

students in the laboratory for the experiments, none of

whom used the GeeAir before. A series of tasks were

defined as follows in order to test each user one after

another:

1. Use speech to select a target appliance (one of eight).

After a red light feedback from the system for

confirmation, conduct gestures to control the

appliance.

2. Use the joystick to repeat the same task as Step 1.

3. Cover the eyes of each participant to simulate the

situation for a blind person, using speech to select a

target appliance (one of eight). After a voice feedback

from the system, conduct gestures to control the

appliance.

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

2 3 4 5 6 7 9 11 13 15 17 19

Rec

og

nit

ion

Rat

e

Frame Number

user dependentuser independent

Fig. 10 Experimental result for various frame number N

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3

Rec

og

nit

ion

Rat

e

FDSVM Nave Bayes C4.5 DTW HMM

Group No.

Fig. 11 Experimental results for the user-dependent case

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 2 3

Rec

og

nit

ion

Rat

e

FDSVM Nave Bayes C4.5 DTW HMM

Group No.

Fig. 12 Experimental result for the user-independent case


123

4. Use joystick to repeat the same task as Step 3.

Table 4 shows the average response time of different

stages when students use the GeeAir prototype. We can see

that it is faster to select a target using joystick than speech

because selection by speech needs lots of time (i.e. 1.4 s) to

speak an appliance name. The computational cost for rec-

ognition of both speech and gesture is less than 0.5 s. For a

user, response time of feedback by light is nearly negligible

(only 43 ms). For the procedure of gesture command,

including gesture action and gesture recognition, the

average time spent is 0.483 s.

7 Conclusions

We have developed a handheld, universal multimodal

remote control device, called GeeAir, for controlling home

appliances/appliances via a mixed modality of speech,

gesture, joystick, button, and light. Compared to the

existing universal remote controllers, GeeAir can enable

even those with physical, hearing, and vision impairment to

control home appliances in a natural manner. Compared to

the existing multimodal solutions interacting with the smart

environments, GeeAir provides a handy and single device

solution, not only providing comfort and convenience for

common users in controlling home appliances but also

meeting the special needs of physically and vision-

impaired people in operating the home appliances.

Single modality such as speech, gesture, joystick, but-

ton, and light all has its own strength and weakness. By

combining those diverse but complementary modalities

together and integrating them into a single device, different

home user groups can always find a combination of

modalities they feel comfortable to interact with the envi-

ronment. GeeAir represents an interesting attempt toward

bringing the multimodal interaction techniques closer to

the everyday life of home users, particularly those who

need assistance for independent living.

Speech and gesture are two most natural ways that people

interact with each other. Even though the continuous speech

and gesture recognition techniques are still not mature

enough to be deployed in real applications, we achieved very

good performance in our work through standardizing a small

set of easily learned verbal commands and gestures, and

introducing feedback mechanisms.

Multimodal interaction devices are necessary for mobile

and ubiquitous environments. The GeeAir prototype per-

mits us to begin developing the design space for mapping

interactions with multimodal commands. Such a space will

be necessary for optimally supporting different home users

in different context.

The initial test results show clear benefits of the multi-

modal device GeeAir over the universal remote controllers

and other single modality based solutions. In the future, we

plan to conduct a series of formal evaluations of GeeAir

with real home users, including elderly and disabled

inhabitants. Hopefully, the study will shed light on the

cognitive load of various combinations of modalities:

speech-gesture, joystick-gesture, speech-button, and joy-

stick-button, in order to further improve the future design

of GeeAir.

Acknowledgments The authors would like to thank the commentsand suggestions from the anonymous reviewers. The laboratory stu-

dents participation in the experiments is greatly appreciated. This

work is supported in part by the National High-Tech Research and

Development (863) Program of China (No. 2008AA01Z132,

2009AA011900), the Natural Science Fund of China (No. 60525202,

60533040), and the France ICT-Asia I-CROSS program. Dr. Shijian

Li is corresponding author.

References

1. Campbell LW (1997) A more universal remote control. http://web.

media.mit.edu/*lieber/Teaching/Collab97/Collab-Projects/remote.html

2. http://www.consumer.philips.com/consumer/en/gb/consumer/

cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_

CONSUMER/[4-in-1TV/VCR/DVD/SAT]

3. http://www.oneforall.co.uk/en_UK/product/1/universal-remotes/

3/advanced/25/digital-12

4. http://www.logitech.com/index.cfm/remotes/universal_remotes/

devices/3898&cl=us,en

5. http://www.universalremote.com/product_detail.php?model=158

6. Lee L, Johnson T (2006) URCousin: universal remote control

user interface. In: Proceedings of the Human Interface Technol-

ogies Conference, April 2006

7. Niezen G, Hancke GP (2008) Gesture recognition as ubiquitous

input for mobile phones. International Workshop on Devices that

Alter Perception (DAP08), conjunction with Ubicomp08, 2008

8. Bolt RA (1980) Put-that-there: voice and gesture at the graphics

interface, SIGGRAPH80, pp 262270

9. Machate J, Burmester M, Bekiaris E (1997) Towards an intelli-

gent multimodal and multimedia user interface providing a

new dimension of natural HMI in the teleoperation of all

home appliances by E&D users, 6th International Conference

Table 4 Average response time of different stages (unit: millisecond)

Target selection Feedback Gesture (action ? recognition)

Joystick 1266 Light 43 426 ? 57

Speech (speaking ? recognition) 1397 ? 406 Voice 736


123

http://web.media.mit.edu/~lieber/Teaching/Collab97/Collab-Projects/remote.htmlhttp://web.media.mit.edu/~lieber/Teaching/Collab97/Collab-Projects/remote.htmlhttp://web.media.mit.edu/~lieber/Teaching/Collab97/Collab-Projects/remote.htmlhttp://www.consumer.philips.com/consumer/en/gb/consumer/cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_CONSUMER/[4-in-1TV/VCR/DVD/SAT]http://www.consumer.philips.com/consumer/en/gb/consumer/cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_CONSUMER/[4-in-1TV/VCR/DVD/SAT]http://www.consumer.philips.com/consumer/en/gb/consumer/cc/_categoryid_3000_SERIES_REMOTE_CONTROL_SU_GB_CONSUMER/[4-in-1TV/VCR/DVD/SAT]http://www.oneforall.co.uk/en_UK/product/1/universal-remotes/3/advanced/25/digital-12http://www.oneforall.co.uk/en_UK/product/1/universal-remotes/3/advanced/25/digital-12http://www.logitech.com/index.cfm/remotes/universal_remotes/devices/3898&cl=us,enhttp://www.logitech.com/index.cfm/remotes/universal_remotes/devices/3898&cl=us,enhttp://www.universalremote.com/product_detail.php?model=158

ManMachine Interactions Intelligent Systems in Business,

Montpellier, May 1997, pp 226229

10. Machate J (1999) Being naturalon the use of multimodal

interaction concepts in smart homes. In: Proceedings of the HCI

International 99, pp 937941

11. Wilson A, Oliver N (2003) Gwindows: robust stereo vision for

gesture-. based control of windows. In: Proceedings of the 5th

international conference on multimodal interfaces, New York,

NY, USA, pp 211218

12. Krum DM, Omoteso O, Ribarsky W, Starner T, Hodges LF

(2002) Speech and Gesture Multimodal Control of a Whole Earth

3D Visualization Environment. In: Proceedings of Symposium on

Data Visualization, Barcelona, Spain, pp 195200

13. Starner T, Auxier J, Ashbrook D, Gandy M (2000) The gesture

pendant: a self-illuminating, wearable, infrared computer vision

system for home automation control and medical monitoring.

International Symposium on Wearable Computers (ISWC00),

pp 8795

14. Kela J, Korpipaa P, Mantyjarvi J, Kallio S, Savino G, Jozzo L,

Marca D (2006) Accelerometer-based gesture control for a design

environment, Personal Ubiquitous Computing, 10:285299

15. Wu J, Pan G, Li S, Zhang D (2009) Gesture Recognition with a

3D Accelerometer. The Sixth International Conference on

Ubiquitous Intelligence and Computing (UIC-09), Brisbane,

Australia, 79 July, 2009

16. Rabiner L, Levinson L (1981) Isolated and connected word rec-

ognitiontheory and selected applications. IEEE Trans Commun

29(5):621659

17. Rabiner LR (1989) A tutorial on hidden markov models and

selected applications in speech recognition. Proc IEEE 77:257

286

18. Lee C-H, Lin C-H, Juang B-H (1991) A study on speaker

adaptation of the parameters of continuous density hidden Mar-

kov models. IEEE Trans Signal Process 39(4):806814

19. Davis SB, Mermelstein P (1980) Comparison of parametric

representation for monosyllabic word recognition in continuously

spoken sentences. IEEE Trans Acoust Speech Signal Process

28:357366

20. Mitra S, Acharya T (2007) Acharya: gesture recognition: a sur-

vey. IEEE Trans Syst Man Cybern Part C 37(3):311324

21. Schlomer T, Poppinga B, Henze N, Boll S (2008) Gesture Rec-

ognition with a Wii Controller. International Conference on

Tangible and Embedded Interaction (TEI08), pp 1114, Bonn

Germany, Feb. 1820, 2008

22. Mantyla V-M (2001) Discrete hidden markov models with

application to isolated user-dependent hand gesture recognition.

VTT publications

23. Liu J, Wang Z, Zhong L, Wickramasuriya J, Vasudevan V (2009)

uWave: accelerometer-based personalized gesture recognition

and its applications. IEEE PerCom09, 2009

24. Mantyjarvi J, Kela J, Korpipaa P, Kallio S (2004) Enabling fast

and effortless customization in accelerometer based gesture

interaction. Proceedings of the 3rd International Conference on

Mobile and Ubiquitous Multimedia (MUM04), ACM Press, 25

31, October 2729

25. Christanini J, Taylor JS (2000) An introduction to support vector

machines and other kernel-based methods. Cambridge University

Press, Cambridge

26. HTK: http://htk.eng.cam.ac.uk/

27. Frigo M, Johnson SG (2005) The design and implementation of

FFTW3. Proc IEEE 93(2)

28. Joachims T (1999) Making large-scale SVM learning practical.

Advances in kernel methodssupport vector learning. In:

Schollkopf B, Burges C, Smola A (ed) MIT-Press

29. Quinlan JR (1996) Improved use of continuous attributes in c4.5.

J Artif Intell Res 4:7790


123

http://htk.eng.cam.ac.uk/

GeeAir: a universal multimodal remote control device for home appliancesAbstractIntroductionRelated workGeeAir: an overviewMultimodal selection of a target applianceSelecting via speech commandsSelecting via joystickFeedback mechanism

Operating an appliance via gestureGesture command definitionGesture recognition with FDSVMFeature extraction: frame-based gesture descriptorGesture classification: multiclass SVM

EvaluationsImplementationHardware setupAlgorithms implementation

Data acquisitionSpeech recognition accuracyGesture recognition accuracyExperiment 1: effect of frame number NExperiment 2: user-dependent gesture recognitionExperiment 3: user-independent gesture recognition

Response time test

ConclusionsAcknowledgmentsReferences

/ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 149 /GrayImageMinResolutionPolicy /Warning /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 150 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages true /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 599 /MonoImageMinResolutionPolicy /Warning /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

/CreateJDFFile false /Description > /Namespace [ (Adobe) (Common) (1.0) ] /OtherNamespaces [ > /FormElements false /GenerateStructure false /IncludeBookmarks false /IncludeHyperlinks false /IncludeInteractive false /IncludeLayers false /IncludeProfiles false /MultimediaHandling /UseObjectSettings /Namespace [ (Adobe) (CreativeSuite) (2.0) ] /PDFXOutputIntentProfileSelector /DocumentCMYK /PreserveEditing true /UntaggedCMYKHandling /LeaveUntagged /UntaggedRGBHandling /UseDocumentProfile /UseDocumentBleed false >> ]>> setdistillerparams> setpagedevice

geeair: a universal multimodal remote control …gpan/publication/2010-puc-geeair.pdfgeeair: a...

Documents