perceptual intelligence
Post on 22-Nov-2014
204 Views
Preview:
TRANSCRIPT
PERCEPTUAL INTELLIGENCE
SEMINAR REPORT
submitted by
AMITH K P
EPAHECS009
for the award of the degree
of
Bachelor of Technology
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
GOVERNMENT ENGINEERING COLLEGE SREEKRISHNAPURAM
PALAKKAD
September 2010
CERTIFICATE
This is to certify that the seminar report entitled PERCEPTUAL IN-
TELLIGENCE submitted by AMITH K P, to the Department Of Computer
Science and Engineering, Government Engineering College, Sreekrishna-
puram, Palakkad-678633, in partial fulfilment of the requirement for the award of
B.Tech Degree in Computer Science and Engineering is a bonafide record of the work
carried out by him during year 2010.
Dr P C Reghu Raj Dr P C Reghu Raj
Seminar Co-ordinator Head of the Department
Place: Sreekrishnapuram
Date: 24-10-2010
Acknowledgement
It stands to reason that the completion of a seminar of this scope
needs the support of many people. I take this opportunity to express my
boundless thanks and commitment to each and every one,who helped me in
successful completion of my seminar. I am so happy to acknowledge the help
of all the individuals to fulfil my attempt.
First and foremost I wish to express wholehearted indebtedness to
God Almighty for his gracious constant care and magnanimity showered bliss-
fully over me during this endeavour.
I thank to Dr P C Reghu Raj, Head of Department of Computer
Science and Engineering, Govt. Engineering College Sreekrishnapuram, for
providing and availing me all the required facilities for undertaking the semi-
nar in a systematic way .I express my heartfelt gratitude to him for working
as a seminar coordinator and guide ,who corrected me and gave valuable sug-
gestions.
Gratitude is extended to all teaching and non teaching staffs of De-
partment of Computer Science and Engineering, Govt Engineering College
Sreekrishnapuram for the sincere directions imparted and the cooperation in
connection with the seminar.
I am also thankful to my parents for the support given in connection
with the seminar. Gratitude may be extended to all well-wishers and my
friends who supported us to complete the seminar in time.
ii
Table of Contents
List of Figures v
Abstract 1
1 Introduction 2
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Why Perceptual Intelligence ? . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Organisation of the Report. . . . . . . . . . . . . . . . . . . . . . . . 3
2 Perceptual Intelligence 4
2.1 Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Filters Of Perception . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Perceptual User Interfaces . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Information flow in Perceptual User Interfaces . . . . . . . . . 7
2.2.1.1 Perceptive User Interface/Perceptual Interface . . . 7
2.2.1.2 Multimodal Interfaces . . . . . . . . . . . . . . . . . 10
2.2.1.3 Multimedia Interfaces . . . . . . . . . . . . . . . . . 12
2.3 Perceptual Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Perceptual Intelligent Systems . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Gesture Recognition System . . . . . . . . . . . . . . . . . . . 17
2.4.1.1 Gesture Recognition . . . . . . . . . . . . . . . . . . 17
2.4.1.2 Working . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.1.3 Challenges of Gesture Recognition . . . . . . . . . . 21
2.4.2 Speech Recognition System . . . . . . . . . . . . . . . . . . . 21
2.4.2.1 Basic Needs Of Speech Recognition . . . . . . . . . . 21
2.4.2.2 Working . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4.2.3 Performance of speech recognition systems . . . . . . 25
2.4.3 Nouse Perceptual Vision Interface . . . . . . . . . . . . . . . . 25
2.4.3.1 Computer Controlled Actions . . . . . . . . . . . . . 26
2.4.3.2 Tools Used In Nouse Perceptual Vision Interface . . 26
2.4.3.3 Working Of Nouse Perceptual Vision Interface . . . . 27
3 Applications 31
3.1 SMART ROOMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2 SMART CLOTHES . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4 Conclusion 35
Bibliography 36
iv
List of Figures
2.1 Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Perceptual User Interface . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Information flow in Perceptual User Interface . . . . . . . . . . . . . . 8
2.4 Human Computer Interaction . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Perceptual Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Phantom input/output device . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Different Medias . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 User adjusts sound volume with slider and turns off sound with Sound
off push button . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Background subtracted mask . . . . . . . . . . . . . . . . . . . . . . . 19
2.10 image illustrating body center . . . . . . . . . . . . . . . . . . . . . . 20
2.11 Selecting Shift, selecting H . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 smart room . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Abstract
Human computer interaction has not changed fundamentally for nearly two
decades. Most users interact with computers by typing, clicking and pointing. Now
most of the research works are concentrating on interaction techniques that com-
bine an understanding of natural human capabilities with computer I/O devices and
machine perception and reasoning.
Perceptual Intelligence is the knowledge and understanding that everything we
experience (especially thoughts and feelings) are defined by our perception. Its im-
portant to realize that this is an active, not passive, process and therefore we have
the ability to control it or change it. Computers need to share our perceptual en-
vironment before they can be really helpful. They need to be situated in the same
world that we are; they need to know much more than just the text of our words.
They also need to know who we are, see our expressions and gestures, and hear the
tone and emphasis of our voice.
CHAPTER 1
Introduction
Inanimate things are coming to our life. That is the simple objects that surround
us are gaining sensors,computational powers, and actuators.Consequently,desks and
doors, TVs and telephones, cars and trains,eyeglasses and shoes, and even the shirts
on our backs are changing from static, inanimate objects into adaptive, reactive sys-
tems that can be more friendly, useful, and efficient. Or, of course,these new systems
could be even more difficult to use than current systems;it depends how we design
the interface between the world of humans and the world of this new generation of
machines.
1.1 Motivation
The main problem with todays systems are,they are both deaf and blind.They
mostly experience the world around them through a slow serial line to a keyboard
and mouse. Even multimedia computers, which can handle signals like sound and
image,do so only as a transport device that knows nothing. Hence these objects are
still static and inanimate.
To change inanimate objects like offices,houses, cars, or glasses into smart, active
help-mates,they need some kind of Intelligence.This kind of intelligence what they
need here is Perceptual Intelligence. Perceptual Intelligence is paying attention to
people and the surrounding situation in the same way another person would, thus
allowing these new devices to learn to adapt their behavior to suit us,rather than
adapting to them as we do today.
1.2 Why Perceptual Intelligence ?
The problem with current computers is they are incredibly isolated. If
you imagine yourself living in a closed, dark, soundproof box with only a telegraph
connection to the outside world, you can get some sense of how difficult it is for
computers to act intelligently or be helpful.They exist in a world almost completely
disconnected from ours, so how can they know what they should do in order to be
helpful?
Computers need to share our perceptual environment before they can be really
helpful.They need to be situated in the same world that we are; they need to know
much more than just the text of our words of the signals content.Once the computer
has the perceptual ability to know who, what, when, where, and why, by understand-
ing,learning and interacting with the physical world sufficent for the computer to
determine a good course of action.If the systems have the ability to learn perception,
they can act in a smart way.Perceptual intelligence is actually a learned skill.
1.3 Organisation of the Report.
Each chapter begins with a brief introduction to its content. Chapter 2 describes the
Perceptual Intelligence in detail. Chapter 3 discusses the Applications of Perceptual
Intelligence. Chapter 4 includes the conclusion.
3
CHAPTER 2
Perceptual Intelligence
2.1 Perception
Perception is the process of receiving information about and making sense
of the world around us. It involves deciding which information to notice, how to
categorize this information and how to interpret it within the framework of existing
knowledge.A process by which individuals organize and interpret their sensory im-
pressions in order to give meaning to their environment.For instance ,as shown in Fig
2.1,in perception ,process firstly recieve stimuli and select them accordingly.Now after
selecting,it organises and interprets proceeding to give response.
Perception is the end result of a thought that begins its journey with the senses.
We see, hear, physically feel, smell or taste an event. After the event is experienced
it must then go through various filters before our brains decipher what exactly has
happened and how we feel about it. Even though this process can seem instantaneous,
it still always happens.
2.1.1 Filters Of Perception
The filters that make up perception are as follows:
• What we know about the subject or event.I saw an orange and knew it was
editable.
• What our previous experience with the subject or event was. Last time I ate
an orange I peeled it first (knowledge to peel an orange before eating it) and
it was sweet. Our previous experience forms our expectations.
Figure 2.1: Perception
• Our current emotional state. How we are feeling at the time of the event does
affect how we will feel after the event. I was in a bad mood when I ate the
orange and it angered me that it was sour and not sweet .
• In the end my intellectual and emotional perception regarding the eating of an
orange was an unpleasant experience. Depending on how strong that experience
was, determines how I will feel next time I eat an orange. For example, if I got
violently sick after eating an orange, the next time I see an orange, I probably
wont want to eat it. If I had a pleasant experience eating an orange, the next
time I see an orange, Ill likely want to eat it.
Even though emotions seemly occur as a result of an experience, they are actually
the result of a complicated process. This process involves interpreting action and
thought and then assigning meaning to it. The mind attaches meaning with prejudice
as the information goes through the perceptual filters we mentioned above.
Our perceptual filters also determine truth, logic along with meaning though they
dont always do this accurately. Only when we become aware that a bad feeling could
5
be an indication of a misunderstanding (error in perception).we can begin to make
adjustments to our filters and change the emotional outcome.
When left alone and untrained, the mind chooses emotions and reactions based
on a ”survival” program which does not take into account that we are civilized beings
its only concerned with survival.
A good portion of this program is faulty because the filters have created distor-
tions, deletions and generalizations which alter perception. For example,jumping to
a conclusion about ”all” or ”none” of something based on one experience.The uncon-
scious tends to think in absolutes and supports ”one time” learnings from experience
(this is the survival aspect of learning).
2.2 Perceptual User Interfaces
A perceptual interface is one that allows a computer user to interact with the
computer without having to use the normal keyboard and mouse. These interfaces are
realised by giving the computer the capability of interpreting the user’s movements or
voice commands.Perceptual Interfaces are concerned with extending human computer
interaction to use all modalities of human perception. All current research efforts are
focused at including vision, audition, and touch in the process.The goal of creating
perceptual user interfaces is to allow humans to have natural means of interacting
with computers, appliances and devices using voice, sounds, gestures, and touch.
Perceptual User interfaces (PUI) are characterised by interaction techniques that
combine an understanding of natural human capabilities with computer I/O devices
and machine perception and reasoning. Devices and sensors should be transparent
and passive if possible, and machines should perceive relevant human communication
channels as well as generate output that is naturally understood. This is expected
to require integration at multiple levels of technologies such as speed and sound
recognition and generation, computer vision, graphical animation and visualization,
language understanding, touch based sensing and feedback learning, user modelling
6
Figure 2.2: Perceptual User Interface
and dialogue management as in Figure 2.2.
2.2.1 Information flow in Perceptual User Interfaces
PUI integrates perceptive, multimodal, and multimedia interfaces to bring
our human capabilities to bear on creating more natural and intuitive interfaces as
shown in Figure 2.3.
2.2.1.1 Perceptive User Interface/Perceptual Interface
A perceptive user interface is one that adds human-like perceptual capabil-
ities to the computer, for example, making the computer aware of what the user is
saying or what the users face, body and hands are doing. These interfaces provide in-
put to the computer while leveraging human communication and motor skills.Unlike
traditional passive interfaces that wait for users to enter commands before taking any
action, perceptual interfaces actively sense and perceive the world and take actions
based on goals and knowledge at various levels. Perceptual interfaces move beyond
7
Figure 2.3: Information flow in Perceptual User Interface
the limited modalities and channels available with a keyboard, mouse, and monitor,
to take advantage of a wider range of modalities, either sequentially or in parallel.
The general model for perceptual interfaces is that of human-to-human commu-
nication. While this is not universally accepted in the Human Computer Interaction
community as the ultimate interface model,there are several practical and intuitive
reasons why it makes sense to pursue this goal. Human interaction is natural and
in many ways effortless; beyond an early age, people do not need to learn special
techniques or commands to communicate with one another.
Figure 2.4 and 2.5 depicts natural interaction between people and,similarly, be-
tween humans and computers. Perceptual interfaces can potentially effect improve-
ments in the human factors objectives mentioned earlier in the section, as they can be
easy to learn and efficient to use,they can reduce error rates by giving users multiple
ways to communicate, and they can be very satisfying and compelling for users.
8
Figure 2.4: Human Computer Interaction
Figure 2.5: Perceptual Interface
9
2.2.1.2 Multimodal Interfaces
A multimodal user interface is closely related in emphasizing human com-
munication skills.It is a system that combines two or more input modalities in a
coordinated manner.Humans interact with the world by way of information being
sent and received, primarily through the five major senses of sight, hearing, touch,
taste, and smell.A modality refers to a particular sense. A communication channel
is a course or pathway through which information is transmitted. In typical HCI
usage, a channel describes the interaction technique that utilizes a particular com-
bination of user and computer communication i.e., the user output/computer input
pair. This can be based on a particular device, such as the keyboard channel or the
mouse channel, or on a particular action, such as spoken language, written language,
or dynamic gestures. In this view, the following are all channels: text (which may use
multiple modalities when typing in text or reading text on a monitor), sound, speech
recognition, images/video, and mouse pointing and clicking.
So are multimodal interfaces or multi-modality or multi-channel. Certainly every
command line interface uses multiple modalities, as sight and touch are vital to these
systems. The same is true for graphical user interfaces, which in addition use multiple
channels of keyboard text entry, mouse pointing and clicking, sound, images, etc.Most
work on multimodal User Interface are focused on computer input.Multimodal output
uses different modalities, like visual display, audio and tactile feedback to engage
human perceptual, cognitive and communication skills in understanding what is being
presented.Multimodal interfaces focus on integrating sensor recognition based input
technologies such as speech recognition, pen gesture recognition, and computer vision,
into the user interface.
Multimodal interface systems have used a number of non-traditional
modes and technologies.
Some of the most common are the following:
10
• Speech recognition Speech is a very important and flexible communication
modality for humans, and is much more natural than typing or any other
way of expressing particular words, phrases, and longer utterances. Despitethe
decades of research in speech recognition and over a decade of commercially
available speech recognition products, the technology is still far from perfect,
due to the size, complexity, and subtlety of language, the limitations of micro-
phone technology, and problems of noisy environments. Systems using speech
recognition have to be able to recover from the inevitable errors produced by
the system.
• Pen-based gesture Pen-based gesture has been popular in part because of com-
puter form factors that include a pen or stylus as a primary input device. Pen
input is particularly useful for pointing gestures,defining lines, contours, and
areas, and specially-defined gesture commands e.g., minimizing a window by
drawing a large M on the screen. Pen-based systems are quite useful in mobile
computing,where a small computer can be carried.
• Haptic input and force feedback Haptic, or touch-based, input devices measure
pressure, velocity, locationessentially perceiving aspects of a users manipulative
and explorative manual actions. These can be integrated into existing devices
e.g., keyboards and mouse that know when they are being touched, and possibly
by whom. Or they can exist as standalone devices, such as the well-known
PHANTOM device by SensAble Technologies in Fig.2.6.These and most other
haptic devices integrate force feedback and allow the user to experience the
touch and feel of simulated artifacts as if they were real. Through the mediator
of a hand-held stylus or probe, haptic exploration can now receive simulated
feedback including rigid boundaries of virtual objects, soft tissue, and surface
texture properties. A tempting goal is to simulate all haptic experiences and to
be able to recreate objects with all their physical properties in virtual worlds
so they can be touched and handled in a natural way.
11
Figure 2.6: Phantom input/output device
• Computer vision Computer vision has many advantages as an input modality
for multimodal or perceptual interfaces. Visual information is clearly impor-
tant in human-human communication, as meaningful information is conveyed
through identity, facial expression,posture, gestures, and other visually ob-
servable cues. Sensing and perceiving these visual cues from video cameras
appropriately placed in the environment is the domain of computer vision.
2.2.1.3 Multimedia Interfaces
Multimedia User Interface uses perceptual and cognitive skills to interpret
information presented to the user .Multimedia user interfaces combine several kinds
of media to help people use a computer. These media can include text, graphics,
animation, images, voice, music,and touch as shown in fig 2.7.Multimedia user inter-
faces are gaining popularity because they are very effective at keeping the interest
of their users, improve the amount of information users remember, and can be very
12
Figure 2.7: Different Medias
cost-effective.
Successful multimedia building requires primary emphasis on the user. Multi-
media interface determine which human sensory system is most efficient at receiving
specific information, then use the media that involves that human sensory system.
For example, to teach someone how to fix a part on a jet engine it is probably most
efficient for the student to watch someone else fixing the part rather than hearing a
lecture on how to fix the part. The human visual system is better at receiving this
complex information. So, the multimedia should probably use video as the medium
of communication.
Thus heavly emphasing on the user’s senses, rather than the media, means that
we should probably call these user interfaces multisensory rather than multimedia.
The human senses that used most frequently in multimedia are sight, sound, and
touch. Multimedia often stimulate these senses simultaneously. For example, a user
can see and hear a video, and then touch a graphical button to freeze the video image.
13
Since so many media are available to the multimedia user interface , it is very easy
to overwhelm and confuse the users.
Medias of Multimedia Interfaces
Sight
The medium of sight is helpful when you need to communicate detailed information,
such as instructions, that the user may need to refer to later. Here are some guidelines
that involve sight.
• Use pastel colors
Except for videos and black and white text, it is generally a good idea to use
slightly washed out,desaturated, impure colors. This is most important for the
small objects that you put on screens. Scientists believe that these kinds of
colors let people focus on small objects better and are lesslikely to cause objects
to appear to float on the display. In your videos, stick with the original colors
in which the video was shot. Use high-contrast foreground-background colors
for text.
• Use fonts that have serifs
Serifs are small lines that finish off the main strokes of a letter and often appear
at the top or bottom of a letter. The text you are reading has a serif font. Use
fonts with serifs because these fonts may help readers differentiate letters and
because readers seem to prefer fonts with serifs.
Sound
The medium of sound is helpful when you need to communicate short, simple infor-
mation, such as confirmation beeps, that users don’t have to refer to later. Here are
some guidelines that involve sound.
• Use lower frequency sounds
14
Figure 2.8: User adjusts sound volume with slider and turns off sound with Sound off
push button
High-frequency sounds can be shrill and annoying , especially when the user
has to hear them repeatedly. Generally, try to use lower frequency sounds,
around 100 hertz to 1000 hertz.
• Let the user control sound volume quickly and easily
The user may want to make the sound louder or softer for personal preference
or to avoid disturbing people nearby.Design so the user can adjust the sound
volume. Make sound controls large and obvious and let the user turn the sound
off. These features are shown in Figure 2.8.
Touch
The medium of touch is helpful when you need to ask the user to make simple choices,
such as navigation decisions, without using a keyboard. Here are some guidelines that
involve touch.
• Use large touch areas
For a touch-sensitive screen, make it easy for the users to activate the touch
areas by making those areas fairly large.Using rectangular touch areas that are
15
40 mm wide and 36 mm high surrounding smaller visual targets, such as push
buttons, that are 27 mm wide and 22 mm high.
• Use consistent colors and shapes to designate touch areas
To make it obvious which objects can be selected with touch, use the same
color and shape for those objects that can be selected with touch.For example,
always use a blue square to designate buttons that the user can select with
touch.
2.3 Perceptual Intelligence
Perceptual Intelligence is the knowledge and understanding that everything
we experience (especially thoughts and feelings) are defined by our perception,i.e
paying attention to people and the surrounding situation in the same way another
person would, thus allowing these new devices to learn to adapt their behaviour to
suit us, rather than adapting to them as we do today.In the language of cognitive
science, perceptual intelligence is the ability to deal with the frame problem; it is the
ability to classify the current situation,so that it is possible to know what variables are
important and thus can take appropriate action. Once a computer has the perceptual
ability to know who, what,when, where, and why, then the probabilistic rules derived
by statistical learning methods are normally sufficient for the computer to determine
a good course of action.
The key to perceptual intelligence is making machines aware of their environ-
ment, and in particular, sensitive to the people who interact with them.They should
know who we are, see our expressions and gestures, and hear the tone and emphasis
of our voice.The goal of the perceptual intelligence approach is not to create com-
puters with the logical powers envisioned in most Artificial Intelligence research, or
to have computers that are ubiquitous and networked, because most of the tasks we
want performed do not seem to require complex reasoning or a gods-eye view of the
situation.One can imagine, for instance, a well-trained dog controlling most of the
16
functions we envision for future smart environments. So instead of logic or ubiquity,
we strive to create systems with reliable perceptual capabilities and the ability to
learn simple responses.
2.4 Perceptual Intelligent Systems
We have developed computer systems that can follow peoples actions,recognizing
their faces, gestures, and expressions.
Some of the systems are:
• Gesture recognition system
• Speech recognition system
• Nouse perceptual vision interface
2.4.1 Gesture Recognition System
2.4.1.1 Gesture Recognition
Gesture Recognition deals with the goal of interpreting human gestures via
mathematical algorithms. Gestures can originate from any bodily motion or state
but commonly originate from the face or hand. Current focuses in the field include
emotion recognition from the face and hand gesture recognition. Many approaches
have been made using cameras and computer vision algorithms to interpret sign lan-
guage.Gesture Recognition can be seen as a way for computers to begin to understand
human body language, thus building a richer bridge between machines and humans
than primitive text user interfaces or even GUIs (Graphical User Interfaces), which
still limit the majority of input to keyboard and mouse.
Gesture Recognition enables humans to interface with the machine (HMI) and
interact naturally without any mechanical devices. Using the concept of Gesture
17
Recognition, it is possible to point a finger at the computer screen so that the cursor
will move accordingly. This could potentially make conventional input devices such
as mouse, keyboards and even touch-screens redundant.Gesture Recognition can be
conducted with techniques from computer vision and image processing.
2.4.1.2 Working
The system must be able to look at a user, and determine where they are
pointing. Such a system would be integral to an intelligent room, in which a user
interacts with a computer by making verbal statements accompanied by gestures to
inform the computer of, for example, the location of a newly filed,physical document.
The system was written in c++ using components in c.The operation of the
system proceeds in four basic steps:
• Image input
• Background subtraction
• Image processing and Data extraction
• Decision tree generation/parsing
Image input
To input image data into the system, an SGI IndyCam(Silicon Graphics Inc was used,
with an SGI image capture program used to take the picture. The camera was used
to take first a background image and then to take subsequent images of a person
pointing in particular directions.
Background subtraction
Once images are taken, the system performs a background subtraction of the image
to isolate the person and create a mask. The background subtraction proceeds in two
steps.First, each pixel from the background image is channelwise subtracted from the
18
Figure 2.9: Background subtracted mask
corresponding pixel from the foreground image. The resulting channel differences are
summed, and if they are above a threshold, then the corresponding image of the mask
is set white, otherwise it is set black.The resulting image is a mask that outlines the
body of the person as in figure 2.9.
Image processing and Data extraction
Once the mask is generated, then it can be processed for data to extract into a decision
tree.Two strategies for extracting data from the image used. Firstly,find the body
and arms. Each column of pixels from the mask was summed,and the column with
the highest sum was assumed to be the an equalized background subcenter of the
body. This is a valid criteria for determining the tracted image body center, based
on the assumptions of the input image.From the center of the body, horizontal rows
are traced out to the left and right until the edge of the mask is reached (pixels turn
from white to black). The row of pixels that extends the furthest is assumed to be
19
Figure 2.10: image illustrating body center
the arm that is pointing. This again is a valid decision based on the assumptions
about the input image. Only one arm is pointing at a time, and the arm is pointing
in a direction as in figure 2.10.
Decision tree generation/parsing
Once the data are extracted, it is executed to create the decision tree. This process
is straightforward, involving only the manual classification of the test vectors into six
categories representing the images:right-up,right-middle,rightdown,left-up,left-middle,left-
down.Since the system uses a decision tree, the method of learning must be supervised.
Besides being tedious, the manual classification of the images is difficult to classify
into more categories.Converting the mask shown in fig 2.10 to vertical bar represen-
tation.The system uses combination of two algorithms for image data extraction.To
use the system, instead of generating the tree,a single image is processed, and then
run through the decision tree , resulting in the classification of the image into one of
the six mentioned categories.
20
2.4.1.3 Challenges of Gesture Recognition
There are many challenges associated with the accuracy and usefulness of
Gesture Recognition software. For image-based gesture recognition there are limita-
tions on the equipment used and image noise. Images or video may not be under
consistent lighting, or in the same location. Items in the background or distinct
features of the users may make recognition more difficult. The variety of implemen-
tations for image-based gesture recognition may also cause issue for viability of the
technology to general usage. For example, recognition using stereo cameras or depth-
detecting cameras are not currently commonplace. Video or web cameras can give
less accurate results based on their limited resolution.
2.4.2 Speech Recognition System
Speech recognition converts spoken words to machine-readable input (for
example, to the binary code for a string of character codes). The term voice recog-
nition may also be used to refer to speech recognition, but more precisely refers to
speaker recognition, which attempts to identify the person speaking, as opposed to
what is being said.
2.4.2.1 Basic Needs Of Speech Recognition
The following definitions are the basics needed for understanding speech
recognition technology.
• Utterance
An utterance is the vocalization (speaking) of a word or words that represent a
single meaning to the computer. Utterances can be a single word, a few words,
a sentence, or even multiple sentences.
• Speaker Dependance
Speaker dependent systems are designed around a specific speaker,which are
21
more accurate for the correct speaker, but much less accurate for other speakers.
They assume the speaker will speak in a consistent voice and tempo.Adaptive
systems usually start as speaker independent systems and utilize training tech-
niques to adapt to the speaker to increase their recognition accuracy.
• Vocabularies
Vocabularies (or dictionaries) are lists of words or utterances that can be rec-
ognized by the SR system.Smaller vocabularies makes easier for a computer to
recognize, while larger vocabularies are more difficult. Unlike normal dictio-
naries, each entry doesn’t have to be a single word. They can be as long as
a sentence or two. Smaller vocabularies can have as few as 1 or 2 recognized
utterances (e.g.”Wake Up”), while very large vocabularies can have a hundred
thousand or more!
• Accuracy
The ability of a recognizer can be examined by measuring its accuracy or how
well it recognizes utterances. This includes not only correctly identifying an
utterance but also identifying if the spoken utterance is not in its vocabulary.
Good ASR systems have an accuracy of 98acceptable accuracy of a system
really depends on the application.
• Training
Some speech recognizers have the ability to adapt to a speaker. When the
system has this ability, it may allow training to take place. An ASR system
is trained by having the speaker repeat standard or common phrases and ad-
justing its comparison algorithms to match that particular speaker. Training a
recognizer usually improves its accuracy.Training can also be used by speakers
that have difficulty speaking, or pronouncing certain words.
22
2.4.2.2 Working
The speech recognition process is performed by a software component known
as the speech recognition engine. The primary function of the speech recognition
engine is to process spoken input and translate it into text that an application un-
derstands. The application can then do one of two things:
• The application can interpret the result of the recognition as a command. In
this case, the application is a command and control application. An example
of a command and control application is one in which the caller says check
balance, and the application returns the current balance of the callers account.
• If an application handles the recognized text simply as text, then it is considered
a dictation application. In a dictation application, if you said check balance,
the application would not interpret the result, but simply return the text check
balance.
Recognition systems can be broken down into two main types. Pattern Recog-
nition systems compare patterns to known/trained patterns to determine a match.
Acoustic Phonetic systems use knowledge of the human body (speech production, and
hearing) to compare speech features (phonetics such as vowel sounds). Most modern
systems focus on the pattern recognition approach because it combines nicely with
current computing techniques and tends to have higher accuracy. Most recognizers
can be broken down into the following steps:
1. Audio recording and Utterance detection
2. PreFiltering (preemphasis, normalization, banding, etc.)
3. Framing and Windowing (chopping the data into a usable format)
4. Filtering (further filtering of each window/frame/freq. band)
5. Comparison and Matching (recognizing the utterance)
6. Action (Perform function associated with the recognized pattern)
23
Although each step seems simple, each one can involve a multitude of different
(and sometimes completely opposite) techniques.
(1) Audio/Utterance Recording: can be accomplished in a number of ways. Start-
ing points can be found by
comparing ambient audio levels (acoustic energy in some cases) with the sample
just recorded. Endpoint detection is harder because speakers tend to leave ”artifacts”
including breathing/sighing,teeth chatters, and echoes.
(2) PreFiltering: is accomplished in a variety of ways, depending on other features
of the recognition
system. The most common methods are the ”BankofFilters” method which uti-
lizes a series of audio filters to prepare the sample, and the Linear Predictive Coding
method which uses a prediction function to calculate differences (errors). Different
forms of spectral analysis are also used.
(3) Framing/Windowing involves separating the sample data into specific sizes.
This is often rolled into step 2 or step 4. This step also involves preparing the sample
boundaries for analysis (removing edge clicks, etc.)
(4) Additional Filtering is not always present. It is the final preparation for each
window before comparison and matching. Often this consists of time alignment and
normalization.
(5), Comparison and Matching. Most involve comparing the current window
with known samples. There are methods that use Hidden Markov Models (HMM),
frequency analysis, differential analysis, linear algebra techniques/shortcuts, spectral
distortion, and time distortion methods. All these methods are used to generate a
probability and accuracy match.
(6) Actions can be just about anything the developer wants.
24
2.4.2.3 Performance of speech recognition systems
The performance of speech recognition systems is usually specified in terms of
accuracy and speed. Most speech recognition users would tend to agree that dictation
machines can achieve very high performance in controlled conditions. There is some
confusion, however, over the interchangeability of the terms ”speech recognition” and
”dictation”.
Commercially available speaker-dependent dictation systems usually require only
a short period of training (sometimes also called ‘enrollment’) and may successfully
capture continuous speech with a large vocabulary at normal pace with a very high
accuracy. Most of experiments claim that recognition software can achieve more than
75 percentage accuracy if operated under optimal conditions.
Speech recognition in video has become a popular search technology used by
several video search companies.Limited vocabulary systems, requiring no training,
can recognize a small number of words (for instance, the ten digits) as spoken by
most speakers. Such systems are popular for routing incoming phone calls to their
destinations in large organizations.Both acoustic modeling and language modeling are
important parts of modern statistically-based speech recognition algorithms. Hidden
Markov models (HMMs) are widely used in many systems.
2.4.3 Nouse Perceptual Vision Interface
Evolved from the original Nouse ie Nose as Mouse. Nouse PVI has several
unique features that make it preferable to other hands-free vision-based computer
input alternatives. natives. Its original idea of using the nose tip as a single refer-
ence point to control a computer has been confirmed to be very convenient.The nose
literally becomes a new finger which can be used to write words, move a cursor on
screen, click or type. Being able to track the nose tip with subpixel precision within
a wide range of head motion, makes performing all control tasks possible.
25
2.4.3.1 Computer Controlled Actions
Nouse PVI is a perceptual vision interface program that offers a complete
solution to working with a computer in Microsoft Windows OS hands-free. Using a
camera connected to a computer, the program analyzes the facial motion of the user
to allow him/her to use it instead of a mouse and a keyboard. As such Nouse- PVI
allows a user, to perform the basic three computer-control actions:
• cursor control: includes
a) Cursor positioning
b) Cursor moving, and
c) Object drugging - which are normally performed using mouse motion
• clicking: includes
a) right-button click
b) left-button click
c) Double-click, and
d) Holding the button down - which are normally performed using the mouse
buttons
• key/letter entry: includes
a) typing of English letters
b) Switching from capital to small letters, and to functional keys
c) entering basic MS Windows functional keys as well as Nouse functional keys
- which would normally be performed using a keyboard
2.4.3.2 Tools Used In Nouse Perceptual Vision Interface
The program is equipped with such tools as:
26
1. Nousor (Nouse Cursor) - the video-feedback-providing cursor that is used to
point and to provide the feeling of ” touch” with a computer.
2. NouseClick - a nose-operated mechanism to simulate types of clicks
3. NouseCodes - configurable Nouse tool that allows entering computer commands
and operate the program using head motion codes.
4. NouseEditor - provides an easy way of typing and storing messages hands-free
using face motion. Typed messages are automatically stored in Clipboard (as with
CNTR+A, CNTR+C)
5. NouseBoard- a specially designed for face-motion-based typing on-screen key-
board that automatically maps to the user’s facial motion range.
6. NouseTyper - a configurable Nouse tool that allows typing letters by drawing
them inside the cursor (instead of using the NouseBoard)
7. NouseChalk - a configurable Nouse tool that allows writing letters as with a
chalk on a piece of paper. Written letters are automatically saved on hard- drive as
images that can be opened and emailed.
2.4.3.3 Working Of Nouse Perceptual Vision Interface
Once camera is selected ,firstly in inactive sleep state.The program remains
in this state until a user showing facial motion is detected by camera.When user is
detected,it switch to the inactive wake up state.The nouse icon turns to face straight
up and the buttons on the Nouse Cursor turn white. In this state the system verifies
that position of users face. If this position is not consistent over 10-20secs, Nouse
goes back to sleep state.
Working in NouseCursor State
1. Sleep:Start nouse cursor. (Bring it to ”put face in centre mode”)
Initially,NouseCursor state are in the Nouse Sleep states all Nousor buttons
colours all red.
27
2. Wake-up: Putting nose (face) in center of nouse cursor to activate it.
Two middle buttons will remain red, while side buttons become bllue,which indi-
cates the user to put your face in the center.Once done then starts nousing.
3. Move the cursor to a point on the screen as with a joystick.
The nouse cursor works as a joystick. If the nose is in the center, the nouse cursor
will not move.Nouse cursor will move proportionally faster is that direction as nose
move.
4. Move the cursor as by pushing a mouse.
It can be analogized to moving an actual mouse cursor when the mouse pad is
very small. In order to move to the right for example, nudge nose to the right and
then move back to center and repeat this process. The mode is used to place the
cursor exactly where to click.Wait for the the cursor to switch to clicking mode.
5. Prepare to Click
Now that the cursor has been placed,a countdown starting at 3 will count down
and to click all you need to do is move. After you click, you will go back to the
putting face in center mode .
6. Performing Clicking
a. Do a right click
To do a right click, move down and to the right.
b. Do a double click.
To do a double click, move up and to the left.
c. Drag an item.
If you select drag, the will emulate pressing down on the left mouse button and
holding it down. Once you select drag, a ’D’ will be drawn in the nouse cursor. When
you choose you next place to click, the action will be to release the mouse no matter
what direction you move.
7. Entering motion codes with nousor - NouseCode
28
Move user position to a corner of the range, that corner’s corresponding number
will be remembered as green dot.To enter a code, simply move to the correct corners
in order. So for example, if you want to enter the code 1-2-3-4, you need to move to
the top left, bottom left, bottom right and top right in that order.After any code is
entered, you need to confirm wether or not you actually want to enter that code. A
letter corresponding to that code will appear on the nouse cursor that in confirmation
mode. To confirm you must move in a vertical direction.
Working in NouseBoard state
• Motion-based key entry using virtual (on-screen) NouseBoard and NousePad.
When you first open NouseBoard you will be in the state where you need to put
your face in the center of the nouse cursor. Once you do this, the NouseBoard
and NousePad will popup (see Figure 2.11) and ready to type.The first state of
the NouseBoard is one where the whole board in coloured white. This indicates
that no key have been selected yet.Note that for each cell of the NouseBoard,
there are four characters (or commands), one at each corner. The first thing
you need to do is decide which corner the character you would like to type
resides in. When you move the NouseBoard selector to a corner, all of the cells
will have their respective characters in that corner stay white while the rest of
the NouseBoard turns gray. Now that a corner has been selected, you can move
the NouseBoard selector around to select the character you would like to type.
Once you have moved the selector to the requested cell,stay still for a couple
of seconds. This will cause the character or command to turn green.Once a
character has turned green you simply need to move the selector to another
cell and that character will be typed in the NousePad. The NouseBoard will
return to its initial all white state.
• Choosing to not type a letter.
If you have a character coloured green but you dont want to type it, simple
29
Figure 2.11: Selecting Shift, selecting H
stay still in the same cell for a couple of seconds. This will cause the letter not
to be typed and the NouseBoard to return to its initial all white state.
• Getting back nouse after losing it.
All three methods described for the Nouse Cursor state will work equally here.
• Cancelling corner selection.
If you select a corner and then realize that the character you want to type is not
in that corner of its cell, there are two ways to cancel the corner selection. The
first is to highlight a letter and then not type it.Once you do so, the keyboard
will be reset to the all white state.
• Exiting NousePad (copying to clipboard) Once your are satisfied with the text
in the NousePad, you can exit the NouseBoard state by selecting mouse on the
NouseBoard. This will cause the NouseBoard to close and the text in NousePad
to be copied to your systems clipboard.
30
CHAPTER 3
Applications
There are some applications currently using Perceptual Intelligence technique for
future development.some of these applications are given below
3.1 SMART ROOMS
The idea of a smart room is little like having a butler; that is a passive
observer who usually stands quietly in the corner but who is constantly looking for
opportunities to help. Both smart rooms and smart clothes are instrumented with
sensors that allow the computer to see, hear, and interpret users actions (currently
mainly cameras, microphones, and electromagnetic field sensors, and also biosensors
like heart rate and muscle action.)People in a smart room can control programs,
browse multimedia information, and experience shared virtual environments without
keyboards, special sensors, or special googles.The key idea is smart room or clothing
can know something about what is going on, it can react intelligently.
Our first smart room was developed in 1989.Now there are smart rooms in Japan,
England, and throughout places in U.S.They can be linked together by ISDN tele-
phone lines to allow shared virtual environment and cooperative work experiments.
Some of the perceptual capabilities are available to smart rooms. To act intelli-
gently in a day to day environment, the first thing we need to know is: where are the
people? The human body is a complex dynamic system, whose visual features are
time varying, noisy signals.
Object tracking can be used to detect the location. Once the person is located,
and visual and auditory attention has been focused on them, the next question to
ask is: who is it? The question of identity is central to adaptive behaviour because
Figure 3.1: smart room
who is giving a command is often as important as the command itself. Perhaps the
best way to answer the question is to recognize them by their facial appearance and
by their speech. For general perceptual interfaces, person recognition systems will
need to recognize people under much less constrained conditions. One method of
achieving greater generality is to employ multiple sensory inputs; audio and-video-
based recognition systems in particular have the critical advantage of using the same
modalities that humans use for recognition.
We have developed computer systems for speech-recognition, hand gesture recog-
nition face recognition etc. After a person is identified, next crucial factor is facial
expression. For instance a car should know if the driver is sleepy, and a teaching pro-
gram should know if the student looks bored.So, just as we can recognize a person once
32
we have accurately located their face, we can also analyse the persons facial motion
to determine their expression. The lips are of particular importance in interpreting
facial expression, so we are giving more importance to tracking and classification of
lip shape. First step of processing is to detect and characterize the shape of the lip
region. For this task, a system called LAFTER is developed. Online algorithms are
used to make maximum posteriori estimates of 2D head pose and lip shape. Using
lip shape features derived from LAFTER systems we can train Hidden Markov Mod-
els for various mouth configurations.HMMs are well developed statistical modelling
techniques for modelling time series data, and are used widely in speech recognition.
3.2 SMART CLOTHES
In the case of a smart room, cameras and microphones are watching people
from a third person perspective.However,when we build the computers,cameras,microphones
and other sensors into our clothes ,the computers view moves from a passive third
person to an active first person vantage point. That means smart clothes can be more
intimately and actively involved in the users activities. If these wearable devices have
sufficient understanding of the users situation-that is-enough perceptual intelligence-
then they should be able to act as an intelligent personal agent, proactively providing
the wearer with information relevant to the current situation.
Eg: suppose we are placing a GPS (Global Position Sensor) into our belt, then
navigation software can help us to find our way around by whispering directions in
your ear or showing a map on a display built into our glasses. Similarly body worn ac-
celerometers and tilt sensors can distinguish walking from standing from sitting, and
bio sensors such as galvanic skin response (GSR) are correlated with mental arousal,
allowing construction of wearable medical monitors. A simple but important applica-
tion for a medical wearable is to give people feedback about their alertness and stress
level. Centre for Future health at the University of Rochester has developed early
warning systems for people with high risk medical problems, and eldercare wearable
33
to help keep seniors out of nursing homes. These wearable devices are examples of
personalized perceptual intelligence, allowing proactive fetching and filtering of in-
formation for immediate use by the wearer. With the development of new self care
technology, tiny wearable health monitors may one day continuously collect signals
from the body and transmit data to a base station at home. Predictive software may
identify trends and make specific health predictions, so users can prevent crisis and
better manage daily routines and health interventions.
Consider that we have built wearables that continuously analyze background
sound to detect human speech. Using this information, the wearable is able to know
when you and another person are talking, so that they wont interrupt. Now researches
are going a step further, using microphones built into a jacket to allow wod-spotting
software to analyze your conversation and remind you of relevant facts. Cameras
make attractive candidates for a wearable, perceptual intelligent interface, because
a sense of environmental contexts may be obtained by pointing the camera in the
direction of users gaze. For instance, by building a camera into your eye glasses, face
recognition software can help you remember the name of the person you are looking
at.
34
CHAPTER 4
Conclusion
Now it is possible to track peoples motion, identify them by voice and facial ap-
pearance, and recognize their actions in real time using only modest computational
resources. By using this perceptual information we have been able to build smart
room and smart clothes that can recognize people, understand their speech, allow
them to control information displays without mouse or keyboard, communicate by
facial and hand gesture, and interact in a more personalized, adaptive manner.
Our overall goal is to make the computers seem as natural to interact with as
another person. Sometimes this means than there should be no interface; it should
just recognize what is going on and what is the right thing. At othertimes, it means
that the system should engage in a dialogue with a person. We want a system that is
truly human centred and natural to interact with; this requires not just perceptions
but also a significant understanding of the semantics of the everyday world and the
reasoning capabilities to use this understanding flexibly.
Bibliography
[1] Alex Pentland,”Perceptual Intelligence”, Volume 1707/1999,pp.74-88, 1999.
[2] Matthew Turk, George Robertson,”Perceptual user interfaces (introduction)”,
Communications of the ACM, v.43 n.3,p.32-34, March 2000.
[3] Blair MacIntyre, Steven Feiner, ”Future Multimedia User Interfaces:Multimedia
Systems”,v.4,pp. 250-268,1996.
[4] Marcia Seivert Entwistle, ”The Performance of Automated Speech Recognition
Systems Under Adverse Conditions of Human Exertion”,International journal of
Human computer interaction,pp. 127140,2003.
[5] Oliver, N., Bernard, F., Coutaz, J., and Pentland, A. ” LAFTER: Lips and face
tracker”, IEEE CVPR ”97 Oune 17-19, 1997, San Juan, PR, IEEE Press, New
York, N.Y.
[6] Murray Hill, ”A tutorial on hidden Markov models and selected applications in
speech recognition”,IEEE proceedings,Volume: 77 ,Issue:2,pp. 257 - 286,1989.
[7] Ying Wu,Thomas S. Huang, ”Vision-Based Gesture Recognition: A Review
”,Volume 1739,pp. 103-115,1999.
[8] D. Gorodnichy,G. Roth, ” Nouse, Use your nose as a mouse’ perceptual vision
technology for hands-free games and interfaces”. In Image and Video Computing,
Volume 22, Issue 12 , Pages 931-942, NRC 47140,2004.
[9] D. Gorodnichy, ”On importance of nose for face tracking”,In Proc. IEEE Int.
Conf.on Automatic Face and Gesture Recognition, pages 188196, Washington
DC, May 20-21 2002.
[10] Zimmerman. T, Lanier J., Blanchard C., Bryson S.,Harvill Y, ” A Hand Gesture
Interface Device”, In Proc. ACM CHI + GI 87, pp 189192,1987.
[11] Goldberg D,Richardson C, ”Touch-Typing with a Stylus”, In Proc. ACM IN-
TERCHI ,vol. 93, pages 80-87,1993.
[12] S. Oviatt,W. Wahlster, ” Human-Computer Interaction (Special Issue on Mul-
timodal Interfaces)”, Lawrence Erlbaum Associates, Vol 12, No 1 2, 1997.
[13] Hauptmann A.G.,McAvinney P, ”Gestures with Speech for Graphics Manipula-
tion”, Intl.J. Man-Machine Studies,Vol 38,pp 231-249,1993.
[14] J.Tu, T.Huang,H.Tao, ”Face as mouse through visual face tracking”,In Second
Workshop on Face Processing in Video . In Proc. of CVR05, ISBN0-7695-2319-6,
2005.
[15] R. Ruddarraju, ”Perceptual user interfaces using vision-based eye tracking”,In
5th international conference on Multimodal interfaces, pp 227 233,Vancouver,
British Columbia, Canada, 2003.
37
top related