final year project - enhancing virtual learning through emotional agents (document)

53
CHAPTER 1 INTRODUCTION 1.1 A Brief Description Virtual learning is increasing day by day, and Human Computer Interaction is a necessity to make virtual learning a better experience. The emotions of a person play a major role in the learning process. Hence the proposed work, detects the emotions of a person, by his face expressions. For a facial expression to be detected face location and area must be known; therefore in most cases, emotion detection algorithms start with face detection, taking into account the fact that face emotions are mostly depicted using the mouth. Eventually, algorithms for eye and mouth detection and tracking are necessary, in order to provide the features for subsequent emotion recognition. In this project we propose a detection system for natural emotion recognition. 1.2 Need For Face Detection Human activity is a major concern in a wide variety of applications such as video surveillance, human computer interface, face recognition and face database management. Most face recognition algorithms assume that face location is known. Similarly, face-tracking algorithms often assume that initial face location is known. In order to improve the efficiency of the face recognition systems, an efficient face detection algorithm is needed. 1

Upload: sana-nasar

Post on 16-Jul-2015

253 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

CHAPTER 1

INTRODUCTION

1.1 A Brief Description

Virtual learning is increasing day by day, and Human Computer

Interaction is a necessity to make virtual learning a better experience.

The emotions of a person play a major role in the learning process.

Hence the proposed work, detects the emotions of a person, by his face

expressions.

For a facial expression to be detected face location and area must

be known; therefore in most cases, emotion detection algorithms start

with face detection, taking into account the fact that face emotions are

mostly depicted using the mouth. Eventually, algorithms for eye and

mouth detection and tracking are necessary, in order to provide the

features for subsequent emotion recognition. In this project we propose a

detection system for natural emotion recognition.

1.2 Need For Face Detection

Human activity is a major concern in a wide variety of

applications such as video surveillance, human computer interface, face

recognition and face database management. Most face recognition

algorithms assume that face location is known. Similarly, face-tracking

algorithms often assume that initial face location is known. In order to

improve the efficiency of the face recognition systems, an efficient face

detection algorithm is needed.

1

Page 2: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

1.3 Need For Emotion Detection

Human beings communicate through facial emotions in day to day

interactions with others. Human perceiving the emotions of fellow

human is natural and inherently accurate. Human can express his/her

inner state of mind through emotions. Many times, emotion indicates

that a human needs help. Computer recognising emotions is an important

research in Human Computer Interfacing (HCI). This HCI can be a

welcoming method for physically disabled and to those who are unable

to express their requirement by voice or by other means and especially to

those who are confined to bed. The human emotion can be detected

through facial actions or through biosensors. Facial actions are imaged

through still or video cameras. From still images, taken at discrete

times, the changes in eye and mouth areas can be exposed. Measuring

and analysing such changes will lead to the determination of human

emotions.

1.4 Existing Face Detection Approaches

1.4.1 Feature Invariant Methods

These methods aim to find structural features that exist even when

the pose, viewpoint, or lighting conditions vary, and then use these to

locate faces. These methods are designed mainly for face localization.

2

Page 3: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Texture

Human faces have a distinct texture that can be used to

separate them from different objects. The textures are computed using

second-order statistical features on sub images of 16X16 pixels. Three

types of features are considered: skin, hair, and others. To infer the

presence of a face from the texture labels, the votes of occurrence of hair

and skin textures are used. Here the colour information is also

incorporated with face-texture model. Using the face texture model, a

scanning scheme for face detection in colour scenes in which the orange

like parts including the face areas are enhanced. One advantage of this

approach is that it can detect faces which are not upright or have features

such as beards and glasses.

Skin Colour

Human skin colour has been used and proven to be an effective

feature in many applications from face detection to hand tracking.

Although different people have different colour, several studies have

shown that the major difference lies largely between their intensity rather

than their chrominance. Several colour spaces have been utilized to label

pixels as skin including RGB, Normalized RGB, HSV, YCbCr, YIQ,

YES, CIE XYZ and CIE LUV.

1.4.2 Template Matching Methods

In template matching, a standard face pattern is manually

predefined or parameterized by a function. Given an input image, the

correlation values with the standard patterns are computed for the four

3

Page 4: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

contours, eyes, nose, and mouth independently. The existence of a face is

determined based on the correlation values. This approach has the

advantage of being simple to implement. However, it has proven to be

inadequate for face detection since it cannot effectively deal with

variation in scale, pose, and shape. Multiresolution, multiscale, sub

templates, and deformable templates have subsequently been proposed

to achieve scale and shape invariance.

Predefined Face Template

In this approach several sub templates for nose, eyes, mouth and

face contour are used to model a face. Each sub template is defined in

terms of line segments. Lines in the input image are extracted based on

greatest gradient change and then matched against the sub templates. The

correlations between sub images and contour templates are computed

first to detect candidate location of faces. Then, matching with the other

sub templates is performed at the candidate positions. In other words, the

first phase determines focus of attention or region of interest and second

phase examines the details to determine the existence of a face.

1.4.3 Appearance Based Methods

In the appearance based methods the templates are learned from

examples in images. In general, appearance based methods rely on

techniques from statistical analysis and machine learning to find the

relevant characteristics of face and non face images. The learned

characteristics are in the form of distribution models that are

consequently used for face detection.

4

Page 5: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

1.5 Existing Emotion Detection Approaches

1.5.1 Genetic Algorithm

The eye feature plays a vital role in classifying the face emotion

using Genetic Algorithm. The acquired images must go through few pre-

processing methods such as grayscale, histogram equalization and

filtering. A Genetic Algorithm methodology estimates the emotions from

eye feature alone. Observation of various emotions lead to a unique

characteristic of eye, that is, the eye exhibits ellipses of different

parameters in each emotion. Genetic Algorithm is adopted to optimize

the ellipse characteristics of the eye features. Processing time for Genetic

Algorithm varies for each emotion.

1.5.2 Neural Network

Neural networks have found profound success in the area of

pattern recognition. By repeatedly showing a neural network, inputs are

classified into groups, the network can be trained to discern the criteria

used to classify, and it can do so in a generalized manner allowing

successful classification of new inputs not used during training. With the

explosion of research in emotions in recent years, the application of

pattern recognition technology to emotion detection has become

increasingly interesting. Since emotion has become an important

interface for the communication between human and machine, it plays a

basic role in rational decision-making, learning, perception, and various

cognitive tasks.

5

Page 6: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Human's emotion can be detected based on the physiological

measurements, facial expression. Since human shows the same facial

muscles when expressing a particular emotion, therefore the emotion can

be quantified. Primary emotions such as anger, disgust, fear, happiness,

sadness and surprise can be classified using Neural Network.

1.5.3 Feature Point Extraction

Template Matching

An interesting approach in the problem of automatic facial feature

extraction is a technique based on the use of template prototypes, which

are portrayed on the 2-d space in gray scale format. This is a technique

that is, to some extent, easy to use, but also effective. It uses correlation

as a basic tool for comparing the template with the part of the image that

we wish to recognize. An interesting question that arises is, the

behaviour of recognition with template matching in different resolutions.

This involves multi-resolution representations through the use of

Gaussian pyramids. The experiments proved that not very high

resolutions are needed for template matching recognition. For example,

the use of templates of 36x36 pixels proved sufficient. This fact shows

us that template matching is not as computationally complex as we

originally imagined.

This class implements the face detection algorithm which starts by

scanning the given image with the SSR filter and locating the face

candidates, then it assembles the candidates that are close to each other

using connected components (to treat less candidates which means less

processing time, remember this is a real-time application), then we take

6

Page 7: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

the centre of each cluster and extract a template based on this centre; we

pass the template to the Support Vector Machine which tells us whether

this template is a face or not, if yes, we locate the eyes, then we locate

the nose.

Face detection techniques are of two categories:

1. Feature based approach

2. Image-based approach.

Template Matching provides for the human face detection system.

1. Feature Based Technique:

The techniques in the first category make use of apparent

properties of face such as face geometry, skin colour, and motion. Even

feature-based technique can achieve high speed in face detection, but it

also has problem in poor reliability under lighting condition.

2. Image Based Technique:

The image based approach takes advantage of current advance in

pattern recognition theory. Most of the image based approach applies a

window scanning technique for detecting face, which requires large

computation.

To achieve high speed and reliable face detection system, we

propose the method which combines both feature-based and image-based

approach using SSR Filter.

7

Page 8: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

1.5.4 Template Matching.

Template matching is a technique in digital image processing for

finding small parts of an image which match a template image or as a

way to detect edges in images.

The basic method of template matching uses a convolution mask

(template), tailored to a specific feature of the search image, which we

want to detect.

This technique can be easily performed on grey images or edge

images. The convolution output will be highest at places where the

image structure matches the mask structure, where large image values

get multiplied by large mask values

Eyes and Nose detection using SSR Filter.

A real-time face detection algorithm using Six-Segmented

Rectangular (SSR) filter of the eyes and nose detection.

SSR is a six segment rectangle as illustrated in Figure 1.1.

Figure 1.1 SSR Filter

8

Page 9: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

At the beginning, a rectangle is scanned throughout the input

image. This rectangle is segmented into six segments as shown below.

The SSR filter is used to detect the Between-the-Eyes based on

two characteristics of face geometry.

BTE - Between The Eyes

The detection of BTE is based on the property of the image

characteristics of the area on face. The intensity of the BTE image

closely resembles a hyperbolic surface as shown in Figure 1.2. The BTE

is the saddle point on the hyperbolic surface. A rotationally invariant

filter could thus be devised for detecting the BTE area.

9

Page 10: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Figure 1.2 Determination of BTE

The nose area is usually calculated to be 2/3rd of the value of L as

shown in Figure 1.3. The L is calculated as the approximate distance

between both eyes and the distance from eye to nose.

Figure1.3 Nose Tip Search Area Relative to Eyes

The common BTE area on human face resembles a hyperbolic surface.

The proposed work uses this hyperbolic model to describe the BTE

region, the centre of the BTE is thus the saddle point on the surface.

Blobs

Blobs provide a complementary description of image structures in

terms of regions, as opposed to corners that are more point-like.

Nevertheless, blob descriptors often contain a preferred point (a local

maximum of an operator response or a centre of gravity) which means

10

Page 11: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

that many blob detectors may also be regarded as interest point

operators. Blob detectors can detect areas in an image which are too

smooth to be detected by a corner detector.

Gabor Filtering

It is possible for Gabor filtering to be used in a facial recognition

system. The neighbouring region of a pixel may be described by the

response of a group of Gabor filters in different frequencies and

directions, which have a reference to the specific pixel. In that way, a

feature vector may be formed, containing the responses of those filters.

Automated Facial Feature Extraction

In this approach, as far as the frontal images are concerned, the

fundamental concept upon which the automated localization of the

predetermined points is based consists of two steps: the hierarchic and

reliable selection of specific blocks of the image and subsequently the

use of a standardized procedure for the detection of the required

benchmark points. In order for the former of the two processes to be

successful, the need of a secure method of approach has emerged. The

detection of a block describing a facial feature relies on a previously,

effectively detected feature. By adopting this reasoning, the choice of the

most significant characteristic -the ground of the cascade routine- has to

be made. The importance that each of the commonly used facial features,

regarding the issue of face recognition, has already been studied by other

researchers. The outcome of surveys proved the eyes to be the most

dependable and easily located of all facial features, and as such they

were used. The techniques that were developed and tried separately,

utilize a combination of template matching and Gabor filtering.

11

Page 12: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

The Hybrid Method

The basic question of the desired feature blocks is performed by a

simple template matching procedure. Each feature prototype is selected

from one of the frontal images of the face base. The practiced

comparison criterion is the maximum correlation coefficient between the

prototype and the repeatedly audited blocks of a smartly restricted area

of the face.

In order for the search area to be incisively and functionally

limited, the knowledge of the human face physiology has been applied,

without hindering the satisfactory performance of the algorithm in cases

of small violations of the initial limitations. However, the final block

selection by the mere use of this method has not always been crowned

with success. Therefore, the need of a measure of reliability came forth.

For that reason, the use of Gabor filtering was deemed to be one suitable

tool. As it can be mathematically deduced from the filter’s form, it

ensures simultaneous optimum localization in the natural space as well

as in frequency space.

The filter is applied both on the localized area and the template in

four different spatial frequencies. Its response is regarded as valid, only

in the case that its amplitude exceeds a saliency threshold. The area with

minimum phase distance from its template is considered to be the most

reliably traced block.

12

Page 13: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

1.5.5 Preprocessing and Postprocessing of Images

Image Processing Toolbox provides reference-standard algorithms

for pre-processing and post processing tasks that solve frequent system

problems, such as interfering noise, low dynamic range, out-of-focus

optics, and the difference in colour representation between input and

output devices. Using region-of-interest tools to create a mask, items in

the original image (top) are selected to create the mask (bottom).

Image Enhancement techniques in Image Processing Toolbox

enables user to increase the signal-to-noise ration and accentuate image

features by modifying the colours or intensities of an image.

We can

• Perform histogram equalization

• Perform decorrelation stretching

• Remap the dynamic range

• Adjust the gamma value

• Perform linear, median or adaptive filtering.

1.5.6 Typical Tasks of Computer Vision

Each of the application areas in computer vision systems employ a

range of computer vision tasks, more or less well-defined measurement

problems or processing problems, which can be solved using a variety of

methods. Some examples of typical computer vision tasks are presented

13

Page 14: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

below.

Recognition

The classical problem in computer vision, image processing and

machine vision is that of determining whether or not the image data

contains some specific object, feature, or activity. This task can normally

be solved robustly and without effort by a human, but is still not

satisfactorily solved in computer vision for the general case: arbitrary

objects in arbitrary situations. Computer vision for the general case:

arbitrary objects in arbitrary situations. The existing methods for dealing

with this problem can at best solve it only for specific objects, such as

simple geometric objects (e.g., polyhedrons), human faces, printed or

hand-written characters, or vehicles, and in specific situations, typically

described in terms of well-defined illumination, background, and pose of

the object relative to the camera.

Different varieties of the recognition problem are described in the

literature:

Recognition: one or several pre-specified or learned objects or object

classes can be recognized, usually together with their 2D positions in the

image or 3D poses in the scene.

Identification: An individual instance of an object is recognized.

Examples: identification of a specific person face or fingerprint, or

identification of a specific vehicle. Detection based on relatively simple

and fast computations is sometimes used for finding smaller regions of

interesting image data.

14

Page 15: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

CHAPTER 2

LITERATURE SURVEY

Jarkiewicz et al [1] propose an emotion detection system where

analysis is done using a Haar-like detector and face detection is done

using a hybrid approach. The technique proposed here is to localize

seventeen characteristic points on the face and based on their

displacements certain emotions can be automatically recognized. An

improvement over the above proposed method is the feature extraction

technique.

A face detection algorithm is proposed by Zhao et al [2] for colour

images. This work is based on an adaptive threshold and a chroma chart

that shows probability of skin colours. Thus by identifying the skin

region, the facial part can be identified in the image. This technique

when used with the feature extraction technique yields better results.

Maglogiannis et al [3] present an integrated system for emotion

detection. The system uses colour images and it is composed of three

modules. The first module implements skin detection, using Markov

random fields for image segmentation and face detection. A second

module is responsible for eye and mouth detection and extraction. The

specific module uses the HSV colour space of the specified eye and

mouth region. The third module detects the emotions, pictured in the

eyes and mouth using edge detection and measuring the gradient of the

eye’s and mouth’s region.

15

Page 16: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

A detailed experimental study of face detection algorithms based

on skin colour has been made by Singh et al [4]. Three colour spaces,

RGB, YCbCr and HSI are of main concern .The algorithms of these

three colour spaces have been compared and then combined to get a new

skin colour based face detection algorithm which gives higher accuracy.

A survey by Yang et al [5] categorizes and evaluates the various

face detection algorithms. Other relevant issues such as benchmarking,

data collection and evaluation techniques have also been discussed. The

algorithms have been analysed and their limitations have been identified.

The Eigenface method [6] which uses principal components

analysis for dimensionality reduction, yields projection directions that

maximize the total scatter across all classes, ie, across all images of all

faces. In choosing the projection which maximizes total scatter, principal

components analysis retains unwanted variations due to lighting and

facial expression. The Eigenface method is also based on linearly

projecting the image space to a low dimensional feature space.

The Bunch Graph technique [7] has been fairly reliable to

determine facial attributes from single images, such as gender or the

presence of glasses or a beard. If this technique was developed to extract

independent and stable personal attributes, such as age, race or gender,

recognition from large databases could be improved and speed-up

considerably by preselecting corresponding sectors of the database.

Image deblurring algorithms are blind, Lucy-Richardson, Wiener and

regularized filter deconvolution as well as conversions between point

16

Page 17: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

spread and optical transfer solutions.

The Fisherfaces method [8], a derivative of Fisher’s Linear

Discriminant (FLD) maximizes the ratio between class scatter to that of

within-class scatter and appears to be the best at extrapolating and

interpolating over variation in lighting, although the Linear Subspace

method is a close second. The Eigenface method is also based on linearly

projecting the image space to a low dimensional feature space.

However, the Eigenface method, which uses principal components

analysis, yields projection directions that maximize the total scatter.

In a survey by Cheng-Chin Chiang[9] et al., presents a real-time

face detection algorithm for locating faces in images and videos. This

algorithm finds not only the face regions, but also the precise locations

of the facial components such as eyes and lips. The algorithm starts from

the extraction of skin pixels based upon rules derived from a simple

quadratic polynomial model. Interestingly, with a minor modification,

this polynomial model is also applicable to the extraction of lips. The

benefits of applying these two similar polynomial models are twofold.

First, much computation time are saved. Second, both extraction

processes can be performed simultaneously in one scan of the image or

video frame. The eye components are then extracted after the extraction

of skin pixels and lips. Afterwards, the algorithm removes the falsely

extracted components by verifying with rules derived from the spatial

and geometrical relationships of facial components. Finally, the precise

face regions are determined accordingly. According to the experimental

results, the proposed algorithm exhibits satisfactory performance in

terms of both accuracy and speed for detecting faces with wide

variations in size, scale, orientation, colour, and expressions.

17

Page 18: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Hironori Yamauchi[9], proposed Bio Security using Face

recognition for Industrial Use about current systems for face recognition

techniques which often use either SVM or adaboost techniques for face

detection part and use PCA for face recognition part.

In Robust real time face tracking for the analysis of human

behaviour proposed by Damien Douxchamp and Nick Campbell[10],

presented a real-time system for face detection, tracking and

characterization from Omni directional video. Viola-Jones is used as a

basis for face detection, and then various filters are applied to eliminate

false positives. Gaps between two detection of a face by the Viola-Jones

algorithms are filled using a colour-based tracking.

Shinjiro Kawato and Nobuji Tetsutani[11], proposed Scale

Adaptive Face Detection and Tracking in Real Time for detection and

tracking of faces in video sequences in real time. It can be applied to a

wide range of face scales. Fast extraction of face candidates is done with

a Six-Segmented Rectangular (SSR) filters and face verification by a

support vector machine.

18

Page 19: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Real-Time Face Detection Using Six-Segmented Rectangular

Filter (SSR Filter) by Oraya Sawettanusorn and et al.,[12], proposed

a real-time face detection algorithm using Six-Segmented

Rectangular (SSR) filter, distance information, and template

matching technique. Between-the-Eyes is selected as face

representative because its characteristic is common to most people

and is easily seen for a wide range of face orientation. Image is

scanned and divided into six segments throughout the face image.

A research by Li Zhang[13] and et al., concentrates on intelligent

neural network based facial emotion recognition and Latent Semantic

Analysis based topic detection for a humanoid robot. The work has first

of all incorporated Facial Action Coding System describing physical

cues and anatomical knowledge of facial behavior for the detection of

neutral and six basic emotions from real-time posed facial expressions.

Feedforward neural networks (NN) are used to respectively implement

both upper and lower facial Action Units (AU) analyzers to recognize six

upper and 11 lower facial actions including Inner and Outer Brow

Raiser, Lid Tightener, Lip Corner Puller, Upper Lip Raiser, Nose

Wrinkler, Mouth Stretch etc. An artificial neural network based facial

emotion recognizer is subsequently used to accept the derived 17 Action

Units as inputs to decode neutral and six basic emotions from facial

expressions. Moreover, in order to advise the robot to make appropriate

responses based on the detected affective facial behaviors, Latent

Semantic Analysis is used to focus on underlying semantic structures of

the data and go beyond linguistic restrictions to identify topics embedded

in the users’ conversations. The overall development is integrated with a

modern humanoid robot platform under its Linux C++ SDKs. The work

19

Page 20: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

presented here shows great potential in developing personalized

intelligent agents/robots with emotion and social intelligence.

CHAPTER 3

PROBLEM DEFINITION

The aim of this project is to detect human facial emotions namely

happiness, sadness and surprise. This is done by first detecting the face

from an image, based on the skin colour detection technique. It is then

followed by image segmentation and feature extraction techniques,

where eye and mouth parts are extracted. Based on the eye and mouth

variances the emotions are detected. From the position of eyes, emotions

are detected. If the person is happy or sad then eyes will be open and

when a person is surprised, eyes will be wide open. Similarly for the lips

the shape and colour properties are important. Depending on the shape of

the lips, emotions are detected, i.e., if the lips are closed and curved

upwards it indicates happiness. If lips are opened it indicates surprise

etc. Therefore based on the facial features such as eyes and mouth,

emotions are detected and recognized.

20

Page 21: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

CHAPTER 4

FACIAL EMOTION DETECTION AND RECOGNITION

4.1 Overview of the Algorithm

Our project proposes an emotion detection system where in the

facial emotions namely - happy, sad and surprised are detected. First

the face is detected from an image using the skin colour model. This is

then followed by feature extraction such as eyes and mouth. This is used

for further processing to detect the emotion. For detecting the emotion

we take into account the fact that emotions are basically represented

using mouth expressions. This is done using the shape and colour

properties of the lips.

4.1.1 Video Fragmentation

The input video of an e-learning student is acquired using an image

acquisition device and stored into a database. This video is extracted and

21

Page 22: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

fragmented into several frames to detect the emotions of the e-leaning

student and to thereby improve the virtual learning environment. By the

video acquisition feature which is used to record and register the on-

going emotional changes in the e-learning student, the resulting emotions

are detected by mapping the changes in the eye and lip region. The

videos are recorded into a database before processing, thereby making it

useful to analyse the changes of emotion for a particular subject or

during a particular time of the day.

Frame rate and motion blur are important aspects of video quality.

This demo helps to show the visual differences between various frame

rates and motion blur.

A few presets to try out:

Motion blur is a natural effect when you film the world in discrete time

intervals. When a film is recorded at 25 frames per second, each frame

has an exposure time of up to 40 milliseconds (1/25 seconds). All the

changes in the scene over that entire 40 milliseconds will blend into the

final frame. Without motion blur, animation will appear to jump and will

not look fluid.

When the frame rate of a movie is too low, your mind will no longer be

convinced that the contents of the movie are continuous, and the movie

will appear to jump (also called strobing).

The human eye and its brain interface, the human visual system, can

process 10 to 12 separate images per second, perceiving them

individually, but the threshold of perception is more complex, with

different stimuli having different thresholds: the average shortest

22

Page 23: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

noticeable dark period, such as the flicker of a cathode ray tube monitor

or fluorescent lamp, is 16 milliseconds, while single-millisecond visual

stimulus may have a perceived duration between 100ms and 400ms due

to persistence of vision in the visual cortex. This may cause images

perceived in this duration to appear as one stimulus, such as a 10ms

green flash of light immediately followed by a 10ms red flash of light

perceived as a single yellow flash of light.

4.1.2 Face Detection

The first step for face detection is to make a skin colour model.

After the skin colour model is produced, the test image is skin

segmented (binary image) and the face is detected. The result of Face

Detection is processed by a decision function based on the chroma

components (CrCb from YCbCr and Hue from HSV). Before the result is

passed to the next module, it is cropped according to the skin mask.

Small background areas which could lead to errors during the next stages

will be deleted.

A model image of face detection with the bounding box is

illustrated below in Figure 4.1.

23

Page 24: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Figure 4.1 Face Detection

4.1.3 Feature Extraction

After the face has been detected the next step is feature extraction

where the eyes and mouth are extracted from the detected face .For eye

extraction, this is done by creating two eye maps, a chrominance eye

map and a luminance eye map. The two maps are then combined to

locate the eyes in a face image, as shown in Figure 4.2.

24

Page 25: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Figure 4.2 Feature Detection

To locate the mouth region, we use the fact that it contains

stronger red components and weaker blue components than other facial

regions (Cr > Cb), and thus the mouth map is constructed .Based on this

the mouth region is extracted .Finally the extracted eyes and mouth from

the face image according to the maps are passed onto the next module of

our algorithm.

4.1.4 Emotion Detection

The last module is emotion detection. This module makes use of

the fact that the emotions are expressed majorly with the help of eye and

mouth expressions as show in Figure 4.3. Emotion detection from lip

images is based on colour and shape properties of human lips. Having a

binary lip image, shape detection can be performed. Thus, depending on

the shape of the lips and other morphological properties the emotions are

detected. A computer is being taught to interpret human emotions based

on lip pattern, according to research published in the International

Journal of Artificial Intelligence and Soft Computing. The system could

improve the way we interact with computers and perhaps allow disabled

people to use computer-based communications devices, such as voice

synthesizers, more effectively and more efficiently.

25

Page 26: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Figure 4.3 Emotion Detection

4.2 Architectural Design

The architectural diagram shows the overall working of the

system, where captured colour image sample is taken as the input and it

is processed using image processing tools and is analysed to locate the

facial features such as eyes and mouth, which will be further processed

to recognize the emotion of the person. After the localization of the facial

features the next step is to localize the characteristic points on the face.

Followed by this is the feature extraction process where the features are

extracted such as eyes and mouth.

Based on the variations of eyes and mouth, emotion of a person is

detected and recognized. For a person who is happy, the eyes will be

open and the lips will be closed upwards whereas for a person who is

26

Page 27: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

sad, the eyes will be open and the lips will be closed facing downwards.

Similarly for a person who is surprised the eyes will be wide open and

there will be a considerable displacement of the eye brows from the eyes

and the mouth will be wide open. Based on the above measures mood

exhibited by a person is detected and it is recognized.

The Figure 4.4 shows the overall working of the system where the

input is the image and the output is the emotion recognized such as

happy, sad or surprised.

27

Page 28: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Figure 4.4 – Architectural Diagram

CHAPTER 5

REQUIREMENT ANALYSIS

28

Page 29: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

The Software Requirements Specification is based on the problem

definition. Ideally, the requirement specification will state the “what” of

the software product without implying “how” the software design is

concerned, by specifying how the product will provide the required

features.

5.1 Product Requirements

5.1.1 Input Requirements

The input for this work is the video of an e-learning student, which

may contain the human face.

5.1.2 Output Requirements

The output is the detected facial emotion such as happy, sad, and

surprised.

5.2 Resource Requirements

The hardware configuration requirement is shown in Table 5.1 and

software configuration required to run this software is shown in Table

5.2.

5.2.1 Hardware Requirements

29

Page 30: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Table 5.1 – Hardware Requirements

S.No Feature Configuration

1 CPU Intel core 2 Duo processor

2 Main memory 1 GB RAM

3 Hard Disk 60 GB Disk size

The above configuration in the Table 5.1 is the minimum hardware

requirements for the proposed system.

5.2.2 Software Requirements

Table 5.2 – Software Requirements

S.No Software Version

1 Windows 7

2 Matlab R2012a

3 Picassa 3

The proposed system is executed using Windows 7,

MatlabR2012a and picassa 3 as shown in Table 5.2.

CHAPTER 6

DEVELOPMENT PROCESS AND DOCUMENTATION

30

Page 31: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

6.1 Face Detection

Face detection is used in biometrics, often as a part of or together with

a facial recognition system. It is also used in video surveillance, human

computer interface and image database management. Some recent digital

cameras use face detection for autofocus. Face detection is also useful

for selecting regions of interest in photo slideshows that use a pan-and-

scale Ken Burns effect.

Face detection can be regarded as a specific case of object-class

detection. In object-class detection, the task is to find the locations and

sizes of all objects in an image that belong to a given class. Examples

include upper torsos, pedestrians, and cars.

Face detection can be regarded as a more general case of face

localization. In face localization, the task is to find the locations and

sizes of a known number of faces. In face detection, one does not have

this additional information.

6.1.1 Sample Collection

The sample skin coloured pixels is collected from images of

people belonging to different races. Each pixel is carefully chosen from

the images so that the other regions which are not belonging to the skin

colour do not get included.

6.1.2 Chroma Chart Preparation

Chroma chart shown in Figure 6.1 is the distribution of the skin

colour of different people over the chromatic colour space.

31

Page 32: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Figure 6.1 – Chroma Chart Diagram

Here the chromatic colour is taken in the (Cb, Cr) colour space.

Normally the images will be stored in the (R, G, B) format. A suitable

conversion is needed to convert it into YCbCr colour space.

The collected sample pixels values are converted from (R G B)

colour space to the YcbCr colour space and a chart is drawn by taking

the Cb along x- axis and Cr along Y-axis. Now the obtained chart shows

the distribution of the skin colour of different people. The Intensity(Y)

component is not considered because it has very little effect in the

chrominance variation. The following diagram shows the distribution of

the skin colour of different people.

6.1.3 Skin Colour Model

The skin-likelihood image is obtained using the developed skin

colour model. The skin colour model is the distribution of skin colour

32

Page 33: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

over the chromatic colour space. Each and every pixel in the given input

image is compared with the skin colour model. If the particular

chrominance pair is present then that pixel is made as white pixel. This is

achieved by assigning the red, green and blue component of each pixel

as 255. If the chrominance pair is not present that pixel is made as black

pixel. This is achieved by assigning the red, green and blue component

of each pixel as 0.

The result of Face Detection is first processed by a decision

function based on the chroma components (CrCb from YCrCb and Hue

from HSV).If all the following conditions are true for a pixel, it's marked

as skin area; 140< Cr < 165 and 140< Cb <195.Now the obtained image

is a binary image where the white coloured regions show the possible

skin coloured region. The black region shows the non-skin coloured

region. Before the result is passed to the next module, it is cropped

according to the skin mask. Small background areas which could lead to

errors during the next stages will be deleted.

6.2 Feature Extraction

Feature extraction is the process of detecting the required features

from the face and extracting it by cropping or other such technique.

6.2.1 Eye Detection

Two separate eye maps are built, one from the chrominance

component and the other from the luminance component. These two

33

Page 34: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

maps are then combined into a single eye map. The eye map from the

chrominance is based on the fact that high-Cb and low-Cr values can be

found around the eyes. The following formula presented helps us to

construct the map:

1/3*(Cb*Cb + (255-Cr)*(255-Cr) + (Cb/Cr)).

Eyes usually contain both dark and bright pixels in the luminance

component, so gray scale operators can be de- signed to emphasize

brighter and darker pixels in the luminance component around eye

regions. Such operators are dilation and erosion. We use gray scale

dilation and erosion with a spherical structuring element to construct the

eye map.

The eye map from the chrominance is then combined with the eye

map from the luminance by an AND (multiplication) operation, Eye

Map=(EyeMapChr) AND (EyeMapLum). The resulting eye map is then

dilated and normalized to brighten both the eyes and suppress other

facial areas. Then with an appropriate choice of a threshold, we can track

the location of the eye region.

6.2.2 Mouth Detection

To locate the mouth region, we use the fact that it contains

stronger red components and weaker blue components than other facial

34

Page 35: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

regions (Cr >Cb), so the mouth map is constructed as follows:

n= 0.95 * (1/k sum (Cr(x,y)*Cr(x,y))) / (1/k sum (Cr(x,y)/Cb(x,y)))

Map = Cr*Cr*(Cr*Cr – n*Cr/Cb)

Where k is the number of pixels in the face.

The mouth detection diagram is shown in Figure 6.2

Happy:

Surprised:

Figure 6.2 – Mouth Detection Diagram

6.3 Emotion Detection

Emotion detection from lip images is based on colour and shape

properties of human lips. For this task we considered already having a

35

Page 36: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

rectangular colour image containing lips and surrounding skin (with as

small amount of skin as possible). Given this we can start extracting a

binary image of lips, which would give us the necessary information

about the shape.

To extract a binary image of lips, a double threshold approach was

used. First, a binary image (mask) containing objects similar to lips is

extracted. The mask image is extracted in the way that it contains a

subset of pixels which is equal or greater of the exact subset of lip pixels.

Then, another image (marker) is generated by extracting pixels which

contain lips with highest probability. Later, the mask image is

reconstructed using the marker image to make results more accurate.

Having a binary lip image, shape detection can be performed.

Some lip features of face expressing certain emotions are obvious: side

corners of happy lips are higher compared to the lip centre than it is for

serious or sad lips. One way to express it more mathematically is to find

the leftmost and rightmost pixels (lip corners), draw a line between them

and calculate the position of lip centre with respect to that line. The

lower below the line is the centre, the happier the lips are. Another

morphological lip property that can be extracted is mouth openness.

Open lips imply certain emotions: usually happiness and surprise.

For example (surprised and happy):

1. Based on the original binary image the first step is to remove small

areas which is done with the 'sizethre(x,y,'z')' function.

36

Page 37: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

2. In the second step a morphological closing (imclose(bw,se)) with a

'disk' structure element is done.

3. In the third step some properties of image regions are measured (blob

analysis).More precise:

A 'BoundingBox' is calculated which contains the smallest

rectangle of the region (in our case the green box). In digital image

processing, the bounding box is merely the coordinates of the

rectangular border that fully encloses a digital image when it is placed

over a page, a canvas, a screen or other similar bidimensional

background.

'Extremas' were calculated which means a 8-by-2 matrix that

specifies the extrema points in the region. Each row of the matrix

contains the x- and y-coordinates of one of the points. The format of the

vector is [top-left top-right right-top right-bottom bottom-right bottom-

left left-bottom left-top] (in our case the cyan dots).

A 'Centroid' which is a 1-by-ndims(L) vector that specifies the

centre of mass of the region (in our case the blue 'star').

The centroid is calculated based on:

1. p_poly_dist.....Calculates distance (shown as red line) between

Centroid and 'left-top-right-top-line'.

37

Page 38: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

2. lipratio....Ratio between width and height of the bounding box.

3. lip_sign....Is a positive/negative number, which is calculated to

detect if the 'left-top-right-top-line' runs over/under the 'centroid'.

4. The decision is made if mood is 'happy', 'sad' and 'surprised'.

After reviewing some illumination correction (colour constancy)

algorithms we decided to use the "Max-RGB" (also known as "White

patch") algorithm. This algorithm assumes that in every image there is a

white patch, which is then used as a reference for present illumination. A

more accurate "Colour by Correlation" algorithm was also considered,

but it required building a precise colour-illumination correlation table in

controlled conditions, which would be beyond the scope of this task. As

the face detection is always the first step in the processes of these

recognition or transmission systems, its performance would put a strict

limit on the achieved performance of the whole system. Ideally, a good

face detector should accurately extract all faces in images regardless of

their positions, scales, orientations, colours, shapes, poses, expressions

and light conditions. However, for the current state of the art in image

processing technologies, this goal is a big challenge. For this reason,

many designed face detectors deal with only upright and frontal faces in

well-constrained environments.

This lip emotion detection algorithm has one restriction - the face

cannot be rotated more than 90 degrees, since then the corner detection

would obviously fail.

CHAPTER 7

EXPERIMENTAL RESULTS

38

Page 39: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

7.1 General

The results obtained after successful implementation of the project

is given in this chapter. The results obtained are given in step by step

basis.

7.2 Chroma Chart

Chroma chart displayed in Figure 7.1 is the distribution of the skin

colour of different people over the chromatic colour space. Here the

chromatic colour is taken in the (Cb, Cr) colour space. The Intensity(Y)

component is not considered because it has very little effect in the

chrominance variation. The following diagram shows the distribution of

the skin colour of different people.

Figure 7.1 – Chroma Chart

7.3 Result Analysis

This gives the overall efficiency of the proposed system detected

39

Page 40: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

at each step. The system was analysed for its detection rate and time

taken to detect a particular stage for a specified number of input images.

Three stages were considered in the system that is the skin detection, the

face detection such as eyes and mouth and the emotion detection and

recognition, at each of these stages detection rate and the time taken was

calculated. The results are tabulated in the Table 7.1.

Table 7.1 – Result Analysis

According to the table, 17 image samples were taken to determine

the skin detection rate and it was found that, out of 17 images, skin was

detected for 16 images giving a detection rate of 94.44 % with an

average time of 1.4 seconds per image. The face detection rate was

40

STAGES DETECTION

RATES

(%)

NUMBER OF

IMAGES TIME(s)

SKIN DETECTION 94.44 17 1.4

FACE DETECTION

(EYES AND

MOUTH)

83.33 15 1

EMOTION

DETECTION AND

RECOGNITION

88.88 16 0.5

Page 41: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

calculated for 15 images out of which for 12 images face was detected

successfully giving a detection rate of 83.33 % with an average time of 1

second per image. Similarly the emotion detection and recognition rate

was calculated for 16 images. Out of which for 14 images exact

emotions were detected and recognized giving a detection rate of 88.88

% with an average time of 0.5 seconds per image.

The video fragmentation rate of a video depends on the duration

and length of the original video. The Frames per Second (fps) rate is

dependent on the time span of the video.

Frame rate (also known as frame frequency) is the frequency (rate)

at which an imaging device produces unique consecutive images

called frames. The term applies equally well to film and

video cameras, computer graphics, and motion capture systems. Frame

rate is most often expressed in frames per second (FPS) and is also

expressed in progressive scan monitors as hertz(Hz). If a video of a

greater time span is given, the interval between the fragments remain

constant. For every fragment produced, the emotion of the person is

detected. Thereby, acquiring an opinion on what intervals the change of

emotions occurs, and narrowing down to the corresponding reason of

occurrence.

CHAPTER 8

CONCLUSION AND FUTURE WORK

41

Page 42: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Conclusion

The proposed system utilizes feature extraction techniques and

determines the emotion of the person based on the facial features namely

eyes and lips. The emotion exhibited by a person is determined with a

good accuracy and it is user friendly system.

Face-Detection and Segmentation

In this project we have proposed an emotion detection and

recognition system for colour images. Although our application is only

constructed for full frontal pictures with only one person per picture

Face-Detection is necessary for decreasing the area of interest needed for

further processing in order to achieve the best results.

Trying to detect the skin of a face in an image really is a hard task

due to the variance of illumination. The success of correct detection

depends a lot on the light sources and illumination properties of the

environment the picture are taken.

Emotion Detection

The major difficulty of the used approach is determining the right

hue threshold range for lip extraction. Lip colours vary mostly according

to face owner's race, presence of make-up and illumination, under which

the photo was taken. The latter is the least problem, since there exist

illumination correction algorithms.

Future Enhancements

42

Page 43: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

The future work includes enhancement of the system so that it is

able to detect emotions of the person even in complex backgrounds

having different illumination conditions and to eliminate the lip colour

constraint in the coloured images. The other criterion that can be worked

upon is to project more emotions other than happy, sad and surprised.

APPENDIX 1

SCREENSHOTS

43

Page 44: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

SCREEN 1 : The detected face for the given video input.

44

Page 45: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

SCREEN 2: The interface which is used to select the input image.

45

Page 46: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

SCREEN 3: The image which is to given as reference

46

Page 47: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

SCREEN 4: The image to be tested.

47

Page 48: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

SCREEN 5: The smoothened reference image.

48

Page 49: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

SCREEN 6: The test image after smoothening.

49

Page 50: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

SCREEN 7: The image after the detection of edges.

50

Page 51: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

SCREEN 8: The above screen is the result screen which displays the

end result of the system, the emotion portrayed by the person in the

image.

REFERENCES

51

Page 52: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

[1] J. Jarkiewicz, R. Kocielnik and K. Marasek, “Anthropometric

Facial Emotion Recognition”, Novel Interaction Methods and

Techniques -Lecture Notes in Computer Science, Volume 5611,

2009.

[2]L. Zhao, X. LinSun, J. Liu and X.Hexu, “Face Detection Based on

Skin Colour”, Proceedings of the third international conference on

machine learning and cybernetics, Shanghai, 2004.

[3] I. Maglogiannis, D. Vouyioukas and C. Aggelopoulos, “Face

Detection and Recognition of Natural Human Emotion Using

Markov Random Fields”, Pers Ubiquit Comput, 2009.

[4]M.H Yang, D. J. Kriegman, N. Ahuja, “Detecting Faces in

Images”, IEEE transactions on pattern analysis and machine

intelligence, vol.24, no.1, 2002

[5]Pedro J. Muñoz-Merino,Carlos Delgado Kloos and Mario Muñoz-

Organero, “Enhancement of Student Learning Through the Use of

a Hinting Computer e-Learning System and Comparison With

Human Teachers ,“ IEEE Journal.vol.52, 2011.

[6]Emily Mower,Maja.J.Mataric and Shrikanth Narayanan, “A Frame

Work for Automatic Human Emotion Classification using

Emotions Profile,” IEEE Journal.vol.23, 2011

[7]Xiaogang Wang and Xiaoou Tang, “Face Photo-Sketch Synthesis

and Recognition”, IEEE transactions, 2009.

[8]Yan Tong Jixu Chen and Qiang ji, “A Unified Probabilistic

52

Page 53: Final Year Project - Enhancing Virtual Learning through Emotional Agents (Document)

Framework for Spontaneous Facial Action Modelling”, IEEE

transactions on pattern analysis and machine intelligence, vol.32,

no.2, 2010.

[9] Chen, L.S., Huang, T.S. “Emotional Expressions in Audiovisual

Human Computer Interaction”, IEEE International Conference,

Volume: 1, 2000.

[10] De Silva, L.C., Ng, P. C. “Bimodal Emotion Recognition”,

Fourth IEEE International Conference, 2000.

53