icia redpath dec05

7/28/2019 Icia Redpath Dec05

1/6

Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 1

AbstractThe understanding and treatment of outliers is

complex and non trivial in many data analysis and data mining

exercises. It is not always done well. One approach that can be

used in combination with others is to understand the domain ofinterest and use this knowledge to guide the data preparation

and subsequent steps in terms of the treatment and

interpretation of outliers. To demonstrate the approach

proposed a study on web usage in the tertiary education sector

is used. A particular issue that occurs in the sector, where

widespread use of the World Wide Web occurs, is to monitor

students web use in the environment. This is important in

evaluating and improving teaching outcomes. Data mining

techniques play a key role in analyzing student interaction as it

is captured in Web logs. This paper considers the non-trivial

task of data preparation and analysis of web data and in

particular the treatment of outliers in this domain. Some

conclusions on how to define an outlier in terms of the strategic

aims of the particular analysis are made. Some generalconclusions are made about how to classify outliers as noise or

indicative indicators in a web environment. It is argued that

the approach demonstrated can be applied across a range of

domains and is a guide as to how the knowledge discovery task

may be partially automated.

I. INTRODUCTION

Outlier treatment is an important aspect of the Knowledge

Discovery in Databases (KDD) process where outliers may

reveal important information about the data under analysis.

Outliers can be determined by a number of techniques

including visualization, and proximity based approaches.

All approaches require some input from the expert in thedomain of interest so domain knowledge plays an important

role in their identification. Once outliers have been

identified they may be rejected as due to measurement error

or as being from another sample population or they may be

accepted as phenomena of interest. Making the distinction

between the outliers that should be rejected and those that

should be accepted is not always easy.

The purpose of this paper is to use a case study on student

behavior, as demonstrated by their usage of a courseware

web system to support their activities, to make some general

conclusions about how outliers may be treated in the domain

of the case study and to extend those conclusions to other

domains of interest. Having gained an understanding of how

outliers may be treated in a general way it is suggested how

this understanding could be incorporated into a partiallyautomated model for KDD.

Section 2 reviews the treatment of outliers, from the use

of traditional statistical approaches through to the more

recent approaches that have been suggested with the

increased interest in data mining; where data mining is as a

label for a new grouping of techniques for data analysis. In

section 3 the data preparation for the case study on student

behavior is detailed and an emphasis is placed on how

outliers are treated in the sample data. The particular role of

the goals of the analysis is demonstrated (1) in identifying

outliers and (2) in deciding on whether they should be

retained or rejected. Section 4 discusses how domainknowledge is important at every step of the KDD process

and relates the points made to the case study. Section 5

describes how the capture of the appropriate domain

knowledge will allow outlier treatment to be part of an

automated KDD system.

Use of domain knowledge is only one aspect of attempts

to automate the KDD process and other issues such

visualization of data and knowledge, integration of data

mining tools and automated support via workbenches need

to be considered. This paper increases understanding of the

issues by a focus on outliers, one of the types of knowledge

that has the potential to be revealed.

II. TREATMENT OF OUTLIERS

There is a large literature on outliers extending over many

years. In consequence one would expect that a concise

definition of an outlier could be provided but in fact this is a

difficult task. It is a complex matter to precisely encapsulate

what an outlier is. Many writers convey a notion of what an

outlier is in a larger series of observations but providing an

objective statement that can be used to identify an outlier is

a great challenge. Some attempts are included here:

It is well recognized by those who collect or analyse

Domain Knowledge to Support Understanding

and Treatment of OutliersRobert Redpath, Judy Sheard

Caulfield School of Information Technology

Monash University, Caulfield East, Victoria, Australia

Email: [email protected]

Email: [email protected]


2/6


data that values occur in a sample which are so far

removed from the remaining values that the analyst is not

willing to believe that these values have come from the

same population. Many times values occur which are

dubious in the eyes of the analyst and he feels that heshould make a decision as to whether to accept or reject

these values as part of his sample. [1]

The outliers are values which seem either too large or

too small as compared to the rest of the observations. [2]

The use of phrases such as in the eyes of the analyst

and seem indicate the subjective approach used for

identifying outliers. Collet and Lewis[3] provide a

comprehensive discussion of the subjective nature of the

exercise. They point out that most procedures for

identifying outliers essentially have two stages. First a

subjective decision is made that a value (or observation) is

unusual in comparison to most other values. They use theterm surprising for such a value. Second the identified value

is tested for discordance with the rest of the sample based

on satisfying some objective criterion. It thus becomes an

hypothesis testing procedure using traditional statistical

methods. By experiment they observe and support the

conclusion that identification of outliers by such methods is

subjective and is affected by the method of presentation and

the scale and pattern of the data [3]

While recognizing the subjective nature of outlier

identification, once identified they can be treated in two

broad ways as either values for rejection or values that point

to phenomena of interest. They are rejected if they can beconsidered as values that are drawn from another population

or values resulting from measurement error and suitable for

exclusion from the sample as distorting the analysis.

Alternately they can be considered phenomena of interest

that should not be excluded.

In the past an approach was taken to outlier detection

that placed an emphasis on rejecting outliers as values that

are not part of the population being analyzed and so more

statistically based approaches were used. Recent activity in

the data mining area is more particularly concerned with

outliers as anomalies of interest. Data mining as a term

embraces traditional statistical approaches and other less

statistically rigorous techniques such as decision trees and

neural networks. Data mining will often avoid a strictly

statistical view of data, as the volume of observations and

number of attributes does not easily permit statistically

based approaches. But outliers are still of great interest so

identification by density-based and proximity-based

techniques, combined with visualizations are typically

employed. The aim is to find outliers that represent valid

data that is significant but that differs from most of the

sample. Applications include identifying unusual weather

events, fraud, intrusion, medical conditions and public

health issues. A range of other approaches exist for

detection of outliers including model based techniques and

also techniques that make use of class labels. If class labels

are used and training set data is available a supervised

learning approach could be employed. If no training data is

available an unsupervised techniques may be employed. Itmay also be appropriate to remove an observation in the

data preparation step of data mining if it is indicative of a

measurement error or from another population not of

interest in the analysis. Outliers of this type are often

referred to as noise and would typically be rejected from the

sample as they would distort the analysis being carried out.

An excellent overview of all these approaches can be found

in the book by Tan et al [4].

Other literature on outliers includes Knorr and Ng[5]

who take an intuitive notion of outliers and provide a

formalization as follows:

An object O in a dataset T is UO(p,D) -outlier if at leasta fraction p of the objects in T are>= distance D from O.

It is intuitive in the sense that the analyst must nominate

the fraction p and the distance D based on a notion that an

outlier is a distance from the main fraction of the data. It is

noted by the authors that this can be used as the basis for

approaches that are effective with the large datasets such as

are encountered in data mining. We note here that it also

depends on the domain knowledge of the expert in the area

to establish the fraction and distance parameters.

Work by Liu[6] addresses the problem of distinguishing

between outliers that should be rejected and those that

should be retained in the sample as phenomena of interest.The criteria suggested to make this decision are the

characteristics of the data and also relevant domain

knowledge. A strategy is suggested that models noise and

error processes and accepts outliers as phenomena of

interest if the noise model cannot account for them. We

again note here that the domain knowledge of the expert in

the domain of interest is required to construct the noise

model. Lui has also published a number of papers

addressing this theme for treatment of outliers with G.

Cheng and J. Wu[7, 8]. G. Cheng, in his PhD dissertation

suggests a methodology that uses two strategies for outlier

treatment. The methodology requires that outliers in

multivariate data are first detected and visualized by the use

of self organizing maps (SOMs). SOMs, a technique

pioneered by T. Kohonen [9], are widely used for

understanding and interpreting clusters in large datasets

containing multivariate data. Two strategies are then

proposed for handling the outliers. Domain knowledge can

be used to identify outliers for retention and to model what

is considered normal data and thus reason about outliers

based on this model. Domain Knowledge can also be used

to identify outliers for rejection and to build a model of the

behavior of these outliers. Data would be accepted outside

the norms if it is not accounted for by the model developed


3/6


for the outliers suitable for rejection. One of the logical

considerations for proceeding in this way is that large

datasets would not permit all outliers to be dealt with

manually. The key point to note in our view is that the

domain knowledge of the expert in the domain of interestessentially determines the treatment of outliers after

application of some widely used supervised and

unsupervised SOM learning approaches that generate

visualizations. The visualizations permit the domain expert

to first identify the outliers, then decide whether to accept

the outliers as phenomena of interest or reject the outliers as

distorting the analysis.

III. OUTLIERTREATMENT IN A WEBSITE USAGE ANALYSIS

A case study was chosen to demonstrate the application

of domain knowledge in the treatment of outliers in thecontext of the overall KDD process. The particular analysis

used in this case study was concerned with the effectiveness

of web-based learning environment and was part of a study

executed in 2002. The web based learning environment was

provided to students in the final year of their undergraduate

degree where they undertake an industrial experience (IE)

project in which they design, develop and deliver a small

computer system for a client. A full description of the IE

project can be found in Hagan[10]. The web site, known as

the Web Industrial Experience Resources Website (WIER),

was developed to provide students with an integrated

learning environment to use during their IE project work.The site provides various resources including general

project information, a facility to enable event scheduling,

facilities for project management including a task timer

tracker, time graph generator, file manager and risk list, and

various forms of communication facilities via news group

and discussion forums. The site also provides access to a

repository of resources including standards documents,

document templates, and samples of past projects. Details

can be found in [11]

The research studies made use of a web log data

generated when students made use of the WIER system. The

full details of the data preparation can be reviewed in the

report by Sheard et al [12] and details of the findings based

on web usage analysis can be found in[13].

To enable meaningful interpretation of the data the

following abstractions were defined. A connection to the

WIER system was termed a session. A student session at the

WIER web site can logically be viewed as a number of

episodes. An episode corresponds to a student making use

of one of the functional areas of the system (e.g. accessing a

past project, using the time tracker or engaging in a

discussion). Each episode consists of a series of interactions.

An interaction was defined as a page request. This is shown

diagrammatically in Figure 1. Over the 27 week period of

the data collection there were 9442 sessions, consisting of

47725 episodes in total, in the analysis.Session 1

Episode 1 . ..Episode n

Interaction 1 .. Interaction n

Fig. 1 Class hierarchy of the abstractions used

The goals of the analysis embraced three major themes.

These were (1) frequency of sessions for a student over the

semester, (2) time of session analysis and finally (3) time of

episode analysis. The level of abstraction (session or

episode or interaction) suitable for one analysis goal and

what constituted an outlier for one analysis goal was not

necessarily the same for another analysis goal. Considering

the analysis goal offrequency of sessions it was decided to

include all sessions even though some lasted a number of

days and were clearly inactive and some lasted less than asecond and no work could have been done. This was valid

in terms of understanding this simple aspect of student

behaviour; their attempts to connect to the system.

Considering the goals of analysis (2) and (3), the aspects

of student behaviour of interest required that students be

actively engaged using the system and more than just

connecting with the system, so apparent failed logins were

eliminated. Furthermore, a long session did not necessarily

indicate inactivity; neither did a long episode. An active

episode was defined as one in which the time between

interactions within that episode was above some threshold.

A careful consideration of the percentages of episodes andsessions excluded at different thresholds for the interaction

times combined with knowledge of the system functions

from an educators perspective allowed the analyst to arrive

at a threshold of 600 seconds (10 minutes) between

interactions. The threshold was used to attach a class label

ofinactive to an episode. Using the threshold, if an episode

had any times between interactions of greater than or equal

to 600 seconds the episode was considered inactive.

Sessions (and their single episode) that lasted less than 1

second were also excluded as being inactive.

Considering the analysis goal of time of session if a

session contained an inactive episode this identified thesession as being inactive and the entire session was

excluded from the analysis as being misleading.

Considering the analysis goal of time of episode any

inactive episode was excluded.

Table I, below, compares the method employed to

exclude outliers to simply considering rejecting episodes

with a time greater than 1 hour. The method chosen includes

some outliers (as phenomena of interest) that would be

excluded by simpler methods or visualization approaches

while also excluding a greater number of undesirable

outliers. Some episodes (18) were retained that had episode

times greater than 1 hour including one episode that lasted


4/6


more than 2 hours.TABLE I

PERCENTAGE OF EPISODES REMAINING AFTER EXCLUDING

OUTLIERS

Method Percentage of

episodes

remaining

Number of

episodes with

times >=3600

seconds

included

Number of

episodes with

times >=7200

seconds

included

Extreme

episode time

(>3600 sec.)

97.5 0 0

Interaction

interval (the

method used)

93.9 18 1

IV. DOMAINKNOWLEDGETOSUPPORTTHEKDD

PROCESSIt has been suggested that domain knowledge has a role in

all steps of the KDD process[14]. It must be noted that the

success of a data mining exercise depends on a number of

factors including the degree of automation and the use of

visualization tools. A comprehensive overview of the issues

and literature concerning this can be found in the paper [15]

that details the impact of domain knowledge on all aspects

of the KDD process. Also suggested in [15] is an

architecture for a partially automated KDD system that

splits the analyst/users involvement into initial knowledge

acquisition and subsequent user interventions. Additionally

it is noted there that the strategic goals of the analyst wouldand should be included in a broad definition of domain

knowledge.

Web usage analysis and data collection presents some

particular challenges and an understanding of the domain of

interest must inform this analysis. As is commented by

Sullivan, many page hits could represent a deeply

satisfying experience or a hopelessly lost reader[16]. So in

discussing the steps in the KDD process; from data

collection, data preparation through to interpretation of the

results; an emphasis will be placed on how domain

knowledge informs the various decisions being made. (A

detailed commentary on the steps can be found in[12]). Inorder to understand the users behavior better it was

recognised that typical web file data, if used alone, did not

provide enough information. A script was therefore

included in each page on the site that recorded information

in a database each time a page was loaded. The students

entered the site via a login page and thus the start of each

session was able to be determined. The domain expert

understood that identity was vital for the analysis. Students

identities could also be matched to interactions. Caching

was disabled on most pages and thus page requests could be

recorded as interactions. Pages were also categorized

according to which functional area of the system they

belonged to so that a students use of each function could be

analysed. The categorization required the domain

knowledge of the expert in the application. Another key

decision informed by domain knowledge is the suitable

level of abstraction to apply in terms of the analysis goals.In the case study the goals of analysis required the class

hierarchy of session, episode and interaction be established.

The interaction time was used to attach a class label (or

abstraction label) to an episode of inactive.

It is clear that domain knowledge assists at a number of

steps in the analysis. Table II following summarises the

contribution at each step.TABLE II

SUMMARY OF DECISIONS INFORMED BY DOMAIN

KNOWLEDGE AT EACH STEP IN THE ANALYSIS

Data Process Step Decision informed by Domain Knowledge

Collection design Capture additional information for the analysis

Collection Monitor student logons and record identity

Abstraction Decide on semantically meaningful groupings

Integration Combine with additional data related to user

Cleaning Removal of irrelevant items Determine missing

items

Remove outliers that distort the analysis

Transformation Identify the user

Identify the session

Mining Outliers as phenomena of interest are identified

In order to know if outliers should be rejected or included

as phenomena of interest the goals of the analyst must betaken into account. Here the goals were to answer the

questions:

How long do students spend using the web site?

What are the patterns and trends in web site usage over

the year?

Are there any differences in use based on student

performance?

Clearly the data on students who spend long amounts of

times at initial setup need to be included to address these

objectives. Observations that are related to inactive or

disinterested users, as defined via the domain knowledge

that defines the inactive observations, would be rejected

from the sample.

V. OUTLIERTREATMENT AS PART OF A PARTIALLYAUTOMATED MODEL FORKNOWLEDGE DISCOVERY

There are two ways in which domain knowledge can be

captured in order to partially automate the knowledge

discovery process and in particular aid in identifying

outliers for categorization as suitable for rejection or as

phenomena of interest. The domain knowledge can be

captured prior and during the analysis via a knowledge

acquisition module. It can also be embodied in case studies

of previous analyses that have been formally captured in a


5/6


data bank of representative case studies. The Web Usage

case study detailed here will be used to demonstrate how the

approaches suggested could be applied in a particular

domain of interest. Domain knowledge acquisition and use

of representative case studies can be applied separately butwould preferably be applied in combination. If used in

combination they would have complimentary and

consequential effects on each other. The user would also

have the discretion to intervene at points of decision as the

data analysis proceeded.

Figure 2, adapted from[15] seeks to show the broadarchitecture for what is proposed.

Fig.2 Proposed Architecture for a Partially Automated KDD System

Case based reasoning would allow a previous web usage

analysis such as described here to provide prompts for the

major steps in the analysis if the user so desired that would

over rule a generic set of steps that would be provided as a

default. The steps would then guide the interaction with the

knowledge acquisition module in prompting for the

appropriate domain knowledge. Table III shows the

knowledge acquisition in summary as it would apply to the

case study.

The goals of the analysis would be captured andformalized in terms of which level of abstraction they relate

to and the attributes of the particular level of abstraction that

are important in categorizing the instances at that level of

abstraction. In the case study this was the time between

interactions in an episode. With reference to outliers the

knowledge acquisition module would capture the hierarchy

of abstractions, in this case session, episode and interaction.

The domain expert could then be provided with a frequency

distribution for the level of abstraction showing the

instances above and below certain threshold values.

This would assist the domain expert in determining upper

and lower threshold values for the analysis in question that

captured phenomena of interest but rejected outliers that

would distort the analysis in their view.TABLE III

AN AUTOMATED MODEL APPLIED TO THE CASE STUDY

Relevant domain knowledge

captured by knowledge acquisition

module at the beginning of the

KDD process

Case study application

Attributes to enrich data requested Student identity on page requests

Goals in terms of data abstractionrequested; supported by automatic

interrogation of meta data

Session frequencySession length

Abstraction hierarchy requested Session has many episodes; episode

has many interactions

Request class label Inactive episode

Request attribute to determine class

label

Request threshold for attribute

identified

Time between interactions

Threshold is 10 minutes

Data SelectionDesign andcollection

Data IntegrationTransformation andCoding

Data Mining

Result Evaluation

SelectedData

TransformedData

User KDDProcessInterface andWorkbench

Formal DomainKnowled e

Case Repository

Knowledge AcquisitionModule

Case BasedReasoning Module

Process-Metadata

Web Usage LogData

Convert toFormalismModule

Initial KnowledgeAcquisition

DK MetaDefinitionModule


6/6


VI. CONCLUSION

It is not always clear how outliers can be identified and

once identified how they should be treated. Domain

knowledge can play a key part in outlier treatment. The

definition of domain knowledge must be broad enough toinclude the goals of the analysis being carried out. It can be

employed at a number of points in the KDD process to

ensure correct handling of outliers. In particular it can

inform the following steps:

Establishment of the goals of the analysis

Design of data collection to allow the correct attributes

to be captured.

Determination of the suitable level of abstraction to

allow identification of outliers for both rejection and

inclusion in the analysis

Determination of the attributes to permit the correctclass labels to be attached to the appropriate level of

abstraction.

In the particular example of the web usage case study it

was found that the data collection had to be designed to

allow the individual student sessions to be identified and

also the use by students of the functional areas (episodes)

within a session to be identified. Episodes then had to be

labeled as active or inactive. Sessions that contained an

inactive episode were considered to be inactive by the

domain expert also.

An understanding of the treatment of outliers will permita model for a partially automated KDD workbench to be

developed that handles outliers based on the principals

established. This model will have two major parts that relate

to domain knowledge acquisition. The first part will be an

initial knowledge acquisition module that captures domain

knowledge. The second part will be a case based reasoning

module that will make use of the major steps to make them

available for any future analysis exercise in the same

domain of interest.

REFERENCES

[1] W. J. Dixon, "Analysis of Extreme Values,"Ann. Math. Statist.,vol. 21, pp. 488-506, 1950.

[2] E. J. Gumbell, "Discussion on "Rejection of Outliers" by

Anscombe, F.J.," Technometrics, vol. 2, pp. 165-166, 1960.

[3] D. Collett and T. Lewis, "The Subjective Nature of Outlier

Rejection Procedures,"Applied Statistics, vol. 25, pp. 228-237,

1976.

[4] P.-N. Tan, M. Steinbach, and V. Kumar,Introduction to Data

Mining: Pearson Education, Inc., 2006.

[5] E. M. Knorr and R. T. Ng, "A Unified Notion of Outliers:

Properties and Computation," presented at Proceedings of

Knowledge Discovery in Databases (KDD-97), Newport Beach

CA., 1997.

[6] X. Liu, "Strategies for Outlier Analysis," presented at

Colloquium on Knowldge Discovery and Data Mining, London,

UK, 1998.

[7] X. Liu, G. Cheng, and J. X. Wu, "Noise and Uncertainty

Management in Intelligent Data Modelling," presented at

Proceedings of the 12th national Confrence on Artificial

Intelligence (AAAI-94), 1994.

[8] J. G. Cheng, "Outlier Management in Intelligent Data Analysis,"

inDepartment of Computer Science, Birkbeck College. London:

University of London, 2000, pp. 163.

[9] T. Kohonen, Self Organizing Maps: Springer; New York,

Berlin, 1995.

[10] D. Hagan, S. Tucker, and J. Ceddia, "Industrial Experience

Products: A Balance of Product and Process," Computer Science

Education, vol. 9, pp. 106-113, 1999.

[11] J. Ceddia, S. Tucker, C. Clemence, and A. Cambrell, "WIER-

Implementing Artifact Reuse in an Educational Environment

with Real Products," presented at 31st Annual Frontiers in

Education Conference, Reno, Nevada, 2001.

[12] J. Sheard, J. Ceddia, J. Hurst, and J. Tuovinen, "Determining

Website Usage Time from Interactions: Data Preparation and

Analysis,"Educational Technology Systems, vol. 32, pp. 101-

121, 2003-2004.

[13] J. Sheard, J. Ceddia, J. Hurst, and J. Tuovinen, "Inferring

Student Learning Behaviour from Website Interactions: AUsage Analysis,"Education and Information Technologies, vol.

8, pp. 245-266, 2003.

[14] A. Knobbe, A. Schipper, and P. Brockhausen, "Domain

Knowledge and Data Mining Process Decisions: Enabling End-

User Datawarehouse Mining; Contract No. IST-1999-11993;

Deliverable No. D5: www-ai.cs.uni-

dortmund.de/MMWEB/content/publications.html," vol. 2003,

2000.

[15] R. Redpath and B. Srinivasan, "A Model for Domain Centered

Knowledge Discovery in Databases," presented at IEEE 4th

International Conference on Intelligent Systems Design and

Applications ISDA 2004, Budapest Hungary, 2004.

[16] T. Sullivan, "Reading Reader Reaction: A Proposal for

Inferential Analysis of Web Server Log Files," presented at 3rd

Conference on Human Factors and the Web, Denver Colorado,

1997.

icia redpath dec05

Documents