icia redpath dec05
TRANSCRIPT
-
7/28/2019 Icia Redpath Dec05
1/6
Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 1
AbstractThe understanding and treatment of outliers is
complex and non trivial in many data analysis and data mining
exercises. It is not always done well. One approach that can be
used in combination with others is to understand the domain ofinterest and use this knowledge to guide the data preparation
and subsequent steps in terms of the treatment and
interpretation of outliers. To demonstrate the approach
proposed a study on web usage in the tertiary education sector
is used. A particular issue that occurs in the sector, where
widespread use of the World Wide Web occurs, is to monitor
students web use in the environment. This is important in
evaluating and improving teaching outcomes. Data mining
techniques play a key role in analyzing student interaction as it
is captured in Web logs. This paper considers the non-trivial
task of data preparation and analysis of web data and in
particular the treatment of outliers in this domain. Some
conclusions on how to define an outlier in terms of the strategic
aims of the particular analysis are made. Some generalconclusions are made about how to classify outliers as noise or
indicative indicators in a web environment. It is argued that
the approach demonstrated can be applied across a range of
domains and is a guide as to how the knowledge discovery task
may be partially automated.
I. INTRODUCTION
Outlier treatment is an important aspect of the Knowledge
Discovery in Databases (KDD) process where outliers may
reveal important information about the data under analysis.
Outliers can be determined by a number of techniques
including visualization, and proximity based approaches.
All approaches require some input from the expert in thedomain of interest so domain knowledge plays an important
role in their identification. Once outliers have been
identified they may be rejected as due to measurement error
or as being from another sample population or they may be
accepted as phenomena of interest. Making the distinction
between the outliers that should be rejected and those that
should be accepted is not always easy.
The purpose of this paper is to use a case study on student
behavior, as demonstrated by their usage of a courseware
web system to support their activities, to make some general
conclusions about how outliers may be treated in the domain
of the case study and to extend those conclusions to other
domains of interest. Having gained an understanding of how
outliers may be treated in a general way it is suggested how
this understanding could be incorporated into a partiallyautomated model for KDD.
Section 2 reviews the treatment of outliers, from the use
of traditional statistical approaches through to the more
recent approaches that have been suggested with the
increased interest in data mining; where data mining is as a
label for a new grouping of techniques for data analysis. In
section 3 the data preparation for the case study on student
behavior is detailed and an emphasis is placed on how
outliers are treated in the sample data. The particular role of
the goals of the analysis is demonstrated (1) in identifying
outliers and (2) in deciding on whether they should be
retained or rejected. Section 4 discusses how domainknowledge is important at every step of the KDD process
and relates the points made to the case study. Section 5
describes how the capture of the appropriate domain
knowledge will allow outlier treatment to be part of an
automated KDD system.
Use of domain knowledge is only one aspect of attempts
to automate the KDD process and other issues such
visualization of data and knowledge, integration of data
mining tools and automated support via workbenches need
to be considered. This paper increases understanding of the
issues by a focus on outliers, one of the types of knowledge
that has the potential to be revealed.
II. TREATMENT OF OUTLIERS
There is a large literature on outliers extending over many
years. In consequence one would expect that a concise
definition of an outlier could be provided but in fact this is a
difficult task. It is a complex matter to precisely encapsulate
what an outlier is. Many writers convey a notion of what an
outlier is in a larger series of observations but providing an
objective statement that can be used to identify an outlier is
a great challenge. Some attempts are included here:
It is well recognized by those who collect or analyse
Domain Knowledge to Support Understanding
and Treatment of OutliersRobert Redpath, Judy Sheard
Caulfield School of Information Technology
Monash University, Caulfield East, Victoria, Australia
Email: [email protected]
Email: [email protected]
-
7/28/2019 Icia Redpath Dec05
2/6
Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 2
data that values occur in a sample which are so far
removed from the remaining values that the analyst is not
willing to believe that these values have come from the
same population. Many times values occur which are
dubious in the eyes of the analyst and he feels that heshould make a decision as to whether to accept or reject
these values as part of his sample. [1]
The outliers are values which seem either too large or
too small as compared to the rest of the observations. [2]
The use of phrases such as in the eyes of the analyst
and seem indicate the subjective approach used for
identifying outliers. Collet and Lewis[3] provide a
comprehensive discussion of the subjective nature of the
exercise. They point out that most procedures for
identifying outliers essentially have two stages. First a
subjective decision is made that a value (or observation) is
unusual in comparison to most other values. They use theterm surprising for such a value. Second the identified value
is tested for discordance with the rest of the sample based
on satisfying some objective criterion. It thus becomes an
hypothesis testing procedure using traditional statistical
methods. By experiment they observe and support the
conclusion that identification of outliers by such methods is
subjective and is affected by the method of presentation and
the scale and pattern of the data [3]
While recognizing the subjective nature of outlier
identification, once identified they can be treated in two
broad ways as either values for rejection or values that point
to phenomena of interest. They are rejected if they can beconsidered as values that are drawn from another population
or values resulting from measurement error and suitable for
exclusion from the sample as distorting the analysis.
Alternately they can be considered phenomena of interest
that should not be excluded.
In the past an approach was taken to outlier detection
that placed an emphasis on rejecting outliers as values that
are not part of the population being analyzed and so more
statistically based approaches were used. Recent activity in
the data mining area is more particularly concerned with
outliers as anomalies of interest. Data mining as a term
embraces traditional statistical approaches and other less
statistically rigorous techniques such as decision trees and
neural networks. Data mining will often avoid a strictly
statistical view of data, as the volume of observations and
number of attributes does not easily permit statistically
based approaches. But outliers are still of great interest so
identification by density-based and proximity-based
techniques, combined with visualizations are typically
employed. The aim is to find outliers that represent valid
data that is significant but that differs from most of the
sample. Applications include identifying unusual weather
events, fraud, intrusion, medical conditions and public
health issues. A range of other approaches exist for
detection of outliers including model based techniques and
also techniques that make use of class labels. If class labels
are used and training set data is available a supervised
learning approach could be employed. If no training data is
available an unsupervised techniques may be employed. Itmay also be appropriate to remove an observation in the
data preparation step of data mining if it is indicative of a
measurement error or from another population not of
interest in the analysis. Outliers of this type are often
referred to as noise and would typically be rejected from the
sample as they would distort the analysis being carried out.
An excellent overview of all these approaches can be found
in the book by Tan et al [4].
Other literature on outliers includes Knorr and Ng[5]
who take an intuitive notion of outliers and provide a
formalization as follows:
An object O in a dataset T is UO(p,D) -outlier if at leasta fraction p of the objects in T are>= distance D from O.
It is intuitive in the sense that the analyst must nominate
the fraction p and the distance D based on a notion that an
outlier is a distance from the main fraction of the data. It is
noted by the authors that this can be used as the basis for
approaches that are effective with the large datasets such as
are encountered in data mining. We note here that it also
depends on the domain knowledge of the expert in the area
to establish the fraction and distance parameters.
Work by Liu[6] addresses the problem of distinguishing
between outliers that should be rejected and those that
should be retained in the sample as phenomena of interest.The criteria suggested to make this decision are the
characteristics of the data and also relevant domain
knowledge. A strategy is suggested that models noise and
error processes and accepts outliers as phenomena of
interest if the noise model cannot account for them. We
again note here that the domain knowledge of the expert in
the domain of interest is required to construct the noise
model. Lui has also published a number of papers
addressing this theme for treatment of outliers with G.
Cheng and J. Wu[7, 8]. G. Cheng, in his PhD dissertation
suggests a methodology that uses two strategies for outlier
treatment. The methodology requires that outliers in
multivariate data are first detected and visualized by the use
of self organizing maps (SOMs). SOMs, a technique
pioneered by T. Kohonen [9], are widely used for
understanding and interpreting clusters in large datasets
containing multivariate data. Two strategies are then
proposed for handling the outliers. Domain knowledge can
be used to identify outliers for retention and to model what
is considered normal data and thus reason about outliers
based on this model. Domain Knowledge can also be used
to identify outliers for rejection and to build a model of the
behavior of these outliers. Data would be accepted outside
the norms if it is not accounted for by the model developed
-
7/28/2019 Icia Redpath Dec05
3/6
Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 3
for the outliers suitable for rejection. One of the logical
considerations for proceeding in this way is that large
datasets would not permit all outliers to be dealt with
manually. The key point to note in our view is that the
domain knowledge of the expert in the domain of interestessentially determines the treatment of outliers after
application of some widely used supervised and
unsupervised SOM learning approaches that generate
visualizations. The visualizations permit the domain expert
to first identify the outliers, then decide whether to accept
the outliers as phenomena of interest or reject the outliers as
distorting the analysis.
III. OUTLIERTREATMENT IN A WEBSITE USAGE ANALYSIS
A case study was chosen to demonstrate the application
of domain knowledge in the treatment of outliers in thecontext of the overall KDD process. The particular analysis
used in this case study was concerned with the effectiveness
of web-based learning environment and was part of a study
executed in 2002. The web based learning environment was
provided to students in the final year of their undergraduate
degree where they undertake an industrial experience (IE)
project in which they design, develop and deliver a small
computer system for a client. A full description of the IE
project can be found in Hagan[10]. The web site, known as
the Web Industrial Experience Resources Website (WIER),
was developed to provide students with an integrated
learning environment to use during their IE project work.The site provides various resources including general
project information, a facility to enable event scheduling,
facilities for project management including a task timer
tracker, time graph generator, file manager and risk list, and
various forms of communication facilities via news group
and discussion forums. The site also provides access to a
repository of resources including standards documents,
document templates, and samples of past projects. Details
can be found in [11]
The research studies made use of a web log data
generated when students made use of the WIER system. The
full details of the data preparation can be reviewed in the
report by Sheard et al [12] and details of the findings based
on web usage analysis can be found in[13].
To enable meaningful interpretation of the data the
following abstractions were defined. A connection to the
WIER system was termed a session. A student session at the
WIER web site can logically be viewed as a number of
episodes. An episode corresponds to a student making use
of one of the functional areas of the system (e.g. accessing a
past project, using the time tracker or engaging in a
discussion). Each episode consists of a series of interactions.
An interaction was defined as a page request. This is shown
diagrammatically in Figure 1. Over the 27 week period of
the data collection there were 9442 sessions, consisting of
47725 episodes in total, in the analysis.Session 1
Episode 1 . ..Episode n
Interaction 1 .. Interaction n
Fig. 1 Class hierarchy of the abstractions used
The goals of the analysis embraced three major themes.
These were (1) frequency of sessions for a student over the
semester, (2) time of session analysis and finally (3) time of
episode analysis. The level of abstraction (session or
episode or interaction) suitable for one analysis goal and
what constituted an outlier for one analysis goal was not
necessarily the same for another analysis goal. Considering
the analysis goal offrequency of sessions it was decided to
include all sessions even though some lasted a number of
days and were clearly inactive and some lasted less than asecond and no work could have been done. This was valid
in terms of understanding this simple aspect of student
behaviour; their attempts to connect to the system.
Considering the goals of analysis (2) and (3), the aspects
of student behaviour of interest required that students be
actively engaged using the system and more than just
connecting with the system, so apparent failed logins were
eliminated. Furthermore, a long session did not necessarily
indicate inactivity; neither did a long episode. An active
episode was defined as one in which the time between
interactions within that episode was above some threshold.
A careful consideration of the percentages of episodes andsessions excluded at different thresholds for the interaction
times combined with knowledge of the system functions
from an educators perspective allowed the analyst to arrive
at a threshold of 600 seconds (10 minutes) between
interactions. The threshold was used to attach a class label
ofinactive to an episode. Using the threshold, if an episode
had any times between interactions of greater than or equal
to 600 seconds the episode was considered inactive.
Sessions (and their single episode) that lasted less than 1
second were also excluded as being inactive.
Considering the analysis goal of time of session if a
session contained an inactive episode this identified thesession as being inactive and the entire session was
excluded from the analysis as being misleading.
Considering the analysis goal of time of episode any
inactive episode was excluded.
Table I, below, compares the method employed to
exclude outliers to simply considering rejecting episodes
with a time greater than 1 hour. The method chosen includes
some outliers (as phenomena of interest) that would be
excluded by simpler methods or visualization approaches
while also excluding a greater number of undesirable
outliers. Some episodes (18) were retained that had episode
times greater than 1 hour including one episode that lasted
-
7/28/2019 Icia Redpath Dec05
4/6
Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 4
more than 2 hours.TABLE I
PERCENTAGE OF EPISODES REMAINING AFTER EXCLUDING
OUTLIERS
Method Percentage of
episodes
remaining
Number of
episodes with
times >=3600
seconds
included
Number of
episodes with
times >=7200
seconds
included
Extreme
episode time
(>3600 sec.)
97.5 0 0
Interaction
interval (the
method used)
93.9 18 1
IV. DOMAINKNOWLEDGETOSUPPORTTHEKDD
PROCESSIt has been suggested that domain knowledge has a role in
all steps of the KDD process[14]. It must be noted that the
success of a data mining exercise depends on a number of
factors including the degree of automation and the use of
visualization tools. A comprehensive overview of the issues
and literature concerning this can be found in the paper [15]
that details the impact of domain knowledge on all aspects
of the KDD process. Also suggested in [15] is an
architecture for a partially automated KDD system that
splits the analyst/users involvement into initial knowledge
acquisition and subsequent user interventions. Additionally
it is noted there that the strategic goals of the analyst wouldand should be included in a broad definition of domain
knowledge.
Web usage analysis and data collection presents some
particular challenges and an understanding of the domain of
interest must inform this analysis. As is commented by
Sullivan, many page hits could represent a deeply
satisfying experience or a hopelessly lost reader[16]. So in
discussing the steps in the KDD process; from data
collection, data preparation through to interpretation of the
results; an emphasis will be placed on how domain
knowledge informs the various decisions being made. (A
detailed commentary on the steps can be found in[12]). Inorder to understand the users behavior better it was
recognised that typical web file data, if used alone, did not
provide enough information. A script was therefore
included in each page on the site that recorded information
in a database each time a page was loaded. The students
entered the site via a login page and thus the start of each
session was able to be determined. The domain expert
understood that identity was vital for the analysis. Students
identities could also be matched to interactions. Caching
was disabled on most pages and thus page requests could be
recorded as interactions. Pages were also categorized
according to which functional area of the system they
belonged to so that a students use of each function could be
analysed. The categorization required the domain
knowledge of the expert in the application. Another key
decision informed by domain knowledge is the suitable
level of abstraction to apply in terms of the analysis goals.In the case study the goals of analysis required the class
hierarchy of session, episode and interaction be established.
The interaction time was used to attach a class label (or
abstraction label) to an episode of inactive.
It is clear that domain knowledge assists at a number of
steps in the analysis. Table II following summarises the
contribution at each step.TABLE II
SUMMARY OF DECISIONS INFORMED BY DOMAIN
KNOWLEDGE AT EACH STEP IN THE ANALYSIS
Data Process Step Decision informed by Domain Knowledge
Collection design Capture additional information for the analysis
Collection Monitor student logons and record identity
Abstraction Decide on semantically meaningful groupings
Integration Combine with additional data related to user
Cleaning Removal of irrelevant items Determine missing
items
Remove outliers that distort the analysis
Transformation Identify the user
Identify the session
Mining Outliers as phenomena of interest are identified
In order to know if outliers should be rejected or included
as phenomena of interest the goals of the analyst must betaken into account. Here the goals were to answer the
questions:
How long do students spend using the web site?
What are the patterns and trends in web site usage over
the year?
Are there any differences in use based on student
performance?
Clearly the data on students who spend long amounts of
times at initial setup need to be included to address these
objectives. Observations that are related to inactive or
disinterested users, as defined via the domain knowledge
that defines the inactive observations, would be rejected
from the sample.
V. OUTLIERTREATMENT AS PART OF A PARTIALLYAUTOMATED MODEL FORKNOWLEDGE DISCOVERY
There are two ways in which domain knowledge can be
captured in order to partially automate the knowledge
discovery process and in particular aid in identifying
outliers for categorization as suitable for rejection or as
phenomena of interest. The domain knowledge can be
captured prior and during the analysis via a knowledge
acquisition module. It can also be embodied in case studies
of previous analyses that have been formally captured in a
-
7/28/2019 Icia Redpath Dec05
5/6
Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 5
data bank of representative case studies. The Web Usage
case study detailed here will be used to demonstrate how the
approaches suggested could be applied in a particular
domain of interest. Domain knowledge acquisition and use
of representative case studies can be applied separately butwould preferably be applied in combination. If used in
combination they would have complimentary and
consequential effects on each other. The user would also
have the discretion to intervene at points of decision as the
data analysis proceeded.
Figure 2, adapted from[15] seeks to show the broadarchitecture for what is proposed.
Fig.2 Proposed Architecture for a Partially Automated KDD System
Case based reasoning would allow a previous web usage
analysis such as described here to provide prompts for the
major steps in the analysis if the user so desired that would
over rule a generic set of steps that would be provided as a
default. The steps would then guide the interaction with the
knowledge acquisition module in prompting for the
appropriate domain knowledge. Table III shows the
knowledge acquisition in summary as it would apply to the
case study.
The goals of the analysis would be captured andformalized in terms of which level of abstraction they relate
to and the attributes of the particular level of abstraction that
are important in categorizing the instances at that level of
abstraction. In the case study this was the time between
interactions in an episode. With reference to outliers the
knowledge acquisition module would capture the hierarchy
of abstractions, in this case session, episode and interaction.
The domain expert could then be provided with a frequency
distribution for the level of abstraction showing the
instances above and below certain threshold values.
This would assist the domain expert in determining upper
and lower threshold values for the analysis in question that
captured phenomena of interest but rejected outliers that
would distort the analysis in their view.TABLE III
AN AUTOMATED MODEL APPLIED TO THE CASE STUDY
Relevant domain knowledge
captured by knowledge acquisition
module at the beginning of the
KDD process
Case study application
Attributes to enrich data requested Student identity on page requests
Goals in terms of data abstractionrequested; supported by automatic
interrogation of meta data
Session frequencySession length
Abstraction hierarchy requested Session has many episodes; episode
has many interactions
Request class label Inactive episode
Request attribute to determine class
label
Request threshold for attribute
identified
Time between interactions
Threshold is 10 minutes
Data SelectionDesign andcollection
Data IntegrationTransformation andCoding
Data Mining
Result Evaluation
SelectedData
TransformedData
User KDDProcessInterface andWorkbench
Formal DomainKnowled e
Case Repository
Knowledge AcquisitionModule
Case BasedReasoning Module
Process-Metadata
Web Usage LogData
Convert toFormalismModule
Initial KnowledgeAcquisition
DK MetaDefinitionModule
-
7/28/2019 Icia Redpath Dec05
6/6
Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 6
VI. CONCLUSION
It is not always clear how outliers can be identified and
once identified how they should be treated. Domain
knowledge can play a key part in outlier treatment. The
definition of domain knowledge must be broad enough toinclude the goals of the analysis being carried out. It can be
employed at a number of points in the KDD process to
ensure correct handling of outliers. In particular it can
inform the following steps:
Establishment of the goals of the analysis
Design of data collection to allow the correct attributes
to be captured.
Determination of the suitable level of abstraction to
allow identification of outliers for both rejection and
inclusion in the analysis
Determination of the attributes to permit the correctclass labels to be attached to the appropriate level of
abstraction.
In the particular example of the web usage case study it
was found that the data collection had to be designed to
allow the individual student sessions to be identified and
also the use by students of the functional areas (episodes)
within a session to be identified. Episodes then had to be
labeled as active or inactive. Sessions that contained an
inactive episode were considered to be inactive by the
domain expert also.
An understanding of the treatment of outliers will permita model for a partially automated KDD workbench to be
developed that handles outliers based on the principals
established. This model will have two major parts that relate
to domain knowledge acquisition. The first part will be an
initial knowledge acquisition module that captures domain
knowledge. The second part will be a case based reasoning
module that will make use of the major steps to make them
available for any future analysis exercise in the same
domain of interest.
REFERENCES
[1] W. J. Dixon, "Analysis of Extreme Values,"Ann. Math. Statist.,vol. 21, pp. 488-506, 1950.
[2] E. J. Gumbell, "Discussion on "Rejection of Outliers" by
Anscombe, F.J.," Technometrics, vol. 2, pp. 165-166, 1960.
[3] D. Collett and T. Lewis, "The Subjective Nature of Outlier
Rejection Procedures,"Applied Statistics, vol. 25, pp. 228-237,
1976.
[4] P.-N. Tan, M. Steinbach, and V. Kumar,Introduction to Data
Mining: Pearson Education, Inc., 2006.
[5] E. M. Knorr and R. T. Ng, "A Unified Notion of Outliers:
Properties and Computation," presented at Proceedings of
Knowledge Discovery in Databases (KDD-97), Newport Beach
CA., 1997.
[6] X. Liu, "Strategies for Outlier Analysis," presented at
Colloquium on Knowldge Discovery and Data Mining, London,
UK, 1998.
[7] X. Liu, G. Cheng, and J. X. Wu, "Noise and Uncertainty
Management in Intelligent Data Modelling," presented at
Proceedings of the 12th national Confrence on Artificial
Intelligence (AAAI-94), 1994.
[8] J. G. Cheng, "Outlier Management in Intelligent Data Analysis,"
inDepartment of Computer Science, Birkbeck College. London:
University of London, 2000, pp. 163.
[9] T. Kohonen, Self Organizing Maps: Springer; New York,
Berlin, 1995.
[10] D. Hagan, S. Tucker, and J. Ceddia, "Industrial Experience
Products: A Balance of Product and Process," Computer Science
Education, vol. 9, pp. 106-113, 1999.
[11] J. Ceddia, S. Tucker, C. Clemence, and A. Cambrell, "WIER-
Implementing Artifact Reuse in an Educational Environment
with Real Products," presented at 31st Annual Frontiers in
Education Conference, Reno, Nevada, 2001.
[12] J. Sheard, J. Ceddia, J. Hurst, and J. Tuovinen, "Determining
Website Usage Time from Interactions: Data Preparation and
Analysis,"Educational Technology Systems, vol. 32, pp. 101-
121, 2003-2004.
[13] J. Sheard, J. Ceddia, J. Hurst, and J. Tuovinen, "Inferring
Student Learning Behaviour from Website Interactions: AUsage Analysis,"Education and Information Technologies, vol.
8, pp. 245-266, 2003.
[14] A. Knobbe, A. Schipper, and P. Brockhausen, "Domain
Knowledge and Data Mining Process Decisions: Enabling End-
User Datawarehouse Mining; Contract No. IST-1999-11993;
Deliverable No. D5: www-ai.cs.uni-
dortmund.de/MMWEB/content/publications.html," vol. 2003,
2000.
[15] R. Redpath and B. Srinivasan, "A Model for Domain Centered
Knowledge Discovery in Databases," presented at IEEE 4th
International Conference on Intelligent Systems Design and
Applications ISDA 2004, Budapest Hungary, 2004.
[16] T. Sullivan, "Reading Reader Reaction: A Proposal for
Inferential Analysis of Web Server Log Files," presented at 3rd
Conference on Human Factors and the Web, Denver Colorado,
1997.