icia redpath dec05

Upload: suvir-misra

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 Icia Redpath Dec05

    1/6

    Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 1

    AbstractThe understanding and treatment of outliers is

    complex and non trivial in many data analysis and data mining

    exercises. It is not always done well. One approach that can be

    used in combination with others is to understand the domain ofinterest and use this knowledge to guide the data preparation

    and subsequent steps in terms of the treatment and

    interpretation of outliers. To demonstrate the approach

    proposed a study on web usage in the tertiary education sector

    is used. A particular issue that occurs in the sector, where

    widespread use of the World Wide Web occurs, is to monitor

    students web use in the environment. This is important in

    evaluating and improving teaching outcomes. Data mining

    techniques play a key role in analyzing student interaction as it

    is captured in Web logs. This paper considers the non-trivial

    task of data preparation and analysis of web data and in

    particular the treatment of outliers in this domain. Some

    conclusions on how to define an outlier in terms of the strategic

    aims of the particular analysis are made. Some generalconclusions are made about how to classify outliers as noise or

    indicative indicators in a web environment. It is argued that

    the approach demonstrated can be applied across a range of

    domains and is a guide as to how the knowledge discovery task

    may be partially automated.

    I. INTRODUCTION

    Outlier treatment is an important aspect of the Knowledge

    Discovery in Databases (KDD) process where outliers may

    reveal important information about the data under analysis.

    Outliers can be determined by a number of techniques

    including visualization, and proximity based approaches.

    All approaches require some input from the expert in thedomain of interest so domain knowledge plays an important

    role in their identification. Once outliers have been

    identified they may be rejected as due to measurement error

    or as being from another sample population or they may be

    accepted as phenomena of interest. Making the distinction

    between the outliers that should be rejected and those that

    should be accepted is not always easy.

    The purpose of this paper is to use a case study on student

    behavior, as demonstrated by their usage of a courseware

    web system to support their activities, to make some general

    conclusions about how outliers may be treated in the domain

    of the case study and to extend those conclusions to other

    domains of interest. Having gained an understanding of how

    outliers may be treated in a general way it is suggested how

    this understanding could be incorporated into a partiallyautomated model for KDD.

    Section 2 reviews the treatment of outliers, from the use

    of traditional statistical approaches through to the more

    recent approaches that have been suggested with the

    increased interest in data mining; where data mining is as a

    label for a new grouping of techniques for data analysis. In

    section 3 the data preparation for the case study on student

    behavior is detailed and an emphasis is placed on how

    outliers are treated in the sample data. The particular role of

    the goals of the analysis is demonstrated (1) in identifying

    outliers and (2) in deciding on whether they should be

    retained or rejected. Section 4 discusses how domainknowledge is important at every step of the KDD process

    and relates the points made to the case study. Section 5

    describes how the capture of the appropriate domain

    knowledge will allow outlier treatment to be part of an

    automated KDD system.

    Use of domain knowledge is only one aspect of attempts

    to automate the KDD process and other issues such

    visualization of data and knowledge, integration of data

    mining tools and automated support via workbenches need

    to be considered. This paper increases understanding of the

    issues by a focus on outliers, one of the types of knowledge

    that has the potential to be revealed.

    II. TREATMENT OF OUTLIERS

    There is a large literature on outliers extending over many

    years. In consequence one would expect that a concise

    definition of an outlier could be provided but in fact this is a

    difficult task. It is a complex matter to precisely encapsulate

    what an outlier is. Many writers convey a notion of what an

    outlier is in a larger series of observations but providing an

    objective statement that can be used to identify an outlier is

    a great challenge. Some attempts are included here:

    It is well recognized by those who collect or analyse

    Domain Knowledge to Support Understanding

    and Treatment of OutliersRobert Redpath, Judy Sheard

    Caulfield School of Information Technology

    Monash University, Caulfield East, Victoria, Australia

    Email: [email protected]

    Email: [email protected]

  • 7/28/2019 Icia Redpath Dec05

    2/6

    Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 2

    data that values occur in a sample which are so far

    removed from the remaining values that the analyst is not

    willing to believe that these values have come from the

    same population. Many times values occur which are

    dubious in the eyes of the analyst and he feels that heshould make a decision as to whether to accept or reject

    these values as part of his sample. [1]

    The outliers are values which seem either too large or

    too small as compared to the rest of the observations. [2]

    The use of phrases such as in the eyes of the analyst

    and seem indicate the subjective approach used for

    identifying outliers. Collet and Lewis[3] provide a

    comprehensive discussion of the subjective nature of the

    exercise. They point out that most procedures for

    identifying outliers essentially have two stages. First a

    subjective decision is made that a value (or observation) is

    unusual in comparison to most other values. They use theterm surprising for such a value. Second the identified value

    is tested for discordance with the rest of the sample based

    on satisfying some objective criterion. It thus becomes an

    hypothesis testing procedure using traditional statistical

    methods. By experiment they observe and support the

    conclusion that identification of outliers by such methods is

    subjective and is affected by the method of presentation and

    the scale and pattern of the data [3]

    While recognizing the subjective nature of outlier

    identification, once identified they can be treated in two

    broad ways as either values for rejection or values that point

    to phenomena of interest. They are rejected if they can beconsidered as values that are drawn from another population

    or values resulting from measurement error and suitable for

    exclusion from the sample as distorting the analysis.

    Alternately they can be considered phenomena of interest

    that should not be excluded.

    In the past an approach was taken to outlier detection

    that placed an emphasis on rejecting outliers as values that

    are not part of the population being analyzed and so more

    statistically based approaches were used. Recent activity in

    the data mining area is more particularly concerned with

    outliers as anomalies of interest. Data mining as a term

    embraces traditional statistical approaches and other less

    statistically rigorous techniques such as decision trees and

    neural networks. Data mining will often avoid a strictly

    statistical view of data, as the volume of observations and

    number of attributes does not easily permit statistically

    based approaches. But outliers are still of great interest so

    identification by density-based and proximity-based

    techniques, combined with visualizations are typically

    employed. The aim is to find outliers that represent valid

    data that is significant but that differs from most of the

    sample. Applications include identifying unusual weather

    events, fraud, intrusion, medical conditions and public

    health issues. A range of other approaches exist for

    detection of outliers including model based techniques and

    also techniques that make use of class labels. If class labels

    are used and training set data is available a supervised

    learning approach could be employed. If no training data is

    available an unsupervised techniques may be employed. Itmay also be appropriate to remove an observation in the

    data preparation step of data mining if it is indicative of a

    measurement error or from another population not of

    interest in the analysis. Outliers of this type are often

    referred to as noise and would typically be rejected from the

    sample as they would distort the analysis being carried out.

    An excellent overview of all these approaches can be found

    in the book by Tan et al [4].

    Other literature on outliers includes Knorr and Ng[5]

    who take an intuitive notion of outliers and provide a

    formalization as follows:

    An object O in a dataset T is UO(p,D) -outlier if at leasta fraction p of the objects in T are>= distance D from O.

    It is intuitive in the sense that the analyst must nominate

    the fraction p and the distance D based on a notion that an

    outlier is a distance from the main fraction of the data. It is

    noted by the authors that this can be used as the basis for

    approaches that are effective with the large datasets such as

    are encountered in data mining. We note here that it also

    depends on the domain knowledge of the expert in the area

    to establish the fraction and distance parameters.

    Work by Liu[6] addresses the problem of distinguishing

    between outliers that should be rejected and those that

    should be retained in the sample as phenomena of interest.The criteria suggested to make this decision are the

    characteristics of the data and also relevant domain

    knowledge. A strategy is suggested that models noise and

    error processes and accepts outliers as phenomena of

    interest if the noise model cannot account for them. We

    again note here that the domain knowledge of the expert in

    the domain of interest is required to construct the noise

    model. Lui has also published a number of papers

    addressing this theme for treatment of outliers with G.

    Cheng and J. Wu[7, 8]. G. Cheng, in his PhD dissertation

    suggests a methodology that uses two strategies for outlier

    treatment. The methodology requires that outliers in

    multivariate data are first detected and visualized by the use

    of self organizing maps (SOMs). SOMs, a technique

    pioneered by T. Kohonen [9], are widely used for

    understanding and interpreting clusters in large datasets

    containing multivariate data. Two strategies are then

    proposed for handling the outliers. Domain knowledge can

    be used to identify outliers for retention and to model what

    is considered normal data and thus reason about outliers

    based on this model. Domain Knowledge can also be used

    to identify outliers for rejection and to build a model of the

    behavior of these outliers. Data would be accepted outside

    the norms if it is not accounted for by the model developed

  • 7/28/2019 Icia Redpath Dec05

    3/6

    Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 3

    for the outliers suitable for rejection. One of the logical

    considerations for proceeding in this way is that large

    datasets would not permit all outliers to be dealt with

    manually. The key point to note in our view is that the

    domain knowledge of the expert in the domain of interestessentially determines the treatment of outliers after

    application of some widely used supervised and

    unsupervised SOM learning approaches that generate

    visualizations. The visualizations permit the domain expert

    to first identify the outliers, then decide whether to accept

    the outliers as phenomena of interest or reject the outliers as

    distorting the analysis.

    III. OUTLIERTREATMENT IN A WEBSITE USAGE ANALYSIS

    A case study was chosen to demonstrate the application

    of domain knowledge in the treatment of outliers in thecontext of the overall KDD process. The particular analysis

    used in this case study was concerned with the effectiveness

    of web-based learning environment and was part of a study

    executed in 2002. The web based learning environment was

    provided to students in the final year of their undergraduate

    degree where they undertake an industrial experience (IE)

    project in which they design, develop and deliver a small

    computer system for a client. A full description of the IE

    project can be found in Hagan[10]. The web site, known as

    the Web Industrial Experience Resources Website (WIER),

    was developed to provide students with an integrated

    learning environment to use during their IE project work.The site provides various resources including general

    project information, a facility to enable event scheduling,

    facilities for project management including a task timer

    tracker, time graph generator, file manager and risk list, and

    various forms of communication facilities via news group

    and discussion forums. The site also provides access to a

    repository of resources including standards documents,

    document templates, and samples of past projects. Details

    can be found in [11]

    The research studies made use of a web log data

    generated when students made use of the WIER system. The

    full details of the data preparation can be reviewed in the

    report by Sheard et al [12] and details of the findings based

    on web usage analysis can be found in[13].

    To enable meaningful interpretation of the data the

    following abstractions were defined. A connection to the

    WIER system was termed a session. A student session at the

    WIER web site can logically be viewed as a number of

    episodes. An episode corresponds to a student making use

    of one of the functional areas of the system (e.g. accessing a

    past project, using the time tracker or engaging in a

    discussion). Each episode consists of a series of interactions.

    An interaction was defined as a page request. This is shown

    diagrammatically in Figure 1. Over the 27 week period of

    the data collection there were 9442 sessions, consisting of

    47725 episodes in total, in the analysis.Session 1

    Episode 1 . ..Episode n

    Interaction 1 .. Interaction n

    Fig. 1 Class hierarchy of the abstractions used

    The goals of the analysis embraced three major themes.

    These were (1) frequency of sessions for a student over the

    semester, (2) time of session analysis and finally (3) time of

    episode analysis. The level of abstraction (session or

    episode or interaction) suitable for one analysis goal and

    what constituted an outlier for one analysis goal was not

    necessarily the same for another analysis goal. Considering

    the analysis goal offrequency of sessions it was decided to

    include all sessions even though some lasted a number of

    days and were clearly inactive and some lasted less than asecond and no work could have been done. This was valid

    in terms of understanding this simple aspect of student

    behaviour; their attempts to connect to the system.

    Considering the goals of analysis (2) and (3), the aspects

    of student behaviour of interest required that students be

    actively engaged using the system and more than just

    connecting with the system, so apparent failed logins were

    eliminated. Furthermore, a long session did not necessarily

    indicate inactivity; neither did a long episode. An active

    episode was defined as one in which the time between

    interactions within that episode was above some threshold.

    A careful consideration of the percentages of episodes andsessions excluded at different thresholds for the interaction

    times combined with knowledge of the system functions

    from an educators perspective allowed the analyst to arrive

    at a threshold of 600 seconds (10 minutes) between

    interactions. The threshold was used to attach a class label

    ofinactive to an episode. Using the threshold, if an episode

    had any times between interactions of greater than or equal

    to 600 seconds the episode was considered inactive.

    Sessions (and their single episode) that lasted less than 1

    second were also excluded as being inactive.

    Considering the analysis goal of time of session if a

    session contained an inactive episode this identified thesession as being inactive and the entire session was

    excluded from the analysis as being misleading.

    Considering the analysis goal of time of episode any

    inactive episode was excluded.

    Table I, below, compares the method employed to

    exclude outliers to simply considering rejecting episodes

    with a time greater than 1 hour. The method chosen includes

    some outliers (as phenomena of interest) that would be

    excluded by simpler methods or visualization approaches

    while also excluding a greater number of undesirable

    outliers. Some episodes (18) were retained that had episode

    times greater than 1 hour including one episode that lasted

  • 7/28/2019 Icia Redpath Dec05

    4/6

    Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 4

    more than 2 hours.TABLE I

    PERCENTAGE OF EPISODES REMAINING AFTER EXCLUDING

    OUTLIERS

    Method Percentage of

    episodes

    remaining

    Number of

    episodes with

    times >=3600

    seconds

    included

    Number of

    episodes with

    times >=7200

    seconds

    included

    Extreme

    episode time

    (>3600 sec.)

    97.5 0 0

    Interaction

    interval (the

    method used)

    93.9 18 1

    IV. DOMAINKNOWLEDGETOSUPPORTTHEKDD

    PROCESSIt has been suggested that domain knowledge has a role in

    all steps of the KDD process[14]. It must be noted that the

    success of a data mining exercise depends on a number of

    factors including the degree of automation and the use of

    visualization tools. A comprehensive overview of the issues

    and literature concerning this can be found in the paper [15]

    that details the impact of domain knowledge on all aspects

    of the KDD process. Also suggested in [15] is an

    architecture for a partially automated KDD system that

    splits the analyst/users involvement into initial knowledge

    acquisition and subsequent user interventions. Additionally

    it is noted there that the strategic goals of the analyst wouldand should be included in a broad definition of domain

    knowledge.

    Web usage analysis and data collection presents some

    particular challenges and an understanding of the domain of

    interest must inform this analysis. As is commented by

    Sullivan, many page hits could represent a deeply

    satisfying experience or a hopelessly lost reader[16]. So in

    discussing the steps in the KDD process; from data

    collection, data preparation through to interpretation of the

    results; an emphasis will be placed on how domain

    knowledge informs the various decisions being made. (A

    detailed commentary on the steps can be found in[12]). Inorder to understand the users behavior better it was

    recognised that typical web file data, if used alone, did not

    provide enough information. A script was therefore

    included in each page on the site that recorded information

    in a database each time a page was loaded. The students

    entered the site via a login page and thus the start of each

    session was able to be determined. The domain expert

    understood that identity was vital for the analysis. Students

    identities could also be matched to interactions. Caching

    was disabled on most pages and thus page requests could be

    recorded as interactions. Pages were also categorized

    according to which functional area of the system they

    belonged to so that a students use of each function could be

    analysed. The categorization required the domain

    knowledge of the expert in the application. Another key

    decision informed by domain knowledge is the suitable

    level of abstraction to apply in terms of the analysis goals.In the case study the goals of analysis required the class

    hierarchy of session, episode and interaction be established.

    The interaction time was used to attach a class label (or

    abstraction label) to an episode of inactive.

    It is clear that domain knowledge assists at a number of

    steps in the analysis. Table II following summarises the

    contribution at each step.TABLE II

    SUMMARY OF DECISIONS INFORMED BY DOMAIN

    KNOWLEDGE AT EACH STEP IN THE ANALYSIS

    Data Process Step Decision informed by Domain Knowledge

    Collection design Capture additional information for the analysis

    Collection Monitor student logons and record identity

    Abstraction Decide on semantically meaningful groupings

    Integration Combine with additional data related to user

    Cleaning Removal of irrelevant items Determine missing

    items

    Remove outliers that distort the analysis

    Transformation Identify the user

    Identify the session

    Mining Outliers as phenomena of interest are identified

    In order to know if outliers should be rejected or included

    as phenomena of interest the goals of the analyst must betaken into account. Here the goals were to answer the

    questions:

    How long do students spend using the web site?

    What are the patterns and trends in web site usage over

    the year?

    Are there any differences in use based on student

    performance?

    Clearly the data on students who spend long amounts of

    times at initial setup need to be included to address these

    objectives. Observations that are related to inactive or

    disinterested users, as defined via the domain knowledge

    that defines the inactive observations, would be rejected

    from the sample.

    V. OUTLIERTREATMENT AS PART OF A PARTIALLYAUTOMATED MODEL FORKNOWLEDGE DISCOVERY

    There are two ways in which domain knowledge can be

    captured in order to partially automate the knowledge

    discovery process and in particular aid in identifying

    outliers for categorization as suitable for rejection or as

    phenomena of interest. The domain knowledge can be

    captured prior and during the analysis via a knowledge

    acquisition module. It can also be embodied in case studies

    of previous analyses that have been formally captured in a

  • 7/28/2019 Icia Redpath Dec05

    5/6

    Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 5

    data bank of representative case studies. The Web Usage

    case study detailed here will be used to demonstrate how the

    approaches suggested could be applied in a particular

    domain of interest. Domain knowledge acquisition and use

    of representative case studies can be applied separately butwould preferably be applied in combination. If used in

    combination they would have complimentary and

    consequential effects on each other. The user would also

    have the discretion to intervene at points of decision as the

    data analysis proceeded.

    Figure 2, adapted from[15] seeks to show the broadarchitecture for what is proposed.

    Fig.2 Proposed Architecture for a Partially Automated KDD System

    Case based reasoning would allow a previous web usage

    analysis such as described here to provide prompts for the

    major steps in the analysis if the user so desired that would

    over rule a generic set of steps that would be provided as a

    default. The steps would then guide the interaction with the

    knowledge acquisition module in prompting for the

    appropriate domain knowledge. Table III shows the

    knowledge acquisition in summary as it would apply to the

    case study.

    The goals of the analysis would be captured andformalized in terms of which level of abstraction they relate

    to and the attributes of the particular level of abstraction that

    are important in categorizing the instances at that level of

    abstraction. In the case study this was the time between

    interactions in an episode. With reference to outliers the

    knowledge acquisition module would capture the hierarchy

    of abstractions, in this case session, episode and interaction.

    The domain expert could then be provided with a frequency

    distribution for the level of abstraction showing the

    instances above and below certain threshold values.

    This would assist the domain expert in determining upper

    and lower threshold values for the analysis in question that

    captured phenomena of interest but rejected outliers that

    would distort the analysis in their view.TABLE III

    AN AUTOMATED MODEL APPLIED TO THE CASE STUDY

    Relevant domain knowledge

    captured by knowledge acquisition

    module at the beginning of the

    KDD process

    Case study application

    Attributes to enrich data requested Student identity on page requests

    Goals in terms of data abstractionrequested; supported by automatic

    interrogation of meta data

    Session frequencySession length

    Abstraction hierarchy requested Session has many episodes; episode

    has many interactions

    Request class label Inactive episode

    Request attribute to determine class

    label

    Request threshold for attribute

    identified

    Time between interactions

    Threshold is 10 minutes

    Data SelectionDesign andcollection

    Data IntegrationTransformation andCoding

    Data Mining

    Result Evaluation

    SelectedData

    TransformedData

    User KDDProcessInterface andWorkbench

    Formal DomainKnowled e

    Case Repository

    Knowledge AcquisitionModule

    Case BasedReasoning Module

    Process-Metadata

    Web Usage LogData

    Convert toFormalismModule

    Initial KnowledgeAcquisition

    DK MetaDefinitionModule

  • 7/28/2019 Icia Redpath Dec05

    6/6

    Proceedings of the International Conference on Information and Automation, December 15-18, 2005, Colombo, Sri Lanka. 6

    VI. CONCLUSION

    It is not always clear how outliers can be identified and

    once identified how they should be treated. Domain

    knowledge can play a key part in outlier treatment. The

    definition of domain knowledge must be broad enough toinclude the goals of the analysis being carried out. It can be

    employed at a number of points in the KDD process to

    ensure correct handling of outliers. In particular it can

    inform the following steps:

    Establishment of the goals of the analysis

    Design of data collection to allow the correct attributes

    to be captured.

    Determination of the suitable level of abstraction to

    allow identification of outliers for both rejection and

    inclusion in the analysis

    Determination of the attributes to permit the correctclass labels to be attached to the appropriate level of

    abstraction.

    In the particular example of the web usage case study it

    was found that the data collection had to be designed to

    allow the individual student sessions to be identified and

    also the use by students of the functional areas (episodes)

    within a session to be identified. Episodes then had to be

    labeled as active or inactive. Sessions that contained an

    inactive episode were considered to be inactive by the

    domain expert also.

    An understanding of the treatment of outliers will permita model for a partially automated KDD workbench to be

    developed that handles outliers based on the principals

    established. This model will have two major parts that relate

    to domain knowledge acquisition. The first part will be an

    initial knowledge acquisition module that captures domain

    knowledge. The second part will be a case based reasoning

    module that will make use of the major steps to make them

    available for any future analysis exercise in the same

    domain of interest.

    REFERENCES

    [1] W. J. Dixon, "Analysis of Extreme Values,"Ann. Math. Statist.,vol. 21, pp. 488-506, 1950.

    [2] E. J. Gumbell, "Discussion on "Rejection of Outliers" by

    Anscombe, F.J.," Technometrics, vol. 2, pp. 165-166, 1960.

    [3] D. Collett and T. Lewis, "The Subjective Nature of Outlier

    Rejection Procedures,"Applied Statistics, vol. 25, pp. 228-237,

    1976.

    [4] P.-N. Tan, M. Steinbach, and V. Kumar,Introduction to Data

    Mining: Pearson Education, Inc., 2006.

    [5] E. M. Knorr and R. T. Ng, "A Unified Notion of Outliers:

    Properties and Computation," presented at Proceedings of

    Knowledge Discovery in Databases (KDD-97), Newport Beach

    CA., 1997.

    [6] X. Liu, "Strategies for Outlier Analysis," presented at

    Colloquium on Knowldge Discovery and Data Mining, London,

    UK, 1998.

    [7] X. Liu, G. Cheng, and J. X. Wu, "Noise and Uncertainty

    Management in Intelligent Data Modelling," presented at

    Proceedings of the 12th national Confrence on Artificial

    Intelligence (AAAI-94), 1994.

    [8] J. G. Cheng, "Outlier Management in Intelligent Data Analysis,"

    inDepartment of Computer Science, Birkbeck College. London:

    University of London, 2000, pp. 163.

    [9] T. Kohonen, Self Organizing Maps: Springer; New York,

    Berlin, 1995.

    [10] D. Hagan, S. Tucker, and J. Ceddia, "Industrial Experience

    Products: A Balance of Product and Process," Computer Science

    Education, vol. 9, pp. 106-113, 1999.

    [11] J. Ceddia, S. Tucker, C. Clemence, and A. Cambrell, "WIER-

    Implementing Artifact Reuse in an Educational Environment

    with Real Products," presented at 31st Annual Frontiers in

    Education Conference, Reno, Nevada, 2001.

    [12] J. Sheard, J. Ceddia, J. Hurst, and J. Tuovinen, "Determining

    Website Usage Time from Interactions: Data Preparation and

    Analysis,"Educational Technology Systems, vol. 32, pp. 101-

    121, 2003-2004.

    [13] J. Sheard, J. Ceddia, J. Hurst, and J. Tuovinen, "Inferring

    Student Learning Behaviour from Website Interactions: AUsage Analysis,"Education and Information Technologies, vol.

    8, pp. 245-266, 2003.

    [14] A. Knobbe, A. Schipper, and P. Brockhausen, "Domain

    Knowledge and Data Mining Process Decisions: Enabling End-

    User Datawarehouse Mining; Contract No. IST-1999-11993;

    Deliverable No. D5: www-ai.cs.uni-

    dortmund.de/MMWEB/content/publications.html," vol. 2003,

    2000.

    [15] R. Redpath and B. Srinivasan, "A Model for Domain Centered

    Knowledge Discovery in Databases," presented at IEEE 4th

    International Conference on Intelligent Systems Design and

    Applications ISDA 2004, Budapest Hungary, 2004.

    [16] T. Sullivan, "Reading Reader Reaction: A Proposal for

    Inferential Analysis of Web Server Log Files," presented at 3rd

    Conference on Human Factors and the Web, Denver Colorado,

    1997.