chapter 2

Download Chapter 2

If you can't read please download the document

Upload: smartlife0888

Post on 02-Feb-2016

5 views

Category:

Documents


0 download

DESCRIPTION

Data MiningPractical Machine Learning Tools and TechniquesSlides for Chapter 2 of Data Mining by I. H. Witten, E. Frank andM. A. Hall

TRANSCRIPT

Chapter 2

Data MiningPractical Machine Learning Tools and Techniques

Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank andM. A. Hall

Input: Concepts, instances, attributes

Terminology

Whats a concept?Classification, association, clustering, numeric prediction

Whats in an example?Relations, flat files, recursion

Whats in an attribute?Nominal, ordinal, interval, ratio

Preparing the inputARFF, attributes, missing values, getting to know data

Terminology

Components of the input:Concepts: kinds of things that can be learnedAim: intelligible and operational concept description

Instances: the individual, independent examples of a conceptNote: more complicated forms of input are possible

Attributes: measuring aspects of an instanceWe will focus on nominal and numeric ones

Whats a concept?

Styles of learning:Classification learning:
predicting a discrete class

Association learning:
detecting associations between features

Clustering:
grouping similar instances into clusters

Numeric prediction:
predicting a numeric quantity

Concept: thing to be learned

Concept description:
output of learning scheme

Classification learning

Example problems: weather data, contact lenses, irises, labor negotiations

Classification learning is supervisedScheme is provided with actual outcome

Outcome is called the class of the example

Measure success on fresh data for which class labels are known (test data)

In practice success is often measured subjectively

Association learning

Can be applied if no class is specified and any kind of structure is considered interesting

Difference to classification learning:Can predict any attributes value, not just the class, and more than one attributes value at a time

Hence: far more association rules than classification rules

Thus: constraints are necessaryMinimum coverage and minimum accuracy

Clustering

Finding groups of items that are similar

Clustering is unsupervisedThe class of an example is not known

Success often measured subjectively

Iris virginica1.95.12.75.8102101525121

Iris virginica2.56.03.36.3Iris versicolor1.54.53.26.4Iris versicolor1.44.73.27.0Iris setosa0.21.43.04.9Iris setosa0.21.43.55.1TypePetal widthPetal lengthSepal widthSepal length

Numeric prediction

Variant of classification learning where class is numeric (also called regression)

Learning is supervisedScheme is being provided with target value

Measure success on test data

40FalseNormalMildRainy55FalseHighHot Overcast 0TrueHigh Hot Sunny5FalseHighHotSunnyPlay-timeWindyHumidityTemperatureOutlook

Whats in an example?

Instance: specific type of example

Thing to be classified, associated, or clustered

Individual, independent example of target concept

Characterized by a predetermined set of attributes

Input to learning scheme: set of instances/dataset

Represented as a single relation/flat file

Rather restricted form of input

No relationships between objects

Most common form in practical data mining

A family tree

=

Steven M

Graham M

Pam F

Grace F

RayM

=

Ian M

Pippa F

Brian M

=

Anna F

Nikki F

PeggyF

PeterM

Family tree represented as a table

IanPamFemaleNikkiIanPamFemaleAnnaRayGraceMaleBrianRayGraceFemalePippaRayGraceMaleIanPeggyPeterFemalePamPeggyPeterMaleGrahamPeggyPeterMaleSteven??FemalePeggy??MalePeterparent2Parent1GenderName

The sister-of relation

yesAnnaNikkiYesNikkiAnnaYesPippaIanYesPamStevenNoGrahamStevenNoPeterStevenNoStevenPeterNoPeggyPeterSister of?Second personFirst
person

NoAll the restYesAnnaNikkiYesNikkiAnnaYesPippaBrianYesPippaIanYesPamGrahamYesPamStevenSister of?Second personFirst
person

Closed-world assumption

A full representation in one table

IanIanRayRayPeggyPeggyParent2FemaleFemaleFemaleFemaleFemaleFemaleGenderPamPamGraceGracePeterPeterParent1

NameParent2Parent1GenderNameIanIanRayRayPeggyPeggyPamPamGraceGracePeterPeterFemaleFemaleMaleMaleMaleMaleNoAll the restYesAnnaNikkiYesNikkiAnnaYesPippaBrianYesPippaIanYesPamGrahamYesPamStevenSister
of?Second personFirst person

If second persons gender = female
and first persons parent = second persons parent
then sister-of = yes

Generating a flat file

Process of flattening called denormalizationSeveral relations are joined together to make one

Possible with any finite set of finite relations

Problematic: relationships without pre-specified number of objectsExample: concept of nuclear-family

Denormalization may produce spurious regularities that reflect structure of databaseExample: supplier predicts supplier address

The ancestor-of relation

YesOther positive examples hereYesIanPamFemaleNikki??FemaleGraceRayIanIanIanPeggyPeggyParent2MaleFemaleFemaleFemaleFemaleMaleGenderGracePamPamPamPeterPeterParent1

NameParent2Parent1GenderName?Peggy?????Peter????FemaleFemaleMaleMaleMaleMaleNoAll the restYesIanGraceYesNikkiPamYesNikkiPeterYesAnnaPeterYesPamPeterYesStevenPeterAncestor of?Second personFirst person

Recursion

Appropriate techniques are known as inductive logic programming(e.g. Quinlans FOIL)

Problems: (a) noise and (b) computational complexity

If person1 is a parent of person2
then person1 is an ancestor of person2

If person1 is a parent of person2
and person2 is an ancestor of person3
then person1 is an ancestor of person3

Infinite relations require recursion

Multi-instance Concepts

Each individual example comprises a set of instancesAll instances are described by the same attributes

One or more instances within an example may be responsible for its classification

Goal of learning is still to produce a concept description

Important real world applicationse.g. drug activity prediction

Whats in an attribute?

Each instance is described by a fixed predefined set of features, its attributes

But: number of attributes may vary in practicePossible solution: irrelevant value flag

Related problem: existence of an attribute may depend of value of another one

Possible attribute types (levels of measurement):Nominal, ordinal, interval and ratio

Nominal quantities

Values are distinct symbolsValues themselves serve only as labels or names

Nominal comes from the Latin word for name

Example: attribute outlook from weather dataValues: sunny,overcast, and rainy

No relation is implied among nominal values (no ordering or distance measure)

Only equality tests can be performed

Ordinal quantities

Impose order on values

But: no distance between values defined

Example:
attribute temperature in weather dataValues: hot > mild > cool

Note: addition and subtraction dont make sense

Example rule:
temperature < hot play = yes

Distinction between nominal and ordinal not always clear (e.g. attribute outlook)

Interval quantities

Interval quantities are not only ordered but measured in fixed and equal units

Example 1: attribute temperature expressed in degrees Fahrenheit

Example 2: attribute year

Difference of two values makes sense

Sum or product doesnt make senseZero point is not defined!

Ratio quantities

Ratio quantities are ones for which the measurement scheme defines a zero point

Example: attribute distanceDistance between an object and itself is zero

Ratio quantities are treated as real numbersAll mathematical operations are allowed

But: is there an inherently defined zero point?Answer depends on scientific knowledge (e.g. Fahrenheit knew no lower limit to temperature)

Attribute types used in practice

Most schemes accommodate just two levels of measurement: nominal and ordinal

Nominal attributes are also called categorical, enumerated, or discreteBut: enumerated and discrete imply order

Special case: dichotomy (boolean attribute)

Ordinal attributes are called numeric, or continuousBut: continuous implies mathematical continuity

Metadata

Information about the data that encodes background knowledge

Can be used to restrict search space

Examples:Dimensional considerations
(i.e. expressions must be dimensionally correct)

Circular orderings
(e.g. degrees in compass)

Partial orderings
(e.g. generalization/specialization relations)

Preparing the input

Denormalization is not the only issue

Problem: different data sources (e.g. sales department, customer billing department, )Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors

Data must be assembled, integrated, cleaned up

Data warehouse: consistent point of access

External data may be required (overlay data)

Critical: type and level of data aggregation

The ARFF format

%% ARFF file for weather data with some numeric features%@relation weather

@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}

@datasunny, 85, 85, false, nosunny, 80, 90, true, noovercast, 83, 86, false, yes...

Additional attribute types

ARFF supports string attributes:

Similar to nominal attributes but list of values is not pre-specified

It also supports date attributes:

Uses the ISO-8601 combined date and time format yyyy-MM-dd-THH:mm:ss

@attribute description string

@attribute today date

Relational attributes

Allow multi-instance problems to be represented in ARFF formatThe value of a relational attribute is a separate set of instances

Nested attribute block gives the structure of the referenced instances

@attribute bag relational@attribute outlook { sunny, overcast, rainy }@attribute temperature numeric@attribute humidity numeric@attribute windy { true, false }@end bag

Mulit-instance ARFF

%% Multiple instance ARFF file for the weather data%@relation weather

@attribute bag_ID { 1, 2, 3, 4, 5, 6, 7 }@attribute bag relational@attribute outlook {sunny, overcast, rainy}@attribute temperature numeric@attribute humidity numeric@attribute windy {true, false}@attribute play? {yes, no}@end bag

@data1, sunny, 85, 85, false\nsunny, 80, 90, true, no2, overcast, 83, 86, false\nrainy, 70, 96, false, yes...

Sparse data

In some applications most attribute values in a dataset are zeroE.g.: word counts in a text categorization problem

ARFF supports sparse data




This also works for nominal attributes (where the first value corresponds to zero)

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, class A0, 0, 0, 42, 0, 0, 0, 0, 0, 0, class B

{1 26, 6 63, 10 class A}{3 42, 10 class B}

Attribute types

Interpretation of attribute types in ARFF depends on learning schemeNumeric attributes are interpreted asordinal scales if less-than and greater-than are used

ratio scales if distance calculations are performed (normalization/standardization may be required)

Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise)

Integers in some given data file: nominal, ordinal, or ratio scale?

Nominal vs. ordinal

Attribute age nominal

Attribute age ordinal
(e.g. young < pre-presbyopic < presbyopic)

If age = young and astigmatic = no
and tear production rate = normal
then recommendation = softIf age = pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft

If age pre-presbyopic and astigmatic = no
and tear production rate = normal
then recommendation = soft

Missing values

Frequently indicated by out-of-range entriesTypes: unknown, unrecorded, irrelevant

Reasons:malfunctioning equipment

changes in experimental design

collation of different datasets

measurement not possible

Missing value may have significance in itself (e.g. missing test in a medical examination)Most schemes assume that is not the case: missing may need to be coded as additional value

Inaccurate values

Reason: data has not been collected for mining it

Result: errors and omissions that dont affect original purpose of data (e.g. age of customer)

Typographical errors in nominal attributes values need to be checked for consistency

Typographical and measurement errors in numeric attributes outliers need to be identified

Errors may be deliberate (e.g. wrong zip codes)

Other problems: duplicates, stale data

Getting to know the data

Simple visualization tools are very usefulNominal attributes: histograms (Distribution consistent with background knowledge?)

Numeric attributes: graphs
(Any obvious outliers?)

2-D and 3-D plots show dependencies

Need to consult domain experts

Too much data to inspect? Take a sample!

Click to edit the title text format

Click to edit the outline text formatSecond Outline LevelThird Outline LevelFourth Outline LevelFifth Outline LevelSixth Outline LevelSeventh Outline LevelEighth Outline LevelNinth Outline Level