dbm630 lecture04

82
DBM630: Data Mining and Data Warehousing MS.IT. Rangsit University 1 Semester 2/2011 Lecture 4 Data Mining Concepts Data Preprocessing and Postprocessing by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Upload: tokyo-institute-of-technology

Post on 26-Jan-2015

133 views

Category:

Education


6 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Dbm630 lecture04

DBM630: Data Mining and

Data Warehousing

MS.IT. Rangsit University

1

Semester 2/2011

Lecture 4

Data Mining Concepts Data Preprocessing and Postprocessing

by Kritsada Sriphaew (sriphaew.k AT gmail.com)

Page 2: Dbm630 lecture04

Topics

2

Data Mining vs. Machine Learning vs. Statistics

Instances with attributes and concepts(input)

Knowledge Representation (output)

Why we need data preprocessing and postprocessing?

Engineering the input Data cleaning

Data integration

Data transformation and data reduction

Engineering the output Combining multiple models

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 3: Dbm630 lecture04

Data Mining vs. Machine Learning

3

We are overwhelmed with electronic/recorded data, how we can discover the knowledge from such data.

Data Mining (DM) is a process of discovering patterns in data. The process must be automatic or semi-automatic.

Many techniques have been developed within a field known as Machine Learning (ML).

DM is a practical topic and involves learning in a practical, not a theoretical sense while ML focuses on theoretical one.

DM is for gaining knowledge, not just good prediction. DM = ML + topic-oriented + knowledge-oriented

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 4: Dbm630 lecture04

DM&ML vs. Statistics

4

DM = Statistics + Marketing

Machine learning has been more concerned with formulating the process of generalization as a search through possible hypothesis

Statistics has been more concerned with testing hypotheses.

Very similar schemes have been developed in parallel in machine learning and statistics, e.g., decision tree induction, classification and regression tree, nearest-neighbor methods.

Most learning algorithms use statistical tests when constructing rules or trees and for correcting models that are “overfitted” in that they depend too strongly on the details of particular examples used for building the model.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 5: Dbm630 lecture04

Generalization as Search

5

An aspect that distinguishes ML from statistical approaches, is a search process through a space of possible concept descriptions for one that fits the data.

Three properties that are important to characterize a machine learning process, are language bias: the concept description language, e.g., decision

tree, classification rule, association rules

search bias: the order in which the space is explored, e.g., greedy search, beam search

overfitting-avoidance bias: the way to avoid overfitting to the particular training data, e.g., forward pruning or backward pruning.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 6: Dbm630 lecture04

An Example of Structural Patterns

6

Part of a structural description of the contact lens data might be as follows:

Knowledge Management and Discovery © Kritsada Sriphaew

If tear_production_rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft

Spectacle prescription astigmatism

myope

myope

myope

myope

hypermetrope

hypermetrope

hypermetrope

hypermetrope

myope

myope

no

no

yes

yes

no

no

yes

yes

no

no

reduced

normal

reduced

normal

reduced

normal

reduced

normal

reduced

normal

Recom. lenses

none

soft

none

hard

none

soft

none

hard

none

soft

age

myope

myope

hypermetrope

hypermetrope

yes

yes

no

no

yes

reduced

normal

reduced

normal

reduced

none

hard

none

soft

none

young

young

young

young

young

young

young

young

Pre-presbyopic

Pre-presbyopic

Pre-presbyopic

Pre-presbyopic

Pre-presbyopic

Pre-presbyopic

Pre-presbyopic hypermetrope

hypermetrope Pre-presbyopic yes normal none

Tear prod. rate

Spectacle prescription astigmatism

myope

myope

myope

myope

hypermetrope

hypermetrope

hypermetrope

hypermetrope

no

no

yes

yes

no

no

yes

yes

reduced

normal

reduced

normal

reduced

normal

reduced

normal

Recom. lenses

none

none

none

hard

none

soft

none

none

age

presbyopic

presbyopic

presbyopic

presbyopic

presbyopic

presbyopic

presbyopic

presbyopic

Tear prod. rate

All combinations of possible values

= 3x2x2x2= 24 possibilities

Page 7: Dbm630 lecture04

Input: Concepts, Instance & Attributes

7

Concept description the thing that is to be learned (learning result)

hard to pin down precisely but

intelligible and operational

Instances (‘examples’ referred as input) Information that the learner is given

A single table vs. multiple tables (denormalization to a single table)

Denormalization sometimes produces apparent regularities, such as supplier vs. supplier address do always match together.

Attribute (features) Each instance is characterized by a fixed, predefined set of features

or attributes

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 8: Dbm630 lecture04

Input: Concepts, Instance & Attributes

8

temp. humidity

sunny

sunny

rainy

rainy

overcast

overcast

overcast

rainy

rainy

sunny

85

80

87

70

75

90

65

88

79

85

87

90

75

95

65

94

86

92

75

88

True

False

True

True

False

True

True

True

False

True

play-time play

85

90

63

5

56

25

5

86

78

74

Y

Y

Y

N

Y

N

N

Y

Y

Y

windy Sponsor

Sony

HP

Ford

Ford

HP

?

Nokia

Honda

Ford

Sony

outlook

Concepts

Ins

tan

ce

s (E

xa

mp

les

)

Attributes

Ordinal Attr. Numeric Attr. Nominal Attr. Numeric Nominal

Missing value

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 9: Dbm630 lecture04

Independent vs. Dependent Instances

9

Normally, the input data are represented as a set of independent instances.

But there are many problems involving relationship between objects. That is, some instances depend with the others.

Ex.: A family tree: the sister-of relation

Harry M

Sally F

Julia F

Demi F

Steven M

Bruce M

Tison M

Bill M

Diana F

Nina F

Rica F

Richard M

first person

first person

second person

second person

sis ter

sis ter

All the rest N

Steven Demi

Demi

Tison Diana

Bill Diana

Nina Rica

Rica

Bruce

Y

Y

Y

Y

Y

Nina Y

Demi Bruce Y

Nina Y

Harry Harry

Sally Steven

N N

Steven Bruce N

N Steven Peter

Y Steven Demi

Rica

Close World Assumption

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 10: Dbm630 lecture04

Independent vs. Dependent Instances

10 Data Warehousing and Data Mining by Kritsada Sriphaew

Harry

M

Sally

F

Julia

F

Demi

F

Steven

M

Bruce

M

Tison

M

Bill

M

Diana

F

Nina

F

Rica

F

Richard

M first

person

second

person

sis

ter

All the rest N

Steven Demi

Demi

Tison Diana

Bill Diana

Nina Rica

Rica

Bruce

Y

Y

Y

Y

Y

Nina Y

sister_of(X,Y) :- female(Y),

parent(Z,X),

parent(Z,Y).

name

Richard Julia

Richard Julia

Richard Julia

Demi Tison

Demi Tison

Harry Sally

Harry Sally

? ?

? ?

? ?

? ?

Bill

Diana

Nina

Rica

Steven

Harry

Sally

Richard

Julia

Bruce

Demi

Tison

Male

Female

Female

Female

Male

Female

Male

Female

Male

Female

Male

first

person

second

person sister

All the rest N

Steven Demi

Demi

Tison Diana

Bill Diana

Nina Rica

Rica

Bruce

Y

Y

Y

Y

Y

Nina Y

parent2 parent1 gender

parent2 parent1 gender parent2 parent1 gender

Sally Harry

Sally Harry Sally Harry

Sally Harry Sally Harry

Richard Julia Richard Julia

Richard Julia Richard Julia

Demi Tison Demi Tison

Demi Tison Demi Tison

Male

Male

Male

Male

Male

Female

Female

Female

Female

Female

Female

Female

Female

Denormalization

Page 11: Dbm630 lecture04

Problems of Denormalization

11

A large table with duplication values included.

Relations among instances (rows) are ignored.

Some regularities in the data are merely reflections of the original database structure but might be found by the data mining process, e.g., supplier and supplier address.

Some relations are not finite, e.g., ancestor-of relations. Inductive logic programming can use recursion to deal with this situations (the infinite number of possible instances)

If person1 is a parent of person2 then person1 is an ancestor of person2 If person1 is a parent of person2 and person2 is a parent of person3 then person1 is an ancestor of person3

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 12: Dbm630 lecture04

Missing, Inaccurate, duplicated values

12

Many practical datasets may include three types of errors: Missing values

frequently indicated by out-of-range entries (a negative number)

unknown vs. unrecorded vs. irrelevant values

Inaccurate values

typographical errors: misspelling, mistyping

measurement errors: errors generated by a measuring machine.

Intended errors: Ex.: input the zip code of the rental agency instead of the renter’s zip code.

Duplicated values

repetition of data gives such data more influence on the result.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 13: Dbm630 lecture04

Output: Knowledge Representation

13

There are many different ways for representing the patterns that can be discovered by machine learning. Some popular ones are: Decision tables Decision trees Classification rules Association rules Rules with exceptions Rules involving relations Trees for numeric prediction Instance-based representation Clusters

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 14: Dbm630 lecture04

Decision Tables

14

The simplest, most rudimentary way of representing the output from machine learning or data mining

Ex.: A decision table for the weather data to decide whether or not to “play”

temp. humidity

sunny

sunny

rainy

rainy

overcast

overcast

overcast

rainy

rainy

sunny

hot

hot

hot

mild

cool

hot

cool

mild

mild

hot

high

high

normal

high

low

low

normal

high

low

high

True

False

True

True

False

True

True

True

False

True

play-time play

85

90

63

5

56

25

5

86

78

74

Y

Y

Y

N

Y

N

N

Y

Y

Y

windy Sponsor

Sony

HP

Ford

Ford

HP

Sony

Nokia

Honda

Ford

Sony

outlook

(1) How to make a

smaller, condensed

table with some

useless attributes

are omitted.

(2) How to cope with a

case which does

not exist in the

table.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 15: Dbm630 lecture04

Decision Trees (1)

15

A “divide-and-onquer” approach to the problem of learning.

Ex.: A decision tree (DT) for the contact lens data to decide which type of contact lens is suitable.

none

Tear production rate

astigmatism

Spectacle prescription

reduced normal

no yes

myope hyperope

soft

hard none

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 16: Dbm630 lecture04

Decision Trees (2)

16

Nodes in a DT involve testing a particular attribute with a constant. However, it is possible to compare two attributes with each other, or to utilized some function of one or more attributes.

If the attribute that is tested at a node is a nominal one, the number of children is usually the number of possible values of the attributes.

In this case, the same attribute will not be tested again further down the tree.

In the case that the attributes are divided into two subsets, the attribute might be tested more than one times in a path.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 17: Dbm630 lecture04

Decision Trees (3)

17

If the attribute is numeric, the test at a node usually determines whether its value is greater or less than a predetermined constant.

If missing value is treated as an attribute value, there will be a third branch.

Three-way split into (1) less-than, equal-to and greater-than, or (2) below, within and above.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 18: Dbm630 lecture04

Classification Rules (1)

18

A popular alternative to decision trees. Also called a decision list.

Ex.: If outlook = sunny and humidity = high then play = yes If outlook = rainy and windy = true then play = no If outlook = overcast then play = yes

temp. humidity

sunny

sunny

rainy

rainy

overcast

overcast

overcast

rainy

rainy

sunny

hot

hot

hot

mild

cool

hot

cool

mild

mild

hot

high

high

normal

high

low

low

normal

high

low

high

True

False

True

True

False

True

True

True

False

True

play-time play

85

90

63

5

56

25

5

86

78

74

Y

Y

Y

N

Y

N

N

Y

Y

Y

windy Sponsor

Sony

HP

Ford

Ford

HP

Sony

Nokia

Honda

Ford

Sony

outlook Decision Table

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 19: Dbm630 lecture04

Classification Rules (2)

19

A set of rules is interpreted in sequence.

The antecedent (or precondition) is a series of tests while the consequent (or conclusion) gives the class or classes to the instances.

It is easy to read a set of rules directly off a decision trees but the opposite function is not quite straightforward.

Ex.: replicated subtree problem If a and b then x

If c and d then x

a

b c

c d

d

y n

n y

y n

y n

x

x

x

n

n y

y

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 20: Dbm630 lecture04

Classification Rules (3)

20

One reason why classification rules are popular: Each rule seems to represent an independent “nugget” of

knowledge. New rules can be added to an existing rule set without disturbing

those already there (In the DT case, it is necessary to reshaping the whole tree).

If a rule set gives multiple classifications for a particular example, one solution is to give no conclusion at all.

Another solution is to count how often each rule fires on the training data and go with the most popular one.

One more problem occurs when an instance is encountered that the rules fail to classify at all. Solutions: (1) fail to classify, or (2) choose the most popular class

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 21: Dbm630 lecture04

Classification Rules (4)

21

In a particularly straightforward situation, when rules lead to a class that is boolean (y/n) and when only rules leading to one outcome (say yes) are expressed

A form of closed world assumption.

The result rules cannot be conflict and there is no ambiguity in rule interpretation.

A set of rules can be written as a logic expression disjunctive normal form ( a disjunction (OR) of conjunctive (AND) conditions ).

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 22: Dbm630 lecture04

Association Rules (1)

22

Association rules are really no different from classification rules except that they can predict any attribute, not just the class.

This gives them the freedom to predict combinations of attributes, too.

Association rules (ARs) are not intended to be used together as a set, as classification rules are

Different ARs express different regularities that underlies the dataset, and they generally predict different things.

From even a small dataset, a large number of ARs can be generated. Therefore, some constraints are needed for finding useful rules. Two most popular ones are (1) support and (2) confidence.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 23: Dbm630 lecture04

Association Rules (2)

23

For example, xy [ s = p(x,y), c = p(x,y)/p(x) ] If temperature = hot then humidity = high (s=3/10,c=3/5)

If windy=true and play=Y then humidity=high and outlook=overcast (s=2/10, c=2/4)

If windy=true and play=Y and humidity=high then outlook=overcast (s=2/10, c=2/3)

temp. humidity

sunny

sunny

rainy

rainy

overcast

overcast

overcast

rainy

rainy

sunny

hot

hot

hot

mild

cool

hot

cool

mild

mild

hot

high

high

normal

high

low

low

normal

high

low

high

True

False

True

True

False

True

True

True

False

True

play-time play

85

90

63

5

56

25

5

86

78

74

Y

Y

Y

N

Y

N

N

Y

Y

Y

windy Sponsor

Sony

HP

Ford

Ford

HP

Sony

Nokia

Honda

Ford

Sony

outlook

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 24: Dbm630 lecture04

Rules with Exception (1)

24

For classification rules, incremental modifications can be made to a rule set by expressing exceptions to existing rules rather than by reengineering the entire set. Ex.: If petal-length >= 2.45 and petal-length < 4.45

then Iris-versicolor

If petal-length >= 2.45 and petal-length < 4.45 then Iris-versicolor EXCEPT if petal-width < 1.0 then Iris-setosa

Of course, we can have exceptions to the exceptions, exceptions to these and so on.

A new case Sepal width Petal length

5.1 3.5 2.6 0.2

Petal width type

Iris-setosa

Sepal length

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 25: Dbm630 lecture04

Rules with Exception (2)

25

Rules with exceptions can be used to represent the entire concept description in the first place.

Ex.:

Default: Iris-setosa

except if petal-length >= 2.45 and petal-length < 5.355 and

petal-width < 1.75

then Iris-versicolor

except if petal-length >= 4.95 and petal-width < 1.55

then Iris-virginica

else if sepal-length < 4.95 and sepal-width >=2.45

then Iris-virginica

else if petal-length >= 3.35

then Iris-virginica

except if petal-length < 4.85 and sepal-length<5.95

then Iris-versicolor

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 26: Dbm630 lecture04

Rules with Exception (3)

26

Rules with exceptions can be proved to be logically equivalent to an if-else statements.

The user can see that it is plausible, the expression in terms of (common) rules and (rare) exceptions will be easier to grasp than a normal structure (if-else).

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 27: Dbm630 lecture04

Rules involving relations (1)

27

So far the conditions in rules involve testing an attribute value against a constant.

This is called propositional (in propositional calculus).

Anyway, there are situation where a more expressive form of rule would provide more intuitive&concise concept description.

Ex.: the concept of standing up.

There are two classes: standing and lying.

The information given is the width, height and the number of sides of each block.

lying

standing

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 28: Dbm630 lecture04

Rules involving relations (2)

28

A propositional rule set produced for this data might be

If width >= 3.5 and height < 7.0 then lying

If height >= 3.5 then standing

A rule set with relations that will be produced, is

If width(b)>height(b) then lying

If height(b)>width(b) then standing

height sides

2

2

7

4

10

7

9

3

4

6

3

8

6

9

1

2

4

4

4

3

3

3

4

3

stand

stand

lying

stand

lying

stand

lying

lying

class width

lying

standing

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 29: Dbm630 lecture04

Trees for numeric prediction

29

Instead of predicting categories, predicting numeric quantities is also very important.

We can use regression equation.

There are two more knowledge representations: regression tree and model tree. Regression trees are decision tree with averaged numeric

values at the leaves.

It is possible to combine regression equations with regression trees. The result model is model tree, a tree whose leaves contain linear expressions.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 30: Dbm630 lecture04

An example of numeric prediction CPU performance (Numeric prediction)

30

PRP = -55.9 + 0.0489 MYCT + 0.153 MMIN +

0.0056 MMAX + 0.6410 CACH -

0.2700 CHMIN + 1.480 CHMAX

cycle main memory

5

3

208

2

207

209

4

1

time

125

29

29

29

29

...

125

480

480

min

256

8000

8000

8000

8000

...

2000

512

1000

max

6000

32000

32000

32000

16000

...

8000

8000

4000

channels

min

16

8

8

8

8

...

2

0

0

max

128

32

32

32

16

...

14

0

0

cache

(Kb)

256

32

32

32

32

...

0

32

0

perfor

mace

MYCT MMIN MMAX CHMIN CHMAX CACH PRP

198

269

220

172

132

...

52

67

45

CHMIN

CACH MMAX

MMAX MMAX

CACH 19.3

(28/8.7%)

29.8

(37/8.18%)

19.3

(28/8.7%)

37.3

(19/11.3%)

59.3

(24/16.9%)

MYCT

18.3

(7/3.83%)

75.7

(10/24.6%)

133

(16/28.8%)

157(21/73.7%

)

783

(5/35.9%)

CHMAX

281

(11/56%)

MMIN

492

(7/53.9%)

<=7.5 >7.5

<=8.5 (8.5,28]

>28

<=2500 (2500,4250] >4250 <=10000

>10000

>28000 <=28000

>58 <=58

>12000 <=12000

(0.5,8.5] <=0.5

<=550 >550

LM1: PRP = 8.29 + 0.004 MMAX +2.77 CHMIN

LM2: PRP = 20.3 + 0.004 MMIN -3.99 CHMIN

+ 0.946 CHMAX

LM3: PRP = 38.1 + 0.12 MMIN

LM4: PRP = 19.5 + 0.02 MMAX + 0.698 CACH

+ 0.969 CHMAX

LM5: PRP = 285 + 1.46 MYCT + 1.02 CACH

- 9.39 CHMIN

LM6: PRP = -65.8 + 0.03 MMIN - 2.94 CHMIN

+ 4.98 CHMAX

CHMIN

CACH MMAX

MMAX

CACH LM1

(65/7.32%)

LM4

(50/22.1%)

LM2

(26/6.37%)

LM3

(24/14.5%)

LM5(21/45.5

%)

LM6

(23/63.5%)

<=7.5 >7.5

<=8.5 >8.5

<=4250 >4250

>28000 <=28000

<=0.5 (0.5,8.5]

Model Tree

Regression Tree

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 31: Dbm630 lecture04

Instance-based representation (1)

31

The simplest form of learning is plain memorization.

Encountering a new instance the memory is searched for the training instance that most strongly resembles the new one.

This is a completely different way of representing the “knowledge” extracted from a set of instances: just store the instances themselves and operate by relating new instances whose class is unknown to existing ones whose class is known.

Instead of creating rules, work directly from the examples themselves.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 32: Dbm630 lecture04

Instance-based representation (2)

32

Instance-based learning is lazy, deferring the real work as long as possible.

Other methods are eager, producing a generalization as soon as the data has been seen.

In instance-based learning, each new instance is compared with existing ones using a distance metric, and the closest existing instance is used to assign the class to the new one. This is also called the nearest-neighbor classification method.

Sometimes more than one nearest neighbor is used, and the majority class of the closest k neighbors is assigned to the new instance. This is termed the k-nearest-neighbor method.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 33: Dbm630 lecture04

Instance-based representation (3)

33

When computing the distance between two examples, the standard Euclidean distance may be used.

When nominal attributes are present, we may use the following procedure. A distance of 0 is assigned if the values are identical, otherwise the

distance is 1.

Some attributes will be more important than others. We need some kinds of attribute weighting. To get suitable attribute weights from the training set is a key problem.

It may not be necessary, or desirable, to store all the training instances. To reduce the nearest neighbor calculation time. To reduce the unrealistic amounts of storages.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 34: Dbm630 lecture04

Instance-based representation (4)

34

Generally some regions of attribute space are more stable with regard to class than others, and just a few examples are needed inside stable regions.

An apparent drawback to instance-based representation is that they do not make explicit the structures that are learned.

(a) (b) (c)

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 35: Dbm630 lecture04

Clusters

35

The output takes the form of a diagram that shows how the instances fall into clusters.

The simplest case involving associating a cluster number with each instance (Fig. a).

Some clustering algorithm allow one instance to belong to more than one cluster, a Venn diagram (Fig. b).

Some algorithms associate instances with clusters probabilistically rather than categorically (Fig. c).

Other algorithms produce a hierarchical structure of clusters, called dendrograms (Fig. d).

Clustering may work with other learning methods for more performance.

h

c b

e

f d

a g

h

c b

e

f d

a g

(a) (b)

1 2

a

f

d

c

h

e

g

b

0.4

0.6

0.1

0.5

0.6

0.4

0.1

0.2

0.3

0.3

0.4

0.2

0.3

0.1

0.4

0.7

0.3

0.1

0.5

0.3

0.1

0.5

0.5

0.1

3

(c) a b c d e f g h

(d)

Page 36: Dbm630 lecture04

Why Data Preprocessing? (1)

36

Data in the real world is dirty incomplete: lacking attribute values, lacking certain attributes

of interest, or containing only aggregate data. Ex: occupation = “ ”

noisy: containing errors or outliers. Ex: salary = “-10”

inconsistent: containing discrepancies in codes or names. Ex: Age=“42” but Birthday = “01/01/1997” Was rating “1,2,3” but now rating “A,B,C”

No quality data, no quality mining results! Quality decisions must be based on quality data

Data warehouse needs consistent integration of quality data

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 37: Dbm630 lecture04

Why Data Preprocessing? (2)

37

To integrate multiple sources of data to more meaningful one.

To transform data to the form that makes sense and is more descriptive

To reduce the size (1) in cardinality aspect and/or (2) in variety aspect in order to improve the computational time and accuracy

Multi-Dimensional Measure of Data Quality

• Believability

• Value added

• Interpretability

• Accessibility

• Accuracy

• Completeness

• Consistency

• Timeliness

A well-accepted multidimensional view:

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 38: Dbm630 lecture04

Major Tasks in Data Preprocessing

38

Data cleaning Fill in missing values, smooth noisy data, identify or remove

outliers, and resolve inconsistencies

Data integration Integration of multiple databases, data cubes, or files

Data transformation and data reduction Normalization and aggregation

Obtains reduced representation in volume but produces the same or similar analytical results

Data discretization: data reduction, especially for numerical data

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 39: Dbm630 lecture04

Forms of Data Preprocessing

39

Data

Integration

Data

Cleaning

Data

Transformation

Data

Reduction

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 40: Dbm630 lecture04

Topics in Data Cleaning

40

Data cleaning tasks

Fill in missing values

Identify outliers and smooth out noisy data

Correct inconsistent data

Advanced techniques for automatic data cleaning

Improving decision tree

Robust regression

Detecting anomalies

Data Cleaning

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 41: Dbm630 lecture04

Missing Data

41

Data is not always available e.g., many tuples have no recorded value for several

attributes, such as customer income in sales data

Missing data may be due to equipment malfunction

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding

certain data may not be considered important at the time of entry

not register history or changes of the data

Missing data may need to be inferred.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 42: Dbm630 lecture04

How to Handle Missing Data?

42

Ignore the tuple: usually done when class label is missing

Fill in the missing value manually: tedious + infeasible?

Use a global constant to fill in the missing value: e.g., “unknown”, a new class?!

Use the attribute mean to fill in the missing value

Use the attribute mean for all samples belonging to the same class to fill in the missing value: smarter

Use the most probable value to fill in the missing value: inference-based such as Bayesian formula or decision tree The most popular, preserve relationship between missing attributes

and other attributes

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 43: Dbm630 lecture04

How to Handle Missing Data?

(Examples)

43

temp. humidity

sunny

sunny

rainy

rainy

overcast

overcast

overcast

rainy

rainy

sunny

85

80

87

70

75

90

65

88

79

85

87

90

75

95

?

94

86

92

75

88

True

False

True

True

False

True

True

True

False

?

play-time play

85

90

63

5

56

25

5

86

78

74

Y

Y

?

N

Y

N

N

Y

Y

Y

windy Sponsor

Sony

HP

Ford

Ford

HP

?

Nokia

Honda

Ford

Sony

outlook

Concepts Attributes

Add

Unknown

ignore

humid = 86.9

humid|play=y

= 86.4

Predict by Bayesian formula or decision tree Manually Checking

2

1

3

4

5

6

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 44: Dbm630 lecture04

Noisy Data

44

Noise: random error or variance in a measured variable

Incorrect attribute values may due to faulty data collection instruments

data entry problems

data transmission problems

technology limitation

inconsistency in naming convention

Other data problems which requires data cleaning duplicate records

incomplete data

inconsistent data

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 45: Dbm630 lecture04

How to Handle Noisy Data

45

Binning method (Data smoothing): first sort data and partition into (equi-depth) bins

then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

Clustering detect and remove outliers

Combined computer and human inspection detect suspicious values and check by human

Regression smooth by fitting the data into regression functions

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 46: Dbm630 lecture04

Simple Discretization Methods: Binning

46

Equal-width (distance) partitioning: It divides the range into N intervals of equal size: uniform grid if A and B are the lowest and highest values of the attribute,

the width of intervals W = (B-A)/N. The most straightforward But outliers may dominate presentation

(since we use lowest/highest values)

Skewed (asymmetrical) data is not handled well.

Equal-depth (frequency) partitioning: It divides the range into N intervals, each containing around

same number of samples Good data scaling Managing categorical attributes can be tricky.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 47: Dbm630 lecture04

Binning Methods for Data Smoothing

47

Sorted data for price (in dollars):

4, 8, 9, 15, 21, 21, 24, 25, 26, 27, 29, 34

Partition into (equi-depth) bins:

- Bin 1: 4, 8, 9, 15 (mean = 9, median = 8.5)

- Bin 2: 21, 21, 24, 25 (mean = 22.75, median = 23)

- Bin 3: 26, 27, 29, 34 (mean = 29, median = 28)

Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 22.75, 22.75, 22.75, 22.75

- Bin 3: 29, 29, 29, 29

Smoothing by bin medians:

- Bin 1: 8.5, 8.5, 8.5, 8.5

- Bin 2: 23, 23, 23, 23

- Bin 3: 28, 28, 28, 28

Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

Partition into

equidepth bin

(depth=3)

Each value in a bin is replaced by the

mean (or median) value of the bin.

Similarly, smoothing by bin median

The minimum and

maximum values in a

given bin are identified

as the bin boundaries

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 48: Dbm630 lecture04

Cluster Analysis

48

[Clustering] detect and remove outliers

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 49: Dbm630 lecture04

Regression

49

x

y

y = x + 1

X1

Y1

Y1’

[Regression] smooth by fitting the data into regression functions

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 50: Dbm630 lecture04

Automatic Data Cleaning (Improving Decision Trees)

50

Improving decision trees: relearn tree with misclassified instances removed or pruning away some subtrees

Better strategy (of course): let human expert check misclassified instances

When systematic noise is present it is better not to modify the data

Also: attribute noise should be left in training set

(Unsystematic) class noise in training set should be eliminated if possible

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 51: Dbm630 lecture04

Automatic Data Cleaning (Robust Regression - I)

51

Statistical methods that address problem of outliers are called robust

Possible way of making regression more robust:

Minimize absolute error instead of squared error

Remove outliers (i. e. 10% of points farthest from the regression plane)

Minimize median instead of mean of squares (copes with outliers in any direction)

Finds narrowest strip covering half the observations

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 52: Dbm630 lecture04

Automatic Data Cleaning

(Robust Regression - II)

52

perpendicular

Least absolute

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 53: Dbm630 lecture04

Automatic Data Cleaning (Detecting Anomalies)

53

Visualization is a best way of detecting anomalies (but often can’t be done)

Automatic approach:

committee of different learning schemes, e.g. decision tree, nearest- neighbor learner, and a linear discriminant function

Conservative approach: only delete instances which are incorrectly classified by all of them

Problem: might sacrifice instances of small classes

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 54: Dbm630 lecture04

Data Integration

54

Data integration: combines data from multiple sources into a coherent store

Schema integration integrate metadata from different sources Entity identification problem: identify real world entities from

multiple data sources, e.g., How to match A.cust-num with B.customer-id

Detecting and resolving data value conflicts for the same real world entity, attribute values from different

sources are different possible reasons: different representations, different scales,

e.g., metric vs. British units

Data Integration

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 55: Dbm630 lecture04

Handling Redundant Data in Data

Integration Redundant data occur often

when integration of multiple databases The same attribute may have

different names in different databases

One attribute may be a “derived” attribute in another table, e.g., annual revenue

Some redundancies can be detected by correlational analysis

Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality

BA

n

i

ii

BAn

BBAA

R)1(

))((1

,

deviation standardwhere

55

)1(

)(

1

)( 222

nn

xxn

n

xx

A correlation between

attribute A and B

If RA,B > 0 then A and B are positively correlated.

If RA,B = 0 then A and B are independent.

If RA,B < 0 then A and B are negatively correlated.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 56: Dbm630 lecture04

Data Transformation

56

Smoothing: remove noise from data

Aggregation: summarization, data cube construction

Generalization: concept hierarchy climbing

Normalization: scaled to fall within a small, specified range min-max normalization

z-score normalization

normalization by decimal scaling

Attribute/feature construction New attributes constructed from the given ones

Data Transformation and Data Reduction

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 57: Dbm630 lecture04

Data Transformation: Normalization

57

min-max normalization

z-score normalization

normalization by decimal scaling

new

min

new

min

new

max

minmax

min vvvvv

vvv

)('

v

vvv

'

j

vv

10' Where j is the smallest integer such that Max(| |)<1 'v

)1(

)(

1

)(

)(

222

nn

xxn

n

xx

xxxz

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 58: Dbm630 lecture04

Data Reduction Strategies

58

Warehouse may store terabytes of data: Complex data analysis/mining may take a very long time to run on the complete data

Data reduction Obtains a reduced representation of the data set that is much

smaller in volume but yet produces the same (or almost the same) analytical results

Data reduction strategies Data cube aggregation (reduce rows) Dimensionality reduction (reduce columns) Numerosity reduction (reduce columns or values) Discretization / Concept hierarchy generation (reduce values)

Data Reduction

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 59: Dbm630 lecture04

Three Types of Data Reduction

59

Three types of data reduction are:

Reduce no. of column (feature or attribute)

Reduce no. of row (case, example or instance)

Reduce no. of the values in a column (numeric/nominal)

temp. humidity

sunny

rainy

overcast

rainy

sunny

85

80

87

70

75

87

90

75

95

65

True

False

True

True

False

play-time play

85

90

63

5

56

Y

Y

Y

N

Y

windy Sponsor

Sony

HP

Ford

Ford

HP

outlook

Rows

Columns Values

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 60: Dbm630 lecture04

Data Cube Aggregation

60

Ex. You are interested in the annual sales rather than the total per quarter, thus the data can be aggregated resulting data summarize the total sales per year instead of per quarter

The resulting data set is smaller in volume, without loss of information necessary for the analysis task

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 61: Dbm630 lecture04

Dimensionality Reduction

61

Feature selection (i.e., attribute subset selection): Select a minimum set of features such that the probability

distribution of different classes given the values for those features is as close as possible to the original distribution given the values of all features

reduce the number of patterns, easier to understand

Heuristic methods (due to exponential number of choices): decision-tree induction (wrapper approach) independent assessment (filter method) step-wise forward selection step-wise backward elimination combining forward selection+backward elimination

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 62: Dbm630 lecture04

Decision Tree Induction (Wrapper Approach)

62

Initial attribute set:

{A1, A2, A3, A4, A5, A6}

A4 ?

A1? A6?

Class 1 Class 2 Class 1 Class 2

Reduced attribute set: {A1, A4, A6}

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 63: Dbm630 lecture04

Numerosity Reduction

65

Parametric methods Assume the data fits some model, estimate model

parameters, store only the parameters, and discard the data (except possible outliers)

Log-linear models: obtain value at a point in m-D space as the product on appropriate marginal subspaces (estimate the probability of each cell in a larger cuboid based on the smaller cuboids)

Non-parametric methods Do not assume models

Major families: histograms, clustering, sampling

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 64: Dbm630 lecture04

Regression

66

Linear regression: Y = a + bX

Two parameters , a and b specify the line and are to be estimated by using the data at hand.

using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….

Multiple regression: Y = a + b1X1 + b2X2.

Many nonlinear functions can be transformed into the above.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 65: Dbm630 lecture04

Histograms

A popular data reduction technique

Divide data into buckets and store average (or sum) for

each bucket

Related to quantization problems.

0

5

10

15

20

25

30

35

40

10000 30000 50000 70000 90000

67 Data Warehousing and Data Mining by Kritsada Sriphaew

Page 66: Dbm630 lecture04

Clustering

68

Partition data set into clusters, and one can store cluster representation only

Can be very effective if data is clustered but not if data is “smeared (dirty)”

Can have hierarchical clustering and be stored in multi-dimensional index tree structures

There are many choices of clustering definitions and clustering algorithms.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 67: Dbm630 lecture04

Sampling

69

Allow a mining algorithm to run in complexity that is potentially sub-linear to the size of the data

Choose a representative subset of the data Simple random sampling may have very poor performance

in the presence of skew (bias)

Develop adaptive sampling methods Stratified (classify) sampling:

Approximate the percentage of each class (or subpopulation of interest) in the overall database

Used in conjunction with skewed (biased) data

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 68: Dbm630 lecture04

Sampling

70

Raw Data

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 69: Dbm630 lecture04

Sampling

71

Raw Data Cluster/Stratified Sample

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 70: Dbm630 lecture04

Discretization

72

Three types of attributes: Nominal: values from an unordered set

Ordinal: values from an ordered set

Continuous: real numbers

Discretization: divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes.

Reduce data size by discretization

Prepare for further analysis

Discretization and concept hierarchy generation

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 71: Dbm630 lecture04

Discretization and Concept hierachy

73

Discretization

reduce the number of values for a given continuous attribute by dividing the range of the attribute into intervals. Interval labels can then be used to replace actual data values.

Concept hierarchies

reduce the data by collecting and replacing low level concepts (such as numeric values for the attribute age) by higher level concepts (such as young, middle-aged, or senior).

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 72: Dbm630 lecture04

Discretization and Concept hierarchy generation - numeric data

74

Binning (see sections before)

Histogram analysis (see sections before)

Clustering analysis (see sections before)

Entropy-based discretization

Keywords:

Supervised discretization

Entropy-based discretization

Unsupervised discretization

Clustering, Binning, Histogram

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 73: Dbm630 lecture04

Entropy-Based Discretization

75

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is

info(S,T) = (|S1|/|S|) × info(S1) + (|S2|/|S|) × info(S2) The boundary that minimizes the entropy function over

all possible boundaries is selected as a binary discretization.

The process is recursively applied to partitions obtained until some stopping criterion is met, e.g.,

info(S) - info(S,T) < threshold Experiments show that it may reduce data size and

improve classification accuracy

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 74: Dbm630 lecture04

Entropy-Based Discretization

Ex.: temperature attribute of weather data are

i

N

i

i ppX 2

1

log)(info

76

64 65 68 69 70 71 72 75 80 81 83 85

y n y y y n y/n y/y n y y n

Temp=71.5

bits939.0

])3,5([info14

8])2,4([info

14

6])3.5[],2,4([info

bits940.0])5,9([info

Page 75: Dbm630 lecture04

Specification of a set of attributes (Concept hierarchy generation)

77

Concept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set. The attribute with the most distinct values is placed at the lowest level of the hierarchy. country

province_or_ state

city

street

15 distinct values

65 distinct values

3567 distinct values

674,339 distinct values

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 76: Dbm630 lecture04

Why Postprocessing?

78

To improve the acquired model (the mined knowledge)?

Techniques to combine several mining approaches to find better results

Inp

ut

Dat

a

Method 1

Method 2

Method N

Co

mb

ined

Ou

tpu

t Data

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 77: Dbm630 lecture04

Combining Multiple Models (Overview)

79

Basic idea of “meta” learning schemes: build different “experts” and let them vote

Advantage: often improves predictive performance

Disadvantage: produces output that is very hard to analyze

Schemes we will discuss are bagging, boosting and stacking (or stacked generalization)

These approaches can be applied to both numeric and nominal classification

Engineering the Output

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 78: Dbm630 lecture04

Combining Multiple Models (Bagging - general)

80

Employs simplest way of combining predictions: voting/ averaging

Each model receives equal weight

“Idealized” version of bagging: Sample several training sets of size(instead of just having one

training set of size n)

Build a classifier for each training set

Combine the classifier’s predictions

This improves performance in almost all cases if learning scheme is unstable

(i.e. decision trees)

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 79: Dbm630 lecture04

Combining Multiple Models (Bagging - algorithm)

81

Model generation Let N be the number of instances in the training data.

For each of t iterations:

Sample n instances with replacement from training set.

Apply the learning algorithm to the sample.

Store the resulting model.

Classification For each of the t models:

Predict class of instance using model.

Return class that has been predicted most often.

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 80: Dbm630 lecture04

Combining Multiple Models (Boosting - general)

82

Also uses voting/ averaging but models are weighted according to their performance Iterative procedure: new models are influenced by

performance of previously built ones

New model is encouraged to become expert for instances classified incorrectly by earlier models

Intuitive justification: models should be experts that complement each other

(There are several variants of this algorithm)

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 81: Dbm630 lecture04

Combining Multiple Models (Stacking - I)

83

Hard to analyze theoretically: “black magic”

Uses “meta learner” instead of voting to combine predictions of base learners

Predictions of base learners (level-0 models) are used as input for meta learner (level-1 model)

Base learners usually have different learning schemes

Predictions on training data can’t be used to generate data for level-1 model!

Cross-validation-like scheme is employed

Data Warehousing and Data Mining by Kritsada Sriphaew

Page 82: Dbm630 lecture04

Combining Multiple Models (Stacking - II)

84 Data Warehousing and Data Mining by Kritsada Sriphaew