course material dmba

1

Indian Statistical Institute

Training Program

on

Data Mining & Business Analytics

using

Rapid Miner

Boby J

2

Contents


1. Introduction to Rapid Miner

2. Missing Value Analysis

3. Data Visualization

4. Market Basket Analysis

5. Correlation & Regression

6. Data partitioning & Classification

7. Cluster Analysis

3


DATAPREPROCESSING

4

Indian Statistical InstituteDATA PREPROCESSING

1. Missing Value Handling

5

Missing Value Handling


Example: Suppose a telecom company wants to introduce a scoring mechanism to rate

its circles based on the following parameters

1. Current Month’s Usage

2. Last 3 Month’s Usage

3. Average Recharge

4. Projected Growth

The data set is given in next slide. There are some missing values. How to

proceed?

6


Example: Circle wise Data


SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 3.2 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 95.5 98.3 B

13 6.3 3.3 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

7


Step 1: Calculate the % of missing values in each attribute


Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

Missing Values 3 2 2 0 0

Total Records 19 19 19 19 19

% Missing 15.79 10.53 10.53 0.00 0.00

If % Missing is > 20%, then the data is not sufficient to develop the model.

Ignore the corresponding attribute and proceed

8


Step 3: Prepare Pivot table of attributes


Current Month's

Usage A B C Grand Total

Missing 1 1 1 3

Non Missing 5 6 5 16

Grand Total 6 7 6 19

Last 3 Month's

Usage A B C Grand Total

Missing 1 1 0 2



Average Recharge A B C Grand Total

Missing 1 1 0 2



Projected Grow th A B C Grand Total

Missing 0 0 0 0



Conclusion

None of the

cases 100%

values are

missing

9


Step 3: Prepare Pivot table of attributes


Current Month's

Usage A B C

Missing 16.67 14.29 16.67

Non Missing 83.33 85.71 83.33

Grand Total 100 100 100

Last 3 Month's

Usage A B C

Missing 16.67 14.29 0.00

Non Missing 83.33 85.71 100.00


Average Recharge A B C

Missing 16.67 14.29 0.00

Non Missing 83.33 85.71 100.00


Projected Grow th A B C

Missing 0 0 0

Non Missing 100 100 100


Conclusion

None of the cases

100% values are

missing

10


Example: 3 Choices

Choice 1: Ignore missing value records


SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

4 4.6 3.1 98.5 9..2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

11 6.5 2.8 95.4 98.5 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

18 6.2 3.4 94.6 97.3 C

11



Choice 2. Replace the missing values with attribute mean, minimum,

maximum or mode


SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 3.2 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 95.5 98.3 B

13 6.3 3.3 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Mean 6.0 3.1 96.1 98.5

Min 4.6 2.3 94.3 97.3

Max 7.0 3.9 99.4 99.4

12



Choice 2. Replace the missing values with attribute mean, minimum,

maximum or mode


SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 6 3.2 96.1 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 3.1 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 6 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 3.1 95.5 98.3 B

13 6.3 3.3 96.1 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 6 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Mean 6.0 3.1 96.1 98.5

Min 4.6 2.3 94.3 97.3

Max 7.0 3.9 99.4 99.4

13



Choice 3 : Replace the missing values with attribute mean corresponding

the circle


SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 3.2 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 95.5 98.3 B

13 6.3 3.3 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Mean 5.00 3.34 98.64 99.24 A

Mean 6.47 2.98 95.47 98.45 B

Mean 6.36 3.03 94.73 97.97 C

14



Choice 3 : Replace the missing values with attribute mean corresponding

the circle


SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 5 3.2 98.64 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 3.34 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 6.47 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 2.98 95.5 98.3 B

13 6.3 3.3 95.47 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 6.36 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Mean 5.00 3.34 98.64 99.24 A

Mean 6.47 2.98 95.47 98.45 B

Mean 6.36 3.03 94.73 97.97 C

15

Exercise: The data on 3 modes of transport of a supply chain management company

are given below. Handle the missing values?


SL No Delivery Speed Vehicles Extra Handling Cost Mode of Transport

1 27.75 3 2 Water

2 3 445 Water

3 28.2 3 1 460 Water

4 8.75 1 0 980 Direct Truck

5 9.25 0 950 Direct Truck

6 9.15 1 1 Direct Truck

7 15.2 3 2 820 LTL Truck

8 16.2 2 2.5 810 LTL Truck

9 3 1.5 835 LTL Truck

LTL Truck : Less than truck load

DATA PREPROCESSING: Missing Value Handling

16


MARKET BASKETANALYSIS

17


MARKET BASKET ANALYSIS

A modeling technique based upon the logic that if a customer buy a certain group of

items, he is more (or less) likely to buy another group of items

Example:

Those who buy cigarettes are more likely to buy match box also.

18



Association Rule Mining:

Developing rules that predict the occurrence of of an item based on the

occurrence of other items in the transaction

Example

Milk, Bread, Biscuits, Fruits5

Bread, Milk, Toys, Biscuits4

Milk, Biscuits, Toys, Fruits3

Bread, Biscuits, Toys, Eggs2

Milk, Bread1

ItemsId

{Milk, Bread} {Biscuits} with probability = 2 / 3

19



Itemset:

A collection of one or more items

k – itemset

An itemset consisting of k items





Milk, Bread1

ItemsId

20



Support count:

Frequency of occurrence of an itemset

Example

{Milk, Bread, Biscuits} = 2





Milk, Bread1

ItemsId

21



Support :

Proportion or fraction of transaction that contain an itemset

Example

{Milk, Bread, Biscuits} = 2 / 5





Milk, Bread1

ItemsId

Frequent Itemset

An itemset whose support is greater than or equal to minimum support

22







Milk, Bread1

ItemsId

Confidence

Conditional probability that an item will appear in transactions that contain another

items

Example

Confidence that Toys will appear in transaction containing Milk & Biscuits

= {Milk, Biscuits, Toys} / {Milk, Biscuits} = 2 / 3 = 0.67

23



Association Rule Mining

1. Frequent Itemset Generation

Fix minimum support value

Generate all itemsets whose support ≥ minimum support

2. Rule Generation

Fix minimum confidence value

Generate high confidence rules from each frequent itemset

24



Frequent Itemset Generation: Apriori Algorithm

a. Fix minimum support count

b. Generate all itemsets of length = 1

c. Calculate the support for each itemset

d. Eliminate all itemsets with support count < minimum support count

e. Repeat steps c & d for itemsets of length = 2, 3, ---

25




Example:

Minimum Support count = 2

A,C,E6

A,E5

B,E4

A,B,C,E3

B,C,E2

A,C,D1

ItemsId

26




Example:


5E

1D

4C

3B

4A

Support countItem

Step 1:

Generate itemsets of length = 1 & calculate support

27




Example:


5E

1D

4C

3B

4A

Support countItem

Step 2:

eliminate itemsets with support count < minimum support count (2)

28




Example:


5E

4C

3B

4A

Support countItem

Step 2:


29




Example:


2B, C

3B, E

3C,E

3A,E

3A, C

1A, B

Support countItem

Step 3:

generate itemsets of length = 2

30




Example:


2B, C

3B, E

3C,E

3A,E

3A, C

1A, B

Support countItem

Step 4:


31




Example:


2B, C

3B, E

3C,E

3A,E

3A, C

Support countItem

Step 4:


32




Example:


2B, C, E

2A, C, E

Support countItem

Step 5:


33




Example:


Step 6:


1A, B, C, E

Support CountItemset

34




Example:


Result:

3

3

2

3

3

2

2

Support count

0.33B, C, E

0.33A, C, E

0.50A , C

0.50A , E

0.33B,C

0.50B,E

0.50C,E

SupportItem

35



Association Rule Mining: Apriori Algorithm

Example:

Minimum Support = 0.50

Minimum Confidence = 0.5

3

3

2

3

3

2

2

Support count

0.33B, C, E

0.33A, C, E

0.50A , C

0.50A , E

0.33B,C

0.50B,E

0.50C,E

SupportItem

36



Association Rule Mining: Apriori Algorithm

Example:

Minimum Support = 0.50

Minimum Confidence = 0.5

0.600.50E B

0.600.50E C

0.750.50C E

0.750.50C A

0.600.50E A

0.50

0.50

0.50

Support

0.75A C

0.75A E

1.00B E

ConfidenceItem

37



Association Rule Mining: Other Measures

Lift

Lift (A C) = Confidence (A C) / Support (C)

Example

0.75

0.75

Confidence

0.93

1.12

Lift

E = 0.83A E

C = 0.67A C

SupportItem

Criteria : Lift ≥ 1

Lift (A , C) = 1.12 > Lift (A , E) indicates that A has a greater impact on the

frequency of C than it has on the frequency of E

38



Exercise 1:The data on transactions from a mobile outlet is given below.

1. Generate frequent items sets with a support of at least 25%?

2. Generate association of items with a confidence of at least 50%?

3. Estimate the chance that Mobile Slim, Landline and Broadband will

be subscribed together?

4. Estimate the chance that the customers who buy Landline will also

purchase Broadband & Ring tones?

39



Exercise 2:

The market basket Software data set contains the details of transaction at a

software product company.

1. Identify the frequent product types with a support of minimum 25% ?

2. Also identify the association of products with a confidence of minimum 50%

?

3. What is the chance that Operating System and Office Suite will be

purchased together?

4. What is the chance that Operating System and Visual Studio will be

purchased together?

5. Estimate the chance that the customers who buy Operating System will also

purchase Office Suite ?

6. Estimate the chance that the customers who buy Operating System will also

purchase Visual Studio?

40


LINEARREGRESSION

41

CORRELATION & REGRESSION

Correlation:

Correlation analysis is a technique to identify the relationship between two

variables.

Type and degree of relationship between two variables.


42


Correlation: Usage

Explore the relationship between the output characteristic and input or process

variable.

Output variable : Y : Dependent variable

Input / Process variable : X : Independent variable


43

Positive Correlation: Y increases as X increases & vice versa

Scatter Plot

0

4

8

12

16

20

0 3 6 9 12

X

Y



44

Negative Correlation: Y decreases as X increases & vice versa

Scatter Plot

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9 10

X

Y



45

No Correlation: Random Distribution of points

Scatter Plot

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

X

Y



46

Is there any correlation ?

Scatter Plot

0

5

10

15

20

25

30

0 2 4 6 8 10 12

X

Y



47

Measure of Correlation: Coefficient of Correlation

Symbol : r

Range : -1 to 1

Sign : Type of correlation

Value : Degree of correlation

Examples:

r = 0.6 , 60 % positive correlation

r = -0.82, 82% negative correlation

r = 0, No correlation



48

Coefficient of Correlation: Positive Correlation

Collect data on x and y: When x is low, y is also low & vice versa

x y

2 5

3 7

1 3

5 11

6 12

7 15



49

Calculate Mean of x & y values

SL No. x y

1 2 5

2 3 7

3 1 3

4 5 11

5 6 12

6 7 15

Mean 4 8.83




50

Take x – Mean x and y – Mean y

SL No. x – Mean x y – Mean y

1 -2 -3.83

2 -1 -1.83

3 -3 -5.83

4 1 2.17

5 2 3.17

6 3 6.17




Conclusion:

Low values will become

negative & high values will

become positive

51

Generally when x values are negative, y values are also negative & vice versa


1 -2 -3.83

2 -1 -1.83

3 -3 -5.83

4 1 2.17

5 2 3.17

6 3 6.17




52

Then

Product of x & y values will be positive

SL No. x – Mean x y – Mean y Product

1 -2 -3.83 7.66

2 -1 -1.83 1.83

3 -3 -5.83 17.49

4 1 2.17 2.17

5 2 3.17 6.34

6 3 6.17 18.51

Sum = Sxy 54




53

Sum of Product of x & y values (Sxy) will be positive


1 -2 -3.83 7.66

2 -1 -1.83 1.83

3 -3 -5.83 17.49

4 1 2.17 2.17

5 2 3.17 6.34

6 3 6.17 18.51

Sum = Sxy 54




54

Coefficient of Correlation: Negative Correlation

Collect data on x and y: When x is low then y will be high & vice versa

x y

2 12

3 11

1 15

5 7

6 5

7 3



55

Calculate Mean of x & y values

SL No. x y

1 2 12

2 3 11

3 1 15

4 5 7

5 6 5

6 7 3

Mean 4 8.83




56

Take x – Mean x and y – Mean y


1 -2 3.67

2 -1 2.67

3 -3 6.67

4 1 -1.33

5 2 -3.33

6 3 -5.33




Conclusion:

Low values will become

negative & high values will

become positive

57

Generally when x values are negative, y values are positive & vice versa


1 -2 3.67

2 -1 2.67

3 -3 6.67

4 1 -1.33

5 2 -3.33

6 3 -5.33




58

Then

Product of x & y values will be negative


1 -2 3.67 -7.34

2 -1 2.67 -2.67

3 -3 6.67 -20.01

4 1 -1.33 -1.33

5 2 -3.33 -6.66

6 3 -5.33 -15.99

Sum = Sxy - 54




59

Sum of Product of x & y values Sxy will be negative





1 -2 3.67 -7.34

2 -1 2.67 -2.67

3 -3 6.67 -20.01

4 1 -1.33 -1.33

5 2 -3.33 -6.66

6 3 -5.33 -15.99

Sum = Sxy - 54

60

In Short

If correlation is positive

Sxy will be positive

If correlation is negative

Sxy will be negative

Coefficient of Correlation:



61

To avoid scale issues

Sxy is divided by √ (Sxx.Syy)




Sxy = Σ(x-Mean x)(y-Mean y)

Sxx = Σ(x-Mean x)2

Syy = Σ(y-Mean y)2

Correlation Coefficient r = Sxy / √ (Sxx.Syy)

62




SL No. x – Mean x y – Mean y Product (x – Mean x)2 (y – Mean y)2

1 -2 3.67 -7.34 4 14.6689

2 -1 2.67 -2.67 1 3.3489

3 -3 6.67 -20.01 9 33.9889

4 1 -1.33 -1.33 1 4.7089

5 2 -3.33 -6.66 4 10.0489

6 3 -5.33 -15.99 9 38.0689

Sum Sxy: -54 Sxx: 28 Syy:104.83

r = Sxy / √Sxx.Syy = -54 / √(28 x 104.83) = -0.9967

63

Regression

Correlation helps

To check whether two variables are related

If related

Identify the type & degree of relationship



64

Regression

Regression helps

• To identify the exact form of the relationship

• To model output in terms of input or process variables

Examples:

Yield = 5 + 3 x Time - 2 x Temperature

Y = 2 - 5x



65

Multiple Regression

To model output variable y in terms of two or more variables.

General Form:

Y = a + b1X1 + b2X2 + - - - + bkXk

Two variable case:

Y = a + b1X1 + b2X2



66

Exercise 1: The data on Vendor performance score and the number of On Time,

Complete, Undamaged & Correctly billed shipments from the vendors of a

supply chain management company are given below. Can you develop a

model for Vendor performance score in terms of other variables?



Vendor Id

Ontime

Shipment

Complete

Shipment

Undamaged

Shipmetns

Correctly

billed

Performance

Score

1 950 990 980 550 2985

2 1450 1425 1475 975 4576

3 1700 1575 1730 1320 5435

4 1800 1515 1890 1615 5955

5 1675 1420 1756 1456 5400

6 1756 1645 1835 1489 5590

7 1236 1462 1335 1435 4675

8 1100 1523 1565 1625 4960

9 1325 1725 1570 1520 5325

10 1450 1620 1463 1430 5170

11 1570 1458 1356 1630 5190

67

Exercise 2: A construction company wants to develop a model the concrete

compressive strength. The attributes of interest are given in the table

below. The training data is given in the file Concrete_Data.xls .

1. Can you develop the model?

2. How much close it will predict the values?

LINEAR REGRESSION


1 Cement (component 1)(kg in a m^3 mixture)

2 Blast Furnace Slag (component 2)(kg in a m^3 mixture)

3 Fly Ash (component 3)(kg in a m^3 mixture)

4 Water (component 4)(kg in a m^3 mixture)

5 Superplasticizer (component 5)(kg in a m^3 mixture)

6 Coarse Aggregate (component 6)(kg in a m^3 mixture)

7 Fine Aggregate (component 7)(kg in a m^3 mixture)

8 Age (day)

9 Concrete compressive strength(MPa, megapascals)

68


CLASSIFICATION METHODS

69


INTRODUCTION

Objective

To develop a mathematical model for an attribute or response metric (Y) in terms of

other available attributes (Xs).

When to Use

Xs : Continuous or discrete

Y : Discrete

70



Classifies data (develops a model) based on the training data

Each sample is assumed to belong to a predefined class

Sample data set used for building the model is training set

Usage:

For classifying future or unknown data

71



Example:

Y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

x1 x2 Y x1 x2 Y

11.35 23 Blue 11.85 39.9 Red

11.59 22.3 Blue 12.09 39.5 Red

12.19 24.5 Blue 12.69 37.8 Red

13.23 26.4 Blue 13.73 38.2 Red

13.51 30.2 Blue 14.01 37.8 Red

13.68 32 Blue 14.18 36.5 Red

14.78 33.1 Blue 15.28 36 Red

15.11 33 Blue 15.61 37.1 Red

15.55 25.2 Blue 16.05 33.1 Red

16.37 24.1 Blue 16.87 32.4 Red

16.99 22 Blue 17.49 31 Red

18.23 23.5 Blue 18.73 32 Red

18.83 24.1 Blue 19.33 31.8 Red

19.06 25 Blue 19.56 30.9 Red

72



Example:

Y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

20

22

24

26

28

30

32

34

36

38

40

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00

x1

x2

73



Example:

y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

20

22

24

26

28

30

32

34

36

38

40

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00

x1

x2

x2

y1

> 35

74



Example:


x2Attribute 2

x1Attribute 1

20

22

24

26

28

30

32

34

36

38

40

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00

x1

x2

x2

y1 y2

> 35 < 28

75



Example:


x2Attribute 2

x1Attribute 1

20

22

24

26

28

30

32

34

36

38

40

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00

x1

x2

x2

x1y1 y2

> 35 < 28

y2 y1

< 15.5 > 15.5

76



Example: Rules


x2Attribute 2

x1Attribute 1

x2

x1y1 y2

> 35 < 28

y2 y1

< 15.5 > 15.5

If x2 > 35 then y = y1

If x2 < 28, then y = y2

If 28 > x2 > 35 & x1 > 15.5, then y = y1

If 28 > x2 > 35 & x1 < 15.5, then y = y2

77



Example: The following table 1 gives the profile of customers (Refund, Marital

Status & Taxable Income) who has taken loan from a bank. The table also

shows how many of them really cheated the bank.

1. Can you develop a decision rule to classify the customer as whether they will

cheat or not based on the value of 3 attributes (Refund, Marital Status &

Taxable Income)

2. Validate the model using the test data given in table 2

Yes> 80 KDivorcedNo5

No> 80 KMarriedNo4

No< 80 KSingleNo3

No> 80 KSingleNo2

No> 80 KMarriedYes1

CheatTaxable

Income

Marital

Status

RefundSL No

Table 2: Test Data

78



Table 1: Training Data Set

Yes> 80 KSingleNo10

No> 80 KMarriedNo9

Yes> 80 KSingleNo8

No> 80 KDivorcedYes7

No< 80 KMarriedNo6


No> 80 KMarriedYes4

No< 80 KSingleNo3

No> 80 KMarriedNo2

No> 80 KSingleYes1

CheatTaxable IncomeMarital StatusRefundSL No

Class variable: Cheat

Number of predefined classes: 2 (Cheat = No & Cheat = Yes)

79



Example:Result

If Marital Status = Married then cheat : No

If Marital Status = Single & Refund = Yes then cheat : No

If Marital Status = Single, Refund = No & Taxable Income < 80K then cheat: No

If Marital Status = Single, Refund = No & Taxable Income > 80K then cheat: Yes

If Marital Status = Divorced & Refund = Yes then cheat : No

If Marital Status = Divorced & Refund = No then cheat : Yes

80



Example:Decision Tree

Yes> 80 KSingleNo10

No> 80 KMarriedNo9

Yes> 80 KSingleNo8

No> 80 KDivorcedYes7

No< 80 KMarriedNo6


No> 80 KMarriedYes4

No< 80 KSingleNo3

No> 80 KMarriedNo2

No> 80 KSingleYes1

CheatTaxable

Income

Marital

Status

RefundSL No

81



Example: Test Data Set


No> 80 KMarriedNo4

No< 80 KSingleNo3

No> 80 KSingleNo2

No> 80 KMarriedYes1

CheatTaxable

Income

Marital

Status

RefundSL No

Yes

No

No

No

No

Cheat


No> 80 KMarriedNo4

No< 80KSingleNo3

Yes> 80 KSingleNo2

No> 80KMarriedYes1

Predicted

Cheat

Taxable

Income

Marital

Status

RefundSL No

82



Performance Evaluation Measures

1. Confusion Matrix

dcClass = No

baClass = Yes

Class = NoClass = YesPredicted

Class

Actual Class

2. Accuracy: (a+d) / (a + b + c + d)

3. Precision: a / (a + b)

4. Recall: a / (a + c)

5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)

83



Example: Performance Evaluation Measures

1. Confusion Matrix

11Cheat = Yes

03Cheat = No

Cheat = YesCheat = NoPredicted

Class

Actual Class

Yes

No

No

No

No

Cheat

Yes5

No4

No3

Yes2

No1

Predicted CheatSL No

84



Example: Performance Evaluation Measures

1. Confusion Matrix

11Cheat = Yes

03Cheat = No

Cheat = YesCheat = NoPredicted

Class

Actual Class

2. Accuracy: (3+1) / (3 + 1 + 0 + 1) = 4 / 5 = 0.8

3. Precision: 3 / (3 + 0) = 3 / 3 = 1.0

4. Recall: 3 / (3 + 1) = 3 / 4= 0.75

5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)

= 2 x 1.0 x 0.75 / (1.00 + 0.75) = 0.86

85



Challenges

How to represent the entire information in the dataset using minimum number

of rules?

How to develop the smallest tree?

Solution

Select the attribute with maximum information for first split

RefundSecond

Taxable IncomeThird

Marital StatusFirst

AttributeSplit

86



Example: A marketing company wants to optimize their mailing campaign by sending

the brochure mail only to those customers who responded to previous mail

campaigns. The profile of customers are given below. Can you develop a rule to

identify the profile of customers who are likely to respond?

SL No District House Type Income Previous_Customer Outcome

1 Suburban Detached High No No Response

2 Suburban Detached High Yes No Response

3 Rural Detached High No Responded

4 Urban Semi-detached High No Responded

5 Urban Semi-detached Low No Responded

6 Urban Semi-detached Low Yes No Response

7 Rural Semi-detached Low Yes Responded

8 Suburban Terrace High No No Response

9 Suburban Semi-detached Low No Responded

10 Urban Terrace Low No Responded

11 Suburban Terrace Low Yes Responded

12 Rural Terrace High Yes Responded

13 Rural Detached Low No Responded

14 Urban Terrace High Yes No Response

CHAID Algorithm

87




campaigns. The profile of customers are given below? Can you develop a rule to


4

3

2

1

SL No

2Previous Customer

2Income

3House Type

3District

Number of valuesVariable Name

Number of variables = 4

Total Combination of Customer Profiles = 3 x 3 x 2 x 2 = 36

CHAID Algorithm


88




campaigns. The profile of customers are given below? Can you develop a rule to



89


Exercise 1: A bank wants to know the profile of customers who will buy a Personal

Equity Plan (Pep) after the mailing campaign? The data is given in the

file named bank-data.xls.


1. Can you develop a decision methodology?

2. How good is your model?

90


Exercise 1:. The file contains the following fields.


did the customer buy a PEP (Personal Equity Plan) after the

last mailing (YES/NO)

Pep

does the customer have a mortgage (YES/NO) Mortgage

does the customer have a current account (YES/NO) Current_acct

does the customer have a saving account (YES/NO) Save_acct

does the customer own a car (YES/NO) Car

number of children (numeric) Children

is the customer married (YES/NO) Married

income of customer (numeric) Income

inner_city/rural/suburban/town Region

MALE / FEMALE Sex

age of customer in years (numeric) Age

a unique identification number Id

91


Exercise 2: The profile of the customers of a telecom service provider in

grace period is given in churn.xls file.

1. Can you develop a a model to identify potential churners

(disconnections) so that organization can win back the customers by

providing different offers?

2. How good is the decision rule?


92


1) Service class

2) Class change in last week

3) Class change in last15 days

4) Class change in last month

5) Class change in last two months

6) Usage amount in last week

7) Usage amount in last15 days

8) Usage amount in last month

9) Usage amount in last two months

10) Recharge amount in last week

11) Recharge amount in last15 days

12) Recharge amount in last month

13) Recharge amount in last two

months

14) Recharge count in last week

15) Recharge cont in last 15 days

16) Recharge count in last month

17) Recharge count in last two

months

18) Closing balance in last week

19) Closing balance in last15 days

20) Closing balance in last month

21) Closing balance in last two

months


Exercise 2:. The file contains the following fields.

93


CLUSTER ANALYSIS

94


Objective

To classify the records or items into a smaller number of groups based on the values

of available attributes.

When to Use

When there is no Y attribute

All attributes are considered as Xs only

CLUSTER ANALYSIS

95


CLUSTER ANALYSIS

Methodology to group objects based on many attributes such that objects in a group

will be similar (or related) to one another

will be different from (or unrelated to) the objects in other groups

96


CLUSTER ANALYSIS

Types of Clustering

• K Mean Clustering

• K Medoid Clustering

97


CLUSTER ANALYSIS

K Mean Clustering

Methodology to group objects based on many attributes such that objects in a cluster will

be closer (or more similar) to the centroid of the cluster than to the centroid of any other

cluster.

1. Each cluster is associated with a centroid

2. Each point is assigned to the cluster with the closest centroid

3. Number of clusters, K must be specified

4. Initially centroids are often chosen randomly

5. The centroid is (typically) the mean of the points in the cluster

6. Closeness is measured by Euclidean distance

98


CLUSTER ANALYSIS

K Mean Clustering:Euclidean Distance

D(x, y) = √((x1 – y1)2 + (x2 – y2)

2 + - - - + ((xk – yk)2 )

Example:

7.20.6248.918.123.62

5.30.5756.015.725.81

Attribute 5Attribute 4Attribute 3Attribute 2Attribute 1SL No

Attribute 1 Attribute 2 Attribute 3 Attribute 4 Attribute 5

Difference 2.2 -2.4 7.1 -0.05 -1.9

Square 4.84 5.76 50.41 0.0025 3.61

Sum

Sq Root

64.6225

8.038812101

Euclidean Distance

99


CLUSTER ANALYSIS

K Mean Clustering:Algorithm

1. Get the number of clusters (k) required from the user

2. Randomly select k centroids

3. Calculate the Euclidean distance of each data record to each & every

centroid

4. For each record, identify the cluster with minimum Euclidean distance

5. Allocate the record to the cluster with minimum distance

6. Recalculate the centroids

7. Repeat steps 3 to 6 until there is no change in the cluster elements

100


CLUSTER ANALYSIS

K Mean Clustering:Example

Cluster the following data with 3 attributes (Spend in 3 quarters) into 2 clusters

SL No. Quarter 1 Quarter 2 Quarter 3

1 1.425172 31.08748 108.5436

2 3.017551 34.17728 103.4577

3 3.803405 34.78973 101.7977

4 4.299151 31.02313 107.3701

5 5.352034 22.80945 109.9353

6 6.038361 22.21948 100.1809

7 6.128493 25.04893 111.0543

8 8.381028 23.6761 106.3302

9 8.989409 27.62143 106.7186

10 9.788646 27.35268 105.7799

Step 1: k = 2

Step 2: Randomly identify 2 centroids

Centroid Quarter 1 Quarter 2 Quarter 3

1 1.5 35 100

2 9.8 22 111

101


CLUSTER ANALYSIS


Step 3: Calculate the Euclidean distance of each point from centroid 1

Quarter 1 Quarter 2 Quarter 3

1 -0.07483 -3.91252 8.543587 88.30632 9.397144437

2 1.517551 -0.82272 3.4577 14.93552 3.864650273

3 2.303405 -0.21027 1.797705 8.58163 2.929441858

4 2.799151 -3.97687 7.370058 77.96853 8.829979305

5 3.852034 -12.1906 9.935263 262.1572 16.19127037

6 4.538361 -12.7805 0.180881 183.9713 13.56360037

7 4.628493 -9.95107 11.05433 242.6451 15.57706944

8 6.881028 -11.3239 6.330205 215.6508 14.68505506

9 7.489409 -7.37857 6.71863 155.6745 12.47695732

10 8.288646 -7.64732 5.779929 160.5907 12.67243867

Sum of

Squares

Euclidean

Distance

Difference from Centroid 1

SL No.

102


CLUSTER ANALYSIS


Step 4: Calculate the Euclidean distance of each point from centroid 2


1 -8.37483 9.087476 -2.45641 158.7539 12.59975882

2 -6.78245 12.17728 -7.5423 251.174 15.84846897

3 -5.99659 12.78973 -9.2023 284.2187 16.85878574

4 -5.50085 9.023126 -3.62994 124.8526 11.17374688

5 -4.44797 0.809445 -1.06474 21.57327 4.644703413

6 -3.76164 0.219475 -10.8191 131.2514 11.45650187

7 -3.67151 3.048927 0.054333 22.77887 4.772721122

8 -1.41897 1.676096 -4.6698 26.62977 5.160403756

9 -0.81059 5.621434 -4.28137 50.58771 7.112503764

10 -0.01135 5.352683 -5.22007 55.90048 7.476662339

Difference from Centroid 2

SL No.

Sum of

Squares

Euclidean

Distance

103


CLUSTER ANALYSIS


Step 5: Allocate records to clusters with minimum distance

SL No. Quarter 1 Quarter 2 Quarter 3 Cluster 1 Cluster 2 Allocation

1 1.425172 31.08748 108.5436 9.397144 12.59975882 1

2 3.017551 34.17728 103.4577 3.86465 15.84846897 1

3 3.803405 34.78973 101.7977 2.929442 16.85878574 1

4 4.299151 31.02313 107.3701 8.829979 11.17374688 1

5 5.352034 22.80945 109.9353 16.19127 4.644703413 2

6 6.038361 22.21948 100.1809 13.5636 11.45650187 2

7 6.128493 25.04893 111.0543 15.57707 4.772721122 2

8 8.381028 23.6761 106.3302 14.68506 5.160403756 2

9 8.989409 27.62143 106.7186 12.47696 7.112503764 2

10 9.788646 27.35268 105.7799 12.67244 7.476662339 2

Step 6: Recalculate the centroids and repeat the steps


1 3.13632 32.7694 105.2923

2 7.446328 24.78801 106.6665

Mean

Centroid

104


CLUSTER ANALYSIS

Exercise 1: The data on the % Erlang utilization of mobile towers of a telecom service

provider is given in Erlang_Utilization.xls? Kindly group the towers into 5

clusters based on the utilization?

105


CLUSTER ANALYSIS

K Medoid Clustering

Methodology to group objects based on many attributes such that objects in a cluster will

be closer (or more similar) to the most centrally located object of the cluster.

1. Number of clusters, K must be specified

2. Closeness is measured by Euclidean distance

Exercise: Perform the exercises 1 to 3 using k medoid clustering method

course material dmba

Documents