course material dmba

105
1 Indian Statistical Institute Training Program on Data Mining & Business Analytics using Rapid Miner Boby J

Upload: khan-shaad

Post on 27-Nov-2014

143 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Course Material DMBA

1

Indian Statistical Institute

Training Program

on

Data Mining & Business Analytics

using

Rapid Miner

Boby J

Page 2: Course Material DMBA

2

Contents

Indian Statistical Institute

1. Introduction to Rapid Miner

2. Missing Value Analysis

3. Data Visualization

4. Market Basket Analysis

5. Correlation & Regression

6. Data partitioning & Classification

7. Cluster Analysis

Page 3: Course Material DMBA

3

Indian Statistical Institute

DATAPREPROCESSING

Page 4: Course Material DMBA

4

Indian Statistical InstituteDATA PREPROCESSING

1. Missing Value Handling

Page 5: Course Material DMBA

5

Missing Value Handling

Indian Statistical Institute

Example: Suppose a telecom company wants to introduce a scoring mechanism to rate

its circles based on the following parameters

1. Current Month’s Usage

2. Last 3 Month’s Usage

3. Average Recharge

4. Projected Growth

The data set is given in next slide. There are some missing values. How to

proceed?

Page 6: Course Material DMBA

6

Missing Value Handling

Example: Circle wise Data

Indian Statistical Institute

SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 3.2 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 95.5 98.3 B

13 6.3 3.3 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Page 7: Course Material DMBA

7

Missing Value Handling

Step 1: Calculate the % of missing values in each attribute

Indian Statistical Institute

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

Missing Values 3 2 2 0 0

Total Records 19 19 19 19 19

% Missing 15.79 10.53 10.53 0.00 0.00

If % Missing is > 20%, then the data is not sufficient to develop the model.

Ignore the corresponding attribute and proceed

Page 8: Course Material DMBA

8

Missing Value Handling

Step 3: Prepare Pivot table of attributes

Indian Statistical Institute

Current Month's

Usage A B C Grand Total

Missing 1 1 1 3

Non Missing 5 6 5 16

Grand Total 6 7 6 19

Last 3 Month's

Usage A B C Grand Total

Missing 1 1 0 2

Non Missing 5 6 6 17

Grand Total 6 7 6 19

Average Recharge A B C Grand Total

Missing 1 1 0 2

Non Missing 5 6 6 17

Grand Total 6 7 6 19

Projected Grow th A B C Grand Total

Missing 0 0 0 0

Non Missing 6 7 6 19

Grand Total 6 7 6 19

Conclusion

None of the

cases 100%

values are

missing

Page 9: Course Material DMBA

9

Missing Value Handling

Step 3: Prepare Pivot table of attributes

Indian Statistical Institute

Current Month's

Usage A B C

Missing 16.67 14.29 16.67

Non Missing 83.33 85.71 83.33

Grand Total 100 100 100

Last 3 Month's

Usage A B C

Missing 16.67 14.29 0.00

Non Missing 83.33 85.71 100.00

Grand Total 100 100 100

Average Recharge A B C

Missing 16.67 14.29 0.00

Non Missing 83.33 85.71 100.00

Grand Total 100 100 100

Projected Grow th A B C

Missing 0 0 0

Non Missing 100 100 100

Grand Total 100 100 100

Conclusion

None of the cases

100% values are

missing

Page 10: Course Material DMBA

10

Missing Value Handling

Example: 3 Choices

Choice 1: Ignore missing value records

Indian Statistical Institute

SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

4 4.6 3.1 98.5 9..2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

11 6.5 2.8 95.4 98.5 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

18 6.2 3.4 94.6 97.3 C

Page 11: Course Material DMBA

11

Missing Value Handling

Example: Circle wise Data

Choice 2. Replace the missing values with attribute mean, minimum,

maximum or mode

Indian Statistical Institute

SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 3.2 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 95.5 98.3 B

13 6.3 3.3 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Mean 6.0 3.1 96.1 98.5

Min 4.6 2.3 94.3 97.3

Max 7.0 3.9 99.4 99.4

Page 12: Course Material DMBA

12

Missing Value Handling

Example: Circle wise Data

Choice 2. Replace the missing values with attribute mean, minimum,

maximum or mode

Indian Statistical Institute

SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 6 3.2 96.1 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 3.1 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 6 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 3.1 95.5 98.3 B

13 6.3 3.3 96.1 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 6 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Mean 6.0 3.1 96.1 98.5

Min 4.6 2.3 94.3 97.3

Max 7.0 3.9 99.4 99.4

Page 13: Course Material DMBA

13

Missing Value Handling

Example: Circle wise Data

Choice 3 : Replace the missing values with attribute mean corresponding

the circle

Indian Statistical Institute

SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 3.2 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 95.5 98.3 B

13 6.3 3.3 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Mean 5.00 3.34 98.64 99.24 A

Mean 6.47 2.98 95.47 98.45 B

Mean 6.36 3.03 94.73 97.97 C

Page 14: Course Material DMBA

14

Missing Value Handling

Example: Circle wise Data

Choice 3 : Replace the missing values with attribute mean corresponding

the circle

Indian Statistical Institute

SL No.

Current

Month's

Usage

Last 3

Month's

Usage

Average

Recharge

Projected

Growth Circle

1 5.1 3.5 99.4 99.2 A

2 4.9 3 98.6 99.2 A

3 5 3.2 98.64 99.2 A

4 4.6 3.1 98.5 9..2 A

5 5 3.34 98.4 99.2 A

6 5.4 3.9 98.3 99.4 A

7 7 3.2 95.3 98.4. B

8 6.4 3.2 95.5 98.5 B

9 6.9 3.1 95.1 98.5 B

10 6.47 2.3 96 98.3 B

11 6.5 2.8 95.4 98.5 B

12 5.7 2.98 95.5 98.3 B

13 6.3 3.3 95.47 98.6 B

14 6.7 3.3 94.3 97.5 C

15 6.7 3 94.8 97.3 C

16 6.3 2.5 95 98.9 C

17 6.36 3 94.8 98 C

18 6.2 3.4 94.6 97.3 C

19 5.9 3 94.9 98.8 C

Mean 5.00 3.34 98.64 99.24 A

Mean 6.47 2.98 95.47 98.45 B

Mean 6.36 3.03 94.73 97.97 C

Page 15: Course Material DMBA

15

Exercise: The data on 3 modes of transport of a supply chain management company

are given below. Handle the missing values?

Indian Statistical Institute

SL No Delivery Speed Vehicles Extra Handling Cost Mode of Transport

1 27.75 3 2 Water

2 3 445 Water

3 28.2 3 1 460 Water

4 8.75 1 0 980 Direct Truck

5 9.25 0 950 Direct Truck

6 9.15 1 1 Direct Truck

7 15.2 3 2 820 LTL Truck

8 16.2 2 2.5 810 LTL Truck

9 3 1.5 835 LTL Truck

LTL Truck : Less than truck load

DATA PREPROCESSING: Missing Value Handling

Page 16: Course Material DMBA

16

Indian Statistical Institute

MARKET BASKETANALYSIS

Page 17: Course Material DMBA

17

Indian Statistical Institute

MARKET BASKET ANALYSIS

A modeling technique based upon the logic that if a customer buy a certain group of

items, he is more (or less) likely to buy another group of items

Example:

Those who buy cigarettes are more likely to buy match box also.

Page 18: Course Material DMBA

18

Indian Statistical Institute

MARKET BASKET ANALYSIS

Association Rule Mining:

Developing rules that predict the occurrence of of an item based on the

occurrence of other items in the transaction

Example

Milk, Bread, Biscuits, Fruits5

Bread, Milk, Toys, Biscuits4

Milk, Biscuits, Toys, Fruits3

Bread, Biscuits, Toys, Eggs2

Milk, Bread1

ItemsId

{Milk, Bread} {Biscuits} with probability = 2 / 3

Page 19: Course Material DMBA

19

Indian Statistical Institute

MARKET BASKET ANALYSIS

Itemset:

A collection of one or more items

k – itemset

An itemset consisting of k items

Milk, Bread, Biscuits, Fruits5

Bread, Milk, Toys, Biscuits4

Milk, Biscuits, Toys, Fruits3

Bread, Biscuits, Toys, Eggs2

Milk, Bread1

ItemsId

Page 20: Course Material DMBA

20

Indian Statistical Institute

MARKET BASKET ANALYSIS

Support count:

Frequency of occurrence of an itemset

Example

{Milk, Bread, Biscuits} = 2

Milk, Bread, Biscuits, Fruits5

Bread, Milk, Toys, Biscuits4

Milk, Biscuits, Toys, Fruits3

Bread, Biscuits, Toys, Eggs2

Milk, Bread1

ItemsId

Page 21: Course Material DMBA

21

Indian Statistical Institute

MARKET BASKET ANALYSIS

Support :

Proportion or fraction of transaction that contain an itemset

Example

{Milk, Bread, Biscuits} = 2 / 5

Milk, Bread, Biscuits, Fruits5

Bread, Milk, Toys, Biscuits4

Milk, Biscuits, Toys, Fruits3

Bread, Biscuits, Toys, Eggs2

Milk, Bread1

ItemsId

Frequent Itemset

An itemset whose support is greater than or equal to minimum support

Page 22: Course Material DMBA

22

Indian Statistical Institute

MARKET BASKET ANALYSIS

Milk, Bread, Biscuits, Fruits5

Bread, Milk, Toys, Biscuits4

Milk, Biscuits, Toys, Fruits3

Bread, Biscuits, Toys, Eggs2

Milk, Bread1

ItemsId

Confidence

Conditional probability that an item will appear in transactions that contain another

items

Example

Confidence that Toys will appear in transaction containing Milk & Biscuits

= {Milk, Biscuits, Toys} / {Milk, Biscuits} = 2 / 3 = 0.67

Page 23: Course Material DMBA

23

Indian Statistical Institute

MARKET BASKET ANALYSIS

Association Rule Mining

1. Frequent Itemset Generation

Fix minimum support value

Generate all itemsets whose support ≥ minimum support

2. Rule Generation

Fix minimum confidence value

Generate high confidence rules from each frequent itemset

Page 24: Course Material DMBA

24

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

a. Fix minimum support count

b. Generate all itemsets of length = 1

c. Calculate the support for each itemset

d. Eliminate all itemsets with support count < minimum support count

e. Repeat steps c & d for itemsets of length = 2, 3, ---

Page 25: Course Material DMBA

25

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

A,C,E6

A,E5

B,E4

A,B,C,E3

B,C,E2

A,C,D1

ItemsId

Page 26: Course Material DMBA

26

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

5E

1D

4C

3B

4A

Support countItem

Step 1:

Generate itemsets of length = 1 & calculate support

Page 27: Course Material DMBA

27

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

5E

1D

4C

3B

4A

Support countItem

Step 2:

eliminate itemsets with support count < minimum support count (2)

Page 28: Course Material DMBA

28

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

5E

4C

3B

4A

Support countItem

Step 2:

eliminate itemsets with support count < minimum support count (2)

Page 29: Course Material DMBA

29

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

2B, C

3B, E

3C,E

3A,E

3A, C

1A, B

Support countItem

Step 3:

generate itemsets of length = 2

Page 30: Course Material DMBA

30

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

2B, C

3B, E

3C,E

3A,E

3A, C

1A, B

Support countItem

Step 4:

eliminate itemsets with support count < minimum support count (2)

Page 31: Course Material DMBA

31

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

2B, C

3B, E

3C,E

3A,E

3A, C

Support countItem

Step 4:

eliminate itemsets with support count < minimum support count (2)

Page 32: Course Material DMBA

32

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

2B, C, E

2A, C, E

Support countItem

Step 5:

generate itemsets of length = 3

Page 33: Course Material DMBA

33

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

Step 6:

generate itemsets of length = 4

1A, B, C, E

Support CountItemset

Page 34: Course Material DMBA

34

Indian Statistical Institute

MARKET BASKET ANALYSIS

Frequent Itemset Generation: Apriori Algorithm

Example:

Minimum Support count = 2

Result:

3

3

2

3

3

2

2

Support count

0.33B, C, E

0.33A, C, E

0.50A , C

0.50A , E

0.33B,C

0.50B,E

0.50C,E

SupportItem

Page 35: Course Material DMBA

35

Indian Statistical Institute

MARKET BASKET ANALYSIS

Association Rule Mining: Apriori Algorithm

Example:

Minimum Support = 0.50

Minimum Confidence = 0.5

3

3

2

3

3

2

2

Support count

0.33B, C, E

0.33A, C, E

0.50A , C

0.50A , E

0.33B,C

0.50B,E

0.50C,E

SupportItem

Page 36: Course Material DMBA

36

Indian Statistical Institute

MARKET BASKET ANALYSIS

Association Rule Mining: Apriori Algorithm

Example:

Minimum Support = 0.50

Minimum Confidence = 0.5

0.600.50E B

0.600.50E C

0.750.50C E

0.750.50C A

0.600.50E A

0.50

0.50

0.50

Support

0.75A C

0.75A E

1.00B E

ConfidenceItem

Page 37: Course Material DMBA

37

Indian Statistical Institute

MARKET BASKET ANALYSIS

Association Rule Mining: Other Measures

Lift

Lift (A C) = Confidence (A C) / Support (C)

Example

0.75

0.75

Confidence

0.93

1.12

Lift

E = 0.83A E

C = 0.67A C

SupportItem

Criteria : Lift ≥ 1

Lift (A , C) = 1.12 > Lift (A , E) indicates that A has a greater impact on the

frequency of C than it has on the frequency of E

Page 38: Course Material DMBA

38

Indian Statistical Institute

MARKET BASKET ANALYSIS

Exercise 1:The data on transactions from a mobile outlet is given below.

1. Generate frequent items sets with a support of at least 25%?

2. Generate association of items with a confidence of at least 50%?

3. Estimate the chance that Mobile Slim, Landline and Broadband will

be subscribed together?

4. Estimate the chance that the customers who buy Landline will also

purchase Broadband & Ring tones?

Page 39: Course Material DMBA

39

Indian Statistical Institute

MARKET BASKET ANALYSIS

Exercise 2:

The market basket Software data set contains the details of transaction at a

software product company.

1. Identify the frequent product types with a support of minimum 25% ?

2. Also identify the association of products with a confidence of minimum 50%

?

3. What is the chance that Operating System and Office Suite will be

purchased together?

4. What is the chance that Operating System and Visual Studio will be

purchased together?

5. Estimate the chance that the customers who buy Operating System will also

purchase Office Suite ?

6. Estimate the chance that the customers who buy Operating System will also

purchase Visual Studio?

Page 40: Course Material DMBA

40

Indian Statistical Institute

LINEARREGRESSION

Page 41: Course Material DMBA

41

CORRELATION & REGRESSION

Correlation:

Correlation analysis is a technique to identify the relationship between two

variables.

Type and degree of relationship between two variables.

Indian Statistical Institute

Page 42: Course Material DMBA

42

CORRELATION & REGRESSION

Correlation: Usage

Explore the relationship between the output characteristic and input or process

variable.

Output variable : Y : Dependent variable

Input / Process variable : X : Independent variable

Indian Statistical Institute

Page 43: Course Material DMBA

43

Positive Correlation: Y increases as X increases & vice versa

Scatter Plot

0

4

8

12

16

20

0 3 6 9 12

X

Y

CORRELATION & REGRESSION

Indian Statistical Institute

Page 44: Course Material DMBA

44

Negative Correlation: Y decreases as X increases & vice versa

Scatter Plot

0

1

2

3

4

5

6

7

8

9

1 2 3 4 5 6 7 8 9 10

X

Y

CORRELATION & REGRESSION

Indian Statistical Institute

Page 45: Course Material DMBA

45

No Correlation: Random Distribution of points

Scatter Plot

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

X

Y

Indian Statistical Institute

CORRELATION & REGRESSION

Page 46: Course Material DMBA

46

Is there any correlation ?

Scatter Plot

0

5

10

15

20

25

30

0 2 4 6 8 10 12

X

Y

CORRELATION & REGRESSION

Indian Statistical Institute

Page 47: Course Material DMBA

47

Measure of Correlation: Coefficient of Correlation

Symbol : r

Range : -1 to 1

Sign : Type of correlation

Value : Degree of correlation

Examples:

r = 0.6 , 60 % positive correlation

r = -0.82, 82% negative correlation

r = 0, No correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Page 48: Course Material DMBA

48

Coefficient of Correlation: Positive Correlation

Collect data on x and y: When x is low, y is also low & vice versa

x y

2 5

3 7

1 3

5 11

6 12

7 15

CORRELATION & REGRESSION

Indian Statistical Institute

Page 49: Course Material DMBA

49

Calculate Mean of x & y values

SL No. x y

1 2 5

2 3 7

3 1 3

4 5 11

5 6 12

6 7 15

Mean 4 8.83

Coefficient of Correlation: Positive Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Page 50: Course Material DMBA

50

Take x – Mean x and y – Mean y

SL No. x – Mean x y – Mean y

1 -2 -3.83

2 -1 -1.83

3 -3 -5.83

4 1 2.17

5 2 3.17

6 3 6.17

Coefficient of Correlation: Positive Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Conclusion:

Low values will become

negative & high values will

become positive

Page 51: Course Material DMBA

51

Generally when x values are negative, y values are also negative & vice versa

SL No. x – Mean x y – Mean y

1 -2 -3.83

2 -1 -1.83

3 -3 -5.83

4 1 2.17

5 2 3.17

6 3 6.17

Coefficient of Correlation: Positive Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Page 52: Course Material DMBA

52

Then

Product of x & y values will be positive

SL No. x – Mean x y – Mean y Product

1 -2 -3.83 7.66

2 -1 -1.83 1.83

3 -3 -5.83 17.49

4 1 2.17 2.17

5 2 3.17 6.34

6 3 6.17 18.51

Sum = Sxy 54

Coefficient of Correlation: Positive Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Page 53: Course Material DMBA

53

Sum of Product of x & y values (Sxy) will be positive

SL No. x – Mean x y – Mean y Product

1 -2 -3.83 7.66

2 -1 -1.83 1.83

3 -3 -5.83 17.49

4 1 2.17 2.17

5 2 3.17 6.34

6 3 6.17 18.51

Sum = Sxy 54

Coefficient of Correlation: Positive Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Page 54: Course Material DMBA

54

Coefficient of Correlation: Negative Correlation

Collect data on x and y: When x is low then y will be high & vice versa

x y

2 12

3 11

1 15

5 7

6 5

7 3

CORRELATION & REGRESSION

Indian Statistical Institute

Page 55: Course Material DMBA

55

Calculate Mean of x & y values

SL No. x y

1 2 12

2 3 11

3 1 15

4 5 7

5 6 5

6 7 3

Mean 4 8.83

Coefficient of Correlation: Negative Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Page 56: Course Material DMBA

56

Take x – Mean x and y – Mean y

SL No. x – Mean x y – Mean y

1 -2 3.67

2 -1 2.67

3 -3 6.67

4 1 -1.33

5 2 -3.33

6 3 -5.33

Coefficient of Correlation: Negative Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Conclusion:

Low values will become

negative & high values will

become positive

Page 57: Course Material DMBA

57

Generally when x values are negative, y values are positive & vice versa

SL No. x – Mean x y – Mean y

1 -2 3.67

2 -1 2.67

3 -3 6.67

4 1 -1.33

5 2 -3.33

6 3 -5.33

Coefficient of Correlation: Negative Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Page 58: Course Material DMBA

58

Then

Product of x & y values will be negative

SL No. x – Mean x y – Mean y Product

1 -2 3.67 -7.34

2 -1 2.67 -2.67

3 -3 6.67 -20.01

4 1 -1.33 -1.33

5 2 -3.33 -6.66

6 3 -5.33 -15.99

Sum = Sxy - 54

Coefficient of Correlation: Negative Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

Page 59: Course Material DMBA

59

Sum of Product of x & y values Sxy will be negative

Coefficient of Correlation: Negative Correlation

CORRELATION & REGRESSION

Indian Statistical Institute

SL No. x – Mean x y – Mean y Product

1 -2 3.67 -7.34

2 -1 2.67 -2.67

3 -3 6.67 -20.01

4 1 -1.33 -1.33

5 2 -3.33 -6.66

6 3 -5.33 -15.99

Sum = Sxy - 54

Page 60: Course Material DMBA

60

In Short

If correlation is positive

Sxy will be positive

If correlation is negative

Sxy will be negative

Coefficient of Correlation:

CORRELATION & REGRESSION

Indian Statistical Institute

Page 61: Course Material DMBA

61

To avoid scale issues

Sxy is divided by √ (Sxx.Syy)

Coefficient of Correlation:

CORRELATION & REGRESSION

Indian Statistical Institute

Sxy = Σ(x-Mean x)(y-Mean y)

Sxx = Σ(x-Mean x)2

Syy = Σ(y-Mean y)2

Correlation Coefficient r = Sxy / √ (Sxx.Syy)

Page 62: Course Material DMBA

62

Coefficient of Correlation:

CORRELATION & REGRESSION

Indian Statistical Institute

SL No. x – Mean x y – Mean y Product (x – Mean x)2 (y – Mean y)2

1 -2 3.67 -7.34 4 14.6689

2 -1 2.67 -2.67 1 3.3489

3 -3 6.67 -20.01 9 33.9889

4 1 -1.33 -1.33 1 4.7089

5 2 -3.33 -6.66 4 10.0489

6 3 -5.33 -15.99 9 38.0689

Sum Sxy: -54 Sxx: 28 Syy:104.83

r = Sxy / √Sxx.Syy = -54 / √(28 x 104.83) = -0.9967

Page 63: Course Material DMBA

63

Regression

Correlation helps

To check whether two variables are related

If related

Identify the type & degree of relationship

CORRELATION & REGRESSION

Indian Statistical Institute

Page 64: Course Material DMBA

64

Regression

Regression helps

• To identify the exact form of the relationship

• To model output in terms of input or process variables

Examples:

Yield = 5 + 3 x Time - 2 x Temperature

Y = 2 - 5x

CORRELATION & REGRESSION

Indian Statistical Institute

Page 65: Course Material DMBA

65

Multiple Regression

To model output variable y in terms of two or more variables.

General Form:

Y = a + b1X1 + b2X2 + - - - + bkXk

Two variable case:

Y = a + b1X1 + b2X2

CORRELATION & REGRESSION

Indian Statistical Institute

Page 66: Course Material DMBA

66

Exercise 1: The data on Vendor performance score and the number of On Time,

Complete, Undamaged & Correctly billed shipments from the vendors of a

supply chain management company are given below. Can you develop a

model for Vendor performance score in terms of other variables?

CORRELATION & REGRESSION

Indian Statistical Institute

Vendor Id

Ontime

Shipment

Complete

Shipment

Undamaged

Shipmetns

Correctly

billed

Performance

Score

1 950 990 980 550 2985

2 1450 1425 1475 975 4576

3 1700 1575 1730 1320 5435

4 1800 1515 1890 1615 5955

5 1675 1420 1756 1456 5400

6 1756 1645 1835 1489 5590

7 1236 1462 1335 1435 4675

8 1100 1523 1565 1625 4960

9 1325 1725 1570 1520 5325

10 1450 1620 1463 1430 5170

11 1570 1458 1356 1630 5190

Page 67: Course Material DMBA

67

Exercise 2: A construction company wants to develop a model the concrete

compressive strength. The attributes of interest are given in the table

below. The training data is given in the file Concrete_Data.xls .

1. Can you develop the model?

2. How much close it will predict the values?

LINEAR REGRESSION

Indian Statistical Institute

1 Cement (component 1)(kg in a m^3 mixture)

2 Blast Furnace Slag (component 2)(kg in a m^3 mixture)

3 Fly Ash (component 3)(kg in a m^3 mixture)

4 Water (component 4)(kg in a m^3 mixture)

5 Superplasticizer (component 5)(kg in a m^3 mixture)

6 Coarse Aggregate (component 6)(kg in a m^3 mixture)

7 Fine Aggregate (component 7)(kg in a m^3 mixture)

8 Age (day)

9 Concrete compressive strength(MPa, megapascals)

Page 68: Course Material DMBA

68

Indian Statistical Institute

CLASSIFICATION METHODS

Page 69: Course Material DMBA

69

Indian Statistical Institute

INTRODUCTION

Objective

To develop a mathematical model for an attribute or response metric (Y) in terms of

other available attributes (Xs).

When to Use

Xs : Continuous or discrete

Y : Discrete

Page 70: Course Material DMBA

70

Indian Statistical Institute

CLASSIFICATION METHODS

Classifies data (develops a model) based on the training data

Each sample is assumed to belong to a predefined class

Sample data set used for building the model is training set

Usage:

For classifying future or unknown data

Page 71: Course Material DMBA

71

Indian Statistical Institute

CLASSIFICATION METHODS

Example:

Y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

x1 x2 Y x1 x2 Y

11.35 23 Blue 11.85 39.9 Red

11.59 22.3 Blue 12.09 39.5 Red

12.19 24.5 Blue 12.69 37.8 Red

13.23 26.4 Blue 13.73 38.2 Red

13.51 30.2 Blue 14.01 37.8 Red

13.68 32 Blue 14.18 36.5 Red

14.78 33.1 Blue 15.28 36 Red

15.11 33 Blue 15.61 37.1 Red

15.55 25.2 Blue 16.05 33.1 Red

16.37 24.1 Blue 16.87 32.4 Red

16.99 22 Blue 17.49 31 Red

18.23 23.5 Blue 18.73 32 Red

18.83 24.1 Blue 19.33 31.8 Red

19.06 25 Blue 19.56 30.9 Red

Page 72: Course Material DMBA

72

Indian Statistical Institute

CLASSIFICATION METHODS

Example:

Y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

20

22

24

26

28

30

32

34

36

38

40

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00

x1

x2

Page 73: Course Material DMBA

73

Indian Statistical Institute

CLASSIFICATION METHODS

Example:

y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

20

22

24

26

28

30

32

34

36

38

40

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00

x1

x2

x2

y1

> 35

Page 74: Course Material DMBA

74

Indian Statistical Institute

CLASSIFICATION METHODS

Example:

y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

20

22

24

26

28

30

32

34

36

38

40

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00

x1

x2

x2

y1 y2

> 35 < 28

Page 75: Course Material DMBA

75

Indian Statistical Institute

CLASSIFICATION METHODS

Example:

y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

20

22

24

26

28

30

32

34

36

38

40

10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00

x1

x2

x2

x1y1 y2

> 35 < 28

y2 y1

< 15.5 > 15.5

Page 76: Course Material DMBA

76

Indian Statistical Institute

CLASSIFICATION METHODS

Example: Rules

y1 (Red) , y2 (Blue)Label : y

x2Attribute 2

x1Attribute 1

x2

x1y1 y2

> 35 < 28

y2 y1

< 15.5 > 15.5

If x2 > 35 then y = y1

If x2 < 28, then y = y2

If 28 > x2 > 35 & x1 > 15.5, then y = y1

If 28 > x2 > 35 & x1 < 15.5, then y = y2

Page 77: Course Material DMBA

77

Indian Statistical Institute

CLASSIFICATION METHODS

Example: The following table 1 gives the profile of customers (Refund, Marital

Status & Taxable Income) who has taken loan from a bank. The table also

shows how many of them really cheated the bank.

1. Can you develop a decision rule to classify the customer as whether they will

cheat or not based on the value of 3 attributes (Refund, Marital Status &

Taxable Income)

2. Validate the model using the test data given in table 2

Yes> 80 KDivorcedNo5

No> 80 KMarriedNo4

No< 80 KSingleNo3

No> 80 KSingleNo2

No> 80 KMarriedYes1

CheatTaxable

Income

Marital

Status

RefundSL No

Table 2: Test Data

Page 78: Course Material DMBA

78

Indian Statistical Institute

CLASSIFICATION METHODS

Table 1: Training Data Set

Yes> 80 KSingleNo10

No> 80 KMarriedNo9

Yes> 80 KSingleNo8

No> 80 KDivorcedYes7

No< 80 KMarriedNo6

Yes> 80 KDivorcedNo5

No> 80 KMarriedYes4

No< 80 KSingleNo3

No> 80 KMarriedNo2

No> 80 KSingleYes1

CheatTaxable IncomeMarital StatusRefundSL No

Class variable: Cheat

Number of predefined classes: 2 (Cheat = No & Cheat = Yes)

Page 79: Course Material DMBA

79

Indian Statistical Institute

CLASSIFICATION METHODS

Example:Result

If Marital Status = Married then cheat : No

If Marital Status = Single & Refund = Yes then cheat : No

If Marital Status = Single, Refund = No & Taxable Income < 80K then cheat: No

If Marital Status = Single, Refund = No & Taxable Income > 80K then cheat: Yes

If Marital Status = Divorced & Refund = Yes then cheat : No

If Marital Status = Divorced & Refund = No then cheat : Yes

Page 80: Course Material DMBA

80

Indian Statistical Institute

CLASSIFICATION METHODS

Example:Decision Tree

Yes> 80 KSingleNo10

No> 80 KMarriedNo9

Yes> 80 KSingleNo8

No> 80 KDivorcedYes7

No< 80 KMarriedNo6

Yes> 80 KDivorcedNo5

No> 80 KMarriedYes4

No< 80 KSingleNo3

No> 80 KMarriedNo2

No> 80 KSingleYes1

CheatTaxable

Income

Marital

Status

RefundSL No

Page 81: Course Material DMBA

81

Indian Statistical Institute

CLASSIFICATION METHODS

Example: Test Data Set

Yes> 80 KDivorcedNo5

No> 80 KMarriedNo4

No< 80 KSingleNo3

No> 80 KSingleNo2

No> 80 KMarriedYes1

CheatTaxable

Income

Marital

Status

RefundSL No

Yes

No

No

No

No

Cheat

Yes> 80 KDivorcedNo5

No> 80 KMarriedNo4

No< 80KSingleNo3

Yes> 80 KSingleNo2

No> 80KMarriedYes1

Predicted

Cheat

Taxable

Income

Marital

Status

RefundSL No

Page 82: Course Material DMBA

82

Indian Statistical Institute

CLASSIFICATION METHODS

Performance Evaluation Measures

1. Confusion Matrix

dcClass = No

baClass = Yes

Class = NoClass = YesPredicted

Class

Actual Class

2. Accuracy: (a+d) / (a + b + c + d)

3. Precision: a / (a + b)

4. Recall: a / (a + c)

5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)

Page 83: Course Material DMBA

83

Indian Statistical Institute

CLASSIFICATION METHODS

Example: Performance Evaluation Measures

1. Confusion Matrix

11Cheat = Yes

03Cheat = No

Cheat = YesCheat = NoPredicted

Class

Actual Class

Yes

No

No

No

No

Cheat

Yes5

No4

No3

Yes2

No1

Predicted CheatSL No

Page 84: Course Material DMBA

84

Indian Statistical Institute

CLASSIFICATION METHODS

Example: Performance Evaluation Measures

1. Confusion Matrix

11Cheat = Yes

03Cheat = No

Cheat = YesCheat = NoPredicted

Class

Actual Class

2. Accuracy: (3+1) / (3 + 1 + 0 + 1) = 4 / 5 = 0.8

3. Precision: 3 / (3 + 0) = 3 / 3 = 1.0

4. Recall: 3 / (3 + 1) = 3 / 4= 0.75

5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)

= 2 x 1.0 x 0.75 / (1.00 + 0.75) = 0.86

Page 85: Course Material DMBA

85

Indian Statistical Institute

CLASSIFICATION METHODS

Challenges

How to represent the entire information in the dataset using minimum number

of rules?

How to develop the smallest tree?

Solution

Select the attribute with maximum information for first split

RefundSecond

Taxable IncomeThird

Marital StatusFirst

AttributeSplit

Page 86: Course Material DMBA

86

Indian Statistical Institute

CLASSIFICATION METHODS

Example: A marketing company wants to optimize their mailing campaign by sending

the brochure mail only to those customers who responded to previous mail

campaigns. The profile of customers are given below. Can you develop a rule to

identify the profile of customers who are likely to respond?

SL No District House Type Income Previous_Customer Outcome

1 Suburban Detached High No No Response

2 Suburban Detached High Yes No Response

3 Rural Detached High No Responded

4 Urban Semi-detached High No Responded

5 Urban Semi-detached Low No Responded

6 Urban Semi-detached Low Yes No Response

7 Rural Semi-detached Low Yes Responded

8 Suburban Terrace High No No Response

9 Suburban Semi-detached Low No Responded

10 Urban Terrace Low No Responded

11 Suburban Terrace Low Yes Responded

12 Rural Terrace High Yes Responded

13 Rural Detached Low No Responded

14 Urban Terrace High Yes No Response

CHAID Algorithm

Page 87: Course Material DMBA

87

Indian Statistical Institute

Example: A marketing company wants to optimize their mailing campaign by sending

the brochure mail only to those customers who responded to previous mail

campaigns. The profile of customers are given below? Can you develop a rule to

identify the profile of customers who are likely to respond?

4

3

2

1

SL No

2Previous Customer

2Income

3House Type

3District

Number of valuesVariable Name

Number of variables = 4

Total Combination of Customer Profiles = 3 x 3 x 2 x 2 = 36

CHAID Algorithm

CLASSIFICATION METHODS

Page 88: Course Material DMBA

88

Indian Statistical Institute

Example: A marketing company wants to optimize their mailing campaign by sending

the brochure mail only to those customers who responded to previous mail

campaigns. The profile of customers are given below? Can you develop a rule to

identify the profile of customers who are likely to respond?

CLASSIFICATION METHODS

Page 89: Course Material DMBA

89

Indian Statistical Institute

Exercise 1: A bank wants to know the profile of customers who will buy a Personal

Equity Plan (Pep) after the mailing campaign? The data is given in the

file named bank-data.xls.

CLASSIFICATION METHODS

1. Can you develop a decision methodology?

2. How good is your model?

Page 90: Course Material DMBA

90

Indian Statistical Institute

Exercise 1:. The file contains the following fields.

CLASSIFICATION METHODS

did the customer buy a PEP (Personal Equity Plan) after the

last mailing (YES/NO)

Pep

does the customer have a mortgage (YES/NO) Mortgage

does the customer have a current account (YES/NO) Current_acct

does the customer have a saving account (YES/NO) Save_acct

does the customer own a car (YES/NO) Car

number of children (numeric) Children

is the customer married (YES/NO) Married

income of customer (numeric) Income

inner_city/rural/suburban/town Region

MALE / FEMALE Sex

age of customer in years (numeric) Age

a unique identification number Id

Page 91: Course Material DMBA

91

Indian Statistical Institute

Exercise 2: The profile of the customers of a telecom service provider in

grace period is given in churn.xls file.

1. Can you develop a a model to identify potential churners

(disconnections) so that organization can win back the customers by

providing different offers?

2. How good is the decision rule?

CLASSIFICATION METHODS

Page 92: Course Material DMBA

92

Indian Statistical Institute

1) Service class

2) Class change in last week

3) Class change in last15 days

4) Class change in last month

5) Class change in last two months

6) Usage amount in last week

7) Usage amount in last15 days

8) Usage amount in last month

9) Usage amount in last two months

10) Recharge amount in last week

11) Recharge amount in last15 days

12) Recharge amount in last month

13) Recharge amount in last two

months

14) Recharge count in last week

15) Recharge cont in last 15 days

16) Recharge count in last month

17) Recharge count in last two

months

18) Closing balance in last week

19) Closing balance in last15 days

20) Closing balance in last month

21) Closing balance in last two

months

CLASSIFICATION METHODS

Exercise 2:. The file contains the following fields.

Page 93: Course Material DMBA

93

Indian Statistical Institute

CLUSTER ANALYSIS

Page 94: Course Material DMBA

94

Indian Statistical Institute

Objective

To classify the records or items into a smaller number of groups based on the values

of available attributes.

When to Use

When there is no Y attribute

All attributes are considered as Xs only

CLUSTER ANALYSIS

Page 95: Course Material DMBA

95

Indian Statistical Institute

CLUSTER ANALYSIS

Methodology to group objects based on many attributes such that objects in a group

will be similar (or related) to one another

will be different from (or unrelated to) the objects in other groups

Page 96: Course Material DMBA

96

Indian Statistical Institute

CLUSTER ANALYSIS

Types of Clustering

• K Mean Clustering

• K Medoid Clustering

Page 97: Course Material DMBA

97

Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering

Methodology to group objects based on many attributes such that objects in a cluster will

be closer (or more similar) to the centroid of the cluster than to the centroid of any other

cluster.

1. Each cluster is associated with a centroid

2. Each point is assigned to the cluster with the closest centroid

3. Number of clusters, K must be specified

4. Initially centroids are often chosen randomly

5. The centroid is (typically) the mean of the points in the cluster

6. Closeness is measured by Euclidean distance

Page 98: Course Material DMBA

98

Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Euclidean Distance

D(x, y) = √((x1 – y1)2 + (x2 – y2)

2 + - - - + ((xk – yk)2 )

Example:

7.20.6248.918.123.62

5.30.5756.015.725.81

Attribute 5Attribute 4Attribute 3Attribute 2Attribute 1SL No

Attribute 1 Attribute 2 Attribute 3 Attribute 4 Attribute 5

Difference 2.2 -2.4 7.1 -0.05 -1.9

Square 4.84 5.76 50.41 0.0025 3.61

Sum

Sq Root

64.6225

8.038812101

Euclidean Distance

Page 99: Course Material DMBA

99

Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Algorithm

1. Get the number of clusters (k) required from the user

2. Randomly select k centroids

3. Calculate the Euclidean distance of each data record to each & every

centroid

4. For each record, identify the cluster with minimum Euclidean distance

5. Allocate the record to the cluster with minimum distance

6. Recalculate the centroids

7. Repeat steps 3 to 6 until there is no change in the cluster elements

Page 100: Course Material DMBA

100

Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Example

Cluster the following data with 3 attributes (Spend in 3 quarters) into 2 clusters

SL No. Quarter 1 Quarter 2 Quarter 3

1 1.425172 31.08748 108.5436

2 3.017551 34.17728 103.4577

3 3.803405 34.78973 101.7977

4 4.299151 31.02313 107.3701

5 5.352034 22.80945 109.9353

6 6.038361 22.21948 100.1809

7 6.128493 25.04893 111.0543

8 8.381028 23.6761 106.3302

9 8.989409 27.62143 106.7186

10 9.788646 27.35268 105.7799

Step 1: k = 2

Step 2: Randomly identify 2 centroids

Centroid Quarter 1 Quarter 2 Quarter 3

1 1.5 35 100

2 9.8 22 111

Page 101: Course Material DMBA

101

Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Example

Step 3: Calculate the Euclidean distance of each point from centroid 1

Quarter 1 Quarter 2 Quarter 3

1 -0.07483 -3.91252 8.543587 88.30632 9.397144437

2 1.517551 -0.82272 3.4577 14.93552 3.864650273

3 2.303405 -0.21027 1.797705 8.58163 2.929441858

4 2.799151 -3.97687 7.370058 77.96853 8.829979305

5 3.852034 -12.1906 9.935263 262.1572 16.19127037

6 4.538361 -12.7805 0.180881 183.9713 13.56360037

7 4.628493 -9.95107 11.05433 242.6451 15.57706944

8 6.881028 -11.3239 6.330205 215.6508 14.68505506

9 7.489409 -7.37857 6.71863 155.6745 12.47695732

10 8.288646 -7.64732 5.779929 160.5907 12.67243867

Sum of

Squares

Euclidean

Distance

Difference from Centroid 1

SL No.

Page 102: Course Material DMBA

102

Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Example

Step 4: Calculate the Euclidean distance of each point from centroid 2

Quarter 1 Quarter 2 Quarter 3

1 -8.37483 9.087476 -2.45641 158.7539 12.59975882

2 -6.78245 12.17728 -7.5423 251.174 15.84846897

3 -5.99659 12.78973 -9.2023 284.2187 16.85878574

4 -5.50085 9.023126 -3.62994 124.8526 11.17374688

5 -4.44797 0.809445 -1.06474 21.57327 4.644703413

6 -3.76164 0.219475 -10.8191 131.2514 11.45650187

7 -3.67151 3.048927 0.054333 22.77887 4.772721122

8 -1.41897 1.676096 -4.6698 26.62977 5.160403756

9 -0.81059 5.621434 -4.28137 50.58771 7.112503764

10 -0.01135 5.352683 -5.22007 55.90048 7.476662339

Difference from Centroid 2

SL No.

Sum of

Squares

Euclidean

Distance

Page 103: Course Material DMBA

103

Indian Statistical Institute

CLUSTER ANALYSIS

K Mean Clustering:Example

Step 5: Allocate records to clusters with minimum distance

SL No. Quarter 1 Quarter 2 Quarter 3 Cluster 1 Cluster 2 Allocation

1 1.425172 31.08748 108.5436 9.397144 12.59975882 1

2 3.017551 34.17728 103.4577 3.86465 15.84846897 1

3 3.803405 34.78973 101.7977 2.929442 16.85878574 1

4 4.299151 31.02313 107.3701 8.829979 11.17374688 1

5 5.352034 22.80945 109.9353 16.19127 4.644703413 2

6 6.038361 22.21948 100.1809 13.5636 11.45650187 2

7 6.128493 25.04893 111.0543 15.57707 4.772721122 2

8 8.381028 23.6761 106.3302 14.68506 5.160403756 2

9 8.989409 27.62143 106.7186 12.47696 7.112503764 2

10 9.788646 27.35268 105.7799 12.67244 7.476662339 2

Step 6: Recalculate the centroids and repeat the steps

Quarter 1 Quarter 2 Quarter 3

1 3.13632 32.7694 105.2923

2 7.446328 24.78801 106.6665

Mean

Centroid

Page 104: Course Material DMBA

104

Indian Statistical Institute

CLUSTER ANALYSIS

Exercise 1: The data on the % Erlang utilization of mobile towers of a telecom service

provider is given in Erlang_Utilization.xls? Kindly group the towers into 5

clusters based on the utilization?

Page 105: Course Material DMBA

105

Indian Statistical Institute

CLUSTER ANALYSIS

K Medoid Clustering

Methodology to group objects based on many attributes such that objects in a cluster will

be closer (or more similar) to the most centrally located object of the cluster.

1. Number of clusters, K must be specified

2. Closeness is measured by Euclidean distance

Exercise: Perform the exercises 1 to 3 using k medoid clustering method