course material dmba
TRANSCRIPT
1
Indian Statistical Institute
Training Program
on
Data Mining & Business Analytics
using
Rapid Miner
Boby J
2
Contents
Indian Statistical Institute
1. Introduction to Rapid Miner
2. Missing Value Analysis
3. Data Visualization
4. Market Basket Analysis
5. Correlation & Regression
6. Data partitioning & Classification
7. Cluster Analysis
3
Indian Statistical Institute
DATAPREPROCESSING
4
Indian Statistical InstituteDATA PREPROCESSING
1. Missing Value Handling
5
Missing Value Handling
Indian Statistical Institute
Example: Suppose a telecom company wants to introduce a scoring mechanism to rate
its circles based on the following parameters
1. Current Month’s Usage
2. Last 3 Month’s Usage
3. Average Recharge
4. Projected Growth
The data set is given in next slide. There are some missing values. How to
proceed?
6
Missing Value Handling
Example: Circle wise Data
Indian Statistical Institute
SL No.
Current
Month's
Usage
Last 3
Month's
Usage
Average
Recharge
Projected
Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
7
Missing Value Handling
Step 1: Calculate the % of missing values in each attribute
Indian Statistical Institute
Current
Month's
Usage
Last 3
Month's
Usage
Average
Recharge
Projected
Growth Circle
Missing Values 3 2 2 0 0
Total Records 19 19 19 19 19
% Missing 15.79 10.53 10.53 0.00 0.00
If % Missing is > 20%, then the data is not sufficient to develop the model.
Ignore the corresponding attribute and proceed
8
Missing Value Handling
Step 3: Prepare Pivot table of attributes
Indian Statistical Institute
Current Month's
Usage A B C Grand Total
Missing 1 1 1 3
Non Missing 5 6 5 16
Grand Total 6 7 6 19
Last 3 Month's
Usage A B C Grand Total
Missing 1 1 0 2
Non Missing 5 6 6 17
Grand Total 6 7 6 19
Average Recharge A B C Grand Total
Missing 1 1 0 2
Non Missing 5 6 6 17
Grand Total 6 7 6 19
Projected Grow th A B C Grand Total
Missing 0 0 0 0
Non Missing 6 7 6 19
Grand Total 6 7 6 19
Conclusion
None of the
cases 100%
values are
missing
9
Missing Value Handling
Step 3: Prepare Pivot table of attributes
Indian Statistical Institute
Current Month's
Usage A B C
Missing 16.67 14.29 16.67
Non Missing 83.33 85.71 83.33
Grand Total 100 100 100
Last 3 Month's
Usage A B C
Missing 16.67 14.29 0.00
Non Missing 83.33 85.71 100.00
Grand Total 100 100 100
Average Recharge A B C
Missing 16.67 14.29 0.00
Non Missing 83.33 85.71 100.00
Grand Total 100 100 100
Projected Grow th A B C
Missing 0 0 0
Non Missing 100 100 100
Grand Total 100 100 100
Conclusion
None of the cases
100% values are
missing
10
Missing Value Handling
Example: 3 Choices
Choice 1: Ignore missing value records
Indian Statistical Institute
SL No.
Current
Month's
Usage
Last 3
Month's
Usage
Average
Recharge
Projected
Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
4 4.6 3.1 98.5 9..2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
11 6.5 2.8 95.4 98.5 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
18 6.2 3.4 94.6 97.3 C
11
Missing Value Handling
Example: Circle wise Data
Choice 2. Replace the missing values with attribute mean, minimum,
maximum or mode
Indian Statistical Institute
SL No.
Current
Month's
Usage
Last 3
Month's
Usage
Average
Recharge
Projected
Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 6.0 3.1 96.1 98.5
Min 4.6 2.3 94.3 97.3
Max 7.0 3.9 99.4 99.4
12
Missing Value Handling
Example: Circle wise Data
Choice 2. Replace the missing values with attribute mean, minimum,
maximum or mode
Indian Statistical Institute
SL No.
Current
Month's
Usage
Last 3
Month's
Usage
Average
Recharge
Projected
Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 6 3.2 96.1 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 3.1 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 6 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 3.1 95.5 98.3 B
13 6.3 3.3 96.1 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 6 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 6.0 3.1 96.1 98.5
Min 4.6 2.3 94.3 97.3
Max 7.0 3.9 99.4 99.4
13
Missing Value Handling
Example: Circle wise Data
Choice 3 : Replace the missing values with attribute mean corresponding
the circle
Indian Statistical Institute
SL No.
Current
Month's
Usage
Last 3
Month's
Usage
Average
Recharge
Projected
Growth circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 3.2 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 95.5 98.3 B
13 6.3 3.3 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 5.00 3.34 98.64 99.24 A
Mean 6.47 2.98 95.47 98.45 B
Mean 6.36 3.03 94.73 97.97 C
14
Missing Value Handling
Example: Circle wise Data
Choice 3 : Replace the missing values with attribute mean corresponding
the circle
Indian Statistical Institute
SL No.
Current
Month's
Usage
Last 3
Month's
Usage
Average
Recharge
Projected
Growth Circle
1 5.1 3.5 99.4 99.2 A
2 4.9 3 98.6 99.2 A
3 5 3.2 98.64 99.2 A
4 4.6 3.1 98.5 9..2 A
5 5 3.34 98.4 99.2 A
6 5.4 3.9 98.3 99.4 A
7 7 3.2 95.3 98.4. B
8 6.4 3.2 95.5 98.5 B
9 6.9 3.1 95.1 98.5 B
10 6.47 2.3 96 98.3 B
11 6.5 2.8 95.4 98.5 B
12 5.7 2.98 95.5 98.3 B
13 6.3 3.3 95.47 98.6 B
14 6.7 3.3 94.3 97.5 C
15 6.7 3 94.8 97.3 C
16 6.3 2.5 95 98.9 C
17 6.36 3 94.8 98 C
18 6.2 3.4 94.6 97.3 C
19 5.9 3 94.9 98.8 C
Mean 5.00 3.34 98.64 99.24 A
Mean 6.47 2.98 95.47 98.45 B
Mean 6.36 3.03 94.73 97.97 C
15
Exercise: The data on 3 modes of transport of a supply chain management company
are given below. Handle the missing values?
Indian Statistical Institute
SL No Delivery Speed Vehicles Extra Handling Cost Mode of Transport
1 27.75 3 2 Water
2 3 445 Water
3 28.2 3 1 460 Water
4 8.75 1 0 980 Direct Truck
5 9.25 0 950 Direct Truck
6 9.15 1 1 Direct Truck
7 15.2 3 2 820 LTL Truck
8 16.2 2 2.5 810 LTL Truck
9 3 1.5 835 LTL Truck
LTL Truck : Less than truck load
DATA PREPROCESSING: Missing Value Handling
16
Indian Statistical Institute
MARKET BASKETANALYSIS
17
Indian Statistical Institute
MARKET BASKET ANALYSIS
A modeling technique based upon the logic that if a customer buy a certain group of
items, he is more (or less) likely to buy another group of items
Example:
Those who buy cigarettes are more likely to buy match box also.
18
Indian Statistical Institute
MARKET BASKET ANALYSIS
Association Rule Mining:
Developing rules that predict the occurrence of of an item based on the
occurrence of other items in the transaction
Example
Milk, Bread, Biscuits, Fruits5
Bread, Milk, Toys, Biscuits4
Milk, Biscuits, Toys, Fruits3
Bread, Biscuits, Toys, Eggs2
Milk, Bread1
ItemsId
{Milk, Bread} {Biscuits} with probability = 2 / 3
19
Indian Statistical Institute
MARKET BASKET ANALYSIS
Itemset:
A collection of one or more items
k – itemset
An itemset consisting of k items
Milk, Bread, Biscuits, Fruits5
Bread, Milk, Toys, Biscuits4
Milk, Biscuits, Toys, Fruits3
Bread, Biscuits, Toys, Eggs2
Milk, Bread1
ItemsId
20
Indian Statistical Institute
MARKET BASKET ANALYSIS
Support count:
Frequency of occurrence of an itemset
Example
{Milk, Bread, Biscuits} = 2
Milk, Bread, Biscuits, Fruits5
Bread, Milk, Toys, Biscuits4
Milk, Biscuits, Toys, Fruits3
Bread, Biscuits, Toys, Eggs2
Milk, Bread1
ItemsId
21
Indian Statistical Institute
MARKET BASKET ANALYSIS
Support :
Proportion or fraction of transaction that contain an itemset
Example
{Milk, Bread, Biscuits} = 2 / 5
Milk, Bread, Biscuits, Fruits5
Bread, Milk, Toys, Biscuits4
Milk, Biscuits, Toys, Fruits3
Bread, Biscuits, Toys, Eggs2
Milk, Bread1
ItemsId
Frequent Itemset
An itemset whose support is greater than or equal to minimum support
22
Indian Statistical Institute
MARKET BASKET ANALYSIS
Milk, Bread, Biscuits, Fruits5
Bread, Milk, Toys, Biscuits4
Milk, Biscuits, Toys, Fruits3
Bread, Biscuits, Toys, Eggs2
Milk, Bread1
ItemsId
Confidence
Conditional probability that an item will appear in transactions that contain another
items
Example
Confidence that Toys will appear in transaction containing Milk & Biscuits
= {Milk, Biscuits, Toys} / {Milk, Biscuits} = 2 / 3 = 0.67
23
Indian Statistical Institute
MARKET BASKET ANALYSIS
Association Rule Mining
1. Frequent Itemset Generation
Fix minimum support value
Generate all itemsets whose support ≥ minimum support
2. Rule Generation
Fix minimum confidence value
Generate high confidence rules from each frequent itemset
24
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
a. Fix minimum support count
b. Generate all itemsets of length = 1
c. Calculate the support for each itemset
d. Eliminate all itemsets with support count < minimum support count
e. Repeat steps c & d for itemsets of length = 2, 3, ---
25
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
A,C,E6
A,E5
B,E4
A,B,C,E3
B,C,E2
A,C,D1
ItemsId
26
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
5E
1D
4C
3B
4A
Support countItem
Step 1:
Generate itemsets of length = 1 & calculate support
27
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
5E
1D
4C
3B
4A
Support countItem
Step 2:
eliminate itemsets with support count < minimum support count (2)
28
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
5E
4C
3B
4A
Support countItem
Step 2:
eliminate itemsets with support count < minimum support count (2)
29
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
2B, C
3B, E
3C,E
3A,E
3A, C
1A, B
Support countItem
Step 3:
generate itemsets of length = 2
30
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
2B, C
3B, E
3C,E
3A,E
3A, C
1A, B
Support countItem
Step 4:
eliminate itemsets with support count < minimum support count (2)
31
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
2B, C
3B, E
3C,E
3A,E
3A, C
Support countItem
Step 4:
eliminate itemsets with support count < minimum support count (2)
32
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
2B, C, E
2A, C, E
Support countItem
Step 5:
generate itemsets of length = 3
33
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
Step 6:
generate itemsets of length = 4
1A, B, C, E
Support CountItemset
34
Indian Statistical Institute
MARKET BASKET ANALYSIS
Frequent Itemset Generation: Apriori Algorithm
Example:
Minimum Support count = 2
Result:
3
3
2
3
3
2
2
Support count
0.33B, C, E
0.33A, C, E
0.50A , C
0.50A , E
0.33B,C
0.50B,E
0.50C,E
SupportItem
35
Indian Statistical Institute
MARKET BASKET ANALYSIS
Association Rule Mining: Apriori Algorithm
Example:
Minimum Support = 0.50
Minimum Confidence = 0.5
3
3
2
3
3
2
2
Support count
0.33B, C, E
0.33A, C, E
0.50A , C
0.50A , E
0.33B,C
0.50B,E
0.50C,E
SupportItem
36
Indian Statistical Institute
MARKET BASKET ANALYSIS
Association Rule Mining: Apriori Algorithm
Example:
Minimum Support = 0.50
Minimum Confidence = 0.5
0.600.50E B
0.600.50E C
0.750.50C E
0.750.50C A
0.600.50E A
0.50
0.50
0.50
Support
0.75A C
0.75A E
1.00B E
ConfidenceItem
37
Indian Statistical Institute
MARKET BASKET ANALYSIS
Association Rule Mining: Other Measures
Lift
Lift (A C) = Confidence (A C) / Support (C)
Example
0.75
0.75
Confidence
0.93
1.12
Lift
E = 0.83A E
C = 0.67A C
SupportItem
Criteria : Lift ≥ 1
Lift (A , C) = 1.12 > Lift (A , E) indicates that A has a greater impact on the
frequency of C than it has on the frequency of E
38
Indian Statistical Institute
MARKET BASKET ANALYSIS
Exercise 1:The data on transactions from a mobile outlet is given below.
1. Generate frequent items sets with a support of at least 25%?
2. Generate association of items with a confidence of at least 50%?
3. Estimate the chance that Mobile Slim, Landline and Broadband will
be subscribed together?
4. Estimate the chance that the customers who buy Landline will also
purchase Broadband & Ring tones?
39
Indian Statistical Institute
MARKET BASKET ANALYSIS
Exercise 2:
The market basket Software data set contains the details of transaction at a
software product company.
1. Identify the frequent product types with a support of minimum 25% ?
2. Also identify the association of products with a confidence of minimum 50%
?
3. What is the chance that Operating System and Office Suite will be
purchased together?
4. What is the chance that Operating System and Visual Studio will be
purchased together?
5. Estimate the chance that the customers who buy Operating System will also
purchase Office Suite ?
6. Estimate the chance that the customers who buy Operating System will also
purchase Visual Studio?
40
Indian Statistical Institute
LINEARREGRESSION
41
CORRELATION & REGRESSION
Correlation:
Correlation analysis is a technique to identify the relationship between two
variables.
Type and degree of relationship between two variables.
Indian Statistical Institute
42
CORRELATION & REGRESSION
Correlation: Usage
Explore the relationship between the output characteristic and input or process
variable.
Output variable : Y : Dependent variable
Input / Process variable : X : Independent variable
Indian Statistical Institute
43
Positive Correlation: Y increases as X increases & vice versa
Scatter Plot
0
4
8
12
16
20
0 3 6 9 12
X
Y
CORRELATION & REGRESSION
Indian Statistical Institute
44
Negative Correlation: Y decreases as X increases & vice versa
Scatter Plot
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7 8 9 10
X
Y
CORRELATION & REGRESSION
Indian Statistical Institute
45
No Correlation: Random Distribution of points
Scatter Plot
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
X
Y
Indian Statistical Institute
CORRELATION & REGRESSION
46
Is there any correlation ?
Scatter Plot
0
5
10
15
20
25
30
0 2 4 6 8 10 12
X
Y
CORRELATION & REGRESSION
Indian Statistical Institute
47
Measure of Correlation: Coefficient of Correlation
Symbol : r
Range : -1 to 1
Sign : Type of correlation
Value : Degree of correlation
Examples:
r = 0.6 , 60 % positive correlation
r = -0.82, 82% negative correlation
r = 0, No correlation
CORRELATION & REGRESSION
Indian Statistical Institute
48
Coefficient of Correlation: Positive Correlation
Collect data on x and y: When x is low, y is also low & vice versa
x y
2 5
3 7
1 3
5 11
6 12
7 15
CORRELATION & REGRESSION
Indian Statistical Institute
49
Calculate Mean of x & y values
SL No. x y
1 2 5
2 3 7
3 1 3
4 5 11
5 6 12
6 7 15
Mean 4 8.83
Coefficient of Correlation: Positive Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
50
Take x – Mean x and y – Mean y
SL No. x – Mean x y – Mean y
1 -2 -3.83
2 -1 -1.83
3 -3 -5.83
4 1 2.17
5 2 3.17
6 3 6.17
Coefficient of Correlation: Positive Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
Conclusion:
Low values will become
negative & high values will
become positive
51
Generally when x values are negative, y values are also negative & vice versa
SL No. x – Mean x y – Mean y
1 -2 -3.83
2 -1 -1.83
3 -3 -5.83
4 1 2.17
5 2 3.17
6 3 6.17
Coefficient of Correlation: Positive Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
52
Then
Product of x & y values will be positive
SL No. x – Mean x y – Mean y Product
1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49
4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
Coefficient of Correlation: Positive Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
53
Sum of Product of x & y values (Sxy) will be positive
SL No. x – Mean x y – Mean y Product
1 -2 -3.83 7.66
2 -1 -1.83 1.83
3 -3 -5.83 17.49
4 1 2.17 2.17
5 2 3.17 6.34
6 3 6.17 18.51
Sum = Sxy 54
Coefficient of Correlation: Positive Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
54
Coefficient of Correlation: Negative Correlation
Collect data on x and y: When x is low then y will be high & vice versa
x y
2 12
3 11
1 15
5 7
6 5
7 3
CORRELATION & REGRESSION
Indian Statistical Institute
55
Calculate Mean of x & y values
SL No. x y
1 2 12
2 3 11
3 1 15
4 5 7
5 6 5
6 7 3
Mean 4 8.83
Coefficient of Correlation: Negative Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
56
Take x – Mean x and y – Mean y
SL No. x – Mean x y – Mean y
1 -2 3.67
2 -1 2.67
3 -3 6.67
4 1 -1.33
5 2 -3.33
6 3 -5.33
Coefficient of Correlation: Negative Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
Conclusion:
Low values will become
negative & high values will
become positive
57
Generally when x values are negative, y values are positive & vice versa
SL No. x – Mean x y – Mean y
1 -2 3.67
2 -1 2.67
3 -3 6.67
4 1 -1.33
5 2 -3.33
6 3 -5.33
Coefficient of Correlation: Negative Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
58
Then
Product of x & y values will be negative
SL No. x – Mean x y – Mean y Product
1 -2 3.67 -7.34
2 -1 2.67 -2.67
3 -3 6.67 -20.01
4 1 -1.33 -1.33
5 2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
Coefficient of Correlation: Negative Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
59
Sum of Product of x & y values Sxy will be negative
Coefficient of Correlation: Negative Correlation
CORRELATION & REGRESSION
Indian Statistical Institute
SL No. x – Mean x y – Mean y Product
1 -2 3.67 -7.34
2 -1 2.67 -2.67
3 -3 6.67 -20.01
4 1 -1.33 -1.33
5 2 -3.33 -6.66
6 3 -5.33 -15.99
Sum = Sxy - 54
60
In Short
If correlation is positive
Sxy will be positive
If correlation is negative
Sxy will be negative
Coefficient of Correlation:
CORRELATION & REGRESSION
Indian Statistical Institute
61
To avoid scale issues
Sxy is divided by √ (Sxx.Syy)
Coefficient of Correlation:
CORRELATION & REGRESSION
Indian Statistical Institute
Sxy = Σ(x-Mean x)(y-Mean y)
Sxx = Σ(x-Mean x)2
Syy = Σ(y-Mean y)2
Correlation Coefficient r = Sxy / √ (Sxx.Syy)
62
Coefficient of Correlation:
CORRELATION & REGRESSION
Indian Statistical Institute
SL No. x – Mean x y – Mean y Product (x – Mean x)2 (y – Mean y)2
1 -2 3.67 -7.34 4 14.6689
2 -1 2.67 -2.67 1 3.3489
3 -3 6.67 -20.01 9 33.9889
4 1 -1.33 -1.33 1 4.7089
5 2 -3.33 -6.66 4 10.0489
6 3 -5.33 -15.99 9 38.0689
Sum Sxy: -54 Sxx: 28 Syy:104.83
r = Sxy / √Sxx.Syy = -54 / √(28 x 104.83) = -0.9967
63
Regression
Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship
CORRELATION & REGRESSION
Indian Statistical Institute
64
Regression
Regression helps
• To identify the exact form of the relationship
• To model output in terms of input or process variables
Examples:
Yield = 5 + 3 x Time - 2 x Temperature
Y = 2 - 5x
CORRELATION & REGRESSION
Indian Statistical Institute
65
Multiple Regression
To model output variable y in terms of two or more variables.
General Form:
Y = a + b1X1 + b2X2 + - - - + bkXk
Two variable case:
Y = a + b1X1 + b2X2
CORRELATION & REGRESSION
Indian Statistical Institute
66
Exercise 1: The data on Vendor performance score and the number of On Time,
Complete, Undamaged & Correctly billed shipments from the vendors of a
supply chain management company are given below. Can you develop a
model for Vendor performance score in terms of other variables?
CORRELATION & REGRESSION
Indian Statistical Institute
Vendor Id
Ontime
Shipment
Complete
Shipment
Undamaged
Shipmetns
Correctly
billed
Performance
Score
1 950 990 980 550 2985
2 1450 1425 1475 975 4576
3 1700 1575 1730 1320 5435
4 1800 1515 1890 1615 5955
5 1675 1420 1756 1456 5400
6 1756 1645 1835 1489 5590
7 1236 1462 1335 1435 4675
8 1100 1523 1565 1625 4960
9 1325 1725 1570 1520 5325
10 1450 1620 1463 1430 5170
11 1570 1458 1356 1630 5190
67
Exercise 2: A construction company wants to develop a model the concrete
compressive strength. The attributes of interest are given in the table
below. The training data is given in the file Concrete_Data.xls .
1. Can you develop the model?
2. How much close it will predict the values?
LINEAR REGRESSION
Indian Statistical Institute
1 Cement (component 1)(kg in a m^3 mixture)
2 Blast Furnace Slag (component 2)(kg in a m^3 mixture)
3 Fly Ash (component 3)(kg in a m^3 mixture)
4 Water (component 4)(kg in a m^3 mixture)
5 Superplasticizer (component 5)(kg in a m^3 mixture)
6 Coarse Aggregate (component 6)(kg in a m^3 mixture)
7 Fine Aggregate (component 7)(kg in a m^3 mixture)
8 Age (day)
9 Concrete compressive strength(MPa, megapascals)
68
Indian Statistical Institute
CLASSIFICATION METHODS
69
Indian Statistical Institute
INTRODUCTION
Objective
To develop a mathematical model for an attribute or response metric (Y) in terms of
other available attributes (Xs).
When to Use
Xs : Continuous or discrete
Y : Discrete
70
Indian Statistical Institute
CLASSIFICATION METHODS
Classifies data (develops a model) based on the training data
Each sample is assumed to belong to a predefined class
Sample data set used for building the model is training set
Usage:
For classifying future or unknown data
71
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
Y1 (Red) , y2 (Blue)Label : y
x2Attribute 2
x1Attribute 1
x1 x2 Y x1 x2 Y
11.35 23 Blue 11.85 39.9 Red
11.59 22.3 Blue 12.09 39.5 Red
12.19 24.5 Blue 12.69 37.8 Red
13.23 26.4 Blue 13.73 38.2 Red
13.51 30.2 Blue 14.01 37.8 Red
13.68 32 Blue 14.18 36.5 Red
14.78 33.1 Blue 15.28 36 Red
15.11 33 Blue 15.61 37.1 Red
15.55 25.2 Blue 16.05 33.1 Red
16.37 24.1 Blue 16.87 32.4 Red
16.99 22 Blue 17.49 31 Red
18.23 23.5 Blue 18.73 32 Red
18.83 24.1 Blue 19.33 31.8 Red
19.06 25 Blue 19.56 30.9 Red
72
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
Y1 (Red) , y2 (Blue)Label : y
x2Attribute 2
x1Attribute 1
20
22
24
26
28
30
32
34
36
38
40
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
x2
73
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
y1 (Red) , y2 (Blue)Label : y
x2Attribute 2
x1Attribute 1
20
22
24
26
28
30
32
34
36
38
40
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
x2
x2
y1
> 35
74
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
y1 (Red) , y2 (Blue)Label : y
x2Attribute 2
x1Attribute 1
20
22
24
26
28
30
32
34
36
38
40
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
x2
x2
y1 y2
> 35 < 28
75
Indian Statistical Institute
CLASSIFICATION METHODS
Example:
y1 (Red) , y2 (Blue)Label : y
x2Attribute 2
x1Attribute 1
20
22
24
26
28
30
32
34
36
38
40
10.00 11.00 12.00 13.00 14.00 15.00 16.00 17.00 18.00 19.00 20.00
x1
x2
x2
x1y1 y2
> 35 < 28
y2 y1
< 15.5 > 15.5
76
Indian Statistical Institute
CLASSIFICATION METHODS
Example: Rules
y1 (Red) , y2 (Blue)Label : y
x2Attribute 2
x1Attribute 1
x2
x1y1 y2
> 35 < 28
y2 y1
< 15.5 > 15.5
If x2 > 35 then y = y1
If x2 < 28, then y = y2
If 28 > x2 > 35 & x1 > 15.5, then y = y1
If 28 > x2 > 35 & x1 < 15.5, then y = y2
77
Indian Statistical Institute
CLASSIFICATION METHODS
Example: The following table 1 gives the profile of customers (Refund, Marital
Status & Taxable Income) who has taken loan from a bank. The table also
shows how many of them really cheated the bank.
1. Can you develop a decision rule to classify the customer as whether they will
cheat or not based on the value of 3 attributes (Refund, Marital Status &
Taxable Income)
2. Validate the model using the test data given in table 2
Yes> 80 KDivorcedNo5
No> 80 KMarriedNo4
No< 80 KSingleNo3
No> 80 KSingleNo2
No> 80 KMarriedYes1
CheatTaxable
Income
Marital
Status
RefundSL No
Table 2: Test Data
78
Indian Statistical Institute
CLASSIFICATION METHODS
Table 1: Training Data Set
Yes> 80 KSingleNo10
No> 80 KMarriedNo9
Yes> 80 KSingleNo8
No> 80 KDivorcedYes7
No< 80 KMarriedNo6
Yes> 80 KDivorcedNo5
No> 80 KMarriedYes4
No< 80 KSingleNo3
No> 80 KMarriedNo2
No> 80 KSingleYes1
CheatTaxable IncomeMarital StatusRefundSL No
Class variable: Cheat
Number of predefined classes: 2 (Cheat = No & Cheat = Yes)
79
Indian Statistical Institute
CLASSIFICATION METHODS
Example:Result
If Marital Status = Married then cheat : No
If Marital Status = Single & Refund = Yes then cheat : No
If Marital Status = Single, Refund = No & Taxable Income < 80K then cheat: No
If Marital Status = Single, Refund = No & Taxable Income > 80K then cheat: Yes
If Marital Status = Divorced & Refund = Yes then cheat : No
If Marital Status = Divorced & Refund = No then cheat : Yes
80
Indian Statistical Institute
CLASSIFICATION METHODS
Example:Decision Tree
Yes> 80 KSingleNo10
No> 80 KMarriedNo9
Yes> 80 KSingleNo8
No> 80 KDivorcedYes7
No< 80 KMarriedNo6
Yes> 80 KDivorcedNo5
No> 80 KMarriedYes4
No< 80 KSingleNo3
No> 80 KMarriedNo2
No> 80 KSingleYes1
CheatTaxable
Income
Marital
Status
RefundSL No
81
Indian Statistical Institute
CLASSIFICATION METHODS
Example: Test Data Set
Yes> 80 KDivorcedNo5
No> 80 KMarriedNo4
No< 80 KSingleNo3
No> 80 KSingleNo2
No> 80 KMarriedYes1
CheatTaxable
Income
Marital
Status
RefundSL No
Yes
No
No
No
No
Cheat
Yes> 80 KDivorcedNo5
No> 80 KMarriedNo4
No< 80KSingleNo3
Yes> 80 KSingleNo2
No> 80KMarriedYes1
Predicted
Cheat
Taxable
Income
Marital
Status
RefundSL No
82
Indian Statistical Institute
CLASSIFICATION METHODS
Performance Evaluation Measures
1. Confusion Matrix
dcClass = No
baClass = Yes
Class = NoClass = YesPredicted
Class
Actual Class
2. Accuracy: (a+d) / (a + b + c + d)
3. Precision: a / (a + b)
4. Recall: a / (a + c)
5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)
83
Indian Statistical Institute
CLASSIFICATION METHODS
Example: Performance Evaluation Measures
1. Confusion Matrix
11Cheat = Yes
03Cheat = No
Cheat = YesCheat = NoPredicted
Class
Actual Class
Yes
No
No
No
No
Cheat
Yes5
No4
No3
Yes2
No1
Predicted CheatSL No
84
Indian Statistical Institute
CLASSIFICATION METHODS
Example: Performance Evaluation Measures
1. Confusion Matrix
11Cheat = Yes
03Cheat = No
Cheat = YesCheat = NoPredicted
Class
Actual Class
2. Accuracy: (3+1) / (3 + 1 + 0 + 1) = 4 / 5 = 0.8
3. Precision: 3 / (3 + 0) = 3 / 3 = 1.0
4. Recall: 3 / (3 + 1) = 3 / 4= 0.75
5. F Measure = 2 x Precision x Accuracy / (Precision + Accuracy)
= 2 x 1.0 x 0.75 / (1.00 + 0.75) = 0.86
85
Indian Statistical Institute
CLASSIFICATION METHODS
Challenges
How to represent the entire information in the dataset using minimum number
of rules?
How to develop the smallest tree?
Solution
Select the attribute with maximum information for first split
RefundSecond
Taxable IncomeThird
Marital StatusFirst
AttributeSplit
86
Indian Statistical Institute
CLASSIFICATION METHODS
Example: A marketing company wants to optimize their mailing campaign by sending
the brochure mail only to those customers who responded to previous mail
campaigns. The profile of customers are given below. Can you develop a rule to
identify the profile of customers who are likely to respond?
SL No District House Type Income Previous_Customer Outcome
1 Suburban Detached High No No Response
2 Suburban Detached High Yes No Response
3 Rural Detached High No Responded
4 Urban Semi-detached High No Responded
5 Urban Semi-detached Low No Responded
6 Urban Semi-detached Low Yes No Response
7 Rural Semi-detached Low Yes Responded
8 Suburban Terrace High No No Response
9 Suburban Semi-detached Low No Responded
10 Urban Terrace Low No Responded
11 Suburban Terrace Low Yes Responded
12 Rural Terrace High Yes Responded
13 Rural Detached Low No Responded
14 Urban Terrace High Yes No Response
CHAID Algorithm
87
Indian Statistical Institute
Example: A marketing company wants to optimize their mailing campaign by sending
the brochure mail only to those customers who responded to previous mail
campaigns. The profile of customers are given below? Can you develop a rule to
identify the profile of customers who are likely to respond?
4
3
2
1
SL No
2Previous Customer
2Income
3House Type
3District
Number of valuesVariable Name
Number of variables = 4
Total Combination of Customer Profiles = 3 x 3 x 2 x 2 = 36
CHAID Algorithm
CLASSIFICATION METHODS
88
Indian Statistical Institute
Example: A marketing company wants to optimize their mailing campaign by sending
the brochure mail only to those customers who responded to previous mail
campaigns. The profile of customers are given below? Can you develop a rule to
identify the profile of customers who are likely to respond?
CLASSIFICATION METHODS
89
Indian Statistical Institute
Exercise 1: A bank wants to know the profile of customers who will buy a Personal
Equity Plan (Pep) after the mailing campaign? The data is given in the
file named bank-data.xls.
CLASSIFICATION METHODS
1. Can you develop a decision methodology?
2. How good is your model?
90
Indian Statistical Institute
Exercise 1:. The file contains the following fields.
CLASSIFICATION METHODS
did the customer buy a PEP (Personal Equity Plan) after the
last mailing (YES/NO)
Pep
does the customer have a mortgage (YES/NO) Mortgage
does the customer have a current account (YES/NO) Current_acct
does the customer have a saving account (YES/NO) Save_acct
does the customer own a car (YES/NO) Car
number of children (numeric) Children
is the customer married (YES/NO) Married
income of customer (numeric) Income
inner_city/rural/suburban/town Region
MALE / FEMALE Sex
age of customer in years (numeric) Age
a unique identification number Id
91
Indian Statistical Institute
Exercise 2: The profile of the customers of a telecom service provider in
grace period is given in churn.xls file.
1. Can you develop a a model to identify potential churners
(disconnections) so that organization can win back the customers by
providing different offers?
2. How good is the decision rule?
CLASSIFICATION METHODS
92
Indian Statistical Institute
1) Service class
2) Class change in last week
3) Class change in last15 days
4) Class change in last month
5) Class change in last two months
6) Usage amount in last week
7) Usage amount in last15 days
8) Usage amount in last month
9) Usage amount in last two months
10) Recharge amount in last week
11) Recharge amount in last15 days
12) Recharge amount in last month
13) Recharge amount in last two
months
14) Recharge count in last week
15) Recharge cont in last 15 days
16) Recharge count in last month
17) Recharge count in last two
months
18) Closing balance in last week
19) Closing balance in last15 days
20) Closing balance in last month
21) Closing balance in last two
months
CLASSIFICATION METHODS
Exercise 2:. The file contains the following fields.
93
Indian Statistical Institute
CLUSTER ANALYSIS
94
Indian Statistical Institute
Objective
To classify the records or items into a smaller number of groups based on the values
of available attributes.
When to Use
When there is no Y attribute
All attributes are considered as Xs only
CLUSTER ANALYSIS
95
Indian Statistical Institute
CLUSTER ANALYSIS
Methodology to group objects based on many attributes such that objects in a group
will be similar (or related) to one another
will be different from (or unrelated to) the objects in other groups
96
Indian Statistical Institute
CLUSTER ANALYSIS
Types of Clustering
• K Mean Clustering
• K Medoid Clustering
97
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering
Methodology to group objects based on many attributes such that objects in a cluster will
be closer (or more similar) to the centroid of the cluster than to the centroid of any other
cluster.
1. Each cluster is associated with a centroid
2. Each point is assigned to the cluster with the closest centroid
3. Number of clusters, K must be specified
4. Initially centroids are often chosen randomly
5. The centroid is (typically) the mean of the points in the cluster
6. Closeness is measured by Euclidean distance
98
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Euclidean Distance
D(x, y) = √((x1 – y1)2 + (x2 – y2)
2 + - - - + ((xk – yk)2 )
Example:
7.20.6248.918.123.62
5.30.5756.015.725.81
Attribute 5Attribute 4Attribute 3Attribute 2Attribute 1SL No
Attribute 1 Attribute 2 Attribute 3 Attribute 4 Attribute 5
Difference 2.2 -2.4 7.1 -0.05 -1.9
Square 4.84 5.76 50.41 0.0025 3.61
Sum
Sq Root
64.6225
8.038812101
Euclidean Distance
99
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Algorithm
1. Get the number of clusters (k) required from the user
2. Randomly select k centroids
3. Calculate the Euclidean distance of each data record to each & every
centroid
4. For each record, identify the cluster with minimum Euclidean distance
5. Allocate the record to the cluster with minimum distance
6. Recalculate the centroids
7. Repeat steps 3 to 6 until there is no change in the cluster elements
100
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Example
Cluster the following data with 3 attributes (Spend in 3 quarters) into 2 clusters
SL No. Quarter 1 Quarter 2 Quarter 3
1 1.425172 31.08748 108.5436
2 3.017551 34.17728 103.4577
3 3.803405 34.78973 101.7977
4 4.299151 31.02313 107.3701
5 5.352034 22.80945 109.9353
6 6.038361 22.21948 100.1809
7 6.128493 25.04893 111.0543
8 8.381028 23.6761 106.3302
9 8.989409 27.62143 106.7186
10 9.788646 27.35268 105.7799
Step 1: k = 2
Step 2: Randomly identify 2 centroids
Centroid Quarter 1 Quarter 2 Quarter 3
1 1.5 35 100
2 9.8 22 111
101
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Example
Step 3: Calculate the Euclidean distance of each point from centroid 1
Quarter 1 Quarter 2 Quarter 3
1 -0.07483 -3.91252 8.543587 88.30632 9.397144437
2 1.517551 -0.82272 3.4577 14.93552 3.864650273
3 2.303405 -0.21027 1.797705 8.58163 2.929441858
4 2.799151 -3.97687 7.370058 77.96853 8.829979305
5 3.852034 -12.1906 9.935263 262.1572 16.19127037
6 4.538361 -12.7805 0.180881 183.9713 13.56360037
7 4.628493 -9.95107 11.05433 242.6451 15.57706944
8 6.881028 -11.3239 6.330205 215.6508 14.68505506
9 7.489409 -7.37857 6.71863 155.6745 12.47695732
10 8.288646 -7.64732 5.779929 160.5907 12.67243867
Sum of
Squares
Euclidean
Distance
Difference from Centroid 1
SL No.
102
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Example
Step 4: Calculate the Euclidean distance of each point from centroid 2
Quarter 1 Quarter 2 Quarter 3
1 -8.37483 9.087476 -2.45641 158.7539 12.59975882
2 -6.78245 12.17728 -7.5423 251.174 15.84846897
3 -5.99659 12.78973 -9.2023 284.2187 16.85878574
4 -5.50085 9.023126 -3.62994 124.8526 11.17374688
5 -4.44797 0.809445 -1.06474 21.57327 4.644703413
6 -3.76164 0.219475 -10.8191 131.2514 11.45650187
7 -3.67151 3.048927 0.054333 22.77887 4.772721122
8 -1.41897 1.676096 -4.6698 26.62977 5.160403756
9 -0.81059 5.621434 -4.28137 50.58771 7.112503764
10 -0.01135 5.352683 -5.22007 55.90048 7.476662339
Difference from Centroid 2
SL No.
Sum of
Squares
Euclidean
Distance
103
Indian Statistical Institute
CLUSTER ANALYSIS
K Mean Clustering:Example
Step 5: Allocate records to clusters with minimum distance
SL No. Quarter 1 Quarter 2 Quarter 3 Cluster 1 Cluster 2 Allocation
1 1.425172 31.08748 108.5436 9.397144 12.59975882 1
2 3.017551 34.17728 103.4577 3.86465 15.84846897 1
3 3.803405 34.78973 101.7977 2.929442 16.85878574 1
4 4.299151 31.02313 107.3701 8.829979 11.17374688 1
5 5.352034 22.80945 109.9353 16.19127 4.644703413 2
6 6.038361 22.21948 100.1809 13.5636 11.45650187 2
7 6.128493 25.04893 111.0543 15.57707 4.772721122 2
8 8.381028 23.6761 106.3302 14.68506 5.160403756 2
9 8.989409 27.62143 106.7186 12.47696 7.112503764 2
10 9.788646 27.35268 105.7799 12.67244 7.476662339 2
Step 6: Recalculate the centroids and repeat the steps
Quarter 1 Quarter 2 Quarter 3
1 3.13632 32.7694 105.2923
2 7.446328 24.78801 106.6665
Mean
Centroid
104
Indian Statistical Institute
CLUSTER ANALYSIS
Exercise 1: The data on the % Erlang utilization of mobile towers of a telecom service
provider is given in Erlang_Utilization.xls? Kindly group the towers into 5
clusters based on the utilization?
105
Indian Statistical Institute
CLUSTER ANALYSIS
K Medoid Clustering
Methodology to group objects based on many attributes such that objects in a cluster will
be closer (or more similar) to the most centrally located object of the cluster.
1. Number of clusters, K must be specified
2. Closeness is measured by Euclidean distance
Exercise: Perform the exercises 1 to 3 using k medoid clustering method