ankit pandey_10bm60012_weka term paper
Embed Size (px)
TRANSCRIPT
-
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
1/15
Data Mining Techniques using WEKA
IT for Business Intelligence
Ankit Pandey (10BM60012)
This term paper contains a brief introduction of a powerful data mining tool WEKA along with a hands-
on guide to two data mining techniques namely Clustering (k-means) and Linear Regression using
WEKA.
-
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
2/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 2
Data Mining Techniques using WEKA
Introduction to WEKA
WEKA (Waikato Environment for Knowledge Analysis) is a collection of state-of-the-art
machine learning algorithms and data preprocessing tools written in Java, developed at the
University of Waikato, New Zealand. It is free software that runs on almost any platform and
is available under the GNU General Public License. It has a wide range of applications in
various data mining techniques. It provides extensive support for the entire process of
experimental data mining, including preparing the input data, evaluating learning schemes
statistically, and visualizing the input data and the result of learning. The WEKA workbench
includes methods for the main data mining problems: regression, classification, clustering,
association rule mining, and attribute selection. It can be used in either of the following two
interfaces
Command Line Interface (CLI) Graphical User Interface (GUI)
The WEKA GUI Chooser appears like this
Fig.1
The buttons can be used to start the following applications
Explorer Environment for exploring data with WEKA. It gives access to all thefacilities using menu selection and form filling.
Experimenter It can be used to get the answer for a question: Which methods andparameter values work best for the given problem?
-
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
3/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 3
KnowledgeFlowSame function as explorer. Supports incremental learning. It allowsdesigning configurations for streamed data processing. Incremental algorithms can be
used to process very large datasets.
Simple CLI It provides a simple Command Line Interface for directly executing WEKAcommands.
This term paper will demonstrate the following two data mining techniques using WEKA:
Clustering (Simple K Means) Linear regression
Clustering using WEKA
Clustering
Clustering is a class of techniques used to classify objects or cases into relatively homogenous
groups called clusters. Objects in each cluster tend to be similar to each other and dissimilar
to objects in the other clusters. In clustering, there is no a-priori information about the group
or cluster membership for any of the objects.
There are two major types of clustering techniques viz.
Hierarchical Clustering Non-Hierarchical Clustering (aka k-means Clustering)
HIERARCHICAL CLUSTERING - Some measure of distance (usually Euclidean or squared Euclidean)
is used to identify distances between all pairs of objects to be clustered. We begin with all
objects in separate clusters. Two closest objects are joined to form a cluster. This processcontinues, until points join existing clusters (because they are closest to an existing cluster),
and clusters join other clusters, based on the shortest distance criterion.
NON-HIERARCHICAL (K-MEANS) CLUSTERING - We need to rationally specify the number of
clusters we want the objects to be clustered into. In this term paper, i will illustrate the
process of k-means clustering through WEKA.
Business applications of Clustering
Segmentation of the market Understanding buyer behavior Identifying new product opportunities Selecting test markets and Reducing data
-
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
4/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 4
K-means clustering in WEKA
Example Problem: A major Indian FMCG company wants to map the profile of its target
market in terms of lifestyle, attitudes and perceptions. The company's managers prepare, with
the help of their marketing research team, a set of 15 statements, which they feel measure manyof the variables of interest. These 15 statements are given below. The respondent had to agree
or disagree (1 = Strongly Agree, 2 = Agree, 3 = Neither Agree nor Disagree, 4 = Disagree, 5
= Strongly Disagree) with each statement.
1. I prefer to use e-mail rather than write a letter.2. I feel that quality products are always priced high.3. I think twice before I buy anything.4. Television is a major source of entertainment.5. A car is a necessity rather than a luxury.6. I prefer fast food and ready to use products.7. People are more health conscious today.8. Entry of foreign companies has increased the efficiency of Indian companies.9. Women are active participants in purchase decisions.10.I believe po liticians can play a positive role.11.I enjoy watching movies.12.If I get a chance, I would like to settle abroad.13.I always buy branded products.14.I frequently go out on weekends.15.I prefer to pay by credit card rather than in cash.
The company wants to cluster the market based on the above attributes to facilitate itself in
effectively catering to most feasible and lucrative segment. I will describe how WEKA can be
used to do this.
For the purpose of simplification we renamed the above 15 statements as variables in csv file
clustering data i.e. var01 through var15. This data file contains 1436 instances
(responses). As an input WEKA accepts few file formats including arffand csv. In this case we
are using a csv file as an input.
https://docs.google.com/spreadsheet/ccc?key=0AmVsF7VYJHmmdDFJYVRrYlVza3RYdjBkUjZBRlRaaUE#gid=0https://docs.google.com/spreadsheet/ccc?key=0AmVsF7VYJHmmdDFJYVRrYlVza3RYdjBkUjZBRlRaaUE#gid=0https://docs.google.com/spreadsheet/ccc?key=0AmVsF7VYJHmmdDFJYVRrYlVza3RYdjBkUjZBRlRaaUE#gid=0 -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
5/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 5
Steps to be followed:
1. Select Explorer in the Weka GUI Chooser window ( displayed previously).2. The following window will appear
Fig.2
3. Select Open File and load the csv file clustering data. After loading the file, theinterface will be like this
Fig.3
https://docs.google.com/spreadsheet/ccc?key=0AmVsF7VYJHmmdDFJYVRrYlVza3RYdjBkUjZBRlRaaUE#gid=0https://docs.google.com/spreadsheet/ccc?key=0AmVsF7VYJHmmdDFJYVRrYlVza3RYdjBkUjZBRlRaaUE#gid=0 -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
6/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 6
4. We can click on Visualize All to view the distribution of all variables in the samplepopulation as follows:
Fig.4
The preprocessing tasks in WEKA obviate the need to convert the data set into the standard
spreadsheet format and convert categorical attributes to binary. The WEKA SimpleKMeans
algorithm uses Euclidean distance measure to compute distances between instances and
clusters.
5. For performing clustering operation, select the tab Cluster in the explorer window.
Fig.5
[All the figures from Fig. 4 onwards can be viewed more clearly in a separate window by clicking over them]
https://docs.google.com/open?id=0B2VsF7VYJHmmbXg5NXVTNm9lWUUhttps://docs.google.com/open?id=0B2VsF7VYJHmmZ0JaenBmZDZSUEk -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
7/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 7
6. In the Clusterer panel, click on Choose andselect SimpleKMeans.7. Then click the text box beside the Choose button (a pop-up window will appear). Set
the numClusters value to 4 and click ok.
Fig.6
8. Make sure that the Use training set is selected in the Cluster mode panel and thenclick Start button to begin clustering process.
Fig.7
https://docs.google.com/open?id=0B2VsF7VYJHmmNjZEVlRPNlhESU0 -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
8/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 8
9. In the Result list panel, right click the result and select View in a separate window.Following result will be displayed:
Fig.8
Interpretation of clustering results
Based on the values of cluster centroids as shown in the above figure, we can state the
characteristics of each of the clusters. As an example we will describe the characteristics of
cluster 2 having 264 instances.
Cluster 2 characteristics:
Prefer to use e-mail rather than letter and credit cards over cash Somewhat believe that quality products are priced high Don't think much before buying, TV not a major source of entertainment Car is considered more of a luxury, somewhat prefer fast foods Health conscious, women are active decision makers Friendly towards products of foreign companies Enjoy movies, prefer banded products and weekend trips
Similarly the salient features of each of the clusters can be obtained from the results and
would subsequently help the FMCG firm to take the decision regarding which segment
(cluster) it should primarily target.
https://docs.google.com/open?id=0B2VsF7VYJHmmTERVR2lYZ3NfbWc -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
9/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 9
Visualization of Clustering Results
A more intuitive way to go through the results is to visualize them in the graphical form. To do
so:
Right click the result in the Result list panel Select Visualize cluster assignments By setting X-axis variable as Cluster, Y-axis variable as Instance_number and Color as
var 11, we get the following output:
Fig.9
From the above graph we can interpret that people in cluster 0 like watching movies a lot
while people in cluster 3 dont like watching movies. Cluster 1 and cluster 2 have mixed
responses which are skewed towards watching movies.
Similarly we can change the variables in X-axis, Y-axis and color to visualize other aspects of
result. Note that WEKA has generated an extra variable named Cluster (not present inoriginal data) which signifies the cluster membership of various instances.
We can save the output as an arff file by clicking on the save button in Fig. 9.
The output file contains an additional attribute cluster for each instance.
Thus besides the value of fifteen attributes for any instance, the output also specifies the
cluster membership for that instance.
https://docs.google.com/open?id=0B2VsF7VYJHmma1I3NGllSnRKZDg -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
10/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 10
A part of the saved output arff file:
We can also use other clustering methods to group the data into clusters. WEKA is particularly
useful in the clustering process when the size of data is huge. It can generate clusters pretty
quickly even with huge data. With numerous applications of clustering in business, WEKA can
be very useful in the clustering of data in real business scenarios.
@relation 'Clustering Data_clustered'
@attribute Instance_number [email protected] var01 [email protected] var02 numeric
@attribute var03 [email protected] var04 [email protected] var05 [email protected] var06 [email protected] var07 [email protected] var08 [email protected] var09 [email protected] var10 [email protected] var11 [email protected] var12 [email protected] var13 [email protected] var14 [email protected] var15 nu[email protected] Cluster {cluster0, cluster1, cluster2, cluster3}
@data0,1,3,1,2,3,1,3,2,3,2,2,1,1,1,1,cluster0
1,2,3,2,3,2,2,3,2,4,1,5,2,2,2,2,cluster32,3,2,3,2,3,1,3,3,2,2,2,3,2,2,3,cluster03,3,2,2,2,2,2,3,2,1,2,1,2,1,1,1,cluster04,2,2,2,2,2,1,3,3,2,2,1,1,3,3,2,cluster05,2,2,3,3,1,2,2,2,3,2,1,2,3,3,3,cluster06,1,1,2,2,2,1,2,2,2,1,2,3,3,3,1,cluster07,2,1,1,2,1,2,1,1,1,1,3,3,1,1,2,cluster0
8,2,1,1,3,2,2,2,1,2,1,2,2,2,2,3,cluster0
9,1,2,2,3,2,1,1,1,3,2,1,1,2,2,1,cluster010,2,3,3,2,1,2,1,1,2,2,2,1,1,1,2,cluster011,3,2,2,2,3,2,1,1,1,3,2,2,2,2,3,cluster012,2,3,2,2,3,3,2,2,2,3,2,3,1,1,2,cluster0
..
-
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
11/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 11
Linear Regression using WEKA
Regression
Regression analysis helps us to determine the nature and stre ngth of relationship between
two variables or between one dependent variable and number of independent variables. In
regression analysis, an estimating equation is developed which is a mathematical formula that
relates the known (independent) variables to the unknown (dependent) variable. After this
correlation analysis can be applied to determine the degree to which variables are related.
Broadly, regression can be classified into two types:
Simple linear regression (one dependent variable and one independent variable) Multiple regression (one dependent variable and many independent variables)
I will illustrate the process of Multiple regression in WEKA with an example in this term paper.
Business applications of Regression
Pricing decisions Risk Analysis for investments Sales/Market forecasts Trend line analysis Total quality control Development of better hiring plans
Regression in WEKA
Example Problem: Kristal Auto (fictional name) is car manufacturing company that has
presence in all segments ranging from A1 segment hatchbacks to premium saloons and SUVs.
It is planning to launch a crossover. In its pursuit to price the new crossover appropriately and
competitively, Kristal wants to determine what all factors determine the price of a car and up to
which extent each factor influences the price. To do so it collects the data for 2220 different car
models. It considered 10 features that primarily determine the price of a car. Following are
those 10 features:
Displacement (cc) Mileage (kmpl) Boot Space (ltrs)
-
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
12/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 12
Length (mm) Anti-lock Braking System (ABS) Electronic Stability Program (ESP) Anti-theft alarm Airbags Keyless Entry Global Positioning System (GPS)
Then the company regress these independent variables with the price to develop a regression
model which could assist it in its pricing decisions. I will describe the regression process using
WEKA in this term paper.
We will use the data in csv file namedregression data.
Steps to be followed:
1. Select Explorer from the WEKA GUI user window and load the file regression dataasdescribed in the clustering example. Following screen will appear after this:
Fig.10
https://docs.google.com/open?id=0B2VsF7VYJHmmOXd0dnFLZmdIcFEhttps://docs.google.com/open?id=0B2VsF7VYJHmmOXd0dnFLZmdIcFEhttps://docs.google.com/open?id=0B2VsF7VYJHmmOXd0dnFLZmdIcFEhttps://docs.google.com/open?id=0B2VsF7VYJHmmOXd0dnFLZmdIcFEhttps://docs.google.com/open?id=0B2VsF7VYJHmmOXd0dnFLZmdIcFEhttps://docs.google.com/open?id=0B2VsF7VYJHmmOXd0dnFLZmdIcFEhttps://docs.google.com/open?id=0B2VsF7VYJHmmRUFQTm1LNEhRbTghttps://docs.google.com/open?id=0B2VsF7VYJHmmOXd0dnFLZmdIcFEhttps://docs.google.com/open?id=0B2VsF7VYJHmmOXd0dnFLZmdIcFE -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
13/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 13
2. Click Classifier tab in the explorer window and then click the Choose button in theClassifier panel. Then select LinearRegression from functions. Following screen will
appear:
Fig.11
It automatically identifies the dependent variable as Price (as shown below Test Options panel).
In case it doesnt happen we can select the dependent variable.
3.
Press the Start button. Following output will be generated:
Fig.12
Output can also be viewed in a separate window (as described earlier in clustering example).
https://docs.google.com/open?id=0B2VsF7VYJHmmWno5VlZ1WVhPZ28https://docs.google.com/open?id=0B2VsF7VYJHmmUl9NWHN3LVhhTDg -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
14/15
IT for Business Intelligence
Ankit Pandey (10BM60012), MBA 2nd
Year VGSoM IIT Kharagpur 14
We can see that the output contains the regression equation which can be used to price the
new crossover. (Note that the equation may not be coherent with real life situation as the
data used is manipulated to effectively demonstrate the technique.)
From the regression equation we can see that Boot space and length are negatively correlated
to the price, while displacement, ABS and GPS are positively correlated to the price.
We can also visualize the classifier error i.e. those instances which are wrongly predicted by
regression equation by right clinking on the result set in the Result list panel and selecting
Visualize classifier errors.
Fig.13
The X-axis has Price (actual) and the Y-axis has Predicted Price.
Other applications of WEKA in data mining
WEKA can be used in various other data mining techniques. Some of them are:
Classification (using decision trees) Collaborative filtering (Nearest Neighbor) Association
https://docs.google.com/open?id=0B2VsF7VYJHmmU3puSGdBM3NfT0k -
8/2/2019 Ankit Pandey_10BM60012_WEKA Term Paper
15/15
IT for Business Intelligence
References
Data Mining by Ian H. Witten, Eibe Frank and Mark A. Hall (3rd edition, MorganKaufmann publisher)
www.cs.waikato.ac.nz/ml/ weka/
http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/