tao fayan_report on top 10 data mining algorithms applications with r

Top 10 data mining algorithms applications with R

Fayan Tao

November 10, 2015

Abstract

This report simply introduces the top 10 data mining algorithms and demonstrates the specific experimental examples of the top 10 data mining algorithms integrated with R and RStudio. The algorithms include C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART, which are all widely used in clustering, clasiffication, association analysis, and link mining.

Introduction

About top 10 algorithms in data miming[1]

In an effort to identify some of the most influential algorithms that have been widely used in the data mining community, the IEEE International Conference on Data Mining (ICDM, http://www.cs.uvm.edu/~icdm/) identified the top 10 algorithms in data mining for presentation at ICDM's 06 in Hong Kong.

These top 10 algorithms play a vital role in research area, in particular, they are widely used in data mining. As such significant algorithms cover clustering, statistical learning, association analysis, and link mining, they appeal to thousands of researchers being engaged in those topics.

About R[3]

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering...) and graphical techniques, and is highly extensible. R also provides an Open Source route to participation in statistical research.

R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide

http://www.cs.uvm.edu/~icdm/

variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.

About RStudio[4]

RStudio is an integrated development environment (IDE) for R. It includes a console, syntax-highlighting editor that supports direct code execution, as well as tools for plotting, history, debugging and workspace management.

RStudio is available in open source and commercial editions and runs on the desktop (Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro (Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).

About dataset used in this report[2]

The iris dataset has been used for classification in many research publications, such as data mining. It consists of 50 samples from each of three classes of iris flowers (setosa,versicolour and virginica). One class is linearly separable from the other two, while the latter are not linearly separable from each other. There are five attributes of those three kind of flowers:

1. sepal length in cm;

2. sepal width in cm;

3. petal length in cm;

4. petal width in cm;

5. class: Iris Setosa, Iris Versicolour, and Iris Virginica.

In this report, we choose iris dataset to do some related experiments.

Experimental examples[5][6]

1. C4.5

C4.5 constructs a classifier in the form of a decision tree. In order to do this, C4.5 is given a set of data representing things that are already classified. In this report, we want to classify 100 out of 150 samples and attempt to predict which class another 50 samples belong to.

In this report, We are going to train C5.0 which is the latest version of C4.5 to recognize 3 different species of iris. Once C5.0 is trained, we will test it with some

data it has not seen before to see how accurately it "learned" the characteristics of each species.

library(C50) library(printr)

This code takes a sample of 100 rows from the iris dataset:

train.indeces <- sample(1:nrow(iris), 100) iris.train <- iris[train.indeces, ] iris.test <- iris[-train.indeces, ]

This code trains a model based on the training data:

model <- C5.0(Species ~ ., data = iris.train)

This code tests the model using the test data:

results <- predict(object = model, newdata = iris.test, type = "class")

This code generates a confusion matrix for the results:

table(results, iris.test$Species)

results/ setosa versicolor virginica

setosa 20 0 0

versicolor 0 18 2

virginica 0 0 10

The rows above represent the predicted species, and the columns above represent the actual species from the iris dataset.

Starting from the setosa row, we can see that:

(1) 20 iris observations were predicted to be setosa when they were actually setosa.

(2) 18 iris observations were predicted to be versicolor when they were actually versicolor.

(3) 2 iris observation was predicted to be versicolor when it was actually virginica.

(4) 10 iris observations were predicted to be virginica when it was actually virginica.

2. k-means

k-means creates k groups from a set of objects so that the members of a group are more similar. It is a popular cluster analysis technique for exploring a dataset.

k-means can be used to pre-cluster a massive dataset followed by a more expensive cluster analysis on the sub-clusters. k-means can also be used to rapidly "play"" with k and explore whether there are overlooked patterns or relationships in the dataset.

library(stats) library(printr)

This code removes the Species column from the iris dataset. Then it uses k-means to create 3 clusters:

model <- kmeans(x = subset(iris, select = -Species), centers = 3)


table(model$cluster, iris$Species)

/ setosa versicolor virginica

1 33 0 0

2 0 46 50

3 17 4 0

The above matrix shows us that:

1. k-means picked up not so well on the characteristics for setosa in cluster 1. Out of 50 setosa irises, k-means grouped together just 33.

2. k-means had a tough time with setosa and versicolor, since they are being grouped into both clusters 1 and 3, 2 and 3, respectively.

3. Support Vector Machines

Support vector machine (SVM) learns a hyperplane to classify data into 2 classes. At a high-level, SVM performs a similar task like C4.5 except SVM does not use decision trees at all.

SVM does its thing, maps them into a higher dimension and then finds the hyperplane to separate the classes. SVM attempts to maximize the margin (the margin is the distance between the hyperplane and the 2 closest data points from each respective class), so that the hyperplane is just as far away from 2 closest data points from each respective class. In this way, it decreases the chance of misclassification.

library(e1071) library(printr)




model <- svm(Species ~ ., data = iris.train)






setosa 12 0 0

versicolor 0 17 1

virginica 0 1 19

We can see that the mistakes SVM made are misclassifying a virginica iris as versicolor, a versicolor as a virginica . Although the testing so far has not been very thorough, based on the test runs so far. SVM and C5.0 seem to do about the same on this dataset, and both do better than k-means.

4. Apriori

The Apriori algorithm learns association rules and is applied to a database containing a large number of transactions.

The basic Apriori algorithm is a 3 step approach:

(1) Join: scan the whole database for how frequent 1-itemsets are.

(2) Prune: those itemsets that satisfy the support and confidence move onto the next round for 2-itemsets.

(3) Repeat: this is repeated for each itemset level until we reach our previously defined size.

Apriori is well understood, easy to implement and has many derivatives. In addition, the algorithm can be quite memory, space and time intensive when generating itemsets.

library(arules) library(printr) data("Adult")

This code generates the association rules from the dataset:

rules <- apriori(Adult, parameter = list(support = 0.4, confidence = 0.7), appearance = list(rhs = c("race=White", "sex=Male"), default = "lhs"))

## ## Parameter specification: ## confidence minval smax arem aval originalSupport support minlen maxlen ## 0.7 0.1 1 none FALSE TRUE 0.4 1 10 ## target ext ## rules FALSE ## ## Algorithmic control: ## filter tree heap memopt load sort verbose ## 0.1 TRUE TRUE FALSE TRUE 2 TRUE ## ## apriori - find association rules with the apriori algorithm ## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt ## set item appearances ...[2 item(s)] done [0.00s]. ## set transactions ...[115 item(s), 48842 transaction(s)] done [0.03s]. ## sorting and recoding items ... [11 item(s)] done [0.00s]. ## creating transaction tree ... done [0.02s]. ## checking subsets of size 1 2 3 4 5 done [0.00s]. ## writing ... [32 rule(s)] done [0.00s]. ## creating S4 object ... done [0.00s].

This code gives us a view of the rules:

rules.sorted <- sort(rules, by = "lift") top5.rules <- head(rules.sorted, 5) as(top5.rules, "data.frame")

rules support confidence lift

2 {relationship=Husband} => {sex=Male} 0.4036485 0.9999493 1.495851

12 {marital-status=Married-civ-spouse,relationship=Husband} => {sex=Male}

0.4034028 0.9999492 1.495851

3 {marital-status=Married-civ-spouse} => {sex=Male}

0.4074157 0.8891818 1.330151

4 {marital-status=Married-civ-spouse} => {race=White}

0.4105892 0.8961080 1.048027

19 {workclass=Private,native-country=United-States} => {race=White}

0.5433848 0.8804113 1.029669

The above shows the top 5 rules, it turns out that:

(1) In the 1st rule, When we see Husband we are virtually guaranteed to see Male. Nothing surprising with this revelation.

(2) In the 2nd rule, It is basically the same as the 1st rule.

(3) In the 3rd and 4th rules, When we see civilian spouse, we have a high chance of seeing Male and White. This is interesting, because it potentially tells us something about the data.

(4) In the 5th rule, When we see US, we tend to see White. This seems to fit with our expectations, but it could also point to the way the data was collected.

5. EM

In data mining, expectation-maximization (EM) is generally used as a clustering algorithm (like k-means) for knowledge discovery.

In statistics, the EM algorithm iterates and optimizes the likelihood of seeing observed data while estimating the parameters of a statistical model with unobserved variables.

EM begins by making a guess at the model parameters. Then it follows an iterative 3-step process:

(1) E-step: Based on the model parameters, it calculates the probabilities for assignments of each data point to a cluster;

(2) M-step: Update the model parameters based on the cluster assignments from the E-step;

(3) Repeat until the model parameters and cluster assignments stabilize (a.k.a. convergence).

library(mclust)

## Package 'mclust' version 5.1 ## Type 'citation("mclust")' for citing this R package in publications.

library(printr)

This code removes the Species column from the iris dataset. Then it uses Mclust to create clusters:

model <- Mclust(subset(iris, select = -Species))


table(model$classification, iris$Species)


1 50 0 0

2 0 50 50

When we see the table above, we will find that the numbers along the left-hand side are 1 and 2, which means that EM just found 2 clusters, because EM had trouble distinguishing between versicolor and virginica. Just like k-means clustering, the algorithm has no idea what the cluster names are. So it numbered them accordingly.

Following figures also show that EM clusters versicolor and virginica almost in a same group.

Use this to plot the EM model

plot(model)

6. PageRank

PageRank is a link analysis algorithm designed to determine the relative importance of some object linked within a network of objects.

Arguably, the main selling point of PageRank is its robustness due to the difficulty of getting a relevant incoming link.

library(igraph)

## ## Attaching package: 'igraph' ## ## The following object is masked from 'package:arules': ## ## union ## ## The following objects are masked from 'package:stats': ## ## decompose, spectrum ## ## The following object is masked from 'package:base': ## ## union

library(dplyr)

## ## Attaching package: 'dplyr' ## ## The following objects are masked from 'package:igraph': ## ## %>%, as_data_frame, groups, union ## ## The following objects are masked from 'package:arules': ## ## intersect, setdiff, setequal, union ## ## The following objects are masked from 'package:stats': ## ## filter, lag ## ## The following objects are masked from 'package:base': ## ## intersect, setdiff, setequal, union

library(printr)

This code generates a random directed graph with 15 objects:

g <- random.graph.game(n = 15, p.or.m = 1/4, directed = TRUE)

plot(g)

In the above graph, you can see 2 kinds of links: directed and undirected. Directed links are single directional.

For example, a web page hyperlinking to another web page is one-way. Unless the 2nd web page hyperlinks back to the 1st page, the link does not go both ways.

This code calculates the PageRank for each object:

pr <- page.rank(g)$vector

This code outputs the PageRank for each object:

df <- data.frame(Object = 1:15, PageRank = pr) arrange(df, desc(PageRank))

Object PageRank

3 0.0913739

4 0.0869785

15 0.0846917

6 0.0740617

13 0.0704102

14 0.0697306

12 0.0661287

10 0.0655945

1 0.0626079

11 0.0607512

5 0.0586047

7 0.0570346

9 0.0530647

2 0.0509461

8 0.0480209

The above table tells us the relative importance of each object in the graph. for example,

(1) Object 3 is the most relevant with a PageRank of 0.091;

(2) Object 8 is the least relevant with a PageRank of 0.048.

Looking back at the original graph, this seems to be accurate. Object 8 is linked to by more than 4 objects, while Object 8 is linked to by just 3 objects .

7. AdaBoost

AdaBoost is a boosting algorithm which constructs a classifier. Here, a classifier takes a bunch of data and attempts to predict or classify which class a new data element belongs to. Boosting is an ensemble learning algorithm which takes multiple learning algorithms (e.g. decision trees) and combines them. The goal is to take an ensemble or group of weak learners and combine them to create a single strong learner.

library(adabag)

## Loading required package: rpart ## Loading required package: mlbench ## Loading required package: caret ## Loading required package: lattice ## Loading required package: ggplot2

library(printr)




model <- boosting(Species ~ ., data = iris.train)


results$confusion

Predicted Class/Observed Class setosa versicolor virginica

setosa 15 0 0

versicolor 0 14 0

virginica 0 2 19

We can see that AdaBoost just made 1 mistakes: it misclassified 2 versicolor irises as virginica.

8. kNN

kNN, or k-Nearest Neighbors, is a classification algorithm. However, it differs from the classifiers previously described because it is a lazy learner.

A lazy learner does not do much during the training process other than store the training data. Only when new unlabeled data is input does this type of learner look to classify. On the other hand, an eager learner builds a classification model during training. When new unlabeled data is input, this type of learner feeds the data into the classification model.

kNN builds no such classification model. Instead, it just stores the labeled training data. When new unlabeled data comes in, kNN operates in 2 basic steps:

(1) First, it looks at the k closest labeled training data points in other words, the k-nearest neighbors.

(2) Second, using the neighbors' classes, kNN gets a better idea of how the new data should be classified.

There are some merits kNN have:

(1) kNN can get very computationally expensive when trying to determine the nearest neighbors on a large dataset;

(2) Noisy data can throw off kNN classifications;

(3) Features with a larger range of values can dominate the distance metric relative to features that have a smaller range, so feature scaling is important;

(4) Since data processing is deferred, kNN generally requires greater storage requirements than eager classifiers;

(5) Selecting a good distance metric is crucial to kNN's accuracy.

library(class)

## ## Attaching package: 'class' ## ## The following object is masked from 'package:igraph': ## ## knn

library(printr)



This code initializes kNN with the training data. In addition, it does a test with the testing data.

results <- knn(train = subset(iris.train, select = -Species), test = subset(iris.test, select = -Species), cl = iris.train$Species)



setosa 16 0 0

versicolor 0 13 3

virginica 0 2 16

We can see that kNN made two mistakes:

(1) It misclassified 3 virginica irises as versicolor;

(2) it misclassified 2 versicolor irises as virginica.

9. Naive Bayes

Naive Bayes is not a single algorithm, but a family of classification algorithms that share one common assumption: Every feature of the data being classified is independent of all other features given the class.

Given a set of objects, each of which belongs to a known class, and each of which has a known vector of variables, our aim is to construct a rule which will allow us to assign future objects to a class, given only the vectors of variables describing the future objects. Problems of this kind, called problems of supervised classification, are ubiquitous, and many methods for constructing such rules have been developed. One very important one is the naive Bayes method is also called idiot's Bayes, simple Bayes, and independence Bayes. This method is important for several reasons. It is

very easy to construct, not needing any complicated iterative parameter estimation schemes.

library(e1071) library(printr)




model <- naiveBayes(x = subset(iris.train, select=-Species), y = iris.train$Species)




setosa 15 1 0

versicolor 0 14 2

virginica 0 3 15

As the table shows, Naive Bayes made 3 mistakes:

(1) misclassifying 1 versicolor irises as setosa;

(2) misclassifying 2 virginica irises as versicolor;

(3) misclassifying 3 versicolor as virginica.

10. CART

CART stands for classification and regression trees. It is a decision tree learning technique that outputs either classification or regression trees. Like C4.5, CART is a classifier.

When we use C4.5, we can also apply to CART, since they are both decision tree learning techniques. Things like ease of interpretation and explanation also apply to CART as well. Like C4.5, they are also quite fast, quite popular and the output is human readable.

library(rpart) library(printr)




model <- rpart(Species ~ ., data = iris.train)





setosa 13 0 0

versicolor 0 16 1

virginica 0 3 17

In this particular case, CART misclassified 1 virginica iris as versicolor and 3 mistakes misclassifying versicolor as virginica.

library(mclust) library(printr)

model <- Mclust(subset(iris, select = -Species))

table(model$classification, iris$Species)


1 50 0 0

2 0 50 50

It can been saw that CART has difficulty in classifying versicolor and virginica.

Reference

[1] Xindong Wu, Vipin Kumar et all, Top 10 algorithms in data mining, Knowledge and Information Systems, 2009, Volume 14, Number 1, Page 1.

[2] Yanchang Zhao, R and Data Mining: Examples and Case Studies, Published by Elsevier, December 2012.

[3] More details about R: https://www.r-project.org/about.html

[4] More details about RStudio: https://www.rstudio.com/products/RStudio/

[5] http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-english/

[6] http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-r/

https://www.r-project.org/about.html

https://www.rstudio.com/products/RStudio/

http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-english/

http://rayli.net/blog/data/top-10-data-mining-algorithms-in-plain-r/

tao fayan_report on top 10 data mining algorithms applications with r

Documents