manual orange

13
Tutorial on “Orange: An Open Source Data Mining Package” Prepared By: Mr. KISHOJ BAJRACHARYA (ID No: 111224) [email protected] Department of Computer Science and Information Management School of Engineering and Technology Asian Institute of Technology October 12, 2011 Contents 1 Orange 2 1.1 Introduction ........................................... 2 1.2 Features of Orange ....................................... 2 1.3 Installing Orange-Canvas .................................... 2 1.3.1 Installing on Windows ................................. 2 1.3.2 Installing on Ubuntu .................................. 4 1.4 Python Scripting ........................................ 4 2 Python Scripting Code Examples 6 2.1 Using Python Code ....................................... 6 2.2 Support, Confidence and Lift for Association Rule ....................... 7 2.3 Naive Bayes Classifier ..................................... 8 2.4 Regression ............................................ 9 2.5 K-Means Clustering Algorithm ................................. 10 3 References 13 1

Upload: kishoj-bajracharya

Post on 27-Jan-2015

4.054 views

Category:

Education


3 download

DESCRIPTION

This manual can be used as a beginners tutorial for using python programming in Orange tool. This manual was prepared when I was learning Orange tool.

TRANSCRIPT

Page 1: Manual orange

Tutorial on “Orange: An Open Source Data Mining Package”

Prepared By:

Mr. KISHOJ BAJRACHARYA (ID No: 111224)[email protected]

Department of Computer Science and Information ManagementSchool of Engineering and Technology

Asian Institute of Technology

October 12, 2011

Contents

1 Orange 2

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Features of Orange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Installing Orange-Canvas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Installing on Windows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.2 Installing on Ubuntu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Python Scripting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Python Scripting Code Examples 6

2.1 Using Python Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Support, Confidence and Lift for Association Rule . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Naive Bayes Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.5 K-Means Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 References 13

1

Page 2: Manual orange

1 Orange

1.1 Introduction

Orange is a collection of Python-based modules that sit over the core library of C++ objects and routinesthat handles machine learning and data mining algorithms. It is an open source data mining package buildon Python, Wrapped C, C++ and Qt.

Orange widgets provide a graphical users interface to Oranges data mining and machine learning meth-ods. They include widgets for data entry and pre-processing, data visualization, classification, regression,association rules and clustering, a set of widgets for model evaluation and visualization of evaluation re-sults, and widgets for exporting the models into Decision support system. Orange widgets and OrangeCanvas are all written in Python using Qt graphical users interface library. This allows Orange to run onvarious platforms, including MS Windows and Linux.

1.2 Features of Orange

1. Open and free software: Orange is an open source and free data mining software tool.

2. Platform independent software: Orange is supported on various versions of Linux, Microsoft win-dows, and Apples Mac.

3. Programming support: Orange supports visual programming tools for Data mining: Users candesign data analysis process via visual programming. Orange provides different visualization likebardiagram, scatterplots, trees, network, etc.

4. Scripting Interface: Orange provides python scripting. Programmers can test various new algorithmsand data analysis using python scripting.

5. Support for other components: Orange provides support for Machine Learning, bioinformatics, textmining, etc.

1.3 Installing Orange-Canvas

Orange-Canvas can be install on any platform. Browse the URI http://orange.biolab.si/nightly_builds.html for more informations on installing Orange-Canvas on different platforms. Here we focuson two platforms: Windows 7 and Ubuntu.

1.3.1 Installing on Windows

1. Browse the URI http://orange.biolab.si/nightly_builds.html.

2. Download “orange-win-w-python-snapshot-2011-09-11-py2.7.exe”.

3. Install the software by double clicking the file “orange-win-w-python-snapshot-2011-09-11-py2.7.exe”

The steps for installations are shown in the figures below:

2

Page 3: Manual orange

Fig 1: License Agreement

Fig 2: Completion of Installation

Fig 3: Locating Orange-Canvas

3

Page 4: Manual orange

Fig 4: Orange-Canvas GUI

1.3.2 Installing on Ubuntu

1. Browse the URI http://orange.biolab.si/download/archive/.

2. Download the compressed file “orange-2.0-20101215svn.zip”.

3. Extract all the files from the file “orange-2.0-20101215svn.zip”.

unzip orange-2.0-20101215svn.zip

4. Type the following commands on linux terminal

python setup.py build

sudo python setup.py install

python setup.py install –user

1.4 Python Scripting

Using the scripting language python in which the module orange can be imported by following code.# Import module orange for python scriptingimport orange

4

Page 5: Manual orange

Fig 5(a): Using Python Scripting on Windows

Fig 5(b): Using Python Scripting on Ubuntu

5

Page 6: Manual orange

2 Python Scripting Code Examples

Orange provides scripting interface on python programming language. Programmers can test various newalgorithms and data analysis using python scripting.

2.1 Using Python Code

The following code in the file test.py is used to test python scripting for Orange. It shows the simple pythonprogram to test the importing of data from an external file and play with the data access mechanism.

1 # test.py2 # Importing Orange Library for python3 import orange4

5 # Importing data from the file named "test.tab"6 data = orange.ExampleTable("test")7

8 # Printing the attributes of the table9 print "Attributes:"

10 print data.domain.attributes11

12 # List of attributes13 attributeList = []14

15 # Printing the attributes of the table16 for i in data.domain.attributes:17 attributeList.append(i.name)18 print i.name19 attributeList.append(data.domain.classVar.name)20

21 # Class Name22 print "Class:", data.domain.classVar.name23

24 # Display atributes25 print attributeList26 #attributeList.split(",")27 print28

29 # Displaying the data from the table30 print "Data items:"31 for i in range(14):32 print data[i]

Let the data table for above code be shown as in Fig 6.

Fig 6: Data Table 1

6

Page 7: Manual orange

The output of the above program is shown in Fig 7.

Fig 7: Output of test.py

2.2 Support, Confidence and Lift for Association Rule

The following “example2.py” shows how do we use scripting language like python to get support andconfidence for all the possible association rules developed from the data of imported file “association.tab”.

1 # example2.py2 # Importing classes Orange and orngAssoc3 import orange, orngAssoc4

5 # Importing data from a file named association.tab6 data = orange.ExampleTable("association")7

8 # Data Preprocessing9 data = orange.Preprocessor_discretize(data, method=orange.EquiNDiscretization(numberOfIntervals=4))

10

11 # Data Selection (We have range of 2)12 data = data.select(range(2))13

14 # List of supports15 iList = [0.1, 0.2, 0.3, 0.4]16

17 for x in iList:18 # Developing association rules from Orange19 rules = orange.AssociationRulesInducer(data, support=x)20

21 # if there is no association rule22 if(len(rules) == 0):23 print "No any association rules for support = \%5.3f" \% (x)24 # if there exists an association rule25 else:26 print "\%i rules with support = \%5.3f found.\n" \% (len(rules), x)27 orngAssoc.sort(rules, ["support", "confidence", "lift"])28 orngAssoc.printRules(rules[:(len(rules))], ["support", "confidence", "lift"])29 print

The output of the above program is shown in Fig 8.

7

Page 8: Manual orange

Fig 8: Output of example2.py

2.3 Naive Bayes Classifier

Using Python, we observe the working of Bayesian classifier from voting data set i.e. “voting.tab” and willuse it to classify the first five instances from this data set.

1 # classifier.py2 import orange3 data = orange.ExampleTable("voting")4 classifier = orange.BayesLearner(data)5 for i in range(5):6 c = classifier(data[i])7 print "\%d: \%s (originally \%s)" \% (i+1, c, data[i].getclass())

The script loads the data, uses it to constructs a classifier using naive Bayesian method, and then classifiesfirst five instances of the data set. Naive Bayes made a mistake at a third instance, but otherwise predictedcorrectly as shown if the figure below.

8

Page 9: Manual orange

Fig 9: Output of classifier.py

2.4 Regression

Following example uses both regression trees and k-nearest neighbors, and also uses a majority learnerwhich for regression simply returns an average value from learning data set.

1 # regression2.py2 import orange, orngTree, orngTest, orngStat3

4 data = orange.ExampleTable("housing.tab")5 selection = orange.MakeRandomIndices2(data, 0.5)6 train_data = data.select(selection, 0)7 test_data = data.select(selection, 1)8

9 maj = orange.MajorityLearner(train_data)10 maj.name = "default"11

12 rt = orngTree.TreeLearner(train_data, measure="retis", mForPruning=2, minExamples=20)13 rt.name = "reg. tree"14

15 k = 516 knn = orange.kNNLearner(train_data, k=k)17 knn.name = "k-NN (k=\%i)" \% k18

19 regressors = [maj, rt, knn]20

21 print "\n\%10s " \% "original",22 for r in regressors:23 print "\%10s " \% r.name,24 print25

26 for i in range(10):27 print "\%10.1f " \% test_data[i].getclass(),28 for r in regressors:29 print "\%10.1f " \% r(test_data[i]),30 print

The output of the above program is shown in Fig 10.

Fig 10: Output of regression.py

9

Page 10: Manual orange

2.5 K-Means Clustering Algorithm

Let us use python to implement K-means clustering algorithm for the problem solved in the class i.e. K = 2and array = [1,2,3,4,8,9,10,11].

1 # test3.py2 import numpy3 import math4

5 # Given Array of elements that needs to be clustered6 iArray = [1.0, 2.0, 3.0, 4.0, 8.0, 9.0, 10.0, 11.0]7

8 # Returns the value of the mean of an array elements9 def meanArray(aArray):

10 icount = len(aArray)11 iSum = 012 for x in aArray:13 iSum = iSum + x14 return (iSum/icount)15

16 Count = len(iArray)17

18 # Randomly select 2 elements19 c1 = iArray[Count-2]20 c2 = iArray[Count-1]21

22 # Initial assumptions all classes null23 Class1 = [0.0]24 Class2 = []25 oldClass1 = []26 i = 127

28 # Loop exit condition29 while (oldClass1 != Class1):30 print "Iteration: " + str(i)31 oldClass1 = Class132 Class1 = []33 Class2 = []34 for x in iArray:35 if math.fabs(c1 - x) < math.fabs(c2 - x):36 Class1.append(x)37 else:38 Class2.append(x)39 print "Class1: " + str(Class1)40 c1 = round(meanArray(Class1),1)41 print "c1 = " + str(c1)42

43 print "Class2: " + str(Class2)44 c2 = round(meanArray(Class2),1)45 print "c2 = " + str(c2)46 print47 i = i + 1

10

Page 11: Manual orange

Fig 11: Output of K-Means Clustering Algorithm

Using Orange, we can easily implement K-Means Clustering algorithm and plot graph using the followingcode.

1 import orange2 import orngClustering3 import pylab4 import random5

6 # To plot the 2D-point7 def plot_scatter(data, km, attx, atty, filename="kmeans-scatter", title=None):8 """plot a data scatter plot with the position of centroids"""9 pylab.rcParams.update({’font.size’: 8, ’figure.figsize’: [4,3]})

10

11 # For the points12 x = [float(d[attx]) for d in data]13 y = [float(d[atty]) for d in data]14 colors = ["c", "b"]15 cs = "".join([colors[c] for c in km.clusters])16 pylab.scatter(x, y, c=cs, s=10)17

18 # For the centroid points19 xc = [float(d[attx]) for d in km.centroids]20 yc = [float(d[atty]) for d in km.centroids]21 pylab.scatter(xc, yc, marker="x", c="k", s=200)22

23 pylab.xlabel(attx)24 pylab.ylabel(atty)25 if title:26 pylab.title(title)27 pylab.savefig("\%s-\%03d.png" \% (filename, km.iteration))28 pylab.close()29

30 def in_callback(km):31 print "Iteration: \%d, changes: \%d" \% (km.iteration, km.nchanges)32 plot_scatter(data, km, "X", "Y", title="Iteration \%d" \% km.iteration)33

34 # Read the data from table35 data = orange.ExampleTable("data")36 km = orngClustering.KMeans(data, 2, minscorechange=-1, maxiters=10, inner_callback=in_callback)

11

Page 12: Manual orange

The output of this program is shown below: Result of test.py

Fig 12(a): During iteration 0

Fig 12(b): During iteration 1

12

Page 13: Manual orange

3 References

The following are the references taken as a help to prepare this manual:

1. http://en.wikipedia.org/wiki/Orange_(software)

2. http://orange.biolab.si/

3. http://orange.biolab.si/nightly_builds.html

4. http://orange.biolab.si/doc/ofb-rst/genindex.html

13