csi 5388:topics in machine learning

39
1 CSI 5388:Topics in CSI 5388:Topics in Machine Learning Machine Learning Inductive Learning: A Inductive Learning: A Review Review

Upload: micah

Post on 07-Jan-2016

68 views

Category:

Documents


2 download

DESCRIPTION

CSI 5388:Topics in Machine Learning. Inductive Learning: A Review. Course Outline. Overview Theory Version Spaces Decision Trees Neural Networks Bagging Boosting. Inductive Learning : Overview. Different types of inductive learning: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CSI 5388:Topics in Machine Learning

11

CSI 5388:Topics in CSI 5388:Topics in Machine LearningMachine Learning

Inductive Learning: A ReviewInductive Learning: A Review

Page 2: CSI 5388:Topics in Machine Learning

22

Course OutlineCourse Outline

OverviewOverview TheoryTheory Version SpacesVersion Spaces Decision TreesDecision Trees Neural NetworksNeural Networks BaggingBagging BoostingBoosting

Page 3: CSI 5388:Topics in Machine Learning

33

Inductive Learning : OverviewInductive Learning : Overview Different types of inductive learning:Different types of inductive learning:

• Supervised LearningSupervised Learning: The program attempts to infer an : The program attempts to infer an association between attributes and their inferred class.association between attributes and their inferred class.

Concept LearningConcept Learning ClassificationClassification

• Unsupervised LearningUnsupervised Learning: The program attempts to infer : The program attempts to infer an association between attributes but no class is an association between attributes but no class is assigned.:assigned.:

Reinforced learning.Reinforced learning. ClusteringClustering DiscoveryDiscovery

• Online vs. Batch Learning Online vs. Batch Learning

We will focus on supervised learning in batch mode.We will focus on supervised learning in batch mode.

Page 4: CSI 5388:Topics in Machine Learning

44

Inductive Inference Theory (1)Inductive Inference Theory (1)

Given X the set of all examples.Given X the set of all examples. A A concept concept C is a subset of X.C is a subset of X. A training example T is a subset of X A training example T is a subset of X

such that some examples of T are such that some examples of T are elements of C (the positive elements of C (the positive examples) and some examples are examples) and some examples are not elements of C (the negative not elements of C (the negative examples) examples)

Page 5: CSI 5388:Topics in Machine Learning

55

Inductive Inference Theory (2)Inductive Inference Theory (2)

Learning:Learning: {<xi,yi>} {<xi,yi>} f: X f: X Y Y with i=1..n, with i=1..n, xi xi T, yi T, yi Y (={0,1}) Y (={0,1}) yi= 1, if x1 is positive (yi= 1, if x1 is positive ( C) C) yi= 0, if xi is negative (yi= 0, if xi is negative ( C) C)Goals of learning:Goals of learning:f must be such that for all xj f must be such that for all xj X (not only X (not only T) T) - -

f(xj) =1 si xj f(xj) =1 si xj C C - f(xj) = 0, si xj - f(xj) = 0, si xj C C

Learning system

Page 6: CSI 5388:Topics in Machine Learning

66

Inductive Inference Theory (3)Inductive Inference Theory (3) Problem:Problem: La task or learning is not well formulated La task or learning is not well formulated

because there exist an infinite number of functions because there exist an infinite number of functions that satisfy the goal. that satisfy the goal. It is necessary to find a way to It is necessary to find a way to constrain the search space of f.constrain the search space of f.

Definitions:Definitions:• The set of all fs that satisfy the goal is called The set of all fs that satisfy the goal is called

hypothesis space.hypothesis space.• The constraints on the hypothesis space is called The constraints on the hypothesis space is called

the the inductive biasinductive bias..• There are two types of inductive bias:There are two types of inductive bias:

The The hypothesis space restriction biashypothesis space restriction bias The The preference biaspreference bias

Page 7: CSI 5388:Topics in Machine Learning

77

Inductive Inference Theory (4)Inductive Inference Theory (4) Hypothesis space restriction biasHypothesis space restriction bias We restrain the We restrain the

language of the hypothesis space. language of the hypothesis space. Examples:Examples: k-DNF: We restrict f to the set of Disjunctive Normal form k-DNF: We restrict f to the set of Disjunctive Normal form

formulas having an arbitrary number of disjunctions but formulas having an arbitrary number of disjunctions but at most, k conjunctive in each conjunctions.at most, k conjunctive in each conjunctions.

K-CNF: We restrict f to the set of Conjunctive Normal K-CNF: We restrict f to the set of Conjunctive Normal Form formulas having an arbitrary number of Form formulas having an arbitrary number of conjunctions but with at most, k disjunctive in each conjunctions but with at most, k disjunctive in each disjunction.disjunction.

Properties of that type of bias:Properties of that type of bias:• Positive: Learning will by simplified (Computationally)Positive: Learning will by simplified (Computationally)• Negative: The language can exclude the “good” Negative: The language can exclude the “good”

hypothesis.hypothesis.

Page 8: CSI 5388:Topics in Machine Learning

88

Inductive Inference Theory (5)Inductive Inference Theory (5) Preference Bias: It is an order or unit of Preference Bias: It is an order or unit of

measure that serves as a base to a relation measure that serves as a base to a relation of preference in the hypothesis space. of preference in the hypothesis space.

Examples:Examples: Occam’s razor:Occam’s razor: We prefer a simple formula We prefer a simple formula

for f.for f. Principle of minimal description lengthPrinciple of minimal description length

(An extension of Occam’s Razor): The best (An extension of Occam’s Razor): The best hypothesis is the one that minimise the total hypothesis is the one that minimise the total length of the hypothesis length of the hypothesis andand the description the description of the exceptions to this hypothesis. of the exceptions to this hypothesis.

Page 9: CSI 5388:Topics in Machine Learning

99

Inductive Inference Theory (6)Inductive Inference Theory (6)

How to implement learning with these How to implement learning with these bias?bias?

Hypothesis space restriction bias:Hypothesis space restriction bias:• Given: Given:

A set S of training examplesA set S of training examples A set of restricted hypothesis, HA set of restricted hypothesis, H

• Find: An hypothesis f Find: An hypothesis f H that minimizes H that minimizes the number of incorrectly classified the number of incorrectly classified training examples of S.training examples of S.

Page 10: CSI 5388:Topics in Machine Learning

1010

Inductive Inference Theory (7)Inductive Inference Theory (7) Preference Bias:Preference Bias:

• Given: Given: A set S of training examples A set S of training examples An order of preference better(f1, f2) for all the An order of preference better(f1, f2) for all the

hypothesis space (H) functions.hypothesis space (H) functions.• Find: the best hypothesis f Find: the best hypothesis f H (using the “better” H (using the “better”

relation) that minimises the number of training relation) that minimises the number of training examples S incorrectly classified.examples S incorrectly classified.

Search techniques:Search techniques:• Heuristic searchHeuristic search• Hill ClimbingHill Climbing• Simulated Annealing et Genetic AlgorithmSimulated Annealing et Genetic Algorithm

Page 11: CSI 5388:Topics in Machine Learning

1111

Inductive Inference Theory (8)Inductive Inference Theory (8) When can we trust our learning algorithm?When can we trust our learning algorithm?

Theoretical Theoretical answeranswer• Experimental Experimental answeranswer

Theoretical Theoretical answeranswer : PAC-Learning (Valiant 84) : PAC-Learning (Valiant 84) PAC-Learning provides the limit on the necessary PAC-Learning provides the limit on the necessary

number of example (given a certain bias) that will let number of example (given a certain bias) that will let us believe with a certain confidence that the results us believe with a certain confidence that the results returned by the learning algorithm is approximately returned by the learning algorithm is approximately correct (similar to the t-test). This number of example correct (similar to the t-test). This number of example is called is called sample complexitysample complexity of the bias. of the bias.

If the number of training examples exceeds the sample If the number of training examples exceeds the sample complexity, we are confident of our results.complexity, we are confident of our results.

Page 12: CSI 5388:Topics in Machine Learning

1212

Inductive Inference Theory Inductive Inference Theory (9): PAC-Learning(9): PAC-Learning

Given Pr(X) The probability distribution with which Given Pr(X) The probability distribution with which the examples are selected from X the examples are selected from X

Given f, an hypothesis from the hypothesis space.Given f, an hypothesis from the hypothesis space. Given D the set of all examples for which f and C Given D the set of all examples for which f and C

differ.differ. The error associated with f and the concept C is:The error associated with f and the concept C is:

• Error(f) = Error(f) = xxDD Pr(x) Pr(x)• f is f is approximately correct with an exactitude approximately correct with an exactitude

of of iff: Error(f) iff: Error(f) • f is f is probably approximately correct (PAC)probably approximately correct (PAC)

with probability with probability and exactitude and exactitude if if Pr(Error(f) > Pr(Error(f) > ) < ) <

Page 13: CSI 5388:Topics in Machine Learning

1313

Inductive Inference Theory (10): Inductive Inference Theory (10): PAC-LearningPAC-Learning

Theorem:Theorem: A program that returns any hypothesis A program that returns any hypothesis consistent with the training examples is PAC if n, the consistent with the training examples is PAC if n, the number of training examples is greater thannumber of training examples is greater than ln(ln(/|H|)/ln(1-/|H|)/ln(1-) where |H| represents the number ) where |H| represents the number of hypothesis in H.of hypothesis in H.

Examples:Examples: For 100 hypothesis, you need 70 examples to For 100 hypothesis, you need 70 examples to

reduce the error under 0.1 with a probability of 0.9reduce the error under 0.1 with a probability of 0.9 For 1000 hypothesis, 90 are requiredFor 1000 hypothesis, 90 are required For 10,000 hypothesis, 110 are required.For 10,000 hypothesis, 110 are required. ln(ln(/|H|)/ln(1-/|H|)/ln(1-) grows slowly. That’s good!) grows slowly. That’s good!

Page 14: CSI 5388:Topics in Machine Learning

1414

Inductive Inference Theory (11)Inductive Inference Theory (11) When can we trust our learning algorithm?When can we trust our learning algorithm?

-- Theoretical Theoretical answeranswerExperimental Experimental answeranswer

Experimental Experimental answer : answer : error estimationerror estimation Suppose you have access to 1000 examples for a Suppose you have access to 1000 examples for a

concept f.concept f.Divide the data in 2 sets:Divide the data in 2 sets:

One training setOne training setOne test setOne test set

Train the algorithm on the training set only.Train the algorithm on the training set only.Test the resulting hypothesis to have an Test the resulting hypothesis to have an

estimation of that hypothesis on the test set.estimation of that hypothesis on the test set.

Page 15: CSI 5388:Topics in Machine Learning

1515

A Taxonomy of Machine Learning Techniques: A Taxonomy of Machine Learning Techniques: Highlight on Important ApproachesHighlight on Important Approaches

Supervised Unsupervised

Linear Nonlinear

Single Combined

Easy to Interpret Hard to Interpret

Linear Regression

LogisticRegression

Perceptron

Bagging Boosting Random Forests

Decision Rule Trees Learning

Naïve k-Nearest Bayes Neighbours

Multi-Layer SVMPerceptron

K-Means EM Self-Organizing Maps

Page 16: CSI 5388:Topics in Machine Learning

1616

Version Spaces: DefinitionsVersion Spaces: Definitions Given C1 and C2, two concepts represented by sets of Given C1 and C2, two concepts represented by sets of

examples. If C1 examples. If C1 C2, then C1 is a C2, then C1 is a specialisationspecialisation of of C2 and C2 is a C2 and C2 is a generalisation generalisation of C1.of C1.

C1 is also considered C1 is also considered more specificmore specific than C2 than C2 Example: The set off all blue triangles is more specific Example: The set off all blue triangles is more specific

than the set of all the triangles.than the set of all the triangles. C1 is an C1 is an immediate specialisationimmediate specialisation of C2 if there is no of C2 if there is no

concept that are a specialisation of C2 and a concept that are a specialisation of C2 and a generalisation of C1.generalisation of C1.

A A version spaceversion space define a graph where the nodes are define a graph where the nodes are concepts and the arcs specify that a concept is an concepts and the arcs specify that a concept is an immediate specialisation of another one.immediate specialisation of another one.

(See in class example)(See in class example)

Page 17: CSI 5388:Topics in Machine Learning

1717

Version Spaces: Overview (1)Version Spaces: Overview (1) A Version Space has two limits: The A Version Space has two limits: The general general limitlimit and the and the

specific limit.specific limit. The limits are modified after each addition of a training The limits are modified after each addition of a training

example.example. The starting general limit is simply (?,?,?); The specific The starting general limit is simply (?,?,?); The specific

limit has all the leaves of the Version Space tree.limit has all the leaves of the Version Space tree. When adding a positive example all the examples of the When adding a positive example all the examples of the

specific limit are generalized until it is compatible with the specific limit are generalized until it is compatible with the example.example.

When a negative example is added, the general limit When a negative example is added, the general limit examples are specialised until they are no longer examples are specialised until they are no longer compatible with the example. compatible with the example.

Page 18: CSI 5388:Topics in Machine Learning

1818

Version Spaces: Overview (2)Version Spaces: Overview (2) If the specific limits and the general limits If the specific limits and the general limits

are maintained with the previous rules, then are maintained with the previous rules, then a concept is guaranteed to include all the a concept is guaranteed to include all the positive examples and exclude all the positive examples and exclude all the negative examples if they fall between the negative examples if they fall between the limits.limits.

General Limit

more specific

More general

Specific Limit

If f is here, it includes all examples +And exclude all examples -

(See in class example)

Page 19: CSI 5388:Topics in Machine Learning

1919

Decision Tree: IntroductionDecision Tree: Introduction The simplest form of learning is the memorization of The simplest form of learning is the memorization of

all the training examples.all the training examples. Problem:Problem: Memorization is not useful for new Memorization is not useful for new

examples examples We need to find ways to generalize We need to find ways to generalize beyond the training examples.beyond the training examples.

Possible Solution:Possible Solution: Instead of memorizing each Instead of memorizing each attributes of each examples, we can memorize only attributes of each examples, we can memorize only those that distinguish between positive and those that distinguish between positive and negative examples. That is what the negative examples. That is what the decision treedecision tree does.does.

Notice:Notice: The same set of example can be The same set of example can be represented by different trees. Occam’s Razor tells represented by different trees. Occam’s Razor tells you to take the smallest tree. you to take the smallest tree.

Page 20: CSI 5388:Topics in Machine Learning

2020

Supervised Learning: ExampleSupervised Learning: Example

Patient Attributes Class Temperature Cough Sore Throat Sinus Pain 1 37 yes no no no flu 2 39 no yes yes flu 3 38.4 no no no no flu 4 36.8 no yes no no flu 5 38.5 yes no yes flu 6 39.2 no no yes flu

Goal: Learn how to predict whether a new patient with a given set of symptoms does or does not have the flu.

Page 21: CSI 5388:Topics in Machine Learning

2121

Decision Trees: An exampleDecision Trees: An example

Temperature

Cough Sore Throat

MediumLow

High

Yes No Yes No

A Decision Tree for the Flu Concept

Flu No Flu

No Flu

Flu No Flu

Page 22: CSI 5388:Topics in Machine Learning

2222

Decision tree: ConstructionDecision tree: Construction Step 1:Step 1: We choose an attribute A (= node We choose an attribute A (= node

0) and split the example by the value of 0) and split the example by the value of this attribute. Each of these groups this attribute. Each of these groups correspond to a child of node 0.correspond to a child of node 0.

Step 2:Step 2: For each descendant of node 0, if For each descendant of node 0, if the examples of this descendant are the examples of this descendant are homogenous (have the same class), we homogenous (have the same class), we stop.stop.

Step 3:Step 3: If the examples of this descendent If the examples of this descendent are not homogenous, then we call the are not homogenous, then we call the procedure recursively on that descendent. procedure recursively on that descendent.

Page 23: CSI 5388:Topics in Machine Learning

2323

Construction of Decision Trees IConstruction of Decision Trees I

D13

D12 D11

D10 D9

D4

D7

D5

D3D14

D8

D6

D2D1

What is the mostinformative attribute?

Assume: Temperature

Page 24: CSI 5388:Topics in Machine Learning

2424

D13

D12D11

D10

D9D4

D7D5

D3D14

D8 D6

D2

D1

TemperatureMedium

LowHigh

What are the most informative attributes?

Cough and Sore Throat

Construction of Decision Trees IIConstruction of Decision Trees II

Assume:

Page 25: CSI 5388:Topics in Machine Learning

2525

Construction of Decision Trees IIIConstruction of Decision Trees III The informativeness of an attribute is an The informativeness of an attribute is an

information-theoretic measure that corresponds to information-theoretic measure that corresponds to the attribute that produces the purest children the attribute that produces the purest children nodes.nodes.

This is done by This is done by minimizing the measure of entropy minimizing the measure of entropy in the trees that the attribute split generates.in the trees that the attribute split generates.

The entropy and information are linked in the The entropy and information are linked in the following way: The more there is entropy in a set following way: The more there is entropy in a set S, the more information is necessary in order to S, the more information is necessary in order to guess correctly an element of this set.guess correctly an element of this set.

Info[x,y] = Entropy[x,y] = - x/(x+y) log x/(x+y) Info[x,y] = Entropy[x,y] = - x/(x+y) log x/(x+y) - y/(x+y) log y/(x+y)- y/(x+y) log y/(x+y)

Page 26: CSI 5388:Topics in Machine Learning

2626

Construction of Decision Trees IVConstruction of Decision Trees IV

MediumLow

High

Temperature

Sore Throat

No YesInfo[2,3] = .971 bitsInfo[4,0]= 0 bitsInfo[3,2]= .971 bits

Avg Tree Info = 5/14 *.971 + 4/14 * 0 + 5/14 * .971 = .693Prior Info[9,5] = .940 Gain = .940 - .693 = .247

Info[2,6]= .811Info[3,3]= 1

Avg Tree Info = 8/14 +.811 + 6/14 * 1 = .892Prior Info[9,5] = .940 Gain = .940 - .892 = .048

Page 27: CSI 5388:Topics in Machine Learning

2727

Decision Tree: Other questions.Decision Tree: Other questions.

We have to find a way to deal with We have to find a way to deal with attributes with continuous values or attributes with continuous values or discrete values with a very large set.discrete values with a very large set.

We have to find a way to deal with We have to find a way to deal with missing values.missing values.

We have to find a way to deal with We have to find a way to deal with noise (errors) in the example’s class noise (errors) in the example’s class and in the attribute values.and in the attribute values.

Page 28: CSI 5388:Topics in Machine Learning

2828

Neural Network: Introduction (I)Neural Network: Introduction (I) What is a neural network?What is a neural network? It is a formalism inspired by biological systems It is a formalism inspired by biological systems

and that is composed of units that perform and that is composed of units that perform simple mathematical operations in parallel.simple mathematical operations in parallel.

Examples of simple mathematical operation Examples of simple mathematical operation units:units:• Addition unitAddition unit• Multiplication unitMultiplication unit• Threshold (Continous (example: the Sigmoïd) Threshold (Continous (example: the Sigmoïd)

or not)or not)

Page 29: CSI 5388:Topics in Machine Learning

2929

Multi-Layer Perceptrons: An Opaque Approach

Examples:

Input Units

Hidden Units

Output units

weights

AutoassociationHeteroassociation

Page 30: CSI 5388:Topics in Machine Learning

3030

Representation in a Multi-Layer Perceptron

• hj=g(wji.xi)• yk=g(wkj.hj)

where g(x)= 1/(1+e )

x1 x2 x3 x4 x5 x6

h1 h2 h3

y1 k

j

i

wji’s

wkj’sg (sigmoid):

0

1/20

1

Typically, y1=1 for positive example and y1=0 for negative example

-x

i

j

Page 31: CSI 5388:Topics in Machine Learning

3131

Neural Network: Learning (I)Neural Network: Learning (I)

The units are connected in order to The units are connected in order to create a network capable of create a network capable of computing complicated functions.computing complicated functions.

Since the network has a sigmoid Since the network has a sigmoid output, it implements a function output, it implements a function f(x1,x2,x3,x4) where the output is in f(x1,x2,x3,x4) where the output is in the range [0,1]the range [0,1]

We are interested in neural network We are interested in neural network capable of learning that function.capable of learning that function.

Page 32: CSI 5388:Topics in Machine Learning

3232

Learning in a Multi-Layer Perceptron I

Learning consists of searching through the space of all possible matrices of weight values for a combination of weights that satisfies a database of positive and negative examples (multi-class as well as regression problems are possible).

It is an optimization problem which tries to minimize the sum of square error:

E= 1/2 n=1 k=1[yk-fk(x )] where N is the total number of training examples and K,

the total number of output units (useful for multiclass problems) and fk is the function implemented by the neural net

N K

Page 33: CSI 5388:Topics in Machine Learning

3333

Learning in a Multi-Layer Learning in a Multi-Layer Perceptron II Perceptron II

The optimization problem is solved by The optimization problem is solved by searching the space of possible searching the space of possible solutions by gradient.solutions by gradient.

This consists of taking small steps in the This consists of taking small steps in the direction that minimize the gradient (or direction that minimize the gradient (or derivative) of the error of the function derivative) of the error of the function we are trying to learn.we are trying to learn.

When the gradient is zero we have When the gradient is zero we have reached a local minimum that we hope reached a local minimum that we hope is also the global minimum. is also the global minimum.

Page 34: CSI 5388:Topics in Machine Learning

3434

Neural Network: Learning (II)Neural Network: Learning (II)

Notice that a Neural Network with a set of Notice that a Neural Network with a set of adjustable weights represent a restricted adjustable weights represent a restricted hypothesis space corresponding to a family hypothesis space corresponding to a family of functions. The size of this space can be of functions. The size of this space can be increased or decreased by changing the increased or decreased by changing the number of hidden units in the network.number of hidden units in the network.

Learning is done by a hill-climbing approach Learning is done by a hill-climbing approach called called backpropagationbackpropagation and is based on the and is based on the paradigm of search by gradient.paradigm of search by gradient.

Page 35: CSI 5388:Topics in Machine Learning

3535

Neural Network: Learning (III)Neural Network: Learning (III)

The idea of search by gradient is to The idea of search by gradient is to take small steps in the direction that take small steps in the direction that minimize the gradient (or derivative) minimize the gradient (or derivative) of the error of the function we are of the error of the function we are trying to learn.trying to learn.

When the gradient is zero we have When the gradient is zero we have reached a local minimum that we reached a local minimum that we hope is also the global minimum. hope is also the global minimum.

Page 36: CSI 5388:Topics in Machine Learning

3636

Description of Two Description of Two classifier combination classifier combination

Schemes:Schemes:

- Bagging- Bagging- Boosting- Boosting

Page 37: CSI 5388:Topics in Machine Learning

3737

Combining Multiple ModelsCombining Multiple Models The idea is the following: In order to make The idea is the following: In order to make

the outcome of automated classification the outcome of automated classification more reliable, it may be a good idea to more reliable, it may be a good idea to combine the decisions of several single combine the decisions of several single classifiers through some sort of voting classifiers through some sort of voting schemescheme

Bagging and Boosting are the two most Bagging and Boosting are the two most used combination schemes and they used combination schemes and they usually yield much improved results over usually yield much improved results over the results of single classifiersthe results of single classifiers

One disadvantage of these multiple model One disadvantage of these multiple model combinations is that, as in the case of combinations is that, as in the case of neural Networks, the learned model is neural Networks, the learned model is hard, if not impossible, to interpret.hard, if not impossible, to interpret.

Page 38: CSI 5388:Topics in Machine Learning

3838

Bagging: Bootstrap AggregatingBagging: Bootstrap Aggregating

Idea: perturb the composition of the data sets from which theclassifier is trained. Learn a classifier from each different dataset. Let these classifiers vote. This procedure reduces the portion of the performance error caused by variance in the training set.

BaggingAlgorithm

Page 39: CSI 5388:Topics in Machine Learning

3939

BoostingBoosting

BoostingAlgorithm

Idea: To build models that complement each other. A first classifieris built. Its errors are given a higher weight than its right answersso that the next classifier being built focuses on these errors, etc…

Decrease theweight of the rightanswers Increase theweight oferrors