inductive learning (1/2) decision tree method russell and norvig: chapter 18, sections 18.1 through...

48
Inductive Learning Inductive Learning (1/2) (1/2) Decision Tree Method Decision Tree Method Russell and Norvig: Chapter 18, Sections 18.1 through 18.4 Chapter 18, Sections 18.1 through 18.3 CS121 – Winter 2003

Upload: reynard-ellis

Post on 27-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Inductive Learning Inductive Learning (1/2)(1/2)

Decision Tree MethodDecision Tree Method

Russell and Norvig: Chapter 18, Sections 18.1 through 18.4Chapter 18, Sections 18.1 through 18.3

CS121 – Winter 2003

QuotesQuotes

“Our experience of the world is specific, yet we are able to formulate general theories that account for the past and predict the future” Genesereth and Nilsson, Logical Foundations of AI, 1987 “Entities are not to be multiplied without necessity”Ockham, 1285-1349

Learning AgentLearning Agent

environmentagent

?

sensors

actuators

Problemsolver

Learningelement

Critic Percepts

Actions

KB

ContentsContents

Introduction to inductive learning Logic-based inductive learning: Decision tree method Version space method

Function-based inductive learning Neural nets

ContentsContents

Introduction to inductive learning Logic-based inductive learning: Decision tree method Version space method

+ why inductive learning works

Function-based inductive learning Neural nets

1

2

3

Inductive Learning Inductive Learning FrameworksFrameworks

1. Function-learning formulation

2. Logic-inference formulation

Function-Learning Function-Learning FormulationFormulation

Goal function fTraining set: (xi, f(xi)), i = 1,…,n

Inductive inference: Find a function h that fits the point well

Neural nets

f(x)

x

Logic-Inference Logic-Inference FormulationFormulationBackground knowledge KBTraining set (observed knowledge) such that KB and {KB, }is satisfiable Inductive inference: Find h(inductive hypothesis) such that {KB, h} is satisfiable KB,h

Usually, not a sound inference

h = is a trivial,but uninteresting solution (data caching)

Rewarded Card ExampleRewarded Card Example

Deck of cards, with each card designated by [r,s], its rank and suit, and some cards “rewarded”Background knowledge KB: ((r=1) v … v (r=10)) NUM(r)((r=J) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)

Training set :REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J,S])

Rewarded Card ExampleRewarded Card Example

Background knowledge KB: ((r=1) v … v (r=10)) NUM(r)((r=J) v (r=Q) v (r=K)) FACE(r)((s=S) v (s=C)) BLACK(s)((s=D) v (s=H)) RED(s)

Training set :REWARD([4,C]) REWARD([7,C]) REWARD([2,S]) REWARD([5,H]) REWARD([J,S])

Possible inductive hypothesis:h (NUM(r) BLACK(s) REWARD([r,s]))

There are several possible inductive hypotheses

Learning a PredicateLearning a Predicate

Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD)

Example: CONCEPT describes the precondition of an action, e.g., Unstack(C,A) • E is the set of states• CONCEPT(x)

HANDEMPTYx, BLOCK(C) x, BLOCK(A) x, CLEAR(C) x, ON(C,A) xLearning CONCEPT is a step toward learning the action

Learning a PredicateLearning a Predicate

Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates

A Possible Training SetA Possible Training SetEx. # A B C D E CONCEP

T

1 True True False True False False

2 True False False False False True

3 False False True True True False

4 True True True False True True

5 False True True False False False

6 True True False True True False

7 False False True False True False

8 True False True False True True

9 False False False True True False

10 True True True True False True

Note that the training set does not say whether an observable predicate A, …, E is pertinent or not

Learning a PredicateLearning a Predicate

Set E of objects (e.g., cards) Goal predicate CONCEPT(x), where x is an object in E, that takes the value True or False (e.g., REWARD) Observable predicates A(x), B(X), … (e.g., NUM, RED) Training set: values of CONCEPT for some combinations of values of the observable predicates Find a representation of CONCEPT in the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built with the observable predicates, e.g.: CONCEPT(x) A(x) (B(x) v C(x))

Learning the concept of an Learning the concept of an ArchArch

These aren’t: (negative examples)

ARCH(x) HAS-PART(x,b1) HAS-PART(x,b2) HAS-PART(x,b3) IS-A(b1,BRICK) IS-A(b2,BRICK) MEET(b1,b2) (IS-A(b3,BRICK) v IS-A(b3,WEDGE)) SUPPORTED(b3,b1) SUPPORTED(b3,b2)

These objects are arches:(positive examples)

Example setExample set

An example consists of the values of CONCEPT and the observable predicates for some object x A example is positive if CONCEPT is True, else it is negative The set E of all examples is the example set The training set is a subset of E

a small one!

An hypothesis is any sentence h of the form: CONCEPT(x) S(A,B, …)where S(A,B,…) is a sentence built with the observable predicates The set of all hypotheses is called the hypothesis space H An hypothesis h agrees with an example if it gives the correct value of CONCEPT

Hypothesis SpaceHypothesis Space

It is called a space becauseit has some internal structure

+

++

+

+

+

+

++

+

+

+ -

-

-

-

-

-

-

- --

-

-

Example set X{[A, B, …, CONCEPT]}

Inductive Learning Inductive Learning SchemeScheme

Hypothesis space H{[CONCEPT(x) S(A,B, …)]}

Training set Inductive

hypothesis h

Size of Hypothesis Size of Hypothesis SpaceSpace

n observable predicates 2n entries in truth table In the absence of any restriction (bias), there are hypotheses to choose from n = 6 2x1019 hypotheses!

22n

Rewarded Card ExampleRewarded Card Example

Background knowledge KB:((r=1) v … v (r=10)) NUM([r,s])((r=J) v (r=Q) v (r=K)) FACE([r,s])((s=S) v (s=C)) BLACK([r,s])((s=D) v (s=H)) RED([r,s])

Training set :REWARD([4,C]) REWARD([7,C]) REWARD([2,S])

REWARD([5,H]) REWARD([J ,S])

Possible inductive hypothesis:h (NUM(x) BLACK(x) REWARD(x))

h1 NUM(x) BLACK(x) REWARD(x)h2 BLACK([r,s]) (r=J) REWARD([r,s])h3 ([r,s]=[4,C]) ([r,s]=[7,C]) [r,s]=[2,S])

REWARD([r,s])h3 ([r,s]=[5,H]) ([r,s]=[J,S]) REWARD([r,s])agree with all the examples in the training set

Multiple Inductive Multiple Inductive HypothesesHypotheses

Need for a system of preferences – called a bias – to compare possible hypotheses

Keep-It-Simple (KIS) BiasKeep-It-Simple (KIS) Bias

Motivation If an hypothesis is too complex it may not be worth

learning it (data caching might just do the job as well) There are much fewer simple hypotheses than complex

ones, hence the hypothesis space is smaller

Examples: Use much fewer observable predicates than suggested

by the training set Constrain the learnt predicate, e.g., to use only “high-

level” observable predicates such as NUM, FACE, BLACK, and RED and/or to have simple syntax (e.g., conjunction of literals)

If the bias allows only sentences S that areconjunctions of k << n predicates picked fromthe n observable predicates, then the size of H is O(nk)

Putting Things TogetherPutting Things Together

Object set

Goal predicate

Observable predicates

Example set X

Training set

Testset

Inducedhypothesis h

Learningprocedure L

Evaluationyes

no

Bias

Hypothesisspace H

Predicate-Learning Predicate-Learning MethodsMethods

Decision tree Version space

Predicate as a Decision Predicate as a Decision TreeTree

The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED

Predicate as a Decision Predicate as a Decision TreeTree

The predicate CONCEPT(x) A(x) (B(x) v C(x)) can be represented by the following decision tree:

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

Example:A mushroom is poisonous iffit is yellow and small, or yellow, big and spotted• x is a mushroom• CONCEPT = POISONOUS• A = YELLOW• B = BIG• C = SPOTTED• D = FUNNEL-CAP• E = BULKY

Training SetTraining SetEx. # A B C D E CONCEP

T

1 False False True False True False

2 False True False False False False

3 False True True True True False

4 False False True False False False

5 False False False True True False

6 True False True False False True

7 True False False True False True

8 True False True False True True

9 True True True False True True

10 True True True True True True

11 True True False False False False

12 True True False False True False

13 True False True True True True

TrueTrueTrueTrueFalseTrue13

FalseTrueFalseFalseTrueTrue12

FalseFalseFalseFalseTrueTrue11

TrueTrueTrueTrueTrueTrue10

TrueTrueFalseTrueTrueTrue9

TrueTrueFalseTrueFalseTrue8

TrueFalseTrueFalseFalseTrue7

TrueFalseFalseTrueFalseTrue6

FalseTrueTrueFalseFalseFalse5

FalseFalseFalseTrueFalseFalse4

FalseTrueTrueTrueTrueFalse3

FalseFalseFalseFalseTrueFalse2

FalseTrueFalseTrueFalseFalse1

CONCEPTEDCBAEx. #

Possible Decision TreePossible Decision TreeD

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

Possible Decision TreePossible Decision TreeD

CE

B

E

AA

A

T

F

F

FF

F

T

T

T

TT

CONCEPT (D (E v A)) v (C (B v ((E A) v A)))

A?

B?

C?True

True

True

True

FalseTrue

False

FalseFalse

False

CONCEPT A (B v C)

KIS bias Build smallest decision tree

Computationally intractable problem greedy algorithm

Getting StartedGetting Started

True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12

The distribution of the training set is:

Ex. # A B C D E CONCEPT

1 False False True False True False

2 False True False False False False

3 False True True True True False

4 False False True False False False

5 False False False True True False

6 True False True False False True

7 True False False True False True

8 True False True False True True

9 True True True False True True

10 True True True True True True

11 True True False False False False

12 True True False False True False

13 True False True True True True

Getting StartedGetting Started

True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12

The distribution of training set is:

Without testing any observable predicate, wecould report that CONCEPT is False (majority rule) with an estimated probability of error P(E) = 6/13

Getting StartedGetting Started

True: 6, 7, 8, 9, 10,13False: 1, 2, 3, 4, 5, 11, 12

The distribution of training set is:

Without testing any observable predicate, wecould report that CONCEPT is False (majority rule)with an estimated probability of error P(E) = 6/13

Assuming that we will only include one observable predicate in the decision tree, which predicateshould we test to minimize the probability or error?

Assume It’s AAssume It’s A

A

True:False:

6, 7, 8, 9, 10, 1311, 12 1, 2, 3, 4, 5

T F

If we test only A, we will report that CONCEPT is Trueif A is True (majority rule) and False otherwise

The estimated probability of error is:Pr(E) = (8/13)x(2/8) + (5/8)x0 = 2/13

Assume It’s BAssume It’s B

B

True:False:

9, 102, 3, 11, 12 1, 4, 5

T F

If we test only B, we will report that CONCEPT is Falseif B is True and True otherwise

The estimated probability of error is:Pr(E) = (6/13)x(2/6) + (7/13)x(3/7) = 5/13

6, 7, 8, 13

Assume It’s CAssume It’s C

C

True:False:

6, 8, 9, 10, 131, 3, 4 1, 5, 11, 12

T F

If we test only C, we will report that CONCEPT is Trueif C is True and False otherwise

The estimated probability of error is:Pr(E) = (8/13)x(3/8) + (5/13)x(1/5) = 4/13

7

Assume It’s DAssume It’s D

D

T F

If we test only D, we will report that CONCEPT is Trueif D is True and False otherwise

The estimated probability of error is: Pr(E) = (5/13)x(2/5) + (8/13)x(3/8) = 5/13

True:False:

7, 10, 133, 5 1, 2, 4, 11, 12

6, 8, 9

Assume It’s EAssume It’s E

E

True:False:

8, 9, 10, 131, 3, 5, 12 2, 4, 11

T F

If we test only E we will report that CONCEPT is False,independent of the outcome

The estimated probability of error is unchanged: Pr(E) = (8/13)x(4/8) + (5/13)x(2/5) = 6/13

6, 7

So, the best predicate to test is A

Choice of Second Choice of Second PredicatePredicate

A

T F

The majority rule gives the probability of error Pr(E|A) = 1/8and Pr(E) = 1/13

C

True:False:

6, 8, 9, 10, 1311, 127

T FFalse

Choice of Third PredicateChoice of Third Predicate

C

T F

B

True:False: 11,12

7

T F

A

T F

False

True

Final TreeFinal Tree

A

CTrue

True

True BTrue

TrueFalse

False

FalseFalse

False

A?

B?

C?True

True

True

True

FalseTrueFalse

FalseFalse

False

L CONCEPT A (C v B)

Learning a Decision TreeLearning a Decision Tree

DTL(,Predicates)1. If all examples in are positive then return True2. If all examples in are negative then return False3. If Predicates is empty then return failure4. A most discriminating predicate in Predicates5. Return the tree whose:

- root is A, - left branch is DTL(+A,Predicates-A), - right branch is DTL(-A,Predicates-A)

Subset of examples that satisfy A

Noise in training set!May return majority rule,

instead of failure

Using Information TheoryUsing Information Theory

Rather than minimizing the probability of error, most existing learning procedures try to minimize the expected number of questions needed to decide if an object x satisfies CONCEPT This minimization is based on a measure of the “quantity of information” that is contained in the truth value of an observable predicate

Miscellaneous IssuesMiscellaneous Issues

Assessing performance: Training set and test set Learning curve

size of training set% c

orr

ect

on t

est

set 100

Typical learning curve

Miscellaneous IssuesMiscellaneous Issues

Assessing performance: Training set and test set Learning curve

Overfitting Tree pruning

Risk of using irrelevantobservable predicates togenerate an hypothesis

that agrees with all examplesin the training set

Terminate recursion wheninformation gain is too small

The resulting decision tree + majority rule may not classify correctly all examples in the training set

Miscellaneous IssuesMiscellaneous Issues

Assessing performance: Training set and test set Learning curve

Overfitting Tree pruning Cross-validation

Missing data

The value of an observablepredicate P is unknown foran example x. Then construct a decision tree for both valuesof P and select the value that ends up classifying x in the largest class

Select threshold that maximizesinformation gain

Miscellaneous IssuesMiscellaneous Issues

Assessing performance: Training set and test set Learning curve

Overfitting Tree pruning Cross-validation

Missing data Multi-valued and continuous attributes

These issues occur with virtually any learning method

Multi-Valued AttributesMulti-Valued AttributesWillWait predicate (Russell and Norvig)

Patrons?

Hungry?

Type?

FriSat?

No

None

False

FullSome

False

False

False

Yes

True

True

True

ThaiItalian Burger

Multi-valued attributes

Applications of Decision Applications of Decision TreeTree

Medical diagnostic / Drug design Evaluation of geological systems for assessing gas and oil basins Early detection of problems (e.g., jamming) during oil drilling operations Automatic generation of rules in expert systems

SummarySummary

Inductive learning frameworks Logic inference formulation Hypothesis space and KIS bias Inductive learning of decision trees Assessing performance Overfitting