learning ii introduction to artificial intelligence cs440/ece448 lecture 21
TRANSCRIPT
Learning IILearning II
Introduction to Artificial Intelligence
CS440/ECE448
Lecture 21
Last lectureLast lecture
• The (in)efficiency of exact inference with Bayes nets(in)efficiency of exact inference with Bayes nets• The learning problemThe learning problem• Decision treesDecision trees
This lectureThis lecture• Identification treesIdentification trees• Neural networks: PerceptronsNeural networks: Perceptrons
ReadingReading• Chapters 18 and 20
Inductive learning methodInductive learning method
• Construct/adjust h to agree with f on training set.
(h is consistent if it agrees with f on all examples)• E.g., curve fitting:
Inductive learning methodInductive learning method
• Construct/adjust h to agree with f on training set.
(h is consistent if it agrees with f on all examples)• E.g., curve fitting:
Inductive learning methodInductive learning method
• Construct/adjust h to agree with f on training set.
(h is consistent if it agrees with f on all examples)• E.g., curve fitting:
Inductive learning methodInductive learning method
Ockham’s razor: prefer the simplest consistent hypothesis.
• Construct/adjust h to agree with f on training set.
(h is consistent if it agrees with f on all examples)• E.g., curve fitting:
Inductive LearningInductive Learning
• Given examples of some concepts and a description (features) for these concepts as training data, learn how to classify subsequent descriptions into one of the concepts.– Concepts (classes)– Features– Training set– Test set
• Here, the function has discrete outputs (the classes).
Decision TreesDecision Trees“Should I play tennis today?”
Note: A decision tree can be expressed as a disjunction of conjunctions
(Outlook = sunny) (Humidity = normal)
(Outlook = overcast) (Wind=Weak)
Outlook
HumidityWind
No Yes No Yes
No
SunnyRain
Overcast
High Low Strong Weak
Learning Decision TreesLearning Decision Trees
• Inductive Learning.
• Given a set of positive and negative training examples of a concept, can we learn a decision tree that can be used to appropriately classify other examples?
• Identification Trees: ID3 [ Quinlan, 1979 ].
What on Earth causes people to get sunburns?
I don’t know, so let’s go to the beach and collect some data.
Sunburn dataSunburn dataName Hair Height
Swim Suit Color
Lotion Result
Sarah Blond Average Yellow No Sunburned
Dana Blond Tall Red Yes Fine
Alex Brown Short Red Yes Fine
Annie Blond Short Red No Sunburned
Emily Red Average Blue No Sunburned
Pete Brown Tall Blue No Fine
John Brown Average Blue No Fine
Katie Blond Short Yellow Yes Fine
There are 3 x 3 x 3 x 2 = 54 possible feature vectors
Exact Matching MethodExact Matching Method• Construct a table recording observed cases.
• Use table lookup to classify new data.
• Problem: For realistic problems, exact matching can’t be used.
8 people and 54 possible feature vectors:
15% chance of finding an exact match.
Another example:
106 Examples
12 features
5 values per feature
106/512 0.4%
How can we do the classification?How can we do the classification?
• Nearest-neighbor method (but only if we can establish a distance between feature vectors).
• Use identification trees:
An identification tree is a decision tree in which each set of possible conclusions is implicitly established by a list of samples of known class.
An ID tree consistent with the dataAn ID tree consistent with the data
Hair Color
Lotion Used
Sarah
Annie
Dana
Katie
EmilyAlexPeteJohn
Blond Red Brown
No Yes
Sunburned
Not Sunburned
Sunburned
Not Sunburned
Another consistent ID treeAnother consistent ID tree
Katie
Hair Color
Alex
Blond RedBrown
Sunburned
Not Sunburned
Hair Color
Emily
Red
Suit Color
Height
Annie
Short
Suit Color
AverageDanaPete
Tall
Sarah
John
Yellow RedBlue
Yellow BlueRed
Brown
An ideaAn ideaSelect tests that divide as well as possible Select tests that divide as well as possible people into sets with homogeneous labelspeople into sets with homogeneous labels
Hair Color
SarahAnnieDanaKatie
Emily AlexPeteJohn
Blond Red Brown
Lotion used
SarahAnnieEmilyPeteJohn
DanaAlexKatie
NoYes
Height Suit Color
Then among blonds...Then among blonds...
Lotion used
SarahAnnie
DanaKatie
NoYes
Height
KatieAnnie Sarah
Dana
Short Av Tall
Suit Color
SarahKatie
DanaAnnie
Yellow Red Blue This is perfectly
homogeneous...
Combining these two together …Combining these two together …
Hair Color
Lotion Used
Sarah
Annie
Dana
Katie
EmilyAlexPeteJohn
Blond Red Brown
No Yes
Sunburned
Not Sunburned
Sunburned
Not Sunburned
Decision Tree Learning AlgorithmDecision Tree Learning Algorithm
• Problem:– For practical problems, it is unlikely that
any test will produce one completely homogeneous subset.
• Solution:– Minimize a measure of inhomogeneity or
disorder. – Available from information theory.
InformationInformation• Let’s say we have a question which has n possible
answers and call them vi.
• Let’s say that answer vi occurs with probability P(vi), then the information content (entropy) measured in bits of knowing the answer is:
• One bit of information is enough information to answer a yes or no question.
• E.g. consider flipping a fair coin, how much information do you have if you know which side comes up?
I(½, ½) = - (½ log2½ + ½ log2½) = 1bit
n
iiin vPvPvPvPI
121 )(log)())(,),((
Information at a nodeInformation at a node• In our decision tree for a given feature (e.g. hair color), we have
– b: number of branches (e.g. possible values for the feature)– Nb: number of samples in branch – Np: number of samples in all branches– Nbc: number of samples in class c in branch b.
• Using frequencies as an estimate of the probabilities, we have
• For a single branch, the information is simply
b
bc
c b
bc
b p
b
N
N
N
N
N
NnInformatio 2log
b
bc
c b
bc
N
N
N
NnInformatio 2log
ExampleExample• Consider a single branch (b=1) which only contains members of two
classes A and B.– If half of the points belong to A and half belong to B:
– What if all the points belong to A (or to B):
• We like the latter situation since the branches are homogeneous, so less information is needed to make a decision (maximize information
gain).
12
1
2
12
1log
2
1
2
1log
2
1
lognInformatio
22
2
b
bc
c b
bc
b p
b
N
N
N
N
N
N
000
0log01log1
lognInformatio
22
2
b
bc
c b
bc
b p
b
N
N
N
N
N
N
0log :Note lim0
xxx
What is the amount of information required for What is the amount of information required for classification after we have used the hair test?classification after we have used the hair test?
Hair Color
SarahAnnieDanaKatie
Emily AlexPeteJohn
Blond Red Brown
- 2/4 log22/4- 2/4 log22/4= 1
-1 log21-0 log20= 0
- 0 log20- 3/3 log23/3= 0
Information = 4/8*1 + 1/8*0 + 3/8*0 = 0.5
b
bc
c b
bc
b p
b
N
N
N
N
N
N
nInformatio
2log
b
bc
c b
bc
b p
b
N
N
N
N
N
N
nInformatio
2log
Selecting top level featureSelecting top level feature• Using the 8 samples we have so far, we get:
Test InformationHair 0.5Height 0.69Suit Color 0.94Lotion 0.61
• Hair wins, least additional information needed for rest of classification.
• This is used to build the first level of the identification tree:
Hair Color
SarahAnnieDanaKatie
Emily AlexPeteJohn
Blond Red Brown
Selecting second level featureSelecting second level feature
• Let’s consider the remaining features for the blond branch (4 samples)Test InformationHeight 0.5Suit Color 1Lotion 0
• Lotion wins, least additional information.
Hair Color
SarahAnnieDanaKatie
Emily AlexPeteJohn
Blond Red Brown
Thus we get to the tree we had arrived Thus we get to the tree we had arrived at earlierat earlier
Hair Color
Lotion Used
Sarah
Annie
Dana
Katie
EmilyAlexPeteJohn
Blond Red Brown
No Yes
Sunburned
Not Sunburned
Sunburned
Not Sunburned
Using the Identification tree as a Using the Identification tree as a classification procedureclassification procedure
Hair Color
Lotion Used
Sunburn OK
Sunburn OK
Blond Red Brown
No Yes
Rules:
•If Blond and uses lotion, then OK
•If Blond and does not use lotion, then gets burned
•If red-haired, then gets burned
•If brown hair, then OK
Rules:
•If Blond and uses lotion, then OK
•If Blond and does not use lotion, then gets burned
•If red-haired, then gets burned
•If brown hair, then OK
Performance measurementPerformance measurementHow do we know that h ≈ f ?
1. Use theorems of computational/statistical learning theory
2. Try h on a new test set of examples
(use same distribution over example space as training set)
Learning curve = % correct on test set as a function of training set size