learning with decision trees artificial intelligence cmsc 25000 february 20, 2003
TRANSCRIPT
![Page 1: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/1.jpg)
Learning with Decision Trees
Artificial Intelligence
CMSC 25000
February 20, 2003
![Page 2: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/2.jpg)
Agenda
• Learning from examples– Machine learning review– Identification Trees:
• Basic characteristics• Sunburn example• From trees to rules• Learning by minimizing heterogeneity• Analysis: Pros & Cons
![Page 3: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/3.jpg)
Machine Learning: Review
• Learning: – Automatically acquire a function from inputs to
output values, based on previously seen inputs and output values.
– Input: Vector of feature values– Output: Value
• Examples: Word pronunciation, robot motion, speech recognition
![Page 4: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/4.jpg)
Machine Learning: Review
• Key contrasts:– Supervised versus Unsupervised
• With or without labeled examples (known outputs)
– Classification versus Regression• Output values: Discrete versus continuous-valued
– Types of functions learned• aka “Inductive Bias” • Learning algorithm restricts things that can be learned
![Page 5: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/5.jpg)
Machine Learning: Review
• Key issues:– Feature selection:
• What features should be used?• How do they relate to each other?• How sensitive is the technique to feature selection?
– Irrelevant, noisy, absent feature; feature types
– Complexity & Generalization• Tension between
– Matching training data– Performing well on NEW UNSEEN inputs
![Page 6: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/6.jpg)
Machine Learning Features
• Inputs: – E.g.words, acoustic measurements, financial data– Vectors of features:
• E.g. word: letters – ‘cat’: L1=c; L2 = a; L3 = t
• Financial data: F1= # late payments/yr : Integer• F2 = Ratio of income to expense: Real
![Page 7: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/7.jpg)
Machine Learning Features
• Question: – Which features should be used?– How should they relate to each other?
• Issue 1: How do we define relation in feature space if features have different scales? – Solution: Scaling/normalization
• Issue 2: Which ones are important?– If differ in irrelevant feature, should ignore
![Page 8: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/8.jpg)
Complexity & Generalization
• Goal: Predict values accurately on new inputs• Problem:
– Train on sample data– Can make arbitrarily complex model to fit– BUT, will probably perform badly on NEW data
• Strategy:– Limit complexity of model (e.g. degree of equ’n)– Split training and validation sets
• Hold out data to check for overfitting
![Page 9: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/9.jpg)
Learning: Identification Trees
• (aka Decision Trees)• Supervised learning• Primarily classification• Rectangular decision boundaries
– More restrictive than nearest neighbor
• Robust to irrelevant attributes, noise• Fast prediction
![Page 10: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/10.jpg)
Sunburn ExampleName Hair Height Weight Lotion Result
Sarah Blonde Average Light No Burn
Dana Blonde Tall Average Yes None
Alex Brown Short Average Yes None
Annie Blonde Short Average No Burn
Emily Red Average Heavy No Burn
Pete Brown Tall Heavy No None
John Brown Average Heavy No None
Katie Blonde Short Light Yes None
![Page 11: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/11.jpg)
Learning about Sunburn
• Goal:– Train on labeled examples– Predict Burn/None for new instances
• Solution??– Exact match: same features, same output
• Problem: 2*3^3 feature combinations– Could be much worse
![Page 12: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/12.jpg)
Learning about Sunburn
• Better Solution: – Identification tree:– Training:
• Divide examples into subsets based on feature tests• Sets of samples at leaves define classification
– Prediction:• Route NEW instance through tree to leaf based on
feature tests• Assign same value as samples at leaf
![Page 13: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/13.jpg)
Sunburn Identification Tree
Hair Color
Lotion Used
BlondeRed
Brown
Alex: NoneJohn: NonePete: None
Emily: Burn
No Yes
Sarah: BurnAnnie: Burn
Katie: NoneDana: None
![Page 14: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/14.jpg)
Simplicity
• Occam’s Razor:– Simplest explanation that covers the data is best
• Occam’s Razor for ID trees:– Smallest tree consistent with samples will be
best predictor for new data
• Problem: – Finding all trees & finding smallest: Expensive!
• Solution:– Greedily build a small tree
![Page 15: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/15.jpg)
Building ID Trees
• Goal: Build a small tree such that all samples at leaves have same class
• Greedy solution:– At each node, pick test such that branches are
closest to having same class• Split into subsets with least “disorder”
– (Disorder ~ Entropy)
– Find test that minimizes disorder
![Page 16: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/16.jpg)
Minimizing DisorderHair Color
BlondeRed
Brown
Alex: NPete: NJohn: N
Emily: BSarah: BDana: NAnnie: BKatie: N
Height
WeightLotion
Short AverageTall
Alex:NAnnie:BKatie:N
Sarah:BEmily:BJohn:N
Dana:NPete:N
Sarah:BKatie:N
Light AverageHeavy
Dana:NAlex:NAnnie:B
Emily:BPete:NJohn:N
No Yes
Sarah:BAnnie:BEmily:BPete:NJohn:N
Dana:NAlex:NKatie:N
![Page 17: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/17.jpg)
Minimizing DisorderHeight
WeightLotion
Short AverageTall
Annie:BKatie:N
Sarah:B Dana:N
Sarah:BKatie:N
Light AverageHeavy
Dana:NAnnie:B
No Yes
Sarah:BAnnie:B
Dana:NKatie:N
![Page 18: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/18.jpg)
Measuring Disorder
• Problem: – In general, tests on large DB’s don’t yield
homogeneous subsets
• Solution:– General information theoretic measure of disorder– Desired features:
• Homogeneous set: least disorder = 0• Even split: most disorder = 1
![Page 19: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/19.jpg)
Measuring Entropy• If split m objects into 2 bins size m1 & m2,
what is the entropy?
m
m
m
m
m
m
m
mmm
mm ii
i
22
212
1
2
loglog
log
0
0.2
0.4
0.6
0.8
1
1.2
0 0.2 0.4 0.6 0.8 1 1.2
m1/m
Disorder
![Page 20: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/20.jpg)
Measuring DisorderEntropy
the probability of being in bin i
i
ii pp 2log
mmp ii /
Entropy (disorder) of a split
i
ip 1
00log0 2 Assume
10 ip
-½ log2½ - ½ log2½ = ½ +½ = 1½½-¼ log2¼ - ¾ log2¾ = 0.5 + 0.311 = 0.811
¾¼
-1log21 - 0log20 = 0 - 0 = 001
Entropyp2p1
![Page 21: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/21.jpg)
Computing Disorder
k
i i
ci
classc i
ci
t
i
n
n
n
n
n
nrAvgDisorde
1
,2
, log
Disorder of class distribution on branch i
Fraction of samples down branch i
N instances
Branch1 Branch 2
N1 a N1 b
N2 aN2 b
![Page 22: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/22.jpg)
Entropy in Sunburn Example
k
i i
ci
classc i
ci
t
i
n
n
n
n
n
nrAvgDisorde
1
,2
, log
Hair color = 4/8(-2/4 log 2/4 - 2/4log2/4) + 1/8*0 + 3/8 *0 = 0.5
Height = 0.69Weight = 0.94Lotion = 0.61
![Page 23: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/23.jpg)
Entropy in Sunburn Example
k
i i
ci
classc i
ci
t
i
n
n
n
n
n
nrAvgDisorde
1
,2
, log
Height = 2/4(-1/2log1/2-1/2log1/2) + 1/4*0+1/4*0 = 0.5Weight = 2/4(-1/2log1/2-1/2log1/2) +2/4(-1/2log1/2-1/2log1/2) = 1 Lotion = 0
![Page 24: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/24.jpg)
Building ID Trees with Disorder
• Until each leaf is as homogeneous as possible – Select an inhomogeneous leaf node– Replace that leaf node by a test node creating
subsets with least average disorder
• Effectively creates set of rectangular regions– Repeatedly draws lines in different axes
![Page 25: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/25.jpg)
Features in ID Trees: Pros
• Feature selection:– Tests features that yield low disorder
• E.g. selects features that are important!
– Ignores irrelevant features
• Feature type handling:– Discrete type: 1 branch per value– Continuous type: Branch on >= value
• Need to search to find best breakpoint
• Absent features: Distribute uniformly
![Page 26: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/26.jpg)
Features in ID Trees: Cons
• Features – Assumed independent– If want group effect, must model explicitly
• E.g. make new feature AorB
• Feature tests conjunctive
![Page 27: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/27.jpg)
From Trees to Rules
• Tree:– Branches from root to leaves =– Tests => classifications– Tests = if antecedents; Leaf labels= consequent– All ID trees-> rules; Not all rules as trees
![Page 28: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/28.jpg)
From ID Trees to RulesHair Color
Lotion Used
BlondeRed
Brown
Alex: NoneJohn: NonePete: None
Emily: Burn
No Yes
Sarah: BurnAnnie: Burn
Katie: NoneDana: None
(if (equal haircolor blonde) (equal lotionused yes) (then None))(if (equal haircolor blonde) (equal lotionused no) (then Burn))(if (equal haircolor red) (then Burn))(if (equal haircolor brown) (then None))
![Page 29: Learning with Decision Trees Artificial Intelligence CMSC 25000 February 20, 2003](https://reader035.vdocuments.net/reader035/viewer/2022062408/56649f205503460f94c3845e/html5/thumbnails/29.jpg)
Identification Trees
• Train:– Build tree by forming subsets of least disorder
• Predict:– Traverse tree based on feature tests– Assign leaf node sample label
• Pros: Robust to irrelevant features, some noise, fast prediction, perspicuous rule reading
• Cons: Poor feature combination, dependency, optimal tree build intractable