discriminative v. generativeggordon/tmp/22-more-classifiers.pdf‣mle increases bias, decreases...
TRANSCRIPT
Discriminativev. generative
Geoff Gordon—Machine Learning—Fall 2013
Naive Bayes
2
Geoff Gordon—Machine Learning—Fall 2013
Naive Bayes
MLE:
3
P (xij , yi) =Y
i
P (yi)Y
j
P (xij |yi)
max
aj ,bj ,pP (xij , yi)
P (yi = +) = p
P (xij = 1 | yi = –) = aj
P (xij = 1 | yi = +) = bj
p = 1N
P�[yi = +]
aj =P
�[(yi = �) ^ (xij = 1)]/P
�[yi = �]
bj =P
�[(yi = +) ^ (xij = 1)]/P
�[yi = �]
2k+1 parameters
P (yi = + | xij) = 1/(1 + exp(�zi))
zi = w0 +P
j wjxij
naive Bayes model: y_i -> [x_ij] N training examples (x_i, y_i), k binary features y_i \in {0,1}, x_i \in {0,1}^kMLE led to classifier: P(y_i=+ \mid x_{ij}) &= 1/(1+\exp(-z_i))\\z_i &= \textstyle w_0 + \sum_j w_j x_{ij}
===
P(x_{ij}, y_i) = \prod_i P(y_i) \prod_j P(x_{ij} | y_i)
p &= \textstyle \frac{1}{N} \sum \delta[y_i=+]\\a_j &= \textstyle \sum \delta[(y_i=-) \wedge (x_{ij}=1)] / \sum \delta[y_i=-]\\b_j &= \textstyle \sum \delta[(y_i=+) \wedge (x_{ij}=1)] / \sum \delta[y_i=-]
\max_{a_j, b_j, p} P(x_{ij}, y_i)
P(y_i=+) &= p \\P(x_{ij}=1 \mid y_i=–) &= a_j\\P(x_{ij}=1 \mid y_i=+) &= b_j
Geoff Gordon—Machine Learning—Fall 2013
Logistic regression
4
argmax
w
Y
i
P (yi|xij)
= argmin
w
X
i
ln(1 + exp(�yizi))
= argmin
w
X
i
h(yizi)
P (yi = + | xij) = 1/(1 + exp(�zi))
zi = w0 +P
j wjxij
[sketch h]
optional prior for both NB and log. reg.
SVM: can think of it as approximating this optimization with a QP
===
P(y_i=+ \mid x_{ij}) &= 1/(1+\exp(-z_i))\\z_i &= \textstyle w_0 + \sum_j w_j x_{ij}
\lefteqn{\arg\max_w \prod_i P(y_i | x_{ij})}& \\& = \arg\min_w \sum_i \ln(1+\exp(-y_i z_i))\\& = \arg\min_w \sum_i h(y_i z_i)
Geoff Gordon—Machine Learning—Fall 2013
Same model, different answer
Why?‣ max P(X, Y) vs. max P(Y | X)
‣ generative vs. discriminative
‣ MLE v. MCLE (max conditional likelihood estimate)
How to pick?‣ Typically MCLE better if lots of data, MLE better if not
5
MCLE better if lots of data: we’ll see why below
it’s perhaps disturbing that there are two different ways to train the same model (3 if we count SVM, but we can justify that as an approximation to MCLE); can we relate them?
[also integration v. maximization, but ignore that]
Geoff Gordon—Machine Learning—Fall 2013
MCLE as MLEmax
Y
i
P (xi, yi | ✓) max
Y
i
P (yi|xi, ✓)
P(x_i, y_i | \theta) = P(x_i | \theta) P(y_i | x_i, \theta)now suppose \theta = (\theta_x, \theta_y), and P(x_i | \theta) = P(x_i | \theta_x) P(y_i | x_i, \theta) = P(y_i | x_i, \theta_y)then max \sum_i \ln P(x_i, y_i | \theta)= max \sum_i [\ln P(x_i | \theta_x) + \ln P(y_i | x_i, \theta_y)]can solve separately for \theta_x and \theta_y\theta_y is MCLE solution
Geoff Gordon—Machine Learning—Fall 2013
MCLE as MLE
Recipe: MCLE = MLE + extra parameters to decouple P(x) from P(y|x)
Bias-variance tradeoff: MLE places additional constraints on θ by coupling to P(x)‣ MLE increases bias, decreases variance (vs. MCLE)
To interpolate generative / discriminative models, soft-tie θx to θy w/ prior
7
Tom Minka. Discriminative models, not discriminative training. MSR tech report TR-2005-144, 2005
parameters of P(y|x) are ones of interest
Geoff Gordon—Machine Learning—Fall 2013
Comparison
As #examples → ∞‣ if Bayes net is right: NB & LR get same answer
‣ if not:
‣ LR has minimum possible training error
‣ train error → test error
‣ so LR does at least as well as NB, usually better
8
Geoff Gordon—Machine Learning—Fall 2013
Comparison
Finite sample: n examples with k attributes‣ how big should n be for excess risk ≤ ϵ?‣ GNB needs n = θ(log k) as long as a constant fraction of
attributes are relevant
‣ Hoeffding for each weight + union bound over weights + bound z away from 0
‣ LR needs n = θ(k)
‣ VC-dimension of linear classifier
GNB converges much faster to its (perhaps less-accurate) final estimates
9
see [Ng & Jordan, 2002]
informally, difference in convergence rates happens because NB’s parameter estimates are uncoupled, while LR’s are coupled
Geoff Gordon—Machine Learning—Fall 2013
Comparison on UCI
10
0 20 40 600.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
pima (continuous)
0 10 20 300.2
0.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
adult (continuous)
0 20 40 600.2
0.25
0.3
0.35
0.4
0.45
m
erro
r
boston (predict if > median price, continuous)
0 50 100 150 2000
0.1
0.2
0.3
0.4
m
erro
r
optdigits (0’s and 1’s, continuous)
0 50 100 150 2000
0.1
0.2
0.3
0.4
m
erro
r
optdigits (2’s and 3’s, continuous)
0 20 40 60 80 1000.1
0.2
0.3
0.4
0.5
m
erro
r
ionosphere (continuous)
0 20 40 600.35
0.4
0.45
0.5
m
erro
r
liver disorders (continuous)
0 20 40 60 80 100 1200.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
sonar (continuous)
0 100 200 300 4000.2
0.3
0.4
0.5
0.6
0.7
m
erro
r
adult (discrete)
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
m
erro
r
promoters (discrete)
0 50 100 1500.1
0.2
0.3
0.4
0.5
m
erro
r
lymphography (discrete)
0 100 200 3000.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
breast cancer (discrete)
0 5 10 15 20 250.1
0.2
0.3
0.4
0.5
m
erro
r
lenses (predict hard vs. soft, discrete)
0 50 100 1500
0.2
0.4
0.6
0.8
m
erro
r
sick (discrete)
0 20 40 60 800
0.1
0.2
0.3
0.4
m
erro
r
voting records (discrete)
0 20 40 600.25
0.3
0.35
0.4
0.45
0.5
mer
ror
pima (continuous)
0 10 20 300.2
0.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
adult (continuous)
0 20 40 600.2
0.25
0.3
0.35
0.4
0.45
m
erro
r
boston (predict if > median price, continuous)
0 50 100 150 2000
0.1
0.2
0.3
0.4
m
erro
r
optdigits (0’s and 1’s, continuous)
0 50 100 150 2000
0.1
0.2
0.3
0.4
m
erro
r
optdigits (2’s and 3’s, continuous)
0 20 40 60 80 1000.1
0.2
0.3
0.4
0.5
m
erro
r
ionosphere (continuous)
0 20 40 600.35
0.4
0.45
0.5
m
erro
r
liver disorders (continuous)
0 20 40 60 80 100 1200.25
0.3
0.35
0.4
0.45
0.5
m
erro
rsonar (continuous)
0 100 200 300 4000.2
0.3
0.4
0.5
0.6
0.7
m
erro
r
adult (discrete)
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
m
erro
r
promoters (discrete)
0 50 100 1500.1
0.2
0.3
0.4
0.5
m
erro
r
lymphography (discrete)
0 100 200 3000.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
breast cancer (discrete)
0 5 10 15 20 250.1
0.2
0.3
0.4
0.5
m
erro
r
lenses (predict hard vs. soft, discrete)
0 50 100 1500
0.2
0.4
0.6
0.8
m
erro
r
sick (discrete)
0 20 40 60 800
0.1
0.2
0.3
0.4
m
erro
r
voting records (discrete)
see [Ng & Jordan, 2002]
Geoff Gordon—Machine Learning—Fall 2013
Comparison on UCI
11
0 20 40 600.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
pima (continuous)
0 10 20 300.2
0.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
adult (continuous)
0 20 40 600.2
0.25
0.3
0.35
0.4
0.45
m
erro
r
boston (predict if > median price, continuous)
0 50 100 150 2000
0.1
0.2
0.3
0.4
mer
ror
optdigits (0’s and 1’s, continuous)
0 50 100 150 2000
0.1
0.2
0.3
0.4
m
erro
r
optdigits (2’s and 3’s, continuous)
0 20 40 60 80 1000.1
0.2
0.3
0.4
0.5
m
erro
r
ionosphere (continuous)
0 20 40 600.35
0.4
0.45
0.5
m
erro
r
liver disorders (continuous)
0 20 40 60 80 100 1200.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
sonar (continuous)
0 100 200 300 4000.2
0.3
0.4
0.5
0.6
0.7
m
erro
r
adult (discrete)
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
m
erro
r
promoters (discrete)
0 50 100 1500.1
0.2
0.3
0.4
0.5
m
erro
rlymphography (discrete)
0 100 200 3000.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
breast cancer (discrete)
0 5 10 15 20 250.1
0.2
0.3
0.4
0.5
m
erro
r
lenses (predict hard vs. soft, discrete)
0 50 100 1500
0.2
0.4
0.6
0.8
m
erro
r
sick (discrete)
0 20 40 60 800
0.1
0.2
0.3
0.4
m
erro
r
voting records (discrete)
0 20 40 600.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
pima (continuous)
0 10 20 300.2
0.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
adult (continuous)
0 20 40 600.2
0.25
0.3
0.35
0.4
0.45
m
erro
r
boston (predict if > median price, continuous)
0 50 100 150 2000
0.1
0.2
0.3
0.4
m
erro
r
optdigits (0’s and 1’s, continuous)
0 50 100 150 2000
0.1
0.2
0.3
0.4
m
erro
r
optdigits (2’s and 3’s, continuous)
0 20 40 60 80 1000.1
0.2
0.3
0.4
0.5
m
erro
r
ionosphere (continuous)
0 20 40 600.35
0.4
0.45
0.5
m
erro
r
liver disorders (continuous)
0 20 40 60 80 100 1200.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
sonar (continuous)
0 100 200 300 4000.2
0.3
0.4
0.5
0.6
0.7
m
erro
r
adult (discrete)
0 20 40 60 80 1000
0.1
0.2
0.3
0.4
0.5
m
erro
r
promoters (discrete)
0 50 100 1500.1
0.2
0.3
0.4
0.5
m
erro
r
lymphography (discrete)
0 100 200 3000.25
0.3
0.35
0.4
0.45
0.5
m
erro
r
breast cancer (discrete)
0 5 10 15 20 250.1
0.2
0.3
0.4
0.5
m
erro
r
lenses (predict hard vs. soft, discrete)
0 50 100 1500
0.2
0.4
0.6
0.8
m
erro
r
sick (discrete)
0 20 40 60 800
0.1
0.2
0.3
0.4
m
erro
rvoting records (discrete)
see [Ng & Jordan, 2002]
Decision trees
Geoff Gordon—Machine Learning—Fall 2013
Dichotomous classifier
13
Search
HOME
· Alderfly/Dobsonfly
· Bees, Ants, Wasps
· Beetles
· Bristletails
· Butterflies, Moths
· Centipedes
· Cicada & Similar
· Cockroaches
· Dragonfly/Damselfly
· Earwigs
· Flies
· Grasshopper/Crickets
· Mayflies
· Mite or Tick
· Scorpion and Like
· Snakeflies
· SPIDERS
· True Bugs
· Walkingsticks
· View ALL
ABOUT BUGS
· Identifying Insects
· Insect Anatomy
· Insect Mouth Parts
SCIENTIFIC
· Dichotomous Keys
· Taxonomic Orders
· Insect Orders
· Scientific Names
· Metamorphosis
· Process of Molting
MISCELLANEOUS
· Bees and Wasps
· Beneficial Insects
· Field Guides
· Color the Bugs
· Spiders.us
Winged Insect Key (Flies, bees, butterflies, beetles, etc.)
Starting with question #1, determine which statement (a or b) is true for your insect. Follow the direction at the endof the true statement until you are finally given the name of the Order your insect belongs to.
1. a. Insect has 1 pair of wings ........................................................................ Order Diptera (flies,mosquitoes)
b. Insect has 2 pair of wings ........................................................................ go to #2 2. a. Insect has extremely long prothorax (neck) ............................................... go to #3 b. Insect has a regular length or no prothorax ............................................... go to #4 3. a. Forelegs come together in a 'praying' position ............................................ Order Mantodea (mantids)
b. Forelegs do not come together in a 'praying' position .................................. Order Raphidoptera(snakeflies)
4. a. Wings are armour-like with membraneous hindwings underneath them ........... Order Coleoptera (beetles) b.Wings are not armour-like .......................................................................... go to #5
5. a. Wings twist when insect is in flight ............................................................. Order Strepsiptera (twisted-wing parasites)
b.Wings flap up and down (no twisting) when in flight ..................................... go to #6 6. a. Wings are triangular in shape .................................................................... go to #7 b.Wings are not triangular in shape .............................................................. go to #8
7. a. Insect lacks a proboscis and has long filaments at abdomenal tip ................... Order Ephemeroptera(mayflies)
b. Insect has a proboscis and lacks long filaments at abdomenal ........................ Order Lepidoptera(butterflies)
8. a. Head is elongated (snout-like) .................................................................... Order Mecoptera(scorpionflies)
b. Head is not elongated (snout-like) .............................................................. go to #9 9. a. Insect has 2 pair of cerci (pincers) at tip of abdomen ..................................... Order Dermaptera (earwigs) b. Insect does not have 2 pair of cerci (pincers) at tip of abdomen ...................... go to #10 10.a. All 4 wings are both similar in size and in shape to each other ......................... go to #11 b. All 4 wings are not similar in size nor in shape to each other ........................... go to #16 11.a. Eyes nearly cover or make up entire head ..................................................... Order Odonata (dragonflies) b. Eyes do not nearly cover nor make up entire head ......................................... go to #12 12.a. All 4 wings are finely veined and are almost 2x longer than abdomen ............... Oder Isoptera (termites) b. All 4 wings are not finely veined and are not almost 2x longer than abdomen .... go to #13 13.a. All 4 wings are transparent with many criss-crossing veins .............................. Order Neuroptera (lacewings) b. All 4 wings are not transparent with many criss-crossing veins ........................ go to #14
http://www.insectidentification.org/winged-insect-key.asp
decision trees were invented long before machine learning -- first called “dichotomous classifiers”
used for field identification of species, rock types, etc.
Geoff Gordon—Machine Learning—Fall 2013
Decision tree
Problem: classification (or regression)‣ n training examples (x1, y1), (x2, y2), … (xn, yn)
‣ xi ∈ Rk, yi ∈ {0, 1}
14
well-known implementations: ID3, C4.5, J48, CART
input variables: real valued, categorical, binary, …output variables: real valued, categorical, binary, …
tree shape [sketch]: a question in each node (about values of input variables), branch based on answer e.g., x_{i,3} > 5
when we reach a leaf, it tells us about the output variable
usually yes/no questions, but could be multiple choice -- answers must be mutually exclusive and exhaustive
Geoff Gordon—Machine Learning—Fall 2013
The picture
15
typical decision tree cartoon: divide plane into rectangular regions
Geoff Gordon—Machine Learning—Fall 2013
The picture
16
Composition II in Red, Blue, and Yellow
Piet Mondrian, 1930
Geoff Gordon—Machine Learning—Fall 2013
Variants
Type of question at internal nodes
Type of label at leaf nodes
Labels on internal nodes or edges
17
type of question at internal node: most common: >, <, or = on single attribute e.g., height > 3? color = blue? also used: logical fns (conjuncts, disjuncts) also used: linear threshold (3*height+width-2)
type of label at leaf: constant (“3” or “true”), or could be any classifier or regressor (e.g., linear regression on all data points w/in box)
labeled internal nodes/edges (combine e.g. w/ sum) color = blue: +3, go to node 3; else +2, go to 4 height > 5: -2, go to 7; else +1, go to 13
Geoff Gordon—Machine Learning—Fall 2013
Variants
Decision list
Decision diagram (DAG)
18
allow cycles + side effects: flowchart
fine print: this might have cycles in it
Geoff Gordon—Machine Learning—Fall 2013
Example
20
Sepa
l Len
gth
Petal Length
petal: range 1..7sepal: range 4..8red: setosagreen: versicolorblue: virginica
suppose we had another type at bottom right; could split on sepal length
Geoff Gordon—Machine Learning—Fall 2013
Why decision trees?
Why?‣ work pretty well
‣ fairly interpretable
‣ very fast at test time
‣ closed under common operations
Why not DTs?‣ learning is NP-hard
‣ often not state-of-art error rate
‣ but: see bagging, boosting
21
closed under common operations: e.g., sum of 2 decision trees/diagrams is another tree/diagram (and there are good algorithms to compute and optimize representation of sum)
not usually state-of-art performance but: boosted or bagged versions may be but: these lose interpretability
Geoff Gordon—Machine Learning—Fall 2013
Learning
22
red? fuzzy? Class
T T –
T F +
T F –
F T –
F F +
split on red: T: -+-, F: -+ best (MLE) labels: 1/3, 1/2 lik: log(2/3)+log(1/3)+log(2/3)+log(1/2)+log(1/2)
alternately, split on fuzzy: T: --, F: +-+ MLE labels: 0, 2/3 [could use Laplace smoothing] lik: log(1)+log(2/3)+log(1/3)+log(1)+log(2/3) better by 2 log(2) = 2 bits
if we now split fuzzy=F by red?, get pure leaves: perfect performance on training data
Geoff Gordon—Machine Learning—Fall 2013
Learning
Bigger data sets with more attributes: finding training set MLE is NP-hard
Heuristic search: build tree greedily, root down‣ start with all training examples in one bin
‣ pick an impure bin
‣ try some candidate splits (e.g., all single-attribute binary tests), pick the best (largest increase in likelihood)
‣ repeat until all bins are either pure or have no possible splits left (ran out of attributes to split on)
23
tradeoff: if we consider stronger splits (e.g., linear classifiers vs. single-attribute tests) we get more progress at each node, but: more selection bias overall goal is not to do well at a node, but to do well with final tree
heuristic: learn slowly (pick a weak set of splits)
Geoff Gordon—Machine Learning—Fall 2013
Information gain
Initially: L = 2 log(.4) + 3 log(.6)
Split on red:‣ bin T: 2 log(.667) + log(.333)
‣ bin F: 2 log(.5)
Split on fuzzy:‣ bin T: 2 log 1 + 0 log 0 = 0
‣ bin F: 2 log(.667) + log(.333)
In general: H(Y) – EX[H(Y|X)]
24
red? fuzzy? Class
T T –
T F +
T F –
F T –
F F +
evaluating a candidate split: increase in likelihood = information gain = measure of purity of binsinit: -4.85red: -4.75 (gain .1 bits)fuzzy: –2.75 (gain 2.1 bits)
there are other splitting criteria besides info gain (e.g., Gini) but we won’t cover
Geoff Gordon—Machine Learning—Fall 2013
Real-valued attributes
25
finding threshold for a real attribute: sort by attribute value, try n+1 thresholds, one in each gap between observed values
===
xs = randn(50)-1; ys = randn(50)+1;xs = sort(xs); ys = sort(ys);clf(); plot(xs, arange(1,51)/50., ys, arange(1, 51)/50., marker='x', ls='none', mew=3, ms=5)
Geoff Gordon—Machine Learning—Fall 2013
Multi-way discrete splits
Split on temp yields {–,–} and {+,–,+}
Split on SS# yields 5 pure leaves
26
SS# Temp Sick?
123-45-6789 36 –
010-10-1010 36.5 +
555-55-1212 41 +
314-15-9265 37 –
271-82-8183 40 +
unfair advantage of multi-way splits
fix: penalize splits of high arity e.g., allow only binary (1 vs rest) e.g., use a statistical test of significance to select a split variable
Geoff Gordon—Machine Learning—Fall 2013
Pruning
Build tree on training set
Prune on holdout set:‣ while removing last split
along some path improves holdout error, do so
‣ if a node N’s children are all pruned, then N becomes eligible for pruning
27
note: order of testing children of N is unimportant
Geoff Gordon—Machine Learning—Fall 2013
Prune as rules
Alternately, convert each leaf to a rule then prune‣ test1 ∧ ¬test2 ∧ test3 …
‣ while dropping a test from a rule improves performance, do so
28
rule-based version: typically leads to smaller, more interpretable classifiers
may get overlap among rules; if so, e.g., average their predictions
Geoff Gordon—Machine Learning—Fall 2013
Bagging
Bagging = bootstrap aggregating
Can be used with any classifier, but particularly effective with decision trees
Generate M bootstrap resamples
Train a decision tree on each one‣
Final classifier: vote all M trees‣ e.g., tree 1 says p(+) = .7, tree 2 says p(+) = .9: predict .8
29
train: could use different training methods on each resample; choices include candidate splits, pruning strategies
random forests: restrict each tree to use a random subsample of k’<<k attributes; don’t prune
bagging can increase performance substantially (random forests often get state-of-art performance) but reduces interpretability
Geoff Gordon—Machine Learning—Fall 2013
Out-of-bag error estimates
Each bag contains (1–1/e) (~67%) of examples
Use out-of-bag examples to estimate error of each tree
To estimate error of overall vote‣ for each example, classify using all out-of-bag trees
‣ average across all examples
Conservative: we’re averaging over ~67% of our trees—but if we have lots of trees, bias is small
30
Boosting
Geoff Gordon—Machine Learning—Fall 2013
Voted classifiers
f: Rk → {–1, 1}
Voted classifier: ∑j fj(x) > 0
Weighted vote: ∑j αj fj(x) > 0‣ assume wlog αj > 0
‣ optionally scale so αj sum to 1
32
5 halfspaces (or add constant classifier for |H|=6)
typically a larger hypothesis space (vs. base set of classifiers) -- e.g., voted halfspaces
terminology: base f_j are called “weak hypotheses” to distinguish from the stronger class of voted f_j
wlog: since we assume hypothesis space is closed under negation (for each f, -f also in space)
idea: learn a voted classifier by MCLEpotential benefit: improved performance, if we can avoid overfitting due to bigger hypothesis space
Geoff Gordon—Machine Learning—Fall 2013
Voted classifiers—the matrix
33
n tr
aini
ng e
xam
ples
T distinct classifiers (T < 2n)
write f_1 ... f_T for all *distinct* classifiers in our hypothesis space (at most 2^n for n training examples)
write z_ij = y_i f_j(x_i) = does f_j get (x_i, y_i) right?(matrix dimensions: # examples * # classifiers)
Geoff Gordon—Machine Learning—Fall 2013
Finding the best voted classifier
34
write s_i = y_i [weighted vote]voted classifier is right on (x_i, y_i) iff s_i > 0
s_i = y_i [\sum_j \alpha_j f_j(x_i)] = \sum_j \alpha_j z_ij
MCLE: min_{\alpha,s} L = \sum_i h(s_i)s.t. s = Z \alphah(s) = log(1+exp(-s))this is a convex program (since h is convex)but too big to solve directly -- how to do it?
Geoff Gordon—Machine Learning—Fall 2013
Coordinate descent
Repeat:‣ Find an index j s.t.
‣ Increase
“Repeatedly increase the weight of a useful weak hypothesis”
35
dL/d↵j < 0
↵j
find an index j s.t. dL/d\alpha_j < 0 [by assumption, don’t have to check separately for dL/d\alpha_j > 0]
concretely, \alpha_j += \alpha [there are other strategies, but this is actually one of the best]
to make this fast, “find an index j” has to be efficient (can’t enumerate columns)
Geoff Gordon—Machine Learning—Fall 2013
Finding a good weak hypothesis
Find j s.t. –dL/dαj is big‣ –dL/dαj =
36
–dL/d\alpha_j = \sum_i –h’(s_i) ds_i/d\alpha_j= \sum_i –h’(s_i) z_ij= \sum_i –h’(s_i) y_i f_j(x_i)= “edge” of classifier jwant to find j to make edge as big as possible–h’>0, so want to make each y_i f_j(x_i) big [sketch -h’]
y_i f_j(x_i) is big <==> f_j gets x_i confidently rightweights -h’(s_i): example i is important if current voted classifier gets it confidently wrong
Geoff Gordon—Machine Learning—Fall 2013
Weak learner
Weak learner = weighted classification algorithm that gets edge ≥ γ‣ i.e., finds classifier that performs well on currently-wrong
examples
Thm: if weak learner always succeeds, L → 0
37
Geoff Gordon—Machine Learning—Fall 2013
Discussion
Can take h(s) to be any convex, decreasing fn of s‣ e.g., exp(–z) or hinge loss max(0, –z)
‣ we used log(1+exp(–z)) — discrete variant of LogitBoost
‣ exp(–z) leads to AdaBoost
Can use confidence-rated classifiers (range [–1,1]) or regression algorithms
Weak hypothesis class: usually want a less-complex class than we’d use on its own—mitigates overfitting‣ same “slow learning rate” idea as for decision tree splits
38
original (real-valued) LogitBoost uses regression as weak learner, like IRLS
Geoff Gordon—Machine Learning—Fall 2013
In practice
Boosting typically takes training error quickly to 0‣ could also stop with failure of weak learner, but this
doesn’t typically happen
Tends to keep increasing margin, even after training error is 0
Tends not to overfit—usually attributed to margin
39
Geoff Gordon—Machine Learning—Fall 2013
Is weak learning reasonable?
If weak learner can always succeed, then ∃ a vote that gets every training example right
If weak learner can fail, boosting seems like a good algorithm for making it do so‣ but as we said, weak learner usually keeps working
40
first line: a theorem of Freund & Schapire