27 february 2001what is confidence?slide 1 what is confidence? how to handle overfitting when given...
TRANSCRIPT
![Page 1: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/1.jpg)
27 February 2001 What is Confidence? Slide 1
What is Confidence?How to Handle Overfitting When Given
Few Examples
Top Changwatchai
AIML seminar
27 February 2001
(based on my ongoing research, Fall 2000)
![Page 2: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/2.jpg)
27 February 2001 What is Confidence? Slide 2
Overview• The problem of overfitting• Bayesian network models• Defining confidence
![Page 3: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/3.jpg)
27 February 2001 What is Confidence? Slide 3
A Learning Problem• Say we want to learn a classifier
– fixed distribution of examples, each drawn independently
– example consists of set of features
– given a set of labeled examples drawn from the distribution
• Is our primary goal to fit these examples as well as possible?– No!
– We want to fit the underlying distribution
• Overfitting– finding a hypothesis which fits the training examples better than
some other hypothesis, but which fits the underlying distribution worse than that hypothesis
• Moral– focusing only on fitting the training data exposes you to the danger
of overfitting
![Page 4: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/4.jpg)
27 February 2001 What is Confidence? Slide 4
Handling Overfitting• Approaches
– Assume you have enough examples• Statistical anomalies are minimized
– Smoothing (handling unlikely examples)– Other statistical methods, such as partitioning into
training/test data– Other heuristics/assumptions
• My constraints– Very small number of examples
• Shh! Ultimate goal is incorporating domain knowledge...
– If approximations are necessary, make only those that can be directly, quantitatively justified
– Want a quantitative measure of overfitting
![Page 5: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/5.jpg)
27 February 2001 What is Confidence? Slide 5
Bayesian Networks• Notes
– BN’s are simply an example application– Focusing on inverted-tree BN’s used as classifiers
• Quick overview– Nodes represent features (assume Boolean)
• Bottom node represents label
– Links represent direct dependence• Absence of link represents lack of direct dependence
– Conditional Probability Tables (CPT’s)• Each node has 2Pa entries (Pa = # parents)
– It is fairly straightforward to:• Learn CPT entries
• Perform inference
![Page 6: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/6.jpg)
27 February 2001 What is Confidence? Slide 6
Bayesian Network Structures• Structure determines expressiveness
– More parents per node = more expressive
– Directly related to the total number of CPT entries in the BN
• “Bayesian networks don’t overfit”– Given:
• A BN structure• A set of training examples
– There is a way of choosing CPT entries which fits the training examples as well as possible
– Since we’re given the structure, we must assume that the best fit to the training examples is also the best fit to the underlying distribution
• However:– Manually building BN structures is a lot of work
– We’d like to not only learn the CPT entries, but also learn the correct structures
![Page 7: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/7.jpg)
27 February 2001 What is Confidence? Slide 7
BN’s and Overfitting• Choosing the “correct” structure is where overfitting becomes a problem
• If the goal is to maximize accuracy on the training data, then we always prefer more expressive networks
– In our inverted-tree classifiers, it would be a naïve Bayes structure
• Unfortunately, the more expressive the network, the greater the tendency to overfit for a fixed number of training examples
• Intuition:– Fitting curves to data points– “Spending” examples to increase confidence
• Current approaches in addressing overfitting– BIC, AIC, MDL, etc.– Each network structure is given a two-part “score”
• Accuracy (the more accurate, the better)
• Expressiveness (the fewer CPT entries, the better)
– I think these rely on assumption that we have sufficiently many examples
![Page 8: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/8.jpg)
27 February 2001 What is Confidence? Slide 8
Confidence• Recall our needs:
– Given very few examples, we want a– Quantitative measure of overfitting that is– As exact as possible
• Intuitive definition of “confidence” of a given BN structure– Probability that we have seen enough examples to either accept or reject
this structure
• Confidence and accuracy– Low confidence: need more examples– High confidence, low accuracy: reject this structure– High confidence, high accuracy: accept this structure
• Sadly, there is not enough time to cover my definition of confidence for an inverted-tree Bayesian network classifier
– Coincidentally, I have run into certain technical difficulties in realizing a practical algorithm for evaluating this confidence
– See me afterward for discussion
![Page 9: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/9.jpg)
27 February 2001 What is Confidence? Slide 9
A New Problem Domain• Goal at the end of this section:
– Motivate a quantitative definition of confidence of a single-node “Bayesian network” (each example has no other features except for its Boolean label)
• Coin-flipping domain– k coins
• Coin i has weight wi (probability of getting heads)
• One of these coins is picked at random (prior probability of picking coin i is pi)
– This coin is flipped N times, and we observe heads H times
– Assuming we know the wi’s, pi’s, H, N, and the experimental setup, we can calculate the probability that the next toss of the coin is heads
k
ik
jj
HNj
Hj
iHN
iHi
k
ik
jj
HNj
Hj
iHN
iHi
iheads
pww
pww
pwwH
N
pwwH
N
wp
1
1
1
1
1
1
1
1
1
![Page 10: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/10.jpg)
27 February 2001 What is Confidence? Slide 10
Confidence in Coin Flipping• How do we define confidence?• First, we need to define our decision algorithm• In this case it’s easy:
– if pheads 0.5, then we predict “heads”
– if pheads < 0.5, then we predict “tails”
• Define confidence as follows:– Our confidence in our decision is the probability that if we saw an arbitrarily large
(infinite) number of tosses, we would still make the same decision
– Seeing an infinite number of tosses is tantamount to knowing what the weight of the coin (wi) is
– In other words, confidence Prob(make the same decision | know the coin’s weight)
• Subtle point: we don’t know the coin’s weight, but we speculate that we do– Alternative POV: say Tasha is in the next room. She knows everything we know (wi’s,
pi’s, H, N, experimental setup). In addition, she knows the weight of the coin (wi) that was picked. Her decision is likewise simple:
• if wi 0.5, then predict “heads”
• if wi < 0.5, then predict “tails”
– Then we can restate the definition:• confidence = Prob(we make the same decision as Tasha)
![Page 11: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/11.jpg)
27 February 2001 What is Confidence? Slide 11
An Equation for Confidence• To repeat:
– confidence = Prob(we make the same decision as Tasha)
• WLOG, let’s say, after calculating pheads = Prob(heads | H, N), we pick heads (i.e., pheads 0.5)
– Then confidence = Prob(Tasha also picked heads)– In other words:
– where
– Recall
• Thus if we define a random variable X mapping wi to P(coin i | H, N), then pheads = E(X) and confidence = Prob(X 0.5)
k
jj
HNj
Hj
iHN
iHi
pwwH
N
pwwH
N
NHiP
1
1
1
,|coin
5.0:
,|coin iwi
NHiPconfidence
k
iiheads NHiPwp
1
,|coin
![Page 12: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/12.jpg)
27 February 2001 What is Confidence? Slide 12
Returning to Single-Node Network• Coin-flipping is a discrete domain (discrete set of coins)• Results generalize to the continuous case• Consider our Boolean-valued labeled examples
– Underlying distribution (which we are trying to learn): a single number w0, which is the probability that a given example will be labeled true
– We observe N examples, with H of them labeled true
• Let W be a random variable corresponding to the prior probability of the weight w0
• Let X be a random variable representing the posterior probability of the weight w0 given H and N
• It can be shown that if W has a beta distribution, then X also has a beta distribution. In particular, if W is uniform (we have no information about the prior probability), then Xbeta(H+1, N-H+1)
• From the properties of a beta distribution, we see that
• Note this is not H/N!• As before, confidence = Prob(X 0.5)
2
1)Pr(
N
HXEtrue
![Page 13: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/13.jpg)
27 February 2001 What is Confidence? Slide 13
Final Notes• We have made few assumptions about the data (for
example, N can be small)• We have come up with an exact, quantitative
expression for confidence (although it may be difficult to evaluate)
• Analysis extends (not trivially) to multivariate case (more than one node in BN)
• Defining confidence can be an important first step to dealing with overfitting when given few examples (I haven’t shown the next few steps)
![Page 14: 27 February 2001What is Confidence?Slide 1 What is Confidence? How to Handle Overfitting When Given Few Examples Top Changwatchai AIML seminar 27 February](https://reader035.vdocuments.net/reader035/viewer/2022072017/56649efe5503460f94c1279e/html5/thumbnails/14.jpg)
27 February 2001 What is Confidence? Slide 14
Summary• Overfitting is bad
– Overfitting is an issue any time we do learning from examples– Often we make assumptions which allow us to assume we don't overfit– At the very least, we should be aware of these assumptions when we do
learning
• Too much expressiveness is bad– Limiting expressiveness (introducing bias) not only helps to reduce the
number of examples needed to learn, but also reduces tendency to overfit
• You can quantify overfitting– I'm not aware of any other efforts in this direction, but it is doable and may
prove useful, especially in reducing reliance on assumptions– To do so, you must clearly define your learning goals (not just the concept
to be learned)– In this presentation, we define and use "confidence"