27 february 2001what is confidence?slide 1 what is confidence? how to handle overfitting when given...

27 February 2001 What is Confidence? Slide 1

What is Confidence?How to Handle Overfitting When Given

Few Examples

Top Changwatchai

AIML seminar

27 February 2001

(based on my ongoing research, Fall 2000)


Overview• The problem of overfitting• Bayesian network models• Defining confidence


A Learning Problem• Say we want to learn a classifier

– fixed distribution of examples, each drawn independently

– example consists of set of features

– given a set of labeled examples drawn from the distribution

• Is our primary goal to fit these examples as well as possible?– No!

– We want to fit the underlying distribution

• Overfitting– finding a hypothesis which fits the training examples better than

some other hypothesis, but which fits the underlying distribution worse than that hypothesis

• Moral– focusing only on fitting the training data exposes you to the danger

of overfitting


Handling Overfitting• Approaches

– Assume you have enough examples• Statistical anomalies are minimized

– Smoothing (handling unlikely examples)– Other statistical methods, such as partitioning into

training/test data– Other heuristics/assumptions

• My constraints– Very small number of examples

• Shh! Ultimate goal is incorporating domain knowledge...

– If approximations are necessary, make only those that can be directly, quantitatively justified

– Want a quantitative measure of overfitting


Bayesian Networks• Notes

– BN’s are simply an example application– Focusing on inverted-tree BN’s used as classifiers

• Quick overview– Nodes represent features (assume Boolean)

• Bottom node represents label

– Links represent direct dependence• Absence of link represents lack of direct dependence

– Conditional Probability Tables (CPT’s)• Each node has 2Pa entries (Pa = # parents)

– It is fairly straightforward to:• Learn CPT entries

• Perform inference


Bayesian Network Structures• Structure determines expressiveness

– More parents per node = more expressive

– Directly related to the total number of CPT entries in the BN

• “Bayesian networks don’t overfit”– Given:

• A BN structure• A set of training examples

– There is a way of choosing CPT entries which fits the training examples as well as possible

– Since we’re given the structure, we must assume that the best fit to the training examples is also the best fit to the underlying distribution

• However:– Manually building BN structures is a lot of work

– We’d like to not only learn the CPT entries, but also learn the correct structures


BN’s and Overfitting• Choosing the “correct” structure is where overfitting becomes a problem

• If the goal is to maximize accuracy on the training data, then we always prefer more expressive networks

– In our inverted-tree classifiers, it would be a naïve Bayes structure

• Unfortunately, the more expressive the network, the greater the tendency to overfit for a fixed number of training examples

• Intuition:– Fitting curves to data points– “Spending” examples to increase confidence

• Current approaches in addressing overfitting– BIC, AIC, MDL, etc.– Each network structure is given a two-part “score”

• Accuracy (the more accurate, the better)

• Expressiveness (the fewer CPT entries, the better)

– I think these rely on assumption that we have sufficiently many examples


Confidence• Recall our needs:

– Given very few examples, we want a– Quantitative measure of overfitting that is– As exact as possible

• Intuitive definition of “confidence” of a given BN structure– Probability that we have seen enough examples to either accept or reject

this structure

• Confidence and accuracy– Low confidence: need more examples– High confidence, low accuracy: reject this structure– High confidence, high accuracy: accept this structure

• Sadly, there is not enough time to cover my definition of confidence for an inverted-tree Bayesian network classifier

– Coincidentally, I have run into certain technical difficulties in realizing a practical algorithm for evaluating this confidence

– See me afterward for discussion


A New Problem Domain• Goal at the end of this section:

– Motivate a quantitative definition of confidence of a single-node “Bayesian network” (each example has no other features except for its Boolean label)

• Coin-flipping domain– k coins

• Coin i has weight wi (probability of getting heads)

• One of these coins is picked at random (prior probability of picking coin i is pi)

– This coin is flipped N times, and we observe heads H times

– Assuming we know the wi’s, pi’s, H, N, and the experimental setup, we can calculate the probability that the next toss of the coin is heads

k

ik

jj

HNj

Hj

iHN

iHi

k

ik

jj

HNj

Hj

iHN

iHi

iheads

pww

pww

pwwH

N

pwwH

N

wp

1

1

1

1

1

1

1

1

1


Confidence in Coin Flipping• How do we define confidence?• First, we need to define our decision algorithm• In this case it’s easy:

– if pheads 0.5, then we predict “heads”

– if pheads < 0.5, then we predict “tails”

• Define confidence as follows:– Our confidence in our decision is the probability that if we saw an arbitrarily large

(infinite) number of tosses, we would still make the same decision

– Seeing an infinite number of tosses is tantamount to knowing what the weight of the coin (wi) is

– In other words, confidence Prob(make the same decision | know the coin’s weight)

• Subtle point: we don’t know the coin’s weight, but we speculate that we do– Alternative POV: say Tasha is in the next room. She knows everything we know (wi’s,

pi’s, H, N, experimental setup). In addition, she knows the weight of the coin (wi) that was picked. Her decision is likewise simple:

• if wi 0.5, then predict “heads”

• if wi < 0.5, then predict “tails”

– Then we can restate the definition:• confidence = Prob(we make the same decision as Tasha)


An Equation for Confidence• To repeat:

– confidence = Prob(we make the same decision as Tasha)

• WLOG, let’s say, after calculating pheads = Prob(heads | H, N), we pick heads (i.e., pheads 0.5)

– Then confidence = Prob(Tasha also picked heads)– In other words:

– where

– Recall

• Thus if we define a random variable X mapping wi to P(coin i | H, N), then pheads = E(X) and confidence = Prob(X 0.5)

k

jj

HNj

Hj

iHN

iHi

pwwH

N

pwwH

N

NHiP

1

1

1

,|coin

5.0:

,|coin iwi

NHiPconfidence

k

iiheads NHiPwp

1

,|coin


Returning to Single-Node Network• Coin-flipping is a discrete domain (discrete set of coins)• Results generalize to the continuous case• Consider our Boolean-valued labeled examples

– Underlying distribution (which we are trying to learn): a single number w0, which is the probability that a given example will be labeled true

– We observe N examples, with H of them labeled true

• Let W be a random variable corresponding to the prior probability of the weight w0

• Let X be a random variable representing the posterior probability of the weight w0 given H and N

• It can be shown that if W has a beta distribution, then X also has a beta distribution. In particular, if W is uniform (we have no information about the prior probability), then Xbeta(H+1, N-H+1)

• From the properties of a beta distribution, we see that

• Note this is not H/N!• As before, confidence = Prob(X 0.5)

2

1)Pr(

N

HXEtrue


Final Notes• We have made few assumptions about the data (for

example, N can be small)• We have come up with an exact, quantitative

expression for confidence (although it may be difficult to evaluate)

• Analysis extends (not trivially) to multivariate case (more than one node in BN)

• Defining confidence can be an important first step to dealing with overfitting when given few examples (I haven’t shown the next few steps)


Summary• Overfitting is bad

– Overfitting is an issue any time we do learning from examples– Often we make assumptions which allow us to assume we don't overfit– At the very least, we should be aware of these assumptions when we do

learning

• Too much expressiveness is bad– Limiting expressiveness (introducing bias) not only helps to reduce the

number of examples needed to learn, but also reduces tendency to overfit

• You can quantify overfitting– I'm not aware of any other efforts in this direction, but it is doable and may

prove useful, especially in reducing reliance on assumptions– To do so, you must clearly define your learning goals (not just the concept

to be learned)– In this presentation, we define and use "confidence"

27 february 2001what is confidence?slide 1 what is confidence? how to handle overfitting when given...

Documents