chapter 3: combining classifiers - tor...

CS583, Bing Liu, UIC ‹N›

Chapter 3: Combining Classifiers

From “Web Data Mining”, by Bing Liu (UIC),

Springer Verlag, 2007

CS583, Bing Liu, UIC ‹N›CS583, Bing Liu, UIC 2

Outline

Ensemble methods: Bagging and Boosting

Fully supervised learning (traditional classification)

Partially (semi-) supervised learning (or classification)

Learning with a small set of labeled examples and a large set of unlabeled examples (LU learning)


Combining classifiers

So far, we have only discussed individual classifiers, i.e., how to build them and use them.

Can we combine multiple classifiers to produce a better classifier?

Yes, sometimes

We discuss two main algorithms:

Bagging

Boosting


Bagging

Breiman, 1996

Bootstrap Aggregating = Bagging

Application of bootstrap sampling

Given: set D containing m training examples

Create a sample S[i] of D by drawing m examples at

random with replacement from D

S[i] of size m: expected to leave out 0.37 of examples

from D


Bagging (cont…)

Training

Create k bootstrap samples S[1], S[2], …, S[k]

Build a distinct classifier on each S[i] to produce k

classifiers, using the same learning algorithm.

Testing

Classify each new instance by voting of the k

classifiers (equal weights)


Bagging Example

Original 1 2 3 4 5 6 7 8

Training set 1 2 7 8 3 7 6 3 1

Training set 2 7 8 5 6 4 2 7 1

Training set 3 3 6 2 7 5 6 2 2

Training set 4 4 5 1 4 6 4 3 8


Bagging (cont …)

When does it help?

When learner is unstable

Small change to training set causes large change in the

output classifier

True for decision trees, neural networks; not true for k-

nearest neighbor, naïve Bayesian, class association

rules

Experimentally, bagging can help substantially for

unstable learners, may somewhat degrade results

for stable learners

Bagging Predictors, Leo Breiman, 1996


Boosting

A family of methods: We only study AdaBoost (Freund & Schapire, 1996)

Training Produce a sequence of classifiers (the same base

learner)

Each classifier is dependent on the previous one, and focuses on the previous one‟s errors

Examples that are incorrectly predicted in previous classifiers are given higher weights

Testing For a test case, the results of the series of

classifiers are combined to determine the final class of the test case.


AdaBoost

Weighted

training set

(x1, y1, w1)

(x2, y2, w2)

…

(xn, yn, wn)

Non-negative weights

sum to 1

Build a classifier ht

whose accuracy on

training set > ½ (better than random)

Change weights

called a weaker classifier


AdaBoost algorithm


Bagging, Boosting and C4.5

C4.5’s mean error

rate over the

10 cross-

validation.

Bagged C4.5

vs. C4.5.

Boosted C4.5

vs. C4.5.

Boosting vs.

Bagging


Does AdaBoost always work?

The actual performance of boosting depends

on the data and the base learner.

It requires the base learner to be unstable as

bagging.

Boosting seems to be susceptible to noise.

When the number of outliners is very large, the

emphasis placed on the hard examples can hurt

the performance.


C4.5 and Boosting


Boosting over Reuters

Source: A Short Introduction to Boosting, (Freund&Schapire,99)

http://www.site.uottawa.ca/~stan/csi5387/boost-tut-ppr.pdf


Chapter 5: Partially-Supervised

Learning


Learning from a small labeled

set and a large unlabeled set

LU learning


Unlabeled Data

One of the bottlenecks of classification is the

labeling of a large set of examples (data

records or text documents).

Often done manually

Time consuming

Can we label only a small number of examples

and make use of a large number of unlabeled

examples to learn?

Possible in many cases.


Why unlabeled data are useful?

Unlabeled data are usually plentiful, labeled data are expensive.

Unlabeled data provide information about the joint probability distribution over words and collocations (in texts).

We will use text classification to study this problem.


DocNo: k ClassLabel: Positive

……

…...homework….

...

DocNo: n ClassLabel: Positive

……

…...homework….

...

DocNo: m ClassLabel: Positive

……

…...homework….

...

DocNo: x (ClassLabel: Positive)

……

…...homework….

...lecture….

DocNo: z ClassLabel: Positive

……

…...homework….

……lecture….

DocNo: y (ClassLabel: Positive)

……lecture…..

…...homework….

...

Labeled Data Unlabeled Data

Documents containing “homework”

tend to belong to the positive class


How to use unlabeled data

One way is to use the EM algorithm

EM: Expectation Maximization

The EM algorithm is a popular iterative algorithm for

maximum likelihood estimation in problems with

missing data.

The EM algorithm consists of two steps,

Expectation step, i.e., filling in the missing data

Maximization step – calculate a new maximum a posteriori

estimate for the parameters.


Incorporating unlabeled Data with EM (Nigam et al, 2000)

Basic EM

Augmented EM with weighted unlabeled data

Augmented EM with multiple mixture

components per class


Algorithm Outline

1. Train a classifier with only the labeled

documents.

2. Use it to probabilistically classify the

unlabeled documents.

3. Use ALL the documents to train a new

classifier.

4. Iterate steps 2 and 3 to convergence.


Basic Algorithm


Basic EM: E Step & M Step

E Step:

M Step:


The problem

It has been shown that the EM algorithm in Fig. 5.1 works well if the The two mixture model assumptions for a particular data

set are true.

The two mixture model assumptions, however, can cause major problems when they do not hold. In many real-life situations, they may be violated.

It is often the case that a class (or topic) contains a number of sub-classes (or sub-topics). For example, the class Sports may contain documents

about different sub-classes of sports, Baseball, Basketball, Tennis, and Softball.

Some methods to deal with the problem.


Weighting the influence of unlabeled

examples by factor

New M step:

The prior probability also needs to be weighted.


Experimental Evaluation

Newsgroup postings

20 newsgroups, 1000/group

Web page classification

student, faculty, course, project

4199 web pages

Reuters newswire articles

12,902 articles

10 main topic categories


20 Newsgroups


Another approach: Co-training

Again, learning with a small labeled set and a large

unlabeled set.

The attributes describing each example or instance

can be partitioned into two subsets. Each of them is

sufficient for learning the target function.

E.g., hyperlinks and page contents in Web page

classification.

Two classifiers can be learned from the same data.


Co-training Algorithm [Blum and Mitchell, 1998]

Given: labeled data L,

unlabeled data U

Loop:

Train h1 (e.g., hyperlink classifier) using L

Train h2 (e.g., page classifier) using L

Allow h1 to label p positive, n negative examples from U

Allow h2 to label p positive, n negative examples from U

Add these most confident self-labeled examples to L


Co-training: Experimental Results

begin with 12 labeled web pages (academic course)

provide 1,000 additional unlabeled web pages

average error: learning from labeled data 11.1%;

average error: co-training 5.0%

Page-base

classifier

Link-based

classifier

Combined

classifier

Supervised

training

12.9 12.4 11.1

Co-training 6.2 11.6 5.0


When the generative model is not

suitable

Multiple Mixture Components per Class (M-EM). E.g., a class --- a number of sub-topics or clusters.

Results of an example using 20 newsgroup data 40 labeled; 2360 unlabeled; 1600 test

Accuracy NB 68%

EM 59.6%

Solutions M-EM (Nigam et al, 2000): Cross-validation on the training

data to determine the number of components.

Partitioned-EM (Cong, et al, 2004): using hierarchical clustering. It does significantly better than M-EM.


Summary

Using unlabeled data can improve the accuracy of classifier when the data fits the generative model.

Partitioned EM and the EM classifier based on multiple mixture components model (M-EM) are more suitable for real data when multiple mixture components are in one class.

Co-training is another effective technique when redundantly sufficient features are available.


Further Topics

Learning from Positive and Unlabeled Example (PU).

Graph-based methods for Semi-supervised learning Labeled and unlabeled examples are nodes in a graph

mincut: See the labeling of Us as a graph partition process (polynomial time)

Spectral Graph transducer: map the graph partition into a minimization problem and apply eigenvector analysis to find the best solutions. Parameters: balancing factors between P and U instances

ICML „07 Tutorial (by Jerry Zhu) at: http://pages.cs.wisc.edu/~jerryzhu/icml07tutorial.html