multiple-instance learning paper 1: a framework for multiple-instance learning [maron and...

Multiple-Instance Learning

Paper 1: A Framework for Multiple-Instance Learning [Maron and Lozano-Perez, 1998]

Paper 2: EM-DD: An Improved Multiple-Instance

Learning Technique [Zhang and Goldman, 2001]

Multiple-Instance Learning (MIL)

A variation on supervised learning Supervised learning: training data are well

labeled. MIL: each training example is a set (or bag) of

instances along with a single label equal to the maximum label among all instances in the bag.

Goal: to learn to accurately predict the label of previously unseen bags.

MIL Setup

Training Data:D = {<B1, l1>, …., <Bm, lm>}m bags where bag Bi has label li.

Boolean labelsPositive Bags: Bi

+

Negative Bags: Bi-

If bag Bi+ = { B+

i1,…, B+ij, … B+

in}, then B+ij is the jth instance in B+

i.

B+ijk is the value of the kth feature of the instance B+

ij

Real-value labels li = max(li1, li2, … , lin)

Diverse Density Algorithm[Maron and Lozano-Perez, 1998]

Main idea: Find a point in feature space that have a high

Diverse Density – High density of positive instances (“close ” to at least

one instance from each positive bag) Low density of negative instances (“far” from every

instance in every negative bag) Higher diverse density = higher probability of

being the target concept.

A Motivating Example for DD

To find an area where there is both high density of positive points and low density of negative points.

The difficulty with using regular density, which adds up the contribution of every positive bag and subtracts negative bags, is illustrated in (b), Section B.

Diverse Density

Assuming that the target concept is a single point t and x is some point in feature space,Pr(x = t | B1

+,…, Bn+

, B1-,…, Bn

-) ……………..(1)

represents the probability that x is the target concept given the training examples.

We can find t if we maximize the above probability over all points x.

Probabilistic Measure of Diverse Density

Using Bayes’ rule, maximizing (1) is equivalent to maximizing

Pr(B1+,…, Bn

+, B1

-,…, Bn- | x = t) ……………..(2)

Further assuming that the bags are conditionally independent given t, the best hypothesis is

argmaxx∏i Pr(Bi+| x = t) ∏i Pr(Bi

-| x = t) ……(3)

General Definition of DD

Again using Bayes’ rule, (3) is equivalent to

argmaxx∏i Pr(x = t |Bi+) ∏i Pr(x = t |Bi

-) ……(4)

(assume a uniform prior over concept location)

x will have high Diverse Density if every positive bag has an instance close to x and no negative bags are close to x.

Noise-or Model

The causal probability of instance j in bag Bi

Pr(x = t |Bij) = exp( -|| Bij – x ||2 )

A positive bag’s contribution:Pr(x = t |Bi

+) = Pr(x = t |Bi1+, Bi2

+,…) =1- ∏j(1- Pr(x = t |Bij

+) )

A negative bag’s contribution:Pr(x = t |Bi

-) = Pr(x = t |Bi1-, Bi2

-,…) =∏j(1- Pr(x = t |Bij

-) )

Feature Relevance

“closeness” depends on the features. Problem: some features might be irrelevant, and

some others might be more important than the others.

|| Bij – x ||2 = ∑k wk ( Bijk – xk )2

Solution: “weight” the features depending on their relevance. Find the best weighting of the features by finding the weights that maximize Diverse Density.

Label Prediction

Predict the label of unknown bag Bi for hypothesis t :

Label(Bi | t) = maxj{exp[-∑k (wk(Bijk – tk))2]}

where wk is a scale factor indicating the importance of feature value for dimension k.

Finding the Maximum DD

Use gradient ascent with multiple starting points The maximum DD peak is made of contributions from

some set of positive points. Start an ascent from every positive point, one of them is

likely to be closest to the maximum. We can contribute most to it and have a climb directly on it.

While this heuristic is sensible for maximizing w.r.t. location, maximizing w.r.t. scaling of feature weights may still lead to local maxima.

Experiments

Experiments

Figure 3(a) shows the regular density surface for the data set in Figure 2, and it is clear that finding the peak is difficult. Figure 3(b) plots the DD surface, and it is easy to pick out the global maxima which is the desired concept.

Performance Evaluation

The table below lists the average accuracy of twenty runs, compared with the performance of the two principal algorithms reproted in [Dietterich et al., 1997] (iterated-discrim APR and GFS elim-ked APR), as well as the MULTINST algorithm from [Auer, 1997].

EM-DD[Zhang and Goldman, 2001]

In the MIL setting, the label of a bag is determined by the "most positive" instance in the bag, i.e., the one with the highest probability of being positive among all the instances in that bag. The difficulty of MIL comes from the ambiguity of not knowing which instance is the most likely one.

In [Zhang and Goldman, 2001], the knowledge of which instance determines the label of the bag is modeled using a set of hidden variables, which are estimated using the Expectation Maximization style approach. This results in an algorithm called EM-DD, which combines this EM-style approach with the DD algorithm.

EM-DD Algorithm

Expectation Maximization algorithm [Dempster,Laird and Rubin, 1977] Start with an initial guess h (which can be obtained

using original DD algorithm), set to some appropriate instance from a positive bag.

E-Step: h is used to pick one instance from each bag that is most likely (given generative model) to be responsible for its label.

M-Step: two-step gradient ascent search (quasi-newton search) of the standard DD algorithm to find a new h’ that maximizes DD(h).

Comparison of Performance

Thank you!

multiple-instance learning paper 1: a framework for multiple-instance learning [maron and...

Documents