multiclasscs446 fall 12 multi-categorical output tasks multi-class classification (y {1,...,k})...

36
MultiClass CS446 Fall ’12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (‘6’) document classification (‘homepage’) Multi-label Classification (y {1,...,K}) document classification (‘(homepage,facultypage)’) Category Ranking (y K) user preference (‘(love > like > hate)’) document classification (‘hompage > facultypage > sports’) Hierarchical Classification (y {1,..,K}) cohere with class hierarchy There are other kinds of rankings; e.g., example ranking. This is, conceptually, more related to simple classification, with the key difference being the loss function – what counts as a good ranking.

Upload: lexi-hitt

Post on 28-Mar-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12

Multi-Categorical Output Tasks

Multi-class Classification (y {1,...,K}) character recognition (‘6’) document classification (‘homepage’)

Multi-label Classification (y {1,...,K}) document classification (‘(homepage,facultypage)’)

Category Ranking (y K) user preference (‘(love > like > hate)’) document classification (‘hompage > facultypage > sports’)

Hierarchical Classification (y {1,..,K}) cohere with class hierarchy place document into index where ‘soccer’ is-a ‘sport’

There are other kinds of rankings; e.g., example ranking. This is, conceptually, more related to simple classification, with the key difference being the loss function – what counts as a good ranking.

Page 2: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 2

One-Vs-All

Assumption: Each class can be separated from the rest using a binary classifier in the hypothesis space.

Learning: Decomposed to learning k independent binary classifiers, one corresponding to each class. An example (x,y) is considered positive for class y and negative to all

others. Assume m examples, k class labels. For simplicity, say, m/k in each. classifier fi: m/k (+) and (k-1)m/k (-)

Decision: Winner Takes All (WTA): f(x) = argmaxi fi (x) = argmaxi vi x

Page 3: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 3

Solving MultiClass with binary learning

MultiClass classifier Function f : Rd {1,2,3,...,k}

Decompose into binary problems

Not always possible to learn No theoretical justification (unless the problem is easy)

Page 4: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 4

Learning via One-Versus-All (OvA) Assumption

Find vr,vb,vg,vy Rn such that vr.x > 0 iff y = red vb.x > 0 iff y = blue vg.x > 0 iff y = green vy.x > 0 iff y = yellow

Classification: f(x) = argmaxi vi x

H = Rkn

Real Problem

Page 5: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 5

All-Vs-All

Assumption: There is a separation between every pair of classes using a binary classifier in the hypothesis space.

Learning: Decomposed to learning [k choose 2] ~ k2 independent binary classifiers, one corresponding to each class. An example (x,y) is considered positive for class y and negative to all others. Assume m examples, k class labels. For simplicity, say, m/k in each. Classifier fij: m/k (+) and m/k (-)

Decision: Winner Decision procedure is more involved since output of binary classifier may not cohere. Options: Majority: classify example x to take label i if i wins on it more often that j (j=1,…k) Use real valued output (or: interpret output as probabilities) Tournament: start with n/2 pairs; continue with winners .

Page 6: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 6

Learning via All-Verses-All (AvA) Assumption

Find vrb,vrg,vry,vbg,vby,vgy Rd such that

vrb.x > 0 if y = red < 0 if y = blue

vrg.x > 0 if y = red < 0 if y = green ... (for all pairs)

Individual Classifiers

Decision Regions

H = Rkkn

How to classify?

Page 7: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 7

Classifying with AvA

Tree

1 red, 2 yellow, 2 green ?

Majority Vote

Tournament

All are post-learning and might cause weird stuff

Page 8: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 8

Error Correcting Codes Decomposition 1-vs-all uses k classifiers for k labels; can you use only log2 k?

Reduce the multi-class classification to random binary problems. Choose a “code word” for each label.

Rows: An encoding of each class (k rows) Columns: L dichotomies of the data, each corresponds to a new classification

problem Extreme cases:

1-vs-all: k rows, k columns K rows log2 k columns

Each training example is mapped to one example per column (x,3) {(x,P1), +; (x,P2), -; (x,P3), -; (x,P4), +}

To classify a new example x: Evaluate hypothesis on the 4 binary problems

{(x,P1) , (x,P2), (x,P3), (x,P4),} Choose label that is most consistent with the results.

Nice theoretical results as a function of the performance of the Pi s (depending on code & size) Potential Problems?

Label P1 P2 P3 P41 - + - +2 - + + -3 + - - +4 + - + +k - + - -

Page 9: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 9

Problems with Decompositions Learning optimizes over local metrics

Poor global performance What is the metric? We don’t care about the performance of the local classifiers

Poor decomposition poor performance Difficult local problems Irrelevant local problems

Especially true for Error Correcting Output Codes Another (class of) decomposition Difficulty: how to make sure that the resulting problems are separable.

Efficiency: e.g., All vs. All vs. One vs. All Former has advantage when working with the dual space.

Not clear how to generalize multi-class to problems with a very large # of output.

Page 10: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 10

A sequential model for multiclass classification

A practical approach: towards a pipeline model As before: relies on a decomposition of the Y space

This time, a hierarchical decomposition (but sometimes the X space is also decomposed)

Goal: Deal with large Y space. Problem: the performance of a multiclass classifier goes down

with the growth in the number of labels.

Page 11: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 11

A sequential model for multiclass classification

Assume Y = {1, 2, …k } In the course of classifying we build a collection of subsets:

Yd …½ Y2 µ Y1 µ Y0 =Y Idea: sequentially, learn fi: X ! Yi-1 (i=1,…d) fi is used to restrict the set of labels;

In the next step we deal only with labels in Yi

fi outputs a probability distribution over labels in Yi-1:

Pi = (pi ( y1 | x), …. pi ( ym | x)) Define: Yi = {y 2 Y | pi ( y |x) > i} (other decision rules possible) Now we need to deal with a smaller set of labels.

Page 12: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 12

Example: Question Classification

A common first step in Question Answering determines the type of the desired answer:

Examples: 1. What is bipolar disorder?

definition or disease/medicine 2. What do bats eat?

Food, plant or animal 3. What is the PH scale?

Could be a numeric value or a definition

Page 13: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 13

A taxonomy for question classification

Page 14: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 14

Example: A hierarchical QC Classifier

The initial confusion set of any question is C0 (coarse classes)

The coarse classifier determines a set of preferred labels, C1µC0, with |C1|<5 (tunable)

Each coarse label is expanded to a set of fine labels using the fixed hierarchy to yield C2

This process continues now on the fine labels, to yield C3 µ C2

Output C1, C3 (or continue)

Page 15: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 15

1 Vs All: Learning Architecture k label nodes; n input features, nk weights. Evaluation: Winner Take All Training: Each set of n weights, corresponding to the i-th label is trained

independently given its performance on example x, and independently of the performance of label j on x.

However, this architecture allows multiple learning algorithm; e.g., see the implementation in the SNoW Multi-class Classifier

Targets (each an LTU)

Features

Weighted edges (weight vectors)

Page 16: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 16

Recall: Winnow’s Extensions

Winnow learns monotone Boolean functions We extended to general Boolean functions via “Balanced Winnow”

2 weights per variable; Decision: using the “effective weight”,

the difference between w+ and w-

This is equivalent to the Winner take all decision Learning: In principle, it is possible to use the 1-vs-all rule and update

each set of n weights separately, but we suggested the “balanced” Update rule that takes into account how both sets of n weights predict on example x

If [(w w ) x ] y, wi wi

ry xi , wi wi

r y xi

Positivew+

Negativew-

Page 17: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 17

An Intuition: Balanced Winnow

In most multi-class classifiers you have a target node that represents positive examples and target node that represents negative examples.

Typically, we train each node separately (mine/not-mine example). Rather, given an example we could say: this is more a + example than a –

example.

We compared the activation of the different target nodes (classifiers) on a given example. (This example is more class + than class -)

Can this be generalized to the case of k labels, k >2?

If [(w w ) x ] y, wi wi

ry xi , wi wi

r y xi

Page 18: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 18

Constraint Classification The examples we give the learner are pairs (x,y), y 2 {1,…k} The “black box learner” we described seems like it is a function of

only x but, actually, we made use of the labels y How is y being used?

y decides what to do with the example x; that is, which of the k classifiers should take the example as a positive example (making it a negative to all the others).

How do we make decisions: Let: fy(x) = wy

T ¢ x Then, we predict using: y* = argmaxy=1,…k fy(x)

Equivalently, we can say that we predict as follows: Predict y iff 8 y’ 2 {1,…k}, y’:=y (wy

T – wy’T ) ¢ x ¸ 0 (**)

So far, we did not say how we learn the k weight vectors wy (y = 1,…k)

Page 19: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 19

Linear Separability for Multiclass We are learning k n-dimensional weight vectors, so we can

concatenate the k weight vectors into w= (w1, w2,…wk) 2 Rnk

Key Construction: (Kesler Construction; Zimak’s Constraint Classification) We will represent each example (x,y), as an nk-dimensional vector, xy with x

embedded in the y-th part of it (y=1,2,…k) and the other coordinates are 0. E.g., xy = (0,x,0,0) Rkn (here k=4, y=2) Now we can understand the decision Predict y iff 8 y’ 2 {1,…k}, y’:=y (wy

T – wy’T ) ¢ x ¸ 0 (**)

In the nk-dimensional space. Predict y iff 8 y’ 2 {1,…k}, y’:=y wT ¢ (xy – xy’) wT ¢ xyy’ ¸

0

Conclusion: The set (xyy’ , + ) (xy – xy’ , +) is linearly separable from the

set (-xyy’ , - ) using the linear separator w 2 Rkn

’ We solved the voroni diagram challenge.

Page 20: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 20

Details: Kesler Construction & Multi-Class Separability

Transform Examples

2>12>32>4

2>1

2>3

i>j fi(x) - fj(x) > 0

wi x - wj x > 0W Xi - W Xj > 0W (Xi - Xj) > 0W Xij > 0

i>j fi(x) - fj(x) > 0

wi x - wj x > 0W Xi - W Xj > 0W (Xi - Xj) > 0W Xij > 0

Xi = (0,x,0,0) Rkd

Xj = (0,0,0,x) Rkd

Xij = Xi - Xj = (0,x,0,-x)

W = (w1,w2,w3,w4) Rkd

2>4

If (x,i) was a given n-dimensional example (that is, x has is labeled i, then xij, 8 j=1,…k, j:= i, are positive examples in the nk-dimensional space. –xij are negative examples.

Page 21: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 21

Kesler’s Construction (1)

y = argmaxi=(r,b,g,y) wi.x wi , x Rn

Find wr,wb,wg,wy Rn such that wr.x > wb.x

wr.x > wg.x

wr.x > wy.x

H = Rkn

Page 22: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 22

Kesler’s Construction (2)

Let w = (wr,wb,wg,wy ) Rkn

Let 0n, be the n-dim zero vector

wr.x > wb.x w.(x,-x,0n,0n) > 0 w.(-x,x,0n,0n) < 0

wr.x > wg.x w.(x,0n,-x,0n) > 0 w.(-x,0n,x,0n) < 0

wr.x > wy.x w.(x,0n,0n,-x) > 0 w.(-x,0n,0n ,x) < 0

x -x -x x

Page 23: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 23

Kesler’s Construction (3)

Let w = (w1, ..., wk) Rn x ... x Rn = Rkn

xij = (0(i-1)n, x, 0(k-i)n) – (0(j-1)n, –x, 0(k-j)n) Rkn

Given (x, y) Rn x {1,...,k} For all j y

Add to P+(x,y), (xyj, 1) Add to P-(x,y), (–xyj, -1)

P+(x,y) has k-1 positive examples ( Rkn) P-(x,y) has k-1 negative examples ( Rkn)

-xx

Page 24: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 24

Learning via Kesler’s Construction Given (x1, y1), ..., (xN, yN) Rn x {1,...,k} Create

P+ = P+(xi,yi) P– = P–(xi,yi)

Find w = (w1, ..., wk) Rkn, such that w.x separates P+ from P–

One can use any algorithm in this space: Perceptron, Winnow, etc. To understand how to update the weight vector in the n-dimensional

space, we note that wT ¢ xyy’ ¸ 0 (in the nk-dimensional space) is equivalent to: (wy

T – wy’T ) ¢ x ¸ 0 (in the n-dimensional space)

Page 25: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 25

Learning via Kesler Construction (2)

A perceptron update rule applied in the nk-dimensional space due to a mistake in wT ¢ xij ¸ 0

Or, equivalently to (wiT – wj

T ) ¢ x ¸ 0 (in the n-dimensional space) Implies the following update:

Given example (x,i) (example x 2 Rn, labeled i) 8 (i,j), i,j = 1,…k, i := j (***) If (wi

T - wjT ) ¢ x < 0 (mistaken prediction; equivalent to wT ¢ xij

¸ 0 ) wi wi +x (promotion) and wj wj – x

(demotion)

Note that this is a generalization of balanced Winnow rule.

Note that we promote wi and demote k-1 weight vectors wj

Page 26: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 26

Conservative update

The general scheme suggests: Given example (x,i) (example x 2 Rn, labeled i)

8 (i,j), i,j = 1,…k, i := j (***) If (wi

T - wjT ) ¢ x < 0 (mistaken prediction; equivalent to wT ¢ xij

¸ 0 ) wi wi +x (promotion) and wj wj – x

(demotion) Promote wi and demote k-1 weight vectors wj

A conservative update: (SNoW’s implementation): In case of a mistake: only the weights corresponding to the target

node i and that closest node j are updated. Let: j* = argminj=1,…k (wi

T - wjT ) ¢ x

If (wiT – wj*

T ) ¢ x < 0 (mistaken prediction) wi wi +x (promotion) and wj* wj* – x

(demotion) Other weight vectors are not being updated.

Page 27: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 27

Significance

The hypothesis learned above is more expressive than when the OvA assumption is used.

Any linear learning algorithm can be used, and algorithmic-specific properties are maintained (e.g., attribute efficiency if using winnow.

E.g., the multiclass support vector machine can be implemented by learning a hyperplane to separate P(S) with maximal margin.

As a byproduct of the linear separability observation, we get a natural notion of a margin in the multi-class case, inherited from the binary separability in the nk-dimensional space. Given example xij 2 Rnk, margin(xij,w) = minij wT ¢

xij Consequently, given x 2 Rn, labeled i

margin(x,w) = minj (wiT

- wjT ) ¢ x

Page 28: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 28

Constraint Classification

The scheme presented can be generalized to provide a uniform view for multiple types of problems: multi-class, multi-label, category-ranking

Reduces learning to a single binary learning task Captures theoretical properties of binary algorithm Experimentally verified Naturally extends Perceptron, SVM, etc...

It is called “constraint classification” since it does it all by representing labels as a set of constraints or preferences among output labels.

Page 29: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 29

Multi-category to Constraint Classification The unified formulation is clear from the following examples: Multiclass

(x, A) (x, (A>B, A>C, A>D) ) Multilabel

(x, (A, B)) (x, ( (A>C, A>D, B>C, B>D) ) Label Ranking

(x, (5>4>3>2>1)) (x, ( (5>4, 4>3, 3>2, 2>1) )

In all cases, we have examples (x,y) with y Sk

Where Sk : partial order over class labels {1,...,k} defines “preference” relation ( > ) for class labeling

Consequently, the Constraint Classifier is: h: X Sk

h(x) is a partial order h(x) is consistent with y if (i<j) y (i<j) h(x)

Just like in the multiclass we learn one wi 2 Rn for each label, the same is done for multi-label and ranking. The weight vectors are updated according with the requirements from y 2 Sk

Page 30: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 30

Properties of Construction Can learn any argmax vi.x function (even when i isn’t linearly

separable from the union of the others) Can use any algorithm to find linear separation

Perceptron Algorithm ultraconservative online algorithm [Crammer, Singer 2001]

Winnow Algorithm multiclass winnow [ Masterharm 2000 ]

Defines a multiclass margin by binary margin in Rkd

multiclass SVM [Crammer, Singer 2001]

Page 31: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 31

Margin Generalization Bounds

Linear Hypothesis space: h(x) = argsort vi.x

vi, x Rd argsort returns permutation of {1,...,k}

CC margin-based bound = min(x,y)S min (i < j)y vi.x – vj.x

errD (h) C

m

R2

2 ln()

m - number of examples R - maxx ||x|| - confidence C - average # constraints

Page 32: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 32

VC-style Generalization Bounds

Linear Hypothesis space: h(x) = argsort vi.x

vi, x Rd argsort returns permutation of {1,...,k}

CC VC-based bound

errD (h) err(S,h) kd log(mk /d) lnm

m - number of examples d - dimension of input space delta - confidence k - number of classes

Page 33: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 33

Beyond MultiClass Classification Ranking

category ranking (over classes) ordinal regression (over examples)

Multilabel x is both red and blue

Complex relationships x is more red than blue, but not green

Millions of classes sequence labeling (e.g. POS tagging) The same algorithms can be applied to these problems, namely, to

Structured Prediction This observation is the starting point for CS546.

Page 34: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 34

(more) Multi-Categorical Output Tasks

Sequential Prediction (y {1,...,K}+)

e.g. POS tagging (‘(NVNNA)’)

“This is a sentence.” D V N D

e.g. phrase identification

Many labels: KL for length L sentence Structured Output Prediction (y C({1,...,K}+))

e.g. parse tree, multi-level phrase identification

e.g. sequential prediction

Constrained by

domain, problem, data, background knowledge, etc...

Page 35: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 35

Semantic Role Labeling: A Structured-Output Problem

For each verb in a sentence1. Identify all constituents that fill a semantic role

2. Determine their roles• Core Arguments, e.g., Agent, Patient or Instrument• Their adjuncts, e.g., Locative, Temporal or Manner

I left my pearls to my daughter-in-law in my will.

A0 : leaver

A1 : thing left

A2 : benefactor

AM-LOC

Page 36: MultiClassCS446 Fall 12 Multi-Categorical Output Tasks Multi-class Classification (y {1,...,K}) character recognition (6) document classification (homepage)

MultiClass CS446 Fall ’12 36

Semantic Role Labeling

I left my pearls to my daughter-in-law in my will. Many possible valid outputs Many possible invalid outputs Typically, one correct output (per input)

A0 - A1 A2 AM-LOC