slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-fall2014/lectures/class22.pdf · solutions...

74
Multi-label Classification Charmgil Hong cs3750 (Presented on Nov 11, 2014)

Upload: hoanglien

Post on 07-May-2019

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-label Classification

Charmgil Hong cs3750

(Presented on Nov 11, 2014)

Page 2: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Goals of the talk

1.To understand the geometry of different approaches for multi-label classification

2.To appreciate how the Machine Learning techniques further improve the multi-label classification methods

3.To learn how to evaluate the multi-label classification methods

2

Page 3: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Agenda

• Motivation & Problem definition

• Solutions

• Advanced solutions

• Evaluation metrics

• Toolboxes

• Summary

3

Page 4: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Notation

• X ∈ Rm : feature vector variable (input)

• Y ∈ Rd : class vector variable (output)

• x = {x1, ..., xm}: feature vector instance

• y = {y1, ..., yd}: class vector instance

• In a shorthand, P(Y=y|X=x) = P(y|x)

• Dtrain : training dataset; Dtest : test dataset

4

Page 5: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Motivation

• Traditional classification

• Each data instance is associated with a single class variable

5

Page 6: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Motivation

• An issue with traditional classification

• In many real-world applications, each data instance can be associated with multiple class variables

• Examples

• A news article may cover multiple topics, such as politics and economics

• An image may include multiple objects as building, road, and car

• A gene may be associated with several biological functions

6

Page 7: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Problem Definition

• Multi-label classification (MLC)

• Each data instance is associated with multiple binary class variables

• Objective: assign each instance the most probable assignment of the class variables

7

Class 1 ∈ { R, B } Class 2 ∈ { , }

Page 8: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

A simple solution

• Idea

• Transform a multi-label classification problem to multiple single-label classification problems

• Learn d independent classifiers for d class variables

8

Page 9: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

• Idea

• Transform a multi-label classification problem to multiple single-label classification problems

• Learn d independent classifiers for d class variables

9

Binary Relevance (BR) [Clare and King, 2001; Boutell et al, 2004]

• IllustrationDtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1

h1 : X → Y1

h2 : X → Y2

h3 : X → Y3

Page 10: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Binary Relevance (BR) [Clare and King, 2001; Boutell et al, 2004]

• Advantages

• Computationally efficient

• Disadvantages

• Does not capture the dependence relations among the class variables

• Not suitable for the objective of MLC

• Does not find the most probable assignment

• Instead, it maximizes the marginal distribution of each class variable

10

Page 11: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Binary Relevance (BR) [Clare and King, 2001; Boutell et al, 2004]

• Marginal vs. Joint: a motivating example

• Question: find the most probable assignment (MAP: maximum a posteriori) of Y = (Y1,Y2)

11

➡ Prediction on the joint (MAP): Y1 = 1, Y2 = 0➡ Prediction on the marginals: Y1 = 0, Y2 = 0

P(Y1,Y2|X=x) Y1 = 0 Y1 = 1 P(Y2|X=x)Y2 = 0 0.2 0.45 0.65Y2 = 1 0.35 0 0.35

P(Y1|X=x) 0.55 0.45

• We want to maximize the joint distribution of Y given observation X = x; i.e.,

Page 12: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Another simple solution

• Idea

• Transform each label combination to a class value

• Learn a multi-class classifier with the new class values

12

Page 13: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

• Idea

• Transform each label combination to a class value

• Learn a multi-class classifier with the new class values

13

• IllustrationDtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 1 0n=4 0.3 0.1 0 1 1n=5 0.8 0.9 1 0 1

YLP

11234

hLP : X → YLP

Label Powerset (LP) [Tsoumakas and Vlahavas, 2007]

Page 14: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Label Powerset (LP) [Tsoumakas and Vlahavas, 2007]

• Advantages

• Learns the full joint of the class variables

• Each of the new class values maps to a label combination

• Disadvantages

• The number of choices in the new class can be exponential (|YLP| = O(2d))

• Learning a multi-class classifier on exponential choices is expensive

• The resulting class distribution would be sparse and imbalanced

• Only predicts the label combinations that are seen in the training set

14

Page 15: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

BR vs. LP

• BR and LP are two extreme MLC approaches

• BR maximizes the marginals on each class variable;while LP directly models the joint of all class variables

• BR is computationally more efficient; but does not consider the relationship among the class variables

• LP considers the relationship among the class variables by modeling the full joint of the class variables; but can be computationally very expensive

15

BR LPIndependent classifiers

All possible label combo.

Page 16: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Agenda

✓ Motivation

• Solutions

• Advanced solutions

• Evaluation metrics

• Toolboxes

• Summary

16

Solutions

Page 17: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Solutions

• Section agenda

• Solutions rooted on BR

• Solutions rooted on LP

• Other solutions

17

Page 18: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

• BR: Binary Relevance [Clare and King, 2001; Boutell et al, 2004]

• Models independent classifiers P(yi|x) on each class variable

• Does not learn the class dependences

• Key extensions from BR

• Learn the class dependence relations by adding new class-dependent features : P(yi|x, {new_features})

Solutions rooted on BR

18

Page 19: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Solutions rooted on BR

• Idea: layered approach

• Layer-1: Learn and predict on Dtrain, using the BR approach

• Layer-2: Learn d classifiers on the original features and the output of layer-1

• Existing methods

• Classification with Heterogeneous Features (CHF) [Godbole et al, 2004]

• Instance-based Logistic Regression (IBLR) [Cheng et al, 2009]

19

Page 20: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

20

Classification with Heterogeneous Features (CHF)

• Illustration

Dtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1

hbr1 : X → Y1

hbr2 : X → Y2

hbr3 : X → Y3

XCHF

X1 X2 hbr1(X) hbr2(X) hbr3(X) Y1 Y2 Y3

n=1 0.7 0.4 .xx .xx .xx 1 1 0n=2 0.6 0.2 .xx .xx .xx 1 1 0n=3 0.1 0.9 .xx .xx .xx 0 0 1n=4 0.3 0.1 .xx .xx .xx 0 0 0n=5 0.8 0.9 .xx .xx .xx 1 0 1

h1 : XCHF → Y1

h2 : XCHF → Y2

h3 : XCHF → Y3

Laye

r-1La

yer-2

Page 21: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

21

Instance-based Logistic Regression (IBLR)

• Illustration

Dtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1

XIBLR

X1 X2 λ1 λ2 λ3 Y1 Y2 Y3

n=1 0.7 0.4 .xx .xx .xx 1 1 0n=2 0.6 0.2 .xx .xx .xx 1 1 0n=3 0.1 0.9 .xx .xx .xx 0 0 1n=4 0.3 0.1 .xx .xx .xx 0 0 0n=5 0.8 0.9 .xx .xx .xx 1 0 1

h1 : XIBLR → Y1

h2 : XIBLR → Y2

h3 : XIBLR → Y3

KNN Scoreλ1 λ2 λ3

.xx .xx .xx

.xx .xx .xx

.xx .xx .xx

.xx .xx .xx

.xx .xx .xx

1 1 00 0 11 0 02/3 1/3 1/3

k=3

y1 y2 y3

Page 22: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Solutions rooted on BR: CHF & IBLR

• Advantages

• Model the class dependences by enriching the feature space using the layer-1 classifiers

• Disadvantages

• Learn the dependence relations in an indirect way

• The predictions are not stable

22

Page 23: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Solutions rooted on LP

• LP: Label Powerset [Tsoumakas and Vlahavas, 2007]

• Models a multi-class classifier on the enumeration of all possible class assignment

• Can create exponentially many classes and computationally very expensive

• Key extensions from LP

• Prune the infrequent class assignments from the consideration to reduce the size of the class assignment space

• Represent the joint distribution more compactly

23

Page 24: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Pruned problem transformation (PPT) [Read et al, 2008]

• Class assignment conversion in PPT

• Prune infrequent class assignment sets

• User specifies the threshold for “infrequency”

24

Dtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 0 0n=4 0.3 0.1 0 1 0n=5 0.1 0.8 0 0 0n=6 0.2 0.1 0 1 0n=7 0.2 0.2 0 1 0n=8 0.2 0.9 0 0 0n=9 0.7 0.3 0 0 1

n=10 0.9 0.9 0 1 1

Dtrain-LP YLP

n=1 1n=2 1n=3 0n=4 2n=5 0n=6 2n=7 2n=8 0n=9 1

n=10 3

|YLP| = 4

Page 25: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Pruned problem transformation (PPT) [Read et al, 2008]

• Class assignment conversion in PPT

• Prune infrequent class assignment sets

• User specifies the threshold for “infrequency”

25

Dtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 0 0n=4 0.3 0.1 0 1 0n=5 0.1 0.8 0 0 0n=6 0.2 0.1 0 1 0n=7 0.2 0.2 0 1 0n=8 0.2 0.9 0 0 0n=9 0.7 0.3 0 0 1n=10 0.9 0.9 0 0 1n=11 0.9 0.9 0 1 0

Dtrain-PPT YPPT

n=1 1n=2 1n=3 0n=4 2n=5 0n=6 2n=7 2n=8 0n=9 1

n=10 1n=11 2

|YPPT| = 3

Page 26: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Solutions rooted on LP: PPT

• Advantages

• Simple add-on to the LP method that focuses on key relationships

• Models the full joint more efficiently

• Disadvantages

• Based on an ad-hoc pruning heuristic

• Mapping to lower dimensional label space is not clear

• (As LP) Only predicts the label combinations that are seen in the training set

26

Page 27: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Other solution: MLKNN [Zhang and Zhou, 2007]

• Multi-label k-Nearest Neighbor (MLKNN) [Zhang and Zhou, 2007]

• Learn a classifier for each class (as BR) by combining k-nearest neighbor with Bayesian inference

• Application is limited as KNN

• Does not produce a model

• Does not work well on high-dimensional data

27

Page 28: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-label output coding

• Key idea

• Motivated by the error-correcting output coding (ECOC) scheme [Dietterich 1995; Bose & Ray-Chaudhuri 1960] in communication

• Solve the MLC problems using lower dimensional codewords

• An output coding MLC method usually consists of three parts:

• Encoding:

• Prediction:

• Decoding:

28

Convert output vectors Y into codewords Z

Perform regression from X to Z; say R

Recover the class assignments Y from R

Page 29: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-label output coding

• Existing methods

• OC (Output Coding) with Compressed Sensing (OCCS) [Hsu et

al, 2009]

• Principle Label Space Transformation (PLST) [Tai and Lin, 2010]

• OC with Canonical Correlation Analysis (CCAOC) [Zhang and

Schneider, 2011]

• Maximum Margin Output Coding (MMOC) [Zhang and Schneider, 2012]

29

Page 30: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Principle Label Space Transformation (PLST) [Tai and Lin, 2010]

• Encoding: Convert output vectors Y into codewords Z, using the singular vector decomposition (SVD)

• Z = VTY = (V1TY, ..., VqTY), where V is a d × q projection vector (d > q)

• Prediction: Perform regression from X to Z; say R

• Decoding: Recover the class labels Y from R using SVD

• Achieved by optimizing a combinatorial loss function30

Z YX(i) Encoding (SVD)(ii) Regression

(iii) Decoding (SVD)CodewordsFeatures Labels

Page 31: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-label output coding

• Existing methods are differentiated from one to another mainly by the encoding/decoding schemes they apply

31

Method Key difference

OCCSUses compressed sensing [Donoho 2006] for encoding and decoding

PLSTUses singular vector decomposition (SVD) [Johnson & Wichern 2002] for encoding and decoding

CCAOCUses canonical correlation analysis (CCA) [Johnson & Wichern 2002] for encoding and mean-field approximation for decoding

MMOC Uses SVD for encoding and maximum margin formulation for decoding

Page 32: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-label output coding

• Advantages

• Show excellent prediction performances

• Disadvantages

• Only able to predict the single best output for a given input

• Cannot estimate probabilities for different input-output pairs

• Not scalable

• Encoding and decoding steps rely on matrix decomposition, whose complexities are sensitive to d and N

• Cannot be generalized to non-binary cases

32

Page 33: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Section summary

33

BR LP

Independent classifiers

All possible label combo.

CHF PPT

IBLR Pruned label combo.

Enriched feature space

Others

Output codingMLKNN

?

Want to achieve

something better

Page 34: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Agenda

✓ Motivation

✓ Solutions

• Advanced solutions

• Evaluation metrics

• Toolboxes

• Summary

34

Advanced solutions

Page 35: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Advanced solutions

• Section agenda

• Extensions using probabilistic graphical models (PGMs)

• Extensions using ensemble techniques

35

Page 36: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Extensions using PGMs

• Probabilistic Graphical Models (PGMs)

• PGM refers to a family of distributions on a set of random variables that are compatible with all the probabilistic independence propositions encoded in a graph

• A smart way to formulate exponentially large probability distribution without paying an exponential cost

• Using PGMs, we can reduce the model complexity

• PGM = Multivariate statistics + Graphical structure

36

Page 37: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Extensions using PGMs

• Representation: Two types

• Undirected graphical models (UGMs)

• Also known as Markov networks (MNs)

• Directed graphical models (DGMs)

• Also known as Bayesian networks (BNs)

37

X1

X2

X1

X2

node: variable

edge:correlation

edge:causal relation

Undirected(MN)

Directed(BN)

Page 38: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Extensions using PGMs

• How PGMs reduce the model complexity?

• Key idea: Exploit the conditional independence (CI) relations among variables!!

• Conditional independence (CI): Random variables A, B are conditionally independent given C, if P(A,B|C) = P(A|C)P(B|C)

• UGM and DGM offer a set of graphical notations for CI

38

A ⊥ B | C

CI representation in UGM

A B

C

Page 39: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Extensions using PGMs

• How PGMs reduce the model complexity?

• Key idea: Exploit the conditional independence (CI) relations among variables!!

• Conditional independence (CI): Random variables A, B are conditionally independent given C, if P(A,B|C) = P(A|C)P(B|C)

• UGM and DGM offer a set of graphical notations for CI

39

A ⊥ B | C A ⊥ B | C

CI representations in DGM

A B

CA BC

A B

C

Page 40: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Extensions using PGMs

• PGMs have been an excellent representation / formulation tool for the MLC problems

• The dependences among features (X) and class variables (Y) can be represented easily with PGMs

• By exploiting the conditional independence, we can make the computation simpler

40

Page 41: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Extensions using PGMs

• Existing methods

• Undirected models (Markov networks)

• Multi-label Conditional Random Field (ML-CRF) [Ghamrawi and McCallum, 2005;

Pakdaman et al, 2014]

• Composite Marginal Models (CMM) [Zhang and Schneider, 2012]

• Directed models (Bayesian networks)

• Multi-dimensional Bayesian Classifiers (MBC) [van der Gaag and de Waal, 2006]

• Classifier Chains (CC) [Read et al, 2009]

• Conditional Tree-structured Bayesian Networks (CTBN) [Batal et al, 2013]

41

Page 42: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-dimensional Bayesian Networks (MBC) [van der Gaag and de Waal, 2006]

• Key idea

• Model the full joint of input and output using a Bayesian network

• Use graphical structures to represent the dependence relations among the input and output variables

• Example MBC (d = 3, m = 4)

42

Y3Y2Y1

X3X2X1 X4

The joint distribution P(X,Y) is represented by the decomposition

X = X1|X2 · X2|X3 · X3 · X4|X2 and

Y = Y1|Y2 · X2 · Y3|Y2

Page 43: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-dimensional Bayesian Networks (MBC) [van der Gaag and de Waal, 2006]

• Advantages

• The full joint distribution of the feature and class variables can be represented efficiently using the Bayesian network

• Disadvantages

• Models the relations among the feature variables which do not carry much information in modeling the multi-label relations

43

Page 44: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-label Conditional Random Fields (MLCRF) [Pakdaman et al, 2014]

• Key idea

• Model the conditionals P(Y|X) to capture the relations among the class variables conditioned on the feature variables

• Learn a pairwise Markov network to model the relations between the input and output variables

• Representation

44

Y1

Y2

Y3

Yd ...

ψ1,2 ψ2,3

ψd,1(ψi,j and 𝜙i are the potentials of Yi, Yj, X;and Z is the normalization term)

Page 45: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Multi-label Conditional Random Fields (MLCRF) [Pakdaman et al, 2014]

• Advantages

• Directly models the conditional joint distribution P(Y|X)

• Disadvantages

• Learning and prediction is computationally very demanding

• To perform an inference, the normalization term Z should be computed, which is usually very costly

• The iterative parameter learning process requires inference at each step whose computational cost is even more expensive

• In practice, approximate inference techniques are applied to make the model usable

45

Page 46: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Classifier Chains (CC) [Read et al, 2009]

• Key idea

• Model P(Y|X) using a directed chain network, where all preceding classes in the chain are conditioning the following class variables

• Representation

46

all preceding labels

Y1 Y2 Y3 Yd...

X

Page 47: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Classifier Chains (CC) [Read et al, 2009]

• Learning

• No structure learning (random chain order)

• Parameter learning is performed on the decomposed CPDs: argmaxθ P(Yi|X, π(Yi); θ)

• Prediction

• Performed by greedy maximization of each factors (CPDs): argmaxYi P(Yi|X, π(Yi); θ)

47

Y1 Y2 Y3 Yd...

Page 48: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]

• Key idea

• Learn P(Y|X) using a tree-structured Bayesian network of the class labels

• Tree-structures can be seen as restricted chains, where each class variable has at most one parent class variable

• Example CTBN

48

X

Y1 Y2 Y3 Y4at most one parent label

This network represents:

Page 49: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]

• Learning

• Structure learning by optimizing conditional log-likelihood

1.Define a complete weighted directed graph, whose edge weights is equal to conditional log-likelihood

2.Find the maximum branching tree from the graph(* Maximum branching tree = maximum weighted directed spanning tree)

49

X

Y1 Y2 Y3 Y4

Page 50: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]

• Learning

• Structure learning by optimizing conditional log-likelihood

• Parameter learning is performed on the decomposed CPDs

• Prediction

• Exact MAP prediction is performed by a belief propagation (max-product) algorithm

50

X

Y1 Y2 Y3 Y4

Page 51: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

CC vs. CTBN

51

CC CTBN

X

Y1 Y2 Y3 Y4

Y1 Y2 Y3 Yd...

at most one parent labelall preceding labels

• Decomposes the joint probability along with the chain structure

• Decomposes the joint probability along with the tree structure

• Performs exact MAP prediction (linear time optimal solution)

• Maximizes the marginals along the chain (suboptimal solution)

• Errors in prediction propagate to the following label prediction

• The tree-structure assumption may restrict its modeling ability

• No structure learning (label ordering is given at random)

• Tree structure is learned using a score-based algorithm

Page 52: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Advanced solutions

• Section agenda

✓ Extensions using probabilistic graphical models (PGMs)

• Extensions using ensemble techniques

52

Page 53: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Extensions using ensemble techniques

• Ensemble techniques

• Techniques of training multiple classifiers and combining their predictions to produce a single classifier

• Ensemble techniques can further improve the performance of MLC classifiers

• Objective: Use a combination of simpler classifiers to improve predictions

53

Page 54: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Extensions using ensemble techniques

• Existing methods

• Ensemble of CCs (ECC) [Read et al, 2009]

• Mixture of CTBNs (MC) [Hong et al, 2014]

54

Page 55: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Ensemble of Classifier Chains (ECC) [Read et al, 2009]

• Recall CC

• Key Idea

• Create user-specified number of CC’s on random subsets of data with random orderings of the class labels

• Predict by majority vote over all base classifiers

55

Y1 Y2 Y3 Yd...

Page 56: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Ensemble of Classifier Chains (ECC) [Read et al, 2009]

• Advantages

• Often times, the performance improves

• Disadvantages

• Ad-hoc ensemble implementation

• Learns base classifiers on random subsets of data with random label ordering

• Ensemble decisions are made by simple averaging over the base models and often inaccurate

56

Page 57: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Mixture of CTBNs (MC) [Hong et al, 2014]

• Motivation

• If the underlying dependency structure in data is more complex than a tree structure, a single CTBN cannot model the data properly

• Key idea

• Use the Mixtures-of-Trees [Meila and Jordan, 2000] framework to learn multiple CTBNs and use them for prediction

57

Page 58: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Mixture of CTBNs (MC) [Hong et al, 2014]

• MC defines the multivariate posterior distribution of class vector P(y|x) = P(y1, …,yd|x) as

• P(y|x,Tk) is the k-th mixture component defined by a CTBN Tk

• λk is the mixture coefficient representing the weight of the k-th component (influence of the k-th CTBN model Tk to the mixture)

58

Page 59: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Mixture of CTBNs (MC) [Hong et al, 2014]

• An example MC

59

Page 60: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Mixture of CTBNs (MC) [Hong et al, 2014]

• Parameter learning

• Objective: Optimize the model parameters (CTBN parameters {θ1, …, θK} and mixture coefficients {λ1, …, λK})

• Idea (apply EM)

1. Associate each instance (x(n), y(n)) with a hidden variable z(n) ∈ {1, …, K} indicating which CTBN it belongs to.

2. Iteratively optimize the expected complete log-likelihood:

60

Page 61: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Mixture of CTBNs (MC) [Hong et al, 2014]

• Structure learning

• Objective: Find multiple CTBN structures from data

• Idea (boosting-like heuristic)

1. On each addition of a new structure to the mixture, recalculate the weight of each data instance (ω) such that it represents the relative “hardness” of the instance

2. Learn the best tree structure by optimizing the weighted conditional log-likelihood:

61

Page 62: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Mixture of CTBNs (MC) [Hong et al, 2014]

• Prediction

• Objective: Find the maximum a posteriori (MAP) prediction for a new instance x

• Idea

1. Search the space of all class assignments by defining a Markov chain

2. Use an annealed version of exploration procedure to speed up the search

62

Page 63: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Mixture of CTBNs (MC) [Hong et al, 2014]

• Advantages

• Learns an ensemble model for MLC in a principled way

• Produces accurate and reliable results

• Disadvantages

• Iterative optimization process in learning requires a large amount of time

63

Page 64: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Agenda

✓ Motivation

✓ Solutions

✓ Advanced solutions

• Evaluation metrics

• Toolboxes

• Summary

64

Page 65: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Evaluation metrics

• Evaluation of MLC methods is more difficult than that of single-label classification

• Measuring the Hamming accuracy is not sufficient for the goal of MLC

• Hamming accuracy (HA) =

• HA measures the individual accuracy on each class variable, which can be optimized by the binary relevance (BR) model

• We want to find “jointly accurate” class assignments

• We want to measure if the model predicts all the labels correctly

• Exact match accuracy (EMA) =

65

Page 66: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Evaluation metrics

• Exact match accuracy (EMA) =

• EMA evaluate if the prediction is correct on all class variables

• Most appropriate metric for MLC

• We are looking for the most probable assignment of classes

• It can be too strict

• Multi-label accuracy (MLA) =

• MLA evaluate the Jaccard index between prediction and true class assignments

• It is less strict than EMA; overestimates the model accuracy

66

Page 67: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Evaluation metrics

• Conditional log-likelihood loss (CLL-loss)

• CLL-loss =

• Reflects the model fitness

• F1-scores: harmonics mean of precision and recall

• Micro F1 =

• Computes the F1-score on each instance and then average out

• Macro F1 =

• Computes the F1-score on each class and then average out

67

Page 68: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Agenda

✓ Motivation

✓ Solutions

✓ Advanced solutions

✓ Evaluation metrics

• Toolboxes

• Summary

68

Page 69: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Toolboxes

• MEKA: a Multi-label Extension to WEKA http://meka.sourceforge.net/

• Mulan: a Java library for Multi-label Learning http://mulan.sourceforge.net/

• LibSVM MLC Extension (BR and LP) http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/multilabel/

• LAMDA Lab (Nanjing Univ., China) Code Repository http://lamda.nju.edu.cn/Default.aspx?Page=Data&NS=&AspxAutoDetectCookieSupport=1

• Prof. Min-Ling Zhang (Southeast Univ., China) http://cse.seu.edu.cn/old/people/zhangml/Resources.htm#codes

69

Page 70: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Summary

70

BR LP

Independent classifiers

All possible label combo.

CHF PPT

IBLR Pruned label combo.

Enriched feature space

Others

Output codingMLKNN

CCWant to achieve

something better

CTBN

MCECC

Page 71: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

References

• [Batal et al, 2013] I. Batal, C. Hong, and M. Hauskrecht. “An efficient probabilistic framework for multi-dimensional classification”. In: Proceedings of the 22nd ACM international conference on Conference on information and knowledge management (CIKM). 2013, pp. 2417–2422.

• [Bose & Ray-Chaudhuri 1960] R.C. Bose, D.K. Ray-Chaudhuri. On a class of error correcting binary group codes. In: Inform and Control, 3. 1960, pp. 68–79.

• [Boutell et al, 2004] M. R. Boutell et al. “Learning Multi-label Scene Classification”. In: Pattern Recognition 37.9 (2004)

• [Cheng and Hüllermeier, 2009] W. Cheng and E. Hüllermeier. “Combining instance-based learning and logistic regression for multilabel classification”. In: Machine Learning 76.2-3 (2009)

• [Clare and King, 2001] A. Clare and R. D. King. “Knowledge Discovery in Multi-Label Pheno- type Data”. In: Lecture Notes in Computer Science. Springer, 2001.

• [Dietterich, 1995] T. G. Dietterich and G. Bakiri. "Solving Multiclass Learning Problems via Error-Correcting Output Codes", In: Journal of Artificial Intelligence Research. 1995. Volume 2, pages 263-286.

• [Donoho, 2006] D. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289– 1306, April 2006.

71

Page 72: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

References

• [Ghamrawi and McCallum, 2005] N. Ghamrawi and A. McCallum. “Collective multi-label classification”. In: Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM). 2005, pp. 195–200.

• [Godbole et al, 2004] S. Godbole and S. Sarawagi. “Discriminative Methods for Multi-labeled Classification”. In: PAKDD’04. 2004, pp. 22–30.

• [Hong et al, 2014] C. Hong, I. Batal, and M. Hauskrecht. “A mixtures-of-trees framework for multi-label classification”. In: Proceedings of the 23nd ACM International Conference on Information and Knowledge Management (CIKM), 2014. ACM.

• [Hsu et al, 2009] [Hsu et al, 2009] D. Hsu et al. “Multi-Label Prediction via Compressed Sensing”. In: NIPS. 2009, pp. 772– 780.

• [Johnson & Wichern, 2002] R. A. Johnson and D. W. Wichern. “Applied Multivariate Statistical Analysis” (5th Ed.). 2002. Upper Saddle River, N.J., Prentice-Hall.

• [Meila and Jordan, 2000] M. Meila and M. I. Jordan. “Learning with mixtures of trees”. Journal of Machine Learning Research, 1:1–48, 2000.

• [Pakdaman et al, 2014] M. Pakdaman, I. Batal, Z. Liu, C. Hong, and M. Hauskrecht. “An optimization-based framework to learn conditional random fields for multi-label classification”. In SDM. SIAM, 2014.

72

Page 73: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

References

• [Read et al, 2008] J. Read, B. Pfahringer, and G. Holmes. “Multi-label Classification Using Ensembles of Pruned Sets”. In: ICDM. IEEE Computer Society, 2008, pp. 995–1000.

• [Read et al, 2009] J. Read et al. “Classifier Chains for Multi-label Classification”. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II. ECML PKDD ’09. Bled, Slovenia: Springer-Verlag, 2009, pp. 254–269.

• [Tai and Lin, 2010] Farbound Tai and Hsuan-Tien Lin. “Multi-label Classification with Principle Label Space Transformation”. In: Proceedings of the 2nd International Workshop on Multi-Label Learning. 2010.

• [van der Gaag and de Waal, 2006] L. C. van der Gaag and P. R. de Waal. “Multi-dimensional Bayesian Net- work Classifiers”. In: Probabilistic Graphical Models. 2006, pp. 107–114

• [Zhang and Schneider, 2012a] Y. Zhang and J. Schneider. “A Composite Likelihood View for Multi-Label Classification”. In: AISTATS (2012).

• [Zhang and Schneider, 2012b] Y. Zhang and J. Schneider. “Maximum Margin Output Coding”. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12). ICML ’12. Edinburgh, Scotland, UK: Omnipress, 2012, pp. 1575–1582.

• [Zhang and Zhou, 2007] Min-Ling Zhang and Zhi-Hua Zhou. “ML-KNN: A lazy learning approach to multi-label learning”. In: Pattern Recogn. 40.7 (July 2007), pp. 2038–2048.

73

Page 74: slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-Fall2014/lectures/class22.pdf · Solutions rooted on LP • LP: Label Powerset [Tsoumakas and Vlahavas, 2007] • Models a

Thanks!

74