slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-fall2014/lectures/class22.pdf · solutions...

Multi-label Classification

Charmgil Hong cs3750

(Presented on Nov 11, 2014)

Goals of the talk

1.To understand the geometry of different approaches for multi-label classification

2.To appreciate how the Machine Learning techniques further improve the multi-label classification methods

3.To learn how to evaluate the multi-label classification methods

2

Agenda

• Motivation & Problem definition

• Solutions

• Advanced solutions

• Evaluation metrics

• Toolboxes

• Summary

3

Notation

• X ∈ Rm : feature vector variable (input)

• Y ∈ Rd : class vector variable (output)

• x = {x1, ..., xm}: feature vector instance

• y = {y1, ..., yd}: class vector instance

• In a shorthand, P(Y=y|X=x) = P(y|x)

• Dtrain : training dataset; Dtest : test dataset

4

Motivation

• Traditional classification

• Each data instance is associated with a single class variable

5

Motivation

• An issue with traditional classification

• In many real-world applications, each data instance can be associated with multiple class variables

• Examples

• A news article may cover multiple topics, such as politics and economics

• An image may include multiple objects as building, road, and car

• A gene may be associated with several biological functions

6

Problem Definition

• Multi-label classification (MLC)

• Each data instance is associated with multiple binary class variables

• Objective: assign each instance the most probable assignment of the class variables

7

Class 1 ∈ { R, B } Class 2 ∈ { , }

A simple solution

• Idea

• Transform a multi-label classification problem to multiple single-label classification problems

• Learn d independent classifiers for d class variables

8

• Idea

• Transform a multi-label classification problem to multiple single-label classification problems

• Learn d independent classifiers for d class variables

9

Binary Relevance (BR) [Clare and King, 2001; Boutell et al, 2004]

• IllustrationDtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1

h1 : X → Y1

h2 : X → Y2

h3 : X → Y3


• Advantages

• Computationally efficient

• Disadvantages

• Does not capture the dependence relations among the class variables

• Not suitable for the objective of MLC

• Does not find the most probable assignment

• Instead, it maximizes the marginal distribution of each class variable

10


• Marginal vs. Joint: a motivating example

• Question: find the most probable assignment (MAP: maximum a posteriori) of Y = (Y1,Y2)

11

➡ Prediction on the joint (MAP): Y1 = 1, Y2 = 0➡ Prediction on the marginals: Y1 = 0, Y2 = 0

P(Y1,Y2|X=x) Y1 = 0 Y1 = 1 P(Y2|X=x)Y2 = 0 0.2 0.45 0.65Y2 = 1 0.35 0 0.35

P(Y1|X=x) 0.55 0.45

• We want to maximize the joint distribution of Y given observation X = x; i.e.,

Another simple solution

• Idea

• Transform each label combination to a class value

• Learn a multi-class classifier with the new class values

12

• Idea

• Transform each label combination to a class value

• Learn a multi-class classifier with the new class values

13

• IllustrationDtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 1 0n=4 0.3 0.1 0 1 1n=5 0.8 0.9 1 0 1

YLP

11234

hLP : X → YLP

Label Powerset (LP) [Tsoumakas and Vlahavas, 2007]

Label Powerset (LP) [Tsoumakas and Vlahavas, 2007]

• Advantages

• Learns the full joint of the class variables

• Each of the new class values maps to a label combination

• Disadvantages

• The number of choices in the new class can be exponential (|YLP| = O(2d))

• Learning a multi-class classifier on exponential choices is expensive

• The resulting class distribution would be sparse and imbalanced

• Only predicts the label combinations that are seen in the training set

14

BR vs. LP

• BR and LP are two extreme MLC approaches

• BR maximizes the marginals on each class variable;while LP directly models the joint of all class variables

• BR is computationally more efficient; but does not consider the relationship among the class variables

• LP considers the relationship among the class variables by modeling the full joint of the class variables; but can be computationally very expensive

15

BR LPIndependent classifiers

All possible label combo.

Agenda

✓ Motivation

• Solutions



• Toolboxes

• Summary

16

Solutions

Solutions

• Section agenda

• Solutions rooted on BR

• Solutions rooted on LP

• Other solutions

17

• BR: Binary Relevance [Clare and King, 2001; Boutell et al, 2004]

• Models independent classifiers P(yi|x) on each class variable

• Does not learn the class dependences

• Key extensions from BR

• Learn the class dependence relations by adding new class-dependent features : P(yi|x, {new_features})

Solutions rooted on BR

18

Solutions rooted on BR

• Idea: layered approach

• Layer-1: Learn and predict on Dtrain, using the BR approach

• Layer-2: Learn d classifiers on the original features and the output of layer-1

• Existing methods

• Classification with Heterogeneous Features (CHF) [Godbole et al, 2004]

• Instance-based Logistic Regression (IBLR) [Cheng et al, 2009]

19

20

Classification with Heterogeneous Features (CHF)

• Illustration

Dtrain X1 X2 Y1 Y2 Y3

n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1

hbr1 : X → Y1

hbr2 : X → Y2

hbr3 : X → Y3

XCHF

X1 X2 hbr1(X) hbr2(X) hbr3(X) Y1 Y2 Y3

n=1 0.7 0.4 .xx .xx .xx 1 1 0n=2 0.6 0.2 .xx .xx .xx 1 1 0n=3 0.1 0.9 .xx .xx .xx 0 0 1n=4 0.3 0.1 .xx .xx .xx 0 0 0n=5 0.8 0.9 .xx .xx .xx 1 0 1

h1 : XCHF → Y1

h2 : XCHF → Y2

h3 : XCHF → Y3

Laye

r-1La

yer-2

21

Instance-based Logistic Regression (IBLR)

• Illustration


n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1

XIBLR

X1 X2 λ1 λ2 λ3 Y1 Y2 Y3

n=1 0.7 0.4 .xx .xx .xx 1 1 0n=2 0.6 0.2 .xx .xx .xx 1 1 0n=3 0.1 0.9 .xx .xx .xx 0 0 1n=4 0.3 0.1 .xx .xx .xx 0 0 0n=5 0.8 0.9 .xx .xx .xx 1 0 1

h1 : XIBLR → Y1

h2 : XIBLR → Y2

h3 : XIBLR → Y3

KNN Scoreλ1 λ2 λ3

.xx .xx .xx

.xx .xx .xx

.xx .xx .xx

.xx .xx .xx

.xx .xx .xx

1 1 00 0 11 0 02/3 1/3 1/3

k=3

y1 y2 y3

Solutions rooted on BR: CHF & IBLR

• Advantages

• Model the class dependences by enriching the feature space using the layer-1 classifiers

• Disadvantages

• Learn the dependence relations in an indirect way

• The predictions are not stable

22

Solutions rooted on LP

• LP: Label Powerset [Tsoumakas and Vlahavas, 2007]

• Models a multi-class classifier on the enumeration of all possible class assignment

• Can create exponentially many classes and computationally very expensive

• Key extensions from LP

• Prune the infrequent class assignments from the consideration to reduce the size of the class assignment space

• Represent the joint distribution more compactly

23

Pruned problem transformation (PPT) [Read et al, 2008]

• Class assignment conversion in PPT

• Prune infrequent class assignment sets

• User specifies the threshold for “infrequency”

24


n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 0 0n=4 0.3 0.1 0 1 0n=5 0.1 0.8 0 0 0n=6 0.2 0.1 0 1 0n=7 0.2 0.2 0 1 0n=8 0.2 0.9 0 0 0n=9 0.7 0.3 0 0 1

n=10 0.9 0.9 0 1 1

Dtrain-LP YLP

n=1 1n=2 1n=3 0n=4 2n=5 0n=6 2n=7 2n=8 0n=9 1

n=10 3

|YLP| = 4

Pruned problem transformation (PPT) [Read et al, 2008]

• Class assignment conversion in PPT

• Prune infrequent class assignment sets

• User specifies the threshold for “infrequency”

25


n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 0 0n=4 0.3 0.1 0 1 0n=5 0.1 0.8 0 0 0n=6 0.2 0.1 0 1 0n=7 0.2 0.2 0 1 0n=8 0.2 0.9 0 0 0n=9 0.7 0.3 0 0 1n=10 0.9 0.9 0 0 1n=11 0.9 0.9 0 1 0

Dtrain-PPT YPPT

n=1 1n=2 1n=3 0n=4 2n=5 0n=6 2n=7 2n=8 0n=9 1

n=10 1n=11 2

|YPPT| = 3

Solutions rooted on LP: PPT

• Advantages

• Simple add-on to the LP method that focuses on key relationships

• Models the full joint more efficiently

• Disadvantages

• Based on an ad-hoc pruning heuristic

• Mapping to lower dimensional label space is not clear

• (As LP) Only predicts the label combinations that are seen in the training set

26

Other solution: MLKNN [Zhang and Zhou, 2007]

• Multi-label k-Nearest Neighbor (MLKNN) [Zhang and Zhou, 2007]

• Learn a classifier for each class (as BR) by combining k-nearest neighbor with Bayesian inference

• Application is limited as KNN

• Does not produce a model

• Does not work well on high-dimensional data

27

Multi-label output coding

• Key idea

• Motivated by the error-correcting output coding (ECOC) scheme [Dietterich 1995; Bose & Ray-Chaudhuri 1960] in communication

• Solve the MLC problems using lower dimensional codewords

• An output coding MLC method usually consists of three parts:

• Encoding:

• Prediction:

• Decoding:

28

Convert output vectors Y into codewords Z

Perform regression from X to Z; say R

Recover the class assignments Y from R



• OC (Output Coding) with Compressed Sensing (OCCS) [Hsu et

al, 2009]

• Principle Label Space Transformation (PLST) [Tai and Lin, 2010]

• OC with Canonical Correlation Analysis (CCAOC) [Zhang and

Schneider, 2011]

• Maximum Margin Output Coding (MMOC) [Zhang and Schneider, 2012]

29

Principle Label Space Transformation (PLST) [Tai and Lin, 2010]

• Encoding: Convert output vectors Y into codewords Z, using the singular vector decomposition (SVD)

• Z = VTY = (V1TY, ..., VqTY), where V is a d × q projection vector (d > q)

• Prediction: Perform regression from X to Z; say R

• Decoding: Recover the class labels Y from R using SVD

• Achieved by optimizing a combinatorial loss function30

Z YX(i) Encoding (SVD)(ii) Regression

(iii) Decoding (SVD)CodewordsFeatures Labels


• Existing methods are differentiated from one to another mainly by the encoding/decoding schemes they apply

31

Method Key difference

OCCSUses compressed sensing [Donoho 2006] for encoding and decoding

PLSTUses singular vector decomposition (SVD) [Johnson & Wichern 2002] for encoding and decoding

CCAOCUses canonical correlation analysis (CCA) [Johnson & Wichern 2002] for encoding and mean-field approximation for decoding

MMOC Uses SVD for encoding and maximum margin formulation for decoding


• Advantages

• Show excellent prediction performances

• Disadvantages

• Only able to predict the single best output for a given input

• Cannot estimate probabilities for different input-output pairs

• Not scalable

• Encoding and decoding steps rely on matrix decomposition, whose complexities are sensitive to d and N

• Cannot be generalized to non-binary cases

32

Section summary

33

BR LP

Independent classifiers


CHF PPT

IBLR Pruned label combo.

Enriched feature space

Others

Output codingMLKNN

?

Want to achieve

something better

Agenda

✓ Motivation

✓ Solutions



• Toolboxes

• Summary

34

Advanced solutions

Advanced solutions

• Section agenda

• Extensions using probabilistic graphical models (PGMs)

• Extensions using ensemble techniques

35

Extensions using PGMs

• Probabilistic Graphical Models (PGMs)

• PGM refers to a family of distributions on a set of random variables that are compatible with all the probabilistic independence propositions encoded in a graph

• A smart way to formulate exponentially large probability distribution without paying an exponential cost

• Using PGMs, we can reduce the model complexity

• PGM = Multivariate statistics + Graphical structure

36


• Representation: Two types

• Undirected graphical models (UGMs)

• Also known as Markov networks (MNs)

• Directed graphical models (DGMs)

• Also known as Bayesian networks (BNs)

37

X1

X2

X1

X2

node: variable

edge:correlation

edge:causal relation

Undirected(MN)

Directed(BN)


• How PGMs reduce the model complexity?

• Key idea: Exploit the conditional independence (CI) relations among variables!!

• Conditional independence (CI): Random variables A, B are conditionally independent given C, if P(A,B|C) = P(A|C)P(B|C)

• UGM and DGM offer a set of graphical notations for CI

38

A ⊥ B | C

CI representation in UGM

A B

C


• How PGMs reduce the model complexity?

• Key idea: Exploit the conditional independence (CI) relations among variables!!

• Conditional independence (CI): Random variables A, B are conditionally independent given C, if P(A,B|C) = P(A|C)P(B|C)

• UGM and DGM offer a set of graphical notations for CI

39

A ⊥ B | C A ⊥ B | C

CI representations in DGM

⧸

A B

CA BC

A B

C


• PGMs have been an excellent representation / formulation tool for the MLC problems

• The dependences among features (X) and class variables (Y) can be represented easily with PGMs

• By exploiting the conditional independence, we can make the computation simpler

40



• Undirected models (Markov networks)

• Multi-label Conditional Random Field (ML-CRF) [Ghamrawi and McCallum, 2005;

Pakdaman et al, 2014]

• Composite Marginal Models (CMM) [Zhang and Schneider, 2012]

• Directed models (Bayesian networks)

• Multi-dimensional Bayesian Classifiers (MBC) [van der Gaag and de Waal, 2006]

• Classifier Chains (CC) [Read et al, 2009]

• Conditional Tree-structured Bayesian Networks (CTBN) [Batal et al, 2013]

41

Multi-dimensional Bayesian Networks (MBC) [van der Gaag and de Waal, 2006]

• Key idea

• Model the full joint of input and output using a Bayesian network

• Use graphical structures to represent the dependence relations among the input and output variables

• Example MBC (d = 3, m = 4)

42

Y3Y2Y1

X3X2X1 X4

The joint distribution P(X,Y) is represented by the decomposition

X = X1|X2 · X2|X3 · X3 · X4|X2 and

Y = Y1|Y2 · X2 · Y3|Y2

Multi-dimensional Bayesian Networks (MBC) [van der Gaag and de Waal, 2006]

• Advantages

• The full joint distribution of the feature and class variables can be represented efficiently using the Bayesian network

• Disadvantages

• Models the relations among the feature variables which do not carry much information in modeling the multi-label relations

43

Multi-label Conditional Random Fields (MLCRF) [Pakdaman et al, 2014]

• Key idea

• Model the conditionals P(Y|X) to capture the relations among the class variables conditioned on the feature variables

• Learn a pairwise Markov network to model the relations between the input and output variables

• Representation

44

Y1

Y2

Y3

Yd ...

ψ1,2 ψ2,3

ψd,1(ψi,j and 𝜙i are the potentials of Yi, Yj, X;and Z is the normalization term)

Multi-label Conditional Random Fields (MLCRF) [Pakdaman et al, 2014]

• Advantages

• Directly models the conditional joint distribution P(Y|X)

• Disadvantages

• Learning and prediction is computationally very demanding

• To perform an inference, the normalization term Z should be computed, which is usually very costly

• The iterative parameter learning process requires inference at each step whose computational cost is even more expensive

• In practice, approximate inference techniques are applied to make the model usable

45

Classifier Chains (CC) [Read et al, 2009]

• Key idea

• Model P(Y|X) using a directed chain network, where all preceding classes in the chain are conditioning the following class variables

• Representation

46

all preceding labels

Y1 Y2 Y3 Yd...

X

Classifier Chains (CC) [Read et al, 2009]

• Learning

• No structure learning (random chain order)

• Parameter learning is performed on the decomposed CPDs: argmaxθ P(Yi|X, π(Yi); θ)

• Prediction

• Performed by greedy maximization of each factors (CPDs): argmaxYi P(Yi|X, π(Yi); θ)

47

Y1 Y2 Y3 Yd...

Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]

• Key idea

• Learn P(Y|X) using a tree-structured Bayesian network of the class labels

• Tree-structures can be seen as restricted chains, where each class variable has at most one parent class variable

• Example CTBN

48

X

Y1 Y2 Y3 Y4at most one parent label

This network represents:


• Learning

• Structure learning by optimizing conditional log-likelihood

1.Define a complete weighted directed graph, whose edge weights is equal to conditional log-likelihood

2.Find the maximum branching tree from the graph(* Maximum branching tree = maximum weighted directed spanning tree)

49

X

Y1 Y2 Y3 Y4


• Learning

• Structure learning by optimizing conditional log-likelihood

• Parameter learning is performed on the decomposed CPDs

• Prediction

• Exact MAP prediction is performed by a belief propagation (max-product) algorithm

50

X

Y1 Y2 Y3 Y4

CC vs. CTBN

51

CC CTBN

X

Y1 Y2 Y3 Y4

Y1 Y2 Y3 Yd...

at most one parent labelall preceding labels

• Decomposes the joint probability along with the chain structure

• Decomposes the joint probability along with the tree structure

• Performs exact MAP prediction (linear time optimal solution)

• Maximizes the marginals along the chain (suboptimal solution)

• Errors in prediction propagate to the following label prediction

• The tree-structure assumption may restrict its modeling ability

• No structure learning (label ordering is given at random)

• Tree structure is learned using a score-based algorithm

Advanced solutions

• Section agenda

✓ Extensions using probabilistic graphical models (PGMs)

• Extensions using ensemble techniques

52

Extensions using ensemble techniques

• Ensemble techniques

• Techniques of training multiple classifiers and combining their predictions to produce a single classifier

• Ensemble techniques can further improve the performance of MLC classifiers

• Objective: Use a combination of simpler classifiers to improve predictions

53

Extensions using ensemble techniques


• Ensemble of CCs (ECC) [Read et al, 2009]

• Mixture of CTBNs (MC) [Hong et al, 2014]

54

Ensemble of Classifier Chains (ECC) [Read et al, 2009]

• Recall CC

• Key Idea

• Create user-specified number of CC’s on random subsets of data with random orderings of the class labels

• Predict by majority vote over all base classifiers

55

Y1 Y2 Y3 Yd...

Ensemble of Classifier Chains (ECC) [Read et al, 2009]

• Advantages

• Often times, the performance improves

• Disadvantages

• Ad-hoc ensemble implementation

• Learns base classifiers on random subsets of data with random label ordering

• Ensemble decisions are made by simple averaging over the base models and often inaccurate

56

Mixture of CTBNs (MC) [Hong et al, 2014]

• Motivation

• If the underlying dependency structure in data is more complex than a tree structure, a single CTBN cannot model the data properly

• Key idea

• Use the Mixtures-of-Trees [Meila and Jordan, 2000] framework to learn multiple CTBNs and use them for prediction

57


• MC defines the multivariate posterior distribution of class vector P(y|x) = P(y1, …,yd|x) as

• P(y|x,Tk) is the k-th mixture component defined by a CTBN Tk

• λk is the mixture coefficient representing the weight of the k-th component (influence of the k-th CTBN model Tk to the mixture)

58


• An example MC

59


• Parameter learning

• Objective: Optimize the model parameters (CTBN parameters {θ1, …, θK} and mixture coefficients {λ1, …, λK})

• Idea (apply EM)

1. Associate each instance (x(n), y(n)) with a hidden variable z(n) ∈ {1, …, K} indicating which CTBN it belongs to.

2. Iteratively optimize the expected complete log-likelihood:

60


• Structure learning

• Objective: Find multiple CTBN structures from data

• Idea (boosting-like heuristic)

1. On each addition of a new structure to the mixture, recalculate the weight of each data instance (ω) such that it represents the relative “hardness” of the instance

2. Learn the best tree structure by optimizing the weighted conditional log-likelihood:

61


• Prediction

• Objective: Find the maximum a posteriori (MAP) prediction for a new instance x

• Idea

1. Search the space of all class assignments by defining a Markov chain

2. Use an annealed version of exploration procedure to speed up the search

62


• Advantages

• Learns an ensemble model for MLC in a principled way

• Produces accurate and reliable results

• Disadvantages

• Iterative optimization process in learning requires a large amount of time

63

Agenda

✓ Motivation

✓ Solutions

✓ Advanced solutions


• Toolboxes

• Summary

64

Evaluation metrics

• Evaluation of MLC methods is more difficult than that of single-label classification

• Measuring the Hamming accuracy is not sufficient for the goal of MLC

• Hamming accuracy (HA) =

• HA measures the individual accuracy on each class variable, which can be optimized by the binary relevance (BR) model

• We want to find “jointly accurate” class assignments

• We want to measure if the model predicts all the labels correctly

• Exact match accuracy (EMA) =

65

Evaluation metrics

• Exact match accuracy (EMA) =

• EMA evaluate if the prediction is correct on all class variables

• Most appropriate metric for MLC

• We are looking for the most probable assignment of classes

• It can be too strict

• Multi-label accuracy (MLA) =

• MLA evaluate the Jaccard index between prediction and true class assignments

• It is less strict than EMA; overestimates the model accuracy

66

Evaluation metrics

• Conditional log-likelihood loss (CLL-loss)

• CLL-loss =

• Reflects the model fitness

• F1-scores: harmonics mean of precision and recall

• Micro F1 =

• Computes the F1-score on each instance and then average out

• Macro F1 =

• Computes the F1-score on each class and then average out

67

Agenda

✓ Motivation

✓ Solutions

✓ Advanced solutions

✓ Evaluation metrics

• Toolboxes

• Summary

68

Toolboxes

• MEKA: a Multi-label Extension to WEKA http://meka.sourceforge.net/

• Mulan: a Java library for Multi-label Learning http://mulan.sourceforge.net/

• LibSVM MLC Extension (BR and LP) http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/multilabel/

• LAMDA Lab (Nanjing Univ., China) Code Repository http://lamda.nju.edu.cn/Default.aspx?Page=Data&NS=&AspxAutoDetectCookieSupport=1

• Prof. Min-Ling Zhang (Southeast Univ., China) http://cse.seu.edu.cn/old/people/zhangml/Resources.htm#codes

69

Summary

70

BR LP

Independent classifiers


CHF PPT

IBLR Pruned label combo.

Enriched feature space

Others

Output codingMLKNN

CCWant to achieve

something better

CTBN

MCECC

References

• [Batal et al, 2013] I. Batal, C. Hong, and M. Hauskrecht. “An efficient probabilistic framework for multi-dimensional classification”. In: Proceedings of the 22nd ACM international conference on Conference on information and knowledge management (CIKM). 2013, pp. 2417–2422.

• [Bose & Ray-Chaudhuri 1960] R.C. Bose, D.K. Ray-Chaudhuri. On a class of error correcting binary group codes. In: Inform and Control, 3. 1960, pp. 68–79.

• [Boutell et al, 2004] M. R. Boutell et al. “Learning Multi-label Scene Classification”. In: Pattern Recognition 37.9 (2004)

• [Cheng and Hüllermeier, 2009] W. Cheng and E. Hüllermeier. “Combining instance-based learning and logistic regression for multilabel classification”. In: Machine Learning 76.2-3 (2009)

• [Clare and King, 2001] A. Clare and R. D. King. “Knowledge Discovery in Multi-Label Pheno- type Data”. In: Lecture Notes in Computer Science. Springer, 2001.

• [Dietterich, 1995] T. G. Dietterich and G. Bakiri. "Solving Multiclass Learning Problems via Error-Correcting Output Codes", In: Journal of Artificial Intelligence Research. 1995. Volume 2, pages 263-286.

• [Donoho, 2006] D. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289– 1306, April 2006.

71

References

• [Ghamrawi and McCallum, 2005] N. Ghamrawi and A. McCallum. “Collective multi-label classification”. In: Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM). 2005, pp. 195–200.

• [Godbole et al, 2004] S. Godbole and S. Sarawagi. “Discriminative Methods for Multi-labeled Classification”. In: PAKDD’04. 2004, pp. 22–30.

• [Hong et al, 2014] C. Hong, I. Batal, and M. Hauskrecht. “A mixtures-of-trees framework for multi-label classification”. In: Proceedings of the 23nd ACM International Conference on Information and Knowledge Management (CIKM), 2014. ACM.

• [Hsu et al, 2009] [Hsu et al, 2009] D. Hsu et al. “Multi-Label Prediction via Compressed Sensing”. In: NIPS. 2009, pp. 772– 780.

• [Johnson & Wichern, 2002] R. A. Johnson and D. W. Wichern. “Applied Multivariate Statistical Analysis” (5th Ed.). 2002. Upper Saddle River, N.J., Prentice-Hall.

• [Meila and Jordan, 2000] M. Meila and M. I. Jordan. “Learning with mixtures of trees”. Journal of Machine Learning Research, 1:1–48, 2000.

• [Pakdaman et al, 2014] M. Pakdaman, I. Batal, Z. Liu, C. Hong, and M. Hauskrecht. “An optimization-based framework to learn conditional random fields for multi-label classification”. In SDM. SIAM, 2014.

72

References

• [Read et al, 2008] J. Read, B. Pfahringer, and G. Holmes. “Multi-label Classification Using Ensembles of Pruned Sets”. In: ICDM. IEEE Computer Society, 2008, pp. 995–1000.

• [Read et al, 2009] J. Read et al. “Classifier Chains for Multi-label Classification”. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II. ECML PKDD ’09. Bled, Slovenia: Springer-Verlag, 2009, pp. 254–269.

• [Tai and Lin, 2010] Farbound Tai and Hsuan-Tien Lin. “Multi-label Classification with Principle Label Space Transformation”. In: Proceedings of the 2nd International Workshop on Multi-Label Learning. 2010.

• [van der Gaag and de Waal, 2006] L. C. van der Gaag and P. R. de Waal. “Multi-dimensional Bayesian Net- work Classifiers”. In: Probabilistic Graphical Models. 2006, pp. 107–114

• [Zhang and Schneider, 2012a] Y. Zhang and J. Schneider. “A Composite Likelihood View for Multi-Label Classification”. In: AISTATS (2012).

• [Zhang and Schneider, 2012b] Y. Zhang and J. Schneider. “Maximum Margin Output Coding”. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12). ICML ’12. Edinburgh, Scotland, UK: Omnipress, 2012, pp. 1575–1582.

• [Zhang and Zhou, 2007] Min-Ling Zhang and Zhi-Hua Zhou. “ML-KNN: A lazy learning approach to multi-label learning”. In: Pattern Recogn. 40.7 (July 2007), pp. 2038–2048.

73

Thanks!

74

slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-fall2014/lectures/class22.pdf · solutions...

Documents