slides mlc 3750 - people.cs.pitt.edumilos/courses/cs3750-fall2014/lectures/class22.pdf · solutions...
TRANSCRIPT
Multi-label Classification
Charmgil Hong cs3750
(Presented on Nov 11, 2014)
Goals of the talk
1.To understand the geometry of different approaches for multi-label classification
2.To appreciate how the Machine Learning techniques further improve the multi-label classification methods
3.To learn how to evaluate the multi-label classification methods
2
Agenda
• Motivation & Problem definition
• Solutions
• Advanced solutions
• Evaluation metrics
• Toolboxes
• Summary
3
Notation
• X ∈ Rm : feature vector variable (input)
• Y ∈ Rd : class vector variable (output)
• x = {x1, ..., xm}: feature vector instance
• y = {y1, ..., yd}: class vector instance
• In a shorthand, P(Y=y|X=x) = P(y|x)
• Dtrain : training dataset; Dtest : test dataset
4
Motivation
• Traditional classification
• Each data instance is associated with a single class variable
5
Motivation
• An issue with traditional classification
• In many real-world applications, each data instance can be associated with multiple class variables
• Examples
• A news article may cover multiple topics, such as politics and economics
• An image may include multiple objects as building, road, and car
• A gene may be associated with several biological functions
6
Problem Definition
• Multi-label classification (MLC)
• Each data instance is associated with multiple binary class variables
• Objective: assign each instance the most probable assignment of the class variables
7
Class 1 ∈ { R, B } Class 2 ∈ { , }
A simple solution
• Idea
• Transform a multi-label classification problem to multiple single-label classification problems
• Learn d independent classifiers for d class variables
8
• Idea
• Transform a multi-label classification problem to multiple single-label classification problems
• Learn d independent classifiers for d class variables
9
Binary Relevance (BR) [Clare and King, 2001; Boutell et al, 2004]
• IllustrationDtrain X1 X2 Y1 Y2 Y3
n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1
h1 : X → Y1
h2 : X → Y2
h3 : X → Y3
Binary Relevance (BR) [Clare and King, 2001; Boutell et al, 2004]
• Advantages
• Computationally efficient
• Disadvantages
• Does not capture the dependence relations among the class variables
• Not suitable for the objective of MLC
• Does not find the most probable assignment
• Instead, it maximizes the marginal distribution of each class variable
10
Binary Relevance (BR) [Clare and King, 2001; Boutell et al, 2004]
• Marginal vs. Joint: a motivating example
• Question: find the most probable assignment (MAP: maximum a posteriori) of Y = (Y1,Y2)
11
➡ Prediction on the joint (MAP): Y1 = 1, Y2 = 0➡ Prediction on the marginals: Y1 = 0, Y2 = 0
P(Y1,Y2|X=x) Y1 = 0 Y1 = 1 P(Y2|X=x)Y2 = 0 0.2 0.45 0.65Y2 = 1 0.35 0 0.35
P(Y1|X=x) 0.55 0.45
• We want to maximize the joint distribution of Y given observation X = x; i.e.,
Another simple solution
• Idea
• Transform each label combination to a class value
• Learn a multi-class classifier with the new class values
12
• Idea
• Transform each label combination to a class value
• Learn a multi-class classifier with the new class values
13
• IllustrationDtrain X1 X2 Y1 Y2 Y3
n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 1 0n=4 0.3 0.1 0 1 1n=5 0.8 0.9 1 0 1
YLP
11234
hLP : X → YLP
Label Powerset (LP) [Tsoumakas and Vlahavas, 2007]
Label Powerset (LP) [Tsoumakas and Vlahavas, 2007]
• Advantages
• Learns the full joint of the class variables
• Each of the new class values maps to a label combination
• Disadvantages
• The number of choices in the new class can be exponential (|YLP| = O(2d))
• Learning a multi-class classifier on exponential choices is expensive
• The resulting class distribution would be sparse and imbalanced
• Only predicts the label combinations that are seen in the training set
14
BR vs. LP
• BR and LP are two extreme MLC approaches
• BR maximizes the marginals on each class variable;while LP directly models the joint of all class variables
• BR is computationally more efficient; but does not consider the relationship among the class variables
• LP considers the relationship among the class variables by modeling the full joint of the class variables; but can be computationally very expensive
15
BR LPIndependent classifiers
All possible label combo.
Agenda
✓ Motivation
• Solutions
• Advanced solutions
• Evaluation metrics
• Toolboxes
• Summary
16
Solutions
Solutions
• Section agenda
• Solutions rooted on BR
• Solutions rooted on LP
• Other solutions
17
• BR: Binary Relevance [Clare and King, 2001; Boutell et al, 2004]
• Models independent classifiers P(yi|x) on each class variable
• Does not learn the class dependences
• Key extensions from BR
• Learn the class dependence relations by adding new class-dependent features : P(yi|x, {new_features})
Solutions rooted on BR
18
Solutions rooted on BR
• Idea: layered approach
• Layer-1: Learn and predict on Dtrain, using the BR approach
• Layer-2: Learn d classifiers on the original features and the output of layer-1
• Existing methods
• Classification with Heterogeneous Features (CHF) [Godbole et al, 2004]
• Instance-based Logistic Regression (IBLR) [Cheng et al, 2009]
19
20
Classification with Heterogeneous Features (CHF)
• Illustration
Dtrain X1 X2 Y1 Y2 Y3
n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1
hbr1 : X → Y1
hbr2 : X → Y2
hbr3 : X → Y3
XCHF
X1 X2 hbr1(X) hbr2(X) hbr3(X) Y1 Y2 Y3
n=1 0.7 0.4 .xx .xx .xx 1 1 0n=2 0.6 0.2 .xx .xx .xx 1 1 0n=3 0.1 0.9 .xx .xx .xx 0 0 1n=4 0.3 0.1 .xx .xx .xx 0 0 0n=5 0.8 0.9 .xx .xx .xx 1 0 1
h1 : XCHF → Y1
h2 : XCHF → Y2
h3 : XCHF → Y3
Laye
r-1La
yer-2
21
Instance-based Logistic Regression (IBLR)
• Illustration
Dtrain X1 X2 Y1 Y2 Y3
n=1 0.7 0.4 1 1 0n=2 0.6 0.2 1 1 0n=3 0.1 0.9 0 0 1n=4 0.3 0.1 0 0 0n=5 0.8 0.9 1 0 1
XIBLR
X1 X2 λ1 λ2 λ3 Y1 Y2 Y3
n=1 0.7 0.4 .xx .xx .xx 1 1 0n=2 0.6 0.2 .xx .xx .xx 1 1 0n=3 0.1 0.9 .xx .xx .xx 0 0 1n=4 0.3 0.1 .xx .xx .xx 0 0 0n=5 0.8 0.9 .xx .xx .xx 1 0 1
h1 : XIBLR → Y1
h2 : XIBLR → Y2
h3 : XIBLR → Y3
KNN Scoreλ1 λ2 λ3
.xx .xx .xx
.xx .xx .xx
.xx .xx .xx
.xx .xx .xx
.xx .xx .xx
1 1 00 0 11 0 02/3 1/3 1/3
k=3
y1 y2 y3
Solutions rooted on BR: CHF & IBLR
• Advantages
• Model the class dependences by enriching the feature space using the layer-1 classifiers
• Disadvantages
• Learn the dependence relations in an indirect way
• The predictions are not stable
22
Solutions rooted on LP
• LP: Label Powerset [Tsoumakas and Vlahavas, 2007]
• Models a multi-class classifier on the enumeration of all possible class assignment
• Can create exponentially many classes and computationally very expensive
• Key extensions from LP
• Prune the infrequent class assignments from the consideration to reduce the size of the class assignment space
• Represent the joint distribution more compactly
23
Pruned problem transformation (PPT) [Read et al, 2008]
• Class assignment conversion in PPT
• Prune infrequent class assignment sets
• User specifies the threshold for “infrequency”
24
Dtrain X1 X2 Y1 Y2 Y3
n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 0 0n=4 0.3 0.1 0 1 0n=5 0.1 0.8 0 0 0n=6 0.2 0.1 0 1 0n=7 0.2 0.2 0 1 0n=8 0.2 0.9 0 0 0n=9 0.7 0.3 0 0 1
n=10 0.9 0.9 0 1 1
Dtrain-LP YLP
n=1 1n=2 1n=3 0n=4 2n=5 0n=6 2n=7 2n=8 0n=9 1
n=10 3
|YLP| = 4
Pruned problem transformation (PPT) [Read et al, 2008]
• Class assignment conversion in PPT
• Prune infrequent class assignment sets
• User specifies the threshold for “infrequency”
25
Dtrain X1 X2 Y1 Y2 Y3
n=1 0.7 0.4 0 0 1n=2 0.6 0.2 0 0 1n=3 0.1 0.9 0 0 0n=4 0.3 0.1 0 1 0n=5 0.1 0.8 0 0 0n=6 0.2 0.1 0 1 0n=7 0.2 0.2 0 1 0n=8 0.2 0.9 0 0 0n=9 0.7 0.3 0 0 1n=10 0.9 0.9 0 0 1n=11 0.9 0.9 0 1 0
Dtrain-PPT YPPT
n=1 1n=2 1n=3 0n=4 2n=5 0n=6 2n=7 2n=8 0n=9 1
n=10 1n=11 2
|YPPT| = 3
Solutions rooted on LP: PPT
• Advantages
• Simple add-on to the LP method that focuses on key relationships
• Models the full joint more efficiently
• Disadvantages
• Based on an ad-hoc pruning heuristic
• Mapping to lower dimensional label space is not clear
• (As LP) Only predicts the label combinations that are seen in the training set
26
Other solution: MLKNN [Zhang and Zhou, 2007]
• Multi-label k-Nearest Neighbor (MLKNN) [Zhang and Zhou, 2007]
• Learn a classifier for each class (as BR) by combining k-nearest neighbor with Bayesian inference
• Application is limited as KNN
• Does not produce a model
• Does not work well on high-dimensional data
27
Multi-label output coding
• Key idea
• Motivated by the error-correcting output coding (ECOC) scheme [Dietterich 1995; Bose & Ray-Chaudhuri 1960] in communication
• Solve the MLC problems using lower dimensional codewords
• An output coding MLC method usually consists of three parts:
• Encoding:
• Prediction:
• Decoding:
28
Convert output vectors Y into codewords Z
Perform regression from X to Z; say R
Recover the class assignments Y from R
Multi-label output coding
• Existing methods
• OC (Output Coding) with Compressed Sensing (OCCS) [Hsu et
al, 2009]
• Principle Label Space Transformation (PLST) [Tai and Lin, 2010]
• OC with Canonical Correlation Analysis (CCAOC) [Zhang and
Schneider, 2011]
• Maximum Margin Output Coding (MMOC) [Zhang and Schneider, 2012]
29
Principle Label Space Transformation (PLST) [Tai and Lin, 2010]
• Encoding: Convert output vectors Y into codewords Z, using the singular vector decomposition (SVD)
• Z = VTY = (V1TY, ..., VqTY), where V is a d × q projection vector (d > q)
• Prediction: Perform regression from X to Z; say R
• Decoding: Recover the class labels Y from R using SVD
• Achieved by optimizing a combinatorial loss function30
Z YX(i) Encoding (SVD)(ii) Regression
(iii) Decoding (SVD)CodewordsFeatures Labels
Multi-label output coding
• Existing methods are differentiated from one to another mainly by the encoding/decoding schemes they apply
31
Method Key difference
OCCSUses compressed sensing [Donoho 2006] for encoding and decoding
PLSTUses singular vector decomposition (SVD) [Johnson & Wichern 2002] for encoding and decoding
CCAOCUses canonical correlation analysis (CCA) [Johnson & Wichern 2002] for encoding and mean-field approximation for decoding
MMOC Uses SVD for encoding and maximum margin formulation for decoding
Multi-label output coding
• Advantages
• Show excellent prediction performances
• Disadvantages
• Only able to predict the single best output for a given input
• Cannot estimate probabilities for different input-output pairs
• Not scalable
• Encoding and decoding steps rely on matrix decomposition, whose complexities are sensitive to d and N
• Cannot be generalized to non-binary cases
32
Section summary
33
BR LP
Independent classifiers
All possible label combo.
CHF PPT
IBLR Pruned label combo.
Enriched feature space
Others
Output codingMLKNN
?
Want to achieve
something better
Agenda
✓ Motivation
✓ Solutions
• Advanced solutions
• Evaluation metrics
• Toolboxes
• Summary
34
Advanced solutions
Advanced solutions
• Section agenda
• Extensions using probabilistic graphical models (PGMs)
• Extensions using ensemble techniques
35
Extensions using PGMs
• Probabilistic Graphical Models (PGMs)
• PGM refers to a family of distributions on a set of random variables that are compatible with all the probabilistic independence propositions encoded in a graph
• A smart way to formulate exponentially large probability distribution without paying an exponential cost
• Using PGMs, we can reduce the model complexity
• PGM = Multivariate statistics + Graphical structure
36
Extensions using PGMs
• Representation: Two types
• Undirected graphical models (UGMs)
• Also known as Markov networks (MNs)
• Directed graphical models (DGMs)
• Also known as Bayesian networks (BNs)
37
X1
X2
X1
X2
node: variable
edge:correlation
edge:causal relation
Undirected(MN)
Directed(BN)
Extensions using PGMs
• How PGMs reduce the model complexity?
• Key idea: Exploit the conditional independence (CI) relations among variables!!
• Conditional independence (CI): Random variables A, B are conditionally independent given C, if P(A,B|C) = P(A|C)P(B|C)
• UGM and DGM offer a set of graphical notations for CI
38
A ⊥ B | C
CI representation in UGM
A B
C
Extensions using PGMs
• How PGMs reduce the model complexity?
• Key idea: Exploit the conditional independence (CI) relations among variables!!
• Conditional independence (CI): Random variables A, B are conditionally independent given C, if P(A,B|C) = P(A|C)P(B|C)
• UGM and DGM offer a set of graphical notations for CI
39
A ⊥ B | C A ⊥ B | C
CI representations in DGM
⧸
A B
CA BC
A B
C
Extensions using PGMs
• PGMs have been an excellent representation / formulation tool for the MLC problems
• The dependences among features (X) and class variables (Y) can be represented easily with PGMs
• By exploiting the conditional independence, we can make the computation simpler
40
Extensions using PGMs
• Existing methods
• Undirected models (Markov networks)
• Multi-label Conditional Random Field (ML-CRF) [Ghamrawi and McCallum, 2005;
Pakdaman et al, 2014]
• Composite Marginal Models (CMM) [Zhang and Schneider, 2012]
• Directed models (Bayesian networks)
• Multi-dimensional Bayesian Classifiers (MBC) [van der Gaag and de Waal, 2006]
• Classifier Chains (CC) [Read et al, 2009]
• Conditional Tree-structured Bayesian Networks (CTBN) [Batal et al, 2013]
41
Multi-dimensional Bayesian Networks (MBC) [van der Gaag and de Waal, 2006]
• Key idea
• Model the full joint of input and output using a Bayesian network
• Use graphical structures to represent the dependence relations among the input and output variables
• Example MBC (d = 3, m = 4)
42
Y3Y2Y1
X3X2X1 X4
The joint distribution P(X,Y) is represented by the decomposition
X = X1|X2 · X2|X3 · X3 · X4|X2 and
Y = Y1|Y2 · X2 · Y3|Y2
Multi-dimensional Bayesian Networks (MBC) [van der Gaag and de Waal, 2006]
• Advantages
• The full joint distribution of the feature and class variables can be represented efficiently using the Bayesian network
• Disadvantages
• Models the relations among the feature variables which do not carry much information in modeling the multi-label relations
43
Multi-label Conditional Random Fields (MLCRF) [Pakdaman et al, 2014]
• Key idea
• Model the conditionals P(Y|X) to capture the relations among the class variables conditioned on the feature variables
• Learn a pairwise Markov network to model the relations between the input and output variables
• Representation
44
Y1
Y2
Y3
Yd ...
ψ1,2 ψ2,3
ψd,1(ψi,j and 𝜙i are the potentials of Yi, Yj, X;and Z is the normalization term)
Multi-label Conditional Random Fields (MLCRF) [Pakdaman et al, 2014]
• Advantages
• Directly models the conditional joint distribution P(Y|X)
• Disadvantages
• Learning and prediction is computationally very demanding
• To perform an inference, the normalization term Z should be computed, which is usually very costly
• The iterative parameter learning process requires inference at each step whose computational cost is even more expensive
• In practice, approximate inference techniques are applied to make the model usable
45
Classifier Chains (CC) [Read et al, 2009]
• Key idea
• Model P(Y|X) using a directed chain network, where all preceding classes in the chain are conditioning the following class variables
• Representation
46
all preceding labels
Y1 Y2 Y3 Yd...
X
Classifier Chains (CC) [Read et al, 2009]
• Learning
• No structure learning (random chain order)
• Parameter learning is performed on the decomposed CPDs: argmaxθ P(Yi|X, π(Yi); θ)
• Prediction
• Performed by greedy maximization of each factors (CPDs): argmaxYi P(Yi|X, π(Yi); θ)
47
Y1 Y2 Y3 Yd...
Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]
• Key idea
• Learn P(Y|X) using a tree-structured Bayesian network of the class labels
• Tree-structures can be seen as restricted chains, where each class variable has at most one parent class variable
• Example CTBN
48
X
Y1 Y2 Y3 Y4at most one parent label
This network represents:
Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]
• Learning
• Structure learning by optimizing conditional log-likelihood
1.Define a complete weighted directed graph, whose edge weights is equal to conditional log-likelihood
2.Find the maximum branching tree from the graph(* Maximum branching tree = maximum weighted directed spanning tree)
49
X
Y1 Y2 Y3 Y4
Conditional Tree-structured Bayesian Networks (CTBNs) [Batal et al, 2013]
• Learning
• Structure learning by optimizing conditional log-likelihood
• Parameter learning is performed on the decomposed CPDs
• Prediction
• Exact MAP prediction is performed by a belief propagation (max-product) algorithm
50
X
Y1 Y2 Y3 Y4
CC vs. CTBN
51
CC CTBN
X
Y1 Y2 Y3 Y4
Y1 Y2 Y3 Yd...
at most one parent labelall preceding labels
• Decomposes the joint probability along with the chain structure
• Decomposes the joint probability along with the tree structure
• Performs exact MAP prediction (linear time optimal solution)
• Maximizes the marginals along the chain (suboptimal solution)
• Errors in prediction propagate to the following label prediction
• The tree-structure assumption may restrict its modeling ability
• No structure learning (label ordering is given at random)
• Tree structure is learned using a score-based algorithm
Advanced solutions
• Section agenda
✓ Extensions using probabilistic graphical models (PGMs)
• Extensions using ensemble techniques
52
Extensions using ensemble techniques
• Ensemble techniques
• Techniques of training multiple classifiers and combining their predictions to produce a single classifier
• Ensemble techniques can further improve the performance of MLC classifiers
• Objective: Use a combination of simpler classifiers to improve predictions
53
Extensions using ensemble techniques
• Existing methods
• Ensemble of CCs (ECC) [Read et al, 2009]
• Mixture of CTBNs (MC) [Hong et al, 2014]
54
Ensemble of Classifier Chains (ECC) [Read et al, 2009]
• Recall CC
• Key Idea
• Create user-specified number of CC’s on random subsets of data with random orderings of the class labels
• Predict by majority vote over all base classifiers
55
Y1 Y2 Y3 Yd...
Ensemble of Classifier Chains (ECC) [Read et al, 2009]
• Advantages
• Often times, the performance improves
• Disadvantages
• Ad-hoc ensemble implementation
• Learns base classifiers on random subsets of data with random label ordering
• Ensemble decisions are made by simple averaging over the base models and often inaccurate
56
Mixture of CTBNs (MC) [Hong et al, 2014]
• Motivation
• If the underlying dependency structure in data is more complex than a tree structure, a single CTBN cannot model the data properly
• Key idea
• Use the Mixtures-of-Trees [Meila and Jordan, 2000] framework to learn multiple CTBNs and use them for prediction
57
Mixture of CTBNs (MC) [Hong et al, 2014]
• MC defines the multivariate posterior distribution of class vector P(y|x) = P(y1, …,yd|x) as
• P(y|x,Tk) is the k-th mixture component defined by a CTBN Tk
• λk is the mixture coefficient representing the weight of the k-th component (influence of the k-th CTBN model Tk to the mixture)
58
Mixture of CTBNs (MC) [Hong et al, 2014]
• An example MC
59
Mixture of CTBNs (MC) [Hong et al, 2014]
• Parameter learning
• Objective: Optimize the model parameters (CTBN parameters {θ1, …, θK} and mixture coefficients {λ1, …, λK})
• Idea (apply EM)
1. Associate each instance (x(n), y(n)) with a hidden variable z(n) ∈ {1, …, K} indicating which CTBN it belongs to.
2. Iteratively optimize the expected complete log-likelihood:
60
Mixture of CTBNs (MC) [Hong et al, 2014]
• Structure learning
• Objective: Find multiple CTBN structures from data
• Idea (boosting-like heuristic)
1. On each addition of a new structure to the mixture, recalculate the weight of each data instance (ω) such that it represents the relative “hardness” of the instance
2. Learn the best tree structure by optimizing the weighted conditional log-likelihood:
61
Mixture of CTBNs (MC) [Hong et al, 2014]
• Prediction
• Objective: Find the maximum a posteriori (MAP) prediction for a new instance x
• Idea
1. Search the space of all class assignments by defining a Markov chain
2. Use an annealed version of exploration procedure to speed up the search
62
Mixture of CTBNs (MC) [Hong et al, 2014]
• Advantages
• Learns an ensemble model for MLC in a principled way
• Produces accurate and reliable results
• Disadvantages
• Iterative optimization process in learning requires a large amount of time
63
Agenda
✓ Motivation
✓ Solutions
✓ Advanced solutions
• Evaluation metrics
• Toolboxes
• Summary
64
Evaluation metrics
• Evaluation of MLC methods is more difficult than that of single-label classification
• Measuring the Hamming accuracy is not sufficient for the goal of MLC
• Hamming accuracy (HA) =
• HA measures the individual accuracy on each class variable, which can be optimized by the binary relevance (BR) model
• We want to find “jointly accurate” class assignments
• We want to measure if the model predicts all the labels correctly
• Exact match accuracy (EMA) =
65
Evaluation metrics
• Exact match accuracy (EMA) =
• EMA evaluate if the prediction is correct on all class variables
• Most appropriate metric for MLC
• We are looking for the most probable assignment of classes
• It can be too strict
• Multi-label accuracy (MLA) =
• MLA evaluate the Jaccard index between prediction and true class assignments
• It is less strict than EMA; overestimates the model accuracy
66
Evaluation metrics
• Conditional log-likelihood loss (CLL-loss)
• CLL-loss =
• Reflects the model fitness
• F1-scores: harmonics mean of precision and recall
• Micro F1 =
• Computes the F1-score on each instance and then average out
• Macro F1 =
• Computes the F1-score on each class and then average out
67
Agenda
✓ Motivation
✓ Solutions
✓ Advanced solutions
✓ Evaluation metrics
• Toolboxes
• Summary
68
Toolboxes
• MEKA: a Multi-label Extension to WEKA http://meka.sourceforge.net/
• Mulan: a Java library for Multi-label Learning http://mulan.sourceforge.net/
• LibSVM MLC Extension (BR and LP) http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/multilabel/
• LAMDA Lab (Nanjing Univ., China) Code Repository http://lamda.nju.edu.cn/Default.aspx?Page=Data&NS=&AspxAutoDetectCookieSupport=1
• Prof. Min-Ling Zhang (Southeast Univ., China) http://cse.seu.edu.cn/old/people/zhangml/Resources.htm#codes
69
Summary
70
BR LP
Independent classifiers
All possible label combo.
CHF PPT
IBLR Pruned label combo.
Enriched feature space
Others
Output codingMLKNN
CCWant to achieve
something better
CTBN
MCECC
References
• [Batal et al, 2013] I. Batal, C. Hong, and M. Hauskrecht. “An efficient probabilistic framework for multi-dimensional classification”. In: Proceedings of the 22nd ACM international conference on Conference on information and knowledge management (CIKM). 2013, pp. 2417–2422.
• [Bose & Ray-Chaudhuri 1960] R.C. Bose, D.K. Ray-Chaudhuri. On a class of error correcting binary group codes. In: Inform and Control, 3. 1960, pp. 68–79.
• [Boutell et al, 2004] M. R. Boutell et al. “Learning Multi-label Scene Classification”. In: Pattern Recognition 37.9 (2004)
• [Cheng and Hüllermeier, 2009] W. Cheng and E. Hüllermeier. “Combining instance-based learning and logistic regression for multilabel classification”. In: Machine Learning 76.2-3 (2009)
• [Clare and King, 2001] A. Clare and R. D. King. “Knowledge Discovery in Multi-Label Pheno- type Data”. In: Lecture Notes in Computer Science. Springer, 2001.
• [Dietterich, 1995] T. G. Dietterich and G. Bakiri. "Solving Multiclass Learning Problems via Error-Correcting Output Codes", In: Journal of Artificial Intelligence Research. 1995. Volume 2, pages 263-286.
• [Donoho, 2006] D. Donoho, “Compressed sensing,” IEEE Trans. Inform. Theory, vol. 52, no. 4, pp. 1289– 1306, April 2006.
71
References
• [Ghamrawi and McCallum, 2005] N. Ghamrawi and A. McCallum. “Collective multi-label classification”. In: Proceedings of the 14th ACM international conference on Information and knowledge management (CIKM). 2005, pp. 195–200.
• [Godbole et al, 2004] S. Godbole and S. Sarawagi. “Discriminative Methods for Multi-labeled Classification”. In: PAKDD’04. 2004, pp. 22–30.
• [Hong et al, 2014] C. Hong, I. Batal, and M. Hauskrecht. “A mixtures-of-trees framework for multi-label classification”. In: Proceedings of the 23nd ACM International Conference on Information and Knowledge Management (CIKM), 2014. ACM.
• [Hsu et al, 2009] [Hsu et al, 2009] D. Hsu et al. “Multi-Label Prediction via Compressed Sensing”. In: NIPS. 2009, pp. 772– 780.
• [Johnson & Wichern, 2002] R. A. Johnson and D. W. Wichern. “Applied Multivariate Statistical Analysis” (5th Ed.). 2002. Upper Saddle River, N.J., Prentice-Hall.
• [Meila and Jordan, 2000] M. Meila and M. I. Jordan. “Learning with mixtures of trees”. Journal of Machine Learning Research, 1:1–48, 2000.
• [Pakdaman et al, 2014] M. Pakdaman, I. Batal, Z. Liu, C. Hong, and M. Hauskrecht. “An optimization-based framework to learn conditional random fields for multi-label classification”. In SDM. SIAM, 2014.
72
References
• [Read et al, 2008] J. Read, B. Pfahringer, and G. Holmes. “Multi-label Classification Using Ensembles of Pruned Sets”. In: ICDM. IEEE Computer Society, 2008, pp. 995–1000.
• [Read et al, 2009] J. Read et al. “Classifier Chains for Multi-label Classification”. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases: Part II. ECML PKDD ’09. Bled, Slovenia: Springer-Verlag, 2009, pp. 254–269.
• [Tai and Lin, 2010] Farbound Tai and Hsuan-Tien Lin. “Multi-label Classification with Principle Label Space Transformation”. In: Proceedings of the 2nd International Workshop on Multi-Label Learning. 2010.
• [van der Gaag and de Waal, 2006] L. C. van der Gaag and P. R. de Waal. “Multi-dimensional Bayesian Net- work Classifiers”. In: Probabilistic Graphical Models. 2006, pp. 107–114
• [Zhang and Schneider, 2012a] Y. Zhang and J. Schneider. “A Composite Likelihood View for Multi-Label Classification”. In: AISTATS (2012).
• [Zhang and Schneider, 2012b] Y. Zhang and J. Schneider. “Maximum Margin Output Coding”. In: Proceedings of the 29th International Conference on Machine Learning (ICML-12). ICML ’12. Edinburgh, Scotland, UK: Omnipress, 2012, pp. 1575–1582.
• [Zhang and Zhou, 2007] Min-Ling Zhang and Zhi-Hua Zhou. “ML-KNN: A lazy learning approach to multi-label learning”. In: Pattern Recogn. 40.7 (July 2007), pp. 2038–2048.
73
Thanks!
74