cedi2010 - slides
TRANSCRIPT
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
A Semi-supervised Approach toMulti-dimensional Classification with
Application to Sentiment Analysis
Jonathan Ortigosa-Hernández1, Juan Diego Rodríguez1,Leandro Alzate2, Iñaki Inza1, and José A. Lozano1
1Intelligent Systems GroupComputer Science and Artificial Intelligence Department
University of the Basque Country2 Socialware Company, Bilbao, Spain
CEDI 2010 Valencia – September 9th, 2010
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Index
1 Introduction
2 Multi-dimensional J /K Dependence Bayesian Classifier
3 Multi-dimensional Semi-supervised Learning
4 Experimentation in Sentiment AnalysisSentiment Analysis ProblemASOMO DatasetResults
5 Conclusions and Future Work
6 References
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Introduction
Supervised Classification
It consists of building a classifier Ψ from a given labelledtraining dataset D, by using an induction algorithm A(A(D) = Ψ),
X1 X2 ... Xn C
x(1)1 x
(1)2 ... x
(1)n c(1)
x(2)1 x
(2)2 ... x
(2)n c(2)
... ... ... ... ...
x(N)1 x
(N)2 ... x
(N)n c(N)
in order to predict the value of a class variable C for anynew unlabelled instance x (Ψ(x) = c).
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Introduction
Uni-dimensional and Multi-dimensionalClassification
Uni-dimensional classification tries to predict a singleclass variable based on a dataset composed of a set oflabelled examples.
(Uni-dimensional Class) Bayesian Network Classifiers(Larrañaga et al, 2005).
Multi-dimensional classification is the generalisation ofthe single-class classification task to the simultaneousprediction of a set of class variables.
Multi-dimensional Class Bayesian Network Classifiers (v.d.Gaag and d. Waal, 2006).
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Introduction
Multi-dimensional Supervised Learning
A typical supervised training dataset
X1 X2 ... Xn C1 C2 ... Cm
x(1)1 x
(1)2 ... x
(1)n c
(1)1 c
(1)2 ... c
(1)m
x(2)1 x
(2)2 ... x
(2)n c
(2)1 c
(2)2 ... c
(2)m
... ... ... ... ... ... ... ...
x(N)1 x
(N)2 ... x
(N)n c
(N)1 c
(N)2 ... c
(N)m
Each instance of the dataset contains both the values of theattributes and m labels which characterise the attributes.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Introduction
Bayesian Network Classifiers
X1 X2 X3 X4 X5 X6
C
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Introduction
Multi-dimensional Class Bayesian NetworkClassifiers (MDBNC)
X1 X2 X3 X4 X5 X6
C1 C2 C3
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Introduction
MDBNC Structure
X1 X2 X3 X4 X5
C1 C2 C3
X1 X2 X3 X4 X5
C1 C2 C3
X1 X2 X3 X4 X5
(a) Complete graph
(c) Class subgraph
(b) Feature selection subgraph
(d) Feature subgraph
C1 C2 C3
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Introduction
Sub-families of MDBNC
(a) Multi-dimensional naive Bayes
(c) Multi-dimensional J/K dependence Bayesian (2/3)
(b) Multi-dimensional tree-augmented network
!"
!#
!$
!%
!& !'
!(
!)
*# *$ *%
!+ !",
*"
!" !# !$ !% !& !' !( !) !* !"+
," ,# ,$
!" !# !$ !% !&
'" '#
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Introduction
Sub-families of MDBNC
1 Multi-dimensional naive Bayes classifier (MDnB)It has a fixed structure. Thus, it has no structural learning(v.d. Gaag and d. Waal, 2006).
2 Multi-dimensional tree-augmented network classifier(MDTAN)
Using the structural learning proposed in (v.d. Gaag and d.Waal, 2006).
3 Multi-dimensional J /K dependence Bayesian classifier(MD J /K)
The structural learning is proposed in this paper.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Multi-dimensional J /K Dependence Bayesian Classifier
Multi-dimensional J /K Dependence BayesianClassifier (MD J /K) I
A supervised method to learn a J /K structure(1) Learn the structure between the class variables (AC):
1 Calculate the p-value (significance of the mutual informationMI(Ci, Cj)) using the independence test for each pair of classvariables, and rank them.
2 Remove the p-values greater than the threshold α = 0,1.3 Use the ranking to add arcs between the class variables fulfilling
the conditions of no cycles between the class variables and nomore than J-parents per class.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Multi-dimensional J /K Dependence Bayesian Classifier
Multi-dimensional J /K Dependence BayesianClassifier (MD J /K) II
A supervised method to learn a J /K structure(2) Learn the structure between the class variables and thefeatures(ACF ):
1 Calculate the p-value (significance of the mutual informationMI(Ci, Xj)) using the independence test for each pair Ci andXj and rank them.
2 Remove the p-values greater than the threshold α = 0,1.3 Use the ranking to add arcs from the class variables to the
features.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Multi-dimensional J /K Dependence Bayesian Classifier
Multi-dimensional J /K Dependence BayesianClassifier (MD J /K) III
A supervised method to learn a J /K structure(3) Learn the structure between the features(AF ):
1 Calculate the p-value (significance of the conditional mutualinformation MI(Xi, Xj |Pac(Xj))) using the conditionalindependence test for each pair Xi and Xj given Pac(Xj) andrank them.
2 Remove the p-values greater than the threshold α = 0,1.3 Use the ranking to add arcs between the class variables fulfilling
the conditions of no cycles between the features and no morethan K-parents per feature.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Multi-dimensional Semi-supervised Learning
Major Problem of Supervised Learning
The labelling of the training data is usually done by anexternal mechanism (usually human beings).However, in many real world problems, obtaining data isrelatively easy, while labelling is difficult, expensive or laborintensive.This problem is accentuated when using multiple targetvariables.DESIRE: Learning algorithms able to incorporate a largenumber of unlabelled data with a small number of labeleddata when learning classifiers.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Multi-dimensional Semi-supervised Learning
Multi-dimensional Semi-supervised Learning
A typical semi-supervised training datasetX1 X2 ... Xn C1 C2 ... Cm
x(1)1 x
(1)2 ... x
(1)n c
(1)1 c
(1)2 ... c
(1)m
x(2)1 x
(2)2 ... x
(2)n c
(2)1 c
(2)2 ... c
(2)m
... ... ... ... ... ... ... ...
x(L)1 x
(L)2 ... x
(L)n c
(L)1 c
(L)2 ... c
(L)m
x(L+1)1 x
(L+1)2 ... x
(L+1)n ? ? ... ?
x(L+2)1 x
(L+2)2 ... x
(L+2)n ? ? ... ?
... ... ... ... ... ... ... ...
x(N)1 x
(N)2 ... x
(N)n ? ? ... ?
Semi-supervised Learning fulfills this desire.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Multi-dimensional Semi-supervised Learning
The Expectation-Maximization Algorithm
The EM algorithm (Dempster et al, 1977)Learn an initial model.Repeat until convergence:(a) Expectation step: Using the current model, estimate themissing values of the data.(b) Maximisation step: Using the whole data and the previousestimations, learn a new current model.
Any MDBNC learning algorithm can be used as model in thisalgorithm.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Sentiment Analysis Problem
Index
1 Introduction
2 Multi-dimensional J /K Dependence Bayesian Classifier
3 Multi-dimensional Semi-supervised Learning
4 Experimentation in Sentiment AnalysisSentiment Analysis ProblemASOMO DatasetResults
5 Conclusions and Future Work
6 References
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Sentiment Analysis Problem
Definitions (Liu, 2010)
Sentiment Analysis (AKA Opinion Mining) is thecomputational study of opinions, sentiments and emotionsexpressed in text.When treating Sentiment Analysis as a classificationproblem, two different problems appear:
1 Subjectivity Classification. Its aim is to classify a text assubjective or objective.
2 Sentiment Classification. It classifies an opinionated text asexpressing a positive, neutral, or negative opinion.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Sentiment Analysis Problem
Motivation for Using Semi-supervised Learning ofMulti-dimensional Classifiers
1 Up to now, these two subproblems have been studied inisolation despite of being closely related.So, probably it would be helpful to use multi-dimensionalclassifiers.
2 Obtaining enough labeled examples for a classifier may becostly and time consuming.This motivates us to deal with unlabelled examples in asemi-supervised framework.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
ASOMO Dataset
Index
1 Introduction
2 Multi-dimensional J /K Dependence Bayesian Classifier
3 Multi-dimensional Semi-supervised Learning
4 Experimentation in Sentiment AnalysisSentiment Analysis ProblemASOMO DatasetResults
5 Conclusions and Future Work
6 References
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
ASOMO Dataset
Properties of the Dataset
Collected by Socialware Company, from the ASOMOservice of mobilised opinion analysis.It consists of 2, 542 Spanish reviews extracted from a blog:
150 documents have been labeled in isolation by an expert.2, 392 posts are left unlabelled.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
ASOMO Dataset
Properties of Each Document
Each document is represented as:14 features
Obtained by using an open source morphological analyser(Carreras et al, 2006).Each feature provide different information related topart-of-speech (POS).Eg. First Persons, Agreement Expressions, Imperatives,Prediction Verbs (future), Questions, Positive Adjectives,etc. Represented as a real number between 0 and 1.
3 class variablesWill to Influence: {declarative sentence, soft WI, mediumWI, strong WI}Sentiment: {very negative, negative, neutral, positive, verypositive}Subjectivity: {Yes (subjective), No (objective)}
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Results
Index
1 Introduction
2 Multi-dimensional J /K Dependence Bayesian Classifier
3 Multi-dimensional Semi-supervised Learning
4 Experimentation in Sentiment AnalysisSentiment Analysis ProblemASOMO DatasetResults
5 Conclusions and Future Work
6 References
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Results
Experiment
The ASOMO dataset has been used to learn:3 (uni-dimensional) Bayesian network classifiers: nB, TANand 2DB.5 MDBNC: MDnB, MDTAN, MD 2/2, MD 2/3 and MD 2/4.
In both Supervised and Semi-supervised (EM algorithm)learning frameworks.Features from ASOMO dataset are discretised into 3values using equal frequency.Results averaged over 5 × 5 fold cross validation.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Results
JOINT Accuracy
This measure estimates the values of all class variablessimultaneously, that is, it only counts a success if all the classesare correctly predicted, otherwise it counts an error.
5
10
15
20
25
nB TAN 2DB MDnB MDTAN MD 2/2 MD 2/3 MD 2/4
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Results
Will to Influence
30
40
50
60
70
nB TAN 2DB MDnB MDTAN MD 2/2 MD 2/3 MD 2/4
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Results
Sentiment Polarity
20
25
30
35
40
nB TAN 2DB MDnB MDTAN MD 2/2 MD 2/3 MD 2/4
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Experimentation in Sentiment Analysis
Results
Subjectivity
50
60
70
80
90
nB TAN 2DB MDnB MDTAN MD 2/2 MD 2/3 MD 2/4
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Conclusions and Future Work
Conclusions
Multi-dimensional classification and semi-supervisedlearning are two different branches of machine learning.In this research, we have established a bridge betweenthem showing that these techniques are competitive withthe state-of-the-art algorithms.“As discussed in the literature, currently there is nocoherent strategy for handling unlabelled data, so somecreativity must be exercised.” (Cohen)
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Conclusions and Future Work
Future work
What kinds of problem could benefit from this?Web-related problems. E.g. Affect Analysis.
Scalability of MDBNC (high dimensionality of multi-labelproblems).The literature has shown that structure learning algorithmsthat maximise the likelihood are not likely to find structuresyielding good classifiers in a semi-supervised manner.
The ASOMO preliminary results show that using the EMwith a probabilistic model cannot ensure learning a correctmodel.Thus, we need to perform structure search, attempting tomaximise classification accuracy directly.
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
Conclusions and Future Work
Questions
THANK [email protected]
A Semi-supervised Approach to Multi-dimensional Classification with Application to Sentiment Analysis
References
References
Liu B. (2010). Sentiment Analysis and Subjectivity. In: Indurkhya N. and Damerau F.J. Handbook of NaturalLanguage Processing, Chapman & Hall, 2nd Ed.
Rodriguez, J.D. and Lozano, J.A. (2008). Multi-objective learning of multi-dimensional Bayesian classifiers. InProceedings of the Eighth International Conference on Hybrid Intelligent Systems, HIS 2008, Barcelona, Spain. pp.501-506
van der Gaag L. and d. Waal P. (2006). Multi-dimensional Bayesian Classifiers. In Proceedings of the ThirdEuropean Workshop in Probabilistic Graphical Models, pages 107–114
Larrañaga, P., Lozano, J.A., Peña, J.M. and Inza, I (2005). Special Issue on Probabilistic Graphical Models forClassification. Machine Learning, 59(3)
Cozman, F. and Cohen, I. (2006). Risk of Semi-Supervised Learning. In: Chapelle, O. Scholkopf, B. and Zien, A.Semi-Supervised Learning. The MIT Press. pp 57-72.
Friedman, N. (1998.) The Bayesian Structural EM algorithm. In Proc. 14th Conf. on Uncertainty in ArtificialIntelligence. Morgan Kaufmann, San Francisco, CA, pp. 129–138
Dempster A., Laird N. and Rubin D. (1977). Maximum Likelihood from Incomplete Data via the EM Algorithm.Journal of the Royal Statistical Society. Series B, 39(1): 1–38
Carreras X., Chao I., Padro L., Padro M. (2006). An Open-Source Suite of Language Analyzers. In Proceedings ofthe 4th Int. Conference on Language Resources and Evaluation, Vol. 10, pp. 239–342