semi-supervised prediction-constrained topic models · 2018. 4. 9. · semi-supervised...

Semi-Supervised Prediction-Constrained Topic Models

Michael C. Hughes1, Gabriel Hope2, Leah Weiner3,Thomas H. McCoy, Jr.4, Roy H. Perlis4, Erik Sudderth2,3, and Finale Doshi-Velez1

1School of Engineering and Applied Sciences, Harvard University2School of Information & Computer Sciences, Univ. of California, Irvine

3Department of Computer Science, Brown University4Massachusetts General Hospital & Harvard Medical School

1 IntroductionDiscrete count data are common: news articles canbe represented as word counts, patient records asdiagnosis counts, and images as visual descriptorcounts. Topic models such as latent Dirichlet al-location (LDA, Blei et al. (2003)) are popular forfinding cooccurance structure in datasets of countvectors, producing a small set of learned topicswhich help users understand the core themes of acorpus too large to comprehend manually.

The low-dimensional feature space produced byLDA can be useful as the input to some predictivetask, where the user seeks to predict labels asso-ciated with each count vector. Supervised topicmodels extend standard topic models to modeldocument topics and labels jointly, leading to bet-ter label predictions and more informative topicrepresentations.

In this work, we correct a key deficiency inprevious formulations of supervised topic mod-els. Our learning objective directly encourageslow-dimensional data representations to produceaccurate predictions by deliberately encoding theasymmetry of prediction tasks: we want to predictlabels from text, not text from labels. Approacheslike supervised LDA (sLDA, McAuliffe and Blei(2008)) that optimize the joint likelihood of labelsand words ignore this crucial asymmetry.

2 BackgroundStandard LDA. The LDA topic model findsstructure in a collection of D documents, or moregenerally,D examples of count vectors. Each doc-ument d is represented by a count vector xd of Vdiscrete words or features: xd ∈ ZV+. The LDAmodel generates these counts via a document-specific mixture of K topics:

πd|α ∼ Dir(πd | α),xd|πd, φ ∼ Mult(xd |

∑Kk=1 πdkφk, Nd). (1)

The random variable πd is a document-topic prob-ability vector, where πdk is the probability of topick in document d and

∑Kk=1 πdk = 1. The vec-

tor φk is a topic-word probability vector, whereφkv gives the probability of word v in topic k and∑V

v=1 φkv = 1. Nd is the (observed) size of doc-ument d: Nd =

∑v xdv. LDA assumes πd and

φk have symmetric Dirichlet priors, with hyperpa-rameters α > 0 and τ > 0.

Topic-based Prediction of Binary Labels.Suppose document d also has a binary label yd ∈{0, 1}. Standard supervised topic models assumelabels and word counts are conditionally indepen-dent given document-topic probabilities πd:

yd|πd, η ∼ Bern(yd | σ(∑K

k=1 πdkηk)), (2)

where σ(z) = (1 + e−z)−1 is the logit function,and η ∈ RK is a vector of real-valued regressionweights with a vague prior ηk ∼ N (0, σ2η).

3 Prediction-Constrained sLDA

We propose a novel, prediction-constrained (PC)objective that finds the best generative model forwords x, while satisfying the constraint that topicsφ must yield accurate predictions about labels ygiven x alone:

minφ,η−[ D∑d=1

log p(xd | φ, α)]− log p(φ, η) (3)

subject to −∑D

d=1 log p(yd | xd, φ, η, α) ≤ �.The scalar � is the highest aggregate loss we arewilling to tolerate, and p(φ, η) = p(φ)p(η) areindependent priors used for regularization. Thestructure of Eq. (3) matches the goals of a domainexpert who wishes to explain as much of the datax as possible, while still making sufficiently accu-rate predictions of y.

Applying the Karush-Kuhn-Tucker conditions,we transform the inequality constrained objective

(a) Movie reviews (b) Yelp reviews

Figure 1: Movie and Yelp tasks: Performance metrics vs. fraction of labeled training documents used for 100 topics. On bothtasks, we artificially include only a small fraction (0.05, 0.10, or 0.20) of available training labels, chosen at random. Fullysupervised methods (e.g. BP-sLDA, MED-sLDA) are only given documents (xd, yd) from this subset, because third-party codedoes not allow using unlabeled data at training. Left: Heldout discriminative performance (AUC, higher is better). Note thatimprovements over supervised learning algorithms, including logistic regression, are particularly large when the fraction oflabeled documents is small. Right: Heldout generative performance (negative likelihood, lower is better).

in Eq. (3) to an equivalent unconstrained optimiza-tion problem:

minφ,η−

D∑d=1

[log p(xd|φ) + λ� log p(yd|xd, φ, η)

]− log p(φ, η). (4)

For any prediction tolerance �, there exists a scalarmultiplier λ� > 0 such that the optimum of Eq. (3)is a minimizer of Eq. (4). The relationship be-tween λ� and � is monotonic, but does not havean analytic form, so we search over the 1-D spaceof penalties λ� for an appropriate value.

Computing this objective directly is not feasi-ble as computing both p(xd|φ) and p(yd|xd, φ, η)requires taking an intractable integral over thelatent variable πd. For learning, we approxi-mate this objective using a point estimate of πd,where πd is fixed to its MAP estimate given xd:argmaxπd p(πd|xd, φ, α). Using this tractable ob-jective, we can fit the model parameters (φ, η) withstochastic gradient descent.

If documents are partially labeled, the objectiveof Eq. (3) can be naturally generalized to only in-clude prediction constraints for observed labels.This allows our approach to be easily applied tosemi-supervised settings.

4 Experimental ResultsWe evaluate our approach on two real-world bag-of-words prediction tasks, comparing with a num-ber of discriminative baselines including: lo-gistic regression, the fully supervised BP-sLDAalgorithm of Chen et al. (2015), the unsuper-vised Gibbs sampler for LDA (Griffiths andSteyvers, 2004) from the Mallet toolbox (McCal-lum, 2002), and the supervised MED-sLDA Gibbs

sampler (Zhu et al., 2013) which is reported to im-prove on an earlier variational method (Zhu et al.,2012). In our experiments we partition documentsinto three sets (training/validation/test). For allmethods shown, we tune any hyperparameters onthe validation set and report results on the test set.Movie task. Each of the 4004/500/501 doc-uments is a movie review by a professionalcritic (Pang and Lee, 2005), with V = 5338 terms.Each review has a binary label, where yd = 1means the critic gave the film more than 2 stars.Yelp task. Each of the 23159/2895/2895 docu-ments (Yelp Dataset Challenge, 2016) aggregatesall text reviews for a single restaurant, using V =10, 000 vocabulary terms. Each document also hasa label indicating the availability of wifi.Discussion. Fig. 1 shows that PC-sLDA is con-sistently competitive with purely discriminativemethods like logistic regression (LR) or BP-sLDAwhen datasets are fully labeled. When the num-ber of available labels is limited, PC-sLDA stillperforms well. For the Movie task in Fig. 1(a),PC-sLDA dominates the AUC metric for smallfractions of labels (0.05, 0.1), beating even LR.In this regime, unsupervised Gibbs-LDA has bet-ter AUC than BP-sLDA and MED-sLDA, demon-strating the value of unlabeled data for prediction.On Yelp, PC-sLDA predictions at small fractionsare better than all but BP-sLDA.

Fig. 1 also shows trends in heldout data negativelog likelihood (lower is better). As expected, un-supervised Gibbs-LDA consistently achieves thebest scores, as explaining data is its sole objec-tive. However, PC-sLDA is able to achieve com-petitive likelihood scores, while achieving betterAUC scores when compared to all other methods.

Acknowledgements

An extended version of this abstract will appearat the 2018 International Conference on ArtificialIntelligence and Statistics

This research supported in part by NSF CA-REER Award No. IIS-1349774, and by the Officeof the Director of National Intelligence (ODNI)via IARPA contract 2016-16041100004. MCH issupported by Oracle Labs. We thank Victor Cas-tro for preparing deidentified EHR data. We thankHarvard FAS Research Computing, Partners Re-search Computing, and Brown Computer Sciencefor computational resources.

ReferencesDavid M. Blei, Andrew Y. Ng, and Michael I. Jordan.

2003. Latent Dirichlet allocation. Journal of Ma-chine Learning Research, 3:993–1022.

Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, XiaodongHe, Jianfeng Gao, Xinying Song, and Li Deng.2015. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture.In Neural Information Processing Systems.

Yelp Dataset Challenge. 2016. Yelp dataset chal-lenge. https://www.yelp.com/dataset_challenge. Accessed: 2016-03.

Tom L Griffiths and Mark Steyvers. 2004. Finding sci-entific topics. Proceedings of the National Academyof Sciences.

Jon D McAuliffe and David M Blei. 2008. Super-vised topic models. In Neural Information Process-ing Systems, pages 121–128.

Andrew Kachites McCallum. 2002. MALLET: Ma-chine learning for language toolkit. mallet.cs.umass.edu.

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-ing class relationships for sentiment categorizationwith respect to rating scales. In Proc. of the AnnualMeeting of the Association for Computational Lin-guistics.

Jun Zhu, Amr Ahmed, and Eric P Xing. 2012.MedLDA: maximum margin supervised topic mod-els. The Journal of Machine Learning Research,13(1):2237–2278.

Jun Zhu, Ning Chen, Hugh Perkins, and Bo Zhang.2013. Gibbs max-margin topic models with fastsampling algorithms. In International Conferenceon Machine Learning.
https://www.yelp.com/dataset_challengehttps://www.yelp.com/dataset_challengemallet.cs.umass.edumallet.cs.umass.edu

semi-supervised prediction-constrained topic models · 2018. 4. 9. · semi-supervised...

Documents