Transcript
  • Semi-Supervised Prediction-Constrained Topic Models

    Michael C. Hughes1, Gabriel Hope2, Leah Weiner3,Thomas H. McCoy, Jr.4, Roy H. Perlis4, Erik Sudderth2,3, and Finale Doshi-Velez1

    1School of Engineering and Applied Sciences, Harvard University2School of Information & Computer Sciences, Univ. of California, Irvine

    3Department of Computer Science, Brown University4Massachusetts General Hospital & Harvard Medical School

    1 IntroductionDiscrete count data are common: news articles canbe represented as word counts, patient records asdiagnosis counts, and images as visual descriptorcounts. Topic models such as latent Dirichlet al-location (LDA, Blei et al. (2003)) are popular forfinding cooccurance structure in datasets of countvectors, producing a small set of learned topicswhich help users understand the core themes of acorpus too large to comprehend manually.

    The low-dimensional feature space produced byLDA can be useful as the input to some predictivetask, where the user seeks to predict labels asso-ciated with each count vector. Supervised topicmodels extend standard topic models to modeldocument topics and labels jointly, leading to bet-ter label predictions and more informative topicrepresentations.

    In this work, we correct a key deficiency inprevious formulations of supervised topic mod-els. Our learning objective directly encourageslow-dimensional data representations to produceaccurate predictions by deliberately encoding theasymmetry of prediction tasks: we want to predictlabels from text, not text from labels. Approacheslike supervised LDA (sLDA, McAuliffe and Blei(2008)) that optimize the joint likelihood of labelsand words ignore this crucial asymmetry.

    2 BackgroundStandard LDA. The LDA topic model findsstructure in a collection of D documents, or moregenerally,D examples of count vectors. Each doc-ument d is represented by a count vector xd of Vdiscrete words or features: xd ∈ ZV+. The LDAmodel generates these counts via a document-specific mixture of K topics:

    πd|α ∼ Dir(πd | α),xd|πd, φ ∼ Mult(xd |

    ∑Kk=1 πdkφk, Nd). (1)

    The random variable πd is a document-topic prob-ability vector, where πdk is the probability of topick in document d and

    ∑Kk=1 πdk = 1. The vec-

    tor φk is a topic-word probability vector, whereφkv gives the probability of word v in topic k and∑V

    v=1 φkv = 1. Nd is the (observed) size of doc-ument d: Nd =

    ∑v xdv. LDA assumes πd and

    φk have symmetric Dirichlet priors, with hyperpa-rameters α > 0 and τ > 0.

    Topic-based Prediction of Binary Labels.Suppose document d also has a binary label yd ∈{0, 1}. Standard supervised topic models assumelabels and word counts are conditionally indepen-dent given document-topic probabilities πd:

    yd|πd, η ∼ Bern(yd | σ(∑K

    k=1 πdkηk)), (2)

    where σ(z) = (1 + e−z)−1 is the logit function,and η ∈ RK is a vector of real-valued regressionweights with a vague prior ηk ∼ N (0, σ2η).

    3 Prediction-Constrained sLDA

    We propose a novel, prediction-constrained (PC)objective that finds the best generative model forwords x, while satisfying the constraint that topicsφ must yield accurate predictions about labels ygiven x alone:

    minφ,η−[ D∑d=1

    log p(xd | φ, α)]− log p(φ, η) (3)

    subject to −∑D

    d=1 log p(yd | xd, φ, η, α) ≤ �.The scalar � is the highest aggregate loss we arewilling to tolerate, and p(φ, η) = p(φ)p(η) areindependent priors used for regularization. Thestructure of Eq. (3) matches the goals of a domainexpert who wishes to explain as much of the datax as possible, while still making sufficiently accu-rate predictions of y.

    Applying the Karush-Kuhn-Tucker conditions,we transform the inequality constrained objective

  • (a) Movie reviews (b) Yelp reviews

    Figure 1: Movie and Yelp tasks: Performance metrics vs. fraction of labeled training documents used for 100 topics. On bothtasks, we artificially include only a small fraction (0.05, 0.10, or 0.20) of available training labels, chosen at random. Fullysupervised methods (e.g. BP-sLDA, MED-sLDA) are only given documents (xd, yd) from this subset, because third-party codedoes not allow using unlabeled data at training. Left: Heldout discriminative performance (AUC, higher is better). Note thatimprovements over supervised learning algorithms, including logistic regression, are particularly large when the fraction oflabeled documents is small. Right: Heldout generative performance (negative likelihood, lower is better).

    in Eq. (3) to an equivalent unconstrained optimiza-tion problem:

    minφ,η−

    D∑d=1

    [log p(xd|φ) + λ� log p(yd|xd, φ, η)

    ]− log p(φ, η). (4)

    For any prediction tolerance �, there exists a scalarmultiplier λ� > 0 such that the optimum of Eq. (3)is a minimizer of Eq. (4). The relationship be-tween λ� and � is monotonic, but does not havean analytic form, so we search over the 1-D spaceof penalties λ� for an appropriate value.

    Computing this objective directly is not feasi-ble as computing both p(xd|φ) and p(yd|xd, φ, η)requires taking an intractable integral over thelatent variable πd. For learning, we approxi-mate this objective using a point estimate of πd,where πd is fixed to its MAP estimate given xd:argmaxπd p(πd|xd, φ, α). Using this tractable ob-jective, we can fit the model parameters (φ, η) withstochastic gradient descent.

    If documents are partially labeled, the objectiveof Eq. (3) can be naturally generalized to only in-clude prediction constraints for observed labels.This allows our approach to be easily applied tosemi-supervised settings.

    4 Experimental ResultsWe evaluate our approach on two real-world bag-of-words prediction tasks, comparing with a num-ber of discriminative baselines including: lo-gistic regression, the fully supervised BP-sLDAalgorithm of Chen et al. (2015), the unsuper-vised Gibbs sampler for LDA (Griffiths andSteyvers, 2004) from the Mallet toolbox (McCal-lum, 2002), and the supervised MED-sLDA Gibbs

    sampler (Zhu et al., 2013) which is reported to im-prove on an earlier variational method (Zhu et al.,2012). In our experiments we partition documentsinto three sets (training/validation/test). For allmethods shown, we tune any hyperparameters onthe validation set and report results on the test set.Movie task. Each of the 4004/500/501 doc-uments is a movie review by a professionalcritic (Pang and Lee, 2005), with V = 5338 terms.Each review has a binary label, where yd = 1means the critic gave the film more than 2 stars.Yelp task. Each of the 23159/2895/2895 docu-ments (Yelp Dataset Challenge, 2016) aggregatesall text reviews for a single restaurant, using V =10, 000 vocabulary terms. Each document also hasa label indicating the availability of wifi.Discussion. Fig. 1 shows that PC-sLDA is con-sistently competitive with purely discriminativemethods like logistic regression (LR) or BP-sLDAwhen datasets are fully labeled. When the num-ber of available labels is limited, PC-sLDA stillperforms well. For the Movie task in Fig. 1(a),PC-sLDA dominates the AUC metric for smallfractions of labels (0.05, 0.1), beating even LR.In this regime, unsupervised Gibbs-LDA has bet-ter AUC than BP-sLDA and MED-sLDA, demon-strating the value of unlabeled data for prediction.On Yelp, PC-sLDA predictions at small fractionsare better than all but BP-sLDA.

    Fig. 1 also shows trends in heldout data negativelog likelihood (lower is better). As expected, un-supervised Gibbs-LDA consistently achieves thebest scores, as explaining data is its sole objec-tive. However, PC-sLDA is able to achieve com-petitive likelihood scores, while achieving betterAUC scores when compared to all other methods.

  • Acknowledgements

    An extended version of this abstract will appearat the 2018 International Conference on ArtificialIntelligence and Statistics

    This research supported in part by NSF CA-REER Award No. IIS-1349774, and by the Officeof the Director of National Intelligence (ODNI)via IARPA contract 2016-16041100004. MCH issupported by Oracle Labs. We thank Victor Cas-tro for preparing deidentified EHR data. We thankHarvard FAS Research Computing, Partners Re-search Computing, and Brown Computer Sciencefor computational resources.

    ReferencesDavid M. Blei, Andrew Y. Ng, and Michael I. Jordan.

    2003. Latent Dirichlet allocation. Journal of Ma-chine Learning Research, 3:993–1022.

    Jianshu Chen, Ji He, Yelong Shen, Lin Xiao, XiaodongHe, Jianfeng Gao, Xinying Song, and Li Deng.2015. End-to-end learning of LDA by mirror-descent back propagation over a deep architecture.In Neural Information Processing Systems.

    Yelp Dataset Challenge. 2016. Yelp dataset chal-lenge. https://www.yelp.com/dataset_challenge. Accessed: 2016-03.

    Tom L Griffiths and Mark Steyvers. 2004. Finding sci-entific topics. Proceedings of the National Academyof Sciences.

    Jon D McAuliffe and David M Blei. 2008. Super-vised topic models. In Neural Information Process-ing Systems, pages 121–128.

    Andrew Kachites McCallum. 2002. MALLET: Ma-chine learning for language toolkit. mallet.cs.umass.edu.

    Bo Pang and Lillian Lee. 2005. Seeing stars: Exploit-ing class relationships for sentiment categorizationwith respect to rating scales. In Proc. of the AnnualMeeting of the Association for Computational Lin-guistics.

    Jun Zhu, Amr Ahmed, and Eric P Xing. 2012.MedLDA: maximum margin supervised topic mod-els. The Journal of Machine Learning Research,13(1):2237–2278.

    Jun Zhu, Ning Chen, Hugh Perkins, and Bo Zhang.2013. Gibbs max-margin topic models with fastsampling algorithms. In International Conferenceon Machine Learning.

    https://www.yelp.com/dataset_challengehttps://www.yelp.com/dataset_challengemallet.cs.umass.edumallet.cs.umass.edu

Top Related