arxiv:1610.09158v1 [cs.cl] 28 oct 2016towards a continuous modeling of natural language domains...

5
Towards a continuous modeling of natural language domains Sebastian Ruder 1,2 , Parsa Ghaffari 2 , and John G. Breslin 1 1 Insight Centre for Data Analytics National University of Ireland, Galway {sebastian.ruder,john.breslin}@insight-centre.org 2 Aylien Ltd. Dublin, Ireland {sebastian,parsa}@aylien.com Abstract Humans continuously adapt their style and language to a variety of domains. However, a reliable definition of ‘domain’ has eluded re- searchers thus far. Additionally, the notion of discrete domains stands in contrast to the mul- tiplicity of heterogeneous domains that hu- mans navigate, many of which overlap. In or- der to better understand the change and varia- tion of human language, we draw on research in domain adaptation and extend the notion of discrete domains to the continuous spec- trum. We propose representation learning- based models that can adapt to continuous do- mains and detail how these can be used to in- vestigate variation in language. To this end, we propose to use dialogue modeling as a test bed due to its proximity to language modeling and its social component. 1 Introduction The notion of domain permeates natural language and human interaction: Humans continuously vary their language depending on the context, in writ- ing, dialogue, and speech. However, the concept of domain is ill-defined, with conflicting definitions aiming to capture the essence of what constitutes a domain. In semantics, a domain is considered a “specific area of cultural emphasis” (Oppenheimer, 2006) that entails a particular terminology, e.g. a specific sport. In sociolinguistics, a domain consists of a group of related social situations, e.g. all human activities that take place at home. In discourse a do- main is a “cognitive construct (that is) created in re- sponse to a number of factors” (Douglas, 2004) and includes a variety of registers. Finally, in the context of transfer learning, a domain is defined as consist- ing of a feature space X and a marginal probability distribution P (X ) where X = {x 1 , ..., x n } and x i is the i th feature vector (Pan and Yang, 2010). These definitions, although pertaining to different concepts, have a commonality: They separate the world in stationary domains that have clear bound- aries. However, the real world is more ambiguous. Domains permeate each other and humans navigate these changes in domain. Consequently, it seems only natural to step away from a discrete notion of domain and adopt a con- tinuous notion. Utterances often cannot be natu- rally separated into discrete domains, but often arise from a continuous underlying process that is re- flected in many facets of natural language: The web contains an exponentially growing amount of data, where each document “is potentially its own do- main” (McClosky et al., 2010); a second-language learner adapts their style as their command of the language improves; language changes with time and with locality; even the WSJ section of the Penn Tree- bank – often treated as a single domain – contains different types of documents, such as news, lists of stock prices, etc. Continuity is also an element of real-world applications: In spam detection, spam- mers continuously change their tactics; in sentiment analysis, sentiment is dependent on trends emerging and falling out of favor. Drawing on research in domain adaptation, we first compare the notion of continuous natural lan- guage domains against mixtures of discrete domains and motivate the choice of using dialogue modeling arXiv:1610.09158v1 [cs.CL] 28 Oct 2016

Upload: others

Post on 25-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arXiv:1610.09158v1 [cs.CL] 28 Oct 2016Towards a continuous modeling of natural language domains Sebastian Ruder1,2, Parsa Ghaffari2, and John G. Breslin1 1Insight Centre for Data Analytics

Towards a continuous modeling of natural language domains

Sebastian Ruder1,2, Parsa Ghaffari2, and John G. Breslin1

1Insight Centre for Data AnalyticsNational University of Ireland, Galway

{sebastian.ruder,john.breslin}@insight-centre.org2Aylien Ltd.

Dublin, Ireland{sebastian,parsa}@aylien.com

Abstract

Humans continuously adapt their style andlanguage to a variety of domains. However, areliable definition of ‘domain’ has eluded re-searchers thus far. Additionally, the notion ofdiscrete domains stands in contrast to the mul-tiplicity of heterogeneous domains that hu-mans navigate, many of which overlap. In or-der to better understand the change and varia-tion of human language, we draw on researchin domain adaptation and extend the notionof discrete domains to the continuous spec-trum. We propose representation learning-based models that can adapt to continuous do-mains and detail how these can be used to in-vestigate variation in language. To this end,we propose to use dialogue modeling as a testbed due to its proximity to language modelingand its social component.

1 Introduction

The notion of domain permeates natural languageand human interaction: Humans continuously varytheir language depending on the context, in writ-ing, dialogue, and speech. However, the conceptof domain is ill-defined, with conflicting definitionsaiming to capture the essence of what constitutesa domain. In semantics, a domain is considered a“specific area of cultural emphasis” (Oppenheimer,2006) that entails a particular terminology, e.g. aspecific sport. In sociolinguistics, a domain consistsof a group of related social situations, e.g. all humanactivities that take place at home. In discourse a do-main is a “cognitive construct (that is) created in re-sponse to a number of factors” (Douglas, 2004) and

includes a variety of registers. Finally, in the contextof transfer learning, a domain is defined as consist-ing of a feature space X and a marginal probabilitydistribution P (X) where X = {x1, ..., xn} and xiis the ith feature vector (Pan and Yang, 2010).

These definitions, although pertaining to differentconcepts, have a commonality: They separate theworld in stationary domains that have clear bound-aries. However, the real world is more ambiguous.Domains permeate each other and humans navigatethese changes in domain.

Consequently, it seems only natural to step awayfrom a discrete notion of domain and adopt a con-tinuous notion. Utterances often cannot be natu-rally separated into discrete domains, but often arisefrom a continuous underlying process that is re-flected in many facets of natural language: The webcontains an exponentially growing amount of data,where each document “is potentially its own do-main” (McClosky et al., 2010); a second-languagelearner adapts their style as their command of thelanguage improves; language changes with time andwith locality; even the WSJ section of the Penn Tree-bank – often treated as a single domain – containsdifferent types of documents, such as news, lists ofstock prices, etc. Continuity is also an element ofreal-world applications: In spam detection, spam-mers continuously change their tactics; in sentimentanalysis, sentiment is dependent on trends emergingand falling out of favor.

Drawing on research in domain adaptation, wefirst compare the notion of continuous natural lan-guage domains against mixtures of discrete domainsand motivate the choice of using dialogue modeling

arX

iv:1

610.

0915

8v1

[cs

.CL

] 2

8 O

ct 2

016

Page 2: arXiv:1610.09158v1 [cs.CL] 28 Oct 2016Towards a continuous modeling of natural language domains Sebastian Ruder1,2, Parsa Ghaffari2, and John G. Breslin1 1Insight Centre for Data Analytics

as a test bed. We then present a way of representingcontinuous domains and show how continuous do-mains can be incorporated into existing models. Wefinally propose a framework for evaluation.

2 Continuous domains vs. mixtures ofdiscrete domains

In domain adaptation, a novel target domain is tra-ditionally assumed to be discrete and independentof the source domain (Blitzer et al., 2006). Otherresearch uses mixtures to model the target domainbased on a single (Daume III and Marcu, 2006) ormultiple discrete source domains (Mansour, 2009).We argue that modeling a novel domain as a mix-ture of existing domains falls short in light of threefactors.

Firstly, the diversity of human language makes itunfeasible to restrict oneself to a limited number ofsource domains, from which all target domains aremodeled. This is exemplified by the diversity ofthe web, which contains billions of heterogeneouswebsites; the Yahoo! Directory1 famously containedthousands of hand-crafted categories in an attempt toseparate these. Notably, many sub-categories werecross-linked as they could not be fully separated andwebsites often resided in multiple categories.

Similarly, wherever humans come together, theculmination of different profiles and interests givesrise to cliques, interest groups and niche communi-ties that all demonstrate their own unique behaviors,unspoken rules, and memes. A mixture of existingdomains fails to capture these varieties.

Secondly, using discrete domains for soft assign-ments relies on the assumption that the source do-mains are clearly defined. However, discrete labelsonly help to explain domains and make them in-terpretable, when in reality, a domain is a hetero-geneous amalgam of texts. Indeed, Plank and vanNoord (2011) show that selection based on human-assigned labels fares worse than using automatic do-main similarity measures for parsing.

Thirdly, not only a speaker’s style and commandof a language are changing, but a language itselfis continuously evolving. This is amplified in fast-moving media such as social platforms. Therefore,

1https://en.wikipedia.org/wiki/Yahoo!_Directory

applying a discrete label to a domain merely anchorsit in time. A probabilistic model of domains shouldin turn not be restricted to treat domains as indepen-dent points in a space. Rather, such a model shouldbe able to walk the domain manifold and adapt tothe underlying process that is producing the data.

3 Dialogue modeling as a test bed forinvestigating domains

As a domain presupposes a social component and re-lies on context, we propose to use dialogue modelingas a test bed to gain a more nuanced understandingof how language varies with domain.

Dialogue modeling can be seen as a prototypi-cal task in natural language processing akin to lan-guage modeling and should thus expose variationsin the underlying language. It allows one to observethe impact of different strategies to model variationin language across domains on a downstream task,while being inherently unsupervised.

In addition, dialogue has been shown to exhibitcharacteristics that expose how language changesas conversation partners become more linguisticallysimilar to each other over the course of the conver-sation (Niederhoffer and Pennebaker, 2002; Levitanet al., 2011). Similarly, it has been shown that thelinguistic patterns of individual users in online com-munities adapt to match those of the community theyparticipate in (Nguyen and Rose, 2011; Danescu-Niculescu-Mizil et al., 2013).

For this reason, we have selected reddit as amedium and compiled a dataset from large amountsof reddit data. Reddit comments live in a rich en-vironment that is dependent on a large number ofcontextual factors, such as community, user, conver-sation, etc. Similar to Chen et al. (2016), we wouldlike to learn representations that allow us to disen-tangle factors that are normally intertwined, such asstyle and genre, and that will allow us to gain moreinsight about the variation in language. To this end,we are currently training models that condition ondifferent communities, users, and threads.

4 Representing continuous domains

In line with past research (Daume III, 2007; Zhouet al., 2016), we assume that every domain has aninherent low-dimensional structure, which allows its

Page 3: arXiv:1610.09158v1 [cs.CL] 28 Oct 2016Towards a continuous modeling of natural language domains Sebastian Ruder1,2, Parsa Ghaffari2, and John G. Breslin1 1Insight Centre for Data Analytics

W

G(d,D)

TS

- “Hey, did youhear about the

Beyonceconcert?”

- “Yeah, I’m so turned up to go

with mybae.”

- “Hey, did youhear about the

Beyonceconcert?”

- “Yess, RT RT RT. So excitedto go with my

boyfriend.”

2014 2016

t

Figure 1: Transforming a discrete source domain subspace S

into a target domain subspace T with a transformation W .

projection into a lower dimensional subspace.In the discrete setting, we are given two do-

mains, a source domain XS and a target domainXT . We represent examples in the source domainXS as xS1 , · · · , xSnS

∈ Rd where xS1 is the i-thsource example and nS is number of examples inXS . Similarly, we have nT target domain examplesxT1 , · · · , xTnT

∈ Rd.We now seek to learn a transformation W that al-

lows us to transform the examples in the XS so thattheir distribution is more similar to the distributionof XT . Equivalently, we can factorize the transfor-mation W into two transformations A and B withW = ABT that we can use to project the source andtarget examples into a joint subspace.

We assume that XS and XT lie on lower-dimensional orthonormal subspaces, S, T ∈ RD×d,which can be represented as points on the Grassmanmanifold, G(d,D) as in Figure 1, where d� D.

In computer vision, methods such as SubspaceAlignment (Fernando et al., 2013) or the GeodesicFlow Kernel (Gong et al., 2012) have been usedto find such transformations A and B. Similarly,in natural language processing, CCA (Faruqui andDyer, 2014) and Procrustes analysis (Mogadala andRettinger, 2016) have been used to align subspacespertaining to different languages.

Many recent approaches using autoencoders(Bousmalis et al., 2016; Zhou et al., 2016) learn sucha transformation between discrete domains. Sim-ilarly, in a sequence-to-sequence dialogue model(Vinyals and V. Le, 2015), we can not only train

T3G(d,D)

S

- “Hey, did youhear about the

Beyonceconcert?”

- “Yeah, I’m so turned up to go

with mybae.”

- “Hey, did youhear about the

Beyonceconcert?”

- “Yess, RT RT RT. So excitedto go with my

boyfriend.”

2014 2015 2016

t

T2T1

- “Hey, did youhear about the

Beyonceconcert?”

- “Yass, I’m ex-cited AF to go with my boy-

friend.”

Wt3Wt2Wt1

Figure 2: Transforming a source domain subspace S into con-

tinuous domain subspaces Tt with a temporally varying trans-

formation Wt.

the model to predict the source domain response, butalso – via a reconstruction loss – its transformationsto the target domain.

For continuous domains, we can assume thatsource domain XS and target domain XT are not in-dependent, but that XT has evolved from XS basedon a continuous process. This process can be in-dexed by time, e.g. in order to reflect how a lan-guage learner’s style changes or how language variesas words rise and drop in popularity. We thus seekto learn a time-varying transformation Wt betweenS and T that allows us to transform between sourceand target examples dependent on t as in Figure 2.

Hoffman et al. (2014) assume a stream of ob-servations z1, · · · , znt ∈ Rd drawn from a contin-uously changing domain and regularize Wt by en-couraging the new subspace at t to be close to theprevious subspace at t − 1. Assuming a stream of(chronologically) ordered input data, a straightfor-ward application of this to a representation-learningbased dialogue model trains the parts of the modelthat auto-encode and transform the original messagefor each new example – possibly regularized with asmoothness constraint – while keeping the rest of themodel fixed.

This can be seen as an unsupervised variant offine-tuning, a common neural network domain adap-tation baseline. As our learned transformation con-tinuously evolves, we run the risk associated withfine-tuning of forgetting the knowledge acquiredfrom the source domain. For this reason, neural

Page 4: arXiv:1610.09158v1 [cs.CL] 28 Oct 2016Towards a continuous modeling of natural language domains Sebastian Ruder1,2, Parsa Ghaffari2, and John G. Breslin1 1Insight Centre for Data Analytics

T

T

2

1

Manchester

Birmingham

East London

UnitedKingdom

Ws3- “Hey china plate, lunch?”

- “Excellent. Just don’t spill the loop

on your roundthe houses

again.

- “Hey mate, lunch?”

- “Bostin. Just don’t spill the soup on your

trousers again.”

- “Hey cock, lunch?”

- “Mint. Just don’t spill the soup on your keks again.”

S

T3

Ws2

Ws1

lat.

long.

- “Hey mate, lunch?”

- “Excellent. Just don’t spill the soup on your trousers

again.”

English

Figure 3: Transforming a source domain subspace S into con-

tinuous target domain subspaces Ts using a spatially varying

transformation Ws.

network architectures that are immune to forgetting,such as the recently proposed Progressive NeuralNetworks (Rusu et al., 2016) are appealing for con-tinuous domain adaptation.

While time is the most obvious dimension alongwhich language evolves, other dimensions are possi-ble: Geographical location influences dialectal vari-ations as in Figure 3; socio-economic status, politi-cal affiliation as well as a domain’s purpose or com-plexity all influence language and can thus be con-ceived as axes that span a manifold for embeddingdomain subspaces.

5 Investigating language change

A continuous notion of domains naturally lends it-self to a diachronic study of language. By look-ing at the representations produced by the modelover different time steps, one gains insight into thechange of language in a community or another do-main. Similarly, observing how a user adapts theirstyle to different users and communities reveals in-sights about the language of those entities.

Domain mixture models use various domain sim-ilarity measures to determine how similar the lan-guages of two domains are, such as Renyi diver-gence (Van Asch and Daelemans, 2010), Kullback-Leibler (KL) divergence, Jensen-Shannon diver-gence, and vector similarity metrics (Plank and van

Noord, 2011), as well as task-specific measures(Zhou et al., 2016).

While word distributions have been used tradi-tionally to compare domains, embedding domainsin a manifold offers the possibility to evaluate thelearned subspace representations. For this, cosinesimilarity as used for comparing word embeddingsor KL divergence as used in the Variational Autoen-coder (Kingma and Welling, 2013) are a natural fit.

6 Evaluation

Our evaluation consists of three parts for evaluatingthe learned representations, the model, and the vari-ation of language itself.

Firstly, as our models produce new representa-tions for every subspace, we can compare a snapshotof a domain’s representation after every n time stepsto chart a trajectory of its changes.

Secondly, as we are conducting experiments ondialogue modeling, gold data for evaluation is read-ily available in the form of the actual response. Wecan thus train a model on reddit data of a certain pe-riod, adapt it to a stream of future conversations andevaluate its performance with BLEU or another met-ric that might be more suitable to expose variation inlanguage. At the same time, human evaluations willreveal whether the generated responses are faithfulto the target domain.

Finally, the learned representations will allow usto investigate the variations in language. Ideally, wewould like to walk the manifold and observe howlanguage changes as we move from one domain tothe other, similarly to (Radford et al., 2016).

7 Conclusion

We have proposed a notion of continuous naturallanguage domains along with dialogue modeling asa test bed. We have presented a representationof continuous domains and detailed how this rep-resentation can be incorporated into representationlearning-based models. Finally, we have outlinedhow these models can be used to investigate changeand variation in language. While our models allowus to shed light on how language changes, modelsthat can adapt to continuous changes are key forpersonalization and the reality of grappling with anever-changing world.

Page 5: arXiv:1610.09158v1 [cs.CL] 28 Oct 2016Towards a continuous modeling of natural language domains Sebastian Ruder1,2, Parsa Ghaffari2, and John G. Breslin1 1Insight Centre for Data Analytics

ReferencesJohn Blitzer, Ryan McDonald, and Fernando Pereira.

2006. Domain Adaptation with Structural Correspon-dence Learning. EMNLP ’06 Proceedings of the 2006Conference on Empirical Methods in Natural Lan-guage Processing, (July):120–128.

Konstantinos Bousmalis, George Trigeorgis, Nathan Sil-berman, Dilip Krishnan, and Dumitru Erhan. 2016.Domain Separation Networks. NIPS.

Xi Chen, Yan Duan, Rein Houthooft, John Schulman,Ilya Sutskever, and Pieter Abbeel. 2016. Info-GAN: Interpretable Representation Learning by In-formation Maximizing Generative Adversarial Nets.arXiv preprint arXiv:1606.03657.

Cristian Danescu-Niculescu-Mizil, Robert West, Dan Ju-rafsky, and Christopher Potts. 2013. No Country forOld Members : User Lifecycle and Linguistic Changein Online Communities. Proceedings of the 22ndinternational conference on World Wide Web, pages307–317.

Hal Daume III and Daniel Marcu. 2006. Domain Adap-tation for Statistical Classifiers. Journal of ArtificialIntelligence Research, 26:101–126.

Hal Daume III. 2007. Frustratingly Easy DomainAdaptation. Association for Computational Linguistic(ACL)s, (June):256–263.

Dan Douglas. 2004. Discourse Domains: The CognitiveContext of Speaking. In Diana Boxer and Andrew D.Cohen, editors, Studying Speaking to Inform SecondLanguage Learning. Multilingual Matters.

Manaal Faruqui and Chris Dyer. 2014. Improving Vec-tor Space Word Representations Using MultilingualCorrelation. Proceedings of the 14th Conference ofthe European Chapter of the Association for Compu-tational Linguistics, pages 462 – 471.

Basura Fernando, Amaury Habrard, Marc Sebban, TinneTuytelaars, K U Leuven, Laboratoire Hubert, CurienUmr, and Benoit Lauras. 2013. Unsupervised Vi-sual Domain Adaptation Using Subspace Alignment.Proceedings of the IEEE International Conference onComputer Vision.

Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman.2012. Geodesic Flow Kernel for Unsupervised Do-main Adaptation. 2012 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR).

Judy Hoffman, Trevor Darrell, and Kate Saenko. 2014.Continuous manifold based adaptation for evolving vi-sual domains. Proceedings of the IEEE ComputerSociety Conference on Computer Vision and PatternRecognition, pages 867–874.

Diederik P Kingma and Max Welling. 2013.Auto-Encoding Variational Bayes. arXiv preprintarXiv:1312.6114, (Ml):1–14.

Rivka Levitan, Agustn Gravano, and Julia Hirschberg.2011. Entrainment in Speech Preceding Backchan-nels. Annual Meeting of the Association for Compu-tational Linguistics (ACL/HLT), pages 113–117.

Yishay Mansour. 2009. Domain Adaptation with Multi-ple Sources. NIPS, pages 1–8.

David McClosky, Eugene Charniak, and Mark Johnson.2010. Automatic domain adaptation for parsing. Hu-man Language Technologies: The 2010 Annual Con-ference of the North American Chapter of the Associ-ation for Computational Linguistics, pages 28–36.

Aditya Mogadala and Achim Rettinger. 2016. BilingualWord Embeddings from Parallel and Non-parallel Cor-pora for Cross-Language Text Classification. NAACL,pages 692–702.

Dong Nguyen and Carolyn P. Rose. 2011. Languageuse as a reflection of socialization in online commu-nities. Proceedings of the Workshop on Languages in. . . , (June):76–85.

K. G. Niederhoffer and J. W. Pennebaker. 2002. Lin-guistic Style Matching in Social Interaction. Journalof Language and Social Psychology, 21(4):337–360.

Harriet J. Oppenheimer. 2006. The Anthropology of Lan-guage: An Introduction to Linguistic Anthropology.Wadsworth, Belmont (Canada).

Sinno Jialin Pan and Qiang Yang. 2010. A survey ontransfer learning. IEEE Transactions on Knowledgeand Data Engineering, 22(10):1345–1359.

Barbara Plank and Gertjan van Noord. 2011. EffectiveMeasures of Domain Similarity for Parsing. Proceed-ings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Tech-nologies, 1:1566–1576.

Alec Radford, Luke Metz, and Soumith Chintala.2016. Unsupervised Representation Learning withDeep Convolutional Generative Adversarial Networks.ICLR, pages 1–15.

Andrei A Rusu, Neil C Rabinowitz, Guillaume Des-jardins, Hubert Soyer, James Kirkpatrick, KorayKavukcuoglu, Razvan Pascanu, Raia Hadsell, andGoogle Deepmind. 2016. Progressive Neural Net-works. arXiv preprint arXiv:1606.04671.

Vincent Van Asch and Walter Daelemans. 2010. UsingDomain Similarity for Performance Estimation. Com-putational Linguistics, (July):31–36.

Oriol Vinyals and Quoc V. Le. 2015. A Neural Conver-sational Model.

Guangyou Zhou, Zhiwen Xie, Jimmy Xiangji Huang, andTingting He. 2016. Bi-Transferring Deep Neural Net-works for Domain Adaptation. ACL, pages 322–332.