factoring ambiguity out of the prediction of...

7
Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 66–72, Valencia, Spain, April 4. c 2017 Association for Computational Linguistics Factoring Ambiguity out of the Prediction of Compositionality for German Multi-Word Expressions Stefan Bott and Sabine Schulte im Walde Institut für Maschinelle Sprachverabeitung Universität Stuttgart Pfaffenwaldring 5b, 70569 Stuttgart, Germany {stefan.bott,schulte}@ims.uni-stuttgart.de Abstract Ambiguity represents an obstacle for distributional semantic models (DSMs), which typically subsume the contexts of all word senses within one vector. While individual vector space approaches have been concerned with sense discrimination (e.g., Schütze (1998), Erk (2009), Erk and Pado (2010)), such discrimination has rarely been integrated into DSMs across semantic tasks. This paper presents a soft- clustering approach to sense discrimina- tion that filters sense-irrelevant features when predicting the degrees of compo- sitionality for German noun-noun com- pounds and German particle verbs. 1 Introduction Addressing the compositionality of complex words is a crucial ingredient for lexicography and NLP applications, to know whether the expres- sion should be treated as a whole, or through its constituents, and what the expression means. For example, studies such as Cholakov and Kordoni (2014), Weller et al. (2014), Cap et al. (2015), and Salehi et al. (2015b) have integrated the prediction of multi-word compositionality into statistical ma- chine translation. We are interested in predicting the degrees of compositionality of two types of German multi- word expressions: (i) German noun-noun com- pounds (NCs) represent nominal multi-word ex- pressions (MWEs), e.g., Feuer|werk ‘fire works’ is composed of the constituents Feuer ‘fire’ and Werk ‘opus’. (ii) German particle verbs (PVs) are complex verbs such as an|strahlen (‘beam/smile at’) which are composed of a sepa- rable prefix particle (an) and a base verb (strahlen ‘beam’/’smile’). Both types of German MWEs are highly frequent and highly productive in the lexi- con. Table 1 presents some example MWEs and their constituents with human ratings on composi- tionality. 1 Automatic approaches to predict composition- ality degrees typically exploit distributional se- mantic models (DSMs), i.e. vector representa- tions relying on the distributional hypothesis (Har- ris, 1954; Firth, 1957), that words with simi- lar distributions have related meanings. Regard- ing the compositionality prediction, DSMs repre- sent the meanings of the MWEs and their con- stituents by distributional vectors, and the sim- ilarity of a compound–constituent vector pair is taken as the predicted degree of compound- constituent compositionality. Existing approaches addressed the compositionality of NCs (Reddy et al., 2011; Salehi and Cook, 2013; Schulte im Walde et al., 2013; Salehi et al., 2014) and com- plex verbs (Baldwin, 2005; Bannard, 2005; Bott and Schulte im Walde, 2015), mainly dfor English and for German. A major obstacle for DSMs is their conflation of contexts across individual word senses. DSMs typically subsume evidence of cooccurring items within one vector for the target word type, rather than discriminating contextual evidence for the specific target word senses. Taking the German noun-noun compound Blatt|salat ’leaf salad’ as an example, its modifier constituent Blatt has at least four senses: ’leaf’, ’sheet of paper’, ’newspaper’ and ’hand of cards’. If we had individual sense vectors for each sense of Blatt, a DSM might suc- cessfully predict a strong compositionality for the compound Blatt|salat regarding this constituent, when comparing the compound vector with the ’leaf’ sense vector, because the vectors agree on 1 The scales for mean ratings were 1–7 for noun-noun compounds, and 1–6 for particle verbs. Examples were taken from the two gold standards described in section 2. 66

Upload: others

Post on 31-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pages 66–72,Valencia, Spain, April 4. c©2017 Association for Computational Linguistics

    Factoring Ambiguity out of the Prediction of Compositionalityfor German Multi-Word Expressions

    Stefan Bott and Sabine Schulte im WaldeInstitut für Maschinelle Sprachverabeitung

    Universität StuttgartPfaffenwaldring 5b, 70569 Stuttgart, Germany

    {stefan.bott,schulte}@ims.uni-stuttgart.de

    Abstract

    Ambiguity represents an obstacle fordistributional semantic models (DSMs),which typically subsume the contexts ofall word senses within one vector. Whileindividual vector space approaches havebeen concerned with sense discrimination(e.g., Schütze (1998), Erk (2009), Erkand Pado (2010)), such discrimination hasrarely been integrated into DSMs acrosssemantic tasks. This paper presents a soft-clustering approach to sense discrimina-tion that filters sense-irrelevant featureswhen predicting the degrees of compo-sitionality for German noun-noun com-pounds and German particle verbs.

    1 Introduction

    Addressing the compositionality of complexwords is a crucial ingredient for lexicography andNLP applications, to know whether the expres-sion should be treated as a whole, or through itsconstituents, and what the expression means. Forexample, studies such as Cholakov and Kordoni(2014), Weller et al. (2014), Cap et al. (2015), andSalehi et al. (2015b) have integrated the predictionof multi-word compositionality into statistical ma-chine translation.

    We are interested in predicting the degrees ofcompositionality of two types of German multi-word expressions: (i) German noun-noun com-pounds (NCs) represent nominal multi-word ex-pressions (MWEs), e.g., Feuer|werk ‘fire works’is composed of the constituents Feuer ‘fire’and Werk ‘opus’. (ii) German particle verbs(PVs) are complex verbs such as an|strahlen(‘beam/smile at’) which are composed of a sepa-rable prefix particle (an) and a base verb (strahlen‘beam’/’smile’). Both types of German MWEs are

    highly frequent and highly productive in the lexi-con. Table 1 presents some example MWEs andtheir constituents with human ratings on composi-tionality.1

    Automatic approaches to predict composition-ality degrees typically exploit distributional se-mantic models (DSMs), i.e. vector representa-tions relying on the distributional hypothesis (Har-ris, 1954; Firth, 1957), that words with simi-lar distributions have related meanings. Regard-ing the compositionality prediction, DSMs repre-sent the meanings of the MWEs and their con-stituents by distributional vectors, and the sim-ilarity of a compound–constituent vector pairis taken as the predicted degree of compound-constituent compositionality. Existing approachesaddressed the compositionality of NCs (Reddy etal., 2011; Salehi and Cook, 2013; Schulte imWalde et al., 2013; Salehi et al., 2014) and com-plex verbs (Baldwin, 2005; Bannard, 2005; Bottand Schulte im Walde, 2015), mainly dfor Englishand for German.

    A major obstacle for DSMs is their conflationof contexts across individual word senses. DSMstypically subsume evidence of cooccurring itemswithin one vector for the target word type, ratherthan discriminating contextual evidence for thespecific target word senses. Taking the Germannoun-noun compound Blatt|salat ’leaf salad’ as anexample, its modifier constituent Blatt has at leastfour senses: ’leaf’, ’sheet of paper’, ’newspaper’and ’hand of cards’. If we had individual sensevectors for each sense of Blatt, a DSM might suc-cessfully predict a strong compositionality for thecompound Blatt|salat regarding this constituent,when comparing the compound vector with the’leaf’ sense vector, because the vectors agree on

    1The scales for mean ratings were 1–7 for noun-nouncompounds, and 1–6 for particle verbs. Examples were takenfrom the two gold standards described in section 2.

    66

  • Multi-Word Expressions Mean RatingsModifier HeadAhorn|blatt ‘maple leaf’ maple leaf 5.64 5.71Blatt|salat ‘green salad’ leaf salad 3.56 5.68See|zunge ‘sole’ sea tongue 3.57 3.27Löwen|zahn ‘dandelion’ lion tooth 2.10 2.23Fliegen|pilz ‘toadstool’ fly/bow tie mushroom 1.93 6.55Fleisch|wolf ‘meat chopper’ meat wolf 6.00 1.90an|leuchten ‘illuminate’ anPRT illuminate – 5.95auf|horchen ‘listen attentively’ aufPRT listen – 4.55aus|reizen ‘exhaust’ ausPRT provoke – 3.62ein|fallen ‘remember/invade’ einPRT fall – 2.54an|stiften ‘instigate’ anPRT create – 1.80

    Table 1: Examples of German noun-noun compounds and German particle verbs, accompanied by trans-lations and human mean ratings on the degrees of compound-constituent compositionality.

    salient features such as green and fresh. But tra-ditionally, the constituent vector contains distribu-tional information across all Blatt senses, and thesimilarity between the conflated word type vectorand the compound vector is most probably deter-mined by the predominant sense of the word type(which does not necessarily coincide with the rel-evant sense).

    While individual vector space approaches havebeen concerned with sense discrimination (e.g.,Schütze (1998), Erk (2009), Erk and Pado (2010)),the approaches have rarely been integrated intoDSMs across semantic tasks. Alternatively,sense disambiguation/discrimination approacheshave been developed for SemEval tasks on WordSense Disambiguation/Discrimination and (Cross-lingual) Lexical Substitution (McCarthy and Nav-igli, 2007; Mihalcea et al., 2010; Jurgens andKlapaftis, 2013). As to our knowledge, few sys-tems have attempted to distinguish between wordsenses and then address various semantic related-ness tasks, such as Li and Jurafsky (2015) andIacobacci et al. (2015). Computational compo-sitionality assessment has been studied for NCs(Reddy et al., 2011; Schulte im Walde et al., 2013;Salehi and Cook, 2013; Schulte im Walde et al.,2016a) and PVs (McCarthy et al., 2003; Baldwinet al., 2003; Bannard, 2005; Kühner and Schulteim Walde, 2010). Most similar to our current workis Salehi et al. (2015a), who addressed the prob-lem of semantic ambiguity in MWEs by using amulti-sense skip gram model with two to five em-beddings per word. They expected multiple em-beddings to capture different word senses. Theycould, however, not find an improvement over theuse of single-word embeddings.

    In this paper, we suggest soft clustering as an

    approximation to separate the different senses ofa word type. We expect that the assignments ofcompound and constituent words to clusters reflectthe differences between word senses, and that theunderlying features refer to the features of the re-spective word sense. We assume further that if wefind a pair of an MWE µ and one of itsconstituents κwith high distributional similarity inthe same cluster, this indicates closeness in mean-ing and therefore strong compositionality. We ex-ploit the soft clusters by (a) identifying the rele-vant senses of the MWE and constituents basedon overlap in cluster assignment, and by (b) com-paring reduced vectors of MWEs and constituentswhen taking into account only a subset of cluster-based salient sense features, in order to optimizethe prediction of compositionality.

    2 Experiment Setup

    Distributional Semantics Models Our DSM isa word space model that uses lemmatized words asdimensions in the high-dimensional vectors space(Sahlgren, 2006; Turney and Pantel, 2010). Theassociative strength between target and contextwords is measured as Local Mutual Information(LMI) (Evert, 2004), based on context word fre-quency. The context of the targets is defined as awindow of n words to the left and the right of thetarget. We use the cosine value of the angle be-tween two vectors as a measure for semantic sim-ilarity and compositionality. For technical reasonswe ignore context words with a count of 5 or lessor an LMI value below 0.

    We use the word vectors in three ways here: (a)we use them directly as window models in order tomeasure the distance between vector pairs for anMWE and each of its components (e.g. Blatt|salat

    67

  • vs. Blatt). We also use them (b) as an input matrixfor soft clustering and (c) we build word vectormodels for each cluster.

    LSC for Soft Clustering We use Latent Seman-tic Classes (LSC) as a soft clustering algorithm(Rooth, 1998; Rooth et al., 1999). LSC is atwo-dimensional soft-clustering algorithm whichlearns three probability distributions: (a) acrossthe clusters, (b) for the output probabilities of eachelement within a cluster and (c) for each featuretype with regard to a cluster. The access to allthree probability distributions is crucial for our ap-proach, since it allows to determine which featuresare salient for individual clusters.

    The Pipeline We create two types of models:The window models are simple word space modelswhich use LMI values based on counts of contextwords. The clustering models apply soft clusteringas a previous step to the determination of distribu-tional similarity. For their construction, we use thewindow-based models as an input to the LSC al-gorithm. The clusters produced by LSC are usedto create individual models for each cluster C in away that each of these cluster-specific models onlycontain vectors for the target words which are con-tained in C and represent only those features asdimensions which are predicted to be salient fea-tures for C. The models vary with respect to thenumber of clusters created.

    With this, we expect that in our example ofBlatt|salat some clusters will capture the leaf-sense and others the sheet- or other senses. Thecomparison between the vectors for Blatt andBlatt|salat is then done separately for each cluster,where the context dimensions of the vectors are re-duced to only those context words which are alsosalient features of each cluster. We expect that thepair of our example only occur in clusters whichcan be attributed to the leaf-sense.

    Comparison across Clusters In cases like theNC Blatt|salat it appears that the word sensewhich should be considered for compositional-ity assessment is the one which is distributionallyclosest to the target MWE. But this is not neces-sarily the case for all MWEs. The PV zu|schlagenis one example: it can mean both to hit hardand quickly or to take advantage of a good of-fer/bargain; in this case the MWE itself is am-biguous. The base verb schlagen means to hit,so one sense of the PV is highly compositional

    and the other sense is less so; nevertheless noneof the senses is predominant. We use three meth-ods to compare the distributional similarity acrossclusters: highest, lowest and average. In the firsttwo methods (highest/lowest) we select the clus-ter with the highest/lowest distributional similar-ity between µ and κ and use its similarity value.In the last method (average) the average similar-ity is computed among those clusters which con-tain both the MWE µ and the target component κ,while clusters which do not contain the pair are ignored.

    Thresholds The fact that LSC outputs proba-bilities for both targets and features allows toset two different thresholds on these probabilities.The threshold on the target output probability (t-threshold) controls the number of clusters to whicha target element will be assigned. The lower thethreshold is set, the more elements each clusterwill contain. Lower threshold values also lead tohigher average numbers of clusters to which eachelement is assigned. The t-threshold influences thepredictions of our models in that low values alsoincrease the likelihood for each Cluster C and foreach pair of a MWE and a constituent wordthat both α and β are included in C. The thresh-old on the feature output probability (f-threshold)allows to filter the vectors for both elements of according to each clusterC so that only thedimensions representing the salient features for Care included in the vectors.

    Corpus For the extraction of features we use theSdeWaC (v.3, 880 million words) corpus (Faaßand Eckart, 2013), in a tokenized (Schmid, 2000),POS-tagged and lemmatized (Schmid, 1994) ver-sion.

    Gold Standards For NCs and PVs we use thefollowing gold standards:

    • GS-NN: 868 German NCs (Schulte im Waldeet al., 2016b) randomly selected from dif-ferent frequency ranges, different ambiguitylevels of the heads and different levels ofmodifier and head productivity. NCs wereannotated by eight native speakers on a scalefrom 1 to 6 for compositionality with respectto both head and modifier constituents.• GS-PV: 354 PVs, for 11 verb particles. PVs

    were randomly selected, balanced over 3 fre-quency bands. The PVs were automatically

    68

  • Figure 1: Results (in ρ values) for different win-dow sizes for the NC-head gold standard

    Figure 2: Results (in ρ values) for different win-dow sizes for the PV gold standard

    harvested from various corpora, assignedto 3 different frequency ranges per parti-cle and then automatically selected. Somemanual revision was done to filter out non-existing PVs resulting from lemmatizationerrors. Ratings were obtained with AmazonMechanical Turk.2

    Feature Sets We were interested in which partsof speech provide the best predictive features forcompositionality. We use only content-word cate-gories: adjectives, nouns and verbs. We use fourdifferent combinations: all content words and cat-egories in isolation.

    Measures Distributional similarity is measuredwith the cosine between vectors. The cosine sim-ilarity values are used to rank the compared pairsfrom lowest to highest. For the evaluation, sys-tem rankings and human judgment rankings ofMWEs are compared to each other with Spear-man’s rank order correlation ρ (Siegel and Castel-lan, 1988). Spearman’s ρ is a non-parametric mea-sure which assesses monotonic relationships ofranks that range between -1 (inverse correlation)and 1 (perfect correlation); a ρ value of 0 indicatesa lack of correlation. Significance is determinedwith the use of the Fisher transformation.

    Soft clustering does not guarantee that each ofthe pairs of NCs and a constituent word is placedtogether in at least one of the clusters. This maypotentially lead to problems of coverage. In prac-tice, however, we experience coverage problemsonly for very restrictive threshold settings.

    2This gold standard is a preliminary, but not identical, ver-sion of the one presented in Bott et al. (2016). It was also usedin Bott and Schulte im Walde (2014).

    3 Results and Discussion

    Figure 1 and 2 show the results for different win-dow sizes for NCs and PVs. The two figureshave different scales and higher ρ scores are ob-tained for NCs. The values are compared to theresults of the window-based models. The pre-dictions of compositionality levels become moreaccurate with increasing window sizes. For NCcompositionality apparently more general infor-mation about the larger context plays an importantrole. Interestingly, no negative effect from largercontexts can be observed, even if smaller con-texts tend to concentrate on closely related wordssuch as complements, modifiers and the comple-mentary parts of collocations in which the targetword takes part. All ρ values above 0.108 are sta-tistically highly significant (p

  • Figure 3: Results for different numbers of clustersfor the NC gold standard (heads vs. modifiers)

    Figure 3 shows the effect of the number of clus-ters which are used in the clustering stage. Thegraphic shows that the number of clusters has not astrong influence on performance, but slightly bet-ter results can be observed with smaller numbersof clusters. This might be due to the fact that largernumbers of clusters split up the feature space intosmaller segments and the feature vectors tend tosuffer from sparseness. Figure 3 also shows thatthe predictions for the noun compound composi-tionality with respect to the heads are generallybetter than with respect to the modifiers. Thisis probably a consequence of the fact that mean-ing of NCs is in most cases more strongly deter-mined by the meanings of their heads than theirmodifiers. This might explain the observed asym-metry. This finding is in line with earlier stud-ies (Hätty, 2016; Schulte im Walde et al., 2016a)which investigated the asymmetry between theproperties of heads and modifiers in noun-nouncompounds. They showed that head constituentproperties, such as their ambiguity or frequency,influence the predictability of NC composition-ality to a much larger degree than modifier con-stituent properties.

    As for feature selection, we found that ad-jectives represented the least reliable predictivefeatures for compositionality assessment, whilenouns were the most reliable ones. The use of thelatter even leads to a slightly better performancethan the use of the full feature set that contains allcontent word categories.

    Figure 4 shows the influence of the target andthe feature thresholds on compositionality predic-

    Figure 4: ρ values for variations over thresholds(NC gold standard)

    tion. As expected, very high threshold values leadto poor performance since they cause very sparsevector representations. Lowering the threshold theperformance curve raises steeply and reaches astable plateau which is observable in this figure.

    4 Conclusions

    We started this paper with a theoretical justifi-cation to factor out the influence of ambiguityfrom the prediction of compositionality acrossmulti-word expressions. We applied soft cluster-ing to extract word-sense vectors from word-typevectors, in order to strengthen salient sense fea-tures and improve the prediction of compound–constituent compositionality. Both NCs and PVsbenefit from the use of clustering in distributionalmodeling, but in different ways. First, PVs ben-efit much more than NCs. Second, the optimaltype of the combination method which calculates aglobal similarity score per compound–constituentpair from the cluster-specific DSMs differs be-tween the two types of MWEs. This suggests anunderlying difference between them.

    In future work we will explore alternative waysto treat the ambiguity of constituent words moreadequately. We further plan to examine why dif-ferent types of MWEs tend to benefit from theclustering approach but with different cluster com-bination methods. We will also extend our investi-gation to other semantic relatedness tasks, such asthe distinction between semantic relations, whichpotentially suffer from the same ambiguity issue.

    70

  • ReferencesTimothy Baldwin, Colin Bannard, Takaaki Tanaka, and

    Dominic Widdows. 2003. An empirical model ofmultiword expression decomposability. In Proceed-ings of the ACL 2003 Workshop on Multiword Ex-pressions, pages 89–96, Sapporo, Japan. Associa-tion for Computational Linguistics.

    Timothy Baldwin. 2005. Deep lexical acquisition ofverb–particle constructions. Computer Speech andLanguage, 19:398–414.

    Collin Bannard. 2005. Learning about the meaning ofverb–particle constructions from corpora. ComputerSpeech and Language, 19:467–478.

    Stefan Bott and Sabine Schulte im Walde. 2014. Syn-tactic transfer patterns of German particle verbs andtheir impact on lexical semantics. In Proceedingsof the Third Joint Conference on Lexical and Com-putational Semantics (*SEM 2014), pages 182–192,Dublin, Ireland. Association for Computational Lin-guistics and Dublin City University.

    Stefan Bott and Sabine Schulte im Walde. 2015.Exploiting fine-grained syntactic transfer featuresto predict the compositionality of German particleverbs. In Proceedings of the 11th International Con-ference on Computational Semantics, pages 34–39,London, UK. Association for Computational Lin-guistics.

    Stefan Bott, Nana Khvtisavrishvili, Max Kisselew, andSabine Schulte im Walde. 2016. Ghost-pv: A repre-sentative gold standard of German particle verbs. InProceedings of the 5th Workshop on Cognitive As-pects of the Lexicon (CogALex - V), pages 125–133,Osaka, Japan. The COLING 2016 Organizing Com-mittee.

    Fabienne Cap, Manju Nirmal, Marion Weller, andSabine Schulte im Walde. 2015. How to accountfor idiomatic German support verb constructions instatistical machine translation. In Proceedings ofthe 11th Workshop on Multiword Expressions, pages19–28, Denver, Colorado. Association for Computa-tional Linguistics.

    Kostadin Cholakov and Valia Kordoni. 2014. Bet-ter statistical machine translation through linguis-tic treatment of phrasal verbs. In Proceedings ofthe 2014 Conference on Empirical Methods in Natu-ral Language Processing (EMNLP), pages 196–201,Doha, Qatar. Association for Computational Lin-guistics.

    Katrin Erk and Sebastian Pado. 2010. Exemplar-basedmodels for word meaning in context. In Proceedingsof the ACL 2010 Conference Short Papers, pages92–97, Uppsala, Sweden. Association for Compu-tational Linguistics.

    Katrin Erk. 2009. Representing words in regions invector space. In Proceedings of the 13th Confer-ence on Computational Natural Language Learning,pages 57–65, Boulder, CO.

    Stefan Evert. 2004. The statistical analysis of mor-phosyntactic distributions. In Proceedings of theFourth International Conference on Language Re-sources and Evaluation (LREC-2004), pages 1539–1542, Lisbon, Portugal. European Language Re-sources Association (ELRA). ACL Anthology Iden-tifier: L04-1509.

    Gertrud Faaß and Kerstin Eckart. 2013. SdeWaC – acorpus of parsable sentences from the web. In Pro-ceedings of the International Conference of the Ger-man Society for Computational Linguistics and Lan-guage Technology, pages 61–68, Darmstadt, Ger-many.

    John R. Firth. 1957. Papers in Linguistics 1934-51.Longmans, London, UK.

    Zellig Harris. 1954. Distributional structure. Word,10(23):146–162.

    Anna Hätty. 2016. Vector space models of composi-tionality for German and English noun-noun com-pounds. Master Thesis. Institut für MaschinelleSprachverarbeitung, Universität Stuttgart.

    Ignacio Iacobacci, Mohammad Taher Pilehvar, andRoberto Navigli. 2015. SensEmbed: Learningsense embeddings for word and relational similarity.In Proceedings of the 53rd Annual Meeting of theAssociation for Computational Linguistics and the7th International Joint Conference on Natural Lan-guage Processing, pages 95–105, Beijing, China.Association for Computational Linguistics.

    David Jurgens and Ioannis Klapaftis. 2013. SemEval-2013 Task 13: Word sense induction for graded andnon-graded senses. In Second Joint Conference onLexical and Computational Semantics (*SEM), Pro-ceedings of the Seventh International Workshop onSemantic Evaluation (SemEval 2013), pages 290–299, Atlanta, Georgia, USA. Association for Com-putational Linguistics.

    Natalie Kühner and Sabine Schulte im Walde. 2010.Determining the degree of compositionality of Ger-man particle verbs by clustering approaches. In Pro-ceedings of the 10th Conference on Natural Lan-guage Processing, pages 47–56, Saarbrücken, Ger-many.

    Jiwei Li and Dan Jurafsky. 2015. Do multi-sense em-beddings improve natural language understanding?In Proceedings of the 2015 Conference on Empiri-cal Methods in Natural Language Processing, pages1722–1732, Lisbon, Portugal. Association for Com-putational Linguistics.

    Diana McCarthy and Roberto Navigli. 2007.SemEval-2007 Task 10: English lexical substi-tution task. In Proceedings of the Fourth In-ternational Workshop on Semantic Evaluations(SemEval-2007), pages 48–53, Prague, Czech Re-public. Association for Computational Linguistics.

    71

  • Diana McCarthy, Bill Keller, and John Carroll.2003. Detecting a continuum of compositionality inphrasal verbs. In Proceedings of the ACL-SIGLEXWorkshop on Multiword Expressions: Analysis, Ac-quisition and Treatment, Sapporo, Japan.

    Rada Mihalcea, Ravi Sinha, and Diana McCarthy.2010. SemEval-2010 Task 2: Cross-lingual lexicalsubstitution. In Proceedings of the 5th InternationalWorkshop on Semantic Evaluation, pages 9–14, Up-psala, Sweden. Association for Computational Lin-guistics.

    Siva Reddy, Diana McCarthy, and Suresh Manand-har. 2011. An empirical study on compositional-ity in compound nouns. In Proceedings of 5th In-ternational Joint Conference on Natural LanguageProcessing, pages 210–218, Chiang Mai, Thailand.Asian Federation of Natural Language Processing.

    Mats Rooth, Stefan Riezler, Detlef Prescher, GlennCarroll, and Franz Beil. 1999. Inducing a semanti-cally annotated lexicon via EM-based clustering. InProceedings of the 37th Annual Meeting of the As-sociation for Computational Linguistics, pages 104–111, College Park, Maryland, USA. Association forComputational Linguistics.

    Mats Rooth. 1998. Two-dimensional clustersin grammatical relations. In Inducing Lexiconswith the EM Algorithm, AIMS Report 4(3). Insti-tut für Maschinelle Sprachverarbeitung, UniversitätStuttgart.

    Magnus Sahlgren. 2006. The Word-Space Model:Using Distributional Analysis to Represent Syntag-matic and Paradigmatic Relations between Wordsin High-Dimensional Vector Spaces. Ph.D. thesis,Stockholm University.

    Bahar Salehi and Paul Cook. 2013. Predictingthe compositionality of multiword expressions us-ing translations in multiple languages. In SecondJoint Conference on Lexical and Computational Se-mantics (*SEM), pages 266–275, Atlanta, Georgia,USA. Association for Computational Linguistics.

    Bahar Salehi, Paul Cook, and Timothy Baldwin. 2014.Using distributional similarity of multi-way transla-tions to predict multiword expression composition-ality. In Proceedings of the 14th Conference ofthe European Chapter of the Association for Com-putational Linguistics, pages 472–481, Gothenburg,Sweden. Association for Computational Linguistics.

    Bahar Salehi, Paul Cook, and Timothy Baldwin.2015a. A word embedding approach to predictingthe compositionality of multiword expressions. InProceedings of the 2015 Conference of the NorthAmerican Chapter of the Association for Computa-tional Linguistics: Human Language Technologies,pages 977–983, Denver, Colorado. Association forComputational Linguistics.

    Bahar Salehi, Nitika Mathur, Paul Cook, and TimothyBaldwin. 2015b. The impact of multiword expres-sion compositionality on machine translation evalu-ation. In Proceedings of the 11th Workshop on Mul-tiword Expressions, pages 54–59, Denver, Colorado.Association for Computational Linguistics.

    Helmut Schmid. 1994. Probabilistic part-of-speechtagging using decision trees. In Proceedings of theInternational Conference on New Methods in Lan-guage Processing, pages 44–49. Manchester, UK.

    Helmut Schmid. 2000. Unsupervised learning of pe-riod disambiguation for tokenisation. Technical re-port, Universität Stuttgart.

    Sabine Schulte im Walde, Stefan Müller, and StefanRoller. 2013. Exploring vector space models topredict the compositionality of German noun-nouncompounds. In Second Joint Conference on Lexicaland Computational Semantics (*SEM), pages 255–265, Atlanta, Georgia, USA. Association for Com-putational Linguistics.

    Sabine Schulte im Walde, Anna Hätty, and Stefan Bott.2016a. The role of modifier and head properties inpredicting the compositionality of English and Ger-man noun-noun compounds: A vector-space per-spective. In Proceedings of the Fifth Joint Con-ference on Lexical and Computational Semantics,pages 148–158, Berlin, Germany. Association forComputational Linguistics.

    Sabine Schulte im Walde, Anna Hätty, Stefan Bott,and Nana Khvtisavrishvili. 2016b. Ghost-NN: Arepresentative gold standard of German noun-nouncompounds. In Proceedings of the Tenth Inter-national Conference on Language Resources andEvaluation (LREC’16), pages 2285–2292, Portoroz,Slovenia. European Language Resources Associa-tion (ELRA).

    Hinrich Schütze. 1998. Automatic word sense dis-crimination. Computational Linguistics, 24(1):97–123. Special Issue on Word Sense Disambiguation.

    Sidney Siegel and N. John Castellan. 1988. Non-parametric Statistics for the Behavioral Sciences.McGraw-Hill, Boston, MA.

    Peter D. Turney and Patrick Pantel. 2010. Fromfrequency to meaning: Vector space models of se-mantics. Journal of Artificial Intelligence Research,37:141–188.

    Marion Weller, Fabienne Cap, Stefan Müller, SabineSchulte im Walde, and Alexander Fraser. 2014.Distinguishing degrees of compositionality in com-pound splitting for statistical machine translation.In Proceedings of the First Workshop on Compu-tational Approaches to Compound Analysis (ComA-ComA 2014), pages 81–90, Dublin, Ireland. Associ-ation for Computational Linguistics and Dublin CityUniversity.

    72