automaciden)ﬁcaonof antonyms,hypernyms,co8 hyponyms ... ·...

Not All Contexts Are Equal: Automa)c iden)fica)on of Antonyms, Hypernyms, Co-‐

Hyponyms, and Synonyms in DSMs

Enrico Santus, Qin Lu, Alessandro Lenci and Chu-‐Ren Huang

Modeling Human Language Ability

•  In the last decades, NLP has achieved impressive progress in modeling human language ability, developing a large number of applica)ons:

–  Informa)on Retrieval (IR) –  Informa)on Extrac)on (IE) – Ques)on Answering (QA) – Machine Transla)on (MT) – Others…

The Need of Resources

•  These applica)ons were not only improved through beOering the algorithms, but also through the use of beOer lexical resources and ontologies (Lenci, 2008a)

– WordNet –  SUMO – DOLCE –  ConceptNET – Others…

Automa)c Crea)on of Resources

•  As the relevance of these resources has grown, systems for their automa)c crea)on have assumed a key role in NLP.

•  Handmade resources are in fact: – Arbitrary – Expensive to create – Time consuming – Difficult to keep updated

Seman)c Rela)ons as Building Blocks

•  En;;es and rela;ons have been iden)fied as the main building blocks of these resources (Herger, 2014).

•  NLP has focused on methods for the automa)c extrac)on and representa)on of en))es and rela)ons, in order to: –  Increase the effec;veness of such resources; –  Reduce the costs of development and updates.

•  Yet, we are s)ll far from achieving sa)sfying results.

Seman)c Rela)ons Iden)fica)on and Discrimina)on in DSMs

•  The distribu;onal approach was chosen because it can be: –  Completely unsupervised (Turney and Pantel, 2010); –  Portable to any language for which large corpora can be collected (ibid.);

–  Large range of applicability (ibid.) –  Cogni)vely plausible (Lenci, 2008b) –  Strength in iden)fying similarity (ibid.)

From Syntagma)c to Paradigma)c Rela)ons

•  The main seman)c rela)ons (i.e. synonymy, antonymy, hypernymy, meronymy) are also called paradigma;c seman;c rela;ons.

•  Paradigma)c rela)ons are concerned with the possibility of subs;tu;on in the same syntagma;c contexts.

•  They should be considered in opposi)on to the syntagma;c rela;ons, which – instead – are concerned with the posi;on in the sentence (syntagm).

(de Saussure, 1916)

Distribu)onal Seman)cs •  Distribu)onal Seman)cs can be used to derive the paradigma)c rela)ons from syntagma)c ones.

•  It relies on the Distribu(onal Hypothesis (Harris, 1954), according to which: 1.  At least some aspects of meaning of a linguis6c

expression depend on its distribu6on in contexts; 2.  The degree of similarity between two linguis6c

expressions is a func6on of the similarity of the contexts in which they occur.

Vector Space Models •  Star)ng from the Distribu)onal Hypothesis, computa;onal

models represen)ng words as vectors, whose dimensions contain the Strength of Associa6on (SoA) with the contexts, have been developed.

–  SoA is generally the frequency of co-‐occurrence or the mutual informa)on (PMI, PPMI, LMI, etc.)

–  Contexts may be single words within a window or within a syntac)c structure, pairs of words, etc.

•  Words are therefore spa)ally represented and their meaning is given by the proximity with other vectors in such vector space, oden also referred as seman;c space (Turney and Pantel, 2010).

Distribu)onal Seman)c Models: Similarity as Proximity

•  DSMs are known for their ability in iden;fying seman;cally similar lexemes.

•  Vector cosine is generally used: Vector cosine returns a value between 0 and 1, where 0 is paradigma)cally totally unrelated, and 1 is distribu)onally iden)cal.

(Santus et al., 2014a-‐c)

Shortcomings of DSMs: Seman)c Rela)ons

•  Unfortunately the defini;on of distribu6onal similarity is so loose that under its umbrella fall not only near-‐synonyms (e.g. nice-‐good), but also:

–  hypernyms (e.g. car-‐vehicle) –  co-‐hyponyms (e.g. car-‐motorbike) –  antonyms (e.g. good-‐bad) –  meronyms (e.g. dog-‐tail)

•  Words holding these rela)ons have in fact similar distribu)ons.

(Santus et al., 2014a-‐c; 2015a-‐c)

How to Iden)fy and Discriminate Seman)c Rela)ons

•  Iden;fica;on of Seman;c rela;ons (classifica)on) consists in classifying word-‐pairs according to the seman)c rela)on they hold. F1 score is generally used to evaluate the accuracy of the algorithm.

•  Seman;c rela;ons discrimina;on (rela)on retrieval): consists in returning a list of word-‐pairs, sorted according to a score that aims to predict a specific rela)on. Average Precision is generally used to evaluate the accuracy of the algorithm.

Distribu)onal Seman)c Model •  All the experiments described in the next slides are

performed on standard window-‐based DSM recording co-‐occurrences with the nearest X content words to the leJ and right of each target word. –  In most of our experiments, we have used X = 2 or 5, because small

windows are most appropriate for paradigma)c rela)on

•  Co-‐occurrences were extracted from a combina)on of the freely available ukWaC and WaCkypedia corpora, and weighted with Local Mutual Informa;on (LMI; Evert, 2005).

(Santus et al., 2014a-‐c; 2015a-‐c)

NEAR-‐SYNONYMY

Near-‐Synonymy: TOEFL & ESL •  Near-‐Synonym: a word having the same or nearly the same

meaning as another in the language, as happy, joyful, elated. •  Similarity is the main organizer of the seman)c lexicon

(Landuaer and Dumais, 1997)

•  Two common tests for evalua)ng methods of near-‐synonymy iden)fica)on are the TOEFL and the ESL test.

•  These tests consist in several ques;ons: a word is provided and the algorithm should find the most similar one (near-‐synonym), among four possible ones. –  TOEFL (Test of English as Foreign Language): 80 ques;ons, with four

choices each; –  ESL (English as Second Language): 50 ques;ons, with four choices each.

(Santus et al., 2016b; 2016d)

APSyn: Hypothesis

•  We have developed APSyn (Average Precision for Synonyms), a varia)on of the Average Precision measure that aims to automa)c iden)fy near-‐synonyms in corpora.

•  The measure is based on the hypothesis that: – Not only similar words occur in similar contexts, but also they tend to share their most relevant contexts. •  E.g. good-‐nice will share contexts like preDy, very, rather, quite, weather, etc. (SketchEngine: Diffs)

APSyn: Method

•  To iden;fy the most related contexts, we decided to rank them according to the Local Mutual Informa)on (LMI; Evert, 2005) –  LMI is similar to the Pointwise Mutual Informa)on (PMI), but it not biased towards low frequent elements.

•  In our experiments, ader having ranked the contexts, we pick the N top ones, were: 100≤N≤1000

•  At this point, the intersec;on among the top N contexts of the two target words in the word-‐pair is evaluated and weighted according to the average rank of the shared contexts.

APSyn: Defini)on

•  For every feature f included in the intersec)on between the top N features of w1 (i.e. N(F1)), and w2 (i.e. N(F2)), APSyn will add 1 divided by the average rank of the feature, among the top LMI ranked features of w1 (i.e. rank1(f)), and w2 (i.e. rank2(f)).

•  Expected scores: –  High scores for Synonyms –  Low scores or zero for less similar words

Experiments 1.  Ques)ons were transformed into

word-‐pairs: PROBL_WORD – POSSIB_CHOICE

2.  APSyn scores were assigned to all the word-‐pairs.

3.  In every ques)on, the word-‐pairs were ranked decreasing, according to APSyn.

4.  If the right answer was ranked first, we added 0.25 ;mes the number of WRONG ANSWERS present in our DSM to the final score.

•  BASELINES –  Cosine and Co-‐Occurrence Baseline

are provided for comparison. –  Random baseline is 25%. –  Average non-‐English US college

applicant in the TOEFL is 64.5%

Discussion •  APSyn, without any op)miza)on, is s)ll not as good as the state-‐of-‐the-‐art

–  100% on TOEFL and 82% on ESL

•  However, it is: –  completely unsupervised (therefore applicable to other languages) –  linguis;cally grounded (therefore it catches some linguis)c proper)es)

•  And it performs: –  beOer than the random baseline, the vector cosine and the co-‐occurrence, on our DMS; –  very similarly to foreign students that take the TOEFL test.

•  The value of N: –  the smaller N (close to 100), the beOer the performance of APSyn.

•  This is probably due to the fact that when N is too big, not only the most relevant contexts are considered.

•  In order to op)mize the performance, N can be learnt from a training set.

(Santus et al., 2016b; 2016d)

ANTONYMY

Antonymy: Importance & Defini)on

•  Antonymy: –  is one of the main rela)ons shaping the organiza)on of the

seman;c memory (together with near-‐synonymy and hypernymy). –  Although it is essen)al for many NLP tasks (e.g. MT, SA, etc.),

current approaches to antonymy iden)fica)on are s)ll weak in discrimina;ng antonyms from synonyms.

•  Antonymy is in fact: –  similar to synonymy in many aspects (e.g. distribu)onal behavior); –  hard to define:

•  there are many subtypes of antonymy; •  even na)ve speakers of a language do not always agree on classifying word-‐pairs as antonyms.

(Mohammad et al., 2008; 2013)

Antonymy: Defini)on •  Over the years, scholars from different disciplines have tried to

–  define antonymy; –  classify the different subtypes of antonymy.

•  Kempson (1977) defined antonyms as word-‐pairs with a “binary

incompa)ble rela)on”, such that the presence of one meaning entails the absence of the other.

•  giant – dwarf vs. giant – person

•  Cruse (1986) iden)fied an important property of antonymy and called it the paradox of simultaneous similarity and difference between antonyms: –  Antonyms are similar in every dimension of meaning except in a

specific one. •  giant = dwarf, except for the size (big vs. small)

Antonymy: Co-‐Occurrence Hypothesis

•  Most of the unsupervised work on antonymy iden)fica)on is based on the co-‐occurrence hypothesis: –  antonyms co-‐occur in the same sentence more oJen than expected

by chance (e.g. in coordinate contexts of the form A and/or B) •  Do you prefer meat or vegetables?

•  Shortcoming: also other seman)c rela)ons are characterized by this property (e.g. co-‐hyponyms, near synonyms).

•  Do you prefer a dog or a cat? •  Is she only pre>y or wonderful?

(Santus et al., 2014b-‐c)

APAnt: Hypothesis •  If we consider the paradox of simultaneous similarity and difference between

antonyms, we have the following distribu)onal correlate:

•  We can fill the empty field with: –  “Similar distribu;onal behaviors except for one dimension of meaning”.

•  Since giant and dwarf are similar in every dimension of meaning except for the one related to size à They occur in similar contexts, except for those related to that dimension.

•  We can also assume that the dimension of meaning in which they differ is a salient one, and – by consequence – that they will behave distribu)onally differently in their most relevant contexts.

•  The size is a salient dimension for both giant and dwarf and they are expected to have a different distribu)onal behavior for this dimension of meaning (i.e. big vs. small).

MEANING à DISTRIBUTIONAL BEHAVIOUR

SYNONYMS Similar in every dimension à Similar distribu)onal behaviors

ANTONYMS Similar in every dimension except one à ???

APAnt: Method

•  APAnt (Average Precision for Antonyms) is defined as the inverse of APSyn (Santus et al., 2014b-‐c; 2015b-‐c). –  Note:

•  1/vector cosine = “no distribu)onal similarity” •  while 1/APSyn = “no sharing the most salient contexts”

•  Expected scores: –  High scores for words not sharing many top contexts (antonyms or unrelated words) –  Low scores for words sharing many top contexts (near-‐synonyms)

•  For every feature f included in the intersec)on between the top N features of w1 (i.e. N(F1)), and w2 (i.e. N(F2)), APSyn will add 1 divided by the average rank of the feature, among the top LMI ranked features of w1 (i.e. rank1(f)), and w2 (i.e. rank2(f)).

•  Expected scores:

–  High scores for Synonyms –  Low scores or zero for less similar words

APAnt: Evalua)on •  We performed several antonyms retrieval experiments to evaluate

APAnt.

•  For the evalua)on, we relied on three main datasets, which contain word-‐pairs labeled with the seman)c rela)ons they hold: –  BLESS (Baroni and Lenci, 2011)

•  Hypernyms, Co-‐Hyponyms, Meronyms, etc. –  Lenci/Benoao (Santus et al., 2014b-‐c)

•  Antonyms, Synonyms and Hypernyms –  EVALu;on 1.0 (Santus et al., 2015a)

•  Hypernyms, Meronyms, Synonyms, Antonyms, etc.

•  DSMs: 2 content words on the led and the right.

APAnt: Experiment 1 Informa;on Retrieval

•  APAnt scores were assigned to all the word-‐pairs in the dataset.

•  Word-‐pairs were ranked decreasing according to the first word in the pair, and then the APAnt value.

•  Average Precision is used to evaluate the rank (Kotlerman et al., 2010). It returns a value between 0 and 1, where 0 is returned if all antonyms are at the boOom, 1 if they are all on the top.

•  Results for the Lenci/BenoOo dataset (2.232 word-‐pairs), and by POS, are provided.

•  BASELINES –  Vector Cosine and Co-‐Occurrence Baseline.

(Santus et al., 2014b-‐c)

APAnt – Discussion •  The evalua)on was performed on:

–  2,232 word-‐pairs •  about 50% Antonyms •  about 50% Synonyms

•  APAnt outperforms vector cosine and co-‐occurrence baseline in the full dataset. –  The co-‐occurrence and the cosine promote synonyms.

•  APAnt outperforms vector cosine and co-‐occurrence baseline also for the different POS: –  Best results are obtained for NOUNS –  Worst results are obtained for ADJECTIVES

•  It is in fact likely that opposite adjec)ves share their main contexts more than nouns do (e.g. cold/hot can be used to define the same en)ty, while giant/dwarf no).

•  N=100 is the best value in our setngs.

HYPERNYMY

Hypernymy: Hypothesis •  Another measure for the iden)fica)on of hypernyms was proposed in Santus et al. (2014a): SLQS.

•  Given a word-‐pair, SLQS evaluates the generality of the N most related contexts of the two words, under the hypothesis that: –  hypernyms tend to occur with more general contexts (e.g. animalàeat) than hyponyms (e.g. dogàbark).

•  Generality is evaluated in terms of median Shannon entropy among the N most related contexts: the higher the median entropy, the more general the word is considered.

SLQS: Method •  The N most LMI related contexts for both words are selected (100≤N≤250) •  For each context, we calculate the entropy (Shannon, 1948):

•  Then, for each word, we pick the median entropy among the N most LMI related contexts:

•  And we finally calculate SLQS according to the following formula:

•  Expected Results: –  SLQS = 0 if the words in the pair have similar generality –  SLQS > 0 if w2 is more general –  SLQS < 0 if w2 is less general

SLQS: Experiment 1 •  Task: Iden)fy the direc)onality of the pair.

•  DSM: 2-‐window •  Dataset: BLESS (Baroni and Lenci, 2011)

–  1277 hypernyms •  Note: all of them are in the hyponym-‐hypernym order; therefore we expect to have SLQS>0

•  Results: –  SLQS obtains 87% precision,

outperforming both WeedsPrec, which is based on the Inclusion hypothesis, and frequency baselines.

(Santus et al., 2014)

SLQS: Experiment 2 •  Task: Informa)on Retrieval task.

–  Given hypernyms, coordinates, meronyms and randoms, score them in a way that the hypernyms are ranked on top.

•  DSM: 2-‐window •  Dataset: BLESS (Baroni and Lenci, 2011)

–  1277 à hypernyms, coordinates, meronyms and randoms. •  We combined SLQS and Cosine, as they respec)vely caught generality and

similarity.

•  Results: –  SLQS*Cosine obtains 59% AP

(Kotlerman et al., 2010), outperforming WeedsPrec, which is based on the Inclusion hypothesis, cosines and frequency baselines.


HYPERNYMY, CO-‐HYPONYMY AND RANDOMS

ROOT13: A Supervised Method •  Task: Classifica)on of Hypernyms, Co-‐Hyponyms and Randoms •  Classifier: RandomForest

•  Features: –  Cosine, co-‐occurrence frequency, frequency of w1-‐w2, entropy of w1-‐w2 (Turney and

Pantel, 2010; Shannon, 1948) –  Shared: size of the intersec)on between the top 1k associated contexts of the two

terms according to the LMI score (Evert 2005) –  APSyn: for every context in the intersec)on between the top 1k associated contexts of

the two terms, this measure adds 1, divided by its average rank in the term-‐context list (Santus et al. 2014b)

–  Diff Freqs: difference between the terms frequencies –  Diff Entrs: difference between the terms entropies –  C-‐Freq 1, 2: two features storing the average frequency among the top 1k associated

contexts for each term –  C-‐Entr 1, 2: two features, storing the average entropy among the top 1k associated

contexts for each term (Shannon, 1948)

•  Dataset: combina)on of BLESS, Lenci/BenoOo and EVALu)on 1.0 –  9,600 pairs: 33% Hypernyms, 33% Co-‐Hyponyms and 33% Randoms

ROOT13: Experiment 1 •  Baseline: Cosine •  Accuracy: measured

with F1

•  Three classes: –  88.3% vs. 57.6%

•  Hyper-‐Coord: –  93.4% vs. 60.2%

•  Hyper-‐Random: –  92.3% vs. 65.5%

•  Coord-‐Random: –  97.3% vs. 81.5%


CONCLUSIONS

Conclusions •  Extrac)ng proper)es from the most related contexts seems to provide important

informa)on about seman)c rela)ons (entropy, intersec)on, frequency, etc.). •  APSyn, APAnt, SLQS and ROOT13 are all methods that try to inves)gate and

combine such proper)es.

•  The former three methods have obtained good results in the tasks we have performed, without any par)cular op)miza)on. Moreover, they are: –  Unsupervised (and therefore applicable to several languages) –  Linguis)cally grounded (they tell us something about word usage)

•  Up to this date, we have iden)fied the most related contexts with Local Mutual Informa)on (Evert, 2005), but it is likely that we will start using the Posi;ve Pointwise Mutual Informa;on, as most of the literature is using it.

•  We are currently developing a system to automa)cally extract as many sta)s)cal proper)es as possible, evalua)ng their correla)on with seman)c rela)ons.

•  Briefly: Not All Contexts Are Equal

Thank you Enrico Santus – The Hong Kong Polytechnic University

Alessandro Lenci – University of Pisa Qin Lu – The Hong Kong Polytechnic University

Sabine Schulte im Walde – Ins)tute for Natural Language Processing, University of StuOgart Frances Yung – Nara Ins)tute of Science and Technology

automaciden)ﬁcaonof antonyms,hypernyms,co8 hyponyms ... ·...

Documents