automaciden)ficaonof antonyms,hypernyms,co8 hyponyms ... ·...

40
Not All Contexts Are Equal: Automa)c iden)fica)on of Antonyms, Hypernyms, Co Hyponyms, and Synonyms in DSMs Enrico Santus, Qin Lu, Alessandro Lenci and ChuRen Huang

Upload: others

Post on 11-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Not  All  Contexts  Are  Equal:  Automa)c  iden)fica)on  of  Antonyms,  Hypernyms,  Co-­‐

Hyponyms,  and  Synonyms  in  DSMs  

Enrico  Santus,  Qin  Lu,  Alessandro  Lenci  and  Chu-­‐Ren  Huang  

Page 2: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Modeling  Human  Language  Ability  

•  In  the  last  decades,  NLP  has  achieved  impressive  progress  in  modeling  human  language  ability,  developing  a  large  number  of  applica)ons:  

–  Informa)on  Retrieval  (IR)  –  Informa)on  Extrac)on  (IE)  – Ques)on  Answering  (QA)  – Machine  Transla)on  (MT)  – Others…  

Page 3: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

The  Need  of  Resources  

•  These  applica)ons  were  not  only  improved  through  beOering  the  algorithms,  but  also  through  the  use  of  beOer  lexical  resources  and  ontologies  (Lenci,  2008a)  

– WordNet  –  SUMO  – DOLCE  –  ConceptNET  – Others…  

Page 4: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Automa)c  Crea)on  of  Resources  

•  As  the  relevance  of  these  resources  has  grown,  systems  for  their  automa)c  crea)on  have  assumed  a  key  role  in  NLP.  

•  Handmade  resources  are  in  fact:  – Arbitrary  – Expensive  to  create  – Time  consuming  – Difficult  to  keep  updated  

Page 5: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Seman)c  Rela)ons  as  Building  Blocks  

•  En;;es  and  rela;ons  have  been  iden)fied  as  the  main  building  blocks  of  these  resources  (Herger,  2014).  

•  NLP  has  focused  on  methods  for  the  automa)c  extrac)on  and  representa)on  of  en))es  and  rela)ons,  in  order  to:  –  Increase  the  effec;veness  of  such  resources;  –  Reduce  the  costs  of  development  and  updates.  

•  Yet,  we  are  s)ll  far  from  achieving  sa)sfying  results.  

Page 6: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Seman)c  Rela)ons  Iden)fica)on  and  Discrimina)on  in  DSMs  

•  The  distribu;onal  approach  was  chosen  because  it  can  be:  –  Completely  unsupervised  (Turney  and  Pantel,  2010);  –  Portable  to  any  language  for  which  large  corpora  can  be  collected  (ibid.);  

–  Large  range  of  applicability  (ibid.)  –  Cogni)vely  plausible  (Lenci,  2008b)  –  Strength  in  iden)fying  similarity  (ibid.)  

Page 7: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

From  Syntagma)c  to  Paradigma)c  Rela)ons  

•  The  main  seman)c  rela)ons  (i.e.  synonymy,  antonymy,  hypernymy,  meronymy)  are  also  called  paradigma;c  seman;c  rela;ons.  

•  Paradigma)c  rela)ons  are  concerned  with  the  possibility  of  subs;tu;on  in  the  same  syntagma;c  contexts.  

•  They  should  be  considered  in  opposi)on  to  the  syntagma;c  rela;ons,  which  –  instead  –  are  concerned  with  the  posi;on  in  the  sentence  (syntagm).  

(de  Saussure,  1916)  

Page 8: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Distribu)onal  Seman)cs  •  Distribu)onal  Seman)cs  can  be  used  to  derive  the  paradigma)c  rela)ons  from  syntagma)c  ones.  

•  It  relies  on  the  Distribu(onal  Hypothesis  (Harris,  1954),  according  to  which:  1.  At  least  some  aspects  of  meaning  of  a  linguis6c  

expression  depend  on  its  distribu6on  in  contexts;  2.  The  degree  of  similarity  between  two  linguis6c  

expressions  is  a  func6on  of  the  similarity  of  the  contexts  in  which  they  occur.  

Page 9: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Vector  Space  Models  •  Star)ng  from  the  Distribu)onal  Hypothesis,  computa;onal  

models  represen)ng  words  as  vectors,  whose  dimensions  contain  the  Strength  of  Associa6on  (SoA)  with  the  contexts,  have  been  developed.  

–  SoA  is  generally  the  frequency  of  co-­‐occurrence  or  the  mutual  informa)on  (PMI,  PPMI,  LMI,  etc.)  

–  Contexts  may  be  single  words  within  a  window  or  within  a  syntac)c  structure,  pairs  of  words,  etc.  

•  Words  are  therefore  spa)ally  represented  and  their  meaning  is  given  by  the  proximity  with  other  vectors  in  such  vector  space,  oden  also  referred  as  seman;c  space  (Turney  and  Pantel,  2010).  

Page 10: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Distribu)onal  Seman)c  Models:  Similarity  as  Proximity  

•  DSMs  are  known  for  their  ability  in  iden;fying  seman;cally  similar  lexemes.  

•  Vector  cosine  is  generally  used:          Vector  cosine  returns  a  value  between  0  and  1,  where  0  is  paradigma)cally  totally  unrelated,  and  1  is  distribu)onally  iden)cal.  

 (Santus  et  al.,  2014a-­‐c)  

Page 11: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Shortcomings  of  DSMs:  Seman)c  Rela)ons  

•  Unfortunately  the  defini;on  of  distribu6onal  similarity  is  so  loose  that  under  its  umbrella  fall  not  only  near-­‐synonyms  (e.g.  nice-­‐good),  but  also:  

–  hypernyms  (e.g.  car-­‐vehicle)  –  co-­‐hyponyms  (e.g.  car-­‐motorbike)  –  antonyms  (e.g.  good-­‐bad)  –  meronyms  (e.g.  dog-­‐tail)    

•  Words  holding  these  rela)ons  have  in  fact  similar  distribu)ons.  

(Santus  et  al.,  2014a-­‐c;  2015a-­‐c)  

Page 12: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

How  to  Iden)fy  and  Discriminate  Seman)c  Rela)ons  

•  Iden;fica;on  of  Seman;c  rela;ons  (classifica)on)  consists  in  classifying  word-­‐pairs  according  to  the  seman)c  rela)on  they  hold.  F1  score  is  generally  used  to  evaluate  the  accuracy  of  the  algorithm.  

•  Seman;c  rela;ons  discrimina;on  (rela)on  retrieval):  consists  in  returning  a  list  of  word-­‐pairs,  sorted  according  to  a  score  that  aims  to  predict  a  specific  rela)on.  Average  Precision  is  generally  used  to  evaluate  the  accuracy  of  the  algorithm.  

Page 13: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Distribu)onal  Seman)c  Model  •  All  the  experiments  described  in  the  next  slides  are  

performed  on  standard  window-­‐based  DSM  recording  co-­‐occurrences  with  the  nearest  X  content  words  to  the  leJ  and  right  of  each  target  word.  –  In  most  of  our  experiments,  we  have  used  X  =  2  or  5,  because  small  

windows  are  most  appropriate  for  paradigma)c  rela)on  

•  Co-­‐occurrences  were  extracted  from  a  combina)on  of  the  freely  available  ukWaC  and  WaCkypedia  corpora,  and  weighted  with  Local  Mutual  Informa;on  (LMI;  Evert,  2005).    

(Santus  et  al.,  2014a-­‐c;  2015a-­‐c)  

Page 14: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

NEAR-­‐SYNONYMY  

Page 15: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Near-­‐Synonymy:  TOEFL  &  ESL  •  Near-­‐Synonym:  a  word  having  the  same  or  nearly  the  same  

meaning  as  another  in  the  language,  as  happy,  joyful,  elated.  •  Similarity  is  the  main  organizer  of  the  seman)c  lexicon  

(Landuaer  and  Dumais,  1997)  

•  Two  common  tests  for  evalua)ng  methods  of  near-­‐synonymy  iden)fica)on  are  the  TOEFL  and  the  ESL  test.  

•  These  tests  consist  in  several  ques;ons:  a  word  is  provided  and  the  algorithm  should  find  the  most  similar  one  (near-­‐synonym),  among  four  possible  ones.  –  TOEFL  (Test  of  English  as  Foreign  Language):  80  ques;ons,  with  four  

choices  each;  –  ESL  (English  as  Second  Language):  50  ques;ons,  with  four  choices  each.    

(Santus  et  al.,  2016b;  2016d)  

Page 16: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

APSyn:  Hypothesis  

•  We  have  developed  APSyn  (Average  Precision  for  Synonyms),  a  varia)on  of  the  Average  Precision  measure  that  aims  to  automa)c  iden)fy  near-­‐synonyms  in  corpora.  

•  The  measure  is  based  on  the  hypothesis  that:  – Not  only  similar  words  occur  in  similar  contexts,  but  also  they  tend  to  share  their  most  relevant  contexts.  •  E.g.  good-­‐nice  will  share  contexts  like  preDy,  very,  rather,  quite,  weather,  etc.  (SketchEngine:  Diffs)  

Page 17: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

APSyn:  Method  

•  To  iden;fy  the  most  related  contexts,  we  decided  to  rank  them  according  to  the  Local  Mutual  Informa)on  (LMI;  Evert,  2005)  –  LMI  is  similar  to  the  Pointwise  Mutual  Informa)on  (PMI),  but  it  not  biased  towards  low  frequent  elements.  

•  In  our  experiments,  ader  having  ranked  the  contexts,  we  pick  the  N  top  ones,  were:  100≤N≤1000    

•  At  this  point,  the  intersec;on  among  the  top  N  contexts  of  the  two  target  words  in  the  word-­‐pair  is  evaluated  and  weighted  according  to  the  average  rank  of  the  shared  contexts.  

Page 18: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

APSyn:  Defini)on  

•  For  every  feature  f  included  in  the  intersec)on  between  the  top  N  features  of  w1  (i.e.  N(F1)),  and  w2  (i.e.  N(F2)),  APSyn  will  add  1  divided  by  the  average  rank  of  the  feature,  among  the  top  LMI  ranked  features  of  w1  (i.e.  rank1(f)),  and  w2  (i.e.  rank2(f)).  

•  Expected  scores:  –  High  scores  for  Synonyms  –  Low  scores  or  zero  for  less  similar  words  

Page 19: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Experiments  1.  Ques)ons  were  transformed  into  

word-­‐pairs:  PROBL_WORD  –  POSSIB_CHOICE  

2.  APSyn  scores  were  assigned  to  all  the  word-­‐pairs.  

3.  In  every  ques)on,  the  word-­‐pairs  were  ranked  decreasing,  according  to  APSyn.  

4.  If  the  right  answer  was  ranked  first,  we  added  0.25  ;mes  the  number  of  WRONG  ANSWERS  present  in  our  DSM  to  the  final  score.  

•  BASELINES  –  Cosine  and  Co-­‐Occurrence  Baseline  

are  provided  for  comparison.  –  Random  baseline  is  25%.  –  Average  non-­‐English  US  college  

applicant  in  the  TOEFL  is  64.5%  

Page 20: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Discussion  •  APSyn,  without  any  op)miza)on,  is  s)ll  not  as  good  as  the  state-­‐of-­‐the-­‐art  

–  100%  on  TOEFL  and  82%  on  ESL  

•  However,  it  is:  –  completely  unsupervised  (therefore  applicable  to  other  languages)  –  linguis;cally  grounded  (therefore  it  catches  some  linguis)c  proper)es)  

•  And  it  performs:  –  beOer  than  the  random  baseline,  the  vector  cosine  and  the  co-­‐occurrence,  on  our  DMS;  –  very  similarly  to  foreign  students  that  take  the  TOEFL  test.  

•  The  value  of  N:  –  the  smaller  N  (close  to  100),  the  beOer  the  performance  of  APSyn.  

•  This  is  probably  due  to  the  fact  that  when  N  is  too  big,  not  only  the  most  relevant  contexts  are  considered.  

•  In  order  to  op)mize  the  performance,  N  can  be  learnt  from  a  training  set.    

(Santus  et  al.,  2016b;  2016d)  

Page 21: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

ANTONYMY  

Page 22: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Antonymy:  Importance  &  Defini)on  

•  Antonymy:  –  is  one  of  the  main  rela)ons  shaping  the  organiza)on  of  the  

seman;c  memory  (together  with  near-­‐synonymy  and  hypernymy).  –  Although  it  is  essen)al  for  many  NLP  tasks  (e.g.  MT,  SA,  etc.),  

current  approaches  to  antonymy  iden)fica)on  are  s)ll  weak  in  discrimina;ng  antonyms  from  synonyms.  

•  Antonymy  is  in  fact:  –  similar  to  synonymy  in  many  aspects  (e.g.  distribu)onal  behavior);  –  hard  to  define:  

•  there  are  many  subtypes  of  antonymy;  •  even  na)ve  speakers  of  a  language  do  not  always  agree  on  classifying  word-­‐pairs  as  antonyms.  

(Mohammad  et  al.,  2008;  2013)  

Page 23: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Antonymy:  Defini)on  •  Over  the  years,  scholars  from  different  disciplines  have  tried  to  

–  define  antonymy;  –  classify  the  different  subtypes  of  antonymy.  

 •  Kempson  (1977)  defined  antonyms  as  word-­‐pairs  with  a  “binary  

incompa)ble  rela)on”,  such  that  the  presence  of  one  meaning  entails  the  absence  of  the  other.  

•  giant  –  dwarf  vs.  giant  –  person  

•  Cruse  (1986)  iden)fied  an  important  property  of  antonymy  and  called  it  the  paradox  of  simultaneous  similarity  and  difference  between  antonyms:  –  Antonyms  are  similar  in  every  dimension  of  meaning  except  in  a  

specific  one.  •  giant  =  dwarf,  except  for  the  size  (big  vs.  small)  

Page 24: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Antonymy:  Co-­‐Occurrence  Hypothesis  

•  Most  of  the  unsupervised  work  on  antonymy  iden)fica)on  is  based  on  the  co-­‐occurrence  hypothesis:  –  antonyms  co-­‐occur  in  the  same  sentence  more  oJen  than  expected  

by  chance  (e.g.  in  coordinate  contexts  of  the  form  A  and/or  B)  •  Do  you  prefer  meat  or  vegetables?  

•  Shortcoming:  also  other  seman)c  rela)ons  are  characterized  by  this  property  (e.g.  co-­‐hyponyms,  near  synonyms).  

•  Do  you  prefer  a  dog  or  a  cat?  •  Is  she  only  pre>y  or  wonderful?  

 (Santus  et  al.,  2014b-­‐c)  

Page 25: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

APAnt:  Hypothesis  •  If  we  consider  the  paradox  of  simultaneous  similarity  and  difference  between  

antonyms,  we  have  the  following  distribu)onal  correlate:  

 

•  We  can  fill  the  empty  field  with:  –  “Similar  distribu;onal  behaviors  except  for  one  dimension  of  meaning”.  

•  Since  giant  and  dwarf  are  similar  in  every  dimension  of  meaning  except  for  the  one  related  to  size  à  They  occur  in  similar  contexts,  except  for  those  related  to  that  dimension.  

•  We  can  also  assume  that  the  dimension  of  meaning  in  which  they  differ  is  a  salient  one,  and  –  by  consequence  –  that  they  will  behave  distribu)onally  differently  in  their  most  relevant  contexts.  

•  The  size  is  a  salient  dimension  for  both  giant  and  dwarf  and  they  are  expected  to  have  a  different  distribu)onal  behavior  for  this  dimension  of  meaning  (i.e.  big  vs.  small).  

MEANING   à   DISTRIBUTIONAL  BEHAVIOUR  

SYNONYMS   Similar  in  every  dimension   à   Similar  distribu)onal  behaviors  

ANTONYMS   Similar  in  every  dimension  except  one   à   ???  

Page 26: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

APAnt:  Method      

•  APAnt  (Average  Precision  for  Antonyms)  is  defined  as  the  inverse  of  APSyn  (Santus  et  al.,  2014b-­‐c;  2015b-­‐c).  –  Note:  

•  1/vector  cosine  =  “no  distribu)onal  similarity”  •  while  1/APSyn  =  “no  sharing  the  most  salient  contexts”  

 

•  Expected  scores:  –  High  scores  for  words  not  sharing  many  top  contexts  (antonyms  or  unrelated  words)  –  Low  scores  for  words  sharing  many  top  contexts  (near-­‐synonyms)  

•  For  every  feature  f  included  in  the  intersec)on  between  the  top  N  features  of  w1  (i.e.  N(F1)),  and  w2  (i.e.  N(F2)),  APSyn  will  add  1  divided  by  the  average  rank  of  the  feature,  among  the  top  LMI  ranked  features  of  w1  (i.e.  rank1(f)),  and  w2  (i.e.  rank2(f)).  

 •  Expected  scores:  

–  High  scores  for  Synonyms  –  Low  scores  or  zero  for  less  similar  words  

Page 27: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

APAnt:  Evalua)on  •  We  performed  several  antonyms  retrieval  experiments  to  evaluate  

APAnt.  

•  For  the  evalua)on,  we  relied  on  three  main  datasets,  which  contain  word-­‐pairs  labeled  with  the  seman)c  rela)ons  they  hold:  –  BLESS  (Baroni  and  Lenci,  2011)  

•  Hypernyms,  Co-­‐Hyponyms,  Meronyms,  etc.  –  Lenci/Benoao  (Santus  et  al.,  2014b-­‐c)  

•  Antonyms,  Synonyms  and  Hypernyms  –  EVALu;on  1.0  (Santus  et  al.,  2015a)  

•  Hypernyms,  Meronyms,  Synonyms,  Antonyms,  etc.  

•  DSMs:  2  content  words  on  the  led  and  the  right.  

Page 28: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

APAnt:  Experiment  1  Informa;on  Retrieval  

•  APAnt  scores  were  assigned  to  all  the  word-­‐pairs  in  the  dataset.  

•  Word-­‐pairs  were  ranked  decreasing  according  to  the  first  word  in  the  pair,  and  then  the  APAnt  value.  

•  Average  Precision  is  used  to  evaluate  the  rank  (Kotlerman  et  al.,  2010).  It  returns  a  value  between  0  and  1,  where  0  is  returned  if  all  antonyms  are  at  the  boOom,  1  if  they  are  all  on  the  top.  

•  Results  for  the  Lenci/BenoOo  dataset  (2.232  word-­‐pairs),  and  by  POS,  are  provided.  

•  BASELINES  –  Vector  Cosine  and  Co-­‐Occurrence  Baseline.  

 (Santus  et  al.,  2014b-­‐c)  

Page 29: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

APAnt  –  Discussion  •  The  evalua)on  was  performed  on:  

–  2,232  word-­‐pairs  •  about  50%  Antonyms  •  about  50%  Synonyms  

•  APAnt  outperforms  vector  cosine  and  co-­‐occurrence  baseline  in  the  full  dataset.  –  The  co-­‐occurrence  and  the  cosine  promote  synonyms.  

•  APAnt  outperforms  vector  cosine  and  co-­‐occurrence  baseline  also  for  the  different  POS:  –  Best  results  are  obtained  for  NOUNS  –  Worst  results  are  obtained  for  ADJECTIVES  

•  It  is  in  fact  likely  that  opposite  adjec)ves  share  their  main  contexts  more  than  nouns  do  (e.g.  cold/hot  can  be  used  to  define  the  same  en)ty,  while  giant/dwarf  no).  

 •  N=100  is  the  best  value  in  our  setngs.  

Page 30: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

HYPERNYMY  

Page 31: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Hypernymy:  Hypothesis  •  Another  measure  for  the  iden)fica)on  of  hypernyms  was  proposed  in  Santus  et  al.  (2014a):  SLQS.  

•  Given  a  word-­‐pair,  SLQS  evaluates  the  generality  of  the  N  most  related  contexts  of  the  two  words,  under  the  hypothesis  that:  –  hypernyms  tend  to  occur  with  more  general  contexts  (e.g.  animalàeat)  than  hyponyms  (e.g.  dogàbark).  

•  Generality  is  evaluated  in  terms  of  median  Shannon  entropy  among  the  N  most  related  contexts:  the  higher  the  median  entropy,  the  more  general  the  word  is  considered.  

Page 32: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

SLQS:  Method  •  The  N  most  LMI  related  contexts  for  both  words  are  selected  (100≤N≤250)  •  For  each  context,  we  calculate  the  entropy  (Shannon,  1948):  

•  Then,  for  each  word,  we  pick  the  median  entropy  among  the  N  most  LMI  related  contexts:  

•  And  we  finally  calculate  SLQS  according  to  the  following  formula:  

•  Expected  Results:  –  SLQS  =  0  if  the  words  in  the  pair  have  similar  generality  –  SLQS  >  0  if  w2  is  more  general  –  SLQS  <  0  if  w2  is  less  general  

Page 33: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

SLQS:  Experiment  1  •  Task:  Iden)fy  the  direc)onality  of  the  pair.  

•  DSM:  2-­‐window  •  Dataset:  BLESS  (Baroni  and  Lenci,  2011)  

–  1277  hypernyms  •  Note:  all  of  them  are  in  the  hyponym-­‐hypernym  order;  therefore  we  expect  to  have  SLQS>0  

•  Results:  –  SLQS  obtains  87%  precision,  

outperforming  both  WeedsPrec,  which  is  based  on  the  Inclusion  hypothesis,  and  frequency  baselines.  

 (Santus  et  al.,  2014)  

Page 34: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

SLQS:  Experiment  2  •  Task:  Informa)on  Retrieval  task.  

–  Given  hypernyms,  coordinates,  meronyms  and  randoms,  score  them  in  a  way  that  the  hypernyms  are  ranked  on  top.  

•  DSM:  2-­‐window  •  Dataset:  BLESS  (Baroni  and  Lenci,  2011)  

–  1277  à  hypernyms,  coordinates,  meronyms  and  randoms.  •  We  combined  SLQS  and  Cosine,  as  they  respec)vely  caught  generality  and  

similarity.  

•  Results:  –  SLQS*Cosine  obtains  59%  AP  

(Kotlerman  et  al.,  2010),  outperforming  WeedsPrec,  which  is  based  on  the  Inclusion  hypothesis,  cosines  and  frequency  baselines.  

 (Santus  et  al.,  2014)  

Page 35: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

HYPERNYMY,  CO-­‐HYPONYMY  AND  RANDOMS  

Page 36: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

ROOT13:  A  Supervised  Method  •  Task:  Classifica)on  of  Hypernyms,  Co-­‐Hyponyms  and  Randoms  •  Classifier:  RandomForest  

•  Features:  –  Cosine,  co-­‐occurrence  frequency,  frequency  of  w1-­‐w2,  entropy  of  w1-­‐w2  (Turney  and  

Pantel,  2010;  Shannon,  1948)  –  Shared:  size  of  the  intersec)on  between  the  top  1k  associated  contexts  of  the  two  

terms  according  to  the  LMI  score  (Evert  2005)  –  APSyn:  for  every  context  in  the  intersec)on  between  the  top  1k  associated  contexts  of  

the  two  terms,  this  measure  adds  1,  divided  by  its  average  rank  in  the  term-­‐context  list  (Santus  et  al.  2014b)  

–  Diff  Freqs:  difference  between  the  terms  frequencies  –  Diff  Entrs:  difference  between  the  terms  entropies  –  C-­‐Freq  1,  2:  two  features  storing  the  average  frequency  among  the  top  1k  associated  

contexts  for  each  term  –  C-­‐Entr  1,  2:  two  features,  storing  the  average  entropy  among  the  top  1k  associated  

contexts  for  each  term  (Shannon,  1948)  

•  Dataset:  combina)on  of  BLESS,  Lenci/BenoOo  and  EVALu)on  1.0  –  9,600  pairs:  33%  Hypernyms,  33%  Co-­‐Hyponyms  and  33%  Randoms  

Page 37: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

ROOT13:  Experiment  1  •  Baseline:  Cosine  •  Accuracy:  measured  

with  F1  

•  Three  classes:  –  88.3%  vs.  57.6%  

•  Hyper-­‐Coord:  –  93.4%  vs.  60.2%  

•  Hyper-­‐Random:  –  92.3%  vs.  65.5%  

•  Coord-­‐Random:  –  97.3%  vs.  81.5%  

(Santus  et  al.,  2016)  

Page 38: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

CONCLUSIONS  

Page 39: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Conclusions  •  Extrac)ng  proper)es  from  the  most  related  contexts  seems  to  provide  important  

informa)on  about  seman)c  rela)ons  (entropy,  intersec)on,  frequency,  etc.).  •  APSyn,  APAnt,  SLQS  and  ROOT13  are  all  methods  that  try  to  inves)gate  and  

combine  such  proper)es.  

•  The  former  three  methods  have  obtained  good  results  in  the  tasks  we  have  performed,  without  any  par)cular  op)miza)on.  Moreover,  they  are:  –  Unsupervised  (and  therefore  applicable  to  several  languages)  –  Linguis)cally  grounded  (they  tell  us  something  about  word  usage)  

•  Up  to  this  date,  we  have  iden)fied  the  most  related  contexts  with  Local  Mutual  Informa)on  (Evert,  2005),  but  it  is  likely  that  we  will  start  using  the  Posi;ve  Pointwise  Mutual  Informa;on,  as  most  of  the  literature  is  using  it.  

•  We  are  currently  developing  a  system  to  automa)cally  extract  as  many  sta)s)cal  proper)es  as  possible,  evalua)ng  their  correla)on  with  seman)c  rela)ons.  

•  Briefly:  Not  All  Contexts  Are  Equal  

Page 40: Automaciden)ficaonof Antonyms,Hypernyms,Co8 Hyponyms ... · The"Need"of"Resources" • These"applicaons"were"notonly"improved" through"beOering"the"algorithms,"butalso" through"the"use"of"beOer"lexical"resources"and"

Thank  you  Enrico  Santus  –  The  Hong  Kong  Polytechnic  University  

Alessandro  Lenci  –  University  of  Pisa  Qin  Lu  –  The  Hong  Kong  Polytechnic  University  

Sabine  Schulte  im  Walde  –  Ins)tute  for  Natural  Language  Processing,  University  of  StuOgart  Frances  Yung  –  Nara  Ins)tute  of  Science  and  Technology