reciclagem: exploring portuguese lexical...

48
Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo Gon¸calo Oliveira, Ana Oliveira Alves, Ricardo Rodrigues {hroliv,ana,rmanuel}@dei.uc.pt CISUC, Department of Informatics Engineering University of Coimbra, Portugal ASSIN@PROPOR 2016 Tomar, 13 July, 2016 [email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 1 / 20

Upload: others

Post on 22-Aug-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Reciclagem: Exploring Portuguese LexicalKnowledge-Bases in the ASSIN Task

Hugo Goncalo Oliveira, Ana Oliveira Alves, Ricardo Rodrigues

{hroliv,ana,rmanuel}@dei.uc.pt

CISUC, Department of Informatics EngineeringUniversity of Coimbra, Portugal

ASSIN@PROPOR 2016Tomar, 13 July, 2016

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 1 / 20

Page 2: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Introduction

Unsupervised approach to the ASSIN task

Rely exclusively on the exploitation of external sources oflexical-semantic knowledge

Heuristics based on known semantic relations

Instead of selecting a single knowledge base, test several and theircombination

Two main goals:1 Test whether an unsupervised approach is enough to compute

semantic similarity

For English, knowledge-based approaches to other tasks rival withunsupervised approaches (e.g. WSD)

2 Indirect comparison of a set of open Portuguese lexical knowledgebases using ASSIN as a benchmark

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 2 / 20

Page 3: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Introduction

Unsupervised approach to the ASSIN task

Rely exclusively on the exploitation of external sources oflexical-semantic knowledge

Heuristics based on known semantic relations

Instead of selecting a single knowledge base, test several and theircombination

Two main goals:1 Test whether an unsupervised approach is enough to compute

semantic similarity

For English, knowledge-based approaches to other tasks rival withunsupervised approaches (e.g. WSD)

2 Indirect comparison of a set of open Portuguese lexical knowledgebases using ASSIN as a benchmark

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 2 / 20

Page 4: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Introduction

Unsupervised approach to the ASSIN task

Rely exclusively on the exploitation of external sources oflexical-semantic knowledge

Heuristics based on known semantic relations

Instead of selecting a single knowledge base, test several and theircombination

Two main goals:1 Test whether an unsupervised approach is enough to compute

semantic similarity

For English, knowledge-based approaches to other tasks rival withunsupervised approaches (e.g. WSD)

2 Indirect comparison of a set of open Portuguese lexical knowledgebases using ASSIN as a benchmark

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 2 / 20

Page 5: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Introduction

Unsupervised approach to the ASSIN task

Rely exclusively on the exploitation of external sources oflexical-semantic knowledge

Heuristics based on known semantic relations

Instead of selecting a single knowledge base, test several and theircombination

Two main goals:1 Test whether an unsupervised approach is enough to compute

semantic similarity

For English, knowledge-based approaches to other tasks rival withunsupervised approaches (e.g. WSD)

2 Indirect comparison of a set of open Portuguese lexical knowledgebases using ASSIN as a benchmark

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 2 / 20

Page 6: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Introduction

Unsupervised approach to the ASSIN task

Rely exclusively on the exploitation of external sources oflexical-semantic knowledge

Heuristics based on known semantic relations

Instead of selecting a single knowledge base, test several and theircombination

Two main goals:1 Test whether an unsupervised approach is enough to compute

semantic similarity

For English, knowledge-based approaches to other tasks rival withunsupervised approaches (e.g. WSD)

2 Indirect comparison of a set of open Portuguese lexical knowledgebases using ASSIN as a benchmark

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 2 / 20

Page 7: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Procedure

Given two sentences t and h...1 Pre-processing (OpenNLP, LemPORT [Rodrigues et al., 2014]):

TokenizationPOS-taggingLematization

2 Compute a similarity score between words in t and h

According to the knowledge baseWords are represented as a tuple (token,POS,lemma)

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 3 / 20

Page 8: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Procedure

Given two sentences t and h...1 Pre-processing (OpenNLP, LemPORT [Rodrigues et al., 2014]):

TokenizationPOS-taggingLematization

2 Compute a similarity score between words in t and h

According to the knowledge baseWords are represented as a tuple (token,POS,lemma)

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 3 / 20

Page 9: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Exploited resources

PAPEL [Goncalo Oliveira et al., 2008], relations extractedfrom Porto Editora’s Dicionario da Lıngua Portuguesa,using grammars based on regularities in the definitions;

Dicionario Aberto [Simoes et al., 2012], relationsextracted using the grammars of PAPEL;

Wikcionario.PT, relations extracted using the grammarsof PAPEL;

TeP [Maziero et al., 2008], thesaurus that groups wordswith their synonyms + antonymy relations;

OpenThesaurus.PT, similar to the previous, but smallerand without antonymy;

OpenWordNet-PT [de Paiva et al., 2012], openPortuguese wordnet;

PULO [Simoes and Guinovart, 2014], another Portuguesewordnet, smaller than the previous.

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 4 / 20

Page 10: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Exploited resources

PAPEL [Goncalo Oliveira et al., 2008], relations extractedfrom Porto Editora’s Dicionario da Lıngua Portuguesa,using grammars based on regularities in the definitions;

Dicionario Aberto [Simoes et al., 2012], relationsextracted using the grammars of PAPEL;

Wikcionario.PT, relations extracted using the grammarsof PAPEL;

TeP [Maziero et al., 2008], thesaurus that groups wordswith their synonyms + antonymy relations;

OpenThesaurus.PT, similar to the previous, but smallerand without antonymy;

OpenWordNet-PT [de Paiva et al., 2012], openPortuguese wordnet;

PULO [Simoes and Guinovart, 2014], another Portuguesewordnet, smaller than the previous.

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 4 / 20

Page 11: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Exploited resources

PAPEL [Goncalo Oliveira et al., 2008], relations extractedfrom Porto Editora’s Dicionario da Lıngua Portuguesa,using grammars based on regularities in the definitions;

Dicionario Aberto [Simoes et al., 2012], relationsextracted using the grammars of PAPEL;

Wikcionario.PT, relations extracted using the grammarsof PAPEL;

TeP [Maziero et al., 2008], thesaurus that groups wordswith their synonyms + antonymy relations;

OpenThesaurus.PT, similar to the previous, but smallerand without antonymy;

OpenWordNet-PT [de Paiva et al., 2012], openPortuguese wordnet;

PULO [Simoes and Guinovart, 2014], another Portuguesewordnet, smaller than the previous.

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 4 / 20

Page 12: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Exploited resources

PAPEL [Goncalo Oliveira et al., 2008], relations extractedfrom Porto Editora’s Dicionario da Lıngua Portuguesa,using grammars based on regularities in the definitions;

Dicionario Aberto [Simoes et al., 2012], relationsextracted using the grammars of PAPEL;

Wikcionario.PT, relations extracted using the grammarsof PAPEL;

TeP [Maziero et al., 2008], thesaurus that groups wordswith their synonyms + antonymy relations;

OpenThesaurus.PT, similar to the previous, but smallerand without antonymy;

OpenWordNet-PT [de Paiva et al., 2012], openPortuguese wordnet;

PULO [Simoes and Guinovart, 2014], another Portuguesewordnet, smaller than the previous.

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 4 / 20

Page 13: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Semantic network representation

Knowledge bases used as semantic networks N(W ,C )|W | words (nodes)|C | connections between words (edges)

Each with a semantic relation label (e.g. sinonimo-de,hiperonimo-de, parte-de, ...)Triples word1 related-to word2 (e.g. animal hiperonimo-de cao,roda parte-de carro)

PAPEL is already in this format!

Wordnets and synonymy thesauri were convertedEach pair of words in a synset resulted in a synonymy tripleA relation for each pair of words in two related synsets

For instance...

{porta, portao} parte-de {automovel, carro, viatura}(porta sinonimo-de portao), (automovel sinonimo-de carro),(automovel sinonimo-de viatura), (carro sinonimo-de viatura),(porta parte-de automovel), (porta parte-de carro), (porta parte-de viatura),(portao parte-de automovel), (portao parte-de carro), (portao parte-de viatura)

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 5 / 20

Page 14: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Semantic network representation

Knowledge bases used as semantic networks N(W ,C )|W | words (nodes)|C | connections between words (edges)

Each with a semantic relation label (e.g. sinonimo-de,hiperonimo-de, parte-de, ...)Triples word1 related-to word2 (e.g. animal hiperonimo-de cao,roda parte-de carro)

PAPEL is already in this format!

Wordnets and synonymy thesauri were convertedEach pair of words in a synset resulted in a synonymy tripleA relation for each pair of words in two related synsets

For instance...

{porta, portao} parte-de {automovel, carro, viatura}(porta sinonimo-de portao), (automovel sinonimo-de carro),(automovel sinonimo-de viatura), (carro sinonimo-de viatura),(porta parte-de automovel), (porta parte-de carro), (porta parte-de viatura),(portao parte-de automovel), (portao parte-de carro), (portao parte-de viatura)

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 5 / 20

Page 15: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Semantic network representation

Knowledge bases used as semantic networks N(W ,C )|W | words (nodes)|C | connections between words (edges)

Each with a semantic relation label (e.g. sinonimo-de,hiperonimo-de, parte-de, ...)Triples word1 related-to word2 (e.g. animal hiperonimo-de cao,roda parte-de carro)

PAPEL is already in this format!

Wordnets and synonymy thesauri were convertedEach pair of words in a synset resulted in a synonymy tripleA relation for each pair of words in two related synsets

For instance...

{porta, portao} parte-de {automovel, carro, viatura}(porta sinonimo-de portao), (automovel sinonimo-de carro),(automovel sinonimo-de viatura), (carro sinonimo-de viatura),(porta parte-de automovel), (porta parte-de carro), (porta parte-de viatura),(portao parte-de automovel), (portao parte-de carro), (portao parte-de viatura)

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 5 / 20

Page 16: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Semantic network representation

Knowledge bases used as semantic networks N(W ,C )|W | words (nodes)|C | connections between words (edges)

Each with a semantic relation label (e.g. sinonimo-de,hiperonimo-de, parte-de, ...)Triples word1 related-to word2 (e.g. animal hiperonimo-de cao,roda parte-de carro)

PAPEL is already in this format!

Wordnets and synonymy thesauri were convertedEach pair of words in a synset resulted in a synonymy tripleA relation for each pair of words in two related synsetsFor instance...

{porta, portao} parte-de {automovel, carro, viatura}(porta sinonimo-de portao), (automovel sinonimo-de carro),(automovel sinonimo-de viatura), (carro sinonimo-de viatura),(porta parte-de automovel), (porta parte-de carro), (porta parte-de viatura),(portao parte-de automovel), (portao parte-de carro), (portao parte-de viatura)

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 5 / 20

Page 17: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Combining semantic networks

CARTAO [Goncalo Oliveira et al., 2011], relations extracted fromthree dictionaries: PAPEL + Dicionario Aberto + Wikcionario.PT

Todos, all the triples from all the exploited resources

Redun2, all the triples in at least two exploited resources

CONTO.PT [Goncalo Oliveira, 2016], fuzzy wordnet, w/ confidencedegrees based on the redundancy in the exploited resources

Words have variable memberships to synsetsSynset connections also have a confidence degree

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 6 / 20

Page 18: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Combining semantic networks

CARTAO [Goncalo Oliveira et al., 2011], relations extracted fromthree dictionaries: PAPEL + Dicionario Aberto + Wikcionario.PT

Todos, all the triples from all the exploited resources

Redun2, all the triples in at least two exploited resources

CONTO.PT [Goncalo Oliveira, 2016], fuzzy wordnet, w/ confidencedegrees based on the redundancy in the exploited resources

Words have variable memberships to synsetsSynset connections also have a confidence degree

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 6 / 20

Page 19: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Combining semantic networks

CARTAO [Goncalo Oliveira et al., 2011], relations extracted fromthree dictionaries: PAPEL + Dicionario Aberto + Wikcionario.PT

Todos, all the triples from all the exploited resources

Redun2, all the triples in at least two exploited resources

CONTO.PT [Goncalo Oliveira, 2016], fuzzy wordnet, w/ confidencedegrees based on the redundancy in the exploited resources

Words have variable memberships to synsetsSynset connections also have a confidence degree

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 6 / 20

Page 20: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Similarity heuristics

Three different kinds of tested heuristics:

Word neighbourhoods in the semantic networks

Based on the structure of the semantic network

Based on the membership to fuzzy synsets

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 7 / 20

Page 21: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Similarity heuristics

Three different kinds of tested heuristics:

Word neighbourhoods in the semantic networks

Based on the structure of the semantic network

Based on the membership to fuzzy synsets

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 7 / 20

Page 22: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Similarity heuristics

Three different kinds of tested heuristics:

Word neighbourhoods in the semantic networks

Based on the structure of the semantic network

Based on the membership to fuzzy synsets

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 7 / 20

Page 23: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Neighbourhood similarity

Similarity between two sentences t and hEach represented as a set of words, T and H.T and H contain all the words of each sentence and theiradjacencies in the semantic network.

Neigh(word) =synonyms(word)

∪ hypernyms(word)

∪ hyponyms(word)

∪ parts(word)

∪ ...

Neighbourhood can be restricted to a subset of relation types

Similarity between T and H:

Simmax (t, h) =

|t|∑i=1

max(

Sim(Neighbours(Ti ),Neighbours(Hj )

)): Hj ∈ H

(alternatives were tested but this lead to the best results)

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 8 / 20

Page 24: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Neighbourhood similarity

Similarity between two sentences t and hEach represented as a set of words, T and H.T and H contain all the words of each sentence and theiradjacencies in the semantic network.

Neigh(word) =synonyms(word)

∪ hypernyms(word)

∪ hyponyms(word)

∪ parts(word)

∪ ...

Neighbourhood can be restricted to a subset of relation types

Similarity between T and H:

Simmax (t, h) =

|t|∑i=1

max(

Sim(Neighbours(Ti ),Neighbours(Hj )

)): Hj ∈ H

(alternatives were tested but this lead to the best results)

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 8 / 20

Page 25: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Neighbourhood similarity heuristics

Adaptations of the Lesk algorithm [Banerjee and Pedersen, 2003]:

Jaccard(A,B) =|Neigh(A) ∩ Neigh(B)||Neigh(A) ∪ Neigh(B)|

Overlap(A,B) =|Neigh(A) ∩ Neigh(B)|

min(|Neigh(A)|, |Neigh(B)|)

Dice(A,B) = 2.|Neigh(A)| ∪ |Neigh(B)||Neigh(A)|+ |Neigh(B)|

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 9 / 20

Page 26: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Network structure heuristics

Average distance

Between each pair of words (pt , ph), such that pt ∈ t and ph ∈ h

Similarity = 11+distance

Should have probably used the lowest distance...

Personalized PageRank [Agirre and Soroa, 2009]

Order the network nodes according to their structural relevance foreach sentence:

1 Each node is weighted: 1|F | , if it is a word in f , 0 otherwise;

2 With the previous weights, PageRank is run for 30 iterations;3 Nodes are ordered according to their rank;4 Define sets Efn with the top-n words (n = 50).5 Similarity given by Etn∩Ehn

n

Much tuning required to set the best parameters...

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 10 / 20

Page 27: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Network structure heuristics

Average distance

Between each pair of words (pt , ph), such that pt ∈ t and ph ∈ h

Similarity = 11+distance

Should have probably used the lowest distance...

Personalized PageRank [Agirre and Soroa, 2009]

Order the network nodes according to their structural relevance foreach sentence:

1 Each node is weighted: 1|F | , if it is a word in f , 0 otherwise;

2 With the previous weights, PageRank is run for 30 iterations;3 Nodes are ordered according to their rank;4 Define sets Efn with the top-n words (n = 50).5 Similarity given by Etn∩Ehn

n

Much tuning required to set the best parameters...

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 10 / 20

Page 28: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Network structure heuristics

Average distance

Between each pair of words (pt , ph), such that pt ∈ t and ph ∈ h

Similarity = 11+distance

Should have probably used the lowest distance...

Personalized PageRank [Agirre and Soroa, 2009]

Order the network nodes according to their structural relevance foreach sentence:

1 Each node is weighted: 1|F | , if it is a word in f , 0 otherwise;

2 With the previous weights, PageRank is run for 30 iterations;3 Nodes are ordered according to their rank;4 Define sets Efn with the top-n words (n = 50).5 Similarity given by Etn∩Ehn

n

Much tuning required to set the best parameters...

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 10 / 20

Page 29: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Fuzzy wordnet heuristics

Different approach, given the features of CONTO.PT...

µ(w ,S): membership of words w to synset S

conf (S1,R,S2): confidence on relation of type R between S1 and S2

Weights ρs > ρh > ρo for synonymy, hypernymy and other relations

Sim(t, h) = maximum similarity between each pair of words (pt , ph),such that pt ∈ t and ph ∈ h

1 If there is at least one synsetS12 : p1 ∈ S12 ∧ p2 ∈ S12 → Sim(p1, p2) = (µ(p1,S1) + µ(p2,S2))× ρs

2 If there are two synsets S1,S2 : p1 ∈ S1 ∧ p2 ∈ S2 ∧ (S1 relatedTo S2)→ Sim(p1, p2) = (µ(p1,S1) + µ(p2,S2))× conf (S1,R,S2)× ρh/o

Not explored enough...

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 11 / 20

Page 30: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Fuzzy wordnet heuristics

Different approach, given the features of CONTO.PT...

µ(w ,S): membership of words w to synset S

conf (S1,R,S2): confidence on relation of type R between S1 and S2

Weights ρs > ρh > ρo for synonymy, hypernymy and other relations

Sim(t, h) = maximum similarity between each pair of words (pt , ph),such that pt ∈ t and ph ∈ h

1 If there is at least one synsetS12 : p1 ∈ S12 ∧ p2 ∈ S12 → Sim(p1, p2) = (µ(p1,S1) + µ(p2,S2))× ρs

2 If there are two synsets S1,S2 : p1 ∈ S1 ∧ p2 ∈ S2 ∧ (S1 relatedTo S2)→ Sim(p1, p2) = (µ(p1,S1) + µ(p2,S2))× conf (S1,R,S2)× ρh/o

Not explored enough...

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 11 / 20

Page 31: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Fuzzy wordnet heuristics

Different approach, given the features of CONTO.PT...

µ(w ,S): membership of words w to synset S

conf (S1,R,S2): confidence on relation of type R between S1 and S2

Weights ρs > ρh > ρo for synonymy, hypernymy and other relations

Sim(t, h) = maximum similarity between each pair of words (pt , ph),such that pt ∈ t and ph ∈ h

1 If there is at least one synsetS12 : p1 ∈ S12 ∧ p2 ∈ S12 → Sim(p1, p2) = (µ(p1,S1) + µ(p2,S2))× ρs

2 If there are two synsets S1,S2 : p1 ∈ S1 ∧ p2 ∈ S2 ∧ (S1 relatedTo S2)→ Sim(p1, p2) = (µ(p1,S1) + µ(p2,S2))× conf (S1,R,S2)× ρh/o

Not explored enough...

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 11 / 20

Page 32: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Entailment heuristics

Exclusively based on the fuzzy wordnet CONTO.PT...

Use CONTO.PT as a normal wordnet by setting cut-pointsθs , for synset memberships µθh, for hypernymy relations confidence conf

δ is a predefined threshold

∆ = ||T | − |H||if (∆ < δ)

every word in T has a synonym in Hreturn Paraphraseevery word in T has a synonym, a hypernym or a hyponym in Hreturn Entailmentreturn None

return None

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 12 / 20

Page 33: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Entailment heuristics

Exclusively based on the fuzzy wordnet CONTO.PT...

Use CONTO.PT as a normal wordnet by setting cut-pointsθs , for synset memberships µθh, for hypernymy relations confidence conf

δ is a predefined threshold

∆ = ||T | − |H||if (∆ < δ)

every word in T has a synonym in Hreturn Paraphraseevery word in T has a synonym, a hypernym or a hyponym in Hreturn Entailmentreturn None

return None

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 12 / 20

Page 34: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Best results for similarityTraining

Network Heuristic Pearson MSE

PT-PT

Redun2 Overlap 0.600 1.173Redun2 Dice 0.598 1.185

OpenWN-PT Jaccard 0.596 1.159Redun2 Jaccard 0.596 1.190PAPEL Overlap 0.594 1.195

TeP Dice 0.592 1.330PULO Jaccard 0.590 1.259

OpenWN-PT PPR 0.528 1.301CONTO.PT N/A 0.587 1.189

PT-BR

Redun2 Overlap 0.546 1.065OpenWN-PT Dice 0.546 1.077OpenWN-PT Jaccard 0.545 1.081OpenWN-PT Overlap 0.544 1.039

Redun2 Jaccard 0.544 1.070Redun2 Overlap 0.544 1.052PAPEL Overlap 0.543 1.027

TeP Dice 0.543 1.090PULO Jaccard 0.541 1.037PAPEL PPR 0.447 1.150

CONTO.PT N/A 0.535 1.078

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 13 / 20

Page 35: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Best resultsTest

Network Heuristic Pearson MSE

PT-PT

Redun2 Overlap 0.536 1.105Redun2 Dice 0.536 1.130Redun2 Jaccard 0.535 1.149

OpenWN-PT Jaccard 0.533 1.141TeP Dice 0.532 1.131TeP Jaccard 0.532 1.151

PAPEL Dice 0.530 1.146PULO Jaccard 0.527 1.313

OpenWN-PT PPR 0.513 1.177CONTO.PT N/A 0.526 1.179

PT-BR

TeP Overlap 0.593 1.256OpenWN-PT Dice 0.589 1.312OpenWN-PT Overlap 0.589 1.345

TeP Dice 0.588 1.311OpenWN-PT Jaccard 0.588 1.329

Redun2 Dice 0.588 1.356PULO Dice 0.584 1.326PAPEL Dice 0.584 1.335

OpenWN-PT PPR 0.464 1.225CONTO.PT N/A 0.580 1.367

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 14 / 20

Page 36: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Comments on Similarity

Substantially different results for training and testTraining: best results for PT-PT (0.6 vs 0.54)Test: best results for PT-BR (0.59 vs 0.53)

Subtle differences in Pearson among best configurations

Typically less than 0.01!

Similar sentences share several words... are the heuristics are morerelevant than the semantic network?

Best results always obtained with the Dice coefficientPageRank always below neighbourhood-based heuristicsAverage distance performed poorly

Additional observations:Redun2 was the best network, except for PT-BR test

Benefits of combining knowledge from different sources!

OpenWN-PT always close to the bestTeP got the best results in PT-BR testCONTO.PT just slightly below the semantic networks

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 15 / 20

Page 37: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Comments on Similarity

Substantially different results for training and testTraining: best results for PT-PT (0.6 vs 0.54)Test: best results for PT-BR (0.59 vs 0.53)

Subtle differences in Pearson among best configurations

Typically less than 0.01!

Similar sentences share several words... are the heuristics are morerelevant than the semantic network?

Best results always obtained with the Dice coefficientPageRank always below neighbourhood-based heuristicsAverage distance performed poorly

Additional observations:Redun2 was the best network, except for PT-BR test

Benefits of combining knowledge from different sources!

OpenWN-PT always close to the bestTeP got the best results in PT-BR testCONTO.PT just slightly below the semantic networks

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 15 / 20

Page 38: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Comments on Similarity

Substantially different results for training and testTraining: best results for PT-PT (0.6 vs 0.54)Test: best results for PT-BR (0.59 vs 0.53)

Subtle differences in Pearson among best configurations

Typically less than 0.01!

Similar sentences share several words... are the heuristics are morerelevant than the semantic network?

Best results always obtained with the Dice coefficientPageRank always below neighbourhood-based heuristicsAverage distance performed poorly

Additional observations:Redun2 was the best network, except for PT-BR test

Benefits of combining knowledge from different sources!

OpenWN-PT always close to the bestTeP got the best results in PT-BR testCONTO.PT just slightly below the semantic networks

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 15 / 20

Page 39: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Comments on Similarity

Substantially different results for training and testTraining: best results for PT-PT (0.6 vs 0.54)Test: best results for PT-BR (0.59 vs 0.53)

Subtle differences in Pearson among best configurations

Typically less than 0.01!

Similar sentences share several words... are the heuristics are morerelevant than the semantic network?

Best results always obtained with the Dice coefficientPageRank always below neighbourhood-based heuristicsAverage distance performed poorly

Additional observations:Redun2 was the best network, except for PT-BR test

Benefits of combining knowledge from different sources!

OpenWN-PT always close to the bestTeP got the best results in PT-BR testCONTO.PT just slightly below the semantic networks

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 15 / 20

Page 40: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Best resultsEntailment

θs θh δ Accuracy Macro F1

PT-PT (train)0.1 0.01 0.5 73.83% 0.450.1 0.1 0.4 71.67% 0.38

0.25 0.2 0.5 73.83% 0.45

PT-BR (train)0.1 00.1 0.3 77.47% 0.310.1 00.1 0.5 76.70% 0.420.2 0.2 0.1 77.70% 0.29

PT-PT (test)0.1 00.1 0.5 73.10% 0.43

0.15 0.1 0.4 72.10% 0.380.05 0.01 0.3 70.80% 0.32

PT-BR (test)0.2 0.2 0.1 77.65% 0.29

0.15 0.1 0.3 79.05% 0.390.1 0.01 0.3 78.30% 0.33

Higher accuracy in PT-BR, higher Macro F1 in PT-PT

Gold collectionPT-PT: 24% entailment and 7% paraphrasePT-BR: 17% entailment and 5% paraphrase

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 16 / 20

Page 41: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Conclusions

Modest results when compared to other participants

Too many parameters involved

semantic networks, their combination, used relation types, relationweights, normalisation, from word similarity to sentence similarity,similarity measures, maximum distance, PageRank set size, cut pointsin fuzzy wordnet ...

Difficult to explore / compare all of them properly

Genetic algorithm?Lines for future work!

Computed scores used as features to the supervised approach ASAPP

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 17 / 20

Page 42: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Conclusions

Modest results when compared to other participants

Too many parameters involved

semantic networks, their combination, used relation types, relationweights, normalisation, from word similarity to sentence similarity,similarity measures, maximum distance, PageRank set size, cut pointsin fuzzy wordnet ...

Difficult to explore / compare all of them properly

Genetic algorithm?Lines for future work!

Computed scores used as features to the supervised approach ASAPP

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 17 / 20

Page 43: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Conclusions

Modest results when compared to other participants

Too many parameters involved

semantic networks, their combination, used relation types, relationweights, normalisation, from word similarity to sentence similarity,similarity measures, maximum distance, PageRank set size, cut pointsin fuzzy wordnet ...

Difficult to explore / compare all of them properly

Genetic algorithm?Lines for future work!

Computed scores used as features to the supervised approach ASAPP

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 17 / 20

Page 44: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

Conclusions

Modest results when compared to other participants

Too many parameters involved

semantic networks, their combination, used relation types, relationweights, normalisation, from word similarity to sentence similarity,similarity measures, maximum distance, PageRank set size, cut pointsin fuzzy wordnet ...

Difficult to explore / compare all of them properly

Genetic algorithm?Lines for future work!

Computed scores used as features to the supervised approach ASAPP

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 17 / 20

Page 45: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

The end

Thank you!

Questions?

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 18 / 20

Page 46: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

The end

Thank you!

Questions?

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 18 / 20

Page 47: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

References I

Agirre, E. and Soroa, A. (2009).

Personalizing PageRank for word sense disambiguation.

In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics,EACL’09, pages 33–41, Stroudsburg, PA, USA. ACL Press.

Banerjee, S. and Pedersen, T. (2003).

Extended gloss overlaps as a measure of semantic relatedness.In Proceedings of the 18th International Joint Conference on Artificial Intelligence (IJCAI’03), pages 805–810, CA, USA.

de Paiva, V., Rademaker, A., and de Melo, G. (2012).

OpenWordNet-PT: An open Brazilian wordnet for reasoning.In Proceedings of 24th International Conference on Computational Linguistics, COLING (Demo Paper).

Goncalo Oliveira, H. (2016).

CONTO.PT: Groundwork for the Automatic Creation of a Fuzzy Portuguese Wordnet.In Proceedings of 12th International Conference on Computational Processing of the Portuguese Language (PROPOR2016), page in press, Tomar, Portugal. Springer.

Goncalo Oliveira, H., Anton Perez, L., Costa, H., and Gomes, P. (2011).

Uma rede lexico-semantica de grandes dimensoes para o portugues, extraıda a partir de dicionarios electronicos.Linguamatica, 3(2):23–38.

Goncalo Oliveira, H., Santos, D., Gomes, P., and Seco, N. (2008).

PAPEL: A dictionary-based lexical ontology for Portuguese.In Proceedings of Computational Processing of the Portuguese Language – 8th International Conference (PROPOR2008), volume 5190 of LNCS/LNAI, pages 31–40, Aveiro, Portugal. Springer.

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 19 / 20

Page 48: Reciclagem: Exploring Portuguese Lexical …propor2016.di.fc.ul.pt/wp-content/uploads/2015/10/Gonca...Reciclagem: Exploring Portuguese Lexical Knowledge-Bases in the ASSIN Task Hugo

References II

Maziero, E. G., Pardo, T. A. S., Felippo, A. D., and Dias-da-Silva, B. C. (2008).

A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletronico para o Portugues do Brasil.In VI Workshop em Tecnologia da Informacao e da Linguagem Humana (TIL), pages 390–392.

Rodrigues, R., Goncalo-Oliveira, H., and Gomes, P. (2014).

LemPORT: a High-Accuracy Cross-Platform Lemmatizer for Portuguese.

In Pereira, M. J. V., Leal, J. P., and Simoes, A., editors, Proceedings of the 3rd Symposium on Languages, Applicationsand Technologies (SLATE ’14), OpenAccess Series in Informatics, pages 267–274, Germany. Schloss Dagstuhl —Leibniz-Zentrum fur Informatik, Dagstuhl Publishing.

Simoes, A. and Guinovart, X. G. (2014).

Bootstrapping a Portuguese wordnet from Galician, Spanish and English wordnets.In Advances in Speech and Language Technologies for Iberian Languages, volume 8854 of LNCS, pages 239–248.

Simoes, A., Alvaro Iriarte Sanroman, and Almeida, J. J. (2012).

Dicionario-Aberto: A source of resources for the Portuguese language processing.In Proceedings of 10th International Conference on the Computational Processing of the Portuguese Language(PROPOR 2012), volume 7243 of LNCS, pages 121–127, Coimbra Portugal. Springer.

[email protected] (CISUC,PT) Reciclagem ASSIN@PROPOR 2016 20 / 20