definition clustering, sense naming & lexical augmentation

Post on 08-Jan-2016

42 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Mathieu LAFOURCADE lafourcade@lirmm.fr. Fabien JALABERT jalabert@lirmm.fr. Definition Clustering, Sense Naming & Lexical Augmentation. Study context 1/2. Natural Language Processing Lexical Semantics - WSD - Document indexing - PowerPoint PPT Presentation

TRANSCRIPT

Definition Clustering,Sense Naming

&Lexical Augmentation

Fabien JALABERTjalabert@lirmm.fr

Mathieu LAFOURCADElafourcade@lirmm.fr

Natural Language Processing

• Lexical Semantics - WSD - Document indexing

• Dictionary construction and vectorization pb extracting definition meta-language example : ‘cannibale’ = ‘qui mange l’Homme en parlant de l’Homme’ themes : homme, manger, rhétorique

• Multi-source approach noise reduction problem : atom element = definition ≠ sense

• Objectives- clustering definitions to obtain senses- naming these senses

Study context 1/2

Term Tdef 1 - Source 1

def 2 - Source 1

def 3 - Source 1

def 1 - Source 2

def 2 - Source 2

def 1 - Source 3

def 2 - Source 3

def 1 - Source 1

Catégorie 1Sense 1

Sense 2

def 2 - Source 1

def 2 - Source 2

def 1 - Source 3

Sense 3def 3 - Source 1

def 1 - Source 2

def 2 - Source 3

Clustering

Multi-source base

‘Acception’ or sense base

Sense naming

Sense 2 – Name

Sense 1 – Name

Sense 2 – Name

Re-injection as new lexical source

t1

t2

t3

t4

t5

t6

tn

Study context 2/2

• Model, Construction, Organization

• Definition Clustering• Sense Naming• Lexical Augmentation

• Results

Summary

• An idea = a vector

• A vector component = a primitive as defined in a Th.– Thesaurus Larousse : 873 concepts

– Concepts are inter-related

Generator space

• A definition a vector

Conceptual Vector Model 1/2

arme

transports maritimes et fluviauxoiseau

Most activated primitives for ‘frégate’ :(oiseau 6134) (transports maritimes et fluviaux 5644) (arme 4891) …

Salton Deerwester

Chauché Lafourcade

Thematicaly terms close to ‘frégate’ :(destroyer 0.2246) (youyou 0.2267) (voilier 0.2268) (contre-torpilleur 0.2274) (chlamydère 0.2276) (oiseau-jardinier 0.2295) (trois-mâts 0.233) …

Thematicaly terms close to ‘frégate/oiseau/’ :(oiseau-jardinier 0.1237) (plumeur 0.1319) (goglu 0.136) (travailleur 0.136)(chlamydère 0.1385) (penne 0.141) (Galliformes 0.1422) (agami 0.1428) …

Thematicaly terms close to‘frégate/bateau/’ :(démâtage 0.1604) (dégréer 0.1676) (naval 0.1718) (bateau-piège 0.1774)

(bateau-vanne 0.1821) (batelet 0.1824) …

Conceptual Vector Model 2/2

xy

Thematic distance = angle between two vectors

SYGMART

la petite brise la glace

le petit briser le glace

GN – Gouv - adj GV - Gouv GN – Gouv - nf

9GN

8briser

7GV

6petit

5le

4GN

11glace

10le

3PH

2PHAMBG

1

12.

14GN

16GA

15le

18brise

17petit

22glacer

20GN

19GV

21le

23.

13PH

Definition Vector ComputationChauché

Learning agents : Sygmart, computation of vectors from definition, synonymy, antonymy, …

Multi-Agent OrganizationDouble-loop

Lecerf Schwab

Endogenous loop

Exogenous loop

Other agents (society)

Agent

Grouping definitions into senses

Clustering

Objective

• Deep analysis - several criteria• No training (but enhancement through exogenous loop)

• Frontier between senses and definitions

- Centroïd approach

- Heuristics (preferences) - cluster number = nb max of definitions in dictionaries- two definitions of a same source two different clusters

Clustering 1/5Strategy

Chaussure montante(quel qu'en soit l'usage )

Coup porté(en escrime

ou non)

Distinction entre"le coup en escrime"et "l'attaque surprise"

réunion devégétaux

Distinction entre"chaussure élégante" et"chaussure tout-terrain"

Clustering 2/5Difficulty

‘botte’

• Source by source iterationuntil obtaining a min value distribution

Affectation of min. value source/cluster From a distance matrix : Hungarian method – O(n3)

Clustering 3/5Algorithm 1/2

Kuhn Ford, Fulkerson

• For each criteriaone evaluationone distance matrix

• CriteriaComparing lexical contents of definitions

(with term frequency, co-occurrences, etc.)

Angular distanceSymbolic markers

- morphology- etymology ( ‘avocat’ : ‘ahuacatl’ / ‘advocatus’ )

- use (‘vieux’ , ‘ancien’, ‘poétique’ … )

- language level (‘argot’, ‘familier’, … )

- domain (‘médecine’, ‘zoologie’, … )

Clustering 4/5Algorithm 2/2

We would like to designate meanings

‘botte’

Correct results in many cases90 % for nouns, 70 % for verbs - to be done for adj

Pb with very strong polysemy vagueness, continuity in meanings

support verb: ‘prendre’,…

Study augmentation of cluster number

Clustering 5/5Results

Sense Naming

Objective

To give the system some capacity to « talk about a sense »

• Dictionary independent• Interface (man-system & system-system)

• A new lexical source looping :-)

Semantic annotation

La frégate/vaisseau/ naviguait à travers

les océans

La frégate/oiseau/ planait à travers les nues en poussant

son cri incomparable

Sense Naming 1/10Properties

1. Extraction

2. Validation and dispatching of polysem bags bijection

3. Evaluation of candidates

ordering and extracting the most appropriate ones

Sense Naming 2/10Procedure

• Extraction attached to a meaning– Morpho-syntactic analysis of the definition– Extraction of markers : « anc. », « méd. », …– Extraction from unstructured or semi-structured data (XML…)

‘frégate’ : [nf] [ancien] Au XVe s., grande barque demi-pontée gréant deux voiles latines sur antenne et assurant la liaison entre les ports et les escadres de galère. [Club Internet]

• Extraction from polysem bags– Word list (like synonym list of Université de Caen : )

Sense Naming 3/10Extraction

Ploux, Victori

ex: ‘botte’ = chaussure, bottillon, coup, attaque, amas, bouquet,…

Bijection being able to re-associate the proper meaning

ƒ : (term, sense) (term, annotation)

ƒ-1 : (term, annotation) (term, sense)

Sense Naming 4/10

• A candidate associated to a sense should be closer of its own sensethan any other

• Unattached candidates are associated to the closest meaning

• A candidate should not be present in a concurrent definition

),(),(, jAiAij saDsaDss ≤≠∀

Validation

• Extraction grade

• Evaluating the capacity to disambiguate (to distinguish a sense from all others)

• Evaluating the capacity to associateCognitive cost reduction

Sense Naming 5/10Evaluation

Prince

‘frégate’ : [nf] [ancien] Au XVe s., grande barque demi-pontée gréant deux voiles latinessur antenne et assurant la liaison entre les ports et les escadres de galère. [Club Internet]

XVe grande barque demi-pontée barque demi-pontée

(6) (2) (1) (3) (1)

gréant voiles latines voiles latines antenne

(4) (5) (6) (5) (7)

au grande barque demi-pontéeXVe , gréant deux voiles latines sur antennes …

SujetGV

COD CCCC

Sense Naming 6/10Extraction grade

12 ddM A −=

1d

MM A

R =

3d

MR R

NS =

absolute margin

relative margin

risk of ‘non-sens’

Sense Naming 7/10

Disambiguation capacity 1/2

frégate vaisseau

w.3(navire moderne)

w.2(navire ancien) t.12

(sanguin)

t.11(navire)(oiseau)

w.1

Ma = d1 - d2 = 0,1

Mr = 0,1 / d1= 0.33

Rns = d3 / 0,33= 0.6

0,95

1,2

0,8

0,85

0,3= d1

0,4= d2

0,2= d3

Sense Naming 8/10

Disambiguation capacity 2/2

frégate vaisseau

w.3(navire moderne)

w.2(navire ancien) t.12

(sanguin)

t.11(navire)(oiseau)

w.1

Ma = d1 - d2 = 0,1

Mr = 0,1 / d1= 0.33

Rns = d3 / 0,33= 0.6

0,95

1,2

0,8

0,85

0,3= d1

0,4= d2

0,2= d3

frégate voilier

w.3(navire moderne)

w.2(navire ancien) t.12

(navire)

t.11(oiseau)(oiseau)

w.1

Ma = d1 - d2 = 0,04

Mr = 0,04 / d1= 0,16

Rns = d3 / 0,16= 4

0,3

0,7

0,29 = d2

0,72

0,72

0.25 = d1

0,65= d3

survey

- collocations (botte de paille, …)

- co-occurrences (Tintin Milou)

- synonyms and hyperonyms(manger se nourrir, mouche insecte animal)

- domain / context for technical terms(médecine, architecture, agriculture, sport, …)

Done for 13 terms totalizing 38 definitions 134 answers

Sense Naming 9/10Cognitive cost

Church Daille Véronis

‘botte’

- multi-criteria approach seems adapted- easily extensible- strong precision

- enhancement needed for meta-language processing- criteria implementation

(associative memory, lexical functions )

- synthesis grammar (botte/secret/ vs. botte/secrète/)

Useful for multilingual lexical databases

Sense Naming 10/10Results

Mel’cukSchwab

Multilingual Lexical DatabaseSome terms are not lexicalized in some language

Objectivelexicalize these terms

Lexical Augmentation

abats

giblets

offal.1

FRANCAIS ENGLISHACCEPTIONS

abats offal

giblets

offal.2

refuse refuse scrapdéchet

abats de volaille

abats de bœuf

abats de porc

beef offal

porc offal

Lexical Augmentation 1/2Papillon projectBoitet LepageMangot-Lerebours Sérasset

• Extraction from definition and sense mane (glosses of dictionaries) abats = {‘porc’, ‘volaille’, ‘bœuf’, …}

• Patterns‘abats de volaille’, ‘abats en volaille’, …

• Patterns validation with co-occurrencesrelative number de hits in Google

• Difficulties ‘dog meat’ ‘viande pour chien’ / ‘viande de chien’ ?

Lexical Augmentation 2/2Procedure

Clustering• promissing results

manual evaluation on 100 difficult terms, 70 % of proper clusters, 30 % of bad affectation locutions

• pb to increase the cluster number maturing of the basic clusters

Sens Naming complementary with conceptual vectors• Good precision

manual evaluation 90 % of pertinent termsautomatic evaluation 70 % (angular distance)

• Towards a synthesis grammar botte/secret/ botte/secrète/

Future works• More criteria

(associative memory, more lexical functions)• Enhance definition analysis (meta-language)

Conclusion

Theoricformalisation de la ‘capacité de désambiguïsation’ et du ‘risque de non-sens’formalisation de l’annotation en sémantique lexicaleproposition d’une mesure de similarité générique entre définitions

Praticalimplémentation sous forme d’agentscatégorisation, nommage (services sur la Toile)augmentation lexicale (en cours)

Diffusionun poster à RECITAL’2003 (Batz sur Mer – 10 – 14 juin 2003)un article à Papillon’2003 (Sapporo – 2 – 6 juillet 2003)soumission pour RFIA’2004

Contribution

Thank you

top related