automatic lexicon generation through wordnet

Post on 05-Feb-2016

37 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Automatic Lexicon Generation through WordNet. by Nitin Verma and Pushpak Bhattacharyya Jan 21, 2004. Introduction. A lexicon is the heart of any natural language processing system. Difficult to construct requiring enormous amount of time and man power. - PowerPoint PPT Presentation

TRANSCRIPT

CSE Department, I.I.T. Bombay

Automatic Lexicon Generation through WordNet

by

Nitin Verma and Pushpak Bhattacharyya

Jan 21, 2004

CSE Department, I.I.T. Bombay

Introduction A lexicon is the heart of any natural language

processing system. Difficult to construct requiring enormous

amount of time and man power. Document specific dictionary generation –

– Given a document D and word W therein, which sense S of W should be picked up from the document ?

– Can one construct a document specific dictionary wherein single senses of the words are stored ?

CSE Department, I.I.T. Bombay

UW Dictionary An important machine readable lexical

resource used by the enconverter and deconverter software's.

Introduction

Enconverter

UWDictionary

AnalysisRules

Natural Language

UNL

CSE Department, I.I.T. Bombay

Format of dictionary entries –

– Semantic attributes (derived from the ontology).– Syntactic attributes (POS, person, number,

tense).– Used for the firing of appropriate analysis rules.

Introduction (UW dictionary)

[crane] “crane (icl>bird)” (N, ANIMT, FAUNA, BIRD);

Restriction

HW UW Attributes (both syntactic and semantic)

CSE Department, I.I.T. Bombay

Animate (ANIMT)– Flora (FLORA)

Shrubs (ANIMT, FLORA, SHRB), e.g. jasmine Aquatic plants(ANIMT, FLORA, AQTC), e.g. lotus ….

– Fauna (FAUNA) Mammals (MML) Reptiles (ANIMT, FAUNA, RPTL), e.g. lizard Birds (ANIMT, FAUNA, BIRD) Fish (ANIMT, FAUNA, FISH) Insects (ANIMT, FAUNA, INSCT), e.g. butterfly ……

Ontology*

*Dictionary group, CFILT, IIT Bombay.

Introduction

CSE Department, I.I.T. Bombay

English-UW dictionary generation

CSE Department, I.I.T. Bombay

Resources used –– English WordNet, a WSD* system (soft

word sense disambiguation method), the UNLKB and an inferencer.

Knowledge based approach.

English-UW dictionary generation

* G. Ramakrishnan and P. Bhattacharya. Soft Word Sense Disambiguation, GWN 2004

CSE Department, I.I.T. Bombay

Stage 1 –

Stage 2 –

English-UW dictionary generation

Method

Word1 word2..----------------------

Input Document

WSD*

Word1:N:1Word2:N:3

----------------------

POS and Sense tagged document

CSE Department, I.I.T. Bombay

English-UW dictionary generation (Method)

Word1:pos1:sense1Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

---------------------------------

------

UW Dictionary

Explanation

UNL KB

CSE Department, I.I.T. Bombay

UW generation for nouns

UW generation

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

1

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

1

2

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

1

2

3

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

1

4

2

3

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

1

4

2

3

5

depth word relation restriction

6 bird icl animal

5 animal icl living thing

4 living thing null null

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

Crane(icl>bird)

1

4

2

3

5

6

depth word relation restriction

6 bird icl animal

5 animal icl living thing

4 living thing null null

6

CSE Department, I.I.T. Bombay

UW generation for nouns

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

UNL KB

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

Crane(icl>bird)

1

4

2

3

5

6

Explanation7

depth word relation restriction

6 bird icl animal

5 animal icl living thing

4 living thing null null

6

CSE Department, I.I.T. Bombay

UW generation for verbs

UW generation

CSE Department, I.I.T. Bombay

UW generation for verbs

Input word

{hypernyms(word)} Π {‘be’, ‘continue’, etc}= 0

true(icl > be)

e.g. : exist (icl > be)

{hypernyms(nominal word)} Π {‘phenomenon’, ‘natural event’, etc}

= 0

true(icl > occur)

e.g. : rain (icl > occur)

false

false

(icl > do) e.g. : make (icl > do)

CSE Department, I.I.T. Bombay

UW generation for adjectives

Input word

UW present in the UNL KB ?Yes

Pick the UW

e.g. : broad (aoj > thing)

No

IS_DEFINED (is_a_value_of relation) on the input word ?

Yes(aoj > thing)

e.g. : good (aoj > thing)

No

(mod > thing) e.g. : green (mod > thing)

CSE Department, I.I.T. Bombay

Semantic attribute generation

English-UW dictionary generation (Method)

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

1

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

1

2

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

1

2

3

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

1

4

2

3

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

IF hypernym=‘organism’ THEN generate ‘ANIMT’

ELSE generate ‘INANI’;

IF hypernym=‘fauna’ THEN generate ‘FAUNA’;

IF hypernym=‘bird’ THEN generate ‘BIRD’;

--- ------ ----

1

4

2

3

5

CSE Department, I.I.T. Bombay

Semantic attribute generation

crane:N:4Word2:pos2:sense2

----------------------

InferenceEngine

KB

WordNet

Database of rules

Tagged Document

crane:N:4

A query to collect

semantic information

crane

bird

fauna, animal

organism

A query to collect relevant

rules

IF hypernym=‘organism’ THEN generate ‘ANIMT’

ELSE generate ‘INANI’;

IF hypernym=‘fauna’ THEN generate ‘FAUNA’;

IF hypernym=‘bird’ THEN generate ‘BIRD’;

--- ------ ----

(N,ANIMT,FAUNA,BIRD)1

4

2

3

5

6

CSE Department, I.I.T. Bombay

Database of rules

Semantic attribute generation

No of such rules: 4344

HYPERNYM ATTRIBUTE

organism ANIMT

flora FLORA

fauna FAUNA

bird BIRD

HYPERNYM ATTRIBUTE

change VOA,CHNG

communicate VOA,COMM

move VOA,MOTN

complete VOA,CMPLT

IS_A_VALUE_OF ATTRIBUTE

weight DES,WT

strength DES,STRNGTH

qual DES,QUAL

SYNONYMY OR ANTONYMY

ATTRIBUTE

bright DES,APPR

deep DES,DPTH

shallow DES,DPTH

SYNONYMY ATTRIBUTE

backward DRCTN

always FREQ

frequent FREQ

beautifully MAN

Table 1. Rules for nouns (96) Table 2. Rules for verbs (405)

Table 4. Rules for adverbs (556)Table 3.2. Rules for adjectives (3258)

Table 3.1. Rules for adjectives (29)

CSE Department, I.I.T. Bombay

Experiments and Results

82

84

86

88

90

92

94

96

98

1 2 3 4 5 6 7 8 9 10

Precision

No of correct entries in the dictionary

Total no of entries in the dictionary

70

72

74

76

78

80

82

84

86

88

90

92

1 2 3 4 5 6 7 8 9 10

Precision

Precision for nouns – 93.9% Precision for verbs – 84.4%

Document No Document No

Precision =

CSE Department, I.I.T. Bombay

78

80

82

84

86

88

90

92

94

96

1 2 3 4 5 6 7 8 9 10

Precision

No of correct entries in the dictionary

Total no of entries in the dictionary

72

74

76

78

80

82

84

86

88

90

92

94

1 2 3 4 5 6 7 8 9 10

Precision

Precision for adjectives – 90.06% Precision for adverbs – 86%

Document No Document No

Precision =

Experiments and results

CSE Department, I.I.T. Bombay

Implementation details Subtasks identified –

– MySQL database is used for storing the rules and the UNL KB.

7540 entries in the UNL KB. 4344 entries in the rule base.

– Inference engine in C++.– Web interface of the DDG in CGI & PHP.– Other utilities like UNL KB organizer, Rule entry

interface, WSD integrator are implemented in Perl.

– LOC 4761

CSE Department, I.I.T. Bombay

Demo

CSE Department, I.I.T. Bombay

Hindi-UW dictionary generation

Method

CSE Department, I.I.T. Bombay

Hindi-UW dictionary generation

1. WordNet API is used to obtain all possible parts-of-speech and all possible senses for every word.

2. Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.

CSE Department, I.I.T. Bombay

2. Hindi WN is queried (by using Hindi WN API) to obtain the semantic attributes.

3. The Hindi UW dictionary database is queried (on the basis of input-word and its POS) to obtain an appropriate UW.

4. In this step the irrelevant entries are disabled and the incorrect ones are corrected manually by the lexicographer.

Hindi-UW dictionary generation

CSE Department, I.I.T. Bombay

Demo

CSE Department, I.I.T. Bombay

The burden of lexicography has been reduced considerably.

The system is being routinely used in our work on machine translation in a tri-language setting (English, Hindi and Marathi).

Future work will be directed towards the implementation of part-of-speech tagger and word-sense-disambiguator, for Hindi and Marathi languages.

Conclusion and future work

top related