cooperating techniques for extracting conceptual taxonomies from text

24
Università degli studi di Bari “Aldo Moro” Dipartimento di Informatica Cooperating Techniques for Extracting Conceptual Taxonomies from Text S. Ferilli, F. Leuzzi, F. Rotella AI*IA 2011 XIIth Conference of the Italian Association for Artificial Intelligence Workshop on Mining Complex Patterns (MCP 2011) Palermo, Italy, September 17, 2011 L.A.C.A.M. http://lacam.di.uniba.it:8000

Upload: university-of-bari-italy

Post on 11-May-2015

237 views

Category:

Education


6 download

TRANSCRIPT

Page 1: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Università degli studi di Bari “Aldo Moro”Dipartimento di Informatica

Cooperating Techniques for Extracting Conceptual Taxonomies from Text

S. Ferilli, F. Leuzzi, F. Rotella

AI*IA 2011 XIIth Conference of the Italian Association for Artificial IntelligenceWorkshop on Mining Complex Patterns (MCP 2011)

Palermo, Italy, September 17, 2011

L.A.C.A.M. http://lacam.di.uniba.it:8000

Page 2: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Overview

1. Introduction & Objectives

2. Extraction of knowledge from text

3. Knowledge representation formalism

4. Identification of relevant concepts

5. Generalization of similar concepts

6. Reasoning ‘by association’

7. Conclusions & Future works

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 2

Page 3: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Introduction

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 3

The spread of electronic documents and document

repositories has generated the need for automatic techniques

to understand and handle the documents content in order to

help users in satisfying their information needs.

Full Text Understading is not trivial, due to:

1. intrinsic ambiguity of natural language;

2. huge amount of common sense and conceptual background

knowledge.

For facing these problems lexical and/or conceptual

taxonomies are useful, even if manually building is very costly

and error prone.

Page 4: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Introduction

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 4

This lack is a strong motivation towards

automatic construction of conceptual

networks by mining large amounts of

documents in natural language.

However, even assuming a correct

knowledge representation, we are

far to simulate human abilities yet.

Page 5: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Objectives

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 5

1. Definition of a representation formalism for knowledge

extracted from natural language texts

2. Extraction of concepts and relevance assessment

3. Generalization of concepts having similar descriptions

4. Definition of a kind of reasoning by concept association that

looks for possible indirect connections between two

identified concepts

Page 6: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Extraction of knowledge from text

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 6

The final output of the Stanford Dependencies is a typed

syntactic structure of each sentence.

Stanford Parser [1]

Stanford Dependencies [2]

Knowledge extracted by processing each sentence separately.

Page 7: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Knowledge representation formalism

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 7

Among all grammatical roles played by words in a sentence,

only subject, verb and complement have been considered.

In the final conceptual graph subjects and complements will

represent concepts, while verbs will express relations between

them.

subject,verb,

complement

subject,complement

Page 8: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Identification ofrelevant concept

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 8

● Hub Words [3]: words having high frequency whose relevance is

computed as:

where:

W ( t )=α w 0+βn+γ∑i=1w ( ti )

w0 , initial weight; n, # of relationships;

w(ti), tf*idf weight of i-th word related to t.

● Keyword extraction techniques from single documents.

● EM Clustering provided by Weka [4] based on Euclidean

distance.

A mix of several techniques are brought to cooperation for

identifying relevant concepts:

Page 9: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Identification ofrelevant concept

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 9

Inspired to the Hub Words approach we have defined a

Relevance Weight:

α+β+γ+δ+ε=1where:

Nodes in the network are ranked by decreasing Relevance

Weight.

A suitable cut-point in the ranking is determined by choosing

the first item such that:

W ( ck )-W (c k+1)≥ p⋅ maxi=0, .. . , n−1

(W ( ci )-W (c i+1))where: p∈[ 0,1 ]

W ( c̄ )=αw ( c̄ )

maxc w( c )+β

e( c̄ )

maxce( c )+γ

∑(c , c̄ )w (c )

e( c̄ )+δ

dM−d ( c̄ )

d M

+εk ( c̄ )

maxck ( c )

A BA C D E

Page 10: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Identification of relevant conceptRelevance Weight in details

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 10

Definition of the Initial Weight

The whole set of triples <subject,verb,complement> is

represented in a Concepts x Attributes matrix V recalling the

classical Terms x Documents Vector Space Model.

αw ( c̄ )

maxcw ( c)Therefore component A is:

Resembling tf*idf:f i , j

∑kf k , j

⋅log∣A∣

∣{ j : ci∈a j}∣

where w(c) is the initial weight assigned to node c computed

according to the above tf*idf schema.

Page 11: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 11

Component C takes into account the average

initial weight of all neighbors of c

Neighborhood Weight Summary

βe( c̄ )

maxce( c )

Component B considers the number of connections (edges) in

which c is involved

Connections Number

γ∑(c , c̄ )

w ( c)

e( c̄ )

Identification of relevant conceptRelevance Weight in details

Page 12: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 12

Component E takes into account the outcome of three KE

techniques suitably weighted:

KE Influence

Component D represents the closeness to center of the cluster

Inverse Distance form Center

δd M−d( c̄ )

dM

εk ( c̄ )

maxc k (c )

k ( c̄ )=ςkco−occurrences

( c̄ )+ηksynset

( c̄ )+θkmvn

( c̄ )

where:

Identification of relevant conceptRelevance Weight in details

Page 13: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 13

● KE based on

WordNet Synsets

k co−occurrences=ςχ 2

maxcluster χ2

k synset=ηkw synset

max (kw synset )

kmvn=θkwmvn

max (kwmvn)

● KE based on

co-occurrences

● KE by means

Multivariate Normal

Distribution (MVN)

Identification of relevant conceptRelevance Weight in details

Page 14: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 14

Identification of relevant conceptEvaluations

Test # α β γ δ ε p

1 0.10 0.10 0.30 0.25 0.25 1.0

2 0.20 0.15 0.15 0.25 0.25 0.7

3 0.15 0.25 0.30 0.15 0.15 1.0

Test # Concept A B C D E W

1 networkaccesssubset

0.1000.001

6.32E-4

0.1000.0010.001

0.0210.1540.150

0.1780.2390.239

0.2500.2500.250

0.6490.6460.641

2 network 0.200 0.150 0.0105 0.178 0.250 0.789

3 networkusernumberindividual

0.1500.1270.1130.103

0.2500.1950.1870.174

0.0210.0220.0220.020

0.1460.1460.1460.146

0.1500.1500.1500.150

0.7170.6410.6190.594

Page 15: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Generalization of similar concepts

Pairwise clustering

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 15

Take in account the description of each concept, consisting in a binary vector that represents presence or absence (1 or 0 respectively) of a <subject,complement> relation between the involved concepts. The Hamming distance provides a similarity evaluation between them.

Page 16: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Generalization of similar concepts

WordNet

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 16

WordNet1 is an external resource that has some useful

properties:

1. lexical taxonomy

2. each concept is described as a set of synonyms (synset)

3. synsets are interlinked by means of conceptual-

semantic and lexical relations

We are focused on hyperonymy, a relation that links the

current synset to more general ones.

1. http://wordnet.princeton.edu/

Page 17: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Generalization of similar conceptsTaxonomical similarity function

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 17

More general: provides a similarity value on the bases of common relations, without focusing on the specific path.

More specific: provides a similarity value on the bases of common relations, relying on the specific path.

Page 18: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Generalization of similar concepts

WSD Domain Driven

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 18

One Domain per Discourse assumption: many uses of a word

in a coherent portion of text tend to share the same domain.

Prevalent domain

individuation

Prevalent domain

individuation

Extraction of all

synsets for each term

Extraction of all

synsets for each term

Extraction of all

domains for each synset

Extraction of all

domains for each synset

Choice of prevalent

domain synset

Choice of prevalent

domain synset

Page 19: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Generalization of similar concepts

Evaluations

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 19

Two toy experiments have been performed with Hamming distance threshold respectively equal to 0.001 and 0.0001, while taxonomical similarity function threshold has been kept equal to 0.4.

Page 20: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Reasoning ‘by association’

Breadth-First Search

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 20

Given two nodes (concepts), a Breadth-First Search starts

from both nodes, the former searches the latter's frontier and

vice versa, until the two frontiers meet by common nodes.

Then the path is restored going backward to the roots in both

directions.

Page 21: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Reasoning ‘by association’

Evaluations

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 21

The table below shows a sample of possible outcomes.E.g., an interpretation of case 5 can be: “the adults write about freedom and use platform, that is recognized as a technology, as well as the internet”.

Page 22: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Conclusions

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 22

This work proposes an approach to extract automatic conceptual

taxonomy from natural language texts.

It works mixing different techniques in order to:

● identify relevant terms/concepts in text;

● generalize similar concepts;

● perform some kind of reasoning “by association”.

Preliminary experiments show that this approach can be viable

although extensions and refinements are needed.

A reliable outcome might help users in understanding the text

content and machines to automatically perform some kind of

reasoning on the taxonomy.

Page 23: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

Future works

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 23

1. Extending the knowledge representation formalism to express negation.

2. Defining a strategy to make a better choice of weights in Relevance Weight computation.

3. Enriching the adjacency matrix to improve concept descriptions.

4. ODD alternatives exploration, to overcome its limits.

5. Taxonomical similarity measures take into account only the hypernym relation, while a more accurate similarity can be obtained adding other relations.

6. Define a strategy to prefer one verb rather than keeping all of them, in reasoning ‘by association’ phase.

Page 24: Cooperating Techniques for Extracting Conceptual Taxonomies from Text

References

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 24

[1] Dan Klein and Christopher D. Manning. Fast exact inference with a factored model for natural language parsing. In Advances in Neural Information Processing Systems, volume 15. MIT Press, 2003.

[2] Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure trees. In LREC, 2006.

[3] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing an ontology based on hub words. In ISMIS’03, pages 93–97, 2003.

[4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten. The weka data mining software: an update. SIGKDD Explorations, 11(1):10–18,2009.