cooperating techniques for extracting conceptual taxonomies from text

Università degli studi di Bari “Aldo Moro”Dipartimento di Informatica

Cooperating Techniques for Extracting Conceptual Taxonomies from Text

S. Ferilli, F. Leuzzi, F. Rotella

AI*IA 2011 XIIth Conference of the Italian Association for Artificial IntelligenceWorkshop on Mining Complex Patterns (MCP 2011)

Palermo, Italy, September 17, 2011

L.A.C.A.M. http://lacam.di.uniba.it:8000

Overview

1. Introduction & Objectives

2. Extraction of knowledge from text

3. Knowledge representation formalism

4. Identification of relevant concepts

5. Generalization of similar concepts

6. Reasoning ‘by association’

7. Conclusions & Future works

Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 2

Introduction


The spread of electronic documents and document

repositories has generated the need for automatic techniques

to understand and handle the documents content in order to

help users in satisfying their information needs.

Full Text Understading is not trivial, due to:

1. intrinsic ambiguity of natural language;

2. huge amount of common sense and conceptual background

knowledge.

For facing these problems lexical and/or conceptual

taxonomies are useful, even if manually building is very costly

and error prone.

Introduction


This lack is a strong motivation towards

automatic construction of conceptual

networks by mining large amounts of

documents in natural language.

However, even assuming a correct

knowledge representation, we are

far to simulate human abilities yet.

Objectives


1. Definition of a representation formalism for knowledge

extracted from natural language texts

2. Extraction of concepts and relevance assessment

3. Generalization of concepts having similar descriptions

4. Definition of a kind of reasoning by concept association that

looks for possible indirect connections between two

identified concepts

Extraction of knowledge from text


The final output of the Stanford Dependencies is a typed

syntactic structure of each sentence.

Stanford Parser [1]

Stanford Dependencies [2]

Knowledge extracted by processing each sentence separately.

Knowledge representation formalism


Among all grammatical roles played by words in a sentence,

only subject, verb and complement have been considered.

In the final conceptual graph subjects and complements will

represent concepts, while verbs will express relations between

them.

subject,verb,

complement

subject,complement

Identification ofrelevant concept


● Hub Words [3]: words having high frequency whose relevance is

computed as:

where:

W ( t )=α w 0+βn+γ∑i=1w ( ti )

w0 , initial weight; n, # of relationships;

w(ti), tf*idf weight of i-th word related to t.

● Keyword extraction techniques from single documents.

● EM Clustering provided by Weka [4] based on Euclidean

distance.

A mix of several techniques are brought to cooperation for

identifying relevant concepts:

Identification ofrelevant concept


Inspired to the Hub Words approach we have defined a

Relevance Weight:

α+β+γ+δ+ε=1where:

Nodes in the network are ranked by decreasing Relevance

Weight.

A suitable cut-point in the ranking is determined by choosing

the first item such that:

W ( ck )-W (c k+1)≥ p⋅ maxi=0, .. . , n−1

(W ( ci )-W (c i+1))where: p∈[ 0,1 ]

W ( c̄ )=αw ( c̄ )

maxc w( c )+β

e( c̄ )

maxce( c )+γ

∑(c , c̄ )w (c )

e( c̄ )+δ

dM−d ( c̄ )

d M

+εk ( c̄ )

maxck ( c )

A BA C D E

Identification of relevant conceptRelevance Weight in details


Definition of the Initial Weight

The whole set of triples <subject,verb,complement> is

represented in a Concepts x Attributes matrix V recalling the

classical Terms x Documents Vector Space Model.

αw ( c̄ )

maxcw ( c)Therefore component A is:

Resembling tf*idf:f i , j

∑kf k , j

⋅log∣A∣

∣{ j : ci∈a j}∣

where w(c) is the initial weight assigned to node c computed

according to the above tf*idf schema.


Component C takes into account the average

initial weight of all neighbors of c

Neighborhood Weight Summary

βe( c̄ )

maxce( c )

Component B considers the number of connections (edges) in

which c is involved

Connections Number

γ∑(c , c̄ )

w ( c)

e( c̄ )



Component E takes into account the outcome of three KE

techniques suitably weighted:

KE Influence

Component D represents the closeness to center of the cluster

Inverse Distance form Center

δd M−d( c̄ )

dM

εk ( c̄ )

maxc k (c )

k ( c̄ )=ςkco−occurrences

( c̄ )+ηksynset

( c̄ )+θkmvn

( c̄ )

where:



● KE based on

WordNet Synsets

k co−occurrences=ςχ 2

maxcluster χ2

k synset=ηkw synset

max (kw synset )

kmvn=θkwmvn

max (kwmvn)

● KE based on

co-occurrences

● KE by means

Multivariate Normal

Distribution (MVN)



Identification of relevant conceptEvaluations

Test # α β γ δ ε p

1 0.10 0.10 0.30 0.25 0.25 1.0

2 0.20 0.15 0.15 0.25 0.25 0.7

3 0.15 0.25 0.30 0.15 0.15 1.0

Test # Concept A B C D E W

1 networkaccesssubset

0.1000.001

6.32E-4

0.1000.0010.001

0.0210.1540.150

0.1780.2390.239

0.2500.2500.250

0.6490.6460.641

2 network 0.200 0.150 0.0105 0.178 0.250 0.789

3 networkusernumberindividual

0.1500.1270.1130.103

0.2500.1950.1870.174

0.0210.0220.0220.020

0.1460.1460.1460.146

0.1500.1500.1500.150

0.7170.6410.6190.594

Generalization of similar concepts

Pairwise clustering


Take in account the description of each concept, consisting in a binary vector that represents presence or absence (1 or 0 respectively) of a <subject,complement> relation between the involved concepts. The Hamming distance provides a similarity evaluation between them.


WordNet


WordNet1 is an external resource that has some useful

properties:

1. lexical taxonomy

2. each concept is described as a set of synonyms (synset)

3. synsets are interlinked by means of conceptual-

semantic and lexical relations

We are focused on hyperonymy, a relation that links the

current synset to more general ones.

1. http://wordnet.princeton.edu/

Generalization of similar conceptsTaxonomical similarity function


More general: provides a similarity value on the bases of common relations, without focusing on the specific path.

More specific: provides a similarity value on the bases of common relations, relying on the specific path.


WSD Domain Driven


One Domain per Discourse assumption: many uses of a word

in a coherent portion of text tend to share the same domain.

Prevalent domain

individuation

Prevalent domain

individuation

Extraction of all

synsets for each term

Extraction of all

synsets for each term

Extraction of all

domains for each synset

Extraction of all

domains for each synset

Choice of prevalent

domain synset

Choice of prevalent

domain synset


Evaluations


Two toy experiments have been performed with Hamming distance threshold respectively equal to 0.001 and 0.0001, while taxonomical similarity function threshold has been kept equal to 0.4.

Reasoning ‘by association’

Breadth-First Search


Given two nodes (concepts), a Breadth-First Search starts

from both nodes, the former searches the latter's frontier and

vice versa, until the two frontiers meet by common nodes.

Then the path is restored going backward to the roots in both

directions.

Reasoning ‘by association’

Evaluations


The table below shows a sample of possible outcomes.E.g., an interpretation of case 5 can be: “the adults write about freedom and use platform, that is recognized as a technology, as well as the internet”.

Conclusions


This work proposes an approach to extract automatic conceptual

taxonomy from natural language texts.

It works mixing different techniques in order to:

● identify relevant terms/concepts in text;

● generalize similar concepts;

● perform some kind of reasoning “by association”.

Preliminary experiments show that this approach can be viable

although extensions and refinements are needed.

A reliable outcome might help users in understanding the text

content and machines to automatically perform some kind of

reasoning on the taxonomy.

Future works


1. Extending the knowledge representation formalism to express negation.

2. Defining a strategy to make a better choice of weights in Relevance Weight computation.

3. Enriching the adjacency matrix to improve concept descriptions.

4. ODD alternatives exploration, to overcome its limits.

5. Taxonomical similarity measures take into account only the hypernym relation, while a more accurate similarity can be obtained adding other relations.

6. Define a strategy to prefer one verb rather than keeping all of them, in reasoning ‘by association’ phase.

References


[1] Dan Klein and Christopher D. Manning. Fast exact inference with a factored model for natural language parsing. In Advances in Neural Information Processing Systems, volume 15. MIT Press, 2003.

[2] Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure trees. In LREC, 2006.

[3] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing an ontology based on hub words. In ISMIS’03, pages 93–97, 2003.

[4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten. The weka data mining software: an update. SIGKDD Explorations, 11(1):10–18,2009.

cooperating techniques for extracting conceptual taxonomies from text

Education

neighbors of c c

b c dew c ec c

p max w c i w c i

conceptual taxonomies

node c computedaccording

max c w cwhere wc

idf weight of i

text understading