cooperating techniques for extracting conceptual taxonomies from text
DESCRIPTION
The current abundance of electronic documents requires automatic techniques that support the users in understanding their content and extracting useful information. To this aim, it is important to have conceptual taxonomies that express common sense and implicit relationships among concepts. This work proposes a mix of several tech niques that are brought to cooperation for learning them automatically. Although the work is at a preliminary stage, interesting initial results suggest to go on extending and improving the approach. More details can be found here: http://www.di.uniba.it/~loglisci/MCP2011/mce2011.pdfTRANSCRIPT
Università degli studi di Bari “Aldo Moro”Dipartimento di Informatica
Cooperating Techniques for Extracting Conceptual Taxonomies from Text
S. Ferilli, F. Leuzzi, F. Rotella
AI*IA 2011 XIIth Conference of the Italian Association for Artificial IntelligenceWorkshop on Mining Complex Patterns (MCP 2011)
Palermo, Italy, September 17, 2011
L.A.C.A.M. http://lacam.di.uniba.it:8000
Overview
1. Introduction & Objectives
2. Extraction of knowledge from text
3. Knowledge representation formalism
4. Identification of relevant concepts
5. Generalization of similar concepts
6. Reasoning ‘by association’
7. Conclusions & Future works
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 2
Introduction
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 3
The spread of electronic documents and document
repositories has generated the need for automatic techniques
to understand and handle the documents content in order to
help users in satisfying their information needs.
Full Text Understading is not trivial, due to:
1. intrinsic ambiguity of natural language;
2. huge amount of common sense and conceptual background
knowledge.
For facing these problems lexical and/or conceptual
taxonomies are useful, even if manually building is very costly
and error prone.
Introduction
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 4
This lack is a strong motivation towards
automatic construction of conceptual
networks by mining large amounts of
documents in natural language.
However, even assuming a correct
knowledge representation, we are
far to simulate human abilities yet.
Objectives
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 5
1. Definition of a representation formalism for knowledge
extracted from natural language texts
2. Extraction of concepts and relevance assessment
3. Generalization of concepts having similar descriptions
4. Definition of a kind of reasoning by concept association that
looks for possible indirect connections between two
identified concepts
Extraction of knowledge from text
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 6
The final output of the Stanford Dependencies is a typed
syntactic structure of each sentence.
Stanford Parser [1]
Stanford Dependencies [2]
Knowledge extracted by processing each sentence separately.
Knowledge representation formalism
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 7
Among all grammatical roles played by words in a sentence,
only subject, verb and complement have been considered.
In the final conceptual graph subjects and complements will
represent concepts, while verbs will express relations between
them.
subject,verb,
complement
subject,complement
Identification ofrelevant concept
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 8
● Hub Words [3]: words having high frequency whose relevance is
computed as:
where:
W ( t )=α w 0+βn+γ∑i=1w ( ti )
w0 , initial weight; n, # of relationships;
w(ti), tf*idf weight of i-th word related to t.
● Keyword extraction techniques from single documents.
● EM Clustering provided by Weka [4] based on Euclidean
distance.
A mix of several techniques are brought to cooperation for
identifying relevant concepts:
Identification ofrelevant concept
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 9
Inspired to the Hub Words approach we have defined a
Relevance Weight:
α+β+γ+δ+ε=1where:
Nodes in the network are ranked by decreasing Relevance
Weight.
A suitable cut-point in the ranking is determined by choosing
the first item such that:
W ( ck )-W (c k+1)≥ p⋅ maxi=0, .. . , n−1
(W ( ci )-W (c i+1))where: p∈[ 0,1 ]
W ( c̄ )=αw ( c̄ )
maxc w( c )+β
e( c̄ )
maxce( c )+γ
∑(c , c̄ )w (c )
e( c̄ )+δ
dM−d ( c̄ )
d M
+εk ( c̄ )
maxck ( c )
A BA C D E
Identification of relevant conceptRelevance Weight in details
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 10
Definition of the Initial Weight
The whole set of triples <subject,verb,complement> is
represented in a Concepts x Attributes matrix V recalling the
classical Terms x Documents Vector Space Model.
αw ( c̄ )
maxcw ( c)Therefore component A is:
Resembling tf*idf:f i , j
∑kf k , j
⋅log∣A∣
∣{ j : ci∈a j}∣
where w(c) is the initial weight assigned to node c computed
according to the above tf*idf schema.
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 11
Component C takes into account the average
initial weight of all neighbors of c
Neighborhood Weight Summary
βe( c̄ )
maxce( c )
Component B considers the number of connections (edges) in
which c is involved
Connections Number
γ∑(c , c̄ )
w ( c)
e( c̄ )
Identification of relevant conceptRelevance Weight in details
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 12
Component E takes into account the outcome of three KE
techniques suitably weighted:
KE Influence
Component D represents the closeness to center of the cluster
Inverse Distance form Center
δd M−d( c̄ )
dM
εk ( c̄ )
maxc k (c )
k ( c̄ )=ςkco−occurrences
( c̄ )+ηksynset
( c̄ )+θkmvn
( c̄ )
where:
Identification of relevant conceptRelevance Weight in details
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 13
● KE based on
WordNet Synsets
k co−occurrences=ςχ 2
maxcluster χ2
k synset=ηkw synset
max (kw synset )
kmvn=θkwmvn
max (kwmvn)
● KE based on
co-occurrences
● KE by means
Multivariate Normal
Distribution (MVN)
Identification of relevant conceptRelevance Weight in details
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 14
Identification of relevant conceptEvaluations
Test # α β γ δ ε p
1 0.10 0.10 0.30 0.25 0.25 1.0
2 0.20 0.15 0.15 0.25 0.25 0.7
3 0.15 0.25 0.30 0.15 0.15 1.0
Test # Concept A B C D E W
1 networkaccesssubset
0.1000.001
6.32E-4
0.1000.0010.001
0.0210.1540.150
0.1780.2390.239
0.2500.2500.250
0.6490.6460.641
2 network 0.200 0.150 0.0105 0.178 0.250 0.789
3 networkusernumberindividual
0.1500.1270.1130.103
0.2500.1950.1870.174
0.0210.0220.0220.020
0.1460.1460.1460.146
0.1500.1500.1500.150
0.7170.6410.6190.594
Generalization of similar concepts
Pairwise clustering
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 15
Take in account the description of each concept, consisting in a binary vector that represents presence or absence (1 or 0 respectively) of a <subject,complement> relation between the involved concepts. The Hamming distance provides a similarity evaluation between them.
Generalization of similar concepts
WordNet
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 16
WordNet1 is an external resource that has some useful
properties:
1. lexical taxonomy
2. each concept is described as a set of synonyms (synset)
3. synsets are interlinked by means of conceptual-
semantic and lexical relations
We are focused on hyperonymy, a relation that links the
current synset to more general ones.
1. http://wordnet.princeton.edu/
Generalization of similar conceptsTaxonomical similarity function
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 17
More general: provides a similarity value on the bases of common relations, without focusing on the specific path.
More specific: provides a similarity value on the bases of common relations, relying on the specific path.
Generalization of similar concepts
WSD Domain Driven
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 18
One Domain per Discourse assumption: many uses of a word
in a coherent portion of text tend to share the same domain.
Prevalent domain
individuation
Prevalent domain
individuation
Extraction of all
synsets for each term
Extraction of all
synsets for each term
Extraction of all
domains for each synset
Extraction of all
domains for each synset
Choice of prevalent
domain synset
Choice of prevalent
domain synset
Generalization of similar concepts
Evaluations
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 19
Two toy experiments have been performed with Hamming distance threshold respectively equal to 0.001 and 0.0001, while taxonomical similarity function threshold has been kept equal to 0.4.
Reasoning ‘by association’
Breadth-First Search
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 20
Given two nodes (concepts), a Breadth-First Search starts
from both nodes, the former searches the latter's frontier and
vice versa, until the two frontiers meet by common nodes.
Then the path is restored going backward to the roots in both
directions.
Reasoning ‘by association’
Evaluations
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 21
The table below shows a sample of possible outcomes.E.g., an interpretation of case 5 can be: “the adults write about freedom and use platform, that is recognized as a technology, as well as the internet”.
Conclusions
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 22
This work proposes an approach to extract automatic conceptual
taxonomy from natural language texts.
It works mixing different techniques in order to:
● identify relevant terms/concepts in text;
● generalize similar concepts;
● perform some kind of reasoning “by association”.
Preliminary experiments show that this approach can be viable
although extensions and refinements are needed.
A reliable outcome might help users in understanding the text
content and machines to automatically perform some kind of
reasoning on the taxonomy.
Future works
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 23
1. Extending the knowledge representation formalism to express negation.
2. Defining a strategy to make a better choice of weights in Relevance Weight computation.
3. Enriching the adjacency matrix to improve concept descriptions.
4. ODD alternatives exploration, to overcome its limits.
5. Taxonomical similarity measures take into account only the hypernym relation, while a more accurate similarity can be obtained adding other relations.
6. Define a strategy to prefer one verb rather than keeping all of them, in reasoning ‘by association’ phase.
References
Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 24
[1] Dan Klein and Christopher D. Manning. Fast exact inference with a factored model for natural language parsing. In Advances in Neural Information Processing Systems, volume 15. MIT Press, 2003.
[2] Marie-Catherine de Marneffe, Bill MacCartney, and Christopher D. Manning. Generating typed dependency parses from phrase structure trees. In LREC, 2006.
[3] Sang Ok Koo, Soo Yeon Lim, and Sang-Jo Lee. Constructing an ontology based on hub words. In ISMIS’03, pages 93–97, 2003.
[4] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I.H. Witten. The weka data mining software: an update. SIGKDD Explorations, 11(1):10–18,2009.