ontology learning. an introduction - managing legal ... school lex2009 - ontology in the legal...
TRANSCRIPT
Ontology Learning. An introduction
simonetta [email protected] di Linguistica Computazionale - CNR
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Summary
PART 1 Ontology Learning: basics
Why How Evaluation
PART 2 Ontology Learning in the Legal domain
Prior work Feasibility study carried out in the legal domain Open issues
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Why learning ontologies from texts
The problem “…manual acquisition and modeling of ontologies still
remains a tedious, cumbersome task resulting in a knowledge acquisition bottleneck” (Alexander Maedche, 2002)
A possible solution Data-driven knowledge acquisition
• semi-automatic ontology development, extension and tuning from domain text analysis
• reduces ontology development time and costs• extracted concepts are
“well adapted” to textsOLOL
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Why learning LEGAL ontologies from texts
The current situation a number of legal ontologies have been proposed
• mostly focusing on upper level concepts • hand-crafted by domain experts
The need realistically large knowledge–based applications in the legal
domain need more and more comprehensive ontologies, incrementally integrating continuously updated knowledge
A possible solution techniques for automated ontology–learning from texts can
play an increasingly prominent role in the near future relatively few attempts made so far to automatically induce
legal domain ontologies from texts
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Ontology Learning from texts as Reverse Engineering
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Text (implicit knowledge)
Structured content(explicit knowledge)
Ontology Learning from texts : the general approach
Linguistic analysis
KnowledgeExtraction
Dynamic Content
Structuring
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Ontology Learning from texts: how
carried out by combining Natural Language Processing technologies with Machine Learning techniques dynamic and incremental process
following the Balanced Cooperative Modeling Paradigm semi-automatic development/extension/tuning of ontology
with human interventionontology
ontologylearning
candidatenew concepts
newontology
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Ontology Learning “Layer Cake”(Buitelaar, Cimiano and Magnini, 2005)
disease, illness, hospital
{disease, illness}
DISEASE:=<Int,Ext,Lex>
is_a (DOCTOR, PERSON)
cure (dom:DOCTOR, range:DISEASE)
∀x, y (sufferFrom(x, y) → ill(x)) Axioms & Rules
(Other) Relations
Taxonomy (Concept Hierarchies)
Concept hierarchies
Concepts
Synonyms
Terms
The knowledge acquisition process organised into a “layer cake" of increasingly complex subtasks
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Linguistic analysis “Layer Cake”
Il presente decreto stabilisce le norme per la prevenzione dell'inquinamento da rumore. In particolare, […]
Il | presente | decreto | stabilisce | […] | dell‘ | inquinamento | da | rumore | .
decreto
DECRETARE#V@S1IP# DECRETO#S@MS#
[NC Il presente decreto] [VC stabilisce] [NC le norme] […]
MODIF(decreto,presente) SUBJ(stabilire,decreto) OBJD(stabilire,norma)
Dependency analysis
Shallow syntactic parsing (chunking)
POStagging
Concept hierarchies
Morphological analysis
Tokenization
Sentence Segmentation
decreto DECRETO#S@MS#
OL systems differentially exploit different levels of linguistic annotation of texts in an incremental fashion
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Terms
terms are the linguistic representation of domain-specific concepts basic prerequisite for more advanced ontology learning tasks terms may consist of
a single wordform so-called “simple” (or one-word) terms, e.g. law two or more wordforms, called “multi-word” (or complex) terms, e.g. public
administration term extraction process articulated into two fundamental steps:
identifying term candidates from text filtering through the candidates to separate terms from non-terms
term extraction systems based on two different types of knowledge, namely linguistic, statistical, or a combination of the two
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Typical term extraction architecture
INPUTtext file
NLP Analysis
extraction of candidate terms
statistical processing
OUTPUTlist of selected
domain-relevant terms
Linguistically analysed text (e.g. POS-tagged,
“chunked”)
statistical measures
(mutual information, log-likelihood, TFIDF etc.)
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Synonyms identification of semantic term variants denoting the same concept
within the same language: synonyms across different languages: translation equivalents
methods for the automatic acquisition of synonyms clustering of distributionally similar terms
• The Distributional Hypothesis (Harris 1968): two words that tend to co-occur in similar linguistic contexts will be positioned closer together in semantic space
mutual information association measure on a very large corpus (the web) with a medium-sized co-occurrence window
• synonyms appear to have a tendency to occur in the near of each other
methods for the automatic acquisition of translation equivalents similar as with monolingual terms, but depending on translated
contexts (i.e., document collections)• Parallel Corpora: Pairs of translated documents• Comparable Corpora: Pairs of documents in different
languages on the same topic
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Conceptsconcept learning can be approached from quite different perspectives
Ogden & Richards (1923) semiotic triangle
From a purely linguistic perspective, conceptual classes can be induced from the set of associated linguistic realizations emerging from texts (term extraction, synonym detection)
Intensionally: a concept is a description of its intension, i.e.
the set of properties that characterizes it or its
relationships to other concepts (definition extraction and
formalization)
Extensionally: a concept is learned by identifying its
instances in texts (ontology population)
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Taxonomy (Concept Hierarchies) (1) Basic methods used for taxonomy extraction
Lexico-syntactic patterns (Hearst 1992)
• Pattern: NPo such as {NP1, NP2,…, (and | or)} NPn
• Matching context: All common-law countries, such as Canada and England …
• Extracted relations:• HYPONYM(Canada, common-law country)• HYPONYM(England, common-law country)
Definition analysis (’80s MRD literature) • Definition: Ai fini della presente legge si intende … e) per - responsabile - la
persona fisica, la persona giuridica, la pubblica amministrazione e qualsiasi altro ente, associazione od organismo preposti dal titolare al trattamento di dati personali;
• Extracted relations:• HYPONYM(responsabile, persona_fisica)• HYPONYM(responsabile, persona_giuridica)• HYPONYM(responsabile, pubblica_amministrazione)• …
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Taxonomy (Concept Hierarchies) (2) Basic methods used for taxonomy extraction
Co-occurrence Analysis• based on Harris distributional hypothesis • exploits unsupervised hierarchical clustering techniques which typically
learn concepts at the same time as they also group terms into clusters of semantically related terms. The hierarchies produced by such clustering approaches can also be used to automatically derive term/concept hierarchies from texts
Linguistic-approaches
• Modifiers typically restrict or narrow down the meaning of the modified noun
• Hyponymy relations induced from head-sharing terms
• HYPONYM(commercial_activity, activity)
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
(Other) Relations
exploiting linguistic structure Part_of, meronymy
Part_of(wheel,bike); holonymy(bike,wheel) Qualia roles
Constitutive(blade, knife) Formal(artifact_tool, knife) Telic(cut_act, knife) Agentive(make_act, knife)
Other relations Located_in Author_of
More complex relations cure (dom:DOCTOR, range:DISEASE)
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Evaluating Ontology Learning results
Evaluation approaches comparing the ontology to a “gold standard” in terms
of • Precision: percentage of correctly acquired items with
respect to all acquired items• Recall: percentage of correctly acquired items with
respect to all items in the gold standard using the ontology in an application and evaluating
the results (task-based evaluation) involving comparisons with a source of data about
the domain that is to be covered by the ontology evaluation is done by humans who try to assess how
well the ontology meets a set of predefined criteria, standards, requirements, etc. (manual evaluation)
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Ontology learning in the legal domain
relatively few attempts made so far to automatically induce legal domain ontologies from texts
ontology learning experiments carried out in the legal domain mainly focused on concept extraction
among them:• Walter and Pinkal (2006): focus on definitions in German court
decisions from which legal concepts are identified together with relevant terminology and relations
• extraction of domain relevant terminology from which domain relevant concepts are derived together with relations linking them
• Lame (2000, 2005): French• Saias and Quaresma (2005): Portuguese • Völker, J., Langa S.F., Sure Y. (2008): Spanish
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Feasibility studies carried out in the legal domain with T2K
T2K (Text–to–Knowledge) ontology learning system Institute of Computational Linguistics (CNR) and Department
of Linguistics of the University of Pisa (DellOrletta et al. 2006) offers a battery of tools for Natural Language Processing
(NLP), statistical text analysis and machine language learning, dynamically integrated to induce ontological knowledge from texts
Two case studies in the legal domain Corpus of environmental laws (Venturi 2006)
• 1,399,617 tokens• 824 institutional and administrative acts by EU, State and
Piedmont Region• time span: from 1997 to 2005
Consumer Law corpus (European DALOS project)• including EU Directives, Regulations and case law on protection of
consumers' economic and legal interests• 292,609 word tokens
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Ontology Learning with T2K
Domain terminologyDomain terminology
Terminology extractionTerminology extraction
AnIta
Tokenizer
Morphological Analyser
POS Tagger
Chunker
Dependency Analyser
text
ontology learning
Terminologyextraction
Semanticstructuring
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Ontology Learning with T2K
Semantic structuringSemantic structuring
AnIta
Tokenizer
Morphological Analyser
POS Tagger
Chunker
Dependency Analyser
text
ontology learning
Terminologyextraction
Semanticstructuring
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Starting point texts annotated
with basic non-recursive syntactic structures (i.e. “chunks”) by AnIta
The result includes single terms
• Es. autorità, inquinamento
multi-word terms• Es. beni culturali,
sistemi di gestione e controllo
terminological variants
Term repository
Terminological variants
Ontology Learning (1) Terminology extraction
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
partial taxonomical chains reconstructed from the internal
linguistic structure of terms simple and complex terms
structured in a vertical hierarchy
riduzione
riduzione dell’inquinamento
acusticoriduzione delle
emissioni inquinanti
riduzione dei consumi
riduzione dell’inquinamento
riduzione della produzione
riduzione delle emissioni …
isa
isa
isaisa
isa
isa
isa
riduzione
riduzione dell’inquinamento
acusticoriduzione delle
emissioni inquinanti
riduzione dei consumi
riduzione dell’inquinamento
riduzione della produzione
riduzione delle emissioni …
isa
isa
isaisa
isa
isa
isa
clusters of semantically related terms inferred through dynamic
distributionally-based similarity measures
using a contex-sensitive notion of semantic similarity
computing the most relevant co-occurring verb/subject and verb/object pairs in the dependency-annotated text vj
ni
DISPOSIZIONI NORMEDISPOSIZIONI LEGISLATIVEDECISIONEATTOPRESCRIZIONI
INQUINAMENTODANNO AMBIENTALEINQUINAMENTO MARINOEFFETTI NOCIVICONSEGUENZAINQUINAMENTO ATMOSFERICO
Ontology Learning (2) Organisation and structuring of the set of acquired terms
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Achieved results
Environmental corpus 4,685 terminological units vertical (hyponymy) relations
• 2,181
• concerning 272 terms semantically related terms
• 3,448
• concerning 665 terms
DALOS corpus 1,443 terminological units vertical (hyponymy)
relations• 623
• concerning 229 terms semantically related terms
• 1,258
• concerning 279 terms
Twofaced terminology: in both case studies acquired terminology includes both legal and regulated domain terms, environmental and consumer
protection terms respectively
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Evaluation of achieved results in terms of precision and recall
reference resources selected as a gold standard Legal domain
• Dizionario giuridico, Edizioni Simone (6.041 entries)
• the thesaurus of DOGI Archive Environmental domain:
• Glossary of the Osservatorio Nazionale sui Rifiuti (1.090 entries)
• the thesaurus EARTh (Environmental Applications Reference Thesaurus)
precision 75.4%
through manual checking the percentage of correctly acquired terms grows to 83.7%, e.g.
• anidride carbonica • beneficiari
reference resources selected as a gold standard the thesaurus of DOGI Archive JurWordNet
precision 85.38%
recall wrt relevant 56 European Union Legal Concepts (EULG) 80.69%
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
Open issues
semi-automatic identification of the domain-relevance for each acquired term, wrt the legal or regulated domains
semi-automatic induction and labelling of basic ontological classes from the acquired proto-conceptual structures
extension of the acquired domain-ontology with concept-linking relations (e.g. events)
identification of definitions and extraction of the embedded domain knowledge
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
semi-automatic identification of the domain-relevance for each acquired term
Two-faced nature of acquired terminology Depending of the usage it may be useful to discriminate
between the two term types First experiments based on a contrastive analysis of term
collections bootstrapped from different corpora
kwid termine valore lemma
3 ARTICOLO 638 ARTICOLO
2 DIRETTIVA 681 DIRETTIVA
15 COMMISSIONE 185 COMMISSIONE
4 MEMBRI 635 MEMBRO
30 ATTIVITÀ 119 ATTIVITA'
17 CONSIGLIO 183 CONSIGLIO
9 DISPOSIZIONI 300 DISPOSIZIONE
5 STATI MEMBRI 594 STATO MEMBRO
kwid termine valore lemma
315 FIDUCIA DEI CONSUMATORI 5 FIDUCIA CONSUMATORE
302 ASPETTI DELLA VENDITA 6 ASPETTO VENDITA
303 FORNITORE DI BENI 6 FORNITORE BENE
1 CONSUMATORE 757 CONSUMATORE
304 DATA DEL PRESTITO 6 DATA PRESTITO
680 CALCOLO DEL TASSO ANNUO EFFETTIVO
3 CALCOLO DI IL TASSO ANNUO EFFETTIVO
307 DISCRIMINAZIONE DIRETTA 6 DISCRIMINAZIONE DIRETTO
310 DECISIONE CONSAPEVOLE 6 DECISIONE CONSAPEVOLE
335 OBBLIGAZIONI CONTRATTUALI
5 OBBLIGAZIONE CONTRATTUALE
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
semi-automatic induction and labelling of basic ontological classes
colour
material
1) definition of root concepts 1) definition of root concepts 2) definition of sub-concepts2) definition of sub-concepts
bianco
beige
scuro
grigio
blu
rosso
acciaio
pino
betulla
alluminio
rovere
plastica
faggiovetro
is_ais_ais_a
is_a
is_a
is_a
is_ais_a
is_a
is_a
is_a
is_a
is_ais_a
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
ontology extension with events (typically expressed by verbs) as connecting elements between concepts
automatic acquisition of clusters of semantically related verbs on the basis of distributionally-based similarity measures
semi-automatic bootstrapping of predicate-arguments structures from texts {centro servizi}
{ aiuto, prestazione, servizio }
{aiuti di stato, …}
{fornire, offrire, eroga}
{servizi per l’impiego servizi alle imprese servizi integrati …}ISA
ISA
{direttore}{dirige}
{controlla, autorizza}
extension of the acquired domain-ontology with concept-linking relations (e.g. events)
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
2. Ai fini della presente legge si intende … e) per responsabile la persona fisica, la persona giuridica, la pubblica amministrazione e qualsiasi altro ente, associazione od organismo preposti dal titolare al trattamento di dati personali; … [ [ CC: U_C] [ FORM: per] [ POTGOV: PER]]
[ [ CC: PUNC_C] [ PUNCTYPE: #@]] [ [ CC: N_C] [ POTGOV: RESPONSABILE#S@FS@MS]][ [ CC: PUNC_C] [ PUNCTYPE: #@]] [ [ CC: N_C] [ DET: LO#RD@FS] [ POTGOV: PERSONA_FISICA#S@FS]][ [ CC: PUNC_C] [ PUNCTYPE: ,#@]] [ [ CC: N_C] [ DET: LO#RD@FS] [ POTGOV: PERSONA_GIURIDICA#S@FS]][ [ CC: PUNC_C] [ PUNCTYPE: ,#@]] [ [ CC: N_C] [ DET: LO#RD@FS] [ POTGOV: PUBBLICA_AMMINISTRAZIONE#S@FS]][ [ CC: COORD_C] [ CONJTYPE: E#CC]] [ [ CC: N_C] [ PREMODIF: QUALSIASI#A@MS ALTRO#A@MS] [ POTGOV: ENTE#S@MS]][ [ CC: PUNC_C] [ PUNCTYPE: ,#@]] [ [ CC: N_C] [ AGR: @FS] [ POTGOV: ASSOCIAZIONE#S@FS]][ [ CC: SUBORD_C] [ CONJTYPE: OD#CS]] [ [ CC: N_C] [ AGR: @MS] [ POTGOV: ORGANISMO#S@MS]][ [ CC: ADJPART_C] [ AGR: @MP@MP] [ POTGOV: PREPORRE#V@MPPR PREPOSTO#A@MP]]
definiendum
ISA
identification of definitions and extraction of the embedded domain knowledge
Summer School LEX2009 - Ontology in the Legal Domain - ONTOLOGY LEARNING 10 September 2009
References Ontology learning
Buitelaar, P. Cimiano, P., Magnini, B. (Eds.) Ontology Learning from Text: Methods, Evaluation and Applications. Frontiers in Artificial Intelligence and Applications Series, Vol. 123, IOS Press, July 2005.
Cimiano, Philipp Ontology Learning and Population from Text: Algorithms, Evaluation and Applications. 2006, XXVIII, Springer, 2006.
Maedche, Alexander Ontology Learning for the Semantic Web. Kluwer Academic Publishers, 2002. Ontology learning in the legal domain
Cimiano, P., Völker, J. Text2Onto - A Framework for Ontology Learning and Data-driven Change Discovery. In Montoyo et al. (eds.), Proceedings of the 10th International Conference on Applications of Natural Language to Information Systems (NLDB). pp. 227–238. Springer, Alicante, Spain, June 2005.
Lame, G. Knowledge acquisition from texts towards an ontology of French law. in Proceedings of the International Conference on Knowledge Engineering and Knowledge Management Managing Knowledge in a World of Networks (EKAW-2000), Juan-les-Pins, 2000.
Lame, G. Using NLP techniques to identify legal ontology components: concepts and relations. In Benjamins et al. (eds.), Law and the Semantic Web. Legal Ontologies, Methodologies, Legal Information Retrieval, and Applications. Lecture Notes in Computer Science, Volume 3369: 169–184, 2005.
Saias, J. and P. Quaresma A Methodology to Create Legal Ontologies in a Logic Programming Based Web Information Retrieval System. In Benjamins et al. (eds.), Law and the Semantic Web. Legal Ontologies, Methodologies, Legal Information Retrieval, and Applications. Lecture Notes in Computer Science, Volume 3369: 185–200, 2005.
Walter, S. and M. Pinkal. Automatic extraction of definitions from german court decisions. In Proceedings of the International Conference on Computational Linguistics (COLING-2006) “Workshop on Information Extraction Beyond The Document”: 20–28, Sidney, 2006.
T. Agnoloni, L. Bacci, E. Francesconi, W. Peters, S. Montemagni, G. Venturi, 2008, A two-level knowledge approach to support multilingual legislative drafting, in Joost Breuker, Pompeu Casanovas, Michel C.A. Klein, Enrico Francesconi (eds.), Law, Ontologies and the Semantic Web - Channelling the Legal Information Flood, Frontiers in Artificial Intelligence and Applications, Springer, Volume 188, 2008, pp. 177-198.
Alessandro Lenci, Simonetta Montemagni, Vito Pirrelli, Giulia Venturi, 2008, Ontology learning from Italian legal texts, in Joost Breuker, Pompeu Casanovas, Michel C.A. Klein, Enrico Francesconi (eds.), Law, Ontologies and the Semantic Web - Channelling the Legal Information Flood, Frontiers in Artificial Intelligence and Applications, Springer, Volume 188, 2008, pp. 75-94.