ontology learning from text - ferdowsi university of …wtlab.um.ac.ir/images/seminars/docs/ontology...

70
Ehsan Asgarian Ontology Learning from Text

Upload: lamthuan

Post on 29-Mar-2018

216 views

Category:

Documents


2 download

TRANSCRIPT

Ehsan Asgarian

Ontology Learning from Text

Definition of Ontology

‘A formal, explicit specification of a shared conceptualization’

must be

machine

understandable

types of concepts and

constraints must be clearly

defined

not private to some individual,

but accepted by a group

an abstract model of some

phenomenon in the world formed

by identifying the relevant

concepts of that phenomenon

or simply, a data model describing of a domain.

Main elements of an ontology

Hierarchy of concepts

(is-a relations)

Object property

(relation)

domain range

domain

xsd:string

range

datatype property

(attribute)

hasTitle

wasWrittenBy

The spectrum of ontology kinds.

Applications of Ontologies

Knowledge representation and knowledge management systems

Intelligent query-answering systems

Information retrieval and extraction

Semantic Web

• Web pages annotated with ontologies

• User queries for Web pages analysed at

knowledge level and answered by inferencing on

ontological knowledge

Ontology Engineering

Definition of Ontology Learning

The application of a set of methods and techniques used for building an ontology from scratch

Uses distributed and heterogeneous knowledge and information sources

Allows a reduction in the time and effort needed in the ontology development process

Task: automatic ontology

extraction from domain texts

Ontology extraction

textsontology

Ontology Learning (Construction)

Manual construction

• Corpus is not necessary

• Small scale

Automatic or semiautomatic construction

• Domain specific corpus

• Good domain knowledge coverage

Ontology Learning methods from…

Unstructured sources

• Involves NLP techniques, morphological and syntactic

analysis, etc.

Semi-structured source

• elicit an ontology from sources that have some predefined

structure, such as XML Schema

Structured data

• Extracting concepts and relations from knowledge contained

in structured data, such as databases

Ontology Learning ‘Layer Cake’

Axioms & Rules

Relations

Taxonomy (Concept hierarchies)

Concepts

Synonyms

Termsdisease, illness, hospital

{disease, illness}

Disease:=<I, E, L>

is_a (Doctor, Person)

cure (domain:Doctor, range:Disease)

x, y (sufferFrom(x, y) ill(x))

An overview of the outputs, tasks, and

common techniques for ontology learning

Subtasks in ontology learning

Extract the relevant domain terminology and synonyms from a

text collection

Discover concepts which can be regarded as abstractions of

human thought

Derive a concept hierarchy organizing these concepts

Extend an existing concept hierarchy with new concepts

Learn non-taxonomic relations between concepts

Populate the ontology with instances of relations and concepts

Discover other axiomatic relationships or rules involving

concepts and relations

Sample (partial) Ontology –

Electronic Voting Domain

Concepts: person, voter, worker, poll watcher, location, county, precinct, vote, ballot, machine, voting machine, manufacturer, etc.

Attributes: name of person, model of machine, etc.

Taxonomical relations:

• Voter is a person; precinct is a location; voting

machine is a machine, etc.

Non-hierarchical relations:

• Voter cast ballot; voter trust machine; county

adopt machine; equipment miscount ballot, etc.

Sample (partial) Ontology –

Electronic Voting Domain

ConceptNet — a practical commonsense reasoning

Open Mind Common Sense (OMCS) is an artificial intelligence

project based at the Massachusetts Institute of Technology (MIT)

Media Lab whose goal is to build and utilize a large

commonsense knowledge base from the contributions of many

thousands of people across the Web.

ConceptNet is a multilingual knowledge base, representing

words and phrases that people use and the common-sense

relationships between them.

Since its founding in 1999, it has

accumulated more than a million

English facts from over 15,000

contributors in addition to knowledge

bases in other languages.

ConceptNet — a practical commonsense reasoning

ConceptNet — a practical commonsense reasoning

The knowledge base is a semantic network presently consisting

of over 1.6 million assertions of commonsense knowledge

encompassing the spatial, physical, social, temporal, and

psychological aspects of everyday life.

It is built from nodes representing concepts, in the form of words

or short phrases of natural language, and labeled relationships

between them. These are the kinds of things computers need to

know to search for information better, answer questions, and

understand people's goals.

ConceptNet is generated automatically from the 700 000

sentences of the Open Mind Common Sense Project — a World

Wide Web based collaboration with over 14 000 authors.

ConceptNet — a practical commonsense reasoning

Challenges in Text Processing

Unstructured texts

Ambiguity in English text• Multiple senses of a word

• Multiple parts of speech – e.g., “like” can occur in 8 PoS:• Verb: “Fruit flies like banana”

• Noun: “We may not see its like again”

• Adjective: “People of like tastes agree”

• Adverb: “The rate is more like 12 percent”

• Preposition: “Time flies like an arrow”

• etc

Lack of closed domain of lexical categories

Noisy texts

Requirement of very large training text sets

Lack of standards in text processing

Part 1 Terms Extraction

Axioms & Rules

Relations

Taxonomy (Concept hierarchies)

Concepts

Synonyms

Termsdisease, illness, hospital

Terms

Linguistic realizations of domain-specific concepts

Are the basis of the ontology learning process

Term extraction implies:

• Linguistic processing part-of-speech tagging,

morphological analysis, etc.

• Statistical processing compares the distribution of

terms between corpora

Terms Extraction: Process

Run a Part-Of-Speech (POS) tagger over the domain

corpus

Identify possible terms by constructing patterns, such

as: Adj-Noun, Noun-noun, Adj-Noun-Noun,…

Ignore Names

Identify only the relevant to the text terms by applying

statistical metrics

Linguistic Analysis: an example

Discourse

Analysis

Dependency Structure

(S)

Dependency Structure

(Phrases)

Phrase Recognition

Morphological Analysis (stemming)

Part of Speech & Semantic Tagging

Tokenization (incl. Named-Entity Rec.)[table] [2005-06-01] [John Smith]

[[the] [large] [table] NP] [[in] [the] [corner] PP]

[table N:ARTIFACT] [table N:furniture]

[work~ing V]

[[the SPEC] [large MOD] [table HEAD] NP]

[[He SUBJ] [booked PRED] [[this] [table HEAD] NP:DOBJ]S]

[[He SUBJ] [booked PRED] [[this] [table HEAD]NP:DOBJ:X1]…]…

[[It SUBJ:X1] [was PRED] still available…]

Statistical Analysis

Statistical metrics used in terms extraction:

2 ( exp)

exp

obs

Chi-square

Term weighting (TFIDF) ( ) log( )( )

Ntfidf w tf

df w

Mutual Information ( , )( , )

( ) ( )

P x ymi x y

P x P y

TFIDF

( ) ( ) log( )( )

Ntfidf w tf w

df w

tf(w) term frequency (number of words occurrences in a document)

df(w) document frequency (number of documents containing the word

N number of all documents

tfidf(w) relative importance of the word in the document

Most popular weighting schema

The word is more popular when it appears

several times in a document The word is more important if it appears

in less documents

Part 2 Synonyms

Axioms & Rules

Relations

Taxonomy (Concept hierarchies)

Concepts

Synonyms

Terms

{disease, illness}

Synonyms

Identification of terms that share

semantics, i.e., potentially refer to the

same concept

Methods for extracting synonyms

• Based on WordNet

• Latent Semantic Indexing (LSI)

WordNet

A lexical database for the English language

Nouns, verbs, adjectives & adverbs are grouped into sets of

synonyms (synsets)

Synsets are interlinked by means of conceptual-semantic

and lexical relations

Adapting WordNet to specific domain

Partition the set of synonymy relations defined in WordNet in

three classes:

• Relations irrelevant in the specific domain

• Relations that are relevant but incorrect in the specific

domain

• Relations that are relevant and correct in the specific

domain

Remove relations from the first two classes and include

relations from the third class

Rank the rest sets according to their frequency in corpus

Latent Semantic Indexing (LSI)

LSI is a technique in NLP of analyzing relationships

between a set of documents and the terms they contain

Uses a term-document matrix which describes the

occurrences of terms in documents – Vector Space Model

Example: doc1 doc2

database X

computer X X

access X

Part 3 Concepts

Axioms & Rules

Relations

Taxonomy (Concept hierarchies)

Concepts

Synonyms

Terms

Disease:=<I, E, L>

Concepts

Intension, Extension, Lexicon

A term may be indicate a concept if we can define its:

Intension:

Extension:

Lexical realizations:

(in)formal definition of the set of objects that this concept

describes

a set of objects that the definition of this concept

describes (the name of the nearest common ancestor)

the term itself and its multilingual synonyms

Example: a disease is an impairment of health or a condition of abnormal functioning

Example: influenza, cancer, heart disease

Example: disease, illness, maladie

Part 4 Taxonomy Induction

Axioms & Rules

Relations

Taxonomy (Concept hierarchies)

Concepts

Synonyms

Terms

is_a (Doctor, Person)

Concept Hierarchy Extraction

With the use of WordNet

Lexico-syntactic patterns

Machine Readable Dictionaries

Co-occurrence Analysis

Unsupervised hierarchical clustering techniques

Linguistic-approaches

Basic methods used for taxonomy extraction:

Taxonomy Extraction with WordNet

Given two terms t1 and t2, check if they stand in a

hypernym relation with regard to WordNet

Normalize the number of hypernym paths by dividing

by the number of senses of t11 2

1 2

1

| ( ( ), ( )) |( , ) min( ,1)

| ( ) |

paths senses t senses tisa t t

senses t

path: a sequence of edges connecting the two synsets

Example: - 4 different hypernym paths between synsets ‘country’ and ‘region’- ‘country’ has 5 senses

value of isa (country, region) = 0.8

Lexico-syntactic patterns - Hearst

Aim: the acquisition of hyponym lexical relations from text

Uses a set of predefined lexico-syntactic patterns which

• occur frequently and in many text genres

• indicate the relation of interest

• can be recognized with little or no pre-encoded knowledge

Principle idea: match these patterns in texts to retrieve

is_a relations

Precision with respect to WordNet: 55,45%

Lexico-syntactic patterns - Hearst

NPo such as {NP1, NP2,…, (and | or)} NPn

‘Vehicles such as cars, trucks and bikes….’

such NP as {NP,} * { (or | and) } NP

‘Such fruits as oranges, nectarines or apples…’

NP {, NP} * { , } { or | and } other NP

‘Swimming, running, or/and other activities…’

vehicle

carbike

truck

is-ais-a is-a

fruit

applenectarine

orange

is-ais-a is-a

is-a

activity

swimmingrunning

is-a

NP { , } including {NP, } * { or | and } NP

‘Injuries, including broken bones, wounds and bruises…’

NP { , } especially {NP, } * { or | and } NP

‘Publications, especially papers and books…’publication

bookpaper

is-ais-a

Lexico-syntactic patterns - Hearst

injury

bruisewound

broken bone

is-ais-a is-a

Machine Readable Dictionaries

A method for extracting taxonomies which goes back

to the 80’s

Main idea: exploit the regularity of dictionary entries to

find a suitable hypernym for the defined word

spring “the season between winter and summer and in which

leaves and flowers appear”

Example:

is_a (spring, season)

MRDs: Exceptions

The hypernym can be preceded by an expression such as ‘a kind of’,

‘a sort of’, or ‘a type of’

The problem is solved by keeping an exception list with words such as

‘kind’, ‘sort’, ‘type‘ and taking the head of the NP following the

preposition ‘of’

The word can be defined in terms of a part-of or membership relation

republican : “a member of a political party advocating republicanism” Example:

is_a (republican, political party) part_of (republican, political party)

hornbeam: “a type of tree with a hard wood, sometimes used in hedges” Example:

is_a (hornbeam, tree)

Co-occurrence analysis

A certain term t1 is more special that a term t2, if

t2 also appears in all the documents in which t1

appears.

( , )( | )

( )

n x yP x y

n y

Term x subsumes term y iff P(x | y) 1, where

n(x,y) the number of documents in which x and y co-occur

n(y) the number of documents that contain y

Document-based subsumption

Unsupervised hierarchical

clustering techniques

Unsupervised hierarchical clustering techniques

known from machine learning research

• very noisy as they highly depend on the frequency and

behavior of the terms in the text collection under consideration

• learn concepts at the same time since they also group terms

(the most related to each other)

• can be regarded as abstractions over words and thus, to

some extent, as concepts

It is unclear which specific relation actually holds

between the involved words.

Semantic_relatedness (cut, knife)

Example:

Linguistic Approaches

Modifiers typically restrict or narrow down the meaning

of the modified noun.

Syntactic structure analysis and dependency analysis

words and modifiers in syntactic structures (noun/verb/

prepositional/… phrases) are analyzed to discover

potential terms and relations e.g. the head-modifier principle:

the heads of the terms assuming the hypernym role

In dependency analysis, grammatical relations, such as

subject, object, adjunct, and complement, are used for

determining more complex relations

is_a (international credit card, credit card)

Example:

Extending Concept Hierarchy

with new Concepts

…by adding a new concept at an appropriate position in the existing taxonomy

Supervised methods:

• classifiers need to be trained which predict membership for every

concept in the existing concept hierarchy.

• need a considerable amount of training data for each concept,

• such approaches do typically not scale to arbitrary large ontologies.

Unsupervised approaches:

• assume a similarity function which computes a measure of fit between

the new concept and the concepts existing in the ontology.

• rely on an appropriate contextual representation of the different

concepts on the basis of which similarity can be computed.

• the hierarchical structure of the ontology needs to be considered and

somehow integrated into the similarity measure

Part 5 Relations (non-taxonomic)

Axioms & Rules

Relations

Taxonomy (Concept hierarchies)

Concepts

Synonyms

Terms

cure (domain:Doctor, range:Disease)

Extracting relations (the interactions

between concepts) & attributes

Specific relations

• Part-of

• Qualia (Formal, Constitutive, Telic, Agentive)

General relations

• Exploiting linguistic structure

Attributes

Learning attributes: Introduction

Attributes relations with a datatype as range

Typically expressed in texts using preposition of, the verb have or

genitive constructs, e.g. ‘the color of the car’, ‘the car’s color’, ‘every

car has a color’

Values of attributes are expressed using copula constructs,

adjectives or expressions specific to the attribute in question, e.g.,

• ‘the car is red’ (copula + value)

• ‘the red car’ (adjective)

• ‘the baby weights 3 kgr’ (specific expressions)

Classification of attributes

To systematize the learning process attributes are classified according to their range

An approach to learning attributes

Tokenize & part-of-speech tag the corpus

Apply the following patterns to extract adjective/noun pairs

(\w+{DET})? (\w+{NN}) + is{VBZ} \w + {JJ}

(\w+{DET})? \w + {JJ} (\w+{NN}) +

These pairs are weighted using conditional probability:

For each of the adjectives we look up the corresponding

attributes in WordNet

f(n,a): joint frequency of adjective a and noun nf(n): the frequency of noun n

JJ: adjective DET: determiner

NN: noun VBZ: verb, 3rd person singular present

“meronymy” / “part-of” relations

whole NN[-PL] ‘s POS part NN[-PL]

part NN[-PL] of PREP {the|a} DET mods [JJ|NN]* whole NN

Format type_of_word TAG type_of_word TAG…

NN = Noun NN-PL = Plural Noun

PREP = Preposition POS = Possessive

JJ = Adjective

e.g. …building’s basement…

e.g. …basement of a building… 55% accuracy

Given a “seed” word find parts of that word in a large corpus of text

Qualia structures

The meaning of a lexical element is described in terms of four roles:

Constitutive

Agentive

Formal

Telic

physical properties of a object (e.g., weight, material, parts)

typically a verb denoting an action which brings the object in existence

normally consists in typing information about the object (e.g., hypernym)

the purpose or function of an object either by a verb or by a nominal

Formal: artifact_tool

Constitutive: blade, handle,…

Telic: cut_act

Agentive: make_act

Example:

Qualia structures for knife

Qualia Structures: Learning Approach

aim: to automatically learn qualia

structures from the WWW

Based on the idea of matching certain

lexico-syntactic patterns conveying a

standard relation

Clues: search engine queries

indicating the relation of

interest

Calculate the weight of a

candidate qualia element e for

the term t using Jaccard

coefficient:

Qualia Structures: Learning Process

Generate Clues

Download Google

Abstracts

POS-tagging

Matching regular

expressions

Statistical Weighting

Word

Weighted QS

( )

( ) ( ) ( )

GoogleHits e t

GoogleHits e GoogleHits t GoogleHits e t

Qualia Structure: Patterns (1/2)

Formal Role

Telic Role

Qualia Structure: Patterns (2/2)

Constitutive Role

Relations by syntactic analysis

SubjToClass_PredToSlot_DObjToRange

Maps a subject to the domain, the predicate or verb to a slot or

relation and the object to its range.

Example:

OntoLT

‘The player kicked the ball to the net’

relation: kick (domain: player, range: ball)

Relations by linguistic theory

Example:‘Joe wrote a letter’

relation: write (subject: Joe, object: letter)

The subcategorization frame of a word is the number

and kinds of other words that it selects when appearing

in a sentence.

E.g. identify verbs in text as indicators of a relation

between their arguments (object properties)

Person restrictions of selection (for the subject and object of the verb “write”)

written-communication

Part 6 Axioms & Rules

Axioms & Rules

Relations

Taxonomy (Concept hierarchies)

Concepts

Synonyms

Terms

x, y (sufferFrom(x, y) ill(x)

DIRT

Discovery of Inference Rules from Text

an unsupervised method for discovering inference rules

from text, such as

X is author of Y X wrote Y,

X caused Y Y is blamed on X

X manufactures Y X’s Y factory

Is based on the assumption that:

Words that occurred in the same contexts tend to be similar

Distributional Hypothesis

DIRT: Distributional Hypothesis

Distributional Hypothesis is applied to

dependency tress

If two paths tend to link the same sets of

words, their meanings are hypothesized to be

similar

DIRT: Dependency trees

The inference rules

discovered by DIRT are

between paths in

dependency trees

Are generated by Minipar

parser

Minipar represents its

grammar as a network where

nodes represent grammatical

categories and links syntactic

relationships A subset of the dependency relations in Minipar output

DIRT: Dependency trees

“John found a solution to the problem”

pcomp

found

a

solution

to

problem

the

John

moddet

subj obj

det

Links represent dependency relationships

Direction: from the head to the modifier

Labels represent types of dependency relations

Each link between two words represents a direct

semantic relationship

Path between “John” and “problem”

N:subj:V find V:obj:N solution N:to:N

meaning “X finds solution to Y”

DIRT: Paths in Dependency Trees

Connect the prepositional complement directly to the words

modified by the preposition

transformation rule

Each link between two words represent a direct semantic relationship

A path represents indirect semantic relationships between two content words

Evaluation Ontology Learning Techniques

1) Task-based evaluation (improve quality): the first

approach evaluates the adequacy of ontologies in the

context of other applications.

2) Corpus-based evaluation : the second approach uses

domain-specific data sources to determine to what

extent the ontologies are able to cover the

corresponding domain.

3) Criteria-based evaluation : The third approach,

assesses ontologies by determining how well they

adhere to a set of criteria.

Task-based evaluation

How well an ontology meets their systems’

requirements.

An ontology designed to improve the performance of

document retrieval more relevant when the ontology

is used

the use of ontological relations in the context of speech

recognition compared with a gold standard

generated by humans.)

Corpus-based evaluation

methods for evaluating the ‘fit’ between an ontology and

the domain knowledge in the form of text corpora.

In this approach, natural language processing (e.g.,

latent semantic analysis, clustering) or information

extraction (e.g., named-entity recognition) techniques

are used to analyze the content of the corpus and

identify terms.

Criteria-based evaluation

the average number of terms that were aggregated to

form a concept in an ontology : This criterion may be used to

realize the perception that the more variants of a term used to form

a concept, the more fully encompassing or complete the concept is.

Other measurement

Evaluation approaches can also be distinguished by the

layers of an ontology :

• term,

• concept,

• relation

Evaluations can be performed to assess the :

• correctness at the terminology layer,

• coverage at the conceptual layer,

• wellness at the taxonomy layer,

• adequacy of the non-taxonomic relations.