ontology learning

Post on 24-Jan-2016

43 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Shalini Gupta - 07305R02 Apoorv Sharma - 07305913 Chirag Patel - 07305909 Shitanshu Verma - 07305037. Ontology Learning. Issue. There is lot of information current representation renders it uninterpretable for machines consequences most of the information remains undiscovered - PowerPoint PPT Presentation

TRANSCRIPT

Ontology Learning

Shalini Gupta - 07305R02Apoorv Sharma - 07305913

Chirag Patel - 07305909Shitanshu Verma - 07305037

Issue There is lot of information current representation renders it

uninterpretable for machines consequences

most of the information remains undiscovered

Big and popular search engines are able to search only 3-4% of the total information on the web.

What is needed ? Improved machines intelligence. Make them read understand use

modify information. With minimal human intervention.

To Achieve It ? Enable machines

Populate Enrich Evaluate

Maintain Their knowledge representation

What is ontology A representation format that

conceptualizes domain Captures classes, instances ,

attributes, relationships Provides sound semantic ground of

machine-understandable description of digital content

Is used in various fields SE, AI Is represented using languages as

OWL etc

What is ontology learning

Process of preparing updating

ontologies from sources such as Documents in natural language

with the help of dictionaries thesauruses etc

Environment

The flow

Initial ontology is given Information sources are given Machines work over the data sources to

enrich the ontology Once enriched

consistency check is done evaluation

Terms related with the process Ontology enrichment

Improving an existing ontology Ontology population

Creating new ontology or adding new concepts to it

Inconsistency resolution resolving inconsistencies that come up while

acquiring ontologies

Enrichment of Ontology Term Identification Taxonomy Extraction Non taxonomical relationship

extraction

Enrichment of Ontology Term Identification

identify important terms in the text Taxonomy Extraction

identifying taxonomical relationships between terms identified

Non taxonomical relationship extraction identifying other relationships

Review

Ontology learning ontology enrichment

term identification taxonomy extraction non taxonomic relationship extraction

Term Identification: Basics Everything is a concept.

An object, an idea, or a thing. A term lexicalizes a concept.

A Word or Multi-word string that conveys 'a single meaning' within a given community e.g. company, Paris, man, cellphone, Red Hat,

car parking Goal: Find out representative concepts.

Term Identification: Steps Steps:

Term Recognition: Find the terms. Term Classification: Cluster the terms

which are same. Term Mapping: Link the terms to well-

defined concepts of referent data sources.

Various techniques exist for every step.

Term Identification: Tokenizing Different combinations of Linguistics

techniques have been able to surpass this step

Tokenizing Scan the text in order to identify

boundaries of words and complex expressions

Term Identification: Tokenizing Remove the stop words like 'a', 'the', 'of',

'with' E.g. Check of the Electrical Bonding of External

Composite Panels with a CORAS Resistivity-Continuity Test

Terms: Check, Electrical Bonding, External Composite Panels, CORAS Resistivity-Continuity Test Set.

Generally nouns are considered as candidate concepts

Term Identification: Importance of a term

TF-IDF technique can be used to find the important keywords [6] a balanced measure stating that a word is

more important if it appears several times in a target document and at the same time it appears rarely in other documents.

Seed-concepts can be used from existing ontologies.

Term Identification:Importance of a term

Multi-word terms The C/NC-value method: [5]

(1) the frequency of occurrence, (2) the frequency of occurrence as a sub-string of

other candidate terms, (3) the number of candidate terms containing the

given term as a sub-string, (4) the number of words contained in the candidate

term The relevant terms can be determined by

mutual cohesiveness by using Mutual Expectation

Term Identification: Morphological Analysis

Use of morphological knowledge of a word [9] A technique which identifies a word-stem

from a full word-form To identify small domain-specific units studies patterns of word-formation and

attempts to formulate rules using the word structure.

e.g. In the biomedical domain a word ending in “-ofilous” or “-itis” is very probably a bio-molecule or a medical term

Advantage: Can identify “background terms” even with low frequency of appearance

Term Identification:Named Entity Recognition

Recognition of person, location, organization names as single

complex entities Complex date and time expressions percentage, monetary value E.g. 'Merrill Lynch'

The next step associates single words or complex expressions with the concepts

e.g 'Merrill Lynch' is related to the concept organization

Identifying Relationships More information for later steps Dependency Relations:

Between the word and its neighbours, the mind perceives connections, the totality of which forms the structure of the sentence

Structural connections establish dependency relations between the words

Deriving Relationships from Dependency Relations Syntactic dependency relations coincide closely

with semantic relations [3] e.g. France Telecom in Paris offers the new DSL

technology. Dependency relations would give linkage

between France Telecom(organization) and Paris(city)

From this we can derive a semantic relationship between organization and city

Term Identification Identifying Relationships

Taxonomic Relationships

Non-Taxonomic Relationships

Taxonomy Construction

Hierarchy of concepts Inclusion relations provide a tree view of the ontology

and imply inheritance between super-concepts and sub-concepts.

E.g. 'Living being' is a super-concept and 'mammal' is a sub-concept.

In terms of ontology, root node is the most general one for the domain of interest.

Discovering taxonomic relations

Based on lexico-syntactic patterns Can find inclusion relation between concepts

through a simple pattern matching on a set of documents

E.g. NP such as NP, NP,..., and NP ...works by authors such as Herrick, Goldsmith, and

Shakespeare hyponym(“author”, Herrick) hyponym(“author”, Goldsmith) hyponym(“author”, Shakespeare)

Discovering new patterns Idea is to use a pattern learner to generate new

patterns Generated patterns then can be used in order

to generate new information (new inclusion relations), as well as to assess the validity of extracted information

E.g. we can generate new patterns like NP is NP NP, NP,..., and other NP NP, especially NP, NP,..., and NP

From the pattern NP such NP as NP, NP,..., and NP

Algorithm for finding new patterns

1. Decide on a lexical relation, R, that is of interest,e.g., "group/member" E.g. a hyponym relation like (author,Shakespeare).

2. Gather a list of terms/instances for which this relation holds.

3. Find places in the corpus where these terms/instances occur syntactically near one another and record the environment.

4. Find new patterns using this.

5. Once a new pattern has been positively identified, use it to gather more instances of the target relation and go to Step 2.

Multi-word concepts

A concept may be represented by multi-word terms

A concept 'A' is a hyponym of a concept 'B' if A has more tokens than B all the tokens of B are present in A both terms have the same head E.g. Concepts 'private customer' and business

customer' is a hyponym of the concept 'customer'

Mining non-taxonomic relations Relationships other than is-a relationships E.g. Linguistic processing may find that the word

'cost' occurs frequently with the words 'hotel', 'guest house', 'youth hostel' in sentences like 'Costs at the youth hostel are $20 per night'

Relations (cost, hotel), (cost, guest house) and (cost, youth hostel) exist

Discovery algorithm finds support and confidence measures for these pairs as well as relationships at higher levels of abstraction such as accommodation and costs

Finding non-taxonomic relations Based on basic Association Rule Algorithm [3] Basic Association Rule Algorithm

Given a set of transactions, T Each transaction has a set of items, i1,i2, ... in

Goal: Compute association rules of form i1→i2 Trick: Explores the fact that many items

appear together. So occurrence of one implies occurrence of another with a high probability (confidence)

Association Rule Mining

E.g. consider the transactions (bread, butter, jam, chips) (bread, butter, jam, ketchup) (ketchup,chips) (bread, butter, jam, chips) (bread,rice)

Eg. bread → butter, jam Support =n(XUY)/N

E.g. Support = 3/5 Confidence = n(XUY)/n(X)

E.g. Confidence = 3/4

Algorithm 1. Extend each transaction to include the

ancestor of a particular item E.g. include the word 'Accommodation' in the

transactions containing word 'guest house' 2. Determine association rules of the form Xk→Yk

where |Xk| = 1 and |Yk| = 1 3. Determine confidence for all rules that exceed

user determined support 4. Prune the rules subsumed by ancestral rules

E.g. if we found 2 rules, (cost, accommodation) and (cost, hotel), we prune the latter rule (cost, hotel)

Statistics-based Extraction of Taxonomic Relations [12][13]

Uses hierarchical clustering. Groups up the similar terms in a

bottom up fashion Uses cosine similarity function

The cosine measure or normalized correlation coefficient between two vectors x and y is given by

Algorithm

Computation of similarity function The similarity matrix is given by

Hotel vector=(0,14,7,4,6)Accommodation vector=(14,0,11,2,5)cos(Hotel,Accommodation) = 7*11+4*2+6*5/(105*150)

Case study:Web-based Ontology Learning with ISOLDE

ISOLDE (Information System for Ontology Learning and Domain Exploration) produce domain ontology from a base ontology

Uses the following An unsupervised named entity recognition

system Web resources like DWDS, Wikipedia and

Wiktionary.

Analysis steps used by ISODLE

Named-entity recognition (NER) uses a domain-specific corpus, a base ontology and a

general purpose NER system (SproUT, see Drozdzynski et al. 2004) to find instances for the classes in the base ontology.

Linguistic pattern analysis for the extraction of class candidates from the

context of the instances extracted in step 1 by use of lexico-syntactic patterns

Collecting web-based knowledge collect information on and between extracted

class candidates from online resources and integrating this into a new or extended taxonomy/ontology

Architecture

Stage wise Examples

After step 1 we get Ballack,Munich, as 1 named entity from soccer corpus

In the second step we find the class candidates for named entities for the sentence in the corpus and then filter the domains specific candidates using X2

method Ballack, the best midfielder in the German

national team. Gives Midfielder as the calss candidate of Ballack.

In the third step for the class candidates we search on web wikipedia definition on midfielder is A midfielder is a player whose position of play is

midway between the attacking strikers and the defenders

Example contd..

We learn the relation midfielder is a player(taxonomic relationship)

Relevence Factor X2

X2=

O matrix for striker

Issues in Learning

human understandable vs machine understandablelearning higher degree relationmapping to high level ontologyevaluation benchmarkincremental ontology learningmulti agent learning

Application of ontology

is ubiquitous in information systems [2]improving the performance of information retrieval and reasoningmaking data between different applications interoperable ontology-type semantic description of behaviors and services allow software agents in a multi-agent system to better coordinate themselves

References [1] Elias Zavitsanos, Georgios Paliouras, George

Vouros,Ontology Learning and Evaluation: A survey Technical Report, 2006.

[2] Nicolas Weber, Paul Buitelaar, Web-based Ontology Learning with ISOLDE, DFKI GmbH - Language Technology Lab Saarbrücken, German,2006.

[3] Alexander Maedche and Steffen Staab, Mining Ontologies from Text, 2000.

[4] Alexander Maedche, Viktor Pekar, and Steffen Staab, Ontology Learning Part One-On Discovering Taxonomic Relations from the Web, 2003.

References [5] K. Frantzi, S. Ananiadou, and H. Mima. Automatic

recognition of multi-word terms: The c-value/nc-value method. 3(2):115–130, 2000.

[6] A. Saltion, G. Wong and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, 1975.

[7] D.I. Moldovan and R.C. Girju. An interactive tool for the rapid development of knowledge bases. International Journal on Artificial Intelligence Tools (IJAIT), 10(1-2), 2001

References [8] J.D. Cohen. Highlights: Language and domain

independent automatic indexing terms for abstracting. Journal of the American Society for Information Science, 46(3):162–174, 1995.

[9] U. Heid. A linguistic bootstrapping approach to the extraction of term candidates from german text. Terminology, 5(2):161–181, 1998.

[10] L.M. Iwanska, N. Mata, and K. Kruger. Fully Automatic Acquisition of Taxonomic Knowledge from Large Corpora of Texts, pages 335–345. MIT/AAAI Press, 2000.

References [11] J.U. Kietz, A. Maedche, and R. Volz. A Method for

Semi-Automatic Ontology Acquisition from a Corporate Intranet. , Juan-Les-Pins, France, 2000.

[12] A. Maedche, V. Pekar, and S. Staab.Ontology learning part one - on discovering taxonomic relations from the web.In Proceedings of the Web Intelligence conference. Springer Verlag, 2002.

[13] Vincent Schickel-Zuber, Boi Faltings: Using hierarchical clustering for learning theontologies used in recommendation systems. KDD 2007: 599-608

[14] A . Maedche and S. Staab. Discovering Conceptual Relations from Text. In Proceedings of ECAI 2000, IOS Press, Amsterdam, 2000.

Thank You

top related