Download - Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial
![Page 1: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/1.jpg)
Human Language Technology (HLT) and Knowledge Acquisition for the Semantic
Web: a Tutorial
Diana Maynard (University of Sheffield)Julien Nioche (University of Sheffield)
Marta Sabou (Vrije Universiteit Amsterdam)Johanna Völker (AIFB)
Atanas Kiryakov (Ontotext Lab, Sirma AI)
EKAW 2006
[This work has been supported by SEKT (http://sekt.semanticweb.org/) and
KnowledgeWeb (http://knowledgeweb.semanticweb.org/ ]
![Page 2: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/2.jpg)
2
1. Motivation, background
2. GATE overview
3. Information Extraction
4. GATE’s HLT components
5. IE and the Semantic Web
6. Ontology learning with Text2Onto
7. Focused ontology learning
8. Massive Semantic Annotation
Structure of the Tutorial
![Page 3: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/3.jpg)
3
Aims of this tutorial
• Investigates some technical aspects of HLT for the SW and brings this methodology closer to non-HLT experts
• Provides an introduction to an HLT toolkit (GATE)
• Demonstrates using HLT for automating SW-specific knowledge acquisition tasks such as:– Semantic annotation– Ontology learning– Ontology population
![Page 4: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/4.jpg)
4
Some Terminology
• Semantic annotation – annotate in the texts all mentions of instances relating to concepts in the ontology
• Ontology learning – automatically derive an ontology from texts
• Ontology population – given an ontology, populate the concepts with instances derived automatically from a text
![Page 5: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/5.jpg)
5
Semantic Annotation: Motivation
• Semantic metadata extraction and annotation is the glue that ties ontologies into document spaces
• Metadata is the link between knowledge and its management
• Manual metadata production cost is too high
• State-of-the-art in automatic annotation needs extending to target ontologies and scale to industrial document stores and the web
![Page 6: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/6.jpg)
6
Challenge of the Semantic Web
• The Semantic Web requires machine processable, repurposable data to complement hypertext
• Once metadata is attached to documents, they become much more useful and more easily processable, e.g. for categorising, finding relevant information, and monitoring
• Such metadata can be divided into two types of information: explicit and implicit.
![Page 7: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/7.jpg)
7
Metadata extraction
• Explicit metadata extraction involves information describing the document, such as that contained in the header information of HTML documents (titles, abstracts, authors, creation date, etc.)
• Implicit metadata extraction involves semantic information deduced from the text, i.e. endogenous information such as names of entities and relations contained in the text. This essentially involves Information Extraction techniques, often with the help of an ontology.
![Page 8: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/8.jpg)
8
Ontology Learning and Population: Motivation
• Creating and populating ontologies manually is a very time-consuming and labour-intensive task
• It requires both domain and ontology experts• Manually created ontologies are generally not
compatible with other ontologies, so reduce interoperability and reuse
• Manual methods are impossible with very large amounts of data
![Page 9: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/9.jpg)
9
Semantic Annotation vs Ontology Population
• Semantic Annotation– Mentions of instances in the text are annotated wrt
concepts (classes) in the ontology.– Requires that instances are disambiguated.– It is the text which is modified.
• Ontology Population– Generates new instances in an ontology from a text. – Links unique mentions of instances in the text to
instances of concepts in the ontology.– Instances must be not only disambiguated but also
co-reference between them must be established.– It is the ontology which is modified.
![Page 10: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/10.jpg)
10
Structure of the Tutorial
1. Motivation, background
2. GATE overview
3. Information Extraction
4. GATE’s HLT components
5. IE and the Semantic Web
6. Ontology learning with Text2Onto
7. Focused ontology learning
8. Massive Semantic Annotation
![Page 11: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/11.jpg)
11
GATE : an open source framework for HLT
• GATE (General Architecture for Text Engineering) is a framework for language processing (http://gate.ac.uk)
• Open Source (LGPL licence)• Hosted on SourceForge
http://sourceforge.net/projects/gate
• Ten years old (!), with 1000s of users at 100s of sites
• Current version 3.1
![Page 12: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/12.jpg)
12
4 sides to the story
• An architecture: A macro-level organisational picture for HLT software systems.
• A framework: For programmers, GATE is an object-oriented class library that implements the architecture.
• A development environment: For language engineers, computational linguists et al, a graphical development environment.
• A community of users and contributors
![Page 13: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/13.jpg)
13
Architectural principles
• Non-prescriptive, theory neutral (strength and weakness)
• Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of Protégé, Jena, Yale...)
• (Almost) everything is a component, and component sets are user-extendable
• (Almost) all operations are available both from API and GUI
![Page 14: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/14.jpg)
14
All the world’s a Java Bean....
CREOLE: a Collection of REusable Objects for Language Engineering:
• GATE components: modified Java Beans with XML configuration
• The minimal component = 10 lines of Java, 10 lines of XML, 1 URL
Why bother? • Allows the system to load arbitrary language
processing components
![Page 15: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/15.jpg)
15
NOTES•everything is a replaceable bean•all communication via fixed APIs •low coupling, high modularity, high extensibility
…
HTMLdocs
RTFdocs
XMLdocs
PDFdocs
XMLDocument
Format
HTMLDocument
Format
PDFDocument
Format
…Document
FormatLayer (LRs)
XML OraclePostgreSql .ser
DataStore Layer
Corpus Document
DocumentContent
AnnotationSet
Annotation FeatureMap
Corpus Layer (LRs)
GATE APIs
Processing Layer (PRs)
NE Co-ref TEs TRs POS …
Onto-logy
ProtégéOnto-logy
Word-net
Gaz-etteers
Language Resource Layer (LRs)
...
Application Layer
ANNIE OBIE …IDE GUI Layer (VRs)
ADiff OntolVR DocVR ...
![Page 16: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/16.jpg)
17
GATE Users
• American National Corpus project • Perseus Digital Library project, Tufts University, US• Longman Pearson publishing, UK• Merck KgAa, Germany• Canon Europe, UK• Knight Ridder, US• BBN (leading HLT research lab), US• SMEs: Melandra, SG-MediaStyle, ...• a large number of other UK, US and EU Universities• UK and EU projects inc. SEKT, PrestoSpace,
KnowledgeWeb, MyGrid, CLEF, Dot.Kom, AMITIES, CubReporter, …
![Page 17: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/17.jpg)
18
Past Projects using GATE
• MUMIS: conceptual indexing: automatic semantic indices for sports video
• MUSE: multi-genre multilingual IE• HSL: IE in domain of health and safety• Old Bailey: IE on 17th century court reports• Multiflora: plant taxonomy text analysis for biodiversity
research in e-science• EMILLE: creation of S. Asian language corpus• ACE / TIDES: IE competitions and collaborations in
English, Chinese, Arabic, Hindi• h-TechSight: ontology-based IE and text mining
![Page 18: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/18.jpg)
19
Current projects using GATE
• ETCSL: Language tools for Sumerian digital library• SEKT: Semantic Knowledge Technologies• PrestoSpace: Preservation of audiovisual data• KnowledgeWeb: Semantic Web network of excellence• MEDIACAMPAIGN: Discovering, inter-relating and
navigating cross-media campaign knowledge • TAO : Transitioning Applications to Ontologies• MUSING : SW-based business intelligence tools• NEON : Networked Ontologies
![Page 19: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/19.jpg)
20
GATE
![Page 20: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/20.jpg)
21
Structure of the Tutorial
1. Motivation, background
2. GATE overview
3. Information Extraction
4. GATE’s HLT components
5. IE and the Semantic Web
6. Ontology learning with Text2Onto
7. Focused ontology learning
8. Massive Semantic Annotation
![Page 21: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/21.jpg)
22
IE is not IR
IE pulls facts and structured information from the content of large text collections. You analyse the facts.
IR pulls documents from large text collections (usually the Web) in response to specific keywords or queries. You analyse the documents.
![Page 22: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/22.jpg)
23
IE for Document Access
• With traditional query engines, getting the facts can be hard and slow
• Where has the Queen visited in the last year?• Which places on the East Coast of the US have
had cases of West Nile Virus? • Which search terms would you use to get this
kind of information?• How can you specify you want someone’s
home page?• IE returns information in a structured way• IR returns documents containing the relevant
information somewhere (if you’re lucky)
![Page 23: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/23.jpg)
24
HaSIE: an example application
• Application developed by University of Sheffield, which aims to find out how companies report about health and safety information
• Answers questions such as:“How many members of staff died or had accidents
in the last year?”“Is there anyone responsible for health and
safety?”“What measures have been put in place to
improve health and safety in the workplace?”
![Page 24: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/24.jpg)
25
HaSIE
• Identification of such information is too time-consuming and arduous to be done manually.
• Each company report may be hundreds of pages long.
• IR systems can’t help because they return whole documents
• System identifies relevant sections of each document, pulls out sentences about health and safety issues, and populates a database with relevant information
• This can then be analysed by an expert
![Page 25: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/25.jpg)
26
HASIE
![Page 26: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/26.jpg)
27
Named Entity Recognition: the cornerstone of IE
• Identification of proper names in texts, and their classification into a set of predefined categories of interest
• Persons• Organisations (companies, government
organisations, committees, etc)• Locations (cities, countries, rivers, etc)• Date and time expressions• Various other types as appropriate
![Page 27: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/27.jpg)
28
Why is NE important?
• NE provides a foundation from which to build more complex IE systems
• Relations between NEs can provide tracking, ontological information and scenario building
• Tracking (co-reference) “Dr Smith”, “John Smith”, “John”, he”
• Ontologies “Athens, Georgia” vs “Athens, Greece”
![Page 28: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/28.jpg)
29
Two kinds of approaches
Knowledge Engineering• rule based • developed by experienced
language engineers • make use of human
intuition • require only small amount
of training data• development can be very
time consuming • some changes may be
hard to accommodate
Learning Systems• use statistics or other
machine learning • developers do not need
LE expertise • require large amounts of
annotated training data • some changes may
require re-annotation of the entire training corpus
![Page 29: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/29.jpg)
30
Typical NE pipeline
• Pre-processing (tokenisation, sentence splitting, morphological analysis, POS tagging)
• Entity finding (gazeteer lookup, NE grammars)• Coreference (alias finding, orthographic
coreference etc.)• Export to database / XML
![Page 30: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/30.jpg)
31
An ExampleRyanair announced yesterday that it will make Shannon its next European
base, expanding its route network to 14 in an investment worth around
€180m. The airline says it will deliver 1.3 million passengers in the first year
of the agreement, rising to two million by the fifth year.
• Entities: Ryanair, Shannon
• Descriptions: European base
• Relations: Shannon base_of Ryanair
• Events: investment(€180m)
• Mentions: it=Ryanair, The airline=Ryanair, it=the airline
![Page 31: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/31.jpg)
32
System development cycle
1. Collect corpus of texts2. Manually annotate gold standard3. Develop system4. Evaluate performance against gold
standard5. Return to step 3, until desired
performance is reached
![Page 32: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/32.jpg)
33
Performance Evaluation
2 main requirements:• Evaluation metric: mathematically defines how
to measure the system’s performance against human-annotated gold standard
• Scoring program: implements the metric and provides performance measures – For each document and over the entire
corpus– For each type of NE
![Page 33: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/33.jpg)
34
Evaluation Metrics
• Most common are Precision and Recall• Precision = correct answers/answers produced • Recall = correct answers/total possible correct
answers• Trade-off between precision and recall • F1 (balanced) Measure = 2PR / 2(R + P) • Some tasks sometimes use other metrics, e.g. cost-
based (good for application-specific adjustment)• Ontology-based IE requires measures sensitive to
the ontology
![Page 34: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/34.jpg)
35
GATE AnnotationDiff Tool
![Page 35: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/35.jpg)
36
Corpus-level Regression Testing
• Need to track system’s performance over time• When a change is made we want to know
implications over whole corpus• Why: because an improvement in one case can
lead to problems in others• GATE offers corpus benchmark tool, which
can compare different versions of the same system against a gold standard
• This operates on a whole corpus rather than a single document
![Page 36: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/36.jpg)
37
Corpus Benchmark Tool
![Page 37: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/37.jpg)
38
Structure of the Tutorial
1. Motivation, background
2. GATE overview
3. Information Extraction
4. GATE’s HLT components
5. IE and the Semantic Web
6. Ontology learning with Text2Onto
7. Focused ontology learning
8. Massive Semantic Annotation
![Page 38: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/38.jpg)
39
GATE’s Rule-based System - ANNIE• ANNIE – A Nearly-New IE system• A version distributed as part of GATE• GATE automatically deals with document
formats, saving of results, evaluation, and visualisation of results for debugging
• GATE has a finite-state pattern-action rule language - JAPE, used by ANNIE
• A reusable and easily extendable set of components
![Page 39: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/39.jpg)
40
What is ANNIE?
ANNIE is a vanilla information extraction system comprising a set of core PRs:
– Tokeniser– Gazetteers– Sentence Splitter– POS tagger– Semantic tagger (JAPE transducer)– Orthomatcher (orthographic coreference)
![Page 40: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/40.jpg)
41
Core ANNIE Components
![Page 41: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/41.jpg)
42
Re-using ANNIE
• Typically a new application will use most of the core components from ANNIE
• The tokeniser, sentence splitter and orthomatcher are basically language, domain and application-independent
• The POS tagger is language dependent but domain and application-independent
• The gazetteer lists and JAPE grammars may act as a starting point but will almost certainly need to be modified
• You may also require additional PRs (either existing or new ones)
![Page 42: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/42.jpg)
43
DEMO of ANNIE and GATE GUI
• Loading ANNIE
• Creating a corpus
• Loading documents
• Running ANNIE on corpus
• Demo
![Page 43: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/43.jpg)
44
![Page 44: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/44.jpg)
45
Gazetteers
• Gazetteers are plain text files containing lists of names (e.g rivers, cities, people, …)
• Information used by JAPE rules• Each gazetteer set has an index file listing all the
lists, plus features of each list (majorType, minorType and language)
• Lists can be modified either internally using Gaze, or externally in your favourite editor
• Gazetteers can also be mapped to ontologies• Generates Lookup results of the given kind
![Page 45: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/45.jpg)
46
![Page 46: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/46.jpg)
47
![Page 47: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/47.jpg)
48
JAPE grammars
• JAPE is a pattern-matching language
• The LHS of each rule contains patterns to be matched
• The RHS contains details of annotations (and optionally features) to be created
• The patterns in the corpus are identified using ANNIC
![Page 48: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/48.jpg)
49
Input specifications
The head of each grammar phase needs to contain certain information– Phase name– Inputs– Matching style
e.g.
Phase: locationInput: Token Lookup NumberControl: appelt
![Page 49: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/49.jpg)
50
Rule: Company1 Priority: 25 ( ( {Token.orthography == upperInitial} )+ //from tokeniser {Lookup.kind == companyDesignator} //from gazetteer lists ):match --> :match.NamedEntity = { kind=company, rule=“Company1” }
=> will match “Digital Pebble Ltd”
NE Rule in JAPE
![Page 50: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/50.jpg)
51
LHS of the rule
• LHS is expressed in terms of existing annotations, and optionally features and their values
• Any annotation to be used must be included in the input header
• Any annotation not included in the input header will be ignored (e.g. whitespace)
• Each annotation is enclosed in curly braces• Each pattern to be matched is enclosed in round
brackets and has a label attached
![Page 51: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/51.jpg)
52
Macros
• Macros look like the LHS of a rule but have no label
Macro: NUMBER(({Digit})+)
• They are used in rules by enclosing the macro name in round brackets
( (NUMBER)+):match
• Conventional to name macros in uppercase letters• Macros hold across an entire set of grammar phases
![Page 52: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/52.jpg)
53
Contextual information
• Contextual information can be specified in the same way, but has no label
• Contextual information will be consumed by the rule
({Annotation1})
({Annotation2}):match
({Annotation3})
![Page 53: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/53.jpg)
54
RHS of the rule
• LHS and RHS are separated by • Label matches that on the LHS
• Annotation to be created follows the label
(Annotation1):match
:match.NE = {feature1 = value1, feature2 = value2}
![Page 54: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/54.jpg)
55
Example Rule for DatesMacro: ONE_DIGIT({Token.kind == number, Token.length == "1"})
Macro: TWO_DIGIT({Token.kind == number, Token.length == "2"})
Rule: TimeDigital1// 20:14:25( (ONE_DIGIT|TWO_DIGIT){Token.string == ":"} TWO_DIGIT ({Token.string == ":"} TWO_DIGIT)?(TIME_AMPM)?(TIME_DIFF)?(TIME_ZONE)? ):time-->:time.TempTime = {kind = "positive", rule =
"TimeDigital1"}
![Page 55: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/55.jpg)
56
Identifying patterns in corpora
• ANNIC – ANNotations In Context• Provides a keyword-in-context-like interface for
identifying annotation patterns in corpora• Uses JAPE LHS syntax, except that + and *
need to be quantified• e.g. {Person}{Token}*3{Organisation} – find all
Person and Organisation annotations within up to 3 tokens of each other
• To use, pre-process the corpus with ANNIE or your own components, then query it via the GUI
![Page 56: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/56.jpg)
57
ANNIC Demo
• Formulating queries
• Finding matches in the corpus
• Analysing the contexts
• Refining the queries
• Demo
![Page 57: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/57.jpg)
58
Using phases
• Grammars usually consist of several phases, run sequentially
• A definition phase (conventionally called main.jape) lists the phases to be used, in order
• Only the definition phase needs to be loaded• Temporary annotations may be created in early
phases and used as input for later phases• Annotations from earlier phases may need to be
combined or modified
![Page 58: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/58.jpg)
59
![Page 59: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/59.jpg)
60
Matching algorithms and Rule Priority
• Rules compete within a single phase!• 3 styles of matching:
– Brill (fire every rule that applies)– First (shortest rule fires)– Appelt (use of priorities)
• Appelt priority is applied in the following order– Starting point of a pattern– Longest pattern– Explicit priority (default = -1)
![Page 60: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/60.jpg)
61
Nam
ed E
ntiti
es in
GA
TE
![Page 61: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/61.jpg)
62
Using co-reference
• Orthographic co-reference module matches proper names in a document
• Improves results by assigning entity type to previously unclassified names, based on relations with classified entities
• May not reclassify already classified entities• Classification of unknown entities very useful for
surnames which match a full name, or abbreviations, e.g. [Bonfield] will match [Sir Peter Bonfield]; [International Business Machines Ltd.] will match [IBM]
![Page 62: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/62.jpg)
63
Named Entity Coreference
![Page 63: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/63.jpg)
64
GATE 4.0
• Before end 06• Faster and leaner!• Nicer GUI• ANNIC included• Improved Machine Learning API
(based on YALE)• and more…
![Page 64: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/64.jpg)
65
1. Motivation, background
2. GATE overview
3. Information Extraction
4. GATE’s HLT components
5. IE and the Semantic Web
6. Ontology learning with Text2Onto
7. Focused ontology learning
8. Massive Semantic Annotation
Structure of the Tutorial
![Page 65: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/65.jpg)
66
Information Extraction for the Semantic Web
• Traditional IE is based on a flat structure, e.g. recognising Person, Location, Organisation, Date, Time etc.
• For the Semantic Web, we need information in a hierarchical structure
• Idea is that we attach semantic metadata to the documents, pointing to concepts in an ontology
• Information can be exported as an ontology annotated with instances, or as text annotated with links to the ontology
![Page 66: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/66.jpg)
67
Richer NE Tagging
• Attachment of instances in the text to concepts in the domain ontology
• Disambiguation of instances, e.g. Cambridge, MA vs Cambridge, UK
![Page 67: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/67.jpg)
68
Magpie: an example
• Developed by the Open University• Plugin for standard web browser• Automatically associates an ontology-based
semantic layer to web resources, allowing relevant services to be linked
• Provides means for a structured and informed exploration of the web resources
• e.g. looking at a list of publications, we can find information about an author such as projects they work on, other people they work with, etc.
![Page 68: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/68.jpg)
69
MAGPIE in action
![Page 69: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/69.jpg)
70
MAGPIE in action
![Page 70: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/70.jpg)
71
GATE and the Semantic Web
• Supports ontologies as part of IE applications - Ontology-Based IE (OBIE)
• Supports semantic annotation and ontology population
• Can combine learning and rule-based methods• Allows combination of IE and IR • Enables use of large-scale linguistic resources
for IE, such as WordNet
![Page 71: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/71.jpg)
72
Ontology Management in GATE
![Page 72: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/72.jpg)
73
Linking the Text to the Ontology
![Page 73: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/73.jpg)
74
Exported Database
![Page 74: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/74.jpg)
75
Evaluation for OBIE• Traditional IE is evaluated in terms of Precision,
Recall and F-measure.
• But these are not sufficient for ontology-based IE, because the distinction between right and wrong is less obvious
• Recognising a Person as a Location is clearly wrong, but recognising a Research Assistant as a Lecturer is not so wrong
• Similarity metrics need to be integrated so that items closer together in the hierarchy are given a higher score, if wrong
![Page 75: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/75.jpg)
76
Augmented Precision and Recall
• Development of a new BDM (Balanced Distance Metric) which compares key and response concepts wrt a given ontology
• In the case of ontological mismatch, provides an indication of how serious the error is, and weights it accordingly
• BDM provides a score between 0 and 1 for each key/response match instead of a binary measure
![Page 76: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/76.jpg)
77
Augmented Precision and Recall
Spurious+BDM=AP
BDMMissing+BDM
=ARBDM
BDM is integrated with traditional Precision and Recall in the following way to produce a score at the corpus level:
![Page 77: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/77.jpg)
78
Examples of misclassificationEntity Response Key BDM
Sochi Location City 0.724
FBI Org GovOrg 0.959
Al-Jazeera Org TVCompany 0.783
Islamic Jihad Company ReligiousOrg 0.816
Brazil Object Country 0.587
Senate Company Political Entity
0.826
![Page 78: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/78.jpg)
79
Structure of the Tutorial
1. Motivation, background
2. GATE overview
3. Information Extraction
4. GATE’s HLT components
5. IE and the Semantic Web
6. Ontology learning with Text2Onto
7. Focused ontology learning
8. Massive Semantic Annotation
![Page 79: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/79.jpg)
Ontology Learning with Text2Onto
http://ontoware.org/projects/text2onto/
Johanna Vö[email protected]
Institute AIFBUniversity of Karlsruhe
![Page 80: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/80.jpg)
81
Agenda
• Ontology Learning– Tasks– Problems
• Text2Onto– Overview– Architecture– Linguistic preprocessing– Ontology learning approaches– Summary
![Page 81: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/81.jpg)
82
Ontology Learning
• Extraction of (domain) ontologies from natural language text– Machine learning– Natural language processing
• Tools: OntoLearn, OntoLT, ASIUM, Mo’K Workbench, JATKE, TextToOnto, …
![Page 82: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/82.jpg)
83
Ontology Learning – Tasks
Concept extraction car, vehicle, person
Concept classification subclass-of( car, vehicle )
Instance extraction Peter, his-car
Instance classification instance-of( Peter, person )
Relation extraction drive( person, car )
Relation instance extraction drive( Peter, his-car )
![Page 83: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/83.jpg)
84
instance-of( Hewlett Packard, organization )
subclass-of( research, activity )
![Page 84: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/84.jpg)
85
reach( information, people )
address_in( issue, article )
subclass-of( resource, knowledge )
![Page 85: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/85.jpg)
86
Ontology Learning – ProblemsText Understanding
• Words are ambiguous– ‘A bank is a financial institution. A bank is a piece of furniture.’ subclass-of( bank, financial institution ) ?
• Natural Language is informal– ‘The sea is water.’ subclass-of( sea, water ) ?
• Sentences may be underspecified– ‘Mary started the book.’ read( Mary, book_1 ) ?
• Anaphores– ‘Peter lives in Munich. This is a city in Bavaria.’instance-of( Munich, city ) ?
• Metaphores, …
![Page 86: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/86.jpg)
87
• What is an instance / concept?– ‘The koala is an animal living in Australia.’instance-of( koala, animal ) subclass-of( koala, animal ) ?
• How to deal with opinions and quoted speech?– ‘Tom thinks that Peter loves Mary.’love( Peter, Mary ) ?
• Knowledge is changing– instance-of( Pluto, planet ) ?
Conclusion: • Ontology learning is difficult. • What we can learn is fuzzy and uncertain. • Ontology maintenance is important.
Ontology Learning – Problems Knowledge Modeling
![Page 87: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/87.jpg)
88
Text2Onto
• Support for (semi-)automatic ontology extraction from natural language text
• Support for ontology maintenance and data-driven ontology evolution by incremental ontology learning
• Model of Possible Ontologies (POM) Confidence / relevance values attached to all
concepts, instances and relations• Enhanced user interaction• Maintenance of multiple modeling alternatives in parallel• Independence of certain ontology language
![Page 88: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/88.jpg)
89
subclass-of( user, human ) / confidence 1.0
subclass-of( document, communication ) / confidence 0.75
![Page 89: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/89.jpg)
90
• Explicit modeling of evidences– Algorithms provide different types of evidences – Explanation component
• References for annotation and change detection
• Explicit modeling of changes– Corpus, evidence, reference and ontology changes– Future work: ontology change strategies
Text2Onto – Evidence, Reference and Change Management
![Page 90: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/90.jpg)
91
Text2Onto – Workflow
Workflow composition
• Complex algorithms– Different types of
algorithms for each ontology learning task
– Flexible combination of results
• Combination strategies– minimum, maximum,
average, linear, classifier, …
![Page 91: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/91.jpg)
92
POM Visualization
WorkflowManager
API
GATE
Corpus
Algorithm Controller
OWLWriter
RDFSWriter
F-LogicWriter
POM
Evidence Store
Reference Store
Text2Onto
Ontology
![Page 92: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/92.jpg)
93
Linguistic PreprocessingGATE
• Standard ANNIE components for– Tokenization– Sentence splitting– POS tagging– Stemming / lemmatizing
• Self-defined JAPE patterns and processing resources for– Stop word detection– Shallow parsing
• GATE applications for English, German and Spanish
![Page 93: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/93.jpg)
94
Ontology Learning Approaches Concept Classification
• Heuristics– ‘image processing software’subclass-of( image processing software, software )
• Patterns– ‘animals such as dogs’– ‘dogs and other animals’– ‘a dog is an animal’ subclass-of( dog, animal )
![Page 94: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/94.jpg)
95
JAPE Patterns for Ontology Learning
rule: Hearst_1(
(NounPhrase):superconcept{SpaceToken.kind == space}{Token.string=="such"}{SpaceToken.kind == space}{Token.string=="as"}{SpaceToken.kind == space}(NounPhrasesAlternatives):subconcept
):hearst1-->:hearst1.SubclassOfRelation = { rule = "Hearst1" },:subconcept.Domain = { rule = "Hearst1" },:superconcept.Range = { rule = "Hearst1" }
![Page 95: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/95.jpg)
96
Ontology Learning Approaches Instance Classification
• Context similarity‘Columbus is the capital of the state of Ohio.Columbus has a population of about 700.000inhabitants.’
• Columbus ( capital (1), state (1), Ohio (1), population (1), inhabitant (1) )
• city ( country (2), state (1), inhabitant (2), mayor (1), attraction (1) )
• explorer( ship (1), sailor (2), discovery (1) )
instance-of( Columbus, city )
![Page 96: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/96.jpg)
97
Ontology Learning Approaches Relation Extraction
• Subcategorization frames– ‘Tina drives a Ford.’
•instance-of( Tina, person )•instance-of( Ford, vehicle )
– ‘Her father drives a bus.’•subclass-of( father, person )•subclass-of( bus, vehicle )
subcat: drive( subj: person, obj: vehicle )drive( person, vehicle )
![Page 97: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/97.jpg)
98
incluyen( ontologiás, definiciones ) / confidence 1.0
![Page 98: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/98.jpg)
99
Other Ontology Learning Approaches
• WordNet– Hyponym( ‘bank’, ‘institution’ ) subclass-of( bank, institution ) ?
• Google– ‘cities such as London’, ‘persons such as London’ …– ‘such as London’ instance-of( London, city ) ?
• Instance clustering– Hierarchical clustering of context vectors
• Formal Concept Analysis (FCA)– breathe( animal )– breathe( human ), speak( human ) subclass-of( human, animal ) ?
![Page 99: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/99.jpg)
100
Summary
• Ontology Learning is difficult, because– Language is fuzzy– Knowledge is changing
• Text2Onto targets these Problems– Model of Possible Ontologies– Heterogeneous sources of evidence– Incremental ontology learning
![Page 100: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/100.jpg)
Thanks!
http://www.aifb.de/WBS/jvo/ontology-learning
http://www.ontoware.org/projects/text2onto
![Page 101: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/101.jpg)
102
Structure of the Tutorial
1. Motivation, background
2. GATE overview
3. Information Extraction
4. GATE’s HLT components
5. IE and the Semantic Web
6. Ontology learning with Text2Onto
7. Focused ontology learning
8. Massive Semantic Annotation
![Page 102: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/102.jpg)
Focused Ontology Learning with GATE
Marta Sabou
A Practical Report on Learning Web Service Ontologies
![Page 103: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/103.jpg)
Goal of the Talk
The goal of this talk is:
•To describe a Semantic Web relevant task: Focused
Ontology Learning.•To exemplify this task in the context of Web Services.•To show how focused ontology learning can be
implemented in GATE.
The focus of the talk is NOT ontology learning but the elements of GATE that helped to perform this task.
![Page 104: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/104.jpg)
Outline
1) Generic Problem:* Focused Ontology Learning(definition and characteristics)
2) Specific Problem:* Learning Web Service Ontologies(Context, Problem Scenario)
3) GATE support for:* writing extraction patterns* evaluating term extraction performance
![Page 105: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/105.jpg)
106
Ontology Learning in Restricted Domains
Focused Ontology Learning:• is Ontology Learning in a restricted domain, for a well-defined task• therefore, simpler than Ontology Learning in general• more and more frequent with the growth of the Semantic Web
Previous Talk’s conclusion:Generic Ontology Learning is important but difficult because:
•Language is fuzzy•Knowledge is changing
However... The Semantic Web is increasingly used in specialized domains, where:
• Language exhibits (strong) domain characteristics• e.g., mathematics, medicine
• The Knowledge to be extracted is defined by the task for which the ontology will be used
• e.g., searching patient records, accessing drug related articles
![Page 106: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/106.jpg)
107
Focused Ontology Learning
Focused Ontology Learning characteristics:1. (Small) corpus with special (domain/context) characteristics;2. Well defined ontological knowledge to be extracted;3. An easy to detect correspondence between text characteristics
and ontology elements;
4. Usually an easy solution (adaptation of OL techniques);
5. Implemented/adapted by a non NLP-expert.
What is needed to support domain experts?• libraries of basic NLP tools/data structures;• tools to easily adapt/combine these NLP elements;• intuitive way to create and debug own applications;• usability plays an important role;• generic methodologies of ontology learning rather than hard-coded
algorithms.
![Page 107: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/107.jpg)
Outline
1) Generic Problem:* Focused Ontology Learning(definition and characteristics)
2) Specific Problem:* Learning Web Service Ontologies(Context, Problem Scenario)
3) GATE support for:* writing extraction patterns (given)* evaluating term extraction performance (given)
![Page 108: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/108.jpg)
109
Context - Semantic Web Services* Semantic WS - semantically annotated WS
* to automate discovery, composition, execution
< rdf:ID=”WS1"> <owls:hasInput rdf:resource=” ”/> <owls:hasInput rdf:resource=” ”/> <owls:hasOutput rdf:resource=” ”/></ >
do:HotelBooking
do:HotelReservationdo:HotelBooking
do:Hoteldo:ReservationDates
=>broad domain coverageBut…increasing nr. of web services
![Page 109: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/109.jpg)
110
A real life story…•Semantic Grid middleware to support in silico experiments in biology•Bioinformatics programs are exposed as semantic web services
150(Services)
4 months!!
Domain Expert
550 ConceptsBut only 125 (23%) usedfor SWS tasks
600(Services)
Our GOAL: Support Expert to learn:1) From more services2) In less time3) A “Better” ontology (for SWS descriptions)
![Page 110: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/110.jpg)
111
FOL Characteristics - 1
* Data Source: * short descriptions of service functionalities* characteristics:
* small corpora (100/200 documents)* employ specific style (sublanguage)
•Replace or delete sequence sections.
•Find antigenic sites in proteins.
•Cai codon usage statistic.
1. (Small) corpus with special (domain/context) characteristics
![Page 111: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/111.jpg)
112
•Web Service Ontologies contain: •A Data Structure hierarchy•A Functionality hierarchy
2. Well defined ontology structure to be extracted
FOL Characteristics - 2
![Page 112: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/112.jpg)
113
3. An easy to detect correspondence between text characteristics and ontology elements
Replace or delete sequence sections.
NP VB_NP
FOL Characteristics - 3
![Page 113: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/113.jpg)
114
Generic Solution: Implementation:
Linguistic Analysis English Tokenizer
Sentence Splitter
POS Tagger
|Replace| |or| |delete| |sequence| ….
Replace or delete sequence sections.(VB) (Prep) (VB) (NN) (NNS)
FOL Characteristics - 44. Usually an easy solution (adaptation of OL techniques).E.g. Pos Tagging
JAPE Rules Replace or delete sequence sections.(VB) (Prep) (VB) (NN) (NNS)
r1 => (NP)
r2 => (Funct)
Extraction Patterns
Ontology Building
Ontology Pruning
OntologyBuilding&Pruning
![Page 114: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/114.jpg)
115
Ontology Building
Ontology Pruning
OntologyBuilding&Pruning
Linguistic Analysis Miniparword : 1 : replace : replace : V : * : i : word : 2 : or : or : U : 1 : lex-mod : word : 3 : delete : delete : V : 1 : lex-dep : word : 4 : sequence : sequence : N : 5 : nn : word : 5 : sections : section : N : 1 : obj :
Extraction Patterns JAPE Rules Replace or delete sequence sections.(VB) (Prep) (VB) (NN) (NNS)
r1 => (NP)
r2 => (Funct)
r2 => (Funct)
FOL Characteristics - 44. Usually an easy solution (adaptation of OL techniques).E.g. Dependency Parsing
![Page 115: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/115.jpg)
Outline
1) Generic Problem:* Focused Ontology Learning(definition and characteristics)
2) Specific Problem:* Learning Web Service Ontologies(Context, Problem Scenario)
3) GATE support for:* writing extraction patterns * evaluating term extraction performance
![Page 116: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/116.jpg)
117
* Easy to follow extraction (step by step) * Easy to adapt for domain engineers
GATE Implementation
![Page 117: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/117.jpg)
118
Pattern based rules – Example ( (DET)*:det ( (ADJ)|(NOUN))*:mods (NOUN):hn):np:np.NP={}
A noun phrase consists of:• zero or more determiners;• zero or more modifiers which can be adjectives or nouns;• One noun which is the head-noun.
Macro: ADJ( {Token.category == JJ, Token.kind == word}| {Token.category == JJR, Token.kind == word}| {Token.category == JJS, Token.kind == word} )
The ADJ macro identifies any Token tagged as JJ, JJR or JJS.
DET, ADJ, NOUN are macros – make rules more readable.
Extract NP(data) from NP(aaindex).
Displays NP(a non-overlapping wordmatch dotplot) of two NP(sequences)
![Page 118: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/118.jpg)
Outline
1) Generic Problem:* Focused Ontology Learning(definition and characteristics)
2) Specific Problem:* Learning Web Service Ontologies(Context, Problem Scenario)
3) GATE support for:* writing extraction patterns (given)* evaluating term extraction performance (given)
![Page 119: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/119.jpg)
120
Performance EvaluationLinguistic Analysis
Extraction Patterns
Ontology Building
Ontology Pruning
A set of important terms are extracted. Terms are indicated by annotations of type: NP, Funct.
* The correctness of these terms has a direct influence on the correctness of the OB step => evaluating them is important.
•The Corpus Benchmark Tool of GATE compares annotation types in 2 corpora, usually:
• the manually annotated Gold Standard corpus and• the automatically annotated corpus.
• It identifies correct, missed and spurious annotations of a certain type and computes Precision and Recall per each document and the whole corpus.
![Page 120: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/120.jpg)
121
Gold Standard Annotations:: Automatic Annotation:
105_profit.xml; Keys : 2Resp : 3
Annotation Type
Precision Recall
Funct 0.666666 1.0
Scan a sequence or database with a matrix or profile.
Funct(scan_sequence)Funct(scan_database)
Funct(scan_sequence)Funct(scan_database)Funct(scan_profile)
Correct = correctly identified annotations (true positives)Spurious = incorrect annotations (false positives)
Example 1:
Performance Evaluation
![Page 121: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/121.jpg)
122
Gold Standard Annotations:: Automatic Annotations:
104_printsextract.xml; Keys : 1Resp : 0
AnnotationType
Precision Recall
Funct NaN 0.0
Preprocess the prints database for use with the program pscan.
Funct(preprocess_prints database)
Missed = unidentified annotations (false negative)
Example 2:
Performance Evaluation
![Page 122: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/122.jpg)
123
Annotation Type
CorrectPartially Correct
Missing Spurious Precision Recall F-Measure
Funct 70 0 78 3 0.958904 0.47297 0.63348416
Statistics
GoldStandard_Terms
Extracted_Terms
correct
missed
spurious
Performance Evaluation
Precision= correct/(All_Extr)
Recall= correct/(All_GS)
![Page 123: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/123.jpg)
124
PROS:•It is very important when developing term extraction.•It allows evaluating:
•1) the performance of the linguistic analyses•2) the coverage of the patterns
•Allows comparing the performance of different tools:•E.g. two different POS taggers
•Easy to use (both from GUI and command line)
Possible improvement:* The current textual output does not allow to directly access all spurious or all missing annotations (these are important when fine-tuning the extraction).* We try to improve this usability issue through visualisation.
Performance Evaluation
![Page 124: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/124.jpg)
125
Summary
• Focused Ontology Learning = OL in a restricted domain.
• GATE supports the development of FOL in many ways: • allows easy reuse and combination of basic NLP modules;• offers software libraries for fundamental NLP data structures (Documents, Corpora, Annotations);• incorporates evaluation mechanisms;• easy to debug and use for non-NLP experts.
• Example FOL = OL for Web Services.
![Page 125: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/125.jpg)
126
Structure of the Tutorial
1. Motivation, background
2. GATE overview
3. Information Extraction
4. GATE’s HLT components
5. IE and the Semantic Web
6. Ontology learning with Text2Onto
7. Focused ontology learning
8. Massive Semantic Annotation
![Page 126: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/126.jpg)
KIM Platform An Overview
Atanas KiryakovOntotext Lab, Sirma AI
http://www.ontotext.com/kim/
![Page 127: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/127.jpg)
128
Semantic Annotation: An exampleXYZ was established on 03 November 1978 in London. It opened a plant in Bulgaria in …
Ontology & KB
Company
type
HQ
establOn
City Country
Location
partOf
type
type type
“03/11/1978”
XYZ
London
UK Bulgaria
HQpartOf
![Page 128: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/128.jpg)
129
Semantic Annotation of NEsA Semantic Annotation of the named entities (NEs) in a text includes:
- a recognition of the type of the entities in the text
-out of a rich taxonomy of classes (not a flat set of 10 types);
- an identification of the entities, which is also a reference to their semantic description.
The traditional (IE-style) NE recognition approach results in:
<Person>Lama Ole Nydahl</Person>
The Semantic Annotation of NEs results in:
<ReligiousPerson ID=“http://..kim/Person111111”>Lama Ole Nydahl
</ReligiousPerson>
![Page 129: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/129.jpg)
130
Platforms for Large-Scale Semantic Annotation
• Allow use of corpus-wide statistics to improve metadata quality, e.g., disambiguation
• Automated alias discovery • Generate SemWeb output (RDF, OWL)• Stand-off storage and indexing of metadata• Use large instance bases to disambiguate to• Ontology servers for reasoning and access• Architecture elements:
– Crawler, onto storage, doc indexing, query, annotators– Apps: sem browsers, authoring tools, etc.
![Page 130: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/130.jpg)
131
The KIM Platform• A platform offering services and infrastructure for:
– (semi-) automatic semantic annotation and
– ontology population
– semantic indexing and retrieval of content
– query and navigation over the formal knowledge
• Based on an Information Extraction technology
• Aim: to arm Semantic Web applications
- by providing a metadata generation technology
- in a standard, consistent, and scalable framework
![Page 131: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/131.jpg)
132
KIM Architecture
SemanticRepository API
Semantic Annotation
API
Query API
Index API
Document Persistence
API
KIM Web UI
Annotation Server
News Collector
Any WebBrowser
BrowserPlug-in
CustomApplications
CustomBack-end
Custom IE
Entity Ranking
KIM Server RMI
![Page 132: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/132.jpg)
133
PROTON Ontology- a light-weight upper-level
ontology;
- 250 NE classes;
- 100 relations and attributes;
- 200.000 entity descriptions;
- covers mostly NE classes, and ignores general concepts;
- includes classes representing lexical resources.
proton.semanticweb.org
![Page 133: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/133.jpg)
134
KIM Scaling on Data
• The Semantic Repository is based on Sesame.
• Our practical tests demonstrate a good performance on top of:
– 1.2M entity descriptions:
– about 15M explicit statements;
– above 30M statements after forward chaining.
• Document and annotation storage and indexing with Lucene:
– .5M docs, processed on a $1000-worth machine;
– retrieval in milliseconds.
![Page 134: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/134.jpg)
135
Simple Usage: Highlight, Hyperlink, and …
![Page 135: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/135.jpg)
136
Simple Usage: … Explore and Navigate
![Page 136: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/136.jpg)
137
How KIM Searches BetterKIM can match a Query:
Documents about a telecom company in Europe, John Smith, and a date in the first half of 2002.
With a document containing:
“At its meeting on the 10th of May, the board of Vodafone appointed John G. Smith as CTO"
The classical IR could not match:
- Vodafone with a "telecom in Europe“, because:- Vodafone is a mobile operator, which is a sort of a telecom;
- Vodafone is in the UK, which is a part of Europe.
- 5th of May with a "date in first half of 2002“;
- “John G. Smith” with “John Smith”.
![Page 137: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/137.jpg)
138
Entity Pattern Search
![Page 138: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/138.jpg)
139
Pattern Search: Entity Results
![Page 139: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/139.jpg)
140
Entity Pattern Search: KIM Explorer
![Page 140: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/140.jpg)
141
Pattern Search, Referring Documents
![Page 141: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/141.jpg)
142
Document Details
![Page 142: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/142.jpg)
143
Summary
KIM is a platform for: - semantic annotation and ontology population,- semantic indexing and retrieval,- providing an API for remote access and integration,- based on Information Extraction (IE) using GATE.
KIM is: - Robust- Scalable- General-purpose, off the shelf platform!
![Page 143: Human Language Technology (HLT) and Knowledge Acquisition for the Semantic Web: a Tutorial](https://reader036.vdocuments.net/reader036/viewer/2022062422/56813e24550346895da80454/html5/thumbnails/143.jpg)
144
THANK YOU!(for not snoring)
The slides: http://www.gate.ac.uk/sale/talks/ekaw2006/ekaw2006-tutorial.ppt
[This work has been supported by SEKT (http://sekt.semanticweb.org/)
andKnowledgeWeb (http://knowledgeweb.semanticweb.org/ )]