1 concepts, ontologies, and project tango deryle lonsdale byu linguistics and english language...
Post on 19-Dec-2015
215 views
TRANSCRIPT
1
Concepts, Ontologies, and Project TANGO
Deryle LonsdaleBYU Linguistics and English Language
2
Outline
NSF projects Semantic Web
Concepts Project TIDIE
Ontologies Project TANGO
Tables Ontology generation
3
Acknowledgements NSF David Embley (BYU CS), Steve Liddle (BYU
Marriott School) and Yuri Tijerino BYU Data Extraction Group members
4
The National Science Foundation Federal agency, $5.5 billion budget, funds 20%
of all federally supported basic research conducted by America’s colleges and universities
7 directorates Biological Sciences, Computer and Information Science
and Engineering, Engineering, Geosciences, Mathematics and Physical Sciences, Social, Behavioral and Economic Sciences, and Education and Human Resources
200,000 scientists, engineers, educators and students at universities, laboratories and field sites
10,000 awards/year, 3 years duration (avg.)
5
The NSF Nifty 50 (general) ACCELERATING, EXPANDING
UNIVERSE ANTARCTIC OZONE HOLE
RESEARCH ARABIDOPSIS—A PLANT GENOME
PROJECT BAR CODES BLACK HOLES CONFIRMED BUCKY BALLS COMPUTER VISUALIZATION
TECHNIQUES DATA COMPRESSION TECHNOLOGY DISCOVERY OF PLANETS DOPPLER RADAR EFFECTS OF ACID RAIN EL NIÑO AND LA NIÑA PREDICTIONS FIBER OPTICS
GEMINI TELESCOPES HANTAVIRUS
IDENTIFICATION DNA FINGERPRINTING MRI—MAGNETIC
RESONANCE IMAGING NANOTECHNOLOGY THE NATIONAL
OBSERVATORIES OVERCOMING HEAVY
METALS OVERCOMING SALT
TOXICITY TISSUE ENGINEERING TUMOR DETECTION VOLCANIC ERUPTION
DETECTION YELLOW BARRELS
6
Language-related Nifty 50 AMERICAN SIGN LANGUAGE DICTIONARY
DEVELOPMENT COMPUTER VISUALIZATION TECHNIQUES THE DARCI CARD DATA COMPRESSION TECHNOLOGY THE "EYE CHIP" OR RETINA CHIP THE INTERNET PERSONS WITH DISABILITIES ACCESS
TO THE WEB PROJECT LISTEN SPEECH RECOGNITION TECHNOLOGY vBNS—VERY HIGH SPEED BACKBONE
NETWORK SYSTEM WEB BROWSERS
7
Hypernym
Synonym
Annotation
The search query
Browsing the Semantic Web
8
Ranking based on content data and structure
Grouping results by their conceptual relationships Using lexical semantics for similarity search
movie
astronomy
sports
Browsing the Semantic Web
9
Desirable, not (yet) possible
Word sense disambiguation Other types of queries (e.g. services)
What is the cheapest available round-trip flight to Cancun the day after finals this semester?
Set up an appointment with my optometrist for next week.
List available 4-person BYU-approved apartments in Orem for under $150/month.
Find me a linguistics job in Tahiti.
10
Project TIDIE
Apr 10, 2001 – May 12, 2005
11
Overview of TIDIE
3-year NSF project at BYU Total amount about $430,000 PI David Embley (BYU CS), 4 co-PI’s
from BYU 18 grad students, 45 publications Demos, tools, papers, presentations at
website (www.deg.byu.edu/)
12
Goal of TIDIE Target-Based Independent-of-Document
Information Extraction Target-based: user specifies what to find
Not just keyword search, but concept-based search using an ontology
Document independent Should work even if pages change over time, on
new documents IE: match, merge, retrieve, format information Present in way that user can search, query
results
13
Document-based IE
14
Recognition and extraction
Car Year Make Model Mileage Price PhoneNr0001 1989 Subaru SW $1900 (336)835-85970002 1998 Elantra (336)526-54440003 1994 HONDA ACCORD EX 100K (336)526-1081
Car Feature0001 Auto0001 AC0002 Black0002 4 door0002 tinted windows0002 Auto0002 pb0002 ps0002 cruise0002 am/fm0002 cassette stereo0002 a/c0003 Auto0003 jade green0003 gold
15
Concepts
What drive the matching process for IE Inherent in words, numbers, phrases,
text Linguistics: lexical semantics Denotations: entities, attributes Location: relationships Occurrences: constraints
16
Concept matching
We use exhaustive concept matching techniques to find concepts in documents including: Lexical information (lexicons) Natural language processing (NLP)
techniques Similarity of values Features of value Data frames Constraints
17
Lexicons
Repositories of enumerable classes of lexical information
FirstNames, LastNames, USStates, ProvoOremApts, CarMakes, Drugs, CampGroundFeats, etc.
WordNet (synonyms, word senses, hypernyms/hyponyms)
18
The data-frame library Snippets of real-world knowledge about data
(type, length, nearby keywords, patterns [as in regexps], functional relations, etc)
Low-level patterns implemented as regular expressions
Match items such as email addresses, phone numbers, names, etc.
Mileage matches [8] constant { extract "\b[1-9]\d{0,2}k"; substitute "[kK]" -> "000"; },
{ extract "[1-9]\d{0,2}?,\d{3}"; context "[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";}, { extract "[1-9]\d{0,2}?,\d{3}"; context "(mileage\:\s*)[^\$\d][1-9]\d{0,2}?,\d{3}[^\d]"; substitute "," -> "";},
{ extract "[1-9]\d{3,6}"; context "[^\$\d][1-9]\d{3,6}\s*mi(\.|\b\les\b)";}, { extract "[1-9]\d{3,6}"; context "(mileage\:\s*)[^\$\d][1-9]\d{3,6}\b";}; keyword "\bmiles\b", "\bmi\.", "\bmi\b", "\bmileage\b";end;
19
Isolated concepts are OK, but...
We’re also interested in the relations between concepts
This is often best done graphically Ontology: arrangement of concepts that
explicitizes their relations, constraints Conceptual modeling: field of CS /
linguistics that deals with formalizing concepts, using such information
BYU has its own well-known conceptual modeling framework (OSM)
20
Conceptual modeling (OSM)
Year Price
Make Mileage
Model
Feature
PhoneNr
Extension
Car
hashas
has
has is for
has
has
has
1..*
0..1
1..*
1..* 1..*
1..*
1..*
1..*
0..1 0..10..1
0..1
0..1
0..1
0..*
1..*
21
Ontologies and IE
Source Target
22
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888
'97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles. Previous owner heart broken! Asking only $11,995. #1415. JERRY SEINER MIDVALE, 566-3800 or 566-3888
Constant/keyword recognition
Descriptor/String/Position(start/end)
Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155
23
Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155
Database instance generator
insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)
24
CarAds
Color
Feature
AccessoryBodyType
OtherFeatureEngine
Transmission
Mileage
ModelTrim
TrimModel
Year
Make
Price
PhoneNr
0:1
has1:*
0:1has1:*
0:0.7:1has
1:* 0:0.9:1has
1:*
0:0.78:1
has
1:*
0:1
1:*
0:1
1:*
0:1
has1:*
0:*has
1:*
0:*
has
1:*
CarAds
Color
Feature
AccessoryBodyType
OtherFeatureEngine
Transmission
Mileage
ModelTrim
TrimModel
Year
Make
Price
PhoneNr
0:1
has1:*
0:1has1:*
0:0.7:1has
1:* 0:0.9:1has
1:*
0:0.78:1
has
1:*
0:1
1:*
0:1
1:*
0:1
has1:*
0:*has
1:*
0:*
has
1:*
Car ads extraction ontology
25
Car ads ontology (textual)Car [->object];Car [0..1] has Year [1..*];Car [0..1] has Make [1..*];Car [0...1] has Model [1..*];Car [0..1] has Mileage [1..*];Car [0..*] has Feature [1..*];Car [0..1] has Price [1..*];PhoneNr [1..*] is for Car [0..*];PhoneNr [0..1] has Extension [1..*];Year matches [4]
constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … …End;
26
A gene ontology
27
A geneology data model
28
Finding jobs in linguistics
Built ontology for linguistics jobs: what defines a linguistics job
Data frames and lexicons: language names (www.ethnologue.com), subfields of linguistics (www.linguistlist.org), tools linguists use, programming languages, activities, responsibilities, country names
Documents: 3500 web pages + emails to me
Complete results reported in DLLS 2003
29
Sample query
30
Sample output
31
Subfield expertise sought
0
100
200
300
400
500
600
700
IE/ IR Morpho NLP Phonetics
Phonology Pragmatics Speech SyntaxSemantics MT TESOL/EFL Translation
0
200
400
600
800
Psycho Neuro HistoricalTypological Acquisition CognitionSocioling Lexicography PhilologyPhilosophy Anthropo
32
Technical skills sought
0
100
200
300
400
500
600
700
C/C++ CGI HTML/SGMLJ ava/ J script Lisp PerlProlog SQL TclVB XML/XSLT
0
50
100
150
200
250
300
Machine learning Finite- stateStatistical Stoch/ProbMath GenerativeField Methods
33
Sample observations 270 don’t have linguist* (!) Computer/computational background required
for almost 1/3 (1116) Noticeable amount of headhunting,
particularly in Seattle, DC areas Often a job title is not even listed (!) Great need for ontologies related to linguistics
job titles theoretical frameworks, subfields typical linguist job activities linguistic research/development venues
34
An engineering discipline? 160 linguistics jobs ending in “engineer” Software development cycle
research e., software design e. development e., software e. software quality e., linguistic test e., linguistic quality e. linguistic support e., user experience e. presales e., technical sales e.
Specific subfields web site e. speech e., voice recognition e., speech recognition application e., speech e.,
ASR tuning e., audio e. dialog e.
tools e. AI e., NLP e. knowledge e., ontology e. linguist e., natural language e. staff e. human factors e., user interface e.
35
A recent ontologist job ad Date: Thu, 28 Jul 2005 11:44:40 Subject: General Linguistics: Ontologist, Denver, USA
Job Rank: Ontologist Specialty Areas: General Linguistics
Position Summary: Ontologist
This person will be responsible for modifying & editing Ontology structures.
Skills: Basic computer skills such as Internet, email, and spreadsheet programs In-depth knowledge of any major industry, such as Health Care, Automotive, Legal, Construction, and
so forth helpful Superior communication skills, both oral and written. Ability to communicate effectively with reports,
peers, superiors, and customers essential Travel &/or foreign language experience desired
Personal Characteristics: A healthy sense of logic, and a love for details A deep and abiding love of language, and of rule-governed classification systems. This person should
be excited by the challenge of figuring out the precise place where a word belongs, and be delighted with the prospect of performing such tasks as the major part of their job
Position Qualifications: -Bachelor's degree, preferably in Linguistics, Library Science, English, or related field
36
Another recent ontologist ad Position Summary: Lead Ontologist
The Lead Ontologist will be responsible for creating & designing Ontology and Ontology structures. This person will be responsible for innovation and general Ontology development as Ontology requirements change. They will serve as Team Lead on various Ontology projects, and they will assist the Director with certain aspects of management, including the development of department culture and standards. They will also serve as a liaison between the Director and the rest of the team.
Skills: Ability to edit & manipulate text highly desired, using tools such as Emacs and Perl.
High level programming language experience and SQL also desired Knowledge of Ontology structures, and experience with developing and maintaining
such structures Ability to assist with Ontology development and use problem-solving skills to overcome
obstacles Ability to QA own Ontology work, and work of others Ability to lead projects from set-up through to QA Leadership or management experience a plus
Position Qualifications: -Bachelor's degree in Linguistics, Library Science, or related field -2-3 years experience in Ontology or related field
Application Deadline: Open until filled.
37
Matching request with ontology
“Tell me about cruises on San Francisco Bay. I’d like to know scheduled times, cost, and the duration of cruises on Friday of next week.”
38
Building a query
Friday, Oct. 29thcost
duration
Selection Constants
San Francisco Bayscheduled times
Projection
= Result ( )
Join Path
39
StartTime Price Duration
Source
10:45 am, 12:00 pm, 1:15, 2:30, 4:00 $20.00, $16.00, $12.00
1
10:00 am, 10:45 am, 11:15 am, 12:00 pm, 12:30 pm, 1:15 pm, 1:45 pm, 2:30 pm, 3:00 pm, 3:45 pm, 4:15 pm, 5:00 pm
$17.00, $16.00, $12.00
1 Hour 2
40
Another example Service Request
Match with Task Ontology Domain Ontology Process Ontology
Complete, Negotiate, Finalize
I want to see a dermatologist next week; any day would
be ok for me, at 4:00 p.m. The dermatologist must be
within 20 miles from my home and must accept my
insurance.
41
Service domain ontology
Appointment
Place
Insurance
Service Provider
Person
NameDoctor
Pediatrcian
Service Description
Duration
Medical Service Provider
Auto Service Provider Auto Mechanic
Dermatologist
Address
Cost
Date
Time
has
is at
is on
has
provides
has
accepts
hashas
"IHC"
is with
is for
is at
is at
has
"DMBA"
is at
->Appointment
Place
Insurance
Service Provider
Person
NameDoctor
Pediatrcian
Service Description
Duration
Medical Service Provider
Auto Service Provider Auto Mechanic
Dermatologist
Address
Cost
Date
Time
has
is at
is on
has
provides
has
accepts
hashas
"IHC"
is with
is for
is at
is at
has
"DMBA"
is at
->
42
Appointment
Place
Insurance
Service Provider
Person
NameDoctor
Pediatrcian
Service Description
Duration
Medical Service Provider
Auto Service Provider Auto Mechanic
Dermatologist
Address
Cost
Date
Time
has
is at
is on
has
provides
has
accepts
hashas
"IHC"
is with
is for
is at
is at
has
"DMBA"
is at
->Appointment
Place
Insurance
Service Provider
Person
NameDoctor
Pediatrcian
Service Description
Duration
Medical Service Provider
Auto Service Provider Auto Mechanic
Dermatologist
Address
Cost
Date
Time
has
is at
is on
has
provides
has
accepts
hashas
"IHC"
is with
is for
is at
is at
has
"DMBA"
is at
->
43
Relevant mini-ontology
Appointment
Place
Dermatologist
Person
Name
Address
Date
Time
is at
is on
has
hasis with
is for
is at
is at
has
is at
->Appointment
Place
Dermatologist
Person
Name
Address
Date
Time
is at
is on
has
hasis with
is for
is at
is at
has
is at
->
44
Ontologies: issues Most successful in data-rich, narrow- domain
applications Ambiguities are problematic, context only
partially eliminates Incompleteness: implicit information Commonsense world pragmatics evasive Knowledge prerequisites are steep Major efforts in creation, maintenance
Must be created by experts Experts are biased in knowledge, agreement needed Ontologies continually change; upkeep a massive task
45
Ontologies: possible solutions
Some automation is needed Current automatic generation of ontologies is
not successful, because extracted from free-form, unstructured text.
A more effective alternative is to extract ontologies from structured data on the web (tables, charts, etc.)
TANGO project Part 1: Extract tables from the web Part 2: Define mini-ontologies from tables Part 3: Merge into growing domain ontology
46
Project TANGO
47
Overview
Table ANalysis for Generating Ontologies
3-year NSF-funded project Joint BYU/RPI project Uses and extends TIDIE concepts,
ontologies Goal is to process tables, generate
ontologies, use results for IE
48
Motivation
Keyword or link analysis search not enough to search for information in tables
Structure in tables can lead to domain knowledge which includes concepts, relationships and constraints (ontologies)
Tables on web created for human use can lead to robust domain ontologies
49
Table understanding
What is a table? Why table normalization? What is table understanding? What is mini-ontology generation?
50
What is a table?
“…a two-dimensional assembly of cells used to present information…” Lopresti and Nagy
Normalized tables (row-column format) Small paper (using OCR) and/or
electronic tables (marked up) intended for human use
51
?
Olympus C-750 Ultra Zoom
Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm
52
?
Olympus C-750 Ultra Zoom
Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm
53
?
Olympus C-750 Ultra Zoom
Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm
54
?
Olympus C-750 Ultra Zoom
Sensor Resolution 4.2 megapixelsOptical Zoom 10 xDigital Zoom 4 xInstalled Memory 16 MBLens Aperture F/8-2.8/3.7Focal Length min 6.3 mmFocal Length max 63.0 mm
55
Digital Camera
Olympus C-750 Ultra Zoom
Sensor Resolution: 4.2 megapixelsOptical Zoom: 10 xDigital Zoom: 4 xInstalled Memory: 16 MBLens Aperture: F/8-2.8/3.7Focal Length min: 6.3 mmFocal Length max: 63.0 mm
56
?
Flight # Class From Time/Date To Time/Date Stops
Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04
Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04
57
?
Flight # Class From Time/Date To Time/Date Stops
Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04
Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04
58
Airline Itinerary
Flight # Class From Time/Date To Time/Date Stops
Delta 16 Coach JFK 6:05 pm CDG 7:35 am 0 02 01 04 03 01 04
Delta 119 Coach CDG 10:20 am JFK 1:00 pm 0 09 01 04 09 01 04
59
?
Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW
60
?
Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW
61
?
Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,000 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW
62
Maps
Place Bonnie LakeCounty DuchesneState UtahType LakeElevation 10,100 feetUSGS Quad Mirror LakeLatitude 40.711ºNLongitude 110.876ºW
63
Table normalization
take any table, produce a standard row-column table with all data cells containing expanded values and type information
Country GDP/PPP GDP/PPP Per
Capita
Real-Growt
h Rate
Inflation
Afghanistan $21,000,000,000 $800 ? ?
Albania $13,200,000,000 $3,800 7.3% 3.0%
Algeria $177,000,000,000 $5,600 3.8% 3.0%
Andorra $1,300,000,000 $19,000 3.8% 4.3%
Angola $13,300,000,000 $1,330 5.4% 110.0%
Antigua and Barbuda
$674,000,000 $10,000 3.5% 0.4%
… … … … …
Raw table
Normalizedtable
64
Normalizing across hyperlinks
65
Normalized table?? Population Populatio
nGrowth
rate
PopulationDensity
BirthRate
DeathRate
Migration
Rate
LifeExpectan
cyMale
LifeExpectanc
yFemale
InfantMortalit
y
Afghanistan 25,824,882 3.95% 39.88 persons/
km2
4.19%
1.70%
1.46% 47.82 years
46.82 years
14.06%
Albania 3,364,571 1.05% 122.79 persons/
km2
2.07%
0.74%
-0.29% 65.92 years
72.33 years
4.29%
Algeria 31,133,486 2.10% 13.07 persons/
km2
2.70%
0.55%
-0.05% 68.07 years
70.46 years
4.38%
American Samoa
63,786 2.64% 320.53 persons/
km2
2.65%
0.40%
0.39% 71.23 years
79.95 years
1.02%
Andorra 65,939 2.24% 146.53 persons/
km2
1.03%
0.55%
1.76% 80.55 years
86.55 years
0.41%
Angola 11,510 2.84% 8.97 persons/
km2
4.31%
1.64%
0.16% 46.08 years
50.82 years
12.92%
… … … … … … … … … …
Western Sahara 239,333 2.34% 0.90 persons/
km2
4.54%
1.66%
-0.54% 47.98 years
50.57 years
13.67%
World 5,995,544,836
1.30% 14.42 persons/
km2
2.20%
0.90%
? 61.00 years
65.00 years
5.60%
Yemen 16,942,230 3.34% 32.09 persons/
km2
4.33%
0.99%
0.00% 58.17 years
61.88 years
6.98%
Zambia 9,663,535 2.12% 13.05 persons/
km2
4.45%
2.26%
0.08% 36.72 years
37 21 years
9.19%
Zimbabwe 11,163,160 1.02% 28.87 persons/
km2
3.06%
2.04%
? 38.77 years
38.94 years
6.12%
66
How to understand tables
Captions – in vicinity of table (above, below etc)
Footnotes – on annotated column labels or data cells
Embedded information – in rows, columns or cells {e.g., $, %, (1,000), billions, etc}
Links to other views of the table, possibly with new information
67
Use of normalized data Take a table as an input and produce standard records
in the form of attribute-value pairs as output Discover constraints among columns Understand the data values
Country GDP/PPP GDP/PPP Per
Capita
Real-Growth Rate
Inflation
Afghanistan
$21,000,000,000 $800 ? ?
Albania $13,200,000,000 $3,800 7.3% 3.0%
Algeria $177,000,000,000
$5,600 3.8% 3.0%
Andorra $1,300,000,000 $19,000 3.8% 4.3%
Angola $13,300,000,000 $1,330 5.4% 110.0%
Antigua and Barbuda
$674,000,000 $10,000 3.5% 0.4%
… … … … …
{has(Country, GDP/PPP),has(Country,GDP/PPP Per Capita),has(Country,Real-growth rate*), has(Country, Inflation*)
Left-most, primary key
Dollar amount(from data frame)
Percentage(from data frame)
Country names(from data frame)
{<Country: Afghanistan>, <GDP/PPP: $21,000,000,000>, <GDP/PPP per capita: $800>, <Real-growth rate: ?>, <Inflation: ?>}
68
Ontology generation overview
Concepts of Interest
Concepts with Relations
Data extraction ontology
Sample Documents
69
Example:Creating a domain ontology
Has associateddata frames
Includes proceduralknowledge
Distances
Duration betweenTime zones
Name Geopolitical Entity
Time
Location
Longitude Latitude
hasnames
Latitude and longitudedesignates location
Country City
HasGMT
70
Example:Table understanding to mini-ontology generation
Agglomeration Population
Continent Country
Tokyo 31,139,900
Asia Japan
New York-Philadelphia
30,286,900
The Americas
United States of America
Mexico 21,233,900
The Americas
Mexico
Seoul 19,969,100
Asia Korea (South)
Sao Paulo 18,847,400
The Americas
Brazil
Jakarta 17,891,000
Asia Indonesia
Osaka-Kobe-Kyoto
17,621,500
Asia Japan
… … … …
Niigata 503,500 Asia Japan
Raurkela 503,300 Asia India
Homjel 502,200 Europe Belarus
Zunyi 501,900 Asia China
Santiago 501,800 The Americas
Dominican Republic
Pingdingshan 501,500 Asia China
Fargona 501,000 Asia Uzbekistan
Kirov 500,200 Europe Russia
Newcastle 500,000 Australia /Oceania
Australia
Agglomeration Population
Country Continent
71
Example:Concept matching to ontology Merging
Merge
Results
Agglomeration Population
Country Continent
Time
Location
Longitude Latitude
hasnames
Latitude and longitudedesignates location
Country City
Name Geopolitical Entity
Continent
Location
Longitude Latitude
Latitude and longitudedesignates location
Name Geopolitical Entity
Population
CityAgglomerationCountry
HasGMT
Time
Location
Longitude Latitude
hasnames
Latitude and longitudedesignates location
Country City
Name Geopolitical Entity
HasGMT
72
Ontology merging/growing Direct merge (no conflicts)
Use results of matching phase to find similar concepts in ontologies (e.g., data value similarities, data frames, NLP, etc)
Conflict resolution Interactively identify evidence and counter
evidence of functional relationships among mini-ontologies using constraint resolution
IDS Interaction with human knowledge engineer Issues – identify Default strategy – apply Suggestions – make
73
Example: Another mini-ontology generation
Place
Longitude Latitude
Elevation
USGS Quad
Area
MineReservoirLakeCity/town
Country
State
Place Name
⊎
74
Example: Another mini-ontology generation
Place
Longitude Latitude
Elevation
USGS Quad
Area
MineReservoirLakeCity/town
Country
State
Place Name
⊎
Location
Longitude Latitude
Latitude and longitudedesignates location
Name Geopolitical Entity
Population
CityAgglomerationCountry
Merge
Continent
Time
hasnameshasGMT
75
Example: Concept Mapping to Ontology Merging
Place
Elevation
USGS Quad
Area
MineReservoirLake
Country
State
⊎
Location
Longitude Latitude
Latitude and longitudedesignates location
Name Geopolitical Entity
Population
AgglomerationCountryContinent
Time
hasnameshasGMT
GeopoliticalEntity with population
City/town
76
Recognize Table Information
Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other
Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 30%
77
Construct Mini-Ontology Religion Population Albanian Roman Shi’a SunniCountry (July 2001 est.) Orthodox Muslim Catholic Muslim Muslim other
Afganistan 26,813,057 15% 84% 1%Albania 3,510,484 20% 70% 30%
78
Discover Mappings
79
Merge
80
Review: the TANGO process
Start out with normalized table Generate likely candidates for:
Object Sets Relationship Sets Functional Constraints Inclusion Constraints/Hierarchical Structure
Get help from user when needed Choose best candidate for the ontology
81
Generate concepts
Create list of candidate concepts (usually column names)
82
Example 1: Generate Concepts
Determine lexicalization (columns with associated values are lexical)
83
Example 1: Generate Concepts
Current ontology
84
Example 1: Generate Relationships
Decide relationship sets Exponential number of combinations Basic assumption: one main concept relates to all
others (attributes) Goal: find central column of interest
85
Example 1: Generate Relationships
Look for mapping between one column and title of table
86
Example 1: Generate Relationships
Current ontology
87
Example 1: Generate Constraints
FDs and Participation Constraints FD definition: X → Y iff (X[i] = X[j]) → (Y[i] = Y[j]) for all
row indexes i and j. Unless solid case (two or more same values), only
consider FDs from central object to attributes Use heuristics for setting exact participation (0:1,1:*, etc)
88
Example 1: Generate Concepts
Numerical values are usually functionally determined by column of interest and have 0:* participation constraint.
89
Example 1: Generate Constraints
Completed mini-ontology
90
Example 2: Generate Concepts
SubFamily, Group, and SubGroup are generic types
Enumerate column values as object sets because less than 5 divisions (recursively)
91
Example 2: Generate Relationships
Found mapping of central column of interest to title (Language)
Exceptions to basic assumption Hierarchy
(enumerated object sets)
Transitive FDs (X → Y, Y → Z, remove X → Z)
Create ISA hierarchy from table structure
92
Example 2: Generate Relationships
Current ontology
93
Example 2: Generate Hierarchical Constraints
Assign members to each object set for easy calculation
Find inclusion dependencies: Union – All
members of parents are members of one or more child
Intersection (Less common) – Child members are always in both parents
Mutual exclusion – Intersection of any two child members is empty.
94
Example 2: Generate Hierarchical Constraints
Completed mini-ontology
95
Future direction
Start with multiple tables (or URLs) and generate mini-ontologies
Identify most suitable mini-ontologies to merge by calculating which tables have most overlap of concepts
Generate multiple domain ontologies Integrate with form-based data
extraction tools (smarter Web search engines)