big data and how to overcome the problems it causes ontology engineering cse 510/phi 598 fall 2014...
TRANSCRIPT
Big Data and How to Overcome the Problems it Causes
Ontology Engineering CSE 510/PHI 598 Fall 2014
September 8, 2014
Big Data Problem• Wikipedia defines Big Data as “…a collection of data
sets so large and complex that it becomes difficult to process using on-hand database management tools.”
• Gartner defines Big Data with three ‘V’s:– Volume– Velocity (of production and analysis)– Variety
• This means that Big Data are beyond our control (as opposed to those complex and big systems with diverse and changing data where the complexity is known)
The Promise of Big Data
• Great insights can be obtained from large diverse data sets if properly exploited with the right analytics
• Proper exploitation requires solutions in the areas of– Hardware– Software– Method
Knowledge Representations: Attribute-Value Systems
Restaurant Cuisine Cost Avg. Diner Review
Avg. Critic Review
Reservation Required
Tom’s Diner American $ 3.2 2.8 No
Les Gros Poissons
French $$$$ 4.5 4.8 Yes
Il Grand Pesce
Italian $$$ 3.8 3.5 Yes
El Gran Pez Spanish $$ 4.3 4.4 No
Den Stora Fisken
Swedish $$$ 3.2 4.8 Yes
De Grote Vis Dutch $$$$ 4.0 2.2 Preferred
A Shortcoming of Attribute-Value Systems
• Duplicate AttributesRestaurant Cuisine … Owner Owner 2 Owner 3
Tom’s Diner American Tom Washington
Les Gros Poissons
French Jean Adams Simone Jefferson
Il Grand Pesce
Italian Robert Madison Simone Jefferson
El Gran Pez Spanish Louis Adams
Den Stora Fisken
Swedish Philip Jackson Claire Van Buren Susan Harrison
De Grote Vis Dutch Kate Tyler
Relational Database Solutions
• 1st Normal Form – No Attributes which are themselves sets
Restaurant Cuisine … Owner
Tom’s Diner American Tom Washington
Les Gros Poissons
French Jean Adams
Les Gros Poissons
French Simone Jefferson
Il Grand Pesce
Italian Robert Madison
Il Grand Pesce
Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
Rows Represent Unique Objects• Each row now uniquely represents an aggregate entity of Restaurant and
Owner• This aggregate forms the primary key of the table
Restaurant Cuisine … Owner
Tom’s Diner American Tom Washington
Les Gros Poissons
French Jean Adams
Les Gros Poissons
French Simone Jefferson
Il Grand Pesce
Italian Robert Madison
Il Grand Pesce
Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
A Shortcoming of 1st Normal Form• Since the attributes depend on only a part of the primary key (i.e.
Restaurant) the table is subject to risks of inconsistencies if the attributes of one of the objects is changed but not the others
Restaurant Cuisine … Owner
Tom’s Diner American Tom Washington
Les Gros Poissons
Creole Jean Adams
Les Gros Poissons
French Simone Jefferson
Il Grand Pesce
Italian Robert Madison
Il Grand Pesce
Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
Relational Database Solutions• 2nd Normal Form requires that any attribute must describe the
object designated by the primary key rather than just some part of it
Restaurant Cuisine Cost …
Tom’s Diner American $
Les Gros Poissons
Creole $$$$
Il Grand Pesce
Italian $$$
El Gran Pez Spanish $$
Den Stora Fisken
Swedish $$$
De Grote Vis Dutch $$$$
Restaurant Owner
Tom’s Diner Tom Washington
Les Gros Poissons
Jean Adams
Les Gros Poissons
Simone Jefferson
Il Grand Pesce
Robert Madison
Il Grand Pesce
Simone Jefferson
El Gran Pez Louis Adams
A Shortcoming of 2nd Normal Form• While both Date and Day of Purchase describe the unique object of the table (i.e.
the Restaurant+Owner primary key) there are duplicate combinations of the two• If one of the combinations is changed without the other a date may be shown has
falling on two days of the week
Restaurant Owner Date of Purchase Day of Purchase
Tom’s Diner Tom Washington 5/3/1994 Wednesday
Les Gros Poissons
Jean Adams 4/14/2008 Friday
Les Gros Poissons
Simone Jefferson 4/14/2008 Saturday
Il Grand Pesce Robert Madison 10/28/2003 Thursday
Il Grand Pesce Simone Jefferson 2/2/1998 Monday
El Gran Pez Louis Adams 7/30/2012 Tuesday
Relational Database Solutions• 3rd Normal Form requires that any attribute describes the entity
represented by the primary key and only that entity• No transitive descriptions as in the example from the previous slide
Restaurant Owner Date of Purchase
Tom’s Diner Tom Washington 5/3/1994
Les Gros Poissons
Jean Adams 4/14/2008
Les Gros Poissons
Simone Jefferson 4/14/2008
Il Grand Pesce Robert Madison 10/28/2003
Il Grand Pesce Simone Jefferson 2/2/1998
El Gran Pez Louis Adams 7/30/2012
Date Day of Week
5/3/1994 Wednesday
4/14/2008 Friday
10/28/2003 Thursday
2/2/1998 Monday
7/30/2012 Tuesday
Knowledge Representations As Highly Designed Artifacts
Restaurant Cuisine Cost …
Tom’s Diner American $
Les Gros Poissons
Creole $$$$
Il Grand Pesce
Italian $$$
El Gran Pez Spanish $$
Den Stora Fisken
Swedish $$$
De Grote Vis Dutch $$$$
Restaurant Owner Date of Purchase
Tom’s Diner Tom Washington 5/3/1994
Les Gros Poissons
Jean Adams 4/14/2008
Les Gros Poissons
Simone Jefferson 4/14/2008
Il Grand Pesce Robert Madison 10/28/2003
Il Grand Pesce Simone Jefferson 2/2/1998
El Gran Pez Louis Adams 7/30/2012
Date Day of Week
5/3/1994 Wednesday
4/14/2008 Friday
10/28/2003 Thursday
2/2/1998 Monday
7/30/2012 Tuesday
Application Translation LayersPresentationLayer
Business Layer
Data Access Layer
Big Data Hardware Solution
• Costly and can overrun the capabilities of the largest single machines
• A solution is to distribute information across many smaller machines
Hardware Solution is Contrary to Relational Design
• Designed to run on single machines• Attempting to disassemble them and run them
on a cluster of machines is very difficult• Big Data requires a different Data Model, one
that is cluster friendly, that is, one that can be distributed while still being efficient at retrieving the data that is needed
NoSQL Database Solutions
• Do not require a highly structured representation of data, the data models are relatively simple– Key – Value Model– Document Model– Column Family Model– Graph Model
Key-Value Data Model
• Key –Value pair where the key is associated to some value
• The value can be any type of object, a number a text value, an array, an image, a file, etc.
Tom’s DinerLes Gros PoissonsIl Grand Pesce
El Gran Pez
Value associated with Tom’s Diner
Value associated with Les Gros Poissos
Value associated with Il Grand Pesce
Value associated with El Gran Pez
Document Data Model• Each element is a document, that is, a complex data structure of
some type, usually expressed in JSON (JavaScript Object Notation)• No set schema for the documents• More transparent than the Key-Value model
[ { "id": 1, "Name": "Tom's Diner", "Cuisine": "American", "Cost": "$", "Average Diner Review": 3.2, "Average Critic Review": 2.8, "Reservation Required": "No", "Owner": "Tom Washington" }]
Column Family Data Model• A Row Key is associated with n-many column
families (i.e. groups of columns that store related data)
1234
Name “Tom’s Diner”
Cuisine “American”
Cost “$”
Avg Review 2.8
Name “Tom Washington”
RestaurantColumnFamily
OwnerColumnFamily
Row Key
Aggregate Orientation
• As noticed and described by Martin Fowler* all of the aforementioned noSQL data models share an orientation towards storing a the description of a significant object
• This enables the distribution of data that tends to be requested together (cluster-friendly)
• Tends to be difficult to re-order the data to query by different aggregates
* NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence, by Sadalage, P.J. and Fowler, M. (2012)
Graph Data Model
Restaurant
Tom’s Diner
Tom Washington
Owner
Cost of $
American Cuisine
Reservations Not
Required
Avg. Diner Review of
3.2
Avg. Critic Review of
2.8
5/13/94Date of
Purchase
Wednesday
Graph Data Model
• Does not have an aggregate orientation, rather the opposite, a granular orientation that breaks the aggregate into its composite elements
• Good for data exploration• Still cluster – friendly, similar data can be
stored in separate graphs
23
RDF Data Model
• RDF specifies a regular syntax for well formed expressions– rdf:statement – a simple expression that relates one entity to
another– rdf:subject – the entity the statement is about– rdf:predicate – the relationship said to hold between the two
entities– rdf:object – the entity that is related to the subject
• Humans are mortal• UB’s website homepage has URL http://www.buffalo.edu/• Remus is the brother of Romulus
RDF Data ModelSubject Predicate Object
Tom’s Dinner Is_a Restaurant
Tom’s Dinner Offers American Cuisine
Tom’s Dinner Costs $
Tom’s Dinner Has_average_diner’s review 3.2
Tom’s Dinner Has_average_critics_review 2.8
Tom’s Dinner Requires_reservation No
Tom’s Dinner Has_owner Tom Washington
Methodological Solution
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
26
Origin
• Formats of data sources included free text, semi-structured and structured
• Some data sets are made available only a short time prior to system testing
• Data sets and domain of interest will change• Data can not be collected into a single store• Provide cross-source searching and analytics• Need to maintain the provenance of data
27
High Level View of Ontology Content• Enable Description of Human Activity
Attributes
Actions
Natural & Artificial
Environments
Time
People & Organizations
Artifacts
are distinguished by
use
to perform
that take place in
28
High Level View of Ontology Content• Including the Activity of Describing Human Activity
Attributes
Time
People & Organizations Information
is distinguished by
produce
that describe
at a
Attribute
Action
Natural & Artificial
Environments
Time
People & Orgs
Artifacts
29
Current Import Structure of the I2WD Ontologies
Basic Formal Ontology
(BFO)
Relation Ontology
(RO) RO BFO Bridge 1.1
Extended Relation Ontology
Time OntologyQuality
OntologyInformation
Entity Ontology
Geospatial Ontology
Event Ontology
Artifact Ontology
Agent Ontology
AIRS Mid-Level Ontology
Emotion Ontology
Counter-terrorism Ontology
Information Technology Ontology
ChEBI Ontology
Manufactured Chemicals Ontology
Upper Level Ontology:
Mid-Level Ontology:
Domain Ontology:
30
Highlighted Capabilities of Ontologies
• Objects (persons, organizations, facilities, materials, etc.) are linked to qualities, functions and roles– these links can be time-stamped– these attributes can be differentiated between
designed and improvised– these attributes can be measured using nominal
(tall, average), ordinal (1st, best), interval (30o
Celsius), and ratio (30mm, 10 gallons) measurement types
31
Highlighted Capabilities of I2WD Ontologies
• Events can be linked together with temporal or causal relationships
• Ambiguous times (… occurred during the Spring of 2010) and places (… happened in New York) can be integrated with more precise information (…occurred on April 18th, 2010, …happened in Central Park)
• Vocabulary for output of sentiment analysis
32
Using States to Express Time Dependent Attributes• In 2004, Alaa al-Tamimi became Mayor of Baghdad.
YearMayor Role PersonTemporal
Interval CityGain Of
Role
2004
Alaa al-Tamimi
Alaa al-Tamimi’s Mayor Role
Baghdad
Temporal Interval ofGain of Alaa al-Tamimi’sMayor Role
Gain of Alaa al-Tamimi’sMayor Role
Interval during
Occurs on
Delimited by
Participates in
Participates in
Has role
Is instance of Is instance of Is instance of Is instance of Is instance ofIs instance of
City Government Of Baghdad
Government
Is organizationalContext of
Is instance of
prescribed_bySamsung Galaxy S4
Data Transfer Speed
Design Specifications of Samsung Galaxy S4
Data Transfer Speed
Specificationprescribes
has_partbearer_of
Data Transfer Speed Specification
ValueMbps
42.2
Inheres_in
Data Transfer Speed Ratio Measurement Is ratio
meausrement of
Data Transfer Speed Measurement Value
Inheres_in
Mbps
36.6Has decimal value
Uses measurement unit
Has decimal value
Lithium Ion Battery
has_part
Portion of Lithium Cobalt
Oxide
is made of
Lithium
Oxygen
Cobalt
is made of
Thermal Stability
bearer_ofThermal Stability Nominal
Measurement
Is nominal measurement of
Thermal Stability Nominal
Measurement Value
Inheres_in
Poor
Has text value
Designed and Measured Artifact Attributes
Uses measurement unit
34
Ontology Content Based on Standards
• Basic Formal Ontology (BFO)• DOD Dictionary of Military and Associated Terms (JP 1-02)• Operations (FM 3-0)• Multinational Operations (JP 3-16)• Counterinsurgency (FM 3-24)• International Standard Industrial Classification of all Economic Activities Rev.4 (ISIC4)• Universal Joint Task List (CJSCM 3500.04C)• Weapon Technical Intelligence (WTI) Improvised Explosive Device IED Lexicon• JC3IEDM• Information Artifact Ontology (IAO)• Phenotype and Trait Ontology (PATO)• Foundational Model of Anatomy (FMA)• Regional Connection Calculus (RCC-8)• Allen Time Calculus• Wikipedia
Partial List of Doctrine and Standards Used
35
Ontology Content Tested Against Data
• Treasury Office of Foreign Assets Control – Specially Designated Nationals and Blocked Persons
• NCTC – Worldwide Incidents Tracking System• UMD – Global Terrorism Database• RAND – Database of Worldwide Terrorism Incidents• LDM version .60 (TED)• VMF PLI• DCGS-A Event Reporting• BFT Report (CCRi test data)• Cidne Sigact (CCRi test data)• Long War Journal• Harmony Documents from CTC at West Point• Threats Open Source Intelligence Gateway
Partial List of Data Sources Used
36
Ontologies Use a Common Upper Ontology
• Produces common patterns within ontologies– Reuse of mappings from the sources
• Easier to include new sources of data
– Enables more uniformity between queries• Easier to transition to new domains of interest
Entity
Organization
Object
Quality of Physical Artifact
Quality of Organization
PhysicalArtifact
Quality
has_quality has_quality
bearer_of
37
Ontologies are Modular
• Each Class is defined in one place– Facilitates locating a class within the target
ontologies– Provides better recall in queries
• Less likely to overlook relevant data
Entity
Organization
Object
PhysicalArtifact
Spatial Location
located_at located_at
38
Ontologies Enable both Early and Late Fusion
• Granular classes allow direct mappings from various perspectives on the same domain while preserving information that can be later used for entity resolution
Car
Make
Model
VIN
Data Source 3
Car
Full Size Mid Size Compact
Data Source 1
Car
Length of Wheelbase
Manufacturer
Model
Compact
Mid Size
Full Size
prescribes
manufactures has quality
is nominally measured by
Vehicle Identification
Number
designates
Car
VIN OwnerData Source 2
39
Organization of Ontologies
• A limited number of upper and mid-level ontologies are carefully managed
• Domain ontologies are developed by subject matter experts and tested by automated procedures
• Content is pushed from domain ontologies to mid-level ontologies as usage levels warrant
40
Future Re-Organization of OntologiesBFO
Extended Relation Ontology
Time Ontology
Quality Ontology
Information Artifact Ontology
Geospatial Ontology
Event Ontology
Artifact Ontology
Agent Ontology
Human Anatomy
Ethnicities
Occupations
Nationalities
Military Units
Religions
Ideologies
Watercraft
Ground Vehicles
Aircraft
Clothing
Weapons
Communication Devices
Tools
Military Events
Interpersonal Events
Weather Events
Acts of Government
Disease Ontology
Legal SystemEventsActs of
Artifact Use
Criminal Acts
Mental Function Ontology
Anthropogenic Feature
Atmospheric Feature
Hydrographic Feature
Landform
Geopolitical Feature
Role Defined Area
Chemical Ontology
Plant Taxonomy
Animal Taxonomy
Upper Level Ontology:
Mid-Level Ontology:
Domain Ontology:
Geological Taxonomy
41
Conformance Testing• Inconsistency – A class is identified as being uninstantiable• Semantic Smuggling – A class or property is reused with changed
content • Multiple Inheritance – A class or property is asserted to be a subclass
of more than one superclass• Taxonomy Overloading – A class or property is related to its parent by
a relationship other than subclass• Containment – A class or property is not a child of any class or
property of the imported ontologies• Conflation – A class or property includes information model
assertions that are not true of the domain• Logic of Terms – A class or property is a set-theoretic combination of
other classes or properties
42
Building a Taxonomy – Common Problems
• Use – Mention Errors• Part of rather than subclass of
Postal Address
Country Address Locality
Address Region Postal Code Post Office
Box NumberStreet
Address
43
Building a Taxonomy – Common Problems
• Narrower in meaning than rather than subclass of• Logic of Terms Adhesives &
Sealants
Adhesives Applicators & Dispensers
Adhesive Application
ServicesGlue Applicators Epoxy
Dispensers
Sealants
In Thomasnet.com(http://www.thomasnet.com/browse) classes are formed by conjunctions and the class hierarchy contains examples of subclasses based on search patterns
44
Building a Taxonomy – Common Problems
• Narrower in meaning than rather than subclass of
Color
Green
Brown Green Dark Green Desaturated Green Light Green Saturated
Green Yellow Green
In the Phenotypic Quality Ontology (http://purl.obolibrary.org/obo/PATO_0000320) classes are subclasses by hue.
45
Building a Taxonomy – Common Problems
• Non-Disjoint Classes
Day
Day of Week
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
Holiday Anniversary