presented by: kiran kancharlapalli

55
Presented By: Kiran Kancharlapalli DBMS - Topics 11 & 12

Upload: oke

Post on 09-Feb-2016

51 views

Category:

Documents


0 download

DESCRIPTION

Presented By: Kiran Kancharlapalli. DBMS - Topics 11 & 12. Semantic Interoperability. What is Semantic Interoperability?. A bility of  computer systems to transmit data with unambiguous, shared meaning Dat a must be made available between heterogeneous agents - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Presented By: Kiran Kancharlapalli

Presented By:Kiran Kancharlapalli

DBMS - Topics 11 & 12

Page 2: Presented By: Kiran Kancharlapalli

Semantic Interoperability

Page 3: Presented By: Kiran Kancharlapalli

What is Semantic Interoperability?

• Ability of computer systems to transmit data with unambiguous, shared meaning

• Data must be made available between heterogeneous agents• Metadata must also be made available allowing a software

agent to learn how to interpret the data– Document Type Definition– XML-Schema– RDF Annotations

• Requirement to enable machine computable logic, inferring and knowledge discovery between information systems.

• Results in Semantic Web

Page 4: Presented By: Kiran Kancharlapalli

How is it accomplished?

• By adding data about the data (metadata), linking each data element to a controlled, shared vocabulary

• The meaning of the data is transmitted with the data itself, in one self-describing "information package" that is independent of any information system

• Syntatic interoperability is a prerequisite for semantic interoperability– refers to the packaging and transmission mechanisms for

data

Page 5: Presented By: Kiran Kancharlapalli

What is Semantic Web?

• Will be able to provide justified answers to natural language questions– Current search engines provide lists of resources that are

supposed to contain the answer• Knowledge rather than plain data would be retrieved

i.e. data which is relevant to the user’s task• Social factors such as privacy and trust would also be

taken into account

Page 6: Presented By: Kiran Kancharlapalli

Benefits

• Search can often be frustrating because of the limitations of keyword-based matching techniques. – Users frequently experience one of two problems:

• either get back no results or • too many irrelevant results.

• The problem is that words can be synonymous (that is, two words have the same meaning) or polysemous (a single word has multiple meanings).

• However, if the languages used to describe web pages were semantically interoperable, then the user could specify a query in the terminology that was most convenient, and be assured that the correct results were returned, regardless of how the data was expressed in the sources.

Page 7: Presented By: Kiran Kancharlapalli

Ontologies

Page 8: Presented By: Kiran Kancharlapalli

What are Ontologies?

• Content theories possible about objects in a specified domain

• A representation vocabulary, specialized to some domain or subject matter

• Provide potential terms for describing knowledge about the domain

• Translating the terms in an ontology from, say English to French, does not change the ontology conceptually

Page 9: Presented By: Kiran Kancharlapalli

What are Ontologies?

• Designed to reuse across multiple applications and implementations

Page 10: Presented By: Kiran Kancharlapalli

Motivation

• select EMPDAT from PERSTAB where POS=“mgmnt”– What does it mean?– PERSTAB is a table which lists employee data

• What’s an employee? How is an employee different from a contractor? What if I want data on both?

• Even if this information is available in English, a human has to read it

Page 11: Presented By: Kiran Kancharlapalli

Motivation (cntd…)

• "Parenthood is a more general relationship than motherhood."

• "Mary is the mother of Bill."

• "Who are Bill's parents?“• "Mary is the parent of Bill.”

– that fact is not stated anywhere, but can be derived by a DAML application.

Page 12: Presented By: Kiran Kancharlapalli

• More formally stated, given the statements

(motherOf subProperty parentOf)(Mary motherOf Bill)

• when stated in DAML, allows you to conclude

(Mary parentOf Bill)

• Java code or a stored procedure could do this sort of inference for facts in XML or SQL

• But the DAML spec itself says the conclusion is true• In contrast, different Java code could reach a different conclusion

Page 13: Presented By: Kiran Kancharlapalli

Everything is not a nail

• Ontology is not always the right tool for the job

• Face recognition, vehicle control systems etc – not the right applications for ontology

Page 14: Presented By: Kiran Kancharlapalli

Many Ways to Use Ontology

• As an information engineering tool– Create a database schema– Map the schema to an upper ontology– Use the ontology as a set of reminders for additional information

that should be included• As more formal comments

– Define an ontology that is used to create a DB or OO system– Use a theorem prover at design time to check for inconsistencies

• For taxonomic reasoning– Do limited run-time inference in Prolog, a description logic, or

even Java• For first order logical inference

– Full-blown use of all the axioms at run time

Page 15: Presented By: Kiran Kancharlapalli

Upper Ontology

• An attempt to capture the most general and reusable terms and definitions

Page 16: Presented By: Kiran Kancharlapalli

Motivation to capture Upper Ontology

• Ontologies may have different names for the same things– type – a relation between a class and an instance– instance – a relation between a class and an instance– isa – a relation between a class and an instance– …

• Ontologies may have the same name for different things, and no corresponding terms– before – a relation between two time points– before – a relation between two time intervals

• Either use the same upper ontology, or at least map to a common upper ontology

Page 17: Presented By: Kiran Kancharlapalli

Some Formal Upper Ontologies

• DOLCE• Cyc• SUMO

Page 18: Presented By: Kiran Kancharlapalli

Simple Methodology• Extract nouns and verbs from a source text• Find classes in SUMO for the nouns and verbs• Record a mapping as being either equal, subsuming or

instance.– type a single word that relates to the UBL term in the "SUMO term" or

"English Word" text areas in the SUMO browser• Create a subclass of SUMO if it's a subsuming mapping• Add properties to the subclass

– reusing SUMO properties– extending SUMO properties by creating a &%subrelation of an existing

property• Add English definition to the class

– define constraints that express how the subclass is more specific than the superclass

• Express the classes and properties in KIF and begin creating axioms, based on the English definitions created previously

Page 19: Presented By: Kiran Kancharlapalli

High Level Distinctions

• The first fundamental distinction is that between ‘Physical’ (things which have a position in space/time) and ‘Abstract’ (things which don’t)

Physical Abstract

Page 20: Presented By: Kiran Kancharlapalli

High Level Distinctions

• Partition of ‘Physical’ into ‘Objects’ and ‘Processes’

Physical

Object Process

Page 21: Presented By: Kiran Kancharlapalli

DBpedia:A Nucleus for a Web of Open Data

• DBpedia.org is an effort to:– extract structured information from Wikipedia– make this information available on the Web under an

open license– interlink the DBpedia dataset with other datasets on the

Web

Page 22: Presented By: Kiran Kancharlapalli

•Title•Abstract•Infoboxes•Geo-coordinates•Categories•Images•Links

• Other languages• Other wiki pages• To the web• Redirects• Disambiguates

Page 23: Presented By: Kiran Kancharlapalli

Extracting Structured Information from Wikipedia

Wikipedia consists of– 6.9 million articles– in 251 languages– monthly growth-rate: 4%

Wikipedia articles contain structured information– infoboxes which use a template mechanism– images depicting the article’s topic– categorization of the article– links to external webpages– intra-wiki links to other articles– inter-language links to articles about the same topic in

different languages

Page 24: Presented By: Kiran Kancharlapalli

TraditionalWeb Browser

Web 2.0Mashups

Semantic WebBrowsers

SPARQLEndpoint

Linked Data SNORQLBrowser

QueryBuilder

Virtuoso

Articles

MySQL

Infobox Categories

Wikipedia Dumps

DB tablesArticle texts

DBpedia datasets loaded into

published via

Extraction

Page 25: Presented By: Kiran Kancharlapalli

Extracting Infobox Data (RDF Representation)

Page 26: Presented By: Kiran Kancharlapalli

DBpedia Basics

• The structured information can be extracted from Wikipedia and can serve as a basis for enabling sophisticated queries against Wikipedia content.

• The DBpedia.org project uses the Resource Description Framework (RDF) as a flexible data model for representing extracted information and for publishing it on the Web. It uses the SPARQL query language to query this data.

• At Developers Guide to Semantic Web Toolkits you find a development toolkit in your preferred programming language to process DBpedia data.

Page 27: Presented By: Kiran Kancharlapalli

The DBpedia Dataset

• 1,600,000 concepts

• including– 58,000 persons

– 70,000 places

– 35,000 music albums

– 12,000 films

• described by 91 million triples– using 8,141 different properties.

– 557,000 links to pictures

– 1,300,000 links external web pages

– 207,000 Wikipedia categories

– 75,000 YAGO categories

Page 28: Presented By: Kiran Kancharlapalli

Accessing the DBpedia Dataset over the Web

1. SPARQL Endpoint

2. Linked Data Interface

3. DB Dumps for Download

Page 29: Presented By: Kiran Kancharlapalli

SPARQL

• SPARQL is a query language for RDF.

• RDF is a directed, labeled graph data format for representing information in the Web.

• This specification defines the syntax and semantics of the SPARQL query language for RDF.

• SPARQL can be used to express queries across diverse data sources, whether the data is stored natively as RDF or viewed as RDF via middleware.

Page 30: Presented By: Kiran Kancharlapalli

The DBpedia SPARQL Endpoint

• http://dbpedia.org/sparql

• hosted on a OpenLink Virtuoso server

• can answer SPARQL queries like– Give me all Sitcoms that are set in NYC?

– All tennis players from Moscow?

– All films by Quentin Tarentino?

– All German musicians that were born in Berlin in the 19th century?

Page 31: Presented By: Kiran Kancharlapalli
Page 32: Presented By: Kiran Kancharlapalli

Example

To know everything Bart wrote on blackboard board in season 12 of Simpson's:•The Simpson episode Wikipedia pages are the identified "things” that we would consider as the subjects of our RDF triples.•The bottom of the Wikipedia page for the "Tennis the Menace" episode tells us that it is a member of the Wikipedia category "The Simpsons episodes, season 12".•The episode's DBpedia page tells us that p:blackboard is the property name for the Wikipedia infobox "Chalkboard" field.

entities

SELECT ?episode,?chalkboard_gag WHERE { ?episode skos:subject <http://dbpedia.org/resource/Category:The_Simpsons_episodes%2C_season_12>. ?episode dbpedia2:blackboard ?chalkboard_gag }

Table

Page 33: Presented By: Kiran Kancharlapalli
Page 34: Presented By: Kiran Kancharlapalli

Possible Improvements

• Better data cleansing required.

• Improvement in the classification.

• Interlink DBpedia with more datasets.

• Improvement in the user interfaces.

• Performance

• Scalability

• More Expressiveness

Page 35: Presented By: Kiran Kancharlapalli

Questions for Discussion• DBpedia gains new information when it extracts data from the latest

Wikipedia dump, whereas Freebase, in addition to Wikipedia extractions, gains new information through its userbase of editors.– Which one is better approach?

• Can Freebase or DBpedia be substitute for Wikipedia?– Freebase : Not good in that we have two similar things –

Wikipedia, Freebase– DBPedia : Not good in that it extracts data from dump

• How can we interlink Freebase & DBpedia?• What can be killer applications using Dbpedia?

– If there is, okay– If there is no, do we really need a large general structured knowledge?

Page 36: Presented By: Kiran Kancharlapalli

Uncertainty propagation

• Every physical quantity has :

– A value or size

– Uncertainty (or ‘Error’)

– Units

• Without these three things, no physical quantity is complete.

• When quoting your measured result, follow the simple rules : Ex: A = 1.71 0.01 m

Always quote main value to the same numberof decimal places as the uncertainty

Always include Units ! !(but if the quantity is dimensionless, say so)

Never quote uncertainty to more than 1 or 2significant figures (this would make no sense)

Page 37: Presented By: Kiran Kancharlapalli

Terminology: ‘Uncertainty’ and ‘Error’

• The terms Uncertainty and Error are used interchangeably to describe a measured range of possible true values.

• The meaning of the term Error is :– NOT the DIFFERENCE between your experimental result &

that predicted by theory, or an accepted standard result !

– NOT a MISTAKE in the experimental procedure or analysis !

• Hence, the term Uncertainty is less ambiguous. Nevertheless, we still use terms like ‘propagation of errors’, ‘error bars’, ‘standard error’,

etc.

• The term “human error” is imprecise - avoid using this as an explanation of the source of error.

Page 38: Presented By: Kiran Kancharlapalli

Error Propagation using CalculusFunctions of one variable

If uncertainty in measured x is Δx, what is uncertainty in a derived quantity z(x) ?

Error propagation is just calculus – you do this formally in the “Data Handling” course

Basic principle is that, if (Δx)/x is small, then to first order:

e.g., if z = xn , then : xxnz

xxnxxnxx

dxdzz nn

1

Hence, for this particular function, the percent (or fractional) error in z is :

xxn

zz

or...... just n times the percent error in x

Page 39: Presented By: Kiran Kancharlapalli

Error Propagation using CalculusFunctions of more than one variable

Suppose uncertainties in two measured quantities x and y are : Δx and Δy , what is the uncertainty in some derived quantity z(x,y) ?

For such functions of 2 variables we use partial differentiation

yyzx

xzz

But, combining errors ALWAYS INCREASES total error - so make sure terms add with the same sign :

yyzx

xzz

22

22

yyzx

xzz

It is better to add in quadraturei.e. “the root of the sum of the squares” :

We can usually always handle error propagation in this way by calculus

Page 40: Presented By: Kiran Kancharlapalli

Simplified Error PropagationA short-cut avoiding calculus

Instead of differentiating z/x, z/y etc, a simpler approach is also acceptable :

1. In the derived quantity z, replace x by x + Δx, say

2. Evaluate Δz in the approximation that Δx is small xzxxzz

xzxzaxxzz

)(Ex. 1 : z = x + a , where a = constant

xbz

xbzxbbxxxbzz

Ex. 2 : z = bx , where b = constant

x

xzzx

xbxxxxxbxxbzz

2212 2222

Ex. 3 : z = bx2 , where b = constant

xx

zz

xxzz

22

Page 41: Presented By: Kiran Kancharlapalli

Synthetic Data

• Any production data applicable to a given situation that are not obtained by direct measurement

• Used in a variety of fields as a filter for information that would otherwise compromise the confidentiality of particular aspects of the data.

• Many times the particular aspects come in the form of human information (i.e. name, home address, IP address, telephone number, social security number, credit card number, etc.)

Page 42: Presented By: Kiran Kancharlapalli

Importance

• Obtaining actual or real data sets could be difficult, and sometimes impossible due to impediments such as– Privacy issues– Image control– Logistics issues– Time– Cost

• Protecting information confidentiality– Data cannot be traced back to an individual

• Certain conditions may not be found in the original data

Page 43: Presented By: Kiran Kancharlapalli

Importance (cntd.)

• Used to train the fraud detection system itself, thus creating the necessary adaptation of the system to a specific environment– By creating realistic behavior profiles of users and attackers– Ex: Intrusion Detection Systems are trained using Synthetic Data

• Allow a baseline to be set– Ex: Researcher doing clinical trials generate synthetic data to aid

in creating a baseline for future studies and testing

• More or less realism could be exhibited according to the selected properties of the original data sets

Page 44: Presented By: Kiran Kancharlapalli

Synthetic Data Generation

• Mostly Scenario based– Evaluating Information Analytics Software– Matching Data Mining Patterns– Evaluate quality of extraction algorithms

• Specific Algorithms and generators for a scenario or a set of (similar) scenarios

• Patterns from data mining techniques could be used to generate synthetic data sets

Page 45: Presented By: Kiran Kancharlapalli

• Researchers frequently need to explore the effects of certain data characteristics on their models. – To help construct datasets exhibiting specific properties,

such as autocorrelation or degree disparity, synthetic data could be generated having one of several types of graph structure:

• random graphs• independent and identically distributed (i.i.d.) connected

components• lattice graphs having a ring structure• lattice graphs having a grid structure• forest fire graphs• cluster graphs with nodes arranged in separate clusters

(cliques)

Page 46: Presented By: Kiran Kancharlapalli

• Synthetic data is generated with simple forms of realism by:– Domain sampling within a field– Preserving cardinality relationships

• In all cases, the data generation process follows the same process:– Generate the empty graph structure.– Generate attribute values based on user-supplied prior

probabilities.• Because the attribute values of one object may depend

on the attribute values of related objects, the attribute generation process assigns values collectively.

Page 47: Presented By: Kiran Kancharlapalli

Data Quality

• Some Definitions– The state of completeness, validity, consistency, timeliness

and accuracy that makes data appropriate for a specific use.

– The totality of features and characteristics of data that bears on their ability to satisfy a given purpose; the sum of the degrees of excellence for factors related to data.

– Complete, standards based, consistent, accurate and time stamped.

Page 48: Presented By: Kiran Kancharlapalli

Data Quality

• Data are of high quality if,– they are fit for their intended uses

in operations, decision making and planning– they correctly represent the real-world construct to

which they refer• As data volume increases

– the question of internal consistency within data arises, regardless of fitness for use for any external purpose

• e.g. a person's age and birth date may conflict within different parts of a database

Page 49: Presented By: Kiran Kancharlapalli

Data Attributes

• Nearly 200 such attributes are there and there is little agreement in their definition and measures

• Most common are– Accuracy– Correctness– Currency– Completeness– Relevance

Page 50: Presented By: Kiran Kancharlapalli

Incorrect Data

• Includes– invalid and outdated information – can originate from

different data sources resulting from • data entry, or data migration and conversion projects

• Total cost to the US economy due to data quality problems is over US$600 billion per annum

Page 51: Presented By: Kiran Kancharlapalli

Frameworks for understanding data quality

• A systems-theoretical approach– influenced by American pragmatism expands the

definition of data quality to include• information quality, and emphasizes the inclusiveness of

the fundamental dimensions of accuracy and precision

• One framework seeks to integrate– product perspective (conformance to specifications)

and – service perspective (meeting consumers'

expectations)

Page 52: Presented By: Kiran Kancharlapalli

• One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously

• Another framework evaluates the quality of the form, meaning and use of the data

Page 53: Presented By: Kiran Kancharlapalli

Data Quality Assurance

• Service providers clean the data on a contract basis

• Consultants advise on fixing processes or systems to avoid data quality problems in the first place

• Tools for analyzing and repairing poor quality data

Page 54: Presented By: Kiran Kancharlapalli

• Data profiling - initially assessing the data to understand its quality challenges

• Data standardization - a business rules engine ensures that data conforms to quality rules

• Geocoding - for name and address data. Corrects data to US and Worldwide postal standards

• Matching or Linking - a way to compare data so that similar, but slightly different records can be aligned. – Matching may use "fuzzy logic" to find duplicates in the data. It often recognizes

that 'Bob' and 'Robert' may be the same individual. – It might be able to find links between husband and wife at the same address. – It often can build a 'best of breed' record, taking the best components from

multiple data sources and building a single super-record.• Monitoring - keeping track of data quality over time and reporting

variations in the quality of data. Software can also auto-correct the variations based on pre-defined business rules.

• Batch and Real time - Once the data is initially cleansed (batch), companies build the processes into enterprise applications to keep it clean.

Page 55: Presented By: Kiran Kancharlapalli

?