steven seida semantic web application architecture introduction and issues

31
Steven Seida Semantic Web Application Architecture Introduction and Issues

Upload: lester-shields

Post on 14-Dec-2015

224 views

Category:

Documents


3 download

TRANSCRIPT

Steven Seida

Semantic Web Application Architecture Introduction and Issues

Semantic Application Components

Textual documents, web pages,

etc.

Filter

spider Information extraction

Segment Classify

Associate Normalize

Deduplicate

Data-base

Data-base

Data mining

Discover patterns•Select models•Fit parameters•Inference•Report results

Actionable knowledge

Actionable knowledge

PredictionOutlier detectionDecision support

Information Extraction in Context

From Information Extraction by Andrew McCallum in November 2005 ADM Queue, pp. 49-57.

UnstructuredSemi-structured

Structured

Knowledge Store

Information/Knowledge Extraction

Convert information (and knowledge) into computer understandable knowledge

Unstructured ExtractionSemi-Structured ExtractionStructured Extraction

Information/Knowledge Extraction

Unstructured Extraction Extract Entities and Relationships

Populate an ontology-Also referred to as populating the slots of a frame (old

A.I. speak)-Populating can be a challenge-Semantic technology separates the description of

structure from the data

Knowledge Extraction Technologies

Natural Language ProcessingMulti-Lingual HandlingMulti-Modal (text, audio, video)

Foundational TechnologiesRule-Based SystemsMachine Learning

On Extraction AccuracyBe wary of junk in your knowledge store – it is very hard to clean out!It is better to extract less with higher accuracy!Especially since many sources tend to yield redundant

information.

http://www.opencalais.com

Example

New York Times says:

Saddam found in bunker

<Entity> <Relationship> <Entity>

Associated Press says:

Saddam Hussein captured at secret underground room

<Entity> <Relationship> <Entity>

Lots of redundancy, and likely significantly improved accuracy from one (based on strengths of extractors and ability to map to relationships your system is ready to handle)

Structured Knowledge Extraction

Given a database Typically rows of well defined relationships

Extract useful entitiesOften, a row (or joined row) is a useful entityGiven a schema, the knowledge is pretty explicit

If structure is good, leave it there (conversion to RDF or similar probably doesn’t make sense)

D2RQ (http://www4.wiwiss.fu-berlin.de/bizer/d2rq/) converts database data into RDF form. It can convert entire databases or provide a SPARQL endpoint that just converts on-the-fly.

Structured Knowledge – Semantic Wiki

Normal Wiki Semantics are a Collection of Knowledge!

Pages reference Pages

Pages might have attributes

Semantic Wiki Enables the Knowledge

Pages have specific types -Class (rdf:type)

References between pages have relationship property

-Object property

Pages have semantic attributes (metadata)-Datatype propertyA Full Featured Semantic Wiki is Semantic Mediawiki – an extension to the wikipedia infrastructure.

Extensions for input forms are required to make system usable by normal humans.One example is: http://foodfinds.referata.com/wiki/HomeList of examples in use at http://semantic-mediawiki.org/wiki/Sites_using_Semantic_MediaWiki

Semi-Structured Knowledge Extraction

Mix of structured and unstructuredLike a form with check boxes for some fields and open text areas for others

-Common to human intelligence reports like police or border patrol reports

Leverage structured tools and provide the known or simple knowledge to the unstructured extractor

-Many extractors won’t use this additional contextual information (still a research area)GRDDL (http://www.w3.org/2001/sw/grddl-wg/) is a W3C tool for creating/extracting RDF

from normal XML documents. Semi-structured data can typically be converted into xmlThen use unstructured tools on the unstructured fields

More Data vs More Entities

Be wary of more data – knowledge extraction must result in entities, not just data!

Data grows exponentially (forever)

Entities grow near asymptotic (and therefore manageable)

-For example: face book people, cars, cell phones…Typical Growth of Data

Through TimeExpected Growth of Entities

Through Time

Disambiguation (Deduplication)

Ambiguous: adj, capable of being understood in two or more possible senses or ways

Disambiguate: v, to establish a single semantic interpretation for

Also known as entity resolution and deduplication

-Given sets of extracted knowledge, determine when pieces refer to the same object or entity

Ex: Hussein is the same entity as Saddam Hussein in the previous news example.

Jaccard Measures for Disambiguation

Jaccard similarity measureSimilarity between sample sets: size of intersection divided by size of the union

J(A,B) = |A intersect B| / |A union B|

Jaccard distance = the dissimilarity: 1-J(A,B)

Is John(#1) the same as John(#2)?We have a set of assertions about John(#1) and John(#2) derived from source sets A and B.If we presume #1 and #2 are the same, how many features do we get versus how many features do we get if we assume they are different.

If sizes are similar enough, then references are to the same John

Jaccard Measures for Disambiguation

Binary data example (slight formulation difference-see wikipedia)

Compare features of two entities (where each feature is true/false)

Both True – M11 = 1

First True, Second false – M10 = 3

First False, Second True – M01= 0

Both False – M00 = 0

Jaccard similarity = M11 / (M10+M01+M11) = 1/4

Jaccard distance = (M01+M10) / (M10+M01+M11) = 3/4

Sphere Shape

Sweet Sour Crunchy

Apple Yes Yes Yes Yes

Banana No Yes No No

Brief, handy Jaccard tutorial at:http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Random Forest for Name Disambiguation

Automatically create a large number of decision trees, each created from a different bootstrap sample set of the training data

Select number of treesSelect number of features to consider at each node split

Decision Tree ConclusionRun data through all trees for matches yes/no – majority count

winsTested on names from Medline Bibliography

16 million biomedical research referencesRaw Features

Primary Author Name – first, last, middleCo-author last names, Affiliation of first authorTitle, Journal Name, Publication Year, LanguageMesh/Keyword terms

Decision Tree Features (21 total)Comparisons between each articles of name, etc.Uniqueness of names, titles, etc.

Disambiguation Results

Random Forests (RF) work as well as any – At impressive accuracyRF trains orders of magnitude faster than SVM

Bayesian issue - High correlation between features violated independence assumption of random variables

Majority class in training data is bestBayesianSimple RegressionDecision TreeSupport Vector Machines (Radial Basis)Random Forests

Number of Article Pairings in Test SetNumber of Features in Test Set

Knowledge Store

Need to be able to discover and manipulate knowledge across stores and leave most of the knowledge where it is

Problems: database schemas and knowledge store ontologies are different, inconsistent, and conflictingEntities are not aligned across the stores

Some Approach to Federating the Information Is RequiredFederated data must conform to some schema or ontology applicable to the problem being solved

Swoogle can help with ontologies: http://swoogle.umbc.edu/Swoogle provides search of over a million RDF documents on the web

Federating Knowledge Stores Approaches

Unifying Relational Database SchemaDistributed RDF Query

Unified Relational Database

Unifying Relational Database SchemaDefine domain schemaConvert all data into domain schema

Possible to have schema federate across multiple storesPerformance likely to drive all data onto central store

Disambiguate across data sourcesAlso likely to drive all data onto central store

Distributed RDF

Distributed RDF QueryDefine domain ontologyEnable all data as RDF

SPARQL end-points (D2RQ, RDF files from GRDDL or other, triplify from web pages, RDF Stores)

Enable within the domain schema

Disambiguate across data sources Can leave data where it is – use owl:sameAs or convert via SWRLQueries will have to include reasoning effects of the disambiguation

Twine at http://www.twine.com – people create shared threads around a topic (of tags). Not a federated store, but uses RDF for storage. To start search on twine for twine.

Semantic Application Components

Textual documents, web pages,

etc.

Filter

spider Information extraction

Segment Classify

Associate Normalize

Deduplicate

Data-base

Data-base

Data mining

Discover patterns•Select models•Fit parameters•Inference•Report results

Actionable knowledge

Actionable knowledge

PredictionOutlier detectionDecision support

Information Extraction in Context

From Information Extraction by Andrew McCallum in November 2005 ADM Queue, pp. 49-57.

UnstructuredSemi-structured

Structured

Knowledge Store

Data Mining / Discovery Patterns

Vector Space ModelMulti-dimensional representation of documents as vectorsEvery dimension of the vector is a term of interest

All vectors represent the same order of terms (at least to be manipulated together)Value at that vector location represents the weight for that term

Given vectors, you can perform vector operations like ‘angle-between’ as similarity measure. Weighting approach is some of the magic:Term Frequency–Inverse Document Frequency (TF-IDF) Weighting

Used to evaluate how important a word is to a document in a collection Includes how likely a term is to occur in a document and a weight based on how unique the term is across all the documents of interest See wikipedia for a good explanation of the details.

Entropy WeightingUsed to estimate relative importance of all words within a documentTake the negative log of the number of occurrences (normalized by number of unique words in document): -ln(cnt/sum_of_all_words)

Data Mining / Discovery Patterns

F-Measure as a General Measure of Mining Performance

Measure of a test's accuracy, considering both the precision p and the recall r of the test

Precision: typically the number of relevant documents returned scaled by the total number returned

Recall: number of relevant documents retrieved by a search scaled by the total number of existing relevant documents (that should have been returned)

Fβ measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision

F1 = 2*Precision*Recall / (Precision + Recall)

See wikipedia for a good explanation of the details.

Data Mining / Discovery Patterns

Seeing some popularity to use

Kullback-Leibler (K-L) divergence as a metricAlso referred to as Information gain

See wikipedia to become quite confused on the details

Wolfram-Alpha at http://www.wolframalpha.com supports data mining by model fitting, inference, and reporting given human language questions. For example, you can ask, “What is the capital of Texas” and receive both short and long answers.

Data Mining / Discovery Patterns

Consistency issues

Mining and Discovery include inference

What about when a new fact or inference contradicts a prior?

IBM Entity Analytics removes old assertions that are contradicted by new information

Alternative is to have some sort of confidence metricBut still have need for aging (at least) some relationships

Most important is to:1) Have a way to detect your are contradicting yourself2) Have some way to address it

Actionable Knowledge/Decision Making

Decision making is quite domain dependentDeciding someone is a terrorist versus where to construct a highway

Typically involves inference and reasoningRule languages (Pellet, SWRL, SPIN, Jena Reasoners, etc)Bayesian networksDempster-Schafer belief handling

Provenance is requiredRDF reification or custom approach (albeit a bit messy)Tends to make databases messy, too

Must be able to remove assertions (i.e. step back) when an incorrect path is detectedMulti-hypothesis reasoning approaches can be a help

Maintain multiple paths of possibility, simultaneouslyObviously an information management headache

The Payoff – Information Superiority

Knowledge Density(#Related Facts /domain)

I nfo

rmati

on Y

iel d

Pot e

nt i

al

(# In

f ere

nce

s /

t im

e )

Information Inferiority

Information Superiority

Phase Transition “Low Yield to High

Yield”

819

9

35

96534

65

42

5317

4

819

9

35

96534

65

42

5317

4

EX

PE

RT

LOW Knowledge Density ExampleLOW Knowledge Density Example

81697

961748

273965

915324

54968

32465

4712589

5394176

814

81697

961748

273965

915324

54968

32465

4712589

5394176

814

EA

SY

HIGH Knowledge Density ExampleHIGH Knowledge Density Example

Interesting problem domains are typically large, so information superiority also requires a large set of related facts.

Interesting problem domains are typically large, so information superiority also requires a large set of related facts.

Summary

Leveraging Semantic Technologies for Arriving at Actionable Knowledge Requires-Converting information into computer usable knowledge-Resolving ambiguity across many sources of knowledge-Accumulating sufficient knowledge density to result in high yield intelligence

Backup

RMS In Class Teams Exercise

Relationship Management System (RMS) - You need an information system to allow you to maintain information about university partnerships with your company. This is initially for your own use, but eventually will be a system to share with others that includes summary reporting about recent activity (including contact reports and contracts).

You will have to deal with at least the types of entities:OrganizationPerson (in a position at an organization)ContractContact Reports (by people at your organization with a person)

Management has decided that semantic wiki technology should be tried out on this system. (You are allowed to convince them otherwise, but they may override your objections.)

Federation In Class Teams Exercise

Intelligent Skills Inventory System – When a Request For Proposal (RFP) arrives, we need to construct an inventory of skills among our personnel that is specific to those needed for the RFP. The skills of your personnel are scattered across a number of systems, which you are required to deal with. You will not be providing any new system for gathering skills, rather, you are to:1. build a system to provide federated query and reporting of skills 2. provide a system to extract appropriate skills from the RFP and provide a list of potential personnel with matching (or near-matching) skills.

Emerging Alternative Federation Approach

Object Accumulation OverlayDevelop an abstraction layer of persisted objects

Adapters convert sources into objects – generally driven by high level ontologyObjects have methods – set/retrieve attributes or perform operationsData (i.e. attributes) is left at the original sourceDisambiguate within the object space

Adapters and custom operations perform disambiguation