making the semantic web work

Making the Semantic Web Work

Reasoning beyond OWL

What is semantics?

Although animals do not use language, they are capable of many of the same kinds of cognition as us; much of our experience is at a non-verbal level.

Semantics is the bridge between surface forms used in language and what we do and experience.

Language understanding depends on world knowledge (i.e. “the pig is in the pen” vs. “the ink is in the pen”)

Machine to Machine Communication

messageexchange

Underlying the systems are different databases; the ability to “get something done” is like a non-verbalized ability, but to work with other systems we need to formulate messages in an artificial language.

Understanding human language is a big problem.

What chunk can we break off that will be useful and can be done today?

Key insight:

The semantic problem of communications between business IT systems aren’t that different from the semantic problem of communication between animals

Natural Language to support M2M

Internal databaseIndustry standard message format

Machine-readable and human readable specfications

Capture critical knowledge in graph database; perhaps 80% of process can be automated, but human effort is part of a structured process that clearly links specification to implementation

Captured specifications are used to compile data transformation rules.

Graph model is used as “universal solvent”

Paul Houle

More generally…

requirements regulationspolicies

Programs that implement behaviors

We might not be ready for executives to specify policies themselves, but we can make the process from specification to behavior more automated, linked to precise vocabulary, and more traceable.

Advances such as SVBR and an English serialization for ISO Common Logic means that executives and line workers can understand why the system does certain things, or verify that policies and regulations are implemented

Logged Decision Process

Focusing on the execution of tasks is the road to real semantics; anything that does a useful job solves the “grounding problem;” Children can’t learn language by watching television,only by talking with others.

Making Expressive Reasoning Scalable

Scalable fabric

BACKGROUND KNOWLEDGE

RULES MODELS

ALGORITHMS HEURISTICS

Scalable system merges data from siloed sources; constructs graph(s) of facts relevant to specific records and entities

profiler

VOCABULARY MANAGEMENT

VERSION CONTROL

EXCEPTION HANDLING

BUSINESS RULES MANAGEMENT

CASE MANAGEMENT

CONCEPT MATCHING

BEHAVIOR TRACEABLETO REQUIREMENTS

MULTILINGUAL SUPPORT

ENRICHED LINKED DATAScalable profiler lets system discover “ground truth” about data to inform generated rules and behaviors

People are looking for better toolsUnconstructive Criticism of the Semantic Web is Common

Blanket dismissals displace real thinking, particularly a “gap analysis” as to what is missing.

Yet, certain unworkable standards (OWL) have also displaced real progress.

History of RDF is about evolutiongood stuff survives, bad ideas (slowly) fade away

RDF/XML

RDFS

OWL

SPARQL

SPIN

Linked Data

ISO Common Logic

Turtle

Early work built on XML, had natural representations for ordered collections but was pedagogically awful (where are the triples?)

N-Triples

Turtle is a human friendly format but isn’t scalable to billions of triples

Competition for schema/inference lanuages left a two winners

A full-featured query language changed everything: but ordered collections go “under the bus”

New inference and transformation languages emerge

In the Linked Data era we can handle billions of triples, but collections and blank nodes become awkward

In the long-term we’ll see highly expressive languages forward compatible with RDF

RDF*

RDF* and SPARQL* let us make statements about statements and query them; this increases expressive and can be used for data management

We can be optimistic because…multiple communities have been working on similar things in parallel

Semantic webRDF / SPARQL

Diagramming and representation of data structures,

processes, systems, models,

etc.

Common Logicand

Message Vocabularies

SUMOUpper ontology

Commercial Master Data Management

products accurately match

entitiesVocabularies and message formats

for business

When you look at the pieces of the puzzle developed by communities that don’t really talk to each other, you see that the “state of the art” is better than it appears…

Common data models• Relational data model• Fundamentally tabular, like a CSV file

• Object-relational model• A column can contain rows• This is like XML or JSON

• Graph Model• Highly general

• Hypergraphs• “Property Graphs” and RDF*

These models are compatible in that you can represent a graph with relational tables, break up an XML record into multiple relational tables, or even embed a hypergraph inside a graph, but there are big difference when it comes to efficiency when you need a certain set of facts in one place.

Predicate CalculusRDF is a special case of the “predicate calculus”

Statement of arity 2

Predicate Calculus:

A(:Dog,:Fido)

RDF:

:Fido A :Dog .

Statement of arity 3

Predicate Calculus:

:Population(:Nigeria,2013,173.6e6)

RDF:

[ a :Population . :where :Nigeria . :when 2013 . :amount 173.6e6]

It’s not too hard to write this in Turtle

This implementation, however, is structurally unstable, since we went from one triple to four triples

How to think about RDF• The basic element of RDF is the Node• This borrows heavily from XML in that

• Terms come out of a URL-based namespace so we can throw everything in a big pot• We get the basic types from XML schema• Plus we can even use XML literals

• A triple is just a tuple with (i) three nodes, and (ii) set semantics• Higher-arity predicates are tuples with >3 nodes• SPARQL result sets and intermediate results are tuples of Nodes• Official serialization formats exist for SPARQL result sets

ISO Common Logic is the obvious upgrade path, since it uses the same data types as RDF and can handle RDF triples, as well as higher-order predicates and intuitively obvious inference.

ISO Common LogicNext step in evolution

• Uses RDF Node as basic data type with all benefits thereof• RDF triples are just arity 2 predicates and can be used directly• First order logic operators supported; typed logic allows some “beyond

first order logic” capabilities• OWL and RDFS can be implemented as a theory in FOL• Builds on the KIF Knowledge Interchange Format

• Foundation for additional developments• Controlled English Format for Common Logic Statements• Modal logics: SVBR• Interchange language for knowledge-based systems of all kinds

The Old RDF: Expressive but not scalable

Early RDF:

RDF/XML serialization, heavy use of blank nodes, extreme expressiveness:

[ a sp:Select ; sp:resultVariables (_:b2) ; sp:where ([ sp:object rdfs:Class ; sp:predicate rdf:type ; sp:subject _:b1 ] [ a sp:SubQuery ; sp:query [ a sp:Select ; sp:resultVariables (_:b2) ; sp:where ([ sp:object _:b2 ; sp:predicate rdfs:label ; sp:subject _:b1 ]) ] ]) ]

This is a representation of a SPARQL query in RDF!

This example uses Turtle, where square brackets create blank nodes and parenthesis create lists.

With this graph in the JENA framework you can easily manipulate this as an abstract syntax tree.

Very complex relationships, such as mathematical equations can be built this way; blank nodes can be used to write high-arity predicates.

Accessing it through SPARQL would not be so easy!

Linked Data: New Focus

Linked data source

Blank nodes are discouraged because it’s hard for a distributed community to talk about something without a name.

[ a sp:Select ; sp:resultVariables (_:b2) ; sp:where ([ sp:object rdfs:Class ; sp:predicate rdf:type ; sp:subject _:b1 ] [ a sp:SubQuery ; sp:query [ a sp:Select ; sp:resultVariables (_:b2) ; sp:where ([ sp:object _:b2 ; sp:predicate rdfs:label ; sp:subject _:b1 ]) ] ]) ]

Turtle and RDF/XML (which have sweet syntax for blank nodes) are not scalable because the parser cannot be restarted after a failure: if you have billions of triples, a few will be bad

<http://example.org/show/218> <http://www.w3.org/2000/01/rdf-schema#label> "That Seventies Show"^^<http://www.w3.org/2001/XMLSchema#string> . <http://example.org/show/218> <http://www.w3.org/2000/01/rdf-schema#label> "That Seventies Show" . <http://example.org/show/218> <http://example.org/show/localName> "That Seventies Show"@en . <http://example.org/show/218> <http://example.org/show/localName> "Cette Série des Années Septante"@fr-be .<http://example.org/#spiderman> <http://example.org/text> "This is a multi-line\nliteral with many quotes (\"\"\"\"\")\nand two apostrophes ('')." . <http://en.wikipedia.org/wiki/Helium> <http://example.org/elements/atomicNumber> "2"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://en.wikipedia.org/wiki/Helium> <http://example.org/elements/specificGravity> "1.663E-4"^^<http://www.w3.org/2001/XMLSchema#double> .

N-Triples is practical for large databases such as Freebase and Dbpedia because records are isolated, but blank nodes must be named, triple-centric modelling is encouraged

We now have a great query language, SPARQL. SPARQL supports the same shorthand for blank nodes as Turtle. Some blank node patterns work naturally, but it is particularly hard to ask questions about ordered collections.

Blank nodes, collections, etc. are out of fashion.

Old Approaches To Reification:Named Graphs:graph :subject :predicate :object .

Adding an extra node to a triple is simple, practical and useful for many purposes.

For instance, I could take in triple data from various sources and keep them apart by putting them in different graphs.

The trouble is that this is a one trick pony: I can’t take collections of named graphs from different sources and keep them apart using named graphs

For practical logic we need to be able to qualify statements to manage:• Provenance• Access Controls• Metadata• Modal relationships• Time

Old Approaches to ReificiationReification with Blank Nodes

[ rdf:type rdf:Statement . rdf:subject :Tolkien . rdf:predicate :wrote . rdf:object :LordOfTheRings . :said :Wikipedia .]

http://stackoverflow.com/questions/1312741/simple-example-of-reification-in-rdf

This isn’t too hard to write in Turtle, but it breaks SPARQL queries and inference for reified triples.

The number of triples is at the very least tripled; the triple store is unlikely to be able to optimize for common use cases.

a new standard that unifies RDF with the property graph model

RDF*/SPARQL* (Reification Done Right)

Turtle facts:

:bob foaf:name "Bob" .<<:bob foaf:age 23>> dct:creator <http://example.com/crawlers#c1> dct:source <http://example.net/homepage-listing.html> .

Sparql query:

SELECT ?age ?src WHERE { ?bob foaf:name "Bob" . <<?bob foaf:age ?age>> dct:source ?src .}

This is huge! So far products based on property graphs have been ad-hoc, without a formal model. SPARQL* brings rich queries to the property graph model and the reverse mapping means RDF* can be processed with traversal-based languages like Gremlin.

Roles of Schemas• Documentation• Integrity Preservation• Efficiency• Inference

Schemas as DocumentationHumans write code to insert data and write queries: schemas tell us how the data is organized.

Automated systems can also use schemas to drive code generation (consider object-relational mapping)

Schemas can preserve integritySQL:

create table customer ( id integer primary key, username varchar(16) unique key not null, email varchar(64) not null,)

SQL prevents attempts to insert records with non-existing fields or lacking required fields. SQL can enforce key integrity and other constraints.

You can (often) code algorithms and take it for granted that data structures satisfy invariants required for those algorithms to work.

RDF:

RDFS and OWL, implemented with the standard semantics,do not validate data.

Practically, RDF users will use types and properties across a wide range of standard and proprietary namespaces, and it canbe hard to keep track of them all.

For instance, rdfs:label is defined in RDFS, despite the fact that you can have labels without schemas. Terms that are the bread-and-butter of RDFS, such as rdf:type, rdf:Property, andrdf:Type are defined in the RDF specification.

It’s an easy mistake to get the “s” wrong in either writing data or queries and if you do you run a query, get zero results, and could easily chase your tail looking for other causes.

You can (and should) define an alternate semantics for RDFS and OWL, which rejects types and properties that are not listed in either data or queries, but this is nonstandard.

These issues are addressed in the “RDF Data Shapes” effort to be completed in 2017, (

Schemas can promote efficiencyJSON (and XML)Structural information is repeated in each schema record{ “red”: 201, “green”: 99, “blue”: 82 “alpha”: 115}

(85 bytes)

Ctypedef unsigned char byte;struct color { byte red; byte green; byte blue; byte alpha} Defines meaning of

201 99 82 115

(4 bytes)20x compression!

In numerical work, it often takes longer to convert a million numbers from ASCII to float than you spend working on the floats. The speed of text parsing is a limiting factor in electronic trading systems and many other applications.

GZIP compression of repetitive data helps, but you get a smaller file if you apply GZIP to binary data. You pay a CPU price for data compression plus a large price in string parsing.

Textual data formats have been fashionable in the Internet Age, because it is easy to get string parsing code to “almost work;” one of the reasons we are hearing about security breaches every day is that it’s extremely difficult to write correct string parsing code.

RDF standards do not address binary serialization, however Vital AI can create binary formats based on OWL schemas

Schemas and Inferencethe unique value of RDF!

:Joe myvocab:emailAddress “[email protected]” .dbpedia:Some_Body db:eMailAddress <mailto:[email protected]> .basekb:m.3137q basekb:organization.email_contact.email_address “[email protected]” .:Lily schemaOrg:email “[email protected]” .

Look at 4 RDF vocabularies and find 4 ways to write an e-mail address

myvocab:emailAddress rdfs:subPropertyOf foaf:email .db:eMailAddress rdfs:subPropertyOf foaf:email .basekb:m.3137q rdfs:subPropertyOf foaf:email .:Lily rdfs:subPropertyOf foaf:email .

:Joe foaf:email “[email protected]” .dbpedia:Some_Body foaf:email <mailto:[email protected]> .basekb:m.3137q foaf:email “[email protected]” .:Lily foaf:email “[email protected]” .

A-BOX

T-BOX

Inferred facts

It looks like an answer for data integration,but…

:Joe foaf:email “[email protected]” .dbpedia:Some_Body foaf:email <mailto:[email protected]> .basekb:m.3137q foaf:email “[email protected]” .:Lily foaf:email “[email protected]” .

There are two reasonable ways to write an email address: as a string or as a URI

foaf:email rdfs:domain owl:Thing .According to the foaf spec, only the URI is correct since,“In OWL DL literals are disjoint from owl:Thing,” (at least if we are using OWL DL…)

Any ETL tool has an ability to apply a function to data (it’s not hard at all to write code to translate a string to a mailto: URI)

RDFS and OWL, however, can’t do simple format conversion. For instance, it is reasonable for people to specify temperatures in Fahrenheit or Centigrade or Kelvin, but OWL inference can’t “multiply by something and add” – even though it can state that properties “mean the same thing”, it can’t specify simple transformations.

Something like OWL may be necessary for data integration, but OWL is not sufficient.

Other things OWL Can’t Do• We can’t reject data• Reject things that we don’t agree with• Reject things we don’t need; let’s use Freebase to seed…

• A directory of ski areas• The spatial hierarchy of Africa• A biomedical ontology

We don’t want to pay to store stuff we don’t need, or wait for it to be processed, do quality control on it, or deal with any problems it might create

OWL is unintuitiveHere’s an excerpt from the FIBO (Finance) ontology:

Organization:

A social unit of people, systematically struc-tured and managed to meet a need or pursue collective goals on a continuing basis.

Autonomous Agent:

An agent is an autonomous individual that can adapt to and interact with its environment.

Property Restriction 1:

Set of things that must have property "has member" at least 2 taken from "autonomous agent"


Set of things that may have property "has part" taken from "organization"


Set of things that must have property "has" at least 1 taken from "goal"


Set of things that may have property "has" taken from "postal address”

means “has parent”

LEGEND

How do you explain this to your boss? To the programmer that just joined the team? What kind of inference does this entail?(I think two people with a goal are an organization, but is there a real difference between a DBA filing for a person who is self employed and one that has an additional employee?)

It’s not always obvious how to do things in OWL

You can’t say

“The United States Has 50 States”

But you can say

“Anything that has 50 states is the United States”

You can get close to what you want to say by

“The United States is a member of an anonymousclass that contains anything with 50 states.”

You can get some entailments from that, but nothing happens if only 47 states are on the list (it’s an open world, we just don’t know about…)

Thus:

It’s not obvious what exactly can be specified in OWL.

If you talk to an expert, you’ll find that he can do a lot of things you might think aren’t possible.

Production Rules and First-Order LogicMany 1970s “expert systems” were driven by production rules; these are now widespread in “Business Rules Engines”.

Condition -> Action

Common data transformations can be easily written with production rules:

Weight(person,weight) and Height(person,height) -> BodyMassIndex(person,weight/height^2)

BodyMassIndex(person,bmi) and bmi<18.5 -> Underweight(person)BodyMassIndex(person,bmi) and 18.5<=bmi<25 -> NormalWeight(person)BodyMassIndex(person,bmi) and 25<=bmi<30 -> Overweight(person)BodyMassIndex(person,bmi) and 30<=bmi 0 -> Obese(person)

You could easily miss it reading the documentation, but it’s possible to state this in OWL by using XML Schema constraints on data types

You can’t do this in OWL. You just can’t

Production Rulesvs imperative programming languages

The BMI example could easily be written in (say) Java…BUT

You have to get the steps in the right order; this is trivial to do in a simple case, but it gets increasingly harder as complexity goes up. This is one of the reasons why programming is a specialized skill.

Production Rules constrain the conditions so the engine can quickly determine which rulesare fired when the state changes…

… but, the actions are written in a conventional programming language like LISP orJava, so we can use a fully spectrum of programming techniques and a lot ofexisting code.

Note: rules engines have advanced greatly since the “golden age of AI”, and now100,000+ rules and 10 million+ facts are practical.

Production Rules in the Wider Picture

Drools Expert: Execution of production rules

Drools Fusion: Complex event processing

jBPM: Business Process Management; coordination of asynchronous human and automated behaviors – controlled by rules

Optaplanner: Multi-objective combinatoric optimization for tasks such as scheduling, vehicle routing, box packing – controlled by rules

This is the JBOSS stack; products such as Blaze Advisor and iLog do all this and more.

The use of production rules to control business processes, particularly in scenarios involving complex workflows and complex multiple requirements is well established.

This is an emerging research topic in the semweb community, but in the business rules world this is a mature technology

“Impeadance Mismatch” between Business Rules and RDF is minimal

Most Java Rules Engines (like JESS and Drools) can reason about ordinary Java objects

RDF data can be converted to specialized predicate objects for performance or convenience, but it isvery possible to insert objects from the Jena framework such as Nodes, Triples and Models directly into a rules engine.

OWL and RDFS implementations often use production rules

OWL 2 RL dialectForward chaining

The semantics of RDFS and most of OWL can be implemented with production rules; RETE andPost-RETE algorithms can evaluate these efficiently.

Popular reasonsers such as Jena and OWLIM often use a box of production rules to implement RDFS andOWL and expose this functionality so you can implement custom inference.

OWL 2 QL dialectBackwards chaining

RDFS, and another major subset of OWL, can be implemented by rewriting SPARQL queries.

Since SPARQL is based on relational algebra, the whole bag of tricks used to optimize relational database queries can be used to efficiently answer queries.

OWL dialects have “computer science” advantages(i.e., algorithms exist to answer queries in bounded time, with scaling that looks good on paper)

More expressive logics that are undecidable sound scary,..

However, many things about conventional programming languages are undecidable…

For instance, you can’t solve the halting problem for conventional programming languages., yet, thatDoesn’t intimidate most people to use languages that lack recursion and unbounded loop.

Algorithms to exactly solve common optimization problems (travelling salesman problem, etc.) are computationally intractable, but approximate algorithms are fine for the real world.

(Evaluation of production rules is not decidable in finite time since it is possible to create an “infinite loop”)

Logical Theorem Provingex. VAMPIRE

If we constrain the action fields of rules a bit, we can prove theorems, a highly flexible form of reasoning. There are other ways to do it, but one effective method is the saturation solver.

Axioms

(S) Statement to prove

Logical Negation Solver

conclusions

If S is true, then not S is false. Eventually the solver will find a contradiction and produce the conclusionfalse.

Since you can derive an infinite number of conclusions from most theories, this process is not guaranteed to finish. A lucky or clever algorithim, could reach false with a short chain. State of the art reasoners use multiple search strategies thatwork well in many real-life cases.

Real-life OWL and RDFS performance doesn’t satisfyRDFS inference, done according to the book, generates a vast number of trivial and uninteresting conclusions; practical reasoners usually don’t implement the complete standard

Requirements for Practical logicOne long term goal for logic is “capture 100% of critical knowledge in business documents”

It might sound like science fiction, but if we hire a team of programmers to implement a policy or to make a system that complies with regulation and requirements, it is the goal. Can we (i) reduce team size, (ii) speed up the project, and (iii) be able to show the rules being enforced to management in a way they can understand?

Plain first-order logic does not cover all the bases.

We need:• Modal logic (CAN, SHOULD, MUST, IT WAS TRUE THAT, HARRY BELIEVES THAT)• Temporal logic (things change at different times)• Default and Defeasible logic• Higher-order logic (for all statements) or (there exists a statements)

These logics are not as mature as FOL, but we can often use tricks to simulate them

Modal logicKey for Law, Contracts, Requirements, …

A modal operator qualifies a statement:

MUST(S) -> S is necessarily true in any situation

USUALLY(S) -> S is usually true

PERMISSABLE(S) -> It it permissible that S is true

BELIEVES(person,S) -> specified person believe S is true

PREVIOUSLY(S) -> S was true in the past

Some modal logic problems can be addressed by rewriting the problem, for instance if S(x,y) is a simple predicate we could define a predicate like

BELIEVES_S(person,x,y)

We can’t express arbitrary statements this way, but we may be able to express all the ones that we’ll really use.

Systems like SUMO use tricks like this to punch above their weight

Temporal LogicChange is the one thing that is constant. The population of Las Vegas was 25 in 1990 and 583,736 in 2010.Since laws change over time, to know if a set of actions was illegal, we need to know when the actions where and what the law was at the time and answer questions like “What did the President know and when did he know it?”

A complete theory is not fully developed, but some pretty good tools are available

The Allen AlgebraTime intervals are closer to reality than points in time; with time intervals we can specify that a meeting starts at 6:00 pm on a certain day and goes on for 1 hour. We could ask if this overlaps with the interval of another meeting to know if I need to choose between one meeting and the other.

Allen Algebra doesn’t cover all temporal reasoning cases, but it works well with production rule systems, and is widely used in complex event processing.

A complete theory is not fully developed, but some pretty good tools are available

Default and Defeasible ReasoningThe following logical chain leads to a bad result:

Flies(Bird)A(Penguin,Bird)Flies(Penguin)

Exceptions are widespread in real life:

“A year divisible by 4 is a leap year, unless the year is divisible by 100; however, if the year is divisible by 400 it IS a leap year”

“An amateur radio operator may not transmit music unless theyare retransmitting a signal from the International Space Station.”

We could write

Any(x): A(x,Bird) and NOT(A(x,Penguin)) -> Flies(x)

But this gets hard to maintain when we find out about ostriches, domestic ducks, etc. It would be worse yet to maintain a list of flying birds.

Default logic adds features that let us express defaultsDefeasible logic allows us to retract a conclusion if we find contrary evidence later

Logical NegationALL APPROACHES ARE SOMEWHAT PROBLEMATIC

There are many ways to implement logical negation, but there is no universal answer to the problem.

For instance, suppose we add

NOT(Underweight(person)) -> WellFed(person)

to the rules we’ve been working on.

If this rule is activated before we have: (i) gotten height and weight information, (ii) computed the BMI, and (iii) classified this person, it will fire improperly. This might not be problem if it has no real-world consequences and is retracted when it becomes false, but it’s not the behavior we want.

Logic ProgrammingPractical Concessions

Phase I: Extract Information AboutHeight and Weight

Phase II: Compute BMI andclassify

Phase III: Make additional conclusions knowing ALL Phase II conclusions

With the agenda mechanism in most Business Rules Systems, each phase can get a complete view of what happened in the last phase, meaning that negation, counting and similar operations work as expected

(At the cost that we need to assign rules to the right phases)

What about SPIN?SPIN is similar in expressiveness to production rules.

ex:Person a rdfs:Class ; rdfs:label "Person"^^xsd:string ; rdfs:subClassOf owl:Thing ; spin:rule [ a sp:Construct ; sp:text """ CONSTRUCT { ?this ex:grandParent ?grandParent . } WHERE { ?parent ex:child ?this . ?grandParent ex:child ?parent . }""" ] .

This is like a production rule written in reverse, we infer triples from the CONSTRUCT clause based on matching the WHERE clause.

TopBraid Composer implements most inference through primitive forward chaining (a fixed point algorithm, RETE cannot be used because the order of rule firing is unpredictable.)

Backwards chaining can be accomplished through the definition of “magic properties” (something similar can be done with Drools too)

SPIN has support for query templates, in some ways like Decision Tables but possibly more palatable for coders and for semantic apps

Control of execution order, negation, and non-monotonic reasoning are not settled. Less is know about how to implement it efficiently.

Linked Data“Trough of Disillusionment”

The dream of linked data is that you can easily “mash up” data from multiple sources to answer questions.

If you want to get the right answers, however, it is not so easy.

If you didn’t have a lot of experience in the corporate world you might blame data publishers, RDF, and the incentive structures around linked data for this, however…

Corporate Data… real life data in business is frequently bad;80% of effort in data mining projects goes into data preparation and cleaning.

tools

Businessanalyst

ERPPOSemail

ERPCRMweb

Factory automation

CRMWikiHR

Sharepoint

CMSCustom apps

Inventory

Social CMSSAAS Apps

A large business has multiple business units running a huge number of applications written at different times by different people

Businesses grow by acquisition; to the extent that customers and employees are aware of different IT systems and their histories, customer service sucks,employees underperform and costs are high

Businesses face the same problem as the Linked Data Community but these problems happen behind closed doors and people are cursing COBOL and SAP instead of RDF and SPARQL

While Linked Data was were emerging, Enterprise IT developed “Master Data Management” to enable a “Customer Centric” enterprise

Personal AccountPaul’s Business Account APaul’s Business Account BOlivia’s Business Account

Child’s AccountPaul’s IRA

Olivia’s IRASEP IRA

Home Equity Line

HouseguestTenant A Personal

Tenant A CorporateTenant B

Traditional business systems are “account centric”, which is enough to get by but not enough to thrive. To really serve me well, my credit union needs a complete picture of the relationship I have with it. (It took me a while to remember how many accounts I have and I might have missed one)

Financial institutions are under legal pressure to “know your customer” (KYC) and linking accounts that belong to a customer is necessary to prevent monkey business

but I own shares in this one!

My name is on this column of accounts, but not the others

Dominant paradigm for master data management:

Objects are clustered based on a distance metric; objects are “blocked” beforehand to avoid the N2 cost of computing distances

… this is effective in the case of matching different records for the same customer, but is NOT effective in cases where we have a ground truth and can know rather than guess …

Tyrol Tirol

Two variants that differ by a letter can be fuzzy matched, but it’s hard to guess arbitrary things like

AT-7ISO 3166-2

AT33NUTS

AU07FIPS 10-4

蒂罗尔州CHINESE

… and why guess when you can just look them up in a quality controlled database?

Conventional MDM focuses on resolving customers (people or businesses;) in some cases it involves resolving products.

Generally the objects being matched are “equal” to each other in ontological status, such as two customer records.

Semantic MDM covers a wider range of concepts and often imports large amount of knowledge from general databases or involves alignment with industry ontologies.

In some cases we are discovering new concepts and maintaining the ontology, but more often we are matching surface forms to underlying concepts.

Do we clean data before or after query time?

Weather station reports temperature in centigrade, reports -999 upon error

32.1 34.6 36.3 -999 33.8

Let’s say we want to compute the average…

If we use the arithmetic mean, we get-215.55° C. Outrageously wrong!

If we know this device reports -999 on error or that temperatures can never be less than -273.15 we can reject the bad value, we get 34.2° C

If we use the median instead of the mean, the outlier is automatically ignored we alsoget 34.2° C (we’re lucky it’s exactly the same)

In this case it’s reasonable to clean the data or use an algorithm that is robust to outliers – they teach kids in elementary school the median is robust, but how many other robust algorithms are on the tip of everyone’s tounge?

Ahead-of-time data preparation

TESTCASES

Test failure blocks further analysis

Queries businessanalyst

Error reports thrown “over the wall”

data quality team

Line drawn between data processing and data use establishes test perimeter and makes process scalable in human

terms

Fixing up at query time will drive you nuts

Scenario: Business Analyst writes queries while talking to co-workers to quickly build collective understanding.

Requirement: easy to write queries off the cuff and get the right answer!

“[email protected]”<mailto:[email protected]>“[email protected]”“[email protected]”

It’s not hard to canonicalize two variant forms of an e-mail address in either a query or in processing the result set

query complexity

effor

t

A real query might be querying tens of values, some are used in conditions, others end up in the results. If many things are being joined (i.e. you’re using SPARQL) the query will explode exponentially in complexity.

Will you trust the answer?

Some kind of query rewriting (like the implementation of OWL 2 QL) might help, but we still lack a perimeter where we can test the system and give it a clean bill of health

Ordered collections are awkward in RDF• Two ways to do it because neither one is satisfying

RDF Containers

:Missions a rdf:Seq ; rdf:_1 :Mercury ; rdf:_2 :Gemini ; rdf:_3 :Apollo .

This could generate huge numbers of predicates, also nothing stops one from accidentally using a numbered label more than once. The facts comprising this list could be spread across a system.

RDF Collections

:Missions a rdf:List . rdf:first :Mercury ; rdf:rest _:n1 ._n1: rdf:first :Gemini ; rdf:rest _:n2 ._n2: rdf:first :Apollo ; rdf:rest rdf:nil .

Operations on a LISP-style list are slow because you need to follow lots of points. The use of blank nodes can protect Collections from modification (important in the OWL spec.)

Neither construction is easy to query in (standard) SPARQL

Yet, some RDF syntaxes look almost the same as JSON/XML

JSON

{ missions: [ “Mercury”, “Gemini”, “Apollo” ]}

TURTLE

:Missions :members (:Mercury,:Gemini,:Apollo) .

Most RDF tools will expand this into a LISP-list with blank nodes, but in TURTLE format the physical layout is the same as JSON.

Collections and Containers are described as “non-normative” in RDF 1.1; advanced tools may use special efficient representations (like would be used for JSON).

It’s awkward to work with ordered collections in the common “client-server” model that revolves around SPARQL engines, but for small graphs in memory, the situation is different – the Jena framework provides a facility for accessing Collections that feels a lot like accessing data in JSON

Ordered collections are critical for dealing with external data that supports external collections AND critical for many traditional RDF use cases such as metadata (you’ll find scientists are pretty sensitive to the order of authors for a paper)

Another Bad Idea in Linked DataDEREFERENCING

http

In principle a client could ask questions about individual items and “follow it’s nose” to discover related information.

In practice, however, you miss data quality problems that are obvious when you look at data holistically. (i.e. 47 instead of 50 states)

If the data was clean ahead of time, and if we understood the structure of data complely ahead of time, dereferencing might work.

Since Linked Data does not enforce quality standards, however, dereferencing is one of those dangerous things that “almost works”.

John Martin T 34 $17.50 I first met…Barry Robnson F 17 $12.76 Barry has…

Mary Capps T 104 $541.99 Sometimes …

Eric Kramer T 95 $214.22 Nobody who …

Matt Butts F 32 $6.54 I’ve never …

Imagine we find a CSV file without any specification as to format…

Most of these match a list of common first names

Most of these match a list of common last names

These look likeBoolean values

All of these are integers

These look like monetary values These fields

appear to contain free text

In the last example, we were able to make some pretty good guesses by looking at the data, not knowing anything about the names of the headers. This could go a long way towards interpreting this file in an automated way.

Add knowledge about the problem domain and we’re cooking with gas…

PROFILING

For best results, do analysis against ALL of the data!

Traditional Data Warehousing

POS sales data B

POS sales data C

POS sales data D

POS sales data A

Data from four different point-of-sale systems used in different parts of a company

CANONICALDATA

MODEL

The good: analysts work with consistent, clean data

The bad: the burden of normalizing the data when it is generated is felt acutely; in a worst case we could do this work and never end up analyzing the data.

The ugly: Since the normalization was done before the requirements for analysis were known, normalized data may not satisify requirements of analysts

Data Lake Enabled by HadoopIngestion is simple because we simply copy raw data of any kind to HDFS.

Development and operations are not burdened by ingestion requirements

Data import is lossless.

Compute and data are tightly coupled; we can “full scan” the data quickly at any time.

Data cleanup can be performed to meet requirements of specific uses AND can be informed by inspection of the complete data set.

Analysis can be performed on text and other kinds of data which cannot be normalized conventionally.

We can square this circle…

Data Lake

operations

raw dataNot perfect, but not damaged by import process! project

Data preparation is driven by requirements; no wasted time and no compromises

Queries

PredictiveAnalytics

MachineLearning

Otherprojects

Ontologies, taxonomies, and logic programming mean an increasing amount of work can be shared between projects

Data Lake

Putting Knowledge To Work(UNIT CONVERSION ONCE AGAIN)

EnglishTemp(location,amount) -> INSERT(MetricTemp(location,(5/9)*(amount-32))

Conversion of a unit represented by a predicate is one simple rule that could be written by hand

Input data specification

Output data specification

Analysis of input and output schema reveals need for unit conversion; system gets conversion rule out of world knowledge library and specializes it

World Knowledge Libraries

General Industry-Specfic Company-Specfic

Code generation

Intelligent Data Preparation

Data Lake

DocumentationMachine readable schemas

describes

Scalable/Parallel

profiler

transformer

consumers

Ontologies Requirements

Knowledge base about instances (ex. Places) and common patterns in data expression (ex. Date formats)

broad spectrum

vertical specific

company specific

application specific

compiles rules

feedback

Iterative Development Process Generates and tests hypotheses

making the semantic web work

Software