the empirical turn in knowledge representation
TRANSCRIPT
Creative Commons CC BY 3.0:
allowed to share & remix
(also commercial)
but must attribute
Frank van Harmelen
The empirical turn in
Knowledge Representation
Contributions from many peoplein the KR&R group over many years.
And thanks to NWO for a 750k€ TOP grant for this
KR in the pre-empirical era
Handbook of Knowledge Representation(1000 pages, ToC alone is 14 pages)
• propositional logic & satisfiability solvers
• first order logic & resolution
• description logic
• constraint (logic) programming
• nonmonotonic reasoning
• belief revision
• qualitative reasoning
• model-based diagnosis
• bayesian networks
• temporal logic
• spatial reasoning
• epistemic logic
• deontic logic
• situation calculus
• default logic
• event calculus• ……
KR metrics in the pre-empirical era
KR = logic• Show small examples
• Prove properties(expressivity, complexity)
• Give algorithms(sound, complete)
KR = engineering• Build applications
• Show high performance
• Show low engineering costs
BUT AN EXPERIMENTIN THE PAST 10 YEARS
MADE IT POSSIBLE TO DO SOMETHING VERY DIFFERENT:
OBSERVE HOWKNOWLEDGE REPRESENTATIONS BEHAVE
AT VERY LARGE SCALE
Rest of the talk
• Which KR’s were part of the experiment?
• How much of it was there to observe?
• How did we manage to observe it?
• What did we learn from observing it?
Which KR’s ?
RDF (for non-logicians)
RDF (for logicians)
• ground binary predicate: 𝑃(𝑂1, 𝑂2)
• Limited existential variables: ∃𝑥: 𝑃 𝐶1, 𝑥 ∧ 𝑃 𝐶2, 𝑥
• Type is unary predicate: 𝑇𝑖 𝑥
• Subtypes ∀𝑥: 𝑇1 𝑥 → 𝑇2(𝑥)
• Type restrictions ∀𝑥, 𝑦: 𝑃 𝑥, 𝑦 → 𝑇1 𝑥 ∧ 𝑇2(𝑦)
• Equality: 𝑂1= 𝑂2• Extensions to DL:
– Distjointness of types
– Cardinality restrictions (0,1)
– always decidable: sub-FOL.
RDF deduction
OWL Semantics
How much is there to observe?
± 45-100 billion facts
1 fact
How big is 100 billion
Denny Vrandečić – AIFB, Universität Karlsruhe ≈ 1 fact per web-page
100 billion golfballs ≈ Jupiter
x T
[<x> IsOfType <T>]
differentowners & locations
< analgesic >
BTW: How did it get so big?
On the Web, anybody can say anything about anything
BTW: How did it get so big?
On the Web, anybody can say anything about anything
x T
R
How did you manage to observe it?
LOD LaundromatBeek & Rietveld et al. 2014, LOD laundromat: a uniform way of publishing other people's dirty datahttp://lodlaundromat.org/pdf/lodlaundry.pdf
HDTFernández & Martínez-Prieto & Gutiérrez, 2013, Binary RDF representation for publication and exchange (HDT)
LDFVerborgh & Vander Sande et al. 2014, Web-Scale Querying through Linked Data Fragments
LOD-a-lothttp://lod-a-lot.lod.labs.vu.nl/
Surprisingly efficient
1 file
28,362,198,927 unique triples
>650K data documents
524 GB of disk space
16 GB of RAM
Only €305,- hardware cost
Meta-Data for a lot of LODhttp://www.semantic-web-journal.net/content/meta-data-lot-lod-2
Statistics (boring)
triples 28,362,198,927
subject 3,214,347,198
predicates 1,168,932
objects 3,178,409,386
literals 5.3B
Re-use is fairly high… or not…
Analysing Logical identity
Joe Raad Wouter BeekESWC2018, under submission
Identity clusters
LOD-a-lot Filehttp: //lod-a-lot.lod.labs.vu.nl
[Fernández 2017]
558 millions owl:sameAs (309 millions distinct terms)
≈ 4 hours
1. Extracting all owl:sameAs statements on the LOD
HDT File(4.5 GB)
HDT File(4.5 GB)
IdentityClosure
1
IdentityClosure
2
IdentityClosure
89 387 082…
- The largest Identity Closure contains 177 794 terms(contains all the countries in the world, Albert Enstein, « empty string », etc.)
- The smallest Identity Closure contains 2 terms
x owl:sameAs y z owl:sameAs y
Identity Closure x y z
2. Generating the Identity Closure
Identity Closure « Cities »
3. Detecting Communities (using the Louvain Algorithm)
This network (i.e. identity closure) has a community structure, as it can be grouped into different sets of nodes, with each set of nodes being densely connected internally.
Goal: Find (and later Evaluate) the most “suspicious” identity links (i.e. the links between different communities)
4. Application: debugging identity statements
Identity closure containing the term
“dbpedia.org/page/Barack_Obama”
This Identity Closure contains 388 terms (i.e. 387 distinct terms are owl:sameAs this term)
95 communities detectedlargest community = 99 terms
4. Application: debugging identity statements
comm0
comm3
2 links
Community 0
1. dbpedia.org/resource/B_hussein_obama2. dbpedia.org/resource/Barack_H_Obama,_Jr3. dbpedia.org/resource/Barak_hussein_obama4. dbpedia.org/resource/President_Barack5. dbpedia.org/resource/Senator_Barack_Obama6. dbpedia.org/resource/Obama
…
99. dbpedia.org/resource/Hussein_Obama
Community 3
1. dbpedia.org/resource/Presidency_of_Barack_Obama2. dbpedia.org/resource/Barack_Obama_Administration3. dbpedia.org/resource/Barack_Obama_Cabinet4. dbpedia.org/resource/Obama_White_House5. dbpedia.org/resource/Obama_regime6. dbpedia.org/resource/America_under_Obama
…
52. dbpedia.org/resource/Presidential_transition_of_Barack_Obama
Symbols or words?
Steven de Rooij Peter Bloem Wouter Beek (ISWC 2016)http://www.cs.vu.nl/~frankh/postscript/ISWC2016.pdf
Symbols or words?
Symbol names are supposed to be meaningless
Aspirin headache
analgesic pain
symptomdrug
treats
treats
Measure mutual information content between string and semantics of a symbol
E(x) = efficient encoding of x
Mutual information content
M(x,y) =E(x) + E(y) – E(x,y)
Take x = symbol name of x as a string
Take 𝑦1 = {types of x} ≈ semantics of x
Take 𝑦2 = {properties of x} ≈ semantics of x
Calculate M(x, 𝑦1) and M(x, 𝑦2) for all symbols in 600k datasets
But variables do encode meaning!
Fraction of datasets with redundancy for types/predicatesat significance level > 0.99
BTW, this is 600.000 datapoints (RDF docs)
Very different network structures
for different predicates
Tobias Kuhn Wouter Beekhttp://ceur-ws.org/Vol-1946/paper-05.pdf
skos:exactMatch
foaf:knows
osspr:contains
Geopolitics:hasborderWith
Summary &
So what…
• We now have larger KB’s than ever before
• We now have the instruments to observe and analyse these very large KB’s
• We can use these insights for better tools:
– query & inference
– publish & maintain
– visualise & explain
– …
But my secret hope is that this will help us to understand the patterns of knowledge:
AI as a computational theory of knowledge