slides for "rocker – a refinement operator for key discovery",

52
Tommaso Soru , Edgard Marx, Axel-Cyrille Ngonga Ngomo AKSW, Department of Computer Science University of Leipzig, Germany May 22, 2015 WWW 2015 — Florence, Italy ROCKER A Refinement Operator for Key Discovery 1

Upload: tommaso-soru

Post on 03-Aug-2015

324 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Tommaso Soru, Edgard Marx, Axel-Cyrille Ngonga Ngomo AKSW, Department of Computer Science University of Leipzig, Germany !!!!!

May 22, 2015 WWW 2015 — Florence, Italy

ROCKER A Refinement Operator for Key Discovery

1

Page 2: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Text

State of the Linked Open Data cloud.353 accessible RDF datasets; ~74 billion triples.

Sources: State of the LOD cloud, LODStats, 2015.

2

Page 3: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Decentral data publication.Real-world entity “Florence, Italy” is described in:

3

DBpedia Linked GeoData

Geo Names

Page 4: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Unique descriptions of resources.

Entity search.

Data integration.

Linked data compression.

Link discovery.

Question answering.

Data quality.

4

Page 5: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Unique descriptions of resources.

Entity search.

Data integration.

Linked data compression.

Link discovery.

Question answering.

Data quality.

4

Keys.

Page 6: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Background.

5

A key is a set of properties which can distinguish all instances of a class in a knowledge base.

Page 7: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Background.

5

A key is a set of properties which can distinguish all instances of a class in a knowledge base.

:Brad_Pitt

:Julia_Roberts

:Oceans_Eleven:The_Mexican

:hasActor

:hasActor:hasActor

:hasActor

“Ocean’s Eleven”

“Julia Roberts”

“The Mexican”

“Brad Pitt”

rdfs:label

rdfs:label rdfs:label

rdfs:label

Page 8: Slides for "ROCKER – A Refinement Operator for Key Discovery",

6

A key is a minimal key if none of its subsets is also a key.

Background.

candidate key distinguishable resources key? min-key?

{rdfs:label} 2 / 2 yes yes

{:hasActor} 1 / 2 no no

{rdfs:label, :hasActor} 2 / 2 yes no

dbpedia-owl:Film

Page 9: Slides for "ROCKER – A Refinement Operator for Key Discovery",

7

A set of properties is called an n-almost-key for a class if it can distinguish all except n instances of that class.

Background.

:Canada

:Iceland

:United_States

:filmedIn:Interstellar

:United_States

:United_Kingdom

:filmedIn:Blade_Runner

:United_States

:United_Kingdom

:filmedIn:2001_A_Space_Odyssey

:WALL-E

Page 10: Slides for "ROCKER – A Refinement Operator for Key Discovery",

7

A set of properties is called an n-almost-key for a class if it can distinguish all except n instances of that class.

Background.

:Canada

:Iceland

:United_States

:filmedIn:Interstellar

:United_States

:United_Kingdom

:filmedIn:Blade_Runner

:United_States

:United_Kingdom

:filmedIn:2001_A_Space_Odyssey

:WALL-E

Page 11: Slides for "ROCKER – A Refinement Operator for Key Discovery",

8

ROCKER’s score function.The score function expresses

the rate of distinguishable instances in a class, given a set of properties (i.e., a candidate key).

:Interstellar

:Blade_Runner

:2001_A_Space_Odyssey

:WALL-E

✗ } score({: filmedIn})

=s∈S :∀ ′s ∈S s ≠ ′s ⇒ discr(s, ′s ,{: filmedIn}){ }

S= .75

Page 12: Slides for "ROCKER – A Refinement Operator for Key Discovery",

8

ROCKER’s score function.The score function expresses

the rate of distinguishable instances in a class, given a set of properties (i.e., a candidate key).

:Interstellar

:Blade_Runner

:2001_A_Space_Odyssey

:WALL-E

✗ }An n-almost-key has a score of at least .α =

S − nS

score({: filmedIn})

=s∈S :∀ ′s ∈S s ≠ ′s ⇒ discr(s, ′s ,{: filmedIn}){ }

S= .75

Page 13: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Contribution #1

A more complete definition of key.All object values are considered (e.g., ).

Null values are accepted (e.g., ).

9

:United_States

:WALL-E

:Canada

:Iceland

:United_States

:filmedIn:Interstellar

:United_States

:United_Kingdom

:filmedIn:Blade_Runner

:WALL-E

Page 14: Slides for "ROCKER – A Refinement Operator for Key Discovery",

10

Properties of a key.Key monotonicity.

Adding a property to a key yields another key.

{:p1, :p2, :p3}{:p1, :p2}⋃ {:p3}

Page 15: Slides for "ROCKER – A Refinement Operator for Key Discovery",

10

Properties of a key.Key monotonicity.

Adding a property to a key yields another key.

Non-key monotonicity. Removing a property from a non-key yields another non-key.

{:p1, :p2, :p3}{:p1, :p2}⋃ {:p3}

{:p1, :p4}{:p1, :p2, :p4}∖ {:p2}

Page 16: Slides for "ROCKER – A Refinement Operator for Key Discovery",

11

Proposed approach.We adopt a refinement operator to refine candidates.

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Page 17: Slides for "ROCKER – A Refinement Operator for Key Discovery",

12

Proposed approach.Pro. The score function induces a quasi-ordering ‘≼’ over the set of all candidates.

P≼Q means score(p) ≤ score(q)

Page 18: Slides for "ROCKER – A Refinement Operator for Key Discovery",

12

Proposed approach.Pro. The score function induces a quasi-ordering ‘≼’ over the set of all candidates.

P≼Q means score(p) ≤ score(q)

Contra. Visiting the refinement tree is an intractable problem!

n properties 2ⁿ–1 nodes

Page 19: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Solutions to intractability.Prune branches using key monotonicity:

for all descendants of a key;

for all ancestors of a non-key.

Consider only a subset of popular properties.

Provide a “fast search” option which selects one of the multiple discovery strategies.

13

Page 20: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Page 21: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?

Page 22: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Page 23: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Refine pivot,remove pivot & add children to frontier

Page 24: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Refine pivot,remove pivot & add children to frontier

Has children?

Page 25: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Refine pivot,remove pivot & add children to frontier

Has children?

Next child

yes

no

Page 26: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Refine pivot,remove pivot & add children to frontier

Has children?

Next child

Ancestor of !key?

yes

no

false true

Page 27: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Refine pivot,remove pivot & add children to frontier

Has children?

Next child

Add to !keys

Ancestor of !key?

yes

no

false true

yes

Page 28: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Refine pivot,remove pivot & add children to frontier

Has children?

Next child

Add to !keys

Ancestor of !key?

Descendant of key?

yes

no

false true

noyes

Page 29: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Refine pivot,remove pivot & add children to frontier

Has children?

Next child

Add to keys

Add to !keys

Ancestor of !key?

Descendant of key?

yes

no

false true

no

yes

yes

Page 30: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

Sort by score

Refine pivot,remove pivot & add children to frontier

Has children?

Next child

Add to keys

Add to !keys

Ancestor of !key?

Descendant of key?

Score?

yes

no

false true

no

no

yes

yes

Page 31: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Algorithm.

14

Frontier := {∅}

Top el. score?< α

≥ α

Halt

< α

≥ α

Sort by score

Refine pivot,remove pivot & add children to frontier

Has children?

Next child

Add to keys

Add to !keys

Ancestor of !key?

Descendant of key?

Score?

yes

no

false true

no

no

yes

yes

Page 32: Slides for "ROCKER – A Refinement Operator for Key Discovery",

15

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Refinement operator.frontier

min-keys

max-non-keys

unvisited nodes

visited nodes

Page 33: Slides for "ROCKER – A Refinement Operator for Key Discovery",

15

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Refinement operator.frontier

min-keys

max-non-keys∅

unvisited nodes

visited nodes

Page 34: Slides for "ROCKER – A Refinement Operator for Key Discovery",

{:p1, :p2, :p3}

15

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Refinement operator.frontier

min-keys

max-non-keys∅

unvisited nodes

visited nodes

{:p1, :p2, :p3}

Page 35: Slides for "ROCKER – A Refinement Operator for Key Discovery",

{:p1, :p2, :p3}

15

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Refinement operator.frontier

min-keys

max-non-keys∅

unvisited nodes

visited nodes

{:p1} {:p2} {:p3}

{:p1, :p2, :p3}

Page 36: Slides for "ROCKER – A Refinement Operator for Key Discovery",

{:p1}

{:p2}

15

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Refinement operator.frontier

min-keys

max-non-keys

unvisited nodes

visited nodes

{:p1} {:p2} {:p3}

{:p2} {:p3}

{:p3}

{:p1, :p2, :p3}

Page 37: Slides for "ROCKER – A Refinement Operator for Key Discovery",

{:p1}

{:p2, :p3}

15

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Refinement operator.frontier

min-keys

max-non-keys

unvisited nodes

visited nodes

{:p1} {:p2} {:p3}

{:p3}{:p2, :p3}

{:p1, :p2, :p3}

{:p2, :p3}

Page 38: Slides for "ROCKER – A Refinement Operator for Key Discovery",

{:p1}

15

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Refinement operator.frontier

min-keys

max-non-keys

unvisited nodes

visited nodes

{:p1} {:p2} {:p3}

{:p3}{:p2, :p3}

{:p1, :p2, :p3}

{:p2, :p3}

Page 39: Slides for "ROCKER – A Refinement Operator for Key Discovery",

{:p1}

15

{:p1, :p2, :p3}

{:p1, :p3}

{:p1} {:p3}

{:p1, :p2} {:p2, :p3}

{:p2}

Refinement operator.frontier

min-keys

max-non-keys

unvisited nodes

visited nodes

{:p1} {:p2} {:p3}

{:p2, :p3}

{:p1, :p2, :p3}

{:p2, :p3}

Page 40: Slides for "ROCKER – A Refinement Operator for Key Discovery",

16

Related work on key discovery.Linkkey (Atencia et al., 2014)

• Tool able to retrieve keys. • Relies on an incomplete definition of key. • State of the Art for small datasets.

SAKey (Symeonidou et al., 2014) • Tool able to retrieve keys and n-almost keys. • Relies on an incomplete definition of key. • State of the Art on bigger datasets.

KD2R (Symeonidou et al., 2011) • Tool able to retrieve keys. • Relies on an incomplete definition of key.

Page 41: Slides for "ROCKER – A Refinement Operator for Key Discovery",

17

Evaluation.

Runtime.

Memory consumption.

Quality of the keys found.

Page 42: Slides for "ROCKER – A Refinement Operator for Key Discovery",

18

Results – Runtime.ROCKER Linkkey SAKey

OAEI Restaurant1 (10 1,880 1,698 1,028

DBpedia Person Function (10 14,565 OutOfMem 6,221

DBpedia Career Station (10 79,964 OutOfMem 2,199,854

DBPedia Organisation Member (10 1,075,679 227,336 OutOfMem

DBpedia Village (10 4,224,338 OutOfMem OutOfMem

DBpedia Musical Work (10 2,524,120 OutOfMem OutOfMem

Dataset sizes in triples. Results in milliseconds.

Page 43: Slides for "ROCKER – A Refinement Operator for Key Discovery",

19

Results – RAM consumption.ROCKER Linkkey SAKey

OAEI Restaurant1 (10 ~5 MB ~2 MB ~2 MB

DBpedia Person Function (10 2.5 GB > 16 GB 1.8 GB

DBpedia Career Station (10 3.5 GB > 16 GB 14.0 GB

DBPedia Organisation Member (10 3.8 GB 14.5 GB > 16 GB

DBpedia Village (10 4.1 GB > 16 GB > 16 GB

DBpedia Musical Work (10 5.0 GB > 16 GB > 16 GB

Dataset sizes in triples. Experiments were run on a 16 GB Ubuntu Linux machine.

Page 44: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Runtime by threshold.

20

Retrieve all candidates whose score is above a threshold α.

Results in milliseconds.

Page 45: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Runtime by threshold.

20

Retrieve all candidates whose score is above a threshold α.

α = 1 α = .999

Results in milliseconds.

Page 46: Slides for "ROCKER – A Refinement Operator for Key Discovery",

21

Retrieve all candidates whose score is above a threshold α.

Results for dataset dbpedia:Monument.

Runtime by threshold.

Page 47: Slides for "ROCKER – A Refinement Operator for Key Discovery",

21

Retrieve all candidates whose score is above a threshold α.

Results for dataset dbpedia:Monument.

Runtime by threshold.

runtime (ms)

Page 48: Slides for "ROCKER – A Refinement Operator for Key Discovery",

22

Contributions.Complete definition of keys by considering multi-object properties and null values.

More scalability in terms of:

Faster execution on larger datasets.

Less memory consumption.

Running ROCKER without restrictions is guaranteed to return minimal keys.

Page 49: Slides for "ROCKER – A Refinement Operator for Key Discovery",

23

Info and future work.ROCKER is part of LIMES – link discovery framework. Its source code is online at http://github.com/AKSW/rocker.

Page 50: Slides for "ROCKER – A Refinement Operator for Key Discovery",

23

Info and future work.ROCKER is part of LIMES – link discovery framework. Its source code is online at http://github.com/AKSW/rocker.

A demo is currently under development, to show how ROCKER can improve data quality by searching for n-almost-keys.

Page 51: Slides for "ROCKER – A Refinement Operator for Key Discovery",

23

Info and future work.ROCKER is part of LIMES – link discovery framework. Its source code is online at http://github.com/AKSW/rocker.

A demo is currently under development, to show how ROCKER can improve data quality by searching for n-almost-keys.

We will evaluate ROCKER inside of the link discovery workflow, i.e.: How can keys help find good link specifications?

Page 52: Slides for "ROCKER – A Refinement Operator for Key Discovery",

Tommaso Soru PhD student at University of Leipzig

Room P905, Fakultät für Mathematik und Informatik Augustusplatz 10, D-04109 Leipzig, Germany

[email protected]

http://tommaso-soru.it !

Proceedings http://www.www2015.it/documents/proceedings/proceedings/p1025.pdf

24