dedalo, looking for cluster explanations in a labyrinth of linked data

Dedalo: looking for Clusters’ Explanations in aLabyrinth of Linked Data

Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta

Knowledge Media Institute, The Open University

May 28, 2014

The Knowledge Discovery process

• Explaining patterns requires background knowledge.

• Background knowledge is attributed to the experts.

• Background knowledge comes from different domains.

• Experts might not be aware of some background knowledge.

Explaining clusters: an example

Authors clustered according to the papers they wrote together.

How to explain those clusters?

Explaining clusters – the easy solution

Use an expert

“each cluster represents a research group in KMi ”

Can one trust those experts?

Explaining clusters – the nice solution

Use Inductive Logic Programming (ILP)

E+ (positive examples) E− (negative examples)

attendsESWC(M.dAquin).attendsESWC(E.Motta).

attendsESWC(V.Lopez).

B: knowledge about E = E+ ∪ E−submitted(M.dAquin). submitted(V.Lopez).

submitted(E.Motta).accepted(V.Lopez). accepted(M.dAquin).

Learn a complete (B ∪H � E+) and consistent (B ∪H 2 E−)explanation for the relation attendsESWC(X).

attendsESWC(X) <- submitted(X)∧accepted(X)

Explaining clusters – still the nice solution


inMyCluster(M.dAquin).inMyCluster(M.Fernandez).

inMyCluster(V.Lopez).inMyCluster(H.Saif).

inMyCluster(M.Sabou).inMyCluster(C.Pedrinaci).inMyCluster(J.Domingue).

B: knowledge about E = E+ ∪ E−

B?

inMyCluster(X) <– ?

Explaining clusters – the cool solution

Integrate ILP with Linked Data

Explaining clusters – the cool solution


inMyCluster(M.dAquin).inMyCluster(M.Fernandez).

inMyCluster(V.Lopez).inMyCluster(H.Saif).

inMyCluster(M.Sabou).inMyCluster(C.Pedrinaci).inMyCluster(J.Domingue).

B: knowledge about E = E+ ∪ E−topic(M.dAquin, SemanticWeb). topic(M.Sabou, SemanticWeb).

topic(V.Lopez, SemanticWeb). topic(H.Saif, SocialWeb).topic(C.Pedrinaci, SemanticWebServices).topic(J.Domingue, SemanticWebServices).

topic(M.Fernandez, SocialWeb).

inMyCluster(X) <- topic(X,SemanticWeb)

Is this enough?

Producing Linked Data Explanations

on similar topicsPeople working in the same place are likely to write papers together.

on the same project


People workingunder the same person

are likely to write papers together.with the same partner


People working under people interested in the same thing write papers together.

Integrating ILP and Linked Data

Add to B each Linked Data explanation hi = 〈pk〉.〈vk〉*,where:

• pk (path): a chain of RDF propertiespk = {prop0 → prop1 → . . .→ propn}

• vk (value): a final instance

• roots(hi ): elements ∈ Ci having hi in commonroots(hi )={ou:M.dAquin, ou:V.Lopez, ou:M.Sabou}

*spread across different datasets

hi = 〈ou:project→ou:ledBy→foaf:topic〉pk .〈edu:SemanticWeb〉vk

Building each hi :– how?– which chains of properties?– where to find the good ones?

Dedalo – An iterative Linked Data traversal

Scoring hypotheses

WRacc1(hi ) = |roots(hi )|

|R|

(|roots(hi )∩Ci ||roots(hi )| −

|Ci ||R|

)

1 Geng et al. (2006). Interestingness measures for data mining: A survey.

Dedalo – An iterative Linked Data traversal

How to define the interestingness of a path pk?How to reach the best hi in the shortest time?

Dedalo – Comparing Heuristics

• We chose to compare different strategies.

• We want to find the path pk leading to the best hi in the shortest time.

• We want to save time and computational complexity

Path Length length of pk in number of properties composing itPath Frequency frequency of the paths in the graph

Adapted PMI joint and individual distribution of pk and CiAdapted TF–IDF how important is pk (term) in Ci (doc)

Delta |vals(pk)| ≈ |C|Entropy2 distribution of |vals(pk)|

Conditional Entropy distribution of |vals(pk)| w.r.t. Ci

2Shannon, C. (1948). A Mathematical Theory of Communication.

Dedalo’s Heuristics

Ci={ou:M.dAquin, ou:V.Lopez, ou:M.Sabou}

Path Frequency top(pk)=〈foaf:topic〉



Adapted TF–IDF top(pk)=〈ou:exMember〉



Entropy top(pk)=〈ou:project→ou:ledBy→foaf:topic〉

Experiments – KMi co-authorship

• Authors clustered according to their co-authorships.

• Network Partitioning clustering, |R|=92, |C|= 6

Cycles

Wra

cc

0 5 10 15

0.00

0.04

0.08

0.12

Semantic Web authorsLenFqDEntC.EntTFIDFPMI

Cycles

Wra

cc0 5 10 15

0.00

0.04

0.08

0.12

Learning Analytics authorsLenFqDEntC.EntTFIDFPMI

|Ci | hi WRacc

22 〈org:hasMembership→ox:hasPrincipal-0.128

Investigator→org:hasMembership〉p.〈ou:SmartProducts〉v123 〈org:hasMembership→ox:hasPrincipalInvestigator

0.127→org:hasMembership〉p.〈ou:SocialLearn〉v2

Experiments – KMi Publications

• Papers clustered according to their keywords.

• XK-Means clustering, |R|=865, |C|= 6

Cycles

Wra

cc

0 2 4 6 8 10

0.00

0.01

0.02

0.03

0.04

0.05

Learning Analytics papersLenFqDEntC.EntTFIDFPMI

Cycles

Wra

cc0 2 4 6 8 10

0.00

0.02

0.04

0.06

0.08

0.10

Semantic Web papersLenFqDEntC.EntTFIDFPMI

|Ci | hi WRacc

601 〈dc:creator→ntag:isRelatedTo〉p.〈ou:LearningAnalytics〉v1 0.042220 〈dc:creator→ntag:isRelatedTo〉p.〈ou:SemanticWeb〉v2 0.073

Experiments –Huddersfield’s dataset

• Books clustered according to the students’ Faculties.

• K-Means clustering, |R|=6969, |C|= 14

Cycles

Wra

cc

0 5 10 15

0.00

00.

001

0.00

20.

003

0.00

40.

005 Music students' borrowings

LenFqDEntC.EntTFIDFPMI

Cycles

Wra

cc

0 5 10 150.

000

0.00

50.

010

0.01

5 Theatre students' borrowingsLenFqDEntC.EntTFIDFPMI

|Ci | hi WRacc

335 〈dc:subject→skos:broader〉p1 .〈lcsh:PhysicalScience〉v 0.005919 〈dc:creator→bl:hasCreated→dc:subject〉p2 .〈bl:EnglishDrama〉v 0.013

Experiments – Comparing heuristics

Heuristics speed comparison in seconds.

KMiA1 KMiA2 KMiP1 KMiP2 Hud1 Hud2Len 1.64 4.15 8.95 9.01 69.13 135.5Freq 2.57 4.35 7.5 9.29 180 180PMI 2.05 3.88 11.28 18.42 180 180

TF–IDF 1.69 3.18 10.61 17.19 180 180Delta 2.02 3.92 180 180 180 180

Entropy 4.19 3.27 7.1 7.3 41.15 105.09Conditional Entropy 2.64 3.89 7.48 7.55 70.91 40.89

/ – Len, Freq : fast but inaccurate baselines

, – Entropy/Conditional Entropy: outperforming measures,reducing redundancy (following wrong paths) and time efforts

/ – PMI , TFIDF, Delta : they might work on less homogeneousclusters

Conclusions

• Linked Data – automatically explaining clusters

• Dedalo – traversing Linked Data to reveal explanations

• Entropy – driving the search in the Linked Data cloud

Beyond Dedalo.Dedalo works as far as there is a limited domain.New use-cases require its extension.

Future work: the OU students enrolment dataset

• Add sameAs linking

• Use of literals

• Aggregation of atomic rules

• Explore new hypotheses evaluation measures

Thanks for your attention!3

[email protected]@open.ac.uk

[email protected]

Questions?Better asking the robot than the experts

3Special thanks to the KMi (happy) faces.

dedalo, looking for cluster explanations in a labyrinth of linked data

Presentations & Public Speaking