dedalo, looking for cluster explanations in a labyrinth of linked data
DESCRIPTION
Presentation of Dedalo at the Extended Semantic Web conference 2014 in Crete (ESWC2014)TRANSCRIPT
Dedalo: looking for Clusters’ Explanations in aLabyrinth of Linked Data
Ilaria Tiddi, Mathieu d’Aquin, Enrico Motta
Knowledge Media Institute, The Open University
May 28, 2014
The Knowledge Discovery process
• Explaining patterns requires background knowledge.
• Background knowledge is attributed to the experts.
• Background knowledge comes from different domains.
• Experts might not be aware of some background knowledge.
Explaining clusters: an example
Authors clustered according to the papers they wrote together.
How to explain those clusters?
Explaining clusters – the easy solution
Use an expert
“each cluster represents a research group in KMi ”
Can one trust those experts?
Explaining clusters – the easy solution
Use an expert
“each cluster represents a research group in KMi ”
Can one trust those experts?
Explaining clusters – the easy solution
Use an expert
“each cluster represents a research group in KMi ”
Can one trust those experts?
Explaining clusters – the nice solution
Use Inductive Logic Programming (ILP)
E+ (positive examples) E− (negative examples)
attendsESWC(M.dAquin).attendsESWC(E.Motta).
attendsESWC(V.Lopez).
B: knowledge about E = E+ ∪ E−submitted(M.dAquin). submitted(V.Lopez).
submitted(E.Motta).accepted(V.Lopez). accepted(M.dAquin).
Learn a complete (B ∪H � E+) and consistent (B ∪H 2 E−)explanation for the relation attendsESWC(X).
attendsESWC(X) <- submitted(X)∧accepted(X)
Explaining clusters – still the nice solution
E+ (positive examples) E− (negative examples)
inMyCluster(M.dAquin).inMyCluster(M.Fernandez).
inMyCluster(V.Lopez).inMyCluster(H.Saif).
inMyCluster(M.Sabou).inMyCluster(C.Pedrinaci).inMyCluster(J.Domingue).
B: knowledge about E = E+ ∪ E−
B?
inMyCluster(X) <– ?
Explaining clusters – the cool solution
Integrate ILP with Linked Data
Explaining clusters – the cool solution
E+ (positive examples) E− (negative examples)
inMyCluster(M.dAquin).inMyCluster(M.Fernandez).
inMyCluster(V.Lopez).inMyCluster(H.Saif).
inMyCluster(M.Sabou).inMyCluster(C.Pedrinaci).inMyCluster(J.Domingue).
B: knowledge about E = E+ ∪ E−topic(M.dAquin, SemanticWeb). topic(M.Sabou, SemanticWeb).
topic(V.Lopez, SemanticWeb). topic(H.Saif, SocialWeb).topic(C.Pedrinaci, SemanticWebServices).topic(J.Domingue, SemanticWebServices).
topic(M.Fernandez, SocialWeb).
inMyCluster(X) <- topic(X,SemanticWeb)
Is this enough?
Producing Linked Data Explanations
on similar topicsPeople working in the same place are likely to write papers together.
on the same project
Producing Linked Data Explanations
People workingunder the same person
are likely to write papers together.with the same partner
Producing Linked Data Explanations
People working under people interested in the same thing write papers together.
Integrating ILP and Linked Data
Add to B each Linked Data explanation hi = 〈pk〉.〈vk〉*,where:
• pk (path): a chain of RDF propertiespk = {prop0 → prop1 → . . .→ propn}
• vk (value): a final instance
• roots(hi ): elements ∈ Ci having hi in commonroots(hi )={ou:M.dAquin, ou:V.Lopez, ou:M.Sabou}
*spread across different datasets
hi = 〈ou:project→ou:ledBy→foaf:topic〉pk .〈edu:SemanticWeb〉vk
Building each hi :– how?– which chains of properties?– where to find the good ones?
Dedalo – An iterative Linked Data traversal
Scoring hypotheses
WRacc1(hi ) = |roots(hi )|
|R|
(|roots(hi )∩Ci ||roots(hi )| −
|Ci ||R|
)
1 Geng et al. (2006). Interestingness measures for data mining: A survey.
Dedalo – An iterative Linked Data traversal
How to define the interestingness of a path pk?How to reach the best hi in the shortest time?
Dedalo – Comparing Heuristics
• We chose to compare different strategies.
• We want to find the path pk leading to the best hi in the shortest time.
• We want to save time and computational complexity
Path Length length of pk in number of properties composing itPath Frequency frequency of the paths in the graph
Adapted PMI joint and individual distribution of pk and CiAdapted TF–IDF how important is pk (term) in Ci (doc)
Delta |vals(pk)| ≈ |C|Entropy2 distribution of |vals(pk)|
Conditional Entropy distribution of |vals(pk)| w.r.t. Ci
2Shannon, C. (1948). A Mathematical Theory of Communication.
Dedalo’s Heuristics
Ci={ou:M.dAquin, ou:V.Lopez, ou:M.Sabou}
Path Frequency top(pk)=〈foaf:topic〉
Dedalo’s Heuristics
Ci={ou:M.dAquin, ou:V.Lopez, ou:M.Sabou}
Adapted TF–IDF top(pk)=〈ou:exMember〉
Dedalo’s Heuristics
Ci={ou:M.dAquin, ou:V.Lopez, ou:M.Sabou}
Entropy top(pk)=〈ou:project→ou:ledBy→foaf:topic〉
Experiments – KMi co-authorship
• Authors clustered according to their co-authorships.
• Network Partitioning clustering, |R|=92, |C|= 6
Cycles
Wra
cc
0 5 10 15
0.00
0.04
0.08
0.12
Semantic Web authorsLenFqDEntC.EntTFIDFPMI
Cycles
Wra
cc0 5 10 15
0.00
0.04
0.08
0.12
Learning Analytics authorsLenFqDEntC.EntTFIDFPMI
|Ci | hi WRacc
22 〈org:hasMembership→ox:hasPrincipal-0.128
Investigator→org:hasMembership〉p.〈ou:SmartProducts〉v123 〈org:hasMembership→ox:hasPrincipalInvestigator
0.127→org:hasMembership〉p.〈ou:SocialLearn〉v2
Experiments – KMi Publications
• Papers clustered according to their keywords.
• XK-Means clustering, |R|=865, |C|= 6
Cycles
Wra
cc
0 2 4 6 8 10
0.00
0.01
0.02
0.03
0.04
0.05
Learning Analytics papersLenFqDEntC.EntTFIDFPMI
Cycles
Wra
cc0 2 4 6 8 10
0.00
0.02
0.04
0.06
0.08
0.10
Semantic Web papersLenFqDEntC.EntTFIDFPMI
|Ci | hi WRacc
601 〈dc:creator→ntag:isRelatedTo〉p.〈ou:LearningAnalytics〉v1 0.042220 〈dc:creator→ntag:isRelatedTo〉p.〈ou:SemanticWeb〉v2 0.073
Experiments –Huddersfield’s dataset
• Books clustered according to the students’ Faculties.
• K-Means clustering, |R|=6969, |C|= 14
Cycles
Wra
cc
0 5 10 15
0.00
00.
001
0.00
20.
003
0.00
40.
005 Music students' borrowings
LenFqDEntC.EntTFIDFPMI
Cycles
Wra
cc
0 5 10 150.
000
0.00
50.
010
0.01
5 Theatre students' borrowingsLenFqDEntC.EntTFIDFPMI
|Ci | hi WRacc
335 〈dc:subject→skos:broader〉p1 .〈lcsh:PhysicalScience〉v 0.005919 〈dc:creator→bl:hasCreated→dc:subject〉p2 .〈bl:EnglishDrama〉v 0.013
Experiments – Comparing heuristics
Heuristics speed comparison in seconds.
KMiA1 KMiA2 KMiP1 KMiP2 Hud1 Hud2Len 1.64 4.15 8.95 9.01 69.13 135.5Freq 2.57 4.35 7.5 9.29 180 180PMI 2.05 3.88 11.28 18.42 180 180
TF–IDF 1.69 3.18 10.61 17.19 180 180Delta 2.02 3.92 180 180 180 180
Entropy 4.19 3.27 7.1 7.3 41.15 105.09Conditional Entropy 2.64 3.89 7.48 7.55 70.91 40.89
/ – Len, Freq : fast but inaccurate baselines
, – Entropy/Conditional Entropy: outperforming measures,reducing redundancy (following wrong paths) and time efforts
/ – PMI , TFIDF, Delta : they might work on less homogeneousclusters
Conclusions
• Linked Data – automatically explaining clusters
• Dedalo – traversing Linked Data to reveal explanations
• Entropy – driving the search in the Linked Data cloud
Beyond Dedalo.Dedalo works as far as there is a limited domain.New use-cases require its extension.
Future work: the OU students enrolment dataset
• Add sameAs linking
• Use of literals
• Aggregation of atomic rules
• Explore new hypotheses evaluation measures
Thanks for your attention!3
[email protected]@open.ac.uk
Questions?Better asking the robot than the experts
3Special thanks to the KMi (happy) faces.