information theoritic analysis of entity dynamics on the linked open data cloud
TRANSCRIPT
www.moving-project.eu
TraininG towards a society of data-saVvy inforMation prOfessionals to enable open leadership INnovation
Chifumi Nishioka and Ansgar Scherp
ZBW -- Leibniz Information Centre for Economics and Kiel University, Germany
Information-theoretic Analysis of Entity Dynamics on the Linked Open Data cloud
www.moving-project.eu
2 of 19
Motivation
• Understanding the dynamics of the LOD cloud is important for many applications • e.g., SPARQL query caching, crawling strategies, term
recommendations
• Related work • Evolution of LOD documents [Käfer et al. 13] • Dynamics of LOD sources [Dividino et al. 14]
• Entities on the LOD cloud
• Used by a lot of applications • Knowledge graph in search engines • Document modeling [Schuhmacher and Ponzetto 14]
Chifumi Nishioka ([email protected])
Come to the presentation of
“TermPicker” by Johann Schaible at
14:30 on 1st June (Wednesday)
We conduct an analysis focusing on entities
www.moving-project.eu
3 of 19
Research Goals
• Measure the changes in entities between two points in time • Represent the temporal dynamics of entities as time-series
• Time-series clustering • Periodicity detection
• Evaluate four different features of entities
Chifumi Nishioka ([email protected])
Goal 1: Represent the temporal dynamics of entities
Goal 2: Find out the representative temporal patterns of entity dynamics
Goal 3: Find out which features of entity more likely define temporal dynamics of entities
www.moving-project.eu
4 of 19
Formalization
• 𝑋𝑡: snapshot of LOD documents at point in time 𝑡 • Snapshot is a collection of triples 𝑥
• 𝑥: triple • 𝑥 = 𝑠, 𝑝, 𝑜 : subject, predicate, and object
Chifumi Nishioka ([email protected])
www.moving-project.eu
5 of 19
Entity and Entity Representations
• Entities are represented by a set of triples • Entity Representation: Out
• Set of triples with common subject URI • e.g., db:John_Brown is defined by two triples
• Entity Representation: InOut • Set of triples with common subject URI or object URI • e.g., db:John_Brown is defined by three triples
Chifumi Nishioka ([email protected])
db:Anne_Smith db:spouseOf db:John_Brown
db:John_Brown db:birthplace db:Los_Angels
db:John_Brown db:works db:Green_University
www.moving-project.eu
6 of 19
Triple Weighting
• example: Barack Obama
• <Barack_Obama, dbp:vicePresident , Joe_Biden> is more important than <Barack_Obama, rdf:type , foaf:Person>
• Baseline • All triples have a same weight
• Combined Information Content (combIC) [Schuhmacher and Ponzetto 14] • 𝐼𝐶 𝑣 = −log(𝑃(𝑣)) • 𝑝𝑟𝑒𝑑 𝑥 , 𝑜𝑏𝑗 𝑥 returns predicate and object of a
triple 𝑥, respectively
Chifumi Nishioka ([email protected])
𝑤𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒(𝑥) = 1
Each triple in entities has different importance for entities
𝑤𝑐𝑜𝑚𝑏𝐼𝐶 𝑥 = 𝐼𝐶 𝑝𝑟𝑒𝑑 𝑥 + 𝐼𝐶(𝑜𝑏𝑗(𝑥))
www.moving-project.eu
7 of 19
Measuring Entity Dynamics
• Cosine distance
• Euclidean distance
Chifumi Nishioka ([email protected])
Goal 1: Represent the temporal dynamics of entities
𝛿𝑐𝑜𝑠𝑑 𝐸𝑡1 , 𝐸𝑡2 = 1 −𝐸𝑡1 ∙ 𝐸𝑡2
||𝐸𝑡1|| ∙ | 𝐸𝑡2 |
1. Measure the amount of changes in entities between two successive snapshots by one of two distance measures
𝛿𝑒𝑢𝑐 𝐸𝑡1 , 𝐸𝑡2 = (𝐸𝑡1,𝑖 − 𝐸𝑡2,𝑖)2
𝑖=1
www.moving-project.eu
8 of 19
Vector Representation of Entities
• Represent an entity 𝐸 by one-hot encoding • Extract all unique triples from different snapshots • Fix order of triples • e.g., db:Anne_Smith at 𝑡1 is (1,1,1,0,0) and at 𝑡2 is
(1,0,1,1,1)
• Cosine distance: 𝛿𝑐𝑜𝑠𝑑 𝐸𝑡1 , 𝐸𝑡2 = 1 −2
3∙ 4= 0.42
• Euclidean distance: 𝛿𝑒𝑢𝑐 𝐸𝑡1 , 𝐸𝑡2 = 3 Chifumi Nishioka ([email protected])
𝑡1 1 db:Anne_Smith db:birthplace db:New_York
2 db:Anne_Smith db:works db:Green_University
3 db:Anne_Smith db:spouseOf db:John_Brown
𝑡2 1 db:Anne_Smith db:birthplace db:New_York
4 db:Anne_Smith db:works db:Royal_University
3 db:Anne_Smith db:spouseOf db:John_Brown
5 db:Anne_Smith db:degree db:Master_of_Science
www.moving-project.eu
9 of 19
Temporal Dynamics of Entities
• Temporal Dynamics of an entity 𝐸
• 𝑛: the number of snapshots
Chifumi Nishioka ([email protected])
Δ(𝐸) = (𝛿 𝐸𝑡1 , 𝐸𝑡2 , 𝛿 𝐸𝑡2 , 𝐸𝑡3 , ⋯ 𝛿(𝐸𝑡𝑛−1 , 𝐸𝑡𝑛))
2. Represent temporal dynamics of entities by a time-series of the amount of changes in an entity between two successive snapshots
Subsequently, we mine the resulted time-series to find out patterns of temporal dynamics of entities
www.moving-project.eu
10 of 19
Time-series Clustering
Chifumi Nishioka ([email protected])
• Clustering algorithm: k-means++ [Arthur and Vassilvitskii 07] • Introduce an improved initial seeding into k-means
• Distance measure: Euclidean distance • The most efficient measure for distance between
time-series with a reasonably high accuracy [Wang et al. 13]
• Optimization of the number of clusters : Average Silhouette
Goal 2: Find out the representative temporal patterns of entity dynamics
www.moving-project.eu
11 of 19
Periodicity Detection
• Periodicity Detection
• A task of detecting periodicity from time-series • Example 1: (1, 3, 2, 1, 3, 2) -> periodicity of three • Example 2: (1, 2, 1, 2, 1, 2) -> periodicity of two • Employ a convolution-based algorithm [Elfeky et al.
05]
Chifumi Nishioka ([email protected])
We assume that the amount of changes of entities have some periodicity
We see the centroids of the resulted clusters as patterns of entity dynamics
www.moving-project.eu
12 of 19
Dataset
• Dynamic Linked Data Observatory (DyLDO) dataset [Käfer et al. 12] • Weekly snapshots of the fixed set of LOD documents • 165 snapshots over three years (05/2012 to 07/2015)
• Entities in the DyLDO dataset • Almost 75% of entities appear only at one snapshot • Focus on entities that appear at >70% of snapshots
Chifumi Nishioka ([email protected])
Entity
representation
# of unique
entities in 165
snapshots
# of entities that
appear at >70% of
snapshots
Out 27,788,902 2,909,700
InOut 29,097,929 2,950,533
www.moving-project.eu
13 of 19
Patterns of Entity Dynamics (1/3)
• Analysis with respect to eight conditions • Conditions are made by two entity representations,
two distance measures, two triple weighting methods
• Result of clustering • # of clusters are smaller when using combIC
Chifumi Nishioka ([email protected])
www.moving-project.eu
14 of 19
Patterns of Entity Dynamics (2/3)
Chifumi Nishioka ([email protected])
Out Cosine Baseline Out Cosine CombIC
Out Euclidean CombIC
Out Euclidean Baseline
www.moving-project.eu
15 of 19
Patterns of Entity Dynamics (3/3)
Chifumi Nishioka ([email protected])
InOut Cosine Baseline InOut Cosine CombIC
InOut Euclidean Baseline InOut Euclidean CombIC
www.moving-project.eu
16 of 19
Periodicity of Entity Dynamics
Chifumi Nishioka ([email protected])
We observe periodicities in temporal dynamics of entities
• e.g., “Periodicity of 56” indicates that the amount of entity changes vary along with one-year cycle
• Different patterns have different periodicities
www.moving-project.eu
17 of 19
Features for Entity Dynamics (1/2)
• Four features of entities • RDF Type (𝑓1) • Property (𝑓2) • Union of RDF types and properties (𝑓3) • Pay level domain (PLD) of entity URI (𝑓4)
• e.g., http://dbpedia.org/resource/The_Beatles -> dbpedia.org
• Evaluate four features by RandIndex • RandIndex: a metric of clustering • Measure the difference of clustering by a feature and
by entity dynamics (i.e., time-series vectors)
Chifumi Nishioka ([email protected])
Goal 3: Find out which features of entity more likely define temporal dynamics of entities
www.moving-project.eu
18 of 19
Features for Entity Dynamics (2/2)
• Entities that share a common PLD are more likely
to have similar temporal dynamics of entities when employing baseline for triple weighting
• When using combIC, entities that have a common RDF type or ECS more likely to belong a same cluster
Chifumi Nishioka ([email protected])
www.moving-project.eu
19 of 19
Thank you for your attention!
Project consortium and funding agency
Chifumi Nishioka ([email protected])
MOVING is funded by the EU Horizon 2020 Programme under the project number INSO-4-2015: 693092
www.moving-project.eu
20 of 19
Conclusion
• Temporal dynamics of entities on the LOD cloud • Represent the temporal dynamics of entities as time-
series • Find out the representative temporal patterns of
entity dynamics • Find out which features of entity
• Future work • e.g., SPARQL query caching
Chifumi Nishioka ([email protected])
Goal 3: Find out which features of entity more likely define temporal dynamics of entities
www.moving-project.eu
22 of 19
Reference
• [Arthur and Vassilvitskii 07] D. Arthur and S. Vassilvitskii. k-means++: The advantages of careful seeding. SODA, 2007.
• [Elfeky et al. 05] M.G. Elfeky, W.G. Aref, and A.K. Elmagarmid. Periodicity detection in time series databases. IEEE TKDE, 2005.
• [Käfer et al. 12] T. Käfer, J. Umbrich, A. Hogan, and A. Polleres. Towards a dynamic linked data observatory. LDOW, 2012.
• [Käfer et al. 13] T. Käfer, A. Abdelrahman, J. Umbrich, P. O’Byrne, and A. Hogan. Observing linked data dynamics. ESWC, 2013.
• [Neumann and Moerkotte 11] T. Neumann and G. Moerkotte. Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins. ICDE, 2011.
• [Schuhmacher and Ponzetto 14] M. Schuhmacher and S.P. Ponzetto. Knowledge-based graph document modeling. WSDM, 2014.
• [Wang et al. 13] X. Wang, A. Mueen, H. Ding, G. Trajcevski, P. Scheuermann, and E. Keogh. Experimental comparison of representation methods and distance measures for time series data. Data Mining and Knowledge Discovery, 2013.
• [Yang and Leskovec 11] J. Yang and J. Leskovec. Patterns of temporal variation in online media. WSDM, 2011.
Chifumi Nishioka ([email protected])
www.moving-project.eu
23 of 19
Entities in the DyLDO dataset
• Distribution of # of times of appearances of entities in 165 snapshots
Chifumi Nishioka ([email protected])
Entity representation Out Entity representation InOut
www.moving-project.eu
24 of 19
Resulted Temporal Patterns (1/4)
Chifumi Nishioka ([email protected])
Out Cosine Baseline Out Cosine CombIC
www.moving-project.eu
25 of 19
Resulted Temporal Patterns (2/4)
Chifumi Nishioka ([email protected])
Out Euclidean Baseline Out Euclidean CombIC
www.moving-project.eu
26 of 19
Resulted Temporal Patterns (3/4)
Chifumi Nishioka ([email protected])
InOut Cosine Baseline InOut Cosine CombIC
www.moving-project.eu
27 of 19
Resulted Temporal Patterns (4/4)
Chifumi Nishioka ([email protected])
InOut Euclidean Baseline InOut Euclidean CombIC