pagerank on wikipedia: towards general importance scores for entities
TRANSCRIPT
KIT – The Research University in the Helmholtz Association
INSTITUTE OF APPLIED INFORMATICS AND FORMAL DESCRIPTION METHODS (AIFB)
www.kit.edu
PageRank on Wikipedia: Towards General Importance Scores for Entities
Andreas Thalhammer and Achim Rettinger
Know@LOD 2016 30.05.2016
Institute of Applied Informatics and Formal
Description Methods (AIFB)
2
Outline
Motivation
Background
Experiments
Conclusions
06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
3
MOTIVATION
06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
4
PageRank on Wikipedia
06.06.2016
[3]
Wikipedia pages link to each other.
Wikipedia can be regarded as a link graph.
PageRank needs a link graph as an input.
The output of PageRank are ranking scores for Web pages.
+
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
5
Idea: Combine Wikipedia PageRank scores
with DBpedia/Wikidata
PREFIX v:http://purl.org/voc/vrank#
SELECT ?e ?r FROM <http://dbpedia.org>
FROM <http://people.aifb.kit.edu/ath/
#DBpedia_PageRank>
WHERE {
?e rdf:type dbo:Scientist;
v:hasRank/v:rankValue ?r.
} ORDER BY DESC(?r) LIMIT 5
06.06.2016
Every resource can be ranked.
SPARQL result sets are often large.
PageRank scores capture a general notion of importance.
dbpedia:Carl_Linnaeus 551.791
dbpedia:Charles_Darwin 215.028
dbpedia:Albert_Einstein 186.549
dbpedia:Isaac_Newton 167.811
dbpedia:Sigmund_Freud 140.245
Done. Wait!
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
6
Who is Carl Linnaeus?
06.06.2016
Main idea:
Change the input of the
PageRank algorithm.
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
7
BACKGROUND
06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
8
The PageRank algorithm
Important parameters:
l(w) – returns all pages that link to w.
c(w) – the number of outgoing links of w.
d – the damping factor
Traditional PageRank [1]:
Variant: Weighted Links Rank (WLRank) [2]:
Link weights (lw): relative position of a link in the article
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
06.06.2016
[3]
Institute of Applied Informatics and Formal
Description Methods (AIFB)
9
The Wikipedia link graph
We use non-rendered versions of articles.
Wiki syntax:
Link: [[brown bear|bears]]
Template: {{ ... }}
Link in a template: {{.... [[brown bear|bears]] ...}}
DBpedia Extraction Framework
Extracts links from the wikitext.
Doesn’t consider:
links from category pages;
links in <ref> tags.
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
06.06.2016
Institute of Applied Informatics and Formal
Description Methods (AIFB)
10
EXPERIMENTS
06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
11
Compared rankings
Newly constructed rankings
ALL – all links from the article text and from the templates
ATL – article text links
ATL-RP – article text links with WLRank and relative position
ABL – abstract links
TEL – template links
Reference rankings:
DBP 2014 – PageRank on “Pagelinks” dataset
DBP 2015-04 – PageRank on “Pagelinks” dataset
DBP-U 2015-04 – PageRank on “Pagelinks unredirected” dataset
TOWR-PV – “The Open Wikipedia Ranking” page views
SUB – SubjectiveEye3D by Paul Houle page views
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
06.06.2016
Institute of Applied Informatics and Formal
Description Methods (AIFB)
12
Setup
Measures (pairwise):
Kendall’s tau
Spearman’s rho
Hypotheses:
1. Links in templates are created in a “please fill out” manner and rather
negatively influence on the general salience that PageRank scores
should represent.
2. Links that are mentioned at the beginning of articles are more often
clicked and the WLRank correlates stronger with pageviews.
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
06.06.2016
Institute of Applied Informatics and Formal
Description Methods (AIFB)
13
Results
Spearman’s rho
Kendall’s tau
06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Input graph size:
Institute of Applied Informatics and Formal
Description Methods (AIFB)
14
SUMMARY
06.06.2016 Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
15
Conclusions
Hypothesis 1 (removing template links):
Subjectively PageRank scores got “better”
the PageRank of Carl Linnaeus got reduced.
Objectively PageRank do not change strongly.
Using only template links is not recommended for computing PageRank
(note that DBpedia extracts most semantic links from there).
Hypothesis 2 (ranking with relative positions):
WLRanks with weights on the link positions in the article text correlate
stronger with pageview-based rankings
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
06.06.2016
Institute of Applied Informatics and Formal
Description Methods (AIFB)
16
Further insights
1 YouTube Category:Living_people
2 Searching United_States
3 Facebook List_of_sovereign_states
4 United_States Animal
5 Undefined France
6 Lists_of_deaths_by_year United_Kingdom
7 Wikipedia World_War_II
8 The_Beatles Germany
9 Barack_Obama Canada
10 Web_search_engine India
11 Google Iran
12 Michael_Jackson Association_football
13 Sex England
14 Lady_Gaga Australia
15 World_War_II Arthropod
16 United_Kingdom Insect
17 Eminem Russia
18 Lil_Wayne Japan
19 Adolf_Hitler China
20 India Italy
21 Justin_Bieber English_language
22 How_I_Met_Your_Mother Poland
23 The_Big_Bang_Theory London
24 World_War_I Spain
25 Miley_Cyrus New_York_City
26 Glee_(TV_series) Catholic_Church
27 Favicon World_War_I
28 Canada Bakhsh
29 Sex_position Latin
30 Kim_Kardashian Village
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
06.06.2016
Pageview-based (SubjectiveEye3D) Link-based (DBpedia PageRank)
Page views or PageRank?
(it depends, but in general)
Both!
Institute of Applied Informatics and Formal
Description Methods (AIFB)
17
Resources
Entity rankings:
DBpedia PageRank: http://people.aifb.kit.edu/ath/#DBpedia_PageRank
SubjectiveEye3D:
https://github.com/paulhoule/telepath/wiki/SubjectiveEye3D
The Open Wikipedia Ranking: http://wikirank.di.unimi.it/
Applications (of DBpedia PageRank):
LinkSUM entity summarization system:
http://km.aifb.kit.edu/services/link/
DBtrends: http://dbtrends.aksw.org
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
06.06.2016
Institute of Applied Informatics and Formal
Description Methods (AIFB)
18 06.06.2016
Questions?
@thalhamm
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
Institute of Applied Informatics and Formal
Description Methods (AIFB)
19
References
[1] S. Brin and L. Page. The Anatomy of a Large-scale Hypertextual Web Search Engine. In Proceedings
of the Seventh International Conference on World Wide Web 7, WWW7, pages 107–117. Elsevier
Science Publishers B. V., Amsterdam, The Netherlands, The Netherlands, 1998.
[2] R. Baeza-Yates and E. Davis. Web Page Ranking Using Link Attributes. In Proceedings of the 13th
International World Wide Web Conference on Alternate Track Papers &Amp; Posters, WWW Alt. ’04,
pages 328–329, New York, NY, USA, 2004. ACM.
[3] An art draw drawn by Felipe Micaroni Lalli ([email protected]).
Andreas Thalhammer and Achim Rettinger – PageRank on Wikipedia: Towards
General Importance Scores for Entities
06.06.2016