analysing structured scholarly data embedded in web pages
TRANSCRIPT
![Page 1: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/1.jpg)
Analysing Structured Scholarly Data Embedded in Web Pages
Pracheta Sahoo, Ujwal Gadiraju, Ran Yu, Sriparna Saha and Stefan Dietze
WWW 2016
April 11th, 2016Montreal, Canada
![Page 2: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/2.jpg)
OVERVIEW❏ INTRODUCTION❏ MOTIVATION❏ RESEARCH
QUESTIONS❏ ANALYSES❏ CONCLUSIONS❏ FUTURE WORK
![Page 3: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/3.jpg)
INTRODUCTION (1/3)
The Web: nearly 46 trillion Web pages indexed by Google
VS
Linked Data: approx. 1000 datasets & 100 billion statements
● different order of magnitude w.r.t. scale & dynamics
Are there other semantics (structured facts) on the Web?
![Page 4: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/4.jpg)
INTRODUCTION (2/3)● Web pages embed structured data
(microdata, microformats and RDFa)○ Interpretation of web documents
(search & retrieval)● Increase in prevalence of embedded
markup (2014 Google study of 12 bn pages estimates an adoption of 26%)
● “Web Data Commons” (Meusel et al. [ISWC’14])○ Markup from Common Crawl (2.2 bn
pages) ○ 17 billion RDF quads○ Markup in 26% of pages, 14% of PLDs
in 2013 (increase from 6% in 2011)
![Page 5: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/5.jpg)
Other semantics (structured facts) on
the Web!
![Page 6: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/6.jpg)
INTRODUCTION (3/3)
Characteristics of Markup Data
![Page 7: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/7.jpg)
MOTIVATION
● Embedded markup ⇒ sparsely linked, large % of coreferences, redundant statements
● Uptake and reuse of embedded markup is hindered by the lack of dynamics, scale
● Lack of understanding of the adoption of markup for scholarly resource metadata
![Page 8: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/8.jpg)
WHAT WE BRING TO THE TABLE ...
● Study of scholarly data extracted from embedded annotations (Web Data Commons)
● Shape & characteristics of entity descriptions
● Level of adoption of terms & types, distributions across TLDs, PLDs, data publishers
![Page 9: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/9.jpg)
RESEARCH QUESTIONS
RQ1 What are frequently used terms & types for scholarly data?
RQ2 How are statements about bibliographic data distributed across the web? Who are the key providers of bibliographic markup?
RQ3 What are the frequent errors that can be observed?
![Page 10: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/10.jpg)
DATASET
● Web Data Commons (WDC) 2014 dataset● Subset ⇒ all statements describing entities
of type s:ScholarlyArticle or co-occuring on same document with any s:ScholarlyArticle instance○ 6,793,764 quads○ 1,184,623 entities○ 83 distinct classes○ 429 distinct predicates
![Page 11: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/11.jpg)
DATASET - Considerations ● s:ScholarlyArticle is the only type which
explicitly refers to scholarly articles● We focus on schema.org, the most
widely used schema● Types considered ⇒ s:ScholarlyArticle,
s:Person and s:Organization○ 280,616 instances (s:
ScholarlyArticle)○ 847,417 insrances (s:Person)○ 3,798 instances (s:Organization)
![Page 12: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/12.jpg)
SCHOLARLY TYPES & PREDICATES (½)
Cumulative dist. of predicates over instances across extracted types
1 to 14
1 to 9 1 to 4
![Page 13: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/13.jpg)
SCHOLARLY TYPES & PREDICATES (2/2)
Top-10 Predicates for s:ScholarlyArticle
![Page 14: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/14.jpg)
DOMAINS & DOCUMENTS (1/5)
Distribution of Entities & Statements across PLDs
![Page 15: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/15.jpg)
DOMAINS & DOCUMENTS (2/5)
Top-10 PLDs (ranked by no. of entities)
![Page 16: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/16.jpg)
DOMAINS & DOCUMENTS (3/5)
Distribution of Entities & Statements across TLDs
![Page 17: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/17.jpg)
DOMAINS & DOCUMENTS (4/5)
Distribution of Entities & Statements across HTML Documents
![Page 18: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/18.jpg)
DOMAINS & DOCUMENTS (5/5)
Top-10 Documents Ranked According to Embedded Entities
![Page 19: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/19.jpg)
TOPICS & PUBLICATION TYPES (1/4)
Distribution of Scholarly Articles across Publishers
![Page 20: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/20.jpg)
TOPICS & PUBLICATION TYPES (2/4)
Top-10 Publishers and corresponding no. of Publications
![Page 21: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/21.jpg)
TOPICS & PUBLICATION TYPES (3/4)
Top-10 Publication Types (genres) across WDC
![Page 22: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/22.jpg)
TOPICS & PUBLICATION TYPES (4/4)
Top-10 Article Titles (ranked by frequency of occurrence)
![Page 23: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/23.jpg)
FREQUENT ERRORS - Schema Violations
Top-10 Misused Predicates
![Page 24: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/24.jpg)
CONCLUSIONS (½) ● First study on coverage & char. of
bibliographic metadata embedded in web pages.
● Early adopters ⇒ publishers, libraries, other providers of bibliographic data.
● Usage of terms, types ⇒ dist. across providers, domains and topics follows a power law; few providers & documents contributing to majority of data.
![Page 25: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/25.jpg)
● Top-k genres & publishers indicate a bias towards French, English data providers.
● Article titles, PLDs & publishers ⇒ bias Computer Science and Life Sciences.
● In this study we only consider entities tagged explicitly as "scholarlyArticle", a deeper analysis considering more types (article, book, etc.) and other creative works can shed light on the true scale of and potential of embedded markup data.
CONCLUSIONS (2/2)
![Page 26: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/26.jpg)
FUTURE WORK
● Targeted crawl of typical providers of scholarly data (publishers, academic orgs., libraries, etc.)
● Consider implicitly typed bibliographic or creative work as scholarly data
![Page 28: Analysing Structured Scholarly Data Embedded in Web Pages](https://reader031.vdocuments.net/reader031/viewer/2022021918/58a54e1c1a28abef2c8b4b2b/html5/thumbnails/28.jpg)
LIMITATIONS
● Our study is limited to schema.org & the types of s:ScholarlyArticle, s:Person, s:Organization.
● We consider only explicitly linked scholarly works.