cedar & prelida preservation of linked socio-historical data
DESCRIPTION
by Albert Meroño, presented at the 3rd PRELIDA Consolidation and Dissemination Workshop, Riva, Italy, October, 17, 2014. More information about the workshop at: prelida.euTRANSCRIPT
![Page 1: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/1.jpg)
CEDAR & PRELIDA Preservation of Linked Socio-
Historical Data
Albert Meroño-Peñuela@albertmeronyo
PRELIDA consolidation workshop @ ISWC, 17-10-2014
![Page 2: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/2.jpg)
CEDAR: Harmonizing Historical Census Data in the Semantic Web
![Page 3: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/3.jpg)
CEDAR: Source Historical DataDutch Historical Censuses (1795-1971)
[Public Historical Statistical Data]
![Page 4: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/4.jpg)
4
From scans to spreadsheets
![Page 5: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/5.jpg)
CEDAR goal: cross queries
?
1795 1830 1889 1930 1971
(through ~3K tables)
![Page 6: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/6.jpg)
Towards 5-star Census Data
![Page 7: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/7.jpg)
Towards 5-star Census Data
>1 year ago
1 year ago
![Page 8: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/8.jpg)
![Page 9: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/9.jpg)
• Web publishable• Machine processable• Dynamic schema• Easily link with other
datasets
![Page 10: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/10.jpg)
Why with semantic technology?
• Web publishable, human & machine readable
• Finer granularity level (cell level)
• Statistical comparability by leveraging semantic descriptions
• Provenance
• Harmonization through linkage to other datasets (the 5th star)
![Page 11: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/11.jpg)
RDF Data Cube
“There are many situations where it would be useful to be able to publish multi-dimensional data, such as
statistics, on the web in such a way that they can be linked to related data sets and concepts.”
![Page 12: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/12.jpg)
![Page 13: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/13.jpg)
![Page 14: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/14.jpg)
RDF Data Cube vocabulary (QB)• SDMX compatible• Defines cubes as a set of observations that consist of
dimensions, measures and attributes
• Dimensions: time period, region, sex (qb:DimensionProperty)• Measure: population life expectancy (qb:MeasureProperty)
• Attribute: unit of measure = years, metadata status = measured (qb:AttributeProperty)
Observation: “the measured life expectancy of males in Newport in the period 2004-2006 is 76.7 years”
![Page 15: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/15.jpg)
CEDAR Integrator
https://github.com/CEDAR-project/Integrator
![Page 16: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/16.jpg)
Raw data
cedar:BRT_1889_08_T1-S0-K17 a tablink:DataCell ;
rdfs:label "K17";
tablink:value "12.0" ;
tablink:dimension cedar:BRT_1889_08_T1-S0-A8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K6 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-J3 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K4 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-K5 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-B8 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-C12 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-E17 ;
tablink:dimension cedar:BRT_1889_08_T1-S0-F17 ;
tablink:sheet cedar:BRT_1889_08_T1-S0 .
![Page 17: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/17.jpg)
Harmonization Rules as Open Annotations
cedar:BRT_1889_08_T1-S0-K4-mapping a oa:Annotation ;
oa:hasBody cedar:BRT_1889_08_T1-S0-K4-mapping-body ;
oa:hasTarget cedar:BRT_1889_08_T1-S0-K4 ;
oa:serializedAt "2014-09-24"^^xsd:date ;
oa:serializedBy
<https://github.com/CEDAR-project/Integrator> ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-mapping-activity .
cedar:BRT_1889_08_T1-S0-K4-mapping-body a rdfs:Resource ;
sdmx-dimension:sex sdmx-code:sex-F .
![Page 18: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/18.jpg)
Harmonized RDF Data Cube
cedar:BRT_1889_02_T1-S0-K17-h a qb:Observation ;
cedar:population "12"^^xml:decimal ;
maritalstatus:maritalStatus
maritalstatus:single ;
cedarterms:occupationPosition cedarterms:job-D ;
sdmx-dimension:sex sdmx-code:sex-F ;
cedarterms:occupation hisco:88030 ;
sdmx-dimension:refArea gg:11150 ;
prov:wasDerivedFrom
cedar:BRT_1889_08_T1-S0-K17 ;
prov:wasGeneratedBy
cedar:BRT_1889_08_T1-S0-K17-activity .
![Page 19: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/19.jpg)
Classification Systems and Concept Schemes
• Some missing harmonized dimensions!• Encode all variables and their values using concept
schemes• Some already exist
– Which ones? How many of them?– Where? – By whom?– Are they used at all? Can I reuse them?
• Some need to be created– Manual and expert knowledge based– Can we do it automatically? Or assist the process?
![Page 20: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/20.jpg)
Dutch Historical
Censuses
(CEDAR)
Dutch Ships
and Sailors
Gemeente
geschiede
nis.nl
HISCO
ICONCLASS
Dutch
Historical
Religions
Dutch
Historical
House Types
![Page 22: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/22.jpg)
Existing dimensions
• Gemeentegeschiedenis.nl
![Page 23: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/23.jpg)
Existing LSD dimensions
• P1: Discoverability? How to discover dimensions created by others?
• P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others?
• P3: Relevance? What’s the size of LSD?
![Page 24: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/24.jpg)
LSD Dimensions
http://lsd-dimensions.org/https://github.com/albertmeronyo/LSD-Dimensions
Hourly JSON-LD dumps
![Page 26: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/26.jpg)
![Page 27: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/27.jpg)
![Page 28: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/28.jpg)
![Page 29: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/29.jpg)
Existing LSD dimensions
• P1: Discoverability? How to discover dimensions created by others? LSD Dimensions
• P2: Reusability? How often are dimensions reused? Can we reuse dimensions created by others? Logarithmic law / probably yes
• P3: Relevance? What’s the size of LSD? ~7.9% of the LOD cloud
![Page 30: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/30.jpg)
Creating new LSD Dimensions
• CEDAR needs concept schemes for
– Historical religious denominations (i.e. religions in the NL in 18th-20th c.)
– Historical occupations (id.)
– Historical building types (id.)
![Page 31: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/31.jpg)
https://github.com/CEDAR-project/TabCluster
![Page 32: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/32.jpg)
TabCluster
Leverages● Lexical properties
○ Hierarchical clustering in Python scipy○ String distances
● Semantic properties (LOD tagging)○ skos:Concept of most frequent cluster-term○ Closest common skos:broader skos:Concept of all
cluster-terms
![Page 33: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/33.jpg)
Compatibility? Remixability? Reusability?
Sarven Capadisli, Albert Meroño-Peñuela, Sören Auer, Reinhard Riedl. “Semantic Similarity and Correlation of Linked Statistical Data Analysis”. 2nd Int. Workshop on Semantic Statistics (SemStats) ISWC 2014.
![Page 34: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/34.jpg)
Concept Drift
Census classification of occupations as for
1859
• Root node is void• Depth 1: occupation groups• Leaves: actual occupations
![Page 35: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/35.jpg)
Concept Drift
Census classification of occupations as for
1889
• Root node is void• Depth 1: occupation groups• Leaves: actual occupations
![Page 36: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/36.jpg)
Concept Drift
Census classification of occupations as for
1899
• Root node is void• Depth 1: occupation groups• Leaves: actual occupations
![Page 37: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/37.jpg)
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
1859 1869 1879
![Page 38: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/38.jpg)
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
![Page 39: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/39.jpg)
Concept Drift
Upper ontologies
(HISCO, AC)
Year-
dependent
ontologies
? ?
![Page 40: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/40.jpg)
Preserving CEDAR
![Page 41: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/41.jpg)
Preserving CEDAR
• DANS-EASY as backend (http://easy.dans.knaw.nl/)
• Archived objects: Turtle snapshots
– 20Go uncompressed, 200Mo compressed (per snapshot)
– Versioning (stats on current release)
• Users still need to
– SPARQL the data => bring up the endpoint on demand
– Run analytics on the data => outsource statistical analysis
![Page 42: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/42.jpg)
Thank you
Questions, suggestions, comments most welcome
@albertmeronyo
http://www.cedar-project.nlhttp://krr.cs.vu.nl/
http://easy.dans.knaw.nl/http://lsd-dimensions.org/
![Page 43: CEDAR & PRELIDA Preservation of Linked Socio-Historical Data](https://reader033.vdocuments.net/reader033/viewer/2022060121/5593ecf01a28ab5d3b8b456f/html5/thumbnails/43.jpg)
Me in 6 tweetshttp://www.albertmeronyo.org
• Background: Computer Science, Web hacker, AI & Law
• PhD candidate at the VU University Amsterdam, DANS, and eHumanities group (KNAW)
• Topic: Semantic Web for the Humanities
• CEDAR project (2012-2015): harmonized historical Dutch censuses in the Semantic Web
• Problem: statistical data publishing, concept drift and dynamics of meaning
• Last paper: What is Linked Historical Data? (EKAW 2014)