preparing digital collections for big data analysis

Post on 19-Oct-2021

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchivingAGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Preparing Digital

Collections for Big

Data AnalysisSven Schlarb, Austrian Institute of Technology

e-Archiving, Cordoba, Spain

05th October 2018

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Digital Transformation

Copyright Doc Searls, https://flic.kr/p/9o5AEY

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Digital Transformation

Copyright (network diagram) https://www.wikidata.org/wiki/User:Thepwnco, CC BY-SA 4.0

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

4

Archiving at internet scale

2003

2018

https://web.archive.org/web/*/https://www.cordoba.es/

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

5

05/10/2018

Is big data still a hype?2014

BIG DATA

Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-

SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

6

Is big data still a hype?2015

Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-

SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

7

Is big data still a hype?2018

BIG DATA

Jeremykemp at English Wikipedia [GFDL (http://www.gnu.org/copyleft/fdl.html) or CC BY-SA

3.0 (https://creativecommons.org/licenses/by-sa/3.0)], from Wikimedia Commons

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

• Relational databases

8

To SQL or to NoSQL?• NoSQL databases

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

NoSQLDatabases

Key-Value Wide

Column

DocumentGraph

Person

Event

Person

{

"name": "Sven Schlarb",

"email": "sven.schlarbait.ac.at",

"events": [

{

"name": "Kulturhackathon openGLAM.at",

"date": "2018-09-22T00:00:00.000Z"

},

{

"name": "e-Archving Cordoba",

"date": "2018-10-05T00:00:00.000Z"

}

]

}

K1 AAA,BBB,CCC

K2 AAA,BBB

K3 AAA,DDD

K4 AAA,2,01/01/2018

K5 3,ZZZ,5623

Key Participant Conference

ID Name City Name Address City

1 John London PVC2018 Townroad 2 Manchester

2 Linda Palme TFC2018 Market 2 Berlin

Different Nosql database types

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Job TrackerTask Trackers

Data Nodes

Name Node

CPU: 1 x 2.53GHz Quadcore CPU (8 HyperThreading)

RAM: 16GB

DISK: 2 x 1TB DISKs configured as RAID0 (Performance) – 2 TB

effective

• Of 16 HT cores: 5 for Map; 2 for Reduce; 1 for OS.

25 processing cores for Map tasks

10 processing cores for Reduce tasks

CPU: 2 x 2.40GHz Quadcore CPU (16 HyperThreading cores)RAM: 24GBDISK: 3 x 1TB DISKs configured as RAID5 (Redundanz) – 2 TB effective

E-ARK Experimental Cluster

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

• Modular package

transformation workflows

& metadata creation

• Parallelize full-text

indexing

•Fast random access

to individual files

•Aggregating data

using facet queries

•Data mining (Classification,

NER)

Faceted Search & Data Mining

Access

Full-text indexing & search

Package transformation and Ingest

Reference Implementation

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

SIP

E-ARK Information Package (simplified)

representations

metadata

[schemas/documentation]

Structural metadata

Provenance metadata

Technical metadata

Descriptive metadata

SIP

DIP

DIPMetadata edits

Migrations

Add emulation info

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

• earkweb is based on Phython and the Celery task

execution system.

– Create archival workflows from predefined tasks which

can be executed in parallel on a computer cluster.

– Examples are data validation, format migration, content

extraction, database transformation, packaging,

interfacing with storage systems.

– earkweb provides a graphical interface and can be

used interactively as well as in batch mode.

earkweb

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

6/30/16

Worker Worker Worker Worker

Staging/Storage Area

NAS <<package transfer>>

decoupled

<<notification>>

<<search and retrieval>>

Information

package

status

Task

results

Cluster Deployment Stack

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Standalone Deployment Stack

6/30/16

Worker Worker Worker Worker

Staging/Storage Area

NAS <<indexing>>

<<search and retrieval>>

Information

package

status

Task

results

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Data Mining/NLP

•Purpose: Analyse digital resources of collections

•Selected use cases: Location names occurring in texts.

Named entity recognition and incorporation of geo-

information

Text classification

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Location names occurring in texts

StanfordNER for NER

nominatim (database behind

openstreetmap.org) for georeferencing

peripleo for visualization

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Location names occurring in texts

Peripleo - PELAGIOS Project

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Geographical/timeline search

Peripleo - PELAGIOS Project

Provided: GML data and TIFF images of maps with metadata (coordinate system, time, etc.)

Convert GML data to Peripleo RDF

Translate coordinate system if necessary

Use peripleo to search for and visualize regions and filter by time

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Geographical/timeline search

Peripleo - PELAGIOS Project

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchivingAGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Text classification using

scikit-learn Prepare data to train SVM classifier

Dump full-texts of the repository into re-

usable packages

Apply text classification and update SolR

records accordingly

AGREEMENT No LC-00921441 CEF-TC-2018-15 eArchiving

Database archiving, rebuilding

and analysis

source: wikipedia

SIARD

RDBMS

data

(up to 80TB)

e.g. Postgres e.g. Oracle

Submit ... Archive ... Reconstruct ... Analyse

.

Muchas Gracias por su atención!Hay preguntas?

sven.schlarb@ait.ac.at

top related