invenio: pythonic framework for large-scale digital libraries

46
Invenio: Pythonic Framework for Large-Scale Digital Libraries Tibor Šimko CERN AIMS/FAO Webinar, November 13 th 2014 @tiborsimko 1 / 46

Upload: aims-agricultural-information-management-standards

Post on 04-Aug-2015

410 views

Category:

Technology


0 download

TRANSCRIPT

Invenio: Pythonic Frameworkfor Large-Scale Digital Libraries

Tibor ŠimkoCERN

AIMS/FAO Webinar, November 13th 2014

@tiborsimko 1 / 46

Outline

1 Introduction

2 Use cases

3 Architecture

4 Technology

5 Preservation Practices

6 From Publications to Data

7 Conclusions

@tiborsimko 2 / 46

1

Introduction

@tiborsimko 3 / 46

What is Invenio?digital library and document repository software

– mature platform: first public release in 2002– rich data: articles, books, notes, photos, videos, data, code

originated in high-energy physics– institutional repository: CERN Document Server– integrated library system: CERN Library– disciplinary repository: INSPIRE

nowadays co-developed by an international collaboration

participating in and collaborating with several EU projects

@tiborsimko 4 / 46

2

Use cases

@tiborsimko 5 / 46

Example: cds.cern.ch

@tiborsimko 6 / 46

E.g. CERN photo search

@tiborsimko 7 / 46

E.g. CERN books and loans

@tiborsimko 8 / 46

Example: inspirehep.net

@tiborsimko 9 / 46

E.g. citation summary

@tiborsimko 10 / 46

E.g. citation history

@tiborsimko 11 / 46

E.g. INSPIRE author page

@tiborsimko 12 / 46

Example: ILO

@tiborsimko 13 / 46

Example: IHEID

@tiborsimko 14 / 46

Example: RERO

@tiborsimko 15 / 46

Example: EPFL

@tiborsimko 16 / 46

3

Architecture

@tiborsimko 17 / 46

Modular Architecture

Author

Sources Librarian

User

Database

Ingestion Modules

Processing Modules

Dissemination Modules

Curation Modules

@tiborsimko 18 / 46

Ingestion Modules

Author

WebSubmit

WebSession, WebAccess

Metadata Full-text

full-text document

BibUpload

BibSched

BibConvert

metadata

MARCXML

BibHarvest

OAI Data Source

ElmSubmit

Non-OAI Data Source

@tiborsimko 19 / 46

Processing Modules

Metadata Full-textRefExtract

BibClassify

BibDocFile

BibEncode

Clusters BibIndex

WebColl

BibRank

BibFormat

BibSort

BibAuthorID

@tiborsimko 20 / 46

Dissemination Modules

Metadata Full-textClusters

WebSearch

User

WebBasketBibAuthorID WebAlert BibHarvest

OAI HarvesterWebComment

WebMessageWebJournal BibCirculation

WebStat WebHelp

@tiborsimko 21 / 46

Curation Modules

Metadata Librarian Full-textBibEdit

MultiEdit

BatchUploader

BibCheck

BibCirculation

BibDocFile

BibClassify

RefExtract

Tasks

BibCatalog

Knowledge Bases

BibKnowledge

BibExportBibMatch

BibMerge

@tiborsimko 22 / 46

Module overview

60+ modules

450,000+ lines of code

160+ authors and contributors since 2002

@tiborsimko 23 / 46

4

Technology

@tiborsimko 24 / 46

Community

Invenio User Group Workshop 2012

developer community growing– 4 developers and contributors in 2002– 48 developers and contributors in 2012

adapting tools and processes– custom processes→ mainstream processes– custom MVC→ mainstream MVC

@tiborsimko 25 / 46

Example: UI evolution

Invenio v1

home-grown CSS(custom)fast but niche template engine(Python)

Invenio v2

mainstream CSS(Twitter Bootstrap)mainstream template engine(Jinja)

@tiborsimko 26 / 46

Example: API evolution

Google search trends: XML API vs JSON API

standardising upon /api structure

(GET|POST) /api/<service_family >/<service_verb >

/<mandatoryarg >?optarg1=val1&optarg2=val2

@tiborsimko 27 / 46

Example: Trac→ GitHub

tickets · pull requests · code reviews · kwalitee · coverage · builds

@tiborsimko 28 / 46

Overview of technologies

front-end:

back-end:

persistence:

@tiborsimko 29 / 46

E.g. Abstraction of recordsabstraction of record fields:

– metadata fields, e.g. author– derived fields, e.g. number_of_authors– virtual fields, e.g. number_of_citations

abstraction of record formats:

JSONUNIMARC

MARC21

EAD

MongoDB

PostgreSQL

model configuration

@tiborsimko 30 / 46

E.g. Abstraction of storage

document store

file storage abstraction layer

CASTORCephBoxAFS Drive EOS S3

JSON Alchemylinked data

record store annotation store

@tiborsimko 31 / 46

5

Preservation Practices

@tiborsimko 32 / 46

OAIS Reference Model

SIP = Submission Information Package · AIP = Archival Information Package · DIP = Dissemination Information Package

@tiborsimko 33 / 46

Example: BlogForever

@tiborsimko 34 / 46

Example: BagIt

@tiborsimko 35 / 46

6

From Publications to Data

@tiborsimko 36 / 46

Example: ZENODO

@tiborsimko 37 / 46

“Code You Can Cite”developed Invenio↔ GitHub bridgearchive software automatically upon release; mint it with a DOI

https://guides.github.com/activities/citable-code

@tiborsimko 38 / 46

Code↔ Data↔ Paper

link data (DATAVERSE) to code (ZENODO) to papers (INSPIRE)first use cases: hep-ex/0011057, arXiv:1401.0080

@tiborsimko 39 / 46

Example: HEP Data Analysis

@tiborsimko 40 / 46

Example: CERN Open Data

@tiborsimko 41 / 46

E.g. Primary Datasets

@tiborsimko 42 / 46

E.g. Data visualisations

@tiborsimko 43 / 46

7

Conclusions

@tiborsimko 44 / 46

Conclusions

Invenio as an established digital library software

modular, scalable, configurable architecture

institutional repositories: CERN, ILO, and more

world wide repositories: INSPIRE

rich formats: articles, books, audio, photo, video, data, software

OAIS-inspired preservation practices

from “small data” to “big data”

Capturing and disseminating knowledgeon data, code, and publications

to enable future data reuse

@tiborsimko 45 / 46

http://invenio-software.org/

@inveniosoftware

@tiborsimko 46 / 46