Invenio: Pythonic Frameworkfor Large-Scale Digital Libraries
Tibor ŠimkoCERN
AIMS/FAO Webinar, November 13th 2014
@tiborsimko 1 / 46
Outline
1 Introduction
2 Use cases
3 Architecture
4 Technology
5 Preservation Practices
6 From Publications to Data
7 Conclusions
@tiborsimko 2 / 46
What is Invenio?digital library and document repository software
– mature platform: first public release in 2002– rich data: articles, books, notes, photos, videos, data, code
originated in high-energy physics– institutional repository: CERN Document Server– integrated library system: CERN Library– disciplinary repository: INSPIRE
nowadays co-developed by an international collaboration
participating in and collaborating with several EU projects
@tiborsimko 4 / 46
Modular Architecture
Author
Sources Librarian
User
Database
Ingestion Modules
Processing Modules
Dissemination Modules
Curation Modules
@tiborsimko 18 / 46
Ingestion Modules
Author
WebSubmit
WebSession, WebAccess
Metadata Full-text
full-text document
BibUpload
BibSched
BibConvert
metadata
MARCXML
BibHarvest
OAI Data Source
ElmSubmit
Non-OAI Data Source
@tiborsimko 19 / 46
Processing Modules
Metadata Full-textRefExtract
BibClassify
BibDocFile
BibEncode
Clusters BibIndex
WebColl
BibRank
BibFormat
BibSort
BibAuthorID
@tiborsimko 20 / 46
Dissemination Modules
Metadata Full-textClusters
WebSearch
User
WebBasketBibAuthorID WebAlert BibHarvest
OAI HarvesterWebComment
WebMessageWebJournal BibCirculation
WebStat WebHelp
@tiborsimko 21 / 46
Curation Modules
Metadata Librarian Full-textBibEdit
MultiEdit
BatchUploader
BibCheck
BibCirculation
BibDocFile
BibClassify
RefExtract
Tasks
BibCatalog
Knowledge Bases
BibKnowledge
BibExportBibMatch
BibMerge
@tiborsimko 22 / 46
Module overview
60+ modules
450,000+ lines of code
160+ authors and contributors since 2002
@tiborsimko 23 / 46
Community
Invenio User Group Workshop 2012
developer community growing– 4 developers and contributors in 2002– 48 developers and contributors in 2012
adapting tools and processes– custom processes→ mainstream processes– custom MVC→ mainstream MVC
@tiborsimko 25 / 46
Example: UI evolution
Invenio v1
home-grown CSS(custom)fast but niche template engine(Python)
Invenio v2
mainstream CSS(Twitter Bootstrap)mainstream template engine(Jinja)
@tiborsimko 26 / 46
Example: API evolution
Google search trends: XML API vs JSON API
standardising upon /api structure
(GET|POST) /api/<service_family >/<service_verb >
/<mandatoryarg >?optarg1=val1&optarg2=val2
@tiborsimko 27 / 46
Example: Trac→ GitHub
tickets · pull requests · code reviews · kwalitee · coverage · builds
@tiborsimko 28 / 46
E.g. Abstraction of recordsabstraction of record fields:
– metadata fields, e.g. author– derived fields, e.g. number_of_authors– virtual fields, e.g. number_of_citations
abstraction of record formats:
JSONUNIMARC
MARC21
EAD
MongoDB
PostgreSQL
model configuration
@tiborsimko 30 / 46
E.g. Abstraction of storage
document store
file storage abstraction layer
CASTORCephBoxAFS Drive EOS S3
JSON Alchemylinked data
record store annotation store
@tiborsimko 31 / 46
OAIS Reference Model
SIP = Submission Information Package · AIP = Archival Information Package · DIP = Dissemination Information Package
@tiborsimko 33 / 46
“Code You Can Cite”developed Invenio↔ GitHub bridgearchive software automatically upon release; mint it with a DOI
https://guides.github.com/activities/citable-code
@tiborsimko 38 / 46
Code↔ Data↔ Paper
link data (DATAVERSE) to code (ZENODO) to papers (INSPIRE)first use cases: hep-ex/0011057, arXiv:1401.0080
@tiborsimko 39 / 46
Conclusions
Invenio as an established digital library software
modular, scalable, configurable architecture
institutional repositories: CERN, ILO, and more
world wide repositories: INSPIRE
rich formats: articles, books, audio, photo, video, data, software
OAIS-inspired preservation practices
from “small data” to “big data”
Capturing and disseminating knowledgeon data, code, and publications
to enable future data reuse
@tiborsimko 45 / 46