unified digital format registry a semantic registry for digital preservation udfr: a semantic...
TRANSCRIPT
Unified Digital Format Registrya semantic registry for digital preservation
UDFR: A Semantic Registry for Format Representation Information
Lisa Dawn ColvinAbhishek Salve
Stephen Abrams
UC Curation CenterCalifornia Digital Library
Digital Library Federation ForumBaltimore, October 31-November 2, 2011
Unified Digital Format Registrya semantic registry for digital preservation
Outline
WhatWhyHowWhen
Unified Digital Format Registrya semantic registry for digital preservation
Why formats?
“Format” is the dividing line between bits and informationffd8ffe000104a46494600010201008300830000ffed0fb050686f746f73686f7020332e30003842494d03e90a5072696e7420496e666f000000007800000000004800480000000002f40240ffeeffee030602520347052803fc00020000004800480000000002d802280001000000640000000100030...
SOIAPP0 JFIF 1.2APP13 IPTCAPP2 ICCDQTSOF0 183x512DRIDHTSOSECS0RST0ECS1RST1ECS2...
Syntax Semantics
Unified Digital Format Registrya semantic registry for digital preservation
Why formats?
There are many necessary preservation activities that can be usefully performed on bits qua bits
But to preserve information you most act on formatted bits and know what those formats mean• Preservation of syntax and semantics
Unified Digital Format Registrya semantic registry for digital preservation
Unified Digital Format Registry
“A reliable, publicly accessible, and sustainable knowledge base of file format representation information for use by the digital preservation community”• “Unification” of the function and holdings of PRONOM
and GDFRhttp://www.nationalarchives.gov.uk/PRONOMhttp://gdfr.info/
• Open source platform / GPL• Semantic wiki• Funded by the Library of Congress
Unified Digital Format Registrya semantic registry for digital preservation
Timeline
PRONOM – National Archives [UK], 2002http://www.nationalarchives.gov.uk/PRONOM
“ready access to reliable technical information about the nature of electronic records”
JHOVE – Harvard, 2003http://hul.harvard.edu/jhove
“digital object validation and characterization”
GDFR – Harvard/OCLC, 2006http://gdfr.info/
“a distributed and replicated registry of format information populated and vetted by experts and enthusiasts world-wide”
Unified Digital Format Registrya semantic registry for digital preservation
Timeline
UDFR – Ad hoc stakeholder community, 2009
• Resolve PRONOM IPR issues and develop a community-supported open source solution
• Advance beyond legacy RDBMS and XML database technology
UDFR – CDL, January 2011http://udfr.org/
“a semantic registry for digital preservation”
• Stakeholder meeting, April 2011• Beta release, November 2011• Production release, January 2012
Unified Digital Format Registrya semantic registry for digital preservation
Representation information
What you need to know about something in order to exploit that thing meaningfully [OAIS/ISO 14720]
Information that lets you answer important preservation questions
• What format is it?• What are its significant properties?• Is it valid?• Is it at risk?• How can I render/play/read it?• What can it be transformed into?• And how?
Unified Digital Format Registrya semantic registry for digital preservation
Why semantic?
Everyone wants to say something about everything• The semantic web lets anyone say anything about
anything• Understandable to both people and machines
Unified Digital Format Registrya semantic registry for digital preservation
Data modelingAbstract
Base
Abstract Product
Abstract Format
File FormatCharacter Encoding
Compression Algorithm
MediaHardwareSoftware Document File
AgentIPR
specificationreference
file
holder
owner
creator
maintaineripr
Controlled Vocabulary …
HoldingProcess
embodies
product
input / output
dependency
Abstract Signature
External Signature
Internal Signature
signature
Digest
digest
Assessment Grammar
grammarassessment
holder
Unified Digital Format Registrya semantic registry for digital preservation
Provenance
“Trust, but verify”
• Complete change historyat the assertion level,including– Who made the assertion, and when?
– Confidence based on personal and institutional reputation
• Imprimatur by technically knowledgeable reviewers
Unified Digital Format Registrya semantic registry for digital preservation
OntologiesPrefixu Namespaceudfrs http://udfr.org/onto#
udfr http://udfr.org/udfr/
dc http://purl.org/dc/elements/1.1/
dcterms http://purl.org/dc/terms/
foaf http://xmls.com/foaf/0.1/
owl http://www.w3.org/2002/07/owl#
pronom http://reference.data.gov.uk/technical-registry/
rdf http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs http://www.w3.org/2000/01/rdf-schema#
skos http://www.w3.org/2004/02/skos/core#
xds http://www.w3.org/2001/XMLSchema#
Unified Digital Format Registrya semantic registry for digital preservation
Technology stack
Ontowikihttp://ontowiki.net/
Virtuoso 4storehttp://virtuoso.openlinksw.com/
Zend frameworkhttp://www.zend.com/
PHPhttp://www.php.net/
Apache httpdhttp://httpd.apache.org/
RDFhttp://www.w3.org/RDF
JavaScript / CSS
HTTP / SPARQL
Erfurt / RDFAuthorhttp://aksw.org/Projects/Erfurt
https://github.com/AKSW/RDFauthor
Unified Digital Format Registrya semantic registry for digital preservation
Initial population
Export from PRONOM• Working with TNA to identify appropriate subset
• Transform to cross-walk modeling differences
Unified Digital Format Registrya semantic registry for digital preservation
Licensing
Code is available under GPLv3http://www.gnu.org/copyleft/gpl.html
• Hosted on BitBuckethttp://www.bitbucket.org/udfr
Data is contributed and available under CC-BYhttp://creativecommons.org/licenses/by/3.0/
• Consistent with UK open government license applicable to PRONOM datahttp://www.nationalarchives.gov.uk/doc/open-government-licence
Unified Digital Format Registrya semantic registry for digital preservation
Demo
Unified Digital Format Registrya semantic registry for digital preservation
Lessons learned
People with semantic experience are scarceToo much time evaluating/prototyping potential
technology choicesMore difficulty than anticipated integrating disparate
open source products0.x software is often numbered that for a reasonFeature lists aren’t (always)
Unified Digital Format Registrya semantic registry for digital preservation
Lessons learned
Availability of a worldwide selection of products is a good thing• Excellent support from AKWS/Universität Leipzig
Modeling differences• RDF (non-)standards
VM deployment• Disparate IT organizations supporting dev/prod instances
(except when you don’t read German)
Unified Digital Format Registrya semantic registry for digital preservation
Next steps
Long-term governance and operational supportTechnical maintenance and enhancementReplication/synchronizationBuilding contributor and reviewer communities
Unified Digital Format Registrya semantic registry for digital preservation
For more information
UDFRhttp://udfr.org/http://bitbucket.org/udfr
PRONOMhttp://www.nationalarchives.gov.uk/PRONOM
GDFRhttp://gdfr.info/
OntoWikihttp://ontowiki.net/Projects/OntoWiki
Virtuosohttp://www.openlinksw.com/dataspace/dav/wiki/Main/VOSRDFWP
Agile Knowledge and Semantic Web (AKSW), Universität Leipzighttp://aksw.org/
UC3http://www.cdlib.org/uc3 [email protected]
Stephen Abrams Mark ReyesLisa Colvin Abhishek SalvePatricia Cruse Tracy SenecaScott Fisher Joan StarrErik Hetzner Carly StrasserGreg Janée Marisa StrongJohn Kunze Adrian TurnerMargaret Low Perry WillettDavid Loy