digital curation: curation micro-services approach to building repositories

Download Digital Curation: Curation Micro-services approach to building repositories

Post on 23-Feb-2016




0 download

Embed Size (px)


Digital Curation: Curation Micro-services approach to building repositories. Mark Phillips UNT Libraries November 8, 2010. Digital Curation. - PowerPoint PPT Presentation


Digital Curation: Thoughts and Examples from the University of North Texas Libraries

Digital Curation: Curation Micro-services approach to building repositories

Mark PhillipsUNT LibrariesNovember 8, 2010Digital Curation Digital curation is the set of policies and practices focused on maintaining and adding value to trusted digital content for use now and into the indefinite future. Curation encompasses preservation and access, and can be applied to the humanities, social sciences, and sciences. is the goal, it is like the finish line in a relay, you hand the responsibility off to various actors with an overarching goal being that the content us viable at a later point in time.Focus on doing the smartest thing you can do right now.Smartest does not equal bestLike many areas, the perfect is the enemy of the good.Digital CurationDigital StewardshipDigital PreservationAccessPreservation


MaintainAdd value

Cross disciplineCuration Micro-ServicesCalifornia Digital LibraryMethodology for building infrastructure to support curation and preservationThinking about the components and interactions in a repository as a set of smaller servicesLoosely coupled servicesReaction to large monolithic repository systems

Unix philosophy for system designOutput of one service is the input for another yet to be created serviceSwap out pieces as neededFocus on simple tools that do one thing.Often referred to as building blocks or LegosAt this time it isnt exactly clear what is and what isnt a Curation Micro-ServiceKind of sounds like a Web-serviceOr any other service for that matterThis really hasnt been answered in the communityCDL servicesIdentity ServiceStorage ServiceFixity ServiceReplication ServiceInventory ServiceCharacterization ServiceIngest ServiceIndex ServiceSearch ServiceTransformation ServiceNotification ServiceAnnotation ServiceSome example componentsAnvlNamastePairtreeBagItCAND-FlatReDDCheckmCutieERC

Pairtreefilesystem hierarchy for holding objectsIdentifier strings mapped to object directoryTwo characters at a timeabcd -> ab/cd/abcdefg -> ab/cd/ef/g/12-986xy4 -> 12/-9/86/xy/4/

Object folder at the end of the mapping

Full Examplecurrent_directory/| pairtree_version0_1 [which version of pairtree]| ( This directory conforms to Pairtree Version 0.1. Updated spec: )| ( )|| pairtree_prefix| ( )|\--- pairtree_root/ |--- aa/ | |--- cd/ | | |--- foo/ | | | | README.txt | | | | thumbnail.gif | | ... | |--- ab/ ... | |--- af/ ... | |--- ag/ ... | ... |--- ab/ ... ... \--- zz/ ... | ...NamasteNAMe AS TExtfile naming conventionprimitive directory-level metadata tags exposed directly via filenamesAnswers the following questionWhat kind of directory is this?Examples0=bagit_0.960=untl_sip_1.00=untl_aip_1.00=untl_acp_1.0Building a repositoryUNT LibrariesTwo separate systems with similar componentsAccess system = AubreyPreservation system = Coda

Built as a set of servicesUNT and micro-servicesModularBuild out in stagesPresentation SystemPreservation SystemOther services as we need themReplace in the future as neededEasy to implement, easy to discard

Identity ServiceArchival Resource Keys (ARK) for identifiersNumber Server for minting names for objectsImplemented as a Web serviceQuery a URL and get a new unique name metapth12604Append that to UNTs NAANark:/67531/metapth12604Currently 5 name spaces for identifiersVocabulary ServiceSimple system for providing canonical versions of namesUnique identifiers for each vocabulary termProvided as Linked data in RDF/XMLOther serializationsLegacy XML formatJsonPython objectEasy to integrate into codePromotes reuse of vocabulariesStorage ServiceProvide a consistent way of requesting an itemUse http for communicationRead only currentlyMakes use of public specificationsCANPairTreeBagItExposed with ApacheStorage Service ExampleFor a known identifier, and a known storage service coda1gel on coda-005

|-- 0=can_0.10|-- admin|-- can-info.txt|-- log`-- store |-- pairtree_index |-- pairtree_prefix |-- pairtree_root | `-- co | `-- da | |-- 1g | | |-- el | | | `-- coda1gel | | | |-- 0=untl_aip_1.0 | | | |-- bag-info.txt | | | |-- bagit.txt | | | |-- | | | |-- data

Storage Service ExampleProxy for abstracting which nodeWe expected to never have all of our data in one placeShifts the problem from infrastructure/storage to a software problem

Storage serviceCoda repository application has a list of active content nodesCoda queries each content node for desired object, (http head requests)Primary and secondary content nodes are usable for increased fault toleranceCoda streams content to end user to allow for very large files to be transferredReplication ServiceSoftware neutral content replicationMaster nodes in Library server roomSecondary nodes at Library Annex server roomCoda instance at each locationDifferent number of content nodes 6 vs 3 currentlyDifferent content node sizes 9TB vs 25TBNeed to balance content across content nodesReplication ServiceSeries of conventions for making content available for replicationThree requirementsProvide a list of objects you want to replicatePoint to a manifest defining all files of an objectProvide a way to validate an object when replicatedReplication Service Coda ImplementationRestful replication serviceComponents Replication queue Queue of objects to replicateCollector Adds object to the Replication queueHarvester queries queue for objects to harvestCoda Metadata StoreAs content is replicated, a validation and replication event is logged centrally.

Event ServiceBased on the PREMIS Event ModelRestful interface for creating new eventsProvides an interface for creating and maintaining PREMIS AgentsCollects and provides access to events important to the lifecycle of the objectCurrently setup to capture ingest, replication, fixityCheck and virusCheck eventsIngest ServiceA more complex workflow for accessioning content into the repositoryUses BagIt as a packaging containerValidation of content each network or disk hopSanity check after atomic movesFolder based workflow with python management scriptsFolder Workflowpth_dropbox/|-- 0.Staging/|-- 1.ToAIP/|-- 2.ToAIP-Error/|-- 3.ToACP/|-- 4.ToACP-Error/|-- 5.ToArchive/|-- 6.ToAubrey/|-- 7.ToAubrey-Sorted/|-- 8.ToAubrey-Sorted-Error/|--|-- -> /home/digitalprojects/coda/|-- -> /home/digitalprojects/coda/|-- -> /home/digitalprojects/coda/`--

1.ToAIPObjects start in this directory, typically using rsync from local machines. Full validation of BagCheck that Bag is a Submission Information Package (SIP)Check for for processing instructionsRequest identifier from Number ServerCreate METS document from supplied filesCreate PREMIS record, JHOVE stream, File stream Move to 3.ToACP on success or 2.ToAIP-Error on failure3.ToACPCheck that Bag is an Archival Information Package (AIP)Check for for processing instructionsProcess METS structure and create Web derivatives based on current practiceMove AIP to 5.ToArchive on Success, Move AIP to 4.ToACP-Error on failureMove ACP to 6.ToAubrey on Success

5.ToArchive/6.ToAubreyRun bash script to rsync contents of 5.ToArchive over to current archival dropboxRun to sort contents of 6.ToAubrey into odd and even folders, upload to appropriate content node on delivery system

Ingest ServiceArchival Information Package (AIP) is ingested into Coda in a very similar fashion, it has the following stepsVerify BagCheck bag is AIPAssign coda identifierAccess Content Package (ACP) is moved to the Aubrey content delivery platform and made avaliable in the following systems statistics for UNT systemsCoda27,552,721 files139,062 objects42.3 TB in use / 120 TB capacity Aubreytexashistory 125,721 objects114,847 live1,248,416 fileSetsdigital.library 38,755 objects38,451 live2,253,031 fileSets