supporting significant properties in a working archive (sps part 5), by stephen grace and gareth...

11
Supporting SPs in a working archive: Software Tools

Upload: jisc-keepit-project

Post on 28-Nov-2014

1.145 views

Category:

Technology


0 download

DESCRIPTION

This presentation, the fifth of six parts on the practical analysis of significant properties of digital objects, seeks out tools capable of extracting and maintaining properties for an archival-scale number of objects. The presentation was given as part of module 3 of a 5-module course on digital preservation tools for repository managers, presented by the JISC KeepIt project. For more on this and other presentations in this course look for the tag 'KeepIt course' in the project blog http://blogs.ecs.soton.ac.uk/keepit/

TRANSCRIPT

Page 1: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

Supporting SPs in a working

archive: Software Tools

Page 2: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

2

Challenge

Reality: Infeasible to perform manual maintenance of large number of objects. Require software capable of extracting & maintaining SPs for large of objects

Requirements:1. Object analysis tools

• Support requisite formats• Identify all/some SPs• Support batch analysis• Ideally well supported and documented

2. Description schemas to record SPs• Flexible• Machine and format idependent

3. Conversion/emulation tools capable of maintaining SPs

Page 3: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

3

Format identification

•File identification through Magic Number and ‘light touch’ scan of encoding structure.•Recognise 100s (potentially 1000s) of formats•Provide basic encoding info, but not detailed structure•Examples:• File (1): Free version created in 1986 & available for all

operating systems.http://gnuwin32.sourceforge.net/packages/file.htm (Windows)• DROID: Java app developed by TNA. Integration with

PRONOM. Format ID & assignment of PUID, which can be linked to preservation planning. http://droid.sourceforge.net/. • FFIdent: Java library to ID and extract basic information.

Recognizes 27 encoding formats using header information (magic number & common structural information)

Page 4: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

4

Page 5: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

5

Detailed Analysis

•Email:• Aperture - Java framework able to decode structured text

and convert to other format• ReadPST: Open source tool for processing Outlook PSTs• XENA - Java tool developed by NAA

•Audio:• MP3Info - technical info viewer and ID3 1.x tag editor that

supports the MP3 file format. • SoX/SOXI (Sound eXchange): extracts descriptive MD and

technical info• MetaFlac: Extractor tool for FLAC audio.

•Images:• TiffInfo• ImageMagick• JHOVE

Perform detailed analysis of internal structure of one or more files.

See InSPECT Testing Reports available at http://www.significantproperties.org.uk/

for further info on these tools

Page 6: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

6

JHOVE 1/2JHOVE (http://hul.harvard.edu/jhove/)•Format-specific digital object validation API written in Java•Functionality: Format identification, Format validation, Format Characterisation•Supports: AIFF, ASCII, Bytestream, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, UTF-8, WAV, and XML.

JHOVE2 (https://confluence.ucop.edu/display/JHOVE2Info/Home)•Supports: JPEG 2000, PDF, SGML, Shapefile, TIFF, ASCII & UTF-8 encoded text, WAVE, XML, ICC color profile•Functionality: Format identification, validation, feature extraction & policy-based assessment

Page 7: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

7

JHOVE Demo

Page 8: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

8

XCL (eXtensible Characterization Language)•Content extraction• Extracts content & tech properties through use of XCEL and saved as XCDL.

•Format support:• PNG, TIFF, GIF, BMP, JPEG, JP2, PBM, PCD, PCX, PICT, PPM, PSD, SVG, TGA, XBM and XPM, MS DOC, DocX, PDF

•Content comparison• Compare 2 objects e.g. TIFF & PNG, PDF & Doc

Page 9: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

9

XCL Extract & compare

Object A

Object B

Format A XCEL

Format B XCEL

Conversion Extractor Comparator

Object A XCDL

Object B XCDL

Page 10: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

10

XCL Demo

Page 11: Supporting Significant Properties in a Working Archive (SPs part 5), by Stephen Grace and Gareth Knight

11

Final thoughts

•Analysis tools useful, but have problems:• Limited format support•Variable access methods (GUI, CLI, APIs)• Inconsistent reporting process•Different metrics (e.g. text vs. no.)•Metric variations (e.g. milliseconds)

•Partial solution: Wrap tools into services• PLANETS Interoperability Framework