fido article v01

Upload: bram-van-der-werf

Post on 10-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/8/2019 Fido Article v01

    1/3

    Identifying file formats taking a closer look at Pronomand Droid

    Adam Farquhar, 10-2010

    Pronom and Droid, developed primarily at the National Archives (TNA) of the UnitedKingdom, have been a key contribution to the digital preservation community.Pronom is a registry of information about file formats. The TNA provides access tothe Pronom registry on-line at http://www.nationalarchives.gov.uk/PRONOM andmaintains the information. Droid is a software application that uses some of the fileformat information to identify the type of specific digital objects. Droid is availableon-line through SourceForge at http://droid.sourceforge.net/ and is managed as anopen source project.

    In October, I spent some time looking closely at both Pronom and Droid to get abetter understanding of them and evaluate ways that they could be improved. Inthis series of blog posts, Ill be reporting on the results of this investigation whichbrought several surprises along with it.

    The TNA have built a lovely web interface to interact with Pronom, whose contentsare stored in a rather complex database. This is a great way to look at theinformation they have on a specific file format. You can search for the format youwant and read through all of the information. As of 22-Oct-2010, Pronom hasinformation about 731 file formats.

    The most important benefit that Pronom provides is to give each of these formats apersistent unique identifier. This is a truly important contribution. There is no otherregistry in the world that provides persistent unique identifiers to digital objectformats without regard to origin at an appropriate level of detail for digitalpreservation. We can contrast Pronom and Droid other initiatives. IANA managesthe mimetype names. Mimetype is heavily used by browsers and email applications

    to decide how to display files that are downloaded, for example. The mimetype is,however, very coarse-grained level. For example, there is only one mimetype for allversions of PDF even though the different PDF versions have substantiallydifferent features as PDF has developed over the decades. Mimetype also covers arelatively small number of format types, and provides no method to recognise theformats. The Unix file command provides a fast and robust method to identifymany types of digital objects, but it does not provide a persistent unique identifierfor each type that it recognises. The person who implements the recognitionroutine for each type is free to print out whatever seems useful and there is noguarantee that the output format will be persistent.

    For me and some other users of the information, however, Pronom has two majorshortcomings. First, I dont have much need for a handsome web interface toaccess the information about a single format at a time! I need to use theinformation in an automated way and for as many formats as possible. Second,the coverage is much more limited than it looks at first. Most of the registeredformats have only outline descriptions. This means that they provide a name andidentifier, and some useful textual information about the format, but no method forrecognising an instance of a format.

    The Droid 5.0 application provides a nice user interface that enables a user to pointto a set of files or a directory, identify the likely file types, and explore them. The

    http://www.nationalarchives.gov.uk/PRONOMhttp://droid.sourceforge.net/http://www.nationalarchives.gov.uk/PRONOMhttp://droid.sourceforge.net/
  • 8/8/2019 Fido Article v01

    2/3

    format is placed into a database and the user can run reports, filter, sort, and so on.One can also export the results as a csv file that can be imported into a spreadsheetapplication for further analysis. In addition, there is a command-line version of thetools to support automated processes.

    In order to identify file formats, Droid uses a Signature File. This is an XML file that

    contains a substantial subset of the information in Pronom. It contains an elementfor each of the file formats known to Pronom.

    I am particularly interested in how Droid recognises specific file formats. At theBritish Library, we need this to be fairly efficient, accurate, and comprehensive.

    The Droid Signature File was the starting point for my exploration. Both it and theunderlying Pronom XML (more on this later) are very clearly documented inhttp://www.nationalarchives.gov.uk/aboutapps/fileformat/pdf/automatic_format_identification.pdf. This important paper lays out the signature language, the Droidalgorithms, and more with considerable precision.

    The Signature file includes several key pieces of information in addition to theformat name and identifier. First, it includes the typical file extensions for theformat. For example, PDF files typically end with a pdf extension. Pronom callsthese external signatures. Second, it includes patterns that can be used torecognise a file format. For example, a PDF file starts with %%PDF and ends with %%EOF. Pronom calls these internal signatures. Third, it includes some relationshipsbetween formats. For example, PDF is a supertype of PDF 1.1, 1.2, and so on. Thismeans that any object that is an instance of the PDF 1.1 format is also an instanceof PDF.

    When I looked more closely at the Signature file, I had two surprises. First, theSignature file included patterns to recognise only 208 of the formats less than athird. This means that the effective coverage of DROID is much smaller than I hadfirst expected. Second, I couldnt make any sense of the patterns! I was expectingto see something like

    %PDF-1.0

    Instead, I encountered:

    255044462D312E309123

    45678

    2525454F466

  • 8/8/2019 Fido Article v01

    3/3

    1234

    This was substantially more complicated than I had anticipated and it sent me backto the definition of the Droid Signature language! It turns out that the internalsignatures in this XML document are not the patterns as held in Pronom. Instead,they are the result of compiling those patterns into a form that can be used forefficient pattern matching. You have to go back to Pronom to find the originalpattern.

    In this very simple case, the pattern is:255044462D312E30

    Again, this is not quite what I was expecting. I needed to go back to thedocumentation again to learn that this is a sequence of bytes coded as pairs of hexdigits.

    25 50 44 46 2D 31 2E 30We can use a table of character encodings to recognise this as:

    % P D F - 1 . 0

    This is great, but the InternalSignature specification seems like a very complicatedway of saying look for %PDF-1.0 at the start of the file.

    After reviewing the Droid signature file, I was convinced that I needed to go back tothe source in Pronom. The problem was how to extract all of the signatures in aform that I could work with. While it is not obvious how to accomplish this, theengineers who developed Pronom have made it possible. In the next post, Ill showhow to get every bit of information out of Pronom in XML format and well take acloser look at some of the signatures.