working with digital archives at the harry ransom center

39
Working With Digital Archives at the Harry Ransom Center A Presentation About Processing the Digital Archives of British Playwright Arnold Wesker Metadata and Digital Object Roundtable Society of American Archivists Annual Meeting 2007 Catherine Stollar Peters New York State Archives

Upload: kita

Post on 14-Jan-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Working With Digital Archives at the Harry Ransom Center. A Presentation About Processing the Digital Archives of British Playwright Arnold Wesker Metadata and Digital Object Roundtable Society of American Archivists Annual Meeting 2007 Catherine Stollar Peters New York State Archives. - PowerPoint PPT Presentation

TRANSCRIPT

Working With Digital Archives at the Harry Ransom Center

A Presentation About Processing the Digital Archives of British Playwright Arnold Wesker

Metadata and Digital Object Roundtable Society of American Archivists Annual Meeting 2007

Catherine Stollar PetersNew York State Archives

Background

Worked at Harry Ransom Center in Austin, Texas from 2004 to early 2007

Austin

Albany

Background

Now work at the New York State Archives

Cultural Education Center

(New York State Archives)

In January 2007 the Ransom Center was

• Processing collections with electronic records• Developing policies and procedures for processing electronic

records• Evaluating options for a Trusted Digital Repository

– At the School of Information at the University of Texas at Austin– At the University Libraries at the University of Texas at Austin– Or develop institutional TDR

• Conducting a general electronic records survey and needs assessment (with a more thorough survey planned for the fall)

HRC Dspace at School of Informationhttps://pacer.ischool.utexas.edu/handle/2081/288

About the Case Study

In January 2007 at the School of Information

• Dr. Patricia Galloway offering Problems in Permanent Retention of Electronic Records Course

• Dr. Galloway contacts Ransom Center for potential support of group projects

School of Information Course

Three collections were processed by students during Spring 2007 semester

• Leon Uris Papers– Lessons in digital archeology– Limited migrated content

• John Crowley Papers– Standard manual processing

• Arnold Wesker Papers– Largely automated processing, migration, ingest procedures– Fragile media– Living author

School of Information Course

Three collections were processed by students during Spring 2007 semester

• Leon Uris Papers– Lessons in digital archeology– Limited migrated content

• John Crowley Papers– Standard manual processing

• Arnold Wesker Papers– Largely automated processing, migration, ingest procedures– Fragile media– Living author

Arnold Wesker

• British playwright and author• Born in London in 1932• The Four Seasons ran in March 2007 at Arcola Theatre• Ransom Center maintains paper archives • Works include

- As Much as I Dare (autobiography)- Longitude (adaptation of Dava Sobel’s book)- Groupie- Chips with Everything

Automated Processing

Largely automated processing, migration and ingest procedures possible because

• One author• Similar content/materials (works, correspondence, diaries, personal

files)• Mostly same format (Corel WordPerfect 5.0, 9.0 and Microsoft Word

97 and 2000)• Easily migrated (to RTF)• Well arranged• Manageable number of files (5,000 +)• Readable disks (75 3.5 inch floppies and 1 zip disk)

Processing Issues

• Some files were password restricted• Bank account numbers were included• Encoded date fields would automatically update

Archival Theory Applied to Digital Materials

Acquisition Create a disk catalog with all pertinent metadata Copy to a processing computer drive

Appraisal Appraise for duplicates and restricted material

Arrangement Arrange material according to author’s original arrangement

Description Create a file catalog with the pertinent metadataCreate and record checksums Extract metadataTransform metadata from NLNZ Schema to Dublin Core

Preservation Migrate all of the files to a more stable format, such as Rich Text FormatMake physical copies of all the files onto new media Ingest the files into DSpaceIngest the project documentation

Reference Integration into paper-based finding aid

Archival Theory Applied to Digital Materials

Acquisition Create a disk catalog with all pertinent metadata Copy to a processing computer drive

Appraisal Appraise for duplicates and restricted material

Arrangement Arrange material according to author’s original arrangement

Description Create a file catalog with the pertinent metadataCreate and record checksums Extract metadataTransform metadata from NLNZ Schema to Dublin Core

Preservation Migrate all of the files to a more stable format, such as Rich Text FormatMake physical copies of all the files onto new media Ingest the files into DSpaceIngest the project documentation

Reference Integration into paper-based finding aid

Disk Catalog

File Catalog

Appraise for Duplicates

• Files on zip disk contained some duplicates• Developed rules for removing duplicates to prevent

automatic deletion of duplicate names but not duplicate files

• Erased duplicate files but recorded presence of duplicates in file catalog

• Zizasoft’s comparison software zsCompare and zsDuplicate Hunter Standard 2.31

Restricted Material

• Bank Account numbers– Investigate to see if the accounts were closed

• Password protected diary entries– Remove password to migrate– Place restrictions on access through DSpace instead

of word processing software– Paper copy already exists and is in restricted section

of stacks

Checksums

• Command line utility automatically creates checksum• Jacksum is one Java checksum utility • Export results to spreadsheet• Compare to MD5 hash created by DSpace

Migrate Text to More Stable Format

• Chose RTF because it is widely accessible by multiple readers and it retains formatting– ODF is new and untested yet– TXT loses formatting– Microsoft Word DOC and Corel WordPerfect WPD

are proprietary and accessible by few readers• Used ABC Text Converter to migrate files from DOC or

WPD into RTF– Used Perl script to add extensions to files to mitigate

Wesker’s use of 3 digit extension

Create Duplicate Physical Copy

• Save files to CD, DVD or harddrive for extra, short-term backup copy while processing (and before ingest into Institutional Repository)

Extract Metadata

National Library of New Zealand XML

National Library of New Zealand XML (cont.)

Dublin Core XML

Directory Arrangement for DSpace Bulk Ingest

QuickTime™ and aVideo decompressor

are needed to see this picture.

QuickTime™ and aVideo decompressor

are needed to see this picture.

Automated Processes

• Created Perl scripts to automate processing– Modified Perl scripts from Queen’s University Library in Ontario, Canada

http://library.queensu.ca/webir/qspace-project/tutorials/qspace_bulk_upload.doc

– Metadata conversion script (from National Library of New Zealand Metadata Extraction Tool v 3.0)

– Script to move individual xml files into individual directories– Script to create contents file for each directory– Scripts to rename files for format transformation

Issues with Metadata Extraction• Author unreliable

– Partially solved by adding code to Perl scripts to export standard author information)

• No subject metadata• Inaccurate dates

– Date created sometimes newer than date modified due to Windows file system• Inaccurate titles

– First line in document– Title from template

• Format problems when extensions are used as part of name field• No recipient information (potential text mining project)• Path name derived from location of file on processing computer, not original

author’s system• Sometimes NLNZ Metadata Extractor v 3.0 processes files with default

adapter instead of actual suitable adapter • Dublin Core metadata is not robust enough for digital preservation needs

New Zealand XML Wrong Author

Dublin Core XML

Ingest

Created detailed ingest procedures based on• Cornell’s ecommons@Cornell procedures as example• DSpace instructions

Takeaways

• More automated tools

• Toolkit to aggregate tasks

• Better metadata extraction potential

• Support of more schemas

MetaTools--Investigating Metadata General Tools

• JISC funded grant project undertaken by the Arts and Humanities Data Service, King’s College London

• 18 month project, ends September 2008• Project goals

– Develop a methodology for evaluating metadata generation tools– Compare the quality of currently available metadata generation

tools (including NLNZ Metadata Extractor, Droid, Jhove)– Develop, test and disseminate prototype web services that

integrate metadata generation tools.

Student Publication

Lorraine Dong, Megan Durden and Sarah Kim Presented Silicon Chips with Everything: Preserving Arnold

Wesker’s Digital Manuscripts at SSA 2007

https://pacer.ischool.utexas.edu/handle/2081/2322

(Look for their forthcoming publication)

Contact Information

Catherine Stollar Peters

New York State ArchivesCultural Education CenterAlbany, New York [email protected]

(518)486-7820