exploring the applicability of scientific data management tools and techniques on the records...

Exploring the Applicability of Scientific Data Management Tools and Techniques on the Records

Management Requirements for the National Archives and Records Administration (NARA)

Oct 01, 2002 – Sept. 30, 2003

Introduction

• The National Archives and Records Administration (NARA) has the duty to preserve the nation’s history through archival storage and management of federal records.

• By law, Federal records are – all books, papers, maps, photographs, machine

readable materials, or other documentary materials – regardless of physical form or characteristics– made or received by an agency of the U.S.

Government under Federal law or in connection with the transaction of public business and preserved or appropriate for preservation by that agency or its legitimate successor

Electronic records critically challenge NARA

• --and many other archives, libraries, agencies, and businesses.

• sheer volume of electronic records,

• their diversity and complexity,

• rapidity of change in the information technologies used to create, store, and manage these records

Preserving the nation’s history requires more than the simple archiving of electronic records

• it requires the capability to – mine – generate knowledge from– reorganize these records in response to

public and government queries.

• It also requires – integrating the capability to preserve

knowledge embedded in these records given the inevitable and frequent changes in technology

Archival and retrieval systems

• Must be efficient and scalable to cope with multiple petabyte archives.

• Preservation of metadata is vital • Making large-scale collections accessible to

diverse users– requires components that provide high-performance

digital library services (such as indexing, clustering, browsing, querying, translation, and change management)

– as well as the means to rapidly deploy these components in configurations that meet the needs of a particular task or user group.

Six projects

• A study of file formats for long-term record archiving

• Automatic email classification• Publishing, exploring, and mining heterogeneous

distributed data• Digital library component technology for large-

scale archives• Time series characterization of archival I/O

behavior• Performance analysis of archive data

management and retrieval

A Study of File Formats for Long-Term Record Archiving

• PI: Mike Folk• Investigate the suitability of scientific data

formats and access methods for record archives.

• Look at HDF5 as archival format for a variety of different kinds of records. Possibilities include GIS, CAD.

• Interface with SRB and OAIS implementation of sample collection.

• Prototype implementation in HDF5 of NARA records collections, to be identified by NARA.

Automatic Email Classification

• PI: Michael Welge• Much interest in automatic text classification (ATC) where automated

learning techniques are used to categorize text documents into pre-defined discrete sets of topics

• Automatic Email Classification (AEC) can be seen as a subtask of ATC• AEC differs from the common ATC in many ways.

– e.g., sentences are ill structured, knowledge is embedded nondiscriminatory email fields, etc.

• We propose to focus on two main questions:– What is the best machine learning technique to classify email messages?– Which are the important attributes within an email message that help

classification?• support it with a series of experiments on several benchmarks and real

world data. • Particularly, we would like to experiment with a large real world data set

such as the Clinton White House email archive.

Publishing, Exploring, and Mining Heterogeneous Distributed Data

• PI: Michael Welge• Look at performance of NCSA’s existing data

mining tools, Data Spaces on distributed data. • Extend Data Spaces so that is can understand

HDF data. • Then apply Data Spaces to a real collection. • Probably use HDF• as well as collections managed by Bob

Grossman of the University of Chicago.

Digital Library Component Technology for Large-scale Archives

• PI: Joseph Futrelle• Making large-scale collections accessible to a variety of kinds of users

requires components that provide high-performance digital library services – E.g. indexing, clustering, browsing, querying, translation, and change

management• As well as the means to rapidly deploy these components in configurations

that meet the needs of a particular task or user group.• The NCSA Digital Library Technologies group has been developing

distributed digital library components for several years • Recently work within the Open Digital Library (ODL) framework

– based on extensions to the Open Archives Initiative Protocol for Metadata Harvesting

• Tasks– Investigate the applicability of the ODL framework to problems of the scale and

heterogeneity represented by NARA records– Attempt to integrate the ODL framework with NCSA’s D2K framework.

Digital Library Component Technology for Large-scale Archives

• Key questions – Can we build large-scale, high performance Open Archives services

using caching and proxying strategies? – Can hierarchical configurations of filtering components be used to scale

services by performing records reduction on multiple streams of documents?

– Can translation components be used in conjunction with indexing or clustering components to build unified representations that span large-scale heterogeneous collections?

– Can NARA records be made to interoperate with external data sources using the Open Archives protocol?

– Can Open Archives components be used to help NARA acquire records from other government agencies?

– Can ODL components be rapidly assembled into applications using the D2K rapid application development environment, or a derived environment, which would not only facilitate application building but also allow the ODL components to interoperate with D2K’s machine learning components?

Performance Analysis of Archive Data Management and Retrieval

• I: Dan Reed• Extend the functionality of Pablo I/O analysis toolkit to analyze I/O performance when

accessing data via archival systems supported by large Linux clusters. • We will characterize performance at three levels, driven to the maximum extent

possible by expected NARA access patterns and integration with the HDF5, D2K, and Emerge toolkits:

– the time required to execute the high level archival commands;– the cost of performing the Linux level I/O operations; – the cost of storage and retrieval from physical storage devices.

• Add procedures to produce SDDF trace data from high-level archival operations. • Develop new analysis tools to process this data • Develop the requisite interfaces to extract data from the SDDF trace files in a form

that can be used by the ARIMA time series modeling software described above. • Then apply time series techniques to characterize the behavior of archival operations.• We also propose to study the cost and power demands of different archival

operations, comparing alternative implementations and analyzing patterns of basic operations that occur frequently throughout the use of the archive.

Time Series Characterization of Archival I/O Behavior

• PI: Nancy Tran• This project plans to work closely with the Pablo team to model and

characterize I/O behaviors using the Pablo group’s SDDF instrumented data.

• Interested in the cost (a fraction of the total execution time) of HDF5 major I/O function calls in applications run on Linux clusters.

• Leveraging their online time series modeling framework (TsModeler), they plan to analyze HDF5 cost time series, automatically built by SDDF.

• Will correlate costs with I/O behaviors, compare different function costs, and identify the most impeding performance bottlenecks.

• Also will develop graphical tools to enable viewing of I/O function cost patterns and their evolutions.

exploring the applicability of scientific data management tools and techniques on the records...

Documents