archiving digital resources for future shigeo sugimoto research center for knowledge communities...

51
Archiving Digital Resources for Future Shigeo Sugimoto Research Center for Knowledge Communities Grad. School of Library, Information and Media Studies University of Tsukuba Japan [email protected] 1

Upload: eugenia-harris

Post on 24-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Archiving Digital Resources for Future

Shigeo SugimotoResearch Center for Knowledge Communities

Grad. School of Library, Information and Media StudiesUniversity of Tsukuba

[email protected]

1

Personal Backgrounds• Born in Osaka, Japan in 1953• Education: BE, ME and PhD from Dept. Information Science,

Faculty of Engineering, Kyoto University– Software Engineering and Programming Languages

• Job: Faculty at a LIS school in Tsukuba since 1983• Research: Digital Libraries, Digital Archives, Metadata• International activities:

– Dublin Core Metadata Initiative– Digital libraries, preservation, metadata research conferences– Consortium of information Schools in Asia-Pacific (CiSAP)

• Governmental Committee works:– Records Management, Digital Archive and Publishing– Digital Resources for National Diet Library, National Archives of Japan

2

Goal of this Talk

• Discuss issues for long-term use of digital resources as the important asset of our knowledge-centric society

• View digital archives as a well-organized collection of information and knowledge resources in our networked information society

• Understand issues for further development of digital archives from broad perspectives

3

Outline

• Introduction• Terms• Digital Archive• Digital Preservation• Metadata• Concluding Remarks

4

5

Introduction• Our information environment is already “paperless”, e.g.

– Create, deliver, store and access documents via the Internet– Use papers as a one-time media for reading and discard them

once the contents are read.– Scan in prints, store contents electronically and discard prints– Use on-line dictionaries more frequently than printed

dictionary• Resources are born in a digital form and used in a digital

environment– Digital Cameras, Smart Phones– Digital Books and Mobile Reading Devices

6

Introduction• Information resources are easily lost unless they

are paid special attentions,– Deterioration of Papers, Films, CDs/DVDs– Software Obsolescence

• Archiving and preservation of digital resources is indispensable for–Keeping the resources searchable, accessible

and usable–Maintaining the resources for future users

Introduction • Long-term use of digital resources is a crucial issue for

the networked information society because– We are so heavily relying on the networked information

environment today,– So many information and knowledge resources are

published and consumed digitally,– So many government and corporate records are created

and stored digitally, and– We need to keep our information and knowledge

resources for future users, but– Life time of digital media is shorter compared with

papers.7

Introduction • The goal of this talk is to overview digital

archive and its long-term use, which covers– Terms and concepts,– Typical digital archive services,– Preservation of digital resources,– Metadata issues, and– Personal perspectives for future

8

Terms and Concepts• Resource (Information Resource):

Any instance from which we get information, typically a book, a paper, a file or a set of files.

• Born Digital Resource: A digital resource created natively in a digital form

• Digitized Resource: A resource created by converting a physical resource or non-digital resource into a digital format

• Turned Digital Resource: Same as Digitized Resource

• Metadata: Data about data (or data about a resource)

9

Terms and Concepts• Digital Archive: – Collection of digital resources organized and preserved

for long-term use of the resources– Collecting, organizing and preserving digital resources

for long-term use• Digital Preservation: – Preserving digital objects for long-term use

• Digital Curation: – Similar to Digital Archive– Maintaining, preserving and adding value to digital

research data throughout its lifecycle, by Digital Curation Centre, UK (http://www.dcc.ac.uk/) 10

Digital Archive – Typical Digital Archives

• Web Archive - Archived collection of Web resources, e.g. Internet Archive

• Institutional Repository, Scholarly Archive - Archived collection of scholarly resources, e.g. academic institutional repositories, preprints and technical reports archives, electronic theses and dissertations

• Digitized collection of cultural and historical resources, e.g. digitized collection of library and museum holdings such as American Memory, World Digital Library

• Digital collection of records of governments and corporate bodies

11

Digital Archive – Some Examples• Very High-Tech High-Quality Digitization of Physical Objects• 3D sensing + Virtual Reality technology, e.g. digitization of Bayon

at the ruins of Angkor (http://www.cvl.iis.u-tokyo.ac.jp/projects.html)

• Massive Digitization of Books and Documents• Google Books project• Book digitization by National Diet Library, Japan

(http://www.ndl.go.jp/en/data/endl.html)– 240K books online off-library use, 570 K books in-house use

• Records Database at Japan Center for Asian Historical Records (http://www.jacar.go.jp/english/index.html)– 22M images as of 2011.4, 1.6M catalog records of Japanese Government

before World War II from Meiji Era

12

Digital Archive – Some Examples

• Collaborative Archives— Europeana (http://europeana.eu/portal/): “Paintings, music,

films and books from Europe's galleries, libraries, archives and museums“

— World Digital Library (http://www.wdl.org/en/): “The World Digital Library (WDL) makes available on the Internet, free of charge and in multilingual format, significant primary materials from countries and cultures around the world. “

— National Digital Archive Project, Taiwan (NDAP) (http://www.ndap.org.tw/index_en.php)Taiwan e-Learning and Digital Archive Program (TELDAP) (http://www.teldap.tw/en/): Multi-Disciplinary National Archives

13

Digital Archive – Why Digital Archive?• Easy and flexible access to important resources– Collect and organize information resources for users in

the networked information environment– Geographical distance has been a fundamental barrier

for the general public to access valuable resources stored at major memory institutions, e.g. national libraries, national archives, national museums

– Equal access for anyone to valuable resources is crucial to empower the progress of our knowledge centric society

– Encourage inter- and cross-disciplinary use of resources

14

Digital Archive – Why Digital Archive?

• Preserving digital resources for future users– Many important resources already exist only in

digital forms – Adding values by maintaining valuable resources

for long period of time• Preserving non-digital resources using digital

technologies for future users– Physical resources may be broken or lost by

disaster

15

Digital Archive – Why Digital Archive?

• Preserving born digital resources– Preparation for the growth of digital publishing

and electronic records of governments• Legal deposit of digitally published resources• Digital archives of e-government records

– Preserving database contents for future use• Scientific databases, statistics databases, etc.

– Preserving Web and Internet resources• Open Web and Hidden Web• Institutional Web (Intranet Web)

16

17

Archival Functions

CollectionCollect,

Organize,Re-format,

Rights Management

Preservation: Keep Resources Accessibleand Usable

AccessSearch and Access

BrowseAccess Control

Resources to be archived Resources for Users

18

Archival Functions

CollectionCollect,

Organize,Re-format,

Rights Management

Preservation: Keep Resources Accessibleand Usable

AccessSearch and Access

BrowseAccess Control

Resources to be archived Resources for Users

Trusted

19

Sharing Preservation Repository

CollectionCollect,

Organize,Re-format,

Rights Management

Trusted RepositorySharing Preservation Function

AccessSearch and Access

BrowseAccess Control

Resources to be archived Resources for Users

Digital Preservation- Fundamental Issues -

• In general, life-time of digital resources is short– Rapid progress and change of technologies– Hardware issues: Short life-time of electronic

memory media and their players, e.g. Floppy, CD, DVD, LD, Video Tapes/Cassettes, Audio Tapes/Cassettes, Magnetic Tapes, etc.

– Software issues: Frequent version changes of software tools and their running environment dependency, e.g., word processors, authoring tools, spreadsheets, browsers, PC operating systems, etc.

20

Digital Preservation- Fundamental Issues -

• The diversity of hardware and software is always increasing

• Special purpose software is used for specific contents, which are usually high-end contents– 3D graphics, Virtual Reality, Interactive Contents

• The volume of database contents is always increasing

• Network oriented digital publishing is growing – paradigm shift toward digital publishing

21

Digital Preservation- Basic Solution -

• Migration and Emulation– Migration: migrate the preserved resources to a new system

environment– Emulation: build an emulator to realize a working

environment for the preserved resources• Metadata– Fundamental component for preservation to record

information about a preserved resource and to keep track of its preservation history

– Descriptive, administrative and technical metadata for archiving and preserving resources

– Preservation of metadata and its schema is required22

Digital Preservation- Basic Solution -

• Open Archival Information System (OAIS) [1]

– International Standard– Reference Model for Archival Systems: a system framework

for archival systems– Information Package: a package structure to keep data

object for long-term• Information Object• Preservation Description Information: Metadata for preservation• Package information: metadata for finding and managing

information package[1] CCSDS Reference Model for an Archival Information System,

http://public.ccsds.org/publications/archive/650x0b1.PDF

23

Digital Preservation: OAIS

DIP

SIP

queriesresult setsorders

Preservation Planning

Access

Data Management

Ingest

Archival Storage

Administration

Descriptive Info

AIPAIP

Descriptive InfoP

RODUCER

CONSUMER

MANAGEMENT

24

IP: Information PackageSIP: Submission IP, AIP: Archival IP,DIP: Dissemination IP

OAIS: Information Object

• Data Object: Physical or Digital Object• Representation Information: Information required to

represent a data object in a meaningful way for users– Structural, Semantic, Technological information

Data Object Representation Information

Information Object+ →

25

26

OAIS: Information Package and Content Information

Information

Object

Preservation

Description Information

(PDI)

Packaging Information

Information Package

Description about package

Content Information• Set of information that is the original target of preservation• Content Data Object together with its Representation Information, i.e. Information Object

27

OAIS: Preservation Description Information

•Context: relationships of the Content Information to its environment •Provenance: history of the Content Information, i.e., origin or source of the Content Information, any changes that may have taken place since it was originated, and who has had custody of it since it was originated•Fixity: Data Integrity checks or Validation/Verification keys used to ensure that the particular Content Information object has not been altered in an undocumented manner

Info.Object PDI

Packaging InformationInformation

Package

•Reference: one or more mechanisms used to provide assigned identifiers for the Content Information, e.g. taxonomic systems, reference systems, registration systems

Digital Preservation: Metadata Issues• Metadata for Digital Preservation– Metadata schema projects based on OAIS PDI

• Ceders, Nedlib, OCLC-RLG

– METS: Metadata Encoding and Transmission Standard(http://www.loc.gov/standards/mets/)• standard for encoding descriptive, administrative, and structural

metadata regarding objects within a digital library • Container Standard which has seven categories of description

– PREMIS (http://www.loc.gov/standards/premis/)• Preservation Metadata: Implementation Strategies• PREMIS Data Model• PREMIS Data Dictionary

28

29

PREMIS Data Model

• Data model shows entities and their relationships

IntellectualEntities

Objects

Events

Rights

Agents

30

PREMIS Data Model

IntellectualEntities

Objects

Events

Rights

Agents

Information objects to be preserved Separation of Intellectual Entityand Objects --- Files of the same content and in different formats

31

PREMIS Data Model

IntellectualEntities

Objects

Events

Rights

Agents

Entities associated with preservation tasks

Digital Preservation- Fundamental Issues Again-

• There is no Perfect solution for preservation of digital preservation – There is no perfect solution for preservation of non-digital

resources either– Preservation of conventional non-digital resources is mainly

preservation of the information media• Digital preservation is primarily preservation of the

information contents but not the information media, i.e. preservation of contents but not container– For example, electronic journals published on the Web use

no tangible media, emails at governmental sectors could be an official record.

32

Digital Preservation- Fundamental Issues Again-

• How do we preserve?– Preserve the original content in the original binary data in

the original format• Keep the functionality and look-and-feel,• Risk of obsolescence of software and hardware to render and

interact with the content

– Convert the original content into a format suitable for long-term use, i.e. widely used standard formats are preferable• May loose some functionality of digital resources, e.g. hyperlinks,

dynamic contents• Need to identify the important content that have to be preserved

33

Digital Preservation- Fundamental Issues Again-

• Confidentiality, Integrity, Authenticity– Crucial aspects for preservation, especially preservation of official

documents • Confidentiality changes over time

• Rights Issues– Copyright issues– Privacy issues

• Metadata Issues– Metadata has to be preserved with the primary resources, otherwise

the resource would loose their value– Metadata schema has to be preserved as well, otherwise metadata

will lose interpretability • Semantics of metadata terms have to be recorded and preserved

34

Digital Preservation- Fundamental Issues Again-

• Proper preservation management – Preservation planning based on risk management

• Obsolescence of software and hardware• Degradation of memory media

• Digital preservation is a management issue rather than a technological issue, because– There is no perfect technological solution to preserve anything

forever– We need to determine what and how information resources

should be preserved– We need to cope with organizational changes of archives and

also manage archives under social circumstances changes 35

Digital Preservation- Fundamental Issues Again-

• A personal perspective• We are responsible to preserve resources for our next

generation. • It is not realistic for us to expect technology and social

environment changes for 100 years or 1000 years.• Digital technologies change very rapidly which is

disadvantageous for preservation from the viewpoint of stability. However, digital resources are easily and flexibly copiable, which is a significant advantage for preservation.

36

37

Metadata

• (Structured) Data about Data• Description about a resource from a certain

point of view in accordance with the requirements in the domain

resource

Metadata Metadata

38

Metadata• Users search, access, evaluate a resource, and pay money

for the resource on the network– These tasks are carried out in the virtual space but not physical

space• Metadata is required in all tasks of this process• We need to use metadata technology suitable to our

applications and also to our network environment

Tasks over the Net

Metadata

• Interoperability is a key issue for metadata– Interoperability across communities– Interoperability over time --- Preservation

• A fundamental barrier is semantic gap between communities– Same word for different concepts– Different words for a same concept– Linked Open Data – Sharing concepts expressed as

data, i.e., terms, phrases, etc.

39

Metadata

• Promote sharing and reuse of metadata vocabularies– Metadata vocabulary – a controlled set of terms used

to express metadata – is semantic basis of metadata– Sharing metadata vocabulary means sharing concepts

• Application Profile concept of Dublin Core– Mixing and matching metadata vocabularies– Clear separation of metadata vocabularies and

structural constraints in a metadata schema

40

41

Application Profile

Title Mandatory

SubjectOptional

Repeatable

Author MandatoryRepeatable

Publisher Mandatoryif applicable

A metadata schema (conceptual)

42

Application Profile

Author Type Publisher

Metadata Vocabulary 2(Metadata Element Set)

Metadata Vocabulary 1(Metadata Element Set)

Title Date Subject

Title Subject Author Publisher

Choose appropriate terms for an application scheme

43

Application Profile

Metadata Vocabulary 2(Metadata Element Set)

Metadata Vocabulary 1(Metadata Element Set)

Define encoding scheme for implementation

Structural constraints for every element

Author Type PublisherTitle Date Subject

TitleMandatory

SubjectOptional

Repeatable

AuthorMandatoryRepeatable

PublisherMandatoryIf applicable

Some Remarks before Conclusion

• A personal perspective learned from 2011.3.11 Quake and Tsunami– Physical stuffs are easily lost.

• Many heritage resources were lost.• Many PCs and servers were lost.

– More robust infrastructure is required to keep important resources safe and preserve them for future• Robust Cloud environment looks advantageous, however• Current Cloud is too simple to be adopted for archiving

important resources

44

45

Some Remarks before Conclusion

CollectCollect,

Organize, Re-format,Rights Management

Preserve

AccessSearch and Access

BrowseAccess Control

Archival Cloud – a layered architecture

Application Systems / Services

Preservation as a Service

Archiving as a Service

Some Remarks before Conclusion

• A personal perspective for promoting digital environment at Memory Institutions, e.g. Museums, Libraries, and Archives– High-tech, high-quality digitization is crucial to increase

the potentials of MLAs– Adoption of digital technologies, which should be really

usable but need not be high-tech, is crucial to promote usability of resources at MLAs

– Human resource development is crucial to further develop MLAs for future networked information society

46

Some Remarks before Conclusion• A personal perspective learned from governmental

committee works– Paradigm shift in publishing environment

• Shrinking print publishing market, expanding digital publishing market in Japan– Print Publishing 2600 Byen (1996) → 1900 Byen (2009)– Digital Publishing 0.4 BYen(2004) → 50 BYen (2009)

• E-publishing business– Mangas on mobile phones has been growing– E-book readers and smart phones may expand the market

• Piracy issue for Mangas (Comics) and Novels– Illegal scanlation of weekly Manga magazines

• Rights issues – relationship between publishers and creators47

Some Remarks before Conclusion– Governmental records management

• Promotion of e-Gov but real change is slow• New national law for official records management, effective

since 2011.4– Need improvement of records management and archival services– Hope to promote national infrastructure for records management

and archives

– Book digitization at NDL• NDL which is a national legal deposit library is allowed to

covert books into digital format for preservation purpose• NDL and publishers have agreed to make digitized books

accessible at public libraries for those books which are not obtainable in the market even if their copyrights are still alive

48

Conclusion

• Digital Archive is an important function and service for our society to maintain valuable intellectual resources and preserve for future

• There are many different types of digital archives but their mission is to select, collect, organize, preserve and provide access to valuable resources

• Digital preservation is a challenging task but we have to find appropriate solutions. There is no unique solution. We need to find an appropriate solution in accordance with requirements of the archiving task and the community.

49

Conclusion

• Metadata is an important component for archiving and preservation.

• Preservation of metadata is a challenging task as well as preservation of primary resources

50

Thank you very much for your attention and patience

For Your Information• iPres 2011: Int’l Conf. on Preservation of

Digital Objects, November 1-4, Singaporehttp://ipres2011.sg/

• Any questions: [email protected]

51