archiving digital resources for future shigeo sugimoto research center for knowledge communities...
TRANSCRIPT
Archiving Digital Resources for Future
Shigeo SugimotoResearch Center for Knowledge Communities
Grad. School of Library, Information and Media StudiesUniversity of Tsukuba
1
Personal Backgrounds• Born in Osaka, Japan in 1953• Education: BE, ME and PhD from Dept. Information Science,
Faculty of Engineering, Kyoto University– Software Engineering and Programming Languages
• Job: Faculty at a LIS school in Tsukuba since 1983• Research: Digital Libraries, Digital Archives, Metadata• International activities:
– Dublin Core Metadata Initiative– Digital libraries, preservation, metadata research conferences– Consortium of information Schools in Asia-Pacific (CiSAP)
• Governmental Committee works:– Records Management, Digital Archive and Publishing– Digital Resources for National Diet Library, National Archives of Japan
2
Goal of this Talk
• Discuss issues for long-term use of digital resources as the important asset of our knowledge-centric society
• View digital archives as a well-organized collection of information and knowledge resources in our networked information society
• Understand issues for further development of digital archives from broad perspectives
3
Outline
• Introduction• Terms• Digital Archive• Digital Preservation• Metadata• Concluding Remarks
4
5
Introduction• Our information environment is already “paperless”, e.g.
– Create, deliver, store and access documents via the Internet– Use papers as a one-time media for reading and discard them
once the contents are read.– Scan in prints, store contents electronically and discard prints– Use on-line dictionaries more frequently than printed
dictionary• Resources are born in a digital form and used in a digital
environment– Digital Cameras, Smart Phones– Digital Books and Mobile Reading Devices
6
Introduction• Information resources are easily lost unless they
are paid special attentions,– Deterioration of Papers, Films, CDs/DVDs– Software Obsolescence
• Archiving and preservation of digital resources is indispensable for–Keeping the resources searchable, accessible
and usable–Maintaining the resources for future users
Introduction • Long-term use of digital resources is a crucial issue for
the networked information society because– We are so heavily relying on the networked information
environment today,– So many information and knowledge resources are
published and consumed digitally,– So many government and corporate records are created
and stored digitally, and– We need to keep our information and knowledge
resources for future users, but– Life time of digital media is shorter compared with
papers.7
Introduction • The goal of this talk is to overview digital
archive and its long-term use, which covers– Terms and concepts,– Typical digital archive services,– Preservation of digital resources,– Metadata issues, and– Personal perspectives for future
8
Terms and Concepts• Resource (Information Resource):
Any instance from which we get information, typically a book, a paper, a file or a set of files.
• Born Digital Resource: A digital resource created natively in a digital form
• Digitized Resource: A resource created by converting a physical resource or non-digital resource into a digital format
• Turned Digital Resource: Same as Digitized Resource
• Metadata: Data about data (or data about a resource)
9
Terms and Concepts• Digital Archive: – Collection of digital resources organized and preserved
for long-term use of the resources– Collecting, organizing and preserving digital resources
for long-term use• Digital Preservation: – Preserving digital objects for long-term use
• Digital Curation: – Similar to Digital Archive– Maintaining, preserving and adding value to digital
research data throughout its lifecycle, by Digital Curation Centre, UK (http://www.dcc.ac.uk/) 10
Digital Archive – Typical Digital Archives
• Web Archive - Archived collection of Web resources, e.g. Internet Archive
• Institutional Repository, Scholarly Archive - Archived collection of scholarly resources, e.g. academic institutional repositories, preprints and technical reports archives, electronic theses and dissertations
• Digitized collection of cultural and historical resources, e.g. digitized collection of library and museum holdings such as American Memory, World Digital Library
• Digital collection of records of governments and corporate bodies
11
Digital Archive – Some Examples• Very High-Tech High-Quality Digitization of Physical Objects• 3D sensing + Virtual Reality technology, e.g. digitization of Bayon
at the ruins of Angkor (http://www.cvl.iis.u-tokyo.ac.jp/projects.html)
• Massive Digitization of Books and Documents• Google Books project• Book digitization by National Diet Library, Japan
(http://www.ndl.go.jp/en/data/endl.html)– 240K books online off-library use, 570 K books in-house use
• Records Database at Japan Center for Asian Historical Records (http://www.jacar.go.jp/english/index.html)– 22M images as of 2011.4, 1.6M catalog records of Japanese Government
before World War II from Meiji Era
12
Digital Archive – Some Examples
• Collaborative Archives— Europeana (http://europeana.eu/portal/): “Paintings, music,
films and books from Europe's galleries, libraries, archives and museums“
— World Digital Library (http://www.wdl.org/en/): “The World Digital Library (WDL) makes available on the Internet, free of charge and in multilingual format, significant primary materials from countries and cultures around the world. “
— National Digital Archive Project, Taiwan (NDAP) (http://www.ndap.org.tw/index_en.php)Taiwan e-Learning and Digital Archive Program (TELDAP) (http://www.teldap.tw/en/): Multi-Disciplinary National Archives
13
Digital Archive – Why Digital Archive?• Easy and flexible access to important resources– Collect and organize information resources for users in
the networked information environment– Geographical distance has been a fundamental barrier
for the general public to access valuable resources stored at major memory institutions, e.g. national libraries, national archives, national museums
– Equal access for anyone to valuable resources is crucial to empower the progress of our knowledge centric society
– Encourage inter- and cross-disciplinary use of resources
14
Digital Archive – Why Digital Archive?
• Preserving digital resources for future users– Many important resources already exist only in
digital forms – Adding values by maintaining valuable resources
for long period of time• Preserving non-digital resources using digital
technologies for future users– Physical resources may be broken or lost by
disaster
15
Digital Archive – Why Digital Archive?
• Preserving born digital resources– Preparation for the growth of digital publishing
and electronic records of governments• Legal deposit of digitally published resources• Digital archives of e-government records
– Preserving database contents for future use• Scientific databases, statistics databases, etc.
– Preserving Web and Internet resources• Open Web and Hidden Web• Institutional Web (Intranet Web)
16
17
Archival Functions
CollectionCollect,
Organize,Re-format,
Rights Management
Preservation: Keep Resources Accessibleand Usable
AccessSearch and Access
BrowseAccess Control
Resources to be archived Resources for Users
18
Archival Functions
CollectionCollect,
Organize,Re-format,
Rights Management
Preservation: Keep Resources Accessibleand Usable
AccessSearch and Access
BrowseAccess Control
Resources to be archived Resources for Users
Trusted
19
Sharing Preservation Repository
CollectionCollect,
Organize,Re-format,
Rights Management
Trusted RepositorySharing Preservation Function
AccessSearch and Access
BrowseAccess Control
Resources to be archived Resources for Users
Digital Preservation- Fundamental Issues -
• In general, life-time of digital resources is short– Rapid progress and change of technologies– Hardware issues: Short life-time of electronic
memory media and their players, e.g. Floppy, CD, DVD, LD, Video Tapes/Cassettes, Audio Tapes/Cassettes, Magnetic Tapes, etc.
– Software issues: Frequent version changes of software tools and their running environment dependency, e.g., word processors, authoring tools, spreadsheets, browsers, PC operating systems, etc.
20
Digital Preservation- Fundamental Issues -
• The diversity of hardware and software is always increasing
• Special purpose software is used for specific contents, which are usually high-end contents– 3D graphics, Virtual Reality, Interactive Contents
• The volume of database contents is always increasing
• Network oriented digital publishing is growing – paradigm shift toward digital publishing
21
Digital Preservation- Basic Solution -
• Migration and Emulation– Migration: migrate the preserved resources to a new system
environment– Emulation: build an emulator to realize a working
environment for the preserved resources• Metadata– Fundamental component for preservation to record
information about a preserved resource and to keep track of its preservation history
– Descriptive, administrative and technical metadata for archiving and preserving resources
– Preservation of metadata and its schema is required22
Digital Preservation- Basic Solution -
• Open Archival Information System (OAIS) [1]
– International Standard– Reference Model for Archival Systems: a system framework
for archival systems– Information Package: a package structure to keep data
object for long-term• Information Object• Preservation Description Information: Metadata for preservation• Package information: metadata for finding and managing
information package[1] CCSDS Reference Model for an Archival Information System,
http://public.ccsds.org/publications/archive/650x0b1.PDF
23
Digital Preservation: OAIS
DIP
SIP
queriesresult setsorders
Preservation Planning
Access
Data Management
Ingest
Archival Storage
Administration
Descriptive Info
AIPAIP
Descriptive InfoP
RODUCER
CONSUMER
MANAGEMENT
24
IP: Information PackageSIP: Submission IP, AIP: Archival IP,DIP: Dissemination IP
OAIS: Information Object
• Data Object: Physical or Digital Object• Representation Information: Information required to
represent a data object in a meaningful way for users– Structural, Semantic, Technological information
Data Object Representation Information
Information Object+ →
25
26
OAIS: Information Package and Content Information
Information
Object
Preservation
Description Information
(PDI)
Packaging Information
Information Package
Description about package
Content Information• Set of information that is the original target of preservation• Content Data Object together with its Representation Information, i.e. Information Object
27
OAIS: Preservation Description Information
•Context: relationships of the Content Information to its environment •Provenance: history of the Content Information, i.e., origin or source of the Content Information, any changes that may have taken place since it was originated, and who has had custody of it since it was originated•Fixity: Data Integrity checks or Validation/Verification keys used to ensure that the particular Content Information object has not been altered in an undocumented manner
Info.Object PDI
Packaging InformationInformation
Package
•Reference: one or more mechanisms used to provide assigned identifiers for the Content Information, e.g. taxonomic systems, reference systems, registration systems
Digital Preservation: Metadata Issues• Metadata for Digital Preservation– Metadata schema projects based on OAIS PDI
• Ceders, Nedlib, OCLC-RLG
– METS: Metadata Encoding and Transmission Standard(http://www.loc.gov/standards/mets/)• standard for encoding descriptive, administrative, and structural
metadata regarding objects within a digital library • Container Standard which has seven categories of description
– PREMIS (http://www.loc.gov/standards/premis/)• Preservation Metadata: Implementation Strategies• PREMIS Data Model• PREMIS Data Dictionary
28
29
PREMIS Data Model
• Data model shows entities and their relationships
IntellectualEntities
Objects
Events
Rights
Agents
30
PREMIS Data Model
IntellectualEntities
Objects
Events
Rights
Agents
Information objects to be preserved Separation of Intellectual Entityand Objects --- Files of the same content and in different formats
31
PREMIS Data Model
IntellectualEntities
Objects
Events
Rights
Agents
Entities associated with preservation tasks
Digital Preservation- Fundamental Issues Again-
• There is no Perfect solution for preservation of digital preservation – There is no perfect solution for preservation of non-digital
resources either– Preservation of conventional non-digital resources is mainly
preservation of the information media• Digital preservation is primarily preservation of the
information contents but not the information media, i.e. preservation of contents but not container– For example, electronic journals published on the Web use
no tangible media, emails at governmental sectors could be an official record.
32
Digital Preservation- Fundamental Issues Again-
• How do we preserve?– Preserve the original content in the original binary data in
the original format• Keep the functionality and look-and-feel,• Risk of obsolescence of software and hardware to render and
interact with the content
– Convert the original content into a format suitable for long-term use, i.e. widely used standard formats are preferable• May loose some functionality of digital resources, e.g. hyperlinks,
dynamic contents• Need to identify the important content that have to be preserved
33
Digital Preservation- Fundamental Issues Again-
• Confidentiality, Integrity, Authenticity– Crucial aspects for preservation, especially preservation of official
documents • Confidentiality changes over time
• Rights Issues– Copyright issues– Privacy issues
• Metadata Issues– Metadata has to be preserved with the primary resources, otherwise
the resource would loose their value– Metadata schema has to be preserved as well, otherwise metadata
will lose interpretability • Semantics of metadata terms have to be recorded and preserved
34
Digital Preservation- Fundamental Issues Again-
• Proper preservation management – Preservation planning based on risk management
• Obsolescence of software and hardware• Degradation of memory media
• Digital preservation is a management issue rather than a technological issue, because– There is no perfect technological solution to preserve anything
forever– We need to determine what and how information resources
should be preserved– We need to cope with organizational changes of archives and
also manage archives under social circumstances changes 35
Digital Preservation- Fundamental Issues Again-
• A personal perspective• We are responsible to preserve resources for our next
generation. • It is not realistic for us to expect technology and social
environment changes for 100 years or 1000 years.• Digital technologies change very rapidly which is
disadvantageous for preservation from the viewpoint of stability. However, digital resources are easily and flexibly copiable, which is a significant advantage for preservation.
36
37
Metadata
• (Structured) Data about Data• Description about a resource from a certain
point of view in accordance with the requirements in the domain
resource
Metadata Metadata
38
Metadata• Users search, access, evaluate a resource, and pay money
for the resource on the network– These tasks are carried out in the virtual space but not physical
space• Metadata is required in all tasks of this process• We need to use metadata technology suitable to our
applications and also to our network environment
Tasks over the Net
Metadata
• Interoperability is a key issue for metadata– Interoperability across communities– Interoperability over time --- Preservation
• A fundamental barrier is semantic gap between communities– Same word for different concepts– Different words for a same concept– Linked Open Data – Sharing concepts expressed as
data, i.e., terms, phrases, etc.
39
Metadata
• Promote sharing and reuse of metadata vocabularies– Metadata vocabulary – a controlled set of terms used
to express metadata – is semantic basis of metadata– Sharing metadata vocabulary means sharing concepts
• Application Profile concept of Dublin Core– Mixing and matching metadata vocabularies– Clear separation of metadata vocabularies and
structural constraints in a metadata schema
40
41
Application Profile
Title Mandatory
SubjectOptional
Repeatable
Author MandatoryRepeatable
Publisher Mandatoryif applicable
A metadata schema (conceptual)
42
Application Profile
Author Type Publisher
Metadata Vocabulary 2(Metadata Element Set)
Metadata Vocabulary 1(Metadata Element Set)
Title Date Subject
Title Subject Author Publisher
Choose appropriate terms for an application scheme
43
Application Profile
Metadata Vocabulary 2(Metadata Element Set)
Metadata Vocabulary 1(Metadata Element Set)
Define encoding scheme for implementation
Structural constraints for every element
Author Type PublisherTitle Date Subject
TitleMandatory
SubjectOptional
Repeatable
AuthorMandatoryRepeatable
PublisherMandatoryIf applicable
Some Remarks before Conclusion
• A personal perspective learned from 2011.3.11 Quake and Tsunami– Physical stuffs are easily lost.
• Many heritage resources were lost.• Many PCs and servers were lost.
– More robust infrastructure is required to keep important resources safe and preserve them for future• Robust Cloud environment looks advantageous, however• Current Cloud is too simple to be adopted for archiving
important resources
44
45
Some Remarks before Conclusion
CollectCollect,
Organize, Re-format,Rights Management
Preserve
AccessSearch and Access
BrowseAccess Control
Archival Cloud – a layered architecture
Application Systems / Services
Preservation as a Service
Archiving as a Service
Some Remarks before Conclusion
• A personal perspective for promoting digital environment at Memory Institutions, e.g. Museums, Libraries, and Archives– High-tech, high-quality digitization is crucial to increase
the potentials of MLAs– Adoption of digital technologies, which should be really
usable but need not be high-tech, is crucial to promote usability of resources at MLAs
– Human resource development is crucial to further develop MLAs for future networked information society
46
Some Remarks before Conclusion• A personal perspective learned from governmental
committee works– Paradigm shift in publishing environment
• Shrinking print publishing market, expanding digital publishing market in Japan– Print Publishing 2600 Byen (1996) → 1900 Byen (2009)– Digital Publishing 0.4 BYen(2004) → 50 BYen (2009)
• E-publishing business– Mangas on mobile phones has been growing– E-book readers and smart phones may expand the market
• Piracy issue for Mangas (Comics) and Novels– Illegal scanlation of weekly Manga magazines
• Rights issues – relationship between publishers and creators47
Some Remarks before Conclusion– Governmental records management
• Promotion of e-Gov but real change is slow• New national law for official records management, effective
since 2011.4– Need improvement of records management and archival services– Hope to promote national infrastructure for records management
and archives
– Book digitization at NDL• NDL which is a national legal deposit library is allowed to
covert books into digital format for preservation purpose• NDL and publishers have agreed to make digitized books
accessible at public libraries for those books which are not obtainable in the market even if their copyrights are still alive
48
Conclusion
• Digital Archive is an important function and service for our society to maintain valuable intellectual resources and preserve for future
• There are many different types of digital archives but their mission is to select, collect, organize, preserve and provide access to valuable resources
• Digital preservation is a challenging task but we have to find appropriate solutions. There is no unique solution. We need to find an appropriate solution in accordance with requirements of the archiving task and the community.
49
Conclusion
• Metadata is an important component for archiving and preservation.
• Preservation of metadata is a challenging task as well as preservation of primary resources
50
Thank you very much for your attention and patience
For Your Information• iPres 2011: Int’l Conf. on Preservation of
Digital Objects, November 1-4, Singaporehttp://ipres2011.sg/
• Any questions: [email protected]
51