class 5-introto dl
DESCRIPTION
TRANSCRIPT
Digital Libraries
Lillian N. Cassel
A digital library• An informal definition of a digital library is a managed collection of information, with associated services, where the information is stored in digital formats and accessible over a network. --
• Wm Arms, Digital Libraries, 1999
• A focused collection of digital objects, including text, video, and audio, along with methods for access and retrieval, and for selection, organization, and maintenance of the collection. --
• Witten and Bainbridge, How to Build a Digital Library 2003
What is a library?
• An active exercise to explore what we know about, and think about, traditional libraries.
• How do we translate these characteristics to the digital world?– Is that the right model? Are we
unnecessarily constraining the digital environment? Are there things that do not translate?
Vannevar Bush
• “As we may think”• (http://www.theatlantic.com/doc/194507/bush)• Reflecting after WWII
– The value of collaboration– The sad use of scientific expertise to invent the
atomic bomb– The need for organization and access to
information.
memex• Vannevar Bush’s vision
Image source:kelty.rice.edu/375/images/memex/camera.jpghttp://www.knowledgesearch.org/presentations/etcon/images/memex.gif
MyLifeBits
• Gordon Bell and Microsoft• http://www.guardian.co.uk/science/story/0,3605,1674359,00.html
“Gordon Bell doesn't need to remember, but has no chance of forgetting. At the age of 71, he is recording as much of his life as modern technology will allow, storing it all on a vast database: a digital facsimile of a life lived.
If he goes for a walk, a miniature camera that dangles from his neck snaps pictures every minute or so, immediately committing the scene to a memory built not of neurons but ones and noughts. If he wanders into a cafe, sensors note the change in light, the shift of temperature and squirrel the information away. Conversations are recorded and steps logged thanks to a GPS receiver carried with him.”
Related work
• Walden’s Path– http://www.csdl.tamu.edu/walden/– System used by itself or as a service within a digital library– Allows a user to make a path through a set of related
resources and save the path for reuse at a later time.• Used to allow a teacher to “blaze a trail” through a collection of
materials to help students find their way from a starting point to a goal.
• Also for recording personal trips through a collection of material to be revisited.
How does that compare to a set of bookmarks?
Moving Forward
• Looked at what a library is• Now
– How do we translate that to a digital entity?
• Information resources, including digital libraries, are very complex systems. – A formal model helps to capture the essence of the system
and give special attention to specific areas– The model also allows developers of digital libraries to have
a check list of areas to consider and develop well.
The 5S model
• Streams– The flow of information in various formats
• Structures– Organizational aspects of the DL
• Spaces– Views of components; real or abstract images
• Scenarios– Services and behaviors
• Societies– Communities and relationships among them
5S summaryModel Primitives Formalisms Objectives
Stream Text; video, audio, software program
Sequences, types Describes properties of the DL content, encoding and textual material or particular forms of multimedia data.
Structure Collection, catalog; hypertext; document; metadata; organizational tools
Graphs; nodes; links; labels; hierarchies
Specifies organizational aspects of the DL content
Space User Interface; index; retrieval model
Sets; operations; vector space; measure space; probability space
Defines logical and presentational views of several DL components
Scenarios Service, event; condition; action
Sequence diagrams; collaboration diagrams
Details the behavior of DL services
Societies Community; managers; actors; classes; relationships; attributes; operators
Object-oriented modeling constructs; design patterns
Defines managers responsible for running DL services; actors that use those services, and relationships among them
Source: http://www.dlib.vt.edu/projects/5S-Model/
Etana - A DL for archeology
An example application of 5S - Etana: A DL for an archeological site
Text Video Audio
*Site *Sub-partition *Container *Artifact*LocusRegion
Taxonomies
Temporal Artifact-specific
Space model
Structuremodel
Metadata
Drawing Photo 3DStreammodel
*Partition
Society model
Archaeologist
General public
Geographic space
Service Manager
Information Satisfaction
Value added
Repository buildingScenario
model Services
Domain specific
User interface Metric space
Spatial
Source: E. A. Fox http://feathers.dlib.vt.edu/
Applying the model, informallyPersonal Photos; Movie, TV, media• Stream - what types of data? Gif, jpg, avi? • Structure - How are the elements organized? Is
there a hierarchy? Are there multiple structures?• Spaces - How would you index the items? How
would you divide them into related groups• Scenarios - what services would you provide? What
information do we need to provide those services?• Societies - who is the library intended to serve?
Remember to include agents and other processes as well as users.
In your group, choose one or the other (photos or movie/TV/media).
Start with stream, scenarios, societies.
More formally: Definitions
• Definition: A stream is a sequence whose co-domain is a non empty set.
• Definition: A structure is a tuple (G, L, F) where G = (V,E) is a directed graph with vertex set V and edge set E, L is a set of label values, and F is a labeling function.
Definitions, cont’d
• Definition: A space is a measurable space, measure space, probability space, vector space, topological space, or metric space– A vector space is a representation for the set of
elements in a collection. The vector representing each element is a set of characteristics held by that element and both connecting that element to others that are similar and distinguishing it from those that are different.
– We will do an exercise to illustrate
Definitions - 3• Definition: A scenario is a sequence of related
transition events (e1, e2, …, en) on state set S such that ek = (sk, sk+1,) for 1 <= k <= n.– More easily visualized, a scenario is a path in a
directed graph, G = (S, ∑e), where vertices correspond to states in the state set S and directed edges are equivalent to events in a set of events, ∑e, and correspond to transitions between states.
– Scenarios must be implemented to make a working system.
Definitions - 4
• Definition: A society is a tuple (C,R) where – C = (c1, c2, …, cn) is a set of conceptual
communities, each community referring to a set of individuals of the same class or type (e.g. actors, activities, components, hardware, software, data);
– R = (r1, r2, …, rm) is a set of relationships, each relationship being a tuple rj = (ej, ij) where ej is a Cartesian product ck1
x ck2 x … x cknj. 1<= k1 < k2 < …
< knj<= n, which specifies the communities involved in the
relationship and ij is an activity.
The Digital Library Content
• Essential elements for a digital library– Users– Content– Services
Content - requirements
• Store– Organize– Describe
• Find
• Deliver
Describing the content
• How to describe content– Metadata
• Machine readable description of anything
• What description– Machine readable requires standard descriptive elements
• Dublin Core (http://dublincore.org/)– International standard– “a standard for cross-domain information resource description.”– 15 descriptive elements
• Other metadata schemes– IEEE-LOM
Metadata
• What does metadata look like?
• Metadata is data about data– Information about a resource, encoded in
the resource or associated with the resource.
• The language of metadata: XML– eXtensible Markup Language
Google Books Project
• Michael A. Keller, Closing Keynote– Ida M. Green University Librarian at Stanford, – Director of Academic Information Resources, – Publisher of HighWire Press, and – Publisher of the Stanford University Press:
• "One good turn deserves another; how the Google Book Search project is benefiting everyone".
Google Books demo
• Full text - Life of Miguel de Cervantes
• Limited Preview - The Life of Miguel de Cervantes Saavedra
• Snippet View - "Discreción" in the Works of Cervantes: A Semantic Study
What has been accomplished
• As of September 2006• Nearly 30,000 Stanford books digitized
– ~1M books from all partner libraries
• Over 4,000 books identified as needing preservation treatment (& so not digitized)
• A great debate about copyright has started– Orphan works– What can an archive do to provide access– Defense of fair use underway
This slide is taken from the presentation by Michael A. Keller at ECDL 2006
Original Principles
• If legally possible, digitize every book (9M volumes) in the Stanford libraries– Now digitizing with imprint dates up to 1963
• Partner libraries (*added recently)– University of Michigan (similar to Stanford)– Harvard (public domain (?), maybe > 1M)– NYPL (public domain, unusual collections)– Oxford - Bodleian (earlier than 1885, ~ 1M titles)– University of California (similar to Stanford >6M)– (more to follow)
This slide is taken from the presentation by Michael A. Keller at ECDL 2006
Purposes
• Digital preservation– Virtual Bookshelves in Stanford Digital Repository under
construction as part of the Stanford Digital Repository– For Stanford use only
• Other searching and research functions– Subtle searching (as in Socrates & HighWire)– Taxonomic (LCSH & HighWire) & Associative Searching (Takano)– Citation linking (HighWire & “InforTools” (Ebrary)– Better navigation (through visualization ?) (Grokker)
• Digitized books from all sources as test bed for new research; combine with articles, datasets, etc. for data mining & other transformative uses.
This slide is taken from the presentation by Michael A. Keller at ECDL 2006
Some Conclusions• Google Book Search
– Is an indexing, not a publishing project– Offers substantial increases in access to contents of books
in library collections by keyword searching– Offers publishers global marketing of their publications– Offers several useful services to readers
• Offers participating libraries– Digital copies of books on their shelves for preservation– New possibilities for services to local readers– New possibilities for research for local faculty & students
This slide is taken from the presentation by Michael A. Keller at ECDL 2006
Google statement
• “Many of the books in Google Book Search come from authors and publishers who participate in our Partner Program. For these books, our partners decide how much of the book is browsable -- anywhere from a few sample pages to the whole book.
• For books that enter Book Search through the Library Project, what you see depends on the book's copyright status. We respect copyright law and the tremendous creative effort authors put into their work. If the book is in the public domain and therefore out of copyright, you can page through the entire book and even download it and read it offline. But if the book is under copyright, and the publisher or author is not part of the Partner Program, we only show basic information about the book, similar to a card catalog, and, in some cases, a few snippets -- sentences of your search terms in context. The aim of Google Book Search is to help you discover books and learn where to buy or borrow them, not read them online from start to finish. It's like going to a bookstore and browsing - with a Google twist.”
http://books.google.com/support/bin/answer.py?answer=43729&topic=9259&hl=en
Other projects
• Open Content Alliance (Yahoo and the Internet Archive)
• The Internet Archive www.archive.org
• The European Digital Library (Growing number of countries)
• others
Comments? Discussion?
A DL example
• Library of Congress American Memory project– http://memory.loc.gov/ammem/index.html– “American Memory provides free and open access through
the Internet to written and spoken words, sound recordings, still and moving images, prints, maps, and sheet music that document the American experience. It is a digital record of American history and creativity. These materials, from the collections of the Library of Congress and other institutions, chronicle historical events, people, places, and ideas that continue to shape America, serving the public as a resource for education and lifelong learning.”
Dublin Core for a map
• Map found in the LOC American Memory collection– Map at http://memory.loc.
gov/ammem/gmdhtml/gmdhome.html
• Dublin Core metadata illustration found at http://webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
– Part of a DL course at U. of Alabama
Go to web site to explore what is there -- including copyright information, title, history, etc.
Dublin Core: Title
• Name given, usually by the creator or publisher
< META name = “DC.Title”
content = “Novi Belgii Novæque Angliæ:nec non partis Virginiæ tabula multis in locis emendata ”
lang = “la”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Subject
• What the work is about, possibly keywords, terms from classification scheme if available.
<META name = “DC.Subject” content = “Middle Atlantic States - Maps
- Early works to 1800 - Facsimilies” scheme = “LCSH” >
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
LCSH = Library of Congress Subject Headers
Dublin Core: Description
• Free text description, abstract, etc.
<META
name = DC.Description”
content = “An (sic) historical map showing the coast of New Jersey as perceived in the senventeenth century”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Source
• Is this object derived from another? Is this map a part of a larger map? Is this text a variation or revision of another piece of text?
<META name = “DC.Source”content = “G3715 1685 .V5 1969”scheme = “LCCN”
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
LCCN = Library of Congress Call Number
Dublin Core: Language
• Language of the content of the resource
• For the map, there is no language content
<META
name = “DC.Language”
content = “nl”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Relation
• To what other object(s) or collection is this object related? Does it also exist in another collection? Is it derived from another document or image? How is it related?
<META name = “DC.Relation”content = “isPartOf
http://lcweb2.loc.gov/cgi-bin/query/r?ammem/gmd:@filreq(@field(NUMBER+@band(g3715+ct000001))+@field(COLLID+dsxpmap))
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Creator
• Person or organization responsible for the Intellectual Content of this object
<META
name = “DC.Creator”
content = “Nicolaum Visscher”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Publisher
• Entity responsible for making the resource available in its present form
• Not shown in the example, but should be something like this:
<META name = “DC.Publisher”content = “Library of Congress American Memory Project”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Contributor
• Any entity making a contribution to this object.
• Example: someone who added some information to the original document or image
• No entry for this map.
Dublin Core: Rights
• A pointer to a copyright notice, a rights management statement, or a rights server.
<META
name = “DC.Rights”
content = http://lcweb2.loc.gov/cgi-bin/ ammemrr.pl ?title=%3ca%20href%3d%22%2fammem
%2fgmdhtml %2fdsxphome.html%22%3eDiscovery%20and%20Exploration %3c%2fa%3e&coll=gmd&div=&agg=g3715&default=ammem &dir=ammem
>
Dublin Core: Date
• Date on which this object was made available in its present form, possibly the date it was entered into this digital collection.
<META
name = “DC.DATE”
content = “1996-04-17”
scheme = “ISO 8601”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Specify the date format so that others can interpret it correctly
Dublin Core: Type or Category
• What sort of thing is this? Some examples: home page, novel, poem, working paper, technical report, essay dictionary, …
• Type should be selected from a controlled list. For example, see the DCMI Type Vocabulary:
• http://dublincore.org/documents/2006/08/28/dcmi-type-vocabulary/
Why is this recommended as a controlled vocabulary field?
DCMI Type Vocabulary
• Collection• Dataset• Event• Image• InteractiveResource• MovingImage
• PhysicalObject• Service• Software• Sound• StillImage• Text
See the official page for explanations of the categories. Note that Image is a broad category and Moving Image and StillImage are more restricted subcategories.
Dublin Core: Type
• Category of this resource
<META
name = “DC.Type”
content = “image.photograph”
>
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Dublin Core: Format
• The way the content is encoded. This tells what resource is needed to access this content.
<METAname=“DC.Format”content = “image/gif”scheme = “IMT”
>
Internet MIME Types: http://www.ltsw.se/knbase/internet/mime.htp
See also Internet Media Type: http://www.graphcomp.com/info/specs/mime.html
Dublin Core: Unique ID
• The key for this object in the collection.• I cannot find one for the map we are looking
at, but the ID for the map of which it is a part is g3715 ct000001
• The Metadata specification for that would be<META name= “DC.Id”
content = “g3715 ct000001”>
Source: http://memory.loc.gov/cgi-bin/query/r?ammem/gmd:@filreq(@field(NUMBER+ @band(g3715+ct000001))+@field(COLLID+dsxpmap))
Dublin Core: Coverage
• The time, space or other measurement of the scope or completeness of the object.
• No coverage entry specified, but might be this:
<META name = “DC.Coverage”content = “North America, Eastern lands and coast, as viewed in late seventeenth century”
> Example not a controlled vocabulary. Why would a controlled vocabulary be better?
International Concensus
• Recognition of International Scope ofResource Discovery on Web
• 17 Countries Currently Involved in DCWorking Groups
• 50+ Implementation Projects in 10Countries
Source: webapp.slis.ua.edu/smmweb/DLib/Metadata/OrganizingInternetResources_files/v3_document.htm
Guide to Good Practice
• The NINCH Guide to Good Practice in the Digital Representation and Management of Cultural Heritage Materials
• http://www.nyu.edu/its/humanities/ninchguide/index.html
Legal and Technical Issues
• Legal: When is a resource available to digitize and make available. What requirements exist for controlling access.
• Technical: How do we control access to a resource that is stored online?– Policies– Encoding– Distribution limitations
Date of work Protected from Term
Created 1-1-78 or after
When work is fixed in tangible medium of expression
Life + 70 years1(or if work of corporate authorship, the shorter of 95 years from publication, or 120 years from creation
Published before 1923
In public domain None
Published 1923 - 63
When published with notice 28 years + could be renewed for 47 years, now extended by 20 years for a total renewal of 67 years. If not so renewed, now in public domain
Published from 1964 - 77
When published with notice 28 years for first term; now automatic extension of 67 years for second term
Created before 1-1-78 but not published
1-1-78, the effective date of the 1976 Act which eliminated common law copyright
Life + 70 years or 12-31-2002, whichever is greater
Created before
1-1-78 but published between then and 12-31-2002
1-1-78, the effective date of the 1976 Act which eliminated common law copyright
Life + 70 years or 12-31-2047 whichever is greater
Chart created by Lolly Gasaway. Updates at
http://www.unc.edu/~unclng/public-d.htm
Works for hire
• Usual case -- works created by faculty are not the property of the university. – Faculty surrender copyright to publishers of
journals and books– Some publishers allow faculty to retain
copyright, giving the publisher specific limited rights to reproduce and distribute the work.
Fair use
• No clear, easy answers.
• Checksheet provided in the article is a good guide to the issues.
• Link to the checksheet: http://www.copyright.iupui.edu/checklist.htm
Moral rights
• Fair to the creator– Keep the identity of the creator of the work– Do not cut the work– Generally, be considerate of the person (or
institution) that created the work.
Getting Permission
• With the best will in the world, getting the appropriate permissions is not always easy.– Identify who holds the rights– Get in touch with the rights holder– Get a suitable agreement to cover the needs of your use.
• Useful links: http://www.loc.gov/copyright/http://www.utsystem.edu/OGC/IntellectualProperty/PERMISSN.HTM
– Connections to various ways to discover and contact the rights holder of a work.
Source: NINCH Guide to Good Practice. Chapter 4:
Rights Management
Checking copyright status
Source: NINCH Guide to Good
Practice. Chapter 4: Rights
ManagementCopyright: Lauryn
G. Grant
Considering people
depicted in the work
Technical issues
• Link the resource to the copyright statements• Maintain that link when the resource is copied
or used• Approaches:
– Steganography– Encryption– Digital Wrappers– Digital Watermarks
Issues in Encryption
• General cases for protection of controlled content: Concern for passive listening, active interference.– Listening: intruder gains information, may not be detected.
Effects indirect. – Active interference
• Intruder may prevent delivery of the message to the intended recipient.
• Intruder may substitute a fake message for the intended one• Effects are direct and immediate• Less likely in the case of digital library content