dl:lesson 5 classification schemas luca dini [email protected]

41
DL:Lesson 5 Classification Schemas Luca Dini [email protected]

Upload: lyris

Post on 15-Jan-2016

54 views

Category:

Documents


0 download

DESCRIPTION

DL:Lesson 5 Classification Schemas Luca Dini [email protected]. Web: Crawling. “central” index. ?. Metadata harvesting. metadata. Author Title Abstract Identifer. ?. metadata. Metasearching. Metasearch Engine. ?. What is metasearching?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

DL:Lesson 5Classification Schemas

Luca [email protected]

Page 2: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Web: Crawling

“central”index

?

Page 3: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Metadata harvesting

metadata

Page 4: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

metadata

AuthorTitleAbstractIdentifer

?

Page 5: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Metasearching

MetasearchEngine

?

Page 6: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

What is metasearching?

Given many document sources and a query, a metasearcher:

– Finds the good sources for the query– Evaluates the query at these sources– Merges the results from these sources

Metasearcher

<%

%>

UnindexedDocuments Legacy

Database /WAIS / etc.

ExistingWebApplication

Page 7: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Main Issues

How to query different types of sources? How to combine results and rankings from multiple

data sources?

Metasearcher

<%

%>

grep ‘biomedical’*.txt

SELECT title FROMarticles . . .

http://…/getTitle?title=‘biomedical’&…

Page 8: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Other Issues

How to choose among multiple data sources? How to get metadata about multiple data

sources?

Metasearcher

<%

%>

cat*.txt

SELECT SCHEMA…….

Best:http://….?getMetaDataWorst:“Hi. What do you have?”

Page 9: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Cost/Functionality

Cost of acceptance

Metadata Harvesting

SDLIP/STARTS

Z39.50

google

Function

Page 10: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Z39.50

http://www.loc.gov/z3950/agency/

Page 11: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Goals

• Permits one computer, the client, to search and retrieve information on another, the database server

• Important both technically and for its wide use in library systems

• Most development has concentrated on bibliographic data

• Most implementations emphasize searches that use a bibliographic set of attributes to search databases of MARC records

Page 12: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Principles

Abstract view of database searching.

• Server stores a set of databases with searchable indexes

• Interactions are based on a session

• The client opens a connection with the server, carries out a sequence of interactions and then closes the connection.

• During the course of the session, both the server and the client remember the state of their interaction.

Page 13: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

The results

Z39.50

• The server carries out the search and builds a results set

• Server saves the results set.

• Subsequent message from the client can reference the result set.

• Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

Page 14: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Services

init -- client connects to the server and exchanges initial information, e.g., preferred message size

explain -- client inquires of the server what databases are available for searching, the fields that are available, the syntax and formats supported, and other options

search -- client presents a query to a database choices of syntax for specifying searches

• only Boolean queries widely implemented • one or more records may be returned to the client

Page 15: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Services

manipulation of results sets -- e.g., sort or delete

present -- requests the server to send specified records from the results set to the client in a specified format

• options:

for controlling content and formats

for managing large records or large results sets

Page 16: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Example

In the database named "Books" find all records for which the access point title that contains the value "evangeline" and the access point author contains the value "longfellow.“

Z39.50 defines a rich variety of search access points that can be extended by implementers

Page 17: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Problems

Very difficult to implement– There are freely available implementations, but they are

complex

Outdated assumptions– Searching is expensive computationally – Bandwidth is limited (ASN.1 compression)

Originally designed for bibliographic record retrieval, and not full documents or other objects

“Overspecified” (Almost) Nobody Implements Explain! Assumes questionable user model (stateful)

Page 18: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Simple Digital Library Interoperability Protocol

http://www-diglib.stanford.edu/~testbed/doc2/SDLIP/

Page 19: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

SDLIP

Compromise between a full-scale, all encompassing search middleware design such as Z39.50 and the “anything goes” approach typical for ad-hoc search interface design on web

Support for stateful and stateless operation by the server

Support for thin clients, such as handheld devices Developed jointly by Stanford, Berkeley, and UC

Santa Barbara

Page 20: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

SDLIP – Search Middleware

Page 21: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Interfaces

                                                          

Page 22: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Interfaces

Search Interface – defines simple query language, protocol can then include other languages

Result Interface – parking meter metaphor supports varying notions of results sets

Source Metadata Interface – provides extension mechanism through discovery server capabilities

Page 23: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Result access interface

This interface allows client applications to access the set of result documents, wherever that set is maintained

Four services:– getSessionInfo– getDocs– extendStateTimeout– cancelRequest

Page 24: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Source metadata interface

Provides information about the service and server itself, such as– Collections served– Collection metadata/content information– Searchable properties

Three operations– getInterface– getSubcollectionInfo– getPropertyInfo

Page 25: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

OAI

Page 26: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Cost

Functionality

HTTPGoogle

Z39.50SGML

DublinCore

MetadataHarvesting

Page 27: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Metadata

metadata

AuthorTitleAbstractIdentifer

Page 28: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

History

Increasing interest in alternative scholarly publishing solutions – e.g., LANL arXiv

Increasing impact through federation UPS Mtg., Sante Fe, October 1999

– Representatives of various ePrint, library, publishing, communities

– Goal: definition of an interoperability framework among ePrint providers

– Result: Santa Fe Convention, interoperability through metadata harvesting

Page 29: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Umbrella model

ReferenceLibraries

PublishersE-Print

Archives

…that can be exploited by different communities

Museums

Page 30: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Key Technical features

Deploy now technology – 80/20 rule Two-party model – providers (data providers) and

consumers (service providers) Simple HTTP encoding XML schema for some degree of protocol

conformance Extensibility

– Multiple item-level metadata– Collection level metadata

Page 31: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Roles

DiscoveryCurrent

AwarenessPreservation

Service Providers

Data Providers

Meta

data

harv

estin

g

Page 32: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Key Features

definitions & concepts– repository– record– identifier– datestamp– set

protocol features– HTTP encoding– metadata prefix &

schema– flow control

protocol requests– supporting requests– harvesting requests

Page 33: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Record

<record><header>

<identifier>oai:eg:001</identifier><datestamp>1999-01-01</datestamp>

</header><metadata>

<dc xmlns=“http://purl.org/dc”><title>My Example</title>

</dc></metadata><about>

<ea xmlns=“http://www.arXiv.org/ea”<usage>No restrictions</usage>

</ea></about>

</record>

protocol support

format-specificmetadata

community-specific

record data

Page 34: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Identifiers

oai-identifier = oai:archive-identifier:record-identifier

Registered URI

Scheme

Archive Idendifier:

Registered within OAI

Unique ID within archive:

(syntax is archive-specific)

example = oai:ncstrl:ncstrl.cornellcs/TR94-1418

locally unique key for extracting a record from a repository

Page 35: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Identify

•Repository name •Base-URL

• Admin e-mail• OAI protocol version

• Description Container

repos i tory

harves ter

service provider data provider

Page 36: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

ListMetadataFormats

REPEAT• Format prefix

• Format XML schema/REPEAT

repos i tory

harves ter

service provider data provider

Page 37: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

ListSets

REPEAT• Set Specification

• Set Name/REPEAT

repos i tory

harves ter

service provider data provider

Page 38: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

* from=a * until=b * set=klmListRecords * metadataPrefix=oai_dc

REPEAT• Identifier

• Datestamp• Metadata

•About Container/REPEAT

repos i tory

harves ter

service provider data provider

Page 39: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

REPEAT• Identifier

• Datestamp/REPEAT

repos i tory

* from=a * until=bListIdentifiers * set=klmh

arves ter

service provider data provider

Page 40: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

* identifier=oai:mlib:123a GetRecord * metadataPrefix=oai_dc

• Identifier• Datestamp

• Metadata• About

repos i tory

harves ter

service provider data provider

Page 41: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

http://www.google.com.tw/webmasters/sitemaps/docs/en/other.html#oai

http://www.nla.gov.au/digicoll/oai/getRecord.html

oai_dc http://www.nla.gov.au/digicoll/oai/ http://www.nla.gov.au/digicoll/oai/

listMetadataFormats.html (oai:nla.gov.au:nla.pic-an22111591)