jisc/nsf pi meeting, june 24-25 archon - a digital library that federates physics collections with...

34
JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer Science Old Dominion University, Norfolk, VA 23529 K. Maly, M. Zubair, M. Nelson In Collaboration With Los Alamos National Laboratory (R. Luce) & American Physical Society (M. Doyle)

Upload: mervin-briggs

Post on 21-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

JISC/NSF PI Meeting, June 24-25

Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness

 Department of Computer Science

Old Dominion University, Norfolk, VA 23529K. Maly, M. Zubair, M. Nelson

In Collaboration With

Los Alamos National Laboratory (R. Luce)&

American Physical Society (M. Doyle)

Page 2: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Motivation

Lack of a federation service that provides an unified interface to diverse collections in the physics domain having metadata that differ in richness, syntax, and semantics

Page 3: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Motivation• Dissemination and discovery of Physics resources

• Contributors

LANL, APS, AIP, CERN

researchers, teachers

• Users

Students, teachers, researchers

Page 4: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Arc: The Basic Federation Engine

User Interface

Search Engine (Servlet)

JDBC

Data Normalization

History Harvest

Daily Harvest

Data Provider

Data Provider Cache Oracle MySQL

Harvester

Page 5: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Grouper Local Query Cache and Session Related Date

Displayer

Database (Metadata &

Index)

Searcher

Session Manager

Arc: The Basic Federation Engine

Page 6: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 7: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Challenges• Resource Discovery

– Diversity in metadata richness– Lack of controlled vocabulary– Ease of discovering (formula based discovery)– Cross linking support– Classification

• Creation and Maintenance– Freshness of metadata– Dynamic nature of collections– Filtering

• Economic Sustainability– Rights management– Who pays? For what?

Page 8: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Issues – No controlled vocabulary

• Different subject classifications• Same authors but different rendering• Same affiliation but different form

Page 9: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Interactive resource discovery approach components

Ha rve st e r

I nd e xG e ne ra to r f o rUn io n o f Ke y

M e t ad a ta F ield s

Ha rve st e dM e ta d a ta

I n de x edfie ld

co nt e nt s

Us er In te rfa c e

S ea rch En g in e

122

1 Use r in t era ct t o ide n tif y a ll th e c olle ct ion s to b ese a rch e d a nd w it h wh a t a ll o p tio n s.

Use r ex ec u te s ea rch ba se d o n t he se lec te do pt ion s

Page 10: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Issues - Equation based search

• Representing search query• Rendering of equations and embedding them into

the HTML display• Integrating into search interface• Identifying equations inside the metadata• Filtering equations• Equation storage

Page 11: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

DisplayEqn

EqnFilter EqnRecorder

Img2Gif

EqnExtractor Acme.JPM.Encoders.GifEncoder

Eqn2Gif

EqnSearch oai.search.Search

cHotEqn MathEqn

EqnCleaner

Eqn Data DC Metadata

Servlet

Image Converter

Formula Filter

Formula Extractor

Page 12: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Filtering Equations

• Errors in equation encoding, some examples:– missing "$" in LaTeX representation– illegal LaTeX symbols

• Simple equations like "n=3"

Page 13: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Filtering/categorizing Equations

Approach:

Use of "Stop Equation File" similar to "Stop Word File" used for indexing.

In equation filtering context, the stop equation file consists of rules in form of regular expressions, which describe the LaTeX string to be dropped. The regular expression approach gives us the flexibility to describe easily variety of strings to be filtered.

Page 14: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

How to search for records using equations?

Three search alternatives (or any combination of these) for the user:

•Search for docs containing all formulae found in a) abstracts b) subject fields of documents containing user input ‘keywords’•Search for docs containing formulae defined by category (e.g. integrals, moments, limits)• Browse formulae by various categorizations and search for docs containing selected formulae

Page 15: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 16: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 17: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Issues - Cross Linking References

• Obtaining references from full-text documents or parallel metadata sets

• Bad format of such references when obtained from full text

• Needed standard way to represent across collections

Page 18: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 19: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 20: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 21: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Issues – Name similarity

• Authors use different names for themselves and their affiliation

• Could use authority files, difficult to create and maintain across different collections

Page 22: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Similarity approach

Clustering

Iterative refinement approach:

•Coarse level clusters based on approximate string matching (edit-distance, soundex, n-gram)

•Refining clusters based on affiliation where available

Presentation

Allow user to follow search by clicking authors and then selecting appropriate, i.e., no authority files

Page 23: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 24: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 25: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Homogenizing User Space

• Enabling Web users to discover information in OAI collections (DP-9 Service)– http://arc.cs.odu.edu:8080/dp9/

• Enabling OAI users to discover information in Web enabled non-OAI compliant collections/databases/web sites

Page 26: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

DP-9 Service for Exposing OAI Collections to Web

Page 27: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer
Page 28: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Web EnabledNon-OAI Compliant

Collections/Databases/Web Sites

Web EnabledNon-OAI Compliant

Collections/Databases/Web Sites

Web EnabledNon-OAI Compliant

Collections/Databases/Web Sites

OAI Service Provider

Gateway to Non-OAICollections

WIDL Description(XML based language)

WIDL Description(XML based language)

WIDL Description(XML based language)

Vac: Gateway Service for Harvesting Non-OAI Collections

Page 29: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Sample Description in WIDL of a Web enabled Non-OAI Collection

<WIDL NAME=‘’NonOAIGateway" Template=‘’TRcollector" BASEURL="http://www.princeton.edu" VERSION="2.0">

<SERVICE NAME=‘’getURL" METHOD="GET" URL="" INPUT=‘’" OUTPUT=‘’urlOutput" />

</BINDING> <BINDING NAME="urlOutput" TYPE="OUTPUT"> <VARIABLE NAME=‘’link" TYPE="String"

REFERENCE="doc.p[1].text" /><VARIABLE NAME=‘’title" TYPE="String" REFERENCE=‘’title" /> <VARIABLE NAME=‘’author" TYPE="String"

REFERENCE=‘’author" /><VARIABLE NAME=‘’descriptionr" TYPE="String"

REFERENCE=‘’abstract" /></BINDING> </WIDL>

Page 30: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Federation/archives Consistency

User Interface

Search Engine (Servlet)

JDBC

Data Normalization

History Harvest

Daily Harvest

Data Provider

Data Provider Cache Oracle MySQL

Harvester

Page 31: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Future Tasks

• Post processing of search results for easier navigation• Exploiting richer metadata and handling diversity in metadata

across all participating collections• Concentrate on interactive search interface for resource

discovery• Data normalization, authority files, filtering• Investigating different schemes for maintaining

federation/archives consistency• More high level services beyond formula based search and

cross-linking• User testing!!!!

Page 32: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Links

• ODU DL research group:– http://dlib.cs.odu.edu/

• Main federation engine:– http://arc.cs.odu.edu/

• NSDL research:– http://archon.cs.odu.edu/

• ITR/IM research– http://kepler.cs.odu.edu/

Page 33: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Not used

Page 34: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer

Automated metadata mapping approach

Los A lam osC ollection

Am ericanPhysica lSoc ie ty

C ollection

ArcService P rovider

O AI Layer O AI Layer O AI Layer

H arvester

M etadataProcessor

H arvestedM etadata

U nified andN orm alizedM etadata

U ser In terface

Search Engine

Registration Server(XML m apping for

each DP)

Nam eauthority

file

T R IService P rovider

O AI Layer