![Page 1: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/1.jpg)
JISC/NSF PI Meeting, June 24-25
Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness
Department of Computer Science
Old Dominion University, Norfolk, VA 23529K. Maly, M. Zubair, M. Nelson
In Collaboration With
Los Alamos National Laboratory (R. Luce)&
American Physical Society (M. Doyle)
![Page 2: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/2.jpg)
Motivation
Lack of a federation service that provides an unified interface to diverse collections in the physics domain having metadata that differ in richness, syntax, and semantics
![Page 3: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/3.jpg)
Motivation• Dissemination and discovery of Physics resources
• Contributors
LANL, APS, AIP, CERN
researchers, teachers
• Users
Students, teachers, researchers
![Page 4: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/4.jpg)
Arc: The Basic Federation Engine
User Interface
Search Engine (Servlet)
JDBC
Data Normalization
History Harvest
Daily Harvest
Data Provider
Data Provider Cache Oracle MySQL
Harvester
![Page 5: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/5.jpg)
Grouper Local Query Cache and Session Related Date
Displayer
Database (Metadata &
Index)
Searcher
Session Manager
Arc: The Basic Federation Engine
![Page 6: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/6.jpg)
![Page 7: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/7.jpg)
Challenges• Resource Discovery
– Diversity in metadata richness– Lack of controlled vocabulary– Ease of discovering (formula based discovery)– Cross linking support– Classification
• Creation and Maintenance– Freshness of metadata– Dynamic nature of collections– Filtering
• Economic Sustainability– Rights management– Who pays? For what?
![Page 8: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/8.jpg)
Issues – No controlled vocabulary
• Different subject classifications• Same authors but different rendering• Same affiliation but different form
![Page 9: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/9.jpg)
Interactive resource discovery approach components
Ha rve st e r
I nd e xG e ne ra to r f o rUn io n o f Ke y
M e t ad a ta F ield s
Ha rve st e dM e ta d a ta
I n de x edfie ld
co nt e nt s
Us er In te rfa c e
S ea rch En g in e
122
1 Use r in t era ct t o ide n tif y a ll th e c olle ct ion s to b ese a rch e d a nd w it h wh a t a ll o p tio n s.
Use r ex ec u te s ea rch ba se d o n t he se lec te do pt ion s
![Page 10: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/10.jpg)
Issues - Equation based search
• Representing search query• Rendering of equations and embedding them into
the HTML display• Integrating into search interface• Identifying equations inside the metadata• Filtering equations• Equation storage
![Page 11: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/11.jpg)
DisplayEqn
EqnFilter EqnRecorder
Img2Gif
EqnExtractor Acme.JPM.Encoders.GifEncoder
Eqn2Gif
EqnSearch oai.search.Search
cHotEqn MathEqn
EqnCleaner
Eqn Data DC Metadata
Servlet
Image Converter
Formula Filter
Formula Extractor
![Page 12: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/12.jpg)
Filtering Equations
• Errors in equation encoding, some examples:– missing "$" in LaTeX representation– illegal LaTeX symbols
• Simple equations like "n=3"
![Page 13: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/13.jpg)
Filtering/categorizing Equations
Approach:
Use of "Stop Equation File" similar to "Stop Word File" used for indexing.
In equation filtering context, the stop equation file consists of rules in form of regular expressions, which describe the LaTeX string to be dropped. The regular expression approach gives us the flexibility to describe easily variety of strings to be filtered.
![Page 14: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/14.jpg)
How to search for records using equations?
Three search alternatives (or any combination of these) for the user:
•Search for docs containing all formulae found in a) abstracts b) subject fields of documents containing user input ‘keywords’•Search for docs containing formulae defined by category (e.g. integrals, moments, limits)• Browse formulae by various categorizations and search for docs containing selected formulae
![Page 15: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/15.jpg)
![Page 16: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/16.jpg)
![Page 17: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/17.jpg)
Issues - Cross Linking References
• Obtaining references from full-text documents or parallel metadata sets
• Bad format of such references when obtained from full text
• Needed standard way to represent across collections
![Page 18: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/18.jpg)
![Page 19: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/19.jpg)
![Page 20: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/20.jpg)
![Page 21: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/21.jpg)
Issues – Name similarity
• Authors use different names for themselves and their affiliation
• Could use authority files, difficult to create and maintain across different collections
![Page 22: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/22.jpg)
Similarity approach
Clustering
Iterative refinement approach:
•Coarse level clusters based on approximate string matching (edit-distance, soundex, n-gram)
•Refining clusters based on affiliation where available
Presentation
Allow user to follow search by clicking authors and then selecting appropriate, i.e., no authority files
![Page 23: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/23.jpg)
![Page 24: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/24.jpg)
![Page 25: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/25.jpg)
Homogenizing User Space
• Enabling Web users to discover information in OAI collections (DP-9 Service)– http://arc.cs.odu.edu:8080/dp9/
• Enabling OAI users to discover information in Web enabled non-OAI compliant collections/databases/web sites
![Page 26: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/26.jpg)
DP-9 Service for Exposing OAI Collections to Web
![Page 27: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/27.jpg)
![Page 28: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/28.jpg)
Web EnabledNon-OAI Compliant
Collections/Databases/Web Sites
Web EnabledNon-OAI Compliant
Collections/Databases/Web Sites
Web EnabledNon-OAI Compliant
Collections/Databases/Web Sites
OAI Service Provider
Gateway to Non-OAICollections
WIDL Description(XML based language)
WIDL Description(XML based language)
WIDL Description(XML based language)
Vac: Gateway Service for Harvesting Non-OAI Collections
![Page 29: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/29.jpg)
Sample Description in WIDL of a Web enabled Non-OAI Collection
<WIDL NAME=‘’NonOAIGateway" Template=‘’TRcollector" BASEURL="http://www.princeton.edu" VERSION="2.0">
<SERVICE NAME=‘’getURL" METHOD="GET" URL="" INPUT=‘’" OUTPUT=‘’urlOutput" />
</BINDING> <BINDING NAME="urlOutput" TYPE="OUTPUT"> <VARIABLE NAME=‘’link" TYPE="String"
REFERENCE="doc.p[1].text" /><VARIABLE NAME=‘’title" TYPE="String" REFERENCE=‘’title" /> <VARIABLE NAME=‘’author" TYPE="String"
REFERENCE=‘’author" /><VARIABLE NAME=‘’descriptionr" TYPE="String"
REFERENCE=‘’abstract" /></BINDING> </WIDL>
![Page 30: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/30.jpg)
Federation/archives Consistency
User Interface
Search Engine (Servlet)
JDBC
Data Normalization
History Harvest
Daily Harvest
Data Provider
Data Provider Cache Oracle MySQL
Harvester
![Page 31: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/31.jpg)
Future Tasks
• Post processing of search results for easier navigation• Exploiting richer metadata and handling diversity in metadata
across all participating collections• Concentrate on interactive search interface for resource
discovery• Data normalization, authority files, filtering• Investigating different schemes for maintaining
federation/archives consistency• More high level services beyond formula based search and
cross-linking• User testing!!!!
![Page 32: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/32.jpg)
Links
• ODU DL research group:– http://dlib.cs.odu.edu/
• Main federation engine:– http://arc.cs.odu.edu/
• NSDL research:– http://archon.cs.odu.edu/
• ITR/IM research– http://kepler.cs.odu.edu/
![Page 33: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/33.jpg)
Not used
![Page 34: JISC/NSF PI Meeting, June 24-25 Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer](https://reader035.vdocuments.net/reader035/viewer/2022062519/5697bfe91a28abf838cb682d/html5/thumbnails/34.jpg)
Automated metadata mapping approach
Los A lam osC ollection
Am ericanPhysica lSoc ie ty
C ollection
ArcService P rovider
O AI Layer O AI Layer O AI Layer
H arvester
M etadataProcessor
H arvestedM etadata
U nified andN orm alizedM etadata
U ser In terface
Search Engine
Registration Server(XML m apping for
each DP)
Nam eauthority
file
T R IService P rovider
O AI Layer