federating repositories of scientific literature
Post on 04-Feb-2016
33 Views
Preview:
DESCRIPTION
TRANSCRIPT
Federating Repositoriesof Scientific Literature
The Interspace Prototype (1997-2000)
Digital Libraries Initiative (1994-1998)
Worm Community System (1990-1993)
Telesophy System (1984-1989)
www.canis.uiuc.edu
Federating Repositoriesof Scientific Literature
The University of Illinois Digital Libraries Initiative (DLI)Project Status & Retrospective
Bruce R. Schatz dli@uiuc.edu
http://dli.grainger.uiuc.edu
AAAS-98, Digital Libraries SessionPhiladelphia, February 1998
1960
1970
1980
1990
2000
2010
Grand Visions
Text Search
Document Search
Concept Search
StructureSyntax Semantics
Evolution of Information Retrieval across the Net
from: Bruce R. Schatz, “Information Retrieval in Digital Libraries: Bringing Search to the Net” cover article in Science, vol 275, Jan 17, 1997 special issue on Bioinformatics
Illinois DLI Status
• Production Testbed based in a Real Library– Document Search based on Structure– SGML Publisher Stream deployed at U of Illinois
• Technology Research for Scalable Federation– Concept Search based on Semantics– Statistical Indexes across subjects and media
Production Testbed Status
• Based in major Engineering Library• Production Stream - in testbed before on shelves
• Full-text SGML -- Federated Structure Search• 5 publishers, 55 journals, 40,000 articles
• Web version campus rollout October 1997• integrated within library information services
Production Testbed Evaluation
• 700 users, steadily increasing to max 1500• used in intro Computer Science classes
• developers and evaluators work closely• needs assessment and usability studies
• careful multi-modal usage evaluation• session observations and transaction logs
Primary Partners
• journal/magazine Publishers: – American Institute of Physics (AIP)
– American Physical Society (APS)
– American Astronomical Society (AAS)
– American Society of Civil Engineers (ASCE)
– American Society of Mechanical Engineers (ASME)
– American Society of Agricultural Engineers (ASAE)
– American Institute of Aeronautics & Astronautics (AIAA)
– Institute of Electrical and Electronics Engineers (IEEE)
– Institution of Electrical Engineers (IEE)
– IEEE Computer Society (IEEE-CS)
• testbed: SoftQuad, OpenText
• infrastructure: Hewlett-Packard, Microsoft
DeLIver Search Interface
DeLIver Search Results
(Full Text Retrieval)
Result of “Figure Caption Search”
Dynamic Linking in Bibliography
Testbed Difficulties
• Original plan was to modify Mosaic for search– Web became commercial -- we lost control of developers
• Plan to use standard BRS as fulltext backend– needed to use SGML specific OpenText search engine
• good-quality SGML simply not available– we had to train every publisher; nothing was ready
• SGML interactive display not journal quality– physics requires equations -- hard to display well
• Custom software hard to deploy widely– Web widespread but too lowend for professional search
Testbed Successes
• Willing to build custom encoding procedures– so succeed with SGML where Elsevier and OCLC failed
• Canonical encoding for structure tags– so can federate across publishers and journals
• Willing to build custom software for Search– so able to do multiple views not single stream like Web
• Production repositories for real Publishers– became R&D arm of major scientific publishers
• Changing the nature of libraries with research– research prototype becomes standard service
Technology Transfer
• Illinois DLI considered R&D arm of publishers– broad spectrum of major publishers in scientific literature
– successful annual partner’s workshop plus high-level visits
• Technology transferred to Publisher partners– contract with AIP to clone testbed software & processing
– arrangements with ASCE for a second cloning
• Testbed Continuance by University Library– industrial partners program between Library & Publishers
– company formed to provide software and service
Technology Research
• Scalable Semantics becoming feasible– statistical clustering proves useful interactively– concept spaces and category maps
• Semantic indexes for large collections– 400K Inspec (1995)– 4M Compendex (1996)
• Simulation of Community Repositories– 1000 collections across all of engineering– testbed for vocabulary switching (federation)
Vocabulary Switching
• Grand Challenge of Digital Libraries– semantic interoperability across subject domains– vocabulary switching to suggest across domains
• Generating 1000 community repositories– 600 categories across engineering (38 top-level) – 150 categories across EE, CS, physics– 3M raw abstracts, about 10M in community spaces
• large-scale supercomputer simulation– 7 days of dedicated computation (10 days overall)– have space navigation; need space intersection
Multimedia Federation
• Semantic Indexing within Media– Text, Image, Number
• Semantic Interoperability across Media– Spatial Data (GIS) dataset intersection
• Multi-site DLI Collaboration– U Illinois: systems and supercomputers– U Arizona: algorithms and experiments– UC Santa Barbara: collections and metadata
Semantic Analysis of Multimedia
• Collections of Objects containing Units– Text: community repository (topic proximity)
document abstracts containing noun phrases– Image: aerial photograph (spatial proximity)
feature regions containing texture tiles
• Units are media-dependent (statistical parsers)– Text: phrase segmentation (nouns on word parts of speech)– Image: texture segmentation (orientation on pixel densities)
• Indexes are media-independent (statistical clusters)– Concept: co-occurrence similarity of units within objects– Category: self-organizing maps of objects within collections
Media Interoperability Experiment
• Feature regions containing texture tiles in aerial photos– 1M regions in 5K photos around southern California (GIS)
• text concept space and category map in geoscience– 10M phrases in 500K abstracts from Georef and Petroleum Abstracts
• image concept space and category map in aerial photos– tile similarity space and visual thesaurus maps (10M tiles)
• numeric satellite sensor data– 1M NASA AVHRR temperature records, 2M GNIS feature names
• spatial gazetteer as bridge image<=>text<=>number– images are labeled by GNIS gazetteer (feature names for text search)
Federated Search
• Multiple Indexes in Distributed Repositories– text search: SGML for full-text articles (Testbed)
bibliographic abstracts for full coverage (INSPEC)
– term suggestion: thesaurus for taxonomy (INSPEC)
concept spaces for term coverage (SGML)
• Multiple View User Interface Client– uniform displays for multiple indexes
– drag-and-drop between display views to mix-and-match
– uniform search across multiple repositories
• Multiple Protocol Stateful Gateway– single query stream analog to single user interface
– will handle distributed repositories for federation, e.g. AAS
– Opentext (socket), term-suggest (SQL), Ovid/DRA (Z39.50)
IODyne Engineering Search Example
Building a new Community
starting the field of Digital Libraries
• IEEE Computer DLI special issue May 1996 • Computer DLI retrospective planned for 1999
• Allerton workshops on DL Sociology• edited book planned on DL Evaluation
• DLI National Coordination effort• Illinois DLI retrospective conference (Mar 98)
The 21st Century: Analysis
• Beyond Search to Analysis• Cross-Correlating Information from many
sources across the Net• The Net solves problems
• Every community has its own special library• Every community and every person does
indexing !!
• The Internet evolves into the Interspace
top related