cover feature federated search of scientific literature

9
0018-9162/99/$10.00 © 1999 IEEE February 1999 51 Cover Feature Federated Search of Scientific Literature T he Digital Libraries Initiative (DLI) project at the University of Illinois at Urbana- Champaign (UIUC) was one of six sponsored by the NSF, DARPA, and NASA from 1994 through 1998. Our goal was to develop widely usable Web technology to effectively search technical documents on the Internet. We concentrated on building the experimental Illinois DLI Testbed with tens of thousands of full-text journal articles from physics, engineering, and computer science, and on making these articles available over the Internet before they are available in print. Our DLI Testbed used document structure to pro- vide federated search across publisher collections, by merging diverse tags from multiple publishers into a single uniform collection. Our sociology research eval- uated the usage of the DLI Testbed by more than a thousand UIUC faculty and students. Our technology research moved beyond document structure to docu- ment semantics, testing contextual indexing of docu- ment content on millions of documents. DLI TESTBED AND FEDERATED SEARCH The DLI Testbed team designed, developed, and evaluated mechanisms to provide effective access to full-text physics and engineering journal articles within an Internet environment. The team, based in the Engineering Library at UIUC, had as its primary goals to construct and test a multipublisher, full-text DLI Testbed that employs flexible search and render- ing capabilities and offers rich links to internal and external resources, with the sources tagged in Standard Generalized Markup Language (SGML); • integrate the DLI Testbed and other full-text repositories into the continuum of information resources offered to end users within the Engineering Library system; determine the efficacy of full-text article searching compared to document surrogate searching and explore end-user full-text searching behavior, in order to identify user-searching needs; and • identify models for effective publishing and retrieval of full-text articles within an Internet environment and employ these models in the DLI Testbed design and development. Document collection and retrieval The DLI Testbed supports full text in SGML for- mat, associated article metadata, and bit-mapped fig- ure images for scientific journal articles. At present, the collection includes 63 journals containing 66,000 articles from five professional societies: American Institute of Physics American Physical Society American Society of Civil Engineers Institution of Electrical Engineers IEEE Computer Society Each publisher transmits electronic copies of their journals to us as they go to print, allowing the issues to appear in the DLI Testbed before the hard copies appear in the Engineering Library. The production stream is increasing at approximately 2,000 articles per month. We achieved a critical mass for useful search in 1997 when the SGML collection reached coverage of two years for each journal. To support federated search across this collection, our DLI Testbed team developed a Web-based retrieval system called DELIVER (DEsktop LInk to Virtual Engineering Resources). In operation since October 1997, DELIVER has been used by more than 1,900 The Illinois Digital Library Project has developed an infrastructure for federated repositories. The deployed testbed indexes articles from many scientific journals and publishers in a production stream that can be searched as though they form a single collection. Bruce Schatz William Mischo Timothy Cole Ann Bishop Susan Harum Eric Johnson Laura Neumann University of Illinois at Urbana- Champaign Hsinchun Chen Dorbin Ng University of Arizona Cover Feature .

Upload: others

Post on 30-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

0018-9162/99/$10.00 © 1999 IEEE February 1999 51

Cover Feature

Federated Searchof ScientificLiterature

The Digital Libraries Initiative (DLI) projectat the University of Illinois at Urbana-Champaign (UIUC) was one of six sponsoredby the NSF, DARPA, and NASA from 1994through 1998. Our goal was to develop

widely usable Web technology to effectively searchtechnical documents on the Internet. We concentratedon building the experimental Illinois DLI Testbed withtens of thousands of full-text journal articles fromphysics, engineering, and computer science, and onmaking these articles available over the Internet beforethey are available in print.

Our DLI Testbed used document structure to pro-vide federated search across publisher collections, bymerging diverse tags from multiple publishers into asingle uniform collection. Our sociology research eval-uated the usage of the DLI Testbed by more than athousand UIUC faculty and students. Our technologyresearch moved beyond document structure to docu-ment semantics, testing contextual indexing of docu-ment content on millions of documents.

DLI TESTBED AND FEDERATED SEARCHThe DLI Testbed team designed, developed, and

evaluated mechanisms to provide effective access tofull-text physics and engineering journal articles withinan Internet environment. The team, based in theEngineering Library at UIUC, had as its primary goalsto

• construct and test a multipublisher, full-text DLITestbed that employs flexible search and render-ing capabilities and offers rich links to internal andexternal resources, with the sources tagged inStandard Generalized Markup Language (SGML);

• integrate the DLI Testbed and other full-textrepositories into the continuum of information

resources offered to end users within theEngineering Library system;

• determine the efficacy of full-text article searchingcompared to document surrogate searching andexplore end-user full-text searching behavior, inorder to identify user-searching needs; and

• identify models for effective publishing andretrieval of full-text articles within an Internetenvironment and employ these models in the DLITestbed design and development.

Document collection and retrievalThe DLI Testbed supports full text in SGML for-

mat, associated article metadata, and bit-mapped fig-ure images for scientific journal articles. At present,the collection includes 63 journals containing 66,000articles from five professional societies:

• American Institute of Physics• American Physical Society• American Society of Civil Engineers• Institution of Electrical Engineers• IEEE Computer Society

Each publisher transmits electronic copies of theirjournals to us as they go to print, allowing the issuesto appear in the DLI Testbed before the hard copiesappear in the Engineering Library. The productionstream is increasing at approximately 2,000 articlesper month. We achieved a critical mass for usefulsearch in 1997 when the SGML collection reachedcoverage of two years for each journal.

To support federated search across this collection,our DLI Testbed team developed a Web-based retrievalsystem called DELIVER (DEsktop LInk to VirtualEngineering Resources). In operation since October1997, DELIVER has been used by more than 1,900

The Illinois Digital Library Project has developed an infrastructure forfederated repositories. The deployed testbed indexes articles from manyscientific journals and publishers in a production stream that can besearched as though they form a single collection.

Bruce SchatzWilliamMischoTimothy ColeAnn BishopSusan HarumEric JohnsonLauraNeumannUniversity ofIllinois atUrbana-Champaign

HsinchunChenDorbin NgUniversity ofArizona

Cove

r Fea

ture

.

52 Computer

registered UIUC students and faculty, plus designatedoutside researchers. We have recorded detailed trans-action logs for more than 97,000 user search sessions.

Figure 1 shows a DELIVER search session. The ini-tial screen, in the upper left, prompts the user thatparts (structures) of the documents are searchable.The user has requested the term surface waveswhen it occurs as a figure caption. The search resultsappear in the lower left, showing four of the articlesretrieved for the requested text phrase occurring in therequested document structure. Note that each articleis from a different journal and that these journals spanmultiple publishers. The screen on the right gives aportion of the SGML display for the full text of thearticle. The red arrows show that surface wavesoccurs in a figure caption, but not in the title. Notethat SGML tags the complete structure of the docu-ment, including figures and equations.

Federated structure searchA critical element of the DLI Testbed was the effec-

tive use of SGML to reveal document structure andproduce associated article-level metadata, whichhomogenizes heterogeneous SGML and allows short-

entry display. We take the SGML directly from thepublishers’ collections, converting it to a canonicalformat for federated searching and transforming tagsinto a standard set.

The metadata also contains links to internal andexternal data, such as other DLI Testbed articles andbibliographic abstract databases. The metadata andindex files—which contain pointers to the full-textdata—are stored independently and separately fromthe full text.

With SGML, documents can be treated as objects,allowing viewing, manipulation, and output. Forretrieval purposes, SGML’s major strength is its abil-ity to reveal a document’s component structure. WhileSGML is becoming ubiquitous in publishing, it islargely generated by publishers as a production by-product. The coming widespread availability of richmarkup formats, such as XML (eXtensible MarkupLanguage)—a nearly complete instance of SGML—will likely make such formats the standard for opendocument systems. Future versions of our DLI Testbedare planning to use XML to represent structure.

The DTD (Document Type Definition), whichaccompanies each publisher’s SGML file, specifies the

Figure 1. DLI testbedWeb-based DELIVERsearch session,showing the queryinterface (upper left),the results interface(lower left), and full-text display (right).

.

semantics and syntax of the SGML tags. The DTD alsospecifies the rules for how SGML tags may be appliedto the documents to identify where components occur.

One of the hardest problems in successfully deploy-ing the DLI Testbed has been processing heteroge-neous DTDs. We developed a number of techniques toaddress these problems and normalize processing,indexing, storage, retrieval, and rendering. For exam-ple, there is a standard canonical set of document tags,and all tags from all publishers are heuristicallymapped into these.

Distributed repositories and linksAn important concern was developing effective

retrieval models for journals published on the Web.We designed a distributed repository architecture thatfederates individual publisher repositories of full-textdocuments. Normalized metadata and index data areextracted from the full text, allowing searches via aparallel execution monitor. This architecture enablesstandardized and canonical searches of subject andauthor that are consistent across distributed and dis-parate repositories.

The DLI Testbed team successfully demonstratedthe efficacy of the distributed repository model. Weproduced cross-DTD metadata, providing paralleldatabase querying and distributed retrieval techniquesacross a distinguished subset of the full-text reposito-ries. We then installed an off-site repository by cloningthe testbed environment at the actual site of a pub-lisher partner (the American Institute of Physics, inNew York City).

We made significant progress developing a meta-data specification to support standardized retrievalacross repositories. This allowed for short-entry dis-play independent of the full-text document reposito-ries and links to associated testbed items andbibliographic databases. We used SGML tag aliasingfor normalization to accommodate heterogeneousDTDs. The DELIVER client supports searching,retrieval, and display across multiple repositories, pro-viding cross-repository retrieval with single searches.

Our innovations include integration of DELIVERwith other retrieval services. We implemented Inspecand Compendex proxies for the Ovid retrieval system,with links to the DELIVER Testbed. These proxiesenable those databases to be searched with compre-hensive coverage for journal abstracts, with transpar-ent links following into the full-text SGML documentswhen the article is covered by the DLI Testbed. TheDLI Testbed also provides links from the bibliogra-phies of retrieved DELIVER articles to other items inthe testbed, citation links to previous testbed articles,and links from bibliographic references in retrievedDELIVER articles to Inspec and Compendex databaserecords.

Multiple-view interfacesComplete search sessions across multiple

sources are necessary to effectively handle sci-entific literature. Our DLI Testbed provides sup-port for federated searching across documentstructures from different publisher repositories.The user can use a single high-level structure,such as author or caption, and have it automat-ically translated into the appropriate SGML tagsfor each document.

Our experimental user interfaces, which ranin the Engineering Library prior to DELIVER,showed that effective information retrieval forfull-text structure search is greatly facilitated bymultiple views. Traditional information retrievalhas supported only a single view, which sends aquery to an index and returns a result. This isthe model currently supported within commercialonline systems and within Web search systems. A mul-tiple-view interface supports sessions, with combina-tions across the results of different search/queries.

We developed an experimental multiple-view inter-face, called IODYNE, which seamlessly integrates manydifferent kinds of indexes with drag-and-drop betweenscreen windows for search indexes.1 Such a client is theparadigm for the next generation of search systems onthe Web, where multiple indexes for different purposescan be easily combined within an entire session.IODYNE supports text search of full-text SGML viathe DLI Testbed and of bibliographic abstracts via Ovid.

This text search can be boosted by term suggestion,where the user specifies a broad query and the systemreturns related terms to be interactively selected forfuture queries. IODYNE supports term suggestion viasubject thesauri (such as the Inspec thesaurus) and viaconcept spaces (automatically generated thesauri pro-vided by our technology research, as described belowin the “Semantic Indexing and Technology Research”section).

Testbed partners and continuanceThe collaborative relationship between the DLI

Testbed team and its publishing partners was partic-ularly strong—they grew to refer to and rely upon usas their “R&D arm.” The strong partnering relation-ship is evidenced by the agreement between the DLIproject and the publisher partners to initiate aCollaborative Partners Program, whose funding isenabling the continuation of the DLI Testbed beyondthe DLI grant period. The Engineering Library is alsoa recipient of a three-year grant from the US DefenseAdvanced Research Projects Agency (DARPA) to con-tinue the SGML testbed for evaluation purposes.

These new funds will allow the DLI Testbed teamto continue investigating issues connected with full-text article indexing, interface design, retrieval, and

February 1999 53

The DELIVER clientsupports search,

retrieval, anddisplay across

multiplerepositories,

providing cross-repository

retrieval with singlesearches.

.

54 Computer

rendering. Continued contributions of materi-als from the publishing partners will allow forthe increase of both the depth and breadth ofthe digital collection. Plans are also underwayto extend DLI Testbed access to the Big TenUniversity Consortium throughout the USMidwest, to enlarge the user population, andfurther develop the distributed repositorymodel.

TESTBED EVALUATION AND SOCIOLOGY RESEARCH

Our social science team pursued an integrated inves-tigation of the social practices of digital libraries.2

Throughout the project, we carried out user studiesand evaluations aimed at improving the DLI Testbed.We also documented and analyzed the extent andnature of DLI Testbed use, satisfaction, and impactsregarding engineering work and communication.These efforts informed our broader contributions toknowledge about engineering work, the use of scien-tific and engineering journals, and the changing infor-mation infrastructure.

We pursued several research threads that are rele-vant to understanding social practices associated withthe development and use of federated repositories ofstructured documents: article disaggregation duringknowledge construction, user understanding of newlyencountered digital libraries, convergence of commu-nities of practice with information artifacts, and res-olution of digital library visions held by differentstakeholders.

In our research, we carefully adapted traditionalsocial science methods to the study of social phe-nomena involving information systems. We employeda variety of qualitative and quantitative techniques forcollecting and analyzing data, including

• observing engineering work and learning activities,• conducting focus groups with potential system

users,• conducting interviews with actual system users,• performing usability testing of system prototypes,• recording transaction logs of system sessions, and • conducting large-scale user surveys.

In addition, we initiated computer-mediated data-gathering techniques, such as user registration and exitpolls after sessions. We have considered results fromall these methods to triangulate our findings and pro-vide a deeper understanding of the nature of digitallibrary use and social phenomena involved.

Analysis of DELIVER users and useUsers are required to fill out an online demographic

questionnaire to register for a DELIVER login. These

questionnaires were analyzed when the registrationsreached a total of 1,200 UIUC faculty, staff, and students.Half of these users are graduate students, who alsoaccount for the most searches. About 75 percent of usersare men, most between 23 and 29 years old. Facultymembers are a small, but intense, segment of users.

DELIVER users cover a wide spectrum, represent-ing all campus engineering disciplines, science-relatedfields (such as ecological modeling and biology), andfields such as communications and psychology. Wefound, however, that most users’ backgrounds reflectthe DLI Testbed’s contents, which concentrates onjournals from physics, civil engineering, electrical engi-neering, and computer science.

A preliminary analysis of 226 recently completeduser surveys suggests that people are generally satisfiedwith our system. The mean responses to three sepa-rate questions meant to gauge people’s reaction toDELIVER was 3.5 (where 1 corresponded to “terri-ble,” “frustrating,” and “inadequate search power,”and 5 corresponded to “wonderful,” “satisfying,” and“adequate search power”).

DELIVER transaction logs reveal the use of varioussystem features. Analysis of more than 4,200 sessionsindicates that about 20 percent of sessions used theextended citation screen, while 38 percent of sessionsviewed the full text of the article. In usability inter-views, we found that users’ ability to view full textwas limited by the fact that they had to first downloadadditional software (an SGML plug-in) to view it.

Use of document structureGiven the nature of searching and display made pos-

sible through the use of SGML, we explored howresearchers use journal components—such asabstracts, figures, equations, or bibliographic cita-tions—in their work.3 We identified five basic purposesfor article components:

• To identify documents of interest.• To assess the relevance of an article before retriev-

ing and reading the full text.• To create a customized document surrogate after

retrieval that includes a combination of biblio-graphic and other elements (for example,author’s name, article title, tables).

• To provide specific pieces of information, such asan equation, a fact, or a diagram.

• To convey knowledge not easily rendered bywords, especially through figures and tables.

Engineers describe a common pattern for utilizingdocument components by zooming in on and filteringinformation in their initial reading of an article. Theytend to first read the title and abstract, then skim sec-tion headings. Next, they look at lists, summary state-

We explored how researchers

use journal components in

their work.

.

ments, definitions, and illustrations, before zeroing inon key sections, reading conclusions, and skimmingreferences.

But engineers pursue unique practices after this ini-tial reading, as they disaggregate and reaggregate arti-cle components for use in their own work. Everyonetakes scraps or reusable pieces of information fromthe article, but they do this differently—perhaps byusing a marker to highlight text portions of interestor by making a mental register of key ideas.

Engineers then create some kind of transitory com-pilation of reusable pieces, such as a personal biblio-graphic database, folders containing the first page ofan article stapled to handwritten notes, or a pile ofjournal issues with key sections bookmarked. Theseintellectual and physical practices associated withcomponent use seem to be based on a combination oftenure in the field, the nature of the task at hand, per-sonal work habits, and cognitive style.

Use of digital librariesOur digital library also allowed us to step back and

take a broader look at the use of online digital collec-tions and how people attempt to make sense of them.In analyzing results from several different data col-lection efforts, we found that users can be confusedby a newly encountered digital library, and that ittakes some time and interaction for them to figure outwhat a particular system, like DELIVER, is.

In usability tests, we identified patterns of useractions designed to uncover what sort of system theDLI Testbed was and what it could do. What firstappeared to be random trial-and-error use of the inter-face was actually structured exploration, whichoccurred frequently across sessions. Such explorationis a cut-and-try approach.4 Situating usage in the realworld forced us to think about who our most likelyaudience was, what they were probably most interestedin using our system for, and how best to reach them.

SEMANTIC INDEXING AND TECHNOLOGY RESEARCH

Improving Web searching beyond full-text retrievalrequires using document structure in the short termand document semantics in the long term. Our tech-nology research team focused on developing newinfrastructure for our vision of the future Internet,termed the Interspace, where each community indexesits own repository of its own knowledge.5 For com-munity amateurs to provide classification compara-ble to today’s trained professionals, informationinfrastructure must provide substantial support forsemantic indexing and retrieval.

The Interspace focuses on scalable technologies forsemantic indexing that work generically across all sub-ject domains.6 We can automatically generate ana-

logues of concepts and categories. We can useconcept spaces—collections of abstract con-cepts generated from concrete objects—to boostsearches by interactively suggesting alternativeterms.1,7 We can use category maps to boostnavigation by interactively browsing clusters ofrelated documents.8

Scalable semanticsScalable semantics is our term for the new

technologies that can index the semantics ofdocument contents on large collections. Thesealgorithms rely on statistical techniques, whichcorrelate the context of phrases within the doc-uments. For example, concept spaces use textdocuments as the objects and noun phrases asthe concepts. The concept spaces are then theco-occurrence frequencies between related termswithin the documents of a collection.

Over the past several years, using DLI materials,we have used the supercomputers at the NationalCenter for Supercomputing Applications (NCSA) tocompute concept spaces for progressively larger col-lections,9 until the scale of entire disciplines, such asengineering, has been reached. By partitioning a largeexisting collection into discipline subcollections,which are the equivalent of community repositories,we use supercomputers to simulate the future worldof a billion repositories.

In 1995 we generated concept spaces for 400,000abstracts from Inspec (deep coverage in physics, elec-trical engineering, and computer science), and in 1996we generated concept spaces for 4 million abstractsfrom Compendex (broad coverage across all of engi-neering, some 38 subject disciplines). The first com-putation took one day of supercomputer time, andthe second took 10 days of high-end time on the HPConvex Exemplar. The second computation provideda comprehensive simulation of community reposito-ries for 1,000 collections across all of engineering,generated by partitioning the abstracts along the sub-ject classification hierarchy.5

Concept spaces and document searchThe Interspace consists of multiple spaces at the cat-

egory, concept, and object levels. Within the courseof an interaction session, users will move across dif-ferent spaces at different levels of abstraction andacross different subject domains. For example, the sys-tem enables users to locate desired terms in the con-cept space by starting from broad terms, thentraversing into narrow terms specific to that documentcollection. They can then move across into documentspace to perform full-text searches by dragging theconcept term into the document space search window.

Figure 2 is a composite of a session with an experi-

February 1999 55

Situating usage inthe real world forced

us to think aboutwho our most likelyaudience was, whatthey were probablymost interested in

using our system for,and how best to

reach them.

.

56 Computer

mental Interspace containing concept spaces for engi-neering literature.10 The different windows sampled inthis session illustrate abstract indexes for categories andconcepts, plus indexes for documents within collections.

The upper-left window shows an integrated list ofcategories for the Inspec, Compendex, and Patternscollections. Inspec and Compendex are standard com-mercial bibliographic databases, and Patterns is a soft-ware engineering community repository. Thesecategories can be selected to retrieve a concept spacefor a specified subject domain. The upper-middle win-dow shows portions of a concept space for the Inspeccategory Software Engineering Techniques.The concept space allows the user to interactivelyrefine a search by selecting from related concepts. Theuser specified the general term object database,and the system returned a list of specific terms, suchas software configuration management.

The user wants to locate and search for an evenmore specific term related to the use of “object-ori-ented databases in software engineering.” The upper-right window shows a further navigation of theconcept space, listing related terms of softwareconfiguration management, which was selectedfrom the previous related list. The very specific termrevision control system is located and usedimmediately by dragging it into the Full-Text Searchwindow.

This window at the lower left has two panes show-ing the results of the query for revision controlsystem. The left pane lists the articles from theSoftware Engineering Techniques collection mentioningthe term, while the right pane displays the abstract ofa selected article. Note a specific article about soft-ware objects in configuration managementhas been found, by navigating the concept space ofterms starting from the broad object database,without being required to ever type a specific searchterm from memory.

Vocabulary switching across concept spacesFinally, to search a subject domain they are unfa-

miliar with, users can begin within the concept spacefor a familiar subject domain, then choose anotherconcept space for the unfamiliar domain and navigateacross spaces based on common terms. This interac-tive vocabulary exploration is our approach to vocab-ulary switching, the classic information retrievalproblem of different terms for the same conceptsacross different subject domains.11

In Figure 2, for example, the user has navigatedacross concept spaces in the upper windows to locatea specific document on revision control in con-figuration management that discusses softwareobjects. To further investigate the use of object-ori-ented techniques, the user selects complex object

Figure 2. AnInterspace Navigatorfor engineering litera-ture, showing conceptspaces for communityrepositories, withupper windows illus-trating term sugges-tion using conceptspaces and lower win-dows illustrating docu-ment search leadingthe user to conceptswitching across sub-ject domains.

.

from the title of the article displayed in the lower-leftwindow and drags this term back into the conceptspace window for Software Engineering. Thisaction switches the level of abstraction from docu-ments (objects) to terms (concepts).

The lower-right window displays the related termsfor complex object within Software Engin-eering in the left pane. Scanning this term list, theuser now wants even more detailed information aboutobject-oriented techniques than seem to be availablein this collection. So the user moves up another levelof abstraction (from concepts to categories) and selectsthe category for Object-Oriented Programmingfrom the Categories window.

A vocabulary switch is now performed between thedomains of SoftwareEngineering Techniquesand Object-Oriented Programming. The userdrags complex object from one domain pane toanother—the lower-right window is the result. Theright pane (at the lower rightmost side) shows therelated terms for complex objectwithin Object-Oriented Programming. Note this related term listis different from that for the same term in SoftwareEngineering since the collections are different (sothe commonly occurring terms will be different). Thisnew term list can then be scanned to select specificterms related to software objects and config-uration management to use in searching any of theavailable collections.

Such a fluid flow across levels and subjects supportssemantic interoperability. This vocabulary switchingby interactive navigation across concept spaces illus-trates why the system is named the Interspace.Interspace navigation enables location of documentswith specific concepts without previous knowledge ofthe terms within the documents. We are constructingand using a full-fledged Interspace prototype withsemantic indexing and space navigation for commu-nity repositories in engineering and medicine.12

CONCLUSIONS AND IMPLICATIONSWe believe that both the DLI Testbed and the

research efforts of the UIUC DLI project were majorsuccesses.

The DLI Testbed efforts built a production systemwith federated search across structured documents.The articles arrive in a production stream directlyfrom major scientific publishers in full-text SGML andare fully federated at the DTD level with a Web inter-face. The DLI Testbed collection is currently thelargest federated repository of SGML articles fromscientific literature anywhere.

The DLI Testbed users represent an order of mag-nitude bigger population than the last-generationresearch system for search of scientific literature. TheDLI Testbed evaluation performed comprehensive

fine-grained methodologies such as user inter-views and large-scale methodologies such astransaction logs. Our results will shortly leadto commercial technologies for federating struc-tured documents across the Internet.

Our research efforts built an experimentalsystem with semantic indexes from documentcontent. Concept spaces are generated for termsuggestion and integrated with text search viaa multiple view interface. Vocabulary switch-ing is supported by interactive navigation acrossconcept spaces.

Our research computations are the largestever in information science. They represent the firsttime that semantic indexes using generic technologyhave been generated on discipline-scale collectionswith millions of documents. They are the first largestep toward scalable semantics, statistical indexes withdomain-independent computations.

The Internet of the 21st century will radicallytransform how we interact with knowledge.Traditionally, online information has been dom-

inated by data centers with large collections indexedby trained professionals. The rise of the World WideWeb and the information infrastructure of distrib-uted personal computing have rapidly developed thetechnologies of collections for independent commu-nities. In the future, online information will be dom-inated by small collections maintained and indexedby community members themselves.

The information infrastructure must similarly beradically different to support indexing of communitycollections and searching across such small collections.The base infrastructure will be knowledge networksrather than transmission networks. Users will considerthemselves to be navigating in the Interspace, acrosslogical spaces of semantic indexes, rather than in theInternet, across physical networks of computer servers.

Future knowledge networks will rely on scalablesemantics, on automatically indexing the communitycollections so that users can effectively search withinthe Interspace of a billion repositories. The mostimportant feature of the infrastructure is thereforesupport of semantic correlation across the indexedcollections. Just as the transmission networks of theInternet are connected via switching machines thatswitch packets, the knowledge networks of theInterspace will be connected via switching machinesthat switch concepts. ❖

AcknowledgmentsThis work was supported by the NSF, DARPA, and

NASA under Cooperative Agreement No. NSF-IRI-94-11318COOP. We thank the American Institute of

February 1999 57

The knowledge networks of the

Interspace will beconnected via

switching machinesthat switchconcepts.

.

58 Computer

Physics, the American Physical Society, the AmericanSociety of Civil Engineers, the IEEE Computer Society,and the Institution of Electrical Engineers for makingtheir SGML materials available to us on an experi-mental basis. Engineering Index and IEE kindly pro-vided Compendex and Inspec, respectively. Indexingwas done on Hewlett-Packard servers, obtainedthrough an educational grant program. Many peoplehave contributed to the research discussed here. Inparticular, we thank Robert Wedgeworth, KevinPowell, Ben Gross, Donal O'Connor, Robert Ferrer,Tom Habing, Hanwen Hsiao, Heidi Kellner, EmilyIgnacio, Cecelia Merkel, Bob Sandusky, Eric Larson,S. Leigh Star, Pauline Cochrane, Andrea Houston,Melanie Loots, Larry Jackson, Mike Folk, KevinGamiel, Joseph Futrelle, Roy Campbell, RobertMcGrath, Duncan Lawrie, Leigh Estabrook.

References1. B. Schatz et al., “Interactive Term Suggestion for Users

of Digital Libraries: Using Subject Thesauri and Co-occurrence Lists for Information Retrieval,” Proc. FirstACM Int’l Conf. Digital Libraries, ACM Press, NewYork, 1996, pp. 126-133.

2. A. Bishop and S. Star, “Social Informatics for DigitalLibrary Use and Infrastructure,” Ann. Rev. InformationScience and Technology, Vol. 31, 1996, pp. 301-401.

3. A. Bishop, “Digital Libraries and Knowledge Disaggre-gation: The Use of Journal Article Components,” Proc.Third ACM Int’l Conf. Digital Libraries, ACM Press,New York, 1998, pp. 29-39.

4. L. Neumann and E. Ignacio, “Trial and Error as a Learn-ing Strategy in System Use,” Proc. 61st American Soc.Information Science Ann. Meeting, Information Today,Medford, N.J., 1998, pp. 243-252.

5. B. Schatz, “Information Retrieval in Digital Libraries: Bring-ing Search to the Net,” Science, Jan. 1997, pp. 327-334.

6. B. Schatz et al., “Federating Diverse Collections of Sci-entific Literature,” Computer, May 1996, pp. 28-36.

7. H. Chen et al., “Automatic Thesaurus Construction foran Electronic Community System,” J. American Soc.Information Science, Mar. 1995, pp. 175-193.

8. H. Chen et al., “Internet Browsing and Searching: UserEvaluations of Category Map and Concept Space Tech-niques,” J. American Soc. Information Science, July1998, pp. 582-603.

9. H. Chen et al., “A Parallel Computing Approach to Cre-ating Engineering Concept Spaces for Semantic Retriev-al: The Illinois Digital Library Project,” IEEE Trans. PatternAnalysis Machine Intelligence, Aug. 1996, pp. 771-782.

10. H. Chen et al., “Alleviating Search Uncertainty throughConcept Associations: Automatic Indexing, Co-occur-rence Analysis, and Parallel Computing,” J. AmericanSoc. Information Science, Mar. 1998, pp. 206-216.

11. H. Chen et al., “A Concept Space Approach to Addressingthe Vocabulary Problem in Scientific Information Retrieval:

An Experiment on the Worm Community System,” J.American Soc. Information Science, Jan. 1997, pp. 17-31.

12. B. Schatz et al., “The Interspace Prototype,” http://www.canis.uiuc.edu.

Bruce Schatz is director of the Community Architec-tures for Network Information Systems (CANIS) Lab-oratory at UIUC and professor in the Graduate Schoolof Library and Information Science, with joint appoint-ments in computer science, neuroscience, and healthinformation sciences. He was the principal investigatorof the Illinois Digital Library Project. He is seniorresearch scientist at the National Center for Supercom-puting Applications (NCSA). He served as the scientificadvisor on information systems when NCSA developedMosaic, which was inspired by his earlier network infor-mation systems and spawned the World Wide Web. Hiscurrent research is building analysis environments tosupport community repositories (Interspace), and per-forming large-scale experiments in semantic retrievalfor vocabulary switching. He received a BA in mathe-matical sciences from Rice University, an MS in artificialintelligence from MIT, an MS in computer science fromCarnegie Mellon University, and a PhD in computer sci-ence from the University of Arizona.

William Mischo is director of the Grainger Engineer-ing Library Information Center, the EngineeringLibrarian, and a professor of library administrationat UIUC. His current interests include expanding theDLI project under the auspices of a CNRI grant toinvestigate the use of XML, cascading style sheets, anddynamic HTML in retrieving and rendering article fulltext. He received a BA in mathematics from CarthageCollege and an MA in library and information sciencefrom the University of Wisconsin.

Timothy Cole is system librarian for digital projectsand associate professor of library administration atthe UIUC library. His interests include informationretrieval interfaces, processing of document structure,and the use of the Web in libraries. He received a BSin aeronautical and astronautical engineering and anMS in library and information science from UIUC.

Ann Bishop is assistant professor in the GraduateSchool of Library and Information Science at UIUCand principal investigator of a US Department of Com-merce grant on the introduction of computers in low-income neighborhoods. Her primary research interestis social aspects of information system design, evalua-tion, and use. She received a BA in Russian literaturefrom Cornell University and an MLS and a PhD ininformation transfer from Syracuse University.

Susan Harum was the external relations coordinatorfor the University of Illinois Digital Libraries Initia-

.

tive at UIUC from 1994 to 1998; she continues toassist the program director for digital libraries at theNSF. Her interests include collaborative research toolsfor international digital libraries. She received a BS inanthropology from Michigan State University, an MAin eastern Asian languages, and an MS in library andinformation science from UIUC.

Eric Johnson is a PhD candidate in the UIUC Gradu-ate School of Library and Information Science. Hisresearch interests include hypermedia systems andhypertext use in thesaurus navigation and biblio-graphic retrieval. He received a BS in computer scienceand science and technology studies from MichiganState University, an MS in computer science fromNorthwestern University, and an MA in sociology andan MS in library and information science from UIUC.

Laura Neumann is a PhD candidate in the UIUC Grad-uate School of Library and Information Science. Herresearch interests include work practice and social issuessurrounding the digitization of information, the use ofdigital libraries and other information systems, and theautomation of work tasks. She received a BA in sociol-ogy and anthropology from the University of Minnesota.

Hsinchun Chen is a full professor in the Departmentof Management Information systems at the Univer-sity of Arizona and director of the UA/MIS ArtificialIntelligence Lab. His research interests includesemantic retrieval, search algorithms, knowledge dis-covery, and collaborative computing. He received aBA from National Chiao-Tung University in Taiwan,an MBA from SUNY Buffalo, and an MS and a PhDin information systems from New York University.

Dorbin Ng is a PhD candidate in the Department ofManagement Information Systems at the Universityof Arizona. His research interests include multime-dia information retrieval, knowledge discovery usingsupercomputers, and user interfaces for collaborativecomputing. He received a BS in business administra-tion and an MS in management information systemsfrom the University of Arizona.

For more information on the Illinois Digital LibraryProject, see http://dli.grainger.uiuc.edu or [email protected]. Readers can contact the authors atCANIS Laboratory, University of Illinois, 704 S.Sixth St., Champaign, IL 61820; www.canis.uiuc.eduor [email protected].

Digital LibraryFeatures

Instant Access

17 CS Periodicals

Back Issues

Full Text Search

Go to http://computer.org/epub for a preview of theDigital Library.

Computer magazine is available online to all Computer Society members at no charge as an additional benefit of membership.

Explore the other optional subscriptions from the 17 magazines and transactions also available online.

In 1999 members can subscribe to the complete digital library for only $99.

Conference proceedings are also online.

HTML and printable formats.

IEEE Computer Society10662 Los Vaqueros Circle

Los Alamitos, CA 90720-1314

Toll-Free +1 800.CS.BOOKS

Phone: +1 714.821.8380

http://computer.org/epub

IEEE Computer Society Periodicals

are online

.