original articles the stanford digital library metadata …metadata information facilities for...

15
Original articles The Stanford Digital Library metadata architecture ? Michelle Baldonado, Chen-Chuan K. Chang, Luis Gravano, Andreas Paepcke Computer Science Department, Stanford University, Stanford, CA 94305-9040, USA, fmichelle,kevin,gravano,paepckeg@db.stanford.edu Received: 15 October 1996 / Accepted: 14 January 1997 Abstract. The overall goal of the Stanford Digital Library project is to provide an infrastructure that aords interoperability among heterogeneous, autono- mous digital library services. These services include both search services and remotely usable information process- ing facilities. In this paper, we survey and categorize the metadata required for a diverse set of Stanford Digital Library services that we have built. We then propose an extensible metadata architecture that meets these re- quirements. Our metadata architecture fits into our established infrastructure and promotes interoperability among existing and de-facto metadata standards. Several pieces of this architecture are implemented; others are under construction. The architecture includes attribute model proxies, attribute model translation services, metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces of the architecture, we show how they address our motivating requirements. Together, these compo- nents provide, exchange, and describe metadata for information objects and metadata for information ser- vices. We also consider how our architecture relates to prior, relevant work on these two types of metadata. Key words: Metadata architecture – Interoperability – Attribute model – Metadata repository – Proxy architecture – Heterogeneity – Metadata survey 1 Introduction As traditional libraries have evolved to meet the needs of their patrons, librarians have encountered and addressed a host of metadata-related issues. Today, sophisticated library cataloging principles and schemes help all of us in finding the information that we need in our local library. As work on digital libraries progresses, however, new metadata needs are arising. The increased ease with which a user can cross the boundaries from one ‘‘library’’ to another, the ability to organize online contents into complex structures, and the development of tools that let users transform those structures, all call for a rethinking of what metadata is and how it can be shared in digital libraries. In the Stanford Digital Library project, we view long- term digital library systems as collections of widely dis- tributed, autonomously maintained services. Of course, a digital library system must include services that allow users to search over collections of information objects. Examples of searchable collections include traditional library collections, digital images, e-mail archives, video, on-line books, and scientific article citation catalogs (containing only metadata about the articles, not the articles themselves). While searching services are valu- able, they are not the only kind of service in the digital library of the future. Remotely usable information pro- cessing facilities are also important digital library ser- vices. These services provide support for activities such as document summarization, indexing, collaborative anno- tation, format conversion, bibliography maintenance, and copyright clearance. Our project has focused on developing an infra- structure in which these disparate services can commu- nicate and interoperate with one another. Specifically, the goal of our digital library testbed is to provide an infra- structure that aords interoperability among these het- erogeneous, autonomous components, much like a hardware bus enables interaction among disparate hardware elements. Our assumption is that inventing a new standard is unlikely to solve the interoperability problem. Instead, we propose that InfoBus components ? This material is based upon work supported by the National Science Foundation under Cooperative Agreement IRI-9411306. Funding for this cooperative agreement is also provided by DARPA, NASA, and the industrial partners of the Stanford Digital Libraries Project. Any opinions, finding, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation or the other sponsors Int J Digit Libr (1997) 1: 108–121

Upload: others

Post on 30-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

Original articles

The Stanford Digital Library metadata architecture?

Michelle Baldonado, Chen-Chuan K. Chang, Luis Gravano, Andreas Paepcke

Computer Science Department, StanfordUniversity, Stanford, CA 94305-9040, USA, fmichelle,kevin,gravano,[email protected]

Received: 15 October 1996 /Accepted: 14 January 1997

Abstract. The overall goal of the Stanford DigitalLibrary project is to provide an infrastructure thata�ords interoperability among heterogeneous, autono-mous digital library services. These services include bothsearch services and remotely usable information process-ing facilities. In this paper, we survey and categorize themetadata required for a diverse set of Stanford DigitalLibrary services that we have built. We then propose anextensible metadata architecture that meets these re-quirements. Our metadata architecture ®ts into ourestablished infrastructure and promotes interoperabilityamong existing and de-facto metadata standards. Severalpieces of this architecture are implemented; others areunder construction. The architecture includes attributemodel proxies, attribute model translation services,metadata information facilities for search services, andlocal metadata repositories. In presenting and discussingthe pieces of the architecture, we show how they addressour motivating requirements. Together, these compo-nents provide, exchange, and describe metadata forinformation objects and metadata for information ser-vices. We also consider how our architecture relates toprior, relevant work on these two types of metadata.

Key words: Metadata architecture ± Interoperability ±Attribute model ± Metadata repository ± Proxyarchitecture ± Heterogeneity ± Metadata survey

1 Introduction

As traditional libraries have evolved to meet the needs oftheir patrons, librarians have encountered and addresseda host of metadata-related issues. Today, sophisticatedlibrary cataloging principles and schemes help all of us in®nding the information that we need in our local library.As work on digital libraries progresses, however, newmetadata needs are arising. The increased ease withwhich a user can cross the boundaries from one ``library''to another, the ability to organize online contents intocomplex structures, and the development of tools that letusers transform those structures, all call for a rethinkingof what metadata is and how it can be shared in digitallibraries.

In the Stanford Digital Library project, we view long-term digital library systems as collections of widely dis-tributed, autonomously maintained services. Of course, adigital library system must include services that allowusers to search over collections of information objects.Examples of searchable collections include traditionallibrary collections, digital images, e-mail archives, video,on-line books, and scienti®c article citation catalogs(containing only metadata about the articles, not thearticles themselves). While searching services are valu-able, they are not the only kind of service in the digitallibrary of the future. Remotely usable information pro-cessing facilities are also important digital library ser-vices. These services provide support for activities such asdocument summarization, indexing, collaborative anno-tation, format conversion, bibliography maintenance,and copyright clearance.

Our project has focused on developing an infra-structure in which these disparate services can commu-nicate and interoperate with one another. Speci®cally, thegoal of our digital library testbed is to provide an infra-structure that a�ords interoperability among these het-erogeneous, autonomous components, much like ahardware bus enables interaction among disparatehardware elements. Our assumption is that inventing anew standard is unlikely to solve the interoperabilityproblem. Instead, we propose that InfoBus components

?This material is based upon work supported by the NationalScience Foundation under Cooperative Agreement IRI-9411306.Funding for this cooperative agreement is also provided byDARPA, NASA, and the industrial partners of the StanfordDigital Libraries Project. Any opinions, ®nding, and conclusions orrecommendations expressed in this material are those of theauthor(s) and do not necessarily re¯ect the views of the NationalScience Foundation or the other sponsors

Int J Digit Libr (1997) 1: 108±121

Page 2: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

Report Documentation Page Form ApprovedOMB No. 0704-0188

Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering andmaintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, ArlingtonVA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if itdoes not display a currently valid OMB control number.

1. REPORT DATE 1997 2. REPORT TYPE

3. DATES COVERED 00-00-1997 to 00-00-1997

4. TITLE AND SUBTITLE The Stanford Digital Library Metadata Architecture

5a. CONTRACT NUMBER

5b. GRANT NUMBER

5c. PROGRAM ELEMENT NUMBER

6. AUTHOR(S) 5d. PROJECT NUMBER

5e. TASK NUMBER

5f. WORK UNIT NUMBER

7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) Stanford University,Department of Computer Science,Stanford,CA,94305-9040

8. PERFORMING ORGANIZATIONREPORT NUMBER

9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)

11. SPONSOR/MONITOR’S REPORT NUMBER(S)

12. DISTRIBUTION/AVAILABILITY STATEMENT Approved for public release; distribution unlimited

13. SUPPLEMENTARY NOTES

14. ABSTRACT The overall goal of the Stanford Digital Library project is to provide an infrastructure that a??ordsinteroperability among heterogeneous, autono- mous digital library services. These services include bothsearch services and remotely usable information process- ing facilities. In this paper, we survey andcategorize the metadata required for a diverse set of Stanford Digital Library services that we have built.We then propose an extensible metadata architecture that meets these re- quirements. Our metadataarchitecture ??ts into our established infrastructure and promotes interoperability among existing andde-facto metadata standards. Several pieces of this architecture are implemented; others are underconstruction. The architecture includes attribute model proxies, attribute model translation servicesmetadata information facilities for search services, and local metadata repositories. In presenting anddiscussing the pieces of the architecture, we show how they address our motivating requirements.Together, these compo- nents provide, exchange, and describe metadata for information objects andmetadata for information ser- vices. We also consider how our architecture relates to prior, relevant workon these two types of metadata.

15. SUBJECT TERMS

16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF ABSTRACT Same as

Report (SAR)

18. NUMBEROF PAGES

14

19a. NAME OFRESPONSIBLE PERSON

a. REPORT unclassified

b. ABSTRACT unclassified

c. THIS PAGE unclassified

Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std Z39-18

Page 3: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

use proxies to access and communicate with each other.Proxies (also called ``wrappers'') allow heterogeneousservices to give the illusion that they respond to a stan-dard set of methods. We call our proxy-based infra-structure the InfoBus [1].

This paper provides a framework for understandingthe classes of metadata and range of metadata needs thatare necessary for our InfoBus services. We outline andground this framework in Sect. 2 by surveying ourInfoBus services and analyzing the categories of meta-data that they require. In particular, we give concreteexamples of four very di�erent InfoBus services. First, weshow that our automated resource-discovery serviceneeds metadata about what collections exist and whateach collection contains. Second, we observe that ourservice for formulating queries appropriate for multiplesources relies upon metadata about how informationobjects are described within each source. Third, we ex-plain why our service for translating queries requiresprotocol-related metadata about the selected search ser-vices. Finally, we analyze how our service for makingsense of query results can utilize metadata about infor-mation objects and their underlying representations inorder to compare them and to portray their surroundingcontext.

Our decision to design and implement a metadataarchitecture has grown out of this understanding ofmetadata needs. We have found that ad-hoc approachesto these metadata issues do not scale and cause problemsfor interoperability. Related work on metadata issues(discussed in Sect. 4) is relevant for speci®c metadataissues that we have encountered, but does not address theproblem of integrating and sharing di�erent types ofmetadata information in the ways that we require. Hence,we have designed our own metadata architecture. It restson top of our established proxy-based framework andcan thus be situated amidst existing and de-facto stan-dards for metadata. Several pieces of this metadataarchitecture are implemented; others are underconstruction. We present the architecture in detail inSect. 3 and show how its features map onto our concretemetadata requirements. Readers interested in the designrationale for this architecture should refer to a separatepaper on this topic [2].

2 Our metadata requirements

In this section, we present the InfoBus services that havemotivated and shaped our metadata architecture. Thediscussion of each service illustrates a number of concreterequirements for the architecture.

The services described in this section are intercon-nected in that they all address the problem of interactingwith heterogeneous searchable collections. This problemis multi-faceted because it can involve a range of activi-ties, including locating and selecting among relevantcollections, retrieving information from these collections,interpreting the information retrieved from them, man-aging and organizing the retrieved information at a locallevel, and sharing this information with others [3, 4].

A simpli®ed example scenario gives an overview ofhow metadata-related issues emerge. Imagine a user, Pat,who is interested in the topic of data mining. Pat's ®rsttask is to determine what searchable collections are rel-evant to this topic. Pat turns for help to a resource-dis-covery service (Sect. 2.1) that relies upon meta-information about each collection to answer the ques-tion. With a set of heterogeneous collections selected, Patis ready to begin searching. The user interface service thatPat is using allows for queries to be entered in a richBoolean front-end language. The query constructor(Sect. 2.2) provided by the user interface uses knowledgeof the attributes supported by each search service in or-der to present Pat with a set of attributes that could be ofvalue in the query. Once Pat has entered a query, it is upto a query translation service (Sect. 2.3) to translate thisfront-end query into its native equivalent for each se-lected collection. The query translator requires meta-in-formation about the capabilities of each collection toaccomplish this task. After the native queries have beenissued, Pat receives back a set of results from the selected,heterogeneous collections. The result analysis componentof the user interface then uni®es these heterogeneousresults and, under user guidance, constructs integratedoverviews of them (Sect. 2.4). Pat can now make sense ofthese initial results and decide what to do next.

2.1 Automated resource discovery

End users have a large variety of searchable collectionsavailable to them. Accessing all these collections for eachquery is not practical. First, there may be too many suchcollections, with varying response times. Second, somecollections charge for their access. Furthermore, presum-ably only a few collections contain useful items for agiven query. Therefore, a crucial component of a DigitalLibrary is a tool that assists users in discovering theuseful resources for their queries.

Finding the best collections for a query is the goal ofGlOSS [5, 6], a resource-discovery service within ourDigital Library testbed. A user submits a query toGlOSS, and GlOSS returns a rank of the available col-lections. This rank is based on estimates of the expectednumber of hits for the query at each collection (Fig. 1).Then, the user submits the query to the top collections, asdetermined by GlOSS. This way, the user avoids ac-cessing the majority of the available collections, focusingthe search on the most promising ones. GlOSS usescontent summaries of all target collections to compute itsestimated ranks, which brings up the problem of sca-lability. GlOSS needs to be able to handle content sum-maries of a large number of searchable collections. Tokeep these summaries small, GlOSS only stores partialinformation.

The GlOSS content summary of a collection consistsof the words that appear in the collection, together withthe number of items where each word occurs, as thefollowing example illustrates. An unusual characteristicof the GlOSS metadata is that it is aggregate information

109

Page 4: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

about the whole collections, rather than informationabout the individual items in the collections.

Example 1. Consider the following user query q:

Title Contains mining

Suppose that GlOSS has three collections available,c1, c2, and c3. To rank these collections for q, GlOSSknows how many items match q at each collection. Forexample, GlOSS knows that c1 has a total of 100 itemsthat contain the word ``mining'' in their title, c2 has 10such items, and c3 none. GlOSS will suggest c1 as the bestsource for the query (100 matching items), and c2 as thesecond best source (10 matching items). Source c3 willnot appear in the GlOSS rank.

GlOSS does not keep information on how words co-occur in the items. Hence, it needs to make assumptionson the distribution of query words in the collectionswhenever a query contains more than one word. Theseassumptions may not hold in reality, leading to wrongresult-size estimates. However, the GlOSS informationtends to be orders of magnitude smaller than the corre-sponding collections, facilitating scalability. Further-more, experimental results with real-user queries showedthat GlOSS ranks searchable collections correctly mostof the time [5].

To suggest collections for a query, GlOSS extractscontent summaries from each collection. These contentsummaries, like in the example above, include the vo-cabulary of each collection, together with frequencycounts that are associated with them. The current im-plementation of GlOSS (available at http://gloss.stanford.edu/) gets the content summariesin ad hoc ways. For example, GlOSS extracts all thebibliographic entries from the CS-TR databases andprocesses these entries locally to produce the right con-tent summaries. (CS-TR is an emerging digital library ofcomputer science technical reports. See http://

www.ncstrl.org/) To get the summary of a collectionof HTML documents, GlOSS could fetch all the docu-ments in the collection by following links, and then ex-tract the necessary information locally. However, thisapproach is expensive in terms of the number of accessesto the collection and the amount of information re-trieved. Also, it would not work for collections of non-HTML documents that do not export their entire con-tents. Therefore, GlOSS needs the collections to collab-orate and export their content summaries upon request.In summary, GlOSS requires the following of ourmetadata architecture:

± We need a way to learn about the available collec-tions.

± We need a way to extract the content summary as-sociated with each collection.

2.2 Query formulation

Di�erent searchable collections typically export di�erentattributes for searching. In query formulation, animportant role of the user interface is thus to make theuser aware of what attributes are available for searching.

DLITE is a user interface service that has been de-veloped for our Digital Library testbed [7, 8, 9]. It allowslibrarians or end users to specify workcenters for thevarious information-related tasks in which they fre-quently engage. A DLITE workcenter gathers togethercomponents and tools that a user might need to completea task. The current DLITE workcenter for searchingincludes a query constructor that relies upon a minimal,canonical attribute model. This attribute model is un-structured and contains just the canonical attributesAuthor, Title, and Subject. By ``unstructured,'' wemean that the attribute model is a ¯at name space. Wewill introduce more complex attribute models inSect. 2.4. Figure 2 shows how the DLITE query con-structor appears to the user. The user ®lls in as many ofthe ®elds as are desired, then hits the ``Create Query''button. An object representing the query then appears inthe circular area below the constructor. The user can takethis query object and drop it on any of the representedsearch services (Dialog_275 and AltaVista in this exam-ple) for evaluation.

We can allow users to formulate queries that refer tothese canonical attributes because we have tables thatmap canonical attributes (often approximately) ontotheir native collection attribute equivalents. However,this current approach does not scale well since it requiresone table entry for each possible equivalence mapping. Amore general approach would introduce independentattribute translators. With independent attribute trans-lators in place, the query translation service could ®rst®nd out what attributes are supported natively by asearch service. If the necessary attributes are not sup-ported, then either the query translation service or thesearch service itself could try to locate an attributetranslator to do the mapping (attribute translators arelikely to be built for a wide variety of mappings, but are

Fig. 1. Users submit queries to GlOSS, and GlOSS ranks the available

collections accordingly

110

Page 5: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

not guaranteed to exist for all possible mappings). Ad-ditionally, the translation process could be decomposedinto several intermediate translation steps, each of whichcould be carried out by a separate translator. Either thequery translation service or an individual translatorcould take the initiative to compute this translation de-composition. This strategy reduces the number ofequivalence mappings that must be recorded.

These query formulation needs correspond directly torequirements on our metadata architecture.

± We need to include attribute models as ®rst-classobjects so that we can communicate with them andperform necessary transformations.

± We need facilities for ®nding out e�ciently what at-tribute models and individual attributes are nativelysupported by each search service.

± We need attribute model translators that are capableof translating from canonical attributes and canonicalattribute values into native attributes and native at-tribute values.

By designing a metadata architecture that takes intoaccount these requirements, we also have the buildingblocks for designing more sophisticated query construc-tors. It is unlikely that any one canonical attribute modelwould be su�cient for all levels of users, tasks, and col-lections. A metadata architecture that treats attributemodels as ®rst-class objects can easily accommodate apool of several canonical attribute models. A DLITEsearch workcenter could use meta-information abouteach of these canonical attribute models to determinewhen a particular model is appropriate. Furthermore, itcould tailor the user's view of the chosen canonical modelby using information about the correspondences betweeneach search service's native attributes and the canonicalmodel's attributes. For example, if we think of the native

attributes for each search service as corresponding to aset of canonical attributes, then we can imagine a queryconstructor that presents the user with the intersection ofthese canonical attribute sets. The automated querytranslation module could then use the attribute modeltranslation services to prepare the user queries for de-livery to the target collections.

2.3 Automated query translation

There are a wealth of search engines behind thecollections in digital libraries, each with a di�erentdocument model and query language. There have beenvarious approaches to address this problem. The mosttypical way is to provide a front-end that supports theleast common denominator of the underlying services tohide the heterogeneity [10, 11]. In contrast, our approachis to allow a user to compose Boolean queries in one richfront-end language [12, 13]. For each user query andtarget source, we transform the user query into asubsuming query that can be supported by the sourcebut that may return extra documents. The results arethen processed by a ®lter query to yield the correct ®nalresult. In summary, given a user query, the querytranslator generates a native query that returns a minimalnumber of extra documents and also a ®lter query toapply in post-processing. The following example illus-trates our approach.

Example 2. Suppose that a user is interested in docu-ments discussing data mining and data warehousing. Saythe user's query is originally formulated as follows:

Title Contains warehousing AND data (W) min-ing

This query selects documents with the three given wordsin the Title ®eld; furthermore, the W proximity operatorspeci®es that the word ``data'' must immediately precedethe word ``mining.''

Now assume the user wishes to query the INSPECdatabase o�ered by the Stanford University Folio sys-tem. First, the translator must map the front-end Title®eld onto the corresponding INSPEC ®eld (as describedin Sect. 2.2). Next, it must map the query operators.Unfortunately, this source does not understand the Woperator. In this case, our approach will be to approxi-mate the predicate ``data (W) mining'' by the closestpredicate supported by Folio, ``data AND mining.''This predicate requires that the two words appear inmatching documents, but in any position. Thus, the na-tive query that is sent to Folio-INSPEC is:

Find Title warehousing AND data AND mining

Notice that now this query is expressed in the syntaxunderstood by Folio. The native query will return apreliminary result set that is a super-set of what the userexpects. Therefore, an additional post-®ltering step isrequired at the front-end to eliminate all documents thatdonot have thewords ``data'' and ``mining'' occurring nextto each other. For this example, the required ®lter query is:

Fig. 2. The DLITE query constructor is shown here using a static set of

canonical attributes. With a richer metadata model, it could be extended to

use a dynamically determined set

111

Page 6: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

Title Contains data (W) mining

As illustrated in the above example, the major compo-nent of query translation is query capability mapping inwhich user queries are rewritten to be expressible withrespect to the target services' capabilities. The querycapability mapping process is therefore driven by themetadata that describe the query capabilities of thesearch services. In Example 2, the query translator needsmetadata related to service functionality to answer thefollowing questions: Is Title a searchable attribute?Can one search Title with the Contains operator? Aservice may only support the Equals operator for someattributes. Is the proximity operator W supported? If not,are there any semantically related operators to substitute?For instance, DEC's AltaVista supports the operatorNEAR, which restricts search terms to be no more than 10words apart, and is therefore a better substitute thanAND. Moreover, the query translator also needs to checkif the query words (i.e., warehousing, data, andmining) are stopwords: common words that are notindexed in the target system. If this is the case, the querymay get no hits at all and the user may be advised ofthis fact.

In general, our algorithms for query capability map-ping require metadata about the underlying search en-gines to perform a complete translation. Currently, weencode the required metadata (capability and schemade®nition) within the query translator. Consequently, thecurrent implementation may not scale well. In addition,the query translator's metadata knowledge may not beup-to-date if the underlying services change. Therefore,to address these problems, it would be ideal to haveservices maintain and provide their own metadata in ourarchitecture. In summary, the query translator needs thefollowing metadata from search services:

± We need the schema de®nition: the set of searchableand/or retrievable attributes and the supported searchmethods (i.e., legal combinations of operators) forsearchable attributes.

± We need to know the supported operators in additionto Boolean operators (e.g., the type of proximity op-erators supported).

± We need to be aware of the stopwords used.± We need to know the vocabulary of the collection,

i.e., the set of words indexed by the service. This isused, for instance, to enumerate words that match thestem of a particular word when stemming [14, 15]must be emulated because it is not a supported ca-pability.

± We need to know the details of other features, e.g., fortruncation, the supported truncation patterns.

2.4 Result analysis

The introduction of canonical attribute models into themetadata architecture is valuable not only for queryformulation, but also for result analysis. Viewing hetero-geneous results through the lens of a canonical attribute

model allows users to obtain a uni®ed overview of thoseresults.

Making result analysis easier for the user is one goalof SenseMaker, another user interface service developedfor our Digital Library testbed [9, 16]. SenseMaker al-lows users to experiment iteratively with di�erent viewsof their results. Within a view, complexity is reduced intwo ways. Similar results may be bundled together, andidentical results may be merged together. Since manybundling and identity criteria are possible, users can tryout di�erent combinations in order to gain multipleperspectives on the set of results. Users also determinewhat attribute values should be displayed in each view.

Figure 3 shows a partial tabular view of results ob-tained by sending the keyword query java develop-ment to two technical article citation collections and twoWorld Wide Web collections. In this view, the displayedcanonical attributes are Shared Site, Title, andAbstract. The bundling criterion speci®es that resultswith URL values referring to the same Internet site shouldbe bundled together (results with unde®ned URL valuesare put into their own bundle). Other possible bundlingcriteria include bundling together results with the sameAuthor value, bundling together results with the same orsimilar Title values, and bundling together results fromthe same collection, among other criteria. For the viewportrayed here, the identity criterion speci®es that resultswith identical Title values should be treated as dupli-cates. Other possible identity criteria include mergingresults with identical URL values, merging results withidentical URL values and identical Title values, andso on.

In SenseMaker, the user's choices of display attri-butes, bundling criterion, and identity criterion all de-

Fig. 3. SenseMaker uni®ed result overview

112

Page 7: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

pend upon the characteristics of the chosen collections.If a user chooses a World Wide Web collection (e.g.,AltaVista), then the list of possible display attributeswould include URL, the list of bundling criteria wouldinclude bundling by Internet site, and the list of identitycriteria would include identity by comparing URL values.If the user does not choose a World Wide Web collec-tion, these choices will not be presented. SenseMaker isable to tailor these choices by checking with the indi-vidual collections to ®nd out what attributes they sup-port natively, determining how to map each of thesenative attributes onto the chosen canonical attributemodel, and then using knowledge about what attributesare present to pare down the chosen canonical attributemodel to just those attributes that are relevant. Notethat the current mapping from native attributes onto thechosen attribute model is performed entirely withinSenseMaker.

SenseMaker further tailors the presented choices byutilizing a structured canonical attribute model. Astructured canonical attribute model makes relationsamong attributes explicit. The SenseMaker structuredcanonical attribute model encodes generalization andcomposition relationships. For example, the model re-cords that a Reporter is a kind of Creator, and that aPublication Date is made up of a PublicationDay, Publication Month, and Publication Year.This relationship information allows SenseMaker topresent the user with choices that are at the right level ofgranularity for the given set of results. If a user choosesonly newspaper databases, then bundling by Reporteris o�ered as a possibility. If a user chooses both anewspaper database and a book database, then bundlingby Creator subsumes bundling by Reporter ± thusallowing books and newspaper articles by the same per-son to be bundled together.

Our experience with SenseMaker con®rms the needfor the requirements discussed in the section on queryformulation (Sect. 2). For the SenseMaker approach toscale, we need to include attribute models and attributetranslators in our architecture, and we need to developprotocols for communicating with these entities. Fur-thermore, the SenseMaker design adds two additionalrequirements:

± We need to include structured attribute models as®rst-class objects and to establish methods for ex-tracting the structure information from these models.

± We need attribute model translators that are capableof translating from native attributes and native at-tribute values into canonical attributes and canonicalattribute values.

3 The InfoBus metadata architecture

Motivated by the concrete metadata needs described inSect. 2, we have designed an integrated metadata archi-tecture. In this section, we begin by giving an overview ofhow our InfoBus works. Then we describe our metadataarchitecture and show how it ®ts into the InfoBus.

3.1 InfoBus overview

Details of the Stanford InfoBus design are described in[1]. We give here only enough detail to provide contextfor our metadata architecture.

Our InfoBus consists of distributed objects thatcommunicate with each other through remote methodcalls. In particular, we use the CORBA speci®cations[17], with Xerox PARC's ILU as the object system im-plementation [18].

Existing external services with di�erent interfaces aremade accessible by service proxies. These are objects thatprovide a standard set of methods on ``one side,'' but canalso communicate with the services they represent. Forexample, a proxy that represents Knight-Ridder's DialogInformation Service responds to the same methods asproxies that represent World Wide Web services (such asAltaVista or Lycos). Proxies shield InfoBus clients fromthe di�erences in access medium and protocol.

Figure 4 illustrates the InfoBus infrastructure. Itshows some of the di�erent services that co-exist on theInfoBus, including InfoBus-speci®c services, user inter-face services, information processing services, and infor-mation sources. Examples of InfoBus-speci®c servicesinclude GlOSS, the resource discovery service discussedin Sect. 2.1, and the query translation service discussed inSect. 2.3. User interface services that we have developedinclude DLITE (Sect. 2.2) and SenseMaker (Sect. 2.4).The InfoBus proxy strategy allows us also to includeexternal interfaces (such as a Z39.50 interface) on theInfoBus. Likewise, the InfoBus can use proxies to ac-commodate information processing services (such as adocument summarization service) and personal infor-mation sources (such as an e-mail archive). This paper

Fig. 4. The InfoBus architecture

113

Page 8: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

focuses primarily on the metadata needs of our InfoBus-speci®c services and user interface services. However, wehave deliberately made our architecture extensible so thatwe can address the metadata needs of other non-search-related services at a later point.

Whenever possible, we try to present InfoBus com-ponents as collections of objects called Item. These areobjects with various standard methods and arbitrarynumbers of property-name/property-value pairs that mayrepresent any real world object, ranging from a piece oftext to a citation to a person. The property values areindividually retrievable.

Search service proxies present themselves as instancesof the class ConstrainableCollection. These rep-resent sets of Items and respond to the method Con-strainCollection( ), which returns a subset of thecontained Items. For example, given a Knight-RidderDialog proxy DP and an Altavista proxy AP, both callsDP. ConstrainCollection (query) and AP. Con-strainCollection (query) return collections ofItems that represent results.

Search service proxies may provide access to nestedlevels of subcollections that may be searched individuallyor together. Figure 5 shows an example.

In general, the structure of a proxy re¯ects thestructure of the external service to which it providesaccess. When submitting a query to a Constrainable-Collection, any desired target subcollections arespeci®ed as well.

3.2 InfoBus metadata facilities

Figure 6 presents our metadata architecture and showshow it ®ts into the InfoBus infrastructure. New compo-nents include attribute model proxies, attribute modeltranslation services, a metadata information facility foreach search service proxy, and a metadata repository.

An attribute model proxy represents a real world at-tribute model, just as a search service proxy represents areal world search service. For example, we might have oneattribute model proxy for the USMARC set of biblio-graphic attributes (referred to as ``®elds'' in the USMARCcommunity) [19], another for the Dublin Core set of at-tributes [20], and so on. An attribute model proxy allows usto encapsulate information that is speci®c to an attributemodel and independent of a search service proxy.

An attribute model translation service serves to medi-ate among the di�erent metadata conventions that arerepresented by the attribute model proxies. These trans-lation services, available via remote method calls, knowhow to translate (often approximately) attributes (andtheir values) from one attribute model into attributesfrom a second attribute model.

The metadata information facility that we attach toeach search service proxy is responsible for exportingmetadata about the proxy as a whole, as well as for ex-porting metadata about the collections to which it pro-vides access. Collection metadata includes descriptions ofthe collection, declarations as to what attribute modelsare supported, information about the collection's queryfacilities, and the statistical information necessary for

Fig. 5. The Dialog collection has several subcollections (e.g., News and

Business) that may be searched individually or together

Fig. 6. Our metadata architecture

114

Page 9: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

meta-searchers to predict the collection's relevance for aparticular query.

Finally, the metadata repository is a local databasethat caches information from the other metadata facili-ties in order to produce a one-stop-shopping location forlocally valuable metadata. We allow for the metadatarepository to pull metadata from the various facilities, aswell as for the facilities to push their metadata to therepository directly.

In the following sections, we describe each of thesefacilities in more detail. Several are implemented in ourcurrent prototype testbed; others are still under con-struction. Overall, our goal in implementing this archi-tecture is to allow our InfoBus services to progress fromad-hoc metadata interactions to well-speci®ed, general,and scalable metadata interactions.

3.2.1 Attribute model proxies

Attribute model proxies are our way of making attributemodels ®rst-class objects in our computational environ-ment. They allow us to store and search over attribute-speci®c information. On a per-attribute basis, attributemodel proxies maintain attribute documentation anddeclare attribute value types. Furthermore, attributemodel proxies record what relationships hold amongthe included attributes. This latter feature allows us torepresent attribute models that are more complex than¯at name spaces.

In the InfoBus ontology, an attribute model proxy is aConstrainableCollection that contains Attrib-uteItem instances. The fact that the attribute modelproxy is a ConstrainableCollection means that itis accessible via the same interface as all other searchservice proxies. In other words, the attribute model proxyhas a ConstrainCollection( ) method that respondsto a query by returning the appropriate subset of theincluded AttributeItems.

Each of the AttributeItems that make up an at-tribute model proxy contains information that is relevantfor a particular attribute, independent of the capabilitiespossessed by any particular search service proxy. Spe-ci®cally, an AttributeItem's properties include thefollowing:

± attrModelName: String± attrName: String± attrValueType: String± attrDocumentation: String

The attrModelName and attrName are both stringsthat serve to identify the AttributeItem uniquely. TheattrModelName is repeated in all items to make themself-contained. This is important when the items arepassed around the system to components other than theattribute model proxy. For example, meta-informationrecorded for a particular book might include values forattributes from many attribute models, including US-MARC [19], Dublin Core [20], Z39.50 Bib1 [21], and oneof the Stanford structured attribute models.

The attrValueType dictates the data type for theAttributeItem's values. We use the interface speci-®cation language that is part of our CORBA imple-mentation to specify these types. It is up to each searchservice proxy to ensure that the values it returns conformto these type speci®cations. If the external service that theproxy represents natively returns a di�erent type, thenthe proxy is expected to transform the value into thespeci®ed type before returning it.

For example, let us assume a simple, ¯at attributemodel that includes Title, Individual Author, andCorporate Author. A proxy for this model is aConstrainableCollection containing the followingAttributeItem instances:

SimpleModel: finst1: fattrModelName -> `Simple-Model'attrName -> `Title'attrValueType -> `String'attrDocumentation -> `This is a title for

an entire volume of work.'ginst2: fattrModelName -> `Simple-Model'attrName -> `Individual Author'attrValueType -> `SEQUENCE of String'attrDocumentation -> `First-name/last-

name...'ginst3: fattrModelName -> `Simple-Model'attrName -> `Corporate Author'attrValueType -> `Recordf

String: companyName,Integer: sicCodeg'

attrDocumentation -> `Name as stated incorporation records...'gg

Since SimpleModel is a ConstrainableCol-lection, any InfoBus component is free to search overinstances of it. If SM is an instance of SimpleModel,then executing the statement authorAttrs = SM.Constrain(``attrName Contains Author'') will re-sult in setting authorAttrs to a list of inst2 andinst3. Finding out details about the two kinds of Au-thor attributes is then as simple as examining theproperties of inst2 and inst3.

In this example, SimpleModel is a ¯at attributemodel: all of the attributes are independent of each other.However, we saw in Sect. 2.4 that some user interfaceservices require structured attribute models. Figure 7depicts a structured attribute model that encodes is-arelationships among its attributes. In our architecture,the attribute model proxy for this attribute model wouldbe a ConstrainableCollection containing At-tributeItems subclassed to include is-a( ) methods.(Other relationships between the attributes in an attributemodel, like has-a, can be treated analogously.)

A search service proxy that supports this attributemodel could use its relationship information when pro-cessing queries. For example, a search service proxy

115

Page 10: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

might determine that items match the query Creatorcontains Ullman if they contain ``Ullman'' in thevalue of the Creator attribute or if they contain ``Ull-man'' in the value of descendant attributes (i.e., in thevalue of Reporter or Author).

Thus, we see that attribute model proxies play animportant role in our metadata architecture. Attributemodel proxies allow InfoBus components to search overattribute models, to obtain documentation and valuetype information for speci®c attributes within an attrib-ute model, and to obtain information about the rela-tionships that hold among an attribute model'sattributes. This is the functionality that SenseMakerneeds (Sect. 2.4) in order to determine display attributes,bundling criteria, and identity criteria that are at theappropriate level of granularity. The attribute modelproxies are also key to building the translation servicesthat we describe next.

3.2.2 Attribute model translation services

In heterogeneous environments, many di�erent attributemodels co-exist together. Therefore, we need to havestrategies for resolving the inevitable mismatches thatwill arise when InfoBus components that support di�er-ent attribute models attempt to communicate with eachother. For example, consider a bibliographic databaseproxy and a client of that proxy. The bibliographicdatabase proxy might support only the Dublin Coreattribute model, while the client might support only theUSMARC bibliographic data attribute model. In orderfor this client and this proxy to communicate with eachother, they must be able to translate from USMARCattributes to Dublin Core attributes and vice versa. Inother words, they require intermediate attribute modeltranslation services. Of course, in some cases, translationmay not even be possible at all: consider translating froman attribute model designed for chemistry databases intoan attribute model designed for ancient Greek texts.

Even when translation is appropriate, the translationfrom one attribute model to another is often di�cult andlossy. For example, the Dublin Core describes authorshipusing the single attribute Author. However, USMARCdistinguishes among several di�erent types of authors,including Corporate Author (recorded in the 100 at-tribute) vs. Individual Author (recorded in the 110attribute). When translating an AttributeItem fromDublin Core to USMARC, a decision must be madewhether to translate the Dublin Core Author attributevalue into a USMARC 100 attribute value or aUSMARC 110 attribute value. This may be hard-coded,or the translation may be performed heuristically and

may take into account the other attribute values of theItem being translated.

Translation services do more than map source attri-butes onto target attributes. They must also convert eachattribute value from the data type speci®ed for the sourceattribute into the data type speci®ed for the target at-tribute. This conversion can be quite complex. For ex-ample, one attribute model might call for authors to berepresented as lists of records, where each record contains®elds for ®rst name, last name, and author address. An-other model might call for just a comma-separated stringof authors in last-name plus initials format. When trans-lating among these values, some information may againbe lost if, for example, the address is simply discarded.

Attribute model translation services are thus an inte-gral component in our metadata architecture. They maybe accessed by a variety of other InfoBus components,including search service proxies. For example, a searchservice proxy might choose to use attribute modeltranslators to be attractive to more clients, because theycan then advertise that they deal in multiple attributemodels. On the other hand, clients might use attributemodel translators in order to ensure their ability tocommunicate with a wide variety of search services. Inaddition to query translators (Sect. 2.3), the attributemodel translation services are needed during query for-mulation (Sect. 2.2) to map the canonical attributes inthe user interface to the actual attributes supported bythe collections. Also, SenseMaker (Sect 2.4) uses the at-tribute model translators to convert the items returnedfrom the collections into items that use the canonicalattribute model.

The methods on standard model translation servicesare:

± getNameMapping(toAttrModel, fromAttr-Model, fromAttrName): SEQUENCE of String/* Given a fromAttrModel attribute name, returnsthe corresponding toAttrModel attribute names */

± getValueMapping(toAttrModel, fromAttr-Model, toAttrName, fromAttrName, from-AttrValue): Any/* Given a fromAttrModel attribute name andvalue, plus the toAttrModel attribute to whichthese should be mapped, returns the correspondingtoAttrModel value */

± getToAttrModels( ): SEQUENCE of String/* Returns the list of toAttrModels that thistranslator accepts */

± getFromAttrModels( ): SEQUENCE of String/* Returns the list of fromAttrModels that thistranslator accepts */

± getDocumentation( ): String /* Returns human-readable documentation for this translator */

For heuristic translators, we can subclass and add anadditional parameter called item to the ®rst two meth-ods. These heuristic translators optionally allow passingan item pointer, so that the translator can use know-ledge about all of the item's attribute values to drive itstranslation algorithm. Our current implementation in-cludes translators that perform non-heuristic mappings

Fig. 7. A complex attribute model with the is-a relationship among its

attributes

116

Page 11: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

between two attribute models, and translations to andfrom a structured, hierarchical model. The implementa-tion includes heuristic translators from USMARC toBibTeX, and Refer to BibTeX models.

3.2.3 Search service proxy metadatainformation facilities

Each service proxy exports metadata information aboutitself and about the collection to which it provides access.Initially, clients can use this information to makejudgments about how well the search service matchesits needs. Later, clients can use the information todetermine how best to access the collection maintainedby the search service (i.e., what capabilities the searchservice supports).

We have decided to make the interface for accessingthe metadata facility of search service proxies very simplein order to encourage proxy writers to provide this in-formation. Search service proxy metadata is accessed viathe call getMetadata(subCollectionName). Foreach subcollection supported by the proxy, this call re-turns two metadata objects. Alternatively, each proxymay opt to ``push'' these metadata objects to its clients.The ®rst metadata object (Table 1) contains the generalservice information, and it is based heavily on the sourcemetadata objects de®ned by STARTS [22]. The generalservice information includes human-readable informa-tion about the (sub)collection, as well as information thatis used by our query translation facility. Examples for thelatter are the type of truncation that is applied to queryterms, and the list of stopwords. Our current translationengine is driven by local tables containing this kind ofinformation about target sources.

The collectionName attribute is the name of the(sub)collection that the metadata object describes. In caseof a subcollection, the parentCollectionName at-tribute is the name of the immediately enclosing collec-tion. Finally, if the (sub)collection described is in turn aConstrainableCollection with subcollections, the

subCollectionNames attribute lists its subcollections.For an example, consider the Dialog collection as de-picted in Fig. 5. The metadata object for the Dialogcollection lists ``Dialog'' as the collectionName, noparentCollectionName, and ``News'' and ``Busi-ness'' as the subCollectionNames. The metadataobject for the Business subcollection has ``Business'' as itscollectionName, ``Dialog'' as its parentCollec-tionName, and no subCollectionNames.

Another interesting attribute of our ®rst metadataobject is contentSummaryLinkage. (see Table 1). Thevalue for this attribute is a URL that points to a contentsummary of the collection. Content summaries are po-tentially large, hence our decision to make them retriev-able using ftp, for example, instead of using our protocol.The content summary follows the STARTS contentsummaries, and consists of the information that a re-source-discovery service like GlOSS needs (Sect. 2.1).Content summaries are formatted as Harvest SOIFs(http://harvest.transarc.com/afs/trans-arc.com/public/trg/Harvest/user-manual/).

Example 3. Consider the following content summary fora collection:

@SContentSummaryfVersionf10g: STARTS 1.0Stemmingf1g: FStopWordsf1g: FCaseSensitivef1g: FFieldsf1g: TNumDocsf3g: 892Fieldf5g: TitleDocFreqf11023g: ``algorithm'' 53

``analysis'' 23...

Fieldf6g: AuthorDocFreqf1211g: ``ullman'' 11

``knuth'' 15...g

Table 1. The attributes present in

the ®rst metadata object for a

(sub)collection, describing general

service characteristics

Metadata Attribute Description

version Version of the metadata object

collectionName Name of the (sub)collection being described

parentCollectionName Name of the immediate parent collection, if any

subCollectionNames Name of the available subcollections

attrModelNames Attribute models supported

attrNames Attributes supported

booleanOps Boolean operators supported

proximity Type of word proximity supported

truncation Truncation patterns supported

implicitModifiers Modi®ers implicitly applied (e.g., stemming)

stopWordList Stopwords used

languages Languages present (e.g., en-US, for American English)

contentSummaryLinkage URL of the content summary of the collection

dateChanged Date the metadata object last changed

dateExpires Date the metadata object will be reviewed

abstract Abstract of the collection

accessConstraints Constraints for accessing the collection

contact Contact information of collection administrator

117

Page 12: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

This summary indicates that the word ``algorithm''appears in the Title of 53 items in the collection, and``ullman'' as an Author of 11 items in the collection, forexample.

The second metadata object returned by the get-Metadata( ) method contains attribute access charac-teristics (see Table 2). This is attribute-speci®c informa-tion for each attribute that the proxy supports. Recallthat attribute model proxies contain only informationthat is independent from any particular search services.Attribute access characteristics complement this infor-mation in that they add the service-speci®c details foreach attribute.

For example, some search services allow only phrasesearching over their Author attributes, while othersallow keyword searching. Similarly, some search servicesmay index their publication attributes, while others maynot. The attribute access characteristics describe this in-formation for each attribute supported by the (sub)col-lection speci®ed in the getMetadata( ) call. The querytranslation services (Sect. 2.3) need this information tosubmit the right queries to the collections.

Notice that this design does not allow clients to querysearch service proxies directly for their metadata. Searchservice proxies only export all their metadata in onestructured ``blob.'' The reason for this is that the InfoBusincludes many proxies, and more are being constructed asour testbed evolves. We therefore want proxies to be aslightweight as possible. Querying over collection-relatedmetadata as exported by search service proxies is insteadavailable through a special component, the metadatarepository.

3.2.4 Metadata repository

The metadata repository is a central, though possiblyreplicated, database of the system's metadata. Therepository collects or subscribes to the metadata avail-able from selected attribute model proxies, attributemodel translation services, search service proxies, andother InfoBus services. The intent is for this repository tobe a local resource for ®nding answers to metadata-related questions and for ®nding specialized metadataresources. The interface to the repository is again thesame as a search service proxy interface. Subcollectionswithin the repository include:

± A subcollection containing all AttributeItemsfrom locally relevant attribute model proxies. Thissubcollection can be used, for example, to search fordocumentation on Dublin Core attributes.

± A subcollection containing locally relevant attributemodel translator information, structured as Items.This subcollection is useful for searching for transla-tors to or from particular models. Searches returnpointers to the translator components.

± General service information for locally relevantsearch service proxies. These Items are obtainedthrough getMetadata( ) calls on the proxies, asdiscussed in the previous section. This subcollectioncan be used, for example, to ®nd collections whoseabstracts match a user's information need. It can alsobe used for more technical inquiries, such as for®nding search proxies that support a given attributemodel, or proxies that support proximity search.

± The attribute access characteristics of the locally rel-evant search proxies are also collected in the metadatarepository. These are primarily useful to the querytranslator. Translators can, for example, ®nd outwhich proxies support keyword searchable DublinCore Author attributes.

The metadata repository provides the list of availablecollections to a resource-discovery service like GlOSS(Sect. 2.1). It also lets the query formulation and trans-lation services ®nd out e�ciently what attribute modelsand individual attributes are supported by each collec-tion, for example (Sects. 2.2 and 2.3).

3.2.5 Metadata facilities summary

In this section, we have seen that each metadata facility inour architecture serves to satisfy a number of therequirements imposed by our motivating set of InfoBusservices. Thus, our metadata architecture is an integratedsolution to the metadata needs of these independentlydeveloped and maintained services. We summarize herethe matches betweenmetadata facilities and requirements.

Attribute model proxies

± They make attribute models ®rst-class objects so thatwe can communicate with them and perform neces-sary transformations (Sect. 2.2).

± They allow attribute models to be structured and theyestablish methods for extracting the structure infor-mation from these models (Sect. 2.4).

Attribute model translation services

± They translate canonical attributes and canonical at-tribute values into native attributes and native at-tribute values (Sect. 2.2).

Table 2. The properties for one

search attribute, as described in

the second metadata object for a

(sub)collection, and listing the at-

tribute access characteristics

Access Characteristic Description

collectionName Name of the (sub)collection being described

attrModelName Model of the attribute

attrName Name of the attribute

searchRetrieve Whether the attribute is searchable, retrievable, or both

modifierCombinations Legal combinations of modi®ers (e.g., stemming, >) for the attribute

118

Page 13: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

± They translate native attributes and native attributevalues into canonical attributes and canonical attrib-ute values (Sect. 2.4).

Search service proxy metadata information facilities

± They export the content summary associated witheach collection (Sect. 2.1).

± They provide facilities for ®nding out e�ciently whatattribute models and individual attributes are nativelysupported by each search service (Sect. 2.2).

± They declare if each attribute is searchable and/orretrievable, and what the supported search methodsare for searchable attributes (i.e., legal combinationsof operators) (Sect. 2.3).

± They declare what operators are supported in addi-tion to Boolean operators (e.g., the type of proximityoperators supported) (Sect. 2.3).

± They export the stopwords used (Sect. 2.3).± They export the vocabulary of the collection, i.e., the

set of words indexed by the service, to emulate stem-ming when it is not natively supported (Sect. 2.3).

± They declare the details of other features, e.g., fortruncation, the supported truncation patterns(Sect. 2.3).

Metadata repositories

± They provide the list of locally available collections(Sect. 2.1).

± They provide facilities for ®nding out e�ciently whatattribute models and individual attributes are nativelysupported by each search service (Sect. 2.2).

4 Related work

In a distributed library environment, we can distinguishbetween metadata for information objects (e.g., docu-ments) and metadata for information services (e.g.,search services). The encoding of metadata for informa-tion objects can facilitate the uni®cation of heteroge-neous information objects, while the encoding of meta-data for information services can facilitatecommunication among disparate services. Since intero-perability is an important goal of our InfoBus, we havedesigned a metadata architecture that encompasses bothtypes of metadata. In this section, we discuss relatedwork on metadata for information objects and metadatafor information services.

Many recent e�orts have concentrated on themetadatafor information objects. In this context, the termmetadatarefers to the description of information objects to supportthe major functions of digital libraries such as search, as-sessment, and acquisition of information. Work in thisarea generally falls into two categories: speci®cation ofmetadata sets and architectures that integrate them.

Relevant work in the ®rst category (speci®cation ofmetadata sets) includes, for instance, the Bib-1 attributeset in Z39.50 and the Dublin Core. The Z39.50 Bib-1attribute set [23, 21] registers a large set of bibliographicattributes. The focus of the Dublin Core [20] is primarily

on developing a simple yet usable set of attributes todescribe the essential features of networked documents(e.g., World Wide Web documents), which the report ofthe Dublin meeting terms ``document-like objects.'' TheDublin Core metadata set consists of 13 metadata ele-ments, including familiar descriptive attributes such asAuthor, Title, and Subject.

These standard metadata sets ®t nicely into our ar-chitecture. As we have described, we represent metadatasets by attribute models that serve as reference points forde®ned attributes. In addition, our attribute models canbe more structured than just ¯at name spaces. Our at-tribute models go beyond these metadata sets becausethey can optionally include structure and because theyare rei®ed as searchable collections.

Given the existence of various metadata sets for in-formation objects, and the fact that no single set coversall the possible aspects, there have also been signi®cante�orts in developing architectures that support the inte-gration and interoperation of various metadata sets.Work on this aspect of the metadata problem includesthe Warwick Framework [24], and the Jet PropulsionLaboratory's DARE metadata model [25].

The Warwick Framework proposes a container ar-chitecture as a mechanism for incorporating attributevalues from di�erent metadata sets in a single informa-tion object. Within each object are ``metadata packages,''one for each distinct metadata set, e.g., Dublin Core orUSMARC. The Warwick Framework also includes im-plementation suggestions for integrating it with HTML,MIME, SGML, and distributed object technology (e.g.,CORBA).

The Warwick Framework is essentially an encodingscheme that incorporates various metadata packageswith document data. Common to our work is the supportof multiple metadata sets in describing documents.However, as we represent documents as a ¯at set of at-tribute-value pairs (similar to HTML), the encodingscheme of the Warwick Framework is more sophisticatedbecause its structure supports recursion and indirection.Consequently, to fully express the Warwick Frameworkwe would need to extend our document models. On theother hand, our work complements the WarwickFramework in that we do provide attribute models asregistries for metadata attributes. The procedure forspecifying and registering new metadata sets is still anopen issue in the Warwick Framework.

The DARE metadata model was developed to sup-port the interaction with metadata elements and to tiethese elements to data themselves. It uses ODL (ObjectDe®nition Language) to de®ne new metadata elementsthat are stored in a ``data dictionary.'' This notion ofdata dictionaries is similar to our attribute models.However, to facilitate interoperability, we also providetranslation services that translate attributes among dif-ferent models.

In addition to the work on metadata for informationobjects discussed above, there are also proposals thatsupport metadata for search services. In this context, thee�orts that are probably closest to what is described hereare the Explain facility of Z39.50-1995 (i.e., Version 3 of

119

Page 14: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

Z39.50 [23]) and Stanford's STARTS, both of which re-quire services to export their ``source metadata.'' Theformer represents a standard e�ort for information re-trieval while the latter is intended to be an informal,lightweight agreement for interoperability among Inter-net search engine vendors.

The Explain facility is the primary mechanism forZ39.50 clients to discover servers' capabilities. Explain-based clients can dynamically con®gure themselves tomatch individual servers so that they can support morethan the least common denominator of the servers.Z39.50 servers present metadata about their services viathe Explain facility. This metadata is essentially anotherdatabase that can be queried by the clients via the Z39.50protocol. In the Explain database, server characteristicsare divided into categories, and database record struc-tures are de®ned for each category. The GILS pro®le forZ39.50 [26] de®nes a metadata attribute set for searchservices.

The Explain facility provided by Z39.50 servers cor-responds to the metadata general information facilitysupported by our service proxies. The major distinction isthat the interface to access the service metadata providedby our proxies is simpler than that of the Explain facility.As explained in the previous sections, we have decided toshift the more complex functionalities (e.g., searchablemetadata) to the metadata repositories. The rationale isto make proxies as lightweight as possible and thereforeeasy to implement. As every service needs a proxy toenter our InfoBus, making proxies lightweight becomescritical. On the other hand, this simplicity may not be anissue for Z39.50 as both their clients and servers willnecessarily have most of the required capabilities [27].However, our architecture can bene®t from the Explainfacility; it should be relatively easy to build proxies toExplain-compliant services that will support our pro-posed metadata facility.

Finally, another relevant e�ort on top of which webuilt our architecture is STARTS (http://www-db.stan-ford.edu/egravano/starts.html). STARTS is an informal``standards'' e�ort coordinated by Stanford, whose maingoal is to facilitate the interoperability of search enginesfor text. STARTS speci®es what metadata should beexported by each collection of text documents. This in-formation is essentially what the search-service proxiesexport through their metadata information facility(Sect. 3.2.3). However STARTS does not describe a moresophisticated metadata architecture. It just speci®es theinformation that the collections should provide. As withthe Explain facility, the architecture that we present inthis paper bene®ts from STARTS-compliant services: it iseasy to build proxies for such services that will satisfy ourmetadata requirements.

Notice that unique in our architecture is the metadatarepository that serves as a central, though possibly rep-licated, database of metadata. The advantages are, ®rst,that the clients may conveniently use it as a local cache ofmetadata. Second, it naturally serves as a service ``di-rectory'' so that interesting queries may be made to helpin resource discovery, e.g., ®nd all services that supportthe Dublin Core Title attribute. Finally, service proxies

can ``publish'' their metadata to the metadata reposito-ries. This makes the update of service metadata easier topropagate in information space.

5 Conclusion

In this paper, we have surveyed the metadata needs ofour InfoBus services and presented a metadata architec-ture that meets these needs. In the process, we haveoutlined a framework for understanding di�erent classesof metadata and metadata uses. The components of ourproposed architecture are integrated into the InfoBus.Each component is a distributed object that can com-municate through remote method calls. Furthermore, themetadata repository is an InfoBus Constrainable-Collection object that can be searched using ouralready established search protocol.

More speci®cally, the components in our metadataarchitecture include: attribute model proxies, attributemodel translation services, metadata information facili-ties for InfoBus search service proxies, and local meta-data repositories. Together, these components canprovide, exchange, and describe metadata for informa-tion objects and metadata for information services. Wehave found that these facilities are necessary buildingblocks for scalable interoperability.

Our current implementation includes six attributemodel translators, table-driven query translation, contentstatistics over the CS-TR collection, and dynamic at-tribute derivations. We are currently modularizing thesefacilities, building attribute model proxies, adding themetadata repository, and making our search proxiescompliant. The pieces of the architecture that are alreadyimplemented allow for interoperability among the ser-vices and collections that are in the InfoBus today. Onceour architecture is complete, we will have a frameworkthat enables us to incorporate many more heterogeneousservices and collections into the InfoBus. In addition, wewill be able to extend it to meet the needs of other typesof information processing services.

References

1. Andreas Paepcke, Steve B. Cousins, He ctor Garcõ a-Molina, Scott

W. Hassan, Steven K. Ketchpel, Martin RoÈ scheisen, and Terry

Winograd. Towards interoperability in digital libraries: Overview

and selected highlights of the Stanford Digital Library Project.

IEEE Computer Magazine, May 1996

2. Michelle Baldonado, Chen-Chuan K. Chang, Luis Gravano, and

Andreas Paepcke. Metadata for digital libraries: Architecture and

design rationale. In Proceedings of the 2nd ACN International

Conference on Digital Libraries, pages 47±56, July 1997

3. Andreas Paepcke. Searching is not enough: What we learned on-

site. D-Lib Magazine, May 1996

4. Andreas Paepcke. Information needs in technical work settings and

their implications for the design of computer tools. Computer

Supported Cooperative Work: The Journal of Collaborative Com-

puting, 5:63±92, 1996

5. Luis Gravano, He ctor Garcõ a-Molina, and Anthony Tomasic. The

e�ectiveness of GlOSS for the text-database discovery problem. In

Proceedings of the 1994 ACM SIGMOD Conference, May 1994

120

Page 15: Original articles The Stanford Digital Library metadata …metadata information facilities for search services, and local metadata repositories. In presenting and discussing the pieces

6. Luis Gravano and He ctor Garcõ a-Molina. Generalizing GlOSS to

vector-space databases and broker hierarchies. In Proceedings of

VLDB '95, pages 78±89, September 1995

7. Steve B. Cousins, Scott W. Hassan, Andreas Paepcke, and Terry

Winograd. A distributed interface for the digital library. Technical

Report SIDL-WP-1996-0037, Stanford University, 1996. Accessible

at http://www-diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1996-

0037

8. Steve B. Cousins. A task-oriented interface to a digital library. In

CHI 96 Conference Companion, pages 103±104, 1996

9. Michelle Q Wang Baldonado and Steve B. Cousins. Addressing

heterogeneity in the networked information environment. The New

Review of Information Networking, 2:83±102, 1996

10. D. T. Hawkins and L. R. Levy. Front end software for online da-

tabase searching Part 1: De®nitions, system features, and evalua-

tion. Online, 9(6):30±37, November 1985

11. M. E. Williams. Transparent information systems through

gateways, front ends, intermediaries, and interfaces. Journal of

the American Society for Information Science, 37(4):204±214, July

1986

12. Chen-Chuan K. Chang, He ctor Garcõ a-Molina, and Andreas

Paepcke. Boolean query mapping across heterogeneous information

sources. IEEE Transactions on Knowledge and Data Engineering,

8(4):515±521, August 1996

13. Chen-Chuan K. Chang, He ctor Garcõ a-Molina, and Andreas

Paepcke. Predicate rewriting for translating boolean queries in a

heterogeneous information system. Technical Report SIDL-WP-

1996-0028, Stanford University, 1996. Accessible at http://www-

diglib.stanford.edu

14. M. F. Porter. An algorithm for su�x stripping. Program,

14(3):130±137, 1980

15. J. B. Lovins. Development of a stemming algorithm. Mechanical

Translation and Computational Linguistics, 11(1-2):22±31, 1968

16. Michelle Q Wang Baldonado and Terry Winograd. SenseMaker:

An information-exploration interface supporting the contextual

evolution of a user's interests. In Proceedings of CHI 97, 1997

17. Object Management Group. The Common Object Request Broker:

Architecture and speci®cation. Accessible at ftp://omg.org/pub/

CORBA, December 1993

18. Doug Cutting, Bill Janssen, Mike Spreitzer, and Farrell Wymore.

ILUReferenceManual. Xerox PaloAltoResearchCenter, December

1993. Accessible at ftp://ftp.parc.xerox.com/pub/ilu/ilu.html

19. USMARC format for bibliographic data: Including guidelines for

content designation, 1994

20. Stuart Weibel, Jean Godby, Eric Miller, and Ron Daniel, Jr. OCLC/

NCSA metadata workshop report. Accessible at http://www.ocl-

c.org:5047/oclc/research/publications/weibel/metadata/dublin_core_

report.html, March 1995

21. Z39.50 Maintenance Agency. Attribute set Bib-1 (Z39.50-1995):

Semantics. Accessible at ftp://ftp.loc.gov/pub/z3950/defs/bib1.txt,

September 1995

22. Luis Gravano, Chen-Chuan K. Chang, He ctor Garcõ a-Molina, and

Andreas Paepcke. STARTS: Stanford protocol proposal for

Internet retrieval and search. Technical Report SIDL-WP-1996-

0043, Stanford University, August 1996. Accessible at http://www-

diglib.stanford.edu/cgi-bin/WP/get/SIDL-WP-1996-0043

23. National Information Standards Organization. Information Re-

trieval (Z39.50): Application Service De®nition and Protocol Spec-

i®cation (ANSI/NISO Z39.50-1995). NISO Press, Bethesda, MD,

1995. Accessible at http://lcweb.loc.gov/z3950/agency/

24. Carl Lagoze, Cli�ord A. Lynch, and Ron Daniel, Jr. The Warwick

Framework: A container architecture for aggregating sets of

metadata. Technical Report TR96-1593, Cornell University, Com-

puter Science Dept., June 1996

25. Jason J. Hyon and Rosana Bisciotti Borgen. Data archival and

retrieval enhancement (DARE) metadata modeling and its user

interface. In Proceedings of the First IEEE Metadata Conference,

Silver Spring, MD. April 1996. IEEE

26. Government Information Locator Service (GILS), 1996. Accessible

at http://info.er.usgs.gov:80/gils/

27. Denis Lynch. Implementing Explain. Accessible at ftp://ftp.loc.gov/

pub/z3950/articles/denis.ps

121