library marc records into linked open data: challenges and opportunities

35
This article was downloaded by: [Georgia Tech Library] On: 15 November 2014, At: 06:29 Publisher: Routledge Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Journal of Library Metadata Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/wjlm20 Library Marc Records Into Linked Open Data: Challenges and Opportunities Timothy W. Cole a , Myung-Ja Han b , William Fletcher Weathers b & Eric Joyner c a Grainger Engineering Library Information Center , University of Illinois at Urbana-Champaign , Urbana , Illinois , USA b Content and Access Management , University of Illinois at Urbana- Champaign , Urbana , Illinois , USA c Department of Computer Science , University of Illinois at Urbana- Champaign , Urbana , Illinois , USA Published online: 20 Sep 2013. To cite this article: Timothy W. Cole , Myung-Ja Han , William Fletcher Weathers & Eric Joyner (2013) Library Marc Records Into Linked Open Data: Challenges and Opportunities, Journal of Library Metadata, 13:2-3, 163-196, DOI: 10.1080/19386389.2013.826074 To link to this article: http://dx.doi.org/10.1080/19386389.2013.826074 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://www.tandfonline.com/page/terms- and-conditions

Upload: eric

Post on 22-Mar-2017

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Library Marc Records Into Linked Open Data: Challenges and Opportunities

This article was downloaded by: [Georgia Tech Library]On: 15 November 2014, At: 06:29Publisher: RoutledgeInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Journal of Library MetadataPublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/wjlm20

Library Marc Records Into Linked OpenData: Challenges and OpportunitiesTimothy W. Cole a , Myung-Ja Han b , William Fletcher Weathers b &Eric Joyner ca Grainger Engineering Library Information Center , University ofIllinois at Urbana-Champaign , Urbana , Illinois , USAb Content and Access Management , University of Illinois at Urbana-Champaign , Urbana , Illinois , USAc Department of Computer Science , University of Illinois at Urbana-Champaign , Urbana , Illinois , USAPublished online: 20 Sep 2013.

To cite this article: Timothy W. Cole , Myung-Ja Han , William Fletcher Weathers & Eric Joyner(2013) Library Marc Records Into Linked Open Data: Challenges and Opportunities, Journal of LibraryMetadata, 13:2-3, 163-196, DOI: 10.1080/19386389.2013.826074

To link to this article: http://dx.doi.org/10.1080/19386389.2013.826074

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the“Content”) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

Page 2: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Journal of Library Metadata, 13:163–196, 2013Published with license by Taylor & FrancisISSN: 1938-6389 print / 1937-5034 onlineDOI: 10.1080/19386389.2013.826074

Library Marc Records Into Linked Open Data:Challenges and Opportunities

TIMOTHY W. COLEGrainger Engineering Library Information Center, University of Illinois at

Urbana-Champaign, Urbana, Illinois, USA

MYUNG-JA HAN and WILLIAM FLETCHER WEATHERSContent and Access Management, University of Illinois at Urbana-Champaign, Urbana,

Illinois, USA

ERIC JOYNERDepartment of Computer Science, University of Illinois at Urbana-Champaign, Urbana,

Illinois, USA

Today researchers search for books in various ways. Once discov-ered, a variety of Web technologies can be used to link to related re-sources and/or associate context with a book. This environment cre-ates an opportunity for libraries. The linked open data (LOD) modelof the Web offers a potential foundation for innovative user servicesand the wider dissemination of bibliographic metadata. However,best practices for transforming library catalog records into LODare still evolving. The practical utility on the Semantic Web of li-brary metadata transformed from MARC remains unclear. Usinga test set of MARC21 records describing 30,000 retrospectively digi-tized books, the University of Illinois at Urbana-Champaign (UIUC)Library explored options for adding links, transforming into non-library specific LOD-friendly semantics, and deploying as RDF tomaximize the utility of these records. This paper highlights lessonslearned during this process, discusses findings to date, and suggestspossible avenues for further work and experimentation.

KEYWORDS bibliographic metadata, linked open data, MARC,RDF, Semantic Web

© Timothy W. Cole, Myung-Ja Han, William Fletcher Weathers, and Eric JoynerAddress correspondence to Timothy W. Cole, Grainger Engineering Library Information

Center, University of Illinois at Urbana-Champaign, 157 Grainger Library (MC-274), 1301 W.Springfield Ave., Urbana, IL 61801, USA. E-mail: [email protected]

163

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 3: Library Marc Records Into Linked Open Data: Challenges and Opportunities

164 T. W. Cole et al.

Faculty and students have many ways to discover books. In addition tothe traditional online public access catalog of their institution’s library, re-searchers today routinely discover books of interest, especially digitized orborn-digital books, through Web search engines (e.g., Google, Bing, Yahoo!),through specialized portals (e.g., HathiTrust,1 OpenLibrary2), and throughvended and publisher-specific tools and Web scale discovery services towhich their institution subscribes (e.g., ExLibris Primo,3 Serials SolutionsSummon R©,4 SpringerLink5). At the same time, academic libraries singly andcollectively have invested heavily over many years in descriptive catalogingas a way to organize their collections and make items easier to find, identify,select, and obtain.6 The challenge is to make maximum use of the existingwealth of library-vetted, well curated book-level descriptive and identifyingbibliographic metadata, and to do so in a manner that complements newservice models (both within and external to the library) that take advantageof full-text search and Semantic Web technologies and standards includingthe Resource Description Framework (RDF)7 and linked open data (LOD)8

services.There is no shortage of interest in this challenge, nor any shortage

of early effort. Several major libraries and entities that work with libraries(e.g., OCLC, various publishers, and library catalog vendors) are busy trans-forming bibliographic catalog records into LOD in an effort to make theserecords more useful to users both within and outside of their institutions.LOD has the potential to help facilitate discovery and provide library userswith added context about resources discovered by linking bibliographic cat-alog records to resources such as Wikipedia/DBpedia,9 the Online ComputerLibrary Center (OCLC)’s Virtual International Authority File (VIAF),10 the Li-brary of Congress (LC) Authorities and Vocabularies Linked Data Service,11

and OCLC’s Faceted Applications of Subject Terminology (FAST).12

However, our review (summarized below) suggests that best practicesfor transforming library catalog data into LOD are still evolving. Guidance forthe numerous decisions about how to integrate links and exactly which URIs(Uniform Resource Identifiers) to include in LOD-enriched catalog recordsis sparse and inconsistent. In a library context, practical LOD applicationsare as of yet few, of limited maturity, and relatively untested. There are avariety of semantic options available for encoding catalog records in RDF,and to date there is little consistency in metadata schemas selected (beyondthe use of RDF itself). There is in fact a surprising lack of homogeneity incatalog-based LOD data sets so far created. While not on its own disabling,this semantic heterogeneity at a minimum complicates interoperability. Theeffectiveness with which catalog-derived Library LOD records can be lever-aged by Semantic Web–based discovery, retrieval, and content use serviceswill be dependent in part on the degree to which the community is able todevelop sensible, coherent standards and best practices for making librarybibliographic catalog records more LOD-friendly.

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 4: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 165

To inform planning and make better sense of work to date in this do-main, the University of Illinois at Urbana-Champaign (UIUC) Library has beenexploring options and methods for taking the metadata found in MAchine-Readable Cataloging (MARC) bibliographic records and enriching these meta-data with links and LOD-friendly semantics. Given that MARC records arenot designed to accommodate URIs associated with names and subject head-ings, a part of this process has been examining options for transforminglink-enriched bibliographic records into a metadata schema more appro-priate for RDF and LOD. Using as our test data set MARC records for abroad spectrum set of our retrospectively digitized books (almost 30,000volumes), we began by analyzing and then transforming our MARC recordsinto the Metadata Object Description Schema (MODS), simultaneously in-tegrating URIs connecting author, contributor, and publisher names to theVIAF and subject headings to the LCSH Linked Data Service.13 While emerg-ing approaches for serializing MODS as RDF are encouraging, they are still awork in progress. Accordingly, we experimented next with transforming ourlink-enriched records into more LOD-friendly and RDF-compatible schemas.

In this paper we discuss the motivation for this work, describe its placerelative to the wealth of other ongoing activities in this domain, and high-light issues encountered and findings to date. In particular we examine thefollowing:

• a few of the limitations of MARC for use with RDF, especially in contrastto MODS and a number of other metadata schemas

• common inconsistencies in current examples of library LOD catalogrecords, e.g., from OCLC WorldCat14 and the British Library’s British Na-tional Bibliography15

• challenges of authority control in the context of LOD, e.g., issues of dealingwith complex subject headings such as compound headings, prepositionalphrase headings, and headings with subdivisions

• preliminary statistical findings for transforming MARC string-based author-ity control terms into VIAF and LCSH links for our sample of library catalogrecords using automated scripts

• a way of using Resource Description Framework in Attributes (RDFa)16

to embed LOD within an XHTML page display designed for human con-sumption

We conclude with preliminary (and to this point necessarily speculative)observations about a few near-term ways that library LOD bibliographicrecords might improve user experience and a brief discussion of the majorremaining challenges for transforming MARC records into LOD.

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 5: Library Marc Records Into Linked Open Data: Challenges and Opportunities

166 T. W. Cole et al.

MOTIVATION AND RESEARCH QUESTIONS

This paper started with a simple question: Can traditional library biblio-graphic catalog records be considered a nascent form of LOD? This questionspawned a number of more specific and practical questions:

1. Are existing catalog record semantics (e.g., MARC21 semantics) sufficient?If not, which schema(s) define the additional semantics needed to expresscatalog records as LOD?

2. What links (URIs) need to be added to library catalog records to makethem useful as LOD, and how difficult is it to gather these URIs?

3. What are the best practices for transforming library catalog records intoRDF records?

4. What kinds of useful (and short-term feasible) added LOD-based servicesutilizing bibliographic metadata are made possible by such transforma-tions?

5. How do these service objectives influence the answers to the precedingquestions?

6. Ultimately, are the likely benefits worth the effort?

Current discussions of LOD offer a great deal of promise. Among sug-gested potential benefits are an increased ability to discover library resourcesfrom outside of libraries (Coyle, 2012) and enhanced interoperability be-tween libraries, publishers, aggregators, vendors, etc. (Byrne & Goddard,2010). The Library Linked Open Data incubator group outlined a broadrange of potential benefits for library-affiliated organizations and groups intheir final report.17 Nonetheless, while current thinking (albeit often stillspeculative at this point in time) suggesting the potential benefits of LOD tofacilitate a broad range of library-based use cases is enticing, many practicalissues remain unsettled. In this paper, we focus primarily on exploring ina concrete scenario our first three research questions, returning in the lastsection of the paper, Discussion and Future Work, to the last three ques-tions and how their answers might provide a lens for viewing the resultsof our experimentation with LOD and might reveal opportunities for new,innovative library services.

In a library context, production-level end-user library services based onLOD are as of yet few, of limited maturity, and relatively untested. More-over, because there are a variety of semantic options available for encod-ing catalog records as RDF, those services that do exist are not uniform inunderlying record structures. So even as OCLC and several large librariescontinue large-scale experiments with converting bibliographic records toLOD data sets, we wanted to take a closer look at the various options and

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 6: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 167

different approaches for treating library catalog records as LOD. Decisionsabout how library bibliographic records are made more LOD-friendly willhelp determine how effectively Library LOD can be leveraged by user-oriented discovery, retrieval, and content use services.

For example, by definition, LOD services are based on identifying rela-tionships between resources. As memory institutions, libraries have a longhistory of managing and organizing information resources. One of the meth-ods libraries have devised to organize resources is through the use of classi-fication schemes and controlled vocabularies. Library catalog records arebuilt in large part on a backbone of taxonomies, thesauri, and author-ity files. Libraries collocate books on their shelves and in their catalogsthrough call numbers, subject headings, author name entries, etc. In ap-plying name authorities and assigning call numbers and subject headings,are libraries instantiating relationships between books and other entities thatare also useful in an LOD context? How might the URIs associated withthe subject headings or name entry headings of a bibliographic record beuseful in helping users discover or browse book-level resources on theWeb?

With this long and rich experience and history of working with thesauri,authority files, and classification schemes, libraries are seemingly in a betterposition than many in the era of LOD; however, this perception needs furthertesting. OCLC and the LC have instantiated several thesauri (LCSH, FAST),authority files (VIAF), and classification schemes (Dewey Decimal system)as LOD data sets, accessible through Web-based, typically RESTful, servicesthat conform to LOD principles.18 In practical terms, how easy and, moreimportantly, how amenable to machine processing is it to transform the stringdata found in a MARC catalog record into a URI linking to a LOD data setsuch as those listed above?

Even assuming reasonable success at obtaining URI links to LOD-friendly data sets, there remains the issue of integrating these URIs withthe other metadata found in library catalog records. As discussed below,MARC, the lingua franca of library catalogs, which was not created with LODin mind, is not sufficient without transformation. Practically speaking, whatare the options for refactoring library catalog records into alternative, moreLOD-friendly (i.e., more suitable for RDF) semantics? Since it would be inef-ficient to maintain multiple copies of each library’s catalog, that is, one foruse with their Online Public Access Catalog (OPAC) and a separate for usein the context of LOD, this in turn engenders the additional question: Canthe information richness and OPAC-suitability of MARC be retained whentransforming to LOD-friendly semantics? In practical terms this might suggestan intermediate format that is at once easily mappable to both MARC andRDF and that has the potential to serve both functions (i.e., as both OPACand LOD-based applications evolve).

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 7: Library Marc Records Into Linked Open Data: Challenges and Opportunities

168 T. W. Cole et al.

Review of Current Practice

MARC AND LOD

Libraries have a long tradition of collecting, managing, and preserving in-formation. Key to libraries being able to carry out these missions is theproper bibliographic control of the information resources they curate. Bib-liographic control in the library domain relies on established traditions andconventions of bibliographic description. Since the 1960s, MARC has beenthe preferred carrier for most bibliographic records created and used inlibraries. The MARC format in its modern serializations (e.g., MARC21XML)remains a good way (in the specific context of library operations) to maintainbibliographic control over print-based library collections.

However, as discussed above, LOD-based applications assume the useof RDF with its inherent subject-predicate-object triples approach for encod-ing metadata, extensive use of URIs to identify entities (rather than strings),including entities found in thesauri and authority files and a focus on in-stantiating relationships between entities described in a broad range of dis-tributed and disparate LOD data sets. The field/subfield data model of MARC,wherein the range (and meaning) of the same named subfield varies accord-ing to which parent field the subfield appears in, complicates the mappingof MARC to RDF. Also, even when serialized as XML, MARC relies on string-based (rather than URI-based) authority control and does not provide na-tively a means to associate or link in a machine-actionable manner withinthe MARC data model controlled vocabulary strings (e.g., subject headings,personal name entries) with URIs.

For these reasons in particular, the MARC21XML format is inadequatefor direct use in most LOD-based applications. This is unlikely to change.Additionally, MARC also embodies a distinct mind set and an approach todescription that is not wholly compatible with LOD and RDF. As noted morethan 20 years ago by Michael Gorman, the structure of the MARC record itself,and the ways the format has come to be used by library catalogers, constrainsand to some degree defines the scope and utility of library bibliographicdescriptions: “The truth of the matter is that one cannot think about anyaspect of cataloguing, except at the most rarified and abstract level, withouttaking the effects of the MARC record into account (Gorman, 1990, p. 63).” Adifferent metadata schema, more compatible with RDF and LOD is needed.Simultaneously we need to start thinking about our catalog records differentlythan we have in the past.

OTHER SCHEMAS

So what are the alternatives to MARC? There are many, arguably too manyalternatives to MARC. MODS was developed after the advent of the Web

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 8: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 169

(unlike MARC). By design, the semantics of MARC map well, for the mostpart, to MODS, but where MARC uses fixed-length alphanumeric codes tolabel data fields and subfields, and even encodes some information byte bybyte in fixed byte-length strings (a holdover from its origin as a data-on-tapeformat), MODS expresses its semantics using more conventional string-basedlabels. For example, MODS does not redefine the semantics of labels used inrecord hierarchy according to parent elements. MODS semantics also haveproven more easily extensible than MARC. Finally, unlike MARC, MODSallows URIs to be associated with most metadata elements. It is still notentirely clear how well metadata encoded in MODS can be expressed asRDF triples; however, the LC has relatively recently proposed a possibletransform for doing this. Though this proposed transform is new and stilllabeled a “work in progress,”19 it shows considerable progress. At the veryleast MODS can be seen as a bridge format, facilitating mapping from MARCsemantics, designed with library catalogs in mind, and to semantic sets likeschema.org (see below), designed with RDF and LOD in mind.

Beyond MARC and MODS, there are numerous other metadata schemasthat are at once more compatible with LOD and RDF and also useful forexpressing book-level bibliographic metadata attributes. None are entirelysufficient on their own. To date, most who are transforming library catalogrecords into LOD have chosen to use multiple schemas.

Simple Dublin Core, having only 15 elements and no real facility forbinding URIs to string values, is generally not considered expressive enough(e.g., there is no way to distinguish between different classes of identifiers,subjects, corporate and personal names, etc.).20 Qualified Dublin Core, withits added elements, refinements, and encoding schemes is more suitable. It isfrequently found in library bibliographic LOD, though on its own even Qual-ified Dublin Core is limited as compared to MARC and MODS. Schema.org isarguably more expressive than Qualified Dublin Core, though because it ismeant as a general-purpose metadata schema, that is, useful for describingmore than just document-like resources, it also is relatively incomplete forbook descriptions as compared to MARC and MODS.

OCLC and others have proposed a set of library-specific extensions toSchema.org base semantics to help mitigate this limitation.21 Other institu-tions have chosen to augment available semantics with their own institution-specific schemas, for example, the British Library Terms RDF schema, self-describing as “some useful terms for describing bibliographic resources thatother models did not include.”22 Some implementers have made selective useof newer, general-purpose bibliographic schemas to supplement schemasthat predate LOD. As discussed below, this includes schemas based on Re-source Description and Access (RDA)23 semantics. Another example is theBibliographic Ontology,24 co-edited by Frederick Giasson and Bruce D’Arcusand hosted by OpenLink Software, Inc. This schema was designed explicitlyto describe bibliographic information in the context of LOD. It has been

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 9: Library Marc Records Into Linked Open Data: Challenges and Opportunities

170 T. W. Cole et al.

used by libraries, for example, the British Library, but generally only in partand sparingly.

In addition to schemas that are generally comparable in scope and gran-ularity to MARC and MODS, there are also other schemas focusing specificallyon semantics to express relationships between books and the subjects andconcepts they deal with, between books and the entities that create themor contribute to their creation (e.g., authors), and between books and theiridentifiers. For example, SKOS (Simple Knowledge Organization System)25;MADS (Metadata Authority Description Schema)26; and UMBEL (Upper Map-ping and Binding Exchange Layer),27a vocabulary and reference conceptontology that shares an editor with the Bibliographic Ontology), are amongthe schemas used to augment library catalog records with links to subjectthesauri and the like. FOAF (Friend of a Friend) and BIO (A Vocabularyfor Biographical Information) are among the schemas used for expressingclasses of metadata attributes that have to do with agent entities, for example,book authors and contributor.

In sum there is a plethora of semantic metadata schemas available forexpressing book-level bibliographic metadata in the context of LOD and theSemantic Web. It is commonplace to use multiple schemas in combination.Table 1 shows the schemas used by OCLC, the British Library (BL), andthe Bibliotheque nationale de France (BnF) in their catalog-based LOD datasets. That no community consensus has yet emerged about which schemas

TABLE 1 Schemas and Ontologies Referenced in LOD Catalog Record Data Sets

Bibliotheque nationaleOCLC de France British Library

RDFRDFSOWLSchema.orgSchema.org (libraryextension)MADS-RDFUMBEL

RDFRDFSRDA Relationships forWEMIRDA Elements (Group 1)BnF OntologySKOSFOAFSimple Dublin Core

RDFRDFSOWLRDA Elements (Group 2)British Library Terms forRDFSKOSFOAFQualified DCISBD Elements (IFLA)The BibliographicOntologyBIOThe Event OntologyW3C OrganizationalStructureW3C WGS84 (GeoPositioning)W3C Ontology for Time

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 10: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 171

to favor and how to use them remains a potential concern for at least someimplementers; at a minimum the lack of consensus can complicate inter-operability. In practical terms this large set of schemas to choose from canbe intimidating for librarians seeking to migrate their legacy local catalogrecords to LOD. Attempts are being made by some researchers to mitigatethis problem by reconciling disparate property semantics commonly used incurrent library LOD practice. The Metadata Vocabulary Junction Project, anexample of one such project, is trying to align the terms used by many dataproviders looking to create LOD data sets.28

ONGOING RELATED WORK BY OCLC AND INDIVIDUAL LIBRARIES

As greater attention has been paid to LOD, many libraries and cultural mem-ory institutions have experimented with and published catalog records inLOD. The BL published and made their subset of 2.8 million British NationalBibliography (BNB) catalog records (89,733,617 triples) available as LOD.The BNB data set can be downloaded in its entirety or queried throughmultiple services, including a SPARQL endpoint, a Describe endpoint, anda simple form-based search service.29 The BL also published a graphicalrepresentation of its data model for BNB books and serials to help othersunderstand and implement their preferred model of LOD in their own en-vironments. Of note here is the tendency of the BL to mint its own localidentifiers for most entities (e.g., for all authors) and then equate these iden-tifiers with shared identifiers (e.g., VIAF) using <owl:sameAs>. This is notuncommon. The BnF has made 20% of its main catalog data available as LOD(Simon, Wenz, Michel, & Di Mascio, 2013). According to Simon et al., theBnF used MARCXML catalog records as a starting point, transforming theseinto multiple serializations of RDF (RDF-NT and RDF-XML), into a comma-separated text file, and into a relational database over which services wereimplemented. Some researchers are also investigating the challenges andbenefits of alignment of domain-specific information into LOD, specifically,music-related resources (Gracy, Zeng, & Skirvin, 2013). As a final illustrativeexample (there are many more available on the Web.30) Oslo Public Libraryalso has made catalog records available in LOD. In part they did this toimprove the discovery of the resources they own (Westrum, Rekkavik, &Talleras, 2012).

At the same time that individual libraries are experimenting with trans-forming bibliographic catalog records into LOD, OCLC, a union catalog ofresources available from 71,000 libraries in 112 countries,31 has also pub-lished most of its catalog data as LOD. In August 2012, OCLC provideddownloadable sample (about 80 million triples) for the 1.2 million mostwidely held works in WorldCat (Online Library Computer Center, 2012).

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 11: Library Marc Records Into Linked Open Data: Challenges and Opportunities

172 T. W. Cole et al.

OCLC also became “a champion for Linked Data” by publishing linked datasources, including VIAF, FAST, and Dewey (Breeding, 2013). OCLC now in-cludes the LOD version of records in WorldCat, so that users (and machineapplications) can more easily connect and access additional contextual infor-mation about name, geographic location, and subject heading entities fromsources such as VIAF, Wikipedia, and WorldCat Identity. Recently, the LC, inaddition to publishing many of its authority files as LOD data sets, has beenbusy developing a new, more LOD-friendly data model called BibFrame32

(Library of Congress, 2012).All of these libraries have one thing in common: they publish their cat-

alog records as LOD and use them in discovery services. However, theseLOD experimentations have varied significantly in the modeling and seman-tics they chose to use for encoding their metadata in RDF. For example, theBL, the BnF, and the Oslo Public Library all implemented the Functional Re-quirements for Bibliographic Records (FRBR) model as outlined in the newRDA in their LOD model. However, the ways they represent FRBR relation-ships are different, for example, the BL and the BnF use RDA vocabulariesavailable in the Open Metadata Registry,33 while the Oslo Public Library useslocally developed semantics on top of RDA vocabularies. When representingbibliographic data in RDF, all borrow semantics from multiple namespaces,but some much more than others (see Table 1). The BL and the BnF alsodeveloped their own namespaces to describe specific information, for exam-ple, the British Library Terms RDF schema and the Bibliotheque nationalede France Ontology. By contrast, OCLC uses semantics from only a fewnamespaces, mostly RDF itself, Schema.org34 and Schema.org library exten-sion (Online Library Computer Center, 2012).35 Table 2 shows WorldCat andBNB examples to illustrate the differences.

TABLE 2 Variation in Schemas Used for Common Bibliographic Properties

Attribute WorldCat British National Bibliography

Title (literal) schema:name rdfs:label, dct:titleAuthor (literal & URI) schema:contributor

madsrdf:isIdentifiedByAuthorityrdfs:labelURIs used are VIAF, id.loc.gov

dct:contributorrdfs:labelfoaf:name, foaf:familyName,foaf:givenNameURI is BNB-specificowl:sameAs links to VIAF

Subject (literal & URI) schema:aboutschema:namemadsrdf:isIdentifiedByAuthorityURI is FAST, id.loc.gov, etc.

dct:subjectrdfs:labelskos:notationskos:prefLabelskos:inSchemeURI is BNB-specific, althoughin Scheme of LCSH, etc.

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 12: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 173

FIGURE 1 RDF for a book titled 100 years of campus architecture at the University of Illinois(UIUC version).

Table 2 is just the tip of the iceberg. Clearly there is no agreed uponbest practice for using and mixing semantics when encoding library biblio-graphic metadata in RDF. After reviewing options and the range of currentpractices, we favored in a general way the more minimalist approach—closerto OCLC/WorldCat than to BL/BNB. We also determined not to create yet an-other local namespace. Instead we relied heavily on Schema.org (like OCLC),although we also made use of SKOS (but not MADS) in expressing subjectrelationships. We did not include links to both FAST and LCSH, but only toLCSH. In part this made it easier to address issues around more complexsubject headings, such as compound headings, prepositional phrase head-ings, and headings with subdivisions. Also like OCLC, we serialized our LODbibliographic metadata in RDFa so as to provide both a machine-readablerecord and human-readable Web display. Figures 1 and 2 show the RDF wecreated for a book digitized from the UIUC collection versus the RDF avail-able for the same book from WorldCat. More about the differences betweenthese two records is described below.

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 13: Library Marc Records Into Linked Open Data: Challenges and Opportunities

174 T. W. Cole et al.

FIGURE 2 RDF for a book titled 100 years of campus architecture at the University of Illinois(OCLC WorldCat version).

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 14: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 175

EXPERIMENT DEFINITION

To better make sense of the diversity of practice when it came to creatingLOD data sets from library catalog metadata, and to better understand ouroptions moving forward, we followed our review of current practice withan experiment in constructing and using an LOD data set based on a set ofrecords from our library catalog.

Data Set

The University of Illinois at Urbana-Champaign Library began its first large-scale retrospective book-digitization project in 2006, joining the Open Con-tent Alliance (OCA) at that time and later expanding its efforts in this domainby working with the Google Books project and contracting with vendors todigitize more narrow collections of content held by the library. With OCA,the library established a satellite OCA scanning center at the library’s off-campus remote facility. With OCA, the UIUC Library has digitized books inareas such as Illinois history, culture, and natural resources; U.S. railroadhistory; rural studies and agriculture; and works in translation as well asextensive collections of 19th century “triple-decker” novels36 and emblembooks written between 1540 and 1800.

By the end of 2012, approximately 30,000 volumes from UIUC collec-tions had been digitized by OCA. Metadata and in most instances copies ofOCA-generated scans were downloaded to the library’s local servers. Thefiles routinely downloaded for each volume include the MARCXML recordwith which the volume is associated, a Dublin Core version of the metadata(also in XML), JP2000 page image files, DjVu, and PDFs derived from thesepage images. The MARCXML records were used to create an e-book recordfor each work digitized. These are added to the UIUC Library’s Voyagercatalog, uploaded to OCLC and disseminated freely to others via the OpenArchives Initiative Protocol for Metadata Harvesting (OAI-PMH). MARCXMLrecords as modified (e.g., identifiers added) both pre and post-digitizationby OCA and the UIUC Library also are used to trigger archival digital copydeposit in the HathiTrust Digital Library.

The initial part of our workflow with the MARC catalog records for OCA-digitized items is illustrated in Figure 3. For our LOD study, we used 28,907MARCXML records describing the print works from our collection digitizedby OCA from July 2007 to October 2012.

Overview of Experimental Phase of Study

Having defined our research questions and investigated practice elsewhere,our first order of business in this practical experimentation phase of our

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 15: Library Marc Records Into Linked Open Data: Challenges and Opportunities

176 T. W. Cole et al.

FIGURE 3 MARC workflow for UIUC books digitized by OCA.

study was to extend our existing workflow, which was limited to workingalmost exclusively with MARC records (see Figure 3), so as to integrate URIsinto the metadata records describing UIUC books that had been digitized byOCA.

This entailed identifying relationships between metadata attributes in lo-cal records and items in LOD data sets such as Wikipedia (DBPedia), VIAF,and LCSH authority Linked Data Service. As alluded to above, an importantconsideration was the degree to which this process could be successfullyautomated. One goal was to learn from our experiences with this small sub-set of records about potentially larger-scale projects. Given the variation insemantics and data models that we observed being used elsewhere, we alsowanted to contrast how we would transform our records to how others haddone this. If an LOD initiative were undertaken with a larger set of ourcatalog records, would it be worthwhile to transform our records to LODourselves or better to use directly the LOD catalog records WorldCat hadalready created? Implicit in this goal is a reflection upon the amount of workthat goes into this project relative to the anticipated workload of integrat-ing WorldCat’s LOD data sets and services into our own locally deliveredservices.

This question can be answered by looking at different factors, not onlythe suitability of OCLC’s mapping to LOD, but also the degree to which theprocess of mapping data strings from MARC into URIs can be automated. This

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 16: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 177

is an important step in determining the wide-scale feasibility of transforminglibrary data into LOD. This process entails identifying relationships betweenmetadata in local records and Semantic Web resources. We set out to measurethe success rate of matching personal names, corporate names, and meetingnames present in the MARCXML records to a corresponding entity in VIAF inaddition to the success rate when attempting to match subject headings in theMARCXML records against terms in the LC subject authority LOD Services.We anticipated that these success rates would depend in large part on thequality of data present in our catalog and on the current state of the librarycommunity’s LOD work.

We also wanted to understand the implication of having link-enrichedmetadata for extending existing services to end users. At the start of ourstudy, scripts were already in place on our servers to dynamically generatea splash page for each book that had been digitized by OCA. This splashpage provides links to PDF, Text, DjVu, and Flip Book derivatives resultingdirectly or indirectly from the OCA digitization (some of these derivativeswere served from local servers and some were maintained at the InternetArchive37). The splash page presents a human-readable view of bibliographicmetadata and is generated directly from the MARCXML record. This displayshows Title, Publisher, Date, Physical Description, Language, Notes, andCopyright Status of the digitized item. How might the end user experiencebe enriched by having more links, that is, more URIs available when weconstruct a book’s splash page?

EXPERIMENT DETAILS AND RESULTS

To begin to answer these questions, we needed to extend our existingMARC-based workflow shown in Figure 3 to support the integration of URIsand then demonstrate, even in a relatively isolated and anecdotal manner,a few of the ways that having these URIs in our bibliographic metadatacan enhance end-user experience. In practical terms, and in the context ofexisting workflows and infrastructure, this meant

1. transforming MARCXML records to MODS (more amenable to the inclusionof URIs)

2. adding URIs for name entities found in VIAF3. adding URIs for subject entries found in LCSH4. dynamically generating (from the enriched MODS) a new XHTML splash

page that takes advantage of these links5. embedding RDFa (with non-MARC, non-MODS semantics as needed) into

the XHTML generated so as to facilitate dissemination of our catalogrecords as LOD

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 17: Library Marc Records Into Linked Open Data: Challenges and Opportunities

178 T. W. Cole et al.

FIGURE 4 Transforming MARC to MODS and adding URIs from VIAF and LC Linked DataServices.

These extensions to our baseline MARC-only workflow are shown infigures 4 and 5.

Transforming MARCXML to MODS

As part of previously established workflow, all the University of IllinoisLibrary’s books digitized by OCA are assigned persistent URI handlesthat resolve to a splash page for each item showing the available digitalrepresentations and metadata. These splash pages are generated dynami-cally from MARCXML exploiting the XML-related technologies such as XSLTto transform from MARC to XHTML. However, when it comes to adding URIsof the available linked data sources into the appropriate metadata fields,MARC has clear limitations. Although MARC has close to 2,000 fields fordescribing or coding descriptive and technical information for the resources,there is no field that is designed for containing a URI as a value in 1XX, 6XX,and 7XX fields where linked data source URIs will be added.

To accommodate this need, we decided to use the MODS as our meta-data standard, that is, as part of our study’s experimental phase all MARCXMLwere first transformed to MODS with links to VIAF and LC subject author-ity LOD Service. The MODS schema version 3.4 allows both <name> and

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 18: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 179

FIGURE 5 Creating XHTML, RDFa, and RDF from MODS with VIAF and LCSH URIs.

<subject> elements to have attributes authority, authorityURI, valueURI, andxlink that make it possible to have linked data source URIs in its elements.38

For this experimental investigation, the tasks of transforming our MAR-CXML records to MODS and simultaneously searching name strings in VIAFwere performed using a single script written in the Python programming lan-guage, since Python is lightweight and easy to use. In addition, it is a widelyused programming language in the library community. The main Pythonscript first calls a slightly modified version of the XSLT39 developed by theLC that transforms MARCXML records to MODS. We made two small changesto the standard LC MARC to MODS XSLT, for example, the OCLC numberis mapped to the <recordIdentifier>, which allows linking to a WorldCatrecord that provides additional information associated with the book, suchas holdings information. We also added the provenance information to eachhandle, so that users are able to see where the digitized resources wereproduced and can be accessed.

Adding VIAF URIs to Transformed MODS Records

As it transforms the MARCXML to MODS XML, our Python script also iden-tifies name entities in the MARC record and searches VIAF for these names.

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 19: Library Marc Records Into Linked Open Data: Challenges and Opportunities

180 T. W. Cole et al.

FIGURE 6 VIAF search results by name types.

Multiple MARC data fields are checked: personal names (data fields 100,700), corporate names (data fields 110, 710), and meeting names (data fields111, 711). When searching VIAF, we used the complete name information,not only the name (subfield a), but also the birth and (or) death information(subfield d), if the information is available, which ensures the exact searchretrieval in VIAF. The search query string was built from the basic VIAFsearch URL, http://viaf.org/viaf/search, with syntax that allows us to use aname found from the appropriate MARC data fields.40

Among our 28,907 sample records, 22,167 records (76.68%) have contentin data field 100, 3,127 records (10.82%) have content in data field 110, and81 records have content in data field 111. For data fields 7XX, 6,434 recordshave content in data field 700, 1,781 records have content in data field 710,and 25 records have content in data field 711. The statistical results of oursearching for the 100/700 (personal names) and 110/710 (corporate names)data fields can be considered representative for the kinds of books includedin our sample; the results for searches of names found in 111/711 (meetingnames) is probably less meaningful given the relatively small sample size.

Our VIAF search results (Figure 6) show that 80.96% of the personalname entries, 74.98% of the corporate name entries, and 59.43% of themeeting name entries from our sample records have matching VIAF URIsdiscovered through exact search, that is, personal names have more matchingVIAF links than corporate names and meeting names. As shown in Figure 6,multiple matches were found for a substantial number of additional name

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 20: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 181

FIGURE 7 MODS <name> element with VIAF link.

entries; some names (∼10% of personal names) were not found at all inVIAF by our automated, scripted search. Since an author may have publishedmultiple works, there are 10,178 unique names included in the data field 100compared with the 22,167 total name entries. Among the unique names, ourexact search found VIAF links for 7,086 names (69.62%) and the all searchfound 8,522 records (83.73%). Although not all of the names have matchingentries in the VIAF, that is, not all names are from authority files consolidatedin VIAF, the results suggest that by and large catalogers of the books includedin our sample diligently used authorized name forms when creating catalogrecords. This is especially noteworthy since the majority of books digitizedwere pre-1923 publications (with correspondingly older catalog records).

Note that in deciding which VIAF URIs to add to our MODS records,we limited ourselves to “exact” matches in preference to the “all” matches tomake sure that the terms in the VIAF and the MARCXML records were ex-actly the same. Since the VIAF includes the 24 different name authority files,41

there is a possibility that one name can have more than one established au-thority name. Additionally, multiple matches in VIAF can be indicative offalse matches. When we sampled the “all” match search results and man-ually examined 100 results, we found only two correct matches. This alsocontributed to our decision to rely only on “exact” search matches. Becauseof the difficulties selecting the correct name from many variant names, wedecided to use the “exact” search method for this project. We felt that at thistime our Python script is not sophisticated enough to select among multiplematching names.

Exact matches found were added to the MODS <name> element usingan attribute xlink:href, as shown in Figure 7.

Adding LCSH URIs to Transformed MODS Records

After all the MARCXML metadata had been transformed to MODS and VIAFlinks added, we made another pass through the records with a second Pythonscript to add links to the LCSH Linked Data Service.

The MARC data fields 6XX (Subject Access Fields) and MODS <subject>element can include multiple subject terms that describe different kinds ofsubjects, main topic, geographic names, temporal coverage, or sometimes

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 21: Library Marc Records Into Linked Open Data: Challenges and Opportunities

182 T. W. Cole et al.

FIGURE 8 MODS <subject> element with multiple subelements.

genre information. When these different terms are included in one sub-ject data field, they refine/complement each other as one complex subjectheading. We recognized the importance of complex subject headings andtried to preserve the relationships between the terms in linked data sourcesas well. Since the LCSH Linked Data Service also provides the link for manycomplex subject headings, part of our script combines all of the subjectterms included in one <subject> element into one complex subject headingand searches for it in the LCSH Linked Data Service. If there is no matchingURI found, then each separate component of the complex subject term issearched. For example, if one <subject> element has subfields, as shown inFigure 8, the terms are added together with double dashes and become onestring, for example, Working class–United States–History. When searched asa single term, the URI: http://id.loc.gov/authorities/subjects/sh2008113772 isfound for this complex subject heading. Had no results been returned, thescript would have been searched separately for “Working Class,” “UnitedStates,” and “History.”

When the script finds the link for a complex subject string, the link isadded in the <subject> element with an attribute valueURI. Since we knowthe source and URI of the authority, that information is also added withattributes authority and authorityURI (see Figure 9). The complex subjectheading link has the RDF/XML record available, and it contains links forindividual subject terms as shown in Figure 9.

The script also grabs each individual link associated with the compo-nents of the complex subject heading and adds these links into the sub-fields of the <subject> element with attributes authorityURI and valueURI,as shown in Figure 10.

As mentioned, when the complex subject heading searching does notyield any matches, the individual components of the subject heading aresearched. In this case, the <subject> element will not contain any link, onlythe subelements will have the matching link found in the LCSH Linked DataService.

In our sample of MODS records, our script found 6,591 complex subjectheadings. Among these complex subject headings, 25.84% (1,703 headings)have matching URIs in the LCSH Linked Data Service (Figure 11). By com-parison, 67.85% (3,283 headings) simple subject headings have matchingURIs, that is, the simple subject headings have twice the matching frequencyin the LCSH Linked Data Service as the complex subject headings. In part

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 22: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 183

FIGURE 9 Portion of RDF/XML for a complex subject heading as retrieved from LCSH LinkedData Service.

this is due to the presence of components in subject headings that are notthemselves in LCSH. Relying on disassociated simple subject headings forthree-quarters of the subject headings found in our catalog record sample isof concern. Additional effort needs to be put into preserving the relationshipbetween components of complex subject headings. In doing this we needto account for complex headings that involve components that are not partof LCSH (e.g., geographic names that are not part of LCSH, personal namesthat are not part of LCSH, etc.). Note structures for doing this are availablein MADS/RDF.

FIGURE 10 MODS <subject> element with LCSH Linked Data Service links.

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 23: Library Marc Records Into Linked Open Data: Challenges and Opportunities

184 T. W. Cole et al.

FIGURE 11 LCSH Linked Data Service search results for simple and complex subjectheadings.

Creating XHTML, RDFa and RDF From MODS Records Dynamically

MODS records with VIAF and LCSH links were then used to dynamicallygenerate new splash pages for each book. Each splash page includes somegeneric HTML scaffolding as a foundation. An XSLT stylesheet is used totransform the MODS to XHTML, with results inserted into this scaffolding.Additional JavaScript and RDFa are also added by this same process as appro-priate to the content of the MODS record. The resulting XHTML+RDFa sup-ports a human-readable display when rendered in a Web browser (Figure 12)and provides a machine-actionable bibliographic record in RDF when theembedded RDFa is distilled. JavaScript included with the page leverages linksincluded in the RDFa to offer the user additional information and contextabout the book described.

For example, if the MODS record contains a <name> element with anxlink:href attribute that links to that <name>‘s VIAF page, additional infor-mation is added to the splash page. During the XSLT-driven transformationprocess, an additional <a> element is added to the author name section ofthe page, and a Javascript onclick listener is attached to it. When the linkis clicked, the page sends a request back to the Web server, and the Webserver retrieves additional information from the VIAF page and other pageslinked by the VIAF page. The information is processed by the server andreturned as a JSON object to the splash page, where additional Javascriptformats the response and displays the information in a JQuery User Interface

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 24: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 185

FIGURE 12 Splash page of OCA digitized book.

modal dialog box, floating above the splash page. The additional informa-tion returned by the server can include the author’s gender and nationality,related names, and additional links to more information on the author (seeFigure 13).

The RDFa that is embedded in the splash page is structured and includesmachine-readable information represented as attributes on preexisting HTMLelements or as attributes of additional <spans> or <divs>. This additionalinformation does not change the human-readable aspects of the splash page,but does provide a better way (i.e., better than trying to parse and infer fromdisplay-oriented HTML) for machine agents to read and act on the infor-mation contained in the page. To describe these books in RDFa we utilizedthree different vocabularies: Schema.org, an experimental library extensionto Schema.org authored by OCLC,42 and SKOS. Table 3 shows the basicmappings from MODS to these sets of more LOD-friendly semantic labels.

Schema.org provides a rich vocabulary set that met many of our needsof describing the OCA digitized books. Although we could have describedall of the information we wanted to map from MODS to RDFa using theSchema.org vocabulary alone, it would have been necessary to combinebibliographic attributes in ways that would have reduced the precision of thebibliographic record as a whole. Specifically we needed more precise termsfor place of publication, OCLC number, and subject headings. The library

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 25: Library Marc Records Into Linked Open Data: Challenges and Opportunities

186 T. W. Cole et al.

FIGURE 13 Popup window with additional information about a name appearing in the splashpage.

TABLE 3 Mapping from MODS to RDF Using Semantics from Schema.org, the Library Exten-sion to Schema.org and SKOS

MODS element RDF Class RDF Property

title schema:Book schema:namesubTitle schema:Book schema:nameoriginInfo/publisher schema:Book schema:publisher

schema:Organization schema:nameoriginInfo/place/placeTerm schema:Place library:placeOfPublicationlanguageTerm schema:Book schema:inLanguagedateIssued schema:Book schema:datePublishedgenre schema:Book schema:genresubject/topic schema:Book schema:about

skos:Concept skos:prefLabelsubject/geographic schema:Book schema:about

skos:Concept skos:prefLabelname/namePart schema:Book schema:creator

schema:Person schema:namename/xlink schema:Person schema:nametableOfContents schema:Book schema:descriptionrecordIdentifier schema:Book library:oclcnum

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 26: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 187

FIGURE 14 A prefix attribute in RDFa giving the full URIs of metadata schemas and semanticsused in the RDFa elements and attributes.

extension to Schema.org provided us with precise properties for these threeareas. Similarly, while the “CreativeWork” class of Schema.org does have an“about” property, which is defined as “the subject matter of the content,” wefound the SKOS class “Concept” and property “prefLabel” more fitting.

The RDFa portion of the splash-page documents began with a prefixattribute in the <table> tag, as shown in Figure 14. In RDFa, the prefixattribute is used to associate URIs of metadata schemas and semantics withnamespace prefixes, which in turn simplify references to these namespaceswhen authoring RDFa elements and attributes.

The context for the RDFa within the HTML table (the subject of top-levelRDF triples comprising the bibliographic record) is the value of the resourceattribute, which is the persistent URI for the digitized book. With the subjectset, triple predicates and objects are then expressed in other attributes onother elements contained in the HTML table.

RDF allows both Resources (represented by URIs) and string literalsas the object of RDF triples. Our MODS records only had URIs for namedentities from VIAF and subject terms from the LCSH Linked Data Service.Values of other bibliographic entities in our sample records were representedas string literals. Additional work with other vocabularies, for example, forlanguage, publication place names, and so on, would allow us to expressmore elements of our bibliographic records in terms of URIs. The RDFamarkup would be straightforward.

Triple predicates (like triple subjects) are always represented byURIs in RDF. For example, the predicate used to express a book ti-tle is http://schema.org/name. The range of this predicate is a stringliteral. For the book identified by the URI http://hdl.handle.net/10111/UIUCOCA:100yearsofcampus00well, the object of the http://schema.org/name predicate is “100 years of campus architecture at the University ofIllinois.” As shown in Figure 15, in RDFa, the subject, predicate, and object

FIGURE 15 A triple where the object of the triple is a literal value as serialized in RDFa.

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 27: Library Marc Records Into Linked Open Data: Challenges and Opportunities

188 T. W. Cole et al.

FIGURE 16 The triple shown in Figure 15 transformed to RDF.

components of this triple appear as values of specific attributes. Figure 16illustrates the same triple when transformed into RDF/XML.

Blank nodes are sometimes needed for uncontrolled values in fieldslike corporate author. For example, if a match was not made on a namefor the creator of a book but this creator was known to be an orga-nization whose name was known, a blank node was used as shown inFigure 17.

Here, the property attribute is used as before. In order to establisha new data item, the typeof attribute is used. In the absence of an href,about, or resource attribute, no resource can be identified as the subjectof the triple, hence the blank node. In the next nested span there is aproperty referring to this blank node. The rest of the RDFa markup involvesdescriptions of resources where matches were found in either VIAF or LCLinked Data Service and the MODS record was enriched with the appropriateURL.

Instances such as these will produce triples where the subject, predicate,and the object are all represented by URIs. Since we were providing links tothe user for all of these additional resources, the URI was expressed in thehref attribute of the <a> tag, as shown in Figure 18.

Creating RDFa, which is embedded in the HTML of a publicly availablelibrary service page, has advantages. It enables scholars and developersboth within and outside of the University of Illinois to harvest and make useof our triples. This can be accomplished by passing the splash page URLto the W3C RDFa 1.1 Distiller and Parser43 or, if a larger scale operationis planned, by downloading the distiller software and running the processlocally. The RDFa Distiller and Parser produces RDF in one of four differentformats: Turtle, RDF/XML, JSON-LD, or N-Triples. Once each page is distilledto RDF/XML, we then load these triples into a local instance of OpenLink

FIGURE 17 Creator described in RDFa using blank node.

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 28: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 189

FIGURE 18 Creator described in RDFa with URI from VIAF.

Software’s Virtuoso triple store. This allows querying over the whole groupof 28,000 bibliographic records.

DISCUSSION AND FUTURE WORK

The experimentation we describe here focused on assessing the feasibilityand challenges of transforming traditional library bibliographic records intoLOD. This work has naturally whetted our appetite to explore some of thepossible opportunities that bibliographic LOD might offer for libraries andlibrary users. In closing, we speculate about the potential for providing usefulLOD-based services to library users. And though it is too early to know, basedon work to date, we remain optimistic that the potential benefits to usersare worth the effort of transforming our bibliographic records into LOD andlearning how to use LOD effectively and efficiently in a library context.

Using LOD to Enhance Services to Library Users

Growth in the use of born-digital and retrospectively digitized informationresources continues and is irreversible. The Semantic Web, based on tech-nologies and data models such as RDF, OWL, and LOD, provides a paradigmfor organizing digital information in ways that will enable libraries to imple-ment new kinds of services. However, the exact form of these services, andtheir impact and effectiveness, is as of yet unclear.

We illustrated above a possible way to enrich splash pages for retro-spectively digitized books with links to additional context pulled from LOD-friendly authority services VIAF and the Library of Congress. This additionalinformation was retrieved by automated means, added to MODS records(derived from existing library catalog MARC records), and then retained inrecords transformed from MODS to RDFa embedded in XHTML. Through theuse of ubiquitous JavaScript libraries and standard Web browser technolo-gies, these links can then be leveraged in various ways to provide additionalaccess services to the user not previously possible when splash pages werebased solely upon information present in the original MARC catalog records.

This is only one, relatively simple illustration of the potential utility ofLOD. The additional VIAF-based LOD services added to our splash pagesin the experiment described above have implications for other discovery

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 29: Library Marc Records Into Linked Open Data: Challenges and Opportunities

190 T. W. Cole et al.

FIGURE 19 Custom Tile with author search suggestion.

services at the University of Illinois Library. The library is currently imple-menting ExLibris Primo, a Web-scale discovery tool.44 We supplement theout-of-the-box Primo implementation with a locally developed integratedrecommender service implemented in a Primo “Custom Tile.” These servicesmake use of the library’s homegrown federated search application, EasySearch, which is exposed through the tile as a set of RESTful Web services.Currently our Custom Tile is primarily used to provide contextual searchsuggestions based upon a user’s query. Figure 19 displays an instance inwhich the Custom Tile has detected that the user may be searching for anauthor’s name and so offers the option to “perform this search as an authorsearch in Primo” along with a link that will initiate this search.

We are now investigating the possibility of adding LOD-based services toour Custom Tile. For example, if a user’s query resembled a name query (asshown in Figure 19), this could potentially trigger a search of VIAF to see ifthere was an exact match. If so, a link similar to the “more info” popup menu,as described above (Figure 13), could be included among the suggestionsshown in the Custom Tile for that Primo search. By providing additionalinformation about the author (i.e., beyond that which is or can be containedin the library bibliographic record) and links to even more information aboutthe author and his or her life and times, the user is presented with contextfor interpreting the works discovered, their potential relevance to the user’sinformation need, and the appropriateness of redoing their search as a searchfor works by a specific, well-identified author.

It also might be possible to provide recommender services based onvariant forms of a name. When a user enters a query into the system, asearch could be performed for the variant terms available in VIAF in additionto the original query. If an exact match is made, the service could thengrab any variant forms of the name and requery our local system. If thereare results, these could be presented to the user as a “Did you mean?”suggestion next to the variant form of the name and perhaps with some

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 30: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 191

added contextual information gleaned from VIAF. Since the VIAF includesmany national libraries and institutions’ authority files, this can be usedfor providing a multilingual search service as well, which can improve thediscovery of resources involving names that are spelled differently in differentlanguages (e.g., Johann Vogel vs. Johannes Vogelius).

Both FAST and the LCSH Linked Data Service can be used in a similarfashion to facilitate topic term suggestions. Although the FAST and the LCSHLinked Data Service have different structures and use different subject terms,both services enable the library to provide a range of subject search termsuggestions when a user enters a subject term in a search box.

For example, the LCSH Linked Data Service provides broader, narrowerand related terms as appropriate for individual headings retrieved from theservice. These can be used to facilitate discovery. Both complex and simplesubject terms URIs were harvested from the LC Linked Data service andthen passed through the workflow to the XHTML splash page. We havebegun experimenting with passing these URIs back to the LCSH LinkedData Service to retrieve lists of broader and narrower subject terms. Byfollowing these links users may view these related terms and, once informedof these additional terms, may opt to reformulate their original query. Thesebroader and narrower subject terms could also be retrieved dynamically andpresented in a popup window that would float above the splash page in thesame manner that additional author information is retrieved and representedfrom VIAF. Users could potentially browse the LCSH subject hierarchy, asshown in Figure 20, without navigating away from the resource they areviewing. Such immediate access to related headings would provide userswith a powerful tool, aiding discovery. A similar heading browse service hasbeen successfully integrated into the Emblematica Online project using theIconclass vocabulary Web service (i.e., rather than LCSH).45 (Also see: Cole,Han, & Vannoy, 2012.)

Looking Ahead—Challenges and Opportunities

Linked open data has become a fashionable buzzword in both library bib-liographic control and discovery services because of the opportunities andpossibilities it presents to us. However, in order to fully implement LOD inlibrary services, there are many challenges that we must work through as acommunity.

First, we need a new carrier for library bibliographic records in order tofacilitate library LOD development. As discussed above, early implementersare using a wide variety of library-specific and non-library metadata schemasin various combinations. Our experimentation reinforces the view of manyin the community that Schema.org is a reasonable option. Developed incollaboration with and under the sponsorship of the major Web search

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 31: Library Marc Records Into Linked Open Data: Challenges and Opportunities

192 T. W. Cole et al.

FIGURE 20 LCSH subject hierarchy browsing service provided in popup window.

engine providers (i.e., Google, Yahoo!, and Microsoft Corporation), thesemantics of Schema.org have gained good traction within the broader Webcommunity. Within a year of its introduction in June 2011, one researcherreported that 7% to 10% of pages being indexed by major search enginescontained Schema.org markup (Wallis, 2012). This percentage is expected togrow. In his presentation at the Semantic Web in Libraries 2012 conference,Richard Wallis of OCLC also reported that 80% of users connecting to librarydiscovery services are being directed there by Web search engines. Ourusers, Wallis maintains, are being connected to libraries and top-level libraryservices from Google, but they are not discovering the individual items wehold this way. Why not? Wallis says it is because Google (and other majorWeb search engines) do not understand MARC, RDA, Z39.50, and so on(Wallis, 2012).

The widespread uptake by Web developers and its use by Web searchengines to bolster their indexing, considered in conjunction with the factthat many library users already rely heavily on Web search engines, evenwhen looking for materials in library collections, is powerful inducementfor libraries to consider the use of Schema.org. However, even advocatesof Schema.org acknowledge its limitations. It is not sufficient for manylibrary-specific information exchange use cases. It also leaves out a few key

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 32: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 193

properties associated with typical library items. The latter issue has largelybeen addressed by the library-specific extensions to Schema.org, as describedabove and as used by us in our experimentation. Nonetheless, it is not clearwhen or even if a broad consensus supporting Schema.org will emergewithin the library community.

As an alternative to Schema.org, there is considerable interest today ina more library-specific, more comprehensive (for bibliographic metadata)approach. The LC initiated and has been leading an effort, BibFrame, tocreate a new bibliographic framework since 2010; as of this writing, it is still ina developmental state. Prototype tools for working with BibFrame have beenmade available.46 There is also an early effort to create a mapping betweenBibFrame and Schema.org.47 Currently there are six libraries experimentingwith transforming traditional MARC records into the BibFrame data model.One library is also working with records compliant with RDA, which is anew content standard and data model optimized for bibliographic records ina Web environment.48 While the LC is working hard to finalize its data andsemantic models after its initial announcement of the BibFrame Model Primerin November 2012,49 it is still not clear when production implementations ofBibFrame will take place and how well BibFrame records will deliver on thepromise of LOD.

Second, we need clear guidelines/best practices on how to transformlibrary catalog records into LOD. OCLC has led major library LOD devel-opments in recent years and now makes their bibliographic LOD-friendlyrecords accessible on the Web. This means more than 290 million recordsin WorldCat can be retrieved as embedded RDFa or in other formats. OCLCLOD records are also available via content negotiations in four different for-mats: RDF/XML, JSON, text/turtle, and plain text.50 Although it is welcomenews that anyone who wants to experiment with LOD in their library envi-ronment can freely harvest library LOD from OCLC, there is still little in theway of published guidelines or workflow descriptions regarding how OCLCtransforms its massive bibliographic database into the LOD records it haspublished. Given the differences in how OCLC, BL, BnF, and other librarieshave chosen to implement LOD for bibliographic metadata, it is also notclear if the OCLC approach is optimal. More research is needed and someconvergence would be helpful.

This leads to a third consideration, that is, that there are too many se-mantics options available for creating RDF representations of bibliographicrecords. Since the traditional library bibliographic records carrier, MARC,is not suitable for LOD and the Semantic Web environment, early experi-menters of library LOD often have developed their own namespaces andsemantics when publishing their catalog records as LOD data sets (as dis-cussed above). As a result, there are too many semantic sets used for libraryLOD data sets. No single semantic set seems sufficient for describing librarybibliographic catalog records. (Ironically, we had a very similar conversation

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 33: Library Marc Records Into Linked Open Data: Challenges and Opportunities

194 T. W. Cole et al.

about metadata standards a decade ago when we tried to create metadatafor digital resources, i.e., no single metadata standard was thought to workfor everything.) Just as the library and other cultural heritage institutionsworked together to develop best practices and guidelines for quality meta-data creation a decade ago (and continues to refine approaches used), nowwe must work together to share our experience and knowledge to developbest practices and guidelines on how to make use of library LOD.

NOTES

1. http://www.hathitrust.org/2. http://openlibrary.org/3. http://www.exlibrisgroup.com/category/PrimoOverview4. http://www.serialssolutions.com/en/services/summon/5. http://link.springer.com/6. http://archive.ifla.org/VII/s13/frbr/frbr3.htm7. http://www.w3.org/RDF/8. “Linked data” refers to linking resources on the Web (see: http://www.w3.org/standards/

semanticweb/data). This process is facilitated if the links and data being linked are open to all, i.e.,“linked open data.” In library-oriented discussions of linked data, openness is often assumed, as we dohere. Most of the illustrations discussed in this paper also would apply to closed linked data systems,e.g., those involving copyrighted primary resources, although linking across closed systems can raiseadditional challenges that are not discussed here

9. http://www.wikipedia.org/, http://dbpedia.org/10. http://viaf.org/11. http://id.loc.gov12. http://www.oclc.org/research/activities/fast.html13. http://id.loc.gov/authorities/subjects.html14. http://www.worldcat.org/15. http://www.bl.uk/bibliographic/datafree.html16. http://www.w3.org/TR/xhtml-rdfa-primer/17. http://www.w3.org/2005/Incubator/lld/XGR-lld-20111025/#Benefits_of_the_Linked_Data_Ap

proach18. http://www.w3.org/DesignIssues/LinkedData.html19. http://www.loc.gov/standards/mods/modsrdf/20. Researchers at Kent State University, bucking this common wisdom, have developed a utility

for library use that converts MARC records to RDF-DC, which is then enriched with links using LODauthority and vocabulary services using a prototype wizard. http://lod-lam.slis.kent.edu/

21. http://purl.org/library/22. http://www.bl.uk/schemas/bibliographic/blterms#23. http://RDVocab.info/Elements/24. http://bibliontology.com/25. http://www.w3.org/2004/02/skos/26. http://www.loc.gov/standards/mads/27. http://umbel.org/28. http://lod-lam.slis.kent.edu/about/default.html#junction29. http://www.bl.uk/bibliographic/datafree.html30. For more examples see the results of this query at http://datahub.io/dataset?q=library31. http://www.worldcat.org/whatis/32. http://www.loc.gov/bibframe/33. http://metadataregistry.org/vocabulary/list.html34. http://schema.org/35. http://www.essepuntato.it/lode/http://purl.org/library/

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 34: Library Marc Records Into Linked Open Data: Challenges and Opportunities

Library MARC Records Into Linked Open Data 195

36. In the 1800s in England, it was commonplace to publish novels in three volumes, driven inpart by the model of circulating libraries in place at the time (http://archive.org/details/19thcennov)

37. http://archive.org/index.php38. http://www.loc.gov/standards/mods/changes-3–4.html39. We use the Library of Congress’s MARC to MODS Stylesheet available at http://www.

loc.gov/standards/mods/v3/MARC21slim2MODS3.xsl40. The detailed guideline of VIAF request types is available at http://www.oclc.org/developer/

documentation/virtual-international-authority-file-viaf/request-types41. The VIAF homepage (http://viaf.org/) lists the 24 different name authority files as of May 6,

201342. http://purl.org/library/43. http://www.w3.org/2012/pyRdfa/#distill_by_uri44. http://www.exlibrisgroup.com/category/PrimoOverview and http://uofi-primo.hosted.exlibri

sgroup.com:1701/primo_library/libweb/action/search.do?vid=UIU45. http://www.iconclass.nl/work-in-progress46. http://bibframe.org/tools/47. http://www.oclc.org/content/dam/research/publications/library/2013/2013–05.pdf48. http://bibframe.org/demos/49. http://www.loc.gov/bibframe/pdf/marcld-report-11–21–2012.pdf50. http://dataliberate.com/2013/06/content-negotiation-for-worldcat/

REFERENCES

Breeding, M. (2013). Linked Data: The next big wave or another tech fad? Com-puters in Libraries, 33(3) April 2013. Retrieved from http://www.infotoday.com/cilmag/apr13/Breeding-Linked-Data-The%20Next-Big-Wave-or-Another-Tech-Fad.shtml

Byrne, G., & Goddard, L. (2010). The strongest link: Libraries and LinkedData. D-Lib Magazine, 16(11/12). Retrieved from http://www.dlib.org/dlib/november10/byrne/11byrne.html.

Cole, T., Han, M.-J., & Vannoy, J. (2012, June 10–14). Descriptive meta-data, iconclass, and digitized emblem literature. In Proceedings of the 12thACM/IEEE-CS Joint Conference on Digital Libraries, Washington, D.C., USA.http://dx.doi.org/10.1145/2232817.2232839

Coyle, K. (2012). Dispatches from the field: Populating the Sematic Web. Ameri-can Libraries, July/August 2012. Retrieved from http://www.americanlibrariesmagazine.org/article/new-world-data

Gorman, M. (1990). Descriptive cataloguing: Its past, present, and future. In M.Gorman (Ed.), Technical services today and tomorrow (p. 63–73). Englewood,CO: Libraries Unlimited.

Gracy, K., Zeng, M., & Skirvin, L., (2013). Exploring methods to improve access tomusic resources by aligning library data with Linked Data: A report of method-ologies and preliminary findings. Journal of the American Society for Informa-tion Science and Technology. http://dx.doi.org/10.1002/asi.22914

Library of Congress. (2012). Bibliographic framework initiative. Retrieved fromhttp://www.loc.gov/bibframe/

Online Library Computer Center. (2012). OCLC adds linked data to WorldCat.org.Retrieved from http://www.oclc.org/news/releases/2012/201238.en.html

Simon, A., Wenz, R., Michel, V., & Di Mascio, A. (2013). Publishing bibliographicrecords on the web of data: Opportunities for the BnF (French National

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14

Page 35: Library Marc Records Into Linked Open Data: Challenges and Opportunities

196 T. W. Cole et al.

Library). In P. Cimiano, O. Corcho, V. Presutti, L. Hollink, & S. Rudolph, (Eds.),The Semantic Web: Semantics and big data (pp. 563–577). Berlin, Germany:Springer. doi: 10.1007/978-3-642-38288-8_38

Westrum, A.-L., Rekkavik, A., & Talleras, K. (2012). Improving the presentation oflibrary data using FRBR and Linked data. Code4Lib, 16(2/3). Retrieved fromhttp://journal.code4lib.org/articles/6424

Wallis, R. (2012, November 26–28). Why schema.org? Presented at SemanticWeb in Libraries, Cologne, Germany. Retrieved from http://www.slideshare.net/rjw/why-schemaorg

Dow

nloa

ded

by [

Geo

rgia

Tec

h L

ibra

ry]

at 0

6:29

15

Nov

embe

r 20

14