from metadata to metadata

16
8/6/2019 From Metadata to MetaDATA http://slidepdf.com/reader/full/from-metadata-to-metadata 1/16 14    L    i    b   r   a   r   y    T   e   c    h   n   o    l   o   g   y    R   e   p   o   r    t   s   w   w   w  .   a    l   a    t   e   c    h   s   o   u   r   c   e  .   o   r   g    J   a   n   u   a   r   y    2    0    1    0 Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle Chapter 2 Changing the Nature o Library Data Together these provide a new conceptual oundation but leave us with a key piece missing: how to express our data in a twenty-rst-century data ormat. For this we are given some direction in the report o the Working Group on the Future o Bibliographic Control: Desired Outcomes: Library bibliographic data will move rom the closed database model to the open Web-based model wherein records are addressable by programs and are in ormats that can be easily integrated into Web services and computer applications. This will enable libraries to make better use o networked data resources and to take advantage o the relationships that exist (or could be made to exist) among various data sources on the Web. 1 The report does not say how library data must change to make this mandate a reality. There will surely be more than one way to accomplish this goal, but a ew things are certain: the library catalog data must be transormed rom being primarily a textual description to a set o data elements to which machine processes can be applied; and these data elements must be compatible with the current mainstream technology that is the World Wide Web. One possible direction or library data is to join the linked data “cloud,” a growing set o data on the World Wide Web that many see as having great promise or a richer inormation uture. From Med o MeData In our current technology environment, all inormation goes through computers beore reaching a human being, so it is necessary to design our metadata to be data—that is, to give it the ability to be manipulated by computer Chapter Abstract  In our current technology environment, all inormation  goes through computers beore reaching a human being,  so it is necessary to design our metadata to be data—that is, to give it the ability to be manipulated by computer  programs. To keep pace with modern advances in tech- nology, the library catalog data must be transormed rom being primarily a textual description to a set o data elements to which machine processes can be applied; and these data elements must be compatible with the current mainstream technology that is the World Wide Web. This chapter o “Understanding the Semantic Web:  Bibliographic Data and Metadata” examines what steps the library community will need to take to acilitate this transormation. C hange can be dicult, and change within long- standing communities o practice can be particu- larly dicult. The rst hurdle is recognizing that change is necessary. The next is to understand the nature o the change: its goals, its possibilities, and the natural limitations that will inevitably move the eort rom an ideal solution to a more realistic one. The last challenge is to arrive at an agreement within the community on a change that will return a good value or the eort it requires. Among librarians, there has already been a real- ization that a change is needed when it comes to how libraries present their catalog data. This has been a topic o study and action or well over a decade. Such think- ing produced a new model or bibliographic data, the Functional Requirements or Bibliographic Data (FRBR), and a proposed new rule set or cataloging practice, Resource Description and Access (RDA).

Upload: american-library-association

Post on 08-Apr-2018

274 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 1/16

14

   L   i   b  r  a  r  y

   T  e  c   h  n  o   l  o  g  y

   R  e  p  o  r   t  s

  w  w  w .  a

   l  a   t  e  c   h  s  o  u  r  c  e .  o  r  g

   J  a  n  u  a  r  y

   2   0   1   0

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

Chapter 2

Changing the Nature oLibrary Data

Together these provide a new conceptual oundation

but leave us with a key piece missing: how to express our

data in a twenty-rst-century data ormat. For this we are

given some direction in the report o the Working Group

on the Future o Bibliographic Control:

Desired Outcomes: Library bibliographic data will

move rom the closed database model to the open

Web-based model wherein records are addressable

by programs and are in ormats that can be

easily integrated into Web services and computer

applications. This will enable libraries to make better

use o networked data resources and to take advantageo the relationships that exist (or could be made to

exist) among various data sources on the Web.1

The report does not say how library data must change

to make this mandate a reality. There will surely be more

than one way to accomplish this goal, but a ew things

are certain: the library catalog data must be transormed

rom being primarily a textual description to a set o data

elements to which machine processes can be applied; and

these data elements must be compatible with the current 

mainstream technology that is the World Wide Web. One

possible direction or library data is to join the linked

data “cloud,” a growing set o data on the World Wide

Web that many see as having great promise or a richer

inormation uture.

From Med o MeData

In our current technology environment, all inormation

goes through computers beore reaching a human being,

so it is necessary to design our metadata to be data—that 

is, to give it the ability to be manipulated by computer

Chapter Abstract 

 In our current technology environment, all inormation

 goes through computers beore reaching a human being,

 so it is necessary to design our metadata to be data—that 

is, to give it the ability to be manipulated by computer 

 programs. To keep pace with modern advances in tech-

nology, the library catalog data must be transormed 

rom being primarily a textual description to a set o data

elements to which machine processes can be applied;

and these data elements must be compatible with the

current mainstream technology that is the World Wide

Web. This chapter o “Understanding the Semantic Web: Bibliographic Data and Metadata” examines what steps

the library community will need to take to acilitate this

transormation.

Change can be dicult, and change within long-

standing communities o practice can be particu-

larly dicult. The rst hurdle is recognizing that 

change is necessary. The next is to understand the nature

o the change: its goals, its possibilities, and the natural

limitations that will inevitably move the eort rom an

ideal solution to a more realistic one. The last challenge

is to arrive at an agreement within the community on

a change that will return a good value or the eort it 

requires.

Among librarians, there has already been a real-

ization that a change is needed when it comes to how

libraries present their catalog data. This has been a topic

o study and action or well over a decade. Such think-

ing produced a new model or bibliographic data, the

Functional Requirements or Bibliographic Data (FRBR),

and a proposed new rule set or cataloging practice,

Resource Description and Access (RDA).

Page 2: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 2/16

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

programs. In a sense, we did this with the MARC record in

the 1960s, but at that time the capabilit ies or processing

data were much, much less advanced than they are today.

It was a time beore keyword searching, beore data min-

ing, and beore the concept that inormation rom a wide

variety o heterogeneous sources would all intermingle

over a single, large network, the Internet.Libraries were among the rst institutions to use

computers to process text. In the 1960s, when the MARC

ormat was introduced, it was extremely unusual to pro-

cess elds o variable length and to process text as it is

normally written, using both upper- and lowercase, punc-

tuation, and even accented characters. Libraries devel-

oped ways to create mixed character sets with both Latin

and non-Latin characters years beore other communities

ound the need to do so. We are no longer alone, however,

in our need to process and manipulate text. The develop-

ment o Unicode, a single character set or all known lan-

guages and scripts, and XML, a data ormat that is fexible

enough to describe very complex texts, have brought us

into a world where text processing is no longer the excep-

tion in the computing world.

Today’s data design has to balance the unctionality

needed or machine processing with the understandable

inormation ormat needs o the human end user. It is

denitely not a matter o serving only the machine or

only the human reader, but o creating data that can serve

both. Compromises will oten have to be made. The cur-

rent version o library data, however, is not serving the

machine unctionality well, so our challenge is to bring

our data into the twenty-rst century or machine process-

ing and to improve service to our human end users by

being able to oer more unctionality in our systems. This

report presents a sample o some steps that can be taken

to accomplish this goal, but please keep in mind that this

is not a complete recipe or the uture o library data, just 

some o the ingredients.

D-y he D

The library catalog record is mainly a textual document.

It is true that this text is coded in elds and subelds in

the machine-readable record, but the physical basis o the

record is still primarily text. In essence, the MARC record

can be considered one o the rst text markup languages,

i not  the rst. The record has some elds with coded

data that were designed or the machine processing o 

that era, which needed data to be stored at a xed length

and with its contents as compact as possible. When the

MARC ormat was developed in the 1960s, the dierence

between the storage o “eng” versus “English” to describe

the language o a work was signicant in terms o system

capabilities. The xed-eld data overcomes some o the

“text-ness” o the primary bibliographic data. For exam-

ple, the date o publication in the publication statement 

can take dierent orms, such as:

1966 (a simple date)

c1966 (a copyright date)

[1966] (a date supplied by cataloger)

[1966?] (a date supplied by cataloger and uncertain)

In the xed-eld area, the ormat o the data is strictly

controlled as our characters, generally numeric. Each o 

these would be coded there as “1966”:

740813s1966 enkc b 000 0aeng

The punctuation and other inormation in the date

eld, while perhaps useul to human readers (at least to

those who know what they mean), are an impediment tomachine processing.

Many o the key elds o the bibliographic record,

however, are not available in a data ormat. One exam-

ple is the ISBN, a very important element in a number

o library operations, rom acquisitions to linking cover

images with the user interace. The ISBN is stored in a

subeld in the MARC record, but that subeld can con-

tain other inormation in textual orm:

9781416554950 (trade pbk.)

0817315497 (cloth : alk. Paper)

0415981484 (Hardcover : alk. Paper)

0847829413 (hbk.)

080327946X :

Although it is possible to select the ISBN itsel rom

this string using programming algorithms,* and all sys-

tems that use the ISBN in processing must do so, there

seems to be little reason not to provide the ISBN in a orm

that can readily be manipulated by machines, since that 

is how it will be used. What causes libraries to continue

with practices that aren’t appropriate to this day and age?

Habit, and the very important act that a large body o 

legacy data is a refection o those practices.

While the use o a data eld or a particular data ele-

ment may be a solution to the problem o text versus data,

one o the results is that many data elements in library

records are entered more than once in the same record, in

slightly dierent ormats. It is well known in the world o 

inormation technology that any time you store the same

inormation in more than one place, you risk those sepa-

*As an example, this is a line o code rom the Open Library project (http://openlibrary.org) that extracts the ISBN rom the MARC subeld using

a regular expression: re_isbn = re.compile(‘([^ ()]+[\dX])(?: \((?:v\. (\d+)(?: : )?)?(.*)\))?’)

Page 3: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 3/16

16

   L   i   b  r  a  r  y

   T  e  c   h  n  o   l  o  g  y

   R  e  p  o  r   t  s

  w  w  w .  a

   l  a   t  e  c   h  s  o  u  r  c  e .  o  r  g

   J  a  n  u  a  r  y

   2   0   1   0

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

help catalogers create library data with a certain level

o consistency. This isn’t as useul as it could be in auto-

mated systems because the connection between the head-

ing in the bibliographic record and that in the authority

record are made on the basis o the display text in the

elds. Should the display text change, the link between

these two elements is broken, and systems cannot bring

them together. This loss o connection between the bib-

liographic and the name authority data can be remedied

by making use o identiers that can be read by machines.

Both bibliographic and authority records can containthis identier, and the display text can be changed as

needed without breaking the link. In other words, one

links through identiers, not through display text. Say

that you have bibliographic and name records that need

to link, and what they have in terms o data is shown in

gure 10.

I the name record changes, nothing links the two any

more, as shown in gure 11, because the display orm has

changed, and the display orm was also the linking string.

I one uses identiers or names, in addition to the

display orms and other common reerences in a name

authority record, as shown in gure 12, display orms can

change without breaking the link between the records.The bibliographic record now needs to update its dis-

play orm, but it can do so using the shared identier.

Although the two records are showing dierent display

orms, the link between them is not broken.

In addition, some areas o our data may appear the

same in display, but have dierent coding in the underly-

ing data record. Because o this dierence in coding, they

will be considered dierent by machines. For example, the

data elements o what libraries call a “main entry” (usu-

ally a person or corporate body that creates a resource)

rate versions o the data alling out o sync. For example,

someone may discover that the date has been entered

incorrectly in a record, as “c1964” instead o “c1965.”

That person may correct the display orm o the data in

the publication statement, but could easily orget that 

there is also a coded orm in the xed-eld area. Adding

to this possibility is the act that in all systems that I have

seen, these two data elements are not near each other in

the user interace used by the cataloger.

The method o adding some data elds to what is

essentially a textual document (the catalog entry) mayhave been appropriate in the middle o the twentieth cen-

tury, but twenty-rst-century computing provides us with

better solutions. Those solutions allow or the coding o 

data or machine use without sacricing service to the

human user. The use o authority-controlled headings in

library data is a good example o where a small change in

how we store our data could greatly increase the machine

capabilities in relation to library records.

Ideniy he D

Some o the inormation in the library record will, o necessity, be text. The concept o authority and control

and headings in library data, however, means that even

many o the text elds are not simply ree text but have

structure and are controlled as to their content. These

headings are oten the primary access points that cor-

respond to the inormation that users have when they

approach the library: authors, titles, and subjects. There

are separate records in our systems or some o these

elements, records that contain additional inormation

needed to provide entry vocabulary or the user and to

Figure 10Sample bibliographic and name records, linked.

Bibliographic record Authority record

Smith, John J. Smith, John J.

Page 4: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 4/16

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

Figure 11

Sample bibliographic and name records; link broken due to name change.

Figure 12Sample bibliographic and name records with identifers; links remain.

Bibliographic record Authority record

ID 3536 ID 3536

Smith, John J. Smith, J. J.

See rom: Smith, John J.

Bibliographic record Authority record

Smith, John J. Smith, J. J.

See rom: Smith, John J.

are subdivided into numerous subelds in one part o the library record, but given as a single string in other

areas o the record. Sometimes the type o entity is speci-

ed (person, corporate body), other times it is not. These

inconsistencies are not visible in the user display o the

bibliographic data, but i the data is to be used consis-

tently in automated unctions, these dierences need to

be overcome. Some o them can be overcome with pro-

gramming, and some, unortunately, cannot. The use o 

identiers or the intended text orms removes the ambi-

guity or machine processing and allows algorithms to

know immediately when the bibliographic data is reer-ring to a particular name entry.

Exi he Dbse

In the latter hal o the twentieth century, the primary

data model was that o the database, and in particular the

relational database. Nearly all systems that made use o 

the data in libraries used database management systems.

The primarily textual nature o the library bibliographic

Page 5: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 5/16

18

   L   i   b  r  a  r  y

   T  e  c   h  n  o   l  o  g  y

   R  e  p  o  r   t  s

  w  w  w .  a

   l  a   t  e  c   h  s  o  u  r  c  e .  o  r  g

   J  a  n  u  a  r  y

   2   0   1   0

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

record made it less suitable or the kind o organization

that database management systems do best. Database

technology was designed or a dierent kind o data, less

textual and more compact, with ewer data elements and

less o a range o content in those elements. Database

technology is designed to retrieve. For example, it can

retrieve all o the invoices that contain a particular prod-uct code. Database management systems work best in

environments with a lot o repetition o data values.

Library data is less data-like than the business data

or which database management systems were designed.

For example, most titles o works are unique, and most 

authors appear only once in any given database. Much o 

the eciency o database management is lost when work-

ing with data o this type. It is also dicult to make use

o some o the eatures o database management systems,

at least in a way that would be ecient. Library data was

designed to be ordered alphabetically by very long text 

strings, something that database management systems

do airly poorly. Similarly, the systems oten have limits

in terms o the length o a string that they can index.

Keyword indexing opened up a whole new way to access

bibliographic data, but the vast number o individual key-

words that the catalog o a large, multilingual library col-

lection will produce can easily overwhelm a traditional

business database product.

In addition, the record ormat that we are using,

MARC, is unlike the ormat o any other community,

and that means that those creating library systems have

to write specic programs to make the data t in the

standard business database world. It is rather ironic, but 

our data is so much more textually sophisticated than

most business data that we simply cannot use standard

business sotware in our systems. Library systems have

to be built rom the ground up, taking a great deal o 

system developer resources and adding to the cost o 

the systems.

Today, many more communities produce and manipu-

late sophisticated textual data in a machine-readable orm,

and the technology has been developed to make working

with that data easier. We now have the means to design

and store our data that are not limited to libraries, but are

part o the mainstream o computing. The creation o an

XML ormat or MARC, using Unicode, is a step toward

the transormation o library-specic data into somethingthat could be used by anyone, anywhere in an inorma-

tion system. The problem, as has been noted by many,

is that MARC is still primarily a textual document,2 and

MARCXML provides the kind o markup o that text that 

would guide displays. A more radical transormation o 

library data is necessary i we are to move rom database-

managed search and display into an interactive use o 

library data on the Web. Why on the Web? The answer

is rather simple: because that’s where our users are. In

addition, that is also where many people are creating and

working with bibliographic data, and they are doing so

without the benet o the great wealth o bibliographic

knowledge that has already been by created in libraries.

Semnic Web, Linked D, nd

Oher New technologies

As is oten the case, each era has a technology trend that 

is put orward as the solution to every problem. In act,

these technology trends usually have merit in spite o the

surplus o hype that surrounds them. In addition, the

hype almost never mentions any deects or downsides o 

the currently hyped concept. The important thing is to

look beyond the hype to the actual value that the tech-

nology provides, and to what the technology can be in

 practice. Every new technology solves some problems,

but not all o them, and there is always some room or

improvement. This is certainly the case in this moment 

in time. The Semantic Web is currently the “favor o the

month” as ar as technology goes.

The basic concept behind the Semantic Web is not 

as mysterious as it sounds. Today’s Web is a web o doc-

uments that link to each other. This was the idea that 

led Tim Berners-Lee to create the World Wide Web: the

need or scientists to put their documents on the Internet 

and to create links so that the documents could link to

one another, thus creating a web o documentation, not 

unlike the amed memex o Vannevar Bush.3 The links

generally go rom a point in one document to another

document, with some pages consisting almost entirely o 

links that serve as entry points to collections. The links

themselves, however, are not very inormative: they have

no meaning beyond link. They do not explain why you

have linked, nor what the link itsel could mean. A link

could be a citation that supports a quote, it could be doc-

ument that gives urther inormation, or it could be criti-

cal (“Whatever you do, don’t believe what this person says

here!”). Note also that links go in only one direction, so a

document that has links to it is not aware o those links.

Such links can sometimes be discovered through search

engines, but the Web itsel does not provide or their easy

discovery or use.

At the time that he “invented” the Web, Tim Berners-

Lee’s online world was a very narrow one o research-ers at the European Organization or Nuclear Research

(CERN) in Geneva. It is probable that at that time he had

no idea o the types o inormation resources that might 

eventually be ound on the Web, resources that have noth-

ing to do with scientic papers. There is almost no type

o human endeavor that you can’t nd today on the Web,

and it has evolved into a somewhat messy, heterogeneous

source o inormation o all kinds.

Web pages take orms that have little i anything to

do with the ormat and content o a scientic article. What 

Page 6: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 6/16

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

is clear, though, is that there is a lot o inormation on

the Web that is embedded in text. This text is the object 

o search engines, which extract words so that Internet 

users can nd the pages or texts based on the words

within them. This is very dierent rom the library cata-

log, which creates an entirely new text—a catalog record—

to represent the actual content o the document. Websearching operates directly on the text o the document 

itsel. But in many ways, words are problematic as inor-

mation resources. They can be ambiguous (e.g., Pluto the

Disney character versus Pluto the orbiting body). They

can be incomplete inormationally, since many concepts

require more than one word (e.g., solar energy, ancient 

Rome). They are language-based, so a search on computer  

does not bring up documents with the term ordinateur. 

And o course keyword searching alls prey to dierences

in spelling (ber versus bre) and errors in spelling or

typography (history versus histroy).

All o this could lead one to conclude that something

else is needed, something that helps us nd inormation in

the great wealth o expression that is the World Wide Web.

For this, Tim Berners-Lee and colleagues at the World

Wide Web Consortium have proposed the Semantic Web

as one solution.4 The Semantic Web would, in essence,

make it possible to mine the World Wide Web or inorma-

tion that can be ound within the many pages o the Web.

It creates a new layer on the World Wide Web, a web o 

inormation ound in the web o documents. Essentially,

the idea is to evolve rom a web o documents that link to

each other to a way to connect, search, and make use o 

data in those documents.

To do this, it is necessary to be more precise in how

we code certain parts o our texts. In particular, we need

ways to make our texts more usable by machines. It may

seem obvious to a person reading a text that a statement 

like this one:

Herman Melville was an American author in the mid-

1800s and wrote numerous literary works, among

which the most amous is Moby Dick, a book based in

the whaling culture o New England.

that the acts contained in here are

•HermanMelvillewasanAmerican.

•HermanMelvillewasanauthor.•HermanMelvilleisnolongeralive.

•HermanMelvillewrote Moby Dick.

• Moby Dick is a book.

•HermanMelvillewroteotherthingsaswell.

• Moby Dick, the book, has something to do with whal-

ing and New England.

• Moby Dick, the book, is amous.

Unortunately, none o this is understood by a com-

puter, nor can one easily program a computer to retrieve

these acts rom the text as understood by a person.

Teaching computers to understand simple utterances has

been the goal o the discipline o articial intelligence or

over hal a century, with disappointing results. The idea

o the Semantic Web is to make it possible to identiy the

data in Web documents and make that data usable as aweb o inormation.

It might occur to the reader at this point that library

catalog records do contain some o the inerences listed

above in a coded orm. In act, library data could help the

Web understand its own hidden data. To do so, however,

it will be necessary to ollow Semantic Web standards

or data ormats. First, we need to understand what the

underlying structures o the Semantic Web are.

Semnic Web

The Semantic Web is about metadata designed to be used

by computers. The act that this is called “semantic” is

more than a little conusing since we know that computers

are totally a-semantic—that is, that they can calculate, move,

rearrange, and sort, but totally lack understanding. They

have what Bruce Sterling has termed “the truly proound

stupidity o the inanimate.”5 The commonly understood

denition o  semantics is “meaning o words.” That is not 

the case with the Semantic Web. The Semantic Web uses

the term as it is dened in an area o mathematics known

as ormal languages. For example, computer programming

languages use ormal semantics that dene the set o possi-

ble computations that the language can perorm, like addi-

tion, subtraction, and division. In this sense o the term,

 semantics means a calculatable syntax, such as:

i A = B, and A = C, then B = C

In ormal languages, you model rules that allow you

to determine i something is or is not true within the de-

nitions o the model. The result is a mathematical truth,

not a human semantic truth, and what the computer does

has nothing at all to do with any meaning that humans

may impart to A, B, or C. I the data read

ANIMAL = DOG and ANIMAL = CAT

the ormal language would conclude

Thereore DOG = CAT

This is obviously a alse statement to a human reader

(and one that both cats and dogs would nd insulting),

but it would be true to the computer under the rules o 

its language. This serves, thereore, as an illustration o 

another well-known computing concept: GIGO, or “gar-

bage in, garbage out.” The computer will perorm its oper-

ations even when the data itsel is absurd, and it will be

unaware o the absurdity. Quality results come only rom

Page 7: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 7/16

20

   L   i   b  r  a  r  y

   T  e  c   h  n  o   l  o  g  y

   R  e  p  o  r   t  s

  w  w  w .  a

   l  a   t  e  c   h  s  o  u  r  c  e .  o  r  g

   J  a  n  u  a  r  y

   2   0   1   0

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

quality data, and quality data has meaning in the human

sense. Computers perorm operations based on rules, not 

on meaning in the human sense o that term.

The data model or the Semantic Web is dened in the

documentation or the Resource Description Framework

(RDF). RDF is devilishly dicult to understand in its or-

mal, mathematical denition, but is actually quite simple

at a conceptual level. It denes a set o rules or the ormalsemantics o metadata that is meant or the elements and

structure o metadata that will be able to operate on the

Semantic Web. Very simply put, in the Semantic Web all

data consists o things and relationships between them,

with the smallest unit being a statement o the orm:

a thing → with relationship to → another thing

This is o course an oversimplication, but it is the

basic concept to keep in mind when working with data in

this new environment. You may see this concept expressed

using diagrams, such as the one shown in gure 13, rom

the RDF Primer document:

The Semantic Web elements and rules are simple, yet,like atoms, they can be combined into very complex work-

ing units. Some o the basic components are resources

and classes, properties, and values.

Resources

The Semantic Web does not limit what it can describe.

Essentially any thing or concept can be part o a Semantic

Web statement. Even relationships can be considered

things. To talk about the Semantic Web in a general way,

documents (and users) tend to use the term resource as

a neutral way o reerring to what Semantic Web state-

ments can be about.

Classes, Properties, and Values

It is easy to dene the three Semantic Web terms classes,

 properties, and values by using analogous terms rom

inormation technology, but be aware that the Semantic

Web has very specic denitions that go beyond the com-

mon understanding o these terms.

Classes

A class is much like a class in a scientic taxonomic sense:

it is a grouping o like resources that all belong together

based on some common characteristics that make them

members o the same set. An example o a class is “vehi-

cles,” where members o the class are “cars,” “trains,” and

“planes.” But you could also dene a class called “things

I own,” and this class can include “car,” “computer,” and

“dog.” In other words, there isn’t a universal class, just the classes that are useul or your metadata.

Properties

A property is what we oten think o as a data element in

metadata. It’s an element o description or the resources

Links to W3C’s documents on RDF www.w3.org/RDF

Figure 13From Frank Manola and Eric Miller, eds., “RDF Primer,” World Wide Web Consortium website, Feb. 10, 2004, fg. 4, www.w3.org/TR/2004/REC-rd-primer-20040210/#fgure4 (accessed Nov. 17, 2009). Copyright © 2004 World Wide Web Consortium

(Massachusetts Institute o Technology, European Research Consortium or Inormatics and Mathematics, Keio University). All

Rights Reserved. http://www.w3.org/Consortium/Legal/2002/copyright-documents-20021231.

http://www.example.org/index.html

http://www.example.org/stafd/85740

http://www.example.org/terms/name http://www.example.org/terms/age

http://www.example.org/dc/elements/1.1/creator

John Smith 27

Page 8: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 8/16

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

your metadata addresses. I your metadata describes cars

on a car lot, then properties can be “model,” “color,” and

“price.” I our class is “books in a library,” then we will

have properties like “title,” “subject,” “location,” and

“accession number.”

ValuesValue as dened in RDF is the actual content o the data

element. So i “color” is a property, then “Dyno blue

pearl” is a value—at least it is when you are creating meta-

data or a Honda Civic. When it comes to library materi-

als, where “title” is a property, values will be things like

“Harry Potter and the Chamber o Secrets,” “Brave new

world,” and “5 piano concertos.”

An important concept in the area o values is that o 

uncontrolled and controlled values. The Semantic Web

uses the terms “literal” and “non-literal” or these, but I

will use the more common library terminology here or

the sake o communication. These concepts are ones that 

the library world has embraced since the beginning o 

modern cataloging practice, but are newly discovered or

some nonlibrary communities that are creating metadata

or the rst time.

In library metadata we have some elds that can

take any string o characters. Among these are the title,

various notes, and other elds whose content is mainly

textual and unpredictable. These elds thereore have

uncontrolled values in them. Many elds, however, includ-

ing some elds that are textual in nature, have controlled

values. All o the elds that comprise the catalog record’s

headings are in some way or another under control. They

are oten taken rom authority records, where a decision

has been made as to the preerred display and its or-

mat. Other elds, primarily the coded elds in the MARC

record, select rom a list o possible values, such as the

codes or “music orm o composition” or “target audi-

ence.” These elements have a nite list that catalogers

must choose rom.

Values may also be controlled as to orm as well as to

content. Library data has ew o these, but may embrace

more as catalog metadata becomes more data-like. For

example, metadata oten restricts the ormat o date and

time so that it can be processed by a machine. The MARC21

record does this in the 005 eld, where the date and time

are recorded according to “Numeric Representation o Dates and Times” as dened by the standard ISO 8601,

the international standard covering the exchange o date-

and time-related data.6 The standard denes the structure

o the data element (YYYY-MM-DD, where Y is a year digit,

M is a month digit, and D is a day digit) and exactly which

characters are valid or each portion o the date. Other

structured elements, although less obviously so, are the

ISBN, which has a language group segment, publisher seg-

ment, and a book segment; and the LCCN, ormatted as

the year ollowed by an accession number.

Idenifcion nd URIs

A key element o the Semantic Web is to identiy our

things and relationships in a way that can be understood

by machines. This means that every thing that we wish

to work with has to have an identier that distinguishes

it rom any other thing. There are many kinds o identi-ers, rom plain numbers to complex alphanumeric strings.

The primary rule or the Semantic Web is that identiers

need to be in the orm o a Uniorm Resource Identier,

which is a particular orm o identier. We don’t need to

go into the structure o URIs because it turns out that the

common Uniorm Resource Locator, URL, is in URI or-

mat and is the preerred identier to use on the Semantic

Web. This means that URLs can be used as identiers

as well as locations, which can sometimes be conusing.

Among the advantages o using URLs, however, is that 

they can easily be used to return inormation about the

thing being identied using HTTP (Hypertext Transer

Protocol). HTTP is used by browsers and other programsto retrieve documents on the Web and thereore needs no

additional programming or this approach to work. In the

case o an identier, what you might retrieve is a descrip-

tion o what is being identied, whereas when the URL is

a location, you are directed to the document or to a site

that is at that location.

Tim Berners-Lee, the “ather o the World Wide Web”

and one o the primary originators o the concepts o the

Semantic Web, says this about the use o URIs:

An inormation object is “on the Web” i it has a URI.

Objects which have URIs are sometimes known as

“First Class Objects” (FCOs). The Web works best when any inormation object o value and identity is

a rst class object. I something does not have a URI,

you can’t reer to it, and the power o the Web is the

less or that.7

The URI is what allows a resource to be a thing on

the Web and to be actionable on the Web. I you want to

link to something, it has to have a URI. I you want to

locate it, it has to have a URL (which is also a URI). To

give library metadata a presence on the Web, we will have

to rst give it an identity, in the orm o a URI.

Identication actually takes place on two levels, that 

o the property (i.e., data element) and where possible,the values.

Properties are identied along with a denition

o their meaning (in human terms) and any rules that 

are applied to the use o the property. For example, the

Dublin Core (DC) community has created identiers that 

begin with http://purl.org/dc/terms/ or each o the DC

terms, and includes denitions or each o the terms at 

that site. Some terms, such as http://purl.org/dc/terms/ 

date, speciy the type o value that can be used with the

term, in this case, values in the orm o ISO 8601.

Page 9: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 9/16

22

   L   i   b  r  a  r  y

   T  e  c   h  n  o   l  o  g  y

   R  e  p  o  r   t  s

  w  w  w .  a

   l  a   t  e  c   h  s  o  u  r  c  e .  o  r  g

   J  a  n  u  a  r  y

   2   0   1   0

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

There is not currently a standard or the denition o 

properties. The requirement that they be RDF-compliant 

imposes some constraints, but displays o property deni-

tions on the Web can vary. Here is one example o such a

display or the term “title” rom the Dublin Core Metadata

Terms:8

Term Name: title

URI: http://purl.org/dc/terms/title

Label: Title

Defnition: A name given to the resource.

Type o Term: Property

Refnes: http://purl.org/dc/elements/1.1/title

Version: http://dublincore.org/usage/terms/ history/#titleT-001

As o December 2007, the DCMI Usage Board is leav-

ing this range unspecied pending an investigation o 

options.

While not exactly a thrilling read, the basic inorma-

tion is in understandable text. In contrast, the machine-

actionable orm o the term named title looks like this:

<rd:Description rd:about=”http://purl.org/dc/ 

terms/title”>

<rds:label xml:lang=”en-US”>Title</ 

rds:label>

<dcterms:description xml:lang=”en-

US”>A name given to the resource.</ 

dcterms:description>

<rds:isDenedBy rd:resource=”http://purl.

org/dc/terms/”/><dcterms:issued>2008-01-1

4</dcterms:issued><dcterms:modied>2008-0

1-14</dcterms:modied>

<rd:type rd:resource=”http://www.

w3.org/1999/02/22-rd-syntax-

ns#Property”/>

<dcterms:hasVersion rd:resource=”http:// 

dublincore.org/usage/terms/ 

history/#titleT-001”/>

<skos:note xml:lang=”en-US”>

In current practice, this term is used primarily with

literal values; however, there are important uses with non-

literal values as well.*

As o December 2007, the DCMIUsage Board is leaving this range unspecied pending an

investigation o options.

</skos:note>

<rds:subPropertyO rd:resource=”http:// 

purl.org/dc/elements/1.1/title”/>

</rd:Description>

In some cases, what data will populate a eld is unpre-

dictable, such as with titles, tables o contents, or ree-orm

notes. In other cases, when the values or a property come

rom a controlled list, such as a list o language codes, each

value can be represented by either a string o text or a URI.

For example, a list o colors could be dened this way:

red

yellow

blue

The list would be identied with a URI, such as

http://example.com/colorList, and anyone using that list 

would identiy the list using the URI. For greater preci-

sion, each color could have its own URI:

http://example.com/colorList/1

http://example.com/colorList/2http://example.com/colorList/3

For these identiers to be useul, they need to be

documented, preerably in both a machine-actionable and

a human-readable orm, as we saw above with the Dublin

Core denition o “title.” Documentation can include

denitions and display orms, and both preerred and

alternate displays. One important advantage o the use

o identiers, rather than natural language, is that the

documentation can include display orms and other inor-

mation in any number o languages.

http://example.com/colorList/1

<label lang=en> red

<label lang=r> rouge

<label lang=jp>赤

<label lang=de> Röte

Applications can then use the language appropriate

to their audience.

Dublin Core Metadata Termshttp://dublincore.org/documents/dcmi-terms

*Note that the Dublin Core community uses the RDF terminology o “literal” and “non-literal” in its documentation.

Page 10: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 10/16

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

Creation o URIs or some o the many controlled

lists used in library catalog data has begun. The Library

o Congress has created a site or the identication o 

the value lists under its control at http://id.loc.gov and

has dened over three hundred thousand entries rom the

Library o Congress Subject Heading list. In anticipation

o uture Semantic Web usage, all properties, relation-ships and value lists related to the Resource Description

and Access (RDA) rules have been registered in an RDF-

compatible ormat and given unique identiers under the

domain http://RDVocab.org. These are thereore now

available to be used in Web applications that conorm to

the Semantic Web data ormat.

Creing nd Using Idenifers

The act that some string o characters identies some

thing does not necessarily make it a good identier. There

are many poor identiers in use—identiers that willnot last as long as the thing they identiy, that are not 

unique in the environment in which they are being used,

or that have been misapplied in practice. When working

with identiers, one must answer some basic questions in

order to nd the appropriate identier and use it wisely.

Among these questions are

•Whatdoestheidentieractuallyidentify?

•Who created theidentier, and isthat creatorthe

appropriate authority or my purpose?

•Whatisthemaintenancecommitmentfortheidenti -

er?

•What isthe context within whichthis identierisunique?

•Howtrustworthyistheidentier?

I’ll use a very common identier, the ISBN, to illus-

trate the issues these questions bring up.

•What does the ISBN identiy? It identies a particu-

lar product rom a particular publisher. When the

product changes signicantly (and especially i that 

changes what physical object the publisher will actu-

ally move around the planet or its price), the pub-

lisher creates a dierent ISBN or that new product.

The ISBN has some standards that address how one

denes “new product” so there is some consistency

across publishers, but obviously also some variance.

For digital books, there is disagreement among pub-

lishers on how to use the ISBN properly and whether

book publishers must assign separate ISBNs or each

e-book ormat.

•Who created the identifer? The ISBN is assigned to

the product by the publisher that is producing and

distributing the product. The publisher is the correct 

one to make this decision. In addition, the ISBN is

an industry standard and is managed by a series o 

agencies around the world. It has standards or appli-

cation.

•What is the maintenance commitment? This is not 

made explicit. Since the ISBN is printed on book cov-

ers and oten also on the verso o the title page, the

identier will endure as long as there is a copy o thebook in existence. For e-books, it isn’t clear i or how

the ISBN will be associated with the book product 

over time.

•In what context is the ISBN a unique identifer? 

The ISBN is unique within the work fow o the book

trade. However, because the ISBN is merely a set o 

digits (with the occasional X in the check digit place)

it must always be clearly identied as an ISBN to

be considered unique. In a larger context, such as

a warehouse database with many product numbers,

the ISBN could easily be conused with another iden-

tier. In a library database, an ISBN could overlapwith another identier, like the OCLC number. On

the open Web, the ISBN denitely would have little

use i not coded as an ISBN.

The Internet Engineering Task Force (IETF) has

designated a “name space” or the ISBN that takes

the URN ormat: urn:isbn:<isbn string>.9 A name

space is a unique naming area that can be used only

or that entity. In this way, the ISBN can be moved

reely around the Internet without losing its identity.

That’s the good news. The bad news is that URNs

are not commonly used in the real Web, so most 

applications have their own designator or the ISBN,

and there are dierent orms in use. In other words,we do not have a consistent way to treat ISBNs as

identiers on the Web.

•How trustworthy is the ISBN? Anyone who has

worked with large bodies o bibliographic data has

seen some o the problems o misapplied ISBNs. The

vast majority o ISBNs work as designed: they iden-

tiy a particular publisher product. There are cases

where the wrong ISBN is assigned to a book or the

wrong ISBN is entered into the metadata or a book.

I have personally seen cases where a publisher, gener-

ally a very small publisher, reuses ISBNs. What this

means is that or some unctions, it may be necessary

to use another data element, such as the title o the

book, in addition to the ISBN as a way to catch these

erroneous identiers.

In all systems, there will be internal identiers—

which are oten never seen by users o the system—that 

have meaning only within that system. As part o its unc-

tioning, a database management system assigns internal

identiers to every table and every eld. These are rarely

useul or anything other than the internal workings o 

that system. The identiers that interest us or metadata

Page 11: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 11/16

24

   L   i   b  r  a  r  y

   T  e  c   h  n  o   l  o  g  y

   R  e  p  o  r   t  s

  w  w  w .  a

   l  a   t  e  c   h  s  o  u  r  c  e .  o  r  g

   J  a  n  u  a  r  y

   2   0   1   0

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

are those that will have use in a larger context. Returning

to our ISBN example, the ISBN is used throughout the

book publishing supply chain, rom the publisher to the

individual bookstore. It has the advantage o being well

known and available through a number o sources.

As we move into a more data-ocused Web, agencies

that maintain data are being encouraged to create identi-

ers or their data and controlled vocabularies. It should

be the maintenance agency that provides the identier,

and the identier should be given a name that only that 

agency can provide. This is why this practice is advisable

or agencies that use their registered domain name as therst part o the identier. Because no two agencies can

own the same domain name on the Web, the identier is

guaranteed to be unique in that context. Two agencies

that have identiers that are six-digit numbers would cre-

ate identiers that look like this:

http://agency1.com/123456

http://agency2.com/123456

Without the domain names, the numeric identiers

would be the same, even though they point to entirely

dierent things within the agencies’ schemes. With the

domain names, each one is unique.

A controversial area o identiers is whether it is best 

or them to be semantically meaningul to humans or not.

An example o a meaningul identier is the one used

by Wikipedia or its articles. The URL that identies the

article includes the words used in the title o the article,

and thereore is somewhat understandable to a person:

http://en.wikipedia.org/wiki/Melvil_Dewey

This practice works well until you have more than

one article with the same name, at which point you need

to add something to disambiguate the term:

http://en.wikipedia.org/wiki/Pluto_(mythology)

http://en.wikipedia.org/wiki/Pluto_(Disney)

Disambiguation is one o the main unctions o name

authority control in libraries. The LC Subject Heading

le gives each o these two Plutos a dierent name and a

dierent identier:

Pluto (Greek deity) http://id.loc.gov/authorities/ 

sh85103581

Pluto (Fictitious character) http://id.loc.gov/ 

authorities/sh96010495

Disambiguation becomes more dicult and less

meaningul as the number o items with the same name

increases. For names o persons, there are oten very many

instances o the same name. The library name authority

le, where possible, disambiguates with dates o birth

and death. This, too, becomes less useul over time.—or

instance, between these two:

Fitzgerald, Michael, 1955

Fitzgerald, Michael, 1955 June 11

Only amily and dear riends will have the inorma-

tion necessary to pick out the person they are seeking in

the library catalog. Yet, because these two individuals are

separately identied, they are dierent.

The subjects or names may conuse human readers,

but once they are assigned identiers, machines will have

no problem knowing the dierence. In this case, Pluto

(Greek deity) is not a display orm o the name since LCSH

uses instead “Hades (Greek deity)” with “Pluto (Greek

Deity)” as a cross reerence. Both, however, are identied

with http://id.loc.gov/authorities/sh85103581 and will

be considered the same subject when used in a Semantic

Web context. In the same way, the various persons named

Michael Fitzgerald will each be given a unique identier

that will make them clearly distinguished or any machine

processes.

URIs nd URLs

The general wisdom among developers o the Semantic

Web unctionality is that the best identiers are URIs

(Uniorm Resource Identiers), and the best URIs are

URLs (Uniorm Resource Locators). Because URLs are

URIs (but with the specic purpose o providing a Web

location or something), URLs can be used as identiers

as well as locations. Among the advantages o using URLs

is that they can easily be used to return inormation about 

the thing being identied, as using the HTTP (Hypertext Transer Protocol) directs browsers and other programs

to retrieve documents on the Web. In the case o an iden-

tier, what you might retrieve is a description o what 

is being identied. The machine unctions will work well

with a string o characters that have little or no meaning

to a human, yet there can also be human-riendly data

attached to the identier, and it can be ound by ollowing

the identier to its Web location.

An example is a Semantic Web language called Friend

o a Friend (FOAF). This language allows you to describe

Internet Engineering Task Forcewww.iet.org

Links to registered RDA element sets and value vocabularieshttp://metadataregistry.org/rdabrowse.htm

Page 12: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 12/16

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

people, particularly in a social networking environment.

FOAF has elds or the common personal elements used

in social networking sites like name, sex, birthday, and

e-mail address. An identier or a person could return this

FOAF inormation in a display so that human users could

see inormation that means more to them than an identi-

er. An example rom the FOAF documentation shows aperson’s name and his work home page address:

<oa:Person>

<oa:name>Dan Brickley</oa:name>

<oa:workplaceHomepage

rd:resource=”http://www.w3.org/”/>

</oa:Person>

This inormation could be located at a URL that is

used to identiy Dan Brickley as a member o the W3C (the

World Wide Web Consortium), or example: http://w3c.org/ 

people/dbrickley. Whenever one reers to Dan Brickley in

Semantic Web data, one could use this identier.

An example rom the library world is the online ver-

sion o the Library o Congress Subject Headings that has

been made available by the Library in linked data ormat.

There is a separate identier or each entry in the subject 

authority le, about 350,000 total. Because the identier

is also a URL, the Library has placed inormation about 

the subject heading at that location and can display it 

in ormats or human readers or or programs. When a

machine process requests the data, it is returned in one o 

several possible ormats, depending on what was specied

in the request. Here is a ragment rom the entry or the

subject heading “Semantic Web” as it would be returned

to the requester in a ormat called RDF/XML:

<rd:Description rd:about=”http://id.loc.gov/ 

authorities/sh2002000569#concept”>

<rd:type rd:resource=”http://www.

w3.org/2004/02/skos/core#Concept”/>

<skos:broader rd:resource=”http://id.loc.gov/ 

authorities/sh2004000479#concept”/>

<skos:preLabel xml:lang=”en”>Semantic Web</ 

skos:preLabel>

</rd:Description>

Obviously this is not human-riendly, but the ull

record that is returned in this ormat can also be dis-

played to a human, using the same underlying data, as

shown in gure 14.

The ability to return inormation based on the iden-

tier is one way to extend the Semantic Web, both or

machine processing and or the human user. A computer

program can use some o the data provided in machine-

readable orm to make additional connections. For exam-

ple, the alternate orms o terms in LCSH or the broader

and narrower terms can extend

a vocabulary search. The e-mail

address in a FOAF record could

allow an application to oer the

option to send an e-mail to the

person who is identied. In the

end, though, it is people who

design and write the programs

that are used by computers, and

that human understanding is

key to the development o inor-

mation resources. The tool that 

does the heavy liting, the com-

puter, is stupidly inanimate.

O course, it is not reason-

able to assume a single system

o identiers or everything onthe Web. Undoubtedly, dierent 

communities will assign identi-

ers o their own, some overlap-

ping with those o another com-

munity. “Switching stations”

will be needed that gather

identiers that are equivalent 

or nearly so. An example o 

this is the Virtual InternationalFigure 14Catalog search or “Semantic Web”

Page 13: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 13/16

26

   L   i   b  r  a  r  y

   T  e  c   h  n  o   l  o  g  y

   R  e  p  o  r   t  s

  w  w  w .  a

   l  a   t  e  c   h  s  o  u  r  c  e .  o  r  g

   J  a  n  u  a  r  y

   2   0   1   0

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

Authority File under development. This le registers

equivalent orms o the same name as used by dierent 

national libraries. Each library can assign its own identi-

er, and through the VIAF it will be possible to learn

that one library’s Tolstoy, Aleksey Konstantinovich,

  gra, 1817–1875 is another library’s ТОЛСТОЙ ,

АЛЕКСЕЙ КОНСТАНТИНОВИЧ , ГРАФ , 1817–1875.They both are identied, however, with http://via.org/ 

via/20473541.

How Idenifers nd RelionshipsMke he Web Semnic

So ar we’ve seen what identiers are, but how do they

create a Semantic Web? In its simplest orm, identiers

allow a computer to act on the elements o Semantic Web

statements without understanding their human meaning.

For example, suppose you have two things (a circle and

a box), and a relationship (arrow), as shown in gure 15.

Then you can ask the question shown in gure 16.

An automated system can answer with Circle 

without in any way knowing what any o this means.

This doesn’t seem to be much, but i  Circle represents

“Herman Melville” and Box represents “ Moby Dick, or 

The Whale,” and the arrow represents the relationship

“is author o,” then the question can be translated to

human language as “Who wrote   Moby Dick, or The

Whale”? And the system can provide the answer: Herman

Melville (gure 17).

You can also ask, “Who wrote Moby Dick?” and get 

the answer “Herman Melville.” With more statements,you can get more than one answer.

Now you can ask, “What books did Herman Melville

write?” and get back a list o two, as shown in gure 18.

Because the Semantic Web would be operating on

identiers, the correct (but much more awkward) original

statement would be:

<identier or Herman Melville> <identier or “is

author o”> <identier or “Moby Dick or The

Whale”>

In actual code it could be something like this, using

the LC Name Authorities identier or Melville, the

RDVocab property or the author role, and the WorldCat 

identier or the book:

<http://id.loc.gov/authorities/n79006936>

<http://rdvocab.ino/roles/author> <http:// 

www.worldcat.org/oclc/25788271>

As more inormation is added to the Semantic Web,

a greater variety o questions can be answered. The state-

ments create a web o inormation that can answer ques-

tions like “What other books were published by the same

publisher in the same year as this edition o  Moby Dick?”

as shown in gure 19.

The extraction o meaningul inormation rom the

web o data requires a method to query the Web in a

structured way. The standard or querying databases is a

language known as Structured Query Language (SQL).

The linked data community has developed a query lan-

guage, called SPARQL (pronounced “sparkle”), that oper-

ates similarly on the Semantic Web. SPARQL is relatively

new, and the tools available or using it are suitable pri-

marily or knowledgeable developers. However, we can

Figure 15Circle → Box.

Figure 16?→ Box.

Figure 17Circle and box with values answering one query

Figure 18Circle and box with values answering two queries

Page 14: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 14/16

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

expect that as the Semantic Web progresses, there will

be interaces that allow more o us to experiment with

querying the Semantic Web and the great variety o data

that is in the linked data cloud.

Key Chrcerisics o heSemnic Web

There are specic characteristics o the Semantic Web that 

make it particularly appropriate or the broad reach o 

inormation and the dynamic nature o inormation today.

First, the Semantic Web is dynamic. Unlike the linking

that takes place in database management systems, where

links have to be created internally between items, linking

on the Web is dynamic and automatic. Think o what hap-

pens when you add a hyperlink to a document: the mere

existence o the HTTP address in the link creates the con-nection between the current document and what it links

to, and that link is active immediately. With the Semantic

Web, data added to the Web in Semantic Web ormat will

be able to link and be accessed immediately without any

urther programming. The disadvantage to the dynamic

nature o the Web and the Semantic Web is that nothing

stays the same or very long. Searches done at one time

will turn up dierent results when done again later i new

resources and new inormation have been added to the

Semantic Web.

The Web is a global inormation resource, although

each user tends to see only a particular slice that responds

to her area o interest. While most online documents will

be seen by only a small percentage o Internet users, there

is no way to predict who will make use o the online dataor how they will use it. We have to assume that the con-

text or any online data is the entire Web itsel. This is

in contrast to data stored in databases, where the data

needs to be dened and identied only within the con-

text o that particular database. Even with the sharing o 

library metadata on a large scale, the context or the data

is assumed to be that o library standards. On the Web,

library data will interact with data that was created in

other contexts by nonlibrarians and or other purposes.

The global nature o the Web means that there are an

untold number o possible uses o the data, but also a

great variety in the data sources. A less controlled envi-

ronment has disadvantages, such as unevenness in thequality o resources. However, by structuring metadata in

a way that it can be used on the global Internet, it can be

used anywhere on the Web without ambiguity.

An advantage o the networked environment or

metadata is its extensibility. New data and data types can

be added at any time. Mashups, or the recombination o 

data rom dierent sources, actually create new inorma-

tion, even i only temporarily as the outcome o a search

over multiple sources. This has an eect on the standards

process, o course. Changes and additions do not have to

Figure 19As more data is added to the semantic web, a greater variety o questions can be answered.

Melville

Typee

is author o

is author o

Moby Dick

published by

Harper & Bros.

published date

1851

published date

published by

16 Months at theGold Diggings

Page 15: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 15/16

28

   L   i   b  r  a  r  y

   T  e  c   h  n  o   l  o  g  y

   R  e  p  o  r   t  s

  w  w  w .  a

   l  a   t  e  c   h  s  o  u  r  c  e .  o  r  g

   J  a  n  u  a  r  y

   2   0   1   0

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

be pre-announced or the Web to continue working as it 

does. The Web itsel becomes the standard communication

platorm, providing notication o changes and download-

able versions o new and proposed elements and values.

Changes can prolierate quickly throughout the community

using the data.

A key element o the Semantic Web is that there aremeaningul relationships between things. We don’t just 

link John and Mary, we are able to say what that link

means. It could be that John is the son o Mary, or Mary

the manager o John. These are very dierent statements,

but on the Web today you cannot make this distinction

with hyperlinks. It is this addition o relationships to the

hyperlinking o the Web that can transorm the Web rom

what it is today to a richer, more meaningul inormation

environment.

Linked (Open) DThe Semantic Web as introduced by Tim Berners-Lee is a

linked web o inormation encoded in documents through-

out the web. Achievement o this vision is still over the

visible horizon. In practice, however, there is a growing

community o people and organizations who have meta-

data available to them that they have structured using

Semantic Web rules. These disparate sets o data can

be combined into a base o actionable data. These sets

o data are being reerred to as “linked data,” and the

Linked Data Cloud is an open and inormal representa-

tion o compatible data available over the Internet. New

linked data is being added to the cloud daily. Each new

resource that is added to the Web in this ormat increases

the number o data connections that are possible between

existing data sets.

Many o the early participants in linked data are insti-

tutions with already existing scientic data sets. Linking

allows data rom PubMed records, or example, to link to

scientic databases like the Protein Databank, the HumanGene Nomenclature Database, and Chemical Entities o 

Biological Interest database. There are also some general

purpose data sets available, such as geographic names,

U.S. Census data, and the wealth o names and acts rom

Wikipedia. Helping to bring these sets o data together is

a particular data set called Dbpedia. The core o Dbpedia

is an extraction o coded data rom Wikipedia, with related

links added rom a variety o linked data sources on the

open Web. A ew o the items o interest in the lengthy

set o data about Herman Melville in Dbpedia include

the date and place o his birth and death, the various

subject headings and genres that have been assigned to

his works, writers who infuenced him and those he infu-

enced.10 This inormation displays in Dbpedia in a rather

unriendly, pseudo-code orm, but the data is intended pri-

marily to be used or linking by programs taking advan-

tage o Semantic Web capabilities. Here is an excerpt romthe Dbpedia page on Herman Melville:

dbpedia-owl:Person/birthDate

1819-08-01 (xsd:date)

dbpedia-owl:Person/birthPlace

dbpedia:New_York_City

dbpedia:United_States

dbpedia:New_York

dbpedia-owl:Person/deathDate

1891-09-28 (xsd:date)

dbpedia-owl:Person/deathPlace

dbpedia:New_York_City

dbpedia:United_States

dbpedia:New_York

It is somewhat dicult to explain what you can do

with linked open data because the answer is just about any-

thing. Where data can be combined on any data element,you can do data mining in the linked open data space in

the same way that you would in a database. A geographic

name in a text could be linked to a geographic name in a

database that returns longitude and latitude, or one that 

links to historical inormation, images, and photographs, or

even genealogy data. In our Herman Melville case, data o 

this nature could be used to gather a list o authors born in

a particular era and a particular place. Mining o this data

is available openly to anyone on the Web, and there is no

way to predict the uses that might be discovered.

Imagine, thereore, what possibilities there would

be i all o the bibliographic data rom all o the library

catalogs o the world were added to the mix. We’ve saidthat the value o linked data increases with every node

that joins the linked data space. This is also true with

library data as linked data. Data rom a single library will

add many interesting data points rom new cataloged

resources, but as each library’s data is added, new kinds

Linked Data Cloud http://linkeddata.org

WorldCat Identitieshttp://orlabs.oclc.org/Identities

Page 16: From Metadata to MetaDATA

8/6/2019 From Metadata to MetaDATA

http://slidepdf.com/reader/full/from-metadata-to-metadata 16/16

Understanding the Semantic Web: Bibliographic Data and Metadata Karen Coyle

o inormation can be extracted. Not only will more librar-

ies add new titles, the repeated inormation rom the hold-

ings o many libraries can be useul. The value o library

holdings is evident in the WorldCat Identities project pro-

duced by the OCLC Research Division. While inormation

about a single book can tell you who the author is, and

where and when it was published, inormation about howmany libraries hold a book can add qualitative inorma-

tion, like which are the most requently held books by

or about a person. Linked open data has the potential

to oster the discovery o new inormation as users nd

interesting ways to combine it.

To understand how linked data becomes inorma-

tion we can look again at our sentence about Herman

Melville and  Moby Dick. Data provided in linked data

orm in Dbpedia gives useul dates, links to works by

the author, and more. Library data, both rom WorldCat 

Identities and library catalog records, contains inorma-

tion that could ll in some gaps in the current linked

data view o this work. Library data cannot participate,

however, until that catalog is made available in the

linked data ormat.

What can libraries oer that no other community can?

First, libraries have vast holdings o published and unpub-

lished materials that are not currently represented on the

Web. Next, they have metadata or most o those materials.

The metadata includes controlled orms o personal and

corporate names, physical description, topical headings,

and classication assignments. This data could interact 

with almost any inormation on the Web, since libraries

cover the ull range o human endeavor in their holdings.

The entry o library data into the linked data cloud

is not a replacement or library catalogs. Indeed, the unc-

tions related to library management like inventory con-

trol, acquisitions, and materials handling will continue

to be part o the institutional database used by libraries.

What is gained rom the transormation o library data to

linked data is an entry into a larger inormation universe

that reaches users on the Web and can be exploited incombination with any other data that is ound there. The

Semantic Web is a logical extension o the inormation

services provided to library users and is one that could

greatly expand the number and range o persons who ben-

et rom the library.

The movement o library data into the linked data

cloud is not as ar o as it might seem. Like the scien-

tic databases, the metadata already exists and is in a

data ormat. Some transormation o the data to a ormat 

compatible with the Semantic Web will be necessary, but 

the encoding that has already been done (mainly in the

MARC ormat) and the degree o vocabulary control that exists acilitate the transormation. It truly is a matter o 

transormation, at least in a rst step. Ater that, the only

limits are those o the imaginat ion o inormation seekers

all over the globe.

Noes

1. On the Record: Report o the Library o Congress Working

Group on the Future o Bibliographic Control (Washington,

DC: Library o Congress, Jan. 9, 2008), 26, www.loc.gov/ 

bibliographic-uture/news/lcwg-ontherecord-jan08-nal.pd 

(accessed Nov. 17, 2009).

2. Ibid, 24;  Roy Tennant, “MARC Must Die,” Library Journal ,

Oct. 15, 2002, www.libraryjournal.com/article/CA250046.

html (accessed Nov. 17, 2009).

3. Tim Berners-Lee, “Inormation Management, a Proposal,”

March 1989, www.w3.org/History/1989/proposal.html

(accessed Nov. 17, 2009); Vannevar Bush, “As We May

Think,” The Atlantic, July 1945, www.theatlantic.com/ 

doc/194507/bush (accessed Nov. 17, 2009).

4. Tim Berners-Lee, James Hendler, and Ora Lassila, “The

Semantic Web,” Scientifc American, May 2001, 34–43.

5. Bruce Sterling, The Hacker Crackdown: Law and Disorder 

on the Electronic Frontier  (New York: Bantam Books,

1992), 31.

6. International Organization or Standardization, “Numeric

Representations o Dates and Time,” ISO website, FAQ

pages, www.iso.org/iso/date_and_time_ormat (accessed

Nov. 17, 2009).

7. Tim Berners-Lee, “Universal Resource Identiers: Axioms o 

Web Architecture,” World Wide Web Consortium website,

Dec. 19, 1996, www.w3.org/DesignIssues/Axioms.html

(accessed Nov. 17, 2009).

8. Dublin Core Metadata Initiative, “DCMI Metadata Terms,”

  Jan. 14, 2008, http://dublincore.org/documents/dcmi-

terms (accessed Nov. 17, 2009).

9. J. Hakala and H. Walravens, “Using International Standard

Book Numbers as Uniorm Resource Names,” Internet Engineering Task Force website, Oct. 2001, www.iet.org/ 

rc/rc3187.txt (accessed Nov. 17, 2009).

10. “About Herman Melville,” Dbpedia, http://dbpedia.org/ 

page/Herman_Melville (accessed Nov. 17, 2009).

Dbpediahttp://dbpedia.org