© rightscom – all rights reserved testbed for interoperable metadata for ebooks hugh look...

© Rightscom – All rights reserved

Testbed for Interoperable Metadata for Ebooks

Hugh Look (Project Manager)

TIME: presentation to Discovery and Access seminar 13 December 2006

© Rightscom 2006– All rights reserved

Testbed for Interoperable Metadata for Ebooks

►Which spells…►TIME

►Weird coincidence, isn’t it?



We built a TIME machine

►The team included the very memorable Kane Richmond as Brick Bradford & Linda Leighton as June Salisbury

►And, of course, the Time Top



When they heard we were building a time machine, of course the client wanted…



So we said “hold on…it’s only a testbed…”



Overview of the project

►Objectives►To develop a testbed system to support ebook

cataloguing►"The testbed will help provide solutions to one of the key

challenges identified for the takeup of ebooks: the lack of standardised e-book catalogue records and also the lack of interoperability between different e-book metadata records."

►The key participants►EPICentre►Rightscom

►Supported by►Book Industry Communications►Helen Henderson



Overview of the project (Cont.)

►Formats we are transforming►Relevance confirmed by librarians &VLE specialists

►Dublin Core – Simple and Qualified ►Onix►MARC►LOM

►No publishers in project using this at present - we will transform other formats to LOM, but not from it

> Not specifically an e-book standard

►LOM input can be added later (as can any other format)



Overview of the project (Cont.)

►Key concept: map to and from a single intermediate format

> Intermediate format is comprehensive> Extensible to new formats

►Data►Records were obtained from publishers and

intermediaries (to whom many thanks are due):

> Oxford University Press> Taylor and Francis> Cambridge University Press> OCLC

►A total of 1886 records were received, in DC, MARC and Onix format.



Requirements process

►Review of requirements from documentation►Requirements analysis focused on

needs of libraries►Range of documents identified►None contain complete requirements►Synthesis presented to workshop



Requirements process: standards at the centre

►Requirements validation workshop►Focused on standards►No radical disagreements or additions

to synthesis►Confirmed standards identified during

analysis process were appropriate►No other significant issues identified



Delivery

►Working transform system►Simple user interface►Testbed released to JISC

►Packaged for installation by further testers



Standards (brace yourselves)

FRBR

Handle

Multimedia

ISRC

ISAN

ISMN CIS

Dublin Core

IMS

DOI

IIM

ISWC

URL

URNSICI

today1980s mid 90’s

Books

Audio

Audiovisual

Libraries

Copyright

Journals

Magazines

Newspapers

STANDARDS

Education

MARC

CAE

ISBN

ISSN

Music

Texts

EAN

Technology

Archives Museums

UPC

ISO codes

IPI

UMIDISTCSMPTE

DMCS

EPICS

ONIX

LOM

abc

<indecs>

MPEG-7

MPEG-21

ISO11179

RDFXML schema

IPDA

PRISMeBooks

OeBF

NITFCIDOC

CrossRef

P/META

XrML

URI

BICI

MPEG21 RDD/REL

MI3P

SCORM

NewsML

GRidMPid

MWLI

SAN

V-ISAN

ERMI

DAISY

METS

MODS

OWL

The testbed

eBookCatalogue –

Common (generic)semantic

and syntacticformat

MARC

DublinCore

ONIX

LOM

MARC

ONIX

DublinCore

LOM

Many data formats

Many data formats

The longer-term potential

eBookCatalogue –

Common (generic)semantic

and syntacticformat

MARC

DublinCore

ONIX

LOM

MARC

ONIX

DublinCore

LOM

Other

Other

Other

Other

Other

Other



Technical

►Technology & tools►Fedora Open Source XML repository►XML schemas and XSLT transforms►Internal generic representation:

Contextual Ontologyx Architecture (“COA”)

►OAI-PHM compliance



Issues for interoperability

►The “hub” needs to be at least as rich as all of the “spokes” put together

►The value mappings need to preserve all their semantics in the hub



Mapping to COA

MARCtag=100subfield a=Kreigel, Marksubfield e=author

Dublin Corecreator=Kriegel, Mark

ONIXContributorContributorRole=B01PersonNameInverted=Kriegel, MarkNamesBeforeKey=MarkKeyName=Kriegel

COAA IsA ResourceA IsA EBookA HasAuthor BB IsA PartyB HasName CC HasNameInverted “Kriegel, Mark”C HasNamePart DD HasValue “Kreigel”D IsA KeyNameD HasIdentifier EE HasValue “1”E IsA SequenceNumberC HasNamePart FF HasValue “Mark”F IsA NamesBeforeKeyNameF HasIdentifier GG HasValue “2”G IsA SequenceNumber



Scheme to scheme mapping issues

► ONIX – rich and well-structured – good input format creating accurate if limited output in MARC or DC.

► MARC – rich, but not always well defined or unambiguous – weaker as an input format (made to be read by humans, not computers)

► Dublin Core: input data weak and often uncontrolled, so transforms no better► But can output richer Qualified Dublin Core from both MARC

and ONIX.► LOM: Pedagogic classifications not generally captured in

MARC, ONIX or DC, so poor match at that level.► But even weak transforms can create “basic” records that can

be added to later



Semantic loss: relative strength of metadata schemes

► Transformations both in and out of Dublin Core were generally poor► Relative semantic poverty and ambiguity.

► As a source schema, unqualified or lightly-qualified Dublin Core has huge limitations

► dc:date may be the date of creation, date of publication (and if so, where?) or of anything else.

> Unless a default assumption is made such data cannot be transformed and is “lost”► dc:identifier often does not provide the IdentifierType, which renders it

meaningless. ► Text in dc:coverage text may mean more or less anything.

► No controlled values in basic DC► Code lists such as those supported by Onix and MARC cannot be mapped into

► Dublin Core as a basis for automated transformation is effectively a non-starter

► Has its uses as a human readable record,► As an output schema, DC does much better

► Good DC records can be produced from ONIX or MARC input► Both ONIX and MARC are good as source schemas for descriptive eBook

metadata. ► Some inherent limitations, but most of these can be overcome



Data quality issues: errors

►Generally the quality of data supplied was good (but small sample)►Amount of data contained in each

record►Homogeneity of metadata from record

to record

►Input data inevitably contains errors►Random and systematic



Data quality issues: errors (cont)

►Systematic►Most frequent was misinterpretation of some

fields - used for data that belongs elsewhere. > One set of ONIX input data included email address of

sender in FromPerson element (There is a specific FromEmail element available for this)

►Use of a wrong ONIX code > Another derived from print book data contained ONIX

format codes showing each eBook incorrectly to be either a Hardback or a Paperback book.



Data quality issues: errors (cont)

►Random ►Discovered by chance when analysing a few

test files. > E.g. affiliated role of an author (“Professor Of

Physics”) sometimes included with his affiliated institute (“University Of Somewhere”) , sometimes included in a separate field

►For random data errors there is nothing that can be done post-hoc apart from manual correction when an error is detected

> Needs improved QA processes at source

►Considerable scope for post-hoc management for systematic or habitual errors



Data quality issues: variant schemas

► Variations in the way in which a particular schema is implemented► One publisher used an ONIX format code to indicate the code

of the source printed book, instead of the eBook► Another supplier had a set of controlled values which they

used for a particular MARC tag which were not standard, but were internally consistent.

► In another case MARC tag 043 (Geographic Area Code) was apparently used in a particular and consistent way by the supplier, but as MARC itself is non-specific, nothing could be done with the data in the general scheme.

► MARC users also have their own variant practise► Especially in the use of “internal” 900 tags



Data quality issues: variant schemas (cont)

►It has been remarked that as there are over 50 UK publishers now providing data in ONIX, there are over 50 variant ONIX schemas

►Not in principle a problem for the COAX one-to-many approach►Variant mappings can be made for different

sources where consistent behaviour is identified

►To maintain such variations in conventional pairwise mapping very resource-intensive



Issues at end of project?

►None of the existing metadata standards meets all the requirements

►Publishers apply the standards differently►COA model can handle variations much

more efficiently than pairwise mapping►Rich standards (e.g. LOM) will require

additional effort from publishers►Impact of new models for selling and

supplying e-books?



TIME is only just (the) beginning…



Thank you

►Hugh Look►www.rightscom.com►020 7620 4433

© rightscom – all rights reserved testbed for interoperable metadata for ebooks hugh look...

Documents

access seminar

testbed slide

rights reserved testbed

format slide

rights reserved standards

rights reserved overview

rights reserved issues

rights reserved delivery