© rightscom – all rights reserved testbed for interoperable metadata for ebooks hugh look...
TRANSCRIPT
© Rightscom – All rights reserved
Testbed for Interoperable Metadata for Ebooks
Hugh Look (Project Manager)
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Testbed for Interoperable Metadata for Ebooks
►Which spells…►TIME
►Weird coincidence, isn’t it?
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
We built a TIME machine
►The team included the very memorable Kane Richmond as Brick Bradford & Linda Leighton as June Salisbury
►And, of course, the Time Top
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
When they heard we were building a time machine, of course the client wanted…
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
So we said “hold on…it’s only a testbed…”
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Overview of the project
►Objectives►To develop a testbed system to support ebook
cataloguing►"The testbed will help provide solutions to one of the key
challenges identified for the takeup of ebooks: the lack of standardised e-book catalogue records and also the lack of interoperability between different e-book metadata records."
►The key participants►EPICentre►Rightscom
►Supported by►Book Industry Communications►Helen Henderson
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Overview of the project (Cont.)
►Formats we are transforming►Relevance confirmed by librarians &VLE specialists
►Dublin Core – Simple and Qualified ►Onix►MARC►LOM
►No publishers in project using this at present - we will transform other formats to LOM, but not from it
> Not specifically an e-book standard
►LOM input can be added later (as can any other format)
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Overview of the project (Cont.)
►Key concept: map to and from a single intermediate format
> Intermediate format is comprehensive> Extensible to new formats
►Data►Records were obtained from publishers and
intermediaries (to whom many thanks are due):
> Oxford University Press> Taylor and Francis> Cambridge University Press> OCLC
►A total of 1886 records were received, in DC, MARC and Onix format.
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Requirements process
►Review of requirements from documentation►Requirements analysis focused on
needs of libraries►Range of documents identified►None contain complete requirements►Synthesis presented to workshop
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Requirements process: standards at the centre
►Requirements validation workshop►Focused on standards►No radical disagreements or additions
to synthesis►Confirmed standards identified during
analysis process were appropriate►No other significant issues identified
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Delivery
►Working transform system►Simple user interface►Testbed released to JISC
►Packaged for installation by further testers
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Standards (brace yourselves)
FRBR
Handle
Multimedia
ISRC
ISAN
ISMN CIS
Dublin Core
IMS
DOI
IIM
ISWC
URL
URNSICI
today1980s mid 90’s
Books
Audio
Audiovisual
Libraries
Copyright
Journals
Magazines
Newspapers
STANDARDS
Education
MARC
CAE
ISBN
ISSN
Music
Texts
EAN
Technology
Archives Museums
UPC
ISO codes
IPI
UMIDISTCSMPTE
DMCS
EPICS
ONIX
LOM
abc
<indecs>
MPEG-7
MPEG-21
ISO11179
RDFXML schema
IPDA
PRISMeBooks
OeBF
NITFCIDOC
CrossRef
P/META
XrML
URI
BICI
MPEG21 RDD/REL
MI3P
SCORM
NewsML
GRidMPid
MWLI
SAN
V-ISAN
ERMI
DAISY
METS
MODS
OWL
The testbed
eBookCatalogue –
Common (generic)semantic
and syntacticformat
MARC
DublinCore
ONIX
LOM
MARC
ONIX
DublinCore
LOM
Many data formats
Many data formats
The longer-term potential
eBookCatalogue –
Common (generic)semantic
and syntacticformat
MARC
DublinCore
ONIX
LOM
MARC
ONIX
DublinCore
LOM
Other
Other
Other
Other
Other
Other
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Technical
►Technology & tools►Fedora Open Source XML repository►XML schemas and XSLT transforms►Internal generic representation:
Contextual Ontologyx Architecture (“COA”)
►OAI-PHM compliance
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Issues for interoperability
►The “hub” needs to be at least as rich as all of the “spokes” put together
►The value mappings need to preserve all their semantics in the hub
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Mapping to COA
MARCtag=100subfield a=Kreigel, Marksubfield e=author
Dublin Corecreator=Kriegel, Mark
ONIXContributorContributorRole=B01PersonNameInverted=Kriegel, MarkNamesBeforeKey=MarkKeyName=Kriegel
COAA IsA ResourceA IsA EBookA HasAuthor BB IsA PartyB HasName CC HasNameInverted “Kriegel, Mark”C HasNamePart DD HasValue “Kreigel”D IsA KeyNameD HasIdentifier EE HasValue “1”E IsA SequenceNumberC HasNamePart FF HasValue “Mark”F IsA NamesBeforeKeyNameF HasIdentifier GG HasValue “2”G IsA SequenceNumber
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Scheme to scheme mapping issues
► ONIX – rich and well-structured – good input format creating accurate if limited output in MARC or DC.
► MARC – rich, but not always well defined or unambiguous – weaker as an input format (made to be read by humans, not computers)
► Dublin Core: input data weak and often uncontrolled, so transforms no better► But can output richer Qualified Dublin Core from both MARC
and ONIX.► LOM: Pedagogic classifications not generally captured in
MARC, ONIX or DC, so poor match at that level.► But even weak transforms can create “basic” records that can
be added to later
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Semantic loss: relative strength of metadata schemes
► Transformations both in and out of Dublin Core were generally poor► Relative semantic poverty and ambiguity.
► As a source schema, unqualified or lightly-qualified Dublin Core has huge limitations
► dc:date may be the date of creation, date of publication (and if so, where?) or of anything else.
> Unless a default assumption is made such data cannot be transformed and is “lost”► dc:identifier often does not provide the IdentifierType, which renders it
meaningless. ► Text in dc:coverage text may mean more or less anything.
► No controlled values in basic DC► Code lists such as those supported by Onix and MARC cannot be mapped into
► Dublin Core as a basis for automated transformation is effectively a non-starter
► Has its uses as a human readable record,► As an output schema, DC does much better
► Good DC records can be produced from ONIX or MARC input► Both ONIX and MARC are good as source schemas for descriptive eBook
metadata. ► Some inherent limitations, but most of these can be overcome
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Data quality issues: errors
►Generally the quality of data supplied was good (but small sample)►Amount of data contained in each
record►Homogeneity of metadata from record
to record
►Input data inevitably contains errors►Random and systematic
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Data quality issues: errors (cont)
►Systematic►Most frequent was misinterpretation of some
fields - used for data that belongs elsewhere. > One set of ONIX input data included email address of
sender in FromPerson element (There is a specific FromEmail element available for this)
►Use of a wrong ONIX code > Another derived from print book data contained ONIX
format codes showing each eBook incorrectly to be either a Hardback or a Paperback book.
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Data quality issues: errors (cont)
►Random ►Discovered by chance when analysing a few
test files. > E.g. affiliated role of an author (“Professor Of
Physics”) sometimes included with his affiliated institute (“University Of Somewhere”) , sometimes included in a separate field
►For random data errors there is nothing that can be done post-hoc apart from manual correction when an error is detected
> Needs improved QA processes at source
►Considerable scope for post-hoc management for systematic or habitual errors
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Data quality issues: variant schemas
► Variations in the way in which a particular schema is implemented► One publisher used an ONIX format code to indicate the code
of the source printed book, instead of the eBook► Another supplier had a set of controlled values which they
used for a particular MARC tag which were not standard, but were internally consistent.
► In another case MARC tag 043 (Geographic Area Code) was apparently used in a particular and consistent way by the supplier, but as MARC itself is non-specific, nothing could be done with the data in the general scheme.
► MARC users also have their own variant practise► Especially in the use of “internal” 900 tags
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Data quality issues: variant schemas (cont)
►It has been remarked that as there are over 50 UK publishers now providing data in ONIX, there are over 50 variant ONIX schemas
►Not in principle a problem for the COAX one-to-many approach►Variant mappings can be made for different
sources where consistent behaviour is identified
►To maintain such variations in conventional pairwise mapping very resource-intensive
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Issues at end of project?
►None of the existing metadata standards meets all the requirements
►Publishers apply the standards differently►COA model can handle variations much
more efficiently than pairwise mapping►Rich standards (e.g. LOM) will require
additional effort from publishers►Impact of new models for selling and
supplying e-books?
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
TIME is only just (the) beginning…
TIME: presentation to Discovery and Access seminar 13 December 2006
© Rightscom 2006– All rights reserved
Thank you
►Hugh Look►www.rightscom.com►020 7620 4433