getting metadata to work harder: re-use, standardisation and streamlining, a data archive...
TRANSCRIPT
GETTING METADATA TO WORK HARDER: re-use, standardisation and streamlining, a data archive perspective
……………………………………………………….………………………………..................................................................................................
LUCY BELL………………………………………...
MANAGEMENT INFORMATION MANAGERUK DATA ARCHIVEUNIVERSITY OF ESSEX………………………………………...
THE VALUE OF CATALOGUING, CIG 2012, UNIVERSITY OF SHEFFIELD
10 – 11 SEPTEMBER 2012
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Introduction
• recent changes to 45 years’ worth of cataloguing and indexing – and indexing practices
• changes are large, wide-ranging – and still underway!
• we hope they will both enhance the user’s experience and create organisational efficiencies
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Themes
• the UK Data Archive: what it is• current practice: metadata schema and tools used at
the Archive• recent internal initiatives• generally: the problems we encountered; the solutions
we have employed• next steps
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The UK Data Archive
• based at the University of Essex since 1967• curator of the largest collection of digital data in the
social sciences and humanities in the UK• holds several thousand datasets relating to society,
both historical and contemporary, making these available via its services:• UK Data Service from October 2012• previously, the Economic and Social Data Service
(ESDS)• it is a place of national deposit for The National
Archives• www.data-archive.ac.uk / (www.esds.ac.uk)
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The UK Data Archive: current cataloguing standards
• the Archive provides access to over 5000 digital data collections• all of these items are catalogued at study level, and
many at variable level• using the de facto standard data cataloguing schema,
DDI (Data Documentation Initiative, see http://www.ddialliance.org/)
• currently, the Archive uses:• DDI 2.1 (now known as DDI-C, for codebook)• the Humanities and Social Science Electronic Thesaurus
(HASSET), © University of Essex, based on UNESCO• internally-controlled authority lists and CVs
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
HASSET
• multidisciplinary thesaurus developed to support the UK Data Archive collection
• coverage in the core subject areas of social science disciplines
• uses standard hierarchical relationships: TT (top term); BT (broader term); NT (narrower term); RT (related term) etc.
• role of HASSET in the Archive is twofold:• used internally for indexing studies and series with HASSET
terms• also a separate product licensed to others
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Significant recent metadata/indexing developments
1. May – October 2010: a review was carried out of the UK Data Archive’s resource discovery tools.• 2011: a project was started to apply the review’s results
to the Archive’s resource discovery applications.
2. 2011 onwards: work was started to move from the DDI-C to DDI-L (for lifecycle) metadata schema.
3. June 2012 – January 2013: SKOS-HASSET, a JISC-funded project is being undertaken to apply SKOS to HASSET and to test its automated indexing capacity
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Shared requirements…
• it became clear that most of these initiatives were all pointing at one thing:
The need for more controlled - and harder-working - metadata
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
1. Resource discovery review
• How do researchers find data?
• trends in information-seeking behaviour show that users prefer simple, Google-like interfaces…
• …but which still return acutely-focused and highly-relevant results.
• the look and feel of the interfaces should be simple but the results must achieve academic rigour.
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Result of the review: the metadata conundrum
• for data services to produce simple interfaces - which still return highly-relevant results - metadata are required which are both:• extremely powerful• increasingly invisible
• a conceptual shift has taken place: the work to focus searches has moved behind the interface
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The previous Archive search context
ESDS Qualidatasearch interface
ESDS Internationalsearch interface
ESDS Government Survey Finder
SEARCHESDS Data catalogue
SEARCH(Data exploration)
Quali Online
SEARCH(Data exploration)
Nesstar
DATA
BROWSEMajor Studies
BROWSESubject Headings
BROWSESubject Headings
BROWSENew releases
BROWSEThematic pages
SEARCHRELU-DSS
SEARCHUKDA-Store
SEARCHCESSDA catalogue
ESDS Government
Variable Search
Variable SearchESDS Data Catalogue
ESDS Government: publications citing
ESDS International data
ESDS Longitudinal: publications citing ESDS Longitudinal
surveys
ESDS International: publications citing
ESDS International data
ESDS Longitudinalsearch interface
ESDS Qualidata free text search interface
ESDS Governmentsearch interface
HASSET
Comparable geography
(Long)
Comparable indicators
(Long)
Subject Headings
SEARCHSurvey Question
Bank
SEARCHCensus data
catalogue
SEARCHHDS
SEARCHSDS
HASSET and other
CVs may be used in the majority of search and
browse activities.
21 interfaces
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The vision: use CVs to enhance the user’s experience
• We wanted:• a single search interface• the ability to move seamlessly from one type of resource
to another:• via faceted browsing and• directly from within each resource type
• This required:• cross-referencing data collections with publications, with
research outputs, with support guides, with case studies using metadata
• Many controlled vocabularies!
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The result: single faceted search/browse interface
• We are moving from this:
• To this:
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Facets needing controlled vocabularies
• Some were already in a fit state:• Depositor (existing authority list)• Country (existing authority list)
• Others needed mapping to high levels:• Subject categories (116 categories mapped to 21 top
terms)• Many were populated with freetext:
• Observation unit• Spatial unit• Kind of data• Time method
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Freetext to controlled vocabularies mapping
• Mapping freetext values to controlled values (all metadata held in SQL tables)
• Same principles for all:• Obtain dump of metadata and manipulate in Excel• Identify CV to be used• Use Google Refine to identify existing, similar, freetext
entries• Re-export into Excel and apply mapping (at item level
or, if possible, at value level)• CVs to be used in the future
• So far, has taken 2 staff members, working c.0.4 FTE 4 months to clean 3 elements
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The mappings
• Spatial unit <geogUnit>• Previous Archive project, U.Geo, had created a spatial unit CV• 653 unique values, now mapped to 194• This has now been used for all items:
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The mappings
• Unit of observation <anlyUnit>• 183 unique values, now mapped to 11, using DDI CVG
recommended list:• Individuals• Organizations• Families/households• Housing Units• Events/Processes• Geographic Units• Time Units• Text units• Groups• Objects• Other
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The mappings
• Kind of data <dataKind>• 294 unique values, now mapped to 7:
• Alpha-numeric• Audio• GIS• Image• Numeric• Textual• Video
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The mappings
• More to come….• Method of data collection• Access/restrictions (Secure data; standard access
conditions etc.)• Method of access (Explore online or download)
• Faceted search/browse will be released as a beta in late 2012• More development will occur during its beta phase
following user feedback
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
2. Metadata schema: DDI-C to DDI-L
• Simultaneously, the Archive has been preparing for the move from DDI-C to DDI-L • DDI-C is similar to a traditional metadata schema• DDI-L is more flexible – to the benefit of users:
• permits data as well as metadata to be encoded• captures survey lifecycles• gives users a fully-rounded view of a survey from
inception to results• broad and flexible, allowing groupings to be made – re-
use is key
• to support all this, it requires CVs to be used in several elements (the DDI Alliance Controlled Vocabularies Group is working on these)
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
3. CVs for organisational efficiency: SKOS application
• JISC project: SKOS-HASSET
• 8 months (June 2012 – January 2013)• part of the JISC Research Tools Programme• Multi-disciplinary project team:
• Information Scientists, Data/text Mining Programmer, Linguist, RDF specialist, Developers
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
SKOS-HASSET
• three aims:• apply SKOS to HASSET – making the thesaurus more
flexible• improve its online presence• test its automated indexing capabilities; corpora:
• questions• questionnaires• abstracts• publications
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
SKOS-HASSET
• Progress so far:• SKOS has been applied to HASSET• Texts prepared for the automated indexing case study• Gold standard of manual indexing of questions is taking
place• TF/IDF, KEA and WEKA all being used for term
extraction – work underway• Next steps:
• SKOS product licensing
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
SKOS-HASSET
• Communication:• SKOS-HASSET blog: http://hassetukda.wordpress.com/• [email protected] email list• Project web site:
http://www.data-archive.ac.uk/find/our-projects/skos-hasset
• Webinar planned for the winter• User guidance
• Please contribute, give feedback!
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Developments… to issues … to improvements
• For users:• the faceted search/browse interface exposed a lack
of standardisation in the underlying metadata• …freetext terms have been used over 45 years; these
are now being standardised• ...rich freetext metadata has not been lost
• the move from DDI-C (DDI 2.1) to DDI-L (DDI 3.1) brings in a conceptually different type of schema to the users’ benefit…• …but which also requires more controlled vocabularies
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Developments… to issues … to improvements
• For us:• Applying more CVs will provide efficiencies:
• ...the Archive wants to introduce an online deposit form for its depositors which will include CV dropdowns
• ...create more ways of suggesting terms for the cataloguers
• SKOS gives the opportunity to work more flexibly with the thesaurus• …automated indexing using CVs is being tested• ...SKOS will allow for easier future thesaurus
development
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Analysis and reporting and future acquisitions decisions supported
The future: analysis and reporting enhanced
Additional metadata
created through text mining; geographic coordinates
Metadata results returned
User queries database
Evaluation of metadata systems
User questioned about usefulness of results
Oth
er, r
elat
ed te
rms
auto
mat
ical
ly s
earc
hed
‘just
-in-ti
me’
and
‘sim
ilar’
resu
lts r
etur
ned
Input programs automatically generate SN
user guides and title pages
Manual metadata
created, auto metadata
checked; record completed with
descriptors
Metadata record
Search and browse activity monitored to inform data acquisition
Results of user quality evaluation
of search analysed
Web deposit form captures
more and more controlled
metadata from depositors
Managem
ent Information
Managem
ent Information
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Conclusion
• we all NEED metadata so that we can find stuff• there is too much stuff (or not enough bodies) to
create all the metadata ourselves in time these days• searchers/users often expect the applications to do the
work for them• use the tools at our disposal to make this happen by:
• employing more CVs where appropriate• sharing and using RDF-enabled CVs• and, crucially, continuing the creation of quality-assured
metadata using fewer resources
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Conclusion
• JISC Intrallect report; quotation from Vic Lyte:
• “A new researcher wishing to approach scholarly inquiry to determine the impact of global warming on penguin populations in South Antarctica doesn’t walk
up to a Librarian and shout ‘Penguins!’.”
(Duncan, C. & Douglas, P., (2009). Automatic metadata generation: use cases and tools/priorities. Intrallect (for JISC): 2009)
……………………………………………………………………………………………………………………………….……………………………..
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
CONTACT
UK DATA ARCHIVEUNIVERSITY OF ESSEXWIVENHOE PARKCOLCHESTERESSEX CO4 3SQ……..……………………………….…..T +44 (0)1206 872001 E [email protected]