iwmw 2002: the value of metadata and how to realise it

Parallel Session on Metadata

The Value of Metadata and how to Realise it..

Date 18th June 2002Facilitator: Dennis Nicholson Centre for Digital Library Research

Notes and Slides

Theme: Examine, Discuss: .…the value of using metadata as a aid

to reliable retrieval both within individual Web sites and across distributed sites

….what the barriers to effective use of metadata are and how they can be overcome

….Who should be responsible for creating and maintaining metadata - resource creators; web-masters; librarians?

Theme: Examine, Discuss: ….Whether embedding and harvesting

or a central database is the best approach.

…plus (if time allows): A step beyond, the value of Content

Management Systems Focus: General My background...

Responsibility to... Stimulate:

Thought; Discussion; Debate Draw out the important points Impart ability to apply what we’ve

discovered Ensure participation So…

Individual needs and circumstances?

Effective Retrieval What is it?

Balance of precision and recall best suited to a given problem High precision and low recall usually

preferred but in some cases (e.g. patents) there may be an advantage in lowering precision to boost recall

Level of precision and recall should be under the user’s control not a side effect of poor metadata

Effective Retrieval Why does it matter?

Costs University, public purse to create the material - a waste if the people it is aimed at can’t find it

Strategic/PR considerations - If they can’t find your courses or expertise registers or digital images for sale if and when you want or need them to they won’t use you or talk or write about you

Effective Retrieval When does it matter?

Only if it is ‘stuff’ you want found The bigger they come, the sooner

they fail… The more ‘stuff’ you have, the more

campuses, or organisations in a collaboration,the harder it is to ensure effective retrieval

Especially with no or poor metadata

What is metadata? Metadata is data about data Consists of things like:

Author; Title; Subject; Description; Level; Language; Viewer

Appropriate to function The route to effective retrieval Maybe...

What can go wrong? Limited penetration (i.e. only some

available documents covered) Misleading results for users

Different metadata record formats Can the software cope? Is there a

cross-walk? Incompatible core field sets

Cross-walk not possible

What can go wrong? Different field sub-sets used (Both

use DC but different field set) Full service limited to common fields

Different fields used for same data element (I put subject headings in subject field and free form keywords in the keyword field but you put subject headings in the keyword field) Misleading results

What can go wrong? Different or no standards applied in

creating data element content (e.g. Darwin, C. or Charles Darwin) Reduced retrieval; varied results

Different or no subject schemes and/or category lists (Educational levels, LCSH v. UNESCO v. made up) Reduced retrieval; varied results

Insufficient granularity (If everything physical is ‘physics’) Poor precision, high recall

What can go wrong? Varied or no methods of central

co-ordination (2 sites or campuses) Can cause some of the other

problems listed above and below Different sites index different fields

(One has subjects, keywords in one index, another in separate indices) Misleading for users

What can go wrong? Missing indices (Nothing on the

subject in the index or no subject index? (2 sites)) Misleading retrieval

Humans can cope but machines can’t (A machine finds it harder to ‘spot’ different usages of the ‘same’ word or alternative words for the same thing than a human does) Semantic web won’t work

Safeguards against: Limited penetration

Policy? Training? DC Dot? Human monitor? Different formats

Discover need, agree policy, set standards, ensure software can cope with formats

Incompatible core field sets Identify formats (DC, IMS, MARC?) then

agree core set of fields (e.g. 15 in DC base)

Safeguards against: Different field sub-sets used

Agree, monitor, one core set Different fields used for same data

element Templates and examples, Central

co-ordination, Guidelines, Training

Safeguards against: Different or no standards applied in

creating data element content Template with examples

Different or no subject schemes and/or category lists Agree single schemes or lists, have

drop down lists, upgrade centrally

Safeguards against: Insufficient granularity

Agree usable level, training, examples Varied or no methods of central

co-ordination (2 sites or campuses) Make sure it doesn’t happen!

Different sites index different fields Agree approach, implement and

monitor standards

Safeguards against: Missing indices

Agree not to do this, and warn users if you can’t agree

Humans can cope but machines can’t (semantic web) Use standard schemes, ontologies in

standard ways and map between different ones in a way that your software can process

Where to keep it? Pros and Cons of:

Embedding and harvesting: Metadata creation more likely? Harder to

co-ordinate, easier to resource? More often out of date? Harder to ensure standardised metadata?

A central database Easier to co-ordinate, more expensive to

resource? Easier to maintain standards? How to ensure new stuff notified?

Where to keep it? Pros and Cons of:

A mix of the two? Worst of both worlds? Or best? How to

ensure the latter? Optimise author input of embedded metadata but allow central upgrades by metatada experts? I this feasible? Is it cost-effective?

Depends on other factors? A question of designing to be fit for

purpose?

Whose Responsibility? Candidates; Their pros and cons:

Resource creators? Au fait with the resource; Labour saving

Web-masters? Au fait with the technical landscape

Librarians? Au fait with knowledge and metadata domains

Public Relations? Au fait with the needs of the University

Anybody else? All of the above? Co-ordinated by?

Other Related Issues A CMS would ensure :

Currency; Accuracy; Legality; Authority of Content retrieved by metadata

Not to mention Uniform look and feel control; easy total

redesign and global changes; all content tracked; joint authorship across departments, units, different institutions; easy repurposing

All who have some responsibility can be involved in controlled way?

Facilities It would provide:

Content authoring; collaborative authoring; editing and workflow; preventing unauthorised editing or creation; scheduling publication; tracking changes; personalising; repurposing; metadata creation; knowledge management through semantic control

Closing Discussion… Who has/plans to have a CMS? What does it/will it cost? Are they:

Essential? Optional? Impractical? A threat to academic freedom?

Do they help solve the metadata problem?

Useful URLs Metadata

http://content.lib.washington.edu/METADATA/ (Why should we care?) http://www.ukoln.ac.uk/metadata/dcdot/ http://www.ukoln.ac.uk/web-focus/metadata/seminar-materials/exercises/d

c-dot/dc-dot.doc

http://www.ukoln.ac.uk/metadata/dcassist/

Content Management Systems http://www.ukoln.ac.uk/nof/support/help/papers/cms.htm (what are

they?) http://www.ariadne.ac.uk/issue30/techwatch/ (Who needs them?) http://www.cultivate-int.org/issue5/cms/ (CMS’s available)

http://content.lib.washington.edu/METADATA/

http://www.ukoln.ac.uk/metadata/dcdot/



http://www.ukoln.ac.uk/web-focus/metadata/seminar-materials/exercises/dc-dot/dc-dot.doc

http://www.ukoln.ac.uk/web-focus/metadata/seminar-materials/exercises/dc-dot/dc-dot.doc




http://www.ukoln.ac.uk/nof/support/help/papers/cms.htm

http://www.ukoln.ac.uk/nof/support/help/papers/cms.htm

http://www.ariadne.ac.uk/issue30/techwatch/

http://www.cultivate-int.org/issue5/cms/