dl:lesson 5 classification schemas luca dini [email protected]

43
DL:Lesson 5 Classification Schemas Luca Dini [email protected]

Upload: adeline-allcorn

Post on 01-Apr-2015

230 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

DL:Lesson 5Classification Schemas

Luca [email protected]

Page 2: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Overview

The Dublin Core defines a number of metadata elements, but what about the values for those elements?

Should they be unrestricted text values or come from pre-defined vocabularies?

"it depends".

We will discuss how to determine the appropriate approach for an organization's situation.

We will also cover how pre-defined vocabularies should be sourced, structured, and maintained.

Page 3: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Vocabulary development and maintenance

Vocabulary development and maintenance is the LEAST of three problems:

– The Vocabulary Problem: How are we going to build and maintain the lists of pre-defined values that can go into some of the metadata elements?

– The Tagging Problem: How are we going to populate metadata elements with complete and consistent values?

What can we expect to get from automatic classifiers? What kind of error detection and error correction procedures do we need?

– The ROI Problem: How are we going to use content, metadata, and vocabularies in applications to obtain business benefits?

More sales? Lower support costs? Greater productivity? How much content? How big an operating budget?

Need to know the answer to the ROI Problem before solving the Vocabulary Problem.

Page 4: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

DefinitionsTerm Definition

Metadata Element A ‘field’ for storing information about one piece of content. Examples: Title, Creator, Subject, Date, …

Metadata Value The ‘contents’ of one Metadata Element. Values may be text strings, or selections from a predefined vocabulary.

Metadata Schema A defined set of metadata elements. The Dublin Core is one schema.

Free Text Value An unconstrained text metadata value. Some text values are constrained to follow a format (e.g. YYYY-MM-DD).

Vocabulary A list of predefined values for a metadata element.

Controlled Vocabulary

A vocabulary with a defined and enforced procedure for its update.

Page 5: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Controlled vocabularies

Hierarchical classification of things into a tree structureHierarchical classification of things into a tree structure

Kingdom Phylum Class Order Family Genus Species

AnimaliaChordata

MammaliaCarnivora

CanidaeCanis

C. familiari

Linnaeus …

Segment Family Class Commodity

44-Office Equipment and Accessories and Supplies .12-Office Supplies

.17-Writing Instruments

.05-Mechanical pencils

.06-Wooden pencils

.07-Colored pencils

UNSPSC …

Page 6: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Types of vocabulariesVocabulary Type Cplxty. Description Relation

Type

Term List 1 Simple list of terms with no internal structure or relations.

None

Synonym Rings 2 List of sets of terms to regard as equivalent. Widely supported in search software.

Equivalence

Authority Files 3 List of names for known entities – people, organizations, books, etc.

Reference

Classification Schemes

4 Hierarchical arrangement of concepts. Loose Hierarchy

Thesauri 5 Hierarchical arrangement of concepts plus supporting information and additional, non-hierarchical, relations.

“Is-a” Hierarchy plus Loose Relations

Ontologies 6 Arrangement of concepts and relations based on a model of underlying reality – e.g. organs, symptoms, diseases & treatments in medicine.

Model-based Typed

Relations

Page 7: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Vocabulary Control

The degree of control over a vocabulary is (mostly) independent of its type.

– Uncontrolled – Anybody can add anything at any time and no effort is made to keep things consistent. Multiple lists and variations will abound.

– Managed – Software makes sure there is a list that is consistent (no duplicates, no orphan nodes) at any one time. Almost anybody can add anything, subject to consistency rules. (e.g. File System Hierarchy)

– Controlled – A documented process is followed for the update of the vocabulary. Few people have authority to change the list. Software may help, but emphasis is on human processes and custodianship. (e.g. Employee list)

Term lists, synonym lists, … can be controlled, managed, or uncontrolled.

Ontologies are managed.

Page 8: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Type of controls

Controlled vocabularies are frequently mentioned

– That does not mean they are always necessary

– Control comes at a cost, but can provide significant data quality benefits by reducing variations.

Is this a well-controlled vocabulary?

– No! It is an uncontrolled, but well-managed, term list

Is this part of an appropriate solution to the ROI problem?

– Yes! There is no budget to do ongoing control and QA

Source: http://del.icio.us/tag/

Page 9: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Likelihood of controlled values(Virtually) Mandatory

Highly Likely Maybe Highly Unlikely

(Virtually) Impossible

Language RFC 3066

Format IMT

Coverage ISO 3166

Type DCMI Type?

Subject Custom

Creator LDAP?

Publisher Custom

Contributor LDAP?

Identifier Custom

Date W3C DTF

Rights

Title

Relation

Source

Description

Page 10: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Mandatory

DC recommends specific best practices:– Language: RFC 3066 (which works with ISO 639)– Format: Internet Media Types (aka MIME)

These vocabularies are widely used throughout the Internet. If you want to do something else, it should be justified.

– Describing physical objects? Use Extent and Medium refinements instead of Format.

– Regional (vs. National) dialects? a) Why? b) Consider a custom element in addition to standard Language

Page 11: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Likely

DC recommends specific best practices:– Coverage: ISO 3166

ISO 3166 should be used unless you have good reasons to use something else

Consider Getty Thesaurus of Geographic Names if you need cities, rivers, etc. (http://www.getty.edu/research/conducting_research/vocabularies/tgn/)

DC provides Encodings for both– Type: DCMITypes (http://dublincore.org/documents/dcmi-type-

vocabulary/) DCMIType list is not necessarily a best practice No widely accepted type list exists, so a custom list is likely

Page 12: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

May be

Creator, Contributor could come from an “authority file”– LC NAF in library contexts– LDAP Directory in corporate contexts

Recommended where possible Many exceptions where author is outside LDAP

Publisher could come from an authority file– Org chart in corporate contexts – e.g. internal records

management system. Identifier should be a URI

– Organization may manage these, but its typically a text field, not a controlled list.

Page 13: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Subject and extensions

Best practice: Use pre-defined subject schemes, not user-selected keywords.

– DC Encodings (DDC, LCC, LCSH, MESH, UDC) most useful in library contexts.

– Not useful for most corporate needs

Recommended: Factor “Subject” into separate facets.– People, Places, Organizations, Events, Objects, Products & Services,

Industry sectors, Content types, Audiences, Business Functions, Competencies, …

Store the different facets in different fields– Use DC elements where appropriate (coverage, type, audience, …)– Extend with custom elements for other fields (industry, products, …)

Page 14: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Thesauri

A Thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among synonymous, equivalent, broader, narrower and other related terms

Page 15: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Standards

National and International Standards for Thesauri– ANSI/NISO z39.19-1994 — American National Standard

Guidelines for the Construction, Format and Management of Monolingual Thesauri

– ANSI/NISO Draft Standard Z39.4-199x — American National Standard Guidelines for Indexes in Information Retrieval

– ISO 2788 — Documentation — Guidelines for the establishment and development of monolingual thesauri

– ISO 5964 — Documentation — Guidelines for the establishment and development of multilingual thesauri

Page 16: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Thesaurus Examples

Examples– The ERIC Thesaurus of Descriptors– The Medical Subject Headings (MESH) of the

National Library of Medicine– The Art and Architecture Thesaurus

Page 17: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

ERIC Thesaurus – Entry

Page 18: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

ERIC Thesaurus – Online

http://www.eric.ed.gov/ERICWebPortal/Home.portal?_nfpb=true&_pageLabel=Thesaurus&_nfls=false

Page 19: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

MeSh

Page 20: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

MeSh Online

http://www.nlm.nih.gov/mesh/meshhome.html

Page 21: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Dewey

Dewey Decimal Classification System (DDC) first published in 1876 by Melvil Dewey

Most widely used classification system in the world (used in 135 countries)

In this country used primarily by public and school libraries

Maintained by the Library of Congress

Page 22: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Dewey

DDC is divided into ten main classes, then ten divisions, each division into ten sections

The first digit in each three-digit number represents the main class.

– “500” = natural sciences and mathematics. The second digit in each three-digit number indicates

the division. – “500” is used for general works on the sciences– “510” for mathematics– “520” for astronomy– “530” for physics

Page 23: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Dewey

The third digit in each three-digit number indicates the section.

– “530”is used for general works on physics– “531” for classical mechanics– “532” for fluid mechanics– “533” for gas mechanics

A decimal point follows the third digit in a class number, after which division by ten continues to the specific degree of classification needed.

Page 24: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Library of Congress Subjects

Essentially an artificial indexing language Based on literary warrant Entry vocabulary provided in the form of reference

structure Moving slowly towards a real thesaurus structure (not

there yet) Not faceted—subdivisions pre-selected, based on

individual heading or “pattern” heading

Page 25: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

LCSH

Digital libraries– see from “Electronic libraries”– see from “Virtual libraries”– see broader term: “Libraries”– see also “Information storage and retrieval

systems”

Page 26: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Library of Congress Classification

21 basic classes, based on single alphabetic character (K=law, N=art, etc.)

Subdivided into two or three alpha characters (KF=American Law, ND=painting, etc.)

Further subdivision by specific numeric assignment Author numbers and dates arrange works by a

particular author together and in chronological order

Page 27: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

LCC

153##$aQL638.E55$hZoology$hChordates. Vertebrates$hFishes$hSystematic divisions$hOsteichthys (Bony fishes). By family, A-Z$hFamilies$jEngraulidae (Anchovies)– $a = Classification number--single number or

beginning number of span (R)– $h = Caption hierarchy– $j = Caption (lowest level, relating to the specific

number in $a)

Page 28: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

DMOZ: A worst case example of a unified ‘subject’

DMOZ has over 600k categories Most are a combination of common facets – Geography,

Organization, Person, Document Type, … (e.g.) Top: Regional: Europe: Spain: Travel and Tourism: Travel Guides

www.dmoz.org

Page 29: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

History of Faceted Navigation

Relatively New -- Taxonomies - Aristotle S. R. Ranganathan – 1960’s

– Issue of Compound Subjects– The Universe consists of PMEST

Personality, Matter, Energy, Space, Time Classification Research Group- 1950’s, 1970’s

– Based on Ranganathan, simplified, less doctrinaire– Principles:

Division – a facet must represent only one characteristic Mutual Exclusivity

Classification Theory to Web Implementation– An Idea waiting for a technology– Multiple Filters / dimensions

Page 30: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

What are Facets?

Facets are not categories– Entities or concepts belong to a category– Entities have facets

Facets are metadata - properties or attributes– Entities or concepts fit into one category– All entities have all facets – defined by set of values

Facets are orthogonal – mutually exclusive – dimensions– An event is not a person is not a document is not a place.– A winery is not a region is not a price is not a color.

– Relations between facets, subfacets, and foci (elements) are not restricted to hierarchical generalization-specialization relations

– Combined using grammars of order and relation to form compound descriptions

Page 31: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Facetted Classification

Clearly distinguishes between semantic relationships and syntactic relationships– Semantic relationships

Within a facet Containment relations

– Syntactic relationshipsAcross facets Combinatoric relations

Have a “syntax” for syntactic combination of semantic terms

Page 32: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Semantic and Syntactic Relationships

Semantic relationships– Is-A (thing/kind,

genus/species) Mammals

– Primates Humans

– Has-Parts Human

– Head Eyes

Syntactic relationships– Compounds

Wheat + harvesting = “wheat harvesting”

Object + operation = operation on object

Page 33: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

What is Faceted Navigation?

Not a Yahoo-style Browse– Computer Stores under Computers and Internet– One value per facet per entity

Faceted Navigation is not hierarchical– Tree – travel up and down, not across– Facets are filters, multidimensional

Facets are applied at search results time – post-coordination, not pre-coordination [Advanced Search]

Faceted Navigation is an active interface – dynamic combination of search and browse

Page 34: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

When to Use Faceted NavigationAdvantages

Systematic Advantages: – Need fewer Elements

4 facets of 10 nodes = 10,000 node taxonomy

– Ability to Handle Compound Subjects

Content Management Advantages: Easier to “categorize” – not as conceptual Fewer = simple, can use auto-classification better Flexible – can add new facets, elements in facet

Page 35: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

When to Use Faceted NavigationAdvantages: Implementation

More intuitive – easy to guess what is behind each door

Simplicity of internal organization 20 questions – we know and use

Dynamic selection of categories Allow multiple perspectives

Trick Users into “using” Advanced Search wine where color = red, price = x-y, etc. Click on color red, click on price x-y, etc.

Flexible – can be combined with other navigation elements

Page 36: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

When to Use Faceted NavigationDisadvantages

Systematic Disadvantages:– Lack of Standards for Faceted Classifications

Every project is unique customization

Implementation Disadvantages:– Loss of Browse Context

Difficult to grasp scope and relationships

– No immediate support for popular subjects Essential Limit of Faceted Navigation

– Limited Domain Applicability – type and size– Entities not concepts, documents, web sites

Page 37: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Developing Facet Structure:Selection of Facets: Theory

Issue - Complete Model of a domain Ranganathan – PMEST

– Personality – Person, animal, event– Matter – what x is made of– Energy – how x changes– Space – where x is– Time – when x happens

Three Planes – Idea, Verbal, Notational

Page 38: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Facets: an example

A Language– a English– b French– c Spanish

B Genre– a Prose– b Poetry– c Drama

C Period– a 16th Century– b 17th Century– c 18th Century– d 19th Century

Aa English Literature

AaBa English Prose

AaBaCa English Prose 16th Century

AbBbCd French Poetry 19th Century

BbCd Drama 19th Century

Page 39: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Developing Facet Structure: Selection of Facets: Practice Wine.com

Region– Australia, California

Type– Red Wine, White, Bubbly

Winery – Alphabetical listing

Price– $25 and below– $25-$50

Top Rated Wines– 90+ under $20

Top Sellers– Cabinet Sauvignon– Pinot Noir

Hot Features– Wine outlet– Sideways collection

Page 40: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Faceted Approach

Power– 4 independent categories of

10 nodes = 10,000 nodes (104)

Faster construction– Use existing taxonomies in

specific fields Reduced maintenance

cost More opportunity for data

reuse Can be easier to navigate

with appropriate UI

60 nodes 24,000 combinations

Page 41: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Organization

Either expose them directly in the user interface (post-coordinating) or

Combine them in a minimal hierarchy (pre-coordination) or

Hide them to the user! Post-coordination takes

software support, which may be fancy or basic.

How many facets?– Log10(#documents) as a

guide

Page 42: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

ElementData Type Length

Req. / Repeat Source Purpose

Asset Metadata

Unique ID Integer Fixed 1 System supplied Basic accountability

Recipe Title String Variable 1 Licensed Content Text search & results display

Recipe summary String Variable 1 Licensed Content Content

Main Ingredients List Variable ?Main Ingredients vocabulary

Key index to retrieve & aggregate recipes, & generate shopping list

Subject Metadata

Meal Types List Variable * Meal Types vocab

Browse or group recipes & filter search results

Cuisines List Variable * Cuisines

Courses List Variable * Courses vocab

Cooking Method Flag Fixed * Cooking vocab

Link Metadata

Recipe Image Pointer Variable ? Product Group Merchandize products

Use Metadata

Rating String Variable 1 Licensed Content Filter, rank, & evaluate recipes

Release Date Date Fixed 1 Product Group Publish & feature new recipes

dc:identifier

dc:title

dc:description

X

X

X

X

X

dcterms:hasPart

dc:datedc:type=“recipe”, dc:format=“text/html”, dc:language=“en”

Page 43: DL:Lesson 5 Classification Schemas Luca Dini dini@celi.it

Project/exercise

Produce a faced classification of your documents (at least 3 facets, min 5 foci each)

Encode the facet classification as an extension of dc:subject

Attribute facets to your docs. Check exptensibility by adding 10 new docs