2002.09.12 - slide 1is 202 - fall 2002 lecture 06: controlled vocabularies introduction prof. ray...

71
2002.09.12 - SLIDE 1 IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2002 SIMS 202: Information Organization and Retrieval Some slides in this lecture were developed by Prof. Marti Hearst

Post on 22-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 1IS 202 - FALL 2002

Lecture 06: Controlled Vocabularies Introduction

Prof. Ray Larson & Prof. Marc Davis

UC Berkeley SIMS

Tuesday and Thursday 10:30 am - 12:00 am

Fall 2002

SIMS 202:

Information Organization

and Retrieval

Some slides in this lecture were developed by Prof. Marti Hearst

Page 2: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 2IS 202 - FALL 2002

Lecture Contents

• Review– Dublin Core– Other Metadata Systems

• Controlled Vocabularies• Name Authority Files

– Choice of Names– Form of Names

• Other Types of Controlled Vocabularies• Faceted vs. Hierarchic Organization of

Vocabularies

Page 3: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 3IS 202 - FALL 2002

Lecture Contents

• Review– Metadata Systems– Dublin Core

• Controlled Vocabularies• Name Authority Files

– Choice of Names– Form of Names

• Other Types of Controlled Vocabularies• Faceted vs. Hierarchic Organization of

Vocabularies

Page 4: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 4IS 202 - FALL 2002

Metadata Systems and Standards

• Naming and ID systems – URLS, ISBNS• Bibliographic description – MARC, Dublin

Core, TEI, etc.• Music – SMDL• Images and objects – CIMI, VRA core

categories• Numeric data – DDI, SDSM• Geospatial data – FGDC • Collections – EAD

Page 5: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 5IS 202 - FALL 2002

Dublin Core

• Simple metadata for describing internet resources

• For “Document-Like Objects”

• 15 Elements (in base DC)

Page 6: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 6IS 202 - FALL 2002

Dublin Core Elements

• Title

• Creator

• Subject

• Description

• Publisher

• Other Contributors

• Date

• Resource Type

• Format

• Resource Identifier

• Source

• Language

• Relation

• Coverage

• Rights Management

Page 7: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 7IS 202 - FALL 2002

Title

• Label: TITLE

• The name given to the resource by the CREATOR or PUBLISHER

Page 8: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 8IS 202 - FALL 2002

Author or Creator

• Label: CREATOR

• The person(s) or organization(s) primarily responsible for the intellectual content of the resource. For example, authors in the case of written documents, artists, photographers, or illustrators in the case of visual resources.

Page 9: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 9IS 202 - FALL 2002

Subject and Keywords

• Label: SUBJECT • The topic of the resource, or keywords or

phrases that describe the subject or content of the resource. The intent of the specification of this element is to promote the use of controlled vocabularies and keywords. This element might well include scheme-qualified classification data (for example, Library of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified controlled vocabularies (such as Medical Subject Headings or Art and Architecture Thesaurus descriptors) as well.

Page 10: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 10IS 202 - FALL 2002

Description

• Label: DESCRIPTION • A textual description of the content of the

resource, including abstracts in the case of document-like objects or content descriptions in the case of visual resources. Future metadata collections might well include computational content description (spectral analysis of a visual resource, for example) that may not be embeddable in current network systems. In such a case this field might contain a link to such a description rather than the description itself.

Page 11: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 11IS 202 - FALL 2002

Publisher

• Label: PUBLISHER

• The entity responsible for making the resource available in its present form, such as a publisher, a university department, or a corporate entity. The intent of specifying this field is to identify the entity that provides access to the resource.

Page 12: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 12IS 202 - FALL 2002

Other Contributors

• Label: CONTRIBUTORS

• Person(s) or organization(s) in addition to those specified in the CREATOR element who have made significant intellectual contributions to the resource but whose contribution is secondary to the individuals or entities specified in the CREATOR element (for example, editors, transcribers, illustrators, and convenors).

Page 13: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 13IS 202 - FALL 2002

Date

• Label: DATE• The date the resource was made available in its

present form. The recommended best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI X3.30-1985. In this scheme, the date element for the day this is written would be 19961203, or December 3, 1996. Many other schema are possible, but if used, they should be identified in an unambiguous manner.

Page 14: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 14IS 202 - FALL 2002

Resource Type

• Label: RESOURCE TYPE • The category of the resource, such as

home page, novel, poem, working paper, preprint, technical report, essay, dictionary. It is expected that RESOURCE TYPE will be chosen from an enumerated list of types. One preliminary set of such types can be found at the following URL (now out of date): http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html

Page 15: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 15IS 202 - FALL 2002

Format

• Label: FORMAT • The data representation of the resource, such as

text/html, ASCII, Postscript file, executable application, or JPEG image. The intent of specifying this element is to provide information necessary to allow people or machines to make decisions about the usability of the encoded data (what hardware and software might be required to display or execute it, for example). As with RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered Internet Media Types (MIME types). In principal, formats can include physical media such as books, serials, or other non-electronic media.

Page 16: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 16IS 202 - FALL 2002

Resource Identifier

• Label: IDENTIFIER

• String or number used to uniquely identify the resource. Examples for networked resources include URLs and URNs (when implemented). Other globally-unique identifiers,such as International Standard Book Numbers (ISBN) or other formal names would also be candidates for this element.

Page 17: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 17IS 202 - FALL 2002

Source

• Label: SOURCE

• The work, either print or electronic, from which this resource is derived, if applicable. For example, an html encoding of a Shakespearean sonnet might identify the paper version of the sonnet from which the electronic version was transcribed.

Page 18: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 18IS 202 - FALL 2002

Language

• Label: LANGUAGE

• Language(s) of the intellectual content of the resource. Where practical, the content of this field should coincide with the Z39.53 three character codes for written languages. See: http://www.sil.org/sgml/nisoLang3-1994.html

Page 19: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 19IS 202 - FALL 2002

Relation

• Label: RELATION• Relationship to other resources. The intent of

specifying this element is to provide a means to express relationships among resources that have formal relationships to others, but exist as discrete resources themselves. For example, images in a document, chapters in a book, or items in a collection. A formal specification of RELATION is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

Page 20: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 20IS 202 - FALL 2002

Coverage

• Label: COVERAGE

• The spatial locations and temporal duration characteristic of the resource. Formal specification of COVERAGE is currently under development. Users and developers should understand that use of this element should be currently considered experimental.

Page 21: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 21IS 202 - FALL 2002

Rights Management

• Label: RIGHTS • The content of this element is intended to be a

link (a URL or other suitable URI as appropriate) to a copyright notice, a rights-management statement, or perhaps a server that would provide such information in a dynamic way. The intent of specifying this field is to allow providers a means to associate terms and conditions or copyright statements with a resource or collection of resources. No assumptions should be made by users if such a field is empty or not present.

Page 22: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 22IS 202 - FALL 2002

Issues in Dublin Core

• Lack of guidance on what to put into each element

• How to structure or organize at the element level?

• How to ensure consistency across descriptions for the same persons, places, things, etc.

Page 23: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 23IS 202 - FALL 2002

Metadata

• Structures and languages for the description of information resources and their elements (components or features)

• “Metadata is information on the organization of the data, the various data domains, and the relationship between them” (Baeza-Yates p. 142)

Page 24: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 24IS 202 - FALL 2002

Metadata

• Often two main types of metadata are distinguished:– Descriptive metadata

• Describes the information/data object and its properties

• May use a variety of descriptive formats and rules

– Topical metadata• Describes the topic or “aboutness” of an

information/data object • May include a variety of vocabularies for

describing, subjects, topics, categories, etc.

Page 25: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 25IS 202 - FALL 2002

Lecture Contents

• Review– Metadata Systems– Dublin Core

• Controlled Vocabularies• Name Authority Files

– Choice of Names– Form of Names

• Other Types of Controlled Vocabularies• Faceted vs. Hierarchic Organization of

Vocabularies

Page 26: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 26IS 202 - FALL 2002

Controlled Vocabularies

• Vocabulary control is the attempt to provide a standardized and consistent set of terms (such as subject headings, names, classifications, etc.) with the intent of aiding the searcher in finding information

• That is, it is an attempt to provide a consistent set of descriptions for use in (or as) metadata

Page 27: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 27IS 202 - FALL 2002

Controlled Vocabularies

• Names and name authorities

• Gazetteers (geographic names)

• Code lists (e.g., LC language codes)

• Subject heading lists

• Classification schemes

• Thesauri

Page 28: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 28IS 202 - FALL 2002

Lecture Contents

• Review– Metadata Systems– Dublin Core

• Controlled Vocabularies• Name Authority Files

– Choice of Names– Form of Names

• Other Types of Controlled Vocabularies• Faceted vs. Hierarchic Organization of

Vocabularies

Page 29: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 29IS 202 - FALL 2002

Names

• Remember Cutter’s objectives of bibliographic description?– To enable a person to find a document of

which the author is known– To show what the library has by a given

author

• First serves access

• Second serves collocation

Page 30: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 30IS 202 - FALL 2002

Problems with Names

• How many names should be associated with a document?

• Which of these should be the “main entry?”

• What form should each of the names take?

• What references should be made from other possible forms of names that haven’t been used?

Page 31: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 31IS 202 - FALL 2002

The Problem

• Proliferation of the forms of names– Different names for the same person– Different people with the same names

• Examples – from Books in Print (semi-controlled but not

consistent)– ERIC author index (not controlled)

Page 32: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 32IS 202 - FALL 2002

Goethe

…etc…

Page 33: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 33IS 202 - FALL 2002

John Muir

Page 34: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 34IS 202 - FALL 2002

Pauline Cochrane nee Atherton

Page 35: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 35IS 202 - FALL 2002

Pauline Cochrane nee Atherton

Page 36: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 36IS 202 - FALL 2002

Rules for Description

• AACR II and other sets of descriptive cataloging rules provide guidelines for:– Determining the number of name entries– Choosing a main entry– Deciding on the form of name to be used– Deciding when to make references

Page 37: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 37IS 202 - FALL 2002

Authority Control

• Authority control is concerned with creation and maintenance of a set of terms that have been chosen as the standard representatives (also know as established) based on some set of rules

• If you have rules, why do you need to keep track of all of the headings? Can’t you just infer the headings from the rules?

Page 38: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 38IS 202 - FALL 2002

Conditions of Authorship?

• Single person or single corporate entity

• Unknown or anonymous authors– Fictitiously ascribed works

• Shared responsibility

• Collections or editorially assembled works

• Works of mixed responsibility (e.g., translations)

• Related works

Page 39: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 39IS 202 - FALL 2002

Added Entries

• Personal names– Collaborators– Editors, compilers, writers– Translators (in some cases)– Illustrators (in some cases)– Other persons associated with the work (such as the

honoree in a festschrift)

• Corporate names– Any prominently named corporate body that has

involvement in the work beyond publication, distribution, etc.

Page 40: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 40IS 202 - FALL 2002

Choice of Name

• AACR II says that the predominant form of the name used in a particular author’s writings should be chosen as the form of name

• References should be made from the other forms of the name

Page 41: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 41IS 202 - FALL 2002

Form of the Name

• When names appear in multiple forms, one form needs to be chosen

• Criteria for choice are:– Fullness (e.g., full names vs. initials only)– Language of the name– Spelling (choose predominant form)

• Entry element:– John Smith or Smith, John?– Mao Zedong or Zedong, Mao? (Mao Tse

Tung?)

Page 42: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 42IS 202 - FALL 2002

Name Authority Files

ID:NAFL8057230 ST:p EL:n STH:a MS:c UIP:a TD:19910821174242 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:05-14-80 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-21-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 053 PR6005.R517 100 10 Creasey, John 400 10 Cooke, M. E. 400 10 Cooke, Margaret,$d1908-1973 400 10 Cooper, Henry St. John,$d1908-1973 400 00 Credo,$d1908-1973 400 10 Fecamps, Elise 400 10 Gill, Patrick,$d1908-1973 400 10 Hope, Brian,$d1908-1973 400 10 Hughes, Colin,$d1908-1973 400 10 Marsden, James 400 10 Matheson, Rodney 400 10 Ranger, Ken 400 20 St. John, Henry,$d1908-1973 400 10 Wilde, Jimmy 500 10 $wnnnc$aAshe, Gordon,$d1908-1973

Different names for thesame person

Page 43: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 43IS 202 - FALL 2002

Name Authority Files

ID:NAFO9114111 ST:p EL:n STH:a MS:n UIP:a TD:19910817053048 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:06-03-91 RFE:a CSC:c SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 08-19-91 040 OCoLC$cOCoLC 100 10 Marric, J. J.,$d1908-1973 500 10 $wnnnc$aCreasey, John 663 Works by this author are entered under the name used in the item. For a listing of other names used by this author, search also under$bCrease y, John 670 OCLC 13441825: His Gideon's day, 1955$b(hdg.: Creasey, John; usage: J .J. Marric) 670 LC data base, 6/10/91$b(hdg.: Creasey, John; usage: J.J. Marric) 670 Pseuds. and nicknames dict., c1987$b(Creasey, John, 1908-1973; Britis h author; pseud.: Marric, J. J.)

Page 44: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 44IS 202 - FALL 2002

Name Authority Files

ID:NAFL8166762 ST:p EL:n STH:a MS:c UIP:a TD:19910604053124 KRC:a NMU:a CRC:c UPN:a SBU:a SBC:a DID:n DF:08-20-81 RFE:a CSC: SRU:b SRT:n SRN:n TSS: TGA:? ROM:? MOD: VST:d 06-06-91 Other Versions: earlier 040 DLC$cDLC$dDLC$dOCoLC 100 10 Butler, William Vivian,$d1927- 400 10 Butler, W. V.$q(William Vivian),$d1927- 400 10 Marric, J. J.,$d1927- 670 His The durable desperadoes, 1973. 670 His The young detective's handbook, c1981:$bt.p. (W.V. Butler) 670 His Gideon's way, 1986:$bCIP t.p. (William Vivian Butler writing as J .J. Marric)

Different people writing with the same name

Page 45: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 45IS 202 - FALL 2002

The Haunting of Lauran Paine

1. Paine, Lauran. ALSO KNOWN AS: Carrel, Mark. Thompson, Russ. Andrews, A. A. Benton, Will. Bradford, Will. Bradley, Concho. Brennan, Will. Carter, Nevada. Allen, Clay. Almonte, Rosa. Armour, John. Cassady, Claude. Glendenning, Donn. Kelley, Ray. Kilgore, John. Martin, Tom. Slaughter, Jim. Standish, Buck. …

Batchelor, Reg. Beck, Harry. Bedford, Kenneth. Bosworth, Frank. Bovee, Ruth. Cassidy, Claude. Custer, Clint. Dana, Amber. Dana, Richard. Davis, Audrey. Drexler, J. F. Duchesne, Antoinette. Fisher, Margot. Fleck, Betty. Frost, Joni. Gordon, Angela. Gorman, Beth. Hayden, Jay. Houston, Will. Howard, Troy. Ingersol, Jared. …

Kelly, Ray. Ketchum, Jack. Liggett, Hunter. Lucas, J. K. Lyon, Buck. Morgan, Arlene. Morgan, Valerie. O'Connor, Clint. St. George, Arthur. Sharp, Helen. Thorn, Barbara. Archer, Dennis. Clark, Badger.

Page 46: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 46IS 202 - FALL 2002

Some Interesting Ones…

Page 47: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 47IS 202 - FALL 2002

Lecture Contents

• Review– Dublin Core– Other Metadata Systems

• Controlled Vocabularies• Name Authority Files

– Choice of Names– Form of Names

• Other Types of Controlled Vocabularies• Faceted vs. Hierarchic Organization of

Vocabularies

Page 48: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 48IS 202 - FALL 2002

Structure of an IR System

SearchLine

Interest profiles& Queries

Documents & data

Rules of the game =Rules for subject indexing +

Thesaurus (which consists of

Lead-InVocabulary

andIndexing

Language

StorageLine

Potentially Relevant

Documents

Comparison/Matching

Store1: Profiles/Search requests

Store2: Documentrepresentations

Indexing (Descriptive and

Subject)

Formulating query in terms of

descriptors

Storage of profiles

Storage of Documents

Information Storage and Retrieval System

Adapted from Soergel, p. 19

Page 49: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 49IS 202 - FALL 2002

Uses of Controlled Vocabularies

• Library subject headings, classification, and authority files

• Commercial journal indexing services and databases

• Yahoo, and other web classification schemes

• Online and manual systems within organizations– SunSolve– MacArthur

Page 50: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 50IS 202 - FALL 2002

Types of Indexing Languages

• Uncontrolled keyword indexing

• Indexing languages– Controlled, but not structured

• Thesauri– Controlled and structured

• Classification systems– Controlled, structured, and coded

• Faceted thesauri and classification systems

Page 51: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 51IS 202 - FALL 2002

Indexing Languages

• An index is a systematic guide designed to indicate topics or features of documents in order to facilitate retrieval of documents or parts of documents

• An Indexing language is the set of terms used in an index to represent topics or features of documents, and the rules for combining or using those terms

Page 52: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 52IS 202 - FALL 2002

Indexing Languages

• Library of Congress Subject Headings

• Yellow pages topics

• Wilson indexes (“reader’s guide”)

Page 53: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 53IS 202 - FALL 2002

Thesauri

• A thesaurus is a collection of selected vocabulary (preferred terms or descriptors) with links among – Synonymous – Equivalent– Broader– Narrower, and– Other related terms

Page 54: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 54IS 202 - FALL 2002

Thesauri (Cont.)

• National and international standards for thesauri– ANSI/NISO z39.19 -- 1994 -- American National

Standard Guidelines for the Construction, Format and Management of Monolingual Thesauri

– ANSI/NISO Draft Standard Z39.4-199x -- American National Standard Guidelines for Indexes in Information Retrieval

– ISO 2788 -- Documentation -- Guidelines for the establishment and development of monolingual thesauri

– ISO 5964 -- Documentation -- Guidelines for the establishment and development of multilingual thesauri

Page 55: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 55IS 202 - FALL 2002

Thesauri (Cont.)

• Examples:– The ERIC Thesaurus of Descriptors– The Art and Architecture Thesaurus– The Medical Subject Headings (MESH) of the

National Library of Medicine

Page 56: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 56IS 202 - FALL 2002

Classification Systems

• A classification system is an indexing language often based on a broad ordering of topical areas

• Thesauri and classification systems both use this broad ordering and maintain a structure of broader, narrower, and related topics

• Classification schemes commonly use a coded notation for representing a topic and it’s place in relation to other terms

Page 57: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 57IS 202 - FALL 2002

Classification Systems (Cont.)

• Examples:– The Library of Congress Classification System– The Dewey Decimal Classification System– The ACM Computing Reviews Categories– The American Mathematical Society

Classification System

Page 58: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 58IS 202 - FALL 2002

Using Controlled Vocabulary

• Start with the text of the document• Attempt to “control” or regularize:

– The concepts expressed within• mutually exclusive• exhaustive

– The language used to express those concepts• limit the normal linguistic variations• regulate word order and structure of phrases• reduce the number of synonyms or near-synonyms

• Also, provide cross-references between concepts and their expression

Slide author: Marti Hearst(These slides follow Bates 88)

Page 59: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 59IS 202 - FALL 2002

Classification Schemes

• Classify possible concepts.

• Goals:– Completely distinct conceptual categories

(mutually exclusive)– Complete coverage of conceptual categories

(exhaustive)

Slide author: Marti Hearst

Page 60: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 60IS 202 - FALL 2002

Assigning Headings vs. Descriptors

• Descriptors– Mix and match

How would we describe recipes using each technique?

Slide author: Marti Hearst

• Subject headings – Assign one (or a few)

complex heading(s) to the document

Page 61: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 61IS 202 - FALL 2002

Subject Heading vs. Descriptors

• Wilsonline– Athletes– Athletes -- Heath&hygiene– Athletes -- Nutrition– Athletes -- Physical Exams– …– Athletics– Athletics -- Administration– Athletics -- Equipment --

Catalogs– …– Sports -- Accidents and

Injuries– Sports -- Accidents and

Injuries -- Prevention

• ERIC– Athletes– Athletic Coaches– Athletic Equipment– Athletic Fields– Athletics– …– Sports Psychology– Sportsmanship

Slide author: Marti Hearst

Page 62: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 62IS 202 - FALL 2002

Subject Headings vs. Descriptors

• Describe the contents of an entire document

• Designed to be looked up in an alphabetical index– Look up document

under its heading

• Few (1-5) headings per document

• Describe one concept within a document

• Designed to be used in Boolean searching– Combine to describe

the desired document

• Many (5-25) descriptors per document

Slide author: Marti Hearst

Page 63: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 63IS 202 - FALL 2002

Lecture Contents

• Review– Dublin Core– Other Metadata Systems

• Controlled Vocabularies• Name Authority Files

– Choice of Names– Form of Names

• Other Types of Controlled Vocabularies• Faceted vs. Hierarchic Organization of

Vocabularies

Page 64: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 64IS 202 - FALL 2002

Hierarchical Classification

• Each category is successively broken down into smaller and smaller subdivisions

• No item occurs in more than one subdivision

• Each level divided out by a “character of division” (also known as a feature)– Example:

• Distinguish “Literature” based on:– Language– Genre– Time Period

Slide author: Marti Hearst

Page 65: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 65IS 202 - FALL 2002

Hierarchical Classification

Literature

SpanishFrenchEnglish

DramaPoetryProse

18th17th16th

DramaPoetryProse

19th 18th17th16th 19th

...

... ... ...

...

Slide author: Marti Hearst

Page 66: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 66IS 202 - FALL 2002

Labeled Categories for Hierarchical Classification

• LITERATURE– 100 English Literature

• 110 English Prose– English Prose 16th Century– English Prose 17th Century– English Prose 18th Century– ...

• 111 English Poetry– 121 English Poetry 16th Century– 122 English Poetry 17th Century– ...

• 112 English Drama– 130 English Drama 16th Century– …

– 200 French LiteratureSlide author: Marti Hearst

Page 67: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 67IS 202 - FALL 2002

Faceted Classification

• Create a separate, free-standing list for each characteristic or division (feature)

• Combine features to create a classification

Slide author: Marti Hearst

Page 68: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 68IS 202 - FALL 2002

Faceted Classification Along With Labeled Categories

• A Language– a English– b French– c Spanish

• B Genre– a Prose– b Poetry– c Drama

• C Period– a 16th Century– b 17th Century– c 18th Century– d 19th Century

• Aa English Literature

• AaBa English Prose

• AaBaCa English Prose 16th Century

• AbBbCd French Poetry 19th Century

• BbCd Drama 19th Century

Slide author: Marti Hearst

Page 69: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 69IS 202 - FALL 2002

Important Questions

• How to use both types of classification structures?

• How to look through them?

• How to use them in search?

Slide author: Marti Hearst

Page 70: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 70IS 202 - FALL 2002

Next Time

• Multimedia Information Organization and Retrieval (MED)

• Readings for next time (in Protected)– “Indexing the Content of Multimedia

Documents” (S. W. Smoliar, L. D. Wilcox) – “Computational Media Aesthetics: Finding

Meaning Beautiful” (C. Dorai, S. Venkatesh)– “The Holy Grail of Content-Based Media

Analysis” (S. Chang)

Page 71: 2002.09.12 - SLIDE 1IS 202 - FALL 2002 Lecture 06: Controlled Vocabularies Introduction Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and

2002.09.12 - SLIDE 71IS 202 - FALL 2002

Homework (!)

• Do Readings

• Receive and integrate feedback on Assignment 2 to iterate your Photo Use Scenario (nothing to turn in on this yet)

• Assignment 3: Photo Metadata Design– Due by Thursday, September 19