finding a common language: bringing complex and disparate vocabularies together

33
Paula R. McCoy Manager, Taxonomy Development ProQuest [email protected] Finding a Common Language: Finding a Common Language: Bringing Complex and Bringing Complex and Disparate Vocabularies Disparate Vocabularies Together Together

Upload: daniela-barbosa

Post on 19-May-2015

1.555 views

Category:

Technology


0 download

DESCRIPTION

This case study addresses the challenges ProQuest faced in managing multilingual controlled vocabularies using multiple Word documents and authority files maintained in an Oracle database. Speakers describe how implementing a thesaurus management tool helped ProQuest simplify and standardize its business semantic management to create a common language and connect disparate information assets as well as handling large and varied vocabularies and authority files, linking new and existing editorial systems and enabling hierarchical views, and automating thesaurus management tasks.

TRANSCRIPT

Page 1: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Paula R. McCoyManager, Taxonomy Development

[email protected]

Finding a Common Language: Finding a Common Language: Bringing Complex and Disparate Bringing Complex and Disparate

Vocabularies TogetherVocabularies Together

Page 2: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Part of Cambridge Information Group & CSA

Headquartered in Ann Arbor, Michigan

Editorial offices in Louisville, Kentucky

Page 3: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Access to over 125 billion digital pages of content from magazine, trade, & scholarly publications, current &

historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds

Subscription-based ProQuest® online information service available in academic and public libraries

Page 4: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Louisville editors abstract & index 4,000+ periodicals & newspapers

ProQuest Controlled Vocabulary used to index subjects; Authority Files used to index company, geographic, personal, product names

CV applied to non-periodical & third-party content via mapping, to allow cross-searching of multiple DBs with one vocabulary

Page 5: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Description of ProQuest Controlled Vocabulary & Authority Files

Taxonomy Management -- Overview

Life Before Synaptica

Thesaurus Management System Purchase

Implementing Synaptica

Life With Synaptica

Topics of DiscussionTopics of Discussion

Q&A

Page 6: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

ProQuest Controlled VocabularyProQuest Controlled Vocabulary

PQ CV

Created in 1970s for ABI/INFORM business database

Based on Library of Congress Subject Headings

Natural language, hierarchical vocabulary complying with ANSI/NISO Standard Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies)

Page 7: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

ProQuest Controlled VocabularyProQuest Controlled Vocabulary

Thesaurus subjects:Business, economics & trade – 4300 termsScience, math & technology – 1600 termsMedicine – 1150 termsHumanities – 960 termsGovernment & policy – 850 termsEducation – 400 terms

Merged with general reference vocabulary in 1980s

Major development effort in past 4 years to boost science, education & medical terms

PQ CV

Page 8: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

ProQuest CV: StatisticsProQuest CV: Statistics

Preferred terms: 11,046

Non-preferred terms: 5631

Scope Notes: 3194 (29%)

Cross-references (Broader, Narrower, Related terms): 67,700

Terms added in 2007: 77

Terms added in 2008: 58+

PQ CV

Page 9: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Authority Files: StatisticsAuthority Files: Statistics

Corporate/Organization Names: 438,098 Names added in 2008: 5489

Personal Names: 416,239 Names added in 2008: 1526

Geographic (Location) Names: 34,331 Names added in 2008: 144

Product Names: 38,210 Names added in 2008: 54

PQ CV

Page 10: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

The Taxonomy Manager’s JobThe Taxonomy Manager’s Job

Add subject terms as dictated by new concepts & new content to index

Maintain hierarchies & Scope Notes

Load updated Thesaurus to ProQuest interface

Manage authority files to maintain standards & control file size

Taxonomy Management

Page 11: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

The Taxonomy Manager’s JobThe Taxonomy Manager’s Job

Taxonomy Management

To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest

OBJECTIVE:

Page 12: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Thesaurus on ProQuest®Thesaurus on ProQuest®

Taxonomy Management

Page 13: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Sample Subject TermSample Subject Term

Taxonomy Management

Chronic obstructive pulmonary disease SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow UF COPD BT Disease BT Respiratory diseases NT Asthma NT Bronchitis NT Emphysema RT Airway management RT Lungs

Preferred, or main termPreferred, or main term

Scope note defining term and how it is used

Scope note defining term and how it is used

Non-preferred term: points to term used to index

Non-preferred term: points to term used to index

Terms broader in nature to main term: COPD is a

disease, and specifically, a respiratory disease

Terms broader in nature to main term: COPD is a

disease, and specifically, a respiratory disease

Terms narrower in nature to main term: these are

chronic lung diseases

Terms narrower in nature to main term: these are

chronic lung diseases

Terms related to main term that might be used to

narrow the search

Terms related to main term that might be used to

narrow the search

Page 14: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Before SynapticaBefore Synaptica

Managing terms meant:

Multiple files Duplicate entries Errors

= less than ideal thesaurus management

Page 15: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

MS Word DocumentMS Word Document

Before Synaptica

Academic degrees SN: A title conferred on students upon graduating from a program of study at a college or university UF: Associates degree Bachelors degree Doctoral degree Masters degree BT: Academic achievement RT: Colleges & universities Graduate studies Graduation requirements Higher education MBA programs & graduates Academic failure SN: The failure of a student to meet academic standards, including failure to be promoted or to graduate UF: Student failure RT: Academic achievement Academic grading Academic probation Academic underachievement At risk students Grade repetition Graduation requirements School dropouts Social promotion

Academic freedom SN: Educators’ freedom to teach and research what they choose BT: Education RT: Colleges & universities Curricula Research Teachers Teaching

Academic grading UF: Grading of students BT: Academic achievement RT: Academic failure Academic probation Achievement tests Cheating Education portfolios Educational evaluation Tests

Version 2004 ProQuest Controlled Vocabulary of Subject Terms Page 3

Academic guidance counseling UF: Guidance counseling Student counseling BT: Counseling Education RT: Career preparation Counselor client relationships Counselor education School counseling  Academic libraries UF: College libraries School libraries BT: Libraries RT: Librarians Library resources Academic marketing SN: Efforts of educational institutions to attract students and funding BT: Marketing NT: Student recruitment RT: Admissions policies College admissions College choice Colleges & universities Enrollment management Enrollments

Academic probation RT: Academic failure Academic grading Academic underachievement

Academic standards SN: Standards for performance in defined academic areas set at the local, state, or federal levels BT: Standards RT: Academic achievement Academic achievement gaps Academic underachievement Achievement tests Core curriculum Education policy Educational evaluation No Child Left Behind Act 2001-US Quality of education School effectiveness Standardized tests

Academic underachievement SN: Student performance that is below standards or below potential RT: Academic achievement Academic achievement gaps Academic failure Academic standards At risk students Grade repetition Social promotion Academy awards UF: Oscars (Motion picture awards) BT: Awards & honors Motion picture industry RT: Actors Acadian culture UF: Cajuns BT: Minority & ethnic groups

Accelerated cost recovery system CC: 4210 UF: ACRS BT: Cost recovery Depreciation Depreciation methods NT: Modified accelerated cost recovery system RT: Capital cost recovery allowances Declining balance method Depreciable assets Tax basis Accelerated death benefits CC: 4220 CC: 8210 UF: Living benefits Viatical settlement BT: Death benefits RT: Estate planning Hardship distributions Insurance policies Life insurance Riders Terminal illnesses

Accelerated depreciation methods USE: Depreciation methods

Key: SN=Scope note CC=Classification code UF=Use for BT=Broader term NT=Narrower term RT=Related term

Page 16: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Vocabulary Documents in WordVocabulary Documents in Word

ProQuest controlled vocabulary

French-language controlled vocabulary

German-language controlled vocabulary

Spanish-language controlled vocabulary

Combined PQ-CBCA controlled vocabulary

Ethnic database vocabulary, English

Ethnic database vocabulary, Spanish

Before Synaptica

Page 17: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Oracle Database FormsOracle Database Forms

Before Synaptica

Page 18: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Authority Files in OracleAuthority Files in Oracle

Class codes (related to subjects)

CORP names (391,665+ terms)

GEOG names (32,000+ terms)

PERS names (350,000+ terms)

PROD names (38,000+ terms)

NAIC codes (related to companies)

Before Synaptica

Page 19: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Foreign-Language VocabulariesForeign-Language VocabulariesFrenchFrench

GermanGerman

SpanishSpanish

Before Synaptica

Page 20: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Adding New TermsAdding New Terms

1. Enter full term hierarchy into new Word doc

2. Copy term into main Word-based vocabulary & enter reciprocal relationships

3. Enter term & relationships into Oracle

4. Review next-day report on Oracle activity

5. Send new term doc to editors via e-mail

6. Print new vocabulary (at least every two years)

Before Synaptica

Page 21: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Thesaurus Management SystemsThesaurus Management Systems

TMS Purchase

Page 22: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Buying CriteriaBuying Criteria

TMS Purchase

Up to 40 admin & 100 read-only users in multiple locations

Ability to load vocabs from multiple Word docs & Oracle authority files

Support for foreign language vocabularies

Ability to add new vocabularies

Vendor onsite installation & training

Software upgrades & tech support

Buying CriteriaBuying Criteria

Page 23: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

1. Ability to interact in real time with editorial system

2. Ability to accommodate authority files of 400,000+ names

TMS Purchase

Buying CriteriaBuying Criteria

Page 24: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Implementing SynapticaImplementing Synaptica

Contract signed and work begun in August 2004

PQ sent to Synaptica all the Word & Oracle files for analysis

Implementing Synaptica

Decision points: how to load & structure data; how to handle “suspect” or erroneous relationships

Page 25: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Synaptica Data AnalysisSynaptica Data Analysis

Term Uniqueness Use Violations Self-Referencing Relationships One Relationship per Term Pair Relationship Unique

Circular References Relationship Reciprocates

Relationship Validation Tests:

Exception Reports delivered to PQ; Errors fixed before production

Implementing Synaptica

Page 26: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Use Validation ErrorUse Validation Error

Marine resources

Implementing Synaptica

Underwater resources UF: Marine resources BT: Natural resources RT: Marine conservation

Marine ecologyMarine pollution

Marine pollution BT: Pollution Water pollution RT: Marine conservation

Marine ecologyOcean dumpingMarine resources

Marine ecology SN: The ecology of the seas and oceans UF: Benthic ecology BT: Ecology RT: Marine conservation

Marine pollutionMarine resourcesOceans

Marine resources USE: Underwater resources

Page 27: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Terms with no language equivalent (LEQ), e.g., no translation

In all 3 languages, multiple English terms with the same translation, e.g.:

Foreign-Language ErrorsForeign-Language Errors

Implementing Synaptica

English term Purchasing Shopping

Buyers Purchasing agents

French term Achats Achats

Acheteurs Acheteurs

French term-revised

Shopping

Agents d'achat

Page 28: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Solution:

Issue: Different editorial systems = 2x data entry: once for Synaptica, once for Oracle

Final ChallengeFinal Challenge

Implementing Synaptica

Overnight synchronization process to copy Synaptica work into Oracle every night

Synch process discontinued April 2008

Page 29: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Putting Synaptica Into ProductionPutting Synaptica Into Production

Deal with people resistant to change

Train users — provide documentation & hands-on demonstrative training

Encourage written feedback on system functionality

Send feedback to Synaptica – many of our suggestions implemented in later versions

Nov 2004Nov 2004

Implementing Synaptica

Page 30: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Life With SynapticaLife With Synaptica

Word – Old, Bad Synaptica – New, Good

Life With Synaptica

Page 31: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

2. Export report of new terms into Word

1. Enter term and relationships into Synaptica “Item Details” window

3. Send Word document to editors

Life With Synaptica

Adding Terms Today: 3 Easy StepsAdding Terms Today: 3 Easy Steps

Page 32: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Synaptica version 6.0 released in early 2006

Life With Synaptica

Synaptica UpdatesSynaptica Updates

Synaptica version 7.0 is being implemented now: • Enhanced user interface • Semantic Web standardization (RDF, OWL, SKOS) and Web Services integration• Expanded Reporting functionality • Enhanced adding and editing of term relationships including “rapid-fire” simple drag-and-drop editing• Improved global term editing• Online help and user guides

Page 33: Finding a Common Language: Bringing Complex and Disparate Vocabularies Together

Benefits of SynapticaBenefits of Synaptica

Life With Synaptica

Greater awareness of thesaurus standards and terminology, e.g.: “preferred” and “non-preferred” instead of Use and Used For

Long-needed updating and improvement in term hierarchies; ability to provide thesaurus statistics

Increase in Company name NPTs — from 1935 to 8952 today

Immediate responsiveness to indexer needs — real-time term additions, esp. NPTs and SNs

Easier loading of updated Thesaurus on PQ interface