finding a common language: bringing complex and disparate vocabularies together
DESCRIPTION
This case study addresses the challenges ProQuest faced in managing multilingual controlled vocabularies using multiple Word documents and authority files maintained in an Oracle database. Speakers describe how implementing a thesaurus management tool helped ProQuest simplify and standardize its business semantic management to create a common language and connect disparate information assets as well as handling large and varied vocabularies and authority files, linking new and existing editorial systems and enabling hierarchical views, and automating thesaurus management tasks.TRANSCRIPT
Paula R. McCoyManager, Taxonomy Development
Finding a Common Language: Finding a Common Language: Bringing Complex and Disparate Bringing Complex and Disparate
Vocabularies TogetherVocabularies Together
Part of Cambridge Information Group & CSA
Headquartered in Ann Arbor, Michigan
Editorial offices in Louisville, Kentucky
Access to over 125 billion digital pages of content from magazine, trade, & scholarly publications, current &
historical newspapers, original materials such as annual reports & civil war pamphlets, and daily wire feeds
Subscription-based ProQuest® online information service available in academic and public libraries
Louisville editors abstract & index 4,000+ periodicals & newspapers
ProQuest Controlled Vocabulary used to index subjects; Authority Files used to index company, geographic, personal, product names
CV applied to non-periodical & third-party content via mapping, to allow cross-searching of multiple DBs with one vocabulary
Description of ProQuest Controlled Vocabulary & Authority Files
Taxonomy Management -- Overview
Life Before Synaptica
Thesaurus Management System Purchase
Implementing Synaptica
Life With Synaptica
Topics of DiscussionTopics of Discussion
Q&A
ProQuest Controlled VocabularyProQuest Controlled Vocabulary
PQ CV
Created in 1970s for ABI/INFORM business database
Based on Library of Congress Subject Headings
Natural language, hierarchical vocabulary complying with ANSI/NISO Standard Z39.19 (Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies)
ProQuest Controlled VocabularyProQuest Controlled Vocabulary
Thesaurus subjects:Business, economics & trade – 4300 termsScience, math & technology – 1600 termsMedicine – 1150 termsHumanities – 960 termsGovernment & policy – 850 termsEducation – 400 terms
Merged with general reference vocabulary in 1980s
Major development effort in past 4 years to boost science, education & medical terms
PQ CV
ProQuest CV: StatisticsProQuest CV: Statistics
Preferred terms: 11,046
Non-preferred terms: 5631
Scope Notes: 3194 (29%)
Cross-references (Broader, Narrower, Related terms): 67,700
Terms added in 2007: 77
Terms added in 2008: 58+
PQ CV
Authority Files: StatisticsAuthority Files: Statistics
Corporate/Organization Names: 438,098 Names added in 2008: 5489
Personal Names: 416,239 Names added in 2008: 1526
Geographic (Location) Names: 34,331 Names added in 2008: 144
Product Names: 38,210 Names added in 2008: 54
PQ CV
The Taxonomy Manager’s JobThe Taxonomy Manager’s Job
Add subject terms as dictated by new concepts & new content to index
Maintain hierarchies & Scope Notes
Load updated Thesaurus to ProQuest interface
Manage authority files to maintain standards & control file size
Taxonomy Management
The Taxonomy Manager’s JobThe Taxonomy Manager’s Job
Taxonomy Management
To ensure that indexers and searchers alike have access to a complete and accurate Thesaurus that they can use to maximize the discoverability of documents in ProQuest
OBJECTIVE:
Thesaurus on ProQuest®Thesaurus on ProQuest®
Taxonomy Management
Sample Subject TermSample Subject Term
Taxonomy Management
Chronic obstructive pulmonary disease SN: Any lung disease, such as chronic bronchitis or emphysema, causing obstruction of bronchial airflow UF COPD BT Disease BT Respiratory diseases NT Asthma NT Bronchitis NT Emphysema RT Airway management RT Lungs
Preferred, or main termPreferred, or main term
Scope note defining term and how it is used
Scope note defining term and how it is used
Non-preferred term: points to term used to index
Non-preferred term: points to term used to index
Terms broader in nature to main term: COPD is a
disease, and specifically, a respiratory disease
Terms broader in nature to main term: COPD is a
disease, and specifically, a respiratory disease
Terms narrower in nature to main term: these are
chronic lung diseases
Terms narrower in nature to main term: these are
chronic lung diseases
Terms related to main term that might be used to
narrow the search
Terms related to main term that might be used to
narrow the search
Before SynapticaBefore Synaptica
Managing terms meant:
Multiple files Duplicate entries Errors
= less than ideal thesaurus management
MS Word DocumentMS Word Document
Before Synaptica
Academic degrees SN: A title conferred on students upon graduating from a program of study at a college or university UF: Associates degree Bachelors degree Doctoral degree Masters degree BT: Academic achievement RT: Colleges & universities Graduate studies Graduation requirements Higher education MBA programs & graduates Academic failure SN: The failure of a student to meet academic standards, including failure to be promoted or to graduate UF: Student failure RT: Academic achievement Academic grading Academic probation Academic underachievement At risk students Grade repetition Graduation requirements School dropouts Social promotion
Academic freedom SN: Educators’ freedom to teach and research what they choose BT: Education RT: Colleges & universities Curricula Research Teachers Teaching
Academic grading UF: Grading of students BT: Academic achievement RT: Academic failure Academic probation Achievement tests Cheating Education portfolios Educational evaluation Tests
Version 2004 ProQuest Controlled Vocabulary of Subject Terms Page 3
Academic guidance counseling UF: Guidance counseling Student counseling BT: Counseling Education RT: Career preparation Counselor client relationships Counselor education School counseling Academic libraries UF: College libraries School libraries BT: Libraries RT: Librarians Library resources Academic marketing SN: Efforts of educational institutions to attract students and funding BT: Marketing NT: Student recruitment RT: Admissions policies College admissions College choice Colleges & universities Enrollment management Enrollments
Academic probation RT: Academic failure Academic grading Academic underachievement
Academic standards SN: Standards for performance in defined academic areas set at the local, state, or federal levels BT: Standards RT: Academic achievement Academic achievement gaps Academic underachievement Achievement tests Core curriculum Education policy Educational evaluation No Child Left Behind Act 2001-US Quality of education School effectiveness Standardized tests
Academic underachievement SN: Student performance that is below standards or below potential RT: Academic achievement Academic achievement gaps Academic failure Academic standards At risk students Grade repetition Social promotion Academy awards UF: Oscars (Motion picture awards) BT: Awards & honors Motion picture industry RT: Actors Acadian culture UF: Cajuns BT: Minority & ethnic groups
Accelerated cost recovery system CC: 4210 UF: ACRS BT: Cost recovery Depreciation Depreciation methods NT: Modified accelerated cost recovery system RT: Capital cost recovery allowances Declining balance method Depreciable assets Tax basis Accelerated death benefits CC: 4220 CC: 8210 UF: Living benefits Viatical settlement BT: Death benefits RT: Estate planning Hardship distributions Insurance policies Life insurance Riders Terminal illnesses
Accelerated depreciation methods USE: Depreciation methods
Key: SN=Scope note CC=Classification code UF=Use for BT=Broader term NT=Narrower term RT=Related term
Vocabulary Documents in WordVocabulary Documents in Word
ProQuest controlled vocabulary
French-language controlled vocabulary
German-language controlled vocabulary
Spanish-language controlled vocabulary
Combined PQ-CBCA controlled vocabulary
Ethnic database vocabulary, English
Ethnic database vocabulary, Spanish
Before Synaptica
Oracle Database FormsOracle Database Forms
Before Synaptica
Authority Files in OracleAuthority Files in Oracle
Class codes (related to subjects)
CORP names (391,665+ terms)
GEOG names (32,000+ terms)
PERS names (350,000+ terms)
PROD names (38,000+ terms)
NAIC codes (related to companies)
Before Synaptica
Foreign-Language VocabulariesForeign-Language VocabulariesFrenchFrench
GermanGerman
SpanishSpanish
Before Synaptica
Adding New TermsAdding New Terms
1. Enter full term hierarchy into new Word doc
2. Copy term into main Word-based vocabulary & enter reciprocal relationships
3. Enter term & relationships into Oracle
4. Review next-day report on Oracle activity
5. Send new term doc to editors via e-mail
6. Print new vocabulary (at least every two years)
Before Synaptica
Thesaurus Management SystemsThesaurus Management Systems
TMS Purchase
Buying CriteriaBuying Criteria
TMS Purchase
Up to 40 admin & 100 read-only users in multiple locations
Ability to load vocabs from multiple Word docs & Oracle authority files
Support for foreign language vocabularies
Ability to add new vocabularies
Vendor onsite installation & training
Software upgrades & tech support
Buying CriteriaBuying Criteria
1. Ability to interact in real time with editorial system
2. Ability to accommodate authority files of 400,000+ names
TMS Purchase
Buying CriteriaBuying Criteria
Implementing SynapticaImplementing Synaptica
Contract signed and work begun in August 2004
PQ sent to Synaptica all the Word & Oracle files for analysis
Implementing Synaptica
Decision points: how to load & structure data; how to handle “suspect” or erroneous relationships
Synaptica Data AnalysisSynaptica Data Analysis
Term Uniqueness Use Violations Self-Referencing Relationships One Relationship per Term Pair Relationship Unique
Circular References Relationship Reciprocates
Relationship Validation Tests:
Exception Reports delivered to PQ; Errors fixed before production
Implementing Synaptica
Use Validation ErrorUse Validation Error
Marine resources
Implementing Synaptica
Underwater resources UF: Marine resources BT: Natural resources RT: Marine conservation
Marine ecologyMarine pollution
Marine pollution BT: Pollution Water pollution RT: Marine conservation
Marine ecologyOcean dumpingMarine resources
Marine ecology SN: The ecology of the seas and oceans UF: Benthic ecology BT: Ecology RT: Marine conservation
Marine pollutionMarine resourcesOceans
Marine resources USE: Underwater resources
Terms with no language equivalent (LEQ), e.g., no translation
In all 3 languages, multiple English terms with the same translation, e.g.:
Foreign-Language ErrorsForeign-Language Errors
Implementing Synaptica
English term Purchasing Shopping
Buyers Purchasing agents
French term Achats Achats
Acheteurs Acheteurs
French term-revised
Shopping
Agents d'achat
Solution:
Issue: Different editorial systems = 2x data entry: once for Synaptica, once for Oracle
Final ChallengeFinal Challenge
Implementing Synaptica
Overnight synchronization process to copy Synaptica work into Oracle every night
Synch process discontinued April 2008
Putting Synaptica Into ProductionPutting Synaptica Into Production
Deal with people resistant to change
Train users — provide documentation & hands-on demonstrative training
Encourage written feedback on system functionality
Send feedback to Synaptica – many of our suggestions implemented in later versions
Nov 2004Nov 2004
Implementing Synaptica
Life With SynapticaLife With Synaptica
Word – Old, Bad Synaptica – New, Good
Life With Synaptica
2. Export report of new terms into Word
1. Enter term and relationships into Synaptica “Item Details” window
3. Send Word document to editors
Life With Synaptica
Adding Terms Today: 3 Easy StepsAdding Terms Today: 3 Easy Steps
Synaptica version 6.0 released in early 2006
Life With Synaptica
Synaptica UpdatesSynaptica Updates
Synaptica version 7.0 is being implemented now: • Enhanced user interface • Semantic Web standardization (RDF, OWL, SKOS) and Web Services integration• Expanded Reporting functionality • Enhanced adding and editing of term relationships including “rapid-fire” simple drag-and-drop editing• Improved global term editing• Online help and user guides
Benefits of SynapticaBenefits of Synaptica
Life With Synaptica
Greater awareness of thesaurus standards and terminology, e.g.: “preferred” and “non-preferred” instead of Use and Used For
Long-needed updating and improvement in term hierarchies; ability to provide thesaurus statistics
Increase in Company name NPTs — from 1935 to 8952 today
Immediate responsiveness to indexer needs — real-time term additions, esp. NPTs and SNs
Easier loading of updated Thesaurus on PQ interface