![Page 1: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/1.jpg)
Helping Interdisciplinary Vocabulary Engineering (HIVE)
OCTOBER 31, 2011
Joan BooneNico CarverJane GreenbergLina HuangRobert LoseeMady MadhuraJosé Ramón Pérez AgüeraLee Richardson Ryan ScherleTodd VisionHollie WhiteCraig Willis
![Page 2: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/2.jpg)
OverviewOverviewPart 1Introduction to HIVEUnderlying rationale A scenarioResearch and challenges
Part 2Technical overview and implementationProgress and challengesNext steps
Part 3Let you experiment
![Page 3: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/3.jpg)
HIVE HIVE TeamTeam
Craig Willis
Bob LoseeLee Richardson
Hollie WhiteJane Greenberg
Madhura Marathe
Lina Huang
José R. P. Agüera
Ryan Scherle
![Page 4: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/4.jpg)
4
HIVE modelHIVE model
<AMG> approach for integrating discipline CVs Model addressing C V cost, interoperability, and usability constraints (interdisciplinary environment)
![Page 5: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/5.jpg)
5
Data underlying peer-reviewed articles in the basic and applied biosciences
![Page 6: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/6.jpg)
• Vocabulary analysis – 600 keywords, Dryad partner journals
• Vocabularies: NBII Thesaurus, LCSH, the Getty’s TGN, ERIC Thesaurus, Gene Ontology, IT IS (10 vocabularies)
• Facets: taxon, geographic name, time period, topic, research method, genotype, phenotype…
• Results431 topical terms, exact matches– NBII Thesaurus, 25%; MeSH, 18%531 terms (topical terms, research method and taxon)– LCSH, 22% found exact matches, 25% partial
• Conclusion: Need multiple vocabularies
Vocabulary needs for Vocabulary needs for DryadDryad
![Page 7: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/7.jpg)
1. Provide efficient, affordable, interoperable, and user friendly access to multiple vocabularies during metadata creation activities
2. Present a model and an approach that can be replicated
—> not necessarily a service
1. Building HIVEVocabulary preparationServer development
2. Sharing HIVEContinuing education (empowering information professionals)
3. Evaluating HIVEExamining HIVE in Dryad
HIVE work-planHIVE work-plan3 PhasesHIVE Goals
![Page 8: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/8.jpg)
HIVE PartnersHIVE PartnersVocabulary
Partners Library of Congress: LCSH
the Getty Research Institute (GRI): TGN (Thesaurus of Geographic Names )
United States Geological Survey (USGS): NBII Thesaurus, Integrated Taxonomic Information System (ITIS)
National Library of Medicine and the National Agricultural Library
Advisory Board Jim Balhoff, NESCent Libby Dechman, LCSH Mike Frame, USGS Alistair Miles, Oxford, UK William Moen, University of North Texas Eva Méndez Rodríguez, University
Carlos III of Madrid Joseph Shubitowski, Getty Research
Institute Ed Summers, LCSH Barbara Tillett, Library of Congress Kathy Wisser, Simmons Lisa Zolly, USGS
WORKSHOPS HOSTS: Columbia Univ.; Univ. of California, San Diego; George Washington University; Univ. of North Texas; Universidad Carlos III de Madrid, Madrid, Spain
![Page 9: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/9.jpg)
![Page 10: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/10.jpg)
![Page 11: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/11.jpg)
![Page 12: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/12.jpg)
HIVE is for…HIVE is for…
HIVE for resource creators- w/Dryad: scientists, depositors
HIVE for information professionals: curators, professional librarians, archivists, museum catalogers
![Page 13: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/13.jpg)
~~~~Amy~~~~Amy
• Meet Amy Zanne. She is a botanist.
• Like every good scientist, she publishes, and she deposits data in Dryad.
Amy’s dataAmy’s data
![Page 14: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/14.jpg)
![Page 15: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/15.jpg)
![Page 16: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/16.jpg)
UsabilityUsability
Huang, 2010
Huang, 2010
![Page 17: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/17.jpg)
System usability and flow System usability and flow metricsmetrics
Huang, 2010
![Page 18: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/18.jpg)
ChallengesChallenges Building vs. doing/analysis
• Source for HIVE generation, beyond abstracts Combining many vocabularies during the indexing/term
• matching phase is difficult, time consuming, inefficient.• NLP and machine learning offer promise
Interoperability = dumbing down • ontologies
Proof-of-concept/ illustrate the differences between HIVE and other vocabulary registries (NCBO and OBO Foundry)
People wanting a service General large team logistics, and having people from
multiple disciplines (also the ++)
![Page 19: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/19.jpg)
HIVE Technical OverviewCraig Willis ([email protected])
![Page 20: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/20.jpg)
CreditsCredits Ryan Scherle (Nescent)
José Ramón Pérez Agüera (UNC)
Lina Huang (UNC)
Duane Costa (LTER)
Alyona Medelyan & Ian Whitten (Univ. of Waikato/NZDL)
![Page 21: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/21.jpg)
HIVE Technical OverviewHIVE Technical Overview HIVE combines several open-source technologies to provide
a framework for vocabulary services.
Java-based web services can run in any Java application server
Demonstration website (http://hive.nescent.org/)
Open-source Google Code project (http://code.google.com/p/hive-mrc/)
Source code, pre-compiled releases, documentation, mailing lists
![Page 22: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/22.jpg)
Who’s using HIVE?Who’s using HIVE?HIVE is being evaluated by several institutions and organizations:
Long Term Ecological Research Network (LTER) Prototype for keyword suggestion for Ecological Markup Language
(EML) documents.
Library of Congress Web Archives (Minerva) Evaluating HIVE for automatic LCSH subject heading suggestion for
web archives.
Dryad Data Repository Evaluating HIVE for suggestion of controlled terms during the
submission and curation process. (Scientific name, spatial coverage, temporal coverage, keywords).
Yale University, Smithsonian Institution Archives
![Page 23: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/23.jpg)
HIVE FunctionsHIVE Functions System for management of multiple controlled vocabularies
in SKOS format
Single interface for browsing, searching, and indexing using multiple vocabularies.
Natural language and structured (SPARQL) queries
Rich internet application (RIA) demonstration interface
Java API and REST interfaces for programmatic access
Framework for conversion of vocabularies to SKOS
![Page 24: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/24.jpg)
HIVE ComponentsHIVE Components HIVE Core API
Java API for management of HIVE vocabularies.
HIVE Web Service
Google Web Toolkit (GWT) based interface to demonstrate the HIVE service. Includes Concept Browser and Indexer.
HIVE REST API
RESTful API developed by Duane Costa of the Long Term Ecological Research Network (LTER)
![Page 25: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/25.jpg)
Supporting TechnologiesSupporting Technologies Sesame: Open-source triple store and framework for
storing and querying RDF data
Used for primary storage, structured queries
Lucene: Java-based full-text search engine
Used for keyword searching, autocomplete (version 2.0)
H2: Embedded relational database
Stores administrative data, fast concept index, KEA++ lookup tables.
KEA++: Algorithm and Java API for automatic indexing
![Page 26: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/26.jpg)
ArchitectureArchitecture
![Page 27: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/27.jpg)
Converting Vocabularies to Converting Vocabularies to SKOSSKOS
“We learned that some thesauri have complex structures for which no SKOS counterparts can be found and that for some features care is required in converting them in such a way that they are still usable for their original purpose.”
Van Assem, Mark. (2010). Converting and Integrating Vocabularies for the Semantic Web. Unpublished dissertation.
![Page 28: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/28.jpg)
Converting Vocabularies to Converting Vocabularies to SKOSSKOS
SKOS does not fit all vocabularies/thesauri
For example, MeSH Is a MeSH descriptor a SKOS Concept?
“A Method to Convert Thesauri to SKOS” (van Assem et al) http://thesauri.cs.vu.nl/eswc06/
Or is a MeSH concept a SKOS concept? “Converting MeSH to SKOS for HIVE”
http://code.google.com/p/hive-mrc/wiki/MeshToSKOS
Either way, information is lost about the vocabulary
![Page 29: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/29.jpg)
Converting Vocabularies to Converting Vocabularies to SKOSSKOS
Additional information http://code.google.com/p/hive-mrc/wiki/VocabularyConversion
Each vocabulary has different requirements
AGROVOC Available in SKOS
ITIS Convert from RDB (MySQL) to SKOS RDF/XML
LCSH Available in SKOS
MeSH Convert from XML to SKOS RDF/XML (SAX)
NBII Convert from XML to SKOS RDF/XML (SAX)
TGN Convert from flat-file to SKOS RDF/XML
![Page 30: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/30.jpg)
KEA++ for Keyphrase KEA++ for Keyphrase ExtractionExtraction
Algorithm and open-source Java library for extracting keyphrases from documents using SKOS vocabularies.
Domain-independent machine learning approach with minimal training set (~50 documents).
Leverages SKOS relationships and alternate/preferred labels
Developed by Alyona Medelyan (KEA++), based on earlier work by Ian Whitten (KEA) University of Waikato, New Zealand (http://www.nzdl.org/Kea/)
(Expanded implementation in Medelyan’s MAUI)Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.”
Journal of the American Society for Information Science and Technology, (59) 7: 1026-1040).
![Page 31: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/31.jpg)
KEA++: Feature KEA++: Feature definitiondefinition
Term Frequency/Inverse Document Frequency: Frequency of a phrase’s occurrence in a document with frequency in general use.
Position of first occurrence: Distance from the beginning of the document. Candidates with high/low values are more likely to be valid (introduction/conclusion)
Phrase length: Analysis suggests that indexers prefer to assign two-word descriptors
Node degree: Number of relationships between the term in the CV.
(MAUI expands feature set)Medelyan, O. and Whitten I.A. (2008). “Domain independent automatic keyphrase indexing with small training sets.” Journal of the American Society for Information Science and Technology, (59) 7: 1026-1040).
Medelyan, O. (2010). Human-competitive automatic topic indexing. Unpublished dissertation.
![Page 32: Helping Interdisciplinary Vocabulary Engineering (HIVE) OCTOBER 31, 2011 Joan Boone Nico Carver Jane Greenberg Lina Huang Robert Losee Mady Madhura José](https://reader036.vdocuments.net/reader036/viewer/2022062301/56649d1a5503460f949ef1c7/html5/thumbnails/32.jpg)
HIVE – UpcomingHIVE – Upcoming Vocabulary synchronization
Integration of HIVE with LCSH Atom Feed (http://id.loc.gov/authorities/feed/)
Integration and evaluation of alternative algorithms As part of the Dryad/HIVE integration Questions:
What is the best algorithm for automatic term suggestion for Dryad vocabularies?
Do different algorithms perform better for title, abstract, full-text, data?
Do different algorithms perform better for a particular vocabulary/taxonomy/ontology?