unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction
TRANSCRIPT
Unlocking knowledge in biodiversitylegacy literature through automatic
semantic metadata extraction
Riza Batista-Navarro, William Ulate, Jennifer Hammock, Georgios Kontonatsios, Trish
Rose-Sandler and Sophia Ananiadou
StructuredData
? Text Mining
http://miningbiodiversity.org
The partners
Social Media Lab
410/9/2015 Mining Biodiversity
Mining Biodiversity
• Transform BHL into a next-generation social digital library
• A multi-disciplinary approach – Text Mining
– Machine learning
– History of Science
– Environmental History & Studies
– Library and Information Science
– Social Media
510/9/2015 Mining Biodiversity
What do we want to do?
Social Media
Visualisation
Semantic
Metadata
610/9/2015 Mining Biodiversity
Biodiversity Heritage Library
• a consortium of botanical and natural history libraries
• stores digitised legacy literature on biodiversity
• currently holds 160,000 volumes = millions of pages (PDFs and OCR-generated text)
• open-access
710/9/2015 Mining Biodiversity
Current features
• supports keyword-based search
• species names annotated and linked to the Encyclopedia of Life
• integrates automatic taxonomic name finding tools (uBio Taxonfinder)
• data access through export functionalities and Web services
810/9/2015 Mining Biodiversity
Keyword-based search and Browsing
Advanced search (also keyword-based)
10/9/2015 10Mining Biodiversity
What’s wrong with keyword-based search?
• Ambiguity!
Boxwood
historic place in Alabama?
North American term for plants in the Buxaceae
family?
Box
container?
Boxwood for other English-speaking countries?
What’s wrong with keyword-based search?
• Ambiguity!
California bay
hardwood tree?
location?
Drum
musical instrument?
fish?
What’s wrong with keyword-based search?
• Ambiguity!
Emperor
fish?
person?
Scrambled eggs
food?
plant?
Semantic metadata generation
• Entity types
– species
– location
– habitat
– anatomical parts
– qualities
– persons
– temporal expressions
• Association types
– observation
– Habitation
– nutrition
– trait
10/9/2015 Mining Biodiversity 14
Examples of semantic metadata (annotations)
• Observation
• Habitation
Examples of semantic metadata (annotations)
• Nutrition
• Trait
How does semantic information help?
SPECIES:California bay
hardwood tree
location
LOCATION:California bay
Text mining-based approach
Seeddocuments
Unlabelleddocuments
Learn semantics
Annotator/CuratorValidate
Feedback
Annotate
Searchindex
Store
Annotate
Automatic annotation by text mining (TM)
– Web-based, graphical TM workbench
– conforms with the Unstructured Information Management Architecture (UIMA) standard
– facilitates the straightforward integration of various analytics into workflows
– allows for the validation of annotations
10/9/2015 Mining Biodiversity 19
interface
10/9/2015 20Mining Biodiversity
Learning semantics
• Training of models using machine learning
– conditional random fields (CRFs) for sequence labelling
– learning the features of mentions and relations of interest based on labelled documents
• contextual features: surrounding, co-occurring words
• dictionary matches: presence of certain words in controlled vocabularies, e.g., Catalogue of Life, Phenotype and Trait Ontology, Gazetteer
10/9/2015 Mining Biodiversity 21
interface
10/9/2015 22Mining Biodiversity
Annotation workflowPre-processing
Dictionary lookup
Machine learning-based
recognition
Relation extraction
Saving
Validation interface
Enhanced searching of BHL content
Faceted search
Automatically generated questions
Time-sensitive
search
Enhanced document viewing
Page in PDF/image
format
OCR-corrected text with colour-coded
annotations
Conclusions
• Literature is a rich source of information but difficult to search
• Keyword-based search not enough to address ambiguity
• Semantic metadata allows for more accurate searching
• Semantic metadata can be extracted using text mining tools
• The Argo text mining workbench facilitates the construction of custom semantic metadata generation workflows