unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

27
Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction Riza Batista-Navarro, William Ulate, Jennifer Hammock, Georgios Kontonatsios, Trish Rose-Sandler and Sophia Ananiadou

Upload: william-ulate

Post on 22-Jan-2018

305 views

Category:

Science


0 download

TRANSCRIPT

Page 1: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Unlocking knowledge in biodiversitylegacy literature through automatic

semantic metadata extraction

Riza Batista-Navarro, William Ulate, Jennifer Hammock, Georgios Kontonatsios, Trish

Rose-Sandler and Sophia Ananiadou

Page 2: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

StructuredData

? Text Mining

Page 3: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

http://miningbiodiversity.org

Page 4: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

The partners

Social Media Lab

410/9/2015 Mining Biodiversity

Page 5: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Mining Biodiversity

• Transform BHL into a next-generation social digital library

• A multi-disciplinary approach – Text Mining

– Machine learning

– History of Science

– Environmental History & Studies

– Library and Information Science

– Social Media

510/9/2015 Mining Biodiversity

Page 6: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

What do we want to do?

Social Media

Visualisation

Semantic

Metadata

610/9/2015 Mining Biodiversity

Page 7: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Biodiversity Heritage Library

• a consortium of botanical and natural history libraries

• stores digitised legacy literature on biodiversity

• currently holds 160,000 volumes = millions of pages (PDFs and OCR-generated text)

• open-access

710/9/2015 Mining Biodiversity

Page 8: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Current features

• supports keyword-based search

• species names annotated and linked to the Encyclopedia of Life

• integrates automatic taxonomic name finding tools (uBio Taxonfinder)

• data access through export functionalities and Web services

810/9/2015 Mining Biodiversity

Page 9: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Keyword-based search and Browsing

Page 10: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Advanced search (also keyword-based)

10/9/2015 10Mining Biodiversity

Page 11: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

What’s wrong with keyword-based search?

• Ambiguity!

Boxwood

historic place in Alabama?

North American term for plants in the Buxaceae

family?

Box

container?

Boxwood for other English-speaking countries?

Page 12: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

What’s wrong with keyword-based search?

• Ambiguity!

California bay

hardwood tree?

location?

Drum

musical instrument?

fish?

Page 13: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

What’s wrong with keyword-based search?

• Ambiguity!

Emperor

fish?

person?

Scrambled eggs

food?

plant?

Page 14: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Semantic metadata generation

• Entity types

– species

– location

– habitat

– anatomical parts

– qualities

– persons

– temporal expressions

• Association types

– observation

– Habitation

– nutrition

– trait

10/9/2015 Mining Biodiversity 14

Page 15: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Examples of semantic metadata (annotations)

• Observation

• Habitation

Page 16: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Examples of semantic metadata (annotations)

• Nutrition

• Trait

Page 17: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

How does semantic information help?

SPECIES:California bay

hardwood tree

location

LOCATION:California bay

Page 18: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Text mining-based approach

Seeddocuments

Unlabelleddocuments

Learn semantics

Annotator/CuratorValidate

Feedback

Annotate

Searchindex

Store

Annotate

Page 19: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Automatic annotation by text mining (TM)

– Web-based, graphical TM workbench

– conforms with the Unstructured Information Management Architecture (UIMA) standard

– facilitates the straightforward integration of various analytics into workflows

– allows for the validation of annotations

10/9/2015 Mining Biodiversity 19

Page 20: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

interface

10/9/2015 20Mining Biodiversity

Page 21: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Learning semantics

• Training of models using machine learning

– conditional random fields (CRFs) for sequence labelling

– learning the features of mentions and relations of interest based on labelled documents

• contextual features: surrounding, co-occurring words

• dictionary matches: presence of certain words in controlled vocabularies, e.g., Catalogue of Life, Phenotype and Trait Ontology, Gazetteer

10/9/2015 Mining Biodiversity 21

Page 22: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

interface

10/9/2015 22Mining Biodiversity

Page 23: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Annotation workflowPre-processing

Dictionary lookup

Machine learning-based

recognition

Relation extraction

Saving

Page 24: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Validation interface

Page 25: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Enhanced searching of BHL content

Faceted search

Automatically generated questions

Time-sensitive

search

Page 26: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Enhanced document viewing

Page in PDF/image

format

OCR-corrected text with colour-coded

annotations

Page 27: Unlocking knowledge in biodiversity legacy literature through automatic semantic metadata extraction

Conclusions

• Literature is a rich source of information but difficult to search

• Keyword-based search not enough to address ambiguity

• Semantic metadata allows for more accurate searching

• Semantic metadata can be extracted using text mining tools

• The Argo text mining workbench facilitates the construction of custom semantic metadata generation workflows