Transcript
Page 1: Ontology Engineering approaches based on semi-automated curation of the primary literature

Ontology Engineering approaches based on semi-automated curation of the primary literature

Gully APC Burns, Tommy Ingulfsen, Donghui Feng and Ed HovyBiomedical Knowledge Engineering Group,Information Sciences Institute,University of Southern California

Page 2: Ontology Engineering approaches based on semi-automated curation of the primary literature

Where’s all the knowledge?

Image taken from U.S. Geological Survey Energy Resource Surveys Program

The primary research literature...… is the end-product of all scientific research … forms the basis for human understanding of the subject... is written in natural language … is structured… is interpretable… is expensive… is terse

Page 3: Ontology Engineering approaches based on semi-automated curation of the primary literature

Precision and imprecision in biological representation

Assay:define model

system

Experiment: perform

measurements

Conceptual model

‘Stress’, ‘energy balance’,‘homeostasis’, ‘glucoprivation’

2-deoxyglucose (2DG) administrated intravenously to rats, look for activation in ‘stress-responsive’ neurons

MAP-K and pERK activate in neurons in PVH, BST and CEAl

High-level concepts

Independent variables

Dependent variables

Imprecise

Precise

Page 4: Ontology Engineering approaches based on semi-automated curation of the primary literature

Partitioning the literature

Page 5: Ontology Engineering approaches based on semi-automated curation of the primary literature

The problem with knowledge: an over-abundance of data

Page 6: Ontology Engineering approaches based on semi-automated curation of the primary literature

Corpus Preparation for Natural Language Processing

The Journal of Comparative Neurology is the foremost international journal for neuroanatomy. We downloaded ~12,000 PDFs in total from 1970-2005.

We preprocessed papers with consistent formatting from vol. 204 - 490 (1982-2005) providing a corpus of 9,474 PDF files. This corpus contains 99,094,318 words

Page 7: Ontology Engineering approaches based on semi-automated curation of the primary literature

Active Learning / Information Extraction Methodology

Page 8: Ontology Engineering approaches based on semi-automated curation of the primary literature

The logical structure of a tract-tracing experiment

Tracer Chemical [1] Injection Site [1]

Location brain structure topography side

Labeled region [1...*] Location

brain structure topography ipsi-contra relative to

injection site? Label type Label density

‘anterograde’

‘retrograde’

Page 9: Ontology Engineering approaches based on semi-automated curation of the primary literature

Annotated XML Example from Albanese & Minciacchi, 1983, JCN 216:406-420

expt. labeldelineation injectionlabelingdescription

Page 10: Ontology Engineering approaches based on semi-automated curation of the primary literature

Recall, Precision and F-Score

Page 11: Ontology Engineering approaches based on semi-automated curation of the primary literature

Field Labeling Results –overall label level

System Features Precision Recall F-Score

Baseline 0.3926 0.1673 0.2346

Lexicon 0.5689 0.3771 0.4536

Lexicon + Surface Words 0.7415 0.6817 0.7103

Lexicon + Surface Words + Window Words

0.7843 0.7039 0.7420

Lexicon + Surface + Window Words + Dependency features

0.7756 0.7347 0.7546

Preliminary data from a training set of 14 documents+ testing on 16 documents

Page 12: Ontology Engineering approaches based on semi-automated curation of the primary literature

Counts

O

injection Location

injection S

pread

labeling D

escription

labeling Location

tracer C

hemical

O 41087 141 97 338 1751 6 43420injectionLocation 545 744 48 6 820 1 2164injectionSpread 126 43 147 11 155 0 482labelingDescription 1121 5 0 3773 82 47 5028labelingLocation 1988 224 110 27 9251 0 11600tracerChemical 108 1 12 0 0 623 744

44975 1158 414 4155 12059 677

machine labels

human labels

Field Labeling Results-Confusion Matrices

Page 13: Ontology Engineering approaches based on semi-automated curation of the primary literature

Generalizing the methodology: ‘Histology’

[from Gonzalo-Ruiz et al 1992, JCN 321: 300-311]

Page 14: Ontology Engineering approaches based on semi-automated curation of the primary literature

The logical structure of a tract-tracing experiment

Tracer Chemical [1] Injection Site [1]

Location brain structure topography side

Labeled region [1...*] Location

brain structure topography ipsi-contra relative to

injection site? Label type Label density

‘anterograde’

‘retrograde’

Page 15: Ontology Engineering approaches based on semi-automated curation of the primary literature

Time and effort Current performance achieved by annotating 40

documents Each document contains 97 sentences (in results

section) on average Annotation rate

~ 40 Sent/hr (no support) ~115 Sent/hr (after 20 documents)

Time taken to annotate document to train system to perform at this standard ~65 hours with no support Estimate ~2 months for a 50% RA (20 hours / week)

Page 16: Ontology Engineering approaches based on semi-automated curation of the primary literature

Can we discover the schema from the text?

Given a large review or a grant proposal specific to a single laboratory

Annotate independent and dependent variables in papers.

Can we learn and extract these patterns?

Page 17: Ontology Engineering approaches based on semi-automated curation of the primary literature

An example from current set of annotations

10 independent variables:•age•species•sex•weight•agonist/antagonist combinations (9)•primary antibody•preparation•protocol•brain region

1 dependent variable:•signal density

Page 18: Ontology Engineering approaches based on semi-automated curation of the primary literature

Acknowledgements

Funding Information Sciences Institute, seed

funding * National Library of Medicine (RO1-

LM07061) * NSF (LONI MAP project) HBP (USCBP)

Neuroscience consultants Alan Watts * Larry Swanson * Arshad Khan * Rick Thompson * Joel Hahn * Lori Gorton * Kim Rapp *

Computer Scientists Eduard Hovy * Donghui Feng * Patrick Pantel *

Developers Tommy Ingulfsen * Wei-Cheng Cheng


Top Related