mary mcgee wood, shenghui wang dept of computer science, u. of manchester valentin tablan, diana...

Post on 15-Jan-2016

224 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Mary McGee Wood, Shenghui WangDept of Computer Science, U. of Manchester

Valentin Tablan, Diana Maynard,

Hamish Cunningham Dept of Computer Science, U. of Sheffield

Susannah LydonEarth Science Education Unit, U. of Keele

Populating a Database from Parallel Texts using “Ontology-based” Information Extraction

The hypothesis

Overview

Parallel texts

Legacy data in the natural sciences

“Ontology-based” Information Extraction

NLDB’04 - a few running threadsMultiple / semi-overlapping text sources

Sophisticated vs shallow or statistical text processing

“Ontologies” are not the same as gazetteers or lexicons (or semantic nets!)

Autonomous agents vs HCC (Human-Computer Collaborative) approaches

We are doing…

Highly homogeneous data sources

Shallow text processing

“Ontologies” only as a last resort

HCC approach

We are not doing…

Heterogeneous data sources

Sophisticated language processing

Improvement of single-source IE or question-answering

Autonomous agents

Parallel textsText descriptions in the traditional descriptive sciences.

Descriptions of protein sequences and functions in molecular biology.

Press coverage of news stories.

Police witness-of-crime reports.

(Semi-) automatic marking of free text answers in examinations.

Legacy data in the natural sciences

Text descriptions in the traditional descriptive sciences:

Species descriptions in botany and zoology

Descriptions of diseases in medicine.

Five species of Ranunculus (buttercups)

Six botanists’ text descriptions (Floras)

Data sources

R. acris L. - Meadow Buttercup. Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; 2n=14.

Typical data

Hand Parsing & Correlation

CTM FE FNA GLEASON GRAY STACE

Petals Petals petals Petals petals

number 5 5

length usually 10-15 mm

9-13 mm 8-14 mm. long

0.8-1.4 cm long

width 8-11 mm nearly as broad

shape broadly obovate with cuneate base

broadly obovate

rounded-obovate

colour bright glossy yellow, rarely paler or white

yellow

Hand-parsed species descriptions for Ranunculus bulbosus

Results of hand-analysis of Ranunculus descriptions from six sources

- Most data from one source only

- Individual texts contain on average 39% of the total information for each species

Department of BotanyNatural History Museum,

LondonRob HuxleyDavid Sutton

MultiFlora IAutomatic compilation of accurate taxonomic databases from multiple non-computerised sourcesDepartment of Computer ScienceUniversity of ManchesterMary McGee WoodDavid RydeheardSusannah Lydon

Supported by the BBSRC / EPSRC joint Bioinformatics Initiative, grant reference number 34/BIO12072

GATE I

Tagger output

Parse trees

Names & verbs

‘Basal leaves more or less deeply divided…’1231 semantics 179 191 (qlf:[ne_tag(e13, offsets(179, 184)), name(e13, 'Basal'), realisation(e13, offsets(179, 184)), leave(e12), time(e12, present), aspect(e12, simple), voice(e12, active), realisation(e12, offsets(185, 191)), realisation(e12, offsets(185, 191)), lsubj(e12, e13)])

1247 semantics 200 226 (qlf:[divide(e14), adv(e14, less), adv(e14, deeply), time(e14, none), aspect(e14, simple), voice(e14, passive), into(e14, e15), count(e15, 3), realisation(e15, offsets(225, 226)), realisation(e14, offsets(200, 226)), realisation(e14, offsets(200, 226))])

Template output (1)Erect perennial to 1m; basal leaves deeply palmately lobed, pubescent; HEAD KIND FEATURE TYPE KIND

Erect

Perennial

to 1m measure unknown

basal position pubescent

leaves Prefix

deeply

palmately

lobed

Template output (2)flowers 15-25mm across; sepals not reflexed; achenes 2-3.5mm, glabrous, smooth, with short hooked beak; HEAD KIND FEATURE TYPE KIND NEGATION

flowers 15-25mm measure width

across

sepals reflexed true

achenes short

hooked

smooth

glabrous

2-3.5mm measure unknown

MultiFlora II:Combining Information Extraction and Knowledge Representation for Biodiversity Informatics

Department of Computer Science, University of ManchesterMary McGee WoodSusannah LydonAlan Rector

Department of Botany, Natural History Museum, LondonRob Huxley

Natural Language Processing Group, University of SheffieldHamish CunninghamValentin TablanDiana Maynard

Supported by the BBSRC Bioinformatics and E-science Programme, grant reference number 34/BEP17049

GATE II

“Ontology-based” Information Extraction

“Ontology” – classes of heads, properties, and features

Gazetteers – instances of these classes

(Lexicons – not currently used)

Head categoriesSpecific plant parts:Flower: Flower, floret, FlLeaf: leaf, leaves, FrondsPetal: petal, honey-leaf, vexillum

Collective categories:PlantSeparatablePart:

appendage, glume, tuberPlantUnseparatablePart:

beak, lobe, segmentSpecificRegionOfWhole: apex, border, head

Ontology: Heads

ontology-heads.eps

Properties

2DShape: arching, linear, toothed

3DShape: branching, thickened, tube

Colour: glossy, golden, greenish

Count: numerous, several

Ontology: Properties

Features

Habit: bush, shrub, succulent MorphologicalProperty:

dense, contiguous, separate SurfaceProperty:

pilose, pitted, rugose

Ontology: Features

Perennial herb with overwintering lf-rosettes from the short oblique to erect premorse stock up to 5 cm, rarely longer and more rhizome-like; roots white, rather fleshy, little branched.

More typical data

System outputHead Class Head Property FeatClass Feature

Plant herb hasLifeform Lifeform Perennial

Leaf lf-rosettes hasLifeform Lifeform overwintering

PlantSepPart stock hasRelProperty RelProperty short

PlantSepPart stock hasOrientation Orientation oblique to erect

PlantSepPart stock hasLength Length up to 5 cm

PlantSepPart stock hasRelProperty RelProperty rhizome-like

Root roots hasColour Colour white

Root roots hasShape3D Shape3D rather fleshy

Root roots hasShape3D Shape3D little branched

R. acris R. bulbosus R. hederaceus Avg

Single description, average

78 60 83 74

Single description, average, for whole template

78 60 83 74

Merged, for whole template

63 58 69 63

Precision

R. acris R. bulbosus R. hederaceus Avg

Single description, average

70 55 74 66

Single description, average, for whole template

22 18 26 22

Merged, for whole template

69 61 82 71

Recall

R. acris R. bulbosus R. hederaceus Avg

Single description, average

73.78 57.39 78.2469.77

Single description, average, for whole template

34.32 27.69 39.6033.92

Merged for whole template

65.86 59.46 74.9466.76

F-measure

Of all instances of missed information, percentage compensated for by merging

50 46 55 50

Of total number of slots in template, percentage where merging allowed compensation for missed information

25 29 18 24

Information merging

These figures based on human judgement

Automated “merging reasoner” under active construction

Information merging

Future work – short termFine-tuning to improve precision

(Semi-) automatic template correlation heuristics

(Semi-) automatic data correlation heuristics

Extend coverage and evaluation

Future targets Techniques:

Merging reasoner

Temporal reasoner

Data types:

Large-scale legacy data in biodiversity studies

Free text annotations in Bioinformatics databases

top related