issues in learning an ontology from text

15
Issues in Learning an Ontology from Text Christopher Brewster, Simon Jupp, Joanne Luciano, David Shotton, Robert Stevens, and Ziqi Zhang

Upload: robertstevens65

Post on 21-May-2015

113 views

Category:

Science


1 download

DESCRIPTION

Talk at bio-ontologies SIG at ISMB Toronto, 2008

TRANSCRIPT

Page 1: Issues in Learning an Ontology from Text

Issues in Learning an Ontology from

Text

Christopher Brewster, Simon Jupp, Joanne Luciano, David Shotton, Robert Stevens, and Ziqi Zhang

Page 2: Issues in Learning an Ontology from Text

The Use Case: Animal Behaviour

• Animal behaviour community recognises the need for an ontology, e.g. for video annotation/retrieval

• The community created an “Animal Behaviour Ontology” - 339 terms

• Can we (semi-) automatically build from text?

Page 3: Issues in Learning an Ontology from Text

Some Questions

• Do we get a “good ontology”?

• If not, is it useful?

• Is it low-effort?

• Should the result be “tidied up” or used as a donor?

Page 4: Issues in Learning an Ontology from Text

Methodology: Dataset

• Journal “Animal Behaviour” from Elsevier

• 623 articles from Vol 71 (2006) - Vol 74 (2007)

• 2.2 million words

• Various formats - most usefully xml

Page 5: Issues in Learning an Ontology from Text

We Want an Ontology of Green

• An ontology of “animal behaviours”

• Not an ontology of the corpus

We want the green terms in the ontology

Page 6: Issues in Learning an Ontology from Text

Processing Steps (1)

1. Text extracted from XML - excluding affiliations, acknowledgements, bibliography except for title etc.

2. Noise removed - person names, animal names, place names

3. Lemmatiser used to reduce data sparsity

4. Term extraction applied

Page 7: Issues in Learning an Ontology from Text

Processing Steps (2)5. Term selection

Regular expression used to select terms ending in behaviour, display, construction, inspection plus generic -ing, -ism, etc.

Build hierarchies using String Inclusion

6. Top level terms filtered using “Hearst Patterns” to test if X ISA behaviour/activity/etc.

WalkingRunningJumpingHuntingPeckingReed BuntingCorn BuntingHerringCourtshipStudentshipCannibalismDimorphism

Page 8: Issues in Learning an Ontology from Text

Applying String Inclusion /Rules to Terms

C

BCAC

ABC

Selection

Mate Selection

Natural Selection

Female Mate Selection

Page 9: Issues in Learning an Ontology from Text

Lexico-Syntactic Patterns

• X such as P, Q, R; X is a Y

• Grooming is a behaviour

• Copulation is an activity

• Dimorphism is a behaviour

• Calls such as trills, whistles, grunts

Page 10: Issues in Learning an Ontology from Text

Results

• 64,000 terms extracted

• The regexp selected 10,335 terms

• Step 6a resulted in an ontology with 17,776 classes and 1295 top level classes

• Step 6b resulted in an ontology with 13,058 classes and 912 top level classes

Page 11: Issues in Learning an Ontology from Text

Results (2) - Copulation Sub-tree

Page 12: Issues in Learning an Ontology from Text

Results(3)

• Evaluation of terms excluded by regexp:

• 56,000 terms excluded

• Random sample of 3140 terms evaluated by hand

• 7 verbs and 42 nouns should not have been excluded

• E.g., “interaction”

• A recall of .905

Page 13: Issues in Learning an Ontology from Text

Discussion: The problem of focus

Page 14: Issues in Learning an Ontology from Text

Other Issues

• More a vocabulary than an ontology

• SKOS-like rather than OWL-like

• Can deal with “selection”, “mate selection” and “natural selection

• Highly compositional terms “Adult male grooming behaviour”

• Cleanish list of top level terms: Canabalism, copulation, eating, foraging, fighting, grooming

Page 15: Issues in Learning an Ontology from Text

Discussion: Is it useful?

• Answers: No, yes, yes, donor

• Useful ontological fragments

• Bringing ontology to ontology learning is the research challenge

• Limitations: noise; the problem of focus; only taxonomic relations

• Advantages: speed; ease; a step towards formal ontologies