modeling textual definition contents with bfo 2.0 and creating related linguistic/nlp resources...
TRANSCRIPT
Modeling Textual Definition Contents With BFO 2.0
and Creating RelatedLinguistic/NLP Resources
Selja SeppäläOctober 29, 2014
NCBO Webinar Series
Objectives
Developing the specifications for a widely usable software to help in definition writing
Creating language- and domain-independent definition models (templates)Using the BFO categories and relationsCreating BFO-related linguistic/NLP resources for large-scale corpus analyses of definitions
2NCBO Webinar Series | October 29, 2014 | Selja Seppälä
The general idea
gen is_a OBJECT An achromatic cell
spe other: develops_from MATERIAL ENTITY
of the myeloid or lymphoid lineages
spe bearer_of DISPOSITION
capable of ameboid movement,
spe located_in MATERIAL ENTITY
found in blood or other tissue.
OBJECT
NCBO Webinar Series | October 29, 2014 | Selja Seppälä
object
s has_part objects has_part object aggregate
material entity
s bearer_of dispositions bearer_of qualitys contains processs contains process boundarys has_history processa has_part immaterial entitya has_part material entitya located_in independent continuants material_basis_of dispositiona occupies three-dimensional spatial regiona part_of immaterial entitya part_of material entitys participates_in process
3
leukocyte (CL_0000738)An achromatic cell of the myeloid or lymphoid lineages capable of ameboid movement, found in blood or other tissue.
12
3
Outline
• Background on definitions• Introducing the problem• Proposal: modeling definition contents
with BFO• Methodology• Creating BFO-related linguistic/NLP resources• Conclusion
4NCBO Webinar Series | October 29, 2014 | Selja Seppälä
Background on definitions
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 5
Object of study (1)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 6
Object of study (2)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 7
Object of study (3)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 8
Object of study (4)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 9
Functions of a definition
• Conveying meaning of domain-specific terms• Specifying terms’ intended meaning
Disambiguation function• Problems when– Information is not relevant, e.g. non-defining– No definition at all How to understand/use the term?
10NCBO Webinar Series | October 29, 2014 | Selja Seppälä
Definition of ‘definition’
• A single sentence • Composed of pieces of information, e.g. found
in experts’ texts• Which are relevant to the experts’ activity• About the types of things to which the defined
term refers• Has classical Aristotelian structure–One genus–One or more differentiae (specifiers)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 11
Example from the Uber Anatomy Ontology (UBERON)
ligamentDense regular connective tissue connecting two or more adjacent skeletal elements or supporting an organ.
NCBO Webinar Series | October 29, 2014 | Selja Seppälä
genus:type of thing
differentia:distinguishing characteristic
12
Example from theCell Type Ontology (CL)
leukocyteAn achromatic cell of the myeloid or lymphoid lineages capable of ameboid movement, found in blood or other tissue.
NCBO Webinar Series | October 29, 2014 | Selja Seppälä
genus:type of thing
differentiae:distinguishing characteristics
13
Definition writing rules, issues, and needs
• Subject to a few formal rules (punctuation, etc.)
• Only vague principles for selecting the right pieces of information
• Time-consuming, costly, and prone to a kinds of inconsistencies
➔ Create computer software to help authors– Improve quality of dictionaries & ontologies– Comply to the publisher’s formal rules (Seppälä 2006)
– Include relevant defining contentsNCBO Webinar Series | October 29, 2014 | Selja Seppälä 14
Definition writing rules, issues, and needs
• Subject to a few formal rules (punctuation, etc.)
• Only vague principles for selecting the right pieces of information
• Time-consuming, costly, and prone to a kinds of inconsistencies
➔ Create computer software to help authors– Improve quality of dictionaries & ontologies– Comply to the publisher’s formal rules (Seppälä 2006)
– Include relevant defining contentsNCBO Webinar Series | October 29, 2014 | Selja Seppälä 15
Introducing the problem
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 16
Example: compiling a corpus5.3.2 Net booms Netting can be used as a boom to collect viscous oils at sea. Vessels deploy the netting, manoeuvring it to surround and concentrate the oil for recovery. In inshore waters, it may be possible to use nets as a moored boom. Net booming is based on the principle that air and water will pass through the net but not viscous oil. Therefore, there is little or no downward current to carry oil underneath the net and the net is less prone to being blown flat. Because of the lower resistance of nets to water movement, it should also be possible in theory to deploy net booms in faster currents than is possible with conventional booms. Net booms have not yet been fully tested in an actual oil spill but results from field trials have been encouraging. A net boom consists of a long strip of netting, supported at frequent and regular intervals by poles with floats and weights attached to keep the netting upright (Figure 20).
(Source: http://www.dft.gov.uk/mca/chapter_5.pdf)
17NCBO Webinar Series | October 29, 2014 | Selja Seppälä
Example: extracting information5.3.2 Net booms Netting can be used as a boom to collect viscous oils at sea. Vessels deploy the netting, manoeuvring it to surround and concentrate the oil for recovery. In inshore waters, it may be possible to use nets as a moored boom. Net booming is based on the principle that air and water will pass through the net but not viscous oil. Therefore, there is little or no downward current to carry oil underneath the net and the net is less prone to being blown flat. Because of the lower resistance of nets to water movement, it should also be possible in theory to deploy net booms in faster currents than is possible with conventional booms. Net booms have not yet been fully tested in an actual oil spill but results from field trials have been encouraging. A net boom consists of a long strip of netting, supported at frequent and regular intervals by poles with floats and weights attached to keep the netting upright (Figure 20).
(Source: http://www.dft.gov.uk/mca/chapter_5.pdf)
18NCBO Webinar Series | October 29, 2014 | Selja Seppälä
Example: identifying potentially defining information
5.3.2 Net booms Netting can be used as a boom to collect viscous oils at sea. Vessels deploy the netting, manoeuvring it to surround and concentrate the oil for recovery. In inshore waters, it may be possible to use nets as a moored boom. Net booming is based on the principle that air and water will pass through the net but not viscous oil. Therefore, there is little or no downward current to carry oil underneath the net and the net is less prone to being blown flat. Because of the lower resistance of nets to water movement, it should also be possible in theory to deploy net booms in faster currents than is possible with conventional booms. Net booms have not yet been fully tested in an actual oil spill but results from field trials have been encouraging. A net boom consists of a long strip of netting, supported at frequent and regular intervals by poles with floats and weights attached to keep the netting upright (Figure 20).
(Source: http://www.dft.gov.uk/mca/chapter_5.pdf)
19NCBO Webinar Series | October 29, 2014 | Selja Seppälä
Example: selecting relevant information for defining the term
5.3.2 Net booms Netting can be used as a boom to collect viscous oils at sea. Vessels deploy the netting, manoeuvring it to surround and concentrate the oil for recovery. In inshore waters, it may be possible to use nets as a moored boom. Net booming is based on the principle that air and water will pass through the net but not viscous oil. Therefore, there is little or no downward current to carry oil underneath the net and the net is less prone to being blown flat. Because of the lower resistance of nets to water movement, it should also be possible in theory to deploy net booms in faster currents than is possible with conventional booms. Net booms have not yet been fully tested in an actual oil spill but results from field trials have been encouraging. A net boom consists of a long strip of netting, supported at frequent and regular intervals by poles with floats and weights attached to keep the netting upright (Figure 20).
(Source: http://www.dft.gov.uk/mca/chapter_5.pdf)
20NCBO Webinar Series | October 29, 2014 | Selja Seppälä
Example: composing the definition5.3.2 Net booms Netting can be used as a boom to collect viscous oils at sea. Vessels deploy the netting, manoeuvring it to surround and concentrate the oil for recovery. In inshore waters, it may be possible to use nets as a moored boom. Net booming is based on the principle that air and water will pass through the net but not viscous oil. Therefore, there is little or no downward current to carry oil underneath the net and the net is less prone to being blown flat. Because of the lower resistance of nets to water movement, it should also be possible in theory to deploy net booms in faster currents than is possible with conventional booms. Net booms have not yet been fully tested in an actual oil spill but results from field trials have been encouraging. A net boom consists of a long strip of netting, supported at frequent and regular intervals by poles with floats and weights attached to keep the netting upright (Figure 20).
(Source: http://www.dft.gov.uk/mca/chapter_5.pdf)
21NCBO Webinar Series | October 29, 2014 | Selja Seppälä
A content selection problem
• How to select the relevant kind of information?➔ Selection principles
• How to create rules to help authors select that the relevant information?➔ Content modeling
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 22
Proposal: modeling definition contents with BFO
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 23
Content selection principles
• Extensional dimension– Ontological factors
• Category of the referent (object, process, quality, etc.)• Type of things or particular (particle accelerator vs. the LHC)
• Contextual dimension– Conceptual system of the domain– Background knowledge of the user
• Communicative dimension– User needs– Communication situation (linguistic expression: quadruped
vs. four legged)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 24
A generic solution to content selection
• Extensional dimension– Ontological factors
• Category of the referent (object, process, quality, etc.)• Type of things or particular (particle accelerator vs. the LHC)
• Contextual dimension– Conceptual system of the domain– Background knowledge of the user
• Communicative dimension– User needs– Communication situation (linguistic expression: quadruped
vs. four legged)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 25
Solution to content representation➔ Represent contents with models based on
categories and relations (Sager 1990, Sager & L’Homme 1994)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä
leukocyteAn achromatic cell of the myeloid or lymphoid lineages capable of ameboid movement, found in blood or other tissue.
is_a OBJECT
capable_of PROCESS
develops_from MATERIAL ENTITY
located_in MATERIAL ENTITY
26
Questions
• What kinds of entities are there?• What are their characteristics?
➔Use a realist upper level ontology: the Basic Formal Ontology (BFO)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 27
Proposal
• Create definition models based on the BFO categories and their characteristics
• See to what extent these models predict the contents of definitions
➔Ontological analysis framework
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 28
Methodology
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 29
Example: modeling definitions of objectsligament (UBERON_0000211)
gen is_a OBJECT Dense regular connective tissue
spe bearer_of FUNCTION
connecting two or more adjacent skeletal elements or supporting an organ.
leukocyte (CL_0000738)
gen is_a OBJECT An achromatic cell
spe other: develops_from MATERIAL ENTITY
of the myeloid or lymphoid lineages
spe bearer_of DISPOSITION
capable of ameboid movement,
spe located_in MATERIAL ENTITY
found in blood or other tissue.
OBJECT
NCBO Webinar Series | October 29, 2014 | Selja Seppälä
object
s has_part objects has_part object aggregate
material entity
s bearer_of dispositions bearer_of qualitys contains processs contains process boundarys has_history processa has_part immaterial entitya has_part material entitya located_in independent continuants material_basis_of dispositiona occupies three-dimensional spatial regiona part_of immaterial entitya part_of material entitys participates_in process
30
Creating definition models
• Use BFO specifications, OWL file & FOL axioms
• To analyze BFO categories as relational configurations (RCs)
‘category + relation + category’
E.g.: some object has_part objectsome object participates_in process
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 31
material entities in BFO 2.0
NCBO Webinar Series | October 29, 2014 | Selja Seppälä
object aggregatea has_member_part object
fiat object parta part_of object
material entitys bearer_of dispositions bearer_of qualitys contains processs contains process boundarys has_history processa has_part immaterial entitya has_part material entitya located_in independent continuants material_basis_of dispositiona occupies three-dimensional spatial regiona part_of immaterial entitya part_of material entitys participates_in process
objects has_part objects has_part object aggregate
32
is_ais_a
is_a
objects has_part objects has_part object aggregate
material entitys bearer_of dispositions bearer_of qualitys contains processs contains process boundarys has_history processa has_part immaterial entitya has_part material entitya located_in independent continuants material_basis_of dispositiona occupies three-dimensional spatial regiona part_of immaterial entitya part_of material entitys participates_in process
The model for object Creating relational models with proper and inherited RCs
Relations characterizing the category object
Relations inherited from the category material entity
OBJECT
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 33
Testing the models
Applying the models to multilingual & multi-domain textual definitions corpora•Segmenting: GEN + SPE•Annotating: with relational models•Statistical analyses to see – Which relations are more relevant for defining– If new ones can be added to the models
Relevant models from generic ones
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 34
Annotation example (1)<FICHE langue="en”>
<NI>EN_AB_11</NI><CM>
<DOMAINE>protein biosynthesis in eukaryotic cells</DOMAINE>
<SS-DOM1>cell</SS-DOM1><SS-DOM2>protein</SS-DOM2>
</CM><VE>protein</VE><DF>
<GEN REFrelGEN="is_a OBJECT" relationVE="GENRE_PROCH">A molecule
</GEN><SPE REFrelSPE="has_part OBJECT AGGREGATE">consisting of different amino acids linked to each other in a specific sequence by peptide bonds.</SPE></DF></FICHE>NCBO Webinar Series | October 29, 2014 | Selja Seppälä 35
Annotation example (2)...
<VE>protein</VE><DF><GEN REFrelGEN="is_a OBJECT"
relationVE="GENRE_PROCH">A molecule</GEN><SPE REFrelSPE="has_part OBJECT AGGREGATE">consisting of different amino acids linked to each other in a specific sequence by peptide bonds.</SPE>
</DF>...
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 36
Results grid for OBJECT (BFO 1.0)
NCBO Webinar Series | October 29, 2014 | Selja Seppälä
object normal categ. other total %
has_part object 15 1 1626
participates_in process 3 6 9
bearer_of quality19 3 22
60
bearer_of realizable entity31 1 32
located_at temporal region0 0 0
located_in site 2 2 4
other 14 14 14
total 70 13 14 97 100
37
Creating BFO-relatedlinguistic/NLP resources
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 38
Creating input resources• Corpora of existing definitions– Multi-domain & multilingual– Checked for adequcy of form (spelling, punctuation, etc.)– May be syntactically parsed
• BFO-Definition-Models– XML format– Encoded: full/short name for relations, synonyms, sources,
purl, comments, modality (all, some, no)• Programs for automating– Genus segmentation & tagging– Genus annotation with BFO categoryMore systematic & increased reliabilityPre-annotation of large-scale corpora
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 39
Genus tagging: segmenting
A. Create processing resources1. Morpho-sytactic rules to get SPE preceding the GEN2. List of terms from dictionary or ontology3. Morpho-syntactic rules to find end of GEN
B. For each definition1. Apply rules in A.1 to tag beginning of definitions with
a premodifying <SPE> tag2. Search for terms at the beginning of definition or
after the first <SPE> tag using list in A.23. If term found, tag with <GEN> tag4. Else, apply rules in A.3 to tag end of GEN that does
not contain a termNCBO Webinar Series | October 29, 2014 | Selja Seppälä 40
Segmenting examples
ligament<GEN REFrelGEN="???" relationVE="GENRE_PROCH">
Dense regular connective tissue</GEN><SPE>connecting two or more adjacent skeletal elements or supporting an organ.
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 41
morpho-syntactic rule:
</GEN><SPE>.*ing
Genus tagging: annotating
• No existing mappings of linguistic expressions to BFO categories
• Tests to create such reseources usingA. Terms in BFO-compliant ontologies (e.g., mid-level)
B. WordNet-KYOTO-DOLCE-Lite-BFO mappings• Processing– Get single/complex terms – Get headword of complex terms– Create synsets of single & complex terms linked to
headwords A. Map to BFO categories via terms from BFO-compliant ontologiesB. Map single terms/headwords to WN synsets search synsets’ ids in
KYOTO get DOLCE categories map to BFO categoriesNCBO Webinar Series | October 29, 2014 | Selja Seppälä 42
conclusion
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 43
Output resources
• Bilingual multi-domain training corpora (comparable & parallel corpora)
• From corpora– Examples for BFO categories– Lexicons mapped to BFO categories & relations
• Annotation manual• Annotation and validation interfaces
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 44
Future work
• Conduct large-scale corpus studies➔ Annotation campaign➔ Larger training corpora for machine learning
• Possible use of the produced linguistic resources to automatically generate textual definitions from logical ones, and vice versa
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 45
Thank [email protected]://seljaseppala.wordpress.com
46NCBO Webinar Series | October 29, 2014 | Selja Seppälä
References• Juan Sager. 1990. A practical course in terminology processing. John
Benjamins, Amsterdam, Philadelphia.• Juan Sager and Marie-Claude L’Homme. 1994. A model for the definition
of concepts: Rules for analytical definitions in terminological databases. Terminology, 1(2):351–373.
• Selja Seppälä. 2006. Semi-automatic checking of terminographic definitions. In International Workshop on Termi- nology design: quality criteria and evaluation methods (TermEval) — LREC 2006, 22–27, Genoa, Italy, 2006.
• Selja Seppälä. 2010. Automatiser la rédaction de définitions terminographiques : questions et traitements. In RECITAL 2010, Montréal, Canada, 19–22 juillet.
• Selja Seppälä. 2012. Contraintes sur la sélection des informations dans les définitions terminographiques: vers des modèles relationnels génériques pertinents. PhD thesis, Département de traitement informatique multilingue (TIM), Faculté de traduction et d’interprétation, Université de Genève.
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 47
Outcomes
• Pilot study suggests that BFO-based models are predictive of definitions’ contents
• When not predictive, BFO categories and relations can be used as a fixed metalanguage
• Methodology may reveal ‘typical features’ for defining each category weighted differentia
• Creation of lingustic resources for a better integraton of BFO to NLP processing
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 48
Advantages• (Semi-)formal structuring of textual definitions
Homogeneity Internal structure linked to the ontology
• No complex BFO labels displayed (only in metadata) Increased readability & user-friendliness
• Allows more complex querying via the definition (through metadata AND text)
• Domain-adaptable through corpus analysis• Methodology applicable to– More specific categories from ontologies extending BFO– Other upper-level ontologies
NCBO Webinar Series | October 29, 2014 | Selja Seppälä 49