corpus based creation and extension of domain-specific resources manuela kunze, dietmar rösner...
TRANSCRIPT
Corpus based Creation and Extension of Domain-Specific Resources Manuela Kunze,
Dietmar RösnerUniversity of Magdeburg
Manuela Kunze 2
Overview
Background: Corpus Characteristics
Experiment 1: Context-related Derivation of Concepts
Experiment 2: Clustering of Values
Manuela Kunze 3
Corpus: Forensic Autopsy Protocols different document parts:
findings histological findings background discussion …
Manuela Kunze 4
Autopsy Protocols: Findings
short linguistic structures typical attribute-value structures
expressed by noun phrases:
Unterblutung des Gewebes/Bleeding of tissue. Oberlippenbart/Upper lip beard.
noun phrases + verb/adjective/noun phrase Mund geschlossen./Mouth closed. Nebennieren ohne Besonderheiten./Adrenal glands
without anomalies.
Useable for the extension of the resources in combination with GermaNet?
Manuela Kunze 5
Corpus
400 Protocols parsed with a context free grammar (ca.
40 rules)
focus of the analyses complex noun phrases
derivation of concepts
attribute-value structuresclustering of values
Manuela Kunze 6
Overview
Corpus Characteristics
Experiment 1: Context-related Derivation of Concepts
Experiment 2: Clustering of Values
Manuela Kunze 7
Approach
analysis of high-frequency complex noun phrases example: Bruch des/der … (fracture of …) occurrence 749 types: 93
known (31): Rippe/rib (254), Brustbein/sternum (65), Wirbelsäule/spine
(58), Schambein/pubic bone (30), Schulterblatt/omoplate (23), …
unknown (62): Schädeldach/calvarium (43), Oberschenkelknochen/femur
(37), Schädelbasis/base of the skull (34), Schlüsselbein/clavicle (33), Brustwirbelsäule/thoracic spine (28), Halswirbelsäule/cervical spine (26), …
Manuela Kunze 8
Idea: Analysis of Complex Noun Phrases
fracture of <known>
keyword of complement
fracture of <unknown>
in corpus:
class of <known>
deduce: class of <unknown> == class of <known>
in GermaNet:
Manuela Kunze 9
Approach
top level category : T
remove senses which are not assigned with the preferred top level
category
collect all (GermaNet) senses
determine the most frequent top level
category
known complements types of a keyword
collect all semantic classes from the hypernym graph
for
each
sen
se known (31): Rippe/rib (254),
Brustbein/sternum (65), Wirbelsäule/spine (58), Schambein/pubic bone (30), Schulterblatt/omoplate (23), …
high-frequency top level categories (as percentage)
3
16,5
75
5,5
noun.body
noun.artifact
noun.quantity
noun.food
…<nomen.Koerper>Finger <nomen.Koerper>=> Gliedmaße, Extremität
<nomen.Artefakt>Finger <nomen.Artefakt>=> Computerprogramm, Programm
<nomen.Koerper>Rippe <nomen.Koerper>=> Knochen, Gebein…
top level category: noun.body
36 senses 27 senses
22 different semantic classes
36 senses
…<nomen.Koerper>Rippe, <nomen.Koerper>=> Knochen, Gebein, <nomen.Koerper>=> Hornsubstanz, <nomen.Koerper>=> Körpersubstanz, <nomen.Substanz>=> Stoff1, Substanz, Materie, <nomen.Tops>=> Objekt, <nomen.Koerper>=> Hornsubstanz, <nomen.Koerper>=> Körpersubstanz,<nomen.Substanz>=> Stoff1, Substanz, Materie, <nomen.Tops>=> Objekt, …
31 complement types
…<nomen.Koerper>Finger <nomen.Koerper>=> Gliedmaße, Extremität
<nomen.Artefakt>Finger <nomen.Artefakt>=> Computerprogramm, Programm
<nomen.Koerper>Rippe <nomen.Koerper>=> Knochen, Gebein
…
Manuela Kunze 10
Approach
collect all semantic classes from the hypernym graph
for each semantic class sc: • determine the level in the hypernym tree (fsc)• count occurences (nsc)
most specific semantic class: Knochen
22 different semantic classes
select the maximum of(fsc * nsc)/N
N: number of all semantic classes
…<nomen.Koerper>Rippe, <nomen.Koerper>=> Knochen, Gebein, <nomen.Koerper>=> Hornsubstanz, <nomen.Koerper>=> Körpersubstanz, <nomen.Substanz>=> Stoff1, Substanz, Materie, <nomen.Tops>=> Objekt, <nomen.Koerper>=> Hornsubstanz, <nomen.Koerper>=> Körpersubstanz,<nomen.Substanz>=> Stoff1, Substanz, Materie, <nomen.Tops>=> Objekt, …
Manuela Kunze 11
Results
85 % correct assignments (types) 94 % correct assignments (tokens)
erroneous cases: correct assignments to wrong complements wrong assignments to correct complements
Manuela Kunze 12
Results: Erroneous Cases
correct assignments to wrong complements: misspelling of tokens: „Oberschenkelknorren“ erroneous fragments of the treatment of German‘s
truncations: „Bruch des Ober- und Unterarmes“ erroneous syntactic analysis of the second NP: „Bruch der
Wandung der …“
wrong assignments to correct complements: (complex) systems of bones, cartilages, connective
tissues: „elbow joint“
Manuela Kunze 13
Overview
Corpus Characteristics
Experiment 1: Context-related Derivation of Concepts
Experiment 2: Clustering of Values
Manuela Kunze 14
Clustering of Values conceptual analysis of linguistic structures
Mund geschlossen/Mouth closed. Rachenschleimhaut duesterrot. /Mucosa of fauces dark red. Beckengeruest festgefuegt und unversehrt. /Pelvis closely joined and entire. Herzohren frei, ovales Vorhoffenster geschlossen./Auricles of heart clear, oval atrium
closed. Brustbein, Rippen und Wirbelsaeule intakt./Sternum, ribs and spine intact. Brustkorb sehr schmal und leicht eindrueckbar./Thorax very narrow and easy to incise. Nebennieren ohne Besonderheiten./Adrenal glands without anomalies. …
1908 concepts Mund/mouth Rachenschleimhaut/mucosa of fauces Beckengeruest/pelvis Herzohren, Vorhoffenster/auricles of
heart, atrium Brustbein, Rippen,
Wirbelsaeule/sternum, ribs, spine Brustkorb/thorax Nebennieren/adrenal glands
2098 different (linguistic) values geschlossen/closed duesterrot /dark red festgefuegt, unversehrt /closely joined,
entire frei, geschlossen/clear, closed intakt/intact sehr schmal, leicht eindrueckbar/very
narrow, easy to incise ohne Besonderheiten/ without anomalies
Have similar concepts same attributes?
What are the values for an attribute?
Manuela Kunze 15
Relations Between Values
Do the values describe different attributes? color, shape etc.
if not, are the values paraphrases/synonyms? antonyms? values of an ‚open‘ range?
Which lexical or conceptual relations exist between the values, e.g.
synonyms, antonyms etc.?
clustering of values
Manuela Kunze 16
Examples
Mund/mouth:
deutlich geoeffnet
fischmaulartig geoeffnet schlotartig geoeffnetruesselartig geoeffnetfroschmaulartig geoeffnetovalaer geoeffnetgeoeffnetspaltfoermig geoeffnetgeschlossen
different kinds of 'opened' vs. closed
Manuela Kunze 17
Examples
Milzgewebe/spleen tissue:nicht sehr blutreichfest deutlich gelockertstark gelockertrelativ gelockert verhaertet gelockertleicht gelockert blutreich sehr blutarmfaeulnisbedingt gelockertetwas faeulnisbedingt aufgelockert sehr blutreich
concentration of blood
consistency,form of tissue
Manuela Kunze 18
Examples
Wirbelsaeule/spine:ebenfalls unversehrt ebenfalls intakt intaktunversehrt ohne Besonderheitenohne Verletzungen
same findings
Manuela Kunze 19
Approach
comparison of values of a concept 33670 comparisons
comparison in several steps1. character-based: via bigrams2. lexical-conceptual relations: available
information in Germanet
Manuela Kunze 20
Approach
values of a concept
removing negations
removing modificators
'corrected' values
lexical/conceptualrelations in GermaNet?
compound?
bigrams of values
particles: sehr, sonst, ebenfalls
adjectives with suffixes: ‚-artig‘, ‚-lich‘, ‚-ig‘
example: 'sonst unaufällig' 'unauffällig'
negations: 'kein', 'nicht', …
Manuela Kunze 21
Results: Character-based Analysis similar values with modifications (particles) and negations
selbst unauffaellig
sonst unauffaelligunauffaellig
glaenzend
nicht glaenzend
geoeffnet
leicht geoeffnet
rundlich geoeffnet
spaltfoermig geoeffnet
spaltweit geoeffnetfroschmaulartig geoeffnet … geoeffnet
sehr muskelkraeftig
nicht sehr muskelstark
muskelkraeftig
nicht sehr muskelkraeftignicht muskelkraeftig
blutreichnicht-sehr-blutreich
sehr-blutreich
blutarmrelativ-blutarm
muskelschwachsehr-muskelschwach
geschlossenspaltfoermig-geschlossen
Manuela Kunze 22
Integration of GermaNet
search for relations between two tokens parts of tokens
queries about: coordinate terms synonyms, hypernyms, hyponyms antonyms
Manuela Kunze 23
Results with GermaNet sehr muskelkraeftig/very strong muscle vs. sehr muskelschwach/very
weak muscle bigrams: 0.5882, 0.4167 antonym: kraeftig vs. schwach
blutarm/bloodless vs. blutreich/bloodrich bigrams: 0.4286 GermaNet: antonym: arm vs. reich
feucht/wet vs. sehr trocken/very dry bigrams: 0.0000 GermaNet: coordinate terms, antonym
sehr gross/very great vs. sehr weit/very broad bigrams: 0.4706 GermaNet: hypernym
frei/free vs. größtenteils vorhanden/mostly existent bigrams: 0.0833 GermaNet: coordinate terms
keine Schwellung/no swelling vs. keine Verletzung/no trauma bigrams: 0.42, 0.4 GermaNet: hypernym
Manuela Kunze 24
Results: Character-based + GermaNet
selbst unauffaellig
sonst unauffaelligunauffaellig
glaenzend
nicht glaenzend
blutreichnicht-sehr-blutreich
sehr-blutreich
blutarmrelativ-blutarm
sehr muskelkraeftignicht sehr muskelstark
muskelkraeftig
nicht sehr muskelkraeftignicht muskelkraeftig
muskelschwachsehr-muskelschwach
geoeffnet
leicht geoeffnet
rundlich geoeffnet
spaltfoermig geoeffnet
spaltweit geoeffnetfroschmaulartig geoeffnet … geoeffnet
geschlossenspaltfoermig-geschlossen
Manuela Kunze 25
Problem: Paraphrases
Wirbelsaeule/spine:intaktunversehrt ohne Besonderheitenohne Verletzungen
same findings
future work
Manuela Kunze 26
Idea: Detection of Paraphases/Synonyms document information + corpus information
to analyse the value sets of a document
compare the value sets of a concept described in different documents values, which are synonyms or antonyms don‘t occur in a
document Example:
Spine closely joined and entire. closely joined, entire: different attributes
Manuela Kunze 27
Idea: Detection of Paraphases/Synonyms collect all values for a concept: candidates
• entire• closely
jointed
• entire• closely
jointed
candidates: intact == broken == entire/closely jointed == entire ?
AP#1 Ap#nAP#2 AP#3 …
…• broken • intact • intact
AP#4 AP#5
• entire
values for the concept 'spine':
Manuela Kunze 28
Idea: Detection of Paraphases/Synonyms
0
20
40
60
80
100
120
140
160
180
intact bleedings closely joined entire without anomalies without bleedings withoutpathological
findings
removing of candidates:
only one paraphrase
bleedings or without bleedings antonyms
closely joined vs. entire occur in the same document (for a concept)
prefer: entire (number of occurrences)
assumption: closely joined is an 'additional' attribute
selection of candidates (restrictions):
only frequent values
similar number of occurrences?
verification of results:
to obtain value sets of other concepts
which have similar values
Manuela Kunze 29
Problems: Detection of Paraphrases a value can be expressed by more than one value
'value 1' == 'value 2' + 'value 3'
result (set of paraphrases for a value) can contain antonyms
Manuela Kunze 30
Detection of Paraphases/Synonyms solutions?
integration of other resources: UMLS extension of GermaNet
1 sense of unversehrt
Sense 1<adj.Koerper>unverletzt, unversehrt <adj.Koerper>=> heil <adj.Koerper>=> gesund <adj.Koerper>=> ?krankheitsspezifisch <adj.Koerper>=> ?körperzustandsspezifisch <adj.Koerper>=> ?körperspezifisch
1 sense of intakt
Sense 1<adj.Relation>intakt, ganz1, funktionstüchtig, funktionsfähig <adj.Relation>=> ?funktionalitätsspezifisch <adj.Relation>=> ?relationsspezifisch
same meaning?
Manuela Kunze 31
Conclusion
experiments about corpus based semiautomatic extension of GermaNet
analysis of complex noun phrases detection and transfer of GermaNet classes
clustering of values bigrams using GermaNet information