ims universität stuttgart off-line (and on-line) text analysis for computational lexicography...
TRANSCRIPT
IMS Universität Stuttgart
Off-line (and On-line) Text Analysis for Computational
LexicographyHannah Kermes
IMS Universität Stuttgart
2
Introduction
• Motivation• computational lexicography
• corpus linguistics
• Approaches to text analysis• symbolic vs. probabilistic approaches
• hand-written vs. learned
• on-line queries vs. chunking vs. full parsing
• Requirements• for the extraction tool
• for the corpus annotation
• classical chunking
IMS Universität Stuttgart
3
Motivation
• maintainance of consistency and completeness within lexica computer assisted methods
• lexical engineering scalable lexicographic work process processes reproducible on large amounts of text
• statistical tools (PoS tagging etc.) and traditional chunkers do not provide enough information for corpus linguistic research
• full parsers are not robust enoughneed for analyzing tools that meet the specific needs of
corpus linguistic studies
IMS Universität Stuttgart
4
Dictonaries
• for human use• printed monolingual dictionaries
• electronic dictionaries
• machine readable dictionaries for NLP applications
IMS Universität Stuttgart
5
Printed monolingual dictionaries• intend to cover most important semantic and
syntactic aspects• maintenance of consistency and completeness
is a problem:• information is missing
• entries are incomplete
• information is not consistent
• language changes have to be covered
IMS Universität Stuttgart
6
Electronic dictionaries
• enormous amounts of information can be stored in a compact format
• search engines allow for easy and fast access to desired data
• users can choose how much and what kind of information they are interested in
• reference corpus as additional knowledge source
IMS Universität Stuttgart
7
Machine readable dictionaries
• NLP applications need detailed and consistent information about words• detailed morphological information
• subcategorization frames of verbs, adjectives, nouns
• specific syntactic information
• selectional preferences
• collocations
• idiomatic usage
IMS Universität Stuttgart
8
Information needed
• syntactic information• subcategorization patterns
• semantic information• selectional preferences, collocations
• synonyms
• multi-word units
• lexical classes
• morphological information• case, number, gender
• compounding and derivation
IMS Universität Stuttgart
9
Requirements for the tool
• it has to work on unrestricted text• shortcomings in the grammar should not lead
to a complete failure to parse• no manual checking should be required• should provide a clearly defined interface• annotation should follow linguistic standards
IMS Universität Stuttgart
10
Requirements for the annotation• head lemma• morpho-syntactic information• lexical-semantic information• structural and textual information• hierarchical representation
IMS Universität Stuttgart
11
A corpus linguistic approach
IMS Universität Stuttgart
12
Hypothesis
The better and more detailed the off-line annotation, the better and faster the on-line extraction.However, the more detailed the off-line annotation, the more complex the grammar, the more time consuming and difficult the grammar development, and the slower the parsing process.
IMS Universität Stuttgart
13
Three different dimensions
• type of grammar• symbolic grammar
• probabilistic grammar
• type of grammar development• hand-written grammar
• learning methods
• depth of analysis• analysis on token level only
• full parsing
• partial parsing
IMS Universität Stuttgart
14
Symbolic approaches
+precise rules can be formulated+lexical knowledge can be included+results can be predicted and controlled- sometimes not sufficient to solve ambiguities- only phenomena which are explicit in the
grammar can be dealt with
IMS Universität Stuttgart
15
Unification-based grammars
• usually complex grammars• model the hierarchical structure of language• handle attachment ambiguities• determine relations among constituents and
their grammatical function• extensive use of lexical information• richness and complexity of rules do not only
solve ambiguities, but produces them as well• usually large number of possible analysis
IMS Universität Stuttgart
16
Context-free Grammars (CFG)
• formal grammars consisting of a set of recursive rewriting rules
• small and modular grammar• minimal interaction among rules• parsing process usually fast• covers only basic aspects of language• robustness rules are used to overcome
shortcomings in the grammar
IMS Universität Stuttgart
17
Probabilistic approaches
+supervised or unsupervised training of rules+all possible analyses are produced+no need for comprehensive lexical or linguistic
knowledge+rules can be left underspecified- depend on the training corpus- highly frequent phenomena are preferred over
low frequent phenomena
IMS Universität Stuttgart
18
Probabilistic context-free grammar• CFG rules enriched by probability• make use of underspecification• not as fast as CFG• special case: head lexicalized context-free
grammar• unsupervised
• grammar rules are indexed by the lemma of the syntactic head
• extraction is performed on the rule set rather than on the annotated corpus
IMS Universität Stuttgart
19
Hand-written rules
+good control of the rule system+negative evidence can be taken into account- depends heavily on the experties of the
grammar writer
IMS Universität Stuttgart
20
Learning grammar rules
+infer grammar form text corpora+extensional syntactic descriptions
(annotations) are turned into intensional descriptions (rules)
+optimal or suboptimal training data+new resources in the form of text corpora can
be exploited+more or less independent of the knowledge of
the grammar developer- depends heavily on the learning corpus- needs an annotated, well-balanced corpus
IMS Universität Stuttgart
21
memory based learning
• special case of learning• most prominent is the data oriented parsing
(DOP)• fragments are stored and as such replace the grammar
• language generation and analysis is performed by combining the memorized fragments
• needs structurally annotated corpus
• the training corpus has great impact on the performance of the system
• highly sensitive to suboptimal data
• needs large storage capacity
IMS Universität Stuttgart
22
Annotation on token level
+usually a form of pattern matching+completely flexible+does not depend on previous syntactic
analysis+easily adaptable to different text types - full syntactic analysis has to be performed by
extraction queries- queries can become rather complex- often restricted to simple contexts
IMS Universität Stuttgart
23
Full Parsing
+provides rich and detailed information about structures, relations and functions
+extraction queries simply have to collect the annotated information
- slow parsing speed- lack of robustness- depend heavily on prerequisite lexical
information- ambiguous output
IMS Universität Stuttgart
24
Chunking
+relatively simple grammar rules+no need for extensive linguistic and
lexicographic information+robust- usually non-hierarchical and non-recursive
structures- annotated structures are simple and convey
less information
IMS Universität Stuttgart
25
Classical chunk definition
• Abney 1991:The typical chunk consists of a single content word surrounded by a constellation of function words, matching a fixed template
• Abney 1996:a non-recursive core of an intra-clausal constituent, extending from the beginning of the constituent to its head
IMS Universität Stuttgart
26
State-of-the-art systems
• CASS parser• finite-state cascades• flat, non-recursive structures• small lexicon (tag-fixes)• information about the head is given as an attribute
• Conexor• symbolic constraint grammar parser• full-fedged grammar for English (ENGCG)• German:
• simple, non-recursive structure• no lexical information available• head lemma indicated by a special tag
IMS Universität Stuttgart
27
State-of-the-art systems
• KaRoParse• top-down bottom-up parser
• includes recursion
• internal structure is flat and non-hierarchical
• no agreement or lexical information
• Schiehlen's chunker• symbolic context free grammar
• recursion
• no head lemma or lexical-semantic information
• needs optimally tokenized text (including MWL recognition)
IMS Universität Stuttgart
28
State-of-the-art systems
• Chunkie• uses TnT-tagger to assign tree fragments to
sequences of PoS-tags• recursion in pre-head position (maximal depth of
three)• head lemma information, yet no agreement or lexical
information
• Cascaded Markov Models• stochastic context free grammar rules • several layers, each layer serving as input to the next• hierachical phrases, including complex recursion• head lemma information, yet no agreement or lexical
information
IMS Universität Stuttgart
29
Problems for extraction
• Kübler and Hinrichs (2001)focused on the recognition of partial constituent structures at the level of individual chunks […], little or no attention has been paid to the question of how such partial analysis can be combined into larger structures for complete utterances.
IMS Universität Stuttgart
30
An example
1. [PC mit kleinen ], [PC über die Köpfe ]
with small above the heads[NC der Apostel ] [NC gesetzten Flammen ]
the apostles set flames2. [PP mit [NP [AP kleinen ], [AP über [NP die Köpfe
with small above the heads[NP der Apostel ] ] gesetzten ] Flammen ] ]
the apostles set flames`with small flames set above the heads of the
apostles´
IMS Universität Stuttgart
31
Problems for extraction
• four NCs instead of only one NP• AN-pair:
+gesetzten + Flammen
- kleine + Flammen
• NN-pair Köpfe + Apostel needs agreement information
• VN-pair setzen + Flammen needs information about the deverbal character of gesetzten
a more complex analysis is needed PCs and NCs need to be combined
IMS Universität Stuttgart
32
Simple solution
PP PC (PC|NC)*• theoretical motivation?• rule covers this particular example, other
examples might need additional rules• rule is vague and largely underspecified
not very reliable
• internal structure is mainly left opague
IMS Universität Stuttgart
33
Complex solution
1. NP NC NCgen
2. PP preposition NP3. AP PP adjective4. NP AP* noun
IMS Universität Stuttgart
34
Complex solution
• solution for this particular example only• large number of rules needed• rules have to be repeated for every instance
of a complex phrase in order to support extractions, the classic
chunk concept has to be extended
IMS Universität Stuttgart
35
Conclusion
ChunkingFull
Parsing
• flat non-recursive structures
• simple grammar
• robust and efficient
• non-ambiguous output
• full hierarchical representation
• complex grammar
• not very robust
• ambiguous output
YAC
IMS Universität Stuttgart
36
Conclusion
• recursive chunking workable compromise between depth of analysis and robustness
• extracted data show correlation between• collocational preference
• subcategorization frames
• semantic classes of adjectives
• to a certain extent distributional preferences
IMS Universität Stuttgart
37
General Concept
• a recursive chunker for unrestricted German text
• technical framework• CWB• CQP• output formats• advantages of the architecture
• general framework of YAC• linguistic coverage• feature annotation• chunking process
IMS Universität Stuttgart
38
A recursive chunker for unrestricted German text• recursive chunker for unrestricted German text• fully automatic analysis• main goal:
provide a useful basis for extraction of linguistic as well as lexicographic information from corpora
IMS Universität Stuttgart
39
• based on a symbolic regular expression grammar
• grammar rules written in CQP• basis:
• tokenization
• PoS-tagging
• lemmatization
• agreement information
General aspects
Tree Tagger
IMSLex
IMS Universität Stuttgart
40
A typical chunker
• robust – works on unrestricted text• works fully automatically• does not provide full but partial analysis of text• no highly ambiguous attachment decisions are
made
IMS Universität Stuttgart
41
YAC goes beyond
• extends the chunk definition of Abney1. recursive embedding
2. post-head embedding
• provides additional information about annotated chunks
1. head lemma
2. agreement information
3. lexical-semantic and structural properties
IMS Universität Stuttgart
42
Extended chunk definition
A chunk is a continuous part of an intra-clausal constituent including recursion and pre-head
as well as post-head modifiers but no PP-attachment, or sentential elements.
IMS Universität Stuttgart
43
Technical Framework
corpusPerl-Scripts
grammarrules
lexicon
ruleapplication
annotationof results
post-processing
IMS Universität Stuttgart
44
Technical framework - CQP
• regular expression matching on token and annotation strings
• tests for membership in user specific word lists• feature set operations• constraints to specify dependencies
IMS Universität Stuttgart
45
Perl-Scripts
• invocation of CQP• processing of the results• annotation of the results into the corpus
IMS Universität Stuttgart
46
Postprocessing
• values can be checked• values can be changed• values can be compared• range of structures can be changed
IMS Universität Stuttgart
47
Output formats
• CQP format, used for:• interactive grammar development
• parsing
• extraction
• an XML format, used for:• hierarchy building
• extraction
• data exchange
IMS Universität Stuttgart
48
Advantages of the system
• efficient work even with large corpora• modular query language• interactive grammar development• powerful post-processing of rules
IMS Universität Stuttgart
49
Linguistic coverage
• Adverbial phrases (AdvP)a) schön stark (beautifully strong)
b) daher (from there); irgendwoher (from anywhere)
c) heim (home); querfeldein (cross-country)
d) innen (inside); überall (everywhere)
e) "sehr bald" (very soon)
f) jetzt (now); damals (at that time)
IMS Universität Stuttgart
50
Linguistic coverage
• Adjectival phrases (AP)a) möglich (possible)
b) schreiend lila (screamingly purple)
c) rund zwei Meter hohearound two meter high
d) über die Köpfe der Apostel gesetzten
above the heads of the apostles set
'set above the heads of the apostles'
IMS Universität Stuttgart
51
Linguistic coverage
• Noun phrases (NP)a) Oktober (October); er (he)
b) 4,9 Milliarden Euro
4.9 billion Euros
c) "Frankensteins Fluch"
"Frankenstein's curse"
d) kleine, über die Köpfe der Apostel gesetzten
small, above the heads of the apostles set
Flammen
flames
'small flames set above the heads of the apostles'
IMS Universität Stuttgart
52
Linguistic coverage
• Prepositional phrases (PP)a) davon (thereof)
b) zwischen Basel und St. Moritz
between Basel and St. Moritz
c) mit kleinen, über die Köpfe der Apostel gesetzten
with small, above the heads of the apostles set
Flammen
flames
'with small flames set above the heads of the apostles
IMS Universität Stuttgart
53
Linguistic coverage
• Verbal complexes (VC)a) gemunkelt (rumored)
b) muß gerechnet werden
has counted to be
'has to be counted
c) zu bekommen
to get
d) bekommen zu haben
gotten to have
'to have gotten'
IMS Universität Stuttgart
54
Linguistic coverage
• Clauses (CL)a) … , daß selbst Ravel sich amüsiert hätte.
… , that even Ravel himself enjoyed had.
'… , that even Ravel would have enjoyed.'
b) … , die man in der griechischen Tragödie findet.
… , which one in the Greek tragedy finds.
'… , which one finds in the Greek tragedy.'
IMS Universität Stuttgart
55
Linguistic coverage
• Clauses (CL)a) … , Instrumente selbst zu bauen.
… , instruments oneself to build.
' … , to build instruments oneself.'
b) … , um einen Kaffee zu trinken.
… , in order a coffee to drink.
'… , in order to drink a coffee.'
IMS Universität Stuttgart
56
Feature annotation
• head lemma• morpho-syntactic information• lexical-semantic properties
IMS Universität Stuttgart
57
Feature annotation
feature value
AdvP
AP NP PP VC CL
lexical-semantic
X X X X X X
head lemma X X X X X X
agreement info
X X X
verbal head lemma
X
IMS Universität Stuttgart
58
Head lemma
• lemma attribute at the head position• normally a single token• multi-word proper nouns have a multi-token
head lemma• a separated verbal prefix is included in the
head lemma of the VCkommt … an ankommen (arrive)
• head lemma of PP:preposition:noun
IMS Universität Stuttgart
59
Morpho-syntactic information
• intersection of the morpho-syntactic information of relevant elements
• invariant elements are not considered• no guessing involved to solve ambiguities
IMS Universität Stuttgart
60
Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|Akk:M:Pl:Ind|
Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Gen:F:Pl:Def|Gen:F:Pl:Ind|Gen:F:Sg:Def|Gen:F:Sg:Ind|Gen:M:Pl:Def|Gen:M:Pl:Ind|Gen:M:Sg:Def|Gen:M:Sg:Ind|Gen:M:Sg:Nil|Gen:N:Pl:Def|Gen:N:Pl:Ind|Gen:N:Sg:Def|Gen:N:Sg:Ind|Gen:N:Sg:Nil|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr>
<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>
IMS Universität Stuttgart
61
Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|
Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|Nom:F:Pl:Def|Nom:F:Pl:Ind|Nom:M:Pl:Def|Nom:M:Pl:Ind|Nom:N:Pl:Def|Nom:N:Pl:Ind|>vierten</ap_agr>
<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|Nom:M:Sg:Def|Nom:M:Sg:Ind|Nom:M:Sg:Nil|>Platz</nc_agr>
IMS Universität Stuttgart
62
Agreement Informationden/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|
Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|
Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr>
<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>
IMS Universität Stuttgart
63
Agreement Information<np_agr |Akk:M:Sg:Def|>den/|Akk:M:Sg:Def|Dat:F:Pl:Def|Dat:M:Pl:Def|Dat:N:Pl:Def| <ap_agr |Akk:F:Pl:Def|Akk:F:Pl:Ind|Akk:M:Pl:Def|
Akk:M:Pl:Ind|Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Akk:N:Pl:Def|Akk:N:Pl:Ind|Dat:F:Pl:Def|Dat:F:Pl:Ind|Dat:F:Pl:Nil|Dat:F:Sg:Def|Dat:F:Sg:Ind|Dat:M:Pl:Def|Dat:M:Pl:Ind|Dat:M:Pl:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:N:Pl:Def|Dat:N:Pl:Ind|Dat:N:Pl:Nil|Dat:N:Sg:Def|Dat:N:Sg:Ind|>vierten</ap_agr>
<nc_agr |Akk:M:Sg:Def|Akk:M:Sg:Ind|Akk:M:Sg:Nil|Dat:M:Sg:Def|Dat:M:Sg:Ind|Dat:M:Sg:Nil|>Platz</nc_agr>
</np_agr>
<np_agr |Akk:M:Sg:Def|>
IMS Universität Stuttgart
64
Lexical-semantic properties
• important for parsing as well as for extraction• properties can be triggers for specific internal
structures, functions, and usages• properties inherent in the corpus
• PoS-tags
Johann Sebastian Bach
NE NE NE
• text markers
"Wilhelm Meisters Lehrjahre"
NE NN NN
IMS Universität Stuttgart
65
Lexical-semantic properties
• properties determined by external knowledge sources (lexica, ontologies, word lists)• locality:
hier (here); dort (there); Stuttgart
• temporality:
Jahr (year); damals (at that time)
• derivation:
gesetzten (set) deverbal adjective
IMS Universität Stuttgart
66
Lexical-semantic properties
• structural information• complex embeddings
[AP [PP über die Köpfe der Apostel ] gesetzten ]
above the heads of the apostles set
' set above the heads of the apostles'
[AP [NP der "Inkatha"-Partei ] angehörenden ]
to the Inkatha-party belonging
'belonging to the Inkatha-party'
IMS Universität Stuttgart
67
Some properties of NPs
card cardinal noun
meas measure noun
ne named entity
quot NP in quotation marks
street street address
temp temporal noun
date date
pron pronominal NP
IMS Universität Stuttgart
68
Other lexical-semantic properties• VC with separated prefix: pref
Er kommt an (he arrives)• PP with contracted preposition and article: fus
am Bahnhof (at the station)• complex APs embedding PPs: pp
über die Köpfe der Apostel gesetztenabove the heads of the apostles set'set above the heads of the apostles'
• AP with deverbal adjectives: vder
IMS Universität Stuttgart
69
Chunking process
Corpus CorpusThirdLevel
FirstLevel
Corpus
SecondLevel
Lexicon
IMS Universität Stuttgart
70
First level
• basic (non-recursive) chunks• chunks with specific internal structure
a) Ende September (end of Semptember)b) Jahre später (years later)c) 21. Juli 2003d) Johann Sebastian Bach
• lexical information is introduced• within the rules itself• within the Perl-scripts
IMS Universität Stuttgart
71
Advantages
• specific rules do not interact with main parsing rules
• additional (e.g. domain specific) rules can be included easily
• main parsing rules can be kept simple• number of main parsing rules can be kept
small
IMS Universität Stuttgart
72
Second level
• main parsing level• relatively simple and general rules
a) AP AdvP? (PP|NP)* ACb) NP Determiner? Cardinal? AP* NCc) PP Preposition (NP|AdvP)
• complex (recursive) structures are built in several iterations
IMS Universität Stuttgart
73
Rule blocks
IMS Universität Stuttgart
74
Second level
- complexity of phrases is achieved by the embedding of complex structures rather than by complex rules
a) [NP eine [AP verständliche ] Sprache ] an understandable language
b) [NP eine [AP für den Anwender verständliche ] Sprache ] a for the user understandable language'a language understandable for the user'
IMS Universität Stuttgart
75
Second level
a) [PP auf [NP dem Giebel ] ] on top of the gable
b) [PP auf [NP dem westwärts gerichteten Giebel on top of the westwards pointed gable des heute im barocken Gewande erscheinenden of the today in baroque garment appearingGotteshauses ] ]Lord's house
IMS Universität Stuttgart
76
Third level
• chunks of related but different categories can be subsumed under one category
• NPs with determiner (NP)• NPs without determiner (NCC) NP• base noun chunks (NC)
• coordination of maximal chunks• decisions are made which need full recursive
chunks• adverbially and predicatively used Adjectives can
only be differentiated by the actual usageadverbially used AP AdvP
IMS Universität Stuttgart
77
Hierarchy building
• resulting structures of all parsing stages are collected and stored in XML-files
• after the parsing process collected structures are combined into a hierarchical structure
• only the largest instance of a structure (sharing the same head) is taken into account
IMS Universität Stuttgart
78
Hierarchy building
a) [NP Faszination ]
fascinationb) [NP gewisse Faszination des Schattens ]
certain fascination of the shadowc) [NP eine gewisse Faszination des Schattens ]
a certain fascination of the shadowd) [NP des Schattens ]
of the shadowe) [NP eine gewisse Faszination [NP des Schattens ] ]
a certain fascination of the shadow
IMS Universität Stuttgart
79
Evaluation on automatic PoS-tags
all chunks maximal chunks
precision recall precision recall
NP 89.93 91.67 89.43 91.68
PP 94.05 89.67 94.04 89.65
AP 84.24 89.25 83.67 89.59
VC - - 97.72 96.62
IMS Universität Stuttgart
80
Evaluation on ideal PoS-tags
all chunks maximal chunks
precision recall precision recall
NP 96.36 96.51 95.55 96.47
PP 98.08 96.51 98.07 96.50
AP 96.39 97.50 96.12 97.45
VC - - 99.01 98.59
IMS Universität Stuttgart
81
Extraction
• Advantage of the system• Goal• Sample Extraction
IMS Universität Stuttgart
82
Advantages of the system
• efficient work even with large corpora• modular query language• interactive grammar development• powerful post-processing of rules
IMS Universität Stuttgart
83
Goal
• provide a fine-grained syntactic classification
of the extracted data at the level of • subcategorization• scrambling
• adjectives subcategorizing clauses• combinatory preferences with verbs• syntactic behavior
IMS Universität Stuttgart
84
Target data
• predicative(-like) constructions
Es war klar, daß ...
It was clear, that ...• ... with adverbial pronoun
Er ist davon überzeugt, daß ...
He is of it convinced, that ...• ... with reflexive pronoun
Es zeigt sich deutlich, daß ...
It shows itself clear, that ...
IMS Universität Stuttgart
85
Target data
• ... with infinite clauses
Es ist möglich, ihn zu besuchen.
It is possible, him to visit.• ... with clause in topicalized position
Daß ..., ist klar.
That ..., is clear.
Ihn zu besuchen, ist möglich.
Him to visit, is possible.
IMS Universität Stuttgart
86
Sample query
adjective + verb + finite clause
VC
APCL
IMS Universität Stuttgart
87
Sample query
adjective + verb + finite clause
VC
APpred
CLfin
IMS Universität Stuttgart
88
Sample query
adjective + verb + finite clause
VC Adjuncts*APpred
CLfin
IMS Universität Stuttgart
89
Sample query
adjective + verb + finite clause
VC (AdvP|PP|NPtemp|CLrel)*APpred
CLfin
IMS Universität Stuttgart
90
adjective + verb + finite clause
sein bleiben machen werden
fraglich 326 34 3
unklar 320 103
klar 225 41 30
offen 228 40
möglich 160 30 2
wichtig 180 2
deutlich 5 97 34
total 1500 177 168 75
IMS Universität Stuttgart
91
adjective + verb + finite clause
sein bleiben machen werden
fraglich 326 34 3
unklar 320 103
klar 225 41 30
offen 228 40
möglich 160 30 2
wichtig 180 2
deutlich 5 97 34
total 1500 177 168 75
IMS Universität Stuttgart
92
Topicalized finite clause
adjective + verb + finite clause CLfin
VC (AdvP|PP|NPtemp|CLrel)*APpred
IMS Universität Stuttgart
93
adjective + verb + finite clause
fincl_ex fincl_top total
fraglich 91 335 426
unklar 13 413 426
klar 221 159 380
offen 19 266 285
möglich 207 4 211
wichtig 192 9 201
deutlich 139 22 161
IMS Universität Stuttgart
94
adjective + verb + finite clause
fincl_ex fincl_top total
fraglich 91 335 426
unklar 13 413 426
klar 221 159 380
offen 19 266 285
möglich 207 4 211
wichtig 192 9 201
deutlich 139 22 161
IMS Universität Stuttgart
95
adjective + verb + infinite clause
sein fallen haben werden machen
bereit 431 4 6
schwer 162 221 108 33 26
möglich 532 40 35
schwierig 245 93 12
leicht 120 59 31 8 16
nötig 112 48 2 7
erforderlich 102 1 15
total 1708 280 195 183 111
IMS Universität Stuttgart
96
adjective + verb + infinite clause
sein fallen haben werden machen
bereit 431 4 6
schwer 162 221 108 33 26
möglich 532 40 35
schwierig 245 93 12
leicht 120 59 31 8 16
nötig 112 48 2 7
erforderlich 102 1 15
total 1708 280 195 183 111
IMS Universität Stuttgart
97
low freq adj + verb + infin clause
stehen bringen haben sein
frei 35 4
satt 19 10
fertig 24 1
IMS Universität Stuttgart
98
low freq adj + verb + clause
stehen bringen haben sein
frei 37 6
satt 27 11
fertig 26 1
IMS Universität Stuttgart
99
adjective subcategorization
• APs with PP complements embedded in NPsDie [AP dafür erforderlichen] 300 000 MarkThe for this needed 300 000 Marks„The 300 000 Marks needed for this“
Der [AP auf Sport spezialisierte] JournalistThe on sports specialised journalist„The journalist specialising in sports“
IMS Universität Stuttgart
100
multiword units and abbreviations• chunks/phrases in brackets or quotes
• multiword units„Teenage Mutant Hero Turtle“(FC Italia Frankfurt)
• abbreviationsDeutscher Aktienindex (Dax)Stickstoffdioxyd (NO2)
IMS Universität Stuttgart
101
Conclusion
• recursive chunking workable compromise between depth of analysis and robustness
• extracted data show correlation between• collocational preference
• subcategorization frames
• semantic classes of adjectives
• to a certain extent distributional preferences