a unified database of dependency treebanks integrating, quantifying & evaluating dependency data
DESCRIPTION
A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data. Olga Pustylnikov, Alexander Mehler Bielefeld University. Motivation. Exploring similarities among languages by means of syntactic treebanks We collected a database covering 11 languages - PowerPoint PPT PresentationTRANSCRIPT
Olga Pustylnikov, Alexander Mehler
Bielefeld University
A Unified Database of Dependency Treebanks
Integrating, Quantifying & EvaluatingDependency Data
SFB 673Motivation
Exploring similarities among languages by means of syntactic treebanks
We collected a database covering 11 languages
Treebanks have been developed separately by different research projects
quantitative investigations on these treebanks -> the need for unification
SFB 673Motivation
(+) generic: allowing to represent as many treebanks as possible
(+) extensible to new treebanks
(+) complete: preserving all corpus specific information
(+) transferable to other kinds of corpora
(–) complex: exhibiting the minimal
complexity
-> graph representations
Demands on the unified format of treebanks
SFB 673Motivation
Graph eXtensible Language is a graph model representig corpora in terms of graphs
XML
GXL
WIKI
MultimodalData
Treebanks
TOOLS
GXL (Holt et al., 2006)
GXL can be applied to any kinds of corpora. (See e.g. Mehler and Gleim (2005), Ferrer i Cancho et al. (2007), Pustylnikov and Mehler (2008))
TreebankseGXL
1. eGXL
2. Data
3. Complexity Evaluation
4. Application
5. Conclusion
SFB 673Agenda
SFB 673eGXL
Sentences
Types
IDREF
<graph id=“Types”>
<node id=“POS” />
<node id=“t245” name=“VERB” />
…
</graph>
<graph id="Sentences">
<graph id="g8">
<node id="s8_1" form="Detta" pos="t151" />
<node id="s8_2" form="vill" pos="t245" />
...
<rel>
<relend direction="in" target="s8_2" />
<relend direction="out" target="s8_1" />
</rel>
...
</graph>
2-level data model
SFB 673The eGXL Sentences-graph
vill
Detta bestämtjag bemöta .
<graph id="Sentences">
<graph id="g8">
<node id="s8_1" form="Detta" pos="t151" />
<node id="s8_2" form="vill" pos="t245" />
...
<rel>
<relend direction="in" target="s8_2" />
<relend direction="out" target="s8_1" />
</rel>
...
</graph>
each token of a treebankeach token of a treebank
word formword forman IDREF to the POS-node of the Types-graph
an IDREF to the POS-node of the Types-graph
a (syntactic) relationa (syntactic) relation
from (e.g. a head verb)
to (e.g. a dependent argument)
from (e.g. a head verb)
to (e.g. a dependent argument)
1. eGXL
2. Data
3. Complexity Evaluation
4. Application
5. Conclusion
SFB 673Agenda
SFB 67311 Dependency Treebanks
7 different formats
SFB 673Input vs. Output Formats
Examples from Dutch, Swedish, Italian treebanks
SFB 673Unification is possible…
… due to the separation of the core from the secondary parts
<graph id=“Types”>
<node id=“POS” />
<node id=“t245” name=“VERB” />
…
</graph>
<graph id="Sentences">
<graph id="g8">
<node id="s8_1" form="Detta" pos="t151" />
<node id="s8_2" form="vill" pos="t245" />
...
<rel>
<relend direction="in" target="s8_2" />
<relend direction="out" target="s8_1" />
</rel>
...
</graph>
diversity
commonality
SFB 673The TreebankWiki
http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/
1. eGXL
2. Data
3. Complexity Evaluation
4. Application
5. Conclusion
SFB 673Agenda
SFB 673Complexity of eGXL
Logical Scalling Factor (LSF): number of logical elements (e.g. XML-element) required to represent a treebank unit (e.g. a word form, POS etc.) node rel
eGXLothereGXLother
1. eGXL
2. Data
3. Complexity Evaluation
4. Application
5. Conclusion
SFB 673Agenda
SFB 673DTDB
1. eGXL
2. Data
3. Complexity Evaluation
4. Application
5. Conclusion
SFB 673Agenda
SFB 673Conclusions
a database covering 11 languages eGXL – a generic XML graph model adopted to syntactic
treebanks use of treebanks within a single application (Ariadne)
[email protected]@uni-bielefeld.de
SFB 673Thank you for your attention!