a unified database of dependency treebanks integrating, quantifying & evaluating dependency data

18
Olga Pustylnikov, Alexander Mehler Bielefeld University A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

Upload: allie

Post on 25-Jan-2016

34 views

Category:

Documents


1 download

DESCRIPTION

A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data. Olga Pustylnikov, Alexander Mehler Bielefeld University. Motivation. Exploring similarities among languages by means of syntactic treebanks We collected a database covering 11 languages - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

Olga Pustylnikov, Alexander Mehler

Bielefeld University

A Unified Database of Dependency Treebanks

Integrating, Quantifying & EvaluatingDependency Data

Page 2: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673Motivation

Exploring similarities among languages by means of syntactic treebanks

We collected a database covering 11 languages

Treebanks have been developed separately by different research projects

quantitative investigations on these treebanks -> the need for unification

Page 3: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673Motivation

(+) generic: allowing to represent as many treebanks as possible

(+) extensible to new treebanks

(+) complete: preserving all corpus specific information

(+) transferable to other kinds of corpora

(–) complex: exhibiting the minimal

complexity

-> graph representations

Demands on the unified format of treebanks

Page 4: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673Motivation

Graph eXtensible Language is a graph model representig corpora in terms of graphs

XML

GXL

WIKI

MultimodalData

Treebanks

TOOLS

GXL (Holt et al., 2006)

GXL can be applied to any kinds of corpora. (See e.g. Mehler and Gleim (2005), Ferrer i Cancho et al. (2007), Pustylnikov and Mehler (2008))

TreebankseGXL

Page 5: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

Page 6: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673eGXL

Sentences

Types

IDREF

<graph id=“Types”>

<node id=“POS” />

<node id=“t245” name=“VERB” />

</graph>

<graph id="Sentences">

<graph id="g8">

<node id="s8_1" form="Detta" pos="t151" />

<node id="s8_2" form="vill" pos="t245" />

...

<rel>

<relend direction="in" target="s8_2" />

<relend direction="out" target="s8_1" />

</rel>

...

</graph>

2-level data model

Page 7: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673The eGXL Sentences-graph

vill

Detta bestämtjag bemöta .

<graph id="Sentences">

<graph id="g8">

<node id="s8_1" form="Detta" pos="t151" />

<node id="s8_2" form="vill" pos="t245" />

...

<rel>

<relend direction="in" target="s8_2" />

<relend direction="out" target="s8_1" />

</rel>

...

</graph>

each token of a treebankeach token of a treebank

word formword forman IDREF to the POS-node of the Types-graph

an IDREF to the POS-node of the Types-graph

a (syntactic) relationa (syntactic) relation

from (e.g. a head verb)

to (e.g. a dependent argument)

from (e.g. a head verb)

to (e.g. a dependent argument)

Page 8: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

Page 9: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 67311 Dependency Treebanks

7 different formats

Page 10: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673Input vs. Output Formats

Examples from Dutch, Swedish, Italian treebanks

Page 11: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673Unification is possible…

… due to the separation of the core from the secondary parts

<graph id=“Types”>

<node id=“POS” />

<node id=“t245” name=“VERB” />

</graph>

<graph id="Sentences">

<graph id="g8">

<node id="s8_1" form="Detta" pos="t151" />

<node id="s8_2" form="vill" pos="t245" />

...

<rel>

<relend direction="in" target="s8_2" />

<relend direction="out" target="s8_1" />

</rel>

...

</graph>

diversity

commonality

Page 12: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673The TreebankWiki

http://ariadne.coli.uni-bielefeld.de/wikis/treebankwiki/

Page 13: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

Page 14: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673Complexity of eGXL

Logical Scalling Factor (LSF): number of logical elements (e.g. XML-element) required to represent a treebank unit (e.g. a word form, POS etc.) node rel

eGXLothereGXLother

Page 15: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

Page 16: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673DTDB

Page 17: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

1. eGXL

2. Data

3. Complexity Evaluation

4. Application

5. Conclusion

SFB 673Agenda

Page 18: A Unified Database of Dependency Treebanks Integrating, Quantifying & Evaluating Dependency Data

SFB 673Conclusions

a database covering 11 languages eGXL – a generic XML graph model adopted to syntactic

treebanks use of treebanks within a single application (Ariadne)

[email protected]@uni-bielefeld.de

[email protected]

SFB 673Thank you for your attention!