treebank mining with gretel - vub artificial intelligence lab · treebank mining with gretel...

Post on 25-Apr-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Treebank mining with GrETEL

Liesbeth Augustinus Frank Van Eynde

GrETEL tutorial - 27 March, 2015

GrETEL • Greedy Extraction of Trees for Empirical Linguistics

• Search engine for treebanks

GrETEL • Greedy Extraction of Trees for Empirical Linguistics

• Search engine for treebanks

• Treebank = syntactically annotated corpus

o Penn Treebank (English)

o TüBa (German)

o LASSY, CGN, SoNaR (Dutch)

NEDERBOOMS • Exploitation of Dutch treebanks for research in linguistics

• CLARIN project

• October, 2010 – February, 2012

• Goals:

o User-friendly tools

o Fast and accurate

• Result:

o GrETEL 1.0

o http://nederbooms.ccl.kuleuven.be

GrETEL 2.0 • Update of GrETEL 1.0

• CLARIN project

• June, 2013 – July, 2014

• Goals:

o Improve GUI

o Make more data accessible

• Result:

o GrETEL 2.0

o http://gretel.ccl.kuleuven.be

TREEBANKS

CGN treebank LASSY small Spoken Dutch Written Dutch

Stylistic & regional differences

conversations vs read texts NL vs VL

Stylistic differences

Wikipedia vs legal texts

± 1M words ± 1M words

130k sentences 65k sentences

Manually corrected Manually corrected

TREEBANKS

SoNaR Written Dutch

Stylistic differences

Wikipedia vs legal texts

± 500M words

41M sentences

Not corrected

GrETEL • Greedy Extraction of Trees for Empirical Linguistics

• Search engine for treebanks

• Treebank = syntactically annotated corpus

o Penn Treebank (English)

o TüBa (German)

o LASSY, CGN, SoNaR (Dutch)

• Parser

o E.g. Alpino (Van Noord 2006)

ALPINO PARSER

Dit is een zin. >> ALPINO parser >> ‘This is a sentence.’

ALPINO PARSER

Dit is een zin. >> ALPINO parser >> ‘This is a sentence.’

XML trees

Query language: XPath

XPATH

//node[@cat="smain" and

node[@rel="su" and

@pt="vnw" and @lemma="dit"]

and node[@rel="hd" and

@pt="ww" and @lemma="zijn"]

and node[@rel="predc" and

@cat="np" and

node[@rel="det" and

@pt="lid" and @lemma="een"]

and node[@rel="hd" and

@pt="n" and @lemma="zin"]]]

XPATH

//node[@cat="smain" and

node[@rel="su" and

@pt="vnw" and @lemma="dit"]

and node[@rel="hd" and

@pt="ww" and @lemma="zijn"]

and node[@rel="predc" and

@cat="np" and

node[@rel="det" and

@pt="lid" and @lemma="een"]

and node[@rel="hd" and

@pt="n" and @lemma="zin"]]]

XPATH

//node[@cat="smain" and

node[@rel="su" and

@pt="vnw" and @lemma="dit"]

and node[@rel="hd" and

@pt="ww" and @lemma="zijn"]

and node[@rel="predc" and

@cat="np" and

node[@rel="det" and

@pt="lid" and @lemma="een"]

and node[@rel="hd" and

@pt="n" and @lemma="zin"]]]

XPATH

GrETEL 2 search modes:

o Example-based search

o XPath search

GrETEL 2 search modes:

o Example-based search

advantage: no or limited knowledge of data structure and/or formal query languages needed

o XPath search

GrETEL 1. Example sentence

2. Inspect parse

3. Indicate relevant items of the sentence

4. Select treebank

5. (Adapt XPath)

6. Inspect results

• Parser (Alpino)

• Automatically generate XPath expression

• Present results

the user

OUTLINE • GrETEL in a nutshell

• GrETEL demo

o Case study

o Search options

• Conclusions

CASE STUDY Infinitivus Pro Participio (IPP) constructions in Dutch

Hij heeft Marie horen zingen.

‘He has heard Mary sing.’

… dat Jan niet is kunnen komen.

‘… that Jan was not able to come.’

CASE STUDY Infinitivus Pro Participio (IPP) constructions in Dutch

Hij heeft Marie horen/*gehoord zingen.

‘He has heard Mary sing.’

… dat Jan niet is kunnen/*gekund komen.

‘… that Jan was not able to come.’

GrETEL ONLINE

INPUT

INPUT PARSE

SELECTION MATRIX

SELECTION GUIDELINES

TREEBANK SELECTION

TREEBANK SELECTION

QUERY OVERVIEW

RESULTS IPP constructions in CGN

Hij heeft Marie horen zingen.

‘He has heard Mary sing.’

344 hits

RESULTS

RESULTS: table

RESULTS: data

RESULTS: data

“greedy” search

RESULTATEN: trees

RESULTS IPP constructions in CGN

Hij heeft Marie horen zingen.

‘He has heard Mary sing.’

344 hits

… dat Jan niet is kunnen komen.

‘… that Jan was not able to come.’

24 hits

MORE RESULTS

Option 1: Use different queries

Hij heeft Marie horen zingen.

‘He has heard Mary sing.’

344 hits

… dat Jan niet is kunnen komen.

‘… that Jan was not able to come.’

24 hits

TOTAL: 567 hits

… dat hij Marie heeft horen zingen.

‘… that he has heard Mary sing.’

79 hits

Jan is niet kunnen komen.

‘Jan was not able to come.’

120 hits

MORE RESULTS

Option 2: Adapt query (via “XPath Search”)

//node[@cat="smain" and node[@rel="hd" and @pt="ww" and

@lemma="hebben"] and node[@rel="vc" and @cat="inf" and

node[@rel="hd" and @pt="ww"] and node[@rel="vc" and

@cat="inf" and node[@rel="hd" and @pt="ww"]]]]

//node[(@cat="smain" or @cat="ssub") and node[@rel="hd"

and (@lemma="hebben" or @lemma="zijn")] and

node[@rel="vc" and @cat="inf" and node[@rel="hd" and

@pt="ww"] and node[@rel="vc" and @cat="inf" and

node[@rel="hd" and @pt="ww"]]]]

MORE RESULTS

MORE RESULTS

Option 2: Adapt query (via “XPath Search”)

MORE RESULTS

Option 2: Adapt query (via “XPath Search”)

//node[@cat="smain" and node[@rel="hd" and @pt="ww" and

@lemma="hebben"] and node[@rel="vc" and @cat="inf" and

node[@rel="hd" and @pt="ww"] and node[@rel="vc" and

@cat="inf" and node[@rel="hd" and @pt="ww"]]]]

//node[(@cat="smain" or @cat="ssub") and node[@rel="hd"

and (@lemma="hebben" or @lemma="zijn")] and

node[@rel="vc" and @cat="inf" and node[@rel="hd" and

@pt="ww"] and node[@rel="vc" and @cat="inf" and

node[@rel="hd" and @pt="ww"]]]]

566 hits (one sentence matches twice: fva400364__10)

OUTLINE • GrETEL in a nutshell

• GrETEL demo

o Case study

o Search options

• Conclusions

ADVANCED SEARCH

ADVANCED SEARCH

ADVANCED SEARCH

ADVANCED SEARCH

SEARCH OPTIONS

Below annotation matrix

WORD ORDER PP-over-V

o V + PP

o … dat hij opstond met een kater. ‘... that he woke up with a hangover.’

o PP + V

o … dat hij met een kater opstond. … that he with a hangover woke-up ‘... that he woke up with a hangover.’

WORD ORDER PP-over-V in LASSY small

o V + PP

o … dat hij opstond met een kater. ‘... that he woke up with a hangover.’

o

2,890 hits in 2,764 sentences

But: results include PP + V as well!

WORD ORDER PP-over-V in LASSY small

o V + PP + word order option

o … dat hij opstond met een kater. ‘... that he woke up with a hangover.’

787 hits in 775 sentences

Results only include V + PP

IGNORE TOP NODE

CONTEXT

CONTEXT

OUTLINE • GrETEL in a nutshell

• GrETEL demo

o Case study

o Search options

• Conclusions

CONCLUSIONS • GrETEL: search engine for Dutch treebanks

• Input = natural language example

• Output = sample of similar sentences

• Syntactic concordancer

• Available online (via Mozilla Firefox)

• No installation required

Try it yourself! http://gretel.ccl.kuleuven.be

Thanks for your attention!

top related