Treebank mining with GrETEL
Liesbeth Augustinus Frank Van Eynde
GrETEL tutorial - 27 March, 2015
GrETEL • Greedy Extraction of Trees for Empirical Linguistics
• Search engine for treebanks
GrETEL • Greedy Extraction of Trees for Empirical Linguistics
• Search engine for treebanks
• Treebank = syntactically annotated corpus
o Penn Treebank (English)
o TüBa (German)
o LASSY, CGN, SoNaR (Dutch)
NEDERBOOMS • Exploitation of Dutch treebanks for research in linguistics
• CLARIN project
• October, 2010 – February, 2012
• Goals:
o User-friendly tools
o Fast and accurate
• Result:
o GrETEL 1.0
o http://nederbooms.ccl.kuleuven.be
GrETEL 2.0 • Update of GrETEL 1.0
• CLARIN project
• June, 2013 – July, 2014
• Goals:
o Improve GUI
o Make more data accessible
• Result:
o GrETEL 2.0
o http://gretel.ccl.kuleuven.be
TREEBANKS
CGN treebank LASSY small Spoken Dutch Written Dutch
Stylistic & regional differences
conversations vs read texts NL vs VL
Stylistic differences
Wikipedia vs legal texts
± 1M words ± 1M words
130k sentences 65k sentences
Manually corrected Manually corrected
TREEBANKS
SoNaR Written Dutch
Stylistic differences
Wikipedia vs legal texts
± 500M words
41M sentences
Not corrected
GrETEL • Greedy Extraction of Trees for Empirical Linguistics
• Search engine for treebanks
• Treebank = syntactically annotated corpus
o Penn Treebank (English)
o TüBa (German)
o LASSY, CGN, SoNaR (Dutch)
• Parser
o E.g. Alpino (Van Noord 2006)
ALPINO PARSER
Dit is een zin. >> ALPINO parser >> ‘This is a sentence.’
ALPINO PARSER
Dit is een zin. >> ALPINO parser >> ‘This is a sentence.’
XML trees
Query language: XPath
XPATH
//node[@cat="smain" and
node[@rel="su" and
@pt="vnw" and @lemma="dit"]
and node[@rel="hd" and
@pt="ww" and @lemma="zijn"]
and node[@rel="predc" and
@cat="np" and
node[@rel="det" and
@pt="lid" and @lemma="een"]
and node[@rel="hd" and
@pt="n" and @lemma="zin"]]]
XPATH
//node[@cat="smain" and
node[@rel="su" and
@pt="vnw" and @lemma="dit"]
and node[@rel="hd" and
@pt="ww" and @lemma="zijn"]
and node[@rel="predc" and
@cat="np" and
node[@rel="det" and
@pt="lid" and @lemma="een"]
and node[@rel="hd" and
@pt="n" and @lemma="zin"]]]
XPATH
//node[@cat="smain" and
node[@rel="su" and
@pt="vnw" and @lemma="dit"]
and node[@rel="hd" and
@pt="ww" and @lemma="zijn"]
and node[@rel="predc" and
@cat="np" and
node[@rel="det" and
@pt="lid" and @lemma="een"]
and node[@rel="hd" and
@pt="n" and @lemma="zin"]]]
XPATH
GrETEL 2 search modes:
o Example-based search
o XPath search
GrETEL 2 search modes:
o Example-based search
advantage: no or limited knowledge of data structure and/or formal query languages needed
o XPath search
GrETEL 1. Example sentence
2. Inspect parse
3. Indicate relevant items of the sentence
4. Select treebank
5. (Adapt XPath)
6. Inspect results
• Parser (Alpino)
• Automatically generate XPath expression
• Present results
the user
OUTLINE • GrETEL in a nutshell
• GrETEL demo
o Case study
o Search options
• Conclusions
CASE STUDY Infinitivus Pro Participio (IPP) constructions in Dutch
Hij heeft Marie horen zingen.
‘He has heard Mary sing.’
… dat Jan niet is kunnen komen.
‘… that Jan was not able to come.’
CASE STUDY Infinitivus Pro Participio (IPP) constructions in Dutch
Hij heeft Marie horen/*gehoord zingen.
‘He has heard Mary sing.’
… dat Jan niet is kunnen/*gekund komen.
‘… that Jan was not able to come.’
GrETEL ONLINE
INPUT
INPUT PARSE
SELECTION MATRIX
SELECTION GUIDELINES
TREEBANK SELECTION
TREEBANK SELECTION
QUERY OVERVIEW
RESULTS IPP constructions in CGN
Hij heeft Marie horen zingen.
‘He has heard Mary sing.’
344 hits
RESULTS
RESULTS: table
RESULTS: data
RESULTS: data
“greedy” search
RESULTATEN: trees
RESULTS IPP constructions in CGN
Hij heeft Marie horen zingen.
‘He has heard Mary sing.’
344 hits
… dat Jan niet is kunnen komen.
‘… that Jan was not able to come.’
24 hits
MORE RESULTS
Option 1: Use different queries
Hij heeft Marie horen zingen.
‘He has heard Mary sing.’
344 hits
… dat Jan niet is kunnen komen.
‘… that Jan was not able to come.’
24 hits
TOTAL: 567 hits
… dat hij Marie heeft horen zingen.
‘… that he has heard Mary sing.’
79 hits
Jan is niet kunnen komen.
‘Jan was not able to come.’
120 hits
MORE RESULTS
Option 2: Adapt query (via “XPath Search”)
//node[@cat="smain" and node[@rel="hd" and @pt="ww" and
@lemma="hebben"] and node[@rel="vc" and @cat="inf" and
node[@rel="hd" and @pt="ww"] and node[@rel="vc" and
@cat="inf" and node[@rel="hd" and @pt="ww"]]]]
//node[(@cat="smain" or @cat="ssub") and node[@rel="hd"
and (@lemma="hebben" or @lemma="zijn")] and
node[@rel="vc" and @cat="inf" and node[@rel="hd" and
@pt="ww"] and node[@rel="vc" and @cat="inf" and
node[@rel="hd" and @pt="ww"]]]]
MORE RESULTS
MORE RESULTS
Option 2: Adapt query (via “XPath Search”)
MORE RESULTS
Option 2: Adapt query (via “XPath Search”)
//node[@cat="smain" and node[@rel="hd" and @pt="ww" and
@lemma="hebben"] and node[@rel="vc" and @cat="inf" and
node[@rel="hd" and @pt="ww"] and node[@rel="vc" and
@cat="inf" and node[@rel="hd" and @pt="ww"]]]]
//node[(@cat="smain" or @cat="ssub") and node[@rel="hd"
and (@lemma="hebben" or @lemma="zijn")] and
node[@rel="vc" and @cat="inf" and node[@rel="hd" and
@pt="ww"] and node[@rel="vc" and @cat="inf" and
node[@rel="hd" and @pt="ww"]]]]
566 hits (one sentence matches twice: fva400364__10)
OUTLINE • GrETEL in a nutshell
• GrETEL demo
o Case study
o Search options
• Conclusions
ADVANCED SEARCH
ADVANCED SEARCH
ADVANCED SEARCH
ADVANCED SEARCH
SEARCH OPTIONS
Below annotation matrix
WORD ORDER PP-over-V
o V + PP
o … dat hij opstond met een kater. ‘... that he woke up with a hangover.’
o PP + V
o … dat hij met een kater opstond. … that he with a hangover woke-up ‘... that he woke up with a hangover.’
WORD ORDER PP-over-V in LASSY small
o V + PP
o … dat hij opstond met een kater. ‘... that he woke up with a hangover.’
o
2,890 hits in 2,764 sentences
But: results include PP + V as well!
WORD ORDER PP-over-V in LASSY small
o V + PP + word order option
o … dat hij opstond met een kater. ‘... that he woke up with a hangover.’
787 hits in 775 sentences
Results only include V + PP
IGNORE TOP NODE
CONTEXT
CONTEXT
OUTLINE • GrETEL in a nutshell
• GrETEL demo
o Case study
o Search options
• Conclusions
CONCLUSIONS • GrETEL: search engine for Dutch treebanks
• Input = natural language example
• Output = sample of similar sentences
• Syntactic concordancer
• Available online (via Mozilla Firefox)
• No installation required
Try it yourself! http://gretel.ccl.kuleuven.be
Thanks for your attention!