supporting e-learning with automatic glossary extraction experiments with portuguese rosa del...
Post on 19-Dec-2015
222 views
TRANSCRIPT
![Page 1: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/1.jpg)
Supporting e-learning with
automatic glossaryextraction
Experiments with Portuguese
Rosa Del Gaudio, António BrancoRANLP, Borovets 2007
![Page 2: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/2.jpg)
Presentation Plan
● LT4eL project● ILIAS● Corpus● Tool● Grammars
● Copula● Other Verbs● Punctuation
● Results● Conclusion
![Page 3: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/3.jpg)
LT4eL● Improve retrieval and accessibility of LO in learning management systems●Employ language technology resources and tools for the semi-automatic generation of descriptive metadata .
●Develop new functionalities such as a key word extractor and a glossary candidate detector, semantic search, tuned for the various languages addressed in the project (Bulgarian, Czech, Dutch, English, German, Maltese, Polish, Portuguese, Romanian).
![Page 4: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/4.jpg)
ILIAS
![Page 5: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/5.jpg)
Objective
● Build a Glossary in an automatic way to support e-learning process. In practice this means to extract a definition from unstructured text (scientific papers, enciclopedia, web pages)
● Better access to information for student ●Accelerate the work of the tutor
![Page 6: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/6.jpg)
ILIAS: Glossary Candidate Detector
![Page 7: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/7.jpg)
The Corpus
• 274.000 tokens • Tutorials
• PhD Thesis
• Scientific papers
• 3 Domains evenly represented
• e-learning
• Technology for non experts
• Calimera
![Page 8: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/8.jpg)
XML format
<definingText continue="y" def="m147" def_type1="is_def" id="d5"><markedTerm dt="y" id="m147" kw="y"><tok base="intranet" class="word" ctag="PNM" id="t9032" sp="y">Intranet</tok></markedTerm><tok base="ser" class="word" ctag="V" id="t9033" msd="pi-3s" sp="y">é</tok><tok base="uma" class="word" ctag="UM" id="t9034" msd="fs" sp="y">uma</tok><tok base="rede" class="word" ctag="CN" id="t9035" msd="fs" sp="y">rede</tok><tok base="desenvolver,desenvolvido" class="word" ctag="PPA" id="t9036" msd="fs"
sp="y">desenvolvida</tok><tok base="para" class="word" ctag="PREP" id="t9037" sp="y">para</tok><tok base="processamento" class="word" ctag="CN" id="t9038" msd="ms"
sp="y">processamento</tok><tok base="de" class="word" ctag="PREP" id="t9039" sp="y">de</tok><tok base="informação" class="word" ctag="CN" id="t9040" msd="fp"
sp="y">informações</tok><tok base="em" class="word" ctag="PREP" id="t9041" sp="y">em</tok><tok base="uma" class="word" ctag="UM" id="t9042" msd="fs" sp="y">uma</tok><tok base="empresa" class="word" ctag="CN" id="t9043" msd="fs" sp="y">empresa</tok><tok base="ou" class="word" ctag="CJ" id="t9044" sp="y">ou</tok><tok base="organização" class="word" ctag="CN" id="t9045" msd="fs">organização</tok><tok class="punctuation" ctag="PNT" id="t9046" sp="y">.</tok></definingText>
![Page 9: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/9.jpg)
LxTransduce
• Input: simple text or xml
• Regular expressions
• Substitution and markup
• Output the same file with changes
• Match tree using elements
• Quick
• Unicode friendly
• freeware
• Easy to integrate in other tools (java)
![Page 10: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/10.jpg)
Rules in lxtransduce
<rule name="Conj"> <query match="tok[@ctag =
'CJ']"/></rule>
<rule name="Coor"> <!--Conjunctions or comma -->
<first><query match="tok[. = ',']"/><ref name="Conj" mult="+"/></first></rule>
<rule name="PARopen"> <query match="tok[.~'^\($']"/> </rule>
<rule name="PARcl"> <query match="tok[.~'^\($']"/> </rule>
<rule name="parenthetic"><seq><ref name="PARopen"/><repeat-until name="tok"><ref name="PARcl"/></repeat-until><ref name="PARcl"/></seq></rule>
![Page 11: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/11.jpg)
First developmentphase
● Less than 50% of the corpus● Focus on the verb● Precision: manually marked/all automatic● Recall: correct automatic/manually marked● F2 :3*(precision*recall)/2*precision+recall
0.220.200.31Gr 01
0.260.440.14Gr 00
F2RecallPrecision
![Page 12: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/12.jpg)
Second developing phase
• 75% of the corpus for developing
• 25% of the corpus for testing
• Specific grammar/rules for each type
![Page 13: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/13.jpg)
Copula baseline grammar
<rule name="euristic"><seq><repeat-until name="tok"><ref name="SERdef" mult="+"/></repeat-until><ref name="SERdef" mult="+"/><not><ref name="PPA"/></not><ref name="tok" mult="*"/><end/></seq></rule>
Verb “to be” third person singular or plural present indicative
<rule name="SERdef"><best><ref name="Ser3"/><ref name="PoderSer"/></best></rule>
![Page 14: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/14.jpg)
Copula base result
• Sentence level results
• Problem with precision
![Page 15: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/15.jpg)
Copula Grammar
![Page 16: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/16.jpg)
Rules for is_type
<!-- To Be 3rd person pl and s -->
<rule name="Serdef"> <querymatch="tok[@ctag = ’V’ and
@base=’ser’ and(@msd[starts-with(.,’fi-
3’ )]or @msd[starts-with(.,’pi-
3’ )])]</rule>....
<rule name="copula1"><seq><ref name="SERdef"/><best><seq><ref name="Art"/><ref name="adj|adv|prep|"
mult="*"/><ref name="Noun" mult="+"/></seq>....</best><ref name="tok" mult="*"/><end/></seq></rule>
![Page 17: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/17.jpg)
Confronting Results
Include that patterns that were excluded
Try to gather the syntactic pattern of non definition and confront with the syntactic pattern of definition.
![Page 18: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/18.jpg)
Other_Verbs grammar
• Collect verbs in a lexicon• Three different category:
reflexive, active, passive.• 22 different verbs
<lex word="chamar"><cat>ref</cat></lex><lex word="chamar,chamado"><cat>pas</cat></lex>
<rule name="Vpas"><seq><ref name="tok"/><not><ref name="not"/> </not><ref name="tok" mult="?"/><query match="tok[mylex(@base)
and (@ctag='PPA')]" constraint="mylex(@base)/cat='pas'"/>
</seq></rule>
![Page 19: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/19.jpg)
Results for verb_type
• Analyze each verbs separately as with is_type
• Richer syntactic patterns
![Page 20: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/20.jpg)
Punctuation Grammar
<rule name="punct_def"><seq><start/><ref name="CompmylexSN"
mult="+"/><query match="tok[.~’^:\$’]"/><ref name="tok" mult="+"/><end/></seq></rule>
●Preliminary work
●Definition introduced by colon mark (most frequent)
![Page 21: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/21.jpg)
All-in-one
• Combination of the previous grammars
• The type is not take into account to calculate precision and recall
![Page 22: Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007](https://reader035.vdocuments.net/reader035/viewer/2022062515/56649d2a5503460f949fefeb/html5/thumbnails/22.jpg)
Conclusions and Future Work
• Overall results: Recall 86%, Precision 14%
• Difference among domains: the style of a document influence the result.
• Improve the rules for verb_type and punc_type
• Combining with other techniques such as ML