software and tools for corpus pattern analysis · naacl 2015 three tasks cpa parsing cpa clustering...
TRANSCRIPT
SOFTWARE AND TOOLS FOR CORPUSPATTERN ANALYSIS
Vít Baisa, Ismaïl El Maarouf, Adam Rambousek, PavelRychlý
OUTLINECorpus Pattern AnalysisAnnotation in Sketch EngineCPA editorPublic accessSemEval 2015LEMON API
INTRODUCTIONtools and datasetsto support Pattern Dictionary of English Verbs (PDEV)since 2006DVC project 2012–2015
CPAassociating word meaning with word use by ananalysis of phraseological patterns and collocationsmeaning is associated with prototypical sentencecontextsconcordance lines are grouped into semanticallymotivated syntagmatic patternshard problem: granularitySubj, Obj, Complement, Adverbial, Indirect ObjBritish National Corpus (written part)
CPA IIdeterminers: take place vs. take his placesemantic types: build [[Machine]] vs. build[[Relationship]]contextual roles: [[Human = Director]] shootvs. [[Human = Sports Player]] shootlexical sets: reap {the whirlwind} vs. reap{the harvest}
ANNOTATION IN SKETCH ENGINEnot so well-known feature of SkEno paper published (until now)not documented :)
c o m m i t 7 b 6 8 2 e 7 4 7 3 d 9 3 5 b 1 4 f 4 8 b 7 b 5 3 5 2 8 3 8 3 1 8 b f d a 5 2 3 A u t h o r : p a r y D a t e : S a t S e p 2 2 2 : 3 4 : 1 0 2 0 0 6 + 0 0 0 0
[ b o n i t o 2 @ 2 0 0 6 - 0 9 - 0 2 2 2 : 3 4 : 1 0 b y p a r y ] a d d e d l i n e g r o u p / a n n o t c o n c
ANNOTATION IN SKETCH ENGINE IIfeatures for lexicographersannotation with word sketchesbootstrapping of partial annotationautomatic patternstraining modemulti-line labellingcustom labels
ANNOTATION IN SKETCH ENGINE IIIsynchronization with CPA editorbasic statistics
CPA EDITORJavaScript (jQuery), standaloneconnected to SkE and DEB servercreating and managing PDEV entries, pattern,ontologycode used by Ken Litkowski for PDEP
CPA PUBLIC ACCESSsimplified interfacelive data, complete verbs
SEMEVAL DATASETNAACL 2015three tasks
CPA parsingCPA clusteringCPA pattern editing
Microcheck, Wingspreadauto-cpa user
LEMON APIan official release of PDEV as linked dataRDF scheme, used by WordNet, DBpedia, ...17,634 triples
CONCLUSION, FUTURE WORK
a reference for future articles
consolidation of code (merge)
other projects are planned (CPA for nouns, adjectives)linking English, Italian, Spanish pattern dictionaries(EURALEX)full CPA bibliography in the proceedings