pragmatic annotation & analysis in dart martin weisser school of english & education...
Post on 12-Jan-2016
217 Views
Preview:
TRANSCRIPT
Pragmatic Annotation & Analysis in DART
Martin WeisserSchool of English & Education
Guangdong University of Foreign Studiesweissermar@gmail.com
martinweisser.org
Outline
• Getting DART• Design Background• DART Annotation Scheme• Basic Automated Annotation• Speech-Act Analysis• N-Gram Analysis• Creating & Editing Resources
Getting DART
• go to http://martinweisser.org/ling_soft.html#DART
• download & run installer (currently 64bit Win only)
Design Background (1)
• 1997–1998: Expert Advisory Group on Language Engineering Standards (EAGLES) WP4guidelines for the representation and annotation of
dialogue
• 2001–2002: SPAAC (A Speech-Act Annotated Corpus of Dialogues) Projectannotation of some 1,200 task-oriented dialogue files
(Trainline + BT)– need to annotate and post-edit corpus efficiently and
consistently on multiple levels SPAACy
Design Background (2)
colour coding helps to identify syntactic patternspost-processing
constrained through fixed options
resources loaded automatically
Design Background (3)
• flaws in SPAACY– monolithic, i.e. no separation of ‘linguistic intelligence’ &
output displayhard to improve linguistic analysis– processing & editing of single files only– other interface issues, e.g. too many buttons, etc.
development of DART– modularisation– strict separation of processing and linguistic analysis
routines– enhanced options for analysis and creation of resources
DART Annotation Scheme (1) –Basic Input Format
optional stylesheet reference
text with optional punctuation ‘tags’ or embedded comments
basic skeleton can be created via ‘File→New’ (Ctrl + n)
DART Annotation Scheme (1) –Output Format
syntactic category mode = semantico-pragmatic markers/’IFIDs’
topic = semantic info
(surface) polarityspeech act(s)
speech act generally inferred from combination of syntax + mode
Basic Automated Annotation
input files workspace
output files workspace
to load single file, press Ctrl + a(, for whole directory Ctrl + d)
single file loaded;to pre-edit, click hyperlink;
to annotate pragmatically, press Ctrl+a
debugging output;ignore if annotation completes successfully
single file processed;to post-edit, click hyperlink
Speech-Act Analysis
• generate frequency list of syntactic category + speech act(s) from ‘Analysis→Speech act stats’
• click hyperlinked speech act (combination) to prime concordancer
• investigate results• if necessary, correct speech act tag by clicking
the hyperlink to the file and editing it
N-Gram Analysis
• useful for determining formulaic expressions for modes or topic patterns (or in general)
• predefined options for uni- to tri-grams• optionally also freely definable n-grams• frequency lists display abs. & rel. frequencies• hyperlink again primes concordancer– for all n>1 with interpolated optional fillers– due to accommodating mixed-case data, sometimes
‘case insensitive’ flag required
Creating & Editing Resources (1)
• mostly done via ‘Edit resources’ menu…• … apart from creating new files• to create new corpus
– choose ‘Edit configuration’– click ‘Add corpus entry’– fill in corpus, lexicon, and topic file name (usually identical, apart
from extension)– click ‘Save configuration’
• new resources created– data folder for corpus– three subfolders: ‘info’, ‘notes’, and ‘stats’– dummy lexicon & topics files (in relevant program folders)
Creating & Editing Resources (2)
• existing resources can be edited…– generally via relevant entry in the ‘Edit resources’ menu– lexica & topic files via hyperlinks in configuration editor
• safest to edit only dialogue, lexica & topic files…• … unless you really know what you’re doing • lexica can also be ‘synthesised’ from corpus data
Creating & Editing Resources (3) –Lexica
• very simple format– word (base form) + space + tag + optional comment (preceded
by #)– special DART tagset
• allows for lexical polysemy– uppercase tag name = unambiguous– lowercase tag name = predominantly tag X
• tooltips on tag buttons provide explanations while editing
• synthesising lexicon works by– creating word list from corpus– ‘subtracting’ items from general lexicon– suggesting possible candidates after morphological analysis
Creating & Editing Resources (4) –Topic Files
• syntax more complex than for lexica• combination of topic labels, space, double
colon, space, associated (representative) patterns
• patterns expressed as– regexes– individual sub-patterns separated by 3 underscores
top related