training statistical language models from grammar-generated data: a comparative case-study
Post on 27-Jan-2016
40 Views
Preview:
DESCRIPTION
TRANSCRIPT
Training Statistical Language Training Statistical Language Models from Models from
Grammar-Generated Data: Grammar-Generated Data: A Comparative Case-StudyA Comparative Case-Study
Manny RaynerGeneva University
(joint work with Beth Ann Hockey and Gwen Christian)
Structure of talkStructure of talk
Background: Regulus and MedSLTGrammar-based language models and
statistical language models
What is MedSLT?What is MedSLT?
Open Source medical speech translation system for doctor-patient dialogues
Medium-vocabulary (400-1500 words)Grammar-based: uses Regulus platformMulti-lingual: translate through interlingua
MedSLTMedSLT
Open Source medical speech translator for doctor – patient examinations
Main system unidirectional (patient answers non-verbally, e.g. nods or points)– Also experimental bidirectional system
Two main purposes– Potentially useful (could save lives!)– Vehicle for experimenting with underlying Regulus
spoken dialogue engineering toolkit
Regulus: central goalsRegulus: central goals
Reusable grammar-based language models– Compile into recognisers
Infrastructure for using them in applications– Speech translation– Spoken dialogue
MultilingualEfficient development environmentOpen Source
$25 (paperback edition)from amazon.com
The full story…
What kind of applications?What kind of applications?
Grammar-based is – Good on in-coverage data– Good for complex, structured utterances
Users need to – Know what they can say– Be concerned about accuracy
Good target applications– Safety-critical – Medium vocabulary (~200 – 2000 words)
In particular…In particular…
Clarissa– NASA procedure assistant for astronauts– ~250 word vocabulary, ~75 command types
MedSLT– Multilingual medical speech translator– ~400 – ~1000 words, ~30 question types
SDS – Experimental in-car system from Ford Research– First prize, Ford internal demo fair, 2007– ~750 words
Key technical ideasKey technical ideas
Reusable grammar resourcesUse grammars for multiple purposes
– Parsing– Generation– Recognition
Appropriate use of statistical methods
Reusable grammar resourcesReusable grammar resources
Building a good grammar from scratch is very challenging
Need a methodology for rational reuse of existing grammar structure
Use small corpus of examples to extract structure from a large resource grammar
GeneralUG
EBL Specialization
UG to CFGCompiler
R E G U L U S
Application Specific
UG
CFGGrammar
Recognizer
NUANCE
The Regulus pictureThe Regulus picture
Lexicon
Training Corpus
PCFGGrammar
CFG to PCFGCompiler
(P)CFG to RecogniserCompiler
OperationalityCriteria
The general English grammarThe general English grammar
Loosely based on SRI Core Language Engine grammar
Compositional semantics (4 different versions) ~200 unification grammar rules ~75 features Core lexicon, ~ 450 words
(Also resource grammars for French, Spanish, Catalan, Japanese, Arabic, Finnish, Greek)
General grammar General grammar domain-specific grammardomain-specific grammar
“Macro-rule learning”Corpus-based processRemove unused rules and lexicon itemsFlatten parsed examples to remove structureSimpler structure less ambiguity
smaller search space
when do you get headaches
PP PPV PRO V N
NP NBAR
NPVBARVBAR
VP
VP
VP
S
S
UTTERANCE EBL example (1)EBL example (1)
when do you get headaches
PP PPV PRO V N
NP NBAR
NPVBARVBAR
VP
VP
VP
S
S
UTTERANCE EBL example (2)EBL example (2)
when do you get headaches
PP V PRO V N
NP
NPVBARVBAR
S
UTTERANCE EBL example (3)EBL example (3)
Main new rules:
S PP VBAR VBAR NPNP N
Using grammars for multiple Using grammars for multiple purposespurposes
Parsing– Surface words logical form
Generation– Logical form surface words
Recognition– Speech surface words
Building a speech translatorBuilding a speech translator
Combine Regulus-based components– Source-language recognizer (speech words)– Source-language parser (words logical form)– Transfer from source to target, via interlingua
(logical form logical form)– Target-language generator (logical form
words)– (3rd party text to speech)
Adding statistical methodsAdding statistical methods
Two different ways to use statistical methods:
Statistical tuning of grammarIntelligent help system
Impact of statistical tuningImpact of statistical tuning
(Regulus book, chapter 11)Base recogniser
– MedSLT with English recogniser– Training corpus: 650 utterances– Vocabulary: 429 surface words
Test data:– 801 spoken and transcribed utterances
Vary vocabulary sizeVary vocabulary size
Add lexical items (11 different versions)Total vocabulary 429 – 3788 surface wordsNew vocabulary not used in test dataExpect degradation in performance
– Larger search space– New possibilities just a distraction
Impact of statistical tuningImpact of statistical tuningfor different vocabulary sizesfor different vocabulary sizes
0
5
10
15
20
25
429 1392 2096 2698 3266 3788
CFG
PCFG
Vocabulary size
Sem
Error R
ate
Intelligent help systemIntelligent help system
Need robustness somewhereAdd a backup statistical recogniserUse it to advise the user
– Approximate match with in-coverage examples– Show user similar things they could say
Original paper: Gorrell, Lewin and Rayner, ICSLP 2002
MedSLT experimentsMedSLT experiments
(Chatzichrisafis et al, HLT workshop 2006)French English version of systemBasic questions
– How quickly do novices become experts?– Can people adapt to limited coverage?
Let subjects use system several times, and track performance
Experimental SetupExperimental Setup
Subjects– 8 medical students, no previous knowledge of system
Scenario– Experimenter simulates headache– Subject must diagnose it– 3 sessions, 3 tasks per session
Instruction– ~20 min instructions & video (headset, push-to-talk)– All other instruction from help system
Session 1Session 2
Session 3
98.6
63.4
53.9
40
50
60
70
80
90
100
Results - # InteractionsResults - # InteractionsInteractions
Results – Time/DiagnosisResults – Time/Diagnosis
00
02
04
07
09
12
14
16
19
Session 1 Session 2 Session 3
Diagnosis 1 Diagnosis 2 Diagnosis 3
Questionnaire results Questionnaire results I quickly learned how to use the system. 4.4
System response times were generally satisfactory. 4.5
When the system did not understand me, the help system usually showed me another way to ask the question. 4.6
When I knew what I could say, the system usually recognized me correctly. 4.3
I was often unable to ask the questions I wanted. 3.8
I could ask enough questions that I was sure of my diagnosis. 4.3
This system is more effective than non-verbal communication using gestures. 4.3
I would use this system again in a similar situation. 4.1
SummarySummary
After 1.5 hours of use, subjects complete task in average of 4 minutes– System implementers average 3 minutes
All coverage learned from help systemSubjects’ impressions very positive
A few words about interlinguaA few words about interlingua
Coverage in different languages diverges if left to itself– Want to enforce uniform coverage
Many-to-many translation– “N2 problem”
Solution: translate through interlingua– Tight interlingua definition
Interlingua grammarInterlingua grammar
Think of interlingua as a languageDefine using Regulus
– Mostly for constraining representations– Also get a surface form
“Semantic grammar”– Not linguistic, all about domain constraints
Example of interlinguaExample of interlingua
“YN-QUESTION pain become-better sc-when [ you sleep PRESENT] PRESENT”
[[utterance_type, ynq], [symptom, pain], [event, become_better], [tense, present], [sc, when], [clause, [[utterance_type, dcl], [pronoun, you], [action, sleep], [tense, present]]]]
Constraints from interlinguaConstraints from interlingua
Source language sentences licensed by grammar may not produce valid interlingua
Interlingua can act as a knowledge source to improve language modelling
Structure of talkStructure of talk
Background: Regulus and MedSLT Grammar-based language models and
statistical language models
Language modelsLanguage models
Two kinds of language modelsStatistical (SLM)
– Trainable, robust– Require a lot of corpus data
Grammar-based (GLM)– Require little corpus data– Brittle
Compromises between Compromises between SLM and GLMSLM and GLM
Put weights on GLM (CFG PCFG)– Powerful technique, see earlier– Doesn’t address robustness
Put GLMs inside SLMs (Wang et al, 2002)Use GLM to generate training data for SLM
(Jurafsky et al 1995, Jonson 2005)
Generating SLM training data Generating SLM training data with a GLMwith a GLM
Optimistic view– Need only small seed corpus, to build GLM– Will be robust, since finally an SLM
Pessimistic view– “Something for nothing”– Data for GLM could be used directly to build an SLM
Hard to decide– Don’t know what data went into GLM– Often just in grammar writer’s head
Regulus permits comparisonRegulus permits comparison
Use Regulus to build GLMData-driven process with explicit corpusSame corpus can be used to build SLMComparison is meaningful
Two ways to build SLMTwo ways to build SLM
Direct– Seed corpus SLM
Indirect– Seed corpus GLM corpus SLM
Parameters for indirect methodParameters for indirect method
Size of generated corpus– Can generate any amount of data
Method of generating corpus– CFG versus PCFG
Filtering– Use interlingua to filter generated corpus
CFG versus PCFG generationCFG versus PCFG generation
CFG– Use plain GLM to do random generation
PCFG– Use seed corpus to weight GLM rules– Weights then used in random generation
Interlingua filteringInterlingua filtering
Impossible to make GLM completely tightMany in-coverage sentences make no senseSome of these don’t produce valid
interlinguaUse interlingua grammar as filter
Example: CFG generated dataExample: CFG generated data
what attacks of them 're your duration all dayhave a few sides of the right sides regularly frequently hurtwhere 's it increasedwhat previously helped this headachehave not any often ever helpedare you usually made drowsy at homewhat sometimes relieved any gradually during its night's this severity frequently increased before helpingwhen are you usually at homehow many kind of changes in temperature help a history
Example: PCFG generated dataExample: PCFG generated data
does bright light cause the attacksare there its cigarettesdoes a persistent pain last several hoursis your pain usually the same beforewere there them when this kind of large meal helped joint paindo sudden head movements usually help to usually relieve the
painare you thirstydoes nervousness aggravate light sensitivityis the pain sometimes in the faceis the pain associated with your headaches
Example: PCFG generated data Example: PCFG generated data with interlingua filteringwith interlingua filtering
does a persistent pain last several hoursdo sudden head movements usually help to usually relieve the
painare you thirstydoes nervousness aggravate light sensitivityis the pain sometimes in the facehave you regularly experienced the paindo you get the attacks hoursis the headache pain betterare headaches worseis neck trauma unchanging
ExperimentsExperiments
Start with same English seed corpus – 948 utterances
Generate GLM recogniser Generate different types of training corpus
– Train SLM from each corpus Compare recognition performance
– Word Error Rate (WER)– Sentence Error Rate (SER)
McNemar sign test on SER to get significance
Experiment 1: different methodsExperiment 1: different methods
Version corpus WER SER
GLM 948 21.96% 50.62%
SLM, seed corpus 948 27.74% 58.40%
SLM, CFG, no filter 4281 49.0% 88.4%
SLM, CFG, filter 4281 44.68% 85.68%
SLM, PCFG, no filter 4281 25.98% 65.31%
SLM, PCFG, filter 4281 25.81% 63.70%
Experiment 1:Experiment 1:significant differencessignificant differences
GLM >> all SLMsseed corpus >> all generated corporaPCFG generation >> CFG generationfiltered > not filtered
However, generated corpora are small…
Experiment 2: Experiment 2: different sizes of corpusdifferent sizes of corpus
Version corpus WER SER
GLM 948 21.96% 50.62%
SLM, seed corpus 948 27.74% 58.40%
SLM, PCFG, no filter 16 619 24.84% 62.47%
SLM, PCFG, filter 16 619 23.80% 59.51%
SLM, PCFG, no filter 497 798 24.38% 59.88%
SLM, PCFG, filter 497 798 23.76% 57.16%
Experiment 2:Experiment 2:significant differencessignificant differences
GLM >> all SLMs large corpus > small corpus large unfiltered generated corpus ~ seed corpus
– SER for large unfiltered corpus about the same
large filtered generated corpus ~/> seed corpus – SER for large filtered corpus better, but not significant
filtered > not filtered
Experiment 3: Experiment 3: like 2, but only in-coverage datalike 2, but only in-coverage data
Version corpus WER SER
GLM 948 7.00% 22.37%
SLM, seed corpus 948 14.40% 42.02%
SLM, PCFG, no filter 16 619 14.13% 46.11%
SLM, PCFG, filter 16 619 12.76% 40.86%
SLM, PCFG, no filter 497 798 12.35% 40.66%
SLM, PCFG, filter 497 798 11.25% 36.19%
Experiment 3:Experiment 3:significant differencessignificant differences
GLM >> all SLMs large corpus > small corpus large unfiltered generated corpus ~/> seed corpus
– SER for large unfiltered corpus better, not significant
large filtered generated corpus > seed corpus filtered > not filtered
Using GLMs to make SLMs:Using GLMs to make SLMs:conclusionsconclusions
Regulus lets us evaluate fairlyIndirect method for building SLM only
slightly better than direct oneGLM better than all SLM variants
– Especially clear on in-coverage data
PCFG generation much better than CFG
SummarySummary
MedSLT– Potentially useful tool for doctors in future– Good test-bed for research now
Using GLMs to build SLMs– Example of how Regulus lets us evaluate a
grammar-based method objectively
For more informationFor more information
Regulus websiteshttp://sourceforge.net/projects/regulus/http://www.issco.unige.ch/projects/regulus/Rayner, Hockey and Bouillon“Putting Linguistics Into Speech Recognition” (CSLI Press, June 2006)
top related