ibm nsf workshop atlanta, georgia 7-8 oct 2003 speech recognition: it takes a village to raise a...
Post on 17-Dec-2015
216 Views
Preview:
TRANSCRIPT
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Speech Recognition: It Takes a Village to Raise a
ChildMichael Picheny
Human Language Technologies GroupIBM Thomas J. Watson Research Center
Special thanks to: Stan Chen, Yuqing Gao, Ramesh Gopinath, Makis Potamianos, Bhuvana Ramabhadran, Bowen Zhou ,
and Geoff Zweig
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
http://www.nist.gov/speech/tests/rt/rt2003/spring/presentations/rt03s-stt-results-v9.pdf
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Developmental Factors
1988 1989 1990 1991 1992 1993 1994 1995•CD HMM •Multiple
Codebooks
•Tied Mixtures
•Double Deltas
•PTMs •STMs•Data
•MLLR •SAT•PLP
•Modeling
•Modeling
•Modeling
•Sig Proc •Modeling
•Modeling•Data
•Adaptation
•Adaptation•Sig Proc
1996 1997 1998 1999 2000 2001 2002 2003•Multiple Models•VTLN•Data
•MLLT, BIC•Data
•ROVER•Data
• fMLLR-SAT
•MMI FSTs •MPE •Data
•Modeling•Adaptation•Data
•Modeling•Data
•Decoding•Data
•Adaptation
•Training
•Modeling
•Training
•Data
•Bulk of improvements from better modeling and more data, closely followed by adaptation (a form of modeling)
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
• Most current systems model speech as a mixture of diagonal Gaussians, but there is this nagging suspicion that full-covariance models would be better.
• Try to approximate full-covariance models with controlled increase in number of parameters (Axelrod, 2003):
Continue the Basics: Advances in Gaussian Modeling
(EMLLT) 2
)1(,
1
1
ddDdaaP p
Tkk
D
k
kggg
p
(PCGMM) 2
)1(,
1
ddDdS pk
D
k
kg
p
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Dp (d = 52)
EMLLT
PCGMM
d 2.67 1.96
2d 2.04 1.75
8d 1.65 1.64
26.5d 1.58 1.58
Advances in Gaussian Modeling
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Advances in Gaussian Modeling
nGauss
Diagonal
Full
5000 3.48 1.83
10000 2.68 1.56
42993 2.00 1.35
142622 1.74 1.54
350286 1.68 Forget it
609100 1.65 Really forget it•10k FC Model better than 600k model with 20% of the
parameters•FC models clearly prone to overtraining. PCGMM helps
but still increases number of parameters•Clearly need lots more acoustic data to train even
PCGMM models much less FC models
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Teens : Utilizing Linguistic Information in ASRTeens : Utilizing Linguistic Information in ASR
• Standard LVCSR does not explicitly use linguistic information
• Past history is littered with failure
• Over the last few years area beginning to show signs of life
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Syntactic Structured LM
Exploiting syntactic dependencies (Chelba ACL98, Wu ICASSP00)
contractNP
endedVP
The
h
contract ended with a loss of 7 cents after
h w w
DT NN VBD IN DT NN IN CD NNS
i-2 i-1 i-2 i-1w i
nt i-1
nt i-2
ii
ii
ST
i1iiiiiiii
ST
i1ii
i1i
ii
WTntnthhwwwP
WTTWwPWwP
)|(),,,,,|(
)|(),|()|(
1121212
1111
•Observe performance improvements ~1% absolute on SWB/BN
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Semantic Structured LM Exploiting semantic dependencies (Erdogan ICSLP02)
I want to
book a wayone
ticket
to Houston
Texas for tomorrow
morning
null
null
null
book
null rt-ow
rt-ow
flight
null
city state
word day timerng
RT-OW LOC DATE TIME
LOC-TO
SEGMENT
BOOK
S
jw1jw2jw
jpjg
jc
),,,,|()|( 211
1 jjjjjjj
j cgpwwwpWwp
•Reductions in error rate by 20% for limited domain tasks
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Super-Structured LM for LVCSR
W1, ..., W
N
Dialogue State
Semantic Parser
World Knowledg
e
Named EntitySyntacti
c Parser
Speaker (turn,
gender, ID)
•Such an LM would clearly require substantially more annotated data than currently available
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Training (hrs)
141 297 602 843
WER(%) 17.2 15.4 14.7 14.5
Nutrition: “There’s no data like more data”
--Robert L. Mercer
RT03 Workshop (BBN)
LIMSI: Lamel (2002)
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
• “SuperStructured LM” probably need 100x current data• A large amount of linguistic knowledge sources now available
Brown corpus, Penn Treebank (syntactically & semantically annotated)
WordNet and FrameNet, Cyc ontologies, Online dictionaries and thesaurus Text data from WWW
How to provide necessary annotation at reasonable cost – may require community effort.
Nutrition: “There is no data like more data”
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Speed, Data & Computing
• Decoding today is fast– ML training even faster– Discriminative training same as
decoding • ~5-10xRT for numerous
iterations• But
– The data is growing, e.g. the EARS program aiming for
• 2000 hrs/year telephony• 5000 hrs/year BN• ~10x increase from current
– Evidence suggests that new & costlier algorithms are necessary to exploit more data
• So– Need minimum 10x increase in
compute power just to track data
– 100x to run 10xRT programs rather than 1xRT programs 0.85 xRT
Speech/non-speech segmentation
Speaker IndependentDecoding
Adaptive Transforms
Speaker-Adaptive Decoding
Acoustic signal
Words
0.01 xRT
0.11 xRT
0.1 xRT
0.63 xRT
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
The BlueGene Frontier
• 200 TeraFlop computer– Combined power of top 500 supercomputers in 2001– 65,000 processors
• 2GB per processor• ~1GHz clock• 3D torus interconnection• 2 nodes per card; 16 cards per board; 16 boards per plane; 2
planes per rack• Pieces beginning to be tested
– Intended for molecular dynamics, but available for other uses• Potential ASR applications
– Physics based articulatory modeling– Brute-force parameter adjustment to minimize WER– Large scale neural network modeling– Incorporation of Visual Processing
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
The Village: Collaborative Paradigms
• Originally progress in ASR was haphazard – no way to compare results and would generate a lot of skepticism because of NIH syndrome
• Evaluation-driven ASR programs (Prior DARPA, DOD Paradigms)– Provided a common metric to compare algorithms– Funding based on relative performance of each site
• Sites hope not only to do well, but for other sites to do badly• Discourages free exchange of resources between sites• Large portion of each site’s effort spent replicating other sites’ algorithms + data
• Non-evaluation driven programs: NGSW, MALACH• How to encourage collaboration while retaining the motivation of
competition?– While also retaining objective evaluation of progress
• Recent EARS program a strong step in this direction• Even broader collaboration possible through sharing resources
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Encouraging Collaboration
• Decompose ASR systems into modules– Front end; acoustic model; pronunciation model;
language model; adaptation; search; etc.
• Sites collaborate to create single ASR system rather than one per site– Each site works on writing better ASR modules,
rather than complete ASR systems– Each module (e.g., MMIE, VTLN, etc.) needs only be
implemented once across all sites– Progress measured and credit assigned to sites
based on how modules affect WER of global system
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Existence Proof: UIMA (Unstructured Information Management Architecture)
http://www.research.ibm.com/people/a/aspector/presentations/www2000f.pdfCharts courtesy of David Ferrucci
• Accelerate Progress in Search and Analysis– Reuse across teams
– Ease of experimentation
– Combination Hypothesis
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Text analysis through a series of annotators
DetaggerDetagger
Document with HTMLtags identified and content extracted
TokenizerTokenizer
Document with tokens (e.g., words) identified
LanguageIdentification
LanguageIdentification
document level
Document
Document labeled withLanguage of text
Part of SpeechPart of Speech
Word labeled with its part of speech
Named-EntitiesNamed-Entities
Document withName identified
Annotators: Analyze, Recognize & Label specific semantic content for next
consumer
SemanticClasses
SemanticClasses
Semantic Classes identified
word level
phraselevel
UIMApplicationUIMApplication
AnalysisAnalysis
Collection ProcessingManager
Collection ProcessingManager
Document, Collection & Metadata StoreDocument, Collection & Metadata Store
Knowledge SourceAccess
Knowledge SourceAccess
CollectionLevel Analyses
CollectionLevel Analyses
Knowledge & DataBases
Unstructured
Information
(Text) Analysis Engines(Document-Level)
(Text) Analysis Engines(Document-Level)
CrawlersCrawlers
Analysis Engine
Directory
Analysis Engine
Directory
Acquisition
Unstructured Information Analysis
Structured Information
Semantic Search Engine
Semantic Search Engine
IndicesIndices
DocumentsCollectionsMetadataKnowledge
SourceAdapter Directory
KnowledgeSource
Adapter Directory
AccessAccess
Component Discovery
Client/User
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Possible Collaboration Discussion Points
• Sharing data and object files seems reasonable
• Speech community needs to design:– Standard file formats supporting rich annotation– Stable, general, open-source C++ interfaces for front
end modules, acoustic models, LM’s, etc.– Rich tool set
• Port competitive trainer, decoder, adaptation into this framework; create basic file manipulation tools
• Can we ride on top of existing architectures such as UIMA?
IBM
NSF Workshop Atlanta, Georgia 7-8 Oct 2003
Summary
• To help our “speech recognition” child develop, continue basic successful approaches of the past:– Better Modeling– More Data
• Increasing difficulty of problem requires focus on community-wide collaboration for both algorithms and resources
top related