ibm nsf workshop atlanta, georgia 7-8 oct 2003 speech recognition: it takes a village to raise a...

NSF Workshop Atlanta, Georgia 7-8 Oct 2003

Speech Recognition: It Takes a Village to Raise a

ChildMichael Picheny

Human Language Technologies GroupIBM Thomas J. Watson Research Center

Special thanks to: Stan Chen, Yuqing Gao, Ramesh Gopinath, Makis Potamianos, Bhuvana Ramabhadran, Bowen Zhou ,

and Geoff Zweig

http://www.nist.gov/speech/tests/rt/rt2003/spring/presentations/rt03s-stt-results-v9.pdf

Developmental Factors

1988 1989 1990 1991 1992 1993 1994 1995•CD HMM •Multiple

Codebooks

•Tied Mixtures

•Double Deltas

•PTMs •STMs•Data

•MLLR •SAT•PLP

•Modeling

•Sig Proc •Modeling

•Modeling•Data

•Adaptation

•Adaptation•Sig Proc

1996 1997 1998 1999 2000 2001 2002 2003•Multiple Models•VTLN•Data

•MLLT, BIC•Data

•ROVER•Data

• fMLLR-SAT

•MMI FSTs •MPE •Data

•Modeling•Adaptation•Data

•Modeling•Data

•Decoding•Data

•Adaptation

•Training

•Modeling

•Training

•Data

•Bulk of improvements from better modeling and more data, closely followed by adaptation (a form of modeling)

• Most current systems model speech as a mixture of diagonal Gaussians, but there is this nagging suspicion that full-covariance models would be better.

• Try to approximate full-covariance models with controlled increase in number of parameters (Axelrod, 2003):

Continue the Basics: Advances in Gaussian Modeling

(EMLLT) 2

ddDdaaP p

(PCGMM) 2

ddDdS pk

Dp (d = 52)

d 2.67 1.96

2d 2.04 1.75

8d 1.65 1.64

26.5d 1.58 1.58

Advances in Gaussian Modeling

nGauss

Diagonal

5000 3.48 1.83

10000 2.68 1.56

42993 2.00 1.35

142622 1.74 1.54

350286 1.68 Forget it

609100 1.65 Really forget it•10k FC Model better than 600k model with 20% of the

parameters•FC models clearly prone to overtraining. PCGMM helps

but still increases number of parameters•Clearly need lots more acoustic data to train even

PCGMM models much less FC models

Teens : Utilizing Linguistic Information in ASRTeens : Utilizing Linguistic Information in ASR

• Standard LVCSR does not explicitly use linguistic information

• Past history is littered with failure

• Over the last few years area beginning to show signs of life

NSF Workshop Atlanta, Georgia 7-8 Oct 2003 Syntactic Structured LM

Exploiting syntactic dependencies (Chelba ACL98, Wu ICASSP00)

contractNP

endedVP

contract ended with a loss of 7 cents after

DT NN VBD IN DT NN IN CD NNS

i-2 i-1 i-2 i-1w i

nt i-1

nt i-2

i1iiiiiiii

WTntnthhwwwP

WTTWwPWwP

)|(),,,,,|(

)|(),|()|(

1121212

•Observe performance improvements ~1% absolute on SWB/BN

Semantic Structured LM Exploiting semantic dependencies (Erdogan ICSLP02)

I want to

book a wayone

ticket

to Houston

Texas for tomorrow

morning

null rt-ow

flight

city state

word day timerng

RT-OW LOC DATE TIME

LOC-TO

SEGMENT

jw1jw2jw

),,,,|()|( 211

1 jjjjjjj

j cgpwwwpWwp

•Reductions in error rate by 20% for limited domain tasks

Super-Structured LM for LVCSR

W1, ..., W

Dialogue State

Semantic Parser

World Knowledg

Named EntitySyntacti

c Parser

Speaker (turn,

gender, ID)

•Such an LM would clearly require substantially more annotated data than currently available

Training (hrs)

141 297 602 843

WER(%) 17.2 15.4 14.7 14.5

Nutrition: “There’s no data like more data”

--Robert L. Mercer

RT03 Workshop (BBN)

LIMSI: Lamel (2002)

• “SuperStructured LM” probably need 100x current data• A large amount of linguistic knowledge sources now available

Brown corpus, Penn Treebank (syntactically & semantically annotated)

WordNet and FrameNet, Cyc ontologies, Online dictionaries and thesaurus Text data from WWW

How to provide necessary annotation at reasonable cost – may require community effort.

Nutrition: “There is no data like more data”

Speed, Data & Computing

• Decoding today is fast– ML training even faster– Discriminative training same as

decoding • ~5-10xRT for numerous

iterations• But

– The data is growing, e.g. the EARS program aiming for

• 2000 hrs/year telephony• 5000 hrs/year BN• ~10x increase from current

– Evidence suggests that new & costlier algorithms are necessary to exploit more data

• So– Need minimum 10x increase in

compute power just to track data

– 100x to run 10xRT programs rather than 1xRT programs 0.85 xRT

Speech/non-speech segmentation

Speaker IndependentDecoding

Adaptive Transforms

Speaker-Adaptive Decoding

Acoustic signal

0.01 xRT

0.11 xRT

0.1 xRT

0.63 xRT

The BlueGene Frontier

• 200 TeraFlop computer– Combined power of top 500 supercomputers in 2001– 65,000 processors

• 2GB per processor• ~1GHz clock• 3D torus interconnection• 2 nodes per card; 16 cards per board; 16 boards per plane; 2

planes per rack• Pieces beginning to be tested

– Intended for molecular dynamics, but available for other uses• Potential ASR applications

– Physics based articulatory modeling– Brute-force parameter adjustment to minimize WER– Large scale neural network modeling– Incorporation of Visual Processing

The Village: Collaborative Paradigms

• Originally progress in ASR was haphazard – no way to compare results and would generate a lot of skepticism because of NIH syndrome

• Evaluation-driven ASR programs (Prior DARPA, DOD Paradigms)– Provided a common metric to compare algorithms– Funding based on relative performance of each site

• Sites hope not only to do well, but for other sites to do badly• Discourages free exchange of resources between sites• Large portion of each site’s effort spent replicating other sites’ algorithms + data

• Non-evaluation driven programs: NGSW, MALACH• How to encourage collaboration while retaining the motivation of

competition?– While also retaining objective evaluation of progress

• Recent EARS program a strong step in this direction• Even broader collaboration possible through sharing resources

Encouraging Collaboration

• Decompose ASR systems into modules– Front end; acoustic model; pronunciation model;

language model; adaptation; search; etc.

• Sites collaborate to create single ASR system rather than one per site– Each site works on writing better ASR modules,

rather than complete ASR systems– Each module (e.g., MMIE, VTLN, etc.) needs only be

implemented once across all sites– Progress measured and credit assigned to sites

based on how modules affect WER of global system

Existence Proof: UIMA (Unstructured Information Management Architecture)

http://www.research.ibm.com/people/a/aspector/presentations/www2000f.pdfCharts courtesy of David Ferrucci

• Accelerate Progress in Search and Analysis– Reuse across teams

– Ease of experimentation

– Combination Hypothesis

Text analysis through a series of annotators

DetaggerDetagger

Document with HTMLtags identified and content extracted

TokenizerTokenizer

Document with tokens (e.g., words) identified

LanguageIdentification

document level

Document

Document labeled withLanguage of text

Part of SpeechPart of Speech

Word labeled with its part of speech

Named-EntitiesNamed-Entities

Document withName identified

Annotators: Analyze, Recognize & Label specific semantic content for next

consumer

SemanticClasses

Semantic Classes identified

word level

phraselevel

UIMApplicationUIMApplication

AnalysisAnalysis

Collection ProcessingManager

Document, Collection & Metadata StoreDocument, Collection & Metadata Store

Knowledge SourceAccess

CollectionLevel Analyses

Knowledge & DataBases

Unstructured

Information

(Text) Analysis Engines(Document-Level)

CrawlersCrawlers

Analysis Engine

ibm nsf workshop atlanta, georgia 7-8 oct 2003 speech recognition: it takes a village to raise a...

ibm nsf workshop atlanta

w i nt i

gaussian modeling slide

form of modeling slide

acoustic data

better modeling

swbbn slide

geoff zweig slide

Documents

for nsf use only nsf 12-580 12/17/12 nsf proposal …

nsf pst converter tool to recover nsf & export nsf to pst

ishrat m. khan and conrad ingram, clark atlanta...

: nsf/ansi 14-2012 - nsf international

nsf astronomy and astrophysics postdoctoral fellowships...

atlanta aesthetic dentistry, atlanta dental implants

introduction and nsf overview september 2006. main topics...

atlanta global studies symposium 2019 rce greater atlanta...

sysinfotools nsf merge · sysinfotools nsf merge 2 1....

introduction to nsf lead leaching nsf 61 safe drinking water...

nsf directorate for engineering - nsf - national science...

nsf/ansi standard acceptance: drinking water treatment ......

atlanta dog trainers, atlanta dog obedience training

nsf mri-‐r2: dynamic network system (dynes, nsf #0958998)

introduction/signal processing, part i michael picheny...

robustness michael picheny, bhuvana …lecture 9 robustness...

introduction and nsf overview - university of new...

atlanta public schools - nsf · 2017. 3. 24. · atlanta...

nsf (national standard format) user...

announcements - vanderbilt university€¦ · fellow, the...