towards speech recognition in silicon: the carnegie …rutenbar/pdf/rutenbar-starc07.pdf · 2007...

21
2007 STARC Forum Rob A. Rutenbar Carnegie Mellon University 1 © R.A. Rutenbar 2007 2007 STARC Forum Towards Speech Recognition in Silicon: The Carnegie Mellon In Silico Vox Project 2007 STARC Forum Towards Speech Recognition in Silicon: The Carnegie Mellon In Silico Vox Project Rob A. Rutenbar Professor, Electrical & Computer Engineering [email protected] © Rob A. Rutenbar 2007 Slide 2 CarnegieMellon Speech Recognition Today Quality = OK Vocab = large Quality = poor Vocab = small Commonality: all software apps

Upload: dinhkhanh

Post on 30-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 1

© R.A. Rutenbar 2007

2007 STARC Forum

Towards Speech Recognition in Silicon:The Carnegie Mellon In Silico Vox Project

2007 STARC Forum

Towards Speech Recognition in Silicon:The Carnegie Mellon In Silico Vox Project

Rob A. RutenbarProfessor, Electrical & Computer [email protected]

© Rob A. Rutenbar 2007 Slide 2

CarnegieMellon

Speech Recognition Today

Quality = OK Vocab = large

Quality = poor Vocab = small

Commonality: all software apps

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 2

© Rob A. Rutenbar 2007 Slide 3

CarnegieMellon

Today’s Best Software Speech Recognizers

Best-quality recognition is computationally hardFor speaker-independent, large-vocabulary, continuous speech

1-10-100-1000 ruleFor ~1X real-time recognition rateFor ~10% word error rate (90% accuracy)Need ~100 MB memory footprintNeed ~100 W powerNeed ~1000 MHz CPU

This proves to be very limiting …

© Rob A. Rutenbar 2007 Slide 4

CarnegieMellon

The Carnegie Mellon In Silico Vox Project

The thesis: It’s time to liberate speech recognition from the unreasonable limitations of software

The solution: Speech recognition in silicon

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 3

© Rob A. Rutenbar 2007 Slide 5

CarnegieMellon

Aside: About the Name “In Silico Vox”

In VivoLatin: an experiment done in a living organism

In VitroLatin: an experiment done in an artificial lab environment

In Silico(Not real Latin): an experiment done via computation only

VoxLatin: voice, or word

© Rob A. Rutenbar 2007 Slide 6

CarnegieMellon

About This Talk

Some philosophyWhy silicon? Why now? Why us (CMU)?

A quick tour: How speech recognition worksWhat happens in a recognizer

An SoC architectureStripping away all CPU stuff we don’t need, focus on essentials

Results ASIC version: Simulation resultsFPGA version: Live, running hardware-based recognizer

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 4

© Rob A. Rutenbar 2007 Slide 7

CarnegieMellon

About This Talk

Some philosophyWhy silicon? Why now? Why us (CMU)?

A quick tour: How speech recognition worksWhat happens in a recognizer

An SoC architectureStripping away all CPU stuff we don’t need, focus on essentials

Results ASIC version: Simulation resultsFPGA version: Live, running hardware-based recognizer

© Rob A. Rutenbar 2007 Slide 8

CarnegieMellon

Why Silicon? Why Now?

Why? Two reasons:

HistoryWe have some successful historical examples of this migration

PerformanceTomorrow’s compelling apps need 100X – 1000X more performance (Not going to happen in software)

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 5

© Rob A. Rutenbar 2007 Slide 9

CarnegieMellon

History: Graphics Engines

Nobody paints pixels in software anymore!Too limiting in max performance. Too inefficient in power.

http://www.nvidia.com

http://www.mtekvision.com

True on the desktop (& laptop) …and on your cellphone too

© Rob A. Rutenbar 2007 Slide 10

CarnegieMellon

Performance: Next-Gen Compelling Applications

Audio-miningVery fast recognizers –much faster than realtimeApp: search large media streams (DVD) quickly

Hands-free appliancesVery portable recognizers –high quality result on << 1 wattApp: interfaces to small devices, cellphone dictation

Terminator 2

FIND: “Hasta la vista, baby!”

“send email to arnold –let’s do lunch…”

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 6

© Rob A. Rutenbar 2007 Slide 11

CarnegieMellon

Silicon Solution: Speed and Power Wins

A famous graph from Prof. Bob Brodersen of BerkeleyStudy looked at 20 designs published at ISSCC, from 1997-2002In slightly older technologies, relative to today: 180nm – 250nmDedicated designs up to 10,000X better energy efficiency (MOPS/mW)

0.010.1

110

1001000

PPC-

95PP

C1-S

OI-00

Sparc

-95Sp

arc2-9

7PP

C2-S

OI-00

Sparc

1-97

X86-9

7Al

pha-0

0Al

pha-9

7PP

C-00

SA-D

SP-98

Hit-D

SP-98

Fuj-D

SP2-9

8Fu

j-DSP

1-00

NEC-

DSP-

98MP

EG2-9

9En

crypt-

00MU

D-98

MPEG

2-98

802.1

1a-01

Microprocessors General Purpose DSP

Dedicated

2 orders ofmagnitude

2 orders ofmagnitude

2 orders ofmagnitude

2 orders ofmagnitude

Million Ops (MOPS) / milliWatt (mW)

© Rob A. Rutenbar 2007 Slide 12

CarnegieMellon

Recent Example: Parallel Radio Baseband DSP

90nm CMOS: adaptive DSP for multipath MIMO channelPower efficiency = 2.1GOPS/mWArea efficiency = 20GOPS/mm2

Technology 90nm CMOS Core area 1.9 × 1.9 mm Die area 2.3 × 2.3 mm Pad count 120 IO/core VDD 1V / 0.4V Cell count 420,304 Frequency 100 MHz P (act/leak) 30mW / 4mW Efficiency 2.1GOPS/mW

(Source: Prof. Dejan Markovitz, UCLA)

Data rate up to 250Mbps over 16 sub-carriersMeasured 34mW @ VDD=385mV

1st path, α1 = 1

2nd path, α2 = 0.6x y

Txarray

Rxarray

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 7

© Rob A. Rutenbar 2007 Slide 13

CarnegieMellon

Why Us...?

1 site (Carnegie Mellon), 3 sets of world-class expertsImpossible to do projects like this without cross-area linkages

Computer Science SPHINX Speech recognition group

Electrical & Computer EngineeringSilicon system implementation group

Electrical & Computer EngineeringMedia / DSP group

Carnegie Mellon Campuswww.cmu.edu

Carnegie Mellon Campuswww.cmu.edu

© Rob A. Rutenbar 2007 Slide 14

CarnegieMellon

Us: the CMU In Silico Vox Team

From left: Kai Yu, Rob Rutenbar, Edward Lin, Richard Stern, Tsuhan Chen, Patrick Bourke(not shown: Jeff Johnston)

From left: Kai Yu, Rob Rutenbar, Edward Lin, Richard Stern, Tsuhan Chen, Patrick Bourke(not shown: Jeff Johnston)

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 8

© Rob A. Rutenbar 2007 Slide 15

CarnegieMellon

About This Talk

Some philosophyWhy silicon? Why now? Why us (CMU)?

A quick tour: How speech recognition worksWhat happens in a recognizer

An SoC architectureStripping away all CPU stuff we don’t need, focus on essentials

Results ASIC version: Simulation resultsFPGA version: Live, running hardware-based recognizer

© Rob A. Rutenbar 2007 Slide 16

CarnegieMellon

How Speech Recognition Works

ADCADC

Filter1Filter2Filter3

FilterN

...

Filter1Filter2Filter3

FilterN

...

x1x2x3...xn

x1x2x3...xnSampling

Feature extraction

Featurevector

DSP

ωω

1

a11

a12

b1(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

Acoustic

“Rob”

Adaptation

HMMSearch

Word...Rob R AO BBob B AO B...

Language

Rob saysAdaptation

Acoustic units Words Language

Adaptation toenvironment/speaker

“ao”

FeatureScoring

Acoustic Frontend

Scoring Backend Search

HMM / Viterbi SearchLanguage Model Search

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 9

© Rob A. Rutenbar 2007 Slide 17

CarnegieMellon

(1) Acoustic Frontend

0 0.2 0.4 0.6 0.8 1 1.2

0

2

4

6

8

10

12

The frontend is all DSP. A discrete Fourier transform (DFT) gives us the spectra. We combine and logarithmically transform spectra in ways motivated by physiology of human ears.

time

Frame

Combine these with estimates of 1st and 2nd

time derivatives

Color is “how much energy”. in transformed spectra.Green = low, red = high.

This pic is across a few sec of speech.

Color is “how much energy”. in transformed spectra.Green = low, red = high.

This pic is across a few sec of speech.

ADCADC

Filter1Filter2Filter3

FilterN

...

Filter1Filter2Filter3

FilterN

...

x1x2x3...

xn

x1x2x3...

xnSamplingExtraction Feature

vector

DSP

ωω

AdaptationAdaptation

Feature

© Rob A. Rutenbar 2007 Slide 18

CarnegieMellon

Each feature is a point in high-dimensional spaceBut each “atomic sound” is a region of this spaceScore each atomic sound with Probability(sound matches feature)

Note: (sounds) X (dimensions) X (Gaussians) = BIG

(2) Scoring Stage

Feature

Each sound is aGaussian Mixture

Feature VectorX=(x1, x2, …, xn)

21)(

7

0

2

)(

2,

2,

2,

i

x

i

iss ewxSCORE is

is

=

−−

= ∑πΛ

σ

μ39

1j=∑ j

1)(7

0

2

)(

2,

2,

2,

i

x

i

iss ewxSCORE is

is

=

−−

= ∑πΛ

σ

μ39

1j=∑ j

Each sound approximated as a set of high-dim Gaussian densities

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 10

© Rob A. Rutenbar 2007 Slide 19

CarnegieMellon

(3) Search: Speech Models are Layered Models

YES

NO

/Y/ /EH/ /S/

/N/ /OW/

Waveform

Pitch

Power

Spectrum

15 ms

0 ms

0 kHz

0 dB6 kHz

80 dB0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Language X Words X Acoustic Layered Search

words “acousticunits”

1 frame ofsampled sound

“sub-acousticunits”

Classical methods(HMMs, Viterbi)

and idiosyncracies

© Rob A. Rutenbar 2007 Slide 20

CarnegieMellon

Context Matters: At Bottom -- Triphones

English has ~50 atomic sounds (phones) but we recognize ~50x50x50 context-dependent triphones

Because “I” sound in “five” is different than the “I” in “nine”

Five F( -, I )cross-word I( F, V )word-internal V( I, - )cross-word

Nine N( -, I )cross-word I( N, N )word-internal N( I, - )cross-word

“I” in “five” ≠ “I” in “nine”

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 11

© Rob A. Rutenbar 2007 Slide 21

CarnegieMellon

Similar for Other Languages, like Japanese...but different basic building blocks (different phones)

Source: Microsoft Speech API 5.3Japanese Phonemeshttp://msdn2.microsoft.com/en-us/library/ms720568.aspx

© Rob A. Rutenbar 2007 Slide 22

CarnegieMellon

Example of Different Phone Building Blocks

Let’s use my name as an example

English: /r/ /OO/ /t/ /n/ /b/ /ar/

Japanese: /ル/ /ー/ /テ/ /ン/ /バ/ /ー/

Aside:Japanese has some reputation as being an “easier” language for automatic recognitionMapping from basic sounds (mora) to words is simpler than English

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 12

© Rob A. Rutenbar 2007 Slide 23

CarnegieMellon

Also Context at Top: N-gram Language Model

Unigram

Trigram

Bigram

W13W123…

W122

W111

W112

W113

W114

W121

W11

W12

W13

W14

W1

W2

W3

W4

termination

Lets us calculate likelihood of word W3after W2after W1

Suppose we have vocabulary { W1, W2, W3, W4, …}

“W1” “W1 W2”

“W1 W2 W3”

© Rob A. Rutenbar 2007 Slide 24

CarnegieMellon

Good Speech Models are BIG

This is ~64K word “Broadcast News” taskUnfortunately, many idiosyncratic details in how layers of model traversed

Waveform

Pitch

Power

Spectrum

15 ms

0 ms

0 kHz

0 dB6 kHz

80 dB0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

for each 10 ms time frame

5156Scores

111,593Triphones

64,001Unigrams

9,382,014Bigrams

13,459,879 Trigrams

...... ...

...

...

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 13

© Rob A. Rutenbar 2007 Slide 25

CarnegieMellon

19%

6% 15%

27%

33%

ADCADC

Filter1Filter2Filter3

FilterN

...

Filter1Filter2Filter3

FilterN

...

x1x2x3...xn

x1x2x3...xnSampling

Feature extraction

Featurevector

DSP

ωωωω

1

a11

a12

b1(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

Acoustic

“Rob”

Adaptation

HMMSearch

Word...Rob R AO BBob B AO B...

Language

Rob saysAdaptation

Acoustic units Words Language

Adaptation toenvironment/speaker

“ao”

FeatureScoring

HMM / Viterbi SearchLanguage Model Search

Where Does Software Spend its Time?CPU time for CMU Sphinx 3.0

Prior studies targeted less capable versions (v1, v2)Tools: SimpleScalar & Intel Vtune64K-word “Broadcast News” benchmark

So: It’s all backend

~0% of time!

© Rob A. Rutenbar 2007 Slide 26

CarnegieMellon

Memory Usage? SPHINX 3.0 vs Spec CPU2000

Cache sizesL1: 64 KB, direct mappedDL1: 64 KB, direct mappedUL2: 512 KB, 4-way set assoc

So…Terrible locality (no surprise, graph search + huge datasets)Load dominated (no surprise, reads a lot, computes a little)Not an insignificant footprint

42 MB186 MB24 MB64 MB

Memory Footprint

0.300.030.060.48L2

0.030.020.020.04 DL1Cache Miss Rates

0.020.080.070.025

Branch Misprediction Rates

0.120.170.20.14Branch’s

0.080.090.150.05Stores

0.270.20.250.27LoadsInstruction Mixes

0.71.050.290.69IPC

23 B15 B55B53 BCycles

EquakeGzipGccSPHINX 3.0

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 14

© Rob A. Rutenbar 2007 Slide 27

CarnegieMellon

About This Talk

Some philosophyWhy silicon? Why now? Why us (CMU)?

A quick tour: How speech recognition worksWhat happens in a recognizer

An SoC architectureStripping away all CPU stuff we don’t need, focus on essentials

Results ASIC version: Simulation resultsFPGA version: Live, running hardware-based recognizer

© Rob A. Rutenbar 2007 Slide 28

CarnegieMellon

This Talk: How to Get to Fast…

Audio-miningVery fast recognizers –much faster than realtimeApp: search large media streams (DVD) quickly

Hands-free appliancesVery portable recognizers –high quality result on << 1 wattApp: interfaces to small devices, cellphone dictation

Terminator 2

FIND: “Hasta la vista, baby!”

“send email to arnold –let’s do lunch…”

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 15

© Rob A. Rutenbar 2007 Slide 29

CarnegieMellon

Speech: Complex Task to do in Silicon

ADCADC

Filter1Filter2Filter3

FilterN

...

Filter1Filter2Filter3

FilterN

...

x1x2x3...xn

x1x2x3...xnSampling

Feature extraction

DSP

ωωωω

1

a11

a12

b1(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

Acoustic

“Rob”

Adaptation

HMMSearch

Word...Rob R AO BBob B AO B...

Language

Rob saysAdaptation

Acoustic units Words Language

Adaptation

“ao”

FeatureScoring

HMM / Viterbi SearchLanguage Model Search

FrontendOps: Low

KB SRAM(constants)

ScoringOps: High

KB SRAM(communic)

10-100MB DRAM

(constants)

BackendOps: Medium

MBs SRAM(active recog)

10-100MB DRAM

(models &active recog)

© Rob A. Rutenbar 2007 Slide 30

CarnegieMellon

A Silicon Architecture: Breakdowns

AcousticFrontend

GaussianScoring

BackendSearch

KB KBMBs SRAM

(active recog)SRAM(constants)

SRAM(communic)

10-100MB DRAM

(constants)

10-100MB DRAM

(models &active recog)

Computations (Ops) Low High MediumSRAM (size) Small Small LargeDRAM (size) -- Medium/Large LargeDRAM (bandwidth) -- High High

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 16

© Rob A. Rutenbar 2007 Slide 31

CarnegieMellon

Essential Implementation Ideas

Custom precision, everywhereEvery bit counts, no extras, no floating point – all fixed point

(Almost) no cachingLike graphics chips: fetch from SDRAM, do careful data placement(Little bit of caching for bandwidth filtering on big language models)

Aggressive pipeliningIf we can possibly overlap computations – we try to do so

Algorithm transformationSome software computations are just bad news for hardwareSubstitute some “deep computation” with hardware-friendly versions

© Rob A. Rutenbar 2007 Slide 32

CarnegieMellon

Example: Aggressive Pipelining

timeFetch HMM/Viterbi

Transition/Prune/WritebackFetch Word

Fetch HMM/Viterbi

Transition/Prune/Writeback Language Model

Fetch HMM/Viterbi

Transition/Prune/WritebackFetch Word

Fetch HMM/Viterbi

Transition/Prune/Writeback

Language Model

Fetch Word Language Model

Fetch HMM/Viterbi

Transition/Prune/WritebackFetch Word

Fetch HMM/Viterbi

Transition/Prune/WritebackFetch Word

Fetch HMM/Viterbi

Transition/Prune/Writeback

Pipelined Get-HMM/Viterbi and Transition stages

Pipelined Get-Word and Get-HMM stages

Pipelined non-LanguageModel and LanguageModel stages

Word #1Word #2HMM #1HMM #2

HMM #3

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 17

© Rob A. Rutenbar 2007 Slide 33

CarnegieMellon

Example: Algorithmic ChangesAcoustic-level pruning threshold

Software: Use best score of current frame (after Viterbi on Active HMMs) Silicon: Use best score of previous frame (nixes big temporal bottleneck)

TradeoffsLess memory bandwidth, can pipeline, little pessimistic on scores

Sphinx 3.0 Software Silicon

Initialize Frame

Fetch Active Word

Fetch Active HMM

Viterbi

Transition/Prune/Writeback

LanguageModel

Start Frame

Done

Initialize Frame

Fetch Active Word

Fetch Active HMM

Viterbi

Transition/Prune/Writeback

LanguageModel

Start Frame

DoneFetch Active Word

Fetch Active HMM

Viterbi

Writeback HMM Transition/Prune/Writeback

Language Model

Initialize Frame

Fetch Active Word

Fetch Active HMM

Done

Done all Active HMM

Done all Active HMM

Fetch Active Word

Fetch Active HMM

Viterbi

Writeback HMM Transition/Prune/Writeback

Language Model

Initialize Frame

Fetch Active Word

Fetch Active HMM

Done

Done all Active HMM

Done all Active HMM

© Rob A. Rutenbar 2007 Slide 34

CarnegieMellon

About This Talk

Some philosophyWhy silicon? Why now? Why us (CMU)?

A quick tour: How speech recognition worksWhat happens in a recognizer

An SoC architectureStripping away all CPU stuff we don’t need, focus on essentials

Results ASIC version: Simulation resultsFPGA version: Live, running hardware-based recognizer

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 18

© Rob A. Rutenbar 2007 Slide 35

CarnegieMellon

Design Flow: C++ Cycle Simulator Verilog2006 benchmark: 5K-word “Wall Street Journal” taskCycle sim results:

No accuracy loss; not quite 2X @ 125MHz ASIC clockBackend search needs: ~1.5MB SRAM, ~30MB DRAM

1.67X

0.59X

1.05X

0.82X

0.74X

Speedup Over Real Time

(bigger is better)

0.125 GHz

2.8 GHz

1 GHz

1 GHz

1 GHz

Clock (GHz)

6.707%Software: Sphinx 3.0 (single CPU)

6.725%Hardware: Our Proposed Recognizer

6.97%Software: Sphinx 4 (single CPU)

6.97%Software: Sphinx 4 (dual CPU)

7.32%Software: Sphinx 3.3 (fast decoder)

Word Error Rate (%)Recognizer Engine

© Rob A. Rutenbar 2007 Slide 36

CarnegieMellon

Aside: Bit-Level Verification Hurts (A Lot)

We have newfound sympathy for others doing silicon designs that handle large media streams

Generating these sort of tradeoff curves: CPU days weeks

01234567

0 500 1000 1500

Spee

dup

Speedup vs Software (over 330 WSJ utterances)

Time (10ms frames)

All SRAM

1.5MB SRAM + 30MB DRAM

SPHINX 3.0 software

ALL DRAM

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 19

© Rob A. Rutenbar 2007 Slide 37

CarnegieMellon

Aside: Pieces of Design = Great Class ProjectsCMU student team: Patrick Chiu, David Fu, Mark McCartney, Ajay Panagariya, Chris Thomas

Floorplan Final Layout

Final Stats

© Rob A. Rutenbar 2007 Slide 38

CarnegieMellon

A Complete Live Recognizer: FPGA DemoIn any “system design” research, you reach a point where you just want to see it work – for real

Goal: Full recognizer 1 FPGA + 1 DRAM

A benchmark that fits on chip1000-word “Resource Mgt” task Slightly simplified: no tri-gramsSlower: not real time, ~2.3X slowerResource limited: slices, mem bandwidth

ADCADC

Filter1Filter2Filter3

FilterN

...

Filter1Filter2Filter3

FilterN

...

x1x2x3...

xn

x1x2x3...

xnSamplingFeature

extractionFeaturevector

DSP

ωωωω

1

a11

a12

b1(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

1

a11

a12

b1(.)

1

a11

a12

b1(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

2

a22

a23

b2(.)

2

a22

a23

b2(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

3

a33

a34

b3(.)

Acoustic

“Rob”

Adaptation

HMMSearch

Word...Rob R AO BBob B AO B...

Language

Rob saysAdaptation

Acoustic units Words Language

Adaptation toenvironment/speaker

“ao”

FeatureScoring

HMM / Viterbi SearchLanguage Model Search

Xilinx XC2VP30 FPGA [13969 slices / 2448 Kb]

Utiliz: 99% of slices 45% of Block RAM~3MB DDR DRAM

50MHz clk; ~200Mb/s IO

Xilinx XC2VP30 FPGA [13969 slices / 2448 Kb]

Utiliz: 99% of slices 45% of Block RAM~3MB DDR DRAM

50MHz clk; ~200Mb/s IO

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 20

© Rob A. Rutenbar 2007 Slide 39

CarnegieMellon

FPGA Experimental ResultsAside: as far as we know, this is the most complex recognizer architecture ever fully mapped into a running, hardware-only form

© Rob A. Rutenbar 2007 Slide 40

CarnegieMellon

SummarySoftware is too constraining for speech recognition

Evolution of graphics chips suggests alternative: Do it in siliconCompelling performance and power reasons for silicon speech recog

Several “in silico vox” architectures in designSoC and FPGA versions~10X realtime speedup architecture in progress at CMU

ReflectionsSome of the most interesting experiences happen when you get people from very different backgrounds – silicon + speech – on same team

2007 STARC Forum

Rob A. RutenbarCarnegie Mellon University 21

© Rob A. Rutenbar 2007 Slide 41

CarnegieMellon

Acknowledgements

Work supported byUS National Science Foundation (www.nsf.gov)Semiconductor Research Corporation (www.src.org)FCRP Focus Research Center for Circuit & System Solutions (www.fcrp.org, www.c2s2.org)

We are grateful for the advice and speech recognition expertise shared with us by

Richard M. Stern, CMUArthur Chan, CMUMosur K. Ravishankar, CMU