dtw for speech recognition

DTW for Speech Recognition

J.-S. Roger Jang ( 張智星 )

[email protected]

http://www.cs.nthu.edu.tw/~jang

MIR Lab ( 多媒體資訊檢索實驗室 )

CS, Tsing Hua Univ. ( 清華大學資工系 )

http://www.cs.nthu.edu.tw/~jang

-2-

Dynamic Time Warping (DTW)

Characteristics: Pattern-matching-based approach Require less memory/computation Suitable for speaker-dependent recognition Suitable for small to medium vocabulary Suitable for microprocessor/chip implementation

Applications Speaker identification & verification for surveillance

Voice commands for mobile phones, toys

-3-

Dynamic Time Warping: Type 1

i

j

t(i-1)

r(j)

)1,2(

)1,1(

)2,1(

min

)()(),(

jiD

jiD

jiD

jritjiD

),( jiD

t: input MFCC matrix (Each column is a frame’s feature.)r: reference MFCC matrixLocal paths: 27-45-63 degrees

DTW recurrence:r(j-1)

t(i)

-4-

Dynamic Time Warping: Type 2

i

j

t(i-1)

r(j)

),1(

)1,1(

)1,(

min

)(),(),(

jiD

jiD

jiD

jritjiD

),( jiD

r(j-1)

t(i)

t: input MFCC matrix (Each row is a frame’s feature.)r: reference MFCC matrixLocal paths: 0-45-90 degrees

DTW recurrence:

-5-

Local Path Constraints

Type 1 27-45-63 local paths

Type 2 0-45-90 local paths

jiD ,

jiD ,

),1(

)1,1(

)1,(

min

)()(),(

jiD

jiD

jiD

jritjiD

)1,2(

)1,1(

)2,1(

min

)()(),(

jiD

jiD

jiD

jritjiD

2,1 jiD

1, jiD 1,1 jiD

jiD ,1

1,1 jiD 1,2 jiD

-6-

Path Penalty for Type-1 DTW

Path penalty No penalty for 45-degree path Some penalty for paths deviated from 45-degree

)1,2(

)1,1(

)2,1(

min)()(),(

jiD

jiD

jiD

jritjiD

),( jiD

)2,1( jiD

)1,2( jiD

)1,1( jiD

0

-7-

DTW Paths of “Match Corners”

We assume the speed of a user’s acoustic input falls within 1/2 and 2 times of that of the intended sentence.

Both corners are fixed. (End point detection is critical.)

Suitable for voice command applications

i

j

-8-

DTW Paths of “Match Anywhere”

No fixed anchored positions

Suitable for retrieval of personal spoken documents

i

j

-9-

Other Variants

Local constraints

Start/ending area

-10-

Implementation Issues

To save memory Use 2-column table for type-1 DTW Use 1-column table for type-2 DTW

To avoid too many if-then statements Pad type-1 DTW with two-layer padding Pad type-2 DTW with one-layer padding

To find a suitable path Minimizing total distance Minimizing average distance

-11-

DTW Path of “Match Corners”

-12-

DTW Path of “Match Anywhere”

-13-

DTW Path of “Match Anywhere”

20 40

20

40

60

80

100

120

140

160

DTW total distance = 304.957

清華大學

我今

天很

高興

來到

清華

大學

進行

演講

20 40

20

40

60

80

100

120

140

160

清華大學

我今

天很

高興

來到

清華

大學

進行

演講

20 40

50

100

150

200400600800

-14-

DTW for Spoken Document Retrieval

Applications Voice-based audio/video retrieval

Issues in SDR using DTW Speaker normalization

Vocal track length normalization (VTLN)

Frequency warping

Efficiency

-15-

DTW for Speaker-independent Voice Command Recognition

Applications Digit recognition

Technical highlights Extensive recordings Clustering within each command Some indexing methods for DTW Suitable for small-vocabulary applications

dtw for speech recognition

Documents