dtw for speech recognition
DESCRIPTION
DTW for Speech Recognition. J.-S. Roger Jang ( 張智星 ) [email protected] http://www.cs.nthu.edu.tw/~jang MIR Lab ( 多媒體資訊檢索實驗室 ) CS, Tsing Hua Univ. ( 清華大學 資工系 ). Dynamic Time Warping (DTW). Characteristics: Pattern-matching-based approach Require less memory/computation - PowerPoint PPT PresentationTRANSCRIPT
DTW for Speech Recognition
J.-S. Roger Jang ( 張智星 )
http://www.cs.nthu.edu.tw/~jang
MIR Lab ( 多媒體資訊檢索實驗室 )
CS, Tsing Hua Univ. ( 清華大學 資工系 )
-2-
Dynamic Time Warping (DTW)
Characteristics: Pattern-matching-based approach Require less memory/computation Suitable for speaker-dependent recognition Suitable for small to medium vocabulary Suitable for microprocessor/chip implementation
Applications Speaker identification & verification for surveillance
Voice commands for mobile phones, toys
-3-
Dynamic Time Warping: Type 1
i
j
t(i-1)
r(j)
)1,2(
)1,1(
)2,1(
min
)()(),(
jiD
jiD
jiD
jritjiD
),( jiD
t: input MFCC matrix (Each column is a frame’s feature.)r: reference MFCC matrixLocal paths: 27-45-63 degrees
DTW recurrence:r(j-1)
t(i)
-4-
Dynamic Time Warping: Type 2
i
j
t(i-1)
r(j)
),1(
)1,1(
)1,(
min
)(),(),(
jiD
jiD
jiD
jritjiD
),( jiD
r(j-1)
t(i)
t: input MFCC matrix (Each row is a frame’s feature.)r: reference MFCC matrixLocal paths: 0-45-90 degrees
DTW recurrence:
-5-
Local Path Constraints
Type 1 27-45-63 local paths
Type 2 0-45-90 local paths
jiD ,
jiD ,
),1(
)1,1(
)1,(
min
)()(),(
jiD
jiD
jiD
jritjiD
)1,2(
)1,1(
)2,1(
min
)()(),(
jiD
jiD
jiD
jritjiD
2,1 jiD
1, jiD 1,1 jiD
jiD ,1
1,1 jiD 1,2 jiD
-6-
Path Penalty for Type-1 DTW
Path penalty No penalty for 45-degree path Some penalty for paths deviated from 45-degree
)1,2(
)1,1(
)2,1(
min)()(),(
jiD
jiD
jiD
jritjiD
),( jiD
)2,1( jiD
)1,2( jiD
)1,1( jiD
0
-7-
DTW Paths of “Match Corners”
We assume the speed of a user’s acoustic input falls within 1/2 and 2 times of that of the intended sentence.
Both corners are fixed. (End point detection is critical.)
Suitable for voice command applications
i
j
-8-
DTW Paths of “Match Anywhere”
No fixed anchored positions
Suitable for retrieval of personal spoken documents
i
j
-9-
Other Variants
Local constraints
Start/ending area
-10-
Implementation Issues
To save memory Use 2-column table for type-1 DTW Use 1-column table for type-2 DTW
To avoid too many if-then statements Pad type-1 DTW with two-layer padding Pad type-2 DTW with one-layer padding
To find a suitable path Minimizing total distance Minimizing average distance
-11-
DTW Path of “Match Corners”
-12-
DTW Path of “Match Anywhere”
-13-
DTW Path of “Match Anywhere”
20 40
20
40
60
80
100
120
140
160
DTW total distance = 304.957
清 華 大 學
我今
天很
高興
來到
清華
大學
進行
演講
20 40
20
40
60
80
100
120
140
160
清 華 大 學
我今
天很
高興
來到
清華
大學
進行
演講
20 40
50
100
150
200400600800
-14-
DTW for Spoken Document Retrieval
Applications Voice-based audio/video retrieval
Issues in SDR using DTW Speaker normalization
Vocal track length normalization (VTLN)
Frequency warping
Efficiency
-15-
DTW for Speaker-independent Voice Command Recognition
Applications Digit recognition
Technical highlights Extensive recordings Clustering within each command Some indexing methods for DTW Suitable for small-vocabulary applications