d ynamic time warping and minimum distance paths for speech recognition
DESCRIPTION
D ynamic Time Warping and Minimum Distance Paths for Speech Recognition. Isolated word recognition: Task : Want to build an isolated ‘word’ recogniser e.g. voice dialling on mobile phones Method: Record, parameterise and store vocabulary of reference words - PowerPoint PPT PresentationTRANSCRIPT
1
Dynamic Time Warping and Minimum Distance Paths for Speech Recognition
Isolated word recognition:
• Task :
• Want to build an isolated ‘word’ recogniser e.g. voice dialling on
mobile phones
• Method:
1. Record, parameterise and store vocabulary of reference words
2. Record test word to be recognised and parameterise
3. Measure distance between test word and each reference word
4. Choose reference word ‘closest’ to test word
2
Words are parameterised on a frame-by-frame basis
Choose frame length, over which speech remains reasonably stationary
Overlap frames e.g. 40ms frames, 10ms frame shift
We want to compare frames of test and reference words i.e. calculate distances between them
40ms
20ms
3
• Problem:
Number of frames won’t always correspond
• Easy:
Sum differences between corresponding frames
Calculating Distances
4
• Solution 1: Linear Time Warping
Stretch shorter sound
• Problem?
Some sounds stretch more than others
5
• Solution 2:
Dynamic Time Warping (DTW)
5 3 9 7 3
4 7 4
Test
Reference
Using a dynamic alignment, make most similar frames correspond
Find distances between two utterences using these corresponding frames
6
Digression: Dynamic Programming
• The shortest route from Dublin to Limerick goes through:– Kildare– Monasterevin– Portlaoise– Mountrath– Roscrea– Nenagh
• Now consider the shortest route from Dublin to Nenagh– What towns does the route go through?
7
Intercity Example
8
9
3 5 1 x 4 x 1 x
7 4 3 x 0 x 3 x
9 3 5 x 2 x 5 x
3 2 1 x 4 x 1 x
5 1 1 x 2 x 1 x
1 2 3
4 7 4
Reference
Test
We can also find the path through the grid that minimizes total cost of path
3 5 11 x 8 x 5 x
7 4 10 x 4 x 7 x
9 3 7 x 4 x 9 x
3 2 2 x 5 x 4 x
5 1 1 x 3 x 4 x
1 2 3
4 7 4
Compute minimum distances dist each point and place in mindist matrix:
mindist(5,3) = min{1 + mindist(5,2),
1 + mindist(4,2),
1 + mindist(4,3)}
Test
Reference
Place distance between frame r of Test and frame c of Reference in cell(r,c) of distance matrix
10
Examples so far are uni-dimensional
Speech is multi-dimensional
e.g. two dimensions, using points (4,3) and (5,2)
4 5
1 2 3 4 5
54321
x
x
)²()²( 1212 yyxx
)²x-(x )²x-(x )²x-(x rntnr2t2r1t1
Distance equation for 2 dimensions:
Distance equation for multi-dimensional:
11
Constraints• Global
– Endpoint detection– Path should be close to diagonal
• Local– Must always travel upwards or eastwards– No jumps– Slope weighting– Consecutive moves upwards/eastwards
12
Global Constraints
13
Local Constraints
mindist(r,c)
mindist(r,c-1)
mindist(r-1,c)mindist(r-1,c-1)
1
12
weights
14
Points to Note• DTW really only suitable for small vocabularies
and/or speaker dependent recognition• Should normalise for reference length• Can use multiple utterances and cluster them• Poor performance if recording environment changes• High computation cost
15
Evaluation• Performance of designs only comparable by
evaluation• Use a test set• For single word recognition we can simply quote %
accuracy:
%100s test wordof No.
correct wordsof No.Accuracy
In error analysis, it can be helpful to use a confusion matrix
16
Confusion Matrix
references
test tokens
yes no
yes 24 2
no 3 21