a metric approach to isolated word recognition · abstract some of the best known approaches to...

80
A Metric Approach to Isolated Word Recognition by Raj Verma Department of Computer Science University of Toronto Toronto, Ontario, Canada A Thesis submitted in accordance with the requirements for the Degree of Master of Science @Copyright by Raj Verma 1991

Upload: others

Post on 31-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

A Metric Approach to Isolated Word Recognition

by

Raj Verma

Department of Computer Science University of Toronto

Toronto, Ontario, Canada

A Thesis submitted in accordance with the requirements for the Degree of Master of Science

@Copyright by Raj Verma 1991

Page 2: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Abstract

Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance function. However, because this dis~

tance function is not defined w.r.t. an inner product, the recognition decisions in these approaches must be made with nonparametric classification algorithms like the k-nearest­neighbor search method. Although such methods can achieve a high degree of accuracy when sufficient numbers of training samples are present, they require significantly more memory and processing time than vector based approa<'.hes which can exploit the geometry of the representation space to model the decision regions much more efficiently.

In this thesis, we show how a new metric approach to pattern recognition can be used to construct an efficient vector representation of speech patterns which geometrically preserves the DTW interdistances. This approach, which was first conceived by Goldfarb (10] as a way of unifying the classical structural and vector based methods of pattern recognition, achieves its efficiency by mapping the distance information from an abstract metric space into a low dimensional pseudo-Euclidean vector space,. Once this is done, the decision regions of each class can be modeled efficiently using various parametric classification methods like those based on Multi-Layered Perceptrons (MLP). In our case, the metric space will be defined by a standard DTW distance function and a set of isolated word samples. This metric space will then be mapped into the vector space in two stages. In the first stage, we use the Goldfarb embedding algorithm to incrementally construct a vector representation w.r.t. the leading principal axes. From this process, the intrinsic dimension of the metric space will be analyzed. Our studies will show that the intrinsic dimension of the DTW metric space is closely related to the number classes in the metric space rather then the number of training samples. Furthermore, the convergence to the intrinsic dimension appears t o happen at an exponential rate. Once a low dimensional representation is determined, we will use the second stage to construct a more efficient mapping function from the met ric space to the vector space. This will be done with a projection formula which uses a set of original input speech samples to act as basis samples in the low dimensional vector space. Obtaining a projection this way is mathematically possible because there is a well-defined relationship between the DTW distance in the metric space and inner products in the pseudo-Euclidean vector space. The basis samples will be selected with a greedy search algorithm that exploits the statistical properties of the vector representation computed by the embedding algorithm.

In our implementation of this approach, using a MLP to analyze the decision regions and testing with monosyllabic digits form the the TI 20 word database, we were able to achieve recognition accuracy of 98.63 on speaker-independent test samples. This score was comparable to the results of the brute-force nearest-neighbor classifier in the DTW metric space; however, our results were obtained using only 16 basis samples in the classifier instead of 320 training samples. Consequently, we were able to reduced the total number of distance computations by nearly 953, and the total memory requirements by nearly 903.

Page 3: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Acknowledgements

I would like to first thank my thesis supervisor Ken Sevcik for his support, guidance,

and most of a.ll, his patience. I would also like to thank my second rea.der Geoff Hinton for

his comments and suggestions. A special thanks goes to Lev Goldfarb for his contribution

to this thesis - this research would not have been possible without his assistance. I am

also grateful to Evan Steeg and Timothy Horton for helping out in the implementation and

proof reading of this work. Finally, I would like to thank my parents for alJ their love and

support.

Page 4: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Contents

1 Introduction 1.1 Representing Speech in a DTW Metric Space 1.2 Representing Speech in a Vector Space 1.3 Goldfarb's Approach .. . . 1.4 Organization of the Thesis . . . . . . .

2 The Metric Approach to Pattern Recognition 2.1 Introduction . . . . . . . . . . . . . . . . . . . 2.2 Definition of a Metric Space . . . . . . . . . .. 2.3 Vector Representation of a Finite Metric Space

2.3.1 The Multidimensional Scaling Approach 2.3.2 Goldfarb's Approach ...... . 2.3 .3 The Main Embedding Algorithm 2.3.4 Principal Component Analysis

2.4 Metric Projections . . ... . .... . 2.4.1 A Metric Projection Algorithm

2.5 On-line Classification . ........ .

3 Dynamic Time Warping and the Metric Approach to Isolated Word Recog-

2 2 4 5 6

7 7 7 9

10 11 13 15 21 24 26

nU~n 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 The Basic Concept of Dynamic Time Warping . . . . . . . . . 28 3.3 Implementation Details of the Dynamic Time Warping Metric . 30

3.3.1 Representing Spectral Features . -. . . . 30 3.3.2 The Dynamic Time Warping Procedure . . . 33

3.4 Systolic Array Implementation of DTW . . . . . . . 36 3.5 The Metric Approach to Isolated Word Recognition 38

3.5.1 K-Nearest Neighbor Classifier 40 3.5.2 Gaussian Classifier . . . . 40 3.5.3 Neural Network Classifier . . 41

4 Evaluation of the Metric Approach to Word Recognition 4.1 Introduction ..... . .............. . 4.2 Description of the Speech Database . . . . . . . . 4.3 Vector Representation of the DTW Metric Space 4.4 Metric Projection Analysis .... . ... . ... . 4.5 Analysis of the Recognition Performance . . . . .

4.5.l The Metric Approach vs. the DTW KNN Classifier

iii

45 45 45 46 54 59 59

Page 5: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

4.6 The Metric MLP vs. Lippmann and Gold's MLP

5 Discussion and Conclusions

1

65

68

Page 6: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Chapter 1

Introduction

1.1 Representing Speech in a DTW Metric Space

Currently, some of the most reliable approaches to Isolated Word Recognition (IWR) are

based on the Dynamic Time Warping (DTW) procedure [37]. This procedure uses a dynamic

programming algorithm to compute a distance measure between two input signals and it

is typically incorporated into nearest-neighbor classifiers. From a similarity measurement

point of view, there are two important advantages one gains from using the DTW distance

instead of the more conventional Euclidean distance. First, because the DTW metric does

not restrict the inputs to fixed-length vectors, it is possible to preserve the natural time­

varying structure of speech in the input representation. Second, the D_TW metric can

measure the similarity between two common speech utterances far more accurately then

the Euclidean metric because it employs operations which account for the elastic nature of

speech. These two properties of the DTW metric are' important because changing speaking

rates can cause temporal variations to appear in the input samples. As a result, an acoustic

feature at time t in one word utterance may not necessarily align with the feature at time t

in a different utterance of the same word. Furthermore, variation in the speaking rate can

cause individual acoustic features in one word to correspond to several features i:n another

word. The Euclidean distance function assumes that there is a one-to-one correspondence

between the inputs and consequently, it can not measure the similarity between two words

as accurately.

The DTW procedure overcomes the time variation problem by using a dynamic pro­

gramming algorithm to map acoustic features in one word to those of another word in man-

2

Page 7: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

ner which yields the minimum possible distance between the two input samples. Strong

constraints are placed on the mapping to prevent unrealistic time alignments from being

selected. Once the optimal mapping is found, the global distance between the words is

determined by accumulating the local distances between corresponding features. Generally

speaking, a nonlinear mapping is required because the temporal structure of speech changes

nonuniformally. This was confirmed in a study done by White and Neely [49] who showed

that a significant reduction in the recognition performance resulted when distances where

computed on a linear time scale rather than a nonlinear time scale using the DTW function.

Although temporal variations in speech can be managed reliably with the DTW proce­

dure, it is difficult to design the classifier efficiently because the only decision algorithms one

can actually use are those based on the nearest-neighbor search algorithm. The brute-force

nearest-neighbor classifier in a DTW metric space has a search and memory complexity

which is linear with the size of the training set. Thus, much of the research in DTW­

based approaches to IWR has been directed at heuristic search methods which reduce this

complexity.

Rabiner [38, 25, 39, 50] has proposed the use of clustering techniques which locates

groups of speech samples which have small DTW interdistances. These groups are replaced

by an "average" word sample (see [36]) in order to reduce the entire search space. Although

a sizable reduction can be achieved this way, they have found that as many as 12 "templates"

per class is often required for reliable speaker-independent word recognition. Moreover, the

clustering approach can not deal with outliers efficiently. Also, another problem with using

templates is that important acoustic features associated with a cluster can be "smeared"

away when samples are averaged together. As a resuit, lower accuracy and robustness can

result (in comparison to nonclustering approaches) as was shown by Gupta et. al. [15].

Vidal et. al. (45] exploits the triangular inequality (TI) property of the DTW metric (see

(44, 4]) to reduce, at each step of a nearest-neighbor search, the number of remaining samples

which needs to be checked (see also [6]). Although an impressive reduction in the number

of distance computations has been achieved, there are several limitations with this type of

approach. The most serious problem is the space requirements of the classifier. To facilitate

a reduction in the number of distance computations, a matrix of n * (n-1)/2 interdistances

(all distances between the training sample pairs) must be stored in the classifier in addition

to the n training inputs. This matrix is used during on-line classification to eliminate

3

Page 8: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

samples which fall outside the "search area" whose region is determined by the TI condition

between the un.known sample and previously tested samples. Consequently, the reductiOn

in the number of distance computations is achieved at the expense of more processing time

between each distance computation and added storage requirements. 1 Another weakness

with this approach is that it is highly sensitive to the degree to which the TI condition is

satisfied. Vidal et. al. found that when the threshold which maintained the TI condition

was set too low, significantly more distance computations where required to converge to the

nearest neighbor. On the other hand, if the threshold was set too high, a much higher error

rate resulted.

1.2 Representing Speech in a Vector Space

Much of the search and memory complexity inherent with the nearest-neighbor classification

algorithm (and the related clustering methods) in a metric space could be lowered signif­

icantly if the speech utterances were represented in a vector space where distances were

measured w.r.t. an inner product (the Euclidean distance, for example). That is because

the well-defined geometric properties of an inner product space can be exploited to obtain

a much more efficient classification model. Statistical classification models based on the

Euclidean vector representation are well developed in area of pattern recognition [5, 4 7].

Generally speaking, efficiency in these models is achieved by storing only the decision re­

gions of each class rather than all of the training samples. By storing the decision regions,

it is possible to design a classifier which has a search and memory complexity independent

of the size of the training set; instead, the complexity can be made a function of the num­

ber of classes, and the dimension of the vector space. A key advantage of formulating the

complexity in terms of the dimension size is that the dimension can often be reduced using

techniques like Principal Component Analysis [30] when the classes are well-separated.

Decision regions in a vector space can be modeled with probability distribution functions

[5, ch 2], or with various linear and nonlinear discriminant functions [5, ch 4]. A noteworthy

class of nonlinear discriminants are those based on multilayered neural networks [31]. Recent

studies have shown that these discriminants can outperform traditional classifiers and are

1 It is important to note however, that the added processing time between each distance computation drops exponentially as one converges towards the nearest-neighbor.

4

Page 9: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

- " ..... ·-· - ·-·-· .. ·~· ·· ~ · -· ---·- -.- ····-·-- .- - . -----···-·••">••--··--·--··~ · ---·- ·-·- ·- -·-·· ·---- -- -- ·-·-·-·-··· ... -- .• ··- . _ .. -····-- - - -

capable of forming unconnected and nonconvex decision surfaces in a Euclidean vector

space [19]. Neural networks are also attractive because the parameters needed to model

the decision regions can be stored efficiently in a set of weight matrices, and most of the

operations required to make a classification decision involves simple matrix- calculations

which can be done in parallel.

However, to use vector classification models like those described above, the time varying

input representation of speech must be preprocessed into a fixed length (ie. n-tuple) repre­

sentation. Simple preprocessing methods like linear time warping, padding with silence, or

extracting fixed sets of input features can be used, but. these methods destroy the natural

time alignment properties inherent in the DTW metric space representations. This can

result in a loss of discriminating information which can make the design of the classifier

more complex and less robust.

1.3 Goldfarb's Approach

In this thesis, we will show how the metric approach to pattern recognition (Goldfarb [10])

can be used to generate an efficient vector space representation of speech which geometrically

preserves the distance information contained in a DTW metric space. By obtaining this

type of representation, we will be able to take advantage of efficient vector classification

methods without losing the discriminating information contained in the DTW metric space.

The vector representation will be computed in two stages. The first stage involves the

sampling of class variations in a finite metric space using the DTW metric and a set of

training speech samples. This metric space is represented in a symmetric distance matrix

which contain all pairs of DTW distances between the training samples. The resulting

metric space is then transformed onto the leading principal axes of a pseudo-Euclidean

vector space at increasingly higher dimensions using an embedding algorithm. During this

transformation, the dimensionality of the vector space is analyzed to find a low dimensional

representation of the metric space. Once this representation is determined, the second

stage is used to construct an efficient formula which approximates the vector representation

of an arbitrary sample. This is done using original speech samples as basis vectors for

the vector space. The basis samples are incorporated into a standard projection formula

which involves the _ solution of a simple linear system of equations. However, instead of

5

Page 10: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

calculating the inner products in the formula, we calculate the DTW distance between the

sample being projected and the basis samples. This can be done because there is a well­

defined relationship between the distances in a metric space and the inner products in a

pseudo-Euclidean vector space. Given the vector projections of a set of training samples,

the design of the recognition system is reduced to a classical problem in pattern recognition

which involves the modeling of the decision regions.

An early version of this approach to Pattern Recognition was recently applied success­

fully in an application involving the classification of neurological signals [26). However, the

projections algorithm in this study was not implemented efficiently. Other applications of

this approach involving the recognition of geometric shapes and strings have also shown

promising results [9, 2, 12]. However there is still no published work which systematically

evaluates issues related to the dimensionality and the projection algorithm. Therefore, a

major goal of this thesis is to provide this analysis in the context of a speech recognition

application.

1.4 Organization of the Thesis

The organization of this thesis is as follows. In chapter 2 we present the main technical

details of the metric approach in a general setting. Then in chapter 3, we define the DTW

distance function that we used in our implementation, and we show how it was incorporated

into the classification system. This discussion will include a description of the classifiers

that were used to analyze the decision regions. We then go into chapter 4 were we provide

both the details of the experimental conditions and the results of our analysis. Finally, we

conclude in chapter 5 with a summary of our results and we provide some remarks on how

this work can be extended.

6

Page 11: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

., ' ~ ..

Chapter 2

The Metric Approach to Pattern

Recognition

2.1 Introduction

This chapter describes the main stages of the metric approach to Pattern Recognition. We

begin by first defining the concept of a metric space. We then consider the problem of

mapping a metric space into a vector space. A solution to this problem using the multidi­

mensional scaling technique [23) will be considered first, and its weakness will be pointed

out. Following this, the approach proposed by Goldfarb will be presented in detail. We

will include the main definitions and mathematical results of this approach along with a

formulation of the embedding algorithm. We will also show how one can study the ac­

curacy vs. dimensionality tradeoff during the transformation by incorporating Principal

Component Analysis (PCA) directly into the embedding process. Finally, we show how the

vector representation in a low dimension can be computed much more efficiently by using

a projection formula.

2.2 Definition of a Metric Space

A metric space (P, o) is defined formally as any set P having an interdistance relationship

measured by a metric function

t5: P x P ~ R+

7

Page 12: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

which satisfies the following 3 conditions (see [3]):

I) Vp1,P2£P 2) Vpi,p2£P 3) Vpi,p2,]J3£P

c(pi. P2) = c(p2, P1) '5(P1iP2) :;::· 0 and [c(pi,p2) = 0 ¢:>Pl= P2] c(P1iP2) + c(p2,p3):;:: c(pi,p3)

(symmetry) (positivity) (triangular inequality)

The number '5(p1,P2) is called the distance between p1 and p2. If the third condition

is eliminated, then the distance function is called a pseudo-metric. Given a finite set of

samples P =(pi, .. . ,pk) (ie. a finite metric space), one can represent a metric space in a

symmetric k-by-k distance matrix D by assigning D[i,j) = c(pi,Pi)·

It is important to note that the definition of a metric space does not require the exact

form of the input samples P to be specified. Any convenient input representation (string,

tree, or matrix, for example) can be used. In addition to this, the metric function can use

any set of operations (such as insertions and deletions) as long as it satisfies the conditions

of a metric. These are the two key advantages of a metric space representation which

are not supported by the conventional Euclidean vector space Rn representation. In fact,

the Euclidean space is actually a very special case of a metric space where the inputs are

specified by fixed length ordered n-tuples (vi, v2 , •• ., vn)£R n. Using this representation, the

distance d(v, u) between any two inputs v, u£Rn must be computed with a very rigid set

of operations based on the standard Euclidean inner product < ., · >:

There are two main problems with directly using the Euclidean representation in pattern

recognition problems. First, one can not represent high-order structural relationships which

exist between the primitives in the inputs - one can only list the primit ive in a certain order.

This limitation complicates the design of the classifier because often the features one needs

to distinguish one class from another are contained not only in the primitives, but also in

the way the primitives connect together. Indeed, the need to base classification decisions

on structure lead to the development of syntactic approaches to pattern recognition [8, 33).

The second problem with the Euclidean representation is that the operations used in

the distance function may not necessarily produce the minimum possible distance between

similar patterns in certain domains. One can show for example, that in the domain of

speech, the DTW function produces relatively smaller intradistances within a class than

8

Page 13: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

I.~

d(~·P;> P;

... .. . . . . .

. ·.·. --+ ~ -----D[ij}=d(P,.P;> ~ f

. ·.· .... . . v,=f(P,) • '

...

Metric Space (P,d) Distance Matrix D Vector Space Representation

Figure 2.1: Illustration of the vector representation problem.

the Euclidean function which uses linearly scaled inputs. The reason the DTW function

performs better is because it optimizes over many different possible alignments and selects

the one which minimizes the total distance instead of just considering one-to-one alignments.

Although a metric space based on non-Euclidean operations can have a more favorable

distance configuration, the Euclidean representation is far more appealing because there are

many efficient algorithms one can use in design of the classifier. The efficiency is achieved

by exploiting the geometry (ie . inner product) of the representation space in a manner

which make it possible to model only the decision surfaces of each class. In a general metric

space, there is no inner product defined; thus, one is forced to use nonparametric methods

like the nearest-neighbor search algorithm. It is therefore of interest to ask the following

questions: can the distance information in a metric space be transformed into an inner

product space (that is preferably Euclidean)? If this transformation were possible, one

could then model the "decision regions" in a metric space far more efficiently without losing

important discriminating information.

2.3 Vector Representation of a Finite Metric Space

Given a finite metric space (P , 6), one way the above question can be stated formally is

as follows: is there a way to construct a set of vectors S = (v1 , ... , vk) where the inner

product based distance matrix D[ i, j] = < vi - Vj, vi - Vj > t equals the distance matrix

in the corresponding metric space (see figure 2.1 )? To answer this question, we need to

9

Page 14: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

consider the following issues. First, we have to establish whether such a mapping is possible.

If equality between the two distance matrices is not possible (or necessary) then it may be

useful to consider a set of vectors S where the error between the two matrices is as small as

possible. The mapping from the metric space to the vector space will be called isometric if

the error can be set to zero.

Second, it is important that the metric points be mapped into the lowest dimensional

vector space possible so that the decision regions can be modeled efficiently. Therefore, we

would like to have a way of analyzing the trade-off between accuracy and dimensionality

during the construction of the vector representation.

2.3.1 The Multidimensional Scaling Approach

One possible solution to the mapping problem can be obtained using the so-called Mul­

tidimensional Scaling technique [23, 51, 22]. In this approach we start with some initial

configuration of vector points S = (v1,v2, ... ,vk) in an Euclidean vector space Vi«::Rn of

some chosen dimension n . Then using a sum of the squares error function En like the one

below

(2.1)

we measure the degree to which the current configuration of points S represent the desired

matrix of interdistances D in the metric space. From the gradient of this function (the

gradient of d(vii Vj) is the unit vector in the direction of vi - Vj),

we can figure out the direction each point Vk«::S must be moved in order to minimize the

total error. Given this, we can use a standard gradient-descent procedure to successively

work towards the desired configuration.

Although this technique appears to offer a reasonable solution, it has several technical

problems, as pointed out by Goldfarb [10), which limit its effectiveness:

1. There is no guarantee of convergence to the optimal configuration since the algorithm may get stuck in a local minimum. Furthermore, convergence to any solution could take an excessive amount of time.

10

Page 15: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

2. There is no reliable way to determine what the dimension of the representation should be (ie., how much accuracy is lost by using dimension k - 1 rather than k). The measurement Ek-I (S)-Ek-l (S) is not reliable because optimal convergence can not be assured.

3. In general, a finite metric space can not be isometrically represented a Euclidean vector space of any dimension. (This will be shown in the next section. Also see Goldfarb [10]).

2.3.2 Goldfarb's Approach

It turns out that a finite metric space can be mapped isometrically into a vector space if one

considers a non-Euclidean inner product space. Moreover, there is a very efficient algorithm

available to do this which overcomes the above-mentioned problems of Multidimensional

Scaling. The definition of this vector space and the main result which establishes the

existence of this algorithm are described next.

Definition of a pseudo-Euclidean Vector Space

A pseudo-Euclidean (or Minkowski) vector space R(n+,n-) of signature (n+,n-) is a real

vector space of dimension n = n+ + n- where the inner product < v, w > between two vec­

tors v = (v1, ... ,vn+,vn++1, ... ,vn) and w = (w1 1 ••• ,wn+,Wn++1, ... ,wn), v,w£R(n+,n-)

is calculated as follows:

Note that this vector space is a generalization of the Euclidean space Rn (ie. R(n+ ,n-) is

Euclidean whenever n- = 0).

One very interesting example of a pseudo-Euclidean Vector space is the 4 dimensional

vector space R(3 ,t). This vector space, which is historically linked to the work of Lorentz,

Minkowski and Einstein, describes the basic relationship between space and time in special

relativity theory (see [7]). This relationship is illustrated in figure 2.2 for a two-dimensional

world (ie. the subspace a<2•1l), and it provides some intuition into the geometry of pseudo­

Euclidean spaces. The diagram shows the space coordinates of light at different instants

in time. Note that c · t, rather than t, is used to describe the time scale; thus each axis

11

Page 16: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

I

~

Figure 2.2: The space-time geometry of a two-dimensional world.

measures distance. 1 In this diagram, the events e = (x, y, ct) which lie on the so-called

"light cone" are those which have a vector length (distance) of zero relative to an observer

positioned at the origin (these events are all connected by a common light signal): 2

llell 2 = < e, e > = x2 + y2 - ( ct)2 = 0

Events which lie inside the cone in the positive time region occur in the future relative to

the observer (ie. the distance to the observer yields negative time); while events inside the

negative part of the cone occur in the past (ie. the distance to the observer yields positive

time). Events outside the cone can not exist because this would require light to travel faster

than c.

The Main Embedding Theorem [Goldfarh·85, ch. 5]

For any finite pseudo-metric space (P, 6), there exits an isometric mapping:

such that for any pair of metric samples p;, Pj£P

where llv - wll d~f < v - w, v - w > ~ is the distance between the vectors v and w in the pseudo-Euclidean space.

Furthermore, the algorithm which implements this transformation is guaranteed to map

1 The variable c refers to the speed of light. 2The vectors on the "light cone" are sometimes called isotropic vectors (see [14] and [10]) .

12

Page 17: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

the metric samples into the smallest dimensional vector space possible.

This theorem tell us that for every sample Pi£P in a finite metric space (P, 6), there

exists a corresponding pseudo-Euclidean vector in R(n+ ,n-) which can he constructed by

the mapping function f. The mapping is done in a manner where the pseudo-Euclidean

distance between any two vectors v = f(Pi), w = f(pj) is equal to the distance between

Pi and Pi in the metric space. In other words, the mapping function f constructs a set of

vectors S for which the sum of the square error E(S) in (2.1) is zero. The theorem also

tells us that the metric does not have to satisfy the triangular inequality (ie. it can he a

pseudo-metric).

2.3.3 The Main Embedding Algorithm

A detailed proof of the embedding theorem can be found in Goldfarb [10]. However, because

the proof is of a constructive nature, it is useful to take a close look at it to understand

how the mapping function f works. 3

We start by recalling a well-known algebraic result about symmetric matrices, which

asserts that for any symmetric matrix H, one can orthogonally diagonalize it (into A) by

using an orthogonal matrix Q (ie. Q-1 = Qt) whose columns are the set of eigenvectors of

· H (see for example [14]):

The diagonal elements A = diag( Ai, A2, ... , An) are the corresponding eigenvalues of H. If

we denote A+ = diag( Ai, A2, ... , An+) and A - = diag( An++i, An++2, .. . , An) as the diagonal

matrices corresponding to the set of positive and negative eigenvalues respectively, then we

can rearrange the columns (Qi, Q2 , ••• , Qn) in Q so that the form of the above expression

is as follows:

H we now move the square root of each eigenvalue out of the middle matrix above by setting

3 This description is based in part on a result shown to me by J .J .Seidel (CSC2414 course notes Fall 1986}.

13

Page 18: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

the matrix V as follows:

then, we can rewrite H as follows:

where l+ and 1- are the positive and negative identity matrices of size n+ by n+ and n­

by n- respectively.

From this expression one can see that each entry H[i,j] in the symmetric matrix is

equal to the pseudo-Euclidean inner product between the two vectors vi, v;E:R(n+ ,n-) where

vector Vi corresponds to the i-th row in the matrix V:

t I

I

~~~ ... ~ vi r-- -I

0 x x

n•no I :o n•lc

I I I

- - - ---H/ijJ----- -i;::=::;;::=~

I

le. le k•n

H[i, j] =<Vi, Vj >= Vi,1Vj,l + ... + vi,n+Vj,n+ - Vi,n++1 Vj,n++1 - ..• . _ Vi,nVj,n

This shows that any k-by-k symmetric matrix His in fact a matrix of inner products of k

pseudo-Euclidean vectors in R(n+ ,n-) which can be obtained from a diagonalization process.

Although the matrix D corresponding to a metric space is symmetric, it is not a matrix

of inner products, but rather a matrix of interdistances. Thus, to use the above result

we need to convert each distance entry D[i,j) into its corresponding inner product value

H[i,j] =<vii Vj >to get the decomposition to work correctly. If this is done we will end up

with vectors vi, v;E:R(n+ ,n-) having a pseudo-Euclidean distance sqrt(< Vi - Vj, Vi - Vj >)

equal to the value D[i,j]. To do this, we simply need to solve for < Vii Vj > in the equation

D[i, j] = sqrt(< Vj - Vj, Vj - Vj >) which can be done as follows:

14

Page 19: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

~

'

-~

D 2[i,j] = 62(Pi1Pi) = <Vi - Vj 1 Vi - Vj >

= <Vi.Vi> -2 < ViiVj > + < Vj,Vj >

llvi - voll2

- 2 < Vj, Vj > +llvj - voll2

= 62(pi,Po) - 2 <vb Vj > +o2(pj,po)

< Vi>Vj > = 1[2 2 2 ] 2 6 (Pi,Po) + c5 (pj,Po) - c5 (Pi,Pj) (2.2)

The sample po above is simply the metric sample designated as the origin v 0 in R (n+ ,n-).

(In the next section we show more precisely how we designate this origin.) Thus, if we

convert each entry D[i,j] in the distance matrix into the corresponding inner product value

H[i,j] =< Vii Vj > using the r.h.s . of (2.2), and then decompose this new matrix H into

pseudo-Euclidean vectors using the diagonalization technique described above, the result

will be vectors v; and Vj (row i and j in V), which have a pseudo-Euclidean distance equal

to the original distance between samples Pi and Pi·

The above embedding process not only shows how to construct a vector representation,

it also shows that a finite metric space can only be mapped isometrically into a Euclidean

vector space when the corresponding matrix of inner products H is non-negative. One

can also see that the dimension of the isometric representation is equal to the number of

nonzero eigenvalues. The time complexity of this embedding algorithm is dominated by the

computation time required to diagonalize a symmetric matrix which is of order O(k3 ) [13].

However, as we shall show in the next section, one can reduce this complexity by embedding

(nonisometrically) into a lower dimensional vector space.

2.3.4 Principal Component Analysis

Although the embedding algorithm does find the lowest dimensional vector representat ion

that precisely preserves the interdistances in a metric space, practical considerations will

make it necessary to work with a still lower dimensional representation. The number of

non-zero eigenvalues n = n+ + n- (ie. the dimension size) constructed by the diagonal­

ization process is usually as large as the number of samples IPI in the finite metric space

(P, 6) . However, empirical studies (9, 2, 12] have shown that when good class separablity

is achieved in the metric space, the intrinsic dimension of the vector space tends to be

. related to the number classes in the metric space rather then the number of samples. In

15

Page 20: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

this context, the intrinsic dimension refers to the lowest dimension where the error in the

distance matrix becomes insignificant to the recognition performance. One way to see how

the class separation effects the dimension size is to imagine the situation where the average

interdistance within a class incrementally approaches zero for c classes in Euclidean n-space

with n > c. As the average distance within a class approaches zero, the upper bound on

the dimension size must approach c - 1 (ie. when each class becomes a point, we can pass

a c - 1 dimensional hyperplane through the c points.) A similar argument can be used for

points represented in a metric space. If we have a situation where the average interdistance

is zero in each class, we would only need to represent c points in the metric space - one for

each class. These points could be represented isometrically in a pseudo-Euclidean vector

space of dimension no greater than c because the maximum number of eigenvalues we would

obtain from the c-by-c interdistance matrix is c.

One way to approximate the intrinsic dimensionality of a metric space is to use the esti­

mate proposed by Pettis et. al. [34] . This measurement uses nearestoneighbor information

in a finite metric space to measure the minimum number of parameters that is necessary to

represent the input data. Using a DTW metric space consisting of as many as 200 classes,

Baydal et. al. [l] concluded that based on the Pettis et. al. estimate, 13 parameters would

be sufficient to represent each of their speech samples. It is interesting to note that this

number is fairly consistent with the number of distance computations that were needed by

the Vidal et. al. [45] nearest-neighbor algorithm. However, because this estimate is not

based on a vector representation, it is not a very reliable way to measure the true intrinsic

dimensionality of a metric space.

A far more accurate method is to analyze the corresponding pseudo-Euclidean vector

representation w.r.t. the principal components. The standard way to perform PCA is to

first obtain the covariance matrix from the vector points, and then to rotate the axes to

the principal axes of the covariance matrix. Figure 2.3 shows the geometrical meaning

of the principal components in a two dimensional space. First, the origin of the vector

representation is shifted to the mean vector. Then an orthogonal rotation of the axis

occurs (from the x-y to the x-y coordinate system) so that each axis lies successively in the

direction of greatest variance while maintaining orthogonality to the existing axis. These

directions are determined by the eigenvectors of the covariance matrix and the corresponding

eigenvalues measure the amount of variance along each axis. One can see in figure 2.3 that

16

Page 21: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

y x'

Figure 2.3: Geometric interpretation of the principal components.

if the distance information (ie. variance) is contained intrinsically in a low dimension, we

can obtain a low dimensional representation by simply removing the principal axes which

have relatively small amounts of variance without distorting the global metric structure.

The definition of a covariance matrix in a pseudo-Euclidean space is actually a general­

ization of the classical one (see Goldfarb (10, ch. 5]). For the vectors contained in V, it can

be computed as follows. First the origin is shifted to the mean vector v = t Ef=1 v;, k = IP!. That is, we substitute each row v;,i = 1, ... ,kin V with Vj -v. Then we compute the

covariance matrix as follows:

where

J = [ 1+ 0 l o r-

The eigenvectors of this matrix define the principal axis in R(n+ ,n-) and the magnitudes

of the corresponding eigenvalues are proportional (by a factor of i) to the variance along

each axis (10]. These eigenvalues can be used to measure the (relative) amount of distance

information that is contained along each principal axis.

Thus, given an isometric pseudo-Euclidean representation of a metric space, one way

to obtain a low dimensional representation in R(m+ ,m-) with ( m = m+ + m- < n) is

as ·follows. First compute the covariance matrix using all of pseudo-Euclidean vectors as

described above. Then diagonalize the covariance matrix to determine the eigenvector and

17

Page 22: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

r Ji ~

I

I the eigenvalues and retain the leading m eigenvectors, which correspond to the m largest

eigenvalues (according to the magnitudes). These eigenvectors are then used to transform

the original vector representation to the leading m principal components. This is done by

si~ply forming a transition matrix whose rows are the eigenvectors. If one simply wants a

low dimensional Euclidean representation, then only the eigenvectors associated with same

signature (ie. in the set A+ or A-) would be retained in the transition matrix. The choice

of m can be made using many different approaches. For example, one can incrementally

transform the vectors into higher dimensions and measure the quality of the representation

w.r.t. various criteria. Reasonable criteria include the scatter ratio, the average distance

between the classes, or the sum of the square error defined by equation (2.1). Once one of

these measurements reaches an acceptable value (ie. when the separablity in the vector space

approaches the separablity in the metric space), one could terminate the search process. In

chapter 4, an empirical study will be done to investigate this issue further.

However, the problem with using this approach is that we have to first construct a full

dimensional representation in order to determine the covariance matrix and the transition

matrix. Fortunately, it is possible to program the embedding process so that it produces

a vector representation that is already transformed to the principal axis. Consequently, a

full dimensional representation does not have to be constructed - we can simply program

the embedding process to terminate after a sufficient number of principal components have

been generated.

To do this, we need to make sure the origin is already positioned at the mean vector

v, instead of an arbitrary sample p0 , when we form the matrix of inner products H. For

this to happen, the sample p0 representing the origin in (2.2) must be selected with the

property: f(p0 ) = v. If this is done (later we show how), then the eigenvalues of H will

be the same as the eigenvalues of the the covariance matrix vtv J as shown below (in the

formulation below, ./X denotes the diagonal matrix diag[v'IXJ, ~' . .. , v'[XJ]):

vtvJ = (Qv'X)tQ\1'1J

= v'XQtQv'XJ

= v'XIv'XJ

= A (2.3)

18

Page 23: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

I I

This important result by Goldfarb [10] tells us that when H is computed with the mean

positioned at the origin, the resulting vectors (rows in V = Qv'X) from the embedding

process will be given w.r.t. the principal ax.is. That is, if we computed the covariance

matrix vtvJ using the rows of V from the embedding process, then we will end up with

the diagonal matrix>. which is the set of eigenvalues computed from H. Consequently, the

vectors in V must already exist w.r.t. the principal axes. Furthermore, the eigenvalues of

H measure the amount of variance a.long each principal ax.is. This means that the analysis

of the intrinsic dimensionality can be done during the embedding process. For each new

eigenvector and eigenvalue obtained during the embedding process, a new principal axis

(ie. column in V) can be added to the representation. Thus, to obtain an m dimensional

representation, only the leading m eigenvectors and eigenvalues need to be computed.

Obtaining the principal components directly from the the embedding process has im­

portant practical implications for the implementation for the embedding algorithm. Diago­

nalization algorithms like those based on Householder reduction and the QL transformation

[13] can be used to incrementally generate more eigenvalues and eigenvectors. These algo­

rithms are stable numerically and they can be programmed to terminate when an acceptable

criteria is satisfied in the vector representation as described earlier, or when the (relative)

size of the eigenvalue becomes sufficiently small. If we terminate the diagona.lization process

at some dimension m, then the representation that results is the orthogonal projection of

the isometric representation on the subspace spanned by the leading m principal axes.

However, in order to perform PCA during the embedding process, we need to be able to

find a sample with the property f(p0 ) = v. In genera.I such a "mean" sample is not likely

to exist in P and even if it did, there is no way of knowing this without actually obtaining

a full isometric representation first. However, Goldfarb [10] has shown that one can get

around this problem algebraically by using the well-defined relationship between distances

and inner products (see the formulation of (2.2)):

<vi-V,vi-V>

<Vi, Vi> + < V, V > -2 <Vi, V > 1 k k 1 k

<Vi, Vi> +k2 EE< v.,,v31 > -2k E < v;,v., > x=ly=l x=l

19

Page 24: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

.. -· --- - ... ·-·--···· -· . .. . ··-··-··--- . ·--·-·- - ...... ··~··· ---···--~-···-h·~ -· ~··-~ -··~ ·-

k k

02(Pi1Po) + :2 L L[02

(Px1Po) + 02(Py1Po) - 02(Px1Py)] x=ly=l

1 ~[ 2 2 2 -2k LL c5 (Px,Po) + c5 (Pi,Po)- c5 (Pi,Px)] :r:=l

1 k 1 k k

= k L D[x, i]2 - 2k2 LL D[x, y]2 :r:=l :r:=ly=l

(2.4)

This result shows that the squared distance between any metric sample Pi and the pre-image

of the mean vector p = 1-1(v), can actually be determined directly from the interdistance

matrix D without first embedding. Taking the r.h.s. of this equation and substituting it for

c5 2(pi, Po) in (2.2) ensures that the matrix of inner products H has its mean vector already

positioned at the origin.

A full summary of the embedding algorithm is provided below:

The main algorithm.

1. Let D[i,j] = o(pi,Pi), 1 S i,j s k. Compute the symmetric matrix H[i,j]1 ~i.i9 k k k k

H[i,j] = ~[~(L D(x,j]2 + L D(i,x]2) - ( : 2 LL D(x,y]2

) - D[i,j]2]

:r:=l :r:=l :r:=l y=l

2. Find the eigenvalues of H:

where

and

At' A2, . . . ' An+' An++11 . . . ' An, 0, 0, ... ' 0 ......__....... k-n

n = n+ + n-.

The number of positive eigenvalues is n+ and the number of negitive eigenvalues is

20

Page 25: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

n-. Now form the diagonal matrix v0::

./). =

0

0

3. Find the corresponding orthonormal eigenvectors of matrix H:

and form the matrix Q whose i-th column is Qi.

4. Compute the matrix V = Q · v'>., and set f(Pi) =Vi where the coordinates of Vi are the first n elements of the i-th row of V. The basis vectors for this representation will then be the principal axes of the corresponding covariance matrix of the resulting vectors in R(n+ ,n-) .

2.4 Metric Projections

So far, we have discussed the first stage of the metric approach, which involves the trans­

formation of a metric space representation into a vector space during which the intrinsic

dimension is determined. This low dimensional representation in R(m+ ,m-) will be referred

to as the principal axis representation of a metric space.

In the next stage of the metric approach we addresses the following question. Given

a principal axis representation of a metric space, how do we determine the principal axis

representation of an arbitrary sample p* in the metric space without repeating the en­

tire embedding process for (P LJ{p*}, t5)? In other words, how do we compute the vector

representation efficiently?

It is done as follows. We know that v• = J(p*) corresponds to some point in R(n+,n-).

The precise point could be obtained by embedding (P LJ{p*}, 6) into R(n+ ,n-) as described

in the previous section. We would like to find the orthogonal projection u"' = 7r( v*) of v"' on

the subspace R(m+ ,m-) spanned by the m = m+ + m- leading principal axis r~presenting

21

Page 26: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

! .

bi . . . . . . . . . .

v!:f(J'*)

. b~ • • •

Figure 2.4: illustration of the concept of a metric projection.

the metric space (see figure 2.4).

Let (b1, b2, ... , bm), m ~ n be any set of basis vectors bifR(n+ ,n-) which span this

space. If we knew the value of the inner products < v•, bi > and < bi, bj > for all

i,j = 1, ... , m, then we could determine u* by simply solving a system of m linear equations.

These equations would be formed as follows. We use the orthogonality condition between

the vector (v* - u*) and each basis vector bi to give us m equations:

< v* - u*, bi>= 0 Vi= 1 .. . m (2.5)

The coordinates of u• w.r.t. the basis vectors is some set of constants (cl! ... , cm) such that

u* = I:~1 cibi . Thus if we substitute this expression of u* into (2.5) above and expand we

can form the following set of linear equations:

+ < bm,bm >Cm=< v*,bm >

Denoting bas the column-matrix with b[i) =< v*, bi> and Gas the m-by-m (Gram)

matrix with G[i,j] =< bj, bj >, we can easily obtain the coordinates of u* by computing

a-1 • b. This gives the vector representation w.r.t. the chosen basis vectors. To find

out what the coordinates of u* are w.r.t. the principal axes (ie. the eigenvectors of the

covariance matrix), we simply have to multiply the coordinates of u* by the transition

22

Page 27: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

matrix B whose i-th column is the basis vector bi. Therefore, the projection of p• can be

calculated as follows:

7r(p'")= B · a-1 · b (2.6)

However, to evaluate this projection formula, we need to be able to compute the column

matrix b for some set of basis vectors. This can be done without knowing the full vector

representation of p• by using the relationship between inner products and distances given

in equation (2.2). If we use real metric samples from Pas our basis, then (2.2) can be used

to measure the inner products between v'" = f(p'") and any vector !(Pi) where Pi£P.

To be specific, if we denote the subset of metric samples Pb = (Pb1 , Pb2 , ••• , Pbm) as our

basis, the entries b[iJ =< v-, bi > can be computed as follows (the Gram matrix can also

be computed ~rom (2.2) in a similar manner):

where Po is the designated origin sample. Actually, to obtain a true principal axis repre­

sentation of p'", we need to designate the origin as the mean vector, which means that in

the above formula, the r.h.s. of (2.4) needs to be used in place of o2(p.,p0 ) and 62(Pb;,Po)

as was done in the main embedding algorithm. Although there is no significant problem in

determining o2 (Pb;, p0 ) this way, since all of the distances required are avajlable in D and

the computation can be done off-line, the computation of 62(p'", po) is clearly too expensive

for on-line applications because the distance between p'" and each of Pi£P needs to be com­

puted before a projection can be obtained. However, this overhead can be eliminated if we

designate a real sample Po as the origin instead of the mean point. This simply corresponds

to a parallel translation of the vector representation; consequently, no metric information

is lost (provided we also shift the training samples in the same manner after the embed­

ding is completed.) H we do this, then only m + 1 new distance computations need to be

done to obtain a projection: one between p• and p0 , and one between p• and each sample

Pb;, i = 1, . . . , m. In the next section, we describe a method for selecting the origin sample

and the basis samples.

23

Page 28: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Picking the Origin and the Basis Samples

In our description of the projection formula, we conveniently made the assumption that there

existed a subset Pb U{Po} C P which would span the same subspace as that spanned by the

leading m principal axes. In other words, we assumed that the full vector representation

(ie. the representation in R(n+ ,n-)) of these samples were located precisely on the subspace

R(m+ ,m-) and that linear independence between the samples existed. In general, however,

there is no guarantee that such a set exists. Furthermore, even if this set did exist, we would

have no tractable way of knowing this unless we obtained the complete vector representation

of (P, o).

Thus, we would like to find, in a tractable manner, a subset of samples Pb LJ{p0 } which

spans a subspace that is as close as possible to the subspace spanned by the m leading

principal axes. The brute-force solution to this problem is to consider all possible ordered

subsets of size (m + 1) from P. For each (m + 1)-subset, all samples in P can be projected

w.r.t. this set using (2.6) and the error between this representation and the original principal

axis representation can be measured using some suitable criteria measurement (like the sum­

of-square-error (2.1) for example). The subset which we would eventually select is the one

which results in the minimum error.

Obviously this search method is not practical as there are (IPIJf !+i))! possible ordered

(m + 1 )-subsets. 4

Nevertheless, if we relax the optimality condition, and use the statistical properties of

the principal axis representation computed by the embedding algorithm along with the

information stored in the distance matrix D, then a reliable solution can be determined in

a tractable manner.

2.4.1 A Metric Projection Algorithm

The solution that we propose is based on the greedy search method and is illustrated in

figure 2.5. We start by first picking a sample to represent the origin. In our solution, we use

the nearest-neighbor to the mean vector in R(n+ ,n-). Fortunately, we do not have to obtain a

~Goldfarb's work a.ppea.rs to be the first to address this search problem a.nd a. reliable (polynomial time) solution to this problem ha.s yet to be found in the area of computer science, statistics, or relevant areas of pattern recognition. In the book "Subspace Method of Pattern Recognition", Oja. [30] does address this problem for Euclidean metric spaces, but the solution he provides does not restrict the search to the original pattern samples. )

24

Page 29: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

J' l

• • • • • • • • • • • •

• • • •

• • • • • • • • • • • •

Figure 2.5: The search space of the basis samples and origin considered m the Metric Projection formula.

full dimensional representation of (P, o) to do this. The nearest-neighbor can be determined

directly from the distance matrix D by using the following result by Goldfarb [10): The

nearest-neighbor to the mean vector in R(n+ ,n-) corresponds to the metric sample p€P

which has the minimum accumulated distance I:f=1 o(p,pi) to all other samples in P. In

other words, the index of this sample is the row in D which has the minimum row sum.

Once the origin sample is determined, we successively find the samples Pb; €P to repre­

sent each principal axis in R(m+,m-) starting with the axis a 1 associated with the largest (in

magnitude) eigenvalue and ending with the axis am associated with the smallest (in magni­

tude) eigenvalue. Let the set {3 = (pi,p2 , ••• ,Pi-I) be the basis selected for the first leading

i-1 axes. The next axis, a;, is determined by first locating the maximum projections v1 and

Vr on both the left and right side of the axis (see figure 2.5). Then, the k-nearest-neighbors

P1 = (p11 , ••• , P1k) and Pr = (Pr 1 , ••• , Prk) of both these points are located from the reduced

vector representation in R(m+ ,m-). For each of these points Pi€Pt U Pr, we project all of the

samples in P w.r.t. the basis set {3LJ{p;} and then we compute some criteria measurement

which describes the error of the resulting representation. (In chapter 4, we will define the

measurement which was used in our implementation). Among the samples in P1 U Pri we

select the sample that minimizes the criteria measure. This sample is then appended to

the basis set {3 and we continue on the greedy search looking for a sample to represent axis

ai+I. This process finally stops when axis am is reached.

There are several variations of this algorithm one can consider. For example, since

the first few leading principal components are the most critical because they represent the

greatest amount of variance, one may consider using a backtracking search for these first

few dimensions and then continue with a greedy search for the rest of the representation.

Another possibility is to use a different selection method. For example, instead of using the

25

Page 30: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

k-nearest-neighbor to V/ and Vr at the i-th step, one could use the k-maximum projections

on both the left and right side of axis a;. This possibility will be investigated in chapter 4

along with the k-nearest-neighbor selection method.

The reason we restrict our search to the k-nearest-neighbors (or k-maximum projections)

for each dimension is because this method avoids cases which could lead to linear dependence

between the basis samples (ie. a high condition number for the corresponding Gram matrix

G). This restriction can eliminate better solutions from the search space, hut as one will

see in chapter 4, our empirical studies indicated that such losses are not likely to be very

significant.

2.5 On-line Classification

Once the origin and a reliable basis set (Pbi, Pb2 , ••• , Pbm) have been determined, we are

ready to make use of them in the design of the classifier. To account for any errors in

the representation that may result from using an imperfect set of basis samples, we must

recompute the vector representation of the entire training set P w.r.t. the newly selected

basis set before they are used in determining decision regions.

Given the vector representation of a set of training samples, we can apply various classi­

cal methods to find the decision regions of each class [5]. However, if the representation we

are using is pseudo-Euclidean, then the inner product underlying these methods must be

changed (ie. generalized) in order to accommodate a pseudo-Euclidean inner product (see

Goldfarb [10] for examples on how this is done). Once the decision regions are determined,

an unknown sample p• can be classified by first obtaining its metric projection u• using

(2.6), then assigning the unknown to the class associated with the decision region which

the vector u• falls into.

26

Page 31: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

''

Chapter 3

Dynamic Time Warping a~d the

Metric Approach to Isolated

Word Recognition

3.1 Introduction

In the previous chapter we presented a general description of the metric approach to pattern

recognition. In this chapter we will show how we used this approach to solve an isolated

word recognition problem efficiently. To use the metric approach, we needed to first define

a distance function in order to obtain a metric space representation of speech. We therefore

begin the chapter with an overview of the main factors which motivated the design of

our distance function. Then we describe the details of how we implemented the distance

function. The implementation is presented in two stages. First, we describe the spectral

analysis that was used to represent the speech signals; then we describe the algorithm that

was used to measure the distance between speech signals in this representation. Both a

serial and parallel version of this algorithm will be described.

We then show the main components of the word recognition system that was used in our

implementation. Included in this discussion will be a description of the various classifiers

that were used to analyze the decision regions. In the next chapter, a comprehensive

performance analysis of this system will be provided.

27

Page 32: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

3.2 The Basic Concept of Dynamic Time Warping

The distance function which we used in our implementation is commonly referred to as

the Dynamic Time Warping (DTW) function [46, 42, 28). This distance has been used

routinely over the last 20 years in template based approaches of speech recognition [37].

It has achieved a high degree of success mainly because it can compare speech signals of

varying temporal length in an accurate manner.

To understand how the function works, it is useful to look at the basic nature of a speech

signal first. Speech is produced by exciting the vocal tract with a pulse of air generated

from the sub-glottal system. The resulting flow of air, which is transmitted to a listener

in an acoustic wave, causes a certain sound to be produced. By varying the shape and

diameter of the vocal system, a speaker can generate different sounds while he is speaking.

Each particular state of the vocal system can be distinguished by its frequency response

(ie. formant frequencies). One can obtain these frequency components by first sampling the

output acoustics waveform and then converting the samples to the frequency domain using

standard digital signal processing techniques [35]. Because there are physical limitations on

how fast the state of the vocal system can change, the main spectral properties of speech

generally remain fixed for 10 to 30 ms. during an ·utterance. Therefore, a reliable digital

representation of speech can usually be obtained by using a sequence of feature vectors

which describe the changing spectral properties over a 10 to 30 ms. ("short-time") interval.

Some examples of spectral properties that are most often used in practice include mel-scale,

cepstral, LPC, and filter bank coefficients [35].

With this type of representation, there are two types of variations that can occur. First,

the individual spectral feature vectors can vary depending on the characteristics of the vocal

system during an utterance. Second, the duration of the sequence can vary as a result of

different speaking rates. Both these situations must be accounted for in a distance function.

This can be done by first determining a meaningful alignment between the input sequences.

Once an alignment is determined, the total difference can be obtained in terms of the

individual differences between corresponding feature vectors. To ensure that interdistances

within a class are as small as possible, it is important to consider alignments which minimize

the total difference a much as possible.

The DTW function finds the alignment which minimizes the total distance between two

28

Page 33: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Figure 3.1: Illustration of the Concept of Time Warping

II . (a) (b) (c)

Figure 3.2: Violations of the local continuity constraint.

sequences

(3.1)

by searching for a mapping (see figure 3 .1)

which leads to the smallest cumulative local distance

To avoid unnatural alignments, the search is restricted to mappings which satisfy two

strong constraints. The first is the endpoint constraint. This is used to ensure that the

beginning points and ending points of both sequences align together.

i(l) = 1, j(l) = 1, beginning point

i(K) = N, j(K) = M, ending point.

The second constraint is the local continuity constraint. This is used to preserve three basic

properties in the mapping:

29

Page 34: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

1. Temporal ordering. For any mapping (ai(k),bi(k)), we must have i(k + 1) ~ i(k) and j(k + 1) ~ j(k). This guards against the mapping show in figure 3.2(a).

2. No skipped feature vectors as shown in figure 3.2(b ).

3. No excessive occurrences of one-to-many correspondences as shown in figure 3.2(c). This avoids the unrealistic mapping of a single acoustics feature with a long acoustic segment. In our case, we will allow at most a 2-to-1 mapping.

The solution to the sequence alignment problem which satisfies the above constraints

can be computed very efficiently using the technique of Dynamic Programming [18]. In

the following section, we will show how this was done in our implementation of the DTW

function.

3.3 Implementation Details of the Dynamic Time Warping

Metric

3.3.1 Representing Spectral Features

Spectral features of the raw input speech waveform (sampled at 12.5 KHz with 12-bit

quantization) as shown in figure 3.3( a) were analyzed using the bandpass liftering technique

proposed by Juang et. al. [21]. In this analysis, overlapping short-time speech segments (ie.

frames of speech) were obtained every lOms. Each segment was obtained using a 512-point

Hamming window (see figure 3.3(b)). A 12-th order linear predictive analysis [27] was then

applied to this segment (see figure 3.3(c)). The resulting coefficients (ai,a2, ... ,a12) were

then converted to the cepstral domain (see figure 3 .3( d)) using the following well-known

recursive formula (see Rabiner [35]):

C1 = al

~ ~(k - i)Ck-iai k = 1 ... 12 •

(3 .2)

(3.3)

The cepstral coefficients were then smoothed with the following window as suggested by

Juang et. al. [21] (this smoothing process is known as bandpass liftering):

w( k) = 1 + 6 sin 7r 1k2

k = 1 ... 12

As shown by Juang et. al. m figure 3.4, the bandpass liftering process smoothens the

30

Page 35: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

• 1il19 ,

(a)

(b)

•O •256

(c)

bs -..,in o~ rn

• 9

(d)

Figure 3.3: Illustration of the spectral feature a.11alysis.

:n

Page 36: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

j, ... ~ ~i\: ~ !$ •.

="" • ' ' ;>i' ~·i ~ ,., \~ : :.< .. ? ;;,,,

~,;~

.,

LPC LOG MAGNITUDE SPECTRUM

0 4000 FREQUENCY {HZ)

(a)

UFTEREO LPC LOG MAGNITUDE SPECTRUM

0 ~000

FREQUENCY (HZ)

(b)

50

40

30

20

10

0

t:: 30

r20

.(10

lri 0

(I) 'O

(I) 'O

Figure 3.4: (a) A sequence of log LPC spectra. (l>) Tltc corJ'.espondi11g l>andpass liftered LPC spectra. (Taken from [20]) .

:.12

Page 37: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

sharp spectral peaks in the LPC log spectrum without distorting the fundamental formant

structure of the speech. This helps remove noiselike statistical variability from the feature

set.

3.3.2 The Dynamic Time Warping Procedure

The dynamic time warping procedure we implemented was a version of the one used in

Myers et. al. [28). An example of it is shown graphically in figure 3.5. The axis of the grid

is used to represent the time scale of each word; each time slot corresponds to a frame of

liftered cepstral coefficients. In our implementation, we made use of the normalize/warp

technique as suggested by Myers et. aL [28). This meant that before we time warped, each

frame of spectral coefficients ( c1 , c2, ... , Cm) was first decimated (normalized) to a fixed

length linear time scale (c1 , c2, •.• , cm.) using the following formulae [48):

Ci = (1-s)cn+(s)cn+t i=l, ... ,m

where

n (m-1)

floor((i-l)(_ )+1) m-1

s . (m - 1)

( i - 1) (- ) + 1 - n. m-1 (3.4)

m was set to the average length over all words used in the training set (in our case m = 45).

The normalization was done to accommodate the slope constraint defined below.

To obtain the DTW distance measure, the local distance at point ( i,j) in the grid

was determined using a Euclidean distance between the corresponding feature vectors. The

relative value of the local distance is shown graphically by the magnitude of the shaded box.

The DTW algorithm can be viewed as a search for a warp path which starts at point {1, 1)

and ends at point ( n, n ). Each grid point passed over in the path corresponds to a specific

mapping of two feature vectors. DTW seeks the path which has the minimum accumulated

local distance along the path from point (1, 1) to (n, n). Local continuity constraints are

maintained by making sure that no path reaches point ( i, j) unless it comes from one of the

following three points: (i - 2,j - 1), (i - 1,j - 1), (i - 1,j - 2), a.s shown in figure 3.6.

The resulting slope constraint on the DTW path restricts the search space to the region

33

Page 38: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

1. 1. I 1. I . 1. • • . ... • 1. ~ f l 1. ·1, I. J. • • . I I '- I. r-----1--+-~1-t-+-+--+-+-+--+-t--+-l-+l-1. _,•.1-•+· .-+. ·;..;.;i· """•+"'•+· ·=· • ~ . ..... . .. . 1 . • • . • .. .. • • . .,,,.,-----. __ t--+-~-+--1--+-+-+-+--4-l-"'I+· 1-+1"'1. ='I-"''+· 14. l:;;:j. ~· • ~ • # ·1. • I . I . l l • '- I I 11. • ...,.,.-----_ '-t-~-+-+-+--+-+-+--+-1-•·+•-+ .• -1. _,•.1-•+· .-+.4' ti .. • . • • . • . • . • . • . • . • . • . • •. • . .. • . . ,,..----t-r--t-t--+--t-i-+-+-+-T-' • . +-•+· 1-+. I_.. 1--•+-•+· 1-+-. I I I I I I 1. • l l ii. 1. J • •. I'. I --..,_'r-----,_.,_-+-,_.._......._._-+-+~1._1 ..... 1-+. _,I._•..._• ..... I_.. U I • I I I 1. I 1. I I. I I I.• • I ..-.......... ... . .----_ 1-T--+-1-+-_.__.__.__.._...1"-'.;-"'l+-1+. 14 1=·1-l+-I=. I I I If. 1. 1. 1. • • f . l • . I . f _ Iii. 1. -·<:>------!-t---t-1-i--......._.-+-+-+1-1· i_l+-1+· lf-+lf-=<_ 1-l+-I,.;.; 'J_.. ='1-1·+-1'-"1~11-1+141"'1· '--lf+-1+14 1-'<· 1-1 +-•""-l4l"-".1-+---+-"-'-+--1--+--+-+-+-+-+--! ._ .. \_,. .. ----. t-t---t-r-+--+-+--.-T"l-+l~lt--1+. 1-+1_,. ""''.-~+: I-'+. _.11-1+-1-+11_.. -"11-1+-1-+. lf-1. _,1.+-1+1-+1-'<. _1+-1 ...... 11-+-<1-+-...;......;-+-+-+-+-+-+-+-+--1.--'"·' --:-~--- . t-...--+-!'--"-...;.....-4-+--'-1~. 14 . -"1,_1+· 1-+1-". r1•I1. I 1. I I 1. I I 1. I If I I II '· ..... --- ---~--+-!._._~_1._1-'-l l-+1__.l-"lt-I+· .-+-. ,,, • . 1. 11. •.II •••.•. I •••.

l-T--t-T-+-+-+-+! _.• """''+-'+· •+•-1. --'+. UI I 1 . • . I • . • , •• •l • • I • If • 1. I 1. I I 1.11. 1. 1. I• 111 I 1'1. •

~,..-t-..-+-+1_..il--+-l+-1.._i 1-+l l_.TJ I 111. 1. 1. I I! I I I If •1111.11111III11. I I lfll. I I If 1.1111r111. I I If. I 1.

- ·._ .... ---­~-."",,--­

................ -----.. '· .. ~.----

' .. ~ ... -------... ...• /,----._.

111 • I •.I•. Ii•+-•-+. 1,.-'f. -'•.+-I 1+-W-+lf-1. 1-t-+-+--+-+-+--+-1-+--+-+-+-~-T-'+-+--+-1-+--+-+-+--+--t-+-+-i- -,,/ ___ --.. _ 11111.l• 1. 1. U 1. 1 w.11 ->..r---,_

J 1. I 1. 1.llJ. lf+-1-+1-'1. -'lft-+-+-t-+-+-+-+-+-+-+-t-+-+-t-+-+-+-+-+-+---+-it-++-1-+-+-+-+-+-+-{-.-.·-... _,-----·v-­I . I I 1 . I lf.,._'J+-1-+. -+-+-+--+-+-+--1--+-+-+-+--+-1-+--+-+-+-+-+-+-+-+-.....---.1--1--1--1-+---1--1--+-+-+-4.---""V"""~--._ .•

I 1 . 1. 1 . 1 . 141 :.-1: -+--1--+-+-+-+--4-1-+-...........i--+-+-+-+-'-+---1-<-+---1--1-+--1--1--1-'-+--'-11--1--1--11--1-~--._ ~-..._/

• 1. 1. 1::.. '--,/

t-f:! •'~' -----..r-_,,.-1-l'! ~I 7 ~-,,,.-....._/ ~I I _r-·._..-/

/·:/{1')·-~J '1) ) j I ) ) ) ~ I) / I I I\ 1, / 'f 1' ~;· ( <-. ·-J'. '\ .'l Y'l ·~ )' Y\ \ \ ', ) ~ y :,, ·) < \ \ I \I.. '1 \ ~ <I () <I ~; < 1 < 1, .._, ·,

·, ·:' ·:. ) ,I J ) ) J J J J } } I ) :1 ) J I J i ) I j ) ~ ) l :1 I I I ) ' I I I 11

\ ..... : .• ,

Figure ~1.5: Graphical ill11stratio11 of the DT\\' process

Page 39: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

J_ _Y:M 2 - x-N

Figure 3.6: Local continuity constraints.

(M,l) y

Search Region

r.t=....~~~~~~~~~~~~~~~~~~....JX

(N,1)

Y:M 2 = x-N

.1=. ~ 2 x-1

Figure 3. 7: Search region defined by the local continuity constraints.

35

Page 40: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

shown in figure 3. 7. Because the inputs were normalized using equation (3.4), the search

space was guaranteed not to be empty. This particular local constraint was originally

suggested by Myers et. al. (28] who found it to have near optimal performance among

·various alternatives. The dynamic programming procedure associated with the constraint

is given by:

l D(i - 2,j - 1) + ![d(i- 1,j) + d(i,j)]

D(i,j) = min D(i-1,j-l)+d(i,j)

D(i - 1,j - 2) + ![d_(i,j - 1) + d(i,j)]

D(l,1) d(l,1)

where d( i, j) is the Euclidean distance between frame i and j and D( i, j) is the corresponding

global distance. The nondiagonal warp paths were weighted by ! to insure that each path

to ( i, j) had the same effective number of local distances; otherwise the algorithm would

have favored the diagonal paths.

3.4 Systolic Array Implementation of DTW

The DTW procedure we described in the previous section has a time complexity propor­

tional to N 2 where N in the length of the two input sequences. However, because many

of the local decisions in the dynamic programming algorithm can be made in parallel, it is

possible to reduce the time complexity down to O(N). This can be done using a hexagonal

cellular systolic array (see figure 3.8) as suggested by Bhavsar et. al. [2]. At each step

k = 1, ... , N, the systolic array computes in parallel the local distance at each processing

node (i,j) for which i + j = k. When the N-th step is complete, the global distance will

be available in processing node (N, N). There are two types of processing nodes which are

used in figure (3.8): one is a hexagonal cellular processor (figure 3.9(a)) and the other is

a square cellular processor (figure 3.9(b)). The hexagonal processor contains three table

lookup memory cells. Two of the memory cells, labelled in figure 3.9(a) as cells ai and b;,

are used to store the input feature vectors ai and bj in processor node (i,j). The other

cell, labelled cell D in figure 3.9(b ), is used to store the local distance at processor node

( i, j). At each step k, the hexagonal processor ( i, j) performs three basic operations. First,

it transfers the memory contents of cell a from processor ( i - I, j) into the memory cell a

36

Page 41: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

: . r1 09

- 1,n-l· I

Figure 3.8: A Systolic Array for DTW (ta.ken from (2] ).

:n

Page 42: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

(a) (b)

Figure 3.9: Parts of the hexagonal processor used in the systolic array.

in processor (i,j). While this is being done, the memory contents of cell bare simultane­

ously transferred from processor (i,j-1) over to cell bin processor (i,j) (see figure 3.9(a)).

Second, the contents of cell Dare read from processors ( i -1, j) , ( i,j) , ( i,j -1) and the min­

imum value is stored in cell D of processor ( i, j) (see figure 3.9(b) ). Finally the Euclidean

distance between a and bin processor (i,j) is computed and added to cell D.

The other processing node shown in figure 3.8 is a square cellular processor. In this

processor, there are only two memory cells. One is used to store an input feature vector,

the other simply indicates an "infinitely" large local distance. The circle processor at (0,0)

in figure 3.8, is simply used to store a zero local distance which is read in from processor

(1,1 ).

This particular implementation of the DTW function does not place a slope constraint

on the warp path; however, regions of the systolic array can be eliminated from the search

path by setting the local distance cell D in undesired areas to an "infinitely" large value.

3.5 The Metric Approach to Isolated Word Recognition

The design of the isolated word recognition system involved four main stages as shown in

figure 3 .10. In the first stage (see figure 3 .10( a)), a metric space representation of speech

was obtained using a set of training samples and the DTW distance function described in

section 3.3.2. Then in the second stage, the leading principal components of this DTW

metric space were computed using the main embedding algorithm described in chapter 2

38

Page 43: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

21~2 211

2 2

0 2 22 222~

2 2 22 .... .... I

11 I I

11 I

3 3

0 I 11 1 1 I I 1 1 1 33

3

1 1 t 1 1 l

Metric Space Vector Space Representation Metric Projections Decision Region Analysis

(a) (b) (c) (d)

Figure 3.10: The 4 stages involved in the design of the isolated word recognition system.

classifier

metric projection

Figure 3.11: The components to the isolated word recognition system.

(see figure 3.lO(b)). Using this representation in the third stage, we determined a set of

basis samples using the search algorithm described in chapter 2. Once th_e basis samples

were located, they were incorporated into a metric projection formula and the vector rep­

resentation of all training samples was then recomputed w.r.t. this projection formula (see

figure 3.lO(c)). Finally, in the fourth stage, the projections of the training samples were

used to analyze the decision regions of each class using three different classifiers as described

in the next section (see figure 3 .10( d)).

Once the four stages were completed, the class label of an arbitrary speech sample

was computed as shown in figure 3.11. This figure shows the main components of the

speech recognition system. The basis samples computed in stage 3 are used to compute

a projection of the unknown input using the metric projection formula. Then the class

label of the resulting vector was obtained from the classifier that was created in stage 4 .

There were three different classifiers which were considered in our implementation: (1) a

k-nearest-neighbor classifier, (2) a Gaussian classifier, and (3) a Neural Network classifier.

39

Page 44: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

These classifiers are described below.

3.5.1 K-Nearest Neighbor Classifier

The k-nearest neighbor (KNN) classifier is an example of a nonparametric classification

procedure [5, ch. 3]. In other words, the classifier does not use a functional description of

the classes; instead, the decision regions are modeled directly with a set of stored samples.

Given a set of n labelled training samples {vi, v2, . . . , vn}, we classify an unknown x by first

locating the k nearest samples to x w.r.t. the distance measure:

{ vd Iajn d( x, v;), i = 1, ... , k} t

Then, we classify x with the class label which appears most frequently in this set; however,

if a tie results, we simply use the class label of the nearest neighbor (ie. k= 1 ).

This procedure has a search and memory complexity which is linear w.r.t. the size of

the training set. Although clustering methods or the use of the triangular inequality can

be used to reduce this complexity, we did not consider these techniques in this thesis.

The KNN classifier was used on both the DTW metric space representation of speech

(ie. d was measured with the DTW metric), and the vector space representation (ie. d

was measured with the pseudo-Euclidean metric). By doing this, we were able to make a

meaningful comparison between the two representations.

3.5.2 Gaussian Classifier

The Gaussian Classifier [5, ch 2) was also considered in our experiments. In this classifier,

the decision regions are characterized by separate probability distribution models for each

class. We assumed that the samples in each class w;, i = 1, .. . , c were random vectors whose

position at location x depended on the multivariate normal density function:

where mis the dimension size of the vector space, µi is the mean vector of class w;, Eis

the m-by-m covariance matrix of class w;, and l:E;I is the determinant of :E;.

The decision regions formed by this density function manifest themselves as hyperellip-

40

Page 45: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

soids. The position and shape of each hyperellipsoid are determined by µ and E respectively.

These two parameters were estimated using the maximum likelihood method [5):

1 n E = - L (vi - µ)t(vi - µ).

n i=l

After obtaining parameters for each class, we classified an unknown x as a member of class

Wi using a Bayesian like decision rule:

mµp(xlwi)i i=l, ... ,c. I

The main advantage of this classifier over the KNN is that the computation time and

memory requirement is independent of the size of the training set . Instead, the complexity

of the classifier is a function of the dimension size m and the number of classes c. During on­

line classification, most of the computation time is dominated by the Mahalanobis distance

computation (x - µ)E-1{x- µ),and therefore the time complexity of the classifier is O(c·

m 2 ). As for memory, one only needs to store the parameters of the distribution model for

each class; thus, the memory complexity is also O(c · m2 ).

3.5.3 Neural Network Classifier

The final classifier that was used in our tests was the Multi-Layered Perceptron (MLP) [31].

This classifier is more flexible then the Gaussian classifier because it is capable of forming

arbitrary decision regions, as was shown by Huang and Lippmann [19). The MLP that we

used can be viewed as a generalization of a linear machine [5]. A linear machine uses c

linear discriminant functions of the form

i = 1, .. . ,c

and the following classification rule

if 9i(x) ~ 9i(x), 'Vj ::/; i then XEWi

to partition the vector space into c piecewise linear decision regions Ri, i = 1, ... , c. In

this case, all of the vector points Xi for which g;(x) is maximum lie in region R;. The

41

Page 46: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

boundary between two adjacent regions Ri, R; is a portion of the hyperplane defined by

Yi(x) - g;(x) = 0.

As with the Gaussian classifier, the parameters of the linear machine are estimated from

the training set. In this case we need to determine the weight vector Wi and threshold value

fh. These values can be found by minimizing a cost function of the form

c

E =LL llg(v)- I[i]lj2

i:::l V~w;

where g(v) is the vector [g1(v),g2(v), ... ,gc(v)] and ~[iJ is an index vector which equals

the i-th row of a c-by-c identity matrix. A solution to this minimization problem can be

obtained using various well-known gradient descent procedures [SJ. Among these, there are

two types of procedures. One converges when a solution exists (ie. perceptron procedures)

but oscillate otherwise, and the other always find a solution-but not necessarily correct

ones, even when the classes are linearly separable (ie. Least Means Square solutions using

Widrow-Hoff rule or the pseudoinverse function) [SJ.

The main limitation of a linear machine is that it only works well when the decision

regions are singly-connected convex regions. In such cases, the Gaussian classifier is likely

to be sufficient. The MLP extends the capability of a linear machine by using hidden layers

of nonlinear discriminant between the inputs vector x and the output layer g(x). Much

like the output layer of a linear machine, the i-th hidden layer of the network consists of

a vector of individual linear discriminant functions {gi, g~, ... , g~J. However, to make the

discriminant more powerful, the output of these functions are passed through a nonlinear

differentiable function which is of the following sigmoid form:

f(x) = -1-

1 + e-X

The input vector to these nonlinear discriminants is the output vector at the previous layer

( i - 1) or simply the input sample x if i = 1. One can show that if a multilayered linear

discriminant does not use a nonlinear function between the layers, it can always be collapsed

to single layer linear machine (see [20J for example). Hinton (17) has shown that the use

of so-called "bottle-neck" hidden layers can help to transform the input representation into

a vector representation which is linearly separable. We will show in the next chapter that

the learning complexity of such transformations is related to the degree of class separation

42

Page 47: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Output Layer j CD Q) ... ©I

+ Second Hidden Layer I CD@ ... @ I

t Fust Hidden Layerl"""©::........;:@:c-··· _____________ @-=-il

+ Input Layer 1012' ... '16' I

(llldric projections) \!J \V ~

Figure 3.12: Neural Network architecture used in the isolated word recognition system.

of the input vectors. It is therefore possible to reduce the complexity of learning in these

networks by preprocessing the inputs using the metric approach with a distance function

which improves the separation of the inputs. Also, by preprocessing data this way, it is

possible to use more flexible input representations in the classification system.

The particular architecture of the MLP that was used in our implementation is shown

in figure 3.12. We selected this architecture to get a meaningful comparison to the results

obtained by Lippmann and Gold [32]. In that particular study, Lippmann and Gold found

that this network gave the optimal performance among various alternatives when tested on

the data set that was also used in our study. (Details of the data set will be described in

the next chapter).

The network was trained using a standard back-propagation learning algorithm [41].

The specific details of this learning algorithm that we used in our implementation were as

follows. We used the acceleration method described by Hinton [17] to update the weights

at each iteration t as shown below:

Initially, we set f = 0.03 and a= 0.5 fort= 1, ... , 100 (ie. the first 100 iterations). Then,

to facilitate faster learning, these parameters were changed to f = 0.05 and a = 0.9 for

the rest of the training period. Convergence of the net occurred when the error dropped

below a tolerance of .1. The weights were updated using a "batch" mode of gradient descent

by accumulating a~; over all the input-output cases first, and then adjusting Wij by the

amount proportional to this sum. This method of gradient descent had the advantage of

not being sensitive to the order in which the weights were updated. Initially, all the weights

were set randomly between the range of -1.0 and + 1.0. To avoid driving the weights to very

43

Page 48: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

large values during training, outputs of .8 and .2 were used instead of 1 and 0 respectively.

After convergence, we ran the learning procedure an extra 500 iterations using a weight­

decay mode. In this mode, the magnitude of each weight was reduced by 0.53 after each

weight was updated. Consequently, only those weights which helped reduced the total error

were able to stay active. This process facilitate generalization by forcing the network to

learn the regularities which defined a class instead of overfitting the sampling error of the

training data (which was a potential problem given the large number of weights in the

network).

44

Page 49: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Chapter 4

Evaluation of the Metric

Approach to Word Recognition

4.1 Introduction

Several experiments were done to evaluate the isolated word recognition system described

in the previous chapter. In this chapter we will present the results of this evaluation. Our

evaluation will be organized in three sections . First, the principal components of the DTW

metric space will be analyzed at varying dimension sizes. This analysis will be used to

determine the intrinsic dimension of the metric space. In the second section, the projection

formula proposed in chapter 2 will be tested. The results of this test will show us how well

the projections preserve the original vector representation. Finally, having constructed a

projection formula, the results of our word recognition system will be presented for each of

the three classifiers described in the previous chapter.

4.2 Description of the Speech Database

The speech data that was used in all of our experiments was obtained from the Texas

Instruments 20 word database [24). These samples were digitized at a rate of 12.5 KHz

using a 12-bit A/D converter. In our evaluation, we used 8 monosyllabic digits (digits

"one" through "nine" excluding "seven") in our vocabulary. Our training set consisted of

320 samples in total which was obtained from 8 speakers ( 4 males and 4 females) using 5

samples per speaker. For testing purposes, we used a set of 1024 speaker dependent (SD)

45

Page 50: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

samples and a set of 1024 speaker independent (SI) samples. In both cases, the samples

were obtained from 8 speaker (4 males and 4 females) using 16 samples per speaker (in

the case of the SD set, the speakers were the same, but the samples were different). All of

the training samples came from the same speaking session; however, the test samples were

generated over 8 different speaking sessions (2 samples per session) which were separated

in time by at least one day.

4.3 Vector Representation of the DTW Metric Space

Using the above training samples and the DTW distance function described in chapter 3.3.2,

we pre-computed the 320 X 320 interdistance matrix representing the corresponding DTW

metric space. 1 We then used the main embedding algorithm (in chapter 2) to obtain a

vector representation w.r.t. the leading principal axes. The eigenvalues associated with

each principal axis are listed in Table 4.1 in descending order (according to magnitudes).

Each magnitude reflects the relative amount of perturbation that would result to the finite

metric space representation if the corresponding axis were removed from the isometric vector

representation. One can see in figure 4.1 that the magnitudes quickly decay toward zero

in an exponential-like manner. The rapid descent towards zero indicates that a relatively

small perturbation should result in the vector representation if all but the first few leading

principal axes are removed. Supporting evidence for this is available in figure 4.2. This plot

shows the vector representation along the first two leading principal axes; note that these

two principal axes span a Euclidean representation because the corresponding eigenvalues

are both positive (see table 4.1). One can see that despite the low dimension, enough DTW

interdistance information is preserved to roughly distinguish most of the classes. Also,

consistent with our intuition, we see that the interdistances between words that sound

similar like "one" and "nine" are much smaller than the interdistances between words that

sound quite different like "six" and "four".

A more precise view of the perturbations can be determined by examining different

statistics of the vector representation at successively higher dimensions. This was done

from dimension 1 up to 320 by adding back more principal components to the representation

1The average distance computation required approximately 45 msec cpu time using a Silicon Graphics 4D/240 Superworkstation (or approximately 38 minutes for all of the 319~318 entries in the symmetric distance matrix).

46

Page 51: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

-IT 1.10.Jc+O; ~5 1.379c+05 109 l .1180c+04 163 . J .JOc+04 217 - l.91Jc+04 211 1 .onc+03

2 7.818c+06 56 1.346c+os 110 -1.777c+04 16-4 3.3Hc+04 211 l.152e+04 272 -T.68Jc+03

J !o.272ct·OG 57 -1.32lc-t05 111 -1.727ct04 16& -3.297c+Ot 219 l.149ct04 2'T:J 7.400ct0J

4 4.fi!o4c+06 H 1.293ct05 112 -1.TJ4c+04 166 3.290c+04 220 l.822c+Ot 274 T.l72c+03

!> 4.17Gct06 $9 -1.269ct0$ 113 l.TO!k+Ot 167 3.266c+04 221 -1.1Uct04 211 -T.238ct03

G J .097c+·06 60 1.2oc+or; 114 l.61!kt04 Ull -3.177c+04 222 l.7'119•+04 276 -T.03h+OJ

7 2.123c+08 61 l .216ct05 111 l.47k+04 Ul11 3.leht04 223 .1.111c+ot 277 -6.HO.+OJ

8 1.914ct06 62 -l.20tc+or; 116 ·l.363c+04 no 3.U3c+04 224 •l.Telc+04 271 6 .7611ct0l

9 l.li29ct06 63 1.164et01 117 1.3524'+04 171 -3.134c+CH 221i l.761•+04 :1711 -6.676'"+03

10 1.354ct06 6-4 -1.l::l4ct05 111 l.303c+04 17::11 3.0111e+04 226 •l.74kt04 210 6.4 Ne +O.J

11 1.l 75c+OG 611 1.121ct0& llll ,_,gr;.,+04 113 3.00k+04 227 ·l.7lict04 211 -8.0llctOJ

12 9.489c+05 66 -1 .0Hc:+OI 120 .1.1r1Jc+04 174 -2.lllS.+04 :n8 l.6TTc+04 212 -1.137ct0l

l.J 8.3411c:t0fi 67 1.078c:t01 Ul -i.044c+04 171 2.96ect04 229 -1.6611c+04 2U l.717c+OJ

14 7. 72Gct06 611 1.042ct01 122 4 .Hlc:t04 176 2.92lc:t04 230 l.617ct04 284 1.n11.+ol

15 7.082ct06 69 -1.027ct01 123 -u111c+oc ITT -2.llOletOt 231 -l.Htict04 215 -1.4110c+OJ

lG G.4 llc+os 70 1.00Jct05 124 4.900<+04 178 2.8112c+Ot 232 1.H4ct04 2ee -5.3T2•+0J

17 -S.504ct05 71 11.lllc+04 125 ·U100ct04 1711 2.8211ct04 233 l .144ct04 ::1187 li.l64ct03

18 6.457ct05 n 9.749ct04 126 4.852ct04 180 -2.tlllc:+Ot 234 1.'32c:t04 288 l.l'Tlc:tOJ

19 t.99Sct05 73 -9.Tl6ct04 127 4.148ct04 111 -2. 7117c:+04 235 -l.1Uc+04 2'9 4.908ct03

20 t .709c+os 74 -9.5:12c+04 128 4.800ct04 182 ·2. TT4c+04 236 l .492c+04 290 -4.IOhtOJ

21 4.28Gc+OS TS ll.lil6ct04 1211 -t.794<+o4 113 2.TT4e+04 237 •l.467c+04 291 4.7411•+03

n 4 .087ct06 76 ·9.'410c+04 130 4..681c+04 lH -2.6ekt04 238 l.441Jct04 292 4.476.o+ol

23 -4 .0S'4ct05 77 9.186<+04 131 4.622c+04 IH 2 .6834'+04 2311 •l.427c+04 293 4.2110c+03

24 3.827c+os 78 8.90'4ct04 132 -4..HOc+o4 186 -2.658ct04 240 1.411•+04 2114 -4.1114c:+03

'.25 - J .ssoc+os 79 ·8.1154ct04 133 t.U8ct04 187 2.633c+04 2U •l.401ict04 296 -4.~ct03

2G 3 .54 lc+OS 80 ll.59Sct04 134 -4.UTct04 11111 -l .631c+04 242 1..381ict04 296 . J .95Gc+03

27 -3 .JOI c+OS 81 8 .502c+04 135 4 .4411<+04 189 2.699ct04 2-43 -1..3411ct0'4 297 J .880c+03

28 J . 230c+Os 82 8 .40lct04 136 -4..392ct04 190 2.6G9ct04 '.24-4 l.3211ct04 2911 3.71 :.c+O~

29 3 . 111ci·05 83 -B.193ct04 137 4.J5Gct04 191 ·2.650ct01 :245 -1 .JOOct 04 299 - 3 .6~1ct03

30 3 .0tllc+05 84 8 . l 73c+Ot 138 4.30Jct04 192 -2.629ct0'4 246 1.280ct04 300 3.!ii43ct03

31 -2 .7 28c+ os 85 8.092ct01 139 •4.280c -! ·04 193 2 .483c+04 247 -1.27.Jc-!·04 301 -3.'.Jli3ct03

32 2 . 66'-~ c+os 86 -7.920c+04 140 4.235ct-04 191 - 2 .434ct04 248 ·l.2411ct04 302 3.14 0ct03

33 2 .Ci ·13c+Os 87 7.7Jlct04 141 4 .169c+04 195 '.J .J93c+Ot :249 l.236ct04 303 -J.llJ9c+03

Jt 2 .!117c+05 88 - 7.527ct-04 10 -4.152c+04 1116 -2.358ct0t 21;0 -1.Ulc-! ·04 304 2.lii65ct03

JS 2 . J~ lct05 89 7.367c+04 143 -4.13Gct04 197 - 2.3'4lct0t 261 1.212ct04 3.06 · 2.478ct03

JG -:Z .28Jc+o:; 90 -7.JJ1ct04 1H -4.046c+04 1911 2.332ct04 212 1.1911ct04 306 2.345ct03

37 :z .2s1c+os 91 7 .322ct04 146 4.004<-1·04 1119 2.312ct04 2'3 ·l-1'2c+-04 307 -2.266c+03

38 2 .1 G1c+os 92 7.16ht04 1'46 ·3.992c+04 200 2.249ct04 264 1.110.+ot 308 2.230ct03

39 - 2. 147ct05 93 7.<>"2ct04 147 3.l'l'~·+'.>4 l{'I -2.238c+'.>'4 2~1 -1.095•+04 309 ·lAOllc+Ol

40 2 .050ct05 94 -6.9114c+04 .... . 3.11,; ;. +t'• 2•.•J 2. 189c+Ot 256 1.0!llc+04 310 l .777c+03

41 -1.9B7c+os 95 6.!1Uc+04 149 3 •• 63<+04 2"3 -2.1T3c+04 25T •l.074ct04 311 -1.4111•+03

42 Ull5ct05 96 -6.800ct04 110 3.816"'+CH 2'.>4 -2.l3Tct04 7H 1.0Hc+04 312 l.Hh+OJ I

43 -l.9llct05 97 6 .729ct04 111 -3.800<+04 205 2.1364'+04 259 -1.oi.c+o• 313 9.707•+02

44 l .83!1c+os 911 -G.7J9ct04 152 3.146<-+04 206 -2.123c+04 260 1.022ct04 314 ... 794•.f.02

45 J .711Jct05 99 G.695ct04 153 3':700<+04 207 2.113"+04 261 -9.9S4ct03 311 7.273•+02

46 l.G7lct06 100 -6.USc+04 154 3.623c+CH 208 •2.07lct04 262 tl:694ct03 316 ·l.C3Tct02

47 - l.Gl!lc+os 101 .6.49!1ct04 155 -3.6l7c+04 209 2.056•+04 263 -9.120.+03 317 4.324at02

411 l .599ct05 102 6.434ct04 U6 3.'64c+04 210 -2.024•+04 28C ••• 2 ... +03 :u• -2.31'4t02

49 ·l.U2c+Oli 103 •6.171ct04 117 ·3.H3ct04 211 2.022et04 261 9.2Hc+03 :11• -4.23te+oo

50 l.62Gct01i 104 ·6.120c+04 lH -3.UOc+04 212 .2.ooe.+o4 268 9..o33c+03 320 -3.0Cl7•09

61 l .tllOc+Oli 105 6 . 116ct04 119 3.107<+04 213 2.002•+04 267 a.Hlct03

62 -l.4GGct06 106 6 .091c+04 ICIO -3.4Nc+o4 214 .1.010.+04 :aee ·•--+0.J

63 -l . 41Gct06 107 l.912c+ot 181 -3.4::17e+04 211 l.1142•+04 2ee 1.446.+0l

14 -1.392<+06 108 l .90ht04 162 3.414c+04 211 l.111e.+04 2TO ·•-l8:1•+o3

Table 4. I : The eigenvalues corresponding t.o the principal axes of the DTW metric apace.

47

Page 52: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

DTW Eigenvalues Eigenvalue Magnitudes x 106

11

10

9

8

7

6

5

4

3

2

1

0 '-...__-------'-------~---------'---~Dimensions

0 100 200 300

Figure 4.1: A plot of the magnitudes of the eigenvalues corresponding to the principal components of the DTW metric space.

48

Page 53: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

300

200

100

0

-100

-200

I

-300

I I I I - I I I

3

3 3 - l3 3 3

33 3

3J 3 3

33 3 3 33

3 33 3 33 3

3 - 3 3 3 33 9

B 3 3 2 9 B B 2 ~ .p 2

9 I BB rP g 9

Bl 3 99 9

B ~ 99 , 9 B B ,:s 9 9 a 9 9, - 9 9

, 8 B

~ 9 2, 9

Ill 19 ,

22 2 9 !Iii'; 9 9

' 2 9 ,

2 , , , ,,,

2 , i 99 II

I 8,

2 5 2 I 5 ,f'6

• 8 B 2

?5 2 5 ,

-~ ~ 5 ,

B B 6 B i 22 2 5 ,·, , 1 • B 2 2 1 l\i

5 1 5 B 2 55 ~ 1

1 ' B 2 5 55

, ~ , 1 1

5 rP 8 55 5

66

6 6 B 5

8 5 - if' 6 6 5 5 4 6 6 6 6 6 5 5

6 66 6 5 6 5 4

6 6 6 5

4 4

4 4 5 5 4 I

-6 6rt, 1 4 6 6 66 4 4

6 66 5 4 4 4 6 6 4 4

6 6 4 4 6 4

44 4

4 4 4

- 4 4 4 4 4 4 4

4 4 4

I I I I I I I --400 -300 -200 -100 0 100 200 300

Figure 4.2: The vector representation of the training samples w.r.t. the 2 leading principal axes.

49

-

-

-

-

-

-

Page 54: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

starting with the first principal axis. For each kfl, ... 320 dimensional representation, we

measured the quality of the representation w .r.t. two different criteria. The first one was

the sum of the square error E(V) between the interdistances in the metric space and the

corresponding interdistances in the k dimensional vector space. This quantity, which was

computed as described in section 2.3.1 of chapter 2, allowed us to see the degree of error

present in the vector representation at different dimension sizes.

The second criteria measured was the ratio of the average between-class distance to the

average within-class distance. This measurement, denoted by S(V), was computed in the

following manner. First, we computed the average between-class distance B(V x' Vy) as

follows, where V x = { v~, vi, ... , v:0 } denotes the set of vectors in class V x. x = 1, . . . , 8

and

Next, we measured the average within-class distance W(Vx) as follows:

W(Vx) =

Finally, we combined these two functions together and computed the average scatter ratio

S(V) of the entire vector representation V = LJ~=I V x as follows:

We used this measurement to see how the class separation changed as the dimensions

increased.

The graph in figure 4.3 shows a plot of the values of E(V) for the increasing dimension

sizes 1 through 320. One can see that E(V) decays sharply towards zero in an exponential

manner as the dimension size increases. This behavior is consistent with the way the

magnitudes of the eigenvalues decline in figure 4.1. Because the eigenvalues reflect the

amount of metric information (ie. variance) that is added back into the representation, we

see that the error in the vector representation approaches zero at approximately the same

point where the eigenvalues become insignificant. Furthermore, the plot clearly shows t hat

essentially all of the distance information is contained intrinsically in the first few leading

50

Page 55: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

DTW Representation Error E(V)

240

220

200

180

160

140

120

100

80

60

40

20

0

Dimensions 0 100 200 300

Figure 4.3: A plot of the vector representation error at varying dimension sizes

51

Page 56: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

DTW Scatter Ratio S(V)

2.5

2.4

2.3

2.2

2.1

2

1.9

1.8

1.7

1.6

1.5

1.4

1.3

1.2

1.1

1

0.9 Dimensions

0 100 200 300

Figure 4.4: A plot of the average between-to-within class scatter ratio at varying dimension sizes.

principal axes; at approximately twice the number of classes (that is, 16 dimensions), we

essentially have an isometric representation of the DTW metric space. This supports the

hypothesis made by Goldfarb [11], that when classes are reasonably separated (as is apparent

in figure 4.2), the intrinsic dimension of our DTW metric space is closely related (by a small

constant) to the number of classes (in our case 8), and not the size of the training set.

The diminishing effects of the higher dimensions can also be seen in terms of the scatter

ratio. In figure 4.4 we show how the ratio S(V) changes as more principal components

a.re added to the vector representation . In this plot we again see a sharp exponential

drop which is similar to figure 4.1 and figure 4.3. At first this plot seems to indicate that

the class separation weakens as more dimensions are added to the representation since we

52

Page 57: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

DTW Class Separation M(v)

5(J()

540

520

500

480

4(J()

440

420

400

380

3(J()

340

320

300

0 100 200 300

Figure 4.5: A plot of the average distance between the classes at varying dimension sizes.

would expect to see the between-to-within class scatter ratio rise as more of the original

metric information is brought back into the representation. However, if we examine the

numerator of the ratio, which measures the average between class distance B(V x, Vy),

we see (in figure 4.5) that the distances between the classes does increase (in a manner

inversely related to the scatter ratio). Thus, the decrease in the scatter ratio simply occurs

because the average distance within a class (the denominator of S(V)) rises faster then the

average distance between the classes. The key observation one should make from figure 4.4

and figure 4.5 is that after the first few principal axes, the class separation levels off and

essentially equals the separation in the DTW metric space. This means we are not likely to

gain m~ch in the recognition performance if we analyze the decision regions in dimensions

53

Page 58: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

beyond this point. Note, the point where the class separation levels off is approximately

the dimension where both the eigenvalues and the error measure E(V) approach zero. This

means that basically any one of these measurements could be used to analyze the intrinsic

dimension of the metric space.

4.4 Metric Projection Analysis

In this section we present the results of our analysis of the metric projection algorithm

described in chapter 2. There were two specific aspects of the projection algorithm we were

interested in analyzing. First, we wanted to see how accurately the metric projections of

the training samples represented the original principal components as more ha.sis samples

were added into the metric projection formula. Second, we wanted to see what differences

existed between the two versions of the greedy search procedures proposed in chapter 2.

To answer these questions, we proceeded in the following manner. First, we set the

dimension size of the search space to 16. (ie. we only used the leading 16 principal com­

ponents of the training samples while searching for the basis samples.) This dimension size

was selected on the basis of the results obtained from the previous section and because the

leading 16 dimensions were Euclidean (see table 4.1) which simplified our analysis of the

decision regions in the next section. The 16 basis samples {bi, h2, ... , bl6} were found from

the search space using a greedy search algorithm. Given i - 1 basis samples, ha.sis sample

bi was selected using two separate methods called the the k-MAX and the k-NN projec­

tion method as outlined in chapter 2. In the k-NN method, we computed the maximum

projection on both ends of the i-th leading principal axis. Then, the k-nearest neighbors

{xi, x2, ... ,X21<} of the two "end" vector points were located and each of these candidate

ha.sis samples were tested as follows. Let { u{, ut ... , u~20 } be the metric projection of each

of the training samples w.r.t. the basis set {b1 ,b2 , ••• ,bi-t,Xj} and let {vl,v;, ... ,v~20} be the original vector representation of the training set w .r. t. the i leading principal axes.

We tested Xj for each j d, ... , 2 · k by measuring the pseudo-Euclidean distance d between

u{ and vf for l = 1, ... , 320 as follows:

54

Page 59: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Metric Projection Error (e) •M

~-------~---...---~ IC-MAXJiiOiccii-160 1C-~tioa5-·

ISO

140

130

120

110

100

90

80

10

60 Dimemiom

10 IS

Figure 4.6: The value of the test criteria e for each dimension.

This measurement was used, rather than E(V), because it reflected the deviation from the

original representation much more accurately. Using this test criteria, we selected the basis

sample bi in the following manner:

b; = min e(xj)· jd, ... ,2k

For the k-MAX method, we used the same general search strategy to select the basis

sample bi; however, instead of selecting among the k-nearest neighbors as described above,

we used the k-maximum projections onto the i-th principal axis. In both cases, we set k

equal to 10.

Figure 4.6 shows how the value of e(bj) , j = 1, ... , k changed as the number of basis

samples increased for both selection methods. In this figure, several important characteris­

tics of the projection algorithm can be observed. If we compare the two search methods, we

can see that both methods start out with the same error. This occurred because the same

sample was selected by both methods at the first dimension. However, after this point the

k-MAX method consistently found samples which resulted in a lower error (except at the

third dimension where e was the same). This difference indicates that at certain points in

the greedy search, some of the samples yielding the maximum projections were not among

55

Page 60: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

the k-nearest neighbors of the endpoints. (If they had been, they would have been selected

and no differences in the two approaches would have been evident.) In such cases, we see

that the k-MAX approach leads to more reliable projections then k-NN approach. The

likely reason for this is that the vectors in the k-MAX set have a greater chance of being

closer (in terms of distance) to the true subspace spanned by the leading ( k = 1, ... , n)

principal axes than the k-NN set . That is because the k-MAX set has the greatest influence

on the direction of the principal axis (being considered) since these samples make the largest

contributions to the variance on the axis. The diagonalization process tries to maximize

this variance; thus, it is best for it to position the axis as close as possible to these samples.

If one looks closely at the derivation of the projection formula (see 2.5), one can see why

closeness to the true subspace is important. When the basis samples do not lie precisely

on the true subspace (that spanned by the eigenvectors), they span a space which has a

certain (affine) angle to the true subspace. The greater the distance that any basis sample

has to the true subspace, the greater the angle becomes and this in turn translates to higher

projection errors.

Another interesting property of our projection algorithm which is evident in figure 4.6 is

the way the error changes as more basis samples are added to the projection formula. One

can see that the error tends to get smaller and level off as the dimensions increase, and at

some dimensions the error even drops. For instance, going from dimension 7 to dimension

8 using the k-MAX search method, we can see that the error drops from approximately 125

to 115. Clearly, for this to happen we must have gained some improvements in the vector

representation at the lower dimensions. An extreme example showing the improvements in

the lower dimensions is illustrated in figure 4. 7 and figure 4.8. In these two figures, which

relate to the k-MAX and k-NN search methods respectively, the projections along the two

leading principal axes are shown when two basis samples (the upper graph) and 16 basis

samples (the lower graph) are used in the projection formula. If we compare these figures

with figure 4.2, we not only see the extent of the improvements, we also see that our 16-d

projection is a very good approximation of the original representation (in 2-d).

Another way to see the improvements that added basis samples can have on the projec­

tions is by examining the between-class separation B and the representation error E. This

is shown in figure 4.10 and figure 4.9 respectively for both projection methods used and also

the original vector representation. First of all, we can see that in both projection methods,

56

Page 61: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

(a)

(b)

'' • •

•• • ,

2M ' '

2 .. ' lit

I t ... • •• I

I I 1' 2 2

•• • • I 2

' .... I .. ' • •

'

• ' "' •' ' ' • • •

' " • 3' ,,:a, . \ I J 3 a . \ . .. . .

• • •\ • 2 2 't" • I 'a ' 2 I

2 • 1 .. 2 .. .. ' 2 • I

' e,4 '• ;.

• ' 11 ,, ,'1 .. '11 . ' •

\ ,

. ' "' , ~

, "1 s 1 1 '• •

45 .. • tl-~~~~~~~~~~~_.-'-lt":'-~~~--''--~~~~~,~,~.-r

5~~[-"'~l~ .. ~~,,._,.,--''--~~~---1

• ••

.. 5 • ' It . I

, . • I I t •

·let

300

200

100

....

-218

• . • • • • • • •• • • ' '• • • • • ,'• ' .

.500 • 4Gf

3

'' •

• I • ,

•• I I I

I •• I

I I

I I f I

I

j " I

I I • I

• I • • . I I • • • •• • '• . '

' " • • •• • ' • '• • ('.' • • . •

3

'

' ... cc'' ••• •

, 3

' 3

3 ' ' ,

' 3

'

z

' ' 225 52 ., I ' • 2

' '

-500 .400 -300

5 ' ... ~· t . . 5

5

. , .. .. 5 5 .. 5 4 5 ...

•• • .

'

' 3

3 3

' f 3

3 .,, 3 ' " 3 " ' 3

' 2 • 2 • •

' ' I • • • • ' • • '• • . , 2 • • , . i2 • . , • ' I .. " • , ' 2

'• j 2 .. ' ' • ., 2 •l ~ . \1' 2

1 5 55 ' ' 5 , ' I ' ' ,. I 5

, .,, ' 2 • 15 ~ 1 , 5

5 I . ' ' 1 2 , , 5

5 . 5 I I • • • • • • . • •

• • • • . .. • . . •

• .. • . • . • . . . . . . • . • . . . • • \' .. . • .

I

·100 100

Figure 4.7: The k-MAX projections along the 2 leading principal axis when (a) 2 basis samples and (b) 16 basis samples are used in the projection formula.

57

Page 62: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

'' ' ' •

, ' •

' ' ' ' ' ' .. ' ' ' ' ' •

' '' • ' ' •

' '' 1

'• ' " • ' • • 1

\ • ' ' ,. • ., . \ 1

' • 1

,. 2 •• \ 1 1 • • • • • • • • ., . 1 1

• ~· 2 •

'•'" •,' • . ,. 2' 22 \ I\ • \

' I• JI I 2 2 2 • • . ,,, . \ • j • ' 2 • 1 • \

2 i 2 1 \

•• 2 ' 1 I' »1 2 ' 2 1 • ,1

' • • 1

• ... • • • • • ' . fj ';t 5 • • • .. ~ • ' • • . . ' ,. • ; ,

ti I I • • • • ' • • • • . ' ,•' . . . • . . • • ' I • • • • . I •

' . ' •1'• ... ~· • • . . • .. ... . " . . • ... • .. ..

• • . •

Ut

Ht

(a)

• • • • •

• •• . • • •

• • • • • •

• 590 .JM -108 ·109 • Ito

' ' • ' ' ' '

' ' ,. ' ' ' ' ' ' • -300

' ' ) ' ' • ' • '

' ' . ''• 2 ' • ' .. • • • ' ' 2

' • • • • • ' " I- • • • • '

2 2 .. • I • . " • 2 2 2l 21 • ••• • • • 2 • \ 2., 1 2 • • • • . .

2 2 • • , J 1151 • •l • • • 2 2 .,

2 2 2 2

' • " ' I • , 1,, • .,,, ... 2 •z 2 • 1 •

11 1 1 . ' 1 1. . • • •' • .. . 1 \

• • ' ' ' • • 1 11 11 • I

I • .. , I

1

' 2 ' 1 • • • ' • • • • ' • \ • •• ...

• •• ,1. • • • • ... • • . \ .. • • •

290

Ito

(b)

-He • • • • • • • • • • • • • • • . .

• t _ . . . • . • • ~ • • '· • • . • • . 2 ..

• . . . . . • •• ~ . . ' . .J• . .590 .Jto ·100 ·180 • 108 100

Figure 4.8: The k-NN projections along the 2 leading principal axis when (a) 2 basis samples and (b) 16 basis samples are used in the projection formula.

58

Page 63: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

P.(V)

3'0

320

300

280

2l50

:240

2211

2IJO

180

lllO

140 . 120 \ .. 100

80

ti()

40

20

0

Projection Error (E)

· ... .. ... ~ .....

·· .................. _ ..

10 IS

blW Rcpueolallm 'X-MAX Pn>jeaiom­Y-WJ!iOjOCbebi-· •••

Dilmmion

Figure 4.9: The sum of square error of the projection for the leading 16 dimensions.

these criteria measurements incrementally approach the original vector representation as

the number of dimensions increases. However, consistent with figure 4.6, we see that the

k-MAX method is able to get much closer to the true representation. One can also see that

the k-MAX method essentially converges to the ideal representation once the dimension

approaches the intrinsic dimension of the metric space. Thus, even though the projections

error e is nonzero, our projection algorithm is still able to generate a representation which

preserves most of the original interdistance matrix. The metric projection error of around

115 (see figure 4.6) occurs most likely because the entire representation is slightly "tilted"

from the subspace spanned by the eigenvectors.

4.5 Analysis of the Recognition Performance

4.5.1 The Metric Approach vs. the DTW KNN Classifier

In the final section of this chapter we present the results of our recognition tests. The

tests were set up using three different vector classifiers as described in the previous chapter.

All projections were calculated using the basis samples selected from the k-MAX search

method because that method gave the best results as was shown in the previous section.

59

Page 64: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Projection Class Separation M(V)

360

340

320

300

210

260

240

220

200

1~"'-~~----'~~~----'~~~----'---:::1 10 IS

Dimemiom

Figure 4.10: The average between class distance of the projection for the leading 16 dimen­sions.

To evaluate the recognition results at different dimensions, we incremented the number of

basis samples in the projection formula from 2 up to 16 (in increments of 2) using the

strategy in the previous section. In the case of the MLP classifier, all of the learning

parameters described in the previous chapter were held fixed at each dimension considered

during training. Consequently, we could not get the network to converge for dimension sizes

lower than 6.

Our recognition tests were done using both the speaker dependent and speaker inde­

pendent test samples described at the beginning of this chapter. (Note, both these sample

sets were not used in the training process). A 2-d plot of these two sample sets w.r.t. the

leading two basis samples is shown if figure 4.ll(a) and figure 4.ll(b) respectively. These

projections were obtained using all 16 basis samples in the projection formula. The results

of our tests are shown in figure 4.12 and figure 4.13.

Comparing the three classifiers, we see that the MLP achieved the best recognition re­

sults overall; however, at 16 dimensions, the results of our KNN classifier were comparable

to the MLP. Both the MLP and the vector KNN classifiers outperformed the Gaussian

classifier indicating that the decision regions were somewhat complex and/or most likely

contained outliers. The main result in figures 4.12 and 4.13 however, is that at 16 dimen-

60

Page 65: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

(a)

(b)

500 ~

100

JOO

,• ,• .

0 300

..... .. • ' , • r

·200 .Joo

', , ' ' 3,

' " !

33 J 3 3 ~~ ' , ' 3 'J t 3 ~ 3 s

3 33 :J 3

3 ' 333 33, 33

100

3 33333 , . . •3s:a:1~:13 3:J32i2J J" :iv"

200

3 3 1 33 3 3 3 3 2 3 1 \ I 1

• ,111' '•'\ \ ' , ' ~ '• F '\•!> '•: • ' ' ,• ' I.,, 1 ,. r 3

3 :t 2

3 2 t ta I It \a

1

• ••• 3 ~ 2 ,, ' • •• \ • 1

.·.·.". 'o 2 2 I 2 I tf 3~ 21 " ' 1\~,' ~ 1 11

' I' \ ''221, 11 I ' ' ' ' ' ' • # 2 " l> • • ' ,. fpt ~ • i• • • "

:· • 'I ii • • ··'if.. '• I "2l' .,~1.· " , .. Ai • •, ' 1 ' ~:t.2 I t , ~ ,, ,,•ti, 1 ,, 1 1

, • •, • • t1 f, ~. .'ri'ih•..,.,1,' ~· 1\•f, 11

• ' • • • 222 t .. 2 ... 2 .. ,,z 5 -. 1 .,

tl--~~~~~~~~-•• ~~.L......:JL..:..:..!...._,...::.2~~~~,......_~.~-.~Z.:.~.~ ..... ~......:J.~-.1-.,.1_.,.-'-<!,_.~\~1-5~--I

I 'l'i"'9;:& I 22 2 I 5 2 55 5 1lt s,t11,,1,

.J ..

I ,I I I 2 51 5 2 2 2 1 1f f, 1 I 1 1 I I 1 11f P' ,_i. 5 fs f I 5 5 6 l 1 1 1 ,, ·~ ,&•, '!li'"·i''• 51

•,1•1,l ' ' • l ' ' ' 2 , -r. ' "- 1

•1 f ' ' ' s ' ...... 'I . . .... . .... . . '· ~-*. .. .. ... • ' ' " ''' ,r ,', ' s l '' • s

.. . I .. I I I 9' 5 ~,.t 5 \ 4 5'

\'.)• ,: ' 5

,. '\ :•,•' ,,, ' ' 5 .. ,.. ' 4 • I I I Ii Ii I \g & I I I •" I 4 ~

• " l' • '· '~I' .. 2'·•···· ' •" \ '4 •• 4

I •

.. • t< , .. . .. "\ ........ ... .. .

.500 0 300 ·200 .Joo

• • I .

• . . . ~ .. . ~· • .

HO

Figure 4.11: 2-d Projections of the (a) speaker dependent test samples and the (b) speaker independent test samples.

61

Page 66: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Speaker Dependent Recognition Results Accuracy (%)

,--,,--~~~-r-~~~-.--~~~-.--~~~-.,.-~----.'kNN

100

98

96

94

92

90

88

86

84

82

80

78

76

74

72

70

68

66

64

·--------------------------- ---------·

0

.... ---"' __ .. ,. ...... • ,,-.~ ... - ............. a .. / •.. /

- - ,-~--~ - - - - - - - - - - - - - - -...... -.- - - - - - - - - - - - - . ,. .. _ - - - - - - - - - - - - - - - .

: i ! j

! ! l i

I ii i :

i I i i :

f ! : i

I : i •

5 10 15 20

"G&iiS:Siiiii ............... . \1Li> ___________ _

T>'tW-KNN" - - - • KN'N-cC&af - -"GaUssfan (l&G) .. MPL(L&G) - .

Dimensions

Figure 4.12: The recognition scores using the speaker dependent test samples.

62

Page 67: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

. ~-,,,_.,

Speaker Independent Recognition Results Accuracy(%)

100 ,-,.---~~~~~-.~~~~~~--.-~~~~~---,.--~~MLP

98

96

94

92

90

88

86

84

82

80

78

76

74

72

70

68

66

·----------------------------------------

0

..... .... ... / ...• -·

~ ------·---- .... :.:.::.::.::;.-: .... ;·:" --· ------.. , ... , , , ,' 11 ..........

' . ' I ' I , : , : , : , : , :

' : ' : , : , : ' : , : , : , :

, i , : , . . : I:'

,' l '~ , :

r: , : •: , :

•: II II •I •: •:

•I •I u

•I ,. •: •: ...

,'i •I 1:

•: ... •I •f •

5 10 15

!(NN••·········

~&Us8raii- -Tiiw-KNN"

Dimensions

Figure 4.13: The recognition scores using the speaker independent test samples.

63

Page 68: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

sions (the intrinsic dimension of our DTW metric space), our system was able to achieve

classification results comparable to the brute-force k-nearest-neighbor classifier in the DTW

metric space. What is clearly impressive about our approach is that we achieved these re­

sults using nearly 953 fewer distance computations (using 16 basis samples and one origin

sample instead of 320 samples). Additional processing time was required to make a clas­

sification decision; however, in the case of the MLP most of the operations could be done

in parallel and thus in a real implementation this would represent a very small portion of

the total computation time (ie. the activity at each layer can be computed in parallel and

this only needs to be done iteratively for 3 layers). The.projections can also be determined

in a parallel fashion because each of the 17 distances can be computed simultaneously and

furthermore, if the DTW procedure were implemented with systolic arrays (as described in

chapter 3), then most of the operations in the DTW functions could also be computed in

parallel.

Unlike the approach taken by Vidal et. al. [45], our reduction in the number of dis­

tance computations was achieved without incurring any significant increase in the memory

requirement. For the 16-128-16-8 MLP, a total of (2(16 x 128) + 128) 4218 weights and (17

samples X 45 frames x 12 cepstral coefficients) 9180 spectral features (numbers) needed

to be stored in the classifier. But, in the DTW KNN classifier, a total of (320 samples x

540 cepstral coefficients) 172,800 spectral features was needed. (In Vidal et. al . approach

an additional 319~318 numbers would also be needed in the classifier to store the distance

matrix.) Thus, we also achieved significant savings in the memory (of nearly 903 when

compared to the DTW KNN classifier).

It is also useful to compare the results of our metric KNN classifier with those of the

DTW KNN classifier. By comparing the recognition scores of these two classifiers at varying

number of dimensions, we can get a good idea of how much meaningful information is

actually lost in the (projected) vector representation at the lower dimensions. As one

can see in both figure 4.12 and figure 4.13, as the number of dimensions increases the

recognition scores quickly begin to approach that of the DTW KNN classifier. The rate at

which this convergence occurs is consistent (inversely) with the representation errors shown

in figure 4.9, and, to some extent, the projection class separation shown in figure 4.10.

This close relationship means that one can approximate what percentage of the DTW KNN

classification results one is likely to achieve at lower dimensions, by scaling with either

64

Page 69: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

.;~'

the separation measure or the representation error measure. It is not surprising to see at

16 dimensions the KNN vector classifier achieving results comparable to the DTW KNN

classifier because, as shown in figure 4.9, there is essentially no error in the interdistance

matrix at this dimension.

4.6· The Metric MLP vs. Lippmann and Gold's MLP

One can gain further insight into the advantages of the metric approach by comparing the

results of our MLP with results obtained by Lippmann and Gold [32). This is an interesting

comparison because in both cases the same network architecture, learning procedure, speech

database, vocabulary, and spectral analysis were used. In fact, the only difference between

the two approaches was the way the inputs were determined. In the Lippmann and Gold

case, no time alignment analysis was done. Instead, the frame of cepstral coefficients which

had the largest energy value was used directly along with the neighboring frame. Since

there were 11 cepstral coefficients in each frame, a total of 22 inputs were used as inputs

to the MLP. In figure 4.13 the results of this approach can be compared to ours for all 3

classifiers. One can clearly see that the vectors generated with the projection formula lead

to significantly better performance on all of the classifiers (in a much lower dimension).

This is not surprising since the Euclidean vectors in our approach had DTW "features"

inherently incorporated into the representation.

The benefits of using DTW projections can also be seen in terms of the learning time

of the MLP. In figure 4.14, we show how the learning time changed as more basis samples

were added to the projections. We have also included the learning time that was required

by Lippmann and Gold. One can see, first of all, that the learning time dropped as more ,

basis samples were added to the projections. The gains in the learning speed are likely due

to the improvements in the class separation (see figure 4.10). We can also see that even at 6

dimensions, our learning time was significantly faster than Lippmann and Gold's time, using

about one quarter as many inputs. This again reflects the benefits of incorporating DTW

information into the representation. Although not surprising, our results on the learning

time are important because they clearly show that one way to reduce the complexity of

learning in MLP is to preprocess the data using a metric function in a manner which

separates the classes as much as possible. This is particularly important for input patterns

65

Page 70: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

- ~

MLP Learning Speed Epoch

bTW Representation 560 ....................... - ......................................... - .................................................. 11 t&···a·-·R--e·p--rese··-··n-ta··-t1.·o--n··-···

540

520 )

500 ' 480

460 '·' 440

420

400

380

360

340

320

300

280

260

240

220

<i~XJ~:~ 200

. (-'~. 180

~ltc~ 160 '", 140

Dimensions 10 15 20

Figure 4.14: The learning time required by the MLP neural net a.t different dimensions.

66

Page 71: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

. ·:.:~~::

_,;~ '

a.sin speech, which contain alot of discriminating information in the structural organization

of the primitives. Because it is difficult to directly model structural relationships in the

inputs of a MLP, learning can be inherently hard. As we have demonstrated, a key strength

of the metric approach is that it is capable of incorporating relevant structural features into

the representation in a highly efficient manner .

67

Page 72: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

Chapter 5

Discussion and Conclusions

In this thesis, we have demonstrated that the metric approach to pattern recognition can

facilitate a highly accurate and efficient implementation of an isolated word recognition

system. In our evaluation of this system, we showed that results comparable to the brute­

force KNN classifier in a DTW metric space could be achieved using substantially fewer

distance computations. Furthermore, unlike other fast approaches based on the DTW

metric, we required only a small set of training samples to be stored in the classifier. Testing

the system on monosyllabic digits from the TI 20 word database, we observed a reduction

of almost 953 in the total number of distance computations and a total storage reduction

of almost 903 when compared to the DTW KNN classifier.

Efficiency in this approach was achieved by mapping the speech samples from a DTW

metric space into a low dimensional pseudo-Euclidean vector space in a manner which pre­

served the pairwise distances. By representing speech this way, it was possible to store only

the decision regions of each class in the classifier using parametric classification methods.

Among the parametric methods we tested, we found the MLP gave the best recognition

performance. We also saw that incorporating DTW features into the vector representation

lead to improvements in the MLP's learning time. The MLP is an attractive classifier be­

cause it can deal with outliers efficiently, and because most of the operations in the classifier

can be done in parallel. Further parallelism could be incorporated into our system if we

used systolic arrays to compute the DTW distances in the projection formula.

The vector representation was obtained in two stages. The first stage involved a trans­

formation of a finite DTW metric space onto the principal axes of a pseudo-Euclidean vector

68

Page 73: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

space. This was done using an embedding algorithm which essentially required the orthog­

onal diagonalization of a symmetric matrix. Obtaining a vector representation with this

approach has several important advantages over the multidimensional scaling approach.

Firstly, it is possible to get an isometric representation, and this representation can be

constructed in a dimension size which is guaranteed to be minimal. Secondly, because the

main computational task involves the diagonalization of a symmetric matrix, there are no

convergence problems with the embedding algorithm. In fact, diagonalization algorithms

for this problem are some of the most stable numerical methods available. Finally, the

embedding process can incrementally generate the vector representation w.r.t . the leading

principal axis. Thus, the accuracy vs. dimensionality trade-off can be analyzed iteratively

for each dimension during the construction of the representation.

In our evaluation, we studied this trade-off w.r.t. the representation error and the class

separation. We found that as the dimension size increased, the representation error decayed

sharply towards zero, and the class separation quickly approached the separation in . the

metric space. Both these progressions tended to level-off at the same point in an exponential­

like fashion, and the changes from one dimension to the next tended to be consistent with

the changes in the magnitudes of the eigenvalues. The sharp leveling off effect of the

representation error, class separation and eigenvalue magnitudes lead us to conclude that

the intrinsic dimension of metric space was closely related (by a small constant) to the

number of classes rather then the number of training samples. This relationship is most

likely influenced by how tightly clustered the classes are. In the most favorable situation

where the distance function returns zero for samples in the same class, we would have an

upper bound of m on the dimension size for am-class problem. For the DTW metric space

considered in this thesis, we found the intrinsic dimension to be essentially twice the number

of classes. It would be interesting to see if this intrinsic dimension could be reduced further

by employing a weight scheme in the DTW metric which further reduced the cluster sizes

of the classes.

An interesting study worth pursuing along these lines is to see how changes in the scatter

ratio of the metric space affects the intrinsic dimensionality. One way this study could be

done is by optimizing a weighting scheme in the DTW metric w.r.t. the scatter ratio. Then,

one could study the vector representation of the metric space as the weights incrementally

approached the optimal scatter ratio. The weighting scheme which produced the optimal

69

Page 74: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

scatter ratio would likely highlight interesting features which define each class. This idea

of using parametric distance functions in the metric approach has recently been proposed

formally by Goldfarb [11) in the context of ma.chine learning. According to this theory of

learning, a compact description of a. class would be obtained by interpreting the weights at

the convergence point. This description could then be used to redesign the metric function

more efficiently, making it possible to achieve further improvements in the design of the

classifier.

In the second stage of the embedding process, the representation generated in the first

stage was used to select basis samples for the vector repfesentation. Modeling the represen­

tation w.r.t. basis samples made it possible to get the vector representation of an arbitrary

metric sample efficiently. For an dimensional representation, only n + 1 distance computa­

tions need to be performed: one to the origin sample and one to each basis sample. This was

done using a standard projection formula which involved the solution of a simple system

of linear equations. However, instead of using inner products in the formula, we used the

DTW distances to the basis samples. This worked mathematically because a key theorem

by Goldfarb assures us that there exists a well-defined relationship (mapping) between the

(DTW) distances in a metric space and the inner products in the pseudo-Euclidean vector

space.

We proposed a greedy search algorithm to select the basis samples. Our analysis of this

algorithm lead to some important observations. First, we saw that the selection method

used to pick the basis samples was important. By using the statistical properties of the

principal axis representation intelligently, we found more accurate projections could be

achieved. Secondly, we observed improvements in the accuracy of the projections as more

basis samples were added to the projection formula. Finally, the projections tended to

converge towards the original representation as we approached the intrinsic dimension of

the metric space.

A relatively small number of classes were used in our study. Thus it remains to be seen

how well the metric approach will scale on larger problems. Even if there is a close rela­

tionship between the number of class and the intrinsic dimensionality of a metric space, our

current approach is likely to become unmanageable for problems involving large vocabulary

sizes. But instead of solving a large problem in one stage, we could use the the metric

approach in several stages, partitioning the problem into manageable subclasses at each

70

Page 75: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

I

stage as was suggested by Goldfarb and Verma [12]. In this approach, a hierarchical deci­

sion tree would be used to organize the speech data efficiently. Starting at the root node,

different samples from a large vocabulary would be selected and a vector representations of

these samples would be obtained as was done in our implementation. Then a coarse cluster

analysis would be done to determine groups of similar sounding words. To keep the dimen­

sion size low, the data would be partitioned into a small set of clusters which would most

likely reflect very high-level features . The decision regions of these clusters could then be

determined and stored using a MLP or some other parametric classification method. Each

cluster would correspond to a node in the decision tree. At the next level, a similar analysis

would be done separately on each partition (ie. cluster); however, the cluster analysis would

be tuned to look for much finer features . This would be done hierarchically, until a node

in the tree contained a manageable set of primitive classes (ie. 10 classes for example). At

this point, we would have a problem similar to the one considered in this thesis. Using this

approach the entire search complexity could be reduced by a log factor. Note also that the

storage requirements are reasonable because at each node we only have to store a small set

of basis samples and the decision regions (using a MLP for example). In this implemen­

tation, the path to a classification decision involves a traversal though increasingly smaller

groups of similar sounding words until one finally focuses upon an understandable word at

a leaf node.

Note that the hierarchical decision tree described above could be used to organize and

search any database containing complex patterns, not just speech utterances. One particular

application where this approach may be well suited is in the area of DNA and RNA sequence

analysis. By their very nature, different genetic sequences are organized hierarchically into

subclasses in a manner reflecting the evolutionary distances between the classes. These

distances are currently used in very inefficient ways to solve classification problems (see

(43]). The metric approach offers great potential to reduce not only the search complexity,

but als() the massive memory requirements in such applications.

A final issue which was not explicitly considered in this thesis is whether one could

achieve comparable results to our system using other forms of vector representations based

on the DTW function. In the -pseudo-Euclidean representation , one aims to preserve the

original DTW distances in the vector representation; however, one could also use the DTW

distance information to construct another type of vector representation which would not

71

Page 76: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

necessarily preserve interdistances. One interesting possibility suggested by Hinton [16) is a

Gaussian Radial Basis Function [40, 29]. In this representation, a vector would be formed by

first computing the squared DTW distances to a set of reference samples, and then applying

a Gaussian function to these distances. This representation has the useful property of pulling

neighboring samples to each reference sample away from the the rest of the sample points.

Thus, provided that one selects at least one reference sample from each class, it is possible

to create a vector representation in an Euclidean Space which separates the classes, but does

not preserve the interdistances. However, if one selects reference samples which lie too close

to the boundary of two or more classes, then good class separation will not be achieved in

the corresponding dimensions. Because the selection of useful samples is very difficult, most

implementations use large numbers of randomly selected basis samples. This highlights an

important advantage of our metric representation over the RBF representation - in the

metric approach, there is a systematic and efficient method for picking the dimension size

and basis samples. Despite the difficulties of picking samples in the RBF representation,

a meaningful comparison could still be made in terms of the recognition performance by

simply using the basis samples from our projection algorithm as the reference samples in

the RBF representation. Such a comparison would at least show whether the preservation

of interdistances was essential for the good results which we observed, or whether other

methods using DTW but not preserving interdistances might perform comparably well. We

plan to provide the results of this experiment in a forthcoming technical report.

72

Page 77: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

I i .

Bibliography

[1) E. Baydal, G. Andreu, and E. Vidal. Estimating the intrinsic dimensionality of discrete utterances. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(5):755-757, May 1989.

(2) V.C. Bhavsar, T.Y.T. Chan, and L. Goldfarb. Parallel implementations for the met­ric approach to pattern recognition. In Proceedings of 1985 IEEE Computer Society Workshop on Computer Architectures for Pattern Analysis and Image Database Man­agement, pages 126-136, 1985.

[3] Victor Bryant. Metric Spaces: Iteration and Application. Cambridge University Press, 1985. .

[4] F . Casacuberta, E. Vidal, and H. Rulot. On the metric properties of dynamic time warping. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(11):1631-1633, November 1987.

[5] R.0. Duda and P.E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973.

[6] D. Feustel and L. G. Shapiro. The nearest neighbor problem in an abstract metric space. Pattern Recognition Letters, 1:125-128, 1982.

[7] A. P. French. Special Relativity. W.W. Norton & Company, New York, 1968.

[8] K.S. Fu. Syntactic Methods in Pattern Recognition. Academic Press, 1974.

[9] L. Goldfarb. A new approach to pattern recognition. Pattern Recognition, 17(5):575-82, 1983.

[10] L. Goldfarb. A new approach to pattern recognition. In L. Kana! and A. Rosenfeld, editors, Progress in Pattern Recognition 2, pages 241-402. North-Holland, 1985.

[11] L. Goldfarb. On the foundations of intelligent processes - i. an evolving model for pattern learning. Pattern Recognition, 23(6):595-616, 1990.

[12] L. Goldfarb and R. Verma. Hybrid associate memories and metric data models. In Digital and Optical Shape Representation and Pattern Recognition, Orlando, FL, 1988. SPIE 1988 Technical Symposium on Optics, Electro-Optics, and Sensors.

(13] G. H. Golub and C. F . Van Loan, editors. Matrix Computations. Johns Hopkins University Press, Baltimore, 1983.

[14] W. Greub. Linear Algebra. Springer, 1974.

73

Page 78: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

g ~ ~i

f!i ~

~

[15] V. N. Gupta, M. Lennig, and P. Mermelstein. Decision rules for speaker-independent isolated word recognition. In Proceedings of the International Con/ on ASSP, pages 9.2.1-9.2.4, 1984.

[16] G. Hinton. Personal Communication.

[17) G. E. Hinton. Learning translation invariant recognition in a massively parallel net­work. In Proc. Conj. Parallel Architectures and Languages Europe, Eindhoven, The Netherlands, 1987.

[18] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. Computer Science Press, 1984.

[19] W. Y. Huang and R. P. Lippmann. Comparisons between neural net and conventional classifiers. In Proceeedings /CNN, San Diego, June. 1987.

[20] M. I. Jordan. An introduction to linear algebra in parallel distributed processing. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume I. Bradford Books, Cambridge, MA, 1986.

[21] B. H. Juang, L. R. Rabiner, and J. G. Wilpon. On the use of bandpass liftering in speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(7):947-953, July 1987.

[22] J. B. Kruskal. Nonmetric multidimensional scaling: A numerical method. Psychome­trika, 29:115-129, June 1964.

[23) J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, London, 1978.

[24] G. R. Leonard. A database for speaker-independent digit recognition. In Proceeding of ICASSP-84, pages 42.11.1-42.11.4, San Diego, CA, 1984.

[25] S. E. Levinson, L. R. Rabiner, A. E. Rosenberg, and J. G. Wilpon. Interactive cluster­ing techniques for selecting speaker independent reference templates for isolated word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2):134-141, April 1979.

[26] J. L. Liang and V. H. Clarson. A new approach to classification of brainwaves. Pattern Recognition, 22(6):767-774, 1989.

[27] J. Makhoul. Linear prediction: A tutorial review. Proc. IEEE, 63:561-580, 1975.

[28] C. Myers, L. R. Rabiner, and A. E. Rosenberg. Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(6):623-635, December 1980.

[29] M. Niranjan and F. Fallside. Neural networks and radial basis functions in classify­ing static speech patterns. Cambridge University Engineering Dept. technical report CUED/F-INFENG/TR.22, 1988.

[30] E. Oja. Subspace Methods of Pattern Recognition. Research Studies Press, 1983.

74 ..J __________________ _

Page 79: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

(31] Lippma~n R. P. An introduction to computing with neural nets. pages 4-22, April 1987.

[32] Lippmann R. P. and Gold B. Neural-net classifiers useful for speech recognition. In Proceedings /CNN, San Diego, June 1987.

(33] T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, 1977.

[34) Pettis et al. An intrinsic dimensionality estimator from nearest-neighbor information. IEEE Trans. Pattern Anal. and Machine Intell., PAMI-1:41, 1979.

(35) L. R. Rabiner. Digital Processing of Speech Signals. Prentice-Hall, 1978.

(36] L. R. Rabiner. On creating reference templates for speaker independent recognition of isolated words. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1):34-42, February 1978.

[37] L. R. Rabiner and S. E. Levinson. Isolated and connected word recognition-theory and selected applications. IEEE Transactions on Communications, 29(5):621-659, May 1981.

(38) L. R. Rabiner, S. E. Levinson, A. E. Rosenberg, and J. G. Wilpon. Speaker indepen­dent recognition of isolated word using clustering techniques. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(4):336-349, August 1979.

(39] L. R. Rabiner and J. G. Wilpon. Considerations in applying clustering techniques to speaker independent word recognition. J. Acoust. Soc. Amer., 66(3):134-141, Septem­ber 1979.

[40] S. Renals and R. Rohwer. Phoneme classification experiments using radial basis func­tions. Proceedings of IEEE ICASSP-89, pages 1-461- 1-466, 1989.

(41] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representa­tions by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume I. Bradford Books, Cambridge, MA, 1986.

(42] H. Sakoe and S. Chiba. A dynamic programming approach to continuous speech recog­nition. In Proceedings of the International Congress on Acoustics, Budapest, Hungary, 1971.

[43] :p. Sankoff and J.B. Kruskal, editors. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.

[44] E. Vidal, H. Casacuberta, et al. On the verification of the triangle inequality by dtw dissimilarity measures. Speech Communications, 7(1):67-79, November 1988.

(45] E. Vidal, H. M. Rulot, F. Casacuberta, and J. Benedi. On the use of a metric-space search algorithm (aesa) for fast dtw-based recognition of isolated words. IEEE Trans­actions on Acoustics, Speech and Signal Processing, 36(5):651-660, May 1988.

{46] T. K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52-57, 1968.

(47] S. Watanabe. Pattern Recognition: Human and Mechanical. Wiley and Son, 1985.

75

Page 80: A Metric Approach to Isolated Word Recognition · Abstract Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance

l

i' ii

r

[48] J. R. Welch. Combination of linear and nonlinear time normalization for isolated word recognition. J. Acoust. Soc. Amer., 67, 1980.

[49] G. M. White and R. B. Neely. Speech-recognition experiments with linear predic­tion, bandpass filtering, and dynamic programming. IEEE Transactions on Acoustics, Speech and Signal Processing, 24:183-188, 1976.

[50] J. G. Wilpon and L. R. Rabiner. A modified k-means clustering algorithm for use in isolated word recognition. IEEE Transactions on Acoustics, Speech and Signal Pro­cessing, 33(3):587-594, June 1985.

[51] M. Wish and J. D. Carroll. Multidimensional scaling and its applications. In P. Kr­ishnaiah and L. N. Kanai, editors, Handbook of Statistics 2: Classification, Pattern Recognition and Reduction of Dimensionality. North-Holland, 1982.

76

'-· ----------------------------