a metric approach to isolated word recognition · abstract some of the best known approaches to...
TRANSCRIPT
A Metric Approach to Isolated Word Recognition
by
Raj Verma
Department of Computer Science University of Toronto
Toronto, Ontario, Canada
A Thesis submitted in accordance with the requirements for the Degree of Master of Science
@Copyright by Raj Verma 1991
Abstract
Some of the best known approaches to Isolated Word Recognition (IWR) are based on the Dynamic Time Warping (DTW) distance function. However, because this dis~
tance function is not defined w.r.t. an inner product, the recognition decisions in these approaches must be made with nonparametric classification algorithms like the k-nearestneighbor search method. Although such methods can achieve a high degree of accuracy when sufficient numbers of training samples are present, they require significantly more memory and processing time than vector based approa<'.hes which can exploit the geometry of the representation space to model the decision regions much more efficiently.
In this thesis, we show how a new metric approach to pattern recognition can be used to construct an efficient vector representation of speech patterns which geometrically preserves the DTW interdistances. This approach, which was first conceived by Goldfarb (10] as a way of unifying the classical structural and vector based methods of pattern recognition, achieves its efficiency by mapping the distance information from an abstract metric space into a low dimensional pseudo-Euclidean vector space,. Once this is done, the decision regions of each class can be modeled efficiently using various parametric classification methods like those based on Multi-Layered Perceptrons (MLP). In our case, the metric space will be defined by a standard DTW distance function and a set of isolated word samples. This metric space will then be mapped into the vector space in two stages. In the first stage, we use the Goldfarb embedding algorithm to incrementally construct a vector representation w.r.t. the leading principal axes. From this process, the intrinsic dimension of the metric space will be analyzed. Our studies will show that the intrinsic dimension of the DTW metric space is closely related to the number classes in the metric space rather then the number of training samples. Furthermore, the convergence to the intrinsic dimension appears t o happen at an exponential rate. Once a low dimensional representation is determined, we will use the second stage to construct a more efficient mapping function from the met ric space to the vector space. This will be done with a projection formula which uses a set of original input speech samples to act as basis samples in the low dimensional vector space. Obtaining a projection this way is mathematically possible because there is a well-defined relationship between the DTW distance in the metric space and inner products in the pseudo-Euclidean vector space. The basis samples will be selected with a greedy search algorithm that exploits the statistical properties of the vector representation computed by the embedding algorithm.
In our implementation of this approach, using a MLP to analyze the decision regions and testing with monosyllabic digits form the the TI 20 word database, we were able to achieve recognition accuracy of 98.63 on speaker-independent test samples. This score was comparable to the results of the brute-force nearest-neighbor classifier in the DTW metric space; however, our results were obtained using only 16 basis samples in the classifier instead of 320 training samples. Consequently, we were able to reduced the total number of distance computations by nearly 953, and the total memory requirements by nearly 903.
Acknowledgements
I would like to first thank my thesis supervisor Ken Sevcik for his support, guidance,
and most of a.ll, his patience. I would also like to thank my second rea.der Geoff Hinton for
his comments and suggestions. A special thanks goes to Lev Goldfarb for his contribution
to this thesis - this research would not have been possible without his assistance. I am
also grateful to Evan Steeg and Timothy Horton for helping out in the implementation and
proof reading of this work. Finally, I would like to thank my parents for alJ their love and
support.
Contents
1 Introduction 1.1 Representing Speech in a DTW Metric Space 1.2 Representing Speech in a Vector Space 1.3 Goldfarb's Approach .. . . 1.4 Organization of the Thesis . . . . . . .
2 The Metric Approach to Pattern Recognition 2.1 Introduction . . . . . . . . . . . . . . . . . . . 2.2 Definition of a Metric Space . . . . . . . . . .. 2.3 Vector Representation of a Finite Metric Space
2.3.1 The Multidimensional Scaling Approach 2.3.2 Goldfarb's Approach ...... . 2.3 .3 The Main Embedding Algorithm 2.3.4 Principal Component Analysis
2.4 Metric Projections . . ... . .... . 2.4.1 A Metric Projection Algorithm
2.5 On-line Classification . ........ .
3 Dynamic Time Warping and the Metric Approach to Isolated Word Recog-
2 2 4 5 6
7 7 7 9
10 11 13 15 21 24 26
nU~n 27 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 The Basic Concept of Dynamic Time Warping . . . . . . . . . 28 3.3 Implementation Details of the Dynamic Time Warping Metric . 30
3.3.1 Representing Spectral Features . -. . . . 30 3.3.2 The Dynamic Time Warping Procedure . . . 33
3.4 Systolic Array Implementation of DTW . . . . . . . 36 3.5 The Metric Approach to Isolated Word Recognition 38
3.5.1 K-Nearest Neighbor Classifier 40 3.5.2 Gaussian Classifier . . . . 40 3.5.3 Neural Network Classifier . . 41
4 Evaluation of the Metric Approach to Word Recognition 4.1 Introduction ..... . .............. . 4.2 Description of the Speech Database . . . . . . . . 4.3 Vector Representation of the DTW Metric Space 4.4 Metric Projection Analysis .... . ... . ... . 4.5 Analysis of the Recognition Performance . . . . .
4.5.l The Metric Approach vs. the DTW KNN Classifier
iii
45 45 45 46 54 59 59
4.6 The Metric MLP vs. Lippmann and Gold's MLP
5 Discussion and Conclusions
1
65
68
Chapter 1
Introduction
1.1 Representing Speech in a DTW Metric Space
Currently, some of the most reliable approaches to Isolated Word Recognition (IWR) are
based on the Dynamic Time Warping (DTW) procedure [37]. This procedure uses a dynamic
programming algorithm to compute a distance measure between two input signals and it
is typically incorporated into nearest-neighbor classifiers. From a similarity measurement
point of view, there are two important advantages one gains from using the DTW distance
instead of the more conventional Euclidean distance. First, because the DTW metric does
not restrict the inputs to fixed-length vectors, it is possible to preserve the natural time
varying structure of speech in the input representation. Second, the D_TW metric can
measure the similarity between two common speech utterances far more accurately then
the Euclidean metric because it employs operations which account for the elastic nature of
speech. These two properties of the DTW metric are' important because changing speaking
rates can cause temporal variations to appear in the input samples. As a result, an acoustic
feature at time t in one word utterance may not necessarily align with the feature at time t
in a different utterance of the same word. Furthermore, variation in the speaking rate can
cause individual acoustic features in one word to correspond to several features i:n another
word. The Euclidean distance function assumes that there is a one-to-one correspondence
between the inputs and consequently, it can not measure the similarity between two words
as accurately.
The DTW procedure overcomes the time variation problem by using a dynamic pro
gramming algorithm to map acoustic features in one word to those of another word in man-
2
ner which yields the minimum possible distance between the two input samples. Strong
constraints are placed on the mapping to prevent unrealistic time alignments from being
selected. Once the optimal mapping is found, the global distance between the words is
determined by accumulating the local distances between corresponding features. Generally
speaking, a nonlinear mapping is required because the temporal structure of speech changes
nonuniformally. This was confirmed in a study done by White and Neely [49] who showed
that a significant reduction in the recognition performance resulted when distances where
computed on a linear time scale rather than a nonlinear time scale using the DTW function.
Although temporal variations in speech can be managed reliably with the DTW proce
dure, it is difficult to design the classifier efficiently because the only decision algorithms one
can actually use are those based on the nearest-neighbor search algorithm. The brute-force
nearest-neighbor classifier in a DTW metric space has a search and memory complexity
which is linear with the size of the training set. Thus, much of the research in DTW
based approaches to IWR has been directed at heuristic search methods which reduce this
complexity.
Rabiner [38, 25, 39, 50] has proposed the use of clustering techniques which locates
groups of speech samples which have small DTW interdistances. These groups are replaced
by an "average" word sample (see [36]) in order to reduce the entire search space. Although
a sizable reduction can be achieved this way, they have found that as many as 12 "templates"
per class is often required for reliable speaker-independent word recognition. Moreover, the
clustering approach can not deal with outliers efficiently. Also, another problem with using
templates is that important acoustic features associated with a cluster can be "smeared"
away when samples are averaged together. As a resuit, lower accuracy and robustness can
result (in comparison to nonclustering approaches) as was shown by Gupta et. al. [15].
Vidal et. al. (45] exploits the triangular inequality (TI) property of the DTW metric (see
(44, 4]) to reduce, at each step of a nearest-neighbor search, the number of remaining samples
which needs to be checked (see also [6]). Although an impressive reduction in the number
of distance computations has been achieved, there are several limitations with this type of
approach. The most serious problem is the space requirements of the classifier. To facilitate
a reduction in the number of distance computations, a matrix of n * (n-1)/2 interdistances
(all distances between the training sample pairs) must be stored in the classifier in addition
to the n training inputs. This matrix is used during on-line classification to eliminate
3
samples which fall outside the "search area" whose region is determined by the TI condition
between the un.known sample and previously tested samples. Consequently, the reductiOn
in the number of distance computations is achieved at the expense of more processing time
between each distance computation and added storage requirements. 1 Another weakness
with this approach is that it is highly sensitive to the degree to which the TI condition is
satisfied. Vidal et. al. found that when the threshold which maintained the TI condition
was set too low, significantly more distance computations where required to converge to the
nearest neighbor. On the other hand, if the threshold was set too high, a much higher error
rate resulted.
1.2 Representing Speech in a Vector Space
Much of the search and memory complexity inherent with the nearest-neighbor classification
algorithm (and the related clustering methods) in a metric space could be lowered signif
icantly if the speech utterances were represented in a vector space where distances were
measured w.r.t. an inner product (the Euclidean distance, for example). That is because
the well-defined geometric properties of an inner product space can be exploited to obtain
a much more efficient classification model. Statistical classification models based on the
Euclidean vector representation are well developed in area of pattern recognition [5, 4 7].
Generally speaking, efficiency in these models is achieved by storing only the decision re
gions of each class rather than all of the training samples. By storing the decision regions,
it is possible to design a classifier which has a search and memory complexity independent
of the size of the training set; instead, the complexity can be made a function of the num
ber of classes, and the dimension of the vector space. A key advantage of formulating the
complexity in terms of the dimension size is that the dimension can often be reduced using
techniques like Principal Component Analysis [30] when the classes are well-separated.
Decision regions in a vector space can be modeled with probability distribution functions
[5, ch 2], or with various linear and nonlinear discriminant functions [5, ch 4]. A noteworthy
class of nonlinear discriminants are those based on multilayered neural networks [31]. Recent
studies have shown that these discriminants can outperform traditional classifiers and are
1 It is important to note however, that the added processing time between each distance computation drops exponentially as one converges towards the nearest-neighbor.
4
- " ..... ·-· - ·-·-· .. ·~· ·· ~ · -· ---·- -.- ····-·-- .- - . -----···-·••">••--··--·--··~ · ---·- ·-·- ·- -·-·· ·---- -- -- ·-·-·-·-··· ... -- .• ··- . _ .. -····-- - - -
capable of forming unconnected and nonconvex decision surfaces in a Euclidean vector
space [19]. Neural networks are also attractive because the parameters needed to model
the decision regions can be stored efficiently in a set of weight matrices, and most of the
operations required to make a classification decision involves simple matrix- calculations
which can be done in parallel.
However, to use vector classification models like those described above, the time varying
input representation of speech must be preprocessed into a fixed length (ie. n-tuple) repre
sentation. Simple preprocessing methods like linear time warping, padding with silence, or
extracting fixed sets of input features can be used, but. these methods destroy the natural
time alignment properties inherent in the DTW metric space representations. This can
result in a loss of discriminating information which can make the design of the classifier
more complex and less robust.
1.3 Goldfarb's Approach
In this thesis, we will show how the metric approach to pattern recognition (Goldfarb [10])
can be used to generate an efficient vector space representation of speech which geometrically
preserves the distance information contained in a DTW metric space. By obtaining this
type of representation, we will be able to take advantage of efficient vector classification
methods without losing the discriminating information contained in the DTW metric space.
The vector representation will be computed in two stages. The first stage involves the
sampling of class variations in a finite metric space using the DTW metric and a set of
training speech samples. This metric space is represented in a symmetric distance matrix
which contain all pairs of DTW distances between the training samples. The resulting
metric space is then transformed onto the leading principal axes of a pseudo-Euclidean
vector space at increasingly higher dimensions using an embedding algorithm. During this
transformation, the dimensionality of the vector space is analyzed to find a low dimensional
representation of the metric space. Once this representation is determined, the second
stage is used to construct an efficient formula which approximates the vector representation
of an arbitrary sample. This is done using original speech samples as basis vectors for
the vector space. The basis samples are incorporated into a standard projection formula
which involves the _ solution of a simple linear system of equations. However, instead of
5
calculating the inner products in the formula, we calculate the DTW distance between the
sample being projected and the basis samples. This can be done because there is a well
defined relationship between the distances in a metric space and the inner products in a
pseudo-Euclidean vector space. Given the vector projections of a set of training samples,
the design of the recognition system is reduced to a classical problem in pattern recognition
which involves the modeling of the decision regions.
An early version of this approach to Pattern Recognition was recently applied success
fully in an application involving the classification of neurological signals [26). However, the
projections algorithm in this study was not implemented efficiently. Other applications of
this approach involving the recognition of geometric shapes and strings have also shown
promising results [9, 2, 12]. However there is still no published work which systematically
evaluates issues related to the dimensionality and the projection algorithm. Therefore, a
major goal of this thesis is to provide this analysis in the context of a speech recognition
application.
1.4 Organization of the Thesis
The organization of this thesis is as follows. In chapter 2 we present the main technical
details of the metric approach in a general setting. Then in chapter 3, we define the DTW
distance function that we used in our implementation, and we show how it was incorporated
into the classification system. This discussion will include a description of the classifiers
that were used to analyze the decision regions. We then go into chapter 4 were we provide
both the details of the experimental conditions and the results of our analysis. Finally, we
conclude in chapter 5 with a summary of our results and we provide some remarks on how
this work can be extended.
6
., ' ~ ..
Chapter 2
The Metric Approach to Pattern
Recognition
2.1 Introduction
This chapter describes the main stages of the metric approach to Pattern Recognition. We
begin by first defining the concept of a metric space. We then consider the problem of
mapping a metric space into a vector space. A solution to this problem using the multidi
mensional scaling technique [23) will be considered first, and its weakness will be pointed
out. Following this, the approach proposed by Goldfarb will be presented in detail. We
will include the main definitions and mathematical results of this approach along with a
formulation of the embedding algorithm. We will also show how one can study the ac
curacy vs. dimensionality tradeoff during the transformation by incorporating Principal
Component Analysis (PCA) directly into the embedding process. Finally, we show how the
vector representation in a low dimension can be computed much more efficiently by using
a projection formula.
2.2 Definition of a Metric Space
A metric space (P, o) is defined formally as any set P having an interdistance relationship
measured by a metric function
t5: P x P ~ R+
7
which satisfies the following 3 conditions (see [3]):
I) Vp1,P2£P 2) Vpi,p2£P 3) Vpi,p2,]J3£P
c(pi. P2) = c(p2, P1) '5(P1iP2) :;::· 0 and [c(pi,p2) = 0 ¢:>Pl= P2] c(P1iP2) + c(p2,p3):;:: c(pi,p3)
(symmetry) (positivity) (triangular inequality)
The number '5(p1,P2) is called the distance between p1 and p2. If the third condition
is eliminated, then the distance function is called a pseudo-metric. Given a finite set of
samples P =(pi, .. . ,pk) (ie. a finite metric space), one can represent a metric space in a
symmetric k-by-k distance matrix D by assigning D[i,j) = c(pi,Pi)·
It is important to note that the definition of a metric space does not require the exact
form of the input samples P to be specified. Any convenient input representation (string,
tree, or matrix, for example) can be used. In addition to this, the metric function can use
any set of operations (such as insertions and deletions) as long as it satisfies the conditions
of a metric. These are the two key advantages of a metric space representation which
are not supported by the conventional Euclidean vector space Rn representation. In fact,
the Euclidean space is actually a very special case of a metric space where the inputs are
specified by fixed length ordered n-tuples (vi, v2 , •• ., vn)£R n. Using this representation, the
distance d(v, u) between any two inputs v, u£Rn must be computed with a very rigid set
of operations based on the standard Euclidean inner product < ., · >:
There are two main problems with directly using the Euclidean representation in pattern
recognition problems. First, one can not represent high-order structural relationships which
exist between the primitives in the inputs - one can only list the primit ive in a certain order.
This limitation complicates the design of the classifier because often the features one needs
to distinguish one class from another are contained not only in the primitives, but also in
the way the primitives connect together. Indeed, the need to base classification decisions
on structure lead to the development of syntactic approaches to pattern recognition [8, 33).
The second problem with the Euclidean representation is that the operations used in
the distance function may not necessarily produce the minimum possible distance between
similar patterns in certain domains. One can show for example, that in the domain of
speech, the DTW function produces relatively smaller intradistances within a class than
8
I.~
d(~·P;> P;
... .. . . . . .
. ·.·. --+ ~ -----D[ij}=d(P,.P;> ~ f
. ·.· .... . . v,=f(P,) • '
...
Metric Space (P,d) Distance Matrix D Vector Space Representation
Figure 2.1: Illustration of the vector representation problem.
the Euclidean function which uses linearly scaled inputs. The reason the DTW function
performs better is because it optimizes over many different possible alignments and selects
the one which minimizes the total distance instead of just considering one-to-one alignments.
Although a metric space based on non-Euclidean operations can have a more favorable
distance configuration, the Euclidean representation is far more appealing because there are
many efficient algorithms one can use in design of the classifier. The efficiency is achieved
by exploiting the geometry (ie . inner product) of the representation space in a manner
which make it possible to model only the decision surfaces of each class. In a general metric
space, there is no inner product defined; thus, one is forced to use nonparametric methods
like the nearest-neighbor search algorithm. It is therefore of interest to ask the following
questions: can the distance information in a metric space be transformed into an inner
product space (that is preferably Euclidean)? If this transformation were possible, one
could then model the "decision regions" in a metric space far more efficiently without losing
important discriminating information.
2.3 Vector Representation of a Finite Metric Space
Given a finite metric space (P , 6), one way the above question can be stated formally is
as follows: is there a way to construct a set of vectors S = (v1 , ... , vk) where the inner
product based distance matrix D[ i, j] = < vi - Vj, vi - Vj > t equals the distance matrix
in the corresponding metric space (see figure 2.1 )? To answer this question, we need to
9
consider the following issues. First, we have to establish whether such a mapping is possible.
If equality between the two distance matrices is not possible (or necessary) then it may be
useful to consider a set of vectors S where the error between the two matrices is as small as
possible. The mapping from the metric space to the vector space will be called isometric if
the error can be set to zero.
Second, it is important that the metric points be mapped into the lowest dimensional
vector space possible so that the decision regions can be modeled efficiently. Therefore, we
would like to have a way of analyzing the trade-off between accuracy and dimensionality
during the construction of the vector representation.
2.3.1 The Multidimensional Scaling Approach
One possible solution to the mapping problem can be obtained using the so-called Mul
tidimensional Scaling technique [23, 51, 22]. In this approach we start with some initial
configuration of vector points S = (v1,v2, ... ,vk) in an Euclidean vector space Vi«::Rn of
some chosen dimension n . Then using a sum of the squares error function En like the one
below
(2.1)
we measure the degree to which the current configuration of points S represent the desired
matrix of interdistances D in the metric space. From the gradient of this function (the
gradient of d(vii Vj) is the unit vector in the direction of vi - Vj),
we can figure out the direction each point Vk«::S must be moved in order to minimize the
total error. Given this, we can use a standard gradient-descent procedure to successively
work towards the desired configuration.
Although this technique appears to offer a reasonable solution, it has several technical
problems, as pointed out by Goldfarb [10), which limit its effectiveness:
1. There is no guarantee of convergence to the optimal configuration since the algorithm may get stuck in a local minimum. Furthermore, convergence to any solution could take an excessive amount of time.
10
2. There is no reliable way to determine what the dimension of the representation should be (ie., how much accuracy is lost by using dimension k - 1 rather than k). The measurement Ek-I (S)-Ek-l (S) is not reliable because optimal convergence can not be assured.
3. In general, a finite metric space can not be isometrically represented a Euclidean vector space of any dimension. (This will be shown in the next section. Also see Goldfarb [10]).
2.3.2 Goldfarb's Approach
It turns out that a finite metric space can be mapped isometrically into a vector space if one
considers a non-Euclidean inner product space. Moreover, there is a very efficient algorithm
available to do this which overcomes the above-mentioned problems of Multidimensional
Scaling. The definition of this vector space and the main result which establishes the
existence of this algorithm are described next.
Definition of a pseudo-Euclidean Vector Space
A pseudo-Euclidean (or Minkowski) vector space R(n+,n-) of signature (n+,n-) is a real
vector space of dimension n = n+ + n- where the inner product < v, w > between two vec
tors v = (v1, ... ,vn+,vn++1, ... ,vn) and w = (w1 1 ••• ,wn+,Wn++1, ... ,wn), v,w£R(n+,n-)
is calculated as follows:
Note that this vector space is a generalization of the Euclidean space Rn (ie. R(n+ ,n-) is
Euclidean whenever n- = 0).
One very interesting example of a pseudo-Euclidean Vector space is the 4 dimensional
vector space R(3 ,t). This vector space, which is historically linked to the work of Lorentz,
Minkowski and Einstein, describes the basic relationship between space and time in special
relativity theory (see [7]). This relationship is illustrated in figure 2.2 for a two-dimensional
world (ie. the subspace a<2•1l), and it provides some intuition into the geometry of pseudo
Euclidean spaces. The diagram shows the space coordinates of light at different instants
in time. Note that c · t, rather than t, is used to describe the time scale; thus each axis
11
I
~
Figure 2.2: The space-time geometry of a two-dimensional world.
measures distance. 1 In this diagram, the events e = (x, y, ct) which lie on the so-called
"light cone" are those which have a vector length (distance) of zero relative to an observer
positioned at the origin (these events are all connected by a common light signal): 2
llell 2 = < e, e > = x2 + y2 - ( ct)2 = 0
Events which lie inside the cone in the positive time region occur in the future relative to
the observer (ie. the distance to the observer yields negative time); while events inside the
negative part of the cone occur in the past (ie. the distance to the observer yields positive
time). Events outside the cone can not exist because this would require light to travel faster
than c.
The Main Embedding Theorem [Goldfarh·85, ch. 5]
For any finite pseudo-metric space (P, 6), there exits an isometric mapping:
such that for any pair of metric samples p;, Pj£P
where llv - wll d~f < v - w, v - w > ~ is the distance between the vectors v and w in the pseudo-Euclidean space.
Furthermore, the algorithm which implements this transformation is guaranteed to map
1 The variable c refers to the speed of light. 2The vectors on the "light cone" are sometimes called isotropic vectors (see [14] and [10]) .
12
the metric samples into the smallest dimensional vector space possible.
This theorem tell us that for every sample Pi£P in a finite metric space (P, 6), there
exists a corresponding pseudo-Euclidean vector in R(n+ ,n-) which can he constructed by
the mapping function f. The mapping is done in a manner where the pseudo-Euclidean
distance between any two vectors v = f(Pi), w = f(pj) is equal to the distance between
Pi and Pi in the metric space. In other words, the mapping function f constructs a set of
vectors S for which the sum of the square error E(S) in (2.1) is zero. The theorem also
tells us that the metric does not have to satisfy the triangular inequality (ie. it can he a
pseudo-metric).
2.3.3 The Main Embedding Algorithm
A detailed proof of the embedding theorem can be found in Goldfarb [10]. However, because
the proof is of a constructive nature, it is useful to take a close look at it to understand
how the mapping function f works. 3
We start by recalling a well-known algebraic result about symmetric matrices, which
asserts that for any symmetric matrix H, one can orthogonally diagonalize it (into A) by
using an orthogonal matrix Q (ie. Q-1 = Qt) whose columns are the set of eigenvectors of
· H (see for example [14]):
The diagonal elements A = diag( Ai, A2, ... , An) are the corresponding eigenvalues of H. If
we denote A+ = diag( Ai, A2, ... , An+) and A - = diag( An++i, An++2, .. . , An) as the diagonal
matrices corresponding to the set of positive and negative eigenvalues respectively, then we
can rearrange the columns (Qi, Q2 , ••• , Qn) in Q so that the form of the above expression
is as follows:
H we now move the square root of each eigenvalue out of the middle matrix above by setting
3 This description is based in part on a result shown to me by J .J .Seidel (CSC2414 course notes Fall 1986}.
13
the matrix V as follows:
then, we can rewrite H as follows:
where l+ and 1- are the positive and negative identity matrices of size n+ by n+ and n
by n- respectively.
From this expression one can see that each entry H[i,j] in the symmetric matrix is
equal to the pseudo-Euclidean inner product between the two vectors vi, v;E:R(n+ ,n-) where
vector Vi corresponds to the i-th row in the matrix V:
t I
I
~~~ ... ~ vi r-- -I
0 x x
n•no I :o n•lc
I I I
- - - ---H/ijJ----- -i;::=::;;::=~
I
le. le k•n
H[i, j] =<Vi, Vj >= Vi,1Vj,l + ... + vi,n+Vj,n+ - Vi,n++1 Vj,n++1 - ..• . _ Vi,nVj,n
This shows that any k-by-k symmetric matrix His in fact a matrix of inner products of k
pseudo-Euclidean vectors in R(n+ ,n-) which can be obtained from a diagonalization process.
Although the matrix D corresponding to a metric space is symmetric, it is not a matrix
of inner products, but rather a matrix of interdistances. Thus, to use the above result
we need to convert each distance entry D[i,j) into its corresponding inner product value
H[i,j] =<vii Vj >to get the decomposition to work correctly. If this is done we will end up
with vectors vi, v;E:R(n+ ,n-) having a pseudo-Euclidean distance sqrt(< Vi - Vj, Vi - Vj >)
equal to the value D[i,j]. To do this, we simply need to solve for < Vii Vj > in the equation
D[i, j] = sqrt(< Vj - Vj, Vj - Vj >) which can be done as follows:
14
~
'
-~
D 2[i,j] = 62(Pi1Pi) = <Vi - Vj 1 Vi - Vj >
= <Vi.Vi> -2 < ViiVj > + < Vj,Vj >
llvi - voll2
- 2 < Vj, Vj > +llvj - voll2
= 62(pi,Po) - 2 <vb Vj > +o2(pj,po)
< Vi>Vj > = 1[2 2 2 ] 2 6 (Pi,Po) + c5 (pj,Po) - c5 (Pi,Pj) (2.2)
The sample po above is simply the metric sample designated as the origin v 0 in R (n+ ,n-).
(In the next section we show more precisely how we designate this origin.) Thus, if we
convert each entry D[i,j] in the distance matrix into the corresponding inner product value
H[i,j] =< Vii Vj > using the r.h.s . of (2.2), and then decompose this new matrix H into
pseudo-Euclidean vectors using the diagonalization technique described above, the result
will be vectors v; and Vj (row i and j in V), which have a pseudo-Euclidean distance equal
to the original distance between samples Pi and Pi·
The above embedding process not only shows how to construct a vector representation,
it also shows that a finite metric space can only be mapped isometrically into a Euclidean
vector space when the corresponding matrix of inner products H is non-negative. One
can also see that the dimension of the isometric representation is equal to the number of
nonzero eigenvalues. The time complexity of this embedding algorithm is dominated by the
computation time required to diagonalize a symmetric matrix which is of order O(k3 ) [13].
However, as we shall show in the next section, one can reduce this complexity by embedding
(nonisometrically) into a lower dimensional vector space.
2.3.4 Principal Component Analysis
Although the embedding algorithm does find the lowest dimensional vector representat ion
that precisely preserves the interdistances in a metric space, practical considerations will
make it necessary to work with a still lower dimensional representation. The number of
non-zero eigenvalues n = n+ + n- (ie. the dimension size) constructed by the diagonal
ization process is usually as large as the number of samples IPI in the finite metric space
(P, 6) . However, empirical studies (9, 2, 12] have shown that when good class separablity
is achieved in the metric space, the intrinsic dimension of the vector space tends to be
. related to the number classes in the metric space rather then the number of samples. In
15
this context, the intrinsic dimension refers to the lowest dimension where the error in the
distance matrix becomes insignificant to the recognition performance. One way to see how
the class separation effects the dimension size is to imagine the situation where the average
interdistance within a class incrementally approaches zero for c classes in Euclidean n-space
with n > c. As the average distance within a class approaches zero, the upper bound on
the dimension size must approach c - 1 (ie. when each class becomes a point, we can pass
a c - 1 dimensional hyperplane through the c points.) A similar argument can be used for
points represented in a metric space. If we have a situation where the average interdistance
is zero in each class, we would only need to represent c points in the metric space - one for
each class. These points could be represented isometrically in a pseudo-Euclidean vector
space of dimension no greater than c because the maximum number of eigenvalues we would
obtain from the c-by-c interdistance matrix is c.
One way to approximate the intrinsic dimensionality of a metric space is to use the esti
mate proposed by Pettis et. al. [34] . This measurement uses nearestoneighbor information
in a finite metric space to measure the minimum number of parameters that is necessary to
represent the input data. Using a DTW metric space consisting of as many as 200 classes,
Baydal et. al. [l] concluded that based on the Pettis et. al. estimate, 13 parameters would
be sufficient to represent each of their speech samples. It is interesting to note that this
number is fairly consistent with the number of distance computations that were needed by
the Vidal et. al. [45] nearest-neighbor algorithm. However, because this estimate is not
based on a vector representation, it is not a very reliable way to measure the true intrinsic
dimensionality of a metric space.
A far more accurate method is to analyze the corresponding pseudo-Euclidean vector
representation w.r.t. the principal components. The standard way to perform PCA is to
first obtain the covariance matrix from the vector points, and then to rotate the axes to
the principal axes of the covariance matrix. Figure 2.3 shows the geometrical meaning
of the principal components in a two dimensional space. First, the origin of the vector
representation is shifted to the mean vector. Then an orthogonal rotation of the axis
occurs (from the x-y to the x-y coordinate system) so that each axis lies successively in the
direction of greatest variance while maintaining orthogonality to the existing axis. These
directions are determined by the eigenvectors of the covariance matrix and the corresponding
eigenvalues measure the amount of variance along each axis. One can see in figure 2.3 that
16
y x'
Figure 2.3: Geometric interpretation of the principal components.
if the distance information (ie. variance) is contained intrinsically in a low dimension, we
can obtain a low dimensional representation by simply removing the principal axes which
have relatively small amounts of variance without distorting the global metric structure.
The definition of a covariance matrix in a pseudo-Euclidean space is actually a general
ization of the classical one (see Goldfarb (10, ch. 5]). For the vectors contained in V, it can
be computed as follows. First the origin is shifted to the mean vector v = t Ef=1 v;, k = IP!. That is, we substitute each row v;,i = 1, ... ,kin V with Vj -v. Then we compute the
covariance matrix as follows:
where
J = [ 1+ 0 l o r-
The eigenvectors of this matrix define the principal axis in R(n+ ,n-) and the magnitudes
of the corresponding eigenvalues are proportional (by a factor of i) to the variance along
each axis (10]. These eigenvalues can be used to measure the (relative) amount of distance
information that is contained along each principal axis.
Thus, given an isometric pseudo-Euclidean representation of a metric space, one way
to obtain a low dimensional representation in R(m+ ,m-) with ( m = m+ + m- < n) is
as ·follows. First compute the covariance matrix using all of pseudo-Euclidean vectors as
described above. Then diagonalize the covariance matrix to determine the eigenvector and
17
r Ji ~
I
I the eigenvalues and retain the leading m eigenvectors, which correspond to the m largest
eigenvalues (according to the magnitudes). These eigenvectors are then used to transform
the original vector representation to the leading m principal components. This is done by
si~ply forming a transition matrix whose rows are the eigenvectors. If one simply wants a
low dimensional Euclidean representation, then only the eigenvectors associated with same
signature (ie. in the set A+ or A-) would be retained in the transition matrix. The choice
of m can be made using many different approaches. For example, one can incrementally
transform the vectors into higher dimensions and measure the quality of the representation
w.r.t. various criteria. Reasonable criteria include the scatter ratio, the average distance
between the classes, or the sum of the square error defined by equation (2.1). Once one of
these measurements reaches an acceptable value (ie. when the separablity in the vector space
approaches the separablity in the metric space), one could terminate the search process. In
chapter 4, an empirical study will be done to investigate this issue further.
However, the problem with using this approach is that we have to first construct a full
dimensional representation in order to determine the covariance matrix and the transition
matrix. Fortunately, it is possible to program the embedding process so that it produces
a vector representation that is already transformed to the principal axis. Consequently, a
full dimensional representation does not have to be constructed - we can simply program
the embedding process to terminate after a sufficient number of principal components have
been generated.
To do this, we need to make sure the origin is already positioned at the mean vector
v, instead of an arbitrary sample p0 , when we form the matrix of inner products H. For
this to happen, the sample p0 representing the origin in (2.2) must be selected with the
property: f(p0 ) = v. If this is done (later we show how), then the eigenvalues of H will
be the same as the eigenvalues of the the covariance matrix vtv J as shown below (in the
formulation below, ./X denotes the diagonal matrix diag[v'IXJ, ~' . .. , v'[XJ]):
vtvJ = (Qv'X)tQ\1'1J
= v'XQtQv'XJ
= v'XIv'XJ
= A (2.3)
18
I I
This important result by Goldfarb [10] tells us that when H is computed with the mean
positioned at the origin, the resulting vectors (rows in V = Qv'X) from the embedding
process will be given w.r.t. the principal ax.is. That is, if we computed the covariance
matrix vtvJ using the rows of V from the embedding process, then we will end up with
the diagonal matrix>. which is the set of eigenvalues computed from H. Consequently, the
vectors in V must already exist w.r.t. the principal axes. Furthermore, the eigenvalues of
H measure the amount of variance a.long each principal ax.is. This means that the analysis
of the intrinsic dimensionality can be done during the embedding process. For each new
eigenvector and eigenvalue obtained during the embedding process, a new principal axis
(ie. column in V) can be added to the representation. Thus, to obtain an m dimensional
representation, only the leading m eigenvectors and eigenvalues need to be computed.
Obtaining the principal components directly from the the embedding process has im
portant practical implications for the implementation for the embedding algorithm. Diago
nalization algorithms like those based on Householder reduction and the QL transformation
[13] can be used to incrementally generate more eigenvalues and eigenvectors. These algo
rithms are stable numerically and they can be programmed to terminate when an acceptable
criteria is satisfied in the vector representation as described earlier, or when the (relative)
size of the eigenvalue becomes sufficiently small. If we terminate the diagona.lization process
at some dimension m, then the representation that results is the orthogonal projection of
the isometric representation on the subspace spanned by the leading m principal axes.
However, in order to perform PCA during the embedding process, we need to be able to
find a sample with the property f(p0 ) = v. In genera.I such a "mean" sample is not likely
to exist in P and even if it did, there is no way of knowing this without actually obtaining
a full isometric representation first. However, Goldfarb [10] has shown that one can get
around this problem algebraically by using the well-defined relationship between distances
and inner products (see the formulation of (2.2)):
<vi-V,vi-V>
<Vi, Vi> + < V, V > -2 <Vi, V > 1 k k 1 k
<Vi, Vi> +k2 EE< v.,,v31 > -2k E < v;,v., > x=ly=l x=l
19
.. -· --- - ... ·-·--···· -· . .. . ··-··-··--- . ·--·-·- - ...... ··~··· ---···--~-···-h·~ -· ~··-~ -··~ ·-
k k
02(Pi1Po) + :2 L L[02
(Px1Po) + 02(Py1Po) - 02(Px1Py)] x=ly=l
1 ~[ 2 2 2 -2k LL c5 (Px,Po) + c5 (Pi,Po)- c5 (Pi,Px)] :r:=l
1 k 1 k k
= k L D[x, i]2 - 2k2 LL D[x, y]2 :r:=l :r:=ly=l
(2.4)
This result shows that the squared distance between any metric sample Pi and the pre-image
of the mean vector p = 1-1(v), can actually be determined directly from the interdistance
matrix D without first embedding. Taking the r.h.s. of this equation and substituting it for
c5 2(pi, Po) in (2.2) ensures that the matrix of inner products H has its mean vector already
positioned at the origin.
A full summary of the embedding algorithm is provided below:
The main algorithm.
1. Let D[i,j] = o(pi,Pi), 1 S i,j s k. Compute the symmetric matrix H[i,j]1 ~i.i9 k k k k
H[i,j] = ~[~(L D(x,j]2 + L D(i,x]2) - ( : 2 LL D(x,y]2
) - D[i,j]2]
:r:=l :r:=l :r:=l y=l
2. Find the eigenvalues of H:
where
and
At' A2, . . . ' An+' An++11 . . . ' An, 0, 0, ... ' 0 ......__....... k-n
n = n+ + n-.
The number of positive eigenvalues is n+ and the number of negitive eigenvalues is
20
n-. Now form the diagonal matrix v0::
./). =
0
0
3. Find the corresponding orthonormal eigenvectors of matrix H:
and form the matrix Q whose i-th column is Qi.
4. Compute the matrix V = Q · v'>., and set f(Pi) =Vi where the coordinates of Vi are the first n elements of the i-th row of V. The basis vectors for this representation will then be the principal axes of the corresponding covariance matrix of the resulting vectors in R(n+ ,n-) .
2.4 Metric Projections
So far, we have discussed the first stage of the metric approach, which involves the trans
formation of a metric space representation into a vector space during which the intrinsic
dimension is determined. This low dimensional representation in R(m+ ,m-) will be referred
to as the principal axis representation of a metric space.
In the next stage of the metric approach we addresses the following question. Given
a principal axis representation of a metric space, how do we determine the principal axis
representation of an arbitrary sample p* in the metric space without repeating the en
tire embedding process for (P LJ{p*}, t5)? In other words, how do we compute the vector
representation efficiently?
It is done as follows. We know that v• = J(p*) corresponds to some point in R(n+,n-).
The precise point could be obtained by embedding (P LJ{p*}, 6) into R(n+ ,n-) as described
in the previous section. We would like to find the orthogonal projection u"' = 7r( v*) of v"' on
the subspace R(m+ ,m-) spanned by the m = m+ + m- leading principal axis r~presenting
21
! .
bi . . . . . . . . . .
v!:f(J'*)
. b~ • • •
Figure 2.4: illustration of the concept of a metric projection.
the metric space (see figure 2.4).
Let (b1, b2, ... , bm), m ~ n be any set of basis vectors bifR(n+ ,n-) which span this
space. If we knew the value of the inner products < v•, bi > and < bi, bj > for all
i,j = 1, ... , m, then we could determine u* by simply solving a system of m linear equations.
These equations would be formed as follows. We use the orthogonality condition between
the vector (v* - u*) and each basis vector bi to give us m equations:
< v* - u*, bi>= 0 Vi= 1 .. . m (2.5)
The coordinates of u• w.r.t. the basis vectors is some set of constants (cl! ... , cm) such that
u* = I:~1 cibi . Thus if we substitute this expression of u* into (2.5) above and expand we
can form the following set of linear equations:
+ < bm,bm >Cm=< v*,bm >
Denoting bas the column-matrix with b[i) =< v*, bi> and Gas the m-by-m (Gram)
matrix with G[i,j] =< bj, bj >, we can easily obtain the coordinates of u* by computing
a-1 • b. This gives the vector representation w.r.t. the chosen basis vectors. To find
out what the coordinates of u* are w.r.t. the principal axes (ie. the eigenvectors of the
covariance matrix), we simply have to multiply the coordinates of u* by the transition
22
matrix B whose i-th column is the basis vector bi. Therefore, the projection of p• can be
calculated as follows:
7r(p'")= B · a-1 · b (2.6)
However, to evaluate this projection formula, we need to be able to compute the column
matrix b for some set of basis vectors. This can be done without knowing the full vector
representation of p• by using the relationship between inner products and distances given
in equation (2.2). If we use real metric samples from Pas our basis, then (2.2) can be used
to measure the inner products between v'" = f(p'") and any vector !(Pi) where Pi£P.
To be specific, if we denote the subset of metric samples Pb = (Pb1 , Pb2 , ••• , Pbm) as our
basis, the entries b[iJ =< v-, bi > can be computed as follows (the Gram matrix can also
be computed ~rom (2.2) in a similar manner):
where Po is the designated origin sample. Actually, to obtain a true principal axis repre
sentation of p'", we need to designate the origin as the mean vector, which means that in
the above formula, the r.h.s. of (2.4) needs to be used in place of o2(p.,p0 ) and 62(Pb;,Po)
as was done in the main embedding algorithm. Although there is no significant problem in
determining o2 (Pb;, p0 ) this way, since all of the distances required are avajlable in D and
the computation can be done off-line, the computation of 62(p'", po) is clearly too expensive
for on-line applications because the distance between p'" and each of Pi£P needs to be com
puted before a projection can be obtained. However, this overhead can be eliminated if we
designate a real sample Po as the origin instead of the mean point. This simply corresponds
to a parallel translation of the vector representation; consequently, no metric information
is lost (provided we also shift the training samples in the same manner after the embed
ding is completed.) H we do this, then only m + 1 new distance computations need to be
done to obtain a projection: one between p• and p0 , and one between p• and each sample
Pb;, i = 1, . . . , m. In the next section, we describe a method for selecting the origin sample
and the basis samples.
23
Picking the Origin and the Basis Samples
In our description of the projection formula, we conveniently made the assumption that there
existed a subset Pb U{Po} C P which would span the same subspace as that spanned by the
leading m principal axes. In other words, we assumed that the full vector representation
(ie. the representation in R(n+ ,n-)) of these samples were located precisely on the subspace
R(m+ ,m-) and that linear independence between the samples existed. In general, however,
there is no guarantee that such a set exists. Furthermore, even if this set did exist, we would
have no tractable way of knowing this unless we obtained the complete vector representation
of (P, o).
Thus, we would like to find, in a tractable manner, a subset of samples Pb LJ{p0 } which
spans a subspace that is as close as possible to the subspace spanned by the m leading
principal axes. The brute-force solution to this problem is to consider all possible ordered
subsets of size (m + 1) from P. For each (m + 1)-subset, all samples in P can be projected
w.r.t. this set using (2.6) and the error between this representation and the original principal
axis representation can be measured using some suitable criteria measurement (like the sum
of-square-error (2.1) for example). The subset which we would eventually select is the one
which results in the minimum error.
Obviously this search method is not practical as there are (IPIJf !+i))! possible ordered
(m + 1 )-subsets. 4
Nevertheless, if we relax the optimality condition, and use the statistical properties of
the principal axis representation computed by the embedding algorithm along with the
information stored in the distance matrix D, then a reliable solution can be determined in
a tractable manner.
2.4.1 A Metric Projection Algorithm
The solution that we propose is based on the greedy search method and is illustrated in
figure 2.5. We start by first picking a sample to represent the origin. In our solution, we use
the nearest-neighbor to the mean vector in R(n+ ,n-). Fortunately, we do not have to obtain a
~Goldfarb's work a.ppea.rs to be the first to address this search problem a.nd a. reliable (polynomial time) solution to this problem ha.s yet to be found in the area of computer science, statistics, or relevant areas of pattern recognition. In the book "Subspace Method of Pattern Recognition", Oja. [30] does address this problem for Euclidean metric spaces, but the solution he provides does not restrict the search to the original pattern samples. )
24
J' l
• • • • • • • • • • • •
• • • •
• • • • • • • • • • • •
Figure 2.5: The search space of the basis samples and origin considered m the Metric Projection formula.
full dimensional representation of (P, o) to do this. The nearest-neighbor can be determined
directly from the distance matrix D by using the following result by Goldfarb [10): The
nearest-neighbor to the mean vector in R(n+ ,n-) corresponds to the metric sample p€P
which has the minimum accumulated distance I:f=1 o(p,pi) to all other samples in P. In
other words, the index of this sample is the row in D which has the minimum row sum.
Once the origin sample is determined, we successively find the samples Pb; €P to repre
sent each principal axis in R(m+,m-) starting with the axis a 1 associated with the largest (in
magnitude) eigenvalue and ending with the axis am associated with the smallest (in magni
tude) eigenvalue. Let the set {3 = (pi,p2 , ••• ,Pi-I) be the basis selected for the first leading
i-1 axes. The next axis, a;, is determined by first locating the maximum projections v1 and
Vr on both the left and right side of the axis (see figure 2.5). Then, the k-nearest-neighbors
P1 = (p11 , ••• , P1k) and Pr = (Pr 1 , ••• , Prk) of both these points are located from the reduced
vector representation in R(m+ ,m-). For each of these points Pi€Pt U Pr, we project all of the
samples in P w.r.t. the basis set {3LJ{p;} and then we compute some criteria measurement
which describes the error of the resulting representation. (In chapter 4, we will define the
measurement which was used in our implementation). Among the samples in P1 U Pri we
select the sample that minimizes the criteria measure. This sample is then appended to
the basis set {3 and we continue on the greedy search looking for a sample to represent axis
ai+I. This process finally stops when axis am is reached.
There are several variations of this algorithm one can consider. For example, since
the first few leading principal components are the most critical because they represent the
greatest amount of variance, one may consider using a backtracking search for these first
few dimensions and then continue with a greedy search for the rest of the representation.
Another possibility is to use a different selection method. For example, instead of using the
25
k-nearest-neighbor to V/ and Vr at the i-th step, one could use the k-maximum projections
on both the left and right side of axis a;. This possibility will be investigated in chapter 4
along with the k-nearest-neighbor selection method.
The reason we restrict our search to the k-nearest-neighbors (or k-maximum projections)
for each dimension is because this method avoids cases which could lead to linear dependence
between the basis samples (ie. a high condition number for the corresponding Gram matrix
G). This restriction can eliminate better solutions from the search space, hut as one will
see in chapter 4, our empirical studies indicated that such losses are not likely to be very
significant.
2.5 On-line Classification
Once the origin and a reliable basis set (Pbi, Pb2 , ••• , Pbm) have been determined, we are
ready to make use of them in the design of the classifier. To account for any errors in
the representation that may result from using an imperfect set of basis samples, we must
recompute the vector representation of the entire training set P w.r.t. the newly selected
basis set before they are used in determining decision regions.
Given the vector representation of a set of training samples, we can apply various classi
cal methods to find the decision regions of each class [5]. However, if the representation we
are using is pseudo-Euclidean, then the inner product underlying these methods must be
changed (ie. generalized) in order to accommodate a pseudo-Euclidean inner product (see
Goldfarb [10] for examples on how this is done). Once the decision regions are determined,
an unknown sample p• can be classified by first obtaining its metric projection u• using
(2.6), then assigning the unknown to the class associated with the decision region which
the vector u• falls into.
26
''
Chapter 3
Dynamic Time Warping a~d the
Metric Approach to Isolated
Word Recognition
3.1 Introduction
In the previous chapter we presented a general description of the metric approach to pattern
recognition. In this chapter we will show how we used this approach to solve an isolated
word recognition problem efficiently. To use the metric approach, we needed to first define
a distance function in order to obtain a metric space representation of speech. We therefore
begin the chapter with an overview of the main factors which motivated the design of
our distance function. Then we describe the details of how we implemented the distance
function. The implementation is presented in two stages. First, we describe the spectral
analysis that was used to represent the speech signals; then we describe the algorithm that
was used to measure the distance between speech signals in this representation. Both a
serial and parallel version of this algorithm will be described.
We then show the main components of the word recognition system that was used in our
implementation. Included in this discussion will be a description of the various classifiers
that were used to analyze the decision regions. In the next chapter, a comprehensive
performance analysis of this system will be provided.
27
3.2 The Basic Concept of Dynamic Time Warping
The distance function which we used in our implementation is commonly referred to as
the Dynamic Time Warping (DTW) function [46, 42, 28). This distance has been used
routinely over the last 20 years in template based approaches of speech recognition [37].
It has achieved a high degree of success mainly because it can compare speech signals of
varying temporal length in an accurate manner.
To understand how the function works, it is useful to look at the basic nature of a speech
signal first. Speech is produced by exciting the vocal tract with a pulse of air generated
from the sub-glottal system. The resulting flow of air, which is transmitted to a listener
in an acoustic wave, causes a certain sound to be produced. By varying the shape and
diameter of the vocal system, a speaker can generate different sounds while he is speaking.
Each particular state of the vocal system can be distinguished by its frequency response
(ie. formant frequencies). One can obtain these frequency components by first sampling the
output acoustics waveform and then converting the samples to the frequency domain using
standard digital signal processing techniques [35]. Because there are physical limitations on
how fast the state of the vocal system can change, the main spectral properties of speech
generally remain fixed for 10 to 30 ms. during an ·utterance. Therefore, a reliable digital
representation of speech can usually be obtained by using a sequence of feature vectors
which describe the changing spectral properties over a 10 to 30 ms. ("short-time") interval.
Some examples of spectral properties that are most often used in practice include mel-scale,
cepstral, LPC, and filter bank coefficients [35].
With this type of representation, there are two types of variations that can occur. First,
the individual spectral feature vectors can vary depending on the characteristics of the vocal
system during an utterance. Second, the duration of the sequence can vary as a result of
different speaking rates. Both these situations must be accounted for in a distance function.
This can be done by first determining a meaningful alignment between the input sequences.
Once an alignment is determined, the total difference can be obtained in terms of the
individual differences between corresponding feature vectors. To ensure that interdistances
within a class are as small as possible, it is important to consider alignments which minimize
the total difference a much as possible.
The DTW function finds the alignment which minimizes the total distance between two
28
Figure 3.1: Illustration of the Concept of Time Warping
II . (a) (b) (c)
Figure 3.2: Violations of the local continuity constraint.
sequences
(3.1)
by searching for a mapping (see figure 3 .1)
which leads to the smallest cumulative local distance
To avoid unnatural alignments, the search is restricted to mappings which satisfy two
strong constraints. The first is the endpoint constraint. This is used to ensure that the
beginning points and ending points of both sequences align together.
i(l) = 1, j(l) = 1, beginning point
i(K) = N, j(K) = M, ending point.
The second constraint is the local continuity constraint. This is used to preserve three basic
properties in the mapping:
29
1. Temporal ordering. For any mapping (ai(k),bi(k)), we must have i(k + 1) ~ i(k) and j(k + 1) ~ j(k). This guards against the mapping show in figure 3.2(a).
2. No skipped feature vectors as shown in figure 3.2(b ).
3. No excessive occurrences of one-to-many correspondences as shown in figure 3.2(c). This avoids the unrealistic mapping of a single acoustics feature with a long acoustic segment. In our case, we will allow at most a 2-to-1 mapping.
The solution to the sequence alignment problem which satisfies the above constraints
can be computed very efficiently using the technique of Dynamic Programming [18]. In
the following section, we will show how this was done in our implementation of the DTW
function.
3.3 Implementation Details of the Dynamic Time Warping
Metric
3.3.1 Representing Spectral Features
Spectral features of the raw input speech waveform (sampled at 12.5 KHz with 12-bit
quantization) as shown in figure 3.3( a) were analyzed using the bandpass liftering technique
proposed by Juang et. al. [21]. In this analysis, overlapping short-time speech segments (ie.
frames of speech) were obtained every lOms. Each segment was obtained using a 512-point
Hamming window (see figure 3.3(b)). A 12-th order linear predictive analysis [27] was then
applied to this segment (see figure 3.3(c)). The resulting coefficients (ai,a2, ... ,a12) were
then converted to the cepstral domain (see figure 3 .3( d)) using the following well-known
recursive formula (see Rabiner [35]):
C1 = al
~ ~(k - i)Ck-iai k = 1 ... 12 •
(3 .2)
(3.3)
The cepstral coefficients were then smoothed with the following window as suggested by
Juang et. al. [21] (this smoothing process is known as bandpass liftering):
w( k) = 1 + 6 sin 7r 1k2
k = 1 ... 12
As shown by Juang et. al. m figure 3.4, the bandpass liftering process smoothens the
30
• 1il19 ,
(a)
(b)
•O •256
(c)
bs -..,in o~ rn
• 9
(d)
Figure 3.3: Illustration of the spectral feature a.11alysis.
:n
j, ... ~ ~i\: ~ !$ •.
="" • ' ' ;>i' ~·i ~ ,., \~ : :.< .. ? ;;,,,
~,;~
.,
LPC LOG MAGNITUDE SPECTRUM
0 4000 FREQUENCY {HZ)
(a)
UFTEREO LPC LOG MAGNITUDE SPECTRUM
0 ~000
FREQUENCY (HZ)
(b)
50
40
30
20
10
0
t:: 30
r20
.(10
lri 0
(I) 'O
(I) 'O
Figure 3.4: (a) A sequence of log LPC spectra. (l>) Tltc corJ'.espondi11g l>andpass liftered LPC spectra. (Taken from [20]) .
:.12
sharp spectral peaks in the LPC log spectrum without distorting the fundamental formant
structure of the speech. This helps remove noiselike statistical variability from the feature
set.
3.3.2 The Dynamic Time Warping Procedure
The dynamic time warping procedure we implemented was a version of the one used in
Myers et. al. [28). An example of it is shown graphically in figure 3.5. The axis of the grid
is used to represent the time scale of each word; each time slot corresponds to a frame of
liftered cepstral coefficients. In our implementation, we made use of the normalize/warp
technique as suggested by Myers et. aL [28). This meant that before we time warped, each
frame of spectral coefficients ( c1 , c2, ... , Cm) was first decimated (normalized) to a fixed
length linear time scale (c1 , c2, •.• , cm.) using the following formulae [48):
Ci = (1-s)cn+(s)cn+t i=l, ... ,m
where
n (m-1)
floor((i-l)(_ )+1) m-1
s . (m - 1)
( i - 1) (- ) + 1 - n. m-1 (3.4)
m was set to the average length over all words used in the training set (in our case m = 45).
The normalization was done to accommodate the slope constraint defined below.
To obtain the DTW distance measure, the local distance at point ( i,j) in the grid
was determined using a Euclidean distance between the corresponding feature vectors. The
relative value of the local distance is shown graphically by the magnitude of the shaded box.
The DTW algorithm can be viewed as a search for a warp path which starts at point {1, 1)
and ends at point ( n, n ). Each grid point passed over in the path corresponds to a specific
mapping of two feature vectors. DTW seeks the path which has the minimum accumulated
local distance along the path from point (1, 1) to (n, n). Local continuity constraints are
maintained by making sure that no path reaches point ( i, j) unless it comes from one of the
following three points: (i - 2,j - 1), (i - 1,j - 1), (i - 1,j - 2), a.s shown in figure 3.6.
The resulting slope constraint on the DTW path restricts the search space to the region
33
1. 1. I 1. I . 1. • • . ... • 1. ~ f l 1. ·1, I. J. • • . I I '- I. r-----1--+-~1-t-+-+--+-+-+--+-t--+-l-+l-1. _,•.1-•+· .-+. ·;..;.;i· """•+"'•+· ·=· • ~ . ..... . .. . 1 . • • . • .. .. • • . .,,,.,-----. __ t--+-~-+--1--+-+-+-+--4-l-"'I+· 1-+1"'1. ='I-"''+· 14. l:;;:j. ~· • ~ • # ·1. • I . I . l l • '- I I 11. • ...,.,.-----_ '-t-~-+-+-+--+-+-+--+-1-•·+•-+ .• -1. _,•.1-•+· .-+.4' ti .. • . • • . • . • . • . • . • . • . • . • •. • . .. • . . ,,..----t-r--t-t--+--t-i-+-+-+-T-' • . +-•+· 1-+. I_.. 1--•+-•+· 1-+-. I I I I I I 1. • l l ii. 1. J • •. I'. I --..,_'r-----,_.,_-+-,_.._......._._-+-+~1._1 ..... 1-+. _,I._•..._• ..... I_.. U I • I I I 1. I 1. I I. I I I.• • I ..-.......... ... . .----_ 1-T--+-1-+-_.__.__.__.._...1"-'.;-"'l+-1+. 14 1=·1-l+-I=. I I I If. 1. 1. 1. • • f . l • . I . f _ Iii. 1. -·<:>------!-t---t-1-i--......._.-+-+-+1-1· i_l+-1+· lf-+lf-=<_ 1-l+-I,.;.; 'J_.. ='1-1·+-1'-"1~11-1+141"'1· '--lf+-1+14 1-'<· 1-1 +-•""-l4l"-".1-+---+-"-'-+--1--+--+-+-+-+-+--! ._ .. \_,. .. ----. t-t---t-r-+--+-+--.-T"l-+l~lt--1+. 1-+1_,. ""''.-~+: I-'+. _.11-1+-1-+11_.. -"11-1+-1-+. lf-1. _,1.+-1+1-+1-'<. _1+-1 ...... 11-+-<1-+-...;......;-+-+-+-+-+-+-+-+--1.--'"·' --:-~--- . t-...--+-!'--"-...;.....-4-+--'-1~. 14 . -"1,_1+· 1-+1-". r1•I1. I 1. I I 1. I I 1. I If I I II '· ..... --- ---~--+-!._._~_1._1-'-l l-+1__.l-"lt-I+· .-+-. ,,, • . 1. 11. •.II •••.•. I •••.
l-T--t-T-+-+-+-+! _.• """''+-'+· •+•-1. --'+. UI I 1 . • . I • . • , •• •l • • I • If • 1. I 1. I I 1.11. 1. 1. I• 111 I 1'1. •
~,..-t-..-+-+1_..il--+-l+-1.._i 1-+l l_.TJ I 111. 1. 1. I I! I I I If •1111.11111III11. I I lfll. I I If 1.1111r111. I I If. I 1.
- ·._ .... ---~-."",,--
................ -----.. '· .. ~.----
' .. ~ ... -------... ...• /,----._.
111 • I •.I•. Ii•+-•-+. 1,.-'f. -'•.+-I 1+-W-+lf-1. 1-t-+-+--+-+-+--+-1-+--+-+-+-~-T-'+-+--+-1-+--+-+-+--+--t-+-+-i- -,,/ ___ --.. _ 11111.l• 1. 1. U 1. 1 w.11 ->..r---,_
J 1. I 1. 1.llJ. lf+-1-+1-'1. -'lft-+-+-t-+-+-+-+-+-+-+-t-+-+-t-+-+-+-+-+-+---+-it-++-1-+-+-+-+-+-+-{-.-.·-... _,-----·v-I . I I 1 . I lf.,._'J+-1-+. -+-+-+--+-+-+--1--+-+-+-+--+-1-+--+-+-+-+-+-+-+-+-.....---.1--1--1--1-+---1--1--+-+-+-4.---""V"""~--._ .•
I 1 . 1. 1 . 1 . 141 :.-1: -+--1--+-+-+-+--4-1-+-...........i--+-+-+-+-'-+---1-<-+---1--1-+--1--1--1-'-+--'-11--1--1--11--1-~--._ ~-..._/
• 1. 1. 1::.. '--,/
t-f:! •'~' -----..r-_,,.-1-l'! ~I 7 ~-,,,.-....._/ ~I I _r-·._..-/
/·:/{1')·-~J '1) ) j I ) ) ) ~ I) / I I I\ 1, / 'f 1' ~;· ( <-. ·-J'. '\ .'l Y'l ·~ )' Y\ \ \ ', ) ~ y :,, ·) < \ \ I \I.. '1 \ ~ <I () <I ~; < 1 < 1, .._, ·,
·, ·:' ·:. ) ,I J ) ) J J J J } } I ) :1 ) J I J i ) I j ) ~ ) l :1 I I I ) ' I I I 11
\ ..... : .• ,
Figure ~1.5: Graphical ill11stratio11 of the DT\\' process
J_ _Y:M 2 - x-N
Figure 3.6: Local continuity constraints.
(M,l) y
Search Region
r.t=....~~~~~~~~~~~~~~~~~~....JX
(N,1)
Y:M 2 = x-N
.1=. ~ 2 x-1
Figure 3. 7: Search region defined by the local continuity constraints.
35
shown in figure 3. 7. Because the inputs were normalized using equation (3.4), the search
space was guaranteed not to be empty. This particular local constraint was originally
suggested by Myers et. al. (28] who found it to have near optimal performance among
·various alternatives. The dynamic programming procedure associated with the constraint
is given by:
l D(i - 2,j - 1) + ![d(i- 1,j) + d(i,j)]
D(i,j) = min D(i-1,j-l)+d(i,j)
D(i - 1,j - 2) + ![d_(i,j - 1) + d(i,j)]
D(l,1) d(l,1)
where d( i, j) is the Euclidean distance between frame i and j and D( i, j) is the corresponding
global distance. The nondiagonal warp paths were weighted by ! to insure that each path
to ( i, j) had the same effective number of local distances; otherwise the algorithm would
have favored the diagonal paths.
3.4 Systolic Array Implementation of DTW
The DTW procedure we described in the previous section has a time complexity propor
tional to N 2 where N in the length of the two input sequences. However, because many
of the local decisions in the dynamic programming algorithm can be made in parallel, it is
possible to reduce the time complexity down to O(N). This can be done using a hexagonal
cellular systolic array (see figure 3.8) as suggested by Bhavsar et. al. [2]. At each step
k = 1, ... , N, the systolic array computes in parallel the local distance at each processing
node (i,j) for which i + j = k. When the N-th step is complete, the global distance will
be available in processing node (N, N). There are two types of processing nodes which are
used in figure (3.8): one is a hexagonal cellular processor (figure 3.9(a)) and the other is
a square cellular processor (figure 3.9(b)). The hexagonal processor contains three table
lookup memory cells. Two of the memory cells, labelled in figure 3.9(a) as cells ai and b;,
are used to store the input feature vectors ai and bj in processor node (i,j). The other
cell, labelled cell D in figure 3.9(b ), is used to store the local distance at processor node
( i, j). At each step k, the hexagonal processor ( i, j) performs three basic operations. First,
it transfers the memory contents of cell a from processor ( i - I, j) into the memory cell a
36
: . r1 09
- 1,n-l· I
Figure 3.8: A Systolic Array for DTW (ta.ken from (2] ).
:n
(a) (b)
Figure 3.9: Parts of the hexagonal processor used in the systolic array.
in processor (i,j). While this is being done, the memory contents of cell bare simultane
ously transferred from processor (i,j-1) over to cell bin processor (i,j) (see figure 3.9(a)).
Second, the contents of cell Dare read from processors ( i -1, j) , ( i,j) , ( i,j -1) and the min
imum value is stored in cell D of processor ( i, j) (see figure 3.9(b) ). Finally the Euclidean
distance between a and bin processor (i,j) is computed and added to cell D.
The other processing node shown in figure 3.8 is a square cellular processor. In this
processor, there are only two memory cells. One is used to store an input feature vector,
the other simply indicates an "infinitely" large local distance. The circle processor at (0,0)
in figure 3.8, is simply used to store a zero local distance which is read in from processor
(1,1 ).
This particular implementation of the DTW function does not place a slope constraint
on the warp path; however, regions of the systolic array can be eliminated from the search
path by setting the local distance cell D in undesired areas to an "infinitely" large value.
3.5 The Metric Approach to Isolated Word Recognition
The design of the isolated word recognition system involved four main stages as shown in
figure 3 .10. In the first stage (see figure 3 .10( a)), a metric space representation of speech
was obtained using a set of training samples and the DTW distance function described in
section 3.3.2. Then in the second stage, the leading principal components of this DTW
metric space were computed using the main embedding algorithm described in chapter 2
38
21~2 211
2 2
0 2 22 222~
2 2 22 .... .... I
11 I I
11 I
3 3
0 I 11 1 1 I I 1 1 1 33
3
1 1 t 1 1 l
Metric Space Vector Space Representation Metric Projections Decision Region Analysis
(a) (b) (c) (d)
Figure 3.10: The 4 stages involved in the design of the isolated word recognition system.
classifier
metric projection
Figure 3.11: The components to the isolated word recognition system.
(see figure 3.lO(b)). Using this representation in the third stage, we determined a set of
basis samples using the search algorithm described in chapter 2. Once th_e basis samples
were located, they were incorporated into a metric projection formula and the vector rep
resentation of all training samples was then recomputed w.r.t. this projection formula (see
figure 3.lO(c)). Finally, in the fourth stage, the projections of the training samples were
used to analyze the decision regions of each class using three different classifiers as described
in the next section (see figure 3 .10( d)).
Once the four stages were completed, the class label of an arbitrary speech sample
was computed as shown in figure 3.11. This figure shows the main components of the
speech recognition system. The basis samples computed in stage 3 are used to compute
a projection of the unknown input using the metric projection formula. Then the class
label of the resulting vector was obtained from the classifier that was created in stage 4 .
There were three different classifiers which were considered in our implementation: (1) a
k-nearest-neighbor classifier, (2) a Gaussian classifier, and (3) a Neural Network classifier.
39
These classifiers are described below.
3.5.1 K-Nearest Neighbor Classifier
The k-nearest neighbor (KNN) classifier is an example of a nonparametric classification
procedure [5, ch. 3]. In other words, the classifier does not use a functional description of
the classes; instead, the decision regions are modeled directly with a set of stored samples.
Given a set of n labelled training samples {vi, v2, . . . , vn}, we classify an unknown x by first
locating the k nearest samples to x w.r.t. the distance measure:
{ vd Iajn d( x, v;), i = 1, ... , k} t
Then, we classify x with the class label which appears most frequently in this set; however,
if a tie results, we simply use the class label of the nearest neighbor (ie. k= 1 ).
This procedure has a search and memory complexity which is linear w.r.t. the size of
the training set. Although clustering methods or the use of the triangular inequality can
be used to reduce this complexity, we did not consider these techniques in this thesis.
The KNN classifier was used on both the DTW metric space representation of speech
(ie. d was measured with the DTW metric), and the vector space representation (ie. d
was measured with the pseudo-Euclidean metric). By doing this, we were able to make a
meaningful comparison between the two representations.
3.5.2 Gaussian Classifier
The Gaussian Classifier [5, ch 2) was also considered in our experiments. In this classifier,
the decision regions are characterized by separate probability distribution models for each
class. We assumed that the samples in each class w;, i = 1, .. . , c were random vectors whose
position at location x depended on the multivariate normal density function:
where mis the dimension size of the vector space, µi is the mean vector of class w;, Eis
the m-by-m covariance matrix of class w;, and l:E;I is the determinant of :E;.
The decision regions formed by this density function manifest themselves as hyperellip-
40
soids. The position and shape of each hyperellipsoid are determined by µ and E respectively.
These two parameters were estimated using the maximum likelihood method [5):
1 n E = - L (vi - µ)t(vi - µ).
n i=l
After obtaining parameters for each class, we classified an unknown x as a member of class
Wi using a Bayesian like decision rule:
mµp(xlwi)i i=l, ... ,c. I
The main advantage of this classifier over the KNN is that the computation time and
memory requirement is independent of the size of the training set . Instead, the complexity
of the classifier is a function of the dimension size m and the number of classes c. During on
line classification, most of the computation time is dominated by the Mahalanobis distance
computation (x - µ)E-1{x- µ),and therefore the time complexity of the classifier is O(c·
m 2 ). As for memory, one only needs to store the parameters of the distribution model for
each class; thus, the memory complexity is also O(c · m2 ).
3.5.3 Neural Network Classifier
The final classifier that was used in our tests was the Multi-Layered Perceptron (MLP) [31].
This classifier is more flexible then the Gaussian classifier because it is capable of forming
arbitrary decision regions, as was shown by Huang and Lippmann [19). The MLP that we
used can be viewed as a generalization of a linear machine [5]. A linear machine uses c
linear discriminant functions of the form
i = 1, .. . ,c
and the following classification rule
if 9i(x) ~ 9i(x), 'Vj ::/; i then XEWi
to partition the vector space into c piecewise linear decision regions Ri, i = 1, ... , c. In
this case, all of the vector points Xi for which g;(x) is maximum lie in region R;. The
41
boundary between two adjacent regions Ri, R; is a portion of the hyperplane defined by
Yi(x) - g;(x) = 0.
As with the Gaussian classifier, the parameters of the linear machine are estimated from
the training set. In this case we need to determine the weight vector Wi and threshold value
fh. These values can be found by minimizing a cost function of the form
c
E =LL llg(v)- I[i]lj2
i:::l V~w;
where g(v) is the vector [g1(v),g2(v), ... ,gc(v)] and ~[iJ is an index vector which equals
the i-th row of a c-by-c identity matrix. A solution to this minimization problem can be
obtained using various well-known gradient descent procedures [SJ. Among these, there are
two types of procedures. One converges when a solution exists (ie. perceptron procedures)
but oscillate otherwise, and the other always find a solution-but not necessarily correct
ones, even when the classes are linearly separable (ie. Least Means Square solutions using
Widrow-Hoff rule or the pseudoinverse function) [SJ.
The main limitation of a linear machine is that it only works well when the decision
regions are singly-connected convex regions. In such cases, the Gaussian classifier is likely
to be sufficient. The MLP extends the capability of a linear machine by using hidden layers
of nonlinear discriminant between the inputs vector x and the output layer g(x). Much
like the output layer of a linear machine, the i-th hidden layer of the network consists of
a vector of individual linear discriminant functions {gi, g~, ... , g~J. However, to make the
discriminant more powerful, the output of these functions are passed through a nonlinear
differentiable function which is of the following sigmoid form:
f(x) = -1-
1 + e-X
The input vector to these nonlinear discriminants is the output vector at the previous layer
( i - 1) or simply the input sample x if i = 1. One can show that if a multilayered linear
discriminant does not use a nonlinear function between the layers, it can always be collapsed
to single layer linear machine (see [20J for example). Hinton (17) has shown that the use
of so-called "bottle-neck" hidden layers can help to transform the input representation into
a vector representation which is linearly separable. We will show in the next chapter that
the learning complexity of such transformations is related to the degree of class separation
42
Output Layer j CD Q) ... ©I
+ Second Hidden Layer I CD@ ... @ I
t Fust Hidden Layerl"""©::........;:@:c-··· _____________ @-=-il
+ Input Layer 1012' ... '16' I
(llldric projections) \!J \V ~
Figure 3.12: Neural Network architecture used in the isolated word recognition system.
of the input vectors. It is therefore possible to reduce the complexity of learning in these
networks by preprocessing the inputs using the metric approach with a distance function
which improves the separation of the inputs. Also, by preprocessing data this way, it is
possible to use more flexible input representations in the classification system.
The particular architecture of the MLP that was used in our implementation is shown
in figure 3.12. We selected this architecture to get a meaningful comparison to the results
obtained by Lippmann and Gold [32]. In that particular study, Lippmann and Gold found
that this network gave the optimal performance among various alternatives when tested on
the data set that was also used in our study. (Details of the data set will be described in
the next chapter).
The network was trained using a standard back-propagation learning algorithm [41].
The specific details of this learning algorithm that we used in our implementation were as
follows. We used the acceleration method described by Hinton [17] to update the weights
at each iteration t as shown below:
Initially, we set f = 0.03 and a= 0.5 fort= 1, ... , 100 (ie. the first 100 iterations). Then,
to facilitate faster learning, these parameters were changed to f = 0.05 and a = 0.9 for
the rest of the training period. Convergence of the net occurred when the error dropped
below a tolerance of .1. The weights were updated using a "batch" mode of gradient descent
by accumulating a~; over all the input-output cases first, and then adjusting Wij by the
amount proportional to this sum. This method of gradient descent had the advantage of
not being sensitive to the order in which the weights were updated. Initially, all the weights
were set randomly between the range of -1.0 and + 1.0. To avoid driving the weights to very
43
large values during training, outputs of .8 and .2 were used instead of 1 and 0 respectively.
After convergence, we ran the learning procedure an extra 500 iterations using a weight
decay mode. In this mode, the magnitude of each weight was reduced by 0.53 after each
weight was updated. Consequently, only those weights which helped reduced the total error
were able to stay active. This process facilitate generalization by forcing the network to
learn the regularities which defined a class instead of overfitting the sampling error of the
training data (which was a potential problem given the large number of weights in the
network).
44
Chapter 4
Evaluation of the Metric
Approach to Word Recognition
4.1 Introduction
Several experiments were done to evaluate the isolated word recognition system described
in the previous chapter. In this chapter we will present the results of this evaluation. Our
evaluation will be organized in three sections . First, the principal components of the DTW
metric space will be analyzed at varying dimension sizes. This analysis will be used to
determine the intrinsic dimension of the metric space. In the second section, the projection
formula proposed in chapter 2 will be tested. The results of this test will show us how well
the projections preserve the original vector representation. Finally, having constructed a
projection formula, the results of our word recognition system will be presented for each of
the three classifiers described in the previous chapter.
4.2 Description of the Speech Database
The speech data that was used in all of our experiments was obtained from the Texas
Instruments 20 word database [24). These samples were digitized at a rate of 12.5 KHz
using a 12-bit A/D converter. In our evaluation, we used 8 monosyllabic digits (digits
"one" through "nine" excluding "seven") in our vocabulary. Our training set consisted of
320 samples in total which was obtained from 8 speakers ( 4 males and 4 females) using 5
samples per speaker. For testing purposes, we used a set of 1024 speaker dependent (SD)
45
samples and a set of 1024 speaker independent (SI) samples. In both cases, the samples
were obtained from 8 speaker (4 males and 4 females) using 16 samples per speaker (in
the case of the SD set, the speakers were the same, but the samples were different). All of
the training samples came from the same speaking session; however, the test samples were
generated over 8 different speaking sessions (2 samples per session) which were separated
in time by at least one day.
4.3 Vector Representation of the DTW Metric Space
Using the above training samples and the DTW distance function described in chapter 3.3.2,
we pre-computed the 320 X 320 interdistance matrix representing the corresponding DTW
metric space. 1 We then used the main embedding algorithm (in chapter 2) to obtain a
vector representation w.r.t. the leading principal axes. The eigenvalues associated with
each principal axis are listed in Table 4.1 in descending order (according to magnitudes).
Each magnitude reflects the relative amount of perturbation that would result to the finite
metric space representation if the corresponding axis were removed from the isometric vector
representation. One can see in figure 4.1 that the magnitudes quickly decay toward zero
in an exponential-like manner. The rapid descent towards zero indicates that a relatively
small perturbation should result in the vector representation if all but the first few leading
principal axes are removed. Supporting evidence for this is available in figure 4.2. This plot
shows the vector representation along the first two leading principal axes; note that these
two principal axes span a Euclidean representation because the corresponding eigenvalues
are both positive (see table 4.1). One can see that despite the low dimension, enough DTW
interdistance information is preserved to roughly distinguish most of the classes. Also,
consistent with our intuition, we see that the interdistances between words that sound
similar like "one" and "nine" are much smaller than the interdistances between words that
sound quite different like "six" and "four".
A more precise view of the perturbations can be determined by examining different
statistics of the vector representation at successively higher dimensions. This was done
from dimension 1 up to 320 by adding back more principal components to the representation
1The average distance computation required approximately 45 msec cpu time using a Silicon Graphics 4D/240 Superworkstation (or approximately 38 minutes for all of the 319~318 entries in the symmetric distance matrix).
46
-IT 1.10.Jc+O; ~5 1.379c+05 109 l .1180c+04 163 . J .JOc+04 217 - l.91Jc+04 211 1 .onc+03
2 7.818c+06 56 1.346c+os 110 -1.777c+04 16-4 3.3Hc+04 211 l.152e+04 272 -T.68Jc+03
J !o.272ct·OG 57 -1.32lc-t05 111 -1.727ct04 16& -3.297c+Ot 219 l.149ct04 2'T:J 7.400ct0J
4 4.fi!o4c+06 H 1.293ct05 112 -1.TJ4c+04 166 3.290c+04 220 l.822c+Ot 274 T.l72c+03
!> 4.17Gct06 $9 -1.269ct0$ 113 l.TO!k+Ot 167 3.266c+04 221 -1.1Uct04 211 -T.238ct03
G J .097c+·06 60 1.2oc+or; 114 l.61!kt04 Ull -3.177c+04 222 l.7'119•+04 276 -T.03h+OJ
7 2.123c+08 61 l .216ct05 111 l.47k+04 Ul11 3.leht04 223 .1.111c+ot 277 -6.HO.+OJ
8 1.914ct06 62 -l.20tc+or; 116 ·l.363c+04 no 3.U3c+04 224 •l.Telc+04 271 6 .7611ct0l
9 l.li29ct06 63 1.164et01 117 1.3524'+04 171 -3.134c+CH 221i l.761•+04 :1711 -6.676'"+03
10 1.354ct06 6-4 -1.l::l4ct05 111 l.303c+04 17::11 3.0111e+04 226 •l.74kt04 210 6.4 Ne +O.J
11 1.l 75c+OG 611 1.121ct0& llll ,_,gr;.,+04 113 3.00k+04 227 ·l.7lict04 211 -8.0llctOJ
12 9.489c+05 66 -1 .0Hc:+OI 120 .1.1r1Jc+04 174 -2.lllS.+04 :n8 l.6TTc+04 212 -1.137ct0l
l.J 8.3411c:t0fi 67 1.078c:t01 Ul -i.044c+04 171 2.96ect04 229 -1.6611c+04 2U l.717c+OJ
14 7. 72Gct06 611 1.042ct01 122 4 .Hlc:t04 176 2.92lc:t04 230 l.617ct04 284 1.n11.+ol
15 7.082ct06 69 -1.027ct01 123 -u111c+oc ITT -2.llOletOt 231 -l.Htict04 215 -1.4110c+OJ
lG G.4 llc+os 70 1.00Jct05 124 4.900<+04 178 2.8112c+Ot 232 1.H4ct04 2ee -5.3T2•+0J
17 -S.504ct05 71 11.lllc+04 125 ·U100ct04 1711 2.8211ct04 233 l .144ct04 ::1187 li.l64ct03
18 6.457ct05 n 9.749ct04 126 4.852ct04 180 -2.tlllc:+Ot 234 1.'32c:t04 288 l.l'Tlc:tOJ
19 t.99Sct05 73 -9.Tl6ct04 127 4.148ct04 111 -2. 7117c:+04 235 -l.1Uc+04 2'9 4.908ct03
20 t .709c+os 74 -9.5:12c+04 128 4.800ct04 182 ·2. TT4c+04 236 l .492c+04 290 -4.IOhtOJ
21 4.28Gc+OS TS ll.lil6ct04 1211 -t.794<+o4 113 2.TT4e+04 237 •l.467c+04 291 4.7411•+03
n 4 .087ct06 76 ·9.'410c+04 130 4..681c+04 lH -2.6ekt04 238 l.441Jct04 292 4.476.o+ol
23 -4 .0S'4ct05 77 9.186<+04 131 4.622c+04 IH 2 .6834'+04 2311 •l.427c+04 293 4.2110c+03
24 3.827c+os 78 8.90'4ct04 132 -4..HOc+o4 186 -2.658ct04 240 1.411•+04 2114 -4.1114c:+03
'.25 - J .ssoc+os 79 ·8.1154ct04 133 t.U8ct04 187 2.633c+04 2U •l.401ict04 296 -4.~ct03
2G 3 .54 lc+OS 80 ll.59Sct04 134 -4.UTct04 11111 -l .631c+04 242 1..381ict04 296 . J .95Gc+03
27 -3 .JOI c+OS 81 8 .502c+04 135 4 .4411<+04 189 2.699ct04 2-43 -1..3411ct0'4 297 J .880c+03
28 J . 230c+Os 82 8 .40lct04 136 -4..392ct04 190 2.6G9ct04 '.24-4 l.3211ct04 2911 3.71 :.c+O~
29 3 . 111ci·05 83 -B.193ct04 137 4.J5Gct04 191 ·2.650ct01 :245 -1 .JOOct 04 299 - 3 .6~1ct03
30 3 .0tllc+05 84 8 . l 73c+Ot 138 4.30Jct04 192 -2.629ct0'4 246 1.280ct04 300 3.!ii43ct03
31 -2 .7 28c+ os 85 8.092ct01 139 •4.280c -! ·04 193 2 .483c+04 247 -1.27.Jc-!·04 301 -3.'.Jli3ct03
32 2 . 66'-~ c+os 86 -7.920c+04 140 4.235ct-04 191 - 2 .434ct04 248 ·l.2411ct04 302 3.14 0ct03
33 2 .Ci ·13c+Os 87 7.7Jlct04 141 4 .169c+04 195 '.J .J93c+Ot :249 l.236ct04 303 -J.llJ9c+03
Jt 2 .!117c+05 88 - 7.527ct-04 10 -4.152c+04 1116 -2.358ct0t 21;0 -1.Ulc-! ·04 304 2.lii65ct03
JS 2 . J~ lct05 89 7.367c+04 143 -4.13Gct04 197 - 2.3'4lct0t 261 1.212ct04 3.06 · 2.478ct03
JG -:Z .28Jc+o:; 90 -7.JJ1ct04 1H -4.046c+04 1911 2.332ct04 212 1.1911ct04 306 2.345ct03
37 :z .2s1c+os 91 7 .322ct04 146 4.004<-1·04 1119 2.312ct04 2'3 ·l-1'2c+-04 307 -2.266c+03
38 2 .1 G1c+os 92 7.16ht04 1'46 ·3.992c+04 200 2.249ct04 264 1.110.+ot 308 2.230ct03
39 - 2. 147ct05 93 7.<>"2ct04 147 3.l'l'~·+'.>4 l{'I -2.238c+'.>'4 2~1 -1.095•+04 309 ·lAOllc+Ol
40 2 .050ct05 94 -6.9114c+04 .... . 3.11,; ;. +t'• 2•.•J 2. 189c+Ot 256 1.0!llc+04 310 l .777c+03
41 -1.9B7c+os 95 6.!1Uc+04 149 3 •• 63<+04 2"3 -2.1T3c+04 25T •l.074ct04 311 -1.4111•+03
42 Ull5ct05 96 -6.800ct04 110 3.816"'+CH 2'.>4 -2.l3Tct04 7H 1.0Hc+04 312 l.Hh+OJ I
43 -l.9llct05 97 6 .729ct04 111 -3.800<+04 205 2.1364'+04 259 -1.oi.c+o• 313 9.707•+02
44 l .83!1c+os 911 -G.7J9ct04 152 3.146<-+04 206 -2.123c+04 260 1.022ct04 314 ... 794•.f.02
45 J .711Jct05 99 G.695ct04 153 3':700<+04 207 2.113"+04 261 -9.9S4ct03 311 7.273•+02
46 l.G7lct06 100 -6.USc+04 154 3.623c+CH 208 •2.07lct04 262 tl:694ct03 316 ·l.C3Tct02
47 - l.Gl!lc+os 101 .6.49!1ct04 155 -3.6l7c+04 209 2.056•+04 263 -9.120.+03 317 4.324at02
411 l .599ct05 102 6.434ct04 U6 3.'64c+04 210 -2.024•+04 28C ••• 2 ... +03 :u• -2.31'4t02
49 ·l.U2c+Oli 103 •6.171ct04 117 ·3.H3ct04 211 2.022et04 261 9.2Hc+03 :11• -4.23te+oo
50 l.62Gct01i 104 ·6.120c+04 lH -3.UOc+04 212 .2.ooe.+o4 268 9..o33c+03 320 -3.0Cl7•09
61 l .tllOc+Oli 105 6 . 116ct04 119 3.107<+04 213 2.002•+04 267 a.Hlct03
62 -l.4GGct06 106 6 .091c+04 ICIO -3.4Nc+o4 214 .1.010.+04 :aee ·•--+0.J
63 -l . 41Gct06 107 l.912c+ot 181 -3.4::17e+04 211 l.1142•+04 2ee 1.446.+0l
14 -1.392<+06 108 l .90ht04 162 3.414c+04 211 l.111e.+04 2TO ·•-l8:1•+o3
Table 4. I : The eigenvalues corresponding t.o the principal axes of the DTW metric apace.
47
DTW Eigenvalues Eigenvalue Magnitudes x 106
11
10
9
8
7
6
5
4
3
2
1
0 '-...__-------'-------~---------'---~Dimensions
0 100 200 300
Figure 4.1: A plot of the magnitudes of the eigenvalues corresponding to the principal components of the DTW metric space.
48
300
200
100
0
-100
-200
I
-300
I I I I - I I I
3
3 3 - l3 3 3
33 3
3J 3 3
33 3 3 33
3 33 3 33 3
3 - 3 3 3 33 9
B 3 3 2 9 B B 2 ~ .p 2
9 I BB rP g 9
Bl 3 99 9
B ~ 99 , 9 B B ,:s 9 9 a 9 9, - 9 9
, 8 B
~ 9 2, 9
Ill 19 ,
22 2 9 !Iii'; 9 9
' 2 9 ,
2 , , , ,,,
2 , i 99 II
I 8,
2 5 2 I 5 ,f'6
• 8 B 2
?5 2 5 ,
-~ ~ 5 ,
B B 6 B i 22 2 5 ,·, , 1 • B 2 2 1 l\i
5 1 5 B 2 55 ~ 1
1 ' B 2 5 55
, ~ , 1 1
5 rP 8 55 5
66
6 6 B 5
8 5 - if' 6 6 5 5 4 6 6 6 6 6 5 5
6 66 6 5 6 5 4
6 6 6 5
4 4
4 4 5 5 4 I
-6 6rt, 1 4 6 6 66 4 4
6 66 5 4 4 4 6 6 4 4
6 6 4 4 6 4
44 4
4 4 4
- 4 4 4 4 4 4 4
4 4 4
I I I I I I I --400 -300 -200 -100 0 100 200 300
Figure 4.2: The vector representation of the training samples w.r.t. the 2 leading principal axes.
49
-
-
-
-
-
-
starting with the first principal axis. For each kfl, ... 320 dimensional representation, we
measured the quality of the representation w .r.t. two different criteria. The first one was
the sum of the square error E(V) between the interdistances in the metric space and the
corresponding interdistances in the k dimensional vector space. This quantity, which was
computed as described in section 2.3.1 of chapter 2, allowed us to see the degree of error
present in the vector representation at different dimension sizes.
The second criteria measured was the ratio of the average between-class distance to the
average within-class distance. This measurement, denoted by S(V), was computed in the
following manner. First, we computed the average between-class distance B(V x' Vy) as
follows, where V x = { v~, vi, ... , v:0 } denotes the set of vectors in class V x. x = 1, . . . , 8
and
Next, we measured the average within-class distance W(Vx) as follows:
W(Vx) =
Finally, we combined these two functions together and computed the average scatter ratio
S(V) of the entire vector representation V = LJ~=I V x as follows:
We used this measurement to see how the class separation changed as the dimensions
increased.
The graph in figure 4.3 shows a plot of the values of E(V) for the increasing dimension
sizes 1 through 320. One can see that E(V) decays sharply towards zero in an exponential
manner as the dimension size increases. This behavior is consistent with the way the
magnitudes of the eigenvalues decline in figure 4.1. Because the eigenvalues reflect the
amount of metric information (ie. variance) that is added back into the representation, we
see that the error in the vector representation approaches zero at approximately the same
point where the eigenvalues become insignificant. Furthermore, the plot clearly shows t hat
essentially all of the distance information is contained intrinsically in the first few leading
50
DTW Representation Error E(V)
240
220
200
180
160
140
120
100
80
60
40
20
0
Dimensions 0 100 200 300
Figure 4.3: A plot of the vector representation error at varying dimension sizes
51
DTW Scatter Ratio S(V)
2.5
2.4
2.3
2.2
2.1
2
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1
0.9 Dimensions
0 100 200 300
Figure 4.4: A plot of the average between-to-within class scatter ratio at varying dimension sizes.
principal axes; at approximately twice the number of classes (that is, 16 dimensions), we
essentially have an isometric representation of the DTW metric space. This supports the
hypothesis made by Goldfarb [11], that when classes are reasonably separated (as is apparent
in figure 4.2), the intrinsic dimension of our DTW metric space is closely related (by a small
constant) to the number of classes (in our case 8), and not the size of the training set.
The diminishing effects of the higher dimensions can also be seen in terms of the scatter
ratio. In figure 4.4 we show how the ratio S(V) changes as more principal components
a.re added to the vector representation . In this plot we again see a sharp exponential
drop which is similar to figure 4.1 and figure 4.3. At first this plot seems to indicate that
the class separation weakens as more dimensions are added to the representation since we
52
DTW Class Separation M(v)
5(J()
540
520
500
480
4(J()
440
420
400
380
3(J()
340
320
300
0 100 200 300
Figure 4.5: A plot of the average distance between the classes at varying dimension sizes.
would expect to see the between-to-within class scatter ratio rise as more of the original
metric information is brought back into the representation. However, if we examine the
numerator of the ratio, which measures the average between class distance B(V x, Vy),
we see (in figure 4.5) that the distances between the classes does increase (in a manner
inversely related to the scatter ratio). Thus, the decrease in the scatter ratio simply occurs
because the average distance within a class (the denominator of S(V)) rises faster then the
average distance between the classes. The key observation one should make from figure 4.4
and figure 4.5 is that after the first few principal axes, the class separation levels off and
essentially equals the separation in the DTW metric space. This means we are not likely to
gain m~ch in the recognition performance if we analyze the decision regions in dimensions
53
beyond this point. Note, the point where the class separation levels off is approximately
the dimension where both the eigenvalues and the error measure E(V) approach zero. This
means that basically any one of these measurements could be used to analyze the intrinsic
dimension of the metric space.
4.4 Metric Projection Analysis
In this section we present the results of our analysis of the metric projection algorithm
described in chapter 2. There were two specific aspects of the projection algorithm we were
interested in analyzing. First, we wanted to see how accurately the metric projections of
the training samples represented the original principal components as more ha.sis samples
were added into the metric projection formula. Second, we wanted to see what differences
existed between the two versions of the greedy search procedures proposed in chapter 2.
To answer these questions, we proceeded in the following manner. First, we set the
dimension size of the search space to 16. (ie. we only used the leading 16 principal com
ponents of the training samples while searching for the basis samples.) This dimension size
was selected on the basis of the results obtained from the previous section and because the
leading 16 dimensions were Euclidean (see table 4.1) which simplified our analysis of the
decision regions in the next section. The 16 basis samples {bi, h2, ... , bl6} were found from
the search space using a greedy search algorithm. Given i - 1 basis samples, ha.sis sample
bi was selected using two separate methods called the the k-MAX and the k-NN projec
tion method as outlined in chapter 2. In the k-NN method, we computed the maximum
projection on both ends of the i-th leading principal axis. Then, the k-nearest neighbors
{xi, x2, ... ,X21<} of the two "end" vector points were located and each of these candidate
ha.sis samples were tested as follows. Let { u{, ut ... , u~20 } be the metric projection of each
of the training samples w.r.t. the basis set {b1 ,b2 , ••• ,bi-t,Xj} and let {vl,v;, ... ,v~20} be the original vector representation of the training set w .r. t. the i leading principal axes.
We tested Xj for each j d, ... , 2 · k by measuring the pseudo-Euclidean distance d between
u{ and vf for l = 1, ... , 320 as follows:
54
Metric Projection Error (e) •M
~-------~---...---~ IC-MAXJiiOiccii-160 1C-~tioa5-·
ISO
140
130
120
110
100
90
80
10
60 Dimemiom
10 IS
Figure 4.6: The value of the test criteria e for each dimension.
This measurement was used, rather than E(V), because it reflected the deviation from the
original representation much more accurately. Using this test criteria, we selected the basis
sample bi in the following manner:
b; = min e(xj)· jd, ... ,2k
For the k-MAX method, we used the same general search strategy to select the basis
sample bi; however, instead of selecting among the k-nearest neighbors as described above,
we used the k-maximum projections onto the i-th principal axis. In both cases, we set k
equal to 10.
Figure 4.6 shows how the value of e(bj) , j = 1, ... , k changed as the number of basis
samples increased for both selection methods. In this figure, several important characteris
tics of the projection algorithm can be observed. If we compare the two search methods, we
can see that both methods start out with the same error. This occurred because the same
sample was selected by both methods at the first dimension. However, after this point the
k-MAX method consistently found samples which resulted in a lower error (except at the
third dimension where e was the same). This difference indicates that at certain points in
the greedy search, some of the samples yielding the maximum projections were not among
55
the k-nearest neighbors of the endpoints. (If they had been, they would have been selected
and no differences in the two approaches would have been evident.) In such cases, we see
that the k-MAX approach leads to more reliable projections then k-NN approach. The
likely reason for this is that the vectors in the k-MAX set have a greater chance of being
closer (in terms of distance) to the true subspace spanned by the leading ( k = 1, ... , n)
principal axes than the k-NN set . That is because the k-MAX set has the greatest influence
on the direction of the principal axis (being considered) since these samples make the largest
contributions to the variance on the axis. The diagonalization process tries to maximize
this variance; thus, it is best for it to position the axis as close as possible to these samples.
If one looks closely at the derivation of the projection formula (see 2.5), one can see why
closeness to the true subspace is important. When the basis samples do not lie precisely
on the true subspace (that spanned by the eigenvectors), they span a space which has a
certain (affine) angle to the true subspace. The greater the distance that any basis sample
has to the true subspace, the greater the angle becomes and this in turn translates to higher
projection errors.
Another interesting property of our projection algorithm which is evident in figure 4.6 is
the way the error changes as more basis samples are added to the projection formula. One
can see that the error tends to get smaller and level off as the dimensions increase, and at
some dimensions the error even drops. For instance, going from dimension 7 to dimension
8 using the k-MAX search method, we can see that the error drops from approximately 125
to 115. Clearly, for this to happen we must have gained some improvements in the vector
representation at the lower dimensions. An extreme example showing the improvements in
the lower dimensions is illustrated in figure 4. 7 and figure 4.8. In these two figures, which
relate to the k-MAX and k-NN search methods respectively, the projections along the two
leading principal axes are shown when two basis samples (the upper graph) and 16 basis
samples (the lower graph) are used in the projection formula. If we compare these figures
with figure 4.2, we not only see the extent of the improvements, we also see that our 16-d
projection is a very good approximation of the original representation (in 2-d).
Another way to see the improvements that added basis samples can have on the projec
tions is by examining the between-class separation B and the representation error E. This
is shown in figure 4.10 and figure 4.9 respectively for both projection methods used and also
the original vector representation. First of all, we can see that in both projection methods,
56
(a)
(b)
'' • •
•• • ,
2M ' '
2 .. ' lit
I t ... • •• I
I I 1' 2 2
•• • • I 2
' .... I .. ' • •
'
• ' "' •' ' ' • • •
' " • 3' ,,:a, . \ I J 3 a . \ . .. . .
• • •\ • 2 2 't" • I 'a ' 2 I
2 • 1 .. 2 .. .. ' 2 • I
' e,4 '• ;.
• ' 11 ,, ,'1 .. '11 . ' •
\ ,
. ' "' , ~
, "1 s 1 1 '• •
45 .. • tl-~~~~~~~~~~~_.-'-lt":'-~~~--''--~~~~~,~,~.-r
5~~[-"'~l~ .. ~~,,._,.,--''--~~~---1
• ••
.. 5 • ' It . I
, . • I I t •
·let
300
200
100
•
....
-218
• . • • • • • • •• • • ' '• • • • • ,'• ' .
.500 • 4Gf
3
'' •
• I • ,
•• I I I
I •• I
I I
I I f I
I
j " I
I I • I
• I • • . I I • • • •• • '• . '
' " • • •• • ' • '• • ('.' • • . •
3
'
' ... cc'' ••• •
, 3
' 3
3 ' ' ,
' 3
'
z
' ' 225 52 ., I ' • 2
' '
-500 .400 -300
5 ' ... ~· t . . 5
5
. , .. .. 5 5 .. 5 4 5 ...
•• • .
'
' 3
3 3
' f 3
3 .,, 3 ' " 3 " ' 3
' 2 • 2 • •
' ' I • • • • ' • • '• • . , 2 • • , . i2 • . , • ' I .. " • , ' 2
'• j 2 .. ' ' • ., 2 •l ~ . \1' 2
1 5 55 ' ' 5 , ' I ' ' ,. I 5
, .,, ' 2 • 15 ~ 1 , 5
5 I . ' ' 1 2 , , 5
5 . 5 I I • • • • • • . • •
• • • • . .. • . . •
• .. • . • . • . . . . . . • . • . . . • • \' .. . • .
I
·100 100
Figure 4.7: The k-MAX projections along the 2 leading principal axis when (a) 2 basis samples and (b) 16 basis samples are used in the projection formula.
57
'' ' ' •
, ' •
' ' ' ' ' ' .. ' ' ' ' ' •
' '' • ' ' •
•
' '' 1
'• ' " • ' • • 1
\ • ' ' ,. • ., . \ 1
' • 1
,. 2 •• \ 1 1 • • • • • • • • ., . 1 1
• ~· 2 •
'•'" •,' • . ,. 2' 22 \ I\ • \
' I• JI I 2 2 2 • • . ,,, . \ • j • ' 2 • 1 • \
2 i 2 1 \
•• 2 ' 1 I' »1 2 ' 2 1 • ,1
' • • 1
• ... • • • • • ' . fj ';t 5 • • • .. ~ • ' • • . . ' ,. • ; ,
ti I I • • • • ' • • • • . ' ,•' . . . • . . • • ' I • • • • . I •
' . ' •1'• ... ~· • • . . • .. ... . " . . • ... • .. ..
• • . •
Ut
Ht
(a)
• • • • •
• •• . • • •
• • • • • •
• 590 .JM -108 ·109 • Ito
' ' • ' ' ' '
' ' ,. ' ' ' ' ' ' • -300
' ' ) ' ' • ' • '
•
' ' . ''• 2 ' • ' .. • • • ' ' 2
' • • • • • ' " I- • • • • '
2 2 .. • I • . " • 2 2 2l 21 • ••• • • • 2 • \ 2., 1 2 • • • • . .
2 2 • • , J 1151 • •l • • • 2 2 .,
2 2 2 2
' • " ' I • , 1,, • .,,, ... 2 •z 2 • 1 •
11 1 1 . ' 1 1. . • • •' • .. . 1 \
• • ' ' ' • • 1 11 11 • I
I • .. , I
1
' 2 ' 1 • • • ' • • • • ' • \ • •• ...
• •• ,1. • • • • ... • • . \ .. • • •
290
Ito
(b)
-He • • • • • • • • • • • • • • • . .
• t _ . . . • . • • ~ • • '· • • . • • . 2 ..
• . . . . . • •• ~ . . ' . .J• . .590 .Jto ·100 ·180 • 108 100
Figure 4.8: The k-NN projections along the 2 leading principal axis when (a) 2 basis samples and (b) 16 basis samples are used in the projection formula.
58
P.(V)
3'0
320
300
280
2l50
:240
2211
2IJO
180
lllO
140 . 120 \ .. 100
80
ti()
40
20
0
Projection Error (E)
· ... .. ... ~ .....
·· .................. _ ..
10 IS
blW Rcpueolallm 'X-MAX Pn>jeaiomY-WJ!iOjOCbebi-· •••
Dilmmion
Figure 4.9: The sum of square error of the projection for the leading 16 dimensions.
these criteria measurements incrementally approach the original vector representation as
the number of dimensions increases. However, consistent with figure 4.6, we see that the
k-MAX method is able to get much closer to the true representation. One can also see that
the k-MAX method essentially converges to the ideal representation once the dimension
approaches the intrinsic dimension of the metric space. Thus, even though the projections
error e is nonzero, our projection algorithm is still able to generate a representation which
preserves most of the original interdistance matrix. The metric projection error of around
115 (see figure 4.6) occurs most likely because the entire representation is slightly "tilted"
from the subspace spanned by the eigenvectors.
4.5 Analysis of the Recognition Performance
4.5.1 The Metric Approach vs. the DTW KNN Classifier
In the final section of this chapter we present the results of our recognition tests. The
tests were set up using three different vector classifiers as described in the previous chapter.
All projections were calculated using the basis samples selected from the k-MAX search
method because that method gave the best results as was shown in the previous section.
59
Projection Class Separation M(V)
360
340
320
300
210
260
240
220
200
1~"'-~~----'~~~----'~~~----'---:::1 10 IS
Dimemiom
Figure 4.10: The average between class distance of the projection for the leading 16 dimensions.
To evaluate the recognition results at different dimensions, we incremented the number of
basis samples in the projection formula from 2 up to 16 (in increments of 2) using the
strategy in the previous section. In the case of the MLP classifier, all of the learning
parameters described in the previous chapter were held fixed at each dimension considered
during training. Consequently, we could not get the network to converge for dimension sizes
lower than 6.
Our recognition tests were done using both the speaker dependent and speaker inde
pendent test samples described at the beginning of this chapter. (Note, both these sample
sets were not used in the training process). A 2-d plot of these two sample sets w.r.t. the
leading two basis samples is shown if figure 4.ll(a) and figure 4.ll(b) respectively. These
projections were obtained using all 16 basis samples in the projection formula. The results
of our tests are shown in figure 4.12 and figure 4.13.
Comparing the three classifiers, we see that the MLP achieved the best recognition re
sults overall; however, at 16 dimensions, the results of our KNN classifier were comparable
to the MLP. Both the MLP and the vector KNN classifiers outperformed the Gaussian
classifier indicating that the decision regions were somewhat complex and/or most likely
contained outliers. The main result in figures 4.12 and 4.13 however, is that at 16 dimen-
60
(a)
(b)
500 ~
100
JOO
,• ,• .
0 300
..... .. • ' , • r
·200 .Joo
', , ' ' 3,
' " !
33 J 3 3 ~~ ' , ' 3 'J t 3 ~ 3 s
3 33 :J 3
3 ' 333 33, 33
100
3 33333 , . . •3s:a:1~:13 3:J32i2J J" :iv"
200
3 3 1 33 3 3 3 3 2 3 1 \ I 1
• ,111' '•'\ \ ' , ' ~ '• F '\•!> '•: • ' ' ,• ' I.,, 1 ,. r 3
3 :t 2
3 2 t ta I It \a
1
• ••• 3 ~ 2 ,, ' • •• \ • 1
.·.·.". 'o 2 2 I 2 I tf 3~ 21 " ' 1\~,' ~ 1 11
' I' \ ''221, 11 I ' ' ' ' ' ' • # 2 " l> • • ' ,. fpt ~ • i• • • "
:· • 'I ii • • ··'if.. '• I "2l' .,~1.· " , .. Ai • •, ' 1 ' ~:t.2 I t , ~ ,, ,,•ti, 1 ,, 1 1
, • •, • • t1 f, ~. .'ri'ih•..,.,1,' ~· 1\•f, 11
• ' • • • 222 t .. 2 ... 2 .. ,,z 5 -. 1 .,
tl--~~~~~~~~-•• ~~.L......:JL..:..:..!...._,...::.2~~~~,......_~.~-.~Z.:.~.~ ..... ~......:J.~-.1-.,.1_.,.-'-<!,_.~\~1-5~--I
I 'l'i"'9;:& I 22 2 I 5 2 55 5 1lt s,t11,,1,
.J ..
I ,I I I 2 51 5 2 2 2 1 1f f, 1 I 1 1 I I 1 11f P' ,_i. 5 fs f I 5 5 6 l 1 1 1 ,, ·~ ,&•, '!li'"·i''• 51
•,1•1,l ' ' • l ' ' ' 2 , -r. ' "- 1
•1 f ' ' ' s ' ...... 'I . . .... . .... . . '· ~-*. .. .. ... • ' ' " ''' ,r ,', ' s l '' • s
.. . I .. I I I 9' 5 ~,.t 5 \ 4 5'
\'.)• ,: ' 5
,. '\ :•,•' ,,, ' ' 5 .. ,.. ' 4 • I I I Ii Ii I \g & I I I •" I 4 ~
• " l' • '· '~I' .. 2'·•···· ' •" \ '4 •• 4
I •
.. • t< , .. . .. "\ ........ ... .. .
.500 0 300 ·200 .Joo
• • I .
• . . . ~ .. . ~· • .
HO
Figure 4.11: 2-d Projections of the (a) speaker dependent test samples and the (b) speaker independent test samples.
61
Speaker Dependent Recognition Results Accuracy (%)
,--,,--~~~-r-~~~-.--~~~-.--~~~-.,.-~----.'kNN
100
98
96
94
92
90
88
86
84
82
80
78
76
74
72
70
68
66
64
·--------------------------- ---------·
0
.... ---"' __ .. ,. ...... • ,,-.~ ... - ............. a .. / •.. /
- - ,-~--~ - - - - - - - - - - - - - - -...... -.- - - - - - - - - - - - - . ,. .. _ - - - - - - - - - - - - - - - .
: i ! j
! ! l i
I ii i :
i I i i :
f ! : i
I : i •
5 10 15 20
"G&iiS:Siiiii ............... . \1Li> ___________ _
T>'tW-KNN" - - - • KN'N-cC&af - -"GaUssfan (l&G) .. MPL(L&G) - .
Dimensions
Figure 4.12: The recognition scores using the speaker dependent test samples.
62
. ~-,,,_.,
Speaker Independent Recognition Results Accuracy(%)
100 ,-,.---~~~~~-.~~~~~~--.-~~~~~---,.--~~MLP
98
96
94
92
90
88
86
84
82
80
78
76
74
72
70
68
66
·----------------------------------------
0
..... .... ... / ...• -·
~ ------·---- .... :.:.::.::.::;.-: .... ;·:" --· ------.. , ... , , , ,' 11 ..........
' . ' I ' I , : , : , : , : , :
' : ' : , : , : ' : , : , : , :
, i , : , . . : I:'
,' l '~ , :
r: , : •: , :
•: II II •I •: •:
•I •I u
•I ,. •: •: ...
,'i •I 1:
•: ... •I •f •
5 10 15
!(NN••·········
~&Us8raii- -Tiiw-KNN"
Dimensions
Figure 4.13: The recognition scores using the speaker independent test samples.
63
sions (the intrinsic dimension of our DTW metric space), our system was able to achieve
classification results comparable to the brute-force k-nearest-neighbor classifier in the DTW
metric space. What is clearly impressive about our approach is that we achieved these re
sults using nearly 953 fewer distance computations (using 16 basis samples and one origin
sample instead of 320 samples). Additional processing time was required to make a clas
sification decision; however, in the case of the MLP most of the operations could be done
in parallel and thus in a real implementation this would represent a very small portion of
the total computation time (ie. the activity at each layer can be computed in parallel and
this only needs to be done iteratively for 3 layers). The.projections can also be determined
in a parallel fashion because each of the 17 distances can be computed simultaneously and
furthermore, if the DTW procedure were implemented with systolic arrays (as described in
chapter 3), then most of the operations in the DTW functions could also be computed in
parallel.
Unlike the approach taken by Vidal et. al. [45], our reduction in the number of dis
tance computations was achieved without incurring any significant increase in the memory
requirement. For the 16-128-16-8 MLP, a total of (2(16 x 128) + 128) 4218 weights and (17
samples X 45 frames x 12 cepstral coefficients) 9180 spectral features (numbers) needed
to be stored in the classifier. But, in the DTW KNN classifier, a total of (320 samples x
540 cepstral coefficients) 172,800 spectral features was needed. (In Vidal et. al . approach
an additional 319~318 numbers would also be needed in the classifier to store the distance
matrix.) Thus, we also achieved significant savings in the memory (of nearly 903 when
compared to the DTW KNN classifier).
It is also useful to compare the results of our metric KNN classifier with those of the
DTW KNN classifier. By comparing the recognition scores of these two classifiers at varying
number of dimensions, we can get a good idea of how much meaningful information is
actually lost in the (projected) vector representation at the lower dimensions. As one
can see in both figure 4.12 and figure 4.13, as the number of dimensions increases the
recognition scores quickly begin to approach that of the DTW KNN classifier. The rate at
which this convergence occurs is consistent (inversely) with the representation errors shown
in figure 4.9, and, to some extent, the projection class separation shown in figure 4.10.
This close relationship means that one can approximate what percentage of the DTW KNN
classification results one is likely to achieve at lower dimensions, by scaling with either
64
.;~'
the separation measure or the representation error measure. It is not surprising to see at
16 dimensions the KNN vector classifier achieving results comparable to the DTW KNN
classifier because, as shown in figure 4.9, there is essentially no error in the interdistance
matrix at this dimension.
4.6· The Metric MLP vs. Lippmann and Gold's MLP
One can gain further insight into the advantages of the metric approach by comparing the
results of our MLP with results obtained by Lippmann and Gold [32). This is an interesting
comparison because in both cases the same network architecture, learning procedure, speech
database, vocabulary, and spectral analysis were used. In fact, the only difference between
the two approaches was the way the inputs were determined. In the Lippmann and Gold
case, no time alignment analysis was done. Instead, the frame of cepstral coefficients which
had the largest energy value was used directly along with the neighboring frame. Since
there were 11 cepstral coefficients in each frame, a total of 22 inputs were used as inputs
to the MLP. In figure 4.13 the results of this approach can be compared to ours for all 3
classifiers. One can clearly see that the vectors generated with the projection formula lead
to significantly better performance on all of the classifiers (in a much lower dimension).
This is not surprising since the Euclidean vectors in our approach had DTW "features"
inherently incorporated into the representation.
The benefits of using DTW projections can also be seen in terms of the learning time
of the MLP. In figure 4.14, we show how the learning time changed as more basis samples
were added to the projections. We have also included the learning time that was required
by Lippmann and Gold. One can see, first of all, that the learning time dropped as more ,
basis samples were added to the projections. The gains in the learning speed are likely due
to the improvements in the class separation (see figure 4.10). We can also see that even at 6
dimensions, our learning time was significantly faster than Lippmann and Gold's time, using
about one quarter as many inputs. This again reflects the benefits of incorporating DTW
information into the representation. Although not surprising, our results on the learning
time are important because they clearly show that one way to reduce the complexity of
learning in MLP is to preprocess the data using a metric function in a manner which
separates the classes as much as possible. This is particularly important for input patterns
65
- ~
MLP Learning Speed Epoch
bTW Representation 560 ....................... - ......................................... - .................................................. 11 t&···a·-·R--e·p--rese··-··n-ta··-t1.·o--n··-···
540
520 )
500 ' 480
460 '·' 440
420
400
380
360
340
320
300
280
260
240
220
<i~XJ~:~ 200
. (-'~. 180
~ltc~ 160 '", 140
Dimensions 10 15 20
Figure 4.14: The learning time required by the MLP neural net a.t different dimensions.
66
. ·:.:~~::
_,;~ '
a.sin speech, which contain alot of discriminating information in the structural organization
of the primitives. Because it is difficult to directly model structural relationships in the
inputs of a MLP, learning can be inherently hard. As we have demonstrated, a key strength
of the metric approach is that it is capable of incorporating relevant structural features into
the representation in a highly efficient manner .
67
Chapter 5
Discussion and Conclusions
In this thesis, we have demonstrated that the metric approach to pattern recognition can
facilitate a highly accurate and efficient implementation of an isolated word recognition
system. In our evaluation of this system, we showed that results comparable to the brute
force KNN classifier in a DTW metric space could be achieved using substantially fewer
distance computations. Furthermore, unlike other fast approaches based on the DTW
metric, we required only a small set of training samples to be stored in the classifier. Testing
the system on monosyllabic digits from the TI 20 word database, we observed a reduction
of almost 953 in the total number of distance computations and a total storage reduction
of almost 903 when compared to the DTW KNN classifier.
Efficiency in this approach was achieved by mapping the speech samples from a DTW
metric space into a low dimensional pseudo-Euclidean vector space in a manner which pre
served the pairwise distances. By representing speech this way, it was possible to store only
the decision regions of each class in the classifier using parametric classification methods.
Among the parametric methods we tested, we found the MLP gave the best recognition
performance. We also saw that incorporating DTW features into the vector representation
lead to improvements in the MLP's learning time. The MLP is an attractive classifier be
cause it can deal with outliers efficiently, and because most of the operations in the classifier
can be done in parallel. Further parallelism could be incorporated into our system if we
used systolic arrays to compute the DTW distances in the projection formula.
The vector representation was obtained in two stages. The first stage involved a trans
formation of a finite DTW metric space onto the principal axes of a pseudo-Euclidean vector
68
space. This was done using an embedding algorithm which essentially required the orthog
onal diagonalization of a symmetric matrix. Obtaining a vector representation with this
approach has several important advantages over the multidimensional scaling approach.
Firstly, it is possible to get an isometric representation, and this representation can be
constructed in a dimension size which is guaranteed to be minimal. Secondly, because the
main computational task involves the diagonalization of a symmetric matrix, there are no
convergence problems with the embedding algorithm. In fact, diagonalization algorithms
for this problem are some of the most stable numerical methods available. Finally, the
embedding process can incrementally generate the vector representation w.r.t . the leading
principal axis. Thus, the accuracy vs. dimensionality trade-off can be analyzed iteratively
for each dimension during the construction of the representation.
In our evaluation, we studied this trade-off w.r.t. the representation error and the class
separation. We found that as the dimension size increased, the representation error decayed
sharply towards zero, and the class separation quickly approached the separation in . the
metric space. Both these progressions tended to level-off at the same point in an exponential
like fashion, and the changes from one dimension to the next tended to be consistent with
the changes in the magnitudes of the eigenvalues. The sharp leveling off effect of the
representation error, class separation and eigenvalue magnitudes lead us to conclude that
the intrinsic dimension of metric space was closely related (by a small constant) to the
number of classes rather then the number of training samples. This relationship is most
likely influenced by how tightly clustered the classes are. In the most favorable situation
where the distance function returns zero for samples in the same class, we would have an
upper bound of m on the dimension size for am-class problem. For the DTW metric space
considered in this thesis, we found the intrinsic dimension to be essentially twice the number
of classes. It would be interesting to see if this intrinsic dimension could be reduced further
by employing a weight scheme in the DTW metric which further reduced the cluster sizes
of the classes.
An interesting study worth pursuing along these lines is to see how changes in the scatter
ratio of the metric space affects the intrinsic dimensionality. One way this study could be
done is by optimizing a weighting scheme in the DTW metric w.r.t. the scatter ratio. Then,
one could study the vector representation of the metric space as the weights incrementally
approached the optimal scatter ratio. The weighting scheme which produced the optimal
69
scatter ratio would likely highlight interesting features which define each class. This idea
of using parametric distance functions in the metric approach has recently been proposed
formally by Goldfarb [11) in the context of ma.chine learning. According to this theory of
learning, a compact description of a. class would be obtained by interpreting the weights at
the convergence point. This description could then be used to redesign the metric function
more efficiently, making it possible to achieve further improvements in the design of the
classifier.
In the second stage of the embedding process, the representation generated in the first
stage was used to select basis samples for the vector repfesentation. Modeling the represen
tation w.r.t. basis samples made it possible to get the vector representation of an arbitrary
metric sample efficiently. For an dimensional representation, only n + 1 distance computa
tions need to be performed: one to the origin sample and one to each basis sample. This was
done using a standard projection formula which involved the solution of a simple system
of linear equations. However, instead of using inner products in the formula, we used the
DTW distances to the basis samples. This worked mathematically because a key theorem
by Goldfarb assures us that there exists a well-defined relationship (mapping) between the
(DTW) distances in a metric space and the inner products in the pseudo-Euclidean vector
space.
We proposed a greedy search algorithm to select the basis samples. Our analysis of this
algorithm lead to some important observations. First, we saw that the selection method
used to pick the basis samples was important. By using the statistical properties of the
principal axis representation intelligently, we found more accurate projections could be
achieved. Secondly, we observed improvements in the accuracy of the projections as more
basis samples were added to the projection formula. Finally, the projections tended to
converge towards the original representation as we approached the intrinsic dimension of
the metric space.
A relatively small number of classes were used in our study. Thus it remains to be seen
how well the metric approach will scale on larger problems. Even if there is a close rela
tionship between the number of class and the intrinsic dimensionality of a metric space, our
current approach is likely to become unmanageable for problems involving large vocabulary
sizes. But instead of solving a large problem in one stage, we could use the the metric
approach in several stages, partitioning the problem into manageable subclasses at each
70
I
stage as was suggested by Goldfarb and Verma [12]. In this approach, a hierarchical deci
sion tree would be used to organize the speech data efficiently. Starting at the root node,
different samples from a large vocabulary would be selected and a vector representations of
these samples would be obtained as was done in our implementation. Then a coarse cluster
analysis would be done to determine groups of similar sounding words. To keep the dimen
sion size low, the data would be partitioned into a small set of clusters which would most
likely reflect very high-level features . The decision regions of these clusters could then be
determined and stored using a MLP or some other parametric classification method. Each
cluster would correspond to a node in the decision tree. At the next level, a similar analysis
would be done separately on each partition (ie. cluster); however, the cluster analysis would
be tuned to look for much finer features . This would be done hierarchically, until a node
in the tree contained a manageable set of primitive classes (ie. 10 classes for example). At
this point, we would have a problem similar to the one considered in this thesis. Using this
approach the entire search complexity could be reduced by a log factor. Note also that the
storage requirements are reasonable because at each node we only have to store a small set
of basis samples and the decision regions (using a MLP for example). In this implemen
tation, the path to a classification decision involves a traversal though increasingly smaller
groups of similar sounding words until one finally focuses upon an understandable word at
a leaf node.
Note that the hierarchical decision tree described above could be used to organize and
search any database containing complex patterns, not just speech utterances. One particular
application where this approach may be well suited is in the area of DNA and RNA sequence
analysis. By their very nature, different genetic sequences are organized hierarchically into
subclasses in a manner reflecting the evolutionary distances between the classes. These
distances are currently used in very inefficient ways to solve classification problems (see
(43]). The metric approach offers great potential to reduce not only the search complexity,
but als() the massive memory requirements in such applications.
A final issue which was not explicitly considered in this thesis is whether one could
achieve comparable results to our system using other forms of vector representations based
on the DTW function. In the -pseudo-Euclidean representation , one aims to preserve the
original DTW distances in the vector representation; however, one could also use the DTW
distance information to construct another type of vector representation which would not
71
necessarily preserve interdistances. One interesting possibility suggested by Hinton [16) is a
Gaussian Radial Basis Function [40, 29]. In this representation, a vector would be formed by
first computing the squared DTW distances to a set of reference samples, and then applying
a Gaussian function to these distances. This representation has the useful property of pulling
neighboring samples to each reference sample away from the the rest of the sample points.
Thus, provided that one selects at least one reference sample from each class, it is possible
to create a vector representation in an Euclidean Space which separates the classes, but does
not preserve the interdistances. However, if one selects reference samples which lie too close
to the boundary of two or more classes, then good class separation will not be achieved in
the corresponding dimensions. Because the selection of useful samples is very difficult, most
implementations use large numbers of randomly selected basis samples. This highlights an
important advantage of our metric representation over the RBF representation - in the
metric approach, there is a systematic and efficient method for picking the dimension size
and basis samples. Despite the difficulties of picking samples in the RBF representation,
a meaningful comparison could still be made in terms of the recognition performance by
simply using the basis samples from our projection algorithm as the reference samples in
the RBF representation. Such a comparison would at least show whether the preservation
of interdistances was essential for the good results which we observed, or whether other
methods using DTW but not preserving interdistances might perform comparably well. We
plan to provide the results of this experiment in a forthcoming technical report.
72
I i .
Bibliography
[1) E. Baydal, G. Andreu, and E. Vidal. Estimating the intrinsic dimensionality of discrete utterances. IEEE Transactions on Acoustics, Speech and Signal Processing, 37(5):755-757, May 1989.
(2) V.C. Bhavsar, T.Y.T. Chan, and L. Goldfarb. Parallel implementations for the metric approach to pattern recognition. In Proceedings of 1985 IEEE Computer Society Workshop on Computer Architectures for Pattern Analysis and Image Database Management, pages 126-136, 1985.
[3] Victor Bryant. Metric Spaces: Iteration and Application. Cambridge University Press, 1985. .
[4] F . Casacuberta, E. Vidal, and H. Rulot. On the metric properties of dynamic time warping. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(11):1631-1633, November 1987.
[5] R.0. Duda and P.E. Hart. Pattern Classification And Scene Analysis. Wiley and Son, 1973.
[6] D. Feustel and L. G. Shapiro. The nearest neighbor problem in an abstract metric space. Pattern Recognition Letters, 1:125-128, 1982.
[7] A. P. French. Special Relativity. W.W. Norton & Company, New York, 1968.
[8] K.S. Fu. Syntactic Methods in Pattern Recognition. Academic Press, 1974.
[9] L. Goldfarb. A new approach to pattern recognition. Pattern Recognition, 17(5):575-82, 1983.
[10] L. Goldfarb. A new approach to pattern recognition. In L. Kana! and A. Rosenfeld, editors, Progress in Pattern Recognition 2, pages 241-402. North-Holland, 1985.
[11] L. Goldfarb. On the foundations of intelligent processes - i. an evolving model for pattern learning. Pattern Recognition, 23(6):595-616, 1990.
[12] L. Goldfarb and R. Verma. Hybrid associate memories and metric data models. In Digital and Optical Shape Representation and Pattern Recognition, Orlando, FL, 1988. SPIE 1988 Technical Symposium on Optics, Electro-Optics, and Sensors.
(13] G. H. Golub and C. F . Van Loan, editors. Matrix Computations. Johns Hopkins University Press, Baltimore, 1983.
[14] W. Greub. Linear Algebra. Springer, 1974.
73
g ~ ~i
f!i ~
~
[15] V. N. Gupta, M. Lennig, and P. Mermelstein. Decision rules for speaker-independent isolated word recognition. In Proceedings of the International Con/ on ASSP, pages 9.2.1-9.2.4, 1984.
[16] G. Hinton. Personal Communication.
[17) G. E. Hinton. Learning translation invariant recognition in a massively parallel network. In Proc. Conj. Parallel Architectures and Languages Europe, Eindhoven, The Netherlands, 1987.
[18] E. Horowitz and S. Sahni. Fundamentals of Computer Algorithms. Computer Science Press, 1984.
[19] W. Y. Huang and R. P. Lippmann. Comparisons between neural net and conventional classifiers. In Proceeedings /CNN, San Diego, June. 1987.
[20] M. I. Jordan. An introduction to linear algebra in parallel distributed processing. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume I. Bradford Books, Cambridge, MA, 1986.
[21] B. H. Juang, L. R. Rabiner, and J. G. Wilpon. On the use of bandpass liftering in speech recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 35(7):947-953, July 1987.
[22] J. B. Kruskal. Nonmetric multidimensional scaling: A numerical method. Psychometrika, 29:115-129, June 1964.
[23) J. B. Kruskal and M. Wish. Multidimensional Scaling. Sage Publications, London, 1978.
[24] G. R. Leonard. A database for speaker-independent digit recognition. In Proceeding of ICASSP-84, pages 42.11.1-42.11.4, San Diego, CA, 1984.
[25] S. E. Levinson, L. R. Rabiner, A. E. Rosenberg, and J. G. Wilpon. Interactive clustering techniques for selecting speaker independent reference templates for isolated word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(2):134-141, April 1979.
[26] J. L. Liang and V. H. Clarson. A new approach to classification of brainwaves. Pattern Recognition, 22(6):767-774, 1989.
[27] J. Makhoul. Linear prediction: A tutorial review. Proc. IEEE, 63:561-580, 1975.
[28] C. Myers, L. R. Rabiner, and A. E. Rosenberg. Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 28(6):623-635, December 1980.
[29] M. Niranjan and F. Fallside. Neural networks and radial basis functions in classifying static speech patterns. Cambridge University Engineering Dept. technical report CUED/F-INFENG/TR.22, 1988.
[30] E. Oja. Subspace Methods of Pattern Recognition. Research Studies Press, 1983.
74 ..J __________________ _
(31] Lippma~n R. P. An introduction to computing with neural nets. pages 4-22, April 1987.
[32] Lippmann R. P. and Gold B. Neural-net classifiers useful for speech recognition. In Proceedings /CNN, San Diego, June 1987.
(33] T. Pavlidis. Structural Pattern Recognition. Springer-Verlag, 1977.
[34) Pettis et al. An intrinsic dimensionality estimator from nearest-neighbor information. IEEE Trans. Pattern Anal. and Machine Intell., PAMI-1:41, 1979.
(35) L. R. Rabiner. Digital Processing of Speech Signals. Prentice-Hall, 1978.
(36] L. R. Rabiner. On creating reference templates for speaker independent recognition of isolated words. IEEE Transactions on Acoustics, Speech and Signal Processing, 26(1):34-42, February 1978.
[37] L. R. Rabiner and S. E. Levinson. Isolated and connected word recognition-theory and selected applications. IEEE Transactions on Communications, 29(5):621-659, May 1981.
(38) L. R. Rabiner, S. E. Levinson, A. E. Rosenberg, and J. G. Wilpon. Speaker independent recognition of isolated word using clustering techniques. IEEE Transactions on Acoustics, Speech and Signal Processing, 27(4):336-349, August 1979.
(39] L. R. Rabiner and J. G. Wilpon. Considerations in applying clustering techniques to speaker independent word recognition. J. Acoust. Soc. Amer., 66(3):134-141, September 1979.
[40] S. Renals and R. Rohwer. Phoneme classification experiments using radial basis functions. Proceedings of IEEE ICASSP-89, pages 1-461- 1-466, 1989.
(41] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume I. Bradford Books, Cambridge, MA, 1986.
(42] H. Sakoe and S. Chiba. A dynamic programming approach to continuous speech recognition. In Proceedings of the International Congress on Acoustics, Budapest, Hungary, 1971.
[43] :p. Sankoff and J.B. Kruskal, editors. Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.
[44] E. Vidal, H. Casacuberta, et al. On the verification of the triangle inequality by dtw dissimilarity measures. Speech Communications, 7(1):67-79, November 1988.
(45] E. Vidal, H. M. Rulot, F. Casacuberta, and J. Benedi. On the use of a metric-space search algorithm (aesa) for fast dtw-based recognition of isolated words. IEEE Transactions on Acoustics, Speech and Signal Processing, 36(5):651-660, May 1988.
{46] T. K. Vintsyuk. Speech discrimination by dynamic programming. Cybernetics, 4(1):52-57, 1968.
(47] S. Watanabe. Pattern Recognition: Human and Mechanical. Wiley and Son, 1985.
75
l
i' ii
r
[48] J. R. Welch. Combination of linear and nonlinear time normalization for isolated word recognition. J. Acoust. Soc. Amer., 67, 1980.
[49] G. M. White and R. B. Neely. Speech-recognition experiments with linear prediction, bandpass filtering, and dynamic programming. IEEE Transactions on Acoustics, Speech and Signal Processing, 24:183-188, 1976.
[50] J. G. Wilpon and L. R. Rabiner. A modified k-means clustering algorithm for use in isolated word recognition. IEEE Transactions on Acoustics, Speech and Signal Processing, 33(3):587-594, June 1985.
[51] M. Wish and J. D. Carroll. Multidimensional scaling and its applications. In P. Krishnaiah and L. N. Kanai, editors, Handbook of Statistics 2: Classification, Pattern Recognition and Reduction of Dimensionality. North-Holland, 1982.
76
'-· ----------------------------