nonlinear dynamical invariants for speech recognition

Nonlinear Dynamical Invariants

for Speech Recognition

S. Prasad, S. Srinivasan, M. Pannuri, G. Lazarou and J. PiconeDepartment of Electrical and Computer Engineering

Mississippi State UniversityURL: http://www.ece.msstate.edu/research/isip/publications/conferences/interspeech/2006/dynamical_invariants/

of 23Nonlinear Dynamical Invariants for Speech Recognition

• State-of-the-art speech recognition systems relying on linear acoustic models suffer from robustness problem.

• Our goal: To study and use new features for speech recognition that do not rely on traditional measures of the first and second order moments of the signal.

• Why nonlinear features?

Nonlinear dynamical invariants may be more robust (invariant) to noise.

Speech signals have both periodic-like and noise-like segments – similar to chaotic signals arising from nonlinear systems.

The motivation behind studying such invariants: to capture the relevant nonlinear dynamical information from the time series – something that is ignored in conventional spectral analysis.

Motivation


Attractors for Dynamical Systems

• System Attractor: Trajectories approach a limit with increasing time, irrespective of the initial conditions within a region

• Basin of Attraction: Set of initial conditions converging to a particular attractor

• Attractors: Non-chaotic (point, limit cycle or torus),or chaotic (strange attactors)

• Example: point and limit cycle attractors of a logistic map (a discrete nonlinear chaotic map)


Strange Attractors

• Strange Attractors: attractors whose shapes are neither points nor limit cycles. They typically have a fractal structure (i.e., they have dimensions that are not integers but fractional)

• Example: a Lorentz system with parameters 0.40.40,0.16 br and


Characterizing Chaos

• Exploit geometrical (self-similar structure) aspects of an attractor or the temporal evolution for system characterization

• Geometry of a Strange Attractor:

Most strange attractors show a similar structure at various scales, i.e., parts are similar to the whole.

Fractal dimensions can be used to quantify this self-similarity.

e.g., Hausdorff, correlation dimensions.

• Temporal Aspect of Chaos:

Characteristic exponents or Lyapunov exponents (LE’s) - captures rate of divergence (or convergence) of nearby trajectories;

Also Correlation Entropy captures similar information.

• Any characterization presupposes that phase-space is available.

What if only one scalar time series measurement of the system (and not its actual phase space) is available?


Reconstructed Phase Space (RPS): Embedding

• Embedding: A mapping from a one-dimensional signal to an m-dimensional signal

• Taken’s Theorem:

Can reconstruct a phase space “equivalent” to the original phase space by embedding with m ≥ 2d+1 (d is the system dimension)

• Embedding Dimension: a theoretically sufficient bound; in practice, embedding with a smaller dimension is adequate.

• Equivalence:

means the system invariants characterizing the attractor are the same

does not mean reconstructed phase space (RPS) is exactly the same as original phase space

• RPS Construction: techniques include differential embedding, integral embedding, time delay embedding, and SVD embedding


Reconstructed Phase Space (RPS): Time Delay Embedding

• Uses delayed copies of the original time series as components of RPS to form a matrix

m: embedding dimension, : delay parameter

• Each row of the matrix is a point in the RPS

)1(222

)1(111

)1(0

m

m

m

xxx

xxx

xxx

X


Reconstructed Phase Space (RPS)

Time Delay Embedding of a Lorentz time series


Lyapunov Exponents

• Quantifies separation in time between trajectories, assuming rate of growth (or decay) is exponential in time, as:

where J is the Jacobian matrix at point p.

• Captures sensitivity to initial conditions.

• Analyzes separation in time of two trajectories with close initial points

where is the system’s evolution function.

)0()()0()( xfdx

dxtx N

)J(p)eigln(1

limn

0p

nn

i

f

f


Correlation Integral

• Measures the number of points within a neighborhood of radius, averaged over the entire attractor as:

where are points on the attractor (which has N such points).

• Theiler’s correction: Used to prevent temporal correlations in the time series from producing an underestimated dimension.

• Correlation integral is used in the computation of both correlation dimension and Kolmogorov entropy.

)()1(*

2)(

1 1ji

N

i

N

ij

xxNN

C

jx


Fractal Dimension

• Fractals: objects which are self-similar at various resolutions

• Correlation dimension: a popular choice for numerically estimating the fractal dimension of the attractor.

• Captures the power-law relation between the correlation integral of an attractor and the neighborhood radius of the analysis hyper-sphere as:

where is the correlation integral.

ln

)(lnlimlim

0

CD

N

)(C


Kolmogorov-Sinai Entropy

• Entropy: a well-known measure used to quantify the amount of disorder in a system.

• Numerically, the Kolmogorov entropy can be estimated as the second order Renyi entropy ( ) and can be related to the correlation integral of the reconstructed attractor as:

where D is the fractal dimension of the system’s attractor, d is the embedding dimension and is the time-delay used for attractor reconstruction.

• This leads to the relation:

• In a practical situation, the values of and are restricted by the resolution of the attractor and the length of the time series.

)exp(lim~)( 20

KdC D

d

d

2K

)(

)(lnlim

1~

10

2

d

d

dC

CK

d


Kullback-Leibler Divergence for Invariants

• Measures discrimination information between two statistical models.

• We measured invariants for each phoneme using a sliding window, and built an accumulated statistical model over each such utterance.

• The discrimination information between a pair of models and is given by:

• provides a symmetric divergence measure between two populations from an information-theoretic perspective.

• We use as the metric for quantifying the amount of discrimination information across dynamical invariants extracted from different broad phonetic classes.

)(xpi

)(xp j

xdxp

xpxpxd

xp

xpxpjiJ

i

j

x jj

i

x i

)(

)(ln)(

)(

)(ln)(),(

),( jiJ

J


Experimental Setup

• Collected artificially elongated pronunciations of several vowels and consonants from 4 male and 3 female speakers

• Each speaker produced sustained sounds (4 seconds long) for three vowels (/aa/, /ae/, /eh/), two nasals (/m/, /n/) and three fricatives (/f/, /sh/, /z/).

• The data was sampled at 22,050 Hz.

• For this preliminary study, we wanted to avoid artifacts introduced by coarticulation.

• Acoustic data to reconstructed phase space: using time delay embedding with a delay of 10 samples. (This delay was selected as the first local minimum of the auto-mutual information vs. delay curve averaged across all phones.

• Window Size: 1500 samples.


Experimental Setup (Tuning Algorithmic Parameters)

• Experiments performed to optimize parameters (by varying the parameters and choosing the value at which we obtain convergence) of estimation algorithm.

• Embedding dimension for LE and correlation dimension: 5

• For Lyapunov exponent:

number of nearest neighbors: 30,

evolution step size: 5,

number of sub-groups of neighbors: 15.

• For Kolmogorov entropy:

Embedding dimension of 15


Tuning Results: Lyapunov Exponents

Lyapunov Exponents

For vowel: /ae/ For nasal: /m/ For fricative: /sh/

• In all three cases, the positive LE stabilizes at an embedding dimension of 5.

• Positive LE much higher for fricative than nasals and vowels.


Tuning Results: Kolmogorov Entropy

Kolmogorov Entropy


• For vowels and nasals: Have stable behavior with embedding dimensions around 12-15.

• For fricatives: Entropy estimate consistently increases with embedding dimension.


Tuning Results: Correlation Dimension

Correlation Dimension


• For vowels and nasals: Clear scaling region at epsilon = 0.75; Less sensitive to variations in embedding dimensions from 5-8.

• For fricatives: No clear scaling region; more sensitive to variations in embedding dimension.


Experimental Results : KL Divergence - LE

• Discrimination information for:

vowels-fricatives: higher

nasals-fricatives: higher

vowels-nasals: lower


Experimental Results : KL Divergence – Kolmogorov Entropy






Experimental Results : KL Divergence – Correlation Dimension






Summary and Future Work

Conclusions:

• Reconstructed phase-space from speech data using Time Delay Embedding.

• Extracted three nonlinear dynamical invariants (LE, Kolmogorov entropy, and Correlation Dimension) from embedded speech data.

• Demonstrated the between-class separation of these invariants across8 phonetic sounds.

• Encouraging results for speech recognition applications.

Future Work:

• Study speaker variability with the hope that variations in the vocal tract response across speakers will result in different attractor structures.

• Add these invariants as features for speech and speaker recognition.


Resources

• Pattern Recognition Applet: compare popular linear and nonlinear algorithms on standard or custom data sets

• Speech Recognition Toolkits: a state of the art ASR toolkit for testing the efficacy of these algorithms on recognition tasks

• Foundation Classes: generic C++ implementations of many popular statistical modeling approaches


References

1. Kumar, A. and Mullick, S.K., “Nonlinear Dynamical Analysis of Speech,” Journal of the Acoustical Society of America, vol. 100, no. 1, pp. 615-629, July 1996.

2. Banbrook M., “Nonlinear analysis of speech from a synthesis perspective,” PhD Thesis, The University of Edinburgh, Edinburgh, UK, 1996.

3. Kokkinos, I. and Maragos, P., “Nonlinear Speech Analysis using Models for Chaotic Systems,” IEEE Transactions on Speech and Audio Processing, pp. 1098- 1109, Nov. 2005.

4. Eckmann, J.P. and Ruelle, D., “Ergodic Theory of Chaos and Strange Attractors,” Reviews of Modern Physics, vol. 57, pp. 617‑656, July 1985.

5. Kantz, H. and Schreiber T., Nonlinear Time Series Analysis, Cambridge University Press, UK, 2003.

6. Campbell, J. P., “Speaker Recognition: A Tutorial,” Proceedings of IEEE, vol. 85, no. 9, pp. 1437-1462, Sept. 1997.

nonlinear dynamical invariants for speech recognition

Documents