speech & nlp (fall 2014): basics of phonology & audio processing, zero crossing rate,...

Speech & NLP

Basics of Phonology & Audio Processing,

Zero Crossing Rate,

Dynamic Time Warping

Vladimir Kulyukin

www.vkedco.blogspot.com

http://www.vkedco.blogspot.com/

Outline

Phonology Basics

Audio Processing

Zero Crossing Rate


Phonology Basics

International Phonetic Alphabet

http://en.wikipedia.org/wiki/International_Phonetic_Alphabet

http://en.wikipedia.org/wiki/International_Phonetic_Alphabet

Phones, Allophones, Phonemes

Phone is a unit of speech sound

Allophone is a member of a set of phones used

to pronounce a phoneme

For example, in English, /p/ is a phoneme with

two allophones [ph] and [p]

[ph] is an aspirated (breath is used) allophone

of /p/

[p] is an unaspirated (breath is not used)

allophone of /p/

Intonation, Tone, & Prosody

Text To Speech

TTS Engine Anatomy

Audio Processing

Samples

Samples are successive snapshots of a

specific signal

Audio files are samples of sound waves

Microphones convert acoustic signals into

analog electrical signals and then analog-to-

digital converter transform analog signals

into digital samples

Digital Audio Signal

time

Sound

pressure

Amplitude

Amplitude (in audio processing) is a

measure of sound pressure

Amplitude is measured at a specific rate

Amplitude measures result in digital

samples

Some samples have positive values

Some samples have negative values

Digital Approximation Accuracy

Any digitization of analog signals carries some

inaccuracy

Approximation accuracy depends on two

factors: 1) sampling rate and 2) resolution

In audio processing, sampling is reduction of

continuous signal to discrete signal

Sampling rate is the number of samples per unit

of time

Resolution is the size of a sample (e.g., the

number of bits)

Sampling Rate & Resolution

Sampling rate is measured in Hertz (Hz)

Hz are measured in samples per second

For example, if the audio is sampled at a

rate of 44,100 samples per second, then its

sampling rate is 44,100Hz

Typical resolutions (sample lengths) are 8

bits, 16 bits, and 32 bits

Nyquist-Shannon Sampling Theorem

This theorem states that perfect reconstruction of

a signal is possible if the sampling frequency is

greater than two times the maximum frequency of

the signal being sampled

For example, if a signal has a maximum frequency

of 50Hz, then it can, theoretically, be

reconstructed if sampled at a rate of 100Hz and

avoid aliasing (aka the effect of indistinguishable

sounds)

Audio File Formats

WAVE (WAV) is often associated with Windows but are

now implemented on other platforms

AIFF is common on Mac OS

AU is common on Unix/Linux

These are similar formats that vary in how they

represent data, pack samples (e.g., little-endian vs. big-

endian), etc.

Some Java examples of how to manipulate Wav files are

WavFileManip.java; if the link does not work, the url is https://github.com/VKEDCO/AudioTrials/blob/master/org.vkedco.nlp.audiotrials/WavFileManip.java

https://github.com/VKEDCO/AudioTrials/blob/master/org.vkedco.nlp.audiotrials/WavFileManip.java
















Zero Crossing Rate

What is Zero Crossing Rate (ZCR)?

Zero Crossing Rate (ZCR) is a measure of the

number of times, in a given sample, when

amplitude crosses the horizontal line at 0

ZCR can be used to detect silence vs. non-

silence, voice vs. unvoiced, speaker’s identity,

etc.

ZCR is essentially the count of successive

samples changing algebraic signs

ZCR Source

public class ZeroCrossingRate {

public static double computeZCR01(double[] signals, double normalizer)

{

long numZC = 0;

for(int i = 1; i < signals.length; i++) {

if ( (signals[i] >= 0 && signals[i-1] < 0) ||

(signals[i] < 0 && signals[i-1] >= 0) ) {

numZC++;

}

}

return numZC/normalizer;

}

}

ZCR in Voiced vs. Unvoiced Speech

Voiced speech is produced when vowels are spoken

Voiced speech is characterized of constant

frequency tones of some duration

Unvoiced speech is produced when consonants are

spoken

Unvoiced speech is non-periodic, random-like

because air passes through a narrow constriction of

the vocal tract

ZCR in Voiced vs. Unvoiced Speech

Phonetic theory states that voiced speech

has a smooth air flow through the vocal tract

whereas unvoiced speech has a turbulent air

flow that produces noise

Thus, voiced speech should have a low ZCR

whereas unvoiced speech should have a high

ZCR

Amplitude of Voiced vs. Unvoiced Speech

Amplitude of unvoiced speech tends to be

lower

Amplitude of voiced speech tends to be

higher

Given a digital sample, we can use average

amplitude as a measure of the sample’s

energy

This can be used to classify samples as

vowels and consonants

ZCR & Amplitude of Voiced & Unvoiced Speech

ZCR Amplitude

Voiced LOW HIGH

Unvoiced HIGH LOW

Detection of Silence & Non-Silence

silence_buffer = [];

non_silence_buffer = [];

buffer = [];

while ( there are still frames left ) {

Read a specific number of frames into buffer;

Compute ZCR and average amplitude of buffer;

if ( ZCR and average amplitude are below specific thresholds ) {

add the buffer to silence_buffer;

}

else {

add the buffer to non_silence_buffer;

}

}

source code in https://github.com/VKEDCO/AudioTrials/blob/master/org.vkedco.nlp.audiotrials/WavFileManip.java



Introduction

Dynamic Time Warping (DTW) is a method to

find an optimal alignment between two time-

dependent sequences (series)

DTW aligns (“warps”) two sequences in a non-

linear way to match each other

DTW has been successfully used in automatic

speech recognition (ASR), bioinformatics

(genetic sequence matching), and video

analysis

Sample Sequences

Sample Alignment

Basic Definitions

There are two sequences:

𝑋 = 𝑥1, … , 𝑥𝑁 and 𝑌 = 𝑦1, … , 𝑦𝑀

There is a feature space F such that:

𝑥𝑖 ∈ 𝐹 & 𝑦𝑗 ∈ 𝐹 where 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑀

There is a local cost measure mapping 2-

tuples of features to non-negative reals:

𝑐: 𝐹 x 𝐹 → 𝑅 ≥ 0

X

Cost Matrix DTW(N, M)

Y

1 2 …. i … N

M

1

2

…

𝑑𝑡𝑤 𝑖, 𝑗 is the cost of warping X[1:i] with Y[1:j]

j

…

X and Y are sequences X[1:N] and Y[1:M]

Warping Path

𝑃 = 𝑝1, … , 𝑝𝐿 , where 𝑝 = 𝑛𝑗 , 𝑚𝑗 ∈ 1, 𝑁 × [1,𝑀] and

𝑗 ∈ 1, 𝐿 is a warping path if

1) 𝑝1 = 1,1 and 𝑝𝐿 = 𝑁,𝑀 2) 𝑛1 ≤ 𝑛2 ≤ … ≤ 𝑛𝑁 and 𝑚1 ≤ 𝑚2 ≤ … ≤ 𝑚𝑀

3) 𝑝𝑙+1 − 𝑝𝑙 ∈ 1, 0 , 0, 1 , 1, 1 , 1 ≤ 𝑙 ≤ 𝐿 − 1

Valid Warping Path

1 2 3 4

1

2

3

4

5

𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where

𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 = 2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = (4, 5)

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

Invalid Warping Path

1 2 3 4

1

2

3

4

5

𝑝1 ≠ 1, 1 so constraint 1 is not satisfied

𝑝1 𝑝2

𝑝3

𝑝4

𝑝5 𝑝6


1 2 3 4

1

2

3

4

5

𝑝3 = 3, 3 , 𝑝4 = 2, 4 , 3 > 2 so 2nd constraint is not satisfied

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6


1 2 3 4

1

2

3

4

5

𝑝2 = 2, 2 , 𝑝3 = 3, 4 , 𝑝3 − 𝑝2 = 3,4 − 2,2 = 1, 2 ∉1, 0 , 0, 1 , 1, 1 so 3rd condition is not satisfied

𝑝1

𝑝2

𝑝3

𝑝4 𝑝5

Total Cost of a Warping Path

𝑃 = 𝑝1, … , 𝑝𝐿 , is a warping path between sequences X

and Y, then its total cost is

𝑐𝑝 𝑋, 𝑌 = 𝑐(𝑥𝑛𝑗 , 𝑦𝑚𝑗)

𝐿

𝑗=1

Example

1 2 3 4

1

2

3

4

5 Assume that 𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where 𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 =2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = 4, 5 ,

is a warping path b/w X[1:4] and Y[1:5].

Then the total cost of P is

𝑐 𝑥1, 𝑦1 + 𝑐 𝑥1, 𝑦2 + 𝑐 𝑥2, 𝑦3 +𝑐 𝑥2, 𝑦4 + 𝑐 𝑥3, 𝑦5 + 𝑐 𝑥4, 𝑦5 .

This notation 𝑐 𝑥𝑖 , 𝑦𝑗 can be simplified

to read 𝑐(𝑖, 𝑗) or 𝑐 𝑋 𝑖 , 𝑌 𝑗 .

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

X

Y

DTW(X, Y) – Cost of an Optimal Warping Path

𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}

Remarks on DTW(X, Y)

There may be several warping paths of the

same warping cost, i.e., DTW(X, Y)

DTW(X, Y) is symmetric whenever the local

cost measure is symmetric, i.e., DTW(X, Y) =

DTW(Y, X)

DTW(X, Y) does not necessarily satisfy the

triangle inequality (the sum of the lengths of

two sides is greater than the length of the

remaining side)

X

DTW Equations: Base Cases

Y

1 2 …. i … N

M

1

2

…

Initial condition: 𝑑𝑡𝑤 1,1 = 𝑐(1,1)

j

…

1st Row: 𝑑𝑡𝑤 𝑖, 1 = 𝑑𝑡𝑤 𝑖 − 1,1 + 𝑐(𝑖, 1)

1st Column:

𝑑𝑡𝑤 1, 𝑗 = 𝑑𝑡𝑤 1, 𝑗 − 1 +𝑐(1, 𝑗)

X

DTW Equations: Recursion

Y

1 2 … i … N

M

1

2

…

j

…

Inner Cell: 𝑑𝑡𝑤 𝑖, 𝑗 = min 𝑑𝑡𝑤 𝑖 − 1, 𝑗 , 𝑑𝑡𝑤 𝑖 − 1, 𝑗 − 1 , 𝑑𝑡𝑤 𝑖, 𝑗 − 1 + 𝑐(𝑖, 𝑗)

Interpretation: Cost of

warping X[1:i] with Y[1:J] is

the cost of warping X[i] with

Y[j] plus the minimum of

the following three costs: 1)

the cost of warping X[1:i-1]

with Y[1:j]; 2) the cost of

warping X[1:i-1] with Y[1:j-

1]; 3) the cost of warping

X[1:i] with Y[1:j-1]

DTW(X, Y) Examples

Sample Feature Space & Sequences

Let the sequences be:

𝑋 = 𝑎, 𝑏, 𝑔 𝑌 = 𝑎, 𝑏, 𝑏, 𝑔 𝑍 = (𝑎, 𝑔, 𝑔)

Let the feature space 𝐹 = 𝑎, 𝑏, 𝑔 .

Let the local cost measure be

defined as follows:

𝑐 𝑥, 𝑦 = 0 𝑖𝑓 𝑥 = 𝑦1 𝑖𝑓 𝑥 ≠ 𝑦

Let us compute dtw(X,Y), dtw(Y,Z), and dtw(X, Z).

Work it out on paper.

DTW(X, Y) = DTW((a, b, g), (a, b, b, g))

Example: DTW(1,1)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0

Example: DTW(2,1)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,1 = 𝑐 2,1 + 𝑑𝑡𝑤 1,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

1

Example: DTW(3,1)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,1 = 𝑐 3,1 + 𝑑𝑡𝑤 2,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

1 2

Example: DTW(1,2)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,2 = 𝑐 1,2 + 𝑑𝑡𝑤 1,1= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

1 2

1

Example: DTW(1,3)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,3 = 𝑐 1,3 + 𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

1 2

1

2

Example: DTW(1,4)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,4 = 𝑐 1,4 + 𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,3= 1 + 2 = 3

1 2

1

2

3

Example: DTW(2,2)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 2,2

+ min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1

= 𝑐 𝑏, 𝑏 + min 1,0,1 = 0 + 0= 0 1 2

1

2

3

0

Example: DTW(3,2)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 3,2


= 𝑐 𝑔, 𝑏 + min 0,1,2 = 1 + 0= 1 1 2

1

2

3

0 1

Example: DTW(2,3)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 2,2


= 𝑐 𝑏, 𝑏 + min 2,1,0 = 0 + 0= 0 1 2

1

2

3

0 1

0

Example: DTW(3,3)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 3,3


= 𝑐 𝑔, 𝑏 + min 0,0,1 = 1 + 0= 1 1 2

1

2

3

0 1

0 1

Example: DTW(2,4)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,4= 𝑐 2,4


= 𝑐 𝑏, 𝑔 + min 3,2,0 = 1 + 0= 1 1 2

1

2

3

0 1

0 1

1

Example: DTW(3,4)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,4= 𝑐 3,4


= 𝑐 𝑔, 𝑔 +min 1,0,1 = 0 + 0= 0

So DTW(X,Y) = 0

1 2

1

2

3

0 1

0 1

1 0

Example: DTW(3,4)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

DTW(X, Y) = 0.

Optimal Warping Path

(OWP) P can be found by

chasing pointers (red

arrows): P = ((1,1), (2, 2),

(2, 3), (3, 4)). 1 2

1

2

3

0 1

0 1

1 0

DTW(Y, Z) = DTW((a, b, b, g), (a, g, g))

DTW(1, 1)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0 Z

DTW(2, 1)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

Z

1

DTW(3, 1)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

Z

1 2

DTW(4, 1)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 3,1= 1 + 2 = 3

Z

1 2 3

DTW(1, 2)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

Z

1 2 3

1

DTW(1, 3)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

Z

1 2 3

1

2

DTW(2, 2)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,2 ,

𝑑𝑡𝑤 1,1 , 𝑑𝑡𝑤 2,1 }

= 1 + 0 = 1

Z

1 2 3

1

2

1

DTW(3, 2)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,2 ,

𝑑𝑡𝑤 2,1 , 𝑑𝑡𝑤 3,1 }

= 1 +min 1,1,2 = 1 + 1 = 2

Z

1 2 3

1

2

1 2

DTW(4, 2)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,2= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,2 ,

𝑑𝑡𝑤 3,1 , 𝑑𝑡𝑤 4,1 }

= 0 +min 2,2,3 = 0 + 2 = 2

Z

1 2 3

1

2

1 2 2

DTW(2, 3)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,3 ,

𝑑𝑡𝑤 1,2 , 𝑑𝑡𝑤 2,2 }

= 1 +min 2,1,1 = 1 + 1 = 2

Z

1 2 3

1

2

1 2 2

2

DTW(3, 3)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,3 ,

𝑑𝑡𝑤 2,2 , 𝑑𝑡𝑤 3,2 }

= 1 +min 2,1,2 = 1 + 1 = 2

Z

1 2 3

1

2

1 2 2

2 2

DTW(4, 3)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,3= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,4 ,

𝑑𝑡𝑤 3,2 , 𝑑𝑡𝑤 4,2 }

= 0 +min 2,2,2 = 0 + 2 = 2

Z

1 2 3

1

2

1 2 2

2 2 2

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

DTW(Y, Z) = 2.

Optimal Warping Path (OWP) P

can be found by chasing pointers

(red arrows): P = ((1,1), (2, 2), (3,

2), (4, 3)).

Z

1 2 3

1

2

1 2 2

2 2 2

DTW(X, Z) = DTW((a, b, g), (a, g, g))

DTW(1, 1)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0

0

DTW(2, 1)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,1 = 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

0 1

DTW(3, 1)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,1 = 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

0 1 2

DTW(1, 2)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,2 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

0 1 2

1

DTW(1, 3)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,3 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

0 1 2

1

2

DTW(2, 2)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔


= 1 +min 1,0,1= 1 + 0 = 1

0 1 2

1

2

1

DTW(3, 2)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 𝑔, 𝑔

+min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1

= 0 +min 1,1,2= 0 + 1 = 1

0 1 2

1

2

1 1

DTW(2, 3)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔


= 1 +min 2,1,1= 1 + 1 = 2

0 1 2

1

2

1 1

2

DTW(3, 3)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 𝑔, 𝑔

+min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,2

= 0 +min 2,1,2= 0 + 1 = 1

0 1 2

1

2

1 1

2 1

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1 0 1 2

1

2

1 1

2 1 DTW(X, Z) = 1.

Optimal Warping Path (OWP)

P can be found by chasing

pointers (red arrows): P =

((1,1), (2, 2), (3, 3)).

DTW Optimizations

Window Optimization

The computation of DTW can be optimized so that only the

cells within a specific window are considered

Smaller Matrix Optimization

You may have realized by now that if we care

only about the total cost of warping sequence X

with sequence Y, we do not need to compute

the entire N x M cost matrix – we need only two

columns

The storage savings are huge, but the running

time remains the same – O(N x M)

We can also normalize the DTW cost by N x M

to keep it low

References

M. Muller. “Information Retrieval for Music and

Motion,”, Ch.04. Springer, ISBN 978-3-540-74047-6

Bachu, R. G., et al. “Separation of Voiced and

Unvoiced using Zero Crossing Rate and Energy of

the Speech Signal." American Society for Engineering

Education (ASEE) Zone Conference Proceedings. 2008.

speech & nlp (fall 2014): basics of phonology & audio processing, zero crossing rate,...

Science