speech & nlp (fall 2014): basics of phonology & audio processing, zero crossing rate,...

87
Speech & NLP Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping Vladimir Kulyukin www.vkedco.blogspot.com

Upload: vladimir-kulyukin

Post on 15-Dec-2014

86 views

Category:

Science


3 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Speech & NLP

Basics of Phonology & Audio Processing,

Zero Crossing Rate,

Dynamic Time Warping

Vladimir Kulyukin

www.vkedco.blogspot.com

Page 2: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Outline

Phonology Basics

Audio Processing

Zero Crossing Rate

Dynamic Time Warping

Page 3: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Phonology Basics

Page 4: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping
Page 5: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

International Phonetic Alphabet

http://en.wikipedia.org/wiki/International_Phonetic_Alphabet

Page 6: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Phones, Allophones, Phonemes

Phone is a unit of speech sound

Allophone is a member of a set of phones used

to pronounce a phoneme

For example, in English, /p/ is a phoneme with

two allophones [ph] and [p]

[ph] is an aspirated (breath is used) allophone

of /p/

[p] is an unaspirated (breath is not used)

allophone of /p/

Page 7: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Intonation, Tone, & Prosody

Page 8: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Text To Speech

Page 9: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

TTS Engine Anatomy

Page 10: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Audio Processing

Page 11: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Samples

Samples are successive snapshots of a

specific signal

Audio files are samples of sound waves

Microphones convert acoustic signals into

analog electrical signals and then analog-to-

digital converter transform analog signals

into digital samples

Page 12: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Digital Audio Signal

time

Sound

pressure

Page 13: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Amplitude

Amplitude (in audio processing) is a

measure of sound pressure

Amplitude is measured at a specific rate

Amplitude measures result in digital

samples

Some samples have positive values

Some samples have negative values

Page 14: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Digital Approximation Accuracy

Any digitization of analog signals carries some

inaccuracy

Approximation accuracy depends on two

factors: 1) sampling rate and 2) resolution

In audio processing, sampling is reduction of

continuous signal to discrete signal

Sampling rate is the number of samples per unit

of time

Resolution is the size of a sample (e.g., the

number of bits)

Page 15: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Sampling Rate & Resolution

Sampling rate is measured in Hertz (Hz)

Hz are measured in samples per second

For example, if the audio is sampled at a

rate of 44,100 samples per second, then its

sampling rate is 44,100Hz

Typical resolutions (sample lengths) are 8

bits, 16 bits, and 32 bits

Page 16: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Nyquist-Shannon Sampling Theorem

This theorem states that perfect reconstruction of

a signal is possible if the sampling frequency is

greater than two times the maximum frequency of

the signal being sampled

For example, if a signal has a maximum frequency

of 50Hz, then it can, theoretically, be

reconstructed if sampled at a rate of 100Hz and

avoid aliasing (aka the effect of indistinguishable

sounds)

Page 17: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Audio File Formats

WAVE (WAV) is often associated with Windows but are

now implemented on other platforms

AIFF is common on Mac OS

AU is common on Unix/Linux

These are similar formats that vary in how they

represent data, pack samples (e.g., little-endian vs. big-

endian), etc.

Some Java examples of how to manipulate Wav files are

WavFileManip.java; if the link does not work, the url is https://github.com/VKEDCO/AudioTrials/blob/master/org.vkedco.nlp.audiotrials/WavFileManip.java

Page 18: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Zero Crossing Rate

Page 19: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

What is Zero Crossing Rate (ZCR)?

Zero Crossing Rate (ZCR) is a measure of the

number of times, in a given sample, when

amplitude crosses the horizontal line at 0

ZCR can be used to detect silence vs. non-

silence, voice vs. unvoiced, speaker’s identity,

etc.

ZCR is essentially the count of successive

samples changing algebraic signs

Page 20: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

ZCR Source

public class ZeroCrossingRate {

public static double computeZCR01(double[] signals, double normalizer)

{

long numZC = 0;

for(int i = 1; i < signals.length; i++) {

if ( (signals[i] >= 0 && signals[i-1] < 0) ||

(signals[i] < 0 && signals[i-1] >= 0) ) {

numZC++;

}

}

return numZC/normalizer;

}

}

Page 21: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

ZCR in Voiced vs. Unvoiced Speech

Voiced speech is produced when vowels are spoken

Voiced speech is characterized of constant

frequency tones of some duration

Unvoiced speech is produced when consonants are

spoken

Unvoiced speech is non-periodic, random-like

because air passes through a narrow constriction of

the vocal tract

Page 22: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

ZCR in Voiced vs. Unvoiced Speech

Phonetic theory states that voiced speech

has a smooth air flow through the vocal tract

whereas unvoiced speech has a turbulent air

flow that produces noise

Thus, voiced speech should have a low ZCR

whereas unvoiced speech should have a high

ZCR

Page 23: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Amplitude of Voiced vs. Unvoiced Speech

Amplitude of unvoiced speech tends to be

lower

Amplitude of voiced speech tends to be

higher

Given a digital sample, we can use average

amplitude as a measure of the sample’s

energy

This can be used to classify samples as

vowels and consonants

Page 24: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

ZCR & Amplitude of Voiced & Unvoiced Speech

ZCR Amplitude

Voiced LOW HIGH

Unvoiced HIGH LOW

Page 25: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Detection of Silence & Non-Silence

silence_buffer = [];

non_silence_buffer = [];

buffer = [];

while ( there are still frames left ) {

Read a specific number of frames into buffer;

Compute ZCR and average amplitude of buffer;

if ( ZCR and average amplitude are below specific thresholds ) {

add the buffer to silence_buffer;

}

else {

add the buffer to non_silence_buffer;

}

}

source code in https://github.com/VKEDCO/AudioTrials/blob/master/org.vkedco.nlp.audiotrials/WavFileManip.java

Page 26: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Dynamic Time Warping

Page 27: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Introduction

Dynamic Time Warping (DTW) is a method to

find an optimal alignment between two time-

dependent sequences (series)

DTW aligns (“warps”) two sequences in a non-

linear way to match each other

DTW has been successfully used in automatic

speech recognition (ASR), bioinformatics

(genetic sequence matching), and video

analysis

Page 28: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Sample Sequences

Page 29: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Sample Alignment

Page 30: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Basic Definitions

There are two sequences:

𝑋 = 𝑥1, … , 𝑥𝑁 and 𝑌 = 𝑦1, … , 𝑦𝑀

There is a feature space F such that:

𝑥𝑖 ∈ 𝐹 & 𝑦𝑗 ∈ 𝐹 where 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑀

There is a local cost measure mapping 2-

tuples of features to non-negative reals:

𝑐: 𝐹 x 𝐹 → 𝑅 ≥ 0

Page 31: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

X

Cost Matrix DTW(N, M)

Y

1 2 …. i … N

M

1

2

𝑑𝑡𝑤 𝑖, 𝑗 is the cost of warping X[1:i] with Y[1:j]

j

X and Y are sequences X[1:N] and Y[1:M]

Page 32: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Warping Path

𝑃 = 𝑝1, … , 𝑝𝐿 , where 𝑝 = 𝑛𝑗 , 𝑚𝑗 ∈ 1, 𝑁 × [1,𝑀] and

𝑗 ∈ 1, 𝐿 is a warping path if

1) 𝑝1 = 1,1 and 𝑝𝐿 = 𝑁,𝑀 2) 𝑛1 ≤ 𝑛2 ≤ … ≤ 𝑛𝑁 and 𝑚1 ≤ 𝑚2 ≤ … ≤ 𝑚𝑀

3) 𝑝𝑙+1 − 𝑝𝑙 ∈ 1, 0 , 0, 1 , 1, 1 , 1 ≤ 𝑙 ≤ 𝐿 − 1

Page 33: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Valid Warping Path

1 2 3 4

1

2

3

4

5

𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where

𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 = 2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = (4, 5)

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

Page 34: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Invalid Warping Path

1 2 3 4

1

2

3

4

5

𝑝1 ≠ 1, 1 so constraint 1 is not satisfied

𝑝1 𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

Page 35: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Invalid Warping Path

1 2 3 4

1

2

3

4

5

𝑝3 = 3, 3 , 𝑝4 = 2, 4 , 3 > 2 so 2nd constraint is not satisfied

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

Page 36: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Invalid Warping Path

1 2 3 4

1

2

3

4

5

𝑝2 = 2, 2 , 𝑝3 = 3, 4 , 𝑝3 − 𝑝2 = 3,4 − 2,2 = 1, 2 ∉1, 0 , 0, 1 , 1, 1 so 3rd condition is not satisfied

𝑝1

𝑝2

𝑝3

𝑝4 𝑝5

Page 37: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Total Cost of a Warping Path

𝑃 = 𝑝1, … , 𝑝𝐿 , is a warping path between sequences X

and Y, then its total cost is

𝑐𝑝 𝑋, 𝑌 = 𝑐(𝑥𝑛𝑗 , 𝑦𝑚𝑗)

𝐿

𝑗=1

Page 38: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example

1 2 3 4

1

2

3

4

5 Assume that 𝑃 = 𝑝1, 𝑝2, 𝑝3, 𝑝4, 𝑝5, 𝑝6 , where 𝑝1 = 1, 1 , 𝑝2 = 1, 2 , 𝑝3 =2, 3 , 𝑝4 = 2, 4 , 𝑝5 = 3, 5 , 𝑝6 = 4, 5 ,

is a warping path b/w X[1:4] and Y[1:5].

Then the total cost of P is

𝑐 𝑥1, 𝑦1 + 𝑐 𝑥1, 𝑦2 + 𝑐 𝑥2, 𝑦3 +𝑐 𝑥2, 𝑦4 + 𝑐 𝑥3, 𝑦5 + 𝑐 𝑥4, 𝑦5 .

This notation 𝑐 𝑥𝑖 , 𝑦𝑗 can be simplified

to read 𝑐(𝑖, 𝑗) or 𝑐 𝑋 𝑖 , 𝑌 𝑗 .

𝑝1

𝑝2

𝑝3

𝑝4

𝑝5 𝑝6

X

Y

Page 39: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(X, Y) – Cost of an Optimal Warping Path

𝐷𝑇𝑊 𝑋, 𝑌 = min 𝑐𝑝 𝑋, 𝑌 𝑝 is a warping path}

Page 40: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Remarks on DTW(X, Y)

There may be several warping paths of the

same warping cost, i.e., DTW(X, Y)

DTW(X, Y) is symmetric whenever the local

cost measure is symmetric, i.e., DTW(X, Y) =

DTW(Y, X)

DTW(X, Y) does not necessarily satisfy the

triangle inequality (the sum of the lengths of

two sides is greater than the length of the

remaining side)

Page 41: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

X

DTW Equations: Base Cases

Y

1 2 …. i … N

M

1

2

Initial condition: 𝑑𝑡𝑤 1,1 = 𝑐(1,1)

j

1st Row: 𝑑𝑡𝑤 𝑖, 1 = 𝑑𝑡𝑤 𝑖 − 1,1 + 𝑐(𝑖, 1)

1st Column:

𝑑𝑡𝑤 1, 𝑗 = 𝑑𝑡𝑤 1, 𝑗 − 1 +𝑐(1, 𝑗)

Page 42: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

X

DTW Equations: Recursion

Y

1 2 … i … N

M

1

2

j

Inner Cell: 𝑑𝑡𝑤 𝑖, 𝑗 = min 𝑑𝑡𝑤 𝑖 − 1, 𝑗 , 𝑑𝑡𝑤 𝑖 − 1, 𝑗 − 1 , 𝑑𝑡𝑤 𝑖, 𝑗 − 1 + 𝑐(𝑖, 𝑗)

Interpretation: Cost of

warping X[1:i] with Y[1:J] is

the cost of warping X[i] with

Y[j] plus the minimum of

the following three costs: 1)

the cost of warping X[1:i-1]

with Y[1:j]; 2) the cost of

warping X[1:i-1] with Y[1:j-

1]; 3) the cost of warping

X[1:i] with Y[1:j-1]

Page 43: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(X, Y) Examples

Page 44: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Sample Feature Space & Sequences

Let the sequences be:

𝑋 = 𝑎, 𝑏, 𝑔 𝑌 = 𝑎, 𝑏, 𝑏, 𝑔 𝑍 = (𝑎, 𝑔, 𝑔)

Let the feature space 𝐹 = 𝑎, 𝑏, 𝑔 .

Let the local cost measure be

defined as follows:

𝑐 𝑥, 𝑦 = 0 𝑖𝑓 𝑥 = 𝑦1 𝑖𝑓 𝑥 ≠ 𝑦

Let us compute dtw(X,Y), dtw(Y,Z), and dtw(X, Z).

Work it out on paper.

Page 45: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(X, Y) = DTW((a, b, g), (a, b, b, g))

Page 46: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(1,1)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0

Page 47: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(2,1)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,1 = 𝑐 2,1 + 𝑑𝑡𝑤 1,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

1

Page 48: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(3,1)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,1 = 𝑐 3,1 + 𝑑𝑡𝑤 2,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

1 2

Page 49: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(1,2)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,2 = 𝑐 1,2 + 𝑑𝑡𝑤 1,1= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

1 2

1

Page 50: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(1,3)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,3 = 𝑐 1,3 + 𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑏 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

1 2

1

2

Page 51: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(1,4)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 1,4 = 𝑐 1,4 + 𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,3= 1 + 2 = 3

1 2

1

2

3

Page 52: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(2,2)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 2,2

+ min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1

= 𝑐 𝑏, 𝑏 + min 1,0,1 = 0 + 0= 0 1 2

1

2

3

0

Page 53: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(3,2)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 3,2

+ min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1

= 𝑐 𝑔, 𝑏 + min 0,1,2 = 1 + 0= 1 1 2

1

2

3

0 1

Page 54: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(2,3)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 2,2

+ min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2

= 𝑐 𝑏, 𝑏 + min 2,1,0 = 0 + 0= 0 1 2

1

2

3

0 1

0

Page 55: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(3,3)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 3,3

+ min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,1

= 𝑐 𝑔, 𝑏 + min 0,0,1 = 1 + 0= 1 1 2

1

2

3

0 1

0 1

Page 56: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(2,4)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 2,4= 𝑐 2,4

+ min𝑑𝑡𝑤 1,4 ,𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 2,3

= 𝑐 𝑏, 𝑔 + min 3,2,0 = 1 + 0= 1 1 2

1

2

3

0 1

0 1

1

Page 57: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(3,4)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

𝑑𝑡𝑤 3,4= 𝑐 3,4

+ min𝑑𝑡𝑤 2,4 ,𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 3,3

= 𝑐 𝑔, 𝑔 +min 1,0,1 = 0 + 0= 0

So DTW(X,Y) = 0

1 2

1

2

3

0 1

0 1

1 0

Page 58: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Example: DTW(3,4)

a 𝑏 𝑔

𝑎

𝑏

𝑔

0

Y

X

𝑏

1 2 3

4

3

2

1

DTW(X, Y) = 0.

Optimal Warping Path

(OWP) P can be found by

chasing pointers (red

arrows): P = ((1,1), (2, 2),

(2, 3), (3, 4)). 1 2

1

2

3

0 1

0 1

1 0

Page 59: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(Y, Z) = DTW((a, b, b, g), (a, g, g))

Page 60: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(1, 1)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0 Z

Page 61: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(2, 1)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

Z

1

Page 62: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(3, 1)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,1= 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

Z

1 2

Page 63: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(4, 1)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,1= 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 3,1= 1 + 2 = 3

Z

1 2 3

Page 64: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(1, 2)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,2= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

Z

1 2 3

1

Page 65: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(1, 3)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 1,3= 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

Z

1 2 3

1

2

Page 66: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(2, 2)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,2 ,

𝑑𝑡𝑤 1,1 , 𝑑𝑡𝑤 2,1 }

= 1 + 0 = 1

Z

1 2 3

1

2

1

Page 67: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(3, 2)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,2 ,

𝑑𝑡𝑤 2,1 , 𝑑𝑡𝑤 3,1 }

= 1 +min 1,1,2 = 1 + 1 = 2

Z

1 2 3

1

2

1 2

Page 68: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(4, 2)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,2= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,2 ,

𝑑𝑡𝑤 3,1 , 𝑑𝑡𝑤 4,1 }

= 0 +min 2,2,3 = 0 + 2 = 2

Z

1 2 3

1

2

1 2 2

Page 69: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(2, 3)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 1,3 ,

𝑑𝑡𝑤 1,2 , 𝑑𝑡𝑤 2,2 }

= 1 +min 2,1,1 = 1 + 1 = 2

Z

1 2 3

1

2

1 2 2

2

Page 70: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(3, 3)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 𝑏, 𝑔+ min {𝑑𝑡𝑤 2,3 ,

𝑑𝑡𝑤 2,2 , 𝑑𝑡𝑤 3,2 }

= 1 +min 2,1,2 = 1 + 1 = 2

Z

1 2 3

1

2

1 2 2

2 2

Page 71: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(4, 3)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

𝑑𝑡𝑤 4,3= 𝑐 𝑔, 𝑔+ min {𝑑𝑡𝑤 3,4 ,

𝑑𝑡𝑤 3,2 , 𝑑𝑡𝑤 4,2 }

= 0 +min 2,2,2 = 0 + 2 = 2

Z

1 2 3

1

2

1 2 2

2 2 2

Page 72: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(Y, Z)

a 𝑏 𝑏 𝑔

𝑎

𝑔

0

Y

𝑔

1 2 3 4

3

2

1

DTW(Y, Z) = 2.

Optimal Warping Path (OWP) P

can be found by chasing pointers

(red arrows): P = ((1,1), (2, 2), (3,

2), (4, 3)).

Z

1 2 3

1

2

1 2 2

2 2 2

Page 73: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(X, Z) = DTW((a, b, g), (a, g, g))

Page 74: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(1, 1)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,1 = 𝑐 𝑎, 𝑎 = 0

0

Page 75: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(2, 1)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,1 = 𝑐 𝑏, 𝑎 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

0 1

Page 76: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(3, 1)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,1 = 𝑐 𝑔, 𝑎 + 𝑑𝑡𝑤 2,1= 1 + 1 = 2

0 1 2

Page 77: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(1, 2)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,2 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,1= 1 + 0 = 1

0 1 2

1

Page 78: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(1, 3)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 1,3 = 𝑐 𝑎, 𝑔 + 𝑑𝑡𝑤 1,2= 1 + 1 = 2

0 1 2

1

2

Page 79: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(2, 2)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,2= 𝑐 𝑏, 𝑔

+ min𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 1,1 ,𝑑𝑡𝑤 2,1

= 1 +min 1,0,1= 1 + 0 = 1

0 1 2

1

2

1

Page 80: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(3, 2)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,2= 𝑐 𝑔, 𝑔

+min𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 2,1 ,𝑑𝑡𝑤 3,1

= 0 +min 1,1,2= 0 + 1 = 1

0 1 2

1

2

1 1

Page 81: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(2, 3)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 2,3= 𝑐 𝑏, 𝑔

+ min𝑑𝑡𝑤 1,3 ,𝑑𝑡𝑤 1,2 ,𝑑𝑡𝑤 2,2

= 1 +min 2,1,1= 1 + 1 = 2

0 1 2

1

2

1 1

2

Page 82: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(3, 3)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1

𝑑𝑡𝑤 3,3= 𝑐 𝑔, 𝑔

+min𝑑𝑡𝑤 2,3 ,𝑑𝑡𝑤 2,2 ,𝑑𝑡𝑤 3,2

= 0 +min 2,1,2= 0 + 1 = 1

0 1 2

1

2

1 1

2 1

Page 83: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW(X, Z)

a 𝑏 𝑔

𝑎

𝑔 Z

X

𝑔

1 2 3

3

2

1 0 1 2

1

2

1 1

2 1 DTW(X, Z) = 1.

Optimal Warping Path (OWP)

P can be found by chasing

pointers (red arrows): P =

((1,1), (2, 2), (3, 3)).

Page 84: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

DTW Optimizations

Page 85: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Window Optimization

The computation of DTW can be optimized so that only the

cells within a specific window are considered

Page 86: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

Smaller Matrix Optimization

You may have realized by now that if we care

only about the total cost of warping sequence X

with sequence Y, we do not need to compute

the entire N x M cost matrix – we need only two

columns

The storage savings are huge, but the running

time remains the same – O(N x M)

We can also normalize the DTW cost by N x M

to keep it low

Page 87: Speech & NLP (Fall 2014): Basics of Phonology & Audio Processing, Zero Crossing Rate, Dynamic Time Warping

References

M. Muller. “Information Retrieval for Music and

Motion,”, Ch.04. Springer, ISBN 978-3-540-74047-6

Bachu, R. G., et al. “Separation of Voiced and

Unvoiced using Zero Crossing Rate and Energy of

the Speech Signal." American Society for Engineering

Education (ASEE) Zone Conference Proceedings. 2008.