audio featuresluthuli.cs.uiuc.edu/~daf/courses/cs-498-daf-ps/lecture 8 - audio... · today’s...

Audio Features

CS498

Today’s lecture

•  Audio Features

•  How we hear sound

•  How we represent sound –  In the context of this class

Why features?

•  Features are a very important area –  Bad features make problems unsolvable – Good features make problems trivial

•  Learning how to pick features is the key –  So is understanding what they mean

A simple example

•  Compare two numbers:

x,y = {3,3} x,z = {3,100}

A simple example

•  Compare two numbers:

–  x,y similar but x,z not so much

•  Best way to represent a number is itself!

x −y = 0 x − z = 97

0 1 2 3 4 5 60

0.5

1

1.5

0 1 2 3 4 5 60

0.5

1

1.5

0 1 2 3 4 5 60

0.5

1

1.5

0 1 2 3 4 5 60

0.5

1

1.5

Moving up a level

•  Compare two vectors:

x,y x,z

Moving up a level

•  Compare two vectors:

–  Simply generalizing numbers concept

∠x,y = 0.03 rad ∠x,z = 0.7 radx−y = 0.16 x− z = 1.07

Moving up again

•  Compare two longer vectors:

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

Look similar but are not!

•  Oops! ∠x,y = 1.57 rad, x−y = 7.64

How about this?

•  Are these two vectors the same?

– Not if you look at their norm or angle …

1 2 3 4 5 6 7x 104

−0.6−0.4−0.2

00.20.40.60.8

1 2 3 4 5 6 7x 104

−1

−0.5

0

0.5

Data norms won’t get you far!

•  You need to articulate what matters –  You need to know what matters

•  Features are the means to do so

•  Let’s examine what matters to our ears – Our bodies sorta know best

Hearing

•  Sounds and hearing

•  Human hearing aspects –  Physiology and psychology

•  Lessons learned

The hardware (outer/middle ear)

Outer ear Middle ear

Pinna

Ear canal

Ear drum

•  The pinna (auricle) –  Aids sound collection –  Does directional filtering –  Holds earrings, etc …

•  The ear canal –  About 25mm x 7mm –  Amplifies sound at ~3kHz by ~10dB –  Helps clarify a lot of sounds!

•  Ear drum –  End of middle ear, start of inner ear –  Transmits sound as a vibration to the inner ear

More hardware (inner ear)

•  Ear drum (tympanum) –  Excites the ossicles (ear bones)

•  Ossicles –  Malleus (hammer), incus (anvil), stapes (stirrup) –  Transfers vibrations from ear drum to the oval window –  Amplify sound by ~14dB (peak at ~1kHz) –  Muscles connected to ossicles control the acoustic

reflex (damping in presence of loud sounds)

•  The oval window –  Transfers vibrations to the cochlea

•  Eustachian tube –  Used for pressure equalization Ear drum

Eustachian tube

Ossicles

Oval window

Cochlea

Auditory nerve

The cochlea •  The “A/D converter”

–  Translates oval window vibrations to a neural signal

–  Fluid filled with the basilar membrane in the middle

–  Each section of the basilar membrane resonates with a different sound frequency

–  Vibrations of the basilar membrane move sections of hair cells which send off neural signals to the brain

•  The cochlea acts like the equalizer display in your stereo

–  Frequency domain decomposition •  Neural signals from the hair cells go to

the auditory nerve

Microscope photograph of hair cells (yellow)

Masking & Critical bands •  When two different sounds excite the same

section of the basilar membrane one is masked •  This is observed at the micro-level

–  E.g. two tones at 150Hz and 170Hz, if one tone is loud enough the other will be inaudible

–  A tone can also hide a noise band when loud enough

•  There are 24 distinct bands throughout the cochlea

–  a.k.a critical bands –  Simultaneous excitation on a band by multiple

sources results in a single source percept •  There is also some temporal masking

–  Preceding sounds mask what’s next •  This is a feature which is taken into advantage

by a lot of audio compression –  Throws away stuff you won’t hear due to masking Masking for close

frequency tones vs distant tones

The neural pathways •  A series of neural stops •  Cochlear nuclei

–  Prepping/distribution of neural data from cochlea •  Superior Olivary Complex

–  Coincidence detection across ear signals –  Localization functions

•  Inferior Colliculus –  Last place where we have most original data –  Probably initiates first auditory images in brain

•  Medial Geniculate Body –  Relays various sound features (frequency, intensity,

etc) to the auditory cortex •  Auditory Cortex

–  Reasoning, recognition, identification, etc –  High-level processing

Superior olivary

complex

Cochlear nuclei

Inferior colliculus

Medial geniculate body

Auditory cortex

Stream of conciousness …

Cochleas

?

Ears

The limits of hearing •  Frequency

–  20Hz to 20kHz (upper limit decreases with age/trauma)

–  Infrasound (< 20Hz) can be felt through skin, also as events

–  Ultrasound (> 20kHz) can be “emotionally” perceived (discomfort, nausea, etc)

•  Loudness –  Low limit is 2x10-10 atm –  0dB SPL to 130dB SPL (but also

frequency dependent) •  A dynamic range of 3x106 to 1!

–  130dB SPL threshold of pain"–  194dB SPL is definition of a shock

wave, sounds stops!"

16 315 53 125 250 5000 1000 2000 4000 8000 16000

Frequency (Hz)

Inte

nsity

(dB

)

-10

0 10

20

30 4

0 50

60

70 8

0 90

100

110

120

130

Speech Music

Audible sounds

Pain!

Inaudibility

Tones at various frequencies, how high can you hear?

Perception of loudness •  Loudness is subjective

–  Perceived loudness changes with frequency

–  Perception of “twice as loud” is not really that!

–  Ditto for equal loudness •  Fletcher-Munson curves

–  Equal loudness perception curves through frequencies

•  Just noticeable difference is about 1dB SLP

•  1kHz to 5kHz are the loudest heard frequencies

–  What the ear canal and ossicles amplify!

•  Low limit shifts up with age!

Perception of pitch •  Pitch is another subjective

(and arbitrary) measure!•  Perception of pitch doubling

doesn’t imply doubling of Hz!–  Mel scale is the perceptual

pitch scale!–  Twice as many Mels

correspond to a perceived pitch doubling!

•  Musically useful range varies from 30Hz to 4kHz!

•  Just noticeable difference is about 0.5% of frequency!–  Varies with training though!

“Pitch is that attribute of !auditory sensation in terms !of which sounds may be !ordered from low to high”! - American National Standards Institute!

Perception of timbre •  Timbre is what distinguishes sounds

outside of loudness & pitch –  Another bogus ANSI description

•  Timbre is dynamical and can have many facets which can often include pitch and loudness variations

–  E.g. music instrument identification is guided largely by intensity fluctuations through time

•  There is not a coherent body of literature examining human timbre perception

–  But there is a huge bibliography on computational timbre perception!

Examples of successive timbre changes. Loudness and pitch

are constant

Gray’s timbre space of musical instruments

So how to we use all that?

•  All these processes are meaningful –  They encapsulate statistics of sounds –  They suggest features to use

•  To make machines that cater to our needs – We need to learn from our perception

A lesson from the cochlea

•  Sounds are not vectors

•  Sounds are “frequency ensembles”

•  That’s the “perceptual feature” we care about

Like this!

–  But how do we get this?

The “simplest” sound

•  Sinusoids are special –  Simplest waveform –  An isolated frequency

•  A sinusoid has three parameters

–  Frequency, amplitude & phase •  s(t) = a(t) sin( f t + φ)

•  This simplicity makes sinusoids an excellent building block for most of time series

0 10 20 30 40 50 60 70 80 90 100 -1 -0.5

0 0.5

1

Making a square wave with sines

Frequency domain representation

•  Time series can be decomposed in terms of “sinusoid presence”

–  See how many sinusoids you can add up to get to a good approximation

–  Informally called the spectrum •  No temporal information in this

representation, only frequency information

–  So a sine with a changing frequency is a smeared spike

•  Not that great of a representation for dynamically changing sounds

0 50 100 -1 0 1

0 20 40 60 0 10 20

20 40 60 80 100 -1 0 1

0 20 40 60 0 2 4 6

0 50 100 -1 0 1

0 20 40 60 0 1 2

Time series Spectrum

Time/frequency representation

•  Many names/varieties –  Spectrogram, sonogram,

periodogram, … •  A time ordered series of

frequency compositions –  Can help show how things move

in both time and frequency •  The most useful representation

so far! –  Reveals information about the

frequency content without sacrificing the time info

0 50 100 -1 0 1

Time

Freq

uenc

y

0 100 200 300 0 0.5

1

Time

Freq

uenc

y

0 100 200 300 0 0.5

1

20 40 60 80 100 -1 0 1

0 50 100 -1 0 1

Time

Freq

uenc

y

0 100 200 300 400 0 0.5

1

Time series Time/Frequency

A real example

0 0.5 1 1.5 2 2.5 3 -1 -0.5

0 0.5

1

Tim

e do

mai

n

Time

Freq

uenc

y

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 0 2000 4000 6000 8000

•  Time domain –  We can see the events –  We don’t know how they

sound like though!

•  Spectrum –  We can see a lot of bass

and few middle freqs –  But where in time are they?

•  Spectrogram –  We can “see” each

individual sound –  And we know how it

sounds like!

0 100 200 300 400 500 600 0 0.5

1 1.5

2

Freq

uenc

y do

mai

n

The Discrete Fourier Transform

•  So how do we get from time domain to frequency domain?

–  It is a matrix multiplication (a rotation in fact)

•  The Fourier matrix is square, orthogonal and has complex-valued elements

•  Multiply a vectorized time-series with the Fourier matrix and voila!

The Fourier matrix (real part)

�

Fj,k =1Neijk2πN =

1N

cosjk2πN

+ isin jk2πN

⎛ ⎝

⎞ ⎠

How does the DFT work?

•  Multiplying with the Fourier matrix –  We dot product each Fourier row vector

with the input –  If two vectors point the same way their

dot product is maximized •  Each Fourier row picks out a single

sinusoid from the signal –  In fact a complex sinusoid –  Since all the Fourier sinusoids are

orthogonal there is no overlap •  The resulting vector contains how

much of each Fourier sinusoid the original vector had in it

The DFT in a little more detail •  The DFT features complex numbers

–  Doesn’t have to, but it is convenient for other things

•  The DFT result for real signals is conjugate symmetric

–  The middle value is the highest frequency (Nyquist)

–  Working towards the edges we traverse all frequencies downwards

–  The two sides are mutually conjugate complex numbers

•  The interesting parts of the DFT are the magnitude and the phase

–  Abs( F) = || F || –  Arg( F) = ∡ F

•  To go back we apply the DFT again (with some scaling)

Real and imaginary parts of the DFT of a sine

Corresponding magnitude and phase

100 200 300 400 500 600 700 800 900 1000 100 200 300

100 200 300 400 500 600 700 800 900 1000 -2 0 2

100 200 300 400 500 600 700 800 900 1000 -200 0

200

100 200 300 400 500 600 700 800 900 1000 -100

0 100

Size of a DFT

•  The bigger the DFT input the more frequency resolution

–  But the more data we need!

•  Zero padding helps –  Stuff a lot of zeros at the

end of the input to make up for few data

–  But we don’t really infuse any more information we just make prettier plots

From the DFT to a spectrogram

•  The spectrogram is a series of consecutive magnitude DFTs on a signal

–  This series is taken off consecutive segments of the input

•  It is best to taper the ends of the segments –  This reduces “fake” broadband noise

estimates •  It is wise to make the segments overlap

–  Due to windowing •  The parameters to use are

–  The DFT size –  The overlap amount –  The windowing function

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 -1 -0.5

0 0.5

1

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 20 40 60 80 100 120

…

2 4 6 8 10 12 14 16 18 20 40 60 80 100 120

Magnitude of DFT of every segment (segments can overlap)

Looks nicer as an image

Tim

e se

ries

of

mag

nitu

de s

pect

ra In

put s

ound

S

pect

rogr

am

Time Fr

eque

ncy

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 1 2 x 10 4

Time

Freq

uenc

y

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 0 1 2 x 10 4

Why window?

•  Discontinuities at ends cause noise

–  Start and end point must taper to zero

•  Windowing –  Eliminates the sharp edges

that cause broadband noise •  Overlap

–  Since we have windowed we need to take overlapping segments to make up for the attenuated parts of the input

200 400 600 800 1000 1200 -0.06 -0.04 -0.02 0 0.02 0.04

200 400 600 800 1000 1200 -0.04 -0.02 0 0.02 0.04

Non-existent broadband content

Win

dow

ed

Not

win

dow

ed

Not

win

dow

ed

Win

dow

ed

Nasty sharp edges

Time/Frequency tradeoff

•  Heisenberg’s uncertainty principle –  We can’t accurately know both the

frequency and the time position of a wave

–  Also in particle physics with speed and position of a particle

•  Spectrogram problems –  Big DFTs sacrifice temporal resolution –  Small DFTs have lousy frequency

resolution •  We can use a denser overlap to

compensate –  Ok solution, not great

The Fast Fourier Transform (FFT) •  The Fourier matrix is special

–  Many repeating values –  Unique repeating structure

•  We can decompose a Fourier transform to two Fourier transforms of half the size

–  Also includes some twiddling with the data –  Two Fourier smaller transforms are faster

than one big one –  We keep decomposing it until we have a

very small DFT •  This results into a really fast algorithm that

has driven communications forward! –  The constraint is that the transform size is

best if a power of two so that we can decompose it repeatedly

The Fourier matrix, N = 32

Example FFT, N = 8

Emulating the cochlea

•  Using the time/frequency domain

1 2 3 4 5 6 7 8x 104

−0.6−0.4−0.2

00.20.40.60.8

Time (1k samples)

Freq

uenc

y

26 51 77 102 128 154 179 205

•  Take successive Fourier transforms

•  Keep their magnitude

•  Stack them in time

•  Now you can visually compare sounds!

Back to our example

1 2 3 4 5 6 7x 104

−0.6−0.4−0.2

00.20.40.60.8

1 2 3 4 5 6 7x 104

−1

−0.5

0

0.5

Corresponding spectrograms

Amplitude

Time

Spectrogram

5 10 15 20 25 30 35 40 45 50 55

100

200

300

400

500

Amplitude

Time

Spectrogram

5 10 15 20 25 30 35 40 45 50 55

100

200

300

400

500

A lesson from loudness perception

•  We don’t perceive loudness linearly

•  How much louder is the second “test”?

•  The magnitude we plot should be logarithmic, not linear

Log spectrograms

Ampl

itude

Time

Log spectrogram

5 10 15 20 25 30 35 40 45 50 55

100

200

300

400

500Am

plitu

de

Time

Log spectrogram

5 10 15 20 25 30 35 40 45 50 55

100

200

300

400

500

A lesson from pitch perception

•  Frequencies are not “linear” –  Perceived scale is called mel

•  Use that spacing instead –  i.e. warp the frequency axis

“Mel spectra”

Ampl

itude

Time

Log mel spectrogram

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

25

30

35Am

plitu

de

Time

Log mel spectrogram

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

25

30

35

One more trick

•  Mel cepstra –  Smooth the log mel spectra using one more

frequency transform (the DCT) Am

plitu

de

Time

Mel cepstra

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

Ampl

itude

Time

Mel cepstra

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

Adding some temporal info

•  Deltas and delta-deltas –  In sounds order is important – Using “delta features” we pay attention to change

Coe

ffici

ent

Time

Mel cepstra

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

25

30

35

Coe

ffici

ent

Time

Mel cepstra

5 10 15 20 25 30 35 40 45 50 55

5

10

15

20

25

30

35

What more is there?

•  Tons! –  Spectral features – Waveform features – Higher level features –  Perceptual parameter features – …

Sound recap

•  Go to time/frequency domain – We do so in the cochlea

•  Frequencies are not linear – We perceive them in another scale

•  Amplitude is not linear either – Use log scale instead

•  Resulting features are used a lot –  Further minor tweaks exist (more later)

Next lecture

•  Principal Component Analysis

•  How to find features automatically

•  How to “compress” data without info loss

audio featuresluthuli.cs.uiuc.edu/~daf/courses/cs-498-daf-ps/lecture 8 - audio... · today’s...

Documents