speech and audio processing and coding (cont.)

1

Speech and Audio Processing and Coding (cont.)

Dr Wenwu Wang

Centre for Vision Speech and Signal Processing

Department of Electronic Engineering

[email protected]

http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

mailto:[email protected]

http://personal.ee.surrey.ac.uk/Personal/W.Wang/teaching.html

Psychoacoustics Psychoacoustics is the study of how humans perceive sound, such as

o Perception of loudness

o Pitch perception

o Space perception

References

o B. C.J. Moore, An Introduction to the Psychology of Hearing, Academic Press, 1995.

o D.M. Howard and J. Angus, Acoustics and Psychoacoustics, Focal Press, 1996.

o W. A. Yost, Fundamentals of Hearing: an Introduction, Academic Press, 1994.

o R. M. Warren, Auditory Perception, Cambridge Univ. Press, 1999.

Inner Ear Function The inner ear consists of cochlea which has a snail-like structure.

o It transfers the mechanical vibrations to the movement of basilar membrane, and then converts into nerve firings (organ of corti which consists of a number of hair cells).

o The basilar membrane carries out frequency analysis of input sounds, and it responds best to high frequencies at the (narrow and thin) base end, and to low frequencies at the (wide and thick) apex end.

Inner Ear Function

(a) The spiral nature of the cochlea

(b) The cochlea unrolled

(c) Vertical cross-section through the cochlea

(d) Detailed view of the cochlea tube

From: (Howard & Angus, 1996)

Basilar Membrane


Idealised shape of unrolled basilar membrane

Displacement of Basilar Membrane


Idealised envelope of basilar membrane movement to sounds at five different frequencies

‘Place’ Theory of Hearing The displacement of the basilar membrane changes as the

frequencies change.

The basilar membrane is stimulated from the base end which responds best to high frequencies, and it is important to note that its envelope of movement for a pure tone (or individual component of a complex sound) is not symmetrical, but it tails off less rapidly towards higher frequencies than towards lower frequencies.

The linear distance measured from the apex to the place of the maximum basilar membrane displacement is directly proportional to the logarithm of the input frequency.

Critical Bands An illustration of the perceptual changes when playing two tones

simultaneously with the frequency of a pure tone (F1) fixed and the other (F2) changing.


Critical Bands (cont) The discrimination between two frequencies depends whether the

basilar membrane displacements are separated or not.

A listener’s perception change for the frequency difference between two pure tones from rough and separate to smooth and separate is known as ‘critical bandwidth’ (CB).

“The critical bandwidth is that bandwidth at which subjective responses rather abruptly change.” (Scharf, 1970)

The ‘equivalent rectangular bandwidth’ (ERB) was proposed to use the notion of critical bandwidth practically. (Moore and Glasberg, 1983)

HzffERB cc }52.28]1039.93[]1023.6{[ 2326

Critical Bands (cont) The relationship between the ERB and the centre filter frequency

(Howard & Angus, 1996)

Critical Bands (cont) Semitone: is the smallest musical interval between musical notes,

defined as the interval between two adjacent notes in a 12-tone scale (e.g. from C to C#). Hence, it equals to 100 cents (i.e. a twelfth of an octave)

Octave: the interval between two music pitches with one has a double frequency of the other. In other words, the frequency of one note is 12 semitones higher or lower than that of the other. For example, A4 note is one octave higher than an A3 note, but one octave lower than A5 note.

Loudness Perception The ear’s sensitivity to sounds of different frequencies varies over a wide

range of sound pressure level (SPL). The minimum SPL that can be detected by the human hearing system around 4kHz is approximately 10e-5Pa, while the maximum SPL (i.e the threshold of pain) is 20Pa.

For convenience, in practice, SPL is usually represented in decibels (dB) relative to 20e-5Pa.

For example, the threshold of hearing at 1 kHz is, in fact, In dB, it equals to

While the threshold of pain is 20Pa which in dB equals to

r

m

P

PSPLdB log20)(

mPwhere is the measured SPL,

PaPr5102

dB120102

102log20

5

dB0102

102log20

5

5

Loudness Perception (cont.) The perceived loudness of an acoustic sound is related to its amplitude

(but not a simple one-to-one relationship), as well as the context and nature of the sound.

As the sensitivity of our hearing system varies as the frequency changes, it is possible for a sound with a larger pressure amplitude to be heard as quieter than a sound with a lower pressure amplitude (for example, if they are at different frequencies). [recall the equal loudness contour of the human auditory system shown in the first lecture]

Demos for Loudness Perception

Decibels vs Loudness

Starting with a 440Hz tone (i.e. note A4), then it is reduced 1dB each step



Intensity vs Loudness

Various frequencies played at a constant SPL

A reference tone is played and then the same tone is played 5dB higher; followed by the reference tone, and then the tone 8dB higher and finally the reference tone and then the one 10dB higher

Resources: Audio Box CD from Univ. of Victoria

Pitch Perception

Pitch What is pitch? Pitch

• is “the attribute of auditory sensation in terms of which sounds may be ordered on a musical scale extending from low to high” (American Standard Association, 1960)

• is a “subjective” attribute, and cannot be measured directly. Therefore, a specific pitch value is usually referred to the frequency of a pure tone that has the equal subjective pitch of the sound. In other words, the measurement of pitch requires a human listener (the “subject”) to make a perceptual judgement. This is in contrast to the measurement in the laboratory of, for example, the fundamental frequency of a complex tone, which is an “objective” measurement. (Howard & Angus, 1996)

• is related to the repetition rate of the waveform of a sound, therefore it corresponds to the frequency of a pure tone and the fundamental frequency of a complex tone. In general, sounds having a periodic acoustic pressure variation with time are perceived as pitched sounds, for non-periodic acoustic pressure waveform, as non-pitched sounds. (Howard & Angus, 1996)

Pitch Comparison of pitched and non-pitched sounds (Howard & Angus, 1996)

Pitched Non-pitched

Waveform (time domain)

Periodic (regular repetitions)

Non-periodic (no regular repetitions)

Spectrum (frequency

domain)

Line (harmonic components)

Continuous (no harmonic components)

Pitch Examples of pitched (see the figures in “Musical Notes and its

Fundamental Frequencies”) and non-pitched sounds (see the figure below, the waveform and spectrum of a drum being brushed, Howard & Angus, 1996)

Existing Pitch Perception Theories ‘Place’ theory

Spectral analysis is performed on the stimulus in the inner ear, different frequency components of the input sound excite different places or positions along the basilar membrane, and hence neurones with different centre frequencies.

‘Temporal’ theory

Pitch corresponds to the time pattern of the neural impulses evoked by that stimulus. Nerve firings tend to occur at a particular phase of the stimulating waveform, and thus the intervals between successive neural impulses approximate integral multiples of the period of the stimulating waveform.

Place Theory

Three methods are commonly used for finding the value of f0 based on a place analysis of the frequency components of the input sound:

Method 1: locate the f0 component itself.

Method 2: find the minimum frequency difference between adjacent harmonics, i.e. (n+1)*f0 – n*f0 = f0.

Method 3: find the highest common factor of the frequency components that are present in the input sound.

Place Theory (cont) Method 1:

• Suggests that the pitch of a sound corresponds to the place stimulated by the lowest frequency component, i.e. fundamental frequency f0.

• Assumes that f0 is always present in the sound. For example, as stated by Olm: “a pitch corresponding to a certain frequency can only be heard if the acoustic wave contains power at that frequency”.

Exceptional case:

• As demonstrated by Schouten (1940) that even removing the f0 from a pulse wave, its pitch remained the same.

• Therefore, f0 doesn’t have to be present for pitch perception. Also, the lowest frequency component is not the basis for pitch perception.

Place Theory (cont) Method 2:

• Suggests that whether or not the fundamental frequency f0 is present, some adjacent harmonics, provided that they exist, should be used as a basis for pitch perception.

• For most musical sound, adjacent harmonics are indeed present.

Exceptional case:

• As shown in the figure below, when f0 is present (or absent), the difference between adjacent frequencies are f0, 2f0, 2f0, etc. (or 3f0, 2f0, 2f0, etc), while the perceived pitch would not change.


Place Theory Method 3:

• The highest common factor is the highest value appearing in all rows of the place analysis table below, where as an example, f0 = 100Hz.

• It can address the exceptional cases in both Method 1 and Method 2.


Place Theory Method 3:

• Another example shown by Schouten was using the analysis table to interpret pitch perception for non-harmonic sound. For a sound whose component frequencies were 1040Hz, 1240Hz and 1440Hz, and it was found the pitch was approximately 207Hz. Using Method 2, the pitch would be the spacing between these components, and hence, 200Hz.

• Using the processing table (shown in the next page), the highest common factor would be approximately 207Hz which is an average of 208Hz, 207Hz, 206Hz, of which the components are the 5th, 6th, and 7th harmonic respectively. The pitch perceived in such a situation is referred to as “residue pitch”, “pitch of the residue”, or “virtual pitch”. Actually, the fundamental frequency of these components is 40Hz, of which they are the 26th, 31st, and 36th harmonic respectively. It seems that the perceived pitch found by the auditory system is based on the adjacent harmonics that present in these frequencies.

Place Theory


Problems with the Place Theory Although it provides a basis for understanding how f0 is found in terms of

frequency analysis, it does not explain (Howard & Angus, 1996):

• The discrimination of frequency difference in pitch perception. [To discuss]

• The pitch perception of sounds with frequency components that could not be resolved by the place mechanism of basilar membrane. [In general, no harmonic above about the 5th to 7th is resolved for any fundamental frequency, because in these situations, the critical bandwidth at the centre frequencies (i.e. these harmonics), will be higher than the fundamental frequency.]

• The pitch perceived for some sounds which has non-harmonic (i.e. continuous) spectra. [For example, most listeners would rate ‘ss’ in “sea” to have higher pitch as compared with ‘sh’ in “shell”, as the energy is biased more towards the lower frequencies for ‘sh’ with a peak around 2.5kHz, as compared with a peak around 5kHz for ‘ss’. Figure shown in the next page.]

• Pitch perception for sounds with a fundamental frequency less than 50Hz [This is because the pattern of vibration on the basilar membrane does not seem to change in that region.]

‘ss’ versus ‘sh’

Frequency Discrimination The size of the frequency difference limen (DL), or sometimes called

just noticeable difference (JND), is the smallest detectable change in frequency. Two methods were used to measure DL, including

DLF - The subject is asked to judge which of two frequencies has higher pitch. This method was used by Henning (1970), Moore (1973), etc. It was found that expressed in Hz, the change is smallest at low frequencies, and increases monotonically with increasing frequencies; expressed as a proportion of centre frequency, it tends to be smallest for middle frequencies, and larger for very high and very low frequencies.

FMDL - Tones which are frequency modulated (FM) at a low rate (typically 2-4Hz) are used for the measurement. This method was used by Shower & Biddulph (1931). FMDL seems to vary less with frequency than DLF, and both get smaller as the sound level increases.

Frequency Discrimination The frequency discrimination thresholds change with the centre

frequencies, plotted as log(threshold) versus square root of centre frequency below:

Frequency discrimination threshold measured by several different authors, all measured DLFs except S & B who measured FMDLs (figure first published by Wier et al, 1977, and reproduced in Moore, 1995)

Temporal Theory This theory is based on the fact that the waveform of an acoustic signal with a

strong pitch is periodic.

This theory suggests that it is the detailed nature of the actual waveform that excites the different places along the basilar membrane. Therefore, it depends on the timing of neural firings generated in the organ of Corti, in response to vibrations of the basilar membrane.

It can be simulated by a bank of band-pass filters whose centre frequencies and bandwidths vary according to the critical bandwidth of the human hearing system.

The nerve fibres fire at all places along the basilar membrane, and a given nerve fibre may only fire at one phase or instant in each cycle of the stimulating waveform. This process is known as phase locking.

Due to phase locking, the time between firings for any particular nerve will always be an integer multiple of periods of the stimulus. At each place, there are a number of nerves involved.


Simulation of Temporal Theory


Band-pass filtering of note C4 played on a violin, whose f0 is 261.6Hz.

Simulation of Temporal Theory


The first six harmonics (around 260, 520, 780, 1040, 1300 and 1560Hz) are well resolved by the band-pass filters, and therefore can be explained by the place theory.

For the output waveforms whose filter centre frequencies above the sixth harmonic are not sinusoidal since they are not resolved individually, as the bandwidth is higher than the fundamental frequency.

When two components close in frequency are combined, they produce a beat waveform if both components are harmonics of some fundamental frequency. The beat frequency is equal to the f0, as shown in the filter outputs above the 1.5KHz in the figure of the previous page.

The minimum time between the firings (i.e. 1 period of the stimulus) can be inferred from the filter output (which is the period of the lower harmonics and the period of the input wave itself).

Note that, although the nerve does not necessarily fire in every cycle, and the cycle in which it fires tends to be random, due to phase locking, the time between the firings for any particular nerve will always be an integer multiple of periods of the stimulating waveform.

Nerve Firing An illustration of

nerve firing along the basilar membrane for the first 16 harmonics of an input sound.


Problems with Temporal Theory Although it provides a basis for understanding how the fundamental period

could be found from an analysis of the timing of the nerve firing from all places across the basilar membrane, it couldn’t explain the following:

• Pitch perception of sounds whose f0 is higher than 5kHz. [This is because phase locking breaks down above 5kHz.]

• In practice, this means there will be only approximately two harmonics to be analysed, due to the limitation of the human hearing system (i.e. the upper limit 20kHz).


Contemporary Theory Neither of the theories is perfect for explaining the mechanism of human pitch

perception. A combination of both theories will benefit the analysis of pitch perception, as a model proposed by Moore (1982) for complex tones, shown below. (Howard & Angus, 1996)

Musical Intervals (Melody) One tone evokes a pitch, a sequence of tones with appropriate

frequencies can evoke the perception of a musical interval (or melody).

A sequence of tones below 5kHz evokes a sense of melody, while a sequence of tones above 5kHz does not evoke a clear sense of melody, although different frequencies can be heard. (Moore, 1989)

For example, two tones which are separated in frequency by an interval of one octave (i.e. one has twice the frequency of the other) sound similar. Hence, they are judged to have the same name on the musical scale (for example, C or D).

It appears that the musical interval of an octave is only clearly perceived when both tones are below 5kHz. Above 5kHz, a sequence of pure tones does not produce a clear sense of melody, as shown by Atteneave and Olson, 1971.

Pitch versus Sound Level The pitch of a pure tone is determined not only by its frequency

(mainly), but also by its sound level (lightly).

On average, the pitch of tones below about 2kHz decreases with increasing sound level, while the pitch of tones above about 4kHz increases with increasing sound level. (Moore, 1989)

For tones between 1 and 2kHz, changes in pitch with level are generally less than 1%, while for tones of lower and higher frequencies, the changes can be larger (up to 5%). (Verschuure and van Meeteren, 1975; Moore, 1989)

Musical Notes Notes are played by music instruments that have different pitches.

As the sensitivity of our hearing system varies as the frequency changes, it is possible for a sound with a larger pressure amplitude to be heard as quieter than a sound with a lower pressure amplitude (for example, if they are at different frequencies). [recall the equal loudness contour of the human auditory system shown in the first lecture]

Musical Note and its Fundamental Frequency (Waveform of A4)

)444(45.44028.2

110 HzAHz

msTf


Musical Note and its Harmonics (Spectrum of A4)

Musical Note and its Harmonics The shape of the waveform and the spectrum for each of the notes

played by the four different instruments shown in the previous page is different, even though they all perceived as note A4 (i.e. they have the same fundamental frequencies). It is the so-called “timbre” that distinguishes the four different music instruments.

The frequency components of notes produced by any pitched instruments are called harmonics which are integer multiples of the fundamental frequency f0. Therefore, the first harmonic is the fundamental frequency f0, and the 2nd harmonic is 2f0, and the third is 3f0, etc.

Another term that is also used by many authors is “overtones”. The first overtone refers to the first frequency component that is above f0, which is the second harmonic, i.e. 2f0. For example, for the note A4 played by violin, f0=440.5Hz, the first harmonic is therefore 440.5Hz, and the first overtone is 881.0Hz.

Demos for Pitch Perception Resources: Audio Box CD from Univ. of Victoria

This three demos show how pitch is perceived with different time duration of the signals. In each track, time bursts of sounds are played. Three different pitches are played in these three tracks.

Space Perception

Sound Localisation Sound localisation refers to judgements of the direction and distance of

a sound source, usually achieved through the use of two ears (binaural hearing).

Help humans and animals to locate the sounds of threats and to avoid such threats.

Help humans and animals direct visual attention

Help humans and animals focus attention on sounds from specific directions by excluding other interfering sounds in a noisy and reverberant environment

For blind people, in particular, they can use information from the echoes and reflections to estimate the distance of sound sources.

Although binaural hearing is crucial for sound localisation, monaural perception is similarly effective in some cases, such as in the detection of signals in quiet, intensity discrimination, and frequency discrimination.

Localisation Cues

There are two important cues that enable us to localise sounds:

interaural time difference

interaural intensity difference

Interaural Time Difference (ITD) The two ears are separated by the dimension of the head. For an

average head, the distance between the ears is about 18cm. As such, there will be a time difference between the sound reaching the ear near the source and the one further away. Such difference is called interaural time difference (ITD).

A simple and rough model to calculate the ITD is given below, in which it assumes that the sound travel around the head can be ignored:

c

dt

)sin(

td

c

Where - ITD (in s)

- Distance between the ears (in m)

- The angle of arrival of the sound from the median (in radians)

- Sound speed (in m/s)

Interaural Time Difference (ITD)


Interaural Time Difference (ITD) However, in reality, the sound has to travel around the head in order to

reach the ear.

A more accurate model to calculate the ITD is given below, in which it assumes that the head is spherical:

Based on the equation below, it can be shown that the maximum ITD occurs at 90 degree (considering the average head diameter), which is:

c

rt

))sin((

tr

c

Where - ITD (in s)

- Half the distance between the ears (in m)


- Sound speed (in m/s)

cs6731073.6 4

Interaural Time Difference (ITD)


ITD as a Function of Angle


ITD and IPD The ear appears to use the interaural phase difference (IPD) caused

by the ITD in the two waves to resolve the sound direction.

The phase difference is given by:

When the phase difference is greater than 180 degree, there will be an unresolvable ambiguity in the sound direction as the angles could be the one to the left or to the right.

))sin((2 fr

r

f

Where - The phase difference between the two ears (in radians)

- Half the distance between the ears (in m)


- The frequency (in Hz)

ITD and IPD (cont) The maximum frequency (without phase ambiguity), at a particular

angle, is given by

For an angle of and the average size of head

The ambiguous frequency limit would be higher at smaller angles. For the frequencies higher than the maximum frequency, other cues are used by human ears to resolve the direction of sound sources, such as the interaural intensity difference (IID).

))sin((2

1

))sin((2max

rrf

90 mr 09.0

Hzf 743))2/sin((09.02

1max

Interaural Intensity Difference (IID) Due to the shading effect of the head, the intensity of the sound levels

reaching each ear is also different. Such difference is called interaural intensity difference (ITD).

When the sound source is on the median plane, the sound level at each ear is equal, while the level at one ear progressively reduces, and increases at the other, as the sources move away from the median plane.

The shading effect of the head is difficult to calculate, however, experiments seem to show that the intensity ratio between the two ears varies sinusoidally from 0dB up to 20dB with the sound direction angles, for various frequencies.

The shading effect is not significant unless the size of the head is about one third of a wavelength in size. For a head with a diameter of 18cm, this corresponds to a minimum frequency (Howard & Angus, 1996) of:

c

Hzm

sm

d

cf 637

18.0

/344

3

1

3

1)2/min(

Shading Effect in IID

c


IID as a Function of Angle and Frequency

c

(Data from Gulick, 1971, reproduced from Howard & Angus, 1996)

ITD and IID Trading

c

Both ITD and IID are used for the perception of sound source directions, while in fact it is possible that one cue could be confused (or cancelled) by the other. This is known as ITD and IID trading.

The time delay versus intensity trading is effective over the range of delay times which correspond to the maximum interaural time delay of 0.673ms.

For the delays between 0.673ms and 30ms, small intensity difference will not alter the perceived direction of the sound source. However, if the delayed sound’s intensity is more than 12dB greater than the earlier arrival sound, we will perceive the direction of the sound to be towards the delayed sound.

For the delays of more than 30ms, the delayed sound is perceived as an echo.

Therefore, it is possible to determine the direction of the sound source based purely on ITD or IID.

ITD and IID Trading

c

(Data from Madsen, 1990, reproduced from Howard & Angus, 1996)

speech and audio processing and coding (cont.)

Documents

movement of basilar

basilar membrane changes

basilar membrane displacements

frequencies change

high frequencies

critical bandwidth cb

notion of critical bandwidth

lower frequencies