physics of ventriloquism or can you really speak with your...

29
Physics of Ventriloquism or Can you really speak with your stomach? org Metzner, Marcel Schmittfull June 2005 Inhaltsverzeichnis 1 Introduction 1 2 The formation of human speech sounds 1 2.1 Physical-acoustic description of glottis and vocal tract ............. 2 2.2 Formants and comparison of sounds ...................... 4 2.3 Ventriloquism .................................. 4 3 Vocal tract models 4 3.1 Simple tube model ............................... 4 3.1.1 Short description ............................ 4 3.1.2 Calculating the transfer function .................... 5 3.1.3 Geometrically different tube configurations which are perceived as the same .................................. 8 3.2 Birkholz’ 3D model ............................... 10 4 Model-based simulation of substitute sounds 11 4.1 The plosives [b] and [p] ............................. 11 4.2 The nasal [m] .................................. 12 4.3 The fricatives [f] und [v] ............................ 14 5 Simulation of sounds in a real experiment with the help of a plaster model 15 6 Analysis of recordings of ventriloquists 16 6.1 The sound transitions [ba] with lips and [b’a] without lips ........... 17 6.2 The sound transitions [pa] with lips and [p’a] without lips ........... 17 6.3 Comparison with sound transitions produced by the model and error analysis 18 Friedrichsdorf, Germany. Mail: [email protected] Geldersheim, Germany. Mail: [email protected] The latest version can be obtained from http://japtik.sf.net/bauchreden.

Upload: others

Post on 03-Sep-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

Physics of Ventriloquismor

Can you really speak with your stomach?

Jorg Metzner�, Marcel Schmittfull��June 2005

Inhaltsverzeichnis

1 Introduction 1

2 The formation of human speech sounds 12.1 Physical-acoustic description of glottis and vocal tract. . . . . . . . . . . . . 22.2 Formants and comparison of sounds. . . . . . . . . . . . . . . . . . . . . . 42.3 Ventriloquism. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Vocal tract models 43.1 Simple tube model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3.1.1 Short description. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.1.2 Calculating the transfer function. . . . . . . . . . . . . . . . . . . . 53.1.3 Geometrically different tube configurations which are perceived as the

same . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Birkholz’ 3D model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Model-based simulation of substitute sounds 114.1 The plosives [b] and [p]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.2 The nasal [m]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.3 The fricatives [f] und [v] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5 Simulation of sounds in a real experiment with the help of a plaster model 15

6 Analysis of recordings of ventriloquists 166.1 The sound transitions [ba] with lips and [b’a] without lips. . . . . . . . . . . 176.2 The sound transitions [pa] with lips and [p’a] without lips. . . . . . . . . . . 176.3 Comparison with sound transitions produced by the model and error analysis18

�Friedrichsdorf, Germany. Mail: [email protected]��Geldersheim, Germany. Mail: [email protected] latest version can be obtained fromhttp://japtik.sf.net/bauchreden.

Page 2: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

INHALTSVERZEICHNIS 2

7 Conclusion and outlook 18

Page 3: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

2 THE FORMATION OF HUMAN SPEECH SOUNDS 1

1 Introduction

Ventriloquism may be imagined as a form of speech without lip or jaw movement. The omis-sion of lip movements is substituted by adroit movements of articulators such as the tongue inthe vocal tract1. Particular difficulties arise in relation to sounds which require that the lips bealmost closed, in German these labial sounds are [b], [p], [m], [f] and [v]. They are replacedby ventriloquists usingsubstitute soundswhich are as similar as possible to those desired.

In this project we shall investigate this fascinating art further, firstly by raising the generalquestion of how and indeed whether perceptively equal sounds2may be produced in differentpositions in the vocal tract. We will then reproduce the substitute sounds using an articula-tory vocal tract model and a real plaster model as authentically as possible. Finally, we shallcompare these with recordings of a professional ventriloquist.

The last time ventriloquism was physically examined was in the 1920’s and the only ob-jective of these examinations was to show that ventriloquism (etymologically, speaking fromthe stomach) was independent of the stomach. A precise analysis of the vocal tract was notcarried out. The only literature on ventriloquism up to now has been in the form of popularscience, usually only available from second-hand bookshops.3

The subject matter may be categorised asthe acoustics branch of articulatory speechsynthesis, which examines how speech sounds are produced. Our examination is primarilyfocused on basic research on articulatory synthesis but other applications are conceivable, forventriloquists (in particular for optimising substitute sounds) or to help in the case of particulardiseases which restrict lip movement such as dysarthria deriving from ALS or Parkinson orburn wounds.

2 The formation of human speech sounds

In the production of sounds, we must distinguish between producing sounds at the vocal foldsand modifying these sounds in the remainder of the vocal tract.

Air rises from the lungs towards the vocal folds. The pressure built up is so strong that itcauses a brief opening of the vocal folds, whose opening surface is referred to as the glottis.During the opening, the air escapes into the vocal tract and the pressure on the vocal foldsis reduced so that the glottis closes again - until the pressure once more becomes sufficientto open it. Thus, the glottis opens and closes periodically and this produces a sound, theexcitation.

The variousarticulators (tongue, velum4, jaw, lips) determine the geometry of the vocaltract. The further modification of the glottis’ excitation sound depends on this geometry. Inorder to observe the influence of vocal tract geometry on sound formation, various modelscan be used. In Sections3.1 and3.2 we will refer in detail to the simple tube model and theenhanced Mermelstein model as it is used by Birkholz.

1Thevocal tractdesignates the cavity between the vocal folds and the lips.2perceptively equal means that the human ear perceives two sounds as being almost identical3The only exception to this, according to our research and that of Patrick Martin, a ventriloquist over many

years is [Boc95]. [Vox] is not available in bookshops but only from a casino in Switzerland (or in our case via auniversity association’s interlibrary loans).

4The velum, also known as the soft palate is found in the area connecting the nasal and vocal tracts. For nasalsounds it can open the nasal tract in the direction of the vocal tract. For non-nasal sounds it blocks this opening.

Page 4: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

2 THE FORMATION OF HUMAN SPEECH SOUNDS 2

2.1 Physical-acoustic description of glottis and vocal tract

If the glottis’ excitation function is regarded as the source and the transfer function of thevocal tract is regarded as a filter, the overall system can be seen as aSource-Filter Model, seeFig. 1.

Fig. 1: Source-Filter Model, see [PM95].

GlottisThe sound waves produced by the periodic opening and closing of the glottis have anacousticpressure p, which reflects the local change in air pressure as compared with normal pressure.In addition to acoustic pressure, the sound wave can also be represented as asound velocityv, which indicates the oscillation of the individual air particles. Ifv is multiplied by the cross-sectional areaA of the tube, this will yield thesound flowor volume velocity u= v �A. Theratio of acoustic pressure to volume velocity is described asacoustic impedancepu .

Vocal tractA signalx(t) proceeds from the glottis to the vocal tract. This incoming signal can be describedby a linear combination of Dirac impulsesδ(t). The modification of this impulse function inthe vocal tract is described as the system’simpulse response h(t). The final outgoing signaly(t) is mathematically observed in convolutiony(t) = x(t)�h(t) of the incoming signal withthe impulse response (see [Wer00]). It is particularly important that the impulse responsecompletely describes the acoustic system of the vocal tract.

While the impulse response describes the outputy(t) in the time domain, the so-calledtransfer function H(ω) applied to the incoming signalX(ω) can transform the outputY(ω)into the frequency domain, i.e. dependent onω. For a computer analysis the continuous outputy(t) must besampledusing a particular sampling frequencyfA, i.e. y(t) becomes a discretefunction with values at time points at a distance ofTA = 1

fA. According to the sampling theorem

(see [Hes98], [Wer00]), the sampling frequency must have at least double the value of thehighest frequency occurring in the signal to be analysed in order to ensure that the frequenciesare unambiguous. With a sampling rate of e.g. 44;100Hz (CD quality), only frequencies up toa maximum of 22;050Hz occur.

In order to analyse a signal, the frequency-related representationX(ω) is usually taken.This representation is referred to as the signal’sspectrum, since, in visual terms, it stands forthe strength of the presence of individual frequencies in the signal. This leads to the represen-tation of a periodic signalx(t) as aFourier series([Pet04], [Mil99])

x(t) = ∞

∑k=�∞

αkejkω0t ; αk = 1

T

TZ0

x(t)e� jkω0tdt; (2.1)

with the fundamental frequencyω0. The sum of theharmonic ejkω0t = coskω0t + j sinkω0tthereby forms a complete orthogonal system. The lower limit of the integral is arbitrary as longas the interval of a periodT = 2π

ω0is integrated. The phase of the complexαk describes the

Page 5: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

2 THE FORMATION OF HUMAN SPEECH SOUNDS 3

possible phase displacements of the basesejkωt in (2.1), while the absolute valuesjαkj standfor amplitude. The values ofjαkj are shown in a line spectrum, usually in the logarithmicdecibel scale, as against the individualω.

In order to convert aperiodic functionsx(t) to the frequency domain, the periodT in theterm for the coefficientsαk in (2.1) becomesT ! ∞. The distance between two lines in thefrequency spectrum isω0 = ∆ω = 2π 1

T . Thus, forT !∞, ∆ω! dω andk∆ω!ω. If you putin (2.1) �T

2 as the lower andT2 as the upper integration limit, the equation (2.1) becomes withT ! ∞ and 1

T = ∆ω2π

αk = ∆ω2π

∞Z�∞

x(t)e� jωtdt

| {z }X(ω)

= ∆ω2π

�X(ω)! dω2π

�X(ω): (2.2)

In the Fourier series (2.1), the sum becomes the integral andkω0 ! ω. If (2.2) is placed in(2.1), with ∆ω! dω, the equation for the so-calledFourier synthesisor Inverse Fourier Trans-formation(IFT) follows, i.e. the transformation ofX(ω) into x(t). The inverse transformationfrom time domainx(t) to frequency domainX(ω) is called Fourier transformation (FT):

IFT: x(t) = 12π

∞Z�∞

X(ω)ejωtdω FT: X(ω) =∞Z

�∞

x(t)e� jωt dt: (2.3)

For a more detailed discussion of the Fourier transformation please refer to [Opp89] and[Mil99].

By the help of the Fourier transformation, signals can be described byx(t) or by X(ω) inthe time domain or in the frequency domain respectively:

Time domain: x(t) h(t)�! y(t)# FT # # #

Frequency domain: X(ω) H(ω)�! Y(ω)A means of representation which combines both forms of description is offered by thespec-trogram, in which the absolute valuejαi(t)j for frequencyω at time pointt is indicated bycolors in a diagram with a time axis and a frequency axis.

2.2 Formants and comparison of sounds

The formantsof a sound play a very important role in phonetics. They are the maximumpositions of the spectral envelops of the transfer functionH(ω). Thus the formants indicatefrequencies which are particularly boosted by the vocal tract. These frequencies which arealso called resonances are largely responsible for the recognition of a voice sound. It has oftenbeen shown in experiments that only the first two formants are decisive in the recognition of asound. From the third formant onwards, the characteristics of a sound for recognition purposesare only altered insignificantly. A possible explanation for this, in our opinion, lies in the factthat the human cochlea only displays an anatomically linear structure for frequencies up to 1kHz and above this, frequency distribution is logarithmic.

Page 6: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

3 VOCAL TRACT MODELS 4

2.3 Ventriloquism

In ventriloquism, it is important to find substitutes for the labial sounds (i.e. involving thelips, see supplement), whose first two formants are as similiar as possible to those of thesounds to be replaced. The formants of the sounds to be replaced are well known from theliterature [Sch03]; however, this does not allow any conclusions to be drawn about the vocaltract (see [PM95], [Hes98]). This means that individual vocal tract parameters cannot uniquelybe concluded from the formants.

The only possibility for us to find the vocal tract configuration at which the substitutesounds are produced is to use books for ventriloquists (e.g. [Vox] und [Boc95]) and hintsfrom ventriloquists about their technique in order to simulate the geometry of the vocal tractusing computers. In so doing, we paid great attention to identify the formants of the soundsto be substituted as precicely as possible. Thus, in order to model the substitute sounds forthe critical sounds with labial constriction or occlusion in a model, we used the softwaretractsyn[Bir02]; see Sections3.2and4.

In the next section we shall show in formal terms that it is generally possible to reproducethe first two formants with the articulators in a different position. This is done using a simpletube model.

3 Vocal tract models

3.1 Simple tube model

3.1.1 Short description

The geometry and thus the transfer function of the vocal tract may be compared to a sequenceof discrete cylinder sections, assumed to be lossless with varying diameters. If a signal excitesthe tube, this signal will be modified in accordance with the transfer function. If a very largenumber of discrete cylinders is used, a sound will be produced whose resonances are very closeto that of the continuous vocal tract. The assumption that no losses are considered affects theresults only slightly and makes it possible to calculate the transfer function explicitly for aparticular tube configuration. This will be shown in the next section.

3.1.2 Calculating the transfer function

The vocal tract is said to be similar to a sequence ofN cylinders withAi cross-sectional areas.What we are looking for is the transfer functionH(ω), using this the output signal may becalculated from the tubeY(ω) = X(ω) �H(ω). The following derivation is based on [Sch96]without the help of z-transformations.

While inside cylinderZi with a constant cross-sectional area Ai no modification of thesound wave occurs, at a junction between cylindersZi andZi+1 with different cross-sectionareasAi 6= Ai+1 the sound wave will be split into a transmitted and reflected part. This meansthat sound waves propagate through the tube in the original direction (single vector~e+) and inthe opposite direction (single vector~e�).

Looking at a junction between cylindersZi andZi+1 we assume that the pressure changepi ! pi+1 and the change in volume velocity~ui ! ~ui+1 is continuous (assumption of conti-nuity: see for example [Rab78]), then in the discrete cylinderspi or ui on the right edge ofZi

Page 7: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

3 VOCAL TRACT MODELS 5

is equal topi+1 or~ui+1 on the left edge ofZi+1:

pi = pi+1 or p+i + p�i = p+i+1+ p�i+1; (3.1)

~ui =~ui+1 or ~u+i +~u�i =~u+i+1+~u�i+1; (3.2)

On the right side of the equation (3.1), the total pressure is expressed asp = p+ + p�, i.e.as the sum of the acoustic pressure of the wave travelling in direction+ and the acousticpressure of the wave travelling in direction�. The total flow may be expressed by analogy as~u=~u++~u�. If the flows are represented as~u+ = u+ �~e+ and~u�= u� �~e�, then with~e+ =�~e�(3.2) results in

u+i �u�i = u+i+1�u�i+1: (3.3)

In order to express equation (3.3) in terms of pressurep and of the cross section areaA,the particle velocity~v� = v� �~e� in ~u� =~v�A ) u� = v�A is related to pressurep by theimpedance

p+v+ = p�

v� = ρc; ) u+ = v+A= p+Aρc

; u� = v�A= p�Aρc

(3.4)

whereρ is the density of the medium andc is the sound velocity in this medium, e.g.cLuft;37�C�350m

s . Now equation (3.3) results in

p+i Ai

ρc� p�i Ai

ρc= p+i+1Ai+1

ρc� p�i+1Ai+1

ρc

, Ai(p+i � p�i ) = Ai+1(p+i+1� p�i+1) (3.5)

If (3.1) is solved forp�i the result is:

p�i = p+i+1+ p�i+1� p+i :Solution of (3.5) for p+i+1 and insertion after simplification yields:

p�i = Ai �Ai+1

Ai +Ai+1p+i + 2Ai+1

Ai +Ai+1p�i+1: (3.6)

Without a cross-sectional change, inZi a wave withp+i would move in direction+ and inZi+1

a wave withp�i+1 would move in direction�, so that inZi p�i = p�i+1 would apply for pressurein direction�. If wave p�i+1 is regarded as an incoming wave, however, equation (3.6) statesthat only a certain part of pressurep�i+1 of the wave travelling in direction� is still containedin p�i , i.e. only a portion of this wave istransmitted. At the same time (3.6) contains a partof pressurep+i of the wave travelling in direction+, i.e. a portion of this wave travelling indirection+ is reflectedin direction�. The factors for transmissionT� in the wave travellingdirection� and the factor for reflectionR+ of the wave travelling in direction+ are, accordingto (3.6),

reflectionR+ = Ai �Ai+1

Ai +Ai+1; transmissionT� = 2Ai+1

Ai +Ai+1= 1�R+: (3.7)

If in equations (3.1) and (3.5) we takep+i+1 instead ofp�i as above, this gives

p+i+1 = Ai+1�Ai

Ai +Ai+1p�i+1+ 2Ai

Ai +Ai+1p+i (3.8)

= R�p�i+1+T+p+i= �R+p�i+1+(1+R+)p+i :

Page 8: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

3 VOCAL TRACT MODELS 6

Equations (3.6) and (3.8) may be written compactly usingR= R+ as:

0B@p+i+1

p�i

1CA=

0B@ �R 1+R

1�R R

1CA0B@p�i+1

p+i

1CA : (3.9)

In order to express wave sizespi in terms ofpi+1, (3.9) is simplified as:

0B@p+i

p�i

1CA= 1

1+R

0B@1 R

R 1

1CA0B@p+i+1

p�i+1

1CA (3.10)

Since the sound wave requires the timeτ2 = d

c5 in order to propagate the distanced between

two junction points, the waves are displaced towards each other. Since it follows from (2.3)that

x(t� τ) FT= 12π

∞Z�∞

X(ω)ejω(t�τ)dω = 12π

∞Z�∞

e� jωτ �X(ω)ejωtdω

a displacementx(t)! x(t� τ2) in the time domain corresponds to operationX(ω)! X(ω) �

e� jω τ2 in the frequency domain. In order to make it clear that we are operating in the frequency

domain, in the following we will useP instead ofp. In (3.9), the vector(P+i ;P�i )T is calculated

from (P+i+1;P�i+1)T , i.e. in(P+

i+1;P�i+1)T the+ waveP+i+1 travelling to the right6 must be moved

to the left by τ2, i.e.P+

i+1 ! P+i+1 �e+ 1

2 jωt . By analogy, the� waveP�i+1 propagating to the left

in (P+i+1;P�i+1)T must be displaced to the right byτ2, i.e.P�i+1 ! P�i+1 �e� 1

2 jωt . Thus:

0B@P+

i+1

P�i+1

1CA !

0B@P+

i+1 �e+ 12 jωt

P�i+1 �e� 12 jωt

1CA=

0B@e

12 jωt 0

0 e� 12 jωt

1CA0B@P+

i+1

P�i+1

1CA= e

12 jωt

0B@1 0

0 e� jωτ

1CA0B@P+

i+1

P�i+1

1CA :

Thus, equation (3.10) becomes

0B@P+

i

P�i

1CA= e

12 jωτ

1+R

0B@1 R

R 1

1CA0B@1 0

0 e� jωτ

1CA0B@P+

i+1

P�i+1

1CA= e

12 jωτ

1+R

0B@1 Re� jωτ

R e� jωτ

1CA0B@P+

i+1

P�i+1

1CA : (3.11)

If the whole tube model consists ofN cylindersZi , i = 1::N with Ri , i = 1::N the reflectionfactor at the junction between cylindersZi andZi+1, can be identified, while at the right edge ofthe whole tubeRN+1 ��0:95 is set, since this is capable of describing the transition from thelips into the outside area, which corresponds to a large cross-sectional change. If the cylinderlengthdi is taken as equal todi = const.) τi

2 = dic = τ

2 = const., pressureP1 is obtained fromPN+1 or, by analogy, a signalX1 is obtained fromXN+1 by multiplying the matrices in (3.11)

5Time displacement is selected asτ2 , and later in equation (3.11) in order to be able to leave oute

12 jωt and

transform the matrix.6“To the right” corresponds to the direction+, which in turn corresponds to the direction of the rising cylinder

indicesZi ! Zi+1.

Page 9: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

3 VOCAL TRACT MODELS 7

as follows: 0B@X+

1

X�1

1CA = N

∏i=1

0B@ 1 Rie� jωτ

Ri e� jωτ

1CA

| {z }M i

�0B@X+

N+1

X�N+1

1CA (3.12)

The terms∏i e12 jωτ = e

N2 jωτ and∏i

11+Ri

in (3.11) can be left aside for the moment, since they

represent only a total time displacement byNτ2 without changing the absolute value (jeN

2 jωτj=1) and a total boost (making louder or quieter) by factor∏i

11+Ri

, neither of which have anyeffect on the relative frequency spectra or formants. In (3.12), ∏i M i is a 2�2 matrix of theform

N

∏i=1

M i = M =0B@M11 M12

M21 M22

1CA : (3.13)

With X�N+1 = RN+1 �X+

N+1, it follows from (3.12) and (3.13) for X+1

X+1 = M11X+

N+1+RN+1 �M12X+N+1: (3.14)

FromX+N+1(ω) = H(ω) �X+

1 (ω), it follows for the transfer functionH(ω)H(ω) = X+

N+1(ω)X+

1 (ω) = 1M11+RN+1 �M12: (3.15)

3.1.3 Geometrically different tube configurations which are perceived as the same

In ventriloquism, the vocal tract parameters of critical sounds in the vocal tracts with labialocclusion or constriction must be reproduced by substitute positions, i.e. it must be possible fordifferent vocal tract geometries to sound the same or similar to the human ear. In this section,an example will be given as physical proof of the fact that two different tubesA andB canproduce two sounds that are almost identical to the human ear, i.e. their first two formants (seeSection2.2) are almost identical. The positions of the first two formants can be determined bycalculating the transfer function with the aid of equation (3.15).

Since only one example needs to be shown to prove the existence of two perceptivelyequal tubesA andB, for the sake of simplicity two tubes with similar sound qualities havebeen selected and optimized by hand. The two tubesA andB are determined by the followingcross-sectional areasAi andBi (see figures2(a)and2(b)).The dimensions of the cross-sectional areas may be disregarded, since they have no influence

Page 10: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

3 VOCAL TRACT MODELS 8

(a) tubeA (b) tubeB

Fig. 2: The two tubesA (a) andB (b). The numbers represent the cross-sectional areasAi andBi .

on reflection factors. In order to calculate the transfer functions, equation (3.15) is applied stepby step, i.e. in pseudocode:

1: (3:7) R1[i] = Ai �Ai+1

Ai +Ai+1; R2[i] = Bi �Bi+1

Bi +Bi+1mit i = 1::9

2: (3:12) M1[i](ω) =0B@ 1 R1[i]e� jωτ

R1[i] e� jωτ

1CA ; M2[i](ω) =

0B@ 1 R2[i]e� jωτ

R2[i] e� jωτ

1CA ; i = 1::8;τ = 1:0

3: (3:13) N1(ω) = 8

∏i=1

M1[i](ω); N2(ω) = 8

∏i=1

M2[i](ω)4: (3:15) H1(ω) = 1

N111(ω)+R1[9] �N112(ω) ; H2(ω) = 1N211(ω)+R2[9] �N212(ω) ; ω = 0::π:

If the step width∆ω = π500 is selected and the amplitudesH1 andH2 are scaled with decibels,

i.e.H10= 20� logH1 andH20= 20� logH2, using Maple one obtains the two transfer functionsplotted in Fig.3, H10(ω0) andH20(ω0). Thereforeω0 = ωτ is normalized in such a way that ina 17cm long tube with 10 cylindersω0 = π corresponds approximately toω = 10kHz.

From Fig.3 it can be seen that the first two local maximums or formants ofH10 andH20are almost identical. The displacement of the first two formants can hardly be perceived bythe human ear. The displacement of the third formant is already somewhat greater, but asexplained above this would have hardly any effect on people’s perception of the sound.

In this way, it was shown that the two upper tubes, despite their different geometry, pro-duce sounds which the human ear can tell apart only with great difficulty. We also lookedfor substitute positions for the critical sounds for ventriloquism which must be as similar aspossible to the sounds they are replacing. A condition of this substitute vocal tract geometryis that there can be no labial occlusion.

Page 11: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

3 VOCAL TRACT MODELS 9

Rohr A

–20

–10

0

10

20

30

dB

1 2 3

omega

Fig. 3: Transfer functionsH10(ω) for tube A (light) and H20(ω) for tube B (dark). Mapleplotting for ∆ω = π

500;τ = 1:0 andAi ;Bi as set out in the table. The first two formants (localmaximums) are almost identical.

(a) (b)

Fig. 4: Birkholz’ model: (a) Interlinking of vocal tract with bars for upper and lower areas andtongue. (b) tube model with nasal tract.

3.2 Birkholz’ 3D model

In the following it should be examined whether the substitute sounds formed by a ventrilo-quist may be described using a physical model of the vocal trac. For this an articulatory modeldescribed by Peter Birkholz [Bir02] is used, which is converted in thetractsynsoftware. Ad-vantages oftractsynare, above all, that it is user-friendly and flexible and that it includes thenasal tract7, which is absolutely necessary for the formation of nasal sounds [m] and [n].

Birkholz’s model is an extension of Mermelstein’s model. The vocal tract is representedthree-dimensionally in three surface bars: a bar for the upper and lower areas of the vocal tractand one for the tongue (see Fig.4(a)). The geometry of these bars was determined using x-rayphotographs, e.g. those of Fant and some more recent ones.

The programme calculates, in accordance with the parameters which have been input andwhich determine the geometry of the bars, the cross-section areas at each point of the dis-cretised vocal tract (i.e. split up into bars). By putting together the individual cross-sections,a tube model with varying diameters is produced and the nasal tract tube is linked to the vo-cal tract tube via the velum (see Fig.4(a)). Using this tube system, the sound produced maybe calculated in a manner not dissimilar to that used in Section 3.1 and given out using a

7The nasal tract refers to the full nasal cavity with the paranasal sinuses from the velum to the nostrils. Incombination with the vocal tract, it forms the speech tract.

Page 12: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

4 MODEL-BASED SIMULATION OF SUBSTITUTE SOUNDS 10

loudspeaker.

4 Model-based simulation of substitute sounds

In the following, all the problem sounds of lip-fixed speech are reproduced usingtractsynbysubstitute sounds without closing the lips. These are analysed and compared with the problemsounds in their original form. We will look first at the plosives with a bilabial occlusion [b]and [p], then at the nasal sound with a bilabial occlusion [m] and finally at the fricatives withlabio-dental constriction, [f] and [v].

4.1 The plosives [b] and [p]

According to [Sto96], the positioning in the vocal tract of [p] and [b] is so similar that onthe MRT recordings on which the simulation is also based, no significant differences may beestablished. The only important difference between [p] and [b] is that [p] is voiceless and [b]voiced.

[b] normal

The normal [b] is a plosive with bilabial occlusion, i.e. the air is blocked by an initial closureof the lips and then released by a jerky opening. Since this is a non-stationary sound, a so-called “phone chain” must be produced intractsynwith a closed mouth at the beginning ([b])and then the jerky release, e.g. on an [a]. Using the rough guideline of the tube model and theposition of the formants of the sounds [b] and [a] from [PM95] or [Sch03], [b] and [a] areentered intotractsynin such a way that a [ba] sound chain is produced.

For the initial position of [b] in the vocal tract (see Fig.7(a)) the formants reproduced inFig. 7(e)occur. The important first two formants are atF [b]

1 = 450 Hz andF [b]2 = 1050 Hz. In

consitency to the optimal case according to [PM95], the first formant’s amplitude lies slightlyover the second formant’s amplitude.

The sound chain produced bytractsynwas recorded and analysed in Matlab using a spec-trogram, i.e. the signal is segmented into overlapping segments, analyzed each by the Fouriertransformation, so that the frequency spectrum is obtained at several different time points.In Fig. 8(a), the spectrogram for the synthetically produced [ba] is generated; the time axis ishorizontal and the frequency axis is vertical; the colour indicates the amplitude of the frequen-cies occurring. In the first 0:05 time units, the first formants (dark colour) reach approximately0:5kHz;1:0kHz;2:25kHz and 4:0kHz. At approximately 0:05 time units the first three for-mants “jump” to a somewhat higher frequency. This result coincides with the formant jumpsin [PM95] which are indicated in Fig.8(d).

[b’] substituted

A ventriloquist will avoid closing his lips by using a substitute sound [b’]. In order to replaceits explosive characteristics, the tip of the tongue will first be pressed towards the front teeth(similar to [d]’s alveolar occlusure, but closer to the teeth, see Fig.9) until the tongue reverseswith a sudden movement9. The vocal tract which is otherwise similar to [b] is thus reducedby the distance between the lips and the teeth; see Fig.7(c). As may be seen in Fig.7(e),

this leads to a very good approximation of the first two formants, which are atF [b0]1 = 500 Hz

Page 13: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

4 MODEL-BASED SIMULATION OF SUBSTITUTE SOUNDS 11

andF [b0]2 = 1100 Hz so that the first formant’s amplitude lies considerably over the second

formant’s amplitude.The spectrogram for [b’a] is shown in Fig.8(b). The first two formants lie approximately at

500Hzand 1000Hz in the first 0:06 time units. At approximately 0:06 time units the formantsare displaced to somewhat higher frequencies and also show the typical formant displacementdipicted in Fig.8(d). By contrast with [ba], the third formant of [b’a] departs from the patternin Fig. 8(d), according to which there should be a sharp tilt upwards not downwards. Sinceone can hardly make out a difference between [ba] and [b’a], it can be confirmed that for therecognition of sounds by the human ear only the lower spectral area is of relevance.

Comparison of [b’a] with [ba]

The two diagrams of the formants, Fig.7(e)and7(e)display great similarities for the initialpositions of [b] and [b’]. The first two formants, which are important for sound recognition

(see Section2.2), have almost the same characteristics and position, sinceF [b0]1 �F [b]

1 = 50 Hz

andF [b0]2 �2[b]1 = 50 Hz. Seen over time as well, the first two formants of [b’a] and [ba] are

again practically identical to each other, since they both have an equally strong change up-wards at the same point and thus both share the characteristics of the [ba] sound chain.

Comparison of [b’a] with [da]

Fig.9 showing the positions of [b’] and [d] according to [Vox], as well as the vocal tract positi-on for [b’] in Fig. 7(c)could suggest that [b’] simply corresponds to a normal [d]. To disprovethis, a [da] was produced using tractsyn and Matlab with the relevant spectrogram, see Fig.8(c). The formants are approximately the same as those represented in Fig.8(e)following 8(e)for a [da]. While [da] and [b’a] resemble each other in the first formant, after 0:05 time units[da] has a sudden tilt downwards at the second and third formant, while in [b’a] the secondformant displays a tilt upwards (as in [ba]) and the third has a downward tilt. Since the secondformant is far more important in terms of the sound’s characteristics, [b’a] is significantly mo-re similar to [ba] than to [da]. This means that using the model, it was possible to reproducethe substitute sound [b’] used by ventriloquists and show that it is authentic.

4.2 The nasal [m]

[m] is a nasal with abilabial occlusion, i.e. the lips must be closed and, unlike the plosivesounds, the sound escapes through the nasal tract which air enters by means of the velum.Two variants for substitute sounds without closed lips are given in [Vox]. In the first variant,which we shall call [m’], the closing of the lips is replaced by pressing the tongue as closeas possible to the front teeth so that the sound waves can only escape through the nasal tract.After that, the tongue is once more released from the teeth (see Fig.9(c). The second variant[m”] involves pressing the back of the tongue against the velum, thereby forcing the soundwaves exclusively into the nasal tract (see Fig.9(d)). Since the termination of the sound occursdirectly at the velum, the similarity of [m”] with the sound to be substituted [m] may mostlybe established at a perceptual level when the sound is produced. However, one can still findcertain similarities in the spetra.

For [m], [m’] and [n], the oral cavity acts as a resonating body in nasal sounds. The diffe-rent nasals differ in the size (particularly the length) of this resonating body. Some of the sound

Page 14: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

4 MODEL-BASED SIMULATION OF SUBSTITUTE SOUNDS 12

Fig. 5: Three-way model for the production of nasal sounds.

waves which come from the glottis reach the nasal tract directly and some reach the oral cavitywhere they are reflected and radiated back. If they meet the velum, where the nasal and oralcavities are separated, a superposition occurs. If there is destructive interference for certainwavelengths at the velum, then these wavelengths and their corresponding frequencies willnot be found in the final sound, i.e. the transfer functionH(ω) has so-calledantiresonances,also calledzeros.

Two factors which determine the spectrum of nasals must be taken into account: on theone hand, the (not-modifiable) geometry of the pharynx from the glottis to the velum andthe nasal tract and, on the other hand, the length of theresonating body(oral cavity). Theexact geometry of the oral cavity acting as a resonating body may be disregarded, since thesignificant reflection for sound formation mostly occurs at the occlusion of the oral cavity andthus the length of the oral cavityl plays the major role for the articulation of nasal sounds.8

Since the wave coming from the velum is reflected back at the lips, it has to travel twicethe distancelm to reach the velum again. Thus, the following precondition for destructiveinterference, i.e. a antiresonance may be derived whenn2 N:

2� lm = (2n+1) � λ2= (2n+1) � c

2 f) f = (2n+1) � c

4lm

The lengthslm = 7;53cm andlm0 = 6;89cm of the resonating body for [m] and [m’] followfrom the difference between the total length of the vocal tract and the distance between theglottis and the velum, which can be obtained from [Sto96] and the used simulation. So theapproximate frequency location of the antiresonances of [m] and [m’] may be calculated:

f[m] = f1160;3480;5800; : : :gHz; f[m0] = f1270;3700;6240; : : :gHz:The spectrum for the articulatory model for [m’] (Fig.10(c)shows that the first two antireso-nances for [m] are at approximately 1000 Hz and 3500 Hz, i.e. for the first antiresonance thefrequency calculated is just over the model’s frequency, the second antiresonances are almostexactly equivalent in the calculation and the model.

The first two antiresonances for [m’] (Fig.11(c)are approximately 1250 Hz and 2600 Hz,so the first anti-resonance is almost identical with the calculation but the second is extremelydisplaced. This error in the position of the second resonance may be due to the fact that theantiresonances may be shifted because of superposition with antiresonances and formants

8Further anti-formants are produced because of paranasal sinuses but due to their short length, these generallyconcern only the higher frequencies and affect all sounds equally. However, they can be recognised by a vibrationwhen you put a hand on your head while correctly pronouncing nasals.

Page 15: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

4 MODEL-BASED SIMULATION OF SUBSTITUTE SOUNDS 13

from the nasal tract. Another reason for the displaced position of the second anti-resonancecould be the imprecision in the length of the vocal tract for [m’]. The position of the first anti-resonance, however, corresponds very closely to the calculation. Examinations in test speakersin [Del93] show in the case of the first anti-resonance that this should be between 750 Hz and1250 Hz for [m], depending on the speaker while for [n] it should be between 1500 Hz and2200 Hz. Since the first anti-resonance of [m’] is at approximately 1250 Hz, it follows that thesubstitute sound [m’] is considerably more similar to [m] than [n]. This means that [m’] is anewly discovered sound which displays a similarity to the vocal tract position in [n] but whosecharacteristics in terms of sound are far closer to [m] than [n]. A hearing test using [m’] alsoclearly demonstrates this.

In addition to the antiresonances, the total course of the spectrum in the frequencies from0 to 1 kHz of [m’] is very similar to [m]. This once more confirms the high level of similarityin perceptual terms of the substitute sound [m’] with [m].

4.3 The fricatives [f] und [v]

After looking at [b], [p] and [m], we turn now to the problematic fricatives [f] and [v],9 whoseconstriction must be displaced further back in the mouth, in a manner not dissimilar to [b]. Inprinciple, in order to produce fricatives with an articulatory synthesis model, it is necessarythat an unvoiced excitation due to turbulences is available. Similarly to the case of [b] and [p],[v]’s excitation differs from [f]’s in that [v] in contrast to [f] is voiced.

Since [f] and [f’] are fricatives with a constriction in the front part of the vocal tract, exci-tation occurs not as in the previous cases by the glottis but by turbulence due to constriction.Thus, for an analysis of [f] and [f’] apart from perception, we cannot use the transfer functionbut have to examine the spectogram of a sound chain such as [fa] and [f’a].

[f] normal

The constriction occurs in the labio-dental [f] between the upper teeth and the lower lip. Therelevant vocal tract position is shown in Fig.12(a). The sound chain [fa] is observed and syn-thesised intractsyn; its spectrogram is represented in Fig.12(c). [f] goes up to approximately0:075 time units and is then followed by the transition to [a]. As one can see, [f] involves aspectral distribution which is concentrated on one specific spectral area. This begins for [f]at approximately 1500 Hz, reaches its maximum at 3000 Hz and becomes gradually slowerfor high frequencies. The transition to [a] is very abrupt but follows an “incline”, i.e. first thelower frequencies of [a] are formed, then the higher frequencies follow.

[f’] substituted

The substitute sound [f’] is similar to the English [th] as in “the”, although the tongue is furtherback in the mouth. The labio-dental constriction in [f] is replaced by a very brief alveolarconstriction between tongue and oral cavity. The vocal tract geometry for [f’] in Fig.13(a)wasadjusted in accordance with the perception of [fa] giving rise to the spectrogram shown in Fig.13(c). The reinforced spectral area looks very similar to that of [fa], at approximately 1500�1750Hz. Just like [fa], this area reaches a maximum somewhat below 3000 Hz and drops for

9The phonetic transcription [v] represents for example the initial sound in “Weinflasche”, i.e. it corresponds toa German ’w’. In English it represents e.g. the initial sound in “ventriloquism”.

Page 16: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

5 SIMULATION OF SOUNDS IN A REAL EXPERIMENT WITH THE HELP OF APLASTER MODEL 14

higher frequencies. The transition at approximately 0:07 time units to [a] corresponds almostexactly to that of [fa].

The great similarity in the spectrograms of [fa] and [f’a] leads to a very similar perception,as can be confirmed by listening to the sound chains. Thus, substitute sounds have been foundfor all problematic sounds for ventriloquists which are astonishingly similar to the problemsounds in their original form.

[v] normal and [v’] substituted

[v] is a labio-dental sound, see Fig.6(b). Like [b’], [m’] and [f’] the constriction of the sub-stitute sound [v’] occurs using the tongue. The vocal tract configurations for [v] and [v’] areshown in Figures14(a)and14(c)respectively. The corresponding spectrums are given in Fig.14(e). From comparison of the first two formants in Fig.14(e) as well as from perceptivesimilarity it follows that [v’] is an authentic substitute sound for [v].

5 Simulation of sounds in a real experiment with the help of aplaster model

In order to verify the tube model used in the previous sections a real model was built, whichreproduced the tube model for the speech tract geometry of the certain sounds.

The positive of the model was formed out of the customary dough, while the geometryof the cross sectional areas obtained from MRT photographs of the speech tract correspondsto [m] (see [Sto96]). The model as such (negative) was constructed out of several layers ofplaster. The vocal tract may be changed by placing barriers and reducing the oral resonatingcavity and adjusted from [m] to the vocal geometry of [m], [m’] and [n].

In the excitation of the plaster model two variants were selected.

� Auditoryexamination (perception), residual signal: excitation obtained from a recordingof a voiced sound. The signal in this recording was spectrally filtered in such a way thatall resonances were removed from it and the spectral envelope is a constant. In orderthat the signal should resemble the actual voiced excitation by the glottis as closelyas possible, it is given a further low-pass filtering. The plaster model’s filter functiontogether with this excitation means that the outgoing signal is once more perceived as anatural sounding nasal.

� For thespectral analysis(antiresonance frequencies), peak-excitation: an impulse traincontaining its fundamental frequency and the accompanying harmonics is emitted froma special loudspeaker. At the front, the loud speaker is connected to a plastic tube fixeddirectly onto the model. By screening the loudspeaker by plates which isolate the soundwaves it is ensured that the sound given off by the loudspeaker’s wooden case does notaffect calculations. The recordings are carried out again with a condenser microphoneand analysed in Matlab.The advantage of this excitation is that a antiresonance can be immediately identifiedby the absence of harmonics in the spectrum. However, it must be observed that byattaching a loudspeaker to the model, using a funnel and hose tube, unintended antire-sonances may be created. But by analysing the spectrum of the signal escaping directly

Page 17: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

6 ANALYSIS OF RECORDINGS OF VENTRILOQUISTS 15

from the hose these erroneous antiresonances which are not produced by the plastermodel itself can easily be identified.

Figures15(a)bzw. 15(b) show the plaster model’s spectra for [m] and [m’] at peak sti-mulation. The first antiresonance caused by the speech tract may be identified in both spectraby the absence of a peak in the relevant frequencies and is at 1000 Hz for [m] and approxi-mately 1250 Hz for [m’]. As discussed above, in the literature [Del93] it is assumed that theanti-formant for [m] lies between 750 Hz and 1250 Hz. It was also possible to establish in areal experiment that the first antiresonance of [m’] lies between the typical antiresonances for[n] and [m] but that at 1250 Hz it is considerably closer to [m] than [n]. Thus [m’] is perceivedas [m]. Through excitation with the residual signal this analysis result can be particularly wellconfirmed at an auditory level.

6 Analysis of recordings of ventriloquists

The analysis of sounds as actually spoken by ventriloquists allows us to compare the substitutesounds produced by the model with sounds which are really spoken. We have tried to analyserecordings on the internet which unfortunately were not of sufficiently good quality for ourpurpose.

The only possibility was to make our own recordings of the original and the substitutesounds of a ventriloquist who must be as good as possible. The ventriloquistPatrick Martin,who has been making a living out of his professional appearances for about ten years allowedus to record him speaking a number of test sounds and sentences with a high-quality condensermicrophone. With these recordings it was possible to produce a spectrogram for individualsound transitions. The frequency is entered on the horizontal axis, the amplitude in dB onthe vertical. The various curves represent the spectrum at a sequence of time points. Thehighest frequency derived from the sampling rate in the recording and in our recordings wasat 11025 Hz.10 The approximate processes of the formants were entered by hand, while thefirst recognisable resonance is not a formant of the speech tract but comes from the excitation.

6.1 The sound transitions [ba] with lips and [b’a] without lips

The ascent in the first two formants which are important for recognising sounds is recognisa-ble in both [ba] and [b’a]. This means that the perception of the sound [b’a] corresponds tothat of [ba], which can be unambiguously confirmed from both the spectrum and the soundreproduction. The ventriloquist succeeds not only in optimising the first two formants, the restof the formants are very similar up to approximately 3000 Hz. Yet the vocal tract geometry isnot at all the same, as is clear in particular from observing the higher formants. Where the lipsare used, the third formant occurs somewhat higher. The differences become far more obviousin the fifth formant which is higher in the substitute sound and, in fact appears, to fall into twoseparate formants. The next two formants are very similar again. In the higher frequenciesfurther parallels can be established. (This shows that the vocal tract geometry was different inthe recorded sound but that this is hardly noticeable in terms of the sounds’ characteristics).

10In accordance with the sampling theorem, the frequency width recorded is 11025 Hz, which corresponds tohalf of the sampling rate 22050 Hz, which was indicated by the recording software used,audacity.

Page 18: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

7 CONCLUSION AND OUTLOOK 16

6.2 The sound transitions [pa] with lips and [p’a] without lips

Here too it can be seen that the course of the first three formants is similar in both spectro-grams. The formant tilt in the fourth formant is considerably greater in the substitute soundthan in the sound using the lips. After that there are parallels between the fifth and seventh for-mants whose course is almost identical. Between these two, the sixth formant tilts upwards inthe sound with lips by contrast with the substitute sound. The area above the seventh formantdisplays hardly any similarities at all. This sound is clearly perceived as [pa], and while therestrictions in the sound’s characteristics should at least theoretically be more obvious here,this is not the case in perceptual terms. Since in our simulation we only managed to optimisethe first two formants, which corresponds to approximately 2.5 kHz, here too it can be esta-blished that the quality of the ventriloquist’s substitute sounds exceeds that of the substitutesounds in the stimulation.

6.3 Comparison with sound transitions produced by the model and error ana-lysis

The analyses of the recorded language signals above confirm, in short, the principle that thelower formants of substitute sounds and normal sounds are similar while the upper formantsdisplay large differences. The forms of the deviations and the formants above which there aredeviations are different in the simulation and the recordings.

It must be remembered that the recordings are dependent on the speaker in question andthat ventriloquists’ techniques may vary slightly from one to the other. For example, ventri-loquists could form little dimples in their cheeks or train their tongue to move into differentpositions. Another reason for the difference might also lie in the difficulties already mentionedof the fast plosive-vowel transitions and speaking at a higher fundamental frequency.

7 Conclusion and outlook

It was shown that it is generally possible to produce a perceptively equal signal by a differentordering of tubes. This makes ventriloquism physically explicable and possible.

The examinations show that only the spectral range which is of relevance for the voicerecognition is important for ventriloquists, i.e. especially the first formants of the sounds thatare to be substituted are approximated. Interestingly this works for all critical labial sounds[b], [p], [m], [f] and [v] astonishingly well, in practice as well as in theory with the articulatoryspeech model. The analysis of the plaster model confirms this result.

Using the visual representation of the substitute sounds in the simulation software, it wouldbe conceivable to teach trainee ventriloquists to form the substitute sounds. Using the analyti-cal method it would then be possible to assess the quality of the pronunciation objectively.

Those training possibilities can also be used for medical purposes. Patients suffering fromdysarthria who can no longer move particular articulators in the vocal tract can independentlyimprove their comprehensibility by forming articulatory substitute sounds, for example if themobility of the lips is reduced, substitute sounds could be formed using the tongue.11 Given the

11Prof. Kroger of University Hospital Aachen even told us of a patient who, due to insufficient mobility ofthe central part of the vocal tract made “compensatory articulatory-phonetic sounds” with the lips and the otherarticulators that still worked. Unfortunately, he could not go into the details of these substitute sounds since thepossibility of examining this patient was very limited. In theory, this dysarthria could also present itself in such a

Page 19: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

7 CONCLUSION AND OUTLOOK 17

long progression of those diseases such as ALS and Parkinson, it is possible to start teachingthese patients in the early stages of their disease perceptively equal substitute sounds they willstill be able to articulate in the progress of their disease, thereby replacing the sounds whichbecome problematic for them and remaining a better comprehensability in the progress oftheir disease.

form that rather than the central part of the vocal tract being out of action the lips would lose mobility.

Page 20: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 18

Appendix

Acknowledgements

We would like to thank the Institute of applied Physics at the Goethe University in Frankfurtam Main for making their laboratories and technical support available to us for carrying outour experiments, the Institute of Phonetics for their patience in answering all our questions, theSenckenberg-Museum for providing special plaster and advice for making the plaster model.In particular, we would like to thank our supervisor Dr. Karl Schnell and Professor Lacroix’working group Digital Systems, Language Snthesis and Signal Processors.

Literatur

[Bir02] Peter Birkholz. Entwicklung eines dreidimensionalen Artikulatormodells fur die Sprach-synthese. http://wwwicg.informatik.uni-rostock.de/piet/speak_main.html, Ro-stock, 2002.

[Boc95] Elke Bockamp.Bauchreden - spielend lernen. Edition Aragon, Moers, 1995.

[Del93] Deller. Discrete-Time Processing of Speech Signals. Wiley-IEEE Press, New York, 1993.

[Hes98] Vary Heute Hess.Digitale Sprachsignalverarbeitung. B.G. Teubner, Stuttgart, 1998.

[Mil99] Otto Mildenberger.Informationstechnik kompakt. Vieweg, 1999.

[Opp89] A. V. Oppenheim.Signale und Systeme. Prentice-Hall, Cambridge, MA, 1989.

[Pet04] Thomas Peters. Fourier-reihen.www.mathe-seiten.de , 2004.

[PM95] Bernd Pompino-Marschall.Einfuhrung in die Phonetik. Gruyter, Berlin, 1995.

[Rab78] Schafer Rabiner.Digital Processing of Speech Signals. Prentice-Hall, London, 1978.

[Sch96] Karl Schnell. Sprachsynthese mit erweiterten Rohrmodellen. Diplomarbeit, Frankfurt amMain, 1996.

[Sch03] Karl Schnell.Parameterbestimmung fur Rohrmodelle aus Sprachsignalen fur die Sprachpro-duktion. Dissertation, Frankfurt am Main, 2003.

[Sto96] Titze Story.Vocal tract area functions from magnetic resonance imaging. J.A.S.A. Vol. 100,pp. 537-554 1996, 1996.

[Vox] Valentine Vox.I can see your lips moving. Retonios Magic, Casino, Schweiz.

[Wer00] Martin Werner.Signale und Systeme. Vieweg und Sohn, Braunschweig, 2000.

Illustrations

Page 21: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 19

(a) (b)

Fig. 6: a) Selection of designations of constriction points in the vocal tract (see [Sch03]). b)Classification of consonants in IPA consonants table.

Page 22: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 20

(a) vocal tract b (b) tube model b

(c) vocal tract b’ (d) tube model b’

(e) formants b, b’

Fig. 7: Modelling the sounds[b] and [b’] . a) Vocal tract of [b], the lips form an occlusion.b) Tube model of [b]. c) Vocal tract of [b’], the occlusion is created by the tongue beingpressed upwards. d) Tube model of [b’]. e) Formants of [b] (blue) and [b’] (black), the firsttwo formants of [b] are atF [b]

1 = 450Hz andF [b]2 = 1050Hz, the first two formants of [b’] are

atF [b0]1 = 500Hz andF [b0]

2 = 1100Hz.

Page 23: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 21

(a) ba spectrogram (b) b’a spectrogram

(c) da spectrogram

Fig. 8: (a-c) Spectrograms for the non-stationary sounds modelled in tractsyn: [ba], [b’a] and[da]. The time axis is horizontal and the frequency axis vertical. (d) and (e) Theoretical courseof the first three formants over time, according to [PM95], comparable for the sounds [ba] and[da].

Page 24: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 22

(a) b’ (b) d (c) m’ (d) m”

Fig. 9: Vocal tract positions for ventriloquists, according to [Vox]: (a) Substitute sound [b’]without closing lips, (b) for comparative purposes a normally pronounced [d]. (c) substitutesound [m’], similar to [n] with occlusion using tongue. (d) substitute sound [m”], similar to[ng] with uvular occlusion; in particular, the nasal tract acts as a resonating body.

(a) vocal tract (b) tube model

(c) spectrum

Fig. 10: The sound [m]. (a) The vocal tract ends with an occlusion at the lips. The velum isopen, so that sound waves enter the nasal tract. (b) tube model. (c) Opening the velum givesrise to antiresonances in the spectrum at approximately 1000 Hz and 3500 Hz.

Page 25: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 23

(a) vocal tract (b) tube model

(c) spectrum

Fig. 11: The substitute sound [m’]. (a) The vocal tract resembles [n], since the occlusionoccurs using the tongue instead of the lips. The velum is opened as in [m]. Tube model. Thereare antiresonances in the spectrum.

Page 26: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 24

(a) vocal tract (b) tube model

(c) spectrogram

Fig. 12: The sound [f]. (a) The constriction is labio-dental, the tongue is already making thetransition to [a]. (b) Tube model. (c) spectogram for [fa] with reinforced spectral area fromapproximately 1500 Hz. The maximum is somewhat under 3000 Hz, the amplitude becomesweaker for higher frequencies.

Page 27: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 25

(a) vocal tract (b) tube model

(c) spectrogram

Fig. 13: The substitute sound [f’]. (a) The constriction is alveolar. (b) Tube model. (c) spec-trogram for [f’a] with reinforced spectral area from approximately 1500�1750Hz. The ma-ximum is somewhat under 3000 Hz.

Page 28: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 26

(a) vocal tract v (b) tube model v (c) vocal tract v’

(d) tube model v’ (e) formants v, v’

Fig. 14:Modelling the sounds[v] and[v’] . a) Vocal tract of [v]. b) Tube model of [v]. c) Vocaltract of [v’]. d) Tube model of [v’]. e) Formants of [v] (blue) and [v’] (black),F [v]

1 � 500 Hz; F [v]2 � 1100 Hz; F [v]

3 � 1950 Hz,

F [v0]1 � 450 Hz; F [v0]

2 � 1100 Hz; F [v0]3 � 2500 Hz.

(a) m (b) m’

Fig. 15: Time evolution of spectrum envelopes of the plaster model’s signal, which was ex-cited by an impulse train: (a) plaster model for [m] with antiresonance (missing harmonic) atapproximately 1000 Hz, (b) plaster model for [m’] with antiresonance at approximately 1250Hz.

Page 29: Physics of Ventriloquism or Can you really speak with your ...javapsi.sourceforge.net/projects/pdf/ak-en.pdf · 1 Introduction 1 2 The formation of human speech sounds 1 ... Friedrichsdorf,

LITERATUR 27

(a) ba with Lippen (b) ba without Lippen

Fig. 16: Spectra of Patrick Martin speaking sound transitions [ba] and [b’a] with and withoutlips in relation to time. The frequency is indicated on the horizontal axis, the amplitude in dBon the vertical. The various curves represent the envelope at various time points in the spectra.

(a) pa with lips (b) pa without lips

Fig. 17: Spectra of the recorded sound transitions and [pa] and [p’a] with and without lips.The different curves represent the envelope at various time points in the spectra.