dynamics and articulatory phonology* · human speech events can be seen as quite complex, inthe...

Haskins 1.Aboratories Status Report on Speech Research1993, SR-113, 51-62

Dynamics and Articulatory Phonology*

Catherine P. Browman and Louis Goldsteint

1. INTRODUCTION

Traditionally, the study of human speech and itspatterning has been approached in two differentways. One way has been to consider it as mechanical or bio-mechanical activity (e.g., of articulatorsor air molecules or cochlear hair cells) thatchanges continuously in time. The other way hasbeen to consider it as a linguistic (or cognitive)structure consisting of a sequence of elementschosen from a closed inventory. Development ofthe tools required to describe speech in one or theother of these approaches has proceeded largely inparallel, with one hardly informing the other at all(some notable exceptions are discussed below). Asa result, speech has been seen as having twostructures, one considered physical, and the othercognitive, where the relation between the t",:ostructures is generally not an intrinsic part of eIther description. From this perspective, a complete picture requires 'translating' between the intrinsically incommensurate domains (as argued byFowler, Rubin, Remez, & Turvey, 1980).

The research we have been pursuing (Browman& Goldstein, 1986; 1989; 1990a,b; 1992)('articulatory phonology') begins with the verydifferent assumption that these apparentlydifferent domains are, in fact, the low and highdimensional descriptions of a single (complex)system. Crucial to this approach is identificationof phonological units with dynamically specifiedunits of articulatory action, called gestures.Thus an utterance is described as an act thatcan 'be decomposed into a small numberof primitive units (a low dimensional description),in a particular spatio-temporal configuration.

This work was supported by NSF grant DBS-9112198 andNIH grants HD-01994 and DC-00121 to Haskins Laboratories.Thanks to Alice Faber and Jeff Shaw for comments on anearlier version.

51

The same description also provides an intrinsicspecification of the high dimensional properties ofthe act (its various mechanical and bio-mechanicalconsequences).

In this chapter, we will briefly examine the nature of the low and high dimensional descriptionsof speech, and contrast the dynamical perspectivethat unifies these to other approaches in whichthey are separated as properties of mind and body.We will then review some of the basic assumptions and results of developing a specific model incorporating dynamical units, and illustrate how itprovides both low and high dimensional descriptions.

2. DIMENSIONALITY OFDESCRIPTION

Human speech events can be seen as quitecomplex, in the sense that an individual utterancefollows a continuous trajectory through a spacedefined by a large number of potential degrees offreedom, or dimensions. This is true whether thedimensions are neural, articulatory, acoustic,aerodynamic, auditory, or other ways of describingthe event. The fundamental insight of phonology,however, is that the pronunciation of the words ina given language may differ from (that is, contrastwith) each other in only a restricted number ofways: the number of degrees of freedom actuallyemployed in this contrastive behavior is far fewerthan the number that is mechanically available.This insight has taken the form of the hypothesisthat words can be decomposed into a smallnumber of primitive units (usually far fewer thanone hundred in a given language) which cancombine in different ways to form the largenumber of words required in human lexicons.Thus, as argued by Kelso, Saltzman, and Tuller(1986), human speech is characterized not only bya high number of potential (microscopic) degreesof freedom, but also by a low dimensional(macroscopic) form. This macroscopic form is

52 Browman and Goldstein

usually called the 'phonological' form. As will besuggested below, this collapse of degrees offreedom can possibly be understood as an instanceof the kind of self-organization found in othercomplex systems in nature (Haken, 1977;Kauffmann, 1991; Kugler & Turvey, 1987; Madore& Freedman, 1987; Schoner & Kelso, 1988).

Historically, however, the gross differencesbetween the macroscopic and microscopic scales ofdescription have led researchers to ignore one orthe other description, or to assert its irrelevance,and hence to generally separate the cognitive andthe physical. Anderson (1974) describes how thedevelopment of tools in the nineteenth and earlytwentieth centuries led to the quantification ofmore and more details of the speech signal, but"with such increasingly precise description,however, came the realization that much of it wasirrelevant to the central tasks of linguisticscience" (pA). Indeed, the development of manyearly phonological theories (e.g., those ofSaussure, Trubetzkoy, Sapir, Bloomfield)proceeded largely without any substantiveinvestigation of the measurable properties of thespeech event at all (although Anderson notesBloomfield's insistence that the smallestphonological units must ultimately be defined interms of some measurable properties of the speechsignal). In general, what was seen as importantabout phonological units was their function, theirability to distinguish utterances.

A particularly telling insight into this view ofthe lack of relation between the phonological andphysical descriptions can be seen in Hockett's(1955) familiar Easter egg analogy. The structureserving to distinguish utterances (for Hockett, asequence of letter-sized phonological units calledphonemes) was viewed as a row of colored, butunboiled, easter eggs on a moving belt. Thephysical structure (for Hockett, the acousticsignal) was imagined to be the result of runningthe belt through a wringer, effectively smashingthe eggs and intermixing them. It is quite strikingthat, in this analogy, the cognitive structure of thespeech event cannot be seen in the gooey messitself. For Hockett, the only way the hearer canrespond to the event is to infer (on the basis ofobscured evidence, and knowledge of possible eggsequences) what sequence of eggs might have beenresponsible for the mess. It is clear that in thisview, the relation between cognitive and physicaldescriptions is neither systematic nor particularlyinteresting. The descriptions share color as animportant attribute, but beyond that there is littlerelation.

A major approach that did take seriously thegoal of unifying the cognitive and physical aspectsof speech description was that in the SoundPattern of English (Chomsky & Halle, 1968), including the associated work on the development ofthe theory of distinctive features (Jakobson, Fant,& Halle, 1951) and the quantal relations thatunderlie them (Stevens, 1972, 1989). In this approach, an utterance is assigned two representations: a 'phonological' one, whose goal is to describe how the utterance functions with respect tocontrast and patterns of alternation, and a'phonetic' one, whose goal is to account for thegrammatically determined physical properties ofthe utterance. Crucially, however, the relationbetween the representations is quite constrained:both descriptions employ exactly the same set ofdimensions (the features). The phonological representation is coarser in that features may take ononly binary values, while the phonetic representation is more fine-grained, with the features havingscalar values. However, a principled relation between the binary values and the scales is also provided: Stevens' quantal theory attempts to showhow the potential continuum of scalar feature values can be intrinsically partitioned into categorical regions, when the mapping from articulatorydimensions to auditory properties is considered.Further, the existence of such quantal relations isused to explain why languages employ these particular features in the first place.

Problems raised with this approach to speechdescription soon led to its abandonment, however.One problem is that its phonetic representationswere shown to be inadequate to capture certainsystematic physical differences betweenutterances in different languages (Keating, 1985;Ladefoged, 1980; Port, 1981). The scales used inthe phonetic representations are themselves ofreduced dimensionality, when compared to acomplete physical description of utterances.Chomsky and Halle hypothesized that suchfurther details could be supplied by universalrules. However, the above authors (also Browman& Goldstein, 1986) argued that this would notwork-the same phonetic representation (in theChomsky and Halle sense) can have differentphysical properties in different languages. Thus,more of the physical detail (and particularlydetails having to do with timing) would have to bespecified as part of the description of a particularlanguage. Ladefoged's (1980) argument cut evendeeper. He argued that there is a system of scalesthat is useful for characterizing the measurablearticulatory and acoustic properties of utterances,

Dynamics and Articulatory Phonology 53

but that these scales are very different from thefeatures proposed by Chomsky and Halle.

One response to these failings has been tohypothesize that descriptions of speech shouldinclude, in addition to phonological rules of theusual sort, rules that take (cognitive) phonologicalrepresentations as input and convert them tophysical parameterizations of various sorts. Theserules have been described as rules of 'phoneticimplementation' (e.g., Keating, 1985; Keating,1990; Klatt, 1976; Liberman & Pierrehumbert,1984; Pierrehumbert, 1990; Port, 1981). Note thatin this view, the description of speech is dividedinto two separate domains, involving distincttypes of representations: the phonological orcognitive structure and the phonetic or physicalstructure. This explicit partitioning of the speechside of linguistic structure into separate phoneticand phonological components which employdistinct data types that are related to one anotheronly through rules of phonetic implementation (or'interpretation') has stimulated a good deal ofresearch (e.g., Cohn, 1990; Coleman, 1992;Fourakis & Port, 1986; Keating, 1988; Liberman& Pierrehumbert, 1984). However, there is amajor price to be paid for drawing such a strictseparation: it becomes very easy to view phoneticand phonological (physical and cognitive)structures as essentially independent of oneanother, with no interaction or mutual constraint.As Clements (1992) describes the problem: "Theresult is that the relation between thephonological and phonetic components is quiteunconstrained. Since there is little resemblancebetween them, it does not matter very much forthe purposes of phonetic interpretation what theform of the phonological input is; virtually anyphonological description can serve its purposesequally well. (p. 192)" Yet, there is a constrainedrelation between the cognitive and physicalstructures of speech, which is what drove thedevelopment of feature theory in the first place.

In our view, the relation between the physicaland cognitive, i.e. phonetic and phonological, aspects of speech is inherently constrained by theirbeing simply two levels of description-the microscopic and macroscopic-of the same system.Moreover, we have argued that the relation between microscopic and macroscopic properties ofspeech is one of mutual or reciprocal constraint(Browman & Goldstein, 1990b). As we elaboratedthere, the existence of such reciprocity is supported by two different lines of research. One linehas attempted to show how the macroscopic properties of contrast and combination of phonological

units arise from, or are constrained by, the microscopic, i.e., the detailed properties of speech articulation and the relations among speech articulation, aerodynamics, acoustics, and audition (e.g.,Lindblom, MacNeillage, & Studdert-Kennedy,1983; Ohala, 1983; Stevens, 1972; 1989). A secondline has shown that there are constraints runningin the opposite direction, such that the(microscopic) detailed articulatory or acousticproperties of particular phonological units are determined, in part, by the macroscopic system ofcontrast and combination found in a particularlanguage (e.g., Keating, 1990; Ladefoged, 1982;Manuel & Krakow, 1984; Wood, 1982). The apparent existence of this bi-directionality is of considerable interest, because recent studies of thegeneric properties of complex physical systemshave demonstrated that reciprocal constraint between macroscopic and microscopic scales is ahallmark of systems displaying 'self-organization'(Kugler & Turvey, 1987; see also discussions byLangton in Lewin, 1992 [pp. 12-14; 188-191), andwork on the emergent properties of "co-evolving"complex systems: Hogeweg, 1989; Kauffman,1989; Kauffman & Johnsen, 1991; Packard, 1989).

Such self-organizing systems (hypothesized asunderlying such diverse phenomena as the construction of insect nests and evolutionary andecological dynamics) display the property that the'local' interactions among a large number of microscopic system components can lead to emergentpatterns of 'global' organization and order. Theemergent global organization also places constraints on the components and their localinteractions. Thus, self-organization provides aprincipled linkage between descriptions ofdifferent dimensionality of the same system: thehigh·dimensional description (with many degreesof freedom) of the local interactions and the lowdimensional description (with few degrees offreedom) of the emergent global patterns. Fromthis point of view, then, speech can be viewed as asingle complex system (with low-dimensionalmacroscopic and high-dimensional microscopicproperties) rather than as two distinctcomponents.

A different recent attempt to articulate the nature of the constraints holding between the cognitive and physical structures can be found inPierrehumbert (1990), in which the relation between the structures is argued to be a 'semantic'one, parallel to the relation that obtains betweenconcepts and their real world denotations. In thisview, macroscopic structure is constrained by themicroscopic properties of speech and by the prin-


Figure 1. Computational system for generating speechusing dynamically-defined articulatory gestures.

The formation of this constriction entails a changein the distance between upper and lower lips (or

outputspeech

VOCAL

TRACT

MODEL

TASK

DYNAMIC

MODEL

intendedutterance

LINGUISTIC

GESTURAL

MODEL

As will be elaborated below, contrast amongutterances can be defined in terms of thesegestural constellations. Thus, these structures cancapture the low-dimensional properties ofutterances. In addition, because each gesture isdefined as a dynamical system, no rules ofimplementation are required to characterize thehigh-dimensional properties of the utterance. Atime-varying pattern of articulator motion (and itsresulting acoustic consequences) is lawfullyentailed by the dynamical systems themselvesthey are self-implementing. Moreover, these timevarying patterns automatically display theproperty of context dependence (which isubiquitous in the high dimensional description ofspeech) even though the gestures are defined in acontext-independent fashion. The nature of thearticulatory dimensions along which theindividual dynamical units are defined allows thiscontext dependence to emerge lawfully.

The articulatory phonology approach has beenincorporated into a computational system beingdeveloped at Haskins Laboratories (Browman &Goldstein, 1990a,c; Browman, Goldstein, Kelso,Rubin, & Saltzman, 1984; Saltzman, 1986;Saltzman, & Munhall, 1989). In this system,illustrated in Figure 1, utterances are organizedensembles (or constellations) of units ofarticulatory action called gestures. Each gesture ismodeled as a dynamical system that characterizesthe formation (and release) of a local constrictionwithin the vocal tract (the gesture's functionalgoal or 'task'). For example, the word "ban" beginswith a gesture whose task is lip closure.

ciples guiding human cognitive category-formation. However, the view fails to account for theapparent bi-directionality of the constraints. Thatis, there is no possibility of constraining the microscopic properties of speech by its macroscopicproperties in this view. (For a discussion of possible limitations to a dynamic approach to phonology, see Pierrehumbert & Pierrehumbert, 1990).

The 'articulatory phonology' that we have beendeveloping (e.g., Browman & Goldstein, 1986,1989, 1992) attempts to understand phonology(the cognitive) as the low-dimensional macroscopicdescription of a physical system. In this work,rather than rejecting Chomsky and Halle'sconstrained relation between the physical andcognitive, as the phonetic implementationapproaches have done, we have, if anything,increased the hypothesized tightness of thatrelation by using the concept of differentdimensionality. We have surmised that theproblem with the program proposed by Chomskyand Halle was instead in their choice of theelementary units of the system. In particular, wehave argued that it is wrong to assume that theelementary units are (1) static, (2) neutralbetween articulation and acoustics, and (3)arranged in non-overlapping chunks. Assumptions(1) and (3) have been argued against by Fowler etal. (1980), and (3) has also been rejected by mostof the work in 'nonlinear' phonology over the past15 years. Assumption (2) has been, at leastpartially, rejected in the 'active articulator'version of 'feature geometry'-Halle (1982), Sagey(1986), McCarthy (1988).

3. GESTURESArticulatory phonology takes seriously the view

that the units of speech production are actions,and therefore that (1) they are dynamic, not static.Further, since articulatory phonology considersphonological functions such as contrast to be lowdimensional, macroscopic descriptions of suchactions, the basic units are (2) not neutralbetween articulation and acoustics, but rather arearticulatory in nature. Thus, in articulatoryphonology, the basic phonological unit is thearticulatory gesture, which is defined as adynamical system specified with a characteristicset of parameter values (see Saltzman, in press).Finally, because the tasks are distributed acrossthe various articulator sets of the vocal tract (thelips, tongue, glottis, velum, etc.), an utterance ismodeled as an ensemble, or constellation, of asmall number of (3) potentially overlappinggestural units.


Lip Aperture) over time. This change is modeledusing a second order system (a 'point attractor,'Abraham & Shaw, 1982), specified with particularvalues for the equilibrium position and stiffnessparameters. (Damping is, for the most part,assumed to be critical, so that the systemapproaches its equilibrium position and doesn'tovershoot it). During the activation interval forthis gesture, the equilibrium position for LipAperture is set to the goal value for lip closure; thestiffness setting, combined with the damping,determines the amount of time it will take for thesystem to get close to the goal of lip closure.

The set of task or tract variables currently implemented in the computational model are listedat the top left of Figure 2, and the sagittal vocaltract shape below illustrates their geometric definitions. This set of tract variables is hypothesizedto be sufficient for characterizing most of the gestures of English (exceptions involve the details ofcharacteristic shaping of constrictions, seeBrowman & Goldstein, 1989). For oral gestures,

tract variable

two paired tract variable regimes are specified,one controlling the constriction degree of a particular structure, the other its constriction location (a tract variable regime consists of a set ofvalues for the dynamic parameters of stiffness,equilibrium position, and damping ratio). Thus,the specification for an oral gesture includes anequilibrium position, or goal, for each of two tractvariables, as well as a stiffness (which is currentlyyoked across the two tract variables). Each functional goal for a gesture is achieved by the coordinated action of a set of articulators, that is, a coordinative structure (Fowler et al., 1980; Kelso,Saltzman, & Tuller, 1986; Saltzman, 1986;Turvey, 1977); the sets of articulators used foreach of the tract variables are shown on the topright of Figure 2, with the articulators indicatedon the outline of the vocal tract model below. Notethat the same articulators are shared by both ofthe paired oral tract variables, so that altogetherthere are five distinct articulator sets, or coordinative structure types, in the system.

articulators involved

LPLA

TTCLTTCD

TBCLTBCD

VEL

lip protrusionlip aperture

tongue tip constrict locationtongue tip constrict degree

tongue body constrict locationtongue body constrict degree

velic aperture

upper & lower lips, jawupper & lower lips, jaw

tongue tip, tongue body, jawtongue tip, tongue body, jaw

tongue body, jawtongue body, jaw

velum

GLO

VEL.?

-

glottal aperture

GLO

glottis

velum

+

tonguebody

center

+glottis

+ upper lip

+ lower lip

jaw

Figure 2. Tract variables and their associated articulators.


In the computational system the articulators arethose of a vocal tract model (Rubin, Baer, &Mermelstein, 1981) that can generate speechwaveforms from a specification of the positions ofindividual articulators. When a dynamical system(or pair of them) corresponding to a particular gesture is imposed on the vocal tract, the task-dynamic model (Saltzman, in press; Saltzman, 1986;Saltzman & Kelso, 1987; Saltzman & Munhall,1989) calculates the time-varying trajectories ofthe individual articulators comprising thatcoordinative structure, based on the informationabout values of the dynamic parameters, etc,contained in its input. These articulatortrajectories are input to the vocal tract modelwhich then calculates the resulting global vocaltract shape, area function, transfer function, andspeech waveform (see Figure 1).

Defining gestures dynamically can provide aprincipled link between macroscopic andmicroscopic properties of speech. To illustratesome of the ways in which this is true, considerthe example of lip closure. The values of thedynamic parameters associated with a lip closuregesture are macroscopic properties that define itas a phonological unit and allow it to contrastwith other gestures such as the narrowing gesturefor [w]. These values are definitional, and remaininvariant as long as the gesture is active. At thesame time, however, the gesture intrinsicallyspecifies the (microscopic) patterns of continuouschange that the lips can exhibit over time. Thesechanges emerge as the lawful consequences of thedynamical system, its parameters, and the initialconditions. Thus, dynamically defined gesturesprovide a lawful link between macroscopic andmicroscopic properties.

While tract variable goals are specified numerically, and in principle could take on any realvalue, the actual values used to specify the gestures of English in the model cluster in narrowranges that correspond to contrastive categories:for example, in the case of constriction degree,different ranges are found for gestures thatcorrespond to what are usually referred to asstops, fricatives and approximants. Thus,paradigmatic comparison (or a densitydistribution) of the numerical specifications of allEnglish gestures would reveal a macroscopicstructure of contrastive categories. The existenceof such narrow ranges is predicted by approachessuch as the quantal theory (e.g., Stevens, 1989)and the theory of adaptive dispersion (e.g.,Lindblom, MacNeilage, & Studdert-Kennedy,1983), although the dimensions investigated in

those approaches are not identical to the tractvariable dimensions. These approaches can beseen as accounting for how microscopic continuaare partitioned into a small number of macroscopic categories.

The physical properties of a given phonologicalunit vary considerably depending on its context(e.g., Kent & Minifie, 1977; Ohman, 1966;Liberman, Cooper, Shankweiler, & StuddertKennedy, 1967). Much of this context dependenceemerges lawfully from the use of task dynamics.An example of this kind of context dependence inlip closure gestures can be seen in the fact thatthe three independent articulators that cancontribute to closing the lips (upper lip, lower lip,and jaw) do so to different extents as a function ofthe vowel environment in which the lip closure isproduced (Macchi, 1988; Sussman, MacNeilage, &Hanson, 1973). The value of lip aperture achieved,however, remains relatively invariant no matterwhat the vowel context. In the task-dynamicmodel, the articulator variation resultsautomatically from the fact that the lip closuregesture is modeled as a coordinative structure thatlinks the movements of the three articulators inachieving the lip closure task. The gesture isspecified invariantly in terms of the tract variableof lip aperture, but the closing action isdistributed across component articulators in acontext-dependent way. For example, in an utterance like [ibi], the lip closure is produced concurrently with the tongue gesture for a high frontvowel. This vowel gesture will tend to raise thejaw, and thus, less activity of the upper and lowerlips will be required to effect the lip closure goalthan in an utterance like [aba]. These microscopicvariations emerge lawfully from the task dynamicspecification of the gestures, combined with thefact of overlap (Kelso, Saltzman, & Tuller, 1986;Saltzman & Munhall, 1989).

4. GESTURAL STRUCTURESDuring the act of talking, more than one gesture

is activated, sometimes sequentially and sometimes in an overlapping fashion. Recurrent patterns of gestures are considered to be organizedinto gestural constellations. In the computationalmodel (see Figure 1), the linguistic gestural modeldetermines the relevant constellations for any arbitrary input utterance, including the phasing ofthe gestures. That is, a constellation of gestures isa set of gestures that are coordinated with one another by means of phasing, where for this purpose(and this purpose only), the dynamical regime foreach gesture is treated as if it were a cycle of an

Dynamics and Articulato/l{ Phonology 57

undamped system with the same stiffness as theactual regime. In this way, any characteristicpoint in the motion of the system can be identifiedwith a phase of this virtual cycle. For example, themovement onset of a gesture is at phase 0 degrees,while the achievement of the constriction goal (thepoint at which the critically damped system getssufficiently close to the equilibrium position) occurs at phase 240 degrees. Pairs of gestures arecoordinated by specifying the phases of the twogestures that are synchronous. For example, twogestures could be phased so that their movementonsets are synchronous (0 degrees phased to 0 degrees), or so that the movement onset of one isphased to the goal achievement of another (0 degrees phased to 240 degrees), etc. Generalizationsthat characterize some phase relations in the gestural constellations of English words are proposedin Browman and Goldstein (1990c). As is the casefor the values of the dynamic parameters, valuesof the synchronized phases also appear to clusterin narrow ranges, with onset of movement (0 degrees) and achievement of goal (240 degrees) beingthe most common (Browman & Goldstein, 1990a).

An example of a gestural constellation (for theword "pawn" as pronounced with the back unrounded vowel characteristic of much of the U.S.)is shown in Figure 3a, which gives an idea of thekind of information contained in the gestural dictionary. Each row, or tier, shows the gestures thatcontrol the distinct articulator sets: velum, tonguetip, tongue body, lips, and glottis. The gestures arerepresented here by descriptors, each of whichstands for a numerical equilibrium position valueassigned to a tract variable. In the case of the oralgestures, there are two descriptors, one for each ofthe paired tract variables. For example, for thetongue tip gesture labelled {clo alvl, {clo} standsfor -3.5 mm (negative value indicates compressionof the surfaces), and {alv} stands for 56 degrees(where 90 degrees is vertical and would correspond to a midpalatal constriction). The association lines connect gestures that are phased withrespect to one another. For example, the tonguetip {clo alv} gesture and the velum {wide} gesture(for nasalization) are phased such that the pointindicating 0 degrees-onset of movement-of thetongue tip closure gesture is synchronized withthe point indicating 240 degrees-achievement ofgoal-of the velic gesture.

Each gesture is assumed to be active for a fixedproportion of its virtual cycle (the proportion isdifferent for consonant and vowel gestures). Thelinguistic gestural model uses this proportion,along with the stiffness of each gesture and the

phase relations among the gestures, to calculate agestural score that specifies the temporal activation intervals for each gesture in an utterance.One form of this gestural score for "pawn" isshown in Figure 3b, with the horizontal extent ofeach box indicating its activation interval, and thelines between boxes indicating which gesture isphased with respect to which other gesture(s), asbefore. Note that there is substantial overlapamong the gestures. This kind of overlap can result in certain types of context dependence in thearticulatory trajectories of the invariantly specified gestures. In addition, overlap can cause thekinds of acoustic variation that have been traditionally described as allophonic variation. For example in this case, note the substantial overlapbetween the velic lowering gesture (velum {wide})and the gesture for the vowel (tongue body{narrow phar}). This will result in an interval oftime during which the velo-pharyngeal port isopen and the vocal tract is in position for thevowel-that is, a nasalized vowel. Traditionally,the fact of nasalization has been represented by arule that changes an oral vowel into a nasalizedone before a (final) nasal consonant. But viewed interms of gestural constellations, this nasalizationis just the lawful consequence of how the individual gestures are coordinated. The vowel gestureitself hasn't changed in any way: it has the samespecification in this word and in the word "pawed"(which is not nasalized).

The parameter value specifications andactivation intervals from the gestural score areinput to the task dynamic model (Figure 1), whichcalculates the time-varying response of the tractvariables and component articulators to theimposition of the dynamical regimes definedby the gestural score. Some of the time-varyingresponses are shown in Figure 3c, along withthe same boxes indicating the activation intervalsfor the gestures. Note that the movement curveschange over time even when a tract variable isnot under the active control of some gesture.Such motion can be seen, for example, in the LIPSpanel, after the end of the box for the lip closuregesture. This motion results from one orboth of two sources. (1) When an articulator is notpart of any active gesture, the articulator returnsto a neutral position. In the example, the upper lipand the lower lip articulators both are returningto a neutral position after the end of the lipclosure gesture. (2) One of the articulators linkedto the inactive tract variable may also be linkedto some active tract variable, and thus causepassive changes in the inactive tract variable.


(3a)1----------------------

wideVELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

CIO _____

lab

""wide

clo~alV

narrowphar

L..

'pan(3b)I----------------------~

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

ciaalv

(3c)

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

Dvnamics and Articulatory Phonology

'pon

wide

~-------I

----------------------~

---------------1

closed

tclosed

tclosed

tclosed

t~

closed

59

100 200

TIME (ms)

300 400

Figure 3. Various displays from computational model for "pawn." (a) Gestural descriptors and association lines (b)Gestural descriptors and association lines plus activation boxes (c) Gestural descriptors and activation boxes plusgenerated movements of (from top to bottom): Velie Aperture, vertical position of the Tongue Tip (with respect to thefixed palate/teeth), vertical position of the Tongue Body (with respect to the fixed palate/teeth), Lip Aperture, GlottalAperture.

In the example, the jaw is part of the coordinativestructure for the tongue body vowel gesture, aswell as part of the coordinative structure for thelip closure gesture. Therefore, even after the lipclosure gesture becomes inactive, the jaw isaffected by the vowel gesture, and its lowering forthe vowel causes the lower lip to also passivelylower.

The gestural constellations not only characterizethe microscopic properties of the utterances, asdiscussed above, but systematic differences amongthe constellations also define the macroscopicproperty of phonological contrast in a language.Given the nature of gestural constellations, thepossible ways in which they may differ from oneanother is, in fact, quite constrained. In otherpapers (e.g., Browman & Goldstein, 1986; 1989;1992) we have begun to show that gesturalstructures are suitable for characterizingphonological functions such as contrast, and whatthe relation is between the view of phonologicalstructure implicit in gestural constellations, andthat found in other contemporary views ofphonology (see also Clements, 1992 for adiscussion of these relations). Here we will simply

give some examples of how the notion of contrastis defined in a system based on gestures, using theschematic gestural scores in Figure 4.

One way in which constellations may differ is inthe presence vs. absence of a gesture. This kind ofdifference is illustrated by two pairs of subfiguresin Figure 4: (4a) vs. (4b) and (4b) vs. (4d). (4a)"pan" differs from (4b) "ban" in having a glottis{wide! gesture (for voicelessness), while (4b) "ban"differs from (4d) "Ann" in having a labial closuregesture (for the initial consonant). Constellationsmay also differ in the particular tractvariable/articulator set controlled by a gesturewithin the constellation, as illustrated by (4a)"pan" vs. (4c) "tan," which differ in terms ofwhether it is the lips or tongue tip that performthe initial closure. A further way in whichconstellations may differ is illustrated bycomparing (4e) "sad" to(4f) "shad," in which thevalue of the constriction location tract variable forthe initial tongue tip constriction is the onlydifference between the two utterances. Finally,two constellations may contain the same gesturesand differ simply in how they are coordinated, ascan be seen in (4g) "dab" vs. (4h) "bad."

60

(a) pren

Browman and Goldstein

(b) bren

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

(c) tren

widephar

(d) ren

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

(e) sred (f) Jred

VELUM

crit clo critTONGUE TIP alv alv alvpal

wide wideTONGUE BODY phar phar

LIPS

GLOTTIS wide wide

VELUM

TONGUE TIP

TONGUE BODY

LIPS

GLOTTIS

cloalv

(g) dreb

widephar

clolab

clolab

(h) bred

widephar

cloalII

Figure 4. Schematic gestural scores exemplifying contrast. (a) "pan" (b) "ban" (c) "tan" (d) "Ann" (e) "sad" (f) "shad"(g) "dab" (h) "bad."


This chapter described an approach to the description of speech in which both the cognitive andphysical aspects of speech are captured by viewingspeech as a set of actions, or dynamic tasks, thatcan be described using different dimensionalities:low dimensional or macroscopic for the cognitive,and high dimensional or microscopic for thephysical. A computational model that instantiatesthis approach to speech was briefly outlined. Itwas argued that this approach to speech, which isbased on dynamical description, has severaladvantages. First, it captures both thephonological (cognitive) and physical regularitiesthat minimally must be captured in anydescription of speech. Second, it does so in a waythat unifies the two descriptions as descriptionswith different dimensionality of a single complexsystem. The latter attribute means that thisapproach provides a principled view of thereciprocal constraints that the physical andphonological aspects of speech exhibit.

REFERENCESAbraham, R. H., & Shaw, C. D. (1982). Dynamics-The geometry of

behavior. Santa Cruz, CA: Aerial Press.Anderson, S. R, (1974). The organization of phonology. New York,

NY: Academic Press.Browman, C. P., & Goldstein, L. (1986). Towards an articulatory

phonology. Phonology Yearbook, 3, 219-252.Browman, C. P., & Goldstein, L. (1989). Articulatory gestures as

phonological units. Phonology, 6, 201-251.Browman, C. P., & Goldstein, L. (1990a). Gestural specification

using dynamically-defined articulatory structures. Journal ofPhonetics, 18,299-320.

Browman, C. P., & Goldstein, L. (1990b). Representation andreality: Physical systems and phonological structure. Journal ofPhonetics, 18,411-424.

Browman, C. P., & Goldstein, L. (1990c). In J. Kingston & M. E.Beckman (Eds.), Tiers in articulatory phonology, with someimplications for casual speech (pp. 341-376).

Browman, C. P., & Goldstein, L. (1992). Articulatory phonology:An overview. Phonetica, 49, 155-180.

Browman, C. P., Goldstein, L., Kelso, J. A. 5., Rubin, P., &Saltzman, E. (1984). Articulatory synthesis from underlyingdynamics. Journal of the Acoustical Society of America, 75, S22-S23(A).

Chomsky, N., & Halle, M. 1968. The sound pattern of English.New York: Harper Row. Clements, G. N. (1992). Phonologicalprimes: Features or gestures? Phonetica, 49, 181-193.

Clements, G. N. (1992). Phonological primes: Features orgestures? Phonetica, 49,181-193

Cohn, A. C. (1990). Phonetic and phonological rules ofnasalization. UCLA WPP, 76.

Coleman, J. (1992). The phonetic interpretation of headedphonological structures containing overlapping constituents.Phonology, 9. 1-44.

Fourakis, M., & Port, R (1986). Stop epenthesis in English. JournalofPhonetics, 14, 197-221.

Fowler, C. A., Rubin, P., Remez, R E., & Turvey, M. T. (1980).Implications for speech production of a general theory of

action. In B. Butterworth (Ed.), Language production (pp. 373420). New York, NY: Academic Press.

Haken, H. (1977). Synergetics: An introduction. Heidelberg:Springer-Verlag.

Halle, M. (1982). On distinctive features and their articulatoryimplementation. Natural Language and Linguistic Theory, I, 91105.

Hockett, C. (1955). A manual of phonology. Chicago: University ofChicago.

Hogeweg, P. (1989). MIRROR beyond MIRROR, Puddles of LIFE.In C. Langton (Ed.), Artzficial life (pp. 297-316). New York:Addison-Wesley.

Jakobson, R, Fant, C. G. M., & Halle, M. (1951). Preliminaries to

speech analysis: The distinctive features and their correlates.Cambridge, MA: MIT.

Kauffmann, S. (1989). Principles of adaptation in complexsystems. In D. Stein (Ed.), Sciences of complexity (pp. 619-711).New York: Addison-Wesley.

Kauffmann, S. (1991). Antichaos and adaptation. ScientifiCAmerican, 265, 78-84.

Kauffmann,S., & Johnsen, S. (1991). Co-evolution to the edge ofchaos: coupled fitness landscapes, poised states, and co-evolutionary avalanches. In C. Langton, C. Taylor, J. D. Farmer, & RRasmussen (Eds.), Artificial life II (pp. 325- 369). New York:Addison-Wesley.

Keating, P. A. (1985). CV phonology, experimental phonetics, andcoarticulation. UCLA WPP, 62, 1-13.

Keating, P. A. (1988). Underspecification in phonetics. Phonology,5,275-292.

Keating, P. A. (1990). Phonetic representations in a generativegrammar. Journal of Phonetics, 18, 321-334.

Kelso, J. A. S., Saltzman, E. L., & Tuller, B. (1986). The dynamicalperspective on speech production: Data and theory. Journal ofPhonetics, 14, 29-59.

Kent, RD., & Minifie, F. D. (1977). Coarticulation in recent speechproduction models. Journal of Phonetics, 5, 115-133.

Klatt, D. H. (1976). Linguistic uses of segmental duration inEnglish: Acoustic and perceptual evidence. Journal of theAcoustical Society ofAmeriea, 59, 1208-1221.

Kugler, P. N., & Turvey, M. T. (1987). Information, natural law, andthe self- assembly of rhythmic movement. Hillsdale, NJ: LawrenceErlbaum Associates.

Ladefoged, P. (1980). What are linguistic sounds made of?Language, 56, 485- 502. Ladefoged, P. (1982). A course inphonetics (2nd ed.). New York, NY: Harcourt Brace Jovanovich.

Lewin, R (1992). Complexity. New York, NY: Macmillan.Ladefoged, P. (1982). A course in phonetics (2nd ed.). New York:

Harcourt Brace Jovanovich.Liberman, A. M., Cooper, F. S., Shankweiler, D. P., & Studdert

Kennedy, M. (1967). Perception of the speech code.Psychological Review, 74, 431- 461.

Liberman, M., & Pierrehumbert, J. (1984). Intonational invarianceunder changes in pitch range and length. In M. Aronoff, R T.Oehrle, F. Kelley, & B. Wilker Stephens (Eds.), Language soundstructure (pp. 157-233). Cambridge, MA: MIT Press.

Lindblom, B., MacNeilage, P., & Studdert-Kennedy, M. (1983).Self-organizing processes and the explanation of phonologicaluniversals. In B. Butterworth, B. Comrie, & O. Dahl (Eds.),Explanations of linguistic universals (pp. 181-203). Mouton: TheHague.

Macchi, M. (1988). Labial articulation patterns associated withsegmental features and syllable structure in English. Phonetiea45, 109-121.

Madore, B. F., & Freedman, W. L. (1987). Self-organizingstructures. American Scientist, 75,252-259.


Manuel, S. Y., & Krakow, R. A. (1984). Universal and languageparticular aspects of vowel-to-vowel coarticulation. HaskinsLaboratories Status Report on Speech Research, SR77178, 69-78.

McCarthy, J. J. (1988). Feature geometry and dependency: Areview. Phonetica, 45, 84-108.

Ohala, J. (1983). The origin of sound patterns in vocal tractconstraints. In P. F. MacNeilage (Ed.), The production of speech(pp. 189-216). New York, NY: Springer-Verlag.

Ohman, S. E. G. (1966). Coarticulation in VCV utterances:Spectrographic measurements. Journal of the Acoustical Society ofAmerica, 39, 151- 168.

Packard, N. (1989). Intrinsic adaptation in a simple model forevolution. In C. Langton (Ed.), Artificial life (pp. 141-155). NewYork: Addison-Wesley.

Pierrehumbert, J. (1990). Phonological and phoneticrepresentation. Journal of Phonetics 18, 375-394.

Pierrehumbert, J. B., & Pierrehumbert, R. T. (1990). On attributinggrammars to dynamical systems. Journal ofPhonetics, 18.

Port, R F. (1981). Linguistic timing factors in combination. Journalof the Acoustical Society ofAmerica, 69, 262-274.

Rubin, P. E., Baer, T., & Mermelstein, P. (1981). An articulatorysynthesizer for perceptual research. Journal of the AcousticalSociety ofAmerica, 70,321-328.

Sagey, E. C. (1986). The representation offeatures and relations in nonlinear phonology, doctoral dissertation, MIT.

Saltzman, E. (in press). Dynamics in coordinate systems in skilledsensorimotor actiVity. In T. van Gelder & B. Port (Eds.), Mind asmotion. Cambridge, MA: MIT Press.

Saltzman, E. (1986). Task dynamic coordination of the speecharticulators: A preliminary model. In H. Heuer & C. Fromm

(Eds.), Generation and modulation of action patterns (pp. 129-144).Berlin/Heidelberg: Springer-Verlag

Saltzman, E., & Kelso, J. A. S. (1987). Skilled actions: A taskdynamic approach. Psychological Review, 94, 84-106.

Saltzman, E. 1., & Munhall, K. G. (1989). A dynamical approach togestural patterning in speech production. Ecological Psychology1,333-382.

Schoner, G., & Kelso, J. A. S. (1988). Dynamic pattern generationin behavioral and neural systems. Science, 239,1513-1520.

Stevens, K. N. (1972). The quantal nature of speech: Evidencefrom articulatory-acoustic data. In E. E. David & P. B. Denes(Eds.), Human communication: A unified view (pp. 51-66). NewYork, NY: McGraw-Hill.

Stevens, K. N. (1989). On the quantal nature of speech. Journal ofPhonetics, 17, 3-45.

Sussman, H. M., MacNeilage, P. P., & Hanson, R. J. (1973). Labialand mandibular dynamics during the production of bilabialconsonants: Preliminary observations. Journal of Speech andHearing Research, 16,397-420.

Turvey, M. T. (1977). Preliminaries to a theory of action withreference to vision. In R. Shaw & J. Bransford (Eds.), Perceiving,acting and knowing: Toward an ecological psychology. Hillsdale, NJ:Lawrence Erlbaum Associates.

Wood, S. (1982). X-ray and model studies of vowel articulation.Working Papers, Lund University, 23.

FOOTNOTES'In T. van Gelder & B. Port (Eds.), Mind as motion. Cambridge,

MA: MIT Press (in press).t Also Department of Linguistics, Yale University.

dynamics and articulatory phonology* · human speech events can be seen as quite complex, inthe...

Documents