the kth rule system for singing synthesis · the kth rule system for singing synthesis berndtsson,...

Dept. for Speech, Music and Hearing

Quarterly Progress andStatus Report

The KTH rule system forsinging synthesis

Berndtsson, G.

journal: STL-QPSRvolume: 36number: 1year: 1995pages: 001-022

http://www.speech.kth.se/qpsr

http://www.speech.kth.se

http://www.speech.kth.se/qpsr

STL-QPSR 111 995

The KTH rule system for singing synthesis Gunilla Berndtsson

Abstract This article contains a description of rules controlling the singing synthesis at the Department of Speech Communication and Music Acoustics at the Royal Institute of Technology (KTH) in Stockholm. In our research, synthesis of singing has played an important part for a long period of time. The rules controlling the singing synthesiser MUSSE DIG are implemented in a programming environment originally developed for a text-to-speech system. There are context dependent rules for pronunciation of vowels and consonants, as well as rules for musical performance. The latter rules create crescendos, tempo and vibrato changes etc., depending on the musical context as defined by a score Jile. The rules were developed using an analysis-by-synthesis strategy, i.e., vocal performances are synthesised, the result is analysed and then the rules, which control the synthesis, are accordingly improved. In this article, musical rules, general rules for consonants and vowels, and for some special singing techniques are described.

Introduction Synthesis of singing is a helpful tool in the research on singing and has been

frequently used at KTH (Sundberg, 1987a,b, 1989). To synthesise a song, it is necessary to continuously control a great number of parameters such as sound level, formant frequencies and bandwidths, fundamental frequency (FO), source characteristics, and vibrato extent. As this cannot be done by hand in real time and as exact control over the performance is important for research purposes, a rule system was constructed. A starting point for the synthesis work was the RULSYS text-to- speech system, developed and adapted for musical purposes by Rolf Carlson and Bjorn Granstrom (Carlson & Granstrom, 1975; Carlson et al, 1982, 1991).

Speech differs from singing in many important and characteristic ways, as illustrated in Figure 1. In singing, duration and fundamental frequency are quasi- quantized and voice timbre is a factor of prime relevance. In addition, the vibrato and pitch changes are important expressive variables obeying rules quite different from those used in speech. Hence, the text-to-speech synthesis offered invaluable tools, but the rule system needed major revisions to produce acceptable singing synthesis.

Another project of direct relevance to the singing synthesis is the Director Musices program, parallelly developed for automatic synthesis of instrumental music performance. There are, however, striking differences between singing and performance of instrumental music, as illustrated in the same figure. Apart from the absence of consonants and vowels, instrument playing is characterised by much less flexible sound patterns. As can be seen in the figure these differences are quite substantial even in the case of violin performance, although the violin is one of the least restricted instruments with respect to time varying factors such as vibrato, sound level, and pitch. Although this system is capable of generating musically realistic instrumental performances, it can not be directly adopted to produce acceptable

1 STL-QPSR 111 995

singing synthesis, and it needed addition of a great amount of rules specially tailored for singing.

Hz, ~ ~ -.-. -

Hz Hz 8k

Singing 8kr~) 1 Speech

8k

.... -. . 2 4 6 8 10 Sec 0.2 0.4 0.6 0.8 S ~ C

a) Violin

Fig. 1 Spectrograms of instrumental performance, singing performance, and synthesised speech. a) and b) illustrate the first phrase of Schubert's Ave Maria. In a), a professional violinist plays the excerpt. In b), the phrase is sung by a professional singer. In c), a speech synthesiser developed in our department speaks the words Ave Maria in German. A narrow bandwidth is used (45 Hz). The time scale is compressed 10 times in a) and b) as compared to the scale in c). Thefiequency scale is the same for all the spectrograms.

STL-QPSR 111995

Our present system for singing synthesis can be used for analysis-by-synthesis of sung performance. The method makes it possible to test a variety of hypotheses regarding sung performance, for example, the means by which singers make the performance musically convincing and exciting. The analysis procedure is to synthesise sung performances of musical excerpts, to evaluate these, and to revise the synthesis strategy accordingly.

The MUSSE and MUSSE DIG singing synthesisers The KTH singing synthesis has been developed over a long period. The first system used an analogue vowel synthesiser, MUSSE (Music and Singing Synthesis Equipment), built by Bjorn Larsson (1977). It was controlled by an Eclipse mini computer (Malmgren, 1978) and was later complemented by a unit for synthesis of consonants (Ponteus, 1979).

In 1989-90, the present digital singing synthesiser, MUSSE DIG, was developed. It is installed in a portable PC with a TMS320C30 floating-point digital signal processor inside (Carlsson & Neovius, 1990). To synthesise vowels, the resonances of the human vocal tract, or the formants, must be accurately simulated. Both synthesisers achieve this by means of a set of second order resonance filters in cascade, with variable frequencies and bandwidths. The rule system allows variation of the transition speed of the formants.

As MUSSE was basically a synthesiser for vocalic sounds, it contained only one branch of formant filters. This branch was used also for producing consonants. For instance, nasalization was achieved by manipulating formants and bandwidths (Zera, et al., 1984).

The strategy for producing consonants is somewhat different in MUSSE DIG, as can be seen in Figure 2. MUSSE DIG contains two branches, one for vowels and a second one with two resonance filters for non-aspirated consonants. The vowel branch is used for both aspirated sounds and for nasals.

I

Voice - F 1 F2 F3 F4 F5 F6 F7 source

Noise r-# source I I

Fig. 2. Block scheme of the MUSSE DIG singing synthesis model. The voice source is$ltered by the vowel branch (upper) and the noise source is filtered by the fricative branch (lower) or the vowel branch. The symbols FI -F7 correspond to formantfllters. The triangles symbolise ampl~flers

The voice source in MUSSE DIG is different from that in the MUSSE synthesiser. While the latter used a simple saw tooth signals with variable offset, the former uses

STL-QPSR 111995

pulses similar to those of the human voice pulses. The model for the glottal pulses, devised by Sten Ternstrom, is based on a phase-modulated cosine function. To produce a glottal pulse the phase increment of the cosine is modulated, so that a local speeding up or delaying of the time scale is accomplished. This method, used in the VOSIM model presented by Kaegi & Templaars (1978), was suggested by Peter Pabon (1 994).

In the human voice, as well as in most musical instruments, an increase in sound level is associated with a decrease in spectrum tilt. As a result, the higher spectrum partials gain more in amplitude during a crescendo than the lower partials. Also, depending on glottal adduction, the amplitude of the voice source fundamental can be varied, such that an increased adduction is associated with a decrease of this amplitude. The MUSSE DIG voice source offers the possibility of varying these two parameters. One parameter controls the spectrum slope and another the balance between the voice source fundamental and the overtones. Thus, these parameters can be continuously changed by rules during the performance. Voice source amplitude is changed only during the closed phase of the glottal pulse, in order to prevent discontinuities in the source waveform.

For synthesising opera singers' voices, a sinusoidal vibrato is used, with variable frequency (number of undulations per second) and amplitude (the magnitude of the departures from the mean frequency). For singers who do not use vibrato, a random pitch variation (flutter) is used, reflecting the typical random variations of FO, (Ternstrom & Friberg, 1989). The flutter is implemented as white noise, which is added to the FO parameter value before it is processed by a smoothing filter. In this way the flutter is controlled by three parameters; the flutter amplitude, and the frequency and the bandwidth of the filter smoothing FO.

The MUSSE DIG synthesiser is usually automatically controlled by a score file processed by pronunciation and performance rules, but it can also be controlled interactively. The interactive mode is well suited for demonstration of the individual sound properties in the complex sound of a singing voice. The voice parameters can be changed manually from a control panel on the computer screen, or from external MIDI devices (Carlsson et al., 1991). These devices have been connected to MUSSE DIG by Sten Ternstrom, who also programmed the graphical panel on the computer screen in the Microsoft Windows environment, and the PC-DSP communications.

Rule system configuration The rule system controlling the singing synthesis was first developed for the analogue MUSSE synthesiser. Descriptions of this rule system can be found in Zera et al., (1984) and in Carlsson (1988). Later, the rule system was adapted to the digital MUSSE DIG synthesiser. Adjustments were made to the new synthesis model which contained a fricative branch for consonants, a different voice source, and refined pitch quantisation. Nevertheless, many performance rules from the original MUSSE system are still used.

The rule system for singing synthesis consists of three main types of files, a synthesiser file specifLing the smoothing of control parameters, a definition file for

STL-QPSR 111 995

defining phoneme characteristics, and a rule file containing pronunciation and performance rules.

Synthesiserfiles define the way in which changes in control parameters are realised by means of programmable filters. For example, the formant frequencies are smoothed by a second order filter and the sound level is smoothed by a cosine filter.

Definition files reflect much of the personal voice characteristics, so that different voice types, for example, baritones or sopranos, have different definition files. These files define the phonetic features for each phoneme, and the default values for formant frequencies and bandwidths etc.

Rule files contain pronunciation rules modelling context dependent events such as formant transitions, and voice source and noise characteristics, etc. These files also include performance rules for musical expression. If a song is realised as nominally described in the score file, without any kind of expressive deviations from the score, the result sounds mechanical. Small, context dependent deviations can make the performance musical and interesting. The performance rules seem to help the listeners identi@ structural elements by enhancing differences in duration and pitch, and by grouping tones that belong together musically. These rules act on parameters like sound level, tone duration, FO, and vibrato amplitude. Some of the performance rules were originally developed by Anders Friberg, Johan Sundberg and Lars FrydCn for music played on instruments (Sundberg et al., 1991a; Friberg, 1991), and were later adapted to singing synthesis. Performance rules have also been created directly for singing in co-operation with Lars FrydCn, a violinist and conservatory music teacher. He has suggested how the synthesised performance could be improved, almost as if the computer was one of his music students. Thus, the musical rules reflect in a quantitative form much of his professional competence. In addition, listening experiments using the performance rules have been carried out, in which musicians and non-musicians compared different synthesised performances (Thompson et al., 1989; Friberg & Sundberg, 1994a,b; Sundberg et al., 1991a,b, Friberg et al., 1987b). In some cases, measurements on real performances have either verified existing rules or resulted in the formulation of new rules. The rules are still under development. There are many ways to achieve a musically acceptable performance, and as yet the rules only cover some aspects.

This article describes the rules for musical performance used in the singing synthesis and also some special singing techniques. Also, the general rules for consonants and vowels are presented, while context-dependent rules for how specific consonants and vowels are pronounced are not described in this paper.

Synthesis procedure The system has been extensively used for analysis by synthesis. Normally, the first attempt contains salient imperfections, which are dealt with according to their salience.

In tuning pronunciation rules it is mostly very helpful to analyse the result with a parallel Sonagraph display of the synthesis and a real singer's rendering of the same passage (Fant, 1959). The practicalities of the synthesis procedure can be described as follows:

STL-QPSR 111995

The lyrics and the corresponding notes are first typed into a score file. The lyrics is written in a type of phonetic transcription containing information on vowel length, etc. The metronome value is given, and the notes are specified in terms of pitch name, octave number, and nominal duration. This specification can be complemented by chord symbols, phrase and subphrase markers, symbols for stressed and unstressed note position, and tie marks. An example of the first part of a score file is given in Example 1. It is an excerpt from the tenor part in the kyrie of Johann Sebastian Bach's Mass in B minor.

Example 1. The metronome value is given after "MM". Chords are marked as "Qx", where "x" is the number of semitones fiom the root of the tonic chord. Minor tonality is marked by "-"; hence, Q0-refers to a minor tonic chord. Each note is represented by its vowel with preceding consonants. Long vowels are marked ":". After the vowel symbol follows information on pitch and duration. Sharps, natural,^ andflats are symbolised "#", "=", and "b". The octaves are numbered such that A4 = 440 Hz. Note values are given by numbers, 1 = whole note, 2 = half note, 4 =

quarter note, 8 = eighth note, etc., and dotting by ": " after the note value digit.

MMlOO QO- KY:=B34: RI:=B38 E=B34

The score file is processed by the rule system. The default values for each phoneme are taken from the definition file. They can be changed according to the context by the various rules. Then the parameters are smoothed by the synthesiser file in the RULSYS system. The result is a file containing all necessary control parameter values updated every 10 ms. This parameter file is fed to the MUSSE DIG synthesiser, which computes the sound waveform (Fig. 3).

Score + Lyrics MUSSE DIG Synthesiser

Musical singing ? E l Fig. 3. Block scheme of the note-to-tone conversion of the singing synthesis. The score file is processed using the rule system which is implemented in the RULSYS environment. Then the resulting synthesis parameters are fed to the MUSSE DIG synthesiser.

The RULSYS programming language As mentioned, the programming language in the RULSYS system was originally developed for a rule-oriented description of speech. The notation used is close to that of generative phonology containing, for instance, distinctive features to describe the

STL-QPSR 111 995

properties of the phonemes. The distinctive features presently used in the singing synthesis system are listed in Table 1.

Table 1. Distinctive features used in the singing synthesis system.

ANT Anterior, front articulation BACK Back tongue body CONS Consonantic CONT Continuant COR Coronal, raised tongue-tip FRIC Fricative HIGH High tongue body LOW Low tongue body NAS Nasal OBST Sound produced with a major obstruction in the vocal tract ROUND Rounded lips SEG Phonemes TENSE Long vowel VOICE Voiced sound VOC Vocalic

A rule adds one or more properties to an item, for instance, a vowel, provided it appears in a specific context. The property added can be computed from a note close to, or far ahead of, or far behind the note itself. For example, the last note of the melody can influence all notes preceding it.

Rule syntax:

X is a description of the items concerned by the rule. Y expresses the change in X. A together with B constitute the context description. The & sign marks the place where X occurs in the context. If the strings A and B are empty, i.e., if no context is defined, the rule will be

applied to all X, regardless of context. The rules are applied according to the order in which they are listed in the rule file.

Thus, the first rule processes the entire string of notes in the score file before the second rule is applied.

The rules are additive. This implies that each note may be processed by several rules and modifications will be added successively to the parameters of that note. It is possible to activate or deactivate each rule.

The system allows modifications of a number of synthesis parameters. Those relevant to the performance rules considered here are listed in Table 2.

I

STL-QPSR 111 995

Table 2. Synthesis parameters used in the described rules

Duration DR [ms] Voice source amplitude level (Sound level) L [dB] Fundamental frequency FO [Hz]

i Formant frequencies F1-F5 [Hz] Vibrato amplitude VA [%] Vibrato frequency VF [Hz]

Performance rules The performance rules described in this article are listed in Table 3. They are divided into three categories. One category contains musical rules which concern musical expression. Rules number 1 - 10 originate from the Director Musices program developed for synthesis of instrumental performance. The corresponding rules used for singing synthesis are identical with these, except the first two rules, which induce larger effects on sound level, duration and vibrato amplitude. The second category of rules concern synthesis of consonants and vowels, and include coarticulation effects. In the Table, only rules with a general applicability have been included, while rules for how specific consonants and vowels are pronounced have been omitted. The third category contains rules for special singing techniques.

Table 3. Overview of the rules described in this article.

Rules Affected synthesis parameters i

A. Musical rules I

1. Melodic charge 2. Harmonic charge 3. Melodic intonation 4. The higher, the sharper 5. The higher, the louder 6. Phrasing 7. Inkgales 8. Durational contrast 9. Double duration 10. Synchronisation rules for polyphonic singing 11. Tone before and after a rest 12. Repetition of tone 13. Vibrato tail 14. Swell on long tones (Optional) 15. Marcato (Optional) 16. Bull's roaring onset (Optional)

DR, L, VA DR, L FO FO L DR DR DR, L DR DR, L L L VF L, VA FO, L FO, L

Contd next page

STL-QPSR 111 995

B. Consonants and vowels

1. Duration of consonants and vowels 2. Timing of pitch change 3. Sound level of consonants 4. Diphthongs

C. Special singing techniques

1. Coloratura 2. FO-tracking formants 3. Overtone singing (Optional)

FO, L F1, F2, F3, F4 F1, F2, F3

A. Musical rules

I. Melodic charge In a given harmonic context the scale tones differ with respect to "remarkableness". For example, the root of the chord is an unremarkable scale tone, while the scale tone a minor second above the root is highly remarkable. A tone's remarkableness is reflected in terms of its melodic charge, using the root of the prevailing chord as the reference, see Figure 4. This figure also lists the absolute values of the melodic charge (CM) for tones in C major or C minor context.

TONE CM

Fig. 4. Melodic charge of the scale tones in tonal music. To the right the absolute values are listed, and to the left the melodic charge values are defined by means of the circle offiffhs. Note that the values are asymmetrically distributed around the reference which is the root of the prevailing chord. The charge is negative in the left, subdominant half of the circle in the figure. In the melodic charge rule the absolute values are used. (After Friberg et al., 1987a).

STL-QPSR 1/1995

a) Extra sound level, AL, and duration, ADR, are added in proportion to the tone's melodic charge, CM.

AL = 0.4 * CM [dB] (Maximum 2.6 dB) ADR = 4 * CM/3 [%I (Maximum 8.7 % of DR)

b) The AL distributed in l a is smoothed for a tone initiating a major or a minor second, provided that the tone has a duration shorter than 500 ms, and is equally long as the following tone. Let the index n denote the current tone, the index n-1 the preceding tone, and n+ 1 the following tone.

If the distributed in l a is less than 75% of the AL,, then = 0.75 * AL,. If < 0.5 * AL,, AL,+I = 0.55 * AL,.

c) Extra vibrato amplitude AVA is added in proportion to CM. For tones with DR > 1000 ms,

AVA=0.28*CM [%] (Maximum + 1 .8 %).

It is applied at the time 200 - (35.9 * CM). Typically the vibrato amplitude reaches its maximum in the beginning of the tone, but for short tones with high CM it might I happen that the maximum is reached somewhat earlier than the tone onset. To maintain the resulting VA value for the largest part of the tone, an identical VA value is inserted at 0.66 * DR

In the Director Musices program the melodic charge values are the same as here, while the AL and ADR values in l a are half as high. In Director Musices, the vibrato increase AVA = 0.15 * AL [%I, while in our rule lc, AVA = 0.28 * CM, or AVA = 0.7 * AL, thus considerably greater. In addition, the vibrato onset time depends on CM in the singing synthesis program. I

2. Harmonic charge The remarkableness of chords in a given harmonic context is quantified in terms of the harmonic charge (CH). It is computed fiom the melodic charge of the chord tones as related to the root of the tonic. The chord's root, third, and fifth are denoted by the roman numerals I, 111, and V.

a) The total sound level change during crescendo and decrescendo mirrors the difference in harmonic charge at the chord change.

Sound level increase AL at chord change note: I AL = 2.25 * (cH)'" [dB]

To avoid too fast crescendos, only a portion of the AL is added when the crescendo time is short. Thus, if the time from one chord to the next, Q-length C, is shorter than 250 1 ms, the AL is reduced by a factor Cl2500.

Intermediate notes are given intermediate amplitudes. The level changes do not start earlier than 3 s ahead of chord change. The level decreases start immediately after the chord change and end at the next chord change.

STL-QPSR 1/1995

4. The higher, the sharper The fine tuning of all tones is stretched by 4 centsloctave relative to the equally tempered tuning. Stretchings also occur in, for instance, piano tuning.

5. The higher, the louder The sound level of a tone is increased according to its pitch. The amount is 3 dB1octave.

6. Phrasing Rules for marking of subphrases, phrases, and the final tone of the melody. They operate on signs, which can be added in the score file.

a) The last tone in a phrase is lengthened by 40 ms. The level starts to fall towards zero at 0.8 * DRM, where DRM is the modified duration.

b) The final tone of the piece is lengthened by an additional 40 ms.

c) The last tone in a subphrase is ended with a micropause, the level starting to fall towards zero at 80 ms before the end of the tone.

The perceptual effect of these phrase markers was studied in a listening test on instrumental music (Friberg et al., 1987b). The results showed that at a confidence level of 95% there was a significant preference for using the phrase rule in four out of six examples.

I

7. Inkgales This rule lengthens stressed and shortens unstressed beats in paired sequences of relatively short notes of equal nominal duration <400 ms. The duration of the stressed notes may be lengthened by 22%. If so, the following unstressed note is shortened by I the same amount of ms. Similarly, the first note in the sequence is shortened, provided it appears in metrically unstressed position. The metrically stressed positions are marked in the input notation with + before the first stressed note.

The term indgales originates fi-om French baroque music. Inegales are also used in I jazz music (Rose, 1989). According to results from a listening experiment many music listeners prefer a tempo dependent percentage (Friberg et al., 1994). 1 8. Durational contrast Tones with 30<DR<600 are shortened, and their sound levels are decreased, depending on their durations. The values are listed in Table 5. As can be seen in the graphical representation in Figure 5, the values follow two breakpoint functions.

Table 5. Duration and sound level decreases depending on tone duration.

DR: 30 200 400 600 ADR [ms] 0 -16.5 -10.5 0

[dB] 0 -0.825 -0.525 0

STL-QPSR 111 995

DR [ms]

Fig. 5. Duration and amplitude changes depending on the durational contrast rule. The duration decreases are marked on the left axis and the sound level decreases are marked at the right axis.

9. Double duration The rule reduces the durational contrast between adjacent notes in the durational context of DR, = 0.5 * DR,-,, DR,+, > DR,, where index n refers to the short tone, index n-1 to the preceding tone, and n+l to the following. In this context, a tone with DR,, < 1000 ms is lengthened by 12%, and the preceding tone is shortened with the same amount in ms:

ADR = 0.12 * DR, [msl ADR, = ADR [msI ADRnmI = - A DR CmsI

Double duration has been corroborated in several analyses by measurements of real performances (Henderson, 1937; Gabrielsson, 1987).

10. Synchronisation rules for synth esising polyphonic singing The MUSSE DIG synthesiser is monophonic, so there are no rules for polyphonic singing. On the other hand, the synthesiser has occasionally been used also for ensemble performances (Berndtsson & Sundberg, 1994). In such cases, the ensemble synchronisation rule of the Director Musices was used (Sundberg et al., 1989; Friberg, 199 1). Therefore, this rule will be presented here.

As the rules alter the durations of the notes according to the musical context, various voices will receive different overall durations. Hence, a synchronisation rule is needed. A synchronisation voice is defined, constituted by the shortest note that occurs in the score at each instant. If two or more notes compete, the note having the highest melodic charge is chosen. The synchronisation voice is processed by all rules affecting tone duration, and then all voices are synchronised with it.

STL-QPSR I / 1995

To create polyphonic singing with the singing synthesiser, the Director Musices program makes synchronised input files for the separate voices. These input files specify the duration of the tones in ms and the amplitude patterns for the tones, while the rest of the musical rules and the pronunciation rules are applied as usual. The separate voices are synthesised one by one using MUSSE DIG, and later added together in sample files.

11. Tones before and after a rest A tone followed by a rest receives its default sound level at 60 ms before the end of the tone. Thus, in cases where the sound level has been increased during the tone, the sound level returns to the original value at DR-60 [ms]. At the very end, L is set to zero.

A tone following a non-subphrase-ending rest receives a startvalue of L * 0.5. The original L is set at DR * 0.05 ms.

12. Tone repetition The sound level is reduced at the note boundary by a dip-value (DV), the value depending on the duration (DR) in ms of the tone before the repeated tone.

If DR < 1710 DV = 4.5 * log(DR) [dB] I fD R > 1700 D V = 15 [dB]

13. Vibrato tail The vibrato frequency (VF) is speeded up towards the end of a tone. This is referred to as the vibrato tail (Prame, 1984). The vibrato tail is modelled after measurements.

For tones with DR > 1200: VF is increased with 4 % at DR - 800, VF is increased with 8 % at DR - 600, VF is increased with 13 % at DR - 400, VF is increased with 20 % at DR - 200.

14. Swell on long tones (Optional) For tones with DR > 1200, L increases to a maximum of L+4 dB at 0.8 * DR. At the same time the vibrato amplitude increases by an additional AVA = 0.28 * (L-45) [%I.

15. Marcato (Optional) Marcato is a kind of accent occurring as an ornament in singing. Acoustically, it corresponds to overshoots in downward pitch changes. Thus, the frequency drops too far, passes the target, and then slides up to the target. These events are typically accompanied by sound-level changes.

The first frequency value of a tone completing a descending interval, is set to two semitones below the target at 60 ms before the tone onset. The nominal frequency is reached at the onset of the tone.

The sound level first increases to L+2 dB at 0.25 * DR and then decreases again by the same amount at 0.5 * DR.

STL-QPSR 111 995

16. A bull's roaring onset (Optional) Quick ascending pitch changes at certain tone onsets is sometimes used as an ornament in singing. We have used the semi-serious term bull's roaring onset for this expressive effect.

At the first note after a rest, MUSSE DIG starts the vowel onset 11 semitones below the target, which is reached after 50 ms. Simultaneously, the sound level is increased from zero to target.

B. Consonants and vowels

1. Consonant and vowel durations A basic aspect of synthesising singing is the timing of consonants. Experiences from synthesis of singing have demonstrated that, to avoid rhythm errors, it is necessary to subtract the duration of all consonants from that of the preceding vowel (Sundberg, 1994). Thus, as might be expected, the same principle is applied in singing as in scanned reading of poems (Rapp-Holmgren, 197 1).

A consonant after a rest or a short vowel is lengthened. The lengthening amounts to 0.1 * DR of that vowel or rest. The rule is not applied to the consonants p, t, or k. If there are more than one consonant after the short vowel, the extra DR is shared by all the consonants in proportion to their original lengths.

Consonants in a cluster are shortened according to

where DK, is the duration of the consonant cluster and DRV(,-,) is the duration of the preceding vowel.

2. Timing of pitch change If the pitch changes are not completed at vowel onsets, the synthesis sounds strange. By giving a consonant the same FO value as the succeeding vowel, the new pitch is approached during the consonant and the target pitch is reached at vowel onset.

3. Sound level of consonants If a voiced consonant is succeeded by a vowel, it receives the first sound-level value of that vowel. Otherwise the consonant is given the last sound-level value of the preceding vowel.

4. Diphthongs The first vowel receives 65%, and the second vowel 35%, of the duration of the diphthong. These relations may be style dependent.

STL-QPSR ]/I995

C. Special singing techniques

1. Coloratura Coloratura is a term used for a rapid succession of short notes sung without interspersed consonants. Physiologically, it is performed by means of synchronised undulations of fundamental frequency and pressure (Leandersson et al., 1987). Acoustically, the fundamental frequency (FO) encircles the target values by a rising- falling curve (Fig. 6).

The rules are applied on sequences of notes of almost equal duration (0.8 < DR,/DR,-, < 1.2), where DR, and DR,-, are the durations of two consecutive tones. Other conditions are that DR < 160 ms, and no interspersed consonants are allowed. The first FO value is set a semitone above the target. If the note is succeeded by a higher note, FO is set to one semitone below target at 0.5 * DR. If the note is succeeded by a lower note, FO is set to one semitone below the next tone's target at 0.5 * DR. This note may also be the first note aRer the coloratura sequence.

If a note is both preceded and succeeded by a note with lower pitch, i.e., if it is the turning point of an ascendingldescending melodic line, the first FO-value is set two semitones above the target. In the opposite case, at the turning point of a descendinglascending melodic contour, the first FO-value is set two semitones below I the target. The amplitude is reduced by 2 dB at the beginning of a tone and is back to normal at 0.5 * DR. At the end it is set to 2 dB less than the amplitude of the following tone.

TIME

Fig. 6. The principle used for FO- control in synthesising coloratura singing. The dashed lines show the fundamental ji-equency control signal (before smoothing;). The horizontal straight lines represent the target FO values. The solid curve approximates the curve after smoothing (Sundberg, 1987b).

2. FO-tracking formants In high tones FO may surpass the first formant frequency (Fl). By varying appropriate articulatory parameters such as increasing the jaw opening, F l can be raised to a frequency slightly higher than FO (Sundberg, 1975).

STL-QPSR 1 / I 995

For tones with F 1 < F0+1 semitone, F 1 is raised to F0+1 semitone. A back vowel with less than an octave between F 1 and F2, gets a F2 value one octave above F 1. For FO above the pitch of G3, a vowel not possessing the feature "BACK", receives a reduction of F2 according to Figure 7. Also, when FO is above G4, F3 is reduced and F4 is increased according to the same figure.

Sung - I-- F Spoken - - - -a 4 --.,----

I - - e. - ...- - ..-- -- - - - a .-.---- e I -U- /.. _ _ - - -

@ . .... ..--9------ ----. - - u -.---- Q.9 - U

Q - I - - a . . . . ue- - , .+. \ 1 . .* U .

F3

Fig. 7. FO-tracking formants. Formant Pequencies estimated Porn a professional soprano's singing of the vowels indicated at various pitches. The lines represent an idealised approximation of the data. The leftmost vowel symbols refer to the subject's speech. (After Sundberg, 1987~).

Phonation frequency (Hz)

3. Overtone singing (Optional) Overtone singing is a special singing technique used in some Asian parts (especially Mongolia and Tibet) and, for instance, by the Harmonic Choir from New York (Ellington, 1970; Smith et al. 1967). A particular single overtone can be made clearly audible by tuning the 2nd and 3rd formants to a frequency very close to that of the overtone. In the synthesis, a useful distance between these formants has been found to be about 100 Hz. For enhancing low overtones of a tone with a fundamental near 300 Hz, it is appropriate to cluster F1 and F2 instead. The acoustics and perception of overtone singing has been examined by Bloothooft et al. (1992).

STL-QPSR 111995

Discussion Obviously both the human voice and musical instruments can be used to produce musical performances. Hence, some general rules in the singing program are exactly the same as those used in the Director Musices program for instrumental music. However, there are great differences between the voice and musical instruments and how they are used in performances. These differences necessitate different rules. For instance, the rule for tone repetition in the Director Musices program inserts a micropause, while the singing synthesis uses a reduction of the sound level. The extreme degree of acoustic flexibility of the singing voice is used not only for pronunciation of the text, but also for the purpose of musical expressivity. Even though the overall goal, in developing both the Director Musices and the singing synthesis programs, was to produce a suitable amount of deviations from the score, some rule quantities needed to be larger in the singing synthesis than in the Director Musices program. Also, in sung performances, additional rules have been developed for special singing techniques.

In the present version of Director Musices, an additional rule is included for phrasing, using arched tempo curves. This seems relevant also to singing (Sundberg & al., 1995). Therefore this phrasing rule will be implemented in the singing synthesis program.

In the case of the Director Musices program, several listener experiments have been carried out considering both rule thresholds and preferred rule quantities (Sundberg et al., 1991b). Also some rules, for instance, double duration, have been independently verified by measurements. For the singing synthesis, all rule quantities have been determined by informal listening tests, except for the vibrato tail rule (Prame, 1994). Therefore, the rule quantities must be considered preliminary. Nevertheless, formal listening experiments concerning singing have been carried out for the purpose of testing different hypotheses, such as the significance of the centre frequency of I singer's formant to voice classification, and the acceptability of matching formants to partials in scales (Carlsson & Sundberg, 1992; Berndtsson & Sundberg, in print).

Performances of the same piece may differ considerably. Different singers may use quite different expressive means, depending on the personal taste, but many of these differences can be explained if musicians vary rule quantities. Also, performance differences may result from the performance conditions. For instance, the amount of melodic intonation used is likely to depend on the accompaniment; for songs accompanied by an instrument with fixed pitches, for instance, a piano, the musical intonation would be less exaggerated than in unaccompanied singing.

The present rules are mostly made for classical singing, and other musical styles may require different rules. For instance, for hndamental frequency variations some singers use vibrato, while other singers use a random pitch variation, or flutter. Diphthongs seem to be produced differently by opera and, for example, rock singers; the second vowel seems to be shorter in opera singing. Also, classically trained singers usually sing shorter consonants than other singers, such as jazz singers, who often sing on voiced consonants.

STL-QPSR 111 995

Even though the same voice organ produces both speech and singing, there are great differences between them. For speech synthesis, intelligibility is much more important than timbral naturalness. For singing synthesis, on the other hand, the timbre is a prime concern. Unnaturalness cannot be accepted in synthesised singing. Therefore, it has been necessary to spend a great amount of work on timbre and naturalness in developing the singing synthesis. In particular, the higher formants are much more significant in singing. Different voice classifications and singer categories use different formant frequencies. For instance, male opera singers and altos use a singer's formant. The centre frequency of this formant cluster is different depending on voice classification. The wide pitch range used in singing entails another important difference between synthesis of speech and singing; at high pitches singers tend to increase the first formant to a frequency above FO. Other differences between singing and speech is that vibrato is used in singing, but not in speech, and that the duration of the vowels is usually longer in singing. For singing, the pitches and the durations of the notes are given in the score, and the articulation of the lyrics has to be adjusted to the given conditions. All these factors indicate the need for different rules for speech and singing.

The conditions for developing singing synthesis is heavily dependent on the analysis means available. Sonagraph analysis allows quick comparison of synthesis output and a real singer's voice. By careful matching, a quite realistic quality can mostly be achieved. As a consequence, our singing synthesis system represents a valuable tool for investigations of perceptual aspects of singing.

The work with synthesising singing has been carried out in our department by many researchers over a long period of time. An incoherent collection of rules resulted, which were sometimes even in conflict with each other. A major part of the author's work with synthesising singing has been to organise the rules into a coherent system. To facilitate understanding and implementation of hture improvements, comments have been inserted. Rules have been provided with switches, which allow easy activation and deactivation of rules. In some rules, faucets have been implemented, which make it easy to change rule quantities.

Nevertheless, the synthesis can of course be improved in many respects. The generality of pronunciation rules should be increased by trying them on new examples. A more subtle control of the voice source could be implemented to portray different phonation modes of probable relevance to musical expression. For synthesis of other singing techniques, such as jazz or kolning (Johnson & al., 1985), the rule system needs to be expanded.

The two programs for synthesis of musical performance are presently radically different with regard to implementation and computer environment. The singing synthesis is implemented in RULSYS on a PC, while the Director Musices is implemented in Common Lisp on a Macintosh. These differences of course lead to some disadvantages in the development and implementation of new rules. At the same time, it is also advantageous to retain the RULSYS environment for the singing synthesis, which permits synergy effects with the speech synthesis work carried out at the department.

STL-QPSR 11 1 995

The singing synthesis system is a model of a performing singer, thus allowing certain tentative conclusions regarding principles applied in musical communication. Such principles seem to be the players' facilitation of the listener's differentiation of tone categories and identification of groups of tones. This has previously been discussed in relation to synthesis of instrumental performances (Sundberg & al., 1991a). These principles seem to valid also for singing. Thus, our experiences from synthesising singing have provided reasons to assume that singers and instrument players alike are required to help the listeners' processing of the sound flow, by facilitating differentiation of tone categories, and identification of groups of tones and sounds.

Acknowledgements Johan Sundberg has contributed with editorial assistance for this manuscript.

This work has been supported by the Swedish Research Council for Engineering Sciences (TFR).

References Berndtsson G & Sundberg J (1994). The MUSSE DIG singing synthesis, KTH, Stockholm (Baritone and Bass). In: Friberg A, Iwarsson J, Jansson E & Sundberg J, eds, Proc of Stockholm Music Acoustics Conference, SMAC 93, Stockholm: Royal Swedish Academy of Music No 79: 279-28 1.

Berndtsson G & Sundberg J. Perceptual significance of the centre frequency of singer's formant, Scandinavian Journal of Logopedics and Phoniatrics. (In print).

Bloothooft G, Bringmann E, van Cappelen M, van Luipen JB, & Thomassen KP (1992). Acoustics and perception of overtone singing, The Journal of the Acoustical Society of America, 92/4: 1827- 1836.

Carlsson (now Berndtsson) G (1988). The KTH program for synthesis of singing. M Sc Thesis, Dept of Speech Communication and Music Acoustics, KTH, Stockholm.

Carlsson (now Berndtsson) G & Neovius L (1990). Implementations of synthesis models for speech and singing. STL-QPSR, KTH, 2-3: 63-67.

Carlsson (now Berndtsson) G, Ternstrom S, Sundberg J & Ungvary T (1991). A new digital system for singing syntheses allowing expressive control. In: Proc of the International Computer Music Conference, Montreal, 3 15-3 18.

I

Carlsson (now Berndtsson) G & Sundberg J (1992). Formant frequency tuning in singing. 1 Journal of Voice 613: 256-260.

Carlson R & Granstrom B (1975). A phonetically oriented programming language for rule description of speech. In: Fant G, ed, Speech Communication, Stockholm: Almqvist & Wiksell, 2: 245-253.

Carlson R, Granstrom B & Hunnicutt S (1982). A multi-language text-to-speech module. In: Proc of ICASSP-Paris, 3 : 1 604- 1 607.

STL-QPSR 111 995

Carlson R, Granstrom B & Hunnicutt S (1991). Multilingual text-to-speech development and applications. In: Ainsworth AW, ed, Advances in speech, hearing and language processing, London: JAI Press, UK.

Ellington T (1 970). The technique of chordal singing in the tibetian style. Am Anthropologist, 7214: 826-83 1.

Fant G (1959). Acoustic analysis and synthesis of speech with applications to Swedish. Ericsson Technics 1511 : 1 - 106.

Friberg A, Sundberg J, Fryden L (1987a). Rules for automatized performance of ensemble music. STL-QPSR, KTH, 4: 57-78.

Friberg A, Sundberg J, Fryden L (1987b). How to terminate a phrase. An analysis-by- synthesis experiment on a perceptual aspect of music performance. In: Action and Perception in Rhythm and Music. Stockholm: Royal Swedish Academy of Music No 55: 49-55.

Friberg A (1991). Generative rules for music performance: A formal description of a rule system. Computer Music Journal, 1512: 56-7 1.

Friberg A & Sundberg J (1994a). Just noticable difference in duration, pitch and sound level in a musical context. In: Deliege I, ed, Proc of 3rd International Conference for Music Perception and Cognition, Li2ge, 1994, 339-340.

Friberg A & Sundberg J (1994b). Perception of just noticeable time displacement of a tone presented in a metrical sequence at different tempos. In: Friberg A, Iwarsson J, Jansson E & Sundberg J, eds, Proc of Stockholm Music Acoustics Conference, SUAC 93, Stockholm: Royal Swedish Academy of Music No 79: 39-43.

Friberg A, Sundberg J & Fryden L (1994). Recent musical performance research at KTH. In: Sundberg J, ed, Proc of the Aarhus symposium on generative grammars for music performance. Stockholm: Dept of Speech Communication and Music Acoustics, 7-1 1.

Gabrielsson A (1987). Once again: The theme from Mozart's piano sonata in A major (K.331). In: Action and Perception in Rhythm and Music. Stockholm: Royal Swedish Academy of Music No 55: 49-55.

Henderson MT (1 937). Rhythmic organization in artistic piano performance. In: Studies in the Psychology of Music, Iowa City, Iowa: University of Iowa studies, IV: 28 1-306.

Johnsson A, Sundberg J & Wilbrand H (1985). "Kolning". Study of phonation and articulation in a type of Swedish herding song. In: Askenfelt A, Felicetti S, Jansson E & Sundberg J, eds, Proc of SMAC 83, Stockholm: Royal Swedish Academy of Music, No 4611, 187-202.

Kaegi W & Templaars S (1978). VOSIM - A new sound synthesis system. Journal of the Audio Engineering Society, 26: 41 8-425.

Larsson B (1 977). Music and singing synthesis equipment (MUSSE). STL-QPSR, KTH, 1 : 38- 40.

Leandersson R, Sundberg J, & von Euler C (1987). Role of diaphragmatic activity during singing: A study of transdiaphragmatic pressures. Journal of Applied Physiology 6211 : 259- 270.

Malmgren J (1978). PIGG, Musse interface unit. M Sc Thesis, Dept of Speech Commu- nication, KTH, Stockholm.

STL-QPSR 111 995

Pabon P (1994). A real time voice synthesizer (Alto). In: Friberg A, Iwarsson J, Jansson E & Sundberg J, eds, Proc of Stockholm Music Acoustics Conference, SMAC 93, Stockholm: Royal Swedish Academy of Music No 79,288-293.

Ponteus J (1979). MIMMI. M Sc Thesis. Dept of Speech Communication, KTH, Stockholm.

Prame E (1994). Measurements of the vibrato rate of ten singers. The Journal of the Acoustical Society of America, 96: 1979-84.

Rapp-Holmgren K (1971). A study of syllable timing. STL-QPSR, KTH, 1 : 14-1 9.

Rose RF (1989). An analysis of timing in jazz rhythm section performance. Dissertation at the University of Texas at Austin.

Smith H, Stevens KN & Tomlinson RS (1967). On an unusual mode of chanting by certain tibetan lamas. The Journal of the Acoustical Society ofAmerica, 4115: 1262-1264.

Sundberg J (1975). Formant technique in a professional female singer. Acustica 32: 89-96.

Sundberg J (1987a). The science of the singing voice. Decalb, Illinois: Northern Illinois University Press.

Sundberg J (1987b). Synthesis of singing. In: Acreman C, ed, Musica e technologia: Industria e cultura per lo sviluppo del mezzogiorno, 145- 16 1.

Sundberg J (1989). Synthesis of singing by rule. In: Mathews M & Pierce J, ed, Current Directions in Computer Music Research, Massachusetts: MIT Press, 45-55.

Sundberg J, Friberg A & Fryden L (1989). Rules for automated performance of ensemble music. Contemporary Music Review, 3 : 89- 1 10.

Sundberg J, Friberg A & Fryden L (1991a). Common secrets of musicians and listeners: An analysis-by-synthesis study of musical performance. In: Howell P, West R & Cross I, eds, Representing Musical Structure, London: Academic Press, 16 1 - 197.

Sundberg J, Friberg A & Fryden L (1991b). Threshold and preferred quantities of rules for music performance, Music Perception, 911 : 7 1-92.

Sundberg J (1994). Information Technology and Music. CD ROM, Royal Swedish Academy of Engineering Sciences and KTH, Stockholm.

Sundberg J, Iwarsson J & Wagegird H (1995). A singer's expression of emotions in sung performance. In: Fujimura 0 & Hirano M, eds, Vocal Fold Physiology: Voice Quality Control, San Diego: Singular Press, 2 17-229.

Ternstrom S & Friberg A (1989). Analysis and simulation of small variations in the findamental frequency of sustained vowels. STL-QPSR, KTH, 3: 1-14.

Thompson WF, Sundberg J, Friberg A & Fryden L (1989). The use of rules for expression in the performance of melodies. Psychology of Music, 17: 63-82.

I

Zera J, Gauffin J & Sundberg J (1984). Synthesis of selected VCV-syllables in singing. In: Buxton W, ed, Proc of the International Computer Music Conference, IRCAM, Paris, 83-86.

the kth rule system for singing synthesis · the kth rule system for singing synthesis berndtsson,...

Documents