vocarefiner: an interactive singing recording system with … · 2013-10-22 · vocare ﬁner: an...

VocaRefiner: An Interactive Singing Recording System with Integration ofMultiple Singing Recordings

Tomoyasu Nakano Masataka GotoNational Institute of Advanced Industrial Science and Technology (AIST), Japan

{t.nakano, m.goto}[at]aist.go.jp

ABSTRACT

This paper presents a singing recording system, VocaRe-finer, that enables a singer to make a better singing record-ing by integrating multiple recordings of a song he or shehas sung repeatedly. It features a function called click-able lyrics, with which the singer can click a word in thedisplayed lyrics to start recording from that word. Click-able lyrics facilitate efficient multiple recordings becausethe singer can easily and quickly repeat recordings of aphrase until satisfied. Each of the recordings is automati-cally aligned to the music-synchronized lyrics for compar-ison by using a phonetic alignment technique. Our sys-tem also features a function, called three-element decom-position, that analyzes each recording to decompose it intothree essential elements: F0, power, and spectral envelope.This enables the singer to select good elements from differ-ent recordings and use them to synthesize a better record-ing by taking full advantage of the singer’s ability. Pitchcorrection and time stretching are also supported so thatsingers can overcome limitations in their singing skills.VocaRefiner was implemented by combining existing sig-nal processing methods with new estimation methods forachieving high-accuracy robust F0 and group delay, whichwe propose to improve the synthesized quality.

1. INTRODUCTION

When singers perform live in front of an audience theyonly have one chance. If they forget the lyrics or singout of time with the accompaniment then these mistakescannot be corrected, though singing out-of-tune could befixed by using real-time pitch correction (e.g., Auto-tuneor [1]). However, when vocals are recorded in a studiosetting, the situation is quite different. Many attempts, or“takes”, at singing the entire song, or sections within it, canbe recorded. Indeed, if time and cost are not an issue, thisprocess can continue until either the singer or someone else(e.g., a producer or recording engineer) is completely sat-isfied with the performance. The vocal track which even-tually appears on the final recording is often reconstitutedfrom different sections of various takes and, to a greaterand greater degree, subjected to automatic pitch correction

Copyright: c©2013 Tomoyasu Nakano et al. This is

an open-access article distributed under the terms of the

Creative Commons Attribution 3.0 Unported License, which permits unre-

stricted use, distribution, and reproduction in any medium, provided the original

author and source are credited.

Standard - recording and editing of singing voice

VocaRefiner (proposed system)- specialized for recording singing

- specialized for editing multiple recordings

by listening and seeing waveforms by using clickable lyrics

by visualization of singing properties

by choosing element

at each phoneme

selected element for integration

Recording

Multiple recordings

Musicalaccompaniment

Clickable lyrics

by listening and seeing waveforms

by cut-and-pasting waveforms

All recordings which are not chosen

are just discarded

Integrating

I think of you

In the Spring when ...

Search and decide when to start singing

Choose the recording used to build up the singing recording

Integrate multiple recordings

...

F0

power

timbre

Synthesis(Recomposition)

...

...

Phonetic

alignmentay th ih ng k

I think

Store multiple

recordings

Store multiple

recordings1

23

12

3

+ Pitch correction and time-stretching + Pitch correction and time-stretching

F0

Power

Timbre

Double click

Figure 1. A comparison of VocaRefiner with the standardrecording and editing procedure.

(e.g., Auto-tune) to “fix” any notes which are sung out oftune. What is left over at the end of this process is simplydiscarded as it is of no further use. This “standard” processof recording singing voice is summarized in the left side ofFigure 1.Although this procedure for recording and editing vo-cals is widespread, it has some drawbacks. First, it is ex-tremely time-consuming to manually listen through mul-tiple takes and subjectively determine the “best” parts tobe saved for the final version. Second, the manipulationof multi-track waveforms through “cut-and-paste” and theuse of pitch correction software requires specialist techni-cal knowledge which may be too complex for the amateursinger recording music in their home.To address these shortcomings, we have developed aninteractive singing recording system called VocaRefiner,which lets a singer make multiple recordings interactivelyand edit them while visualizing analysis of the recordings.VocaRefiner has three functions (shown in the right side ofFigure 1) which are specialized for recording, editing and

115

Proceedings of the Sound and Music Computing Conference 2013, SMC 2013, Stockholm, Sweden

processing singing recordings.

1. Interactive recording with clickable lyrics: This al-lows a singer to immediately navigate to the part ofthe song he or she wants to sing without the need tovisually inspect the audio waveform.

2. Visualization of singing analysis: This enables thesinger to see an analysis of the recorded singingwhich captures three essential elements of singingvoice: F0 (pitch), power (loudness), and spectral en-velope (voice timbre).

3. Integration by recomposition and manipulation:This allows the singer to select elements among mul-tiple recordings at the phoneme level and recombinethem to synthesize an integrated result. In additionto the direct recombination of phonemes, VocaRe-finer also has pitch-correction and time-stretchingfunctionality to give the user even more control overtheir performance.

The use of these three functions draws out the latent po-tential of existing singing recordings to the greatest degreepossible, and enables the amateur singer to use advancedtechnologies in a manner which is both intuitive to use andenhances creative possibilities of music creation throughsinging.The remainder of this paper is structured as follows. InSection 2 we present an overview of the main motivation,the target users for VocaRefiner, and the originality of thisstudy. In Section 3 we describe VocaRefiner’s functionalityand usage. The signal processing back-end which drivesVocaRefiner is described in Section 4 along with results onthe performance of the F0 detection method. In Section 5we discuss the role and potential impact of VocaRefiner inthe wider context of singing, and finally, in Section 6 wesummarize the key outcomes from the paper.We provide a website with video demonstrations of Vo-caRefiner at http://staff.aist.go.jp/t.nakano/VocaRefiner/.

2. VOCAREFINER: AN INTERACTIVESINGING-RECORDING SYSTEM

This section describes the goal of our system and short-comings of standard approaches. To achieve the goal andto overcome the shortcomings, we then propose our origi-nal solutions of VocaRefiner.

2.1 Goal of VocaRefiner

The aim of this study is to enable amateur singers record-ing music in their home to create high-quality singingrecordings efficiently and effectively. Many amateursingers have recently started making personal record-ings of songs and have uploaded them to video and au-dio sharing services on the web. For example, over600,000 music video clips including singing recordingsby amateur singers have been uploaded to the most pop-ular Japanese video-sharing service Nico Nico Douga(http://www.nicovideo.jp). There are many listeners whoenjoy such amateur singing which is illustrated by the fact

that, as of April 2013 on Nico Nico Douga, over 4250video clips by amateur singers received over one hundredthousand page views, over 190 video clips had more thanone million page views, and the top five video clips hadmore than five million page views.In Japanese culture, it is common for the singers not toshow their faces in video clips. In this way, their record-ings can be appreciated purely on the quality of the singing.In fact, amateur singers have become very well-known justby their voices and released commercially-available com-pact discs from recording companies. This is a kind of thenew culture for music creation and appreciation driven bymassive influx of user-generated content (UGC) on webservices like Nico Nico Douga.This creates a need and demand for making personalsinging recordings at home. Most amateur singers recordtheir singing voice at home without help from other peo-ple (e.g., studio engineers). To fully produce the record-ings, they must complete the entire process shown in theleft side of Figure 1 by themselves. To create high-qualitysinging recordings, singers typically use traditional record-ing software or a digital audio workstation on a personalcomputer to recording multiple takes of their singing, againand again until they are satisfied. They then cut-and-pastemulti-track waveforms and sometimes use pitch correctionsoftware (e.g., Auto-tune). This traditional approach is in-efficient and time-consuming, and requires specialist tech-nical knowledge which may be a barrier for some would-besingers. We therefore study a novel recording system spe-cialized for personal singing recording. Our eventual goalwith this work is to facilitate and encourage even greaternumbers of singers to create vocal recordings with bettercontrol and to actively participate in UGC music culture.

2.2 Originality of VocaRefiner

In this paper we present an alternative to the standard ap-proach of recording singing voice by providing a novel in-teractive singing recording system VocaRefiner. It has anoriginal efficient and effective interface based on visual-izing analysis of singing voice and driven by signal pro-cessing technologies. We propose a novel use of the lyricsto specify when to start the singing recording and also pro-pose an interactive visualization and integration of multiplerecordings.Although lyrics have already been clickable on some mu-sic players [2], they only allowed users to change the play-back position for listening. VocaRefiner presents a noveluse of lyrics alignment for recording purposes.Multiple recordings were also not fully utilized for in-tegration into the final high-quality recording, with mostrecordings being simply discarded if they are not explic-itly selected. For example, recordings with good lyrics butincorrect pitch and recordings with correct-pitch singingbut a mistake in the lyrics generally cannot be used in thefinal recording. However, VocaRefiner can make full useof bad recordings that would otherwise be discarded in thestandard approach.Although there has not been much research into the as-sistance of singing recording, some studies exist for visu-

116


A

D

C

B

EF0

power

phonetic aligment

end of the accompanimentbeginning of the accompaniment

voice timbre changes

Play-rec button (Recording mode)

Play button (Integration mode)

Recording mode Key

transpositionIntegration mode

1st recordingaccumulated recordings with time

synchronization to the accompaniment

time

6th recording

11th recording

Zoom-in

Zoom-out

Zoom-in Change viewpoint

Play a recording

Load and save

Synthesize a recording by integration


F0 power timbre

each recording

Figure 2. An example VocaRefiner screen. The recordingsare displayed as rectangles.

alizing analysis of singing voices to improve singing skills[3, 4]. Singing analysis has also been used for other pur-poses, such as time-stretching based on phase vocoder [5],voice-conversion [6], and voice-morphing [7]. However,we believe that no other research currently exists whichdeals with both the analysis and integration of multiplesinging recordings as in VocaRefiner.

3. INTERFACE OF VOCAREFINER

The VocaRefiner system, shown in Figure 2, is built aroundcomponents which encapsulate the following three func-tions:

1. Interactive recording by clickable lyrics

2. Visualization by automatic singing analysis

3. Integration by recomposition and manipulation

These functions can be used within the two main modesof VocaRefiner, “recording mode” and “integration mode”,which are selected using button A© in Figure 2.In recording mode the user first selects the target lyricsof the song they wish to sing (which can currently be inEnglish or Japanese, marked B©) and loads the musical ac-companiment.To facilitate the alignment of lyrics with music and click-able lyric functionality, the representation of the lyricsmust be richer than a simple text file containing the wordsof the song. It must also contain timing information -where each word has an associated onset time and thelyrics must also include the pronunciation of each word. It

is possible to estimate this information automatically, how-ever this process can produce some errors which requiremanual correction. Given the normal text file of lyrics, wetherefore automatically convert it into the VocaRefiner for-mat and then manually correct errors if any.The accompaniment can include a synthesized guidemelody or vocal (e.g. prepared by a singing synthesis sys-tem) to make it easier for the user to sing along with thelyrics. In the case where the user is recording a cover ver-sion of an original song they can include the original vocalof the song for this purpose.If the user is unable to sing the song in original key, theycan make use of a transposition function (marked C©), toshift the accompaniment to a more comfortable range.

3.1 Interactive Recording with Clickable Lyrics

The clickable lyrics function, which is built around thetime-synchronization of lyrics to audio (described in Sec-tion 4.1), enables a singer who makes a mistake in the pitchor lyrics to start singing that part again immediately. Suchseamless re-recording can offer a new avenue for record-ing singing, in particular for the amateur singer recordingat home. One case where this could be particularly usefulis when attempting to sing the first note of a song, where itcan be hard to hit the right note straight away. Using click-able lyrics, the singer can repeat the phrase they will tosing recording each version until they are happy they haveit right. By recording vocals in this way, a singer could alsoeasily try different styles of singing the same phrase (stor-ing each one aligned to the accompaniment), which couldhelp them to experiment more in their singing style.Because the lyrics and music are synchronized in time,when the singer clicks the lyrics, the accompaniment isplayed back on headphones (to prevent recording the ac-companiment as well as the vocal) from the specified timeand the voice sung by the user is recorded in time with theaccompaniment. In addition, if the singer only wants tosing a particular section of the song, this section can beselected using the mouse.The recording process can also be started by clicking the“play-rec” button indicated by the red triangle (close to C©)or by using the mouse to drag the slider located to the rightof the button.With this type of functionality, the clickable lyrics com-ponent can facilitate the efficient recording of multipletakes, where a singer can repeat an individual phrase overand over until they are satisfied. In this way, our workextends existing work into lyrics and audio synchroniza-tion [2], which has, up until now, only been applied toplayback systems which cannot record and align singinginput.

3.2 Visualization by Automatic Singing Analysis

Two types of visualization are implemented in VocaRe-finer. The first of which addresses the timing informationof multiple recordings. Each separate recording is indi-cated by a rectangle displayed at D©, as shown on Figure2, whose length indicates its duration. The rectangles ofmultiple recordings, which appear stacked on top of one

117


its power and timbre at each phoneme

are used for the integration

its F0 at each phoneme is used in the integration

Singing with F0 mistakes

Singing with good F0

F0 power timbre

drag

to select

drag

to select

drag

to select

F0

F0

Power

Power

Timbre


Figure 3. Selecting voice elements to integrate.

another, can be used to see which parts of a song were sungmany times, and can be useful for singers to find challeng-ing parts requiring additional practice.The second visualization shows the results of analyzingthe singing recordings. This analysis takes place immedi-ately after the each recording has taken place. First, therecording is automatically aligned to the lyrics via the pro-nunciation and timing using a phonetic alignment tech-nique. VocaRefiner estimates and then displays three el-ements of each recording: F0, power, and spectral enve-lope using techniques described in Section 4.2. These el-ements are used later for the recomposition of recordingsfrom multiple takes.An example of the analysis is shown at the point markedE© in Figure 2. The location of the rectangles in Figure 2shows the onset and offset time of each phoneme. The blueline, the light green line, and the darker green line indi-cate trajectories of selected part used for integration of F0,power, and voice timbre changes, respectively. The super-imposed gray lines (which correspond to other recordings)are parts not selected for integration.Such superimposed views are useful for seeing differ-ences between the recordings without the need for repeatedplayback. In particular this can highlight recordings wherethe wrong note has been sung (without the need to listenback to the recording), and also show the singer the pointswhere the timbre of their voice has changed.

3.3 Integration by Recomposition and Manipulation

The integration can be achieved by two main methods: “re-composition” and “manipulation” along with an additionaltechnique for error repair. Their operation with VocaRe-finer are described in the following subsections, and thetechnology behind them in Section 4.3.

3.3.1 Recomposition

The recomposition process involves direct interaction fromthe user where the elements they wish to use at eachphoneme are selected with the mouse. These selected ele-ments are used for synthesizing the recording.In the situation where multiple recordings have beenmade for a particular section, VocaRefiner assumes that

drag

&

drop

Figure 4. Time-stretching a phoneme. The length of thefinal phoneme /u/ is extended, and its F0, power, and voicetimbre are also stretched accordingly.

drag

&

drop

drag

&

drop

Figure 5. F0 and power can be adjusted using the mouse.

the most recently recorded take will be of good quality,and therefore selects this by default.

3.3.2 Manipulation

Two modes of manipulation are available to the user,one which modifies the phoneme timing and the otherwhich modifies the singing style. The modification ofphoneme timing changes the phoneme onset and duration(via time-stretching), and the manipulation of singing styleis achieved through changes to the F0 and power.A common situation requiring timing manipulation oc-curs when a phoneme is too short and needs to be length-ened. Figure 4 shows that when the length of the finalphoneme /u/ is extended, the F0, power, and spectral en-velope of the phoneme are also stretched accordingly. On-set times can also be adjusted without the need for time-stretching.Figure 5 shows that F0 and power can be indepen-dently adjusted using the mouse. In addition to these lo-cal changes, the overall key of the recording can be alsochanged (Fig. 6) by global transposition.

3.3.3 Error Repair

Because occasional errors are unavoidable when recompo-sition and manipulation are based on the results of auto-matic analysis, it is important to recognize this possibilityand provide the singer the means for correcting mistakes.The most critical errors that could require correction relateto the F0 estimation and phonetic alignment. Such errorscan be easily fixed through a simple interaction, as shownin Figure 7.When an octave error occurs in F0 estimation it can berepaired by dragging the mouse to specify the correct time-frequency range. In fact, octave errors can be eliminated byspecifying the desired time-frequency range after record-ing. The more recordings of the same phrase there are, theeasier it is to determine the correct time-frequency range,because the singer can make a judgement from many F0

trajectories, where most have been correctly analysed.Phonetic alignment errors are repaired by dragging themouse to change the estimated phonetic boundaries. Fig-

118


timetime

fre

qu

en

cy

F0F0

Figure 6. Example of a shift to a higher key.

F0 error Phonetic alignment errorcopy-and-paste

other resultsesimated F0 is

an octave too high

range specification

by dragging the mouse drag

&

drop

change the phonetic

boundaries

correct F0

F0 d o m a r u

d o m a r u

d o m a r u

d o m a r u

d o m a r u

Figure 7. Error repair. An F0 error is repaired by draggingthe mouse to specify the correct time-frequency range (thered rectangle), and then a new F0 trajectory is estimatedfrom this range.

ure 7 shows the correction of the wrong duration of aphoneme /o/. Moreover, estimation results from otherrecordings can be used to correct errors by a simple copy-and-paste process. This function can be used to correct thesituation where the alignment of a recording has many er-rors, for example, when a singer chose to hum the melodyinstead of sing the lyrics.

4. SIGNAL PROCESSING FOR THEIMPLEMENTATION

The functionality of VocaRefiner is built around advancedsignal processing techniques for the estimation of F0,power, spectral envelope and group delay in singing voice.While we make use of some standard techniques for thisanalysis, e.g., F0 [8, 9], spectral envelope [8], and groupdelay [10], and build upon our own previous work in thisarea [11,12] we also present novel contributions for F0 andgroup delay estimation to meet the need for very high ac-curacy frequency and phase estimation in VocaRefiner. Inevaluating the new F0 detection method for singing voice(in Section 4.4), we demonstrate that our method exceedsthe current state of the art.Throughout this paper, singing samples are monauralsolo vocal recordings digitised at 16 bit / 44.1 kHz. Thediscrete analysis time step (1 frame-time) is 1 ms. Timet in this paper is the time measured in frame-time units.All spectral envelopes and group delay are represented by4097 frequency bins (8192 FFT length).

4.1 Signal Processing For Interactive Recording

Methods for estimating pronunciation and timing informa-tion and for transposing the key of the accompaniment arerequired for interactive recording. Phoneme-level pronun-ciation of English lyrics is determined using the CMU pro-nouncing dictionary1 , and the pronunciation of Japanese

1 http://www.speech.cs.cmu.edu/cgi-bin/cmudict

Integration

am

plit

ud

e

t

t + 1/(2*F0)t - 1/(2*F0)

Windowing

F0-adaptive Gaussian windows

A periodic signal (speech/singing/instrument sound)

FFT

time

time

frequency

frequency

am

p.

am

p. Spectrum

Spectral envelope

Waveform

Figure 8. Overview of F0-adaptive multi-frame integra-tion analysis.

lyrics is estimated by using a Japanese language morpho-logical analyzer MeCab2 .Timing information is estimated by first having the singersing the target song once. The system then synchronizesthe phoneme-level pronunciation of the lyrics with therecordings. This synchronization is called phonetic align-ment and is estimated through Viterbi alignment with amonophone hidden Markov model (HMM). Two HMMswere trained with English and Japanese songs, respec-tively. The English songs came from the RWC MusicDatabase (Popular Music [13], Music Genre [14], andRoyalty-Free Music [13]) and the Japanese songs are inthe RWC Music Database (Popular Music [13]).When a singer wishes to transpose the key of the ac-companiment in VocaRefiner, we use a well-known phasevocoder technique [5], which operates offline.

4.2 Signal Processing For Visualizing

A phonetic alignment method and three-element decom-position method are required for implementing this func-tion. The phonetic alignment method is the same as thatdescribed above.The system estimates the fundamental frequency (F0),power, and spectral envelope of each recording.F0(t) values are estimated using the method of Goto etal. [11]. F0(t) are linear-scale frequency values (Hz) es-timated by applying a Hanning window whose length is1024 samples (about 64 ms) and resampling at 16 kHz.Spectral envelopes are estimated using F0-adaptivemulti-frame integration analysis [12]. This method can es-timate spectral envelopes with appropriate shape and hightemporal resolution. Figure 8 shows an overview of theanalysis. First, F0-adaptive Gaussian windows are used forspectrum analysis (F0-adaptive analysis). Then neighbor-hood frames are integrated to estimate the target spectralenvelope (multi-frame integration analysis).

2 http://mecab.sourceforge.net/

119


freq. [kHz]

am

p.

am

p.

F0-adaptive FFT spectrumharmonic GMM (initial)

Iteration(EM algorithm)

210.5 1.5 2.50

new F0 = 329.4894 Hz

1k = 0 2 3 4 5 6 7 8 9 10

(m × K) + (m / 2) = 3.459(0) (0)

number of iterations

frequency [Hz]

349.4483is used as an m

329.4698

309.4483

328.8232

(ground truth)329.2376

(0)

is used as an m (0)

1 5 10 15

310

320

330

340

350

1st F0 = m(0)

= 329.4483 Hz

Figure 9. Iterative F0 results estimated by the harmonicGMM.

The power is calculated from the spectral envelope bysummation of the frequency axis at each time frame.

4.3 Signal Processing For Integration

For high-quality resynthesis, the three elements should beestimated accurately and with high temporal resolution.For this purpose we propose a new F0 re-estimation tech-nique, called F0-adaptive F0 estimation method. It ishighly accurate and has the requisite high temporal reso-lution. To generate the phase spectrum used in resynthesiswe also propose a new method for estimating group de-lay [10].

4.3.1 F0-adaptive F0 estimation method

Using the technique in [11] we perform an initial estimateof the F0 which we call the 1st F0 and use this as in-put to the F0-adaptive F0 estimation method. The basicidea behind our new method is that high temporal reso-lution can be obtained by shortening the analysis windowlength for F0 estimation as much as possible. Moreoverwe exploit the knowledge that harmonic components atlower frequencies of the amplitude spectrum of FFT can beused to estimate F0 accurately, as they contain relativelyreliable information whereas aperiodic components oftendominant at higher frequencies.To obtain high accuracy and high temporal resolution, wepropose a harmonic GMM (Gaussian mixture model). Wefit the GMM to the FFT spectrum estimated by an F0 adap-tive analysis that uses F0-adaptive Gaussian windows anduses the 1st F0 used as an initial value. Hereafter, the 1stF0 is described asm(0).We designed an F0-adaptation window by using a Gaus-sian function. Let w(τ) be a Gaussian window functionof time τ defined as follows, where σ(t) is the standarddeviation of the Gaussian distribution and F0(t) is the fun-damental frequency for analysis time t.

w(τ) = exp(− τ2

2σ(t)2) (1)

σ(t) =α

F0(t)× 1

3(2)

To set the value of α, we follow the approach for high-accuracy spectral envelope estimation in [15] and assignα=2.5.

-20

20

60

40

0

0.049

0.048

0.047

0.046

0.045

0.044

0 frequency [kHz]

am

plit

ude [dB

]gro

up d

ela

y [s]

F0 = 318.6284 Hz

1 2 3 4

Group delay corresponds to the max envelope

Max envelope

Fundamental period = 1 / 318.6284 = 0.0031 s

Overlapped amplitude spectra

Figure 10. Overlapped STFT results showing the maxi-mum envelope (top) and corresponding group delays (bot-tom).

A harmonic GMM G(f ;m, ωk, σk) for frequency f isdesigned as follows:

G(f ; m, ωk, σk) =

K∑k=0

ωk√2πσ2

k

exp(− (f − (m× k))2

2σ2k

) (3)

where K is the number of harmonics, for which K=10was found to provide a high quality output. The Gaus-sian function parameters m, ωk, and σk can be estimatedusing the well-known expectation and maximization (EM)algorithm, which is fitted to the F0-adaptive FFT spectrumin the frequency range [0, (K × m(0)) + m(0)/2]. In theiteration process of the EM algorithm, σk can be replacedwith a range constraint, [ε,m], where ε = 2.2204× 10−16.The estimatedm is used as the new estimated F0(t).

4.3.2 Normalized Group Delay Estimation Method Basedon F0-Adaptive Multi-Frame Integration Analysis

To enable the estimation of the phase spectrum for resyn-thesis, we propose a robust group delay estimation method.Although the previous method [12] relied upon pitch marksto estimate the group delay, the proposed method is morerobust because it does not require them. The basic ideaof this estimation is to use an F0-adaptive multi-frame in-tegration analysis based on the spectral envelope estima-tion approach in [12]. To estimate group delay, the F0-adaptive analysis and a multi-frame integration analysis areconducted. In the integration, maximum envelopes are se-lected and their corresponding group delays are used as thetarget group delays. The group delay at each time can beestimated by using the method described in [10]. Figure 10shows an example of extracting the maximum envelopesand corresponding group delays.The estimated group delay has discontinuities along thefrequency axis caused by the fundamental period. Thegroup delay g(f, t) is therefore normalized with the range(−π, π] and will be given by sin and cos functions as fol-lows:

g(f, t) =mod (g(f, t)− g(β × F0(t), t), 1/F0(t))

F0(t)(4)

gπ(f, t) = (g(f, t)× 2π)− π (5)gx(f, t) = cos (gπ(f, t)) (6)gy(f, t) = sin (gπ(f, t)) (7)

Here mod (x, y) is a residual. The g(f, t) −g(β × F0(t), t) component is used to eliminate an offset

120


synthesized waveform

0

60dB

time [s]

phase spectrum

0.091s

0.095s

11.025freq. [kHz]

11.025freq. [kHz]

adapted to fundamental period

synthesis unit

spectral envelope

fundamental period for synthesis

overlap-and-add

0

1normalized group delay

π

-π

Figure 11. Singing synthesis by F0-synchronous overlap-and-add method from spectral envelope and group delay.

of the analysis time, and β is set to 1.5 (an intermediatefrequency between the first and second harmonics) as set-ting β=1.0 (the fundamental frequency) allowed undesir-able fluctuations to remain.There are also discontinuities along time axis. These aresmoothed along both the time and frequency directions us-ing a 2-dimensional FIR low-pass filter. Since the esti-mated group delay of frequency bins under F0 is knownto be unreliable, we finally smooth the group delay of binsunder F0 so that it can take the same value of the groupdelay at F0.

4.3.3 Singing Synthesis Using Normalized Group Delay

The singing-synthesis method used to make the finalrecording needs to reflect integrating and editing results.Our implementation of singing synthesis from spectral en-velopes and group delays is based on the well-known F0-synchronous overlap-and-add method (Fig. 11).The normalized group delays gx(f, t) and gy(f, t) areadapted to the synthesized fundamental period 1/F0(t)syn

as follows:

g(f, t) =1

F0(t)syn× (gπ(f, t) + π)

2π(8)

gπ(f, t) =⎧⎪⎪⎪⎨⎪⎪⎪⎩

tan−1(gy(f,t)

gx(f,t)) (gx(f, t) > 0)

tan−1(gy(f,t)

gx(f,t)) + π (gx(f, t) < 0)

(3× π) /2 (gy(f, t) < 0, gx(f, t) = 0)

π/2 (gy(f, t) > 0, gx(f, t) = 0)

(9)

Then the phase spectrum used to generate the synthesizedunit is computed from the adapted group delay. The phasespectrum can be obtained by integration of the group delay,as in [10].

4.4 Experiments and Results

To evaluate the effectiveness of the iterative F0 estimationmethod we examine its use when applied as a secondaryprocessing stage on three well-known existing F0 meth-ods: Goto [11]3 , SWIPE [9], and STRAIGHT [8]. In eachcase we provide our iterative F0 estimation method withthe initial output from these systems and derive a new F0

result. The frequency range is used as [100, 700] Hz for allthe methods.

3 The 1st author reimplemented Goto’s method for speech signals.

mean error value [semitone]

0.12

0female singing voice

RWC-MDB-P-2001 No. 7

(verse A: 6.2s)

male singing voice

RWC-MDB-G-2001 No.91

(verse A: 6.6s)

Goto

SWIPE

STRAIGHT

Goto (2nd)

SWIPE (2nd)

STRAIGHT (2nd)

previous

method

proposed

method

Figure 12. Estimation accuracies (mean error value) ofthe proposed re-estimation method (described as “2nd”)compared with those of Goto [11], SWIPE [9], andSTRAIGHT [8].

Estimation accuracy is determined by finding the meanerror value, εf , defined by

εf =1

Tf

∑t

|fg(t)− fn(t)| (10)

fn(t) = 12× log2

F0(t)

440+ 69 (11)

where Tf is the number of voiced frames, and fg(t) is theground truth value. The fn(t) and fg(t) are log-scale fre-quency values relative to the MIDI note number.To compare the performance of the algorithms, we usesynthesized and resynthesized natural sound examples inthe RWC Music Database (Popular Music [13] and MusicGenre [14]). To prepare the ground truth, fg(t), we usedsinging voices resynthesized from natural singing exam-ples using the STRAIGHT algorithm [8].Results in Figure 12 show that the F0 estimation acrosseach of the methods is highly accurate, with very low, εf ,both for male and female signing voice. Furthermore wecan see that, for each of the three algorithms, the inclusionof our iterative estimation method improves performance.In this way, our iterative method could be applied to anyF0 estimation algorithm as an additional processing stepto increase accuracy.Regarding the estimation of spectral envelope and groupdelay, it is not feasible to perform a similar objective anal-ysis. Therefore in Figure 13 we present a comparisonbetween the estimated spectral envelope and group delayfrom a singing recording and a synthesized singing voice.By inspection it is clear that both the spectral envelope andgroup delay between the two signals are highly similar,which indicates the robustness of our method.

5. DISCUSSION

There are two ways to make high-quality singing contentcurrently and in the future. One way is for singers to im-prove their voices by training with a professional teacheror using singing-training software. This can be consideredthe “traditional” way. The alternative is to improve one’ssinging “expression” skill by editing and integrating, i.e.,through practice and training with software tools. This pa-per presented a system for expanding the possibilities viathis new emerging second way. We recognise that thesetwo ways can be used for different purposes and have dif-ferent qualities of pleasantness. We also believe that, in

121


group delay

0.50.30 time [s] 0.50.30 time [s]

0

22.05fr

equency

[kH

z]

0

22.05

frequency

[kH

z]

spectral envelope

group delay

spectral envelope

Estimated results (a synthesized singing)Estimated results (a recording)

Syn

the

sis

An

aly

sis

Figure 13. Examples of estimated spectral envelopeand group delay and of analysis results for a synthesizedsinging voice.

the future, they could become equally important. A high-quality singing recording produced in the traditional waycan create an emotional response in the listeners who ap-preciate the “physical control” of the singer. On the otherhand, a high-quality singing recording improved using atool like VocaRefiner can reach listeners in a different way,where they can appreciate the level of expression withina kind of “singing representation” created through skilledtechnical manipulation. In both cases, there is a sharedcommon purpose of vocal expression and reaching listen-ers on a personal and emotional level.The standard function of recording vocals has only fo-cused on the acquisition of the vocal signal using mi-crophones, pre-amps and digital audio workstations, etc.However, in this paper we explore a new paradigm forrecording, where the process can become interactive. Byallowing a singer to record their voice with a lyrics-based recording system opens new possibilities for inter-active sound recording which could change how music isrecorded in the future, e.g.,when applied to recording otherinstruments such as drums, guitars, and piano.

6. CONCLUSIONS

In this paper we present an interactive singing recordingsystem called VocaRefiner to help amateur singers makehigh quality vocal recordings at home. VocaRefiner comeswith a suite of powerful tools driven by advanced sig-nal processing techniques for voice analysis (includingrobust F0 and group delay estimation), which allow foreasy recording, editing and manipulation of recordings.In addition, VocaRefiner has the unique ability to inte-grate the “best parts” from different takes, even down tothe phoneme level. By selecting between takes and cor-recting errors in pitch and timing, an amateur singer cancreate recordings which capture the full potential of theirvoice, or even go beyond it. Furthermore, the ability tovisually inspect objective information about their singing(e.g., pitch, loudness and timbre) could help singers betterunderstand their voices and encourage them to experimentmore in their singing style. Hence VocaRefiner can alsoact as an educational tool.In future work we intend to further improve the synthe-sis quality and to implement other music understandingfunctions including beat tracking and structure visualiza-tion [16], towards a more complete interactive recordingenvironment.

AcknowledgmentsWe would like to thank Matthew Davies (CREST/AIST) for

proofreading. This research utilized the RWC Music Database“RWC-MDB-P-2001” (Popular Music), “RWC-MDB-G-2001”(Music Genre), and “RWC-MDB-R-2001” (Royalty-Free Mu-sic). This research was supported in part by OngaCrest, CREST,JST.

7. REFERENCES

[1] K. Nakano, M. Morise, and T. Nishiura, “Vocal manipulationbased on pitch transcription and its application to interactiveentertainment for karaoke,” in LNCS: Haptic and Audio In-teraction Design, vol. 6851, 2011, pp. 52–60.

[2] H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno, “Lyricsyn-chronizer: Automatic synchronization system between mu-sical audio signals and lyrics,” in IEEE Journal of SelectedTopics in Signal Processing, 2011, pp. 311–316.

[3] D. Hoppe, M. Sadakata, and P. Desain, “Development of real-time visual feedback assistance in singing training: a review,”Journal of computer assisted learning, vol. 22, pp. 308–316,2006.

[4] T. Nakano, M. Goto, and Y. Hiraga, “MiruSinger: A singingskill visualization interface using real-time feedback and mu-sic cd recordings as referential data,” in Proc. ISMW 2007,2008, pp. 75–76.

[5] U. Zolzer and X. Amatriain, DAFX - Digital Audio Effects.Wiley, 2002.

[6] T. Toda, A. Black, and K. Tokuda, “Voice conversion basedon maximum likelihood estimation of spectral parameter tra-jectory,” IEEE Trans. ASLP, vol. 15, no. 8, pp. 2222–2235,2007.

[7] H. Kawahara, R. Nisimura, T. Irino, M. Morise, T. Takahashi,and H. Banno, “Temporally variable multi-aspect auditorymorphing enabling extrapolation without objective and per-ceptual breakdown,” in Proc. ICASSP 2009, 2009, pp. 3905–3908.

[8] H. Kawahara, I. Masuda-Katsuse, and A. de Cheveigne,“Restructuring speech representations using a pitch adaptivetime-frequency smoothing and an instantaneous frequencybased on F0 extraction: Possible role of a repetitive struc-ture in sounds,” Speech Communication, vol. 27, pp. 187–207, 1999.

[9] A. Camacho, SWIPE: A Sawtooth Waveform Inspired PitchEstimator for Speech And Music. Ph.D. Thesis, Universityof Florida, 2007.

[10] H. Banno, L. Jinlin, S. Nakamura, K. Shikano, and H. Kawa-hara, “Efficient representation of short-time phase based ongroup delay,” in Proc. ICASSP1998, 1998, pp. 861–864.

[11] M. Goto, K. Itou, and S. Hayamizu, “A real-time filledpause detection system for spontaneous speech recognition,”in Proc. Eurospeech ’99, 1999, pp. 227–230.

[12] T. Nakano and M. Goto, “A spectral envelope estimationmethod based on F0-adaptive multi-frame integration anal-ysis,” in Proc. SAPA-SCALE Conference 2012, 2012, pp. 11–16.

[13] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka,“RWC music database: Popular, classical, and jazz musicdatabases,” in Proc. ISMIR 2002, 2002, pp. 287–288.

[14] ——, “RWC music database: Music genre database and mu-sical instrument sound database,” in Proc. ISMIR 2003, 2003,pp. 229–230.

[15] H. Kawahara and M. Morise, “Technical foundations ofTANDEM-STRAIGHT, a speech analysis, modification andsynthesis framework,” Sadhana: Academy Proceedings inEngineering Sciences, vol. 36, no. 5, pp. 713–727, 2011.

[16] M. Goto, K. Yoshii, H. Fujihara, M. Mauch, and T. Nakano,“Songle: A web service for active music listening improvedby user contributions,” in Proc. ISMIR 2011, 2011, pp. 311–316.

122


vocarefiner: an interactive singing recording system with … · 2013-10-22 · vocare ﬁner: an...

Documents