soundscape generation for virtual environments using community-provided ... · soundscape...

Soundscape Generation for Virtual Environments usingCommunity-Provided Audio Databases

Nathaniel Finney Jordi Janer∗

Abstract

This research focuses on the generation of soundscapes using unstructured sound databases for thesonification of virtual environments. The design methodology incorporates the use of concatenativesynthesis to construct a sound environment using online community-provided sonic material, and anapplication of this methodology is described in which sound environments are generated for GoogleStreet View using the online sound database Freesound. Furthermore, the model allows for the creationof augmented soundscapes using parameterization models as an input to the resynthesis paradigm,which incorporates multiple source and textural layers. A subjective evaluation of this applicationwas performed to compare the immersive properties of the generated soundscapes with those of realrecordings, and the results suggest a general preference to the generated soundscapes incorporatingsound design principles. The potential for further research and future applications in the area ofaugmented reality are discussed incorporating emerging web media technologies.

Introduction

The advancement of online virtual reality and aug-mented reality technologies in conjunction with theincrease of community-provided sonic material, en-courages a development of autonomous soundscapegeneration tools that allow for the enhancement ofthe immersive experience of the user.

With the use of unstructured databases as theaudio input to the soundscape generator, such amodel may be considered both a framework forcommunity interaction and an ecological acousticspreservation tool, while taking advantage of thegrowing databases of sonic material [10]. How-ever, such unstructured databases pose challengesto the autonomous generation of immersive sonicenvironments due to the ranging characteristics ofthe material. Differences such as sound quality andreverberation may result in an incongruent sonicambience and thereby reduce the immersive qual-ity of the soundscape. Synthesis methods such aswavelet analysis and resynthesis and large grainconcatenative synthesis are proposed in order toalleviate the effect of these variations and to al-low for parameterization-based sound design algo-rithms.

Applying parameterization to the synthesisparadigm furthermore allows for augmentation ofthe soundscape according to either predefined oruser defined criteria, and thereby expands the po-tential for adaptive and creative soundscape designfor virtual environments and augmented reality.

The design of sonic environments requires afundamental classification strategy for both sound-

scapes and sound sources. The classification ofsoundscapes is most widely performed using R.Murray Schafer’s referential taxonomy, which in-corporates socio-cultural attributes and ecologicalacoustics [12]. In the work of Birchfield et. al.,the generation of the soundscape is based on aprobabilistic selection of audio tracks separatedaccording the Schafer concept of keynote soundsand signal sounds, where the material has beenretrieved prior to generation with a lexical searchusing WordNet [8]. Birchfield et. al. incorporatesdynamically changing probabilities in response touser behavior for triggering audio tracks, which re-sults in an evolving sound environment that theypropose reflects the sonic diversity of the equiva-lent natural environment [7]. Probabilistic modelsfor soundscape generation are a demonstration ofthe notion that a generated soundscape is greatlyenhanced by variety and evolution, for which re-search has shown to be a contributing factor tothe sense of presence in VE’s [13], [15].

The sense of presence is defined as the feelingof being situated in an environment despite be-ing physically situated in another [14]. In quan-tifying and/or qualifying the sense of presence, orthe similar term immersion, different authors haveproposed and evaluated various components thatmake up this highly subjective sense [14], [15], [13].In order to experience some degree of presence in aVE, the psychological states involvement and im-mersion are necessary precursors. Involvement isdefined as the focus on a coherent set of stimuli,and immersion is the actual perception of envelop-ment in an environment [14].

∗Music Technology Group, Universitat Pompeu Fabra, Barcelona. His work was partially supported by the ITEA2Metaverse1 Project (http://www.metaverse1.org).

Figure 1: Overview of the system functionality with given inputs (green circles), analysis blocks (diamonds), soundgroups (rectangles) and processes (rectangles with soft corners).

The model proposed in this research appliesthese fundamental soundscape design principles toa generation methodology and is evaluated accord-ing the resulting sense of presence of the user. Theapplication developed in this research includes onlya portion of the applicable soundscape design prin-ciples described by previous authors, and thereforesuggests that as web audio generation technologiesadvance, this methodology carries a large potentialfor creating immersive online media.

Design Methodology

The principles of sound design and soundscapecharacterization as laid out by R. Murray Shafter[12] form the underlying hypothesis of the designmethodology described in this section, that thetreatment of the soundscape in terms of two layers,textures (background) and objects (foreground),establishes a framework for algorithmic design andinteraction principles more suitable for informationdelivery and perceptual comfort than with a uni-fied montage of sound sources.

The textural elements and object elementsare segmented from sound files retrieved from adatabase, and categorized into one of these two lay-ers using their semantic identifiers, which are eitherextracted from tags or a recognition model. Ob-jects are those sounds that are meant or expectedto draw attention from the user, and may includeindicators, soundmarks or informational contentsuch as church bells or non-diegetic sounds suchas narration. Textural elements are determined tobe those sources that form the ambience such asbirds and wind, and tend to be more stochastic innature while drawing minimal focus from the user.

Figure 2 shows the overview of the system func-tionality, where the two sound groups, textures andobjects are seen to be handled separately until thefinal mixing and spatialization. Solid arrows sig-

nify transfer of audio content, while dashed arrowsare informational streams supplied to the variousfunction blocks. A detailed description of the func-tionality of the individual blocks may be found in[9].

The blocks Textural Parameters and PlaybackStrategy contain models for parameterizing thesynthesis of the soundscape, and thereby allowfor manipulation and augmentation according toa desired resultant soundscape. These functionsmay be considered to be the design portion of themodel, which may be predefined or allow for useror contextual manipulation.

For the concatenation of the original audiofiles, an MFCC-based bayesian information crite-rion (BIC) segmentation procedure is used to scana recording for optimal segmentation points [6].Using an MFCC calculation takes into account thespectral properties of the signal and is based onthe Mel-scale, which correlates the frequency spec-trum of the signal with the perceptual attributeof timbre. A minimum segmentation length is se-lected depending on the signal classification, basedon Schafer’s taxonomy, and the grain sizes longerthan that length are chosen according to the opti-mal segmentation points from the MFCC-BIC cal-culation.

Current Implementation

As an initial implementation of the autonomoussoundscape generation model, a static photoreal-istic environment was chosen in order to evalu-ate the use of recorded sounds from unstructureddatabases in an application that must be consistentwith the natural environment. The Google StreetView application contains panoramic images fromdiscrete locations within a city, in which the useris free to rotate and can translate to different lo-cations.

Figure 2: GUI.

Table 1: Questionnaire for subjective evaluation

1. How well could you identify individual sound sources in the soundscape?

2. How well could you actively localize individual sound sources in the soundscape?

3. How compelling was your sense of objects moving through space?

4. How much did the auditory aspects of the environment involve you?

5. How compelling was your sense of movement (turning) inside the virtual environment?

6. How much did your experiences in the virtual environment seem consistent withyour real-world experiences?

The sound database used for feeding the sound-scape generator was The Freesound Project [4],and as many samples as possible were selected withorigins in Barcelona and Spain for purposes of au-thenticity. For the purposes of this study involv-ing the Street View application, the sound objectsare retrieved manually in order to focus on the au-tomonous generation aspect, and the automatic re-trieval implementation is reserved as an avenue forfuture work.

The application includes dynamic changes ofaudio level and panning in the individual sourceand texture layers according to the location anduser orientation. The application uses Actionscript3.0 for synthesis and user interaction, and pre-analyzes the audio files offline using MATLAB todetermine segmentation points and individual seg-ment selection probabilities. The user interface isshown in figure 2 and the demo application is avail-able online [1].

A subjective evaluation of the applicationwas performed using eight participants and threegenerated soundscapes in comparison with threequadraphonic recordings at the relevant locations.The participants were able to control the selectedlocation and orientation freely, and were not in-formed of which soundscapes were generated andwhich were real recordings. Questions considered

relevant for soundscapes in photorealistic environ-ments were selected from the questionnaire pro-posed by Singer et. al. for evaluating the senseof presence in virtual environments [14] (see figure1), where the user was asked to rate each responseon a five-point scale from Not at all to Very Well.

The results of the analysis showed that thesubjects generally rated the generated soundscapeshigher than the recorded soundscapes for each ofthe individual questions, by a margin of 15− 20%.While there were not enough subjects for draw-ing statistically significant conclusions, the resultsdemonstrate that the generated soundscapes are atleast acceptable in comparison with actual record-ings. The normalized responses to the individualquestions for the three locations are shown in fig-ure 3.

As many of the questions were related tothe identification of sources, localization and dy-namism of sound objects, the spatialization andmixing of the separate layers seem to have hada pronounced effect. The participants were moreeasily able to distinguish and localize objects in thesound scene and perceived a higher degree of move-ment of objects with the generated soundscapesusing separated source layers and individual layerspatialization.

Figure 3: Normalized responses averaged over all participants in the subjective evaluation for each question.

Discussion

The results of the evaluation of the design method-ology employed in this study suggest that the use oftechniques for separating object and textural lay-ers and the application of level and panning al-gorithms based on user orientation, heighten theuser’s sense of presence in the virtual environmentin comparison with recordings of an entire sound-scape. The use of soundscape design principles andparameterization models to augment the generatedsoundscape with reference to user orientation andlocality attributes will be studied further in sub-sequent implementations of the model, with thefocus on developing applications that may be inte-grated in online virtual environments.

In the current application, soundscape gener-ation and real-time manipulation were performedwith Actionscript 3.0, and offline analyses for seg-mentation were conducted using MATLAB. Re-trieval of audio material from the online com-munity database was performed manually in thisstudy. Future implementations of the model areto be developed with the intention to incorporatesonification of virtual environments directly in theweb browser with technologies such as the new ex-tension to the HTML5 media API [2], which allowsfor manipulating raw audio data.

The current implementation of the soundscapegeneration model has relatively modest real-timeaudio requirements, primarily consisting of the re-production of pre-segmented audio files with spa-tialization and dynamic level adjustments relatedto the individual sound objects and textural lay-ers. With the incorporation of the capability toapply dynamic filtering to the individual sound ob-jects and textural layers, the sound design poten-tial of the model would be highly improved. Fur-thermore, as many augmented applications targetportable devices, in which the audition experienceis reproduced through headphones, a binaural spa-tialization algorithm using HRTF’s [5] [3], wouldlikely benefit the user’s sense of presence. Suchspectral filtering models could be implemented astime-domain filters within the proposed HTML5API [2]. However if many sources are present,

methods for optimizing binaural 3D rendering mayrequire further development [11].

Such advances in web media technology arelikely to provide a means for further developing thefunctionality and efficacy of the soundscape gener-ation model, and allow for increased abilities toaugment the soundscapes according to desired cri-teria.

References

[1] http://dev.mtg.upf.edu/soundscape/media/

StreetView/streetViewSoundscaper2_0.html.

[2] https://wiki.mozilla.org/Audio_Data_API.

[3] http://recherche.ircam.fr/equipes/salles/

listen.

[4] Freesound.org. http://www.freesound.org.

[5] V. R. Algazi, R. O. Duda, D. M. Thompson, andC. Avendano. The CIPIC HRTF database. InIEEE Workshop on Applications of Signal Pro-cessing to Audio and Electroacoustics, pages 99–102, 2001.

[6] X. Anguera and J. Hernando. Xbic: Real-timecross probabilities measure for speaker segmen-tation. Univ. California Berkeley, ICSIBerkeleyTech. Rep, 2005.

[7] D. Birchfield, N. Mattar, and H. Sundaram. De-sign of a generative model for soundscape cre-ation. In International Computer Music Confer-ence, Barcelona, Spain, 2005.

[8] C. Fellbaum et al. WordNet: An electronic lexicaldatabase. MIT press Cambridge, MA, 1998.

[9] N. Finney. Autonomous generation of soundscapesusing unstructured sound databases. Master’s the-sis, Universitat Pompeu Fabra, 2009.

[10] J. Janer, N. Finney, G. Roma, S. Kersten, andX. Serra. Supporting soundscape design in virtualenvironments with content-based audio retrieval.Journal of Virtual Worlds Research, 2(3), 2009.

[11] M. Noisternig, T. Musil, A. Sontacchi, andR. Holdrich. 3d binaural sound reproduction usinga virtual ambisonic approach. In IEEE Interna-tional Symposium on Virtual Environments, 2003.

[12] R. Murray Schafer. The Soundscape: Our SonicEnvironment and the Tuning of the World. Des-tiny Books, 1994.

[13] S. Serafin. Sound design to enhance presence inphotorealistic virtual reality. In Proceedings of the2004 International Conference on Auditory Dis-play, pages 6–9, 2004.

[14] Michael J Singer and Bob G. Witmer. Measur-ing presence in virtual environments: A presencequestionnaire. PRESENCE, 7:225–240, 1998.

[15] P. Turner, I. McGregor, S. Turner, and F. Carroll.Evaluating soundscapes as a means of creating asense of place. In International Conference on Au-ditory Display, 2003.

soundscape generation for virtual environments using community-provided ... · soundscape...

Documents