[ieee 2007 2nd international conference on pervasive computing and applications - birmingham, uk...

Echo Wall: A Sound-driven Media-art

Xiaojie Chen', Yuanchun Shil, Zhiyong Fu2'Department ofComputer Science and Technology, Tsinghua University, Beijing, P. R. China

chen-xj@mails. tsinghua. edu. cn, shiyc@tsinghua. edu. cn2Academy ofArts and Design, Tsinghua University, Beijing, P. R. China

infoart@ctsinghua. edu. cn

Abstract

Echo Wall is an interactive artwork, which enablespeople to communicate with the environment and otherpeople in everyday setting. The installation is set up on

a regular wall in public where everyone couldparticipate in the interaction. When one speaks or

makes noise in front of the wall, the projectedanimation on the wall will change according to thelocation and the voice features of the person. Peoplecan communicate with each other by generatingdiferent animations. The microphone array technologyused provides people with an opportunity to experiencethe interaction more naturally without any complicatedequipment.

Keywords: Interactive Art, Voice Control,Microphone Array, Sound Localization.

1. Introduction

The concept of creating inmnovative work of artthrough sound and voice has been an important fieldexplored by both artists and technologists. Humansound and voice often communicate rich messages

about our feelings, emotions, thoughts and theinteraction with the surrounding environment. Byextracting features from human sound, these messages

can be transformed into painting, drawing, animation,synthesized music and other art forms.

In recent years, a number of researchers have shownsignificant interests in sound-driven art, and they havedeveloped several artworks that transform sound andvoice into graphical feedback. Hidden Worlds [1],developed by Levin and Lieberman, is an interactiveaudiovisual installation, which made the voices ofpeople visible in the form of graphic figurations.Players can see their voices, which are presented as 3Dnoodle-like figures, through data glasses. The length ofthe figure is controlled by the duration of the voice,while the diameter is controlled by volume. Organum[2], a multiplayer game developed by Niemeyer inBerkeley, enables three players to collaborativelynavigate through a model of the human voice box, usingtheir voices as a joystick. Player's voices captured by

five microphones are used to control direction andspeed of the movement in 3D space. Sing Pong [3],which is a voice-controlled version of "Pong", allowsplayers to move the paddles using their voices andshadows. The paddle's height is mapped to the volumeof the voice, while its position is mapped to the positionof the player's shadow on the projected screen.

These installations share the basic concept that themessages carried by human voice can be represented ina variety of forms to explore the relations between thesound and the visual art language.

In this paper, we will illustrate a real-time sound-driven media-art installation, the Echo Wall, whichincorporates human-computer interaction into the day-to-day life. The interface can work out the location ofthe sound source and extract features from the humanvoice, so it enables users to control the beautifulanimations, which are projected on the wall, with theirvoices.

Since sound capturing is achieved by themicrophone array embedded in the wall, users are freefrom head-mount microphones or other sound capturedevices. They can express themselves freely andinteract instinctively with the surrounding environment.By controlling the location and the volume of thesound, the users, without receiving any prior training touse the interface, can create splendid 3D drawings andpaintings on the wall.

2. System design overview

1-4244-0971-3/07/$25.00 ©)2007 IEEE.

Echo Wall, shown in Figure 2, is set u]wall that is 2.5 meters high, 8 meters in di6.8 meters in arc length. The whole systemtwo main sections: one is the voice processiwhich includes a microphone array, a misoundcard and a computer working a:processor; the other is animation generatiwhich includes two projectors and a compulas an animation generator. These two siconnected by the network to transfer informa

Microphoneslti-ch

eft Projector /....Right Proj'ector

Wall

X1-- Animation

Figure 2. Setup ofEcho Wall

Figure 3. Architecture ofEcho Wa

Human voices are captured by themicrophone array embedded in the wall.fourteen streams of synchronized audio s

user's position in front of the wall is calculasound features are extracted and analyzed bprocessor. The analyzed results, including a

numerical measurements relating to the user'

p on a realiameter andconsists ofing section,ulti-channels a voiceng section,ter workingections aretion.

then sent to animation generating section via network.The animation generator translates these measurementsinto different rendering attributes, which determine thecolor and location of the animations. Finally, twoprojectors, covering the left and right half of the wallrespectively, projected the animations back onto thewall in synchronization with the user's voice.

The voice processing section and the animationgenerating section are working in parallel with powerfulcomputing capability, so the users can interact in realtime with the virtual representations.

2.1. Sound localization and analysis

mnel Soundcard In the voice processing section, we use a microphonearray system to calculate the location of the soundsource in the two-dimensional plane parallel to theground and extract sound features. The microphonearray consists of fourteen small microphones that areembedded in the wall at the height of 160 cm with aninter-sensor separation of 40 cm. The advantage of

Voice ocessor using microphone array is to free users from head-mount microphones or other complicated equipments,so they can use the system more naturally.

Generator There are many different techniques to detect thelocation of the sound source, such as steeredbeamformer [4], high-resolution techniques [5], or time-difference of arrival (TDOA) procedures [6]. However,each of them has its own limitation. However, each ofthem has its own limitation. The steered beamformer ishighly dependent on the spectral content of the sourcesignal and extremely sensitive to the initial searchlocation. The high-resolution techniques areconventionally designed for narrowband signals, buthuman voices are wideband signals, so sophisticatedgeneralizations are need, which extend thecomputational requirements considerably. Although theTDOA procedures possess a significant computationaladvantage, the time delay estimation is not so robust inthe room environment.

In our system, we choose to use the SRP-PHAT [7]algorithm which combines the simplicity of the steeredbeamformer approach with the robustness offered bythe phase transform (PHAT) [8] weighting. We supposethat only one user at a time can speak in front of thewall and the farthest position user can stand is 4 meters

or from the wall, so the effective interaction area is about24 square meters. The area is firstly split into smallregions, which are 10 cm wide and 10 cm long. Then,theoretical delays from each possible exploration regionto each microphone pair are pre-computed and stored.

14-element When a user speaks or makes noise in the interaction

From the area, steered response power (SRP) with PHATirgnals, the weighting is calculated for each exploration regionited and the using the delays pre-computed. In this way, we can

ytheandithe obtain a kind of sound map as the one shown in Figurey the voice 4. Finally, the peak of the sound map is selected as thenumber of . ' . .estimated position of sound source.

;!o VUIL1V., aiiv.

X (cm)

Figure 4. Example of sound map obtained

The SRP-PHAT process is computed very 32 ms, sothe position of the sound source can be updated in realtime. The users will not be aware of the latency. Thesystem can be extended to calculate the height of soundsource, as long as some microphones are setup on thewall in vertical lines.

After working out the position of the sound source,theoretical delays from the position to each microphonepair can be acquired. Then, a delay-and-sumbeamformer [4] is used to apply time shifts to the arraysignals to compensate for the propagation delays in thearrival of the source signal at each microphone. Oncethese audio signals are time-aligned, they are summedtogether to form a single output signal. We can analyzethe signal to extract a variety of sound features. In thecurrent version of our system, only the energy of thesound, which reflects the volume, is calculated andquantified into seven levels. The energy measurementand the location information are then sent to animationgenerator using local network.

2.2. Virtual representation

The goal of the animation generating section is togenerate splendid animations according to thenumerical measurements received from the voiceprocessing section via network. The animation iscreated using the Virtools Platform [9], which is aunique solution for pervasively developing anddeploying 3D experiences on personal computers.

If no user interacts with the installation, the originalpicture shown on the wall is a virtual lotus pool in apeaceful summer night. Moonlight cascades like waterover the lotus leaves, which tremble slightly along withthe breeze. Several fireflies are flying freely above thepool like glimmering stars in an azure sky. However, nolotus flower blooms in such beautiful scene at first. It isthe user's role to create lotus flowers using voice andsound.

When a user speaks or makes any noise in front ofthe wall, a lotus bud will appear right in front ofhim/her. The horizontal position of the bud isdetermined by the position of the sound source, whilethe vertical position is random assigned. Althoughmicrophone array can obtain the two-dimension

location of the sound source in the horizontal plane, wedo not use the information of the distance between theuser and the wall. The lotus bud will then bloom to afull lotus flower in the following several seconds, asshown in Figure 5. At the same time, the moonlight willturn to cascade over the area, making the lotus flowermore bright and shining.

Figure 5. Blooming process of the animated lotusflower

Once a lotus flower finishes the blooming process, itwill split into numerous petals, which will spreadthroughout the whole wall before vanishing. A numberof lotus flowers can appear and bloom simultaneouslyin the scene. When a user roams and speaks at the sametime, the trajectory of the user will be shown as a stringof flowers on the wall.

The energy measurement of the sound, quantifiedinto seven levels, is mapped to the color of the lotusflower. Users can change the color of the lotus bycontrolling their voices or the noises they make.

3. Interaction process

Echo Wall enables people to interact with theinstallation directly using their voices. They can make awide variety of sounds they would like to control theprojected animation, ranging from speaking, singing,and shouting, to whispering, humming and imitatinganimals. They can even play musical instruments to seethe animation changing along with the music.

The volume parameter of a user's voice and thelocation parameter of the user are needed to relate to theanimation control, but the content and other parameters,such as pitch and duration, relate only to the individualexpression. So users may consider such an artwork as agood opportunity to release their energy, express theiremotions, and impress or grab others' attention. Someusers may enjoy shouting and attracting the audience's

u -I

attention. They do not need to directly face the audience;instead they use the projected animation tocommunicate with the audience. This may eliminateshyness of the users and encourage them to expressmore freely. The microphone array technology used isanother factor to free the users, since they will not beobstructed by any complicated devices. They can singwhile dancing or to shout while running.

Several users can participate in the interactionprocess together. Although they can not speak or makenoise at the same time, they can take turns. The voiceprocessing section can switch from one user to anotherso quickly that the users will not feel any latency. Thelotus flowers generated by one user can convey themessages about the interaction process to the wholewall by splitting into numerous petals, so every userwill get the messages no matter at what part they arestanding. Then the rest of the users can reply bycreating their own lotus flowers with special colors. Inthis way, the group of users can communicate with eachother. The communication is not speech based, so theusers will not be restricted by limited languages orparticular accents. They can use their own words, styles,and characters. Moreover, they can even invent atemporal language that does not actually exist.

The audience can also actively participate in theinteraction process. Some audience may find itinteresting to interfere with the users by making noises.When voices overlap, the lotus flowers will appear atthe wrong place. In fact, there is no significantdifference between the users and the audience. Theycan change their roles quickly as they wish, since noprior training is needed for general audience to interactwith Echo Wall. This can increase the audience'swillingness to try new interactive experiences.

4. Conclusion

Echo Wall was exhibited on the second Art andScience International Exhibition hold in TsinghuaUniversity in November, 2006, where it was wellreceived. During the exhibition, numerous usersparticipated in the interaction. They made a widevariety of sounds, including shouting, speaking andsinging. Some users even used their mobile phones asthe sound sources.

Echo Wall can be applied in diverse domains. Itprovides a unique sound-visual transformingexperience. Singers and instrument players may find ithelpful to improve their performance by transformingtheir play into beautiful visual esthetics. Echo Wall isalso a great tool to encourage teamwork. One user mayfind it hard to generate splendid picture by him alone infront of such a large wall, since he has to run whilespeaking in order to keep lotus flowers blooming at

different parts of the wall. In this situation, most userswould choose to either recruit friends or ask strangers toplay. Through the communication and collaboration,teamwork spirit can be cultivated unconsciously.

Future development of Echo Wall will focus on theexpansion of the mapping between human sound andthe visual output. More features can be extracted fromthe human sound, such as pitch, duration and theemotion expressed. By transforming these new featuresin a proper way, the interaction process can besignificantly enriched. Furthermore, we can introduceother feedbacks beyond visual esthetics, such as music.An enhanced version of Echo Wall would help us tobetter explore the relationship between human soundand the virtual world.

Acknowledgement

This research is supported by Program for NewCentury Excellent Talents in University No.NCET-04-0079 and Tsinghua University Research Grant.

References

[1] G Levin, Z Lieberman, "In-Situ Speech Visualization inReal-Time Interactive Installation and Performance",Proceedings of the 3rd international symposium on Non-photorealistic animation and rendering, Annecy, France, 2004,pp. 7-14.[2] G Niemeyer, D Perkel, R Shaw, J McGonigal, "Organum:individual presence through collaborative play", Proceedingsof the 13th annual ACM international conference onMultimedia, Singapore, 2005, pp. 594-597.[3] Sama'a Al Hashimi, G Davies, "Vocal telekinesis: physicalcontrol of inanimate objects with minimal paralinguistic voiceinput", Proceedings of the 14th annual ACM internationalconference on Multimedia, Santa Barbara, USA, 2006, pp.813-814.[4] G Clifford Carter, "Variance bounds for passively locatingan acoustic source with a symmetric line array", Journal ofAcoustical Society of America, 1977, Vol. 62(4), pp. 922-926.[5] S Haykin, Adaptive Filter Theory (3rd ed), Prentice Hall,1996.[6] Michael S Brandstein, Harvey F Silverman, "A practicalmethodology for speech source localization with microphonearrays", Computer, Speech, and Language, 1997, Vol. 11(2),pp. 91-126.[7] Joseph Hector DiBiase, A high-accuracy, low-latencytechnique for talker localization in reverberant environmentsusing microphone Arrays, Doctor Thesis at Brown University,2000.[8] Charles H Knapp, G Clifford Carter, "The generalizedcorrelation method for estimation of time delay", IEEETransaction on Acoustics, Speech, and Signal Processing,1976, Vol. 24(4), pp. 320-327.[9] b

[ieee 2007 2nd international conference on pervasive computing and applications - birmingham, uk...

Documents