dance to the beat: synchronizing motion to audio...dance to the beat: synchronizing motion to audio...

Computational Visual Mediahttps://doi.org/10.1007/s41095-018-0115-y Vol. 4, No. 3, September 2018, 197–208

Research Article

Dance to the beat: Synchronizing motion to audio

Rachele Bellini1 (�), Yanir Kleiman1, and Daniel Cohen-Or1

c© The Author(s) 2018. This article is published with open access at Springerlink.com

Abstract In this paper we introduce a video post-processing method that enhances the rhythm of adancing performance, in the sense that the dancingmovements are more in time to the beat of themusic. The dancing performance as observed in avideo is analyzed and segmented into motion intervalsdelimited by motion beats. We present an image-spacemethod to extract the motion beats of a video bydetecting frames at which there is a significant changein direction or motion stops. The motion beats arethen synchronized with the music beats such that asmany beats as possible are matched with as little aspossible time-warping distortion to the video. We showtwo applications for this cross-media synchronization:one where a given dance performance is enhanced tobe better synchronized with its original music, and onewhere a given dance video is automatically adapted tobe synchronized with different music.

Keywords video processing; synchronization; motionsegmentation; video analysis

1 IntroductionDancing is a universal phenomenon, which crossescultures, gender, and age. Dancing is even observedin some animals in the wild. We all appreciate andenjoy good dancing; however, an interesting questionis what makes a dancing performance look good, andcan we enhance a dancing performance as observedin a video? One critical aspect of a good danceperformance is that the movement is in time, i.e., withgood synchronization between the dancing movementand the music [1]. In this paper, we introduce avideo post-processing technique that enhances the

1 Tel Aviv University, Tel Aviv 6997801, Israel. E-mail:R. Bellini, [email protected] (�); Y. Kleiman,[email protected]; D. Cohen-Or, [email protected].

Manuscript received: 2018-02-03; accepted: 2018-03-27

rhythm of a dancing performance so that the dancingmovements are in better time to the music.

Studies dealing with dance analysis such as Kimet al. [2], Shiratori et al. [3], and most recentlyChu and Tsai [4] agree that key dance poses occurat stops or turns in the performer’s movementtrajectories. Furthermore, dance movements, andhuman movements in general, can be segmented intoprimitive motions [5]. These key poses in the dancemotion are said to be the motion beats. In thepresence of music, the motion beats should be insynchrony with the rhythm of the music as definedby the music beats [2, 3, 6] (Fig. 1).

In the last decades, interest in techniques that postprocess images and videos has increased, particularlyin techniques that attempt to enhance a subjectcaptured in an image or a video: see, e.g., the worksof Levyand et al. [7] and Zhou et al. [8]. In our work,dance performance enhancement is applied at theframe level without manipulating the context of theframes. The dancing performance as observed in avideo is analyzed and segmented into motion intervalsdelimited by the motion beats. The rhythm of thedance is enhanced by manipulating the temporaldomain of each motion interval using a dynamic time-warping technique similar to that used in Zhou etal. [9, 10]. The challenge is twofold: (i) extractingthe motion beats from the video sequence; and (ii)defining and optimizing an objective function forsynchronizing the motion beats with the music beats.

To extract the motion beats from the videosequence, we analyze at pixel level the optical flowin frames of the video; unlike the work of Chu andRef. [4] where motion beats are tracked by object-space analysis, here the analysis is applied in imagespace, bypassing the difficulties incurred in trackingobjects in videos. We analyze the optical flow acrossa range of frames to detect significant changes in

197

198 R. Bellini, Y. Kleiman, D. Cohen-Or

Fig. 1 A music signal (above), and a sequence of frames from both the input video (middle) and output video (below); the music beat ismarked by a red frame. The output video is synchronized with the music beat, while in the input video, the motion beat, marked by a blueframe, occurs 3 frames before the music beat.

directions, which indicate locations of key poses inthe video. Our method is capable of detecting fastchanges in direction, as well as gradual changes, andlong pauses in the dance followed by a change indirection.

Given the music and motion beat streams, we finda mapping between motion beats and music beatsthat minimizes the time-warping distortion of thevideo. We introduce an objective function that aimsto maximize the number of good matches betweenmotion beats and music beats. This optimization isconstrained in the sense that matches are not enforced;only good matches are allowed, which may leave somemotion beats unmatched. Using the mapping betweenmotion beats and music beats, we adapt the video tothe music using a time-warping technique, stretchingor contracting each segment of the video sequence toalign the motion beats with the music beats to whichthey are mapped.

We show two applications for this cross-mediasynchronization: in one, a given dance performanceis enhanced to be better synchronized with itsoriginal music, and in the other, a given dance videois automatically adapted to be in synchrony withdifferent music. A blind user study was conducted,in which users were requested to compare theinput and output videos, and decide which wasbetter synchronized with the music. The resultsshow improvement for most videos tested, asexplained in detail later. All of our input andoutput videos can be found in the ElectronicSupplementary Material (ESM) and the online projectpage of this paper at http://sites.google.com/site/dancetothebeatresults.

2 Related work2.1 Synchronizing video and musicRecent years have seen a vast increase in theavailability of digital video and audio. This has ledto a rising interest in video editing techniques forresearchers and end users alike. Several recent workshave been dedicated to the concept of synchronizingvideo files with audio files to create music videosor to enhance the entertainment value of a video.Yoon et al. [11] create a music video by editingtogether matching pairs of segmented video clipsand music segments in a sequence. Matching isperformed using features such as brightness andtempo which are extracted from the video and audiosignals. Their previous work [12] matches a givenvideo with generated MIDI music that matches themotion features of the video. The motion featuresare tracked with the help of user interaction. Jehanet al. [13] present an application that creates a musicvideo for given music by using preprocessed videoclips with global speed adjustment as required tomatch the tempo of the music.

The more recent work of Chu and Tsai [4] extractsthe rhythm of audio and video segments in order toreplace the background music or to create a matchingmusic video. After the rhythm is extracted fromthe audio and video segments, each audio segment ismatched to a video segment shifted by the constanttime offset that produces the best match for themusic. This method works well for the creation of amusic video where a large set of video clips or audioclips is available. However, if only a single videoclip and a single audio clip are available, matching is

Dance to the beat: Synchronizing motion to audio 199

inexact. In contrast, our method processes a singlevideo segment by locally warping its speed alongthe video, to produce a natural looking video thatmatches the given music beat by beat.

In very recent work, Suwajanakorn et al. [14]present a method to morph facial movements,matching them to an audio track. A neural networkis trained to find the best match between mouthshapes and phonemes, while the head movementsare warped using dynamic programming. While theresults are impressive, their goal and end results arequite different to those of the work presented here,which is focused on matching a given video to asoundtrack without altering its content.

Other works involve synchronization of a pair ofvideo sequences or motion capture data [9, 10, 15–17]. They extract a feature vector from each frameand align the dense signals, as opposed to a sparsestream of motion beats used in our work. Zhouand De la Torre [10] allow the alignment of featuresequences of different modalities, such as video andmotion capture; however, their technique assumesa strong correlation between the sequences, e.g., apair of sequences describing the same actions invideo and motion capture streams. To synchronizethe two streams, these methods allow the alterationof the speed of both sequences. Humans tend tobe less tolerant of irregularities in music rhythmthan in video appearance; therefore in our method,the music stream is kept intact to avoid irregularand unnatural sounding audio. Furthermore, weenforce additional constraints on the video to ensurenoticeable changes do not occur. Wang et al. [17] findthe optimal matching between two or more videosbased on SIFT features, warping segments of thevideos as necessary. While such features produce goodresults when aligning videos with similar content, theycannot be used in video and music synchronization,since they do not capture temporal events.

Lu et al. [18] present a method to temporally moveobjects in a video. Objects of interest are segmentedin the video, and their new temporal location isfound by optimizing the user’s input with some visualconstraints. This kind of approach, however, isfocused on the overall movement of a single object,approximating it to a rigid body. In our method,instead, we aim to process complex movements, inwhich different parts of objects have different motions(e.g., a person dancing).

2.2 Beat extractionOur method is based on synchronizing motion beatsto music beats. Firstly, we detect motion beats fromthe video and music beats from the audio. Thereis a body of work regarding motion beat detectionfrom motion capture data; works such as Kim etal. [2] and Shiratori et al. [3, 19] detect motion beatsfrom motion capture data to analyze dance motionstructures, in order to synthesize a novel dance or tocreate an archive of primitive dance moves.

Detecting motion beats from a video is a morechallenging task, especially from amateur or home-made videos, which typically are contaminated withsignificant noise, moving background objects, cameramovement, or incomplete views of the character.Chu and Tsai [4] track feature points in each frameand build trajectories of feature points over time.Motion beats are then detected as stops or changes intrajectory directions. Denman et al. [20] find motionclusters and use them to detect local minima in theamount of motion between frames.

Extraction of music beats, or beat tracking, is a wellresearched problem in audio and signal processing.The aim of beat tracking is to identify instants of themusic in which a human listener would tap his footor clap his hands. McKinney et al. [21] provide anoverview and evaluation of several state of the artmethods for beat tracking. These methods usuallydiscover the tempo of the music, and use it to extractmusic beats that are well matched to the tempo.

The work of Ellis [22] provides a simple approachthat, after discovering the tempo, uses dynamicprogramming to produce a series of beats which bestmatch the accent of the music. We use the codeprovided by the author for our music beat tracking.

3 Motion-beat extraction3.1 OutlineThe motion beats of a dance are defined as frames inwhich a significant change in direction occurs. Often,there is a motion stop for a few frames in between asignificant change in direction. Therefore, to detecta motion beat, we consider a non-infinitesimal timerange that consists of a number of frames. A direction-change score is defined for every frame, and computedover multiple time ranges to detect direction changesof various speeds. A low score indicates a strong


change in direction. Motion beats are then detectedas local temporal minima of the frame-level score,ensuring detection of the most prominent changes indirection over time.

3.2 Super-pixel optical flowExtracting the motion beats from a large variety ofvideos is a challenging task. A video can be of lowquality and low resolution, such as one captured by asmartphone. We assume that the input video consistsof fast motions of an object over a natural background.However, the motions of the object do not necessarilyfollow a fixed rhythm. Since the moving object is alsonot necessarily human, we do not operate in objectspace, but in image space.

Our technique is based on analyzing the opticalflow of the video. In particular, we use the MATLABimplementation of the Horn–Schunck optical flowtechnique. Since the video may be noisy, we firstapply a median filter and consolidate the optical flowby considering spatio-temporal super-pixels. Eachsuper-pixel is a 5 × 5 block of pixels over five frames.We assign to each pixel the mean pixel-level opticalflow over the spatio-temporal super-pixel centeredaround it. These super-pixel optical flow valuesare considerably less noisy, yet still fine enough tofaithfully measure the local motion direction. Notethat the motion analysis is meant to identify themotion beats at the frame level rather than at thepixel level.

The super-pixel optical flow of four frames isillustrated in Fig. 2(b), where each color depicts adifferent direction: blue for left and down, cyan forleft and up, red for right and down, and yellow forright and up.

3.3 Frame-level motion beatsFrame-level motion beat extraction is applied at thewindow level. Frames are subdivided into a grid of8 × 8 windows; we assume that in each window onlyone object is moving. Larger windows, or the entireframe for that matter, may contain multiple objectswith opposing directions, for example a symmetricalmovement of the hands. Such motion would nullifythe average direction of movement. Within a smallwindow only one object is moving, and the averagedirection has a meaningful value. The averagedirections of the windows are shown in Fig. 2(c),using the same color encoding as in Fig. 2(b).

Fig. 2 Motion-beat extraction. The original video (a) is displayedalong with (b) its super-pixel optical flow, (c) the average motiondirection for each window, and (d) the average direction change. Thedetected motion beat is marked with a red outline.

We define the direction-change score as the dotproduct of the normalized average motion vectors oftwo frames; the dot product is maximal where themotion vectors have the same direction and minimalwhen they have exactly opposite directions. Wecompute the score function at the window level. Toreduce temporal noise in the direction of motion, theaverage direction is first smoothed over time using aGaussian filter. Since a change in direction in naturalmovement is often gradual, we identify changes indirection that occur at varying speeds. We comparepairs of frames at various intervals centered aroundeach frame in the video. The score for every frameis the minimum score among all intervals centeredaround that frame. Figure 3 shows a temporal windowaround frame i. Frame (i − 6) is compared to frame(i + 6), frame (i − 5) to frame (i + 5), and so on.The dot product for each pair of frames is shownabove the video frames. The direction-change scorefor frame i is the minimum over all pairs, which isreached in this example for the pair {i − 4, i + 4}.

The measurements described above provide a scorefor the change of direction for every spatial window;a low value suggests a strong change. The frame-levelscore is simply the minimal score over all windows,


Fig. 3 Temporal window in which pairs of frames are compared. White arrows mark the approximate direction in which the parrot’s head ismoving. The dot product is computed for each pair of frames and the minimum is chosen as the direction-change score for that frame; the dotproduct is actually computed separately for each window in the 8 × 8 grid.

detecting local changes of direction over all areas ofthe frame. The global score for every frame in thevideo is displayed in Fig. 4, along with the detectedlocal minima of the score, marked by the greenvertical lines. The local minima are computed byfinding the strongest negative peaks of the functionthat are at least five frames apart from each other.Figure 2(d) shows the color-coded score of the spatialwindows in four frames. The negative values areshown in red, where a stronger color indicates a lowerscore, closer to −1. A motion beat is detected in thesecond frame from the left, which has a local temporalminimum of the frame-level change of direction score.Note that these scores are computed using a largertemporal window which is not shown in the figure.

4 Synchronization4.1 ApproachThe detected motion beats effectively divide the videointo a series of motion intervals, each starting one

frame after a motion beat and ending at the nextmotion beat. Music intervals can be defined similarlyusing the music beats. When a motion beat occurs onthe same frame as a music beat, the dancing objectappears to be in synchrony with the music. Ourgoal is thus to adjust the speed of each interval suchthat as many motion beats as possible occur on thesame frame as music beats, with as little distortionas possible of each interval.

The output of the audio and video analysis is twosequences: the music beat times Ai and the motionbeat times Vj . When two beats are matched, therelevant motion interval is stretched or contracted sothat the motion beat occurs simultaneously with themusic beat. The music signal is fixed and does notchange; if the movements of the performer becomeslightly faster or slower, the final video can stillbe perceived as natural, while even the smallestmodification to the music may be perceived asunpleasant. Respecting these asymmetric propertiesof the video and music streams, we differ from

Fig. 4 Frame-level direction-change score for each frame in the video, along with detected local minima of the score (marked by green vertical lines).


previous solutions offered to related tasks by workssuch as Zhou et al. [9, 10], and solve an optimizationproblem which better fits the task at hand.

To compute the optimal mapping between motionbeats and music beats, a time-warping distortionof every matched pair is measured. Naturally, notevery motion beat has a matching music beat, andvice versa. However, the mapping is monotonic, i.e.,every matched motion beat is at a later time thanthe previous matched motion beat, and likewise formusic beats. Therefore, to compute the time-warpingdistortion incurred by a matched pair, it is necessaryto accumulate all motion intervals and music intervals,respectively, between the previously matched beatsand the currently matched beats. We compute acompatibility score for the accumulated intervals,which is high when the time-warping distortionbetween the accumulated intervals is low. For furtherdetails see Section 4.2.

The global optimization problem is to find themapping which maximizes the total compatibilityscore over all pairs. This encourages as many matchedbeats as possible, as long as they do not force alarge time-warping distortion for other matched pairs.Local time-warping distortions between successivepairs are also subject to user-defined constraints. Theuser may control and can define separate constraintsfor speeding up the video and for slowing it down,since different videos tolerate different amounts ofspeed change before losing their natural appearance,which depends on their content. Once the beats havebeen analyzed and matched, the video is processedby contracting or stretching motion intervals in orderthat the motion beats occur simultaneously with theirmatched music beats. The output is a time-warpedversion of the original video that now fits the givenmusic.

4.2 Compatibility scoreThe compatibility score is defined for every possiblemonotonic mapping between the motion beats andmusic beats. It measures the total amount of distortionover all pairs of matched beats. A high compatibilityscore means low distortion, so maximization of thecompatibility score encourages as many matching beatsas possible, as explained above.

Formally, for every music beat Ai and motionbeat Vj we define the lengths in frames of theircorresponding intervals:

la(i) = Ai − Ai−1, lv(j) = Vj − Vj−1

Throughout this section, i is the index of the currentaudio beat and j is the index of the current videobeat. The initial intervals are

la(1) = A1, lv(1) = V1

Every mapping is a sequence of pairs {i(m), j(m)},1 � m � k, where for every m the music beat Ai(m)

is matched with the motion beat Vj(m) .For clarity, we define the aggregations of several

intervals between two beats asLa(i1, i2) =

∑

i1<i�i2

la(i), Lv(j1, j2) =∑

j1<j�j2

lv(j)

Note that i1 and j1 can be zero, in which caseaggregation starts from the first interval. Thecompatibility score C for an arbitrary pair ofsequences starting from {i1, j1} and ending at {i2, j2}is then:

C(i1, i2, j1, j2) =min {La(i1, i2), Lv(j1, j2)}max {La(i1, i2), Lv(j1, j2)}

The score C(i1, i2, j1, j2) for two pairs of beats{i1, j1}, {i2, j2}, indicates the distortion incurred bythis sequence of intervals which includes no othermatching pair in between. Scores are within therange [0, 1]. A high value indicates low distortion; aperfect match in which the video is neither stretchedor contracted has a score of one. We can now solvethe global optimization problem by maximizing thetotal score for all pairs associated with a mapping.

4.3 Dynamic programming solutionThe sequence with maximal score up to any pair ofbeats {Ai, Vj} can be computed regardless of anymatched pairs that are selected afterwards. We canthus solve the optimization problem by use of dynamicprogramming.

We define a global score function F (i, j) thataggregates the values of the compatibility score forsequences up to pair {Ai, Vj}. The maximizationproblem for each pair is then:F (i, j) = max

1�k�i1�l�j

{F (i − k, j − l) + C(i − k, i, j − l, j)}

with initial valuesF (0, j) = 0, F (i, 0) = 0

During the computation of each pair, the distortionconstraints are tested, and pairs that violate theconstraints are discarded. Since the maximizationof F at each step is dependent on i × j values,the theoretical complexity of the computation is


O(n2a n2

v), where na is the number of music beatsand nv is the number of motion beats, or O(n4) if na

and nv are both about n.However, many possibilities can be ignored in

practice. Between two consecutive matched pairs,it is unlikely that several beats in a row will beleft unmatched for both music beats and motionbeats. It is possible that a corresponding pair ofmotion beat and music beat are both left unmatchedif their distortion score violates the constraints.However, due to the regularity of the music beats andthe minimum distance between motion beats, it isimprobable that the difference between consecutivematched pairs will be several motion beats and severalmusic beats. It is far more likely that some motionbeats are left unmatched which are between twoconsecutive music beats or vice versa. Hence, it issufficient to maximize F (i, j) over the narrow band:{i′, j − 1}, {i′, j − 2}, {i − 1, j′}, {i − 2, j′}, where1 � i′ < i, 1 � j′ < j. This has a time complexity ofO(na + nv) for each pair {i, j}. The total complexityof such a solution is O(nanv(na + nv)), or O(n3) ifna and nv are both about n.

An example of the synchronization process is shownin Fig. 5; it is shown as a sequential process forclarity. The top row shows the original music beatsand motion beats for the video sequence. Musicbeats marked in grey squares indicate beats thatare already matched; those in black squares are yetunmatched. Motion beats in the video are markedby red highlights. In the second row, the secondmusic beat at frame 20 is matched with the second

Fig. 5 Synchronization. See Section 4.3 for details.

motion beat at frame 22. The video is thus pushedbackwards by two frames and the position of thenext motion beat is modified accordingly. Notethat the score of each matched pair is based on thedistance between two consecutive beats and not onthe absolute position of the beat. The fourth motionbeat is now at frame 54 which is closer to 50 thanto 60. The score for matching the beat to frame50 is also a little higher. However, matching themotion beat to frame 50 would require shortening thevideo segment by 40%. The user selected to constrainthe speed ratio of the video to no more than a 25%increase in speed, so this potential match is discarded.On the other hand, the constraint for stretching thevideo is more loose, since slowing down the video bya certain ratio looks more natural to the viewer thanspeeding it up by the same ratio. Therefore, the beatis matched to the music beat at frame 60 and thevideo is stretched accordingly. If the user wished toalso constrain stretching of the video to a lower ratio,this motion beat would not be matched to a musicbeat, and the next motion beat would be considered.

5 Experimental results5.1 SettingThe performance of the presented method wasevaluated on a number of videos representing a varietyof types of dancing, motions, rhythms, environments,and subjects. The videos also varied in quality:some videos were of commercial production levels,while others were of home video quality, capturedwith a hand held camera or even a smartphone,and contained significant noise. Figure 6 shows afew frames from each video to capture its essence.For further evaluation, we refer the reader tothe accompanying videos in the ESM and theonline project page mentioned above, where allof the examples can be found and evaluated. Weexperimented with two types of application asdescribed in the following.

5.2 Performance enhancementHere we consider a video containing a dance orrhythmical movements, which does not match thebackground music perfectly. The original video isthen edited by our method, and the dancing enhancedto better fit the beats of the original music: themotion beats are adjusted to match the music beats.


Fig. 6 Typical frames from each input video; the center frame shownfor each video occurs at a motion beat. Note that we can handlevideos with a moving camera (such as the first video) and significantmotion blur (such as the parrot videos).

Firstly, we present two rather simple examples, inwhich the enhancement can be clearly appreciated.The performing subject in the first example is a well-known parrot named Snowball, a sulphur-crestedcockatoo whose renowned dancing skills have beenstudied and described in several academic papers [23–25]. Although the parrot has an extraordinary abilityto move according to the music beat, its performance

is still imprecise. Snowball is famous enough to havebeen cast for a Taco Bell commercial in 2009, where hedances along with the song “Escape (the Pina Coladasong)” by Rupert Holmes. Some of the movements ofthe parrot in the video are irregular, so some motionbeats are not synchronized with the music beats. Ascan be observed, our method modifies the video sothat the movements become more rhythmical andbetter synchronized with the given song.

The subject of the second video is another type ofparrot, a bare-eyed cockatoo, which dances to thesong “Shake a Tail Feather” by Ray Charles. Theparrot does not have a constant rhythm, and missesthe music beats quite often. Our method greatlyenhances its performance.

The third example is of a human dancer performingin front of a live audience. The segment is a part of aperformance by Judson Laipply called “The Evolutionof the Touchdown Dance”, which includes memorableNFL touchdown dances. The dancer mostly keeps thebeat quite well, but in the second part of the videosegment the contribution of our method is clear. Itshould be noted that our method does not interferewith the original video where it is well synchronized.Therefore, in this particular example, the first halfof the output video looks almost the same as theinput, while in the second half there is a noticeabledifference.

5.3 Dance transferThe second set of experiments transfers a dance tonew background music. The input dance is modifiedto match the new rhythm and audio beats of thenew music. Again, we begin with a few simple andclear examples. The first is a video of a man doingpush-ups. After a while he gets tired and does notkeep a constant rhythm in the original video. Thisis particularly evident around the 15th second. Wesynchronized it to the song “Where is the Love?” byThe Black Eyed Peas. The song has a slow rhythm,so the beginning is slowed down: the input videoshows ten push-ups in the first ten seconds, whilethe output shows only eight. However, the samerhythm is kept until the end of the video, and theman does not seem to get tired as in the originalvideo. When synchronizing the same input videoto the song “Billie Jean” by Michael Jackson, theresults differ significantly. The rhythm of the song issomewhat faster, and thus the push-ups happen at


a faster rate of ten push-ups in the first ten seconds.The rhythm of the push-ups is again more regularthan in the input video. Note that since the rateof the push-ups is slower as the video progresses, amusic beat is skipped once in a while, and a motionbeat is matched with the next music beat, while stillmaintaining the constant rhythm.

Another clear example is given by the same TacoBell commercial used for the first type of experiment.This time the background song is changed to thesong “Trashin’ the Camp” by Phil Collins. In theoriginal version the movements of the parrot do not fitwith the rhythm of the music, since they are shiftedin time and their rhythm is rather faster than themusic rhythm. In the output video the movementsare slowed down a little and time-shifted to matchthe new music.

We also show a few experiments with videos ofprofessional human dancers. The first experimentwith a human dancer uses a video of Caren Calder,a dance instructor, performing a traditional WestAfrican dance. We synchronize her dance to “Oh!Darling” by The Beatles, which is a famous songfrom a very different culture. In the original videothe dancer’s hand clapping and music beats are notmatched, while in the new version her movementsand hand clapping occur to the beat of the music.

Another dance experiment is conducted again withfootage from the performance of Judson Laipply,“The Evolution of the Touchdown Dance”, using twodifferent segments of this performance. One has beensynchronized to “Billie Jean” by Michael Jackson, andthe other to “Pantala Naga Pampa” by The DaveMatthews Band. With the first song the changes arevery clear, since the original video does not matchthe music. With “Pantala Naga Pampa”, as in theexperiment using the same video segment with theoriginal music, the dancer’s movements match themusic quite well, and the changes in some parts of thedance are not easily noticeable. However there are afew changes: at the very beginning, in the originalvideo, the first jump occurs within the silence beforethe beginning of the song, while in the new versionthe first jump occurs on the first beat of the song. Inthe second part of the video the dancer jumps a fewtimes with irregular rhythm. This part is significantlychanged to better fit the new music.

The last video we present here is a jazz-funk dance

online video of the Dr Pepper Cherry YouTube DanceStudio. The original background music is replacedand the dance is synchronized with the song “Hayling”by FC Kahuna. The original dance is quite fast, whilethe music we chose has very few beats per minute.In this case, since the dancers keep a rather constantrhythm, the original performance fits almost perfectlyto the new music. Thus, the changes are minuteand hard to perceive without careful observation ofthe input and output videos side by side. The userstudy shows most people believe the output video isbetter synchronized with the music, even though thedifferences between them are small.

We refer the reader again to the ESM or the projectpage to view the example videos described above.

5.4 User studyA blind user study was conducted for each of thevideos described above, to which 26 users replied.Users were presented with pairs of videos, and wererequested to rate which of the videos better matchedthe music, without knowing which video was theenhanced version and which was the original. Theresults of the user study are displayed in Fig. 7.The first three videos, labeled Parrot 1, Parrot2, and Touchdown, were examples of performanceenhancement of two different parrots and a segmentfrom “The Evolution of the Touchdown Dance”. Inthese videos the background music remained the sameas in the original video. The remaining videos wereexamples of music replacement.

Fig. 7 User preferences when comparing the original version to theenhanced version of each video.


As can be seen from the figure, the enhanced videoswere usually preferred to the corresponding inputvideos. Our most successful examples were ones inwhich the dance performance deviated significantlyfrom the rhythm of the song. The African dance videohas a slow rhythm, and so does the song we matchedit to. Therefore users can easily identify where amusic beat is missed in the original performance, andit is easier to evaluate the differences between thevideos. In the push-up videos, which also receivedpositive reviews, the person working out becomestired and does not keep a constant rhythm, whichagain enables the user to clearly see where a musicbeat is missed.

Conversely, for video segments which maintain afast fixed rhythm, it is harder for the human mindto distinguish between different motion beats andseparate them from the rhythm of the attached music.Even a professional video editor may require severalviewings to determine whether a certain motion beatoccurs simultaneously with a music beat. However,it can often be felt subconsciously whether a certainvideo is in sync with the music or not. An interestingexample is the jazz dance performance; the rhythm ofthe dance is very fast, and the replaced music matchesthe original video quite well. Only minor correctionsare applied by our synchronization algorithm, and theoutput video looks almost identical to the the originalvideo: the differences between the videos are onlynoticeable when the videos are viewed simultaneouslyside by side. Yet in our user study, more than 60% ofthe users preferred the enhanced video to the original.

A noticeable failure shows that sometimes keepingthe beat is not enough to present a good danceperformance. In the video of the parrot Snowball,the head motions do not follow the beat particularlywell. However, the parrot changes his dancing stylenoticeably as the chorus of the song begins. Inour enhanced version, the individual motions of theparrot’s head are better synchronized with the music,but the change of dancing style occurs a few musicbeats later than the beginning of the chorus. Theusers did not react well to that change and preferredthe original performance of the parrot. A possibleimprovement to our method in such cases is tosynchronize motion beats to music beats only withinsegmented non-overlapping sections of video.

5.5 ImplementationOur algorithm was implemented in MATLAB. Weran all experiments on an Apple MacBook Pro witha 2.4 GHz Intel Core 2 Duo processor and 4 GB ofRAM, and the Mac OS X 10.6.8 operating system.The experiments were mostly run on videos of about500 frames, or 20 seconds, with 640 × 360 resolution.The bottleneck in our implementation is the videoediting itself, i.e., reading and writing video frames,and video compression. This could be vastly improvedby using other video compression methods (e.g., animage sequence) and faster editing methods. Weestimate that the algorithm itself, without the videoediting, i.e., detecting the motion beats and musicbeats and finding an optimal mapping between them,requires about two minutes for each video above.

6 Conclusions and future workWe have presented a method to enhance a video of adancing performance. The key idea is to synchronizethe motion beats to the music beats so that thedance rhythm is better correlated with the givenmusic. Detected motion beats are used to delimitmotion intervals. However, while music beats are wellunderstood, the notion of motion beats is not yet fullyestablished; there are other possible definitions formotion beats, e.g., using the centers, rather thanthe ends, of motion segments. How to segmentmotion in video remains an open interesting problem.Clearly, segmenting motion of articulated bodies canbe performed better in object-space. However, thisrequires identifying and tracking the skeleton of thedancing body, which is known to be a difficult task.In our work, we analyze the video frames directly,detecting motion beats in image-space. The mainadvantage of our approach is that it is not limitedto human bodies, or tailored to a particular subjectknown a priori. However, in cases of extreme cameramovements, some motion beats may not be detected.As mentioned in Section 3.2, we assume that theforeground objects in the input videos move morequickly than the background. In future, we aimto consider integrating object-space analysis, to seewhether it can significantly improve the segmentationof the dancing subject.

Another interesting direction for future work isintra-frame enhancement of videos containing more


than a single dancing performer. Here we can usethe extracted motion beats, and synchronize two ormore motions in the video with each other as well aswith the background music. In this case, the correctsynchronization will be more evident than in a cross-media synchronization.

The motion beats we detect can also be helpfuloutside the task of synchronization; one such possibleapplication is global time remapping for a video.When stretching or contracting a video, deformationcan be concentrated around the motion beats. Videosegments where continuous motion occurs wouldretain their natural speed, while motion stops, orsegments where there is a change in direction, couldbe prolonged or shortened. Since the speed of motionis low in such segments, the change of speed wouldbe less noticeable and more natural looking.

Electronic Supplementary Material Supplementarymaterial is available in the online version of this articleat https://doi.org/10.1007/s41095-018-0115-y.

References

[1] Repp, B. H. Musical synchronization. In: Music, MotorControl and the Brain. Altenmuller, E.; Wiesendanger,M.; Keselring, J. Eds. Oxford University Press, 55–76,2006.

[2] Kim, T.-h.; Park, S. I.; Shin, S. Y. Rhythmic-motion synthesis based on motion-beat analysis. ACMTransactions on Graphics Vol. 22, No. 3, 392–401, 2003.

[3] Shiratori, T.; Nakazawa, A.; Ikeuchi, K. Dancing-to-music character animation. Computer Graphics ForumVol. 25, No. 3, 449–458, 2006.

[4] Chu, W.-T.; Tsai, S.-Y. Rhythm of motion extractionand rhythm-based cross-media alignment for dancevideos. IEEE Transactions on Multimedia Vol. 14, No.1, 129–141, 2012.

[5] Flash, T.; Hogan, N. The coordination of arm movements:An experimentally confirmed mathematical model.Journal of Neuroscience Vol. 5, No. 7, 1688–1703, 1985.

[6] Jones, M. R.; Boltz, M. Dynamic attending andresponses to time. Psychological Review Vol. 96, No.3, 459–491, 1989.

[7] Leyvand, T.; Cohen-Or, D.; Dror, G.; Lischinski, D.Digital face beautification. In: Proceedings of the ACMSIGGRAPH 2006 Sketches, Article No. 169, 2006.

[8] Zhou, S.; Fu, H.; Liu, L.; Cohen-Or, D.; Han, X.Parametric reshaping of human bodies in images. ACMTransactions on Graphics Vol. 29, No. 4, Article No.126, 2010.

[9] Zhou, F.; Torre, F. Canonical time warping foralignment of human behavior. In: Proceedings of theAdvances in Neural Information Processing Systems,2286–2294, 2009.

[10] Zhou, F.; De la Torre, F. Generalized time warpingfor multi-modal alignment of human motion. In:Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, 1282–1289, 2012.

[11] Yoon, J.-C.; Lee, I.-K.; Byun, S. Automated music videogeneration using multi-level feature-based segmentation.In: Handbook of Multimedia for Digital Entertainmentand Arts. Furht, B. Ed. Springer, 385–401, 2009.

[12] Yoon, J.-C.; Lee, I.-K.; Lee, H.-C. Feature-basedsynchronization of video and background music. In:Advances in Machine Vision, Image Processing, andPattern Analysis. Lecture Notes in Computer Science,Vol. 4153. Zheng, N.; Jiang, X.; Lan, X. Eds. SpringerBerlin Heidelberg, 205–214, 2006.

[13] Jehan, T.; Lew, M.; Vaucelle, C. Cati dance: Self-edited,self-synchronized music video. In: Proceedings of theACM SIGGRAPH 2003 Sketches & Applications, 1–1,2003.

[14] Suwajanakorn, S.; Seitz, S. M.; Kemelmacher-Shlizerman, I. Synthesizing Obama: Learning lip syncfrom audio. ACM Transactions on Graphics Vol. 36, No.4, Article No. 95, 2017.

[15] Caspi, Y.; Irani, M. Spatio-temporal alignment ofsequences. IEEE Transactions on Pattern Analysis andMachine Intelligence Vol. 24, No. 11, 1409–1424, 2002.

[16] Slaney, M.; Covell, M. FaceSync: A linear operator formeasuring synchronization of video facial images andaudio tracks. In: Proceedings of the 13th InternationalConference on Neural Information Processing Systems,784–790, 2000.

[17] Wang, O.; Schroers, C.; Zimmer, H.; Gross, M.;Sorkine-Hornung, A. VideoSnapping: Interactivesynchronization of multiple videos. ACM Transactionson Graphics Vol. 33, No. 4, Article No. 77, 2014.

[18] Lu, S.-P.; Zhang, S.-H.; Wei, J.; Hu, S.-M.; Martin, R. R.Timeline editing of objects in video. IEEE Transactionson Visualization and Computer Graphics Vol. 19, No.7, 1218–1227, 2013.

[19] Shiratori, T.; Nakazawa, A.; Ikeuchi, K. Detectingdance motion structure through music analysis. In:Proceedings of the 6th IEEE International Conferenceon Automatic Face and Gesture Recognition, 857–862,2004.

[20] Denman, H.; Doyle, E.; Kokaram, A.; Lennon,D.; Dahyot, R.; Fuller, R. Exploiting temporaldiscontinuities for event detection and manipulation invideo streams. In: Proceedings of the 7th ACM SIGMM


International Workshop on Multimedia InformationRetrieval, 183–192, 2005.

[21] McKinney, M. F.; Moelants, D.; Davies, M. E. P.;Klapuri, A. Evaluation of audio beat tracking and musictempo extraction algorithms. Journal of New MusicResearch Vol. 36, No. 1, 1–16, 2007.

[22] Ellis, D. P. W. Beat tracking by dynamic programming.Journal of New Music Research Vol. 36, No. 1, 51–60,2007.

[23] Patel, A. D.; Iversen, J. R.; Bregman, M. R.; Schulz, I.Experimental evidence for synchronization to a musicalbeat in a nonhuman animal. Current Biology Vol. 19,No. 10, 827–830, 2009.

[24] Patel, A. D.; Iversen, J. R.; Bregman, M. R.; Schulz,I. Studying synchronization to a musical beat innonhuman animals. Annals of the New York Academyof Sciences Vol. 1169, No. 1, 459–469, 2009.

[25] Patel, A. D.; Iversen, J. R.; Bregman, M. R.; Schulz,I.; Schulz, C. Investigating the human-specificity ofsynchronization to music. In: Proceedings of the 10thInternational Conference on Music Perception andCognition, 100–104, 2008.

Rachele Bellini received her B.Sc.degree cum laude in digital com-munication in 2012 and her M.Sc. degreecum laude in computer science in 2015,both from the University of Milan. Since2014 she is collaborating with Cohen-Or’s group at Tel Aviv University ontexturing, image processing, and video

analysis. She is currently working as a VFX Developer atPixomondo LLC (Los Angeles).

Yanir Kleiman obtained his Ph.D.degree from Tel Aviv University in 2016,under the supervision of Prof. DanielCohen-Or. He was a post-doctor at theEcole Polytechnique in France in 2016and 2017. He is currently a graphicssoftware developer at Double NegativeVisual Effects. His research focuses on

shape analysis, including shape similarity, correspondence

and segmentation, and included work in other domains suchas image synthesis, crowd sourcing, and deep learning. Priorto his Ph.D., Yanir had more than 10 years of professionalexperience as a software developer and visual effects artist.

Daniel Cohen-Or is a professor atthe School of Computer Science, TelAviv University. He received his B.Sc.(cum laude) degree in mathematics andcomputer science and his M.Sc. (cumlaude) degree in computer science, bothfrom Ben-Gurion University, in 1985 and1986, respectively. He received his Ph.D.

degree from the Department of Computer Science at theState University of New York at Stony Brook in 1991.He received the 2005 Eurographics Outstanding TechnicalContributions Award. In 2015, he was named a ThomsonReuters Highly Cited Researcher. Currently, his maininterests are in image synthesis, analysis and reconstruction,motion and transformations, and shapes and surfaces.

Open Access The articles published in this journalare distributed under the terms of the CreativeCommons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permitsunrestricted use, distribution, and reproduction in anymedium, provided you give appropriate credit to the originalauthor(s) and the source, provide a link to the CreativeCommons license, and indicate if changes were made.

Other papers from this open access journal are availablefree of charge from http://www.springer.com/journal/41095.To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

dance to the beat: synchronizing motion to audio...dance to the beat: synchronizing motion to audio...

Documents