automatic musical video creation with media analysis 2004/02/16 student: chen-hsiu huang advisor:...

Automatic Musical Video Creation with Media Analysis

2004/02/16

Student: Chen-Hsiu HuangAdvisor: Prof. Ja-Ling Wu

Outline

Problem Formulation Current Solutions Our Goal Gory Details Performance Evaluation What’s Next? Questions and Discussion

Problem Formulation

The digital video capture devices such as DVs are made more affordable for end users.

It’s interesting to shoot videos but frustrating for editing them.

There’s still a tremendous barrier between amateurs (home users) and the powerful video editing software.

Finally people leave their precious shots in piles of DV tapes without editing and management.

According to a survey on DVworld*, the relations between the video length and how many times will user review them after days:

Video clips with no more then 5 minutes are best for human’s concentration.

Video length Review times

>= 1 hr 1 or 0

30 min ~ 1 hr 2 ~ 3

15 ~ 30 min 5 ~ 10

5 ~ 15 min >= 10

<= 5 min You take it out and watch it when you think about!

*http://www.DVworld.com.tw/

http://www.dvworld.com.tw/



People are inpatient for videos without scenario or voice-over, especially for those with no music.

The improved soundtrack quality improved perceived video image quality.

Synchronizing video and audio segments enhance the perception of both.

One study at MIT showed that listener judge the identical video image to be higher quality when accompanied by higher-fidelity audio.

Facts about Musical Video

Home videos can be roughly classified by its nature property.

Causal Shots within video are causal; changing the order of shots may confuse the viewer

Non-causal Shots are not causal; it’s OK to re-order video shots

Recreational Videos are used to represent a kind of emotion or enjoyment

Memorial Such as marriage or graduation celebrity, videos are memorial and each shot should be preserved properly.

Four profiles are proposed to deal with videos of different nature.

Current Solutions

A consumer product called “muvee autoProducer” has been announced to ease the burden of professional video editing.

It’s application scenario is quite simple:

Pick-up your video

Choose your favorite music

Produce a quality musical video

Select profiles to apply

Our Goal

Although there are commercial products in the market, only few academic publications related. Jonathan Foote, Matthew D. Cooper, Andreas Girgensohn, "Creati

ng music videos using automatic media analysis," ACM Multimedia 2002: 553-560

The content-analysis technologies are developed for years; can we adopt those technologies to help auto-creation of musical videos?

Goal: To achieve the near or beyond quality in the similar application scenario with the content-analysis technologies developed in multimedia domain.

Input video

Input music

Shot changeScene change

Audio segment

cutting

Alignment

Output Video

VolumeZCR

BrightnessBandwidth

…

Human faceFlash light

Motion strengthColor variance

Camera Operation...

Scene selectionKey shot selection

Audio rhythm &Video motion/color

synchronization

Proposed Framework

Audio Analysis

We should cut the input audio into several clips according to its audio features.

Frame-level features Volume: defined as the MSR of audio samples ZCR: the number of times that the audio waveform crosses the zer

o axis in each frame.

Spectral features Brightness: the centroid of frequency spectrum Bandwidth: the standard deviation of frequency spectrum

Generally the brightness’ distribution curve is almost the same as ZCR curve, so here we use ZCR feature only.

Bandwidth is an important audio feature but we can not easily tell what’s the real physical meaning in music when the bandwidth reaches its high/low value.

Furthermore, the relations between musical perceptual and bandwidth values are not clear and not regular.

Brightness

ZCR

Volume

Bandwidth

Audio Segmentation

First we cut the input audio into clips when the volume changes dramatically.

For each clip, we define the burst of ZCR as an “attack”, which may be a beat of base drum or the singer’s voice.

otherwise,0

_)(,1)(

thVCutiFiA cut

cut)()(

1

wi

i

ii

wi

icut w

v

w

vabsiF

10/)max(_ ivthVCut

otherwise,0

)(2)(,1)( iattack

attack

zstdiFiA

)(

)()(

1

wi

i

ii

i

wi

iiattack

w

vzabs

w

vzabsiF

The dramatic volume change defines the audio clip boundary, while the burst of ZCR (attack) in each clip defines the granular sub-segment within it.

Clip boundary Attacks as sub-clip separation

Here we define the dynamic of each clip as:

)()(

ilen

z

iA jj

dynamic

孫燕姿綠光

The dynamic feature can be used as a good reference later for video/audio synchronization

Video Analysis

First we need to apply shot change detection to segment video into scenes.

Here we use the combination of pixel MAD and pixel histogram method to perform the shot change detection.

otherwise,0

1)( and 1)(,1)(

iSiSiV HISTMAD

shot

Dhist < Thhist Dhist > Thhist

Dcolor < Thcolor nothing

Dcolor > Thcolor unsuitable! shot change!

Flashlight detection The flashlight event will be detected as shot change. When the shot change is founded, check if:

If so, then it’s a flashlight event, should not be treated as shot change.

Sub-Shot segmentation Here we use MPEG-7 ColorLayout descriptor to measure each fra

me’s similarity. The first frame in each shot is selected as the basis, each consecut

ive frames are compared with the basis. If

Then we say that in frame i, a sub-shot is occurred.

thFlashLMean

LMeanthFlash

LMean

LMean

i

i

i

i _)(

)( && _

)(

)(

11

ThSubSceneiDFFdistiDi

kk _)( ,),()(

10

Camera Operation Camera operations such as pan or zoom are widely used in amateur

home videos. By detection those camera operations can help catch the video taker’s intention.

Our camera operation detection is performed base on the MPEG video’s motion vectors in P-frames.

Pan Zoom

31

i

i

v

v 3

i

i

v

v

This method is simple and efficient. However, it does well when detecting camera operations.

Video Features Frame-level features

The presence of human faces. Use OpenCV library as face detection module.

Motion intensity Flashlight detection Mean and standard deviation of luminance plane (Dcolor(i) > Thcolor && Dhist(i) < Thhist) defines the unsuitable frames

Shot-level features Numbers and types of camera operation in each shot. Numbers of faces and flashlight event in each shot. The accumulation of distance between each frame and first frame c

an be used to describe the shot’s homogeneity.

Importance Measure Frame-level score function:

)256

130(

)_(

)(

StdMean

opCameraSR

ERScore

amotion

flashface

}2,1,0{_ ,)max(

}1,0{ ,

opCameraMotion

MotionR

EHW

AreaR

imotion

flashface

face

2.0 ,3.0 ,5.0 The face and flashlight event have the highest weighting. Camera operation and higher motion intensity represent the video

taker’s intension, so it’s more important. Frames with higher luminance and larger standard derivation are more

suitable. The penalty of unsuitable frames will be discussed later.

A scaling coefficient according to synchronized audio clip’s feature

The shot-level importance is motivated by observing that: Shots with larger motion intensity take longer duration. The presence of face attracts viewer. Shots of higher heterogeneity can taker longer playing time. Shots with more camera operations are more important. Of course, shots with longer length in origin are more important.

Shot-level importance:

)()()_

(Len

Diff

Len

Motion

Len

opCamera

Len

NumLenIMP face

The shot-level importance function is used in the medium profile to reassign each shot’s length according to its importance.

Static shots takes shorter, while dynamic shots can take longer. Gets better results after editing

“muvee autoProducer” does not reassign each shot’s length!

Example 1

六福村之旅 (31:55)

Music: SHE / 美麗新世界Length: 4:25

Profile: Sequential Medium

Proposed Profiles The usage of profiles allows users to customize their videos according

to its content property and users’ preference in a easy way. We said that home videos have four types:

Causal, Non-causal, Recreational, Memorial For causal or non-causal videos, we use the sequential or non-

sequential parameter to deal with. For memorial or recreational videos, the rhythmic or medium

parameter is developed to cope with. In rhythmic, the music tempo/rhythm is better preserved, while some shots

of video will be neglected. In medium, the accompany of music tempo/rhythm is not so clear as

rhythm, but most of the shots will be promised to shown. The medium parameter preserved the original video the most.

Thus we have four profiles: Sequential Rhythmic, Sequential Medium Non-Sequential Rhythmic, Non-Sequential Medium

Sequential Non-Sequential

Rhythmic

Time sequence of shots will be preserved, with the rhythmic parameter

With the rhythmic parameter, but the original order of shots will be changed.

Medium

Time sequence of shots will be preserved, with the medium parameter

With the medium parameter, but the original order of shots will be changed.

Rhythmic vs. Medium The video is segmented according to the audio clips and sub-clips. After projecting to the video time-line, searching in the video range to

find the video segments with the highest score as the same length as audio segment.

Finally concatenate all the selected segments.

VideoTrack

AudioTrack

Each shot will be reassign to a new length according to its shot importance, shots may becomes longer or shorter in proportion to the total length.

After projection to the video space, the length budget is calculated according to the reduction rate; then allocate the budget to each inner shots according to its length.

If the allocated shot length is to short (< 30 frames), then its budget will be transfer to near shots.

VideoTrack

AudioTrack

However, there are some issues: The fast tempo/rhythm audio clip may be aligned to a static video

shot, which will be annoying for viewer. The slow audio clip may be aligned to a dynamic video shot. We apply an audio scaling coefficient in synchronization stage.

The motion intensity of video shot’s weight will be decreased when aligned with a slow audio clip; nearly preserved when synchronized with fast audio clip.

Another issue when the media length differ:

VideoTrack

AudioTrack

It’s unavoidable when the sequential policy is enforced.

For some video sources, the order of shots is not so important, and re-order shots will not degrade the original.

If we allow re-order the input video shots, things may be better:

VideoTrack

AudioTrack

permutation

It sounds simple and intuitive, but it’s not an easy problem if we want to develop an efficient algorithm to find such permutation.

Furthermore, the “best” solution may not exist and the optimal solution may not be only one permutation.

Non-Sequential Permutation So we developed a randomize algorithm to find a “not-bad” solution wit

hin predictable computation time. First randomly permute each video shot Then we compute the Ravc “audio-to-video coverage” in the corresponding ti

me-line for each shot

Video

Audio

1avcR 2avcR 3avcR

Then we calculate the average Ravc, each permutation will has its Ravc.

After lots of iterations, find the minimal Ravc, theoretically we can approach to the optimal solution efficiently and predictable, only depends on how many iterations we perform.

For an example, 10000 iterations are performed:

Permutation Minimal Ravc

7 5 8 11 3 14 13 1 2 0 9 6 12 4 10 1.455571

11 14 2 10 1 3 9 6 4 0 12 13 7 8 5 1.482213

9 7 13 1 14 6 2 10 8 0 11 4 12 3 5 1.508536

7 3 5 11 12 8 0 13 1 2 14 10 6 4 9 1.425809

13 5 2 10 3 12 7 11 0 14 9 6 8 4 1 1.453530

We can get better solution with more iterations, but through experiments, 10000 iterations are quite enough and will not be a burden for our computation power (actually it’s really fast)

Since its random property, each synchronization result will be different. But we have discussed before that it’s normal to have lots of solutions.

Example 2

吉魯巴 (19:08)

Music: 製造浪漫Length: 4:25

Profile: Sequential Medium and

Non-Sequential Medium

Performance Evaluation Development environment:

AMD Duron 1.2G Hz with 386 MB RAM Analysis complexity:

For videos, about 1.2~1.3:1 comparing to the original video time. For audios, about 2 minutes for a 5 minute audio; if perform the spectral an

alysis, 4-5 minutes are needed. The audio/video analysis will be saved as description files, so the analysis i

s required only once. The synchronization can be regarded as O(n) complexity. When analyzing, usually less than 20 MB RAM is required (depends on ho

w many shots in video) The synchronization result is saved as an AviSynth script. Then we us

e VirtualDub to encode the produced musical video.

Sample Videos

六福村之旅 (31:55)

烏來採蜂蜜 (60:34)

聖淘沙海底世界 (17:59)

littleco 演唱會 (20:22)

吉魯巴 (19:08)

結婚典禮 (43:42)

What’s Next? How to design the experimental result?

The subjective test should not over-burden the viewer. Adding the shot transition effects? Such as dissolve, fade

in, fade out. I’ve tried, but not so easy as I thought.

The automatic approach may not always product a satisfaction result and the experience is highly subjective and differs from people to people. Semi-automatic is probably the best compromise. The automatic

result is served only as a pre-process basis and a labor-saving tool. But the video editing tool is hard to develop, and I doubt if it’s

necessary to develop one from startup on the purpose of thesis.

Questions and Discussion

Any comments are welcomed. Acknowledgment:

Special thanks for Mr. 劉嘉倫 , for his videos and suggestions. Thanks friends in DVworld who provide lots of ideas and comments. Thanks Chih-Hao Shen for his dancing video.

automatic musical video creation with media analysis 2004/02/16 student: chen-hsiu huang advisor:...

Documents

video clips

input audio

synchronizing video

video length

audio features

audio segments

audio waveform

audio segmentationfirst