[ieee 2012 international conference on data science & engineering (icdse) - cochin, kerala,...

Spatial Localization and Color Histograms based

Insertion of Virtual Advertisements in Videos Akshay Hegde1, 3, Deepa Anil Kumar2, 3, Gowri Srinivasa2, 3

1Dept of Computer Science and Engineering,

2Dept of Information Science and Engineering

3Center for Pattern Recognition,

PES Institute of Technology, Bangalore South Campus, Bangalore

[email protected], [email protected], [email protected]

Abstract— This paper presents a location-sensitive algorithm for

the automated insertion of virtual advertisements in videos.

Virtual advertisements are objects (text messages, pictures, an

animation or video sequence) that are introduced noninvasively

into a video. Typically, this is done manually. Most automated

approaches involve insertion of virtual advertisements in hard-

coded, specific locations of videos that could be intrusive. In this

paper we propose an algorithm based on colour histogram

processing that works with an inter-frame difference based

classifier. Along with the spatial localization input (based on the

type of video), this ensures a virtual advertisement is inserted in

a manner that it is noticed and yet appears at a location that does

not obfuscate a critical portion of the video. We demonstrate the

efficacy of this method through experimental results.

Keywords— Color histogram, inter-frame difference,

thresholding, spatial localization, virtual advertisement

I. INTRODUCTION

In recent years, easy access to broadband internet and a sharp

rise in the number of connections as well as advancements in

imaging and video technology has seen a sharp rise in

multimedia content on the web. With this rise, together with

customized entertainment options, research in video analysis

and processing has gained immense importance, particularly

for applications such as advertising. ‘Advertisement’ in our

paper refers to displaying images, text or a video pertaining to

a product to the customer or any end-user. Avenues for this

form of advertising include newspapers, television and the

internet; we focus primarily on video processing for television.

Our goal in this work is to design an algorithm to

automatically insert an advertisement in an existing video in a

minimally intrusive fashion. The audience for a show,

particularly an event such as a reality show or sport is

invariably an interest group amenable for targeted advertising.

The outreach to such an audience is particularly significant in

live streaming of video for popular events. However,

interrupting a program often may not be desirable for the

viewers. Embedding virtual advertisements, such as text

messages, logos, video clips, etc. within the program being

relayed, allows for sponsors to reach the viewers seamlessly

without interrupting the program. We are already seeing a rise

in such embedded advertisements on our television programs.

However, these are manually inserted. In this paper, we

propose a method for the automated insertion of

advertisements in a minimally invasive fashion by analysing

the color histogram of image frames of videos.

The work is organized as follows. In Section II we discuss related work and possible approaches in the context of the

application. Section III details the proposed algorithm together

with the rationale behind each step. We present some

experimental results to demonstrate the efficacy of the method

in Section IV and conclude in Section V.

II. RELATED WORK

There are but a few algorithms for automated insertion of

advertisements in videos. And, most of these algorithms are

based on the detection of edges or specific objects (such as

geometric shapes) to insert an advertisement in the pre-

defined slot. An algorithm for ball detection using modified

Hough transform, replacing the circle with the advertiser’s

logo has been proposed in [1]. Eight edge points are detected

using Hough transform and the centre and radius of circle is

found which is then made use of for replacing with a logo. An

ant-based approach is used to detect the shapes which are

closed in nature without any breakpoints in the shape being

detected [2]. Although the concept is novel, it fails to detect

shapes if there are any occlusions in the shape to be detected.

In [3], an approach for fast logo insertion into video is

proposed. They make us of compressed data domain logo

insertion with an in-content (grass) in the video. The authors

of this paper have also proposed a method of inserting

advertisements based on the detection of rectangles in videos

in [4]. However, most of these methods presuppose a

computational delay. Moreover, these methods cannot be used

in videos where such shapes are intrinsically absent. Often

times, the slot in which a message or animation of specified

dimensions can be inserted is pre-determined based on the

channel (such as one of the corners or the lower end of the

display). This is not always adaptive. Thus, this work aims to

fill the void using methods that are easy and fast to compute

for the insertion of virtual advertisements in videos.

III. PROPOSED ALGORITHM

A block diagram of the proposed algorithm is shown in Fig.

1. The input is a video (a sequence of frames) into which we

would like to embed virtual advertisements. The core of the

algorithm is a frame processor. First, statistics from color

histograms are computed for a specific channel (for instance,

the green channel in videos of lawn sports) or in the

composite color space. If the frame is classified “non-key”

(amenable for editing), an inter-frame difference based

classifier checks the decision. The steps are detailed below.

978-1-4673-2149-5/12/$31.00 ©2012 IEEE 90

Figure 1. A block diagram of the spatial-localization and color-histogram based system for the automated insertion of virtual advertisements in videos.

A. Input

The input to the frame processor is a frame extracted from

video. The output is a frame with/without advertisement being

inserted, depending on the estimated content of the input

frame. The controls specify the position of advertisement

which is either at top or bottom of frame.

B. Compute Histogram

This block scans the pixels present in frame and generates

the brightness histogram for a particular channel or for the

composite color image. The color space (or channel) for

which the histogram is computed depends on the application.

We will discuss the algorithm for videos pertaining to lawn

sports and hence focus on the histogram of the green channel.

C. Classifiers

The frame processor relies on two classifiers.

1) Histogram-based classifier: This is the first and

relatively trivial, classifier that delineates a key frame from a

non-key frame based on statistics computed from color

histograms of the frame. A key frame is one whose contents

are important and hence, is not amenable to editing. A non-

key frame, on the other hand, is one in which an

advertisement may be included. As histogram measures may

mislead (see Section IV), a non-key frame passes through a

second classifier.

2) Inter-frame difference based classifier: This is the

second classifier that discerns the extent of change in a

specified time-interval. If the change is small, the frame is

delineated as non-key and, an advertisement may be inserted.

D. Advertisement Insertion

This block inserts the advertisement based on the user’s

choice if it is classified as a non-key frame.

E. Working of the Frame Processor

The histogram of a frame gives the distribution of different

brightness levels associated with a colour. For the present

discussion, the input could be videos of sporting events such

as cricket, tennis, soccer or golf. The histogram data is used to

decide if the content in a frame holds significance for viewers

and thereby decide whether an advertisement can be inserted

or not. A combination of statistical measures such as the

median, mean and variance, is used to decide whether the

frame contains a key event or not. The mean of a histogram is

computed as the average of brightness values. Likewise, the

standard deviation is computed as the square root of the

sample variance of histogram values of pixels present in the

video frame. Variance gives the factor by which the sample

data varies from the actual expected mean. The median is

simply the most frequently occurring brightness level in a

given frame. In videos of applications where there is a

predominant colour (such as green in lawn sporting events),

the histogram of a frame when the camera pans to show the

audience is relatively flat. This helps us determine whether a

frame is a “key” frame (one that contains the sporting event)

or a “non-key” frame (one that is focusing on the audience).

Figure 2 shows the input, histogram and output frames for

both “key” and “non-key” frames. As we can see, for a “key”

frame, the distribution of green content is sharp and peaks

around the mean while for the “non-key” frame the

distribution has a large spread. Having computed the mean

and standard deviation for the frame histogram, we make use

of these values to decide if the frame should be tagged as a

key frame or not. This is done based on thresholds. The

selection of the thresholds requires an analysis of the colour

content with respect to the rest of the content in that video.

This requires some training data initially. It turns out the

thresholds are fairly consistent across videos once the nature

of events is fixed. If the computed statistics exceed the

precomputed thresholds, a frame is considered to be a “key”

frame, which implies an advertisement cannot be inserted. If a

frame has been classified as “non-key”, the advertisement

2012 International Conference on Data Science & Engineering (ICDSE) 91

may be inserted in a suitable portion of the frame---either at

the top or lower end, so as to be minimally invasive.

However, depending on the video, a threshold based on

statistics computed from the histogram may not suffice. For

instance, Fig. 3 shows frames from a clay court tennis

tournament. In this, a threshold might be set based on shades

of brown (for the clay court) to determine whether a frame is

“key” (whether the match is in progress) or not. However, as

it is evident from Fig. 3(d), a frame that does not contain any

brown can also be a key frame for the aficionado tracing a

certain shot, or watching a replay that has zoomed in on a

player. It would be counterproductive to introduce a message,

logo or animation, as it would be distracting and may even

obscure the player’s footwork or a portion of the stroke.

Thus, instead of relying on just one histogram-based

classifier, any frame that is determined as “non-key” is passed

through a second classifier that checks for the nature of the

content. This is based on inter-frame differencing (that

computes the extent of change between two frames in a given

time). Any classifier that attempts to analyse the entire image

would add to the computational load of the algorithm and

hence introduce large time delays. Since we take into account

spatial localization, we analyse the extent of change in regions,

such as a bar on the top or the bottom of the frame (much like

the subtitles in a movie), which are amenable for insertion of

virtual advertisements. It is quite possible that the camera

angle or change of lighting may introduce spurious differences

between a stack of frames in the chosen time interval. To

circumvent this problem, as a first step, we select n frames in

an interval t. For instance, we may choose to study 1 frame

per second. We compare this with the corresponding n frames

in time intervals t-�. A pixel value p may be considered to be

the same as its corresponding pixel in another frame with the

value p±p0 to account for transient variations such as a

variation in lighting conditions. The maximum value of

tolerance p0 is determined either empirically (based on the

nature of the event, geographical location, weather, time of the

day, etc., parameters) or fixed with the help of user-input. If

more than a certain fraction of pixels of one frame differs

from the one in its preceding time interval, the frame is

delineated as a “key” frame. If not, it is labelled “non key”

and can accommodate virtual advertisements. The subsequent

section discusses the efficacy of this method.

IV. EXPERIMENTAL RESULTS

We obtained a number of video clippings of various lawn

and clay sporting events such as cricket and tennis. The input

videos were in various formats such as mp4, flv, etc. As the

video streamed, our system fragmented the input to

constituent frames. For each frame, we computed the

histogram and applied the precomputed thresholds to decide if

a frame is key or not. Thresholds do not necessarily require

labelled training data. It turns out key and non key frames

form fairly distinct classes. Since the standard deviation is

large for non key frames compared to key frames, it is a fairly

accurate measure to determine the content of most key frames.

Moreover, if we suppose the frame rate is f frames per second

any frame classified as “non-key” has at least ceil(f/5) key

frame in its f-neighbourhood before and after in the dynamic

sequence of frames, it is classified as a key frame, regardless

of the threshold values. This is because persistence of vision

would not allow us to perceive the non key frame and insert

(a) Input frame 1 (b) Histogram of green channel (c) Output – “key frame”

(d) Input frame 2 (e) Histogram of green channel (f) Output – edited “non-key” frame

Figure 2. Experimental results of using histogram-based thresholding. (a) Original image or frame.1. (b) Histogram of the green channel. (c) The output frame showing no change or insertion of advertisement as the frame is classified as a “key frame”. (d) Original image or input frame 2. (e) Histogram of green channel.

(f) Output showing a virtual advertisement (message) inserted in the “non-key” frame.


an advertisement in such a frame is of no particular use While

user input is not necessary, there is provision for a user to

manually indicate a candidate slot for insertion of the

advertisement. Insertions are typically made in the top or

lower end of the display so as not to disrupt the centre of

focus.

Fig. 2 shows the computation of histograms for key and

non key frames. We notice that the spread of the histogram is

more significant for a non-key frame, whereas it is far more

concentrated for a key frame. The key advantage of this

method over edge-detection or shape-detection methods such

as those in [1-6] is that it does not presume there are regular

geometric shapes in the input video. The above is not a

foolproof technique, as determining key and non-key is

relatively trivial when merely histograms are resorted to (see

Fig. 3(d)).

(a) (b)

(c) (d) Figure 3. Experimental results demonstrating the need for a second classifier.

(a) A frame classified as a ‘key frame’ based on color histogram processing.

Subfigures (b) and (c) are classified as non-key frames based on the low content of “brown” (clay court) in these images. (d) Close-up of a player,

falsely classified as “non-key” due to the absence of the clay court. The inter-

frame differencing operation accurately classifies this frame as a “key frame”.

Studying the inter-frame difference, particularly in the

regions slated for inclusion of the virtual advertisement, helps

determine whether the frame, classified as “non-key” by the

histogram, is indeed amenable for editing. Fig. 4 shows

frames from a reality show. The frames in each row are 1s

apart (sampled from a video with a frame rate of 30). Inter-

frame differences for the two pairs (computed in a region of

40 X 320 pixels from the bottom edge of frames of size 240 X

320 pixels) are reported in Table 1. The threshold is set at

10%; thus, more than 1280 pixels in the region of “ad

insertion” are necessary for a frame to be classified as “key”.

TABLE 1. Inter-frame difference between frames that are 1s apart

Frames in Fig. 4 Inter-frame difference Observation

(a) and (b) 285 pixels (0.22%) Non-key

(c) and (d) 1901 pixels (14.85%) Key

The numbers are consistent with the visual content of images

in Fig. 4. And this applies to all classes of videos we consider

in this study. Thus, inter-frame difference is a reliable second-

level classifier for determining “non-key” frames.

(a) (b)

(c) (d) Figure 4. Frames that are 1s apart to demonstrate the utility of the inter-frame

difference technique. Frames (a) and (b) are considered to have the same

content and are classified as “non-key”, whereas frames (c) and (d) are

classified as “key” as the inter-frame difference between them is significant.

V. CONCLUSIONS

This paper presented a procedure to classify video frames

as “key” and “non-key” based on color histogram processing

and relative change in the frames’ contents. Advertisements,

such as a logo, animation or text message, may be inserted in

“non-key” frames of a video in suitable locations so as not to

obfuscate critical content in the video frame. As with most

automated processing techniques, the algorithms presented

here work particularly well for certain classes of videos. In

particular, videos in which there is an arena with a

homogenous texture or dominant color (such as a field or

court), relatively constant background or slow moving

foreground (such as in music concerts, cooking shows, talks

shows, etc.) are most amenable to these processing methods.

The method presented here has an advantage over most

prevalent methods in that, it is easy and fast to compute, and

does not depend on the detection of edges or certain geometric

shapes, etc. in the videos to be able to insert advertisements.

REFERENCES

[1] A. Patel, P. Patel, “Virtual Advertisement with Live Entertainment

Program for Real-Time Video Application”, Proc. of IEEE International

Conference on Advances in Computer Engineering, 2010.

[2] A. Acharya, K. Chattopadhyay,D. Maiti and A. Konar, “An Artificial Ant

Based Novel and Efficient Approach of Regular Geometric Shape Detection

from Digital Image”, Proc. of IEEE 11th International Conference on

Computer and Information Technology, 2008.

[3] Y. Nakajima, A. Yoneyama, Y. Hatori, “A Fast Logo Insertion Algorithm

for MPEG Compressed Video”, Proc. of IEEE International Conference on

Consumer Electronics, 2003.

[4] D. Anilkumar, A. Hegde, G. Srinivasa, “Contour-Tracing Based Detection

of Regular Geometric Shapes for the Insertion of Virtual Advertisements”,

Proc. National Conference on Recent Trends in Computer Technology, Bangalore, 2011.

[5] S.C Pei, Y.Z.Chou ,” Scene-Effect Detection and Insertion MPEG Encoding Scheme for Video Browsing and Error Concealment”, IEEE

Transactions on Multimedia, Vol. 7, Issue 4, August 2005.

[6] M.Dorigo, M.Birattari, and T.Stutzle, “Ant Colony Optimization-Artificial

Ants as a computationally Intelligence Technique”, IEEE Computational

Intelligence Magazine, November 2006.


[ieee 2012 international conference on data science & engineering (icdse) - cochin, kerala,...

Documents