[ieee 2012 international conference on data science & engineering (icdse) - cochin, kerala,...
TRANSCRIPT
![Page 1: [IEEE 2012 International Conference on Data Science & Engineering (ICDSE) - Cochin, Kerala, India (2012.07.18-2012.07.20)] 2012 International Conference on Data Science & Engineering](https://reader035.vdocuments.net/reader035/viewer/2022080123/5750a50a1a28abcf0caef440/html5/thumbnails/1.jpg)
Spatial Localization and Color Histograms based
Insertion of Virtual Advertisements in Videos Akshay Hegde1, 3, Deepa Anil Kumar2, 3, Gowri Srinivasa2, 3
1Dept of Computer Science and Engineering,
2Dept of Information Science and Engineering
3Center for Pattern Recognition,
PES Institute of Technology, Bangalore South Campus, Bangalore
[email protected], [email protected], [email protected]
Abstract— This paper presents a location-sensitive algorithm for
the automated insertion of virtual advertisements in videos.
Virtual advertisements are objects (text messages, pictures, an
animation or video sequence) that are introduced noninvasively
into a video. Typically, this is done manually. Most automated
approaches involve insertion of virtual advertisements in hard-
coded, specific locations of videos that could be intrusive. In this
paper we propose an algorithm based on colour histogram
processing that works with an inter-frame difference based
classifier. Along with the spatial localization input (based on the
type of video), this ensures a virtual advertisement is inserted in
a manner that it is noticed and yet appears at a location that does
not obfuscate a critical portion of the video. We demonstrate the
efficacy of this method through experimental results.
Keywords— Color histogram, inter-frame difference,
thresholding, spatial localization, virtual advertisement
I. INTRODUCTION
In recent years, easy access to broadband internet and a sharp
rise in the number of connections as well as advancements in
imaging and video technology has seen a sharp rise in
multimedia content on the web. With this rise, together with
customized entertainment options, research in video analysis
and processing has gained immense importance, particularly
for applications such as advertising. ‘Advertisement’ in our
paper refers to displaying images, text or a video pertaining to
a product to the customer or any end-user. Avenues for this
form of advertising include newspapers, television and the
internet; we focus primarily on video processing for television.
Our goal in this work is to design an algorithm to
automatically insert an advertisement in an existing video in a
minimally intrusive fashion. The audience for a show,
particularly an event such as a reality show or sport is
invariably an interest group amenable for targeted advertising.
The outreach to such an audience is particularly significant in
live streaming of video for popular events. However,
interrupting a program often may not be desirable for the
viewers. Embedding virtual advertisements, such as text
messages, logos, video clips, etc. within the program being
relayed, allows for sponsors to reach the viewers seamlessly
without interrupting the program. We are already seeing a rise
in such embedded advertisements on our television programs.
However, these are manually inserted. In this paper, we
propose a method for the automated insertion of
advertisements in a minimally invasive fashion by analysing
the color histogram of image frames of videos.
The work is organized as follows. In Section II we discuss related work and possible approaches in the context of the
application. Section III details the proposed algorithm together
with the rationale behind each step. We present some
experimental results to demonstrate the efficacy of the method
in Section IV and conclude in Section V.
II. RELATED WORK
There are but a few algorithms for automated insertion of
advertisements in videos. And, most of these algorithms are
based on the detection of edges or specific objects (such as
geometric shapes) to insert an advertisement in the pre-
defined slot. An algorithm for ball detection using modified
Hough transform, replacing the circle with the advertiser’s
logo has been proposed in [1]. Eight edge points are detected
using Hough transform and the centre and radius of circle is
found which is then made use of for replacing with a logo. An
ant-based approach is used to detect the shapes which are
closed in nature without any breakpoints in the shape being
detected [2]. Although the concept is novel, it fails to detect
shapes if there are any occlusions in the shape to be detected.
In [3], an approach for fast logo insertion into video is
proposed. They make us of compressed data domain logo
insertion with an in-content (grass) in the video. The authors
of this paper have also proposed a method of inserting
advertisements based on the detection of rectangles in videos
in [4]. However, most of these methods presuppose a
computational delay. Moreover, these methods cannot be used
in videos where such shapes are intrinsically absent. Often
times, the slot in which a message or animation of specified
dimensions can be inserted is pre-determined based on the
channel (such as one of the corners or the lower end of the
display). This is not always adaptive. Thus, this work aims to
fill the void using methods that are easy and fast to compute
for the insertion of virtual advertisements in videos.
III. PROPOSED ALGORITHM
A block diagram of the proposed algorithm is shown in Fig.
1. The input is a video (a sequence of frames) into which we
would like to embed virtual advertisements. The core of the
algorithm is a frame processor. First, statistics from color
histograms are computed for a specific channel (for instance,
the green channel in videos of lawn sports) or in the
composite color space. If the frame is classified “non-key”
(amenable for editing), an inter-frame difference based
classifier checks the decision. The steps are detailed below.
978-1-4673-2149-5/12/$31.00 ©2012 IEEE 90
![Page 2: [IEEE 2012 International Conference on Data Science & Engineering (ICDSE) - Cochin, Kerala, India (2012.07.18-2012.07.20)] 2012 International Conference on Data Science & Engineering](https://reader035.vdocuments.net/reader035/viewer/2022080123/5750a50a1a28abcf0caef440/html5/thumbnails/2.jpg)
Figure 1. A block diagram of the spatial-localization and color-histogram based system for the automated insertion of virtual advertisements in videos.
A. Input
The input to the frame processor is a frame extracted from
video. The output is a frame with/without advertisement being
inserted, depending on the estimated content of the input
frame. The controls specify the position of advertisement
which is either at top or bottom of frame.
B. Compute Histogram
This block scans the pixels present in frame and generates
the brightness histogram for a particular channel or for the
composite color image. The color space (or channel) for
which the histogram is computed depends on the application.
We will discuss the algorithm for videos pertaining to lawn
sports and hence focus on the histogram of the green channel.
C. Classifiers
The frame processor relies on two classifiers.
1) Histogram-based classifier: This is the first and
relatively trivial, classifier that delineates a key frame from a
non-key frame based on statistics computed from color
histograms of the frame. A key frame is one whose contents
are important and hence, is not amenable to editing. A non-
key frame, on the other hand, is one in which an
advertisement may be included. As histogram measures may
mislead (see Section IV), a non-key frame passes through a
second classifier.
2) Inter-frame difference based classifier: This is the
second classifier that discerns the extent of change in a
specified time-interval. If the change is small, the frame is
delineated as non-key and, an advertisement may be inserted.
D. Advertisement Insertion
This block inserts the advertisement based on the user’s
choice if it is classified as a non-key frame.
E. Working of the Frame Processor
The histogram of a frame gives the distribution of different
brightness levels associated with a colour. For the present
discussion, the input could be videos of sporting events such
as cricket, tennis, soccer or golf. The histogram data is used to
decide if the content in a frame holds significance for viewers
and thereby decide whether an advertisement can be inserted
or not. A combination of statistical measures such as the
median, mean and variance, is used to decide whether the
frame contains a key event or not. The mean of a histogram is
computed as the average of brightness values. Likewise, the
standard deviation is computed as the square root of the
sample variance of histogram values of pixels present in the
video frame. Variance gives the factor by which the sample
data varies from the actual expected mean. The median is
simply the most frequently occurring brightness level in a
given frame. In videos of applications where there is a
predominant colour (such as green in lawn sporting events),
the histogram of a frame when the camera pans to show the
audience is relatively flat. This helps us determine whether a
frame is a “key” frame (one that contains the sporting event)
or a “non-key” frame (one that is focusing on the audience).
Figure 2 shows the input, histogram and output frames for
both “key” and “non-key” frames. As we can see, for a “key”
frame, the distribution of green content is sharp and peaks
around the mean while for the “non-key” frame the
distribution has a large spread. Having computed the mean
and standard deviation for the frame histogram, we make use
of these values to decide if the frame should be tagged as a
key frame or not. This is done based on thresholds. The
selection of the thresholds requires an analysis of the colour
content with respect to the rest of the content in that video.
This requires some training data initially. It turns out the
thresholds are fairly consistent across videos once the nature
of events is fixed. If the computed statistics exceed the
precomputed thresholds, a frame is considered to be a “key”
frame, which implies an advertisement cannot be inserted. If a
frame has been classified as “non-key”, the advertisement
2012 International Conference on Data Science & Engineering (ICDSE) 91
![Page 3: [IEEE 2012 International Conference on Data Science & Engineering (ICDSE) - Cochin, Kerala, India (2012.07.18-2012.07.20)] 2012 International Conference on Data Science & Engineering](https://reader035.vdocuments.net/reader035/viewer/2022080123/5750a50a1a28abcf0caef440/html5/thumbnails/3.jpg)
may be inserted in a suitable portion of the frame---either at
the top or lower end, so as to be minimally invasive.
However, depending on the video, a threshold based on
statistics computed from the histogram may not suffice. For
instance, Fig. 3 shows frames from a clay court tennis
tournament. In this, a threshold might be set based on shades
of brown (for the clay court) to determine whether a frame is
“key” (whether the match is in progress) or not. However, as
it is evident from Fig. 3(d), a frame that does not contain any
brown can also be a key frame for the aficionado tracing a
certain shot, or watching a replay that has zoomed in on a
player. It would be counterproductive to introduce a message,
logo or animation, as it would be distracting and may even
obscure the player’s footwork or a portion of the stroke.
Thus, instead of relying on just one histogram-based
classifier, any frame that is determined as “non-key” is passed
through a second classifier that checks for the nature of the
content. This is based on inter-frame differencing (that
computes the extent of change between two frames in a given
time). Any classifier that attempts to analyse the entire image
would add to the computational load of the algorithm and
hence introduce large time delays. Since we take into account
spatial localization, we analyse the extent of change in regions,
such as a bar on the top or the bottom of the frame (much like
the subtitles in a movie), which are amenable for insertion of
virtual advertisements. It is quite possible that the camera
angle or change of lighting may introduce spurious differences
between a stack of frames in the chosen time interval. To
circumvent this problem, as a first step, we select n frames in
an interval t. For instance, we may choose to study 1 frame
per second. We compare this with the corresponding n frames
in time intervals t-�. A pixel value p may be considered to be
the same as its corresponding pixel in another frame with the
value p±p0 to account for transient variations such as a
variation in lighting conditions. The maximum value of
tolerance p0 is determined either empirically (based on the
nature of the event, geographical location, weather, time of the
day, etc., parameters) or fixed with the help of user-input. If
more than a certain fraction of pixels of one frame differs
from the one in its preceding time interval, the frame is
delineated as a “key” frame. If not, it is labelled “non key”
and can accommodate virtual advertisements. The subsequent
section discusses the efficacy of this method.
IV. EXPERIMENTAL RESULTS
We obtained a number of video clippings of various lawn
and clay sporting events such as cricket and tennis. The input
videos were in various formats such as mp4, flv, etc. As the
video streamed, our system fragmented the input to
constituent frames. For each frame, we computed the
histogram and applied the precomputed thresholds to decide if
a frame is key or not. Thresholds do not necessarily require
labelled training data. It turns out key and non key frames
form fairly distinct classes. Since the standard deviation is
large for non key frames compared to key frames, it is a fairly
accurate measure to determine the content of most key frames.
Moreover, if we suppose the frame rate is f frames per second
any frame classified as “non-key” has at least ceil(f/5) key
frame in its f-neighbourhood before and after in the dynamic
sequence of frames, it is classified as a key frame, regardless
of the threshold values. This is because persistence of vision
would not allow us to perceive the non key frame and insert
(a) Input frame 1 (b) Histogram of green channel (c) Output – “key frame”
(d) Input frame 2 (e) Histogram of green channel (f) Output – edited “non-key” frame
Figure 2. Experimental results of using histogram-based thresholding. (a) Original image or frame.1. (b) Histogram of the green channel. (c) The output frame showing no change or insertion of advertisement as the frame is classified as a “key frame”. (d) Original image or input frame 2. (e) Histogram of green channel.
(f) Output showing a virtual advertisement (message) inserted in the “non-key” frame.
2012 International Conference on Data Science & Engineering (ICDSE) 92
![Page 4: [IEEE 2012 International Conference on Data Science & Engineering (ICDSE) - Cochin, Kerala, India (2012.07.18-2012.07.20)] 2012 International Conference on Data Science & Engineering](https://reader035.vdocuments.net/reader035/viewer/2022080123/5750a50a1a28abcf0caef440/html5/thumbnails/4.jpg)
an advertisement in such a frame is of no particular use While
user input is not necessary, there is provision for a user to
manually indicate a candidate slot for insertion of the
advertisement. Insertions are typically made in the top or
lower end of the display so as not to disrupt the centre of
focus.
Fig. 2 shows the computation of histograms for key and
non key frames. We notice that the spread of the histogram is
more significant for a non-key frame, whereas it is far more
concentrated for a key frame. The key advantage of this
method over edge-detection or shape-detection methods such
as those in [1-6] is that it does not presume there are regular
geometric shapes in the input video. The above is not a
foolproof technique, as determining key and non-key is
relatively trivial when merely histograms are resorted to (see
Fig. 3(d)).
(a) (b)
(c) (d) Figure 3. Experimental results demonstrating the need for a second classifier.
(a) A frame classified as a ‘key frame’ based on color histogram processing.
Subfigures (b) and (c) are classified as non-key frames based on the low content of “brown” (clay court) in these images. (d) Close-up of a player,
falsely classified as “non-key” due to the absence of the clay court. The inter-
frame differencing operation accurately classifies this frame as a “key frame”.
Studying the inter-frame difference, particularly in the
regions slated for inclusion of the virtual advertisement, helps
determine whether the frame, classified as “non-key” by the
histogram, is indeed amenable for editing. Fig. 4 shows
frames from a reality show. The frames in each row are 1s
apart (sampled from a video with a frame rate of 30). Inter-
frame differences for the two pairs (computed in a region of
40 X 320 pixels from the bottom edge of frames of size 240 X
320 pixels) are reported in Table 1. The threshold is set at
10%; thus, more than 1280 pixels in the region of “ad
insertion” are necessary for a frame to be classified as “key”.
TABLE 1. Inter-frame difference between frames that are 1s apart
Frames in Fig. 4 Inter-frame difference Observation
(a) and (b) 285 pixels (0.22%) Non-key
(c) and (d) 1901 pixels (14.85%) Key
The numbers are consistent with the visual content of images
in Fig. 4. And this applies to all classes of videos we consider
in this study. Thus, inter-frame difference is a reliable second-
level classifier for determining “non-key” frames.
(a) (b)
(c) (d) Figure 4. Frames that are 1s apart to demonstrate the utility of the inter-frame
difference technique. Frames (a) and (b) are considered to have the same
content and are classified as “non-key”, whereas frames (c) and (d) are
classified as “key” as the inter-frame difference between them is significant.
V. CONCLUSIONS
This paper presented a procedure to classify video frames
as “key” and “non-key” based on color histogram processing
and relative change in the frames’ contents. Advertisements,
such as a logo, animation or text message, may be inserted in
“non-key” frames of a video in suitable locations so as not to
obfuscate critical content in the video frame. As with most
automated processing techniques, the algorithms presented
here work particularly well for certain classes of videos. In
particular, videos in which there is an arena with a
homogenous texture or dominant color (such as a field or
court), relatively constant background or slow moving
foreground (such as in music concerts, cooking shows, talks
shows, etc.) are most amenable to these processing methods.
The method presented here has an advantage over most
prevalent methods in that, it is easy and fast to compute, and
does not depend on the detection of edges or certain geometric
shapes, etc. in the videos to be able to insert advertisements.
REFERENCES
[1] A. Patel, P. Patel, “Virtual Advertisement with Live Entertainment
Program for Real-Time Video Application”, Proc. of IEEE International
Conference on Advances in Computer Engineering, 2010.
[2] A. Acharya, K. Chattopadhyay,D. Maiti and A. Konar, “An Artificial Ant
Based Novel and Efficient Approach of Regular Geometric Shape Detection
from Digital Image”, Proc. of IEEE 11th International Conference on
Computer and Information Technology, 2008.
[3] Y. Nakajima, A. Yoneyama, Y. Hatori, “A Fast Logo Insertion Algorithm
for MPEG Compressed Video”, Proc. of IEEE International Conference on
Consumer Electronics, 2003.
[4] D. Anilkumar, A. Hegde, G. Srinivasa, “Contour-Tracing Based Detection
of Regular Geometric Shapes for the Insertion of Virtual Advertisements”,
Proc. National Conference on Recent Trends in Computer Technology, Bangalore, 2011.
[5] S.C Pei, Y.Z.Chou ,” Scene-Effect Detection and Insertion MPEG Encoding Scheme for Video Browsing and Error Concealment”, IEEE
Transactions on Multimedia, Vol. 7, Issue 4, August 2005.
[6] M.Dorigo, M.Birattari, and T.Stutzle, “Ant Colony Optimization-Artificial
Ants as a computationally Intelligence Technique”, IEEE Computational
Intelligence Magazine, November 2006.
2012 International Conference on Data Science & Engineering (ICDSE) 93