[ieee 2008 ieee international workshop on semantic computing and applications (iwsca) - incheon,...

Automatic Subtitles Localization through Speaker Identification in Multimedia System

Seung-Bo Park, Kyung-Jin Oh, Heung-Nam Kim, Geun-Sik Jo Department of Computer & Information Engineering, Inha University, Incheon, Korea

{molaal, okjkillo, nami}@eslab.inha.ac.kr, [email protected]

Abstract

With the increasing popularity of online video,

efficient captioning and displaying the captioned text (subtitles) have also been issued with the accessibility. However, in most cases, subtitles are shown on a separate display below a screen. As a result, some viewers lose condensed information about the contents of the video. To elevate readability and visibility of viewers, in this paper, we present a framework for displaying synchronized text around a speaker in video. The proposed approach first identifies speakers using face detection technologies and subsequently detects a subtitles region. In addition, we adapt DFXP, which is interoperable timed text format of W3C, to support interchanging with existing legacy system. In order to achieve smooth playback of multimedia presentation, such as SMIL and DFXP, a prototype system, namely MoNaPlayer, has been implemented. Our case studies show that the proposed system is feasible to several multimedia applications. 1. Introduction

The prevalence of digital devices and the development of Internet technologies and services enable end-users to be a producer as well as a consumer of media content including digital video. Even in a single day, an enormous amount of digital video is generated on the web, and thus, video now play an important role in education, entertainment, and diverse multimedia applications. In this environment, an efficient captioning and displaying the captioned text (subtitles) have also been issued.

Subtitles/captions, textual versions of the dialog in video, films and television programs, provide condensed information about the contents of the video to viewers who are hearing inexperienced or impaired [12]. Though captioning (often called closed captions) is historically intended for the hearing impaired, it has also been used to help viewers who may not be fluent

in the spoken word. In addition, it can often be used to translate dialog from a foreign language to the native language of viewers. The foreign-language subtitles assume that the viewer can hear but cannot understand the language, so the subtitles provide a form of written translation of a dialog in a foreign language [4].

Because of the growing need for access to subtitles, numerous multimedia players, such as Quicktime, RealPlayer, or Windows Media player, allow displaying video with subtitles even though each media player adopts subtitles format differently. Additionally, many users easily create subtitles files (e.g. Quicktime text [7], RealText [3], or SAMI [8]), especially foreign-language subtitles, by using captioning software and distribute those to the foreign-language viewers. However, in many cases, subtitles are shown on a separate display below a screen. This means that readability of viewers is decreasing because they should watch scenes, read subtitles, and occasionally hear voices of actors, simultaneously. Sometimes, therefore, they lose important information about the contents of the video.

In order to increase the readability of viewers, we propose an efficient subtitles platform and develop a new media player, namely MoNaPlayer, which displays subtitles near by speakers. Our motivations and objectives are summarized as follows: (1) many subtitles contents and formats already exist. Rather than designing a new format, we promote the reusability and accessibility of existing subtitles technologies. (2) Adding not only positioning information but also timing information to subtitles needs very hard works. And, it takes a long time and costs a lot to generate high quality subtitles. Therefore, we develop an automatic transformation module for existing timing subtitles enriched positioning information to display subtitles around speakers. (3) We elevate readability and visibility of viewers in multimedia applications which focus on a special use cases rather than general cases (e.g. generating

IEEE International Workshop on Semantic Computing and Applications

978-0-7695-3317-9/08 $25.00 © 2008 IEEE

DOI 10.1109/IWSCA.2008.28

166

multimedia subtitling for the purposes of foreign language education).

The subsequent sections of this paper are organized as follows: The next section contains a brief overview of existing timed text formats and some related studies. In section 3, we describe the detail of the proposed system, including speaker identification by using face recognitions, subtitles region detection, and automatic subtitles transformation with spatio-temporal information. Section 4 provides our preliminary prototype system and case studies. Finally, we summarize the paper and present future work. 2. XML based Timed Text

Timed text refers to the presentation of text media for synchronizing audio and video. Typical applications of timed text are the real time subtitling of foreign-language movies on the Web, captioning for people lacking audio devices or having hearing impairments [13]. Today, a number of timed text content formats are exist (e.g. SMIL, RealText, DFXP, SAMI, QText, etc [1, 2, 3, 7, 8]), because popular media players among users handles subtitles individually. For example, SAMI (Synchronized Accessible Media Interchange) is used for users who are using Window Media Player whereas RealPlayer users should need RealText for subtitles, which is incompatible with SAMI. This issue brings developing of an interoperable timed text format such as DFXP.

This section briefly explains XML-based languages and technologies of W3C related to timed text formats for subtitles. 2.1. Synchronized Multimedia Integration Language (SMIL)

The SMIL, W3C recommendation, is an XML-based language which integrate streaming audio and video with images, text or any other media type. In addition, SMIL allows reusing of its syntax and semantics in other XML-based languages in particular those who need to represent timing and synchronization [10]. The current recommendation version is SMIL 2.1 [1]. Since SMIL has been developed, many end-user products are implemented for supporting SMIL standard (e.g. RealPlayer, QucikTime, AMBULANT [5], GRiNS [6]).

SMIL defines media object elements referred by URLs. The <text> or <textstream> elements can be used to display video object with subtitles. For example: <smil …>

… <par>

<video src=”video.avi“ dur="40s" … /> <textstream src="subtitles.xml" dur=”40s” …/>

</par> … </smil>

Here, the <par> element schedules two media

objects, such as video and its subtitles, in parallel. While SMIL provides extensive styling and timing

synchronization, it is limited in itself for supporting foreign language subtitles [4]. 2.2. Distribution Format Exchange Profile (DFXP)

W3C has announced DFXP (Distribution Format Exchange Profile) specification as the level of candidate recommendation for timed text authoring format. It provides a standardized representation of a particular subset of textual information with which stylistic, layout, and timing semantics are associated by an author or an authoring system for the purpose of interchange and potential presentation [2]. Although DFXP was not expressly designed for direct integration into a SMIL document, the semantics of the core element and attribute are based on SMIL 2.1 [4].

The following example provides some elements and its attributes closely related to our research.

<tt xmlns="http://www.w3.org/2006/10/ttaf1" xmlns:tts="http://www.w3.org/2006/10/ttaf1#style">

<head> <layout> <region xml:id="koCaption">

<style tts:backgroundColor="transparent"/> <style tts:extent="400px 150px"/> <style tts:origin="10px 100px"/> <style tts:overflow="visible"/> ....

</region> <region xml:id="default">

<style tts:backgroundColor="transparent"/> <style tts:extent="400px 150px"/> <style tts:origin="0px 300px"/> <style tts:overflow="visible"/> ....

</region> </layout>

</head> <body>

<div xml:lang="en"> <p begin="00:01.7" end="00:05.0" region="default">English subtitles</p>

</div> <div xml:lang="ko">

<p begin="00:01.7" end="00:05.0"

167

region="koCaption" >Korean subtitles</p> </div>

</body> </tt>

The <style> element’s tts:origin attribute is used to

specify the x and y coordinates of the origin of a region area which can position text anywhere whereas the tts:extent attribute is used to specify the width and height of a region area. The begin and end attributes for timing specify when text blocks are displayed or eliminated. Note that in this example, subtitles are displayed at a different region according to a selected language (i.e. English or Korean). Detailed information about DFXP can found in [2]. 3. Automatic Subtitles Localization through Speaker Identification

SAMIQtext

DFXP RealText

Video SourceTimed Text

Speaker Identification Timed Text ProcessingTiming information

Subtitles Contents

Video Frame Analysis

Face & Mouth Detection

Sound Analysis

TransformationSpatio-TemporalText Generation

Overlapping Subtitles and Video

<SMIL>

</SMIL>

PlayerSubtitles

Video

URLs

SMIL Generation

Region DetectionSpeaker Moving Trace

Space Detection

DFXP

Publish

Display

Figure 1. System Overview

In this section, we describe an automatic subtitles

localization system that provides not only timing but also positioning information in subtitles. Note that we utilize subtitles already created for the video synchronization rather than creating new contents of subtitles.

The proposed system is divided into three main types of tasks: (a) Extracting descriptive context (e.g., text styling, timing model, and etc.) and contents from existing timed text formats, (b) Analyzing video source for subtitles region detection and speaker identification, and (c) Transforming the timed text to a

new spatio-temporal text and generating SMIL document in order to integrate the video and the subtitles. Figure 1 illustrates a brief overview of the system.

3.1. Identification of Speaker and Position

The existing timed texts contain a number of auxiliary attributes so that subtitles are synchronized with video. In RealText, the temporal presentation is expressed by begin attribute, which is used to specify the begin point of a temporal interval, and end attributes used to specify the ending point of a temporal interval, as follows:

<time begin="3" end="7"/> There's nothing to tell. Likewise, SAMI defines the <Sync Start=”elapsed

time in milliseconds“> tag to support the synchronized functionality.

Prior to identifying a speaker, we firstly extract timing information, such as the begin/end times of the synchronized texts and the explicit/implicit durations of the text appearances. Thereafter, a person speaking at the synchronized time is identified

The speaker identification module consists of two analyzer; analyzing video frames from a video source and audio recorded in a video source. In video consisted of multichannel audio, audio channels of speaker sound are analyzed to determine where the speech sound is located (e.g. left surround, right surround, and back surround).Video frame analyzing involves a multi-step process. First, it extracts all frames during the active duration of the text. For the example of the RealText, the text is display at 3 seconds and disappears at 7 seconds, and thus the duration of the text displayed is 4 seconds. Hence, the video frames for the active duration time (i.e. between 3 seconds and 7 seconds) are extracted. Second, several front frames among the frames are analyzed to recognize faces of persons appeared on the frame. Facial spheres appeared in the front frames are detected in a third step. And we select the candidate face by analyzing mouth features of the detected faces. The mouth of a speaker is almost always moving relatively. Finally, a person speaking at the moment is identified by integrating the sound analysis and the video analysis.

The entire process of the speaker identification is described in figure 2.

168

Analyzing sound channels @ Time duration

Extracting frames @ Time duration

Front frames @ Extracted Frames

Detecting Faces @Front frames

Analyzing Mouth Features of Faces

Identifying speech sound location

Timed Text Processing

Region Detection

Identifying speaker and position

Figure 2. Flow of Speaker Identification

3.2. Detection of Subtitles Region

Table 1. Three cases according to Sound and speaker

Sound Speaker Case 1 O OCase 2 O XCase 3 X X

The regions for displaying subtitles on the video are

determined according to the result of the speaker identification. We defined the following three cases according to the speaker identification as given in Table 1: 1. Case 1: this case is a normal case that a person

speaking in video appears on the screen and can be identified. The subtitles regions near by the speaker are determined according to the region detection module. If proper spaces for displaying texts are not detected, the subtitles regions are selected in the default region (e.g. the bottom of the screen)

2. Case 2: this case is that a person speaking in video may not appear in the scene, such as story narrations, despite the sound information presented. In addition, Speakers can not be identified exactly due to particular reasons (e.g. a dull lighting, a non-frontal facial) even though they are seen in the scene. In the cases of those, the subtitles regions are selected in the default region or in the direction of the speech sound.

3. Case 3: this case is open captions, recorded in video itself, for explaining the scene or situation even though there are no sound and speakers. The subtitles are displayed at the default region (e.g. the bottom of the screen). The primary role of the region detection module is

to discover a proper region among spaces near by a speaker. Once a speaker is identified, the module traces him/her in the existing frames and calculates the moving area during the active duration time. This is for the purpose that text block is prevented overlapping with the face sphere. After that, the blank spaces near by speaker are detected. Finally, the region of them, which is consisted of relatively low dispersion of colors, is selected. Figure 3 summarizes the major steps of the subtitles region detection.

Tracing speaker moving @Time duration

Detecting near regions of Speaker @Time duration

Calculating dispersion of each near regions @

Time duration

Selecting region for displaying subtitles

Identifying SpeakerNo

Yes

Speaker Identification

Transformation

Caption Rules

Figure 3. Flow of Subtitles Region Detection

3.3. Automatic Transformation of Subtitles

In the last step, the information obtained in the previous steps is used to generate the final subtitles.

Sometimes, well-trained users add subtitles directly into the video itself by using video editing (also known as open captions), and subsequently they can display the subtitles at an intended position of a screen. However, these works can be very time consuming and expensive to produce. Therefore, we automatically

169

transform existing timed texts into new subtitles enriched rendering area of subtitles.

By providing flexibility in terms of the transformation, we adapt DFXP in use for subtitling functions. DFXP may be capable of being transformed into one or more legacy timed text formats (e.g. SAMI, QText, RealText, etc), or vice versa [2]. In addition, DFXP content may be used directly as a referred object using a <text> or <textstream> media object element in the SMIL.

From a simple example of an original SAMI file given as follows, the enriched DFXP file can be generated.

<SAMI>

<head> <Style type="text/css"> </Style> </head> <body>

… <SYNC Start=55422><P Class=ENCC> There's nothing to tell. <SYNC Start=59256><P Class=ENCC> <SYNC Start=61759><P Class=ENCC> You're going out with the guy. …

</body> </SAMI>

The following example shows a transformed DFXP containing the region detected in Section 3.2.

<?xml version="1.0" encoding="UTF-8"?> <tt xmlns="http://www.w3.org/2006/10/ttaf1" xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">

… <region xml:id="a1">

<style tts:extent="300px 40px"/> <style tts:origin="10px 200px"/>

… </region> <region xml:id="a2">

<style tts:extent="100px 120px"/> <style tts:origin="0px 120px"/>

… </region>

… <body>

<div xml:lang=”en” style=”subtitles> … <p region=”a1” begin=”55.442s” end=”59.256s”> There's nothing to tell. </p> <p region=”a2” begin=”61.759s” end=”64.389s”> You're going out with the guy. </p> …

</div> </body>

</tt> 4. Preliminary Prototype

In this section, our implementation experiences of preliminary system prototype and case studies are presented. While several players, particularly based on Adobe Flash Player, support for DFXP caption format, none of the existing players provides a rich DFXP specification such as the tts:origin and tts:extent attributes for the styling element. In order to achieve smooth playback of SMIL and DFXP presentation, we implemented a prototype player, MoNaPlayer, using Delphi7 for the development tool, DSPack 2.3.3 for multimedia component based on MS Direct Show and DirectX technologies, and TECXMLParser 1.3 for XML Component. In addition, we use the Luxand FaceSDK for analyzing mouth features and VeriLook Face Identification SDK for the face detection and recognition.

4.1. Case Studies

In case studies, we considered two existing timed texts format, SAMI and RealText, because the most popular two playbacks, Window Media Player (denoted as WMP) and RealPlayer (denoted as RP) support them. Before running the test cases, we created an ASX file (.asx) and a SAMI file (.smi) for WMP, and a SMIL file (.smil) and RealText (.rt) for RP by using MAGpie captioning software [9]. The ASX can combine the SAMI subtitles with a test video whereas SMIL can integrate the RealText subtitles with the video. Figure 4 (a) and (b) shows a snapshot of a simple test case in WMP and RP, respectively.

In the case of MoNaPlayer, we firstly extracted timing information from the SAMI file, and then transformed it into DFXP enriched subtitles regions according to speaker positions. And we integrated the video and DFXP using a <textstream> media object element in SMIL as noted Section 2.1. In SMIL document, we declared a same rendering space for the video and the subtitles. Similarly, these processes were performed for the RealText file.

Figure 4 (c) provides the results of SMIL presentation with DFXP via MoNaPlayer. The subtitles are displayed around the speaker who is detected by analyzing mouth features as shown in the right side of MoNaPlayer.

170

(a) (b) (c)

Figure 4. Snapshots of playbacks. (a) SAMI integrated into ASX (via Window Media Player). (b) RealText integrated into a SMIL (via RealPlayer). (c) DFXP integrated into a SMIL (via MoNaPlayer). 4.2. Discussion While being an acceptable function for multimedia applications, the preliminary prototype has some limitations. We examine some of these limitations and discuss the research issues that need to be explored in order to overcome them. Recognition from non-frontal face. The current prototype system only detects frontal faces. There are many approaches to detect a face in a scene such as finding faces by motion, by color, in images with controlled background, and so forth [11]. Therefore, this needs further studies and technologies combining several good approaches.

Segmentation of moving faces from video sequence. When a face is moving, it is necessary to calculate the moving area. However, the system runs into difficulties for detecting subtitle regions as a consequence of multiple moving objects. In addition, in the case of that a speaker is widely moving as speaking during the time of displaying subtitles, subtitles are overlapped with the speaker’s face. More precise video-based object segmentation techniques need to be studied and developed.

Generation of numerous region styles. When existing timed texts are transformed into DFXP, the current system defines regions separately according to a time interval. In spite of an unavoidable state in the proposed system, generating numerous region styles are inefficiency. Some regions can be reused although the position of an identified speaker is slightly different. Therefore, the method for detecting reusable regions needs to be developed in efficient manner such as template-based region selection.

Although it is necessary to improve the current

system, we believe that this new way of displaying subtitles is useful not only for some persons (such as foreign-language learners, beginning readers, and hard-

of-hearing viewers) but also for some multimedia applications (such as a foreign videoconferencing). 5. Summary

In this paper, we provided how subtitles are integrated with video, particularly displaying subtitles near by speakers though generating synchronized multimedia presentations with the timed text is not novel in itself. We also have developed a prototype supporting an automatic subtitles region detection and a speaker identification by analyzing facial features. Our aim is to promote the reusability and the accessibility of video, to improve the readability and the visibility of viewers, and to enable subtitles to be transformed without difficulty in the multimedia environments. As presented in our case studies, the proposed system is encouraging in some multimedia applications for foreign-language and visual learners, for beginning readers, and for a videoconferencing. However, as discussed in Section 4.2, there still remain limitations and research issues of the current system that need to be overcome.

A research area attracted attention at present is video retrieval by analyzing captioned information. Therefore, we plan to further study the techniques related to face recognitions and implement enhanced prototype in order to provide capabilities for retrieval of relevant video or subtitles. 6. References [1] D. Bulterman et al., Synchronized Multimedia Integration Language (SMIL 2.1), W3C Recommendation, Dec. 2005; http://www.w3.org/TR/2005/REC-SMIL2-20051213/. [2] G. Adams, Timed Text (TT) Authoring Format 1.0 – Distribution Format Exchange Profile (DFXP), W3C Candidate Recommendation, Nov. 2006; http://www.w3.org/TR/2006/CR-ttaf1-dfxp-20061116/

171

[3] RealText Authoring Guide, RealNetworks Inc., Jul. 2004; http://service.real.com/help/library/guides/ProductionGuide/prodguide/realpgd.htm [4] D. C. A. Bulterman, A. J. Jansen, P. Cesar, and S. Cruz-Lara, “An efficient, streamable text format for multimedia captions and subtitles,” Proc. ACM symposium on Document engineering, ACM, 2007, pp. 101-110. [5] D. C. A. Bulterman, J. Jansen, K. Kleanthous, K. Blom, and D. Benden, “Ambulant: a fast, multi-platform open source SMIL player,” Proc. 12th ACM Int'l Conf. Multimedia, ACM, 2004, pp. 492-495. [6] GRiNS Player for SMIL 2.0, Oratrix; http://www.oratrix.com/Products/G2P. [7] Quicktime Text Descriptors, Apple Inc.; http://www.apple.com/quicktime/tutorials/textdescriptors.html. [8] Understanding SAMI 1.0, Microsoft Corp., Oct. 2001;

http://msdn2.microsoft.com/en-us/library/ms971327.aspx. [9] Media Access Generator (MAGpie), Media Access Group, WGBH, Jun. 2007; http://ncam.wgbh.org/webaccess/magpie/magpie_help/. [10] P. Hoschka, “An Introduction to the Synchronized Multimedia Integration Language,” IEEE MultiMedia, 1998, pp. 84-88. [11] W. Zhao, R. Chellappa, A. Rosenfeld, and P.J. Phillips, “Face Recognition: A Literature Survey,” ACM Computing Surveys, Dec. 2003, pp. 399-458. [12] Wikipedia, Subtitle (captioning) http://en.wikipedia.org/wiki/Subtitles. [13] T. Michel, “Timed-Text”, World Wide Web Consortium (W3C); http://www.w3.org/AudioVideo/TT/.

172

[ieee 2008 ieee international workshop on semantic computing and applications (iwsca) - incheon,...

Documents