speech recognition in assisted and live subtitling for television

R&D White Paper

WHP 065

July 2003

Speech Recognition in Assisted and Live Subtitling for Television

M.J. Evans

Research & Development BRITISH BROADCASTING CORPORATION

BBC 2002. All rights reserved.

BBC Research & Development White Paper WHP 065

Speech Recognition in Assisted and Live Subtitling for Television

M.J. Evans

Abstract

For several years the BBC has been using speaker-independent speech recognition to assist television subtitling and dramatically reduce the time taken for a skilled subtitler to complete a programme. This paper describes recent developments in this technology. Our new system for robust generation of subtitles for live television is also introduced. This system uses speaker-specific speech recognition and novel networking software to allow subtitlers to work from home or other remote locations whilst permitting straightforward handovers between subtitlers or television services.

This was an invited paper published by the Technological Advancement Organization of Japan in the Proceedings of the TAO Workshop on Universal Design for Information, Communication and Broadcasting Technologies, Tokyo, 7 June 2003

Key words: speech recognition, television subtitling,

BBC 2002. All rights reserved. Except as provided below, no part of this document may be reproduced in any material form (including photocopying or storing it in any medium by electronic means) without the prior written permission of BBC Research & Development except in accordance with the provisions of the (UK) Copyright, Designs and Patents Act 1988.

The BBC grants permission to individuals and organisations to make copies of the entire document (including this copyright notice) for their own internal use. No copies of this document may be published, distributed or made available to third parties whether by paper, electronic or other means without the BBC's prior written permission. Where necessary, third parties should be directed to the relevant page on BBC's website at http://www.bbc.co.uk/rd/pubs/whp for a copy of this document.

White Papers are distributed freely on request.

Authorisation of the Chief Scientist is required for publication.

Speech Recognition in Assisted and Live Subtitling forTelevision

DR. MICHAEL J. EVANS

Research and Development Department, British Broadcasting Corporation, Kingswood Warren, Surrey KT20 6NP, United Kingdom

For several years the BBC has been using speaker-independent speechrecognition to assist television subtitling and dramatically reduce the time takenfor a skilled subtitler to complete a programme. This paper describes recentdevelopments in this technology. Our new system for robust generation ofsubtitles for live television is also introduced. This system uses speaker-specificspeech recognition and novel networking software to allow subtitlers to workfrom home or other remote locations whilst permitting straightforwardhandovers between subtitlers or television services.

0 INTRODUCTION

The BBC is committed to subtitling all of itstelevision programmes by 2008. This represents ahuge amount of programming, as well as a hugerange of types of programming; from coverage oflive events, to drama programmes scripted andrecorded several months in advance of broadcast.Subtitles for BBC programmes are produced byskilled subtitlers, who compose sensibly condensed,helpfully coloured and well formatted subtitles toaccompany the programmes. Subtitles are deliveredusing teletext for analogue services and DVB fordigital TV.

In the year 2001/02, the BBC broadcastapproximately 44000 hours of TV, of which abouttwo-thirds was subtitled. Within our multichannelservices, the proportion ranged from 73.7% ofprogrammes on the general interest, multiplatformservices BBC1 and BBC2 to 26.6% of the digital-only service BBC News 24. The BBC Parliamentservice was completely unsubtitled during thisperiod. Increasing these subtitling rates to 100%across the board places great demands on the BBCssubtitling capacity.

To support this increase in subtitling withoutcompromising quality and, in particular, the BBCscommitment to use skilled subtitlers, BBC Researchand Development has developed a range of softwaretools to allow subtitlers to work more effectively.These tools take a number of different forms, to meetthe range of styles of programmes that they support.In each case a key technology is speech recognition.The BBC, however, does not use speech recognitionto automatically generate a transcript of a

programmes soundtrack, and use this as the basis ofsubtitles. The multitude of different voices in atypical programmes soundtrack would require theuse of speaker-independent recognition and thetranscription accuracy of such systems remains lowand overheads in post-correction of errors high.Also, part of the skill of the subtitler is thecondensation of programme dialogue, to moderatethe time taken by viewers to read the subtitle.

The type of speech recognition system, and themanner in which it is applied in aiding subtitling aredependent upon the availability of particularresources; namely a transcript of the programme anda recording of it. The implications are summarisedin the table below:

RecordingAvailable?

TranscriptAvailable?

Type of SpeechRecognition

Yes Yes speaker-independentYes No speaker-specific followed

by speaker-independentNo No speaker-specific

This paper describes three systems developed byBBC Research and Development to increase theeffectiveness of skilled subtitlers. Each uses speechrecognition:

Assisted Subtitling uses a speakerindependent recogniser together with atranscript and recording of a programme toproduce an initial set of subtitles for thesubtitler to amend as necessary.

Script Capture uses a speaker-specificrecogniser to allow pre-recorded

programmes that do not have a transcript tobe used with the Assisted Subtitling system.

Live Subtitling allows subtitlers to usespeaker-specific recognition to provide real-time subtitles for live programmes andprogrammes edited close to transmission.

1 ASSISTED SUBTITLING

1.1 Overview

Assisted Subtitling (or AS) is a software systemallowing an operator to produce an initial set ofsubtitles from a recording of a programme and atextual transcript. The initial subtitles formed by AScan then be modified by a skilled subtitler. Overall,this process is much less time-consuming thanhaving the skilled subtitler produce the subtitleswithout any automation.

The AS system consists of a sequence of functionalmodules. These extract relevant information fromthe recording and transcript of the programme,identify the time at which each word in theprogramme is spoken, and assign distinct subtitlecolours to speakers. The final module produceswell-timed, well-formed subtitles for theprogrammes, corresponding to the stylisticpreferences set up by the user. Figure 1 shows amodular overview of the AS system.

An XML-based format called AML (Audio MarkupLanguage) has been developed for AssistedSubtitling. AML files consist of sections whichcomprise, on completion of processing, a characterlist with assigned colours, a timed version of thescript, a shot change list and the initial subtitlesthemselves.

1.2 Script Recognition and Analysis

Assisted Subtitling requires a transcription of thedialogue in a programme. Often, this transcriptionexists in the form of a verbatim post-productionscript. In general, the script consists, not only ofdialogue, but also stage directions, descriptions ofmusic cues and timing information. Extraction ofthe relevant information from the script is alsocomplicated by the fact that scripts are typed upusing a huge variety of different formatting styles.

Script Recognition extracts speaker names, spokenwords and scene boundaries from post productionscript documents produced in Microsoft Word, RichText Format (RTF) or plain text. Any additional textin the document is ignored. (Text can also beoriginated from existing subtitles, typically from anacquired programme; in which case the AS system isused to reversion the subtitles.) The system learnsnew script formats by means of a high-levelgraphical user interface which allows an operator tomanually specify the nature (dialogue, stagedirection etc.) of blocks of text from a small sectionof an example script. After manually marking-up 5-20% of a script, the system has usually learnt enoughabout the formatting conventions to recognise andinterpret the contents of any script document sharingthe same format, with a very high degree ofreliability. Scripts with recognisable formats canthen be directly converted into an unambiguousAML representation by the AS system.

1.3 Alignment using Speaker-Independent Speech Recognition

Alignment is the process of determining timinginformation for each word spoken in the programme.A programmes soundtrack is processed by a speechrecognition system. The recogniser does not need to

6FULSWDQDO\VLV

6KRWFKDQJHGHWHFWLRQ

$OLJQPHQWXVLQJVSHHFKUHFRJQLWLRQ

$XWRPDWLFFRORXULQJ 6XEWLWOH

IRUPDWLRQ

7UDQVFULSW

5HFRUGLQJ

$XGLR

9LGHR

6XEWLWOHV

Figure 1: Modular structure of Assisted Subtitling

produce a transcription of the programme, sincespeaker and dialogue information has already beenextracted from the script document. Instead, wesimply require the recogniser to provide an accurateestimate of the time at which each word is spoken.This process is considerably more reliable thantrying to produce a transcript using a speaker-independent recogniser.

Assisted Subtitlings alignment module uses theAurix speech recognition software from 20/20Speech Ltd [1]. Its accuracy in determining thetiming of words spoken is normally well within theacceptable range for a subtitle. Alignment is themost time-consuming stage of the AS process. On ahigh-end PC, alignment is generally about 4-5 timesfaster than real-time; i.e. a 60 minute programmetakes about 12-15 minutes to align.

1.4 Automatic Speaker Colouring

Subtitles on the BBC, and in Europe generally, makeuse of different colours to help viewers distinguishbetween the words spoken by different people. AShas a colouring module which automaticallygenerates an optimum assignment of colours tospeakers; minimising the incidence of interactingspeakers sharing the same colour. The systemsupports different styles of subtitle colouring;including manual or exclusive assignment of somecolours, different numbers and preference orders ofcolour, restrictions on the colours of interactingspeakers, and whether a speaker must retain the samecolour for the duration of the programme. Forexample, the general BBC subtitling style states that

White, yellow, cyan and green are theavailable colours - with green the leastpreferred;

Only white can be assigned to more than onespeaker in a given scene;

The last speaker in a scene cannot have thesame non-white colour as the first speaker inthe next scene;

Colour assignments persist for the durationof the programme;

In documentaries, yellow is (usually)assigned exclusively to the Narrator.

For a typical programme, automatically colouring tothis specification takes a fraction of a second,whereas colouring manually can often be very time-consuming, and still fail to yield an optimal colourscheme.

1.5 Shot Change Detection and SubtitleFormation

Skilled subtitlers take careful note of shot changeswhen producing subtitles, to avoid distracting theviewers. Generally, subtitles should not be displayedduring a shot change, as viewers tend to look awayfrom the text at such instants. Similarly, subtitlesshould neither start nor end within a second of a shotchange, as this is distracting. Starting or endingsubtitles in synchrony with shot changes ispreferable.

A shot change detector is included in AssistedSubtitling; analysing the video of the programme andproducing a list of the times of each shot change.These times, together with the timed wordsdetermined by the alignment module and the coloursassigned by the colouring module feed the finalmodule; subtitle generation.

The words from the script are concatenated to formsubtitles, but a large number of user-configurablesubtitle preferences are also applied, so that thesubtitles have the best possible overall appearance.The task is to optimise start and end times of eachsubtitle, together with the grouping of wordsbetween adjacent subtitles, and between lines in thecurrent subtitle. Line and subtitle breaks caninterrupt the grammar and slow reading, but subtitletiming cannot drift very far from the timing of thedialogue, subtitles must be displayed long enough tobe read and, as already discussed, care must be takenaround shot changes. The subtitle generation modulein AS manages this complicated balance betweentiming, synchronisation and appearance to produceas good a set of subtitles as possible. The user-configurable preferences specify the relativeimportance of particular attributes in the finishedsubtitles. These subtitles are output in the form of astandard EBU format subtitle file. Figure 2 showssubtitles produced by the AS system.

1.6 Impact

The subtitles produced automatically by the ASsystem will not necessarily be ready for broadcast.Normally, the quality of the subtitles can besignificantly improved by having a skilled subtitlerusing his/her experience in abbreviating the text andadjusting the timing and formatting of the subtitles.However, the AS system considerably reduces thetime taken to subtitle a programme. Instead ofbetween 12 and 16 hours subtitling time per hour of

programme, the skilled subtitler might now berequired for as little as 8 hours. Indeed, for someprogrammes, such as narrated documentaries, withvery simple timing and speaker patterns, the timesaving can be even more dramatic. Subtitlers,therefore, can increase the number of programmesthey can work on in a week. Within the BBC,Assisted Subtitling has been in continuous use forthree years and programmes processed using thesystem are broadcast daily.

2 SCRIPT CAPTURE FOR ASSISTEDSUBTITLING

2.1 Overview

Assisted Subtitling is dependent upon an accuratetranscript of the programmes dialogue. However,verbatim post-production scripts are not alwaysavailable for single programme. A fast and reliablemethod for capturing a transcript of a programme isrequired. As discussed, applying speaker-independent speech recognition to the programmessoundtrack will not produce a sufficiently accuratetranscript. However, we can employ a speechrecogniser which has been trained specifically torecognise the words spoken by an operator. Theoperator re-speaks the programmes dialogue,flagging changes of speaker and scene as he/shegoes. The result is an accurate transcript of theprogramme which, by virtue of being respokenalongside playback of the programme, is alreadypartially synchronised with the programmessoundtrack

2.2 Script Capture using Speaker-Specific Speech Recognition

Script Capture currently uses ViaVoice from IBM[2], although a number of other suitable recognisers

are available. ViaVoice is a speaker-specific voicerecognition package in which a user profile is builtup during a training period lasting approximately 2hours, and optimised during subsequent use.Experienced ViaVoice users can achievetranscription accuracy in excess of 95%.

The Script Capture consists of an MPEG player forplayback of the programme, as well as an editingwindow for the compilation of the script. Theoperator controls the playback of the programme,using shuttle controls on the MPEG player. He/shere-speaks the programmes dialogue whilst theprogramme is running. Alongside this, the operatorcan also use the keyboard and mouse to flag scenebreaks, select speaker names and insertsynchronisation points. An illustration of ScriptCapture in use is shown in Figure 3.

A Script Capture session generally involves theoperator specifying the MPEG file containing theprogramme and a text file containing the names of itscharacters. The operator re-speaks the dialogue,selecting the appropriate speaker from the list ashe/she progresses. (Controls are available on theMPEG player to shuttle the recording backwards andforward, to identify characters and their entrancesand exits, and to correct errors.) The transcript of theprogrammes is gradually built up in the editingwindow. Whenever the operator indicates that thecharacter who is speaking has changed, a syncmarker is inserted into the transcript. Therefore, thetranscript contains the approximate start time of eachpiece of dialogue. This will eventually aid the exactspeech alignment of the script to the soundtrack.

The software ensures that the transcript is formattedin a formal and consistent manner. In this way,when the capture of a script is completed, thetranscript is used to form an AML file for theprogramme without the need to use the probabilistic

Figure 2: Examples of subtitles formed using Assisted Subtitling

script analysis techniques generally used by theAssisted Subtitling system. An AML file producedby Script Capture consists primarily of character listand script sections, including sync points. This file,along with the MPEG file containing theprogrammes recording, can then be imported intothe Assisted Subtitling application. Alignment(using Aurixs speaker-independent recogniser),speaker colouring, shot change detection and subtitlegeneration are then carried out as usual. The result isan initial, easily correctable set of subtitles obtainedin a quick and reliable manner for a programme forwhich no post-production script was available.

3 LIVE SUBTITLING

3.1 Background

The commitment to subtitle 100% of TV outputencompasses a large amount of programming forwhich no pre-recording or transcript exists; i.e. liveprogrammes or programmes edited very close to thetime of transmission. Such programmes includenews reports, sport, parliamentary coverage andspecial events (such as the Queen Mothers funeral)and are subtitled live. Live subtitling requires a veryfast and accurate text input mechanism, to allowsubtitles to be created without excessive delay.

Specialised keyboard input and Computer-AidedTranscription (CAT) are often used, and recent yearshave seen the effective use of speech recognition tocreate the text.

As with Script Capture, live subtitling by speechrequires a speaker specific recogniser, trained to theindividual voice of the subtitler. The subtitler re-speaks the live programmes dialogue, condensingand rephrasing the text as required. The BBC hasbeen using such a live subtitling technique for threeyears and found it to be extremely effective.However, the costs associated with scaling up theamount of live subtitling the BBC needs to performto meet our commitment is prohibitive, in terms ofequipment costs.

Live subtitling through re-speaking has proveneffective for BBC programmes. However, to meetits needs, the BBC requires a flexible, simple andlow-cost live subtitling system, in which thesubtitlers spoken (or CAT) input of text is formedinto subtitles with the minimum of delay. For liveprogrammes, optimal editing and correction of thesubtitles would generally lead to an excessive lagbetween dialogue and subtitle. We want thesubtitlers spoken input to go directly to air.

Figure 3: Script Capture application in use

An additional consideration is the multichannelnature of the BBCs television output. There are upto 14 BBC TV services on air at any moment.Conceivably, any or all of these may require livesubtitling at some point. As many services and liveprogrammes include regional opt-outs - for examplea regional news update within a national newsprogramme - some subtitlers are generally requiredat the regional broadcast centres to cover what mightoften be a very short broadcast. For flexibility, theBBC requires its subtitle production to be networkedin such a way so that geography is not a significantfactor. Subtitlers can provide subtitles from anyregional centre, or indeed from any location with theappropriate network connection. This allows thesubtitlers to work from home, when necessary.

3.2 The BBC Live Subtitling System

The BBCs new live subtitling system is calledKLive and allows a distributed team of skilledsubtitlers to provide live subtitles for any of ourtelevision services from anywhere that they can beconnected to the systems IP (Internet Protocol)network. PC-based subtitling workstation softwarecan be used at regional broadcast centres or in thesubtitlers homes. The software allows subtitlers tohand over responsibility for services between eachother. For example, Subtitler A - at BBC Glasgow -can relinquish responsibility for BBC1 to SubtitlerB, just joining the system from home. Subtitler Athen takes a break before requesting handover ofresponsibility for BBC Parliament from Subtitler C,working at BBC Bristol.

Figure 4 shows a simple block diagram for theKLive system.

3.3 The Real-Time Subtitling Workstation

Figure 5 shows the KLive workstation application inuse. When subtitling live, the subtitler watches theprogramme (usually the live on-air broadcast) anduses IBM ViaVoice (or CAT) to enter the subtitletext corresponding to the dialogue they hear. Theworkstation also allows a very small amount ofsubtitle formatting, including positioning and colourselection. However, essentially the text fromViaVoice or the CAT keyboard is divided directlyinto subtitles without further processing and delay.

The workstation application runs on a PC connectedto the IP network. The output of the workstation is asequence of subtitles which are sent over the IPnetwork to the subtitle transmission gateway.

3.4 KLive Name Server and Gateway

Subtitles are inserted into the on-air signal via asubtitle inserter. In the live subtitling system, eachinserter is coupled with live subtitling gateway. Thegateway is a piece of software which receives the IPdata containing the live subtitle stream from theworkstation and extracts the subtitles. These arepassed to the inserter. Each television service withlive subtitles requires a gateway-inserter pair. It isalso prudent to include some redundancy. Normallythere will be main and reserve broadcast chains, eachwith its own subtitle insertion point. The system is

1$0(6(59(5

9LD9RLFH6XEWLWOHFUHDWLRQ

1HWZRUNLQWHUIDFH

6XEWLWOHH[WUDFWLRQ1HWZRUNLQWHUIDFH

,QVHUWHU

7&3,3

6SHHFKLQSXW

&$7LQSXW

%URDGFDVWVLJQDO

./,9(:25.67$7,21

./,9(*$7(:$

speech recognition in assisted and live subtitling for television

Documents

television subtitling

live television

television services

introductionthe bbc

bbc grants permission

invited paper

rd white paper whp

new system