transport of context-based information in digital …packham.net/home_files/aes109textemb.pdf ·...

14
TRANSPORT OF CONTEXT-BASED INFORMATION IN DIGITAL AUDIO DATA Natalie Packham and Frank Kurth Dept. of Computer Science V University of Bonn, R¨ omerstraße 164, 53117 Bonn, Germany e-mail: [email protected], [email protected] Abstract - In this paper we introduce a technique to embed context-based information into an audio signal and to extract this information from the signal during playback. Context-based information can be any kind of data related to the audio signal such as the lyrics of a song or the score of a piece of music. A typical application will display the extracted information. Techniques from lossy audio compression, especially MPEG-1-Layer-II, are used to ensure inaudibility of the noise introduced into the audio signal by the embedded data. Another aspect of our work deals with synchronized processing of the context- based information. Applications for embedding text information and for extracting this information during playback have been implemented and are presented in this paper. A variety of tests has been conducted to study the embedding capacity of various audio signals and to verify inaudibility of the embedded information. I. INTRODUCTION During playback of an audio signal it is often desirable to have additional information which belongs to the audio signal available. This additional information, in the following referred to as context-based information, can be any kind of data related to the audio signal such as the text of a song or the score of a piece of music. Typically, the context-based information is processsed in some way during playback, for example by displaying it. A convenient way of handling context-based information is to embed it directly into the audio-signal. Once the information is extracted during playback it is ready for further processing. This method guarantees the availability of the information during playback of the audio signal without having to transmit it explicitly each time the audio signal is transmitted over a network or within a file system. Several methods exist that carry text information along with digital audio data. One of them is CD-Text [1] which was designed for use with the CD audio format. An additional data stream contains the text information and a special CD player reads both the audio and the data stream. In view of a portable data format, the most severe disadvantage of CD-Text is the dependence on the CD audio format. The audio format for the Digital Compact Cassette (DCC) provides a similar approach for storing text information [2]. Steganographic methods circumvent the problem of dependency on a specific data format by integrating the data stream into the audio data. An example for such a steganograhic method is the Audio Mole [3, 4], proposed by the Atlantic project, which uses the least significant bit of each audio sample for additional information. A major drawback of the Audio Mole is the introduction of noise that may create audible artifacts in the signal. Previously known methods that use a psychoacoustic approach to ensure inaudibility of the embedded data typically allow only for very little data to be embedded. Mostly, such methods have been developed for watermarking purposes where robustness is more important than capacity [5]. To increase the embedding capacity significantly, our new embedding method incorporates some features of to-day’s high quality, low bit rate audio codecs. With the help of a psychoacoustic model of the human auditory system (HAS) the masking threshold of the signal (in the frequency domain) is computed. Frequency components whose sound pressure level lies below the masking threshold are not perceivable by the HAS. A common technique in lossy audio compression (as in ISO/MPEG 1) is the quantization of audio samples with the objective of keeping the quantization noise below the masking threshold. In a similar way we use a modified re-quantization algorithm to embed the context-based information. The amount of information that can be embedded lies in the range of the compression factor of the underlying audio coder, minus some bits required for the detection of the information and some error correction. AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25 1

Upload: dangkhanh

Post on 11-Mar-2018

217 views

Category:

Documents


1 download

TRANSCRIPT

TRANSPORT OF CONTEXT-BASED INFORMA TION INDIGIT AL AUDIO DATA

NataliePackhamandFrankKurthDept.of ComputerScienceV

Universityof Bonn,Romerstraße164,53117Bonn,Germanye-mail: [email protected],[email protected]

Abstract- In this paperwe introducea techniqueto embedcontext-basedinformationinto anaudiosignalandto extract thisinformationfrom the signalduring playback. Context-basedinformationcanbe any kind of datarelatedto the audiosignalsuchas the lyrics of a songor the scoreof a pieceof music. A typical applicationwill display the extractedinformation.Techniquesfrom lossyaudiocompression,especiallyMPEG-1-Layer-II, areusedto ensureinaudibility of thenoiseintroducedinto the audiosignalby the embeddeddata. Anotheraspectof our work dealswith synchronizedprocessingof the context-basedinformation.Applicationsfor embeddingtext informationandfor extractingthis informationduringplaybackhavebeenimplementedandarepresentedin this paper. A varietyof testshasbeenconductedto studytheembeddingcapacityof variousaudiosignalsandto verify inaudibility of theembeddedinformation.

I. INTRODUCTION

During playbackof an audio signal it is often desirableto have additional information which belongsto the audio signalavailable.Thisadditionalinformation,in thefollowing referredto ascontext-basedinformation, canbeany kind of datarelatedto the audiosignalsuchas the text of a songor the scoreof a pieceof music. Typically, the context-basedinformation isprocesssedin somewayduringplayback,for exampleby displayingit.A convenientway of handlingcontext-basedinformationis to embedit directly into theaudio-signal.Oncetheinformationisextractedduringplaybackit is readyfor furtherprocessing.This methodguaranteestheavailability of the informationduringplaybackof theaudiosignalwithout having to transmitit explicitly eachtime theaudiosignalis transmittedovera network orwithin a file system.Severalmethodsexist thatcarrytext informationalongwith digital audiodata.Oneof themis CD-Text [1] whichwasdesignedfor usewith theCD audioformat. An additionaldatastreamcontainsthetext informationanda specialCD playerreadsboththeaudioandthedatastream.In view of a portabledataformat, themostseveredisadvantageof CD-Text is thedependenceon theCD audioformat.Theaudioformatfor theDigital CompactCassette(DCC)providesasimilarapproachfor storingtextinformation[2].Steganographicmethodscircumventtheproblemof dependency on a specificdataformatby integratingthedatastreamintotheaudiodata.An examplefor sucha steganograhicmethodis theAudio Mole [3, 4], proposedby theAtlantic project,whichusesthe leastsignificantbit of eachaudiosamplefor additionalinformation. A major drawbackof the Audio Mole is theintroductionof noisethatmaycreateaudibleartifactsin thesignal.Previouslyknown methodsthatuseapsychoacousticapproachto ensureinaudibility of theembeddeddatatypically allow onlyfor very little datato beembedded.Mostly, suchmethodshavebeendevelopedfor watermarkingpurposeswhererobustnessismoreimportantthancapacity[5].To increasethe embeddingcapacitysignificantly, our new embeddingmethodincorporatessomefeaturesof to-day’s highquality, low bit rateaudiocodecs.With thehelpof a psychoacousticmodelof thehumanauditorysystem(HAS) themaskingthresholdof thesignal(in thefrequency domain)is computed.Frequency componentswhosesoundpressurelevel lies belowthemaskingthresholdarenot perceivableby theHAS.A commontechniquein lossyaudiocompression(asin ISO/MPEG1) is thequantizationof audiosampleswith theobjectiveof keepingthequantizationnoisebelow themaskingthreshold.In a similar way we usea modifiedre-quantizationalgorithmto embedthecontext-basedinformation.Theamountof informationthatcanbeembeddedlies in therangeof thecompressionfactorof theunderlyingaudiocoder, minussomebits requiredfor thedetectionof theinformationandsomeerrorcorrection.

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

1

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

The paperis organizedas follows: Section2 gives an overview of MPEG-audiocoding and explains how the numberofbits a� vailablefor embeddingthe context-basedinformationis calculatedon thebasisof the quantizationinformationandthescalefactorinformationof anunderlyingMPEG-1-Layer-II-encoder.Anotherobjective in our work is the synchronizedprocessingof the context-basedinformationwith playbackof the audiosignal. In section3 we introduceconceptsfor the synchronizeddisplay of the text of a songand for musicalscores. Weinvestigateseveraltechniquesto overcomeproblemswith theembeddingcapacity.As a resultof this work a waveplayerhasevolvedwhich extractstext information,displaysit, andhighlightseachword asitis sung. Furthermorea tool hasbeendevelopedto embedthe context-basedinformationinteractively. Both tools arebasedon an implementationof an ISO/MPEG-1-Layer-II codec[6]. In section4 we discussthe technicalissuesin realizingthoseapplications.Somearrangementshadto bemadeto providefor errordetectionanderrorcorrection.The resultsof extensive studiesarepresentedin section5. The studiesaredividedasfollows: The embeddingcapacityhasbeenanalyzedfor a varietyof audiosignals.In this paperwe presenttheresultsfor popmusicsongs(section5.1). Thequalityof themodifiedaudiosignalhasbeenstudiedusingbothobjective(section5.2)andsubjectivemeasures(section5.3).In section6 a summaryof thepresentedwork is given.Furtherapplicationsof theproposedtechniquearesynchronousstorageof simultaneoustranslations/multilingualtranslations,synchronousembeddingof scoredatasuchasmusicalnotesetc.,or storageof navigation informationfor applicationswithhypertext systems.

II. EMBEDDING CONTEXT-BASED INFORMA TION WITH RESPECT TOPSYCHOACOUSTIC CONTRAINTS

Overview of ISO/MPEG-1-Layer-II-coding

Figure1 shows an MPEG-1-Layer-II-codec. The input signal is split into framesof 1152samplesper channel.Eachframe

PSfragreplacements

PCM

PCM

input signal analysis

filterbank

filterbank

windowed

FFT

masking

thresholds/

SMR

scaling

information

scaling

und

quantization

bit allocation

MUX

MUX

channel

DE-

rescaling

and

dequantization

decodingof

scalefactorsand

bit allocation

synthesisoutputsignal

Figure1: ISO/MPEG-1-Layer-II-codec.Top: encoder, bottom:decoder.

is transformedby an analysisfilterbankwith 32 equal-spacedbandsto the subbanddomain. Next, the 32 subbandsaresplit

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

2

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

into scaleblocks, eachcontaining12 subbandsamples.Usinga setof 63 predefinedscalefactorsthesubbandsamplesof eachscaleblock� arenormalizedwith a scalefactorto optimizetheinterval in which thesamplesarerepresented.At the sametime, the input signalis transformedto the frequency domainby a WindowedFourier Transform. The resultingspectrumis fed into the psychacousticmodel which calculatesthe maskingthresholdof the spectrum. With the minimalmaskingthresholdin thefrequency intervalof asubbandandthetotalenergyof thespectrallinesin thecorrespondingfrequencyinterval, thesignal-to-mask-ratio (SMR) for eachsubbandis calculated.UsingtheSMRit is possibleto determinetheamountof permissiblequantizationnoisewith respectto audibility of thenoise.More formally, if thesignal-to-noise-ratio (SNR)ofthenoiseresultingfrom quantizationwith an � -bit quantizer,

������ �� , is greaterthanthe��� �

of thecorrespondingsubband,thequantizationnoiseis notaudible.The bit allocationroutinedistributesthe bits availablefor quantizationin the bestway accordingto perceptualaspects.Thetotal numbersof bits available for the encodedbitstreamdependson the chosenbitrateof the coder. Furthermore,the bitsneededfor headerinformationandfor sideinformation(scalingandquantizationinformation)have to betakeninto accountindeterminingthenumberof bitsavailablefor quantization.The subbandsamplesarescaledandquantized,andfinally, the headerinformation,the side information,andthe quantizedsamplesaremergedinto aserialstream.Thedecoderseparatesthestreaminto headerinformation,sideinformation,andquantizedsamples.Thequantizedsamplesaredequantizedandrescaled.Theresultingsubbandsamplesarefed into thesynthesisfilterbankgiving a PCMsignal.For detailedinformationregardingaudiocompressionusingMPEG1 thereaderis referredto [7, 8, 9, 10, 11, 12, 13].

Calculating the number of bits available for context-basedinformation

Usingthequantizationinformationof theMPEG-coderthenumberof bitsavailablefor embeddingthecontext-basedinforma-tion is obtained.To assurepreservationof the audioquality of the underlyingcodecthe scalinginformationhasto be takeninto accountaswell.Furthermore,subbandswhich arenot assignedany bits by thebit allocationalgorithmarenot usedin theembeddingprocess.The numberof bits available for embeddingcontext-basedinformation in onesamplewill be referredto as the embeddingbitwidth in therestof thepaper.Thescalefactorsaregivenby ��������� ���� ������ "!$#&%('*),+�- (1)

Thesamples�/.�0 , !1#324' � � , of a scaleblockarenormalizedasfollows:56�/. 0 ��� . 0�7� � (2)

wherethescalefactorof thescaleblockis theminimumof all scalefactorsgreaterthantheabsolutemaximumof thesamples�/. 0 , !�#82 ' � � . For scaleblockswith �7�9� ' � , theinterval in which thesamplesarerepresentedis enlarged.Thus,scalingthesamplesbeforequantizationdecreasesthequantizationerror.Theembeddingbitwidth : 0�; < of scaleblock

� 2= ?> , !1#324'3+ � , !$#&>@'3+ , is calculatedfromA thequantizationinformation B 0 (from thebit allocationalgorithm)andA thescalefactor � 07; < �C�8�7�9�1� �� � �D���= "!$#&%('3)E+D (3)

usingthefollowing equations: : 0�; < �C� � )GF B 0 "!$#*%H'3+� (4)

and : 0�; < �C� � )IFKJ$LNM � � )D B 0 FPORQTSEUWV � 0�; <YX (5)� � )IF(J$LTM � � )� B 06Z\[ % +^] F � _+$#&%('*),+�- (6)

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

3

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

t/ Frames

PSfragreplacements

I’m the first in line,

taketaketake aaa chance,chance, chance

80 90 100 110 120 130 140/Frames

Figure2: Synchronizeddisplayof lyrics of severalvocalparts.A locatormarkstheframethatis beingplayed.

III. SYNCHRONISING THE CONTEXT-BASED INFORMA TION AND THEAUDIO SIGNAL

For many applicationsdealingwith context-basedinformationit is desirablethat thecontext-basedinformationbe processed(e.g.displayed)at a certainpredefinedmomentin time which correspondendsto someeventin theaudiosignal.Thesynchro-nizationof thecontext-basedinformationwith theaudiosignalis anessentialpartof our work.First, we introducea conceptfor synchronizeddisplayof the text of a songwith onevocal part only. Next, we extendthisconceptto thetext of severalvocalparts,andfinally, wegiveanideahow musicscorescanbeembeddedandsynchronized.A note on the precisionof synchronization:In our work we useframesas a time measurefor synchronization(assuminga samplingrate of 44.1 kHz a frame lastsfor approximately26 ms). Other measurescould be the time elapsedsincethebeginningof songin msor fractionsof frames.It is not feasiblehowever, to realizesuchprecisemeasureswithout greateffortsinceon a realsystembetweenfive andtenframesaresentto theaudiodevice in onecall. Thecalling programdoesnot havefull control over the exact momentin time whenprocessingtakesplace. This said, it is sensibleto useframenumbersasameasurefor synchronization.

Synchronization of text information

As thedisplayof a wholeline of text generallydoesnot correspondto aneventin theaudiosignal,in our approacheachwordis synchronizedseparately. A typical applicationwill displaya whole textline five to ten framesbeforethe first word of thetextline is sungandhighlight eachword whenit is sung.The text information that is embeddedinto an audiosignalwith onevocal part is given by an ASCII-string definedby thefollowing format:

word ` word � -a-b- word � �c�@ mark ` mark � -a-b- mark � �c�’0xff’

Using C-style notationthe symbol’0xff’ representsa byte with all bits set. The first line containsthe text that will bedisplayed.Thesecondline, beginningwith thesymbol@ containstheframenumbersfor highlighting,wheremark

<indicates

theframenumberwhenword<will behighlighted.Theframenumbersaresubjectto thefollowing constraints:dfe,gih `kj !D dfe,gih < ' d=e,gih <Tl � nmG!1#*>o#3pq'*%r dfe,gih < � F � smIp�'*>o'*%r- (7)

A valueof -1 indicatesthatno furtherhighlightingwill bedonein thecorrespondingtextline.An examplefor theactualstringthatwill beembeddedis thefollowing textline:

People always told me@ 18 31 46 57’0xff’

Figure8 containsanexamplefor synchronizeddisplayof simpletext.Theaboveconceptfor synchronizeddisplayof unisontext informationcanbeadaptedfor text informationwith severalvocalpartsin thefollowing way (usingthesamenotationasabove):

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

4

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

word ` ; ` word ` ; � -b-a- word ` ; �Et ���@ mark ` ; ` mark ` ; � -a-a- mark ` ; �Et ���...word u �c� ; ` word u �c� ; � -a-b- word u ��� ; ��v^wWx �c�@ mark u �c� ; ` mark u �c� ; � -a-b- mark u ��� ; ��v^wWx ���’0xff’

Here,eachline containingtext informationcorrespondsto onevocalpartof thesongandeachline containingframenumbers(startingwith thesymbol’@’) indicatestheframenumbersfor highlighting.Thesymbol’0xff’ indicatesthatthetext linesarereadyfor processing.In analogyto (7) theframenumbershave to satisfyd=e,gih�y ; `kj !D nmG!1#Kz{' � d=e,gih9y ; < ' d=e,gih9y ; <Nl � |m=!}#~z{' � nmG!1#�>�'&% y F � - (8)

A valid framenumberhasto bespecifiedfor eachworddueto thedifferentmethodof display:Oncethebyte’0xff’ is read,thewordsareplacedona timescalein accordancewith theircorrespondingframenumberanda locatormarkstheframebeingplayedduringplayback(seeFig. 2 for anexample).

Synchronization of a musical score

Two additionalthingsarerequiredfor embeddingandsynchronizingthescoreof apieceof music:A aconceptfor descriptionof notesandothermusicalelements,A aconceptfor serializationof themusicinformation.

We may achieve both requirementsusingthe notationof the Standard MIDI 1 File (SMF). With MIDI, musicalinformationis representedin a seriesof events.For thescoreof a pieceof musicthis meansthatonehasto transformthescore(beingarepresentationof musicalinformationin notesandothersymbols)to events.Thereexist avarietyof programsfor accomplishingthis.In additionto codingthemusicinformationin so-calledMIDI-eventstheSMFallowsfor meta-eventssuchasthetext-eventandthe lyric-event. Theseeventsoffer a straightforwardway for embeddingtext informationin additionto themusicinformation.

Overcomingembeddingcapacitybottlenecks

We have investigatedseveral techniquesto overcomecapacitybottlenecks.Two sourcesfor this problemhave to be distin-guished:A Thereis not enoughembeddingcapacityin thewholesignal.A Two separateitemsof context-basedinformation(suchastwo text lines)collide, i.e. theframewheretheembeddingof

oneitem startsis alreadyoccupiedwith anotheritem.

Dependingon theapplicationthefollowing techniquescanbeappliedto solve theaboveproblems,wheretechniques1-4 willincreasetheembeddingcapacityin generalandtechniques5 and6 will distributethebits availablein a favourableway:

1. Compressionof the context-basedinformation: Usinga losslesscompressionschemecanresultin a morecompactrepresentationof the context-basedinformation. Dependingon the application,however, the format of the datawhichwill be embeddedmaybechosencompactalreadyandfurthercompressionwill have only minor effectson thesizeofthedata.

2. Making useof the subbands at higher fr equencies: In MPEG-codingat high samplingratessuchas �E� - � kHz, thesubbandswith the highestfrequenciesarenot assignedany bits by the bit allocationalgorithm. In other terms,thesesubbandsarequantizedwith zerobits. To guaranteethat theembeddingnoiseis of the sameorderasthe quantizationnoise,thesesubbandsare not usedfor embeddingcontext-basedinformation (consideran � -bandwith-limitedsignalwhere � ' � ! kHz; thequantizationnoiseof thetwo subbandswith highestfrequenciesis alwayszero).However, sinceany signallocatedat thesefrequenciesis not perceivableby theHAS, thesesubbandscanbeusedsafelyfor embeddingcontext-basedinformation.

1MIDI, MusicalInstrumentsDigital Interface

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

5

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

3. Making useof the absolutethr esholdin quiet: In signalpartswith silencethereis no embeddingcapacitysinceallsubbandsamplesare”quantized”with zerobits. It makessense,however, to choosetheembeddingbitwidth in suchawaythattheresultingembeddingnoiseis below theabsolutethresholdof audibility. Thistechniqueguaranteesaminimalconstantrateof embeddingcapacity.

4. Reducing the bitrate of the encoder: The numberof bits usedfor quantizingthe samplesof a subbanddependsonvariousfactors,oneof thembeing the bitratewhich is fixed in the caseof an MPEG-1-Layer-II-encoder. A smallerbitratereducesthe numberof bits availablefor quantizingthe subbandsamplesand,following the computationof theembeddingbitwidth in (4) and(5), increasestheembeddingbitwidth.

Applying this techniquemay, of course,resultin perceptualdeteriorationof thesignal.

5. Usinga bit reservoir: TheMPEG-1-Layer-III-encoderusesaso-calledbitreservoirto makeup for temporaryshortageof bits neededfor quantization.A framewhich hasbits left over after quantizingin sucha way that the quantizationnoiseis below themaskingthresholdcan”put” theseremainingbits in thebit reservoir. Anotherframe,ata laterpoint intime, which needsmorebits thanit hasavailablefor quantizingto ensurethat thequantizationnoiseis not perceptible,can”take” bits from thereservoir. Bits remainin the reservoir for a fixedtime interval andtheencodinganddecodingprocessesaredelayedby this interval.

The useof a bit reservoir in the sameway canprevent temporaryshortageof embeddingcapacity. We describetherealizationof anembeddingbitreservoir techiquein acompanionpaper[14].

6. Relocatingan item of context-basedinformation: A straightforwardway to dealwith the collision of two itemsofcontext-basedinformationis to relocatethefirst item to anearlierpoint in time. However, this caninitiate a ”domino-effect” wheretherelocateditemcollideswith theitembefore,andsoon.

IV. IMPLEMENT ATION

filterbank

synthesis

bit allocation

scalefactor

information

FFT thresholds/

analysis

embedding

masking

SMR

filterbank

PCM

input signal

output signal

algorithm

PCM

advance

(481 samples)

Figure3: Embeddingprocedure.

Theembeddingprocedureis shown in Fig. 3. Theinputsignalis split into framesof 1152samplesperchannelandeachframeis processedseparately. The scalefactor informationandquantizationinformationarecomputedin the sameway as in theMPEG-codecdescribedin section2.1. Using this the embeddingalgorithmcomputesthe embeddingwidth andembedsthecontext-basedinformation.This stepis describedin detail in section4.2. Thesynthesisfilterbankproducesa PCM signal.To

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

6

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

accountfor theoverall delayof thefilterbank(481samples)thesignalis advancedby 481samples.This stepis necessarytomake� surethatthesamplesthatbuild upa framearethesamefor theembeddingprocessandtheextractionprocess.

Structure of an embeddingblock

PSfragreplacements

LSBMSB signature

1stsample

...

12thsample

16bits

bits for

error

correction

bits for

context-based

information

signalpart

Figure4: Embeddingblock.

Embeddingtakesplacescaleblockwisesincetheembeddingbitwidth is computedscaleblockwise.A scaleblockwhich is usedfor embeddingcontext-basedinformationis calledanembeddingblock.Figure4 showsanembeddingblock. Theblockcontains12subbandsamples.A samplemaycontainmorethan16bitssincetheanalysisfilterbanktransformstheinput signalfrom 16-bit integervaluesto floatingpoint values.Only the16 mostsignificantbitsarerelevantfor ourapplication,though,asthey will betransformedbackto 16-bit integervaluesby thesynthesisfilterbank.Of theembeddingbitwidth % ascomputedin 2.2,only 2 ��� %�F~+ bits areusedto transportinformation.The3 remainingbitsareusedfor a forward error correction(FEC) which will be describedlater in this section. Scaleblockswith an embeddingbitwidth of %K'3) arenot usedfor embeddinginformation.The first five samplesof eachembeddingblock arenot usedfor context-basedinformation. They carry a signatureinsteadwhich is usedfor detectingthecontext-basedinformationduringtheextractionprocess.Furthermore,thesignaturebitscontaininformationaboutthe embeddingbitwidth so that the extractionalgorithmknows how many bits to readfrom the followingsamples.Theremainingsevensamplesareusedfor thecontext-basedinformation.Of the � � % bitsavailablefor embeddingonly � � %GF$+ bits areusedfor thecontext-basedinformation.Betweentheembeddingandtheextractionprocess,errorsmayoccurin thevaluesof thesubbandsamples.ThoseareduetoA thefilterbankwhichsatisfiesonly theNPR-property(NPR,nearperfectreconstruction),A thesytemarchitecture,andA transformationof integervaluesinto floatingpoint valuesandviceversa.

To preventcrippling of thecontext-basedinformation,a FECbasedon arithmeticcodingis used[15, 16]. A number�(�H� ismappedto B�� Z � , B ��� , � � + . In thiswaythebinaryrepresentationof � is extendedby 3 errorcorrectionbitsandis maderobustagainsterrorsin theinterval � F�+ � ��� .We describethe binary representationof the 16 mostsignificantbits of a subbandsamplewith � � � � ��� a-b-a-� � ` . Let � �� � 0 ��� b-a-a-� �D`9 be the binary representationof the signaturebits or the context-basedinformationdependingon the samplenumberand 2 ��� %�F�+ where% is theembeddingbitwidth for theembeddingblock. Theembeddingpattern � � � � � ��� a-a-b-c ��`7AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

7

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

consistsof theinformation � to beembeddedandthebits for errorcorrection�4� � � V � � � `7 �C� � !D � � � < ��� < "!1#&>@'*+D � < � � < �=� "+$#&>@'*%r- (9)

Theembeddingpattern� is embeddedinto sample� usingthefollowing function:� � � � � �?� b-a-b-� � ` � � � ��� a-b-a-� � ` o�� � � �?� b-a-b-c � � � � ��� b-a-b-c � ` - (10)

The embeddingalgorithm

1. Input:A Frame �q� consistingof 96 scaleblocks��� � � � ` b-a-a-� � ��� , where � < , !�#�> '��E) ,contains12subbandsamples.A Scalefactorinformationandbit allocationinformationfor eachscaleblockin �q� : � � < � < ,!$#&>@'3�E) .A Previousframerame� � ��� , if � � is not thefirst frameof thesignal.

2. Using� � < � < computetheembeddingbitwidth � < for eachscaleblock� < , ! #P>I'��,) using

(4) and(5).

3. Removefalselydetectedsignatures.

4. For all � < , !}#*>@'*�E) do:A If thereis context-basedinformation to be embeddedand )3# � < ' � ) , thenembedsignatureandcontext-basedinformationof bitwidth � < F�+ plus 3 error correctionbitsinto thesamplesof theblock.

5. Carryout a final errorcorrectionon frame �q� ��� .6. Output: Frame� � andframe � � �c� , if it exists.

Figure5: Embeddingalgorithm.

Theembeddingalgorithmis shown in Fig. 5. Thecomputationof theembeddingbitwidth is carriedoutasdescribedin section2.2(step2).To make surethat no otherscaleblocksthanthe embeddingblockscarry a valid signature,all scaleblocksarechecked. If asignatureis detectedit is removedby changingsomebits in correspondingsamples(step3).After embeddingthecontext-basedinformation(step4) a final errorcorrectionhasto beperformed.This is dueto thefactthatanerrorwhich occursin sampleof a scaleblock(not anembeddingblock) canchangethebinaryrepresentationin sucha waythat thewholescaleblockcontainsa signatureduringextraction. To prevent this, all stepsup to extractionareperformedandany informationextractedis matchedagainstthe informationembedded.Becauseof the delayof the filterbankthis stepcanonly becarriedout with a delayof oneframe.

The extraction algorithm

Theextractionalgorithmhasbeenrealizedfor simpletext information,i.e. songtexts with onevocalpartonly. Thealgorithmis givenin Fig. 6.

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

8

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

1. Input:A Frame� consistingof 1152PCMsamples.A Buffer   containingextractedinformationwhich hasnot beendisplayedyet.A Synchronizationbuffer5 � � � ` a-b-a-c � 0 ��� , 2 �¢¡ , with framenumberscorresponding

to thetextline which is currentlydisplayed.A Displaybuffer B � �£. ` a-b-a-c . y �c� , z ��¡ , with framenumbersfor thedisplayof thetextlinesin   .A Framenumber¤ of � .

2. Play � .

3. If ¤ ��� < , !$#�>�'32 , thenhighlightnext word in thedisplay.

4. If ¤ � . ` thendisplaynext textline in   , delete. ` from B andthetextline from   .

5. Carry out analysisfilterbank on � . The resulting1152 subbandsamplesare split into 96scaleblocks

� � ` a-a-b-� �9¥ � containing12sampleseach.

6. For all � < , !}#*>@'*�E)1' , do:A If� � </; ` b-a-a-� � </; ¦ containsa valid signature,then

– determineembeddingbitwidth of, readout the context-basedinformation § from� � </; � b-a-a-� � </; �¨� andconcatenate§ with   .

– If ’0xff’ �©§ , thencomputethe frame. y when the textline will be displayed

asfollows: let % be the framenumberwhenthe first word in the textline will behighlighted.If ¤ '*%4F~ª , then

. y �C� %�F«ª , else. y ��� ¤ Z � .

7. Output:A Possiblymodifiedbuffers   ,5

and B .

Figure6: Theextractionalgorithmfor simpletext information.

Implementation details

The algorithmsdescribedin sections4.2 and4.3 have beenimplementedfor audiodatain the WaveformAudio File Format(Waveformat)which supportsPCM-audiodata.Theaudiodatashouldcontain16-bit samples,sampledat a rateof �E� - � kHz.ThewholesoftwarerunsunderWindowsandLinux, theGUI partswererealizedusingtheGUI toolkit FLTK [17]. Thesoftwareis split in thefollowing components:A mbed gui: This is a GUI applicationwhich supportstheuserin generatingvalid context-basedinformationusing

theformatdescribedin section3.1. Thisapplicationcomputestheframenumberswhentheembeddingprocessfor eachtextline shouldstart to ensurethat the text informationis displayedbeforehighlighting starts. Shortageof embeddingcapacitycanbehandledin variouswaysusingsomeof thetechniquesdescribedin 3.3.A mbed: This applicationimplementsthe embeddingalgorithmfrom section4.2. It canbe calleddirectly from thepreviousapplication.A karaoke: This GUI applicationimplementstheextractionalgorithmfrom section4.3.

If thereis enoughembeddingcapacity, it is possibleto useonly onechannelfor embeddingthecontext-basedinformation.Thisreducestherunningtimeof thealgorithmby nearlyonehalf.

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

9

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

Figure7: Theembeddingsoftwarembed gui.

Figure8: Theplaybacksoftwarekaraoke.

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

10

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

V. TESTS

A seriesof testswasconductedto evaluatethe quality of our embeddingmethod. As the main objective of our work is topresentan embeddingmethodfor lyrics of songsandscoresof music,we presentthe resultsof our studiesfor songswithaccompanying music.We havestudiedothertypesof signalsaswell,suchasspeechsignalsandvocalswithout music.For testingpurposeswedistinguishregularembeddingandmaximumembeddingwheretheembeddingnoiseis maximized.With our studieswe attemptto answerthefollowing questions:A Is theretypically sufficientembeddingcapacityfor ourpurposes?A Whatis theratioof embeddingnoiseandtheglobalmaskingthreshold?A Is theregularembeddingof text informationperceptible?A How is regularembeddinginto onechannelevaluatedcomparedto regularembeddinginto bothchannels?

Thetestingmaterialwasmainlypickedfrom theEBU/SQUAM testCD [18].

Embedding capacity

0 50 100 150 200 250 3000

2000

4000

frames

bits

v17_2.wav, one channel, 384kbps

gross embedding capacitynet embedding capacity

0 50 100 150 200 250 3000

2000

4000

frames

bits

v17_2.wav, both channels, 384kbps

0 0.5 1 1.5 2 2.5 3 3.5

x 105

−0.5

0

0.5

signal

soun

d pr

essu

re le

vel

0 100 200 300 400 500 600 7000

2000

4000

6000

frames

bits

v21_2.wav, both channels, 384kbps

0 1 2 3 4 5 6 7 8

x 105

−0.5

0

0.5

signal

soun

d pr

essu

re le

vel

0 100 200 300 400 500 600 7000

2000

4000

6000

frames

bits

v21_2.wav, one channel, 384kbps

gross embedding capacitynet embedding capacity

Figure9: Embeddingcapacityof two popsongs.

v17 2, duration: ¬ s,348Frames­�®b¯kbps,onechannel

­b®b¯kbps,bothchannels

capacitybrutto 111300 217452netto 34118 66598

# frames,cap. °�± ( ²�± ) 203(145) 225(123)

v21 2, duration: ³´¬ s,731Frames­�®b¯kbps,onechannel

­b®b¯kbps,bothchannels

capacitybrutto 789924 1619640netto 240247 493248

# frames,cap. °�± ( ²�± ) 686(45) 705(26)

Table1: Embeddingdataof thesamepopsongsasin Fig. 9.

Figure9 shows theembeddingcapacityfor two popsongs.Table1 containssomedataregardingtheembeddingcapacityofthesignals.Bothsignalcontainenoughcapacityfor text information.Two textlinesof lengths58 bytes(464bits)and66 bytes

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

11

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

(528 bits) areto be embeddedinto the first signal. Highlighting of the first word takesplaceat framenumbers33 and217,respecti� vely. Up to framenumber27 thereare3164bits available.Thesecondtextline will beembeddedinto frames203-208.For thesecondsignalthereis plentyof embeddingcapacityaswell.In general,all popmusicsongsthatwestudiedhadenoughembeddingcapacityfor text information.

Comparison of embeddingnoiseand the masking threshold

We definetheratioof embeddingnoise � andmaskingthreshold%�µ .,¶ ¤ asfollows:· � � � � � ¨%�µ .W¶ ¤? � � !�¸bQNS,U � `º¹ � �%�µ .,¶ ¤ �r» dB _!$#*%('*+ � (11)

where � � is theaverageenergy of theembeddingnoiseof thethreescalefactorsof subband% and %�µ .W¶ ¤ � is theenergy valueof theglobalmaskingthresholdfor subband% asgivenby thepsychoacousticmodel.A valueof ! dB representsa level in thethresholdcalculationof �E) dB below theenergy of a sinewaveof amplitude¼ + � � )E! .Usingthis, theenergy � � ; ½ , !$#&%('*+ � , !1# � '*+ , is computedasfollows:� � ; ½ � � V´¾ ��¿ À F � _!$#&%('3+ � Á!1# � '*+�- (12)

The· �H� � valuecanbeinterpretedasfollows: If

· �H� � '3! thentheembeddingnoiselies below themaskingthresholdandis thereforenotaudible.In thecasewhere

· � � � j ! theembeddingnoisemaybeperceptibleby theHAS.

50 100 150 200 250 300

5

10

15

20

25

30

frames

subb

and

v11_2.wav, maximum embedding, 384kbps

dB

−5

0

5

10

15

20

50 100 150 200 250 300

5

10

15

20

25

30

frames

subb

and

v11_2.wav, maximum embedding, 128kbps

dB

−5

0

5

10

15

20

Figure10: EMR-valuesfor a popmusicpiece.Themaximumof all EMR-valuesusingregularembeddingusinga bitrateof384kbps(notshown here)is F�ª�- � � dB.

Figure10 shows theEMR-valuesfor a popmusicpiece.Embeddingtext causesno audiblenoisein thesignalsinceall EMR-valueslie below zero. If theembeddingnoiseis maximized(top picturein Fig. 10) only a handfulof EMR-valueslie abovethemaskingthreshold.Becauseof thesmallnumberof suchEMR-valueswe cansafelyassumethat they do not causesevereartifactsin thesignal.Usingadifferentbitrate,however, thesignal’squalitywill deteriorateseverly ascanbeseenin thefigure.TheEMR-valuesfor otherpopsongsarecomparableto thevaluesshown.

Resultsof a listening test

We testedour methodin a listeningtestwith elevenpeople.Figure11 shows theresultsfor relativecomparisonsof popmusicsongs.Two versions(original andmodified)of the samesongwereplayedsubsequentlyandthe testpersonswereasked tojudgethe differencein quality of the secondversionwith respectto the first versionon a scalerangingfrom -3 to 3. In the

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

12

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

−3 −2 −1 0 1 2 3

105rabit

v11_2

v12_3

v14_1

v17_2

v21_2

v9_4

v9_5

v9_6

Relative comparison, pop music, original and regular embedding, 384kbps

mu

sic

pie

ce

s

evaluation−3 −2 −1 0 1 2 3

105rabit

v11_2

v12_3

v14_1

v17_2

v21_2

v9_4

v9_5

v9_6

Relative comparison, pop music, reg. embedding in both ch. and one ch., 384kbps

mu

sic

pie

ce

s

evaluation

Figure11: Resultsof listeningtests.For eachpieceof musictheaverageevaluationandthestandarddeviationareshown.

figurea valueof -3 meansthat the quality of the modifiedsignalis muchworsethanthequality of theoriginal, a valueof 0indicatesnot differencein the two signal’s quality, anda valueof +3 statesthat the quality of the modifiedsignalwasmuchbetterthanthequalityof theoriginal.The comparisonsbetweenthe original and the modified signalsshow that no obvious distortionswere introducedinto thesignalsby our embeddingmethod.Thesecondtestcomparesregularembeddingin bothchannelsto embeddingin onechannelonly. Again,noobviouspreferencecanbedetected.

Our studiesprove thatour embeddingmethodis suitedvery well for embeddingtext informationinto songs.Becauseof thehighembeddingcapacityit is reasonableto assumethatothertypesof context-basedinformationcanbeembeddedaswell.

VI. CONCLUSION

In this paper, we introduceda techniqueto embedcontext-basedinformation into audiosignals. The embeddingis suitablefor online-extractionandsynchronouspresentationof text andaudio. We describedMPEG-1-LayerII basedembeddingandplaybacksystemsfor audiopieceswith onevocalpart.Suchsystemsmaybeusedfor karaoke-likeapplications.Wefurthermorediscussedpossibleextensionsfor thesupportof multiple vocalpartsandtheembeddingof scoreinformation.Although thepresentrealizationof our systemis basedon the audiocodingtechniquesdescribedabove, thereis muchroomfor simplifications.Conceptually, weonly needsuitable(fast)transformsandpsychoacousticmodelsto realizeourembeddingmethods.Especiallyanoptimizedspectraltransformis likely to allow for a significantextractionspeed-up.

VII. ACKNOWLEDGEMENTS

Thiswork is partof theMiDiLiB-projectof ourgroup[19]. It wassupportedin partby DeutscheForschungsgemeinschaftundergrantsCL 64/3-1andCL 64/3-4.

VIII. REFERENCES

[1] B. BehrandC. Wiedenhoff. Kuckmal,wasdaspielt. c’t Magazinfur Computertechnik, pages218–222,April 1999.

[2] AbrahamHoogendoorn.Digital compactcassette.In IEEEProc., volume82,pages1479–1489,October1994.

[3] A JMason.Audio mole,codercontroldatato beembeddedin decodedaudioPCM.

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

13

PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION

[4] A. J. Mason,A. K. McParland,andN. J. Yeadon. The atlanticaudiodemonstrationequipment. In 106thConvention.AES,May 1999.

[5] LaurenceBoney, AhmedH. Tewfik, andKhaledN. Hamdy. Digital watermarksfor audiosignals.In 1996IEEEInt. Conf.on MultimediaComputingandSystems, pages473–480,Hiroshima,Japan,1996.

[6] MPEG/audioSoftwareSimulationGroup.ISO/IEC11172-3Layer1 andLayer2 SoftwarePackage,1993.

[7] ISO/IEC11172-3:Codingof moving picturesand= associatedaudiofor digital storagemediaatupto 1.5mbits/s- audiopart. Technicalreport,InternationalStandard,1992.

[8] K. Brandenburg andG. Stoll. ISO/MPEG1-Audio: A genericstandardfor codingof high quality digital audio. J. AudioEng. Soc., 42(10):780–792,October1994.

[9] P. Noll. Mpeg digital audiocoding. IEEESignalProcessingMagazine, pages59–81,1997.

[10] Davis Pan. A tutorial on MPEG/Audiocompression.IEEE MultimediaJournal, 1995.

[11] Davis Yen Pan. Digital audiocompression.Digital Technical Journal of Digital EquipmentCorporation, 5(2):28–40,1993.

[12] S.Shlien.Guideto MPEG-1audiostandard.IEEE Transactionson Broadcasting, 40(4):206–281,December1994.

[13] Udo Zolzer. Digitale Audiosignalverarbeitung. Teubner, Stuttgart,1996.

[14] FrankKurthandViktor Hassenrik.A DynamicEmbeddingCodecfor Multiple GenerationsCompression.In Proc.109thAESConvention,LosAngeles,USA, 2000.

[15] FrankKurth. VermeidungvonGenerationseffektenin der Audiocodierung. PhDthesis,Institut fur Informatik V, Univer-sitat Bonn,1999.

[16] W. Wesley Peterson,E.J.Weldon,andJr. Error-correctingcodes. MIT Press,Cambridge,Mass.,9. edition,1988.

[17] Bill Spitzaket al. FLTK - TheFastLight Tool Kit. URL: http://www.fltk.org.

[18] TechnicalCentreof theEuropeanBroadcastingUnion. Soundqualityassessmentmaterialrecordingsfor subjective tests,1988.

[19] MiDiLiB. Content-basedIndexing, Retrieval, andCompressionof Datain Digital Music Libraries. http://leon.cs.uni-bonn.de/forschungprojekte/midilib/english/.

AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25

14