transport of context-based information in digital …packham.net/home_files/aes109textemb.pdf ·...
TRANSCRIPT
TRANSPORT OF CONTEXT-BASED INFORMA TION INDIGIT AL AUDIO DATA
NataliePackhamandFrankKurthDept.of ComputerScienceV
Universityof Bonn,Romerstraße164,53117Bonn,Germanye-mail: [email protected],[email protected]
Abstract- In this paperwe introducea techniqueto embedcontext-basedinformationinto anaudiosignalandto extract thisinformationfrom the signalduring playback. Context-basedinformationcanbe any kind of datarelatedto the audiosignalsuchas the lyrics of a songor the scoreof a pieceof music. A typical applicationwill display the extractedinformation.Techniquesfrom lossyaudiocompression,especiallyMPEG-1-Layer-II, areusedto ensureinaudibility of thenoiseintroducedinto the audiosignalby the embeddeddata. Anotheraspectof our work dealswith synchronizedprocessingof the context-basedinformation.Applicationsfor embeddingtext informationandfor extractingthis informationduringplaybackhavebeenimplementedandarepresentedin this paper. A varietyof testshasbeenconductedto studytheembeddingcapacityof variousaudiosignalsandto verify inaudibility of theembeddedinformation.
I. INTRODUCTION
During playbackof an audio signal it is often desirableto have additional information which belongsto the audio signalavailable.Thisadditionalinformation,in thefollowing referredto ascontext-basedinformation, canbeany kind of datarelatedto the audiosignalsuchas the text of a songor the scoreof a pieceof music. Typically, the context-basedinformation isprocesssedin somewayduringplayback,for exampleby displayingit.A convenientway of handlingcontext-basedinformationis to embedit directly into theaudio-signal.Oncetheinformationisextractedduringplaybackit is readyfor furtherprocessing.This methodguaranteestheavailability of the informationduringplaybackof theaudiosignalwithout having to transmitit explicitly eachtime theaudiosignalis transmittedovera network orwithin a file system.Severalmethodsexist thatcarrytext informationalongwith digital audiodata.Oneof themis CD-Text [1] whichwasdesignedfor usewith theCD audioformat. An additionaldatastreamcontainsthetext informationanda specialCD playerreadsboththeaudioandthedatastream.In view of a portabledataformat, themostseveredisadvantageof CD-Text is thedependenceon theCD audioformat.Theaudioformatfor theDigital CompactCassette(DCC)providesasimilarapproachfor storingtextinformation[2].Steganographicmethodscircumventtheproblemof dependency on a specificdataformatby integratingthedatastreamintotheaudiodata.An examplefor sucha steganograhicmethodis theAudio Mole [3, 4], proposedby theAtlantic project,whichusesthe leastsignificantbit of eachaudiosamplefor additionalinformation. A major drawbackof the Audio Mole is theintroductionof noisethatmaycreateaudibleartifactsin thesignal.Previouslyknown methodsthatuseapsychoacousticapproachto ensureinaudibility of theembeddeddatatypically allow onlyfor very little datato beembedded.Mostly, suchmethodshavebeendevelopedfor watermarkingpurposeswhererobustnessismoreimportantthancapacity[5].To increasethe embeddingcapacitysignificantly, our new embeddingmethodincorporatessomefeaturesof to-day’s highquality, low bit rateaudiocodecs.With thehelpof a psychoacousticmodelof thehumanauditorysystem(HAS) themaskingthresholdof thesignal(in thefrequency domain)is computed.Frequency componentswhosesoundpressurelevel lies belowthemaskingthresholdarenot perceivableby theHAS.A commontechniquein lossyaudiocompression(asin ISO/MPEG1) is thequantizationof audiosampleswith theobjectiveof keepingthequantizationnoisebelow themaskingthreshold.In a similar way we usea modifiedre-quantizationalgorithmto embedthecontext-basedinformation.Theamountof informationthatcanbeembeddedlies in therangeof thecompressionfactorof theunderlyingaudiocoder, minussomebits requiredfor thedetectionof theinformationandsomeerrorcorrection.
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
1
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
The paperis organizedas follows: Section2 gives an overview of MPEG-audiocoding and explains how the numberofbits a� vailablefor embeddingthe context-basedinformationis calculatedon thebasisof the quantizationinformationandthescalefactorinformationof anunderlyingMPEG-1-Layer-II-encoder.Anotherobjective in our work is the synchronizedprocessingof the context-basedinformationwith playbackof the audiosignal. In section3 we introduceconceptsfor the synchronizeddisplay of the text of a songand for musicalscores. Weinvestigateseveraltechniquesto overcomeproblemswith theembeddingcapacity.As a resultof this work a waveplayerhasevolvedwhich extractstext information,displaysit, andhighlightseachword asitis sung. Furthermorea tool hasbeendevelopedto embedthe context-basedinformationinteractively. Both tools arebasedon an implementationof an ISO/MPEG-1-Layer-II codec[6]. In section4 we discussthe technicalissuesin realizingthoseapplications.Somearrangementshadto bemadeto providefor errordetectionanderrorcorrection.The resultsof extensive studiesarepresentedin section5. The studiesaredividedasfollows: The embeddingcapacityhasbeenanalyzedfor a varietyof audiosignals.In this paperwe presenttheresultsfor popmusicsongs(section5.1). Thequalityof themodifiedaudiosignalhasbeenstudiedusingbothobjective(section5.2)andsubjectivemeasures(section5.3).In section6 a summaryof thepresentedwork is given.Furtherapplicationsof theproposedtechniquearesynchronousstorageof simultaneoustranslations/multilingualtranslations,synchronousembeddingof scoredatasuchasmusicalnotesetc.,or storageof navigation informationfor applicationswithhypertext systems.
II. EMBEDDING CONTEXT-BASED INFORMA TION WITH RESPECT TOPSYCHOACOUSTIC CONTRAINTS
Overview of ISO/MPEG-1-Layer-II-coding
Figure1 shows an MPEG-1-Layer-II-codec. The input signal is split into framesof 1152samplesper channel.Eachframe
PSfragreplacements
PCM
PCM
input signal analysis
filterbank
filterbank
windowed
FFT
masking
thresholds/
SMR
scaling
information
scaling
und
quantization
bit allocation
MUX
MUX
channel
DE-
rescaling
and
dequantization
decodingof
scalefactorsand
bit allocation
synthesisoutputsignal
Figure1: ISO/MPEG-1-Layer-II-codec.Top: encoder, bottom:decoder.
is transformedby an analysisfilterbankwith 32 equal-spacedbandsto the subbanddomain. Next, the 32 subbandsaresplit
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
2
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
into scaleblocks, eachcontaining12 subbandsamples.Usinga setof 63 predefinedscalefactorsthesubbandsamplesof eachscaleblock� arenormalizedwith a scalefactorto optimizetheinterval in which thesamplesarerepresented.At the sametime, the input signalis transformedto the frequency domainby a WindowedFourier Transform. The resultingspectrumis fed into the psychacousticmodel which calculatesthe maskingthresholdof the spectrum. With the minimalmaskingthresholdin thefrequency intervalof asubbandandthetotalenergyof thespectrallinesin thecorrespondingfrequencyinterval, thesignal-to-mask-ratio (SMR) for eachsubbandis calculated.UsingtheSMRit is possibleto determinetheamountof permissiblequantizationnoisewith respectto audibility of thenoise.More formally, if thesignal-to-noise-ratio (SNR)ofthenoiseresultingfrom quantizationwith an � -bit quantizer,
������ �� , is greaterthanthe��� �
of thecorrespondingsubband,thequantizationnoiseis notaudible.The bit allocationroutinedistributesthe bits availablefor quantizationin the bestway accordingto perceptualaspects.Thetotal numbersof bits available for the encodedbitstreamdependson the chosenbitrateof the coder. Furthermore,the bitsneededfor headerinformationandfor sideinformation(scalingandquantizationinformation)have to betakeninto accountindeterminingthenumberof bitsavailablefor quantization.The subbandsamplesarescaledandquantized,andfinally, the headerinformation,the side information,andthe quantizedsamplesaremergedinto aserialstream.Thedecoderseparatesthestreaminto headerinformation,sideinformation,andquantizedsamples.Thequantizedsamplesaredequantizedandrescaled.Theresultingsubbandsamplesarefed into thesynthesisfilterbankgiving a PCMsignal.For detailedinformationregardingaudiocompressionusingMPEG1 thereaderis referredto [7, 8, 9, 10, 11, 12, 13].
Calculating the number of bits available for context-basedinformation
Usingthequantizationinformationof theMPEG-coderthenumberof bitsavailablefor embeddingthecontext-basedinforma-tion is obtained.To assurepreservationof the audioquality of the underlyingcodecthe scalinginformationhasto be takeninto accountaswell.Furthermore,subbandswhich arenot assignedany bits by thebit allocationalgorithmarenot usedin theembeddingprocess.The numberof bits available for embeddingcontext-basedinformation in onesamplewill be referredto as the embeddingbitwidth in therestof thepaper.Thescalefactorsaregivenby ��������� ���� ������ "!$#&%('*),+�- (1)
Thesamples�/.�0 , !1#324' � � , of a scaleblockarenormalizedasfollows:56�/. 0 ��� . 0�7� � (2)
wherethescalefactorof thescaleblockis theminimumof all scalefactorsgreaterthantheabsolutemaximumof thesamples�/. 0 , !�#82 ' � � . For scaleblockswith �7�9� ' � , theinterval in which thesamplesarerepresentedis enlarged.Thus,scalingthesamplesbeforequantizationdecreasesthequantizationerror.Theembeddingbitwidth : 0�; < of scaleblock
� 2= ?> , !1#324'3+ � , !$#&>@'3+ , is calculatedfromA thequantizationinformation B 0 (from thebit allocationalgorithm)andA thescalefactor � 07; < �C�8�7�9�1� �� � �D���= "!$#&%('3)E+D (3)
usingthefollowing equations: : 0�; < �C� � )GF B 0 "!$#*%H'3+� (4)
and : 0�; < �C� � )IFKJ$LNM � � )D B 0 FPORQTSEUWV � 0�; <YX (5)� � )IF(J$LTM � � )� B 06Z\[ % +^] F � _+$#&%('*),+�- (6)
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
3
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
t/ Frames
PSfragreplacements
I’m the first in line,
taketaketake aaa chance,chance, chance
80 90 100 110 120 130 140/Frames
Figure2: Synchronizeddisplayof lyrics of severalvocalparts.A locatormarkstheframethatis beingplayed.
III. SYNCHRONISING THE CONTEXT-BASED INFORMA TION AND THEAUDIO SIGNAL
For many applicationsdealingwith context-basedinformationit is desirablethat thecontext-basedinformationbe processed(e.g.displayed)at a certainpredefinedmomentin time which correspondendsto someeventin theaudiosignal.Thesynchro-nizationof thecontext-basedinformationwith theaudiosignalis anessentialpartof our work.First, we introducea conceptfor synchronizeddisplayof the text of a songwith onevocal part only. Next, we extendthisconceptto thetext of severalvocalparts,andfinally, wegiveanideahow musicscorescanbeembeddedandsynchronized.A note on the precisionof synchronization:In our work we useframesas a time measurefor synchronization(assuminga samplingrate of 44.1 kHz a frame lastsfor approximately26 ms). Other measurescould be the time elapsedsincethebeginningof songin msor fractionsof frames.It is not feasiblehowever, to realizesuchprecisemeasureswithout greateffortsinceon a realsystembetweenfive andtenframesaresentto theaudiodevice in onecall. Thecalling programdoesnot havefull control over the exact momentin time whenprocessingtakesplace. This said, it is sensibleto useframenumbersasameasurefor synchronization.
Synchronization of text information
As thedisplayof a wholeline of text generallydoesnot correspondto aneventin theaudiosignal,in our approacheachwordis synchronizedseparately. A typical applicationwill displaya whole textline five to ten framesbeforethe first word of thetextline is sungandhighlight eachword whenit is sung.The text information that is embeddedinto an audiosignalwith onevocal part is given by an ASCII-string definedby thefollowing format:
word ` word � -a-b- word � �c�@ mark ` mark � -a-b- mark � �c�’0xff’
Using C-style notationthe symbol’0xff’ representsa byte with all bits set. The first line containsthe text that will bedisplayed.Thesecondline, beginningwith thesymbol@ containstheframenumbersfor highlighting,wheremark
<indicates
theframenumberwhenword<will behighlighted.Theframenumbersaresubjectto thefollowing constraints:dfe,gih `kj !D dfe,gih < ' d=e,gih <Tl � nmG!1#*>o#3pq'*%r dfe,gih < � F � smIp�'*>o'*%r- (7)
A valueof -1 indicatesthatno furtherhighlightingwill bedonein thecorrespondingtextline.An examplefor theactualstringthatwill beembeddedis thefollowing textline:
People always told me@ 18 31 46 57’0xff’
Figure8 containsanexamplefor synchronizeddisplayof simpletext.Theaboveconceptfor synchronizeddisplayof unisontext informationcanbeadaptedfor text informationwith severalvocalpartsin thefollowing way (usingthesamenotationasabove):
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
4
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
word ` ; ` word ` ; � -b-a- word ` ; �Et ���@ mark ` ; ` mark ` ; � -a-a- mark ` ; �Et ���...word u �c� ; ` word u �c� ; � -a-b- word u ��� ; ��v^wWx �c�@ mark u �c� ; ` mark u �c� ; � -a-b- mark u ��� ; ��v^wWx ���’0xff’
Here,eachline containingtext informationcorrespondsto onevocalpartof thesongandeachline containingframenumbers(startingwith thesymbol’@’) indicatestheframenumbersfor highlighting.Thesymbol’0xff’ indicatesthatthetext linesarereadyfor processing.In analogyto (7) theframenumbershave to satisfyd=e,gih�y ; `kj !D nmG!1#Kz{' � d=e,gih9y ; < ' d=e,gih9y ; <Nl � |m=!}#~z{' � nmG!1#�>�'&% y F � - (8)
A valid framenumberhasto bespecifiedfor eachworddueto thedifferentmethodof display:Oncethebyte’0xff’ is read,thewordsareplacedona timescalein accordancewith theircorrespondingframenumberanda locatormarkstheframebeingplayedduringplayback(seeFig. 2 for anexample).
Synchronization of a musical score
Two additionalthingsarerequiredfor embeddingandsynchronizingthescoreof apieceof music:A aconceptfor descriptionof notesandothermusicalelements,A aconceptfor serializationof themusicinformation.
We may achieve both requirementsusingthe notationof the Standard MIDI 1 File (SMF). With MIDI, musicalinformationis representedin a seriesof events.For thescoreof a pieceof musicthis meansthatonehasto transformthescore(beingarepresentationof musicalinformationin notesandothersymbols)to events.Thereexist avarietyof programsfor accomplishingthis.In additionto codingthemusicinformationin so-calledMIDI-eventstheSMFallowsfor meta-eventssuchasthetext-eventandthe lyric-event. Theseeventsoffer a straightforwardway for embeddingtext informationin additionto themusicinformation.
Overcomingembeddingcapacitybottlenecks
We have investigatedseveral techniquesto overcomecapacitybottlenecks.Two sourcesfor this problemhave to be distin-guished:A Thereis not enoughembeddingcapacityin thewholesignal.A Two separateitemsof context-basedinformation(suchastwo text lines)collide, i.e. theframewheretheembeddingof
oneitem startsis alreadyoccupiedwith anotheritem.
Dependingon theapplicationthefollowing techniquescanbeappliedto solve theaboveproblems,wheretechniques1-4 willincreasetheembeddingcapacityin generalandtechniques5 and6 will distributethebits availablein a favourableway:
1. Compressionof the context-basedinformation: Usinga losslesscompressionschemecanresultin a morecompactrepresentationof the context-basedinformation. Dependingon the application,however, the format of the datawhichwill be embeddedmaybechosencompactalreadyandfurthercompressionwill have only minor effectson thesizeofthedata.
2. Making useof the subbands at higher fr equencies: In MPEG-codingat high samplingratessuchas �E� - � kHz, thesubbandswith the highestfrequenciesarenot assignedany bits by the bit allocationalgorithm. In other terms,thesesubbandsarequantizedwith zerobits. To guaranteethat theembeddingnoiseis of the sameorderasthe quantizationnoise,thesesubbandsare not usedfor embeddingcontext-basedinformation (consideran � -bandwith-limitedsignalwhere � ' � ! kHz; thequantizationnoiseof thetwo subbandswith highestfrequenciesis alwayszero).However, sinceany signallocatedat thesefrequenciesis not perceivableby theHAS, thesesubbandscanbeusedsafelyfor embeddingcontext-basedinformation.
1MIDI, MusicalInstrumentsDigital Interface
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
5
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
3. Making useof the absolutethr esholdin quiet: In signalpartswith silencethereis no embeddingcapacitysinceallsubbandsamplesare”quantized”with zerobits. It makessense,however, to choosetheembeddingbitwidth in suchawaythattheresultingembeddingnoiseis below theabsolutethresholdof audibility. Thistechniqueguaranteesaminimalconstantrateof embeddingcapacity.
4. Reducing the bitrate of the encoder: The numberof bits usedfor quantizingthe samplesof a subbanddependsonvariousfactors,oneof thembeing the bitratewhich is fixed in the caseof an MPEG-1-Layer-II-encoder. A smallerbitratereducesthe numberof bits availablefor quantizingthe subbandsamplesand,following the computationof theembeddingbitwidth in (4) and(5), increasestheembeddingbitwidth.
Applying this techniquemay, of course,resultin perceptualdeteriorationof thesignal.
5. Usinga bit reservoir: TheMPEG-1-Layer-III-encoderusesaso-calledbitreservoirto makeup for temporaryshortageof bits neededfor quantization.A framewhich hasbits left over after quantizingin sucha way that the quantizationnoiseis below themaskingthresholdcan”put” theseremainingbits in thebit reservoir. Anotherframe,ata laterpoint intime, which needsmorebits thanit hasavailablefor quantizingto ensurethat thequantizationnoiseis not perceptible,can”take” bits from thereservoir. Bits remainin the reservoir for a fixedtime interval andtheencodinganddecodingprocessesaredelayedby this interval.
The useof a bit reservoir in the sameway canprevent temporaryshortageof embeddingcapacity. We describetherealizationof anembeddingbitreservoir techiquein acompanionpaper[14].
6. Relocatingan item of context-basedinformation: A straightforwardway to dealwith the collision of two itemsofcontext-basedinformationis to relocatethefirst item to anearlierpoint in time. However, this caninitiate a ”domino-effect” wheretherelocateditemcollideswith theitembefore,andsoon.
IV. IMPLEMENT ATION
filterbank
synthesis
bit allocation
scalefactor
information
FFT thresholds/
analysis
embedding
masking
SMR
filterbank
PCM
input signal
output signal
algorithm
PCM
advance
(481 samples)
Figure3: Embeddingprocedure.
Theembeddingprocedureis shown in Fig. 3. Theinputsignalis split into framesof 1152samplesperchannelandeachframeis processedseparately. The scalefactor informationandquantizationinformationarecomputedin the sameway as in theMPEG-codecdescribedin section2.1. Using this the embeddingalgorithmcomputesthe embeddingwidth andembedsthecontext-basedinformation.This stepis describedin detail in section4.2. Thesynthesisfilterbankproducesa PCM signal.To
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
6
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
accountfor theoverall delayof thefilterbank(481samples)thesignalis advancedby 481samples.This stepis necessarytomake� surethatthesamplesthatbuild upa framearethesamefor theembeddingprocessandtheextractionprocess.
Structure of an embeddingblock
PSfragreplacements
LSBMSB signature
1stsample
...
12thsample
16bits
bits for
error
correction
bits for
context-based
information
signalpart
Figure4: Embeddingblock.
Embeddingtakesplacescaleblockwisesincetheembeddingbitwidth is computedscaleblockwise.A scaleblockwhich is usedfor embeddingcontext-basedinformationis calledanembeddingblock.Figure4 showsanembeddingblock. Theblockcontains12subbandsamples.A samplemaycontainmorethan16bitssincetheanalysisfilterbanktransformstheinput signalfrom 16-bit integervaluesto floatingpoint values.Only the16 mostsignificantbitsarerelevantfor ourapplication,though,asthey will betransformedbackto 16-bit integervaluesby thesynthesisfilterbank.Of theembeddingbitwidth % ascomputedin 2.2,only 2 ��� %�F~+ bits areusedto transportinformation.The3 remainingbitsareusedfor a forward error correction(FEC) which will be describedlater in this section. Scaleblockswith an embeddingbitwidth of %K'3) arenot usedfor embeddinginformation.The first five samplesof eachembeddingblock arenot usedfor context-basedinformation. They carry a signatureinsteadwhich is usedfor detectingthecontext-basedinformationduringtheextractionprocess.Furthermore,thesignaturebitscontaininformationaboutthe embeddingbitwidth so that the extractionalgorithmknows how many bits to readfrom the followingsamples.Theremainingsevensamplesareusedfor thecontext-basedinformation.Of the � � % bitsavailablefor embeddingonly � � %GF$+ bits areusedfor thecontext-basedinformation.Betweentheembeddingandtheextractionprocess,errorsmayoccurin thevaluesof thesubbandsamples.ThoseareduetoA thefilterbankwhichsatisfiesonly theNPR-property(NPR,nearperfectreconstruction),A thesytemarchitecture,andA transformationof integervaluesinto floatingpoint valuesandviceversa.
To preventcrippling of thecontext-basedinformation,a FECbasedon arithmeticcodingis used[15, 16]. A number�(�H� ismappedto B�� Z � , B ��� , � � + . In thiswaythebinaryrepresentationof � is extendedby 3 errorcorrectionbitsandis maderobustagainsterrorsin theinterval � F�+ � ��� .We describethe binary representationof the 16 mostsignificantbits of a subbandsamplewith � � � � ��� a-b-a-� � ` . Let � �� � 0 ��� b-a-a-� �D`9 be the binary representationof the signaturebits or the context-basedinformationdependingon the samplenumberand 2 ��� %�F�+ where% is theembeddingbitwidth for theembeddingblock. Theembeddingpattern � � � � � ��� a-a-b-c ��`7AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
7
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
consistsof theinformation � to beembeddedandthebits for errorcorrection�4� � � V � � � `7 �C� � !D � � � < ��� < "!1#&>@'*+D � < � � < �=� "+$#&>@'*%r- (9)
Theembeddingpattern� is embeddedinto sample� usingthefollowing function:� � � � � �?� b-a-b-� � ` � � � ��� a-b-a-� � ` o�� � � �?� b-a-b-c � � � � ��� b-a-b-c � ` - (10)
The embeddingalgorithm
1. Input:A Frame �q� consistingof 96 scaleblocks��� � � � ` b-a-a-� � ��� , where � < , !�#�> '��E) ,contains12subbandsamples.A Scalefactorinformationandbit allocationinformationfor eachscaleblockin �q� : � � < � < ,!$#&>@'3�E) .A Previousframerame� � ��� , if � � is not thefirst frameof thesignal.
2. Using� � < � < computetheembeddingbitwidth � < for eachscaleblock� < , ! #P>I'��,) using
(4) and(5).
3. Removefalselydetectedsignatures.
4. For all � < , !}#*>@'*�E) do:A If thereis context-basedinformation to be embeddedand )3# � < ' � ) , thenembedsignatureandcontext-basedinformationof bitwidth � < F�+ plus 3 error correctionbitsinto thesamplesof theblock.
5. Carryout a final errorcorrectionon frame �q� ��� .6. Output: Frame� � andframe � � �c� , if it exists.
Figure5: Embeddingalgorithm.
Theembeddingalgorithmis shown in Fig. 5. Thecomputationof theembeddingbitwidth is carriedoutasdescribedin section2.2(step2).To make surethat no otherscaleblocksthanthe embeddingblockscarry a valid signature,all scaleblocksarechecked. If asignatureis detectedit is removedby changingsomebits in correspondingsamples(step3).After embeddingthecontext-basedinformation(step4) a final errorcorrectionhasto beperformed.This is dueto thefactthatanerrorwhich occursin sampleof a scaleblock(not anembeddingblock) canchangethebinaryrepresentationin sucha waythat thewholescaleblockcontainsa signatureduringextraction. To prevent this, all stepsup to extractionareperformedandany informationextractedis matchedagainstthe informationembedded.Becauseof the delayof the filterbankthis stepcanonly becarriedout with a delayof oneframe.
The extraction algorithm
Theextractionalgorithmhasbeenrealizedfor simpletext information,i.e. songtexts with onevocalpartonly. Thealgorithmis givenin Fig. 6.
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
8
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
1. Input:A Frame� consistingof 1152PCMsamples.A Buffer containingextractedinformationwhich hasnot beendisplayedyet.A Synchronizationbuffer5 � � � ` a-b-a-c � 0 ��� , 2 �¢¡ , with framenumberscorresponding
to thetextline which is currentlydisplayed.A Displaybuffer B � �£. ` a-b-a-c . y �c� , z ��¡ , with framenumbersfor thedisplayof thetextlinesin .A Framenumber¤ of � .
2. Play � .
3. If ¤ ��� < , !$#�>�'32 , thenhighlightnext word in thedisplay.
4. If ¤ � . ` thendisplaynext textline in , delete. ` from B andthetextline from .
5. Carry out analysisfilterbank on � . The resulting1152 subbandsamplesare split into 96scaleblocks
� � ` a-a-b-� �9¥ � containing12sampleseach.
6. For all � < , !}#*>@'*�E)1' , do:A If� � </; ` b-a-a-� � </; ¦ containsa valid signature,then
– determineembeddingbitwidth of, readout the context-basedinformation § from� � </; � b-a-a-� � </; �¨� andconcatenate§ with .
– If ’0xff’ �©§ , thencomputethe frame. y when the textline will be displayed
asfollows: let % be the framenumberwhenthe first word in the textline will behighlighted.If ¤ '*%4F~ª , then
. y �C� %�F«ª , else. y ��� ¤ Z � .
7. Output:A Possiblymodifiedbuffers ,5
and B .
Figure6: Theextractionalgorithmfor simpletext information.
Implementation details
The algorithmsdescribedin sections4.2 and4.3 have beenimplementedfor audiodatain the WaveformAudio File Format(Waveformat)which supportsPCM-audiodata.Theaudiodatashouldcontain16-bit samples,sampledat a rateof �E� - � kHz.ThewholesoftwarerunsunderWindowsandLinux, theGUI partswererealizedusingtheGUI toolkit FLTK [17]. Thesoftwareis split in thefollowing components:A mbed gui: This is a GUI applicationwhich supportstheuserin generatingvalid context-basedinformationusing
theformatdescribedin section3.1. Thisapplicationcomputestheframenumberswhentheembeddingprocessfor eachtextline shouldstart to ensurethat the text informationis displayedbeforehighlighting starts. Shortageof embeddingcapacitycanbehandledin variouswaysusingsomeof thetechniquesdescribedin 3.3.A mbed: This applicationimplementsthe embeddingalgorithmfrom section4.2. It canbe calleddirectly from thepreviousapplication.A karaoke: This GUI applicationimplementstheextractionalgorithmfrom section4.3.
If thereis enoughembeddingcapacity, it is possibleto useonly onechannelfor embeddingthecontext-basedinformation.Thisreducestherunningtimeof thealgorithmby nearlyonehalf.
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
9
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
Figure7: Theembeddingsoftwarembed gui.
Figure8: Theplaybacksoftwarekaraoke.
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
10
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
V. TESTS
A seriesof testswasconductedto evaluatethe quality of our embeddingmethod. As the main objective of our work is topresentan embeddingmethodfor lyrics of songsandscoresof music,we presentthe resultsof our studiesfor songswithaccompanying music.We havestudiedothertypesof signalsaswell,suchasspeechsignalsandvocalswithout music.For testingpurposeswedistinguishregularembeddingandmaximumembeddingwheretheembeddingnoiseis maximized.With our studieswe attemptto answerthefollowing questions:A Is theretypically sufficientembeddingcapacityfor ourpurposes?A Whatis theratioof embeddingnoiseandtheglobalmaskingthreshold?A Is theregularembeddingof text informationperceptible?A How is regularembeddinginto onechannelevaluatedcomparedto regularembeddinginto bothchannels?
Thetestingmaterialwasmainlypickedfrom theEBU/SQUAM testCD [18].
Embedding capacity
0 50 100 150 200 250 3000
2000
4000
frames
bits
v17_2.wav, one channel, 384kbps
gross embedding capacitynet embedding capacity
0 50 100 150 200 250 3000
2000
4000
frames
bits
v17_2.wav, both channels, 384kbps
0 0.5 1 1.5 2 2.5 3 3.5
x 105
−0.5
0
0.5
signal
soun
d pr
essu
re le
vel
0 100 200 300 400 500 600 7000
2000
4000
6000
frames
bits
v21_2.wav, both channels, 384kbps
0 1 2 3 4 5 6 7 8
x 105
−0.5
0
0.5
signal
soun
d pr
essu
re le
vel
0 100 200 300 400 500 600 7000
2000
4000
6000
frames
bits
v21_2.wav, one channel, 384kbps
gross embedding capacitynet embedding capacity
Figure9: Embeddingcapacityof two popsongs.
v17 2, duration: ¬ s,348Frames�®b¯kbps,onechannel
b®b¯kbps,bothchannels
capacitybrutto 111300 217452netto 34118 66598
# frames,cap. °�± ( ²�± ) 203(145) 225(123)
v21 2, duration: ³´¬ s,731Frames�®b¯kbps,onechannel
b®b¯kbps,bothchannels
capacitybrutto 789924 1619640netto 240247 493248
# frames,cap. °�± ( ²�± ) 686(45) 705(26)
Table1: Embeddingdataof thesamepopsongsasin Fig. 9.
Figure9 shows theembeddingcapacityfor two popsongs.Table1 containssomedataregardingtheembeddingcapacityofthesignals.Bothsignalcontainenoughcapacityfor text information.Two textlinesof lengths58 bytes(464bits)and66 bytes
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
11
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
(528 bits) areto be embeddedinto the first signal. Highlighting of the first word takesplaceat framenumbers33 and217,respecti� vely. Up to framenumber27 thereare3164bits available.Thesecondtextline will beembeddedinto frames203-208.For thesecondsignalthereis plentyof embeddingcapacityaswell.In general,all popmusicsongsthatwestudiedhadenoughembeddingcapacityfor text information.
Comparison of embeddingnoiseand the masking threshold
We definetheratioof embeddingnoise � andmaskingthreshold%�µ .,¶ ¤ asfollows:· � � � � � ¨%�µ .W¶ ¤? � � !�¸bQNS,U � `º¹ � �%�µ .,¶ ¤ �r» dB _!$#*%('*+ � (11)
where � � is theaverageenergy of theembeddingnoiseof thethreescalefactorsof subband% and %�µ .W¶ ¤ � is theenergy valueof theglobalmaskingthresholdfor subband% asgivenby thepsychoacousticmodel.A valueof ! dB representsa level in thethresholdcalculationof �E) dB below theenergy of a sinewaveof amplitude¼ + � � )E! .Usingthis, theenergy � � ; ½ , !$#&%('*+ � , !1# � '*+ , is computedasfollows:� � ; ½ � � V´¾ ��¿ À F � _!$#&%('3+ � Á!1# � '*+�- (12)
The· �H� � valuecanbeinterpretedasfollows: If
· �H� � '3! thentheembeddingnoiselies below themaskingthresholdandis thereforenotaudible.In thecasewhere
· � � � j ! theembeddingnoisemaybeperceptibleby theHAS.
50 100 150 200 250 300
5
10
15
20
25
30
frames
subb
and
v11_2.wav, maximum embedding, 384kbps
dB
−5
0
5
10
15
20
50 100 150 200 250 300
5
10
15
20
25
30
frames
subb
and
v11_2.wav, maximum embedding, 128kbps
dB
−5
0
5
10
15
20
Figure10: EMR-valuesfor a popmusicpiece.Themaximumof all EMR-valuesusingregularembeddingusinga bitrateof384kbps(notshown here)is F�ª�- � � dB.
Figure10 shows theEMR-valuesfor a popmusicpiece.Embeddingtext causesno audiblenoisein thesignalsinceall EMR-valueslie below zero. If theembeddingnoiseis maximized(top picturein Fig. 10) only a handfulof EMR-valueslie abovethemaskingthreshold.Becauseof thesmallnumberof suchEMR-valueswe cansafelyassumethat they do not causesevereartifactsin thesignal.Usingadifferentbitrate,however, thesignal’squalitywill deteriorateseverly ascanbeseenin thefigure.TheEMR-valuesfor otherpopsongsarecomparableto thevaluesshown.
Resultsof a listening test
We testedour methodin a listeningtestwith elevenpeople.Figure11 shows theresultsfor relativecomparisonsof popmusicsongs.Two versions(original andmodified)of the samesongwereplayedsubsequentlyandthe testpersonswereasked tojudgethe differencein quality of the secondversionwith respectto the first versionon a scalerangingfrom -3 to 3. In the
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
12
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
−3 −2 −1 0 1 2 3
105rabit
v11_2
v12_3
v14_1
v17_2
v21_2
v9_4
v9_5
v9_6
Relative comparison, pop music, original and regular embedding, 384kbps
mu
sic
pie
ce
s
evaluation−3 −2 −1 0 1 2 3
105rabit
v11_2
v12_3
v14_1
v17_2
v21_2
v9_4
v9_5
v9_6
Relative comparison, pop music, reg. embedding in both ch. and one ch., 384kbps
mu
sic
pie
ce
s
evaluation
Figure11: Resultsof listeningtests.For eachpieceof musictheaverageevaluationandthestandarddeviationareshown.
figurea valueof -3 meansthat the quality of the modifiedsignalis muchworsethanthequality of theoriginal, a valueof 0indicatesnot differencein the two signal’s quality, anda valueof +3 statesthat the quality of the modifiedsignalwasmuchbetterthanthequalityof theoriginal.The comparisonsbetweenthe original and the modified signalsshow that no obvious distortionswere introducedinto thesignalsby our embeddingmethod.Thesecondtestcomparesregularembeddingin bothchannelsto embeddingin onechannelonly. Again,noobviouspreferencecanbedetected.
Our studiesprove thatour embeddingmethodis suitedvery well for embeddingtext informationinto songs.Becauseof thehighembeddingcapacityit is reasonableto assumethatothertypesof context-basedinformationcanbeembeddedaswell.
VI. CONCLUSION
In this paper, we introduceda techniqueto embedcontext-basedinformation into audiosignals. The embeddingis suitablefor online-extractionandsynchronouspresentationof text andaudio. We describedMPEG-1-LayerII basedembeddingandplaybacksystemsfor audiopieceswith onevocalpart.Suchsystemsmaybeusedfor karaoke-likeapplications.Wefurthermorediscussedpossibleextensionsfor thesupportof multiple vocalpartsandtheembeddingof scoreinformation.Although thepresentrealizationof our systemis basedon the audiocodingtechniquesdescribedabove, thereis muchroomfor simplifications.Conceptually, weonly needsuitable(fast)transformsandpsychoacousticmodelsto realizeourembeddingmethods.Especiallyanoptimizedspectraltransformis likely to allow for a significantextractionspeed-up.
VII. ACKNOWLEDGEMENTS
Thiswork is partof theMiDiLiB-projectof ourgroup[19]. It wassupportedin partby DeutscheForschungsgemeinschaftundergrantsCL 64/3-1andCL 64/3-4.
VIII. REFERENCES
[1] B. BehrandC. Wiedenhoff. Kuckmal,wasdaspielt. c’t Magazinfur Computertechnik, pages218–222,April 1999.
[2] AbrahamHoogendoorn.Digital compactcassette.In IEEEProc., volume82,pages1479–1489,October1994.
[3] A JMason.Audio mole,codercontroldatato beembeddedin decodedaudioPCM.
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
13
PACKHAM AND KURTH TRANSPORT OF CONTEXT-BASED INFORMATION
[4] A. J. Mason,A. K. McParland,andN. J. Yeadon. The atlanticaudiodemonstrationequipment. In 106thConvention.AES,May 1999.
[5] LaurenceBoney, AhmedH. Tewfik, andKhaledN. Hamdy. Digital watermarksfor audiosignals.In 1996IEEEInt. Conf.on MultimediaComputingandSystems, pages473–480,Hiroshima,Japan,1996.
[6] MPEG/audioSoftwareSimulationGroup.ISO/IEC11172-3Layer1 andLayer2 SoftwarePackage,1993.
[7] ISO/IEC11172-3:Codingof moving picturesand= associatedaudiofor digital storagemediaatupto 1.5mbits/s- audiopart. Technicalreport,InternationalStandard,1992.
[8] K. Brandenburg andG. Stoll. ISO/MPEG1-Audio: A genericstandardfor codingof high quality digital audio. J. AudioEng. Soc., 42(10):780–792,October1994.
[9] P. Noll. Mpeg digital audiocoding. IEEESignalProcessingMagazine, pages59–81,1997.
[10] Davis Pan. A tutorial on MPEG/Audiocompression.IEEE MultimediaJournal, 1995.
[11] Davis Yen Pan. Digital audiocompression.Digital Technical Journal of Digital EquipmentCorporation, 5(2):28–40,1993.
[12] S.Shlien.Guideto MPEG-1audiostandard.IEEE Transactionson Broadcasting, 40(4):206–281,December1994.
[13] Udo Zolzer. Digitale Audiosignalverarbeitung. Teubner, Stuttgart,1996.
[14] FrankKurthandViktor Hassenrik.A DynamicEmbeddingCodecfor Multiple GenerationsCompression.In Proc.109thAESConvention,LosAngeles,USA, 2000.
[15] FrankKurth. VermeidungvonGenerationseffektenin der Audiocodierung. PhDthesis,Institut fur Informatik V, Univer-sitat Bonn,1999.
[16] W. Wesley Peterson,E.J.Weldon,andJr. Error-correctingcodes. MIT Press,Cambridge,Mass.,9. edition,1988.
[17] Bill Spitzaket al. FLTK - TheFastLight Tool Kit. URL: http://www.fltk.org.
[18] TechnicalCentreof theEuropeanBroadcastingUnion. Soundqualityassessmentmaterialrecordingsfor subjective tests,1988.
[19] MiDiLiB. Content-basedIndexing, Retrieval, andCompressionof Datain Digital Music Libraries. http://leon.cs.uni-bonn.de/forschungprojekte/midilib/english/.
AES 109th CONVENTION, LOS ANGELES, 2000 SEPTEMBER 22-25
14