ac-3: flexible perceptual coding for audio transmission … · 2 technology to a two channel coder....

16

Click here to load reader

Upload: dangdung

Post on 27-Jul-2018

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

1

AC-3: Flexible Perceptual Coding for AudioTransmission and Storage

Craig C. Todd, Grant A. Davidson, Mark F. Davis,Louis D. Fielder, Brian D. Link, Steve Vernon

Dolby LaboratoriesSan Francisco

0. Abstract

Dolby AC-3 is a flexible audio data compression technology capable of encoding a range ofaudio channel formats into a low rate bit stream. Channel formats range from monophonicto 5.1 channels, and may include a number of associated audio services. Based on atransform filter bank and psychoacoustics, AC-3 includes the novel features of transmissionof a variable frequency resolution spectral envelope and hybrid backward/forward adaptivebit allocation.

1. Introduction

The genesis of the AC-3 technology came from a desire to provide superior multi-channelsound localization for High Definition Television sound. In the United States, the HighDefinition Television (HDTV) standardization process formally began in 1987 with theformation, by the Federal Communications Commission (FCC), of the Advisory Committeeon Advanced Television Service (ACATS). The initial HDTV system proposals called foranalog based picture transmission and digital sound transmission. The initially proposedsound systems matrix-encoded a multi-channel audio source into a stereo pair, which wasthen digitally encoded with Dolby AC-1, a low cost delta-modulation based codingtechnology. At the receiver, the two audio channels would optionally be decoded into fourchannels using a matrix decoder. This proposal basically used the 4-2-4 multi-channelmatrix system as a 2-1 bit-rate reduction system, since the matrix technology reduced thebit-rate required to convey four channel audio by a factor of 1/2. By 1989, advances inaudio coding technology and DSP hardware made it possible to move from the AC-1 audiocoding technology to the transform based AC-2 coding technology, simultaneously raisingaudio quality while reducing bit-rate. The use of the multi-channel matrix technologyremained.

By 1990, there were suggestions by some that the limitations of the 4-2-4 matrix technologyshould perhaps be avoided in HDTV, and that four discrete audio channels should betransmitted. Others felt that the small gain in multi-channel performance was not worth therequired doubling of the bit-rate. It was at this point that the concept of AC-3 was born: afundamentally multi-channel audio coder, operating at approximately the same bit-raterequired by two independently coded audio channels, while offering multi-channel audioperformance without the limitations of the 4-2-4 channel matrix system. Basically, it wasrealized that the bit-rate reduction process that was being performed by the 4-2-4 matrixsystem could be better performed in the coder itself, and not by an addition of the matrix

Page 2: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

2

technology to a two channel coder. If successful, the concept of AC-3 would allow HDTVto gain the benefit of going to discrete audio transmission, without paying the price of acorresponding doubling of the required bit-rate.

While conceived in response to a need for HDTV, the concept of AC-3 was first realized inresponse to a similar need in the cinema. By 1990, a 4-2-4 matrix based analog soundsystem had been in place in the 35 mm cinema for some 14 years, and interest was growingto offer the cinema industry new digital sound technology. In 1989, a SMPTE subgroupstudied the issue of how many sound channels a new digital film sound system should offer.The conclusion was that 5.1 channels should be provided (left, center, right, left surround,right surround, subwoofer), which is identical to the 70 mm split surround format which hasbeen in use in the cinema since 1979. In order to place digital sound data on film reliably,and yet not interfere with either the picture or analog sound area of the film, the availabledata rate is limited. It was determined that 320 kb/s of error corrected audio data could bereliably placed and extracted from the film area between the sprocket hole perforations onone side of the 35 mm film. All that was required was to realize the concept of AC-3 as a5.1 channel audio coder operating at 320 kb/s.

Due to the rapid development schedule required to quickly realize a commercial cinemaproduct, AC-3 was first implemented as a real-time system running on multiple DSP cards.The first cinema industry demonstrations began in May of 1991. By Dec. of 1991, the firstAC-3 coded digital film, Star Trek VI, played in three theatres. The formal roll out of theDolby SR•D system (as it is called) was in June of 1992, with release of the film BatmanReturns.

In mid 1991 the existence of the AC-3 audio coding system was publicly disclosed, and waseagerly embraced by the HDTV audio community in the United States. The ITU BR(formerly CCIR) Task Group 10-1 met in June of 1991 and accepted the basic 5 channelaudio format, making it the basis for a recommendation. In February of 1992 the U.S.Advanced Televisions Systems Committee released a document formally recommendingcomposite coded 5.1 channel audio for the U.S. HDTV service. In Oct. of 1992, TG 10-1accepted the 0.1 low frequency channel and modified the draft recommendationaccordingly. In 1993 the AC-3 system underwent subjective testing in the United States inorder to evaluate its suitability for inclusion in the HDTV system being proposed by theGrand Alliance (a consortium of the remaining U.S. HDTV proponents which has beenauthorized to collaborate on the U.S. HDTV broadcast system). In Oct. 1993 the GrandAlliance recommended the use of Dolby AC-3. In Nov. 1993, the full ACATS committeeformally approved the use of AC-3 by the Grand Alliance HDTV system. Final systemtests are expected to be completed by the end of 1994, final FCC approval should occur in1995, and some initial broadcasts may begin in 1996. While the U.S. HDTV process willtake some additional time to complete, AC-3 is expected to begin a widespread roll out incable television and direct broadcast satellite equipment by the end of 1994.

The requirements which AC-3 had to meet to satisfy the cinema application arestraightforward. AC-3 only had to support one bit-rate (320 kb/s) and audio coding mode(5.1 channels). All cinemas in which AC-3 is used have a full complement of loudspeakerchannels, so mixdown to fewer than 5.1 channels is not an issue. Audio levels arestandardized in the cinema industry, and all playback is done with the full intended dynamicrange.

Page 3: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

3

On the other hand, there are a diverse set of requirements for a coder intended forwidespread application. Our intent was to make AC-3 into a universally applicable low bitrate coder by satisfying many diverse requirements. This allows AC-3 to deliver an audiosignal in a form usable by an entire audience, even though different members of theaudience have different needs. While the most critical members of the audience may beanticipated to have complete multi-channel reproduction systems, most of the audience maybe listening in mono or stereo. Some of the audience may have matrix based multi-channelreproduction equipment without a discrete input, thus requiring a 2-channel matrix encoded(and generally not mono-compatible) output from the AC-3 decoder. Most of the audiencewelcomes a restricted dynamic range reproduction, while a few in the audience will wish toexperience the full dynamic range of the original signal. The visually and hearing impairedwish to be served. All of these (and other) diverse needs were considered early in thedesign of the AC-3 coding technology. Techniques to satisfy these needs have beendesigned in from the beginning, and not added on as an afterthought.

2. Bit Allocation Philosophy

The challenge of bit-rate reduction of audio signals is, of course, to remove bits from arepresentation of the audio signal in a manner that is inaudible (to humans). Due to thefrequency masking properties of human hearing, the best representation of audio to use is afrequency domain representation, obtained by the use of a sub-band or transform filter bank.Audio coding technology has developed to the point that the issue is not really what bitsmay be removed, but rather which few bits to leave in (or allocate to) the various frequencycomponents of the audio signal. Since the philosophy behind bit allocation is sofundamental to the audio coder design, we begin our exploration of the AC-3 coder with adiscussion of the philosophy behind its bit allocation algorithm. This algorithm, along withsimple economics, helps to determine the optimum filter bank. There are two broad classesof bit allocation strategy: forward adaptive; and backwards adaptive. Forward adaptive(fig. 1) refers to the method where the encoder calculates the bit allocation and explicitlycodes the allocation into the coded bit stream. In theory, this method allows for the mostaccurate allocation since the encoder has full knowledge of the input signal and, inprinciple, may be of unlimited complexity. The encoder may precisely calculate anoptimum bit allocation within the limits of the psychoacoustic model employed. Anattractive feature of forward adaptation is that since the psychoacoustic model is residentonly in the encoder, the model may be changed at any time with no impact on an installedbase of decoders.

While forward adaptation has some attractive features, there are practical limitations to theperformance which can be obtained with this technique. The performance limitationscomes from the need to utilize a portion of the available bit-rate to deliver the explicit bitallocation to the decoder. For instance, the ISO MPEG1 layer II audio coder requires a datarate of nearly 4 kb/s/ch to transmit the bit allocation with time resolution of 24 msec andfrequency resolution of 750 Hz. During transient conditions it is beneficial to be able toalter the bit allocation with a much finer time resolution but, of course, that would require asignificantly greater allocation data rate (nearly 12 kb/s/ch for 8 msec time resolution in thecase of the LII coder).

Page 4: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

4

During steady state conditions it is useful to be able to allocate bits with much greaterfrequency resolution. For example, in the case of a signal with a spectrum which containsspectral lines in every 750 Hz frequency band, all bands would have to be allocated bits andvery limited bit-rate reduction may be obtained before audible degradation occurs. If bitscan be allocated on a finer frequency grid, then even spectrally dense signals may have bitsremoved at frequencies between the spectral lines, and more useful bit-rate reduction maybe obtained. Like an improvement in time resolution, an improvement in the frequencyresolution of the bit allocation would also require a substantial increase in the allocationdata rate (an increase by a factor of 8 in order to improve the frequency resolution by afactor of 8). While attractive in theory, forward adaptive bit allocation clearly does imposesignificant practical limitations on performance at very low bit-rates.

Backward adaptive bit allocation (fig. 2) refers to the creation of the bit allocation from thecoded audio data itself, without explicit information from the encoder. The advantage ofthis approach is that none of the available data rate is used to deliver the allocationinformation to the decoder, and thus all of the bits are available to be allocated to codingaudio. The allocation may have time or frequency resolution equal to the information usedto generate the allocation. Backward adaptive systems are thus more efficient intransmission, and allow the bit allocation to have superior time and frequency resolution.The disadvantages of backward adaptive allocation come from the fact that the bit allocation

CodedBit-StreamInput Signal(s)

Bit Allocator

Filter Bank Quantize

Multiplex

Bit Allocation

Encoder

Signal(s)

De-Multiplex

De-Quantize Filter BankQuantizedValues

BitAllocation

Output

CodedBit-Stream

Decoder

Figure 1: Forward Adaptive Bit Allocation

Page 5: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

5

must be computed in the decoder from information contained in the bit stream. The bitallocation is computed from information with limited accuracy, and may contain smallerrors. Since the bit allocation is intended to be able to be performed in a low-cost decoder,the computation can not be overly complex, or else decoder cost would not be low. Sincethe bit allocation algorithm available to the encoder is fixed once decoders are deployedinto the field, the psychoacoustic model may not be updated.

The AC-3 coder makes use of hybrid backward/forward adaptive bit allocation, in whichmost of the disadvantages of backward adaptation are removed. This method involves acore backward adaptive bit allocation routine which runs in both the encoder and thedecoder. The core routine is relatively simple, but is based on a psychoacoustic model and,in general, is quite accurate. The input to the core routine is the spectral envelope, which ispart of the encoded audio data delivered to the decoder. With AC-3, the spectral envelopehas time and frequency resolution equal to that of the analysis/synthesis filter bank.

There are two aspects to the forward adaptation: psychoacoustic model parameteradjustment, and delta bit allocation. The core bit allocation routine employs apsychoacoustic model which has certain assumptions about the masking properties ofsignals. Certain parameters of the model are explicitly sent within the AC-3 bit stream.Thus, the details of the actual psychoacoustic model implemented in the decoder may beadjusted by the encoder. The encoder can perform a bit allocation based on any

Encoder

EncodeSpectralEnvelope

Filter BankInputSignal(s)

Quantize

Bit Allocator

Multiplex

CodedBit-StreamQuantized

Samples

EnvelopeSpectralEncoded

Decoder

De-Multiplex

DecodeSpectralEnvelope

De-Quantize

Bit Allocator

Filter BankCodedBit-Stream Output

Signal(s)

Figure 2: Backward Adaptive Bit Allocation

Page 6: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

6

psychoacoustic model of any complexity, and compare the results to the bit allocation basedon the core routine used by the decoder. If a better match can be made to an idealallocation by altering some of the parameters used by the core routine, the encoder can doso. If there is a condition where it is not possible to approach the ideal allocation by meansof parameter variation, the encoder can then explicitly send some allocation information.Since the allocation computed by the core routine should be very close to ideal, only smalldeltas to the default allocation are ever required. The AC-3 data syntax allows the encoderto explicitly send delta bit allocation information which will cause the bit allocation in smallfrequency regions to be increased or decreased. Since the final bit allocation used by theencoder and decoder must be identical the final allocationactually used by the encoder isthe decoder core routine with whatever parameter variations and delta bit allocation are ineffect.

3. Filter Bank

The choice of analysis/synthesis filter bank to employ in an audio coder is a trade-offbetween frequency resolution, time resolution, and cost, where cost is measured in randomaccess memory bits (RAM) and multiply/accumulate cycles (MACs). Steady state audiosignals benefit from finer frequency resolution, whereas transient signals require finer timeresolution. It is not possible to simultaneously achieve high time and frequency resolution,so a compromise must be reached. Fine frequency resolution has a very real cost, in thatlonger blocks of audio must be buffered which requires larger amounts of RAM. Cost ofRAM dominates in the single chip decoders which are demanded by the market. Very fine

.

Encoder

EncodeSpectralEnvelope

Filter Bank

Bit AllocatorCore BitAllocator

Quantize

Multiplex

InputSignal(s)

Bit AllocationSide Information Coded

Bit-Stream

Decoder

De-Multiplex De-Quantize

Core BitAllocator

DecodeSpectralEnvelope

Filter Bank

InputSignal(s)

OutputSignals

Figure 3: Hybrid Backward/Forward Adaptive Bit Allocation

Page 7: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

7

frequency resolution can allow somewhat higher coding gains to be achieved, but requiressignificantly more costly RAM.

AC-3 makes use of the oddly stacked time-division aliasing cancellation filter bankdescribed by Princen and Bradley. Overlapping blocks of 512 windowed samples aretransformed into 256 frequency domain points. The filter bank is critically sampled, and oflow complexity, requiring only 13 MACs for computation. A proprietary 512 point Fielderwindow is used to achieve the best trade-off between close-in frequency selectivity and far-away rejection. Each transform block is formed from audio representing 10.66 msec (at48 kHz sample rate), although the transforms are performed every 5.33 msec. The audioblock rate is thus 187.5 Hz. During transient conditions where finer time resolution isuseful, the block size is halved so that transforms occur every 2.67 msec. The low 13 MACcomputation rate is maintained during transient block size switches. The frequencyresolution of the filter bank is 93.75 Hz. The minimum time resolution is 2.67 msec. Thefull resolution of the filter bank is used; the individual filters are not combined into widerbands (or critical bands), except during a portion of the core bit allocation routine. Bitallocation can occur down to the individual transform coefficient level, with neighboringcoefficients receiving different allocations.

4. Spectral Envelope

Each individual transform coefficient is coded into an exponent and a mantissa. Theexponent allows for a wide dynamic range, while the mantissa is coded with a limitedprecision which results in quantizing noise. The synthesis filter bank in the decoderconstrains the quantizing noise to be at nearly the same frequency as the quantized signal.

0 5000 10000 15000 20000-20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

Spectral Envelope Coding, D15 Mode, Triangle Signal

Original Spectrum Original Exponent Decoded Exponent

Exp

onen

t (6

dB)

Frequency, Hz

Figure 4: D15 Spectral Envelope (Exponent) Coding

Page 8: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

8

The set of coded exponents forms a representation of the overall signal spectrum, and isreferred to as the spectral envelope.

The filters of the transform filter bank are not perfect brick wall filters, but have gentleslopes. In general, the response falls off at approximately 12 dB per adjacent filter, or per93.75 Hz. If the signal spectrum is analyzed with the filter bank, the level in any twoadjacent filters seldom exceeds 12 dB. This fact can be used to advantage in coding thespectral envelope. The AC-3 coder encodes the spectral envelope differentially infrequency. Since deltas of at most ±2 (a delta of 1 represents a 6 dB level change) arerequired, each exponent can be coded as one of 5 changes from the previous (lower infrequency) exponent: +2, +1, 0, -1, -2. The first exponent (D.C. term) is sent as anabsolute, and the rest of the exponents are sent as differentials. Groups of threedifferentials are coded into a 7 bit word. Each exponent thus requires approximately 2.33bits to code. This method of transmitting the exponents is referred to as D15. Whenemployed, the D15 coding provides a very accurate spectral envelope, as indicated in fig. 4.

Coding all exponents with 2.33 bits each for every individual audio block would beextravagant. However, fine frequency resolution is only required for relatively steadysignals, and for these signals the spectral envelope will remain relatively constant overmany blocks. The fine resolution D15 spectral envelope is only sent when the spectrum isrelatively stable, and in that case the estimate may be sent only occasionally. In the typicalcase, the spectral envelope is sent once every 6 audio blocks (32 msec), in which case thedata rate required is < 0.39 bits / exponent. Since each individual frequency point has anexponent, and there is one frequency sample for each time sample (the TDAC filter bank is

0 5000 10000 15000 20000-20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

Spectral Envelope Coding, D25 Mode, Triangle Signal

Original Spectrum Original Exponent Coded Exponent

Exp

onen

t (6

dB)

Frequency, Hz

Figure 5: D25 Spectral Envelope (Exponent) Coding

Page 9: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

9

critically sampled), the D15 high resolution spectral envelope requires < 0.39 bits per audiosample.

When the spectrum of the signal is not stable, it is beneficial to send a spectral estimatemore often. In order to keep the data overhead for the spectral estimate from becomingexcessive, the spectral estimate may be coded with less frequency resolution. Twoadditional methods are available. The medium frequency resolution method is D25, wherea delta is transmitted for every other frequency coefficient. This method requires half of thedata rate of the D15 method (or 1.l6 bits / exponent), and has a factor of two worsefrequency resolution. The D25 method is typically used when the spectrum is relativelystable over 2-3 audio blocks and then undergoes a significant change. Use of the D25method does not allow the spectral envelope to accurately follow all of the troughs in a verytonal spectrum. The coded spectral envelope is, of course, forced to cover all of the peaks.This is illustrated in fig. 5, which shows the result of applying the D25 method to the sametriangle signal shown in fig. 4 (although with this type of signal the D25 method wouldgenerally not be used).

The final method is D45, where a delta is transmitted for every 4 frequency coefficients.This method requires one fourth of the data rate of the D15 method (or 0.58 bits /exponent), and is typically used during transients for single audio blocks (5.3 msec). Thetransmitted spectral envelope thus has very fine frequency resolution for steady state (orslowly changing signals), and has fine time resolution for transient signals. Typically,transient signals do not require fine frequency resolution since by their nature transients are

0 5000 10000 15000 20000-20

-18

-16

-14

-12

-10

-8

-6

-4

-2

0

Spectral Envelope Coding, D45 Mode, Castanet Signal

Original Spectrum Original Exponent Decoded Exponent

Exp

onen

t (6

dB)

Frequency, Hz

Figure 6: D45 Spectral Envelope (Exponent) Coding

Page 10: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

10

wide band signals. Figure 6 shows the result of coding a castanet transient with the D45method.

The AC-3 encoder is responsible for selecting which exponent coding method to use for anyaudio block. Each coded audio block contains a 2-bit field called exponent strategy. Thefour possible strategies are: D15, D25, D45, and REUSE. For most signal conditions, aD15 coded exponent set is sent during the first audio block in a frame, and the following5 audio blocks reuse the same exponent set. During transient conditions, exponents are sentmore often. The encoder routine responsible for choosing the optimum exponent strategymay be improved or updated at any time. Since the exponent strategy is explicitly codedinto the bit stream, all decoders will track any change in the encoder.

5. Bit Allocation Details

The AC-3 core bit allocation routine begins with the decoded spectral envelope, orexponents, which are considered to be the power spectral density (psd) of the signal. Theremay be as many as 252 psd values (the number can vary depending on the number ofexponents which have been sent, which in turn depends on the desired audio bandwidth andthe audio sampling rate). The major portion of the bit allocation routine is the convolutionof a spreading function matching the ear's masking curve, against the power spectraldensity. In order to reduce the computational load, the original psd array is converted to asmaller banded psd array. At low frequencies, the band size is 1, and at high frequenciesthe band size is 16. The bands increase in size proportional to the widening of the ear'scritical bands. The initial (up to) 252 psd values are combined to form 64 banded psdvalues.

A simplified technique has been developed to perform step of convolving the spreadingfunction against the banded psd. The spreading function of the ear is approximated by threecurves: a very steeply decaying downwards masking curve; a fast decaying upwardsmasking curve; and a slowly decaying upwards masking curve which is offset downwards inlevel. The technique ignores the downwards masking curve for simplicity at the expense ofoccasional over allocation. A simplified convolution of the two different sloping upwardsmasking curves against the banded psd is performed. The calculation begins at the lowest

Prototype Masking Curve

Freq.

DownwardMasking

UpwardsMaskingFast

Slow

Signal

Figure 7: Prototype Masking Curve

Page 11: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

11

frequency of the banded psd array and moves upwards in frequency. Two runningcomputations are performed, one for each of the fast decaying and slowly decaying maskingcurves, which we refer to as the fast leak and the slow leak. The calculation is performed inthe log domain, where a decay (leak) can be implemented as a simple decrement. As thecalculation is moved up in frequency band by band, each new banded psd is examined. Ifthe psd in the new band is significant with respect to the current leak values, the new psd iscombined into the leak terms which are increased in value. If the new psd is insignificant,the old leak terms are decremented, or allowed to leak down. The largest of each leak termat each frequency is held. The result of this calculation is an array indicating the predictedmasking, band by band.

This curve is compared against a hearing threshold, and the larger of the two values is held.The final step is to subtract the predicted masking curve from the original (unbanded) psdarray to determine the desired SNR for each individual transform coefficient. This is shownin figure 8. The array of SNR values is converted to an array of bit allocation pointers(baps). The bap array values point to the quantizers to be used for each transformcoefficient mantissa. At this point the encoder does a bit count to determine if the bitallocation has used up the available number of bits. All available bits are from a commonbit pool, available to all of the channels. If more bits are available, the individual mantissaSNR's may be increased, until all of the bits are used up. If too many bits have beenallocated, the individual mantissa SNR's may be decreased, and/or coupling (see nextsection) may be invoked. In the example shown (a difficult multi-tone test signal with tonesat all odd multiples of 750 Hz) the AC-3 coder requires approximately 74 kb/s (including allside information) to code the signal to near transparency. This low bit-rate is achievedbecause the frequency resolution of the filter bank, along with the backward/forward bitallocation, is able to remove bits from in between the tones. As a matter of comparison, thistest signal would fully occupy all bands of the LII filter bank, and the LII coder wouldrequire more than twice the bit-rate of AC-3 to code this signal to the same quality level.

5000 10000 15000 20000

10 dB

Level (dB)

Predicted Masking

Transmitted ExponentValues

SNR via Bit AllocationPointer

Frequency (Hz)Figure 8: Bit Allocation Computation

Page 12: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

12

6. Coupling

Even though the coding techniques employed by AC-3 are very powerful, when the coder isoperated at very low bit rates there are signal conditions under which the coder would runout of bits. When this occurs, a technique which we refer to as coupling is invoked.Coupling takes advantage of the way in which the ear determines directionality for veryhigh frequency signals, in order to allow a reduction in the amount of data necessary tocode a high quality multi-channel audio signal.

At high audio frequencies (above approximately 2 kHz) the ear is physically unable todetect individual cycles of an audio waveform, and instead responds to the envelope of thewaveform. Directionality is determined by the inter-aural time delay of the signal envelope,and by the perceived frequency response which is affected by head shadowing and the ear'spinnae. Coupling takes advantage of the fact that the ear is not able to independently detectthe direction of two high frequency signals which are very closely spaced in frequency.

When the AC-3 coder becomes starved for bits, channels may be selectively coupled at highfrequencies. The frequency at which coupling begins is called the coupling frequency.Above the coupling frequency the coupled channels are combined into a coupling (orcommon) channel. Care is taken with the phase of the signals to be combined so as to avoidsignal cancellations. The encoder measures the original signal power of the individual inputchannels in narrow frequency bands, as well as the power in the coupled channel in thesame frequency bands. The encoder generates coupling coordinates for each individualchannel which indicate the ratio of the original signal power within a band to the couplingchannel power in the band.

The coupling channel is encoded in the same manner as the individual channels; there is aspectral envelope of coded exponents and a set of quantized mantissas. The channels whichare included in coupling are sent discretely up to the coupling frequency. Above thecoupling frequency, only the coupling coordinates are sent for the coupled channels. Thedecoder multiplies the individual channel coupling coordinates by the coupling channelcoefficients to regenerate the high frequency coefficients of the coupled channels.

The coupling process is audibly successful because the reproduced sound field is a closepower match to the original. Within each narrow frequency band, the reproduced powerlevel out of each loudspeaker matches the original signal power in that loudspeaker.Additionally, the level within each narrow frequency band of each source channel isreproduced at the same overall power in the sound field, even though the power may beshared amongst several of the loudspeaker channels. The coupling coordinates are encodedwith an accuracy of < 0.25 dB. This fine resolution not only allows a good power match tobe obtained but, more importantly, allows small changes to be made smoothly. (This is incontrast to techniques employed by other coding technologies to combine high frequenciestogether, where the minimum gain step is 2 dB.)

The AC-3 encoder is responsible for determining the coupling strategy. The encodercontrols which of the audio channels are to be included in coupling, and which remaincompletely independent. The encoder controls at what frequency coupling begins, thecoupling band structure (the bandwidths of the coupled bands), and when new couplingcoordinates are sent. The coupling strategy routine may be altered or improved at any timeand, since the coupling strategy information is explicit in the encoded bit stream, alldecoders will follow the changes. For example, early AC-3 encoders often coupled

Page 13: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

13

channels with a coupling frequency as low as 3.5 kHz. As AC-3 encoding techniques haveimproved, the typical coupling frequency has increased to 10 kHz.

7. Bit Stream Syntax

The fundamental time unit for AC-3 is related to the transform block size. While eachtransformed block is 512 samples long, the 100% overlap/add of the TDAC transformmeans blocks are transformed every 256 samples, or every 5.33 msec for a 48 kHz samplingrate. The transform block rate is 187.5 Hz. The AC-3 syntax groups 6 transform blocksinto an AC-3 frame. The frame rate is 31.25 Hz, and each frame lasts 32 msec. Each AC-3frame begins with a 16 bit sync word. Following the sync word are 8 bits of informationwhich indicate sampling rate and frame size. We refer to the first 3 bytes of the AC-3 frameas sycn info because the information in these bytes is used to acquire and maintainsynchronization with the AC-3 frames.

Following sync info is a set of data we call bit stream info, or BSI. The BSI data containsinformation about the number of channels which are coded, dialogue level, language code,information about associated services, etc. All of the BSI data is essentially static, and isdescriptive about the data in the audio blocks which follow.

Following BSI are 6 coded audio blocks. The first block always contains a complete refreshof exponents, coupling coordinates, and all other information which is conditionallytransmitted. The following 5 blocks may or may not contain information beyond quantizedmantissas. Unused data at the end of the AC-3 frame may be considered aux data.

8. Features

A number of features have been designed into AC-3 in order to make it widely applicable.The goal is to have a coded audio signal which is usable by as wide an audience as possible.The potential audience may range from patrons of a commercial cinema or home theatreenthusiasts who wish to enjoy the full sound experience, to the occupant of a quiet hotelroom listening to a mono TV set at a low volume who nevertheless wishes to hear all of theprogram content. Two of the major concerns of the AC-3 coder design were mixdown, andloudness control.

Very early in the development of AC-3, it was decided to encode the channels discretely,without the use of intermediate matrixes (as have been employed in other multi-channelcoders which offer backwards compatibility with some existing two-channel decoders). Itwas felt that backwards compatibility with existing 2-channel decoders was of little value,since very few decoders were in the field to be backwards compatible with. We did,however, seriously consider the need to be backwards compatible with a variety ofreproduction systems, such as mono, stereo, and matrix surround. Avoiding compatibilitymatrixing allowed AC-3 development to concentrate on fundamental coding techniques, andnot merely on techniques which only mitigate the negative effects of compatibilitymatrixing.

Without the use of compatibility matrixing, it was first thought that a decoder intended toreproduce a 2-channel output from a 5-channel coded signal would have to be 2.5 times ascomplicated as a decoder meant for decoding a 2-channel output from a 2-channel codedsignal. This was because the 2-channel decoder is required to produce a mixed-downversion of the 5-channel signal, and thus must have all 5 channels decoded in order to do so.To solve this dilemma, a technique was developed which requires only partial decoding of

Page 14: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

14

the 5 channels. All channels are decoded into their frequency domain representation. As analternative to being transformed into the time domain for mixdown, the mixdown isperformed directly in the frequency domain. Only the mixed-down signals are thentransformed into the time domain. This mixdown technique reduces the 2-channel decodercomplexity significantly, since the most complex portion of the algorithm, the inverse filterbank, only has to be performed on 2 channels. It is the fact that the frequency-to-timetransformation is a linear process that allows the rearrangement of the inverse filtering andmixdown steps. There is a modest increase in the amount of input buffer memory for thisdownmix technique compared to the use of compatibility matrixing (where only twochannels of input data must be buffered by a two-channel decoder). There is also the needto perform the bit allocation on 5 instead of only 2 channels. This modest increase incomplexity is thought to be an excellent tradeoff in providing a single signal which canequally well serve a wide variety of decoders with differing numbers of reproductionchannels and different down-mix optimizations.

Since AC-3 delivers all channels to the decoder coded discretely, the decoder has fullflexibility to mixdown the channels as appropriate to the listening situation. Differinglistening situations require differing mixdown coefficients. For example, a stereo listenermay wish the surround signals to be mixed out-of-phase so that they don't localize well intwo loudspeaker reproduction. A second listener may wish the same mixdown because thislistener has a current home theatre matrix decoder which requires a 2-channel surroundmatrix encoded signal (which has the surround signal out-of-phase) as an input in order toreproduce surround. A third listener may wish a mono mix-down. This third listener couldnot simply combine the 2-channel mix enjoyed by the first two listeners, because then thesurround signal would be lost. (The currently used Dolby Surround matrix system doessuffer from the problem that the surround signal is lost in monophonic reproduction. Thisfact is known by programme mixers who consciously avoid the placement of essentialprogram content solely in the surround channel. One of the goals for AC-3 was to eliminatethis and other limitations of the matrix surround system. It is essential that a new multi-channel coder allow all listeners to hear all of the sound from all of the channels, so thereare no constraints on programme production.) The mono listener thus requires a differentdownmix. Only by allowing the listener to select the desired type of downmix can allaudiences be served. In most cases the listener will not be required to make any consciousdecision about the desired downmix as products will be sesigned such that the AC-3decoder will automatically make the choice depending on information such as the numberof loudspeakers available.

In the United States (and perhaps elsewhere) there is a problem with the loudness ofdifferent broadcast programmes. Many broadcasters highly compress the dynamic range ofthe audio, and fully modulate the audio channel much of the time. In this case, there is littleheadroom. Sometimes the entertainment program will have a more natural dynamic rangewith some headroom, but commercial messages (attempting to sound loud) may not. Theresult is that there are significant level differences between programme segments on aparticular broadcast channel, as well as between broadcast channels. Two of the designgoals for AC-3 were: to eliminate the apparent level differences between broadcastchannels; and allow broadcasters to compress the dynamic range of the programming formost listeners, while allowing other listeners (who so choose) to enjoy the full originaldynamic range of the programme.

Page 15: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

15

Loudness uniformity is achieved by determining the subjective level of normal spokendialogue, and explicitly coding this level into the data stream as a dialogue level controlword. The AC-3 decoder may then interact with the system playback level. Differingprogrammes may have differing dialogue levels which simply mean that they have differingamounts of headroom available for dramatic effect. When the listener adjusts the volumecontrol, the level of reproduced normal dialogue (i.e. not shouts or whispers) will be set tothe desired subjective sound pressure level. When a new programme segment begins, withdialogue encoded at a different level (i.e. with a different amount of headroom), thereproduction system may use the coded dialogue level control word to make acorresponding adjustment to the system playback volume. The result is that for allprogrammes and all channels, the reproduced level of dialogue will be uniform.

The AC-3 coder contains an integral dynamic range control system. During encoding, or atany point thereafter, dynamic range control words may be placed into the AC-3 bit stream.These control words are used by the decoder to alter the level of the decoded audio on ablock basis (every 5.3 msec). The control word may indicate that the decoder gain be raisedor lowered. The control words are generated by a level compression algorithm which maybe resident in the AC-3 encoder, or in a subsequent bit-stream processor. The controlwords have a resolution of < 0.25 dB per block. The block-to-block gain variations arefurther smoothed by the gentle overlap add process. Gradual gain changes are free fromgain stepping artifacts. For programme audio levels above dialogue level, the dynamicrange control words will indicate level reduction. For audio levels below dialogue level, thecontrol words will indicate a level increase. The default for the decoder is to use the controlwords which will result in the reproduction of the audio programme with a compresseddynamic range. The exact nature of the compression is determined by the algorithm whichgenerates the control words, but in general the compression is such that headroom isreduced (loud sounds are brought down towards dialogue level) and quiet sounds are mademore audible (brought up towards dialogue level). It is a decode option to reduce the effectof the control words of either polarity. If the control words which indicate that gain shouldbe increased are not used, then low level sounds will retain their proper dynamics and onlyloud sounds will be compressed downwards. If only the control words which indicate gainreduction are ignored, the low level sounds will be brought up in level, but loud sounds willreproduce naturally. If all control words are ignored, the original signal dynamic range willbe reproduced. It is also possible to partially use the control words for either polarity. Thusthe listener may instruct the decoder to partially or fully remove the dynamic rangecompression which has been applied to either the loud (above dialogue level) or soft (belowdialogue level) sounds. (Actually the audio is coded without any level alteration, and levelalterations occur at the decoder. However, this is irrelevant to the listener, since the defaultfor the listener is to reproduce the programme with the compression characteristic whichhas been intended by the programme originator. The listener must take some action toremove the compression.)

The AC-3 syntax has been defined to support the coding of one main audio service withfrom 1 to 5.1 channels. Additionally, associated services may be embedded into the AC-3bit stream. Associated service types include: visually impaired (a verbal description of thevisual scene), hearing impaired (dialogue with enhanced intelligibility), commentary,dialogue, and second stereo programme. All services may be tagged with a code to indicatelanguage. Single channel services may have a bit-rate as low as 32 kb/s. The overall datastream is defined for bit rates ranging from 32 kb/s up to 640 kb/s. The higher rates are

Page 16: AC-3: Flexible Perceptual Coding for Audio Transmission … · 2 technology to a two channel coder. If successful, the concept of AC-3 would allow HDTV to gain the benefit of going

16

intended to allow the basic audio quality to exceed audiophile expectations, as well as allowthe incorporation of associated services without severely restricting the bit-rate (and thusquality) of the main audio service.

9. Conclusion

AC-3, with its high resolution spectral envelope coding and hybrid forward/backwardadaptive bit allocation offers very high coding gain at modest complexity. Bit starvation isavoided during extreme signal demands by invoking the technique of coupling, whichavoids any need to restrict the coded audio bandwidth. Fully discrete signal transmissionavoids the need to increase coder complexity and compromise coded audio quality in anattempt to avoid matrixing artifacts. Decoder downmixing allows every listener to obtainthe optimum downmix for his listening situation. The quality level of AC-3 is not limited tothat obtainable with currently available encoders; future encoders are expected to achieveimproved results by means of additional complexity and sophistication. All decoders in thefield will automatically benefit from future encoding improvements. The first integratedcircuit designed for AC-3 (the Zoran ZR38000) became available in 1993, and several moreare expected to become available in 1994. AC-3 has been selected for use in the UnitedStates HDTV system for reasons of: high audio quality; advanced state of development;and full provision of all necessary features. AC-3 is currently being designed into consumerelectronics equipment for cable television, direct broadcast satellite, and pre-recordedmedia.

This paper was presented at the 96th Convention of the Audio Engineering Society, February 26 – March 1, 1994, asPreprint 3796. Reprinted by permission of the AES.