adpcm on tensilica xiaoling xu and fan mo eecs, uc berkeley

ADPCM ON TENSILICA

Xiaoling Xu and Fan Mo

EECS, UC Berkeley

DESIGN GOAL

• Basic Algorithm

• Two Streams Approach

• Make use of Tensilica’s Special Features

• Results

• Conclusion

ADPCM ENCODER

Step SizeCalculation

Z-1

Encoder

Decoder

Z-1

X(n)Input sample

d(n)Difference

Step sizess(n)

Adjusted step sizess(n+1)

L(n) ADPCM output sample 4 bits

X(n) estimate

X(n-1) estimate oflast input sample

+

_

ADPCM DECODER

Step SizeCalculation

Z-1

Decoder +

Z-1

X(n)Output sample

d(n)Difference

Step sizess(n)

Adjusted step sizess(n+1)

L(n) ADPCM input sample 4 bits

X(n-1)

ENCODING ALGORITHMEncoding(*input) {

loop(number of samples) {

X=*input++;

D=X-X-1;

S=StepsizeTable(Index);

Xa=|X|; Code=0;

if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; }

S/=2;

if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; }

S/=2;

if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; }

Code[3]|=(X>0)?0:1;

X-1=(X>0)?X-1:X;

if (X-1>32767) X-1 =32767;

if (X-1<-32768) X-1 =-32768;

Index+=IndexTable(Code);

if (Index>88) Index=88;

if (Index<0) Index=0;

*output++=Code;

} }

StepsizeTable[89] = { 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45, 50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143, 157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449, 494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066, 2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358, 5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899, 15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767};

IndexTable[16] = { -1, -1, -1, -1, 2, 4, 6, 8, -1, -1, -1, -1, 2, 4, 6, 8};

DECODING ALGORITHMDecoding(*Code) {

C=*Code++;


D=0;

if (C[2]==1) D+=S;

S/=2;

if (C[1]==1) D+=S;

S/=2;

if (C[0]==1) D+=S;

if (Code[3]==1) X=X-1-D;

else X=X-1+D;

if (X>32767) X =32767;

if (X<-32768) X =-32768;




*output++=X;

X-1=X;

}

StepsizeTable[89] = { 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45, 50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143, 157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449, 494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066, 2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358, 5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899, 15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767};

IndexTable[16] = { -1, -1, -1, -1, 2, 4, 6, 8, -1, -1, -1, -1, 2, 4, 6, 8};

ALTERNATIVE APPROACHES

if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; }

S/=2;

if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; }

S/=2;

if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; }

Code[3]|=(X>0)?0:1;

Code[2:0]=Xa/S*4=Xa*(1/S)*4;

X-1=Code[2:0]*S/4;

(1/S) is stored in a table.

USING MULTIPLICATIONMultiplier is there. Why not use it?

if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; }

S/=2;

if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; }

S/=2;

if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; }

Code[3]|=(X>0)?0:1;

Xa-=S;

Code[2]=~MSB(Xa);

Xa-=S’[Code];

Code[1]=~MSB(Xa);

Xa-=S’’[Code];

Code[0]=~MSB(Xa);

USING MORE TABLESBuild tables for all possible paths.

S0XX 1XX

00X 01X10X 11XS’

S’’Eg. S’[0XX]=S/2; S’[1XX]=-S+S/2;

BUT...

• Earlier experiments showed that neither approaches give big improvement.

WHY?

• Multiplication takes many cycles.

• Too many tables cause large cache miss.

UNIQUE OPERATIONSEncoding(*input) {

loop(number of samples) {

X=*input++;

D=X-X-1;


Xa=|X|; Code=0;

if (Xa>S) { Code[2]=1; X-=S; X-1+=S; }

S/=2;

if (Xa>S) { Code[1]=1; X-=S; X-1+=S; }

S/=2;

if (Xa>S) { Code[0]=1; X-=S; X-1+=S; }

Code[3]|=(X>0)?0:1;

X-1=(X>0)?X-1:X;

if (X-1>32767) X-1 =32767;

if (X-1<-32768) X-1 =-32768;




*output++=Code;

} }

Decoding(*Code) {

C=*Code++;


D=0;

if (C[2]==1) D+=S;

S/=2;

if (C[1]==1) D+=S;

S/=2;

if (C[0]==1) D+=S;

if (Code[3]==1) X=X-1-D;

else X=X-1+D;

if (X>32767) X =32767;

if (X<-32768) X =-32768;




*output++=X;

X-1=X;

}

IF (…) … ELSE ...

CLAMP

UNIQUE DATA STRUCTURE

• Most data shorter than or equal to 16-bit.

• Since register is 32-bit, why not put two data in one register

• But in some place, the 17th bit is required to store the intermediate results.

StreamA Data StreamB Data31 16 | 15 0

if (Code[3]==1) X=X-1-D;else X=X-1+D;if (X>32767) X =32767;if (X<-32768) X =-32768; X has to be 17-bit

WHY NOT TWO STREAMS?

DUALSTREAM

ENCODER

DUALSTREAM

DECODER

Difficult?

FIRST APPROACH:• Control-Oriented Application is hard to do parallel

operations.

• Modify the algorithm into a more computation-oriented approach by using multiply.

– Speedup

• 10% for single stream

• 0% for two streams due to high cache misses.

– Why? • 16-bit multiplication results a 32-bit data .

ANOTHER APPROACH

• Keep Control-Oriented Approach:

1. How to block the carry/borrow between bit16 and bit15?

2. How to carry out two “If (..) ..” in one instruction?

3. How to encapsulate two 17-bit data in a 32-bit register?

XA-1 XB-1

SA SB+

31 16 | 15 0

TIE Instruction1. How to carry out two “If (..) ..” in one instruction? if (data1>bound) data1=bound; if (data2>bound) data2=bound; if(data2|data1 > bound) data2|data1 = bound|bound

data2 data1

bound bound-

31 30 15 0

2:1 mux 2:1 mux data2 data2

bound bound

data2 data1

TIE Instructions2. How to encapsulate two 17-bit data in a 32-bit register?

data1 += diff1; data2 += diff2;if (data1 > 32767) data1 = 32767 if(data2 > 32767) data2 = 32767

data2|data1 += diff2|diff1;

data2 data1

diff2 diff1+

31 16 | 15 0

result2 result1

2:1 mux 2:1 mux

data2 data1

result2 result1

32767 32767

CONSTANT TABLES• A lot of table lookup instructions in the

original algorithm.• Access constant table from cache is slow.

– Increase cache miss rate– increase # of memory access instructions

• Using constant table!– Tensilica has tables come with the processor.– Almost no extra cost to access the tables.

CONSTANT TABLES

0

20000

40000

60000

80000

100000

120000

140000

Encoding w/oTable

Encoding w/Table

Decoding w/o Table

Decoding w/Table

# o

f In

stru

ctio

ns

Dcache Access

Dcache Miss

TWO STREAM RESULTS

0.00E+00

5.00E+05

1.00E+06

1.50E+06

2.00E+06

2.50E+06

3.00E+06

3.50E+06

4.00E+06

4.50E+06

OldE_Rate NewE_Rate NewET_Rate

Encoding Rate

TWO STREAM RESULTS

0.00E+00

2.00E+06

4.00E+06

6.00E+06

8.00E+06

1.00E+07

1.20E+07

1.40E+07

OldD_Rate NewD_Rate NewDT_Rate

Decoding Rate

COMPARISONProcessor Encoding Decoding

R4000 Indigo 1.1M 1.7M

R3000 Indigo 410K 850K

Sun SLC 250K 420K

Mac-Iisi 21K 35K

486/DX2-33 SCO 550K 865K

486/33 linux 278K 464K

386/33 gcc 117K 168K

Tensilica/141 4.23M 12.1M

CONCLUSION• TIE extensions and improved code

efficiency resulted in an order of magnitude improvement from our original

• Constant table helps to decrease cache access and cache miss.

• Tensilica is also able to handle control-oriented applications.

adpcm on tensilica xiaoling xu and fan mo eecs, uc berkeley

Documents