adpcm on tensilica xiaoling xu and fan mo eecs, uc berkeley
Post on 21-Dec-2015
222 views
TRANSCRIPT
DESIGN GOAL
• Basic Algorithm
• Two Streams Approach
• Make use of Tensilica’s Special Features
• Results
• Conclusion
ADPCM ENCODER
Step SizeCalculation
Z-1
Encoder
Decoder
Z-1
X(n)Input sample
d(n)Difference
Step sizess(n)
Adjusted step sizess(n+1)
L(n) ADPCM output sample 4 bits
X(n) estimate
X(n-1) estimate oflast input sample
+
_
ADPCM DECODER
Step SizeCalculation
Z-1
Decoder +
Z-1
X(n)Output sample
d(n)Difference
Step sizess(n)
Adjusted step sizess(n+1)
L(n) ADPCM input sample 4 bits
X(n-1)
ENCODING ALGORITHMEncoding(*input) {
loop(number of samples) {
X=*input++;
D=X-X-1;
S=StepsizeTable(Index);
Xa=|X|; Code=0;
if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; }
S/=2;
if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; }
S/=2;
if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; }
Code[3]|=(X>0)?0:1;
X-1=(X>0)?X-1:X;
if (X-1>32767) X-1 =32767;
if (X-1<-32768) X-1 =-32768;
Index+=IndexTable(Code);
if (Index>88) Index=88;
if (Index<0) Index=0;
*output++=Code;
} }
StepsizeTable[89] = { 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45, 50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143, 157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449, 494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066, 2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358, 5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899, 15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767};
IndexTable[16] = { -1, -1, -1, -1, 2, 4, 6, 8, -1, -1, -1, -1, 2, 4, 6, 8};
DECODING ALGORITHMDecoding(*Code) {
C=*Code++;
S=StepsizeTable(Index);
D=0;
if (C[2]==1) D+=S;
S/=2;
if (C[1]==1) D+=S;
S/=2;
if (C[0]==1) D+=S;
if (Code[3]==1) X=X-1-D;
else X=X-1+D;
if (X>32767) X =32767;
if (X<-32768) X =-32768;
Index+=IndexTable(Code);
if (Index>88) Index=88;
if (Index<0) Index=0;
*output++=X;
X-1=X;
}
StepsizeTable[89] = { 7, 8, 9, 10, 11, 12, 13, 14, 16, 17, 19, 21, 23, 25, 28, 31, 34, 37, 41, 45, 50, 55, 60, 66, 73, 80, 88, 97, 107, 118, 130, 143, 157, 173, 190, 209, 230, 253, 279, 307, 337, 371, 408, 449, 494, 544, 598, 658, 724, 796, 876, 963, 1060, 1166, 1282, 1411, 1552, 1707, 1878, 2066, 2272, 2499, 2749, 3024, 3327, 3660, 4026, 4428, 4871, 5358, 5894, 6484, 7132, 7845, 8630, 9493, 10442, 11487, 12635, 13899, 15289, 16818, 18500, 20350, 22385, 24623, 27086, 29794, 32767};
IndexTable[16] = { -1, -1, -1, -1, 2, 4, 6, 8, -1, -1, -1, -1, 2, 4, 6, 8};
ALTERNATIVE APPROACHES
if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; }
S/=2;
if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; }
S/=2;
if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; }
Code[3]|=(X>0)?0:1;
Code[2:0]=Xa/S*4=Xa*(1/S)*4;
X-1=Code[2:0]*S/4;
(1/S) is stored in a table.
USING MULTIPLICATIONMultiplier is there. Why not use it?
if (Xa>S) { Code[2]=1; Xa-=S; X-1+=S; }
S/=2;
if (Xa>S) { Code[1]=1; Xa-=S; X-1+=S; }
S/=2;
if (Xa>S) { Code[0]=1; Xa-=S; X-1+=S; }
Code[3]|=(X>0)?0:1;
Xa-=S;
Code[2]=~MSB(Xa);
Xa-=S’[Code];
Code[1]=~MSB(Xa);
Xa-=S’’[Code];
Code[0]=~MSB(Xa);
USING MORE TABLESBuild tables for all possible paths.
S0XX 1XX
00X 01X10X 11XS’
S’’Eg. S’[0XX]=S/2; S’[1XX]=-S+S/2;
BUT...
• Earlier experiments showed that neither approaches give big improvement.
WHY?
• Multiplication takes many cycles.
• Too many tables cause large cache miss.
UNIQUE OPERATIONSEncoding(*input) {
loop(number of samples) {
X=*input++;
D=X-X-1;
S=StepsizeTable(Index);
Xa=|X|; Code=0;
if (Xa>S) { Code[2]=1; X-=S; X-1+=S; }
S/=2;
if (Xa>S) { Code[1]=1; X-=S; X-1+=S; }
S/=2;
if (Xa>S) { Code[0]=1; X-=S; X-1+=S; }
Code[3]|=(X>0)?0:1;
X-1=(X>0)?X-1:X;
if (X-1>32767) X-1 =32767;
if (X-1<-32768) X-1 =-32768;
Index+=IndexTable(Code);
if (Index>88) Index=88;
if (Index<0) Index=0;
*output++=Code;
} }
Decoding(*Code) {
C=*Code++;
S=StepsizeTable(Index);
D=0;
if (C[2]==1) D+=S;
S/=2;
if (C[1]==1) D+=S;
S/=2;
if (C[0]==1) D+=S;
if (Code[3]==1) X=X-1-D;
else X=X-1+D;
if (X>32767) X =32767;
if (X<-32768) X =-32768;
Index+=IndexTable(Code);
if (Index>88) Index=88;
if (Index<0) Index=0;
*output++=X;
X-1=X;
}
IF (…) … ELSE ...
CLAMP
UNIQUE DATA STRUCTURE
• Most data shorter than or equal to 16-bit.
• Since register is 32-bit, why not put two data in one register
• But in some place, the 17th bit is required to store the intermediate results.
StreamA Data StreamB Data31 16 | 15 0
if (Code[3]==1) X=X-1-D;else X=X-1+D;if (X>32767) X =32767;if (X<-32768) X =-32768; X has to be 17-bit
FIRST APPROACH:• Control-Oriented Application is hard to do parallel
operations.
• Modify the algorithm into a more computation-oriented approach by using multiply.
– Speedup
• 10% for single stream
• 0% for two streams due to high cache misses.
– Why? • 16-bit multiplication results a 32-bit data .
ANOTHER APPROACH
• Keep Control-Oriented Approach:
1. How to block the carry/borrow between bit16 and bit15?
2. How to carry out two “If (..) ..” in one instruction?
3. How to encapsulate two 17-bit data in a 32-bit register?
XA-1 XB-1
SA SB+
31 16 | 15 0
TIE Instruction1. How to carry out two “If (..) ..” in one instruction? if (data1>bound) data1=bound; if (data2>bound) data2=bound; if(data2|data1 > bound) data2|data1 = bound|bound
data2 data1
bound bound-
31 30 15 0
2:1 mux 2:1 mux data2 data2
bound bound
data2 data1
TIE Instructions2. How to encapsulate two 17-bit data in a 32-bit register?
data1 += diff1; data2 += diff2;if (data1 > 32767) data1 = 32767 if(data2 > 32767) data2 = 32767
data2|data1 += diff2|diff1;
data2 data1
diff2 diff1+
31 16 | 15 0
result2 result1
2:1 mux 2:1 mux
data2 data1
result2 result1
32767 32767
CONSTANT TABLES• A lot of table lookup instructions in the
original algorithm.• Access constant table from cache is slow.
– Increase cache miss rate– increase # of memory access instructions
• Using constant table!– Tensilica has tables come with the processor.– Almost no extra cost to access the tables.
CONSTANT TABLES
0
20000
40000
60000
80000
100000
120000
140000
Encoding w/oTable
Encoding w/Table
Decoding w/o Table
Decoding w/Table
# o
f In
stru
ctio
ns
Dcache Access
Dcache Miss
TWO STREAM RESULTS
0.00E+00
5.00E+05
1.00E+06
1.50E+06
2.00E+06
2.50E+06
3.00E+06
3.50E+06
4.00E+06
4.50E+06
OldE_Rate NewE_Rate NewET_Rate
Encoding Rate
TWO STREAM RESULTS
0.00E+00
2.00E+06
4.00E+06
6.00E+06
8.00E+06
1.00E+07
1.20E+07
1.40E+07
OldD_Rate NewD_Rate NewDT_Rate
Decoding Rate
COMPARISONProcessor Encoding Decoding
R4000 Indigo 1.1M 1.7M
R3000 Indigo 410K 850K
Sun SLC 250K 420K
Mac-Iisi 21K 35K
486/DX2-33 SCO 550K 865K
486/33 linux 278K 464K
386/33 gcc 117K 168K
Tensilica/141 4.23M 12.1M