auio omprssion - maplerecallauio omprssion project report uvic department of engineering shengzong...
TRANSCRIPT
AUDIO
COMPRESSION
Project Report
UVic Department of Engineering
Shengzong He V00785261 [email protected]
Chenchen Guo V00788009 [email protected]
Submission date: Friday, August 5, 2016
1
INTRODUCTION
In the beginning of this report, there will be a brief description of the components of
sound signals and their importance to the world. This report will focus on the basic pulse
code modulation (PCM) method which encodes such signals into the digital world.
An audio signal representing sound typically have frequency in the range of 20 to 20,000
Hz (Limits of Human Hearing) and is usually represented by an electrical voltage. In the
analog world, the audio signals are the measure of the pressure of the sound wave by the
instantaneous voltage over the time domain.
To capture analog audio signal into the digital world, Pulse-code modulation algorithms
are introduced. The main focus on this report is to introduce the PCM algorithms such
as the µ-law algorithm and utilize such algorithms to compress uniform PCM codes.
In the Uniform PCM quantization processes, an error is introduced when estimating the
sample amplitude. The signal to noise ratio is good at high level signals but bad at low
level signals. Hence Non-Uniform PCM encoding algorithms such as A-law and µ-law
were introduced.
In this project, the main focus is to utilize µ-law compression method to encode uniform
PCM samples from 16-bit representation to 8-bit representation. The compressed ratio is
around 1: 1.75, hence the size of the compressed output should be approximate of 50%
less than the original copy. After the initial implementation, series of optimization
techniques will be applied on top of the base version. The goal of this project is to apply
software optimization technique to increase the performance of the µ-law compression
algorithm.
The project intends follow 6 steps including: implementing the u-law algorithm, use
integer arithmetic (piecewise logarithm), optimization of C code, replace subroutine with
arm, 2-slot machine implementation, Automata -> FPGA solutions. The testing device of
this project will be a single-board computing device with an ARM-v7 architecture chip
set on board.
2
TABLE OF CONTENTS
Introduction ........................................................................................................................ 1
Theoretical background ...................................................................................................... 3
Pulse Code Modulation Process ..................................................................................... 3
Non-uniform pulse code encoding: the µ-law algorithm and the A Law algorithm ...... 3
Design process ................................................................................................................... 4
Project Requirements ..................................................................................................... 4
Development of the Audio Compression Software Framework .................................... 4
Prototype Software Solutions ......................................................................................... 4
Determine the Bottleneck of the Application ................................................................. 5
Using Piecewise linear approximation instead of Log(x) .............................................. 6
Implementing Piecewise Linear Approximation Log2(x) Using Integer Arithmetic .... 8
Optimization of C Software Routines ............................................................................ 9
Optimization of ARM assembly code .......................................................................... 10
Implementation of 2-slot Machine Firmware .............................................................. 12
Multithreading Solution ............................................................................................... 13
Custom Hardware VHDL Solution .............................................................................. 15
Performance/cost evaluation ............................................................................................ 17
Conclusion ........................................................................................................................ 19
Bibliography ..................................................................................................................... 20
3
THEORETICAL BACKGROUND
Pulse Code Modulation Process
PCM encoding method includes three major processes, sampling, quantization and
coding. In the sampling process, the magnitude of the analog signal is sampled regularly
in uniform intervals, the obtained values are called Samples. For a 4kHz voice channel,
the sampling rate is 8000Hz. This means that the audio signal is sample 8000 times per
second. The quantization process then covert the obtained samples into uniform discrete
digital values. In Uniform PCM encoding, each of the quantized amplitude are divided
into uniformly spaced steps called the quantization step. The standard input vales are
divided into 16 steps.
However, the quantization process introduces an error since the real amplitude of a
sample is replaced by an approximated value during this process, this is also known as
quantization distortion. In the high level signals (High Voltage), the quantization
distortion error will be low. But in the low level signals this error is significant.
Non-uniform pulse code encoding: the µ-law algorithm and the A Law algorithm
To future avoid and minimize such quantization distortion, Non-uniform PCM encoding
algorithms like the µ-law algorithm and the A Law algorithm were introduced. Using
such methods will allow the PCM quantization step to be smaller at lower level signals
and higher at high level signals. Hence decreasing the quantization error and increasing
the signal to noise ratio.
During the encoding step of the µ-law algorithm, uniform PCM codes are compression
from 16-bit representation to 8-bit representation. The reduction in size allows for a
limited bandwidth channel to transfer more data, effectively increases the signal transfer
process.
Since the µ-law algorithm is the standard in North America, in this project we will
choose to implement the µ-law algorithm instead of the A-law algorithm. Also, because
the fact that A-law only uses 12-bit magnitude of the audio samples and µ uses 13-bit
magnitude [2], the compressed audio samples of A-law will be less accurate than output
from using µ-law.
4
DESIGN PROCESS
Project Requirements
1. Build and implement audio compression prototype using the µ-law algorithm
2. Use Piecewise linear approximation to replace log(x).
3. Implement piecewise linear approximation using integer arithmetic for log2(x) in
Software (C routines)
4. Optimization of C software routines
5. Optimization of ARM assembly code
6. Implement 2-slot machine solutions
7. Custom hardware solutions (VHDL)
8. Estimate the performance improvement for each implementation of piecewise
log2(x) based on software optimization
Development of the Audio Compression Software Framework
To simplify the problem, we decided that our audio compression software will only
support WAV files as input. As we are not planning to implement resampling algorithm,
the compressed audio file will maintain the original sampling rate. The only difference is
that the bit-depth for each samples decreased from 16 bits to 8 bits. As a result, the file
size will be reduced to 50% of the original size and the WAV format will be changed
from Linear PCM to u-Law compression. The designing process of the skeleton of our
application includes completing the following components: WAV header parser, Data
chunk locator, WAV header modifier, WAV file writer. Also we need to write
corresponding code in the main function that accepts arguments, read the WAV file,
print out file information, call the compression function, and write the compressed file
back to the list.
Since the key point of this report is on software optimization, the development processes
of the software skeleton are omitted to conserve space. For more details, please read the
attached source code.
Prototype Software Solutions
In the first step of this project, a software implementation of µ-law algorithm was
developed using C-routines. Simply following the algorithm and normalizing the input,
the audio file was compressed from 581 MB to 291 MB. During the development of the
code, the input audio sample “x” was represented as values that are in the range of -2^15
< X < 2^15. For the better implementation, the input value X was scaled.
inline unsigned char flin2mu(int16_t input_frame) { uint8_t sign_bit = ((uint16_t)input_frame) >> 15; uint16_t magnitude = (sign_bit)? -input_frame: input_frame; if(magnitude > 32767) magnitude = 32767;
5
double x = magnitude / 32767.0; double result = (log(1.0 + MU * x)/log(1.0 + MU)); uint8_t return_value = rint(result * 127.0); return_value = return_value | (sign_bit << 7); return return_value; }
Determine the Bottleneck of the Application
To optimize our software, it is essential to know where the bottleneck of our application
is. Although it is reasonable to assume that log () function is the most time consuming
part in the entire program, a profiling report could still be helpful.
Instead of using the “gprof “tool to profile our application, we decided to use the system
level “perf” command to get the most accurate and complete information of our
application.
sudo perf record ./a.out sengRadio.wav
The result obtained is as follows.
Loading WAV file: sengRadio.wav... File loaded into memory. No. of channels: 2 RIFF chunk size: 580532454 FMT chunk size: 18 Format code: 1 Sample rate: 44100 Byte rate: 176400 Bps Bit rate: 1411 kbps Bits per sample: 16 Block size: 4 Sample size is signed 16 bits Sample number: 290266208 Compression started at:
Figure 1. Original audio signals Figure 2. Compressed output file
6
2016-08-05 11:38:56.771413 Compression finished at: 2016-08-05 11:41:26.524838 Time elapsed: 149.753425 [ perf record: Woken up 96 times to write data ] [ perf record: Captured and wrote 24.217 MB perf.data (634644 samples) ]
Samples: 634K of event 'cycles:ppp', Event count (approx.): 140893513459 Overhead Command Shared Object Symbol 58.74% a.out libm-2.19.so [.] __log_finite 22.35% a.out a.out [.] flin2mu_encode 6.83% a.out libm-2.19.so [.] __rintl 5.34% a.out libm-2.19.so [.] __logl 0.83% a.out [kernel.kallsyms] [k] mmiocpy 0.64% a.out a.out [.] 0x00000718 0.63% a.out a.out [.] __libc_start_main@plt 0.31% a.out [kernel.kallsyms] [k] v7_flush_kern_dcache_area 0.24% a.out [kernel.kallsyms] [k] get_page_from_freelist 0.22% a.out a.out [.] 0x00000710 0.21% a.out a.out [.] malloc@plt 0.21% a.out a.out [.] puts@plt 0.21% a.out a.out [.] 0x00000714 0.21% a.out [kernel.kallsyms] [k] __memzero …… For a higher level overview, try: perf report --sort comm,dso
As the report indicates, the library function, log_finite (), which was invoked by the log()
function in the math library, has an overhead of 58.74%. Besides, the floating point
arithmetic operations in flin2mu_encode () function also produces 22.35% overhead. It’s
clear that we need to optimize those two functions to improve the overall performance of
our application. According to the U.S. standard, the value of µ used in the µ -law is 255.
Thus we can reduce the original formula to the following one that utilizes log2 ()
function instead of log () function.
ln(1 + µx)
ln(1 + µ)=ln(1 + 255𝑥)
ln(256)=ln(1 + 255𝑥)
8ln(2)=1
8log2(1 + 255𝑥)
It is also important to note that the transformed formula described above has a “divide by
8” operation. For unsigned integers, this is equivalent to performing Logical Shift Right
operation. Therefore, it is highly likely that converting floating point operations to fixed
point operations could further improve the performance.
Using Piecewise linear approximation instead of Log(x)
Since computing Log(x) is expensive, an alternative for increasing the performance is to
approximate the value of for this function. Also, because Log2(x) allows for better
calculation. Using the Change of base Formula of logarithms [1], Log2(x) was used
instead of the natural log in the algorithm. The result is as follows:
𝐹(𝑥) = 𝑠𝑔𝑛(𝑥) ∗1
8log2(1 + 255𝑥)
7
Using the Taylor series expansion technique discussed in class, we can get the following
function.
float fpwlog2(float x) { if(x < 1.0) return -1.0; if(x < 2.0) return x - 1.0; if(x < 4.0) return 1.0 + (x - 2.0) / 2.0; if(x < 8.0) return 2.0 + (x - 4.0) / 4.0; if(x < 16.0) return 3.0 + (x - 8.0) / 8.0; if(x < 32.0) return 4.0 + (x - 16.0)/ 16.0; if(x < 64.0) return 5.0 + (x - 32.0)/ 32.0; if(x < 128.0) return 6.0 + (x - 64.0)/ 64.0; if(x < 256.0) return 7.0 + (x - 128.0)/ 128.0; return -1; } result = 0.125 * fpwlog2(1 + 255.0 * x);
To make sure the result is accurate, we measured the error of the 𝑓𝑝𝑤𝑙𝑜𝑔2() function
when the input 𝑥 is range from 1.0 to 255.0. Here’s the plot of the error, the 𝑥-axis is the
𝑥 value, and the 𝑦-axis is the error value calculated by 𝐸𝑟𝑟 = 𝑓𝑝𝑤𝑙𝑜𝑔2(𝑥)– 𝑙𝑜𝑔2(𝑥);
Figure 3. Error Graph
The maximum measured error is 0.086071491. For our application, this accuracy is tolerable.
-0.01
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
18
.11
5.2
22
.32
9.4
36
.54
3.6
50
.75
7.8
64
.9 72
79
.18
6.2
93
.31
00
.41
07
.51
14
.61
21
.71
28
.81
35
.91
43
15
0.1
15
7.2
16
4.3
17
1.4
17
8.5
18
5.6
19
2.7
19
9.8
20
6.9
21
42
21
.12
28
.22
35
.32
42
.42
49
.5
Y = log2(X) - fpwlog2(X)
8
The result of the fpwlog2() implementation is as follows.
Compression started at: Compression finished at: Time elapsed:
2016-08-05 12:45:17.963338
2016-08-05 12:46:08.396555
50.43321
Comparing to the raw result that uses library log () function, we can see that the
fpwlog2() solution achieved 196.93% performance gain.
Implementing Piecewise Linear Approximation Log2(x) Using Integer Arithmetic
Although the natural logarithm in the mu-law algorithm was replaced with the linear
approximation, the arithmetic of the function still computes with float point numbers.
Hence, to further improve the performance of the software routines, an integer arithmetic
solution for the linear approximation is necessary.
Since we know the input of the pwlog2() function ranges from 1.0 to 256.0. Based on the
fact that it accepts and returns a 16-bit unsigned integer, it is reasonable to choose 28 as
scale factor. The resulting pwlog2() function is as follows.
uint16_t pwlog2(uint16_t X) { if(X < (1 << 8)) return (-1); if(X < (1 << 9)) return (X- (1 << 8)); if(X < (1 << 10)) return (X >> 1); if(X < (1 << 11)) return ((X >> 2) + (1 << 8)); if(X < (1 << 12)) return ((X >> 3) + (1 << 9)); if(X < (1 << 13)) return ((X >> 4) + 768); if(X < (1 << 14)) return ((X >> 5) + (1 << 10)); if(X < (1 << 15)) return ((X >> 6) + 1280); if(X < (1 << 16)) return ((X >> 7) + 1536); return (-1); // It should never reach this point }
Also, we need to modify the corresponding code in the lin2mu_encode function to
eliminate all the floating point operations. Together with operator strength reduction, we
can get the following function.
void lin2mu_encode(int16_t* restrict input_samples, uint8_t* restrict output_samples, uint32_t sample_number) { register uint32_t i; register int16_t* input_sample_pointer = input_samples; register uint8_t* output_sample_pointer = output_samples; register int16_t sample;
9
register uint32_t sign_bit, magnitude, x, result; for (i = sample_number; i > 0; i--) { sample = *input_sample_pointer; sign_bit = (((uint16_t)sample) >> 15) << 7; magnitude = (sign_bit) ? -sample : sample; if (magnitude > 32767) magnitude -= 1; // x = 1 + mu * magnitude; mu == 255 x = (((magnitude << 8) - magnitude) >> 15) + 1; // No rounding // result = 1/8 * log_2(1 + mu * x) * 127.0 result = (pwlog2(x << 8) >> 4); *output_sample_pointer = ~(result | sign_bit); input_sample_pointer++; output_sample_pointer++; } }
After the implementation of the Integer arithmetic of the piecewise linear approximation,
the total elapsed time for calculating the compressed audio file was significantly
reduced.
Compression started at: Compression finished at: Time elapsed:
2016-08-05 13:03:26.470366
2016-08-05 13:03:36.336902
9.866536
This time, it achieved a 411.15% performance gain.
Optimization of C Software Routines
For the better compression time, software optimization techniques were introduced and
tested. Methodologies includes: Loop unrolling, grafting and software pipelining.
Unfortunately, as the compiler already optimized the code, neither loop unrolling nor
grafting gives any performance gain. The grafting solution even reduced the
performance by interfering the automatic complier optimization.
After the inspection of the assembly code generated by the compiler, we noticed that it at
least performed the following optimization techniques: constant folding, operator
strength reduction, function in-lining and loop condition optimization. Therefore, the
only thing left we can do is software pipelining. Loading and storing data before a
branch is likely to be beneficial. During the optimization process, we attempted to use
temporary variables to reduce true dependency between instructions. However, the extra
pressure on the register file significantly reduced the performance. Eventually, we
arrived into the following solution.
void lin2mu_encode(int16_t* restrict input_samples, uint8_t* restrict output_samples, uint32_t sample_number) { register uint32_t i; register int16_t* input_sample_pointer = input_samples; register uint8_t* output_sample_pointer = output_samples; register int16_t sample; register uint32_t sign_bit, magnitude, x, result; sample = *input_sample_pointer++; // prologue for(i = sample_number; i > 0; i--){ sign_bit = (((uint16_t)sample) >> 15) << 7;
10
magnitude = (sign_bit)? -sample - 1: sample; // modified // if-statement removed x = (((magnitude << 8) - magnitude) >> 15) + 1; sample = *input_sample_pointer++; // rearranged result = (pwlog2(x << 8) >> 4); *output_sample_pointer++ = ~(result | sign_bit); } }
To avoid segmentation fault, one extra space is given when allocating the
𝑖𝑛𝑝𝑢𝑡_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 array.
The performance of the optimized C code solution is as follows.
Compression started at: Compression finished at: Time elapsed:
2016-08-05 14:01:49.288499 2016-08-05 14:01:57.446712 8.158213
Here, the C code optimization gave the application an another 20.94% performance
boost.
Optimization of ARM assembly code
Although the compiler already applied a variety of optimization technique, there are still
something could be done through human intervention. Even the most advanced compiler
won’t know the trend of the input data. Therefore, the order of the branches is not always
perfect. To optimized the order of the branches in pwlog2() function, it is essential to
know the possibility of the sample occurs in each range. As an example, we profiled a
60-minute-long software engineering radio file. Here is the statistical data.
Range Number
1.0~2.0 88312443
2.0~4.0 51280930
4.0~8.0 48564421
8.0~16.0 48079378
16.0~32.0 35881654
32.0~64.0 15112524
64.0~128.0 2957250
Table 1. pwlog2() Function Input Range
11
Figure 4. pwlog2() Function Input Statistic
In the file we tested, there are 290266208 samples in total. As you can see from the bar
chart above. The number of occurrence is strictly follow the descending order, which
means switch-case or function pointer solution will not increase the performance.
However, in assembly level, we discovered that the compiler doesn’t know the input
value of pwlog2() function will never reach the first case (when X < 256). If we optimize
this out, we can reduce 1 jump for 88312443 samples. (The majority) Here is the
modification we made.
+ +--751 lines: .arch armv6----------------------------------------------------- ldrh r10, [lr], #2 @ sample, MEM[base: input_sample_pointer_26, offse rsb r3, r0, r0, asl #8 @, D.6844, magnitude, magnitude, mov r3, r3, lsr #15 @ D.6844, D.6844, add r3, r3, #1 @ x, D.6844, mov r3, r3, asl #8 @ tmp182, x, uxth r3, r3 @ D.6842, tmp182 cmp r3, #255 @ D.6842, movls ip, #0 @ D.6847, ------------------------------------------------------------------------------ bls .L158 @, cmp r3, r4 @ D.6842, tmp274 bls .L171 @, cmp r3, r5 @ D.6842, tmp275 bls .L173 @, cmp r3, r6 @ D.6842, tmp276 bls .L174 @, cmp r3, r7 @ D.6842, tmp277 bls .L175 @, + +--651 lines: cmp r3, r8 @ D.6842, tmp278------------------------------------- ~ ~ ~ Before modification 1,1-4 All + +--751 lines: .arch armv6----------------------------------------------------- ldrh r10, [lr], #2 @ sample, MEM[base: input_sample_pointer_26, offse rsb r3, r0, r0, asl #8 @, D.6844, magnitude, magnitude, mov r3, r3, lsr #15 @ D.6844, D.6844, add r3, r3, #1 @ x, D.6844,
0100000002000000030000000400000005000000060000000700000008000000090000000
100000000
Number of Occurrence
12
mov r3, r3, asl #8 @ tmp182, x, uxth r3, r3 @ D.6842, tmp182 cmp r3, r4 @ D.6842, tmp274 mvnls ip, ip @ tmp261, tmp259 uxtbls ip, ip @ D.6847, tmp261 bls .L158 @, ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ cmp r3, r5 @ D.6842, tmp275 bls .L173 @, cmp r3, r6 @ D.6842, tmp276 bls .L174 @, cmp r3, r7 @ D.6842, tmp277 bls .L175 @, + +--651 lines: cmp r3, r8 @ D.6842, tmp278------------------------------------- ~ ~ ~ After modification 1,1 All
The result of the execution is as follows.
Compression started at: Compression finished at: Time elapsed:
2016-08-05 14:35:47.554919
2016-08-05 14:35:55.461069
7.906150
Although it is not very much comparing to C-Code optimization, it still gave our
program a 3.188% of performance bonus.
Implementation of 2-slot Machine Firmware
Due to the large amount of true dependencies and branches, it is highly unlikely that a 2-
slot machine could improve the performance of the pwlog2() function. Here’s our
implementation of the firmware using ARM-A7 style micro code for 2-slot machines.
pwlog2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. cmp r0, #255 @ X, | nop bls .L1 @, | nop cmpcc r0, #512 @ X, | nop subcc r0, r0, #256 @ tmp123, X, | nop uxthcc r0, r0 @ D.6805, tmp123 | nop bxcc lr @ | nop cmp r0, #1024 @ X, | nop bcc .L2 @, | nop cmp r0, #2048 @ X, | nop bcc .L3 @, | nop cmp r0, #4096 @ X, | nop bcc .L4 @, | nop cmp r0, #8192 @ X, | nop bcc .L5 @, | nop cmp r0, #16384 @ X, | nop bcc .L6 @, | nop tst r0, #32768 @ X, | nop moveq r0, r0, lsr #6 @ tmp153, X, | movne r0, r0, lsr #7 addeq r0, r0, #1280 @ D.6805, tmp153, | addne r0, r0, #1536 bx lr @ | nop
13
.L1: ldr r0, .L8 @ D.6805, | nop bx lr @ | nop .L2: mov r0, r0, lsr #1 @ D.6805, X, | nop bx lr @ | nop .L3: mov r0, r0, lsr #2 @ tmp131, X, | nop add r0, r0, #256 @ D.6805, tmp131, | nop bx lr @ | nop .L4: mov r0, r0, lsr #3 @ tmp137, X, | nop add r0, r0, #512 @ D.6805, tmp137, | nop bx lr @ | nop .L5: mov r0, r0, lsr #4 @ tmp143, X, | nop add r0, r0, #768 @ D.6805, tmp143, | nop bx lr @ | nop .L6: mov r0, r0, lsr #5 @ tmp149, X, | nop add r0, r0, #1024 @ D.6805, tmp149, | nop bx lr @ | nop .L7: .align 2 .L8: .word 65535 .size pwlog2, .-pwlog2
According to the statistical data obtained before. The chances that the input of the
pwlog2() function falls into range 64.0 ~ 256.0 is 1.045543%. In that case, the firmware
solution will be 2 cycles faster. Assuming the execution time for both addition and
move instruction are 1 cycle.
Multithreading Solution
Since the embedded programmer is typically exposed to the hardware itself, it would be
beneficial to take the advantage of this extra information. The SoC we are using is a
Broadcom BCM2836, which has a 900 MHz Quad Core ARM Cortex-A7 processor. The
original program we wrote is a single-threaded program that could only occupies 25% of
the cores. That is, we wasted 75% of the computing resource. To make use of the extra
cores we have, we need to modify our program so it can create more than one thread (in
this case, 4 threads) to process the data simultaneously. Fortunately, there is no racing
condition in our audio compression application, so we don’t need MUTEX or
SEMAPHORES to synchronize the action between threads.
First we need to define the data structure for parameter passing.
typedef struct { int id; int16_t* input_samples; uint8_t* output_samples; uint32_t sample_number; } WorkerParm;
Then we can modify the function signature of the lin2mu_encode () function so it
matches the 𝑝𝑡ℎ𝑟𝑒𝑎𝑑 standard.
14
void* lin2mu_encode(void* _myParm) { WorkerParm* myParm = (WorkerParm*) _myParm; register uint32_t i; register int16_t* input_sample_pointer = myParm->input_samples; register uint8_t* output_sample_pointer = myParm->output_samples;
… for(i = myParm->sample_number; i > 0; i--){
… } return NULL; }
Since all the threads perform computation simultaneously, we can’t overlap the sample
source and sample output pointer position anymore.
uint8_t* output_samples = (uint8_t*) malloc(sample_number); if(output_samples == NULL){ printf("Failed allocating memory\n"); exit(1); }
Finally, we can set the parameters and create the threads.
WorkerParm* threadParms = (WorkerParm*) malloc(sizeof(WorkerParm) * 4); uint32_t remaining_samples = sample_number; uint32_t sample_for_each_thread = sample_number / 4; threadParms[0].id = 0; threadParms[0].input_samples = (int16_t*)samples; threadParms[0].output_samples = (uint8_t*)output_samples; threadParms[0].sample_number = sample_for_each_thread; remaining_samples -= sample_for_each_thread; threadParms[1].id = 1; threadParms[1].input_samples = ((int16_t*)samples) + sample_for_each_thread; threadParms[1].output_samples = ((uint8_t*)output_samples) + sample_for_each_thread; threadParms[1].sample_number = sample_for_each_thread; remaining_samples -= sample_for_each_thread; threadParms[2].id = 2; threadParms[2].input_samples = ((int16_t*)samples) + 2 * sample_for_each_thread; threadParms[2].output_samples = ((uint8_t*)output_samples) + 2 * sample_for_each_thread; threadParms[2].sample_number = sample_for_each_thread; remaining_samples -= sample_for_each_thread; threadParms[3].id = 3; threadParms[3].input_samples = ((int16_t*)samples) + 3 * sample_for_each_thread; threadParms[3].output_samples = ((uint8_t*)output_samples) + 3 * sample_for_each_thread; threadParms[3].sample_number = remaining_samples; pthread_t thread1, thread2, thread3, thread4; printf("Creating threads\n"); gettimeofday(&tvBegin, NULL); printf("Compression started at: \n"); timeval_print(&tvBegin); pthread_create(&thread1, NULL, lin2mu_encode, (void*)(threadParms)); pthread_create(&thread2, NULL, lin2mu_encode, (void*)(threadParms + 1)); pthread_create(&thread3, NULL, lin2mu_encode, (void*)(threadParms + 2)); pthread_create(&thread4, NULL, lin2mu_encode, (void*)(threadParms + 3)); pthread_join(thread1, NULL);
15
pthread_join(thread2, NULL); pthread_join(thread3, NULL); pthread_join(thread4, NULL);
Here’s the result.
Compression started at: Compression finished at: Time elapsed:
2016-08-05 15:05:19.794080
2016-08-05 15:05:22.297059
2.502979
Not surprisingly, we got 215.8% performance gain.
Custom Hardware VHDL Solution
Since we already done all the optimizations we can in software/firmware solution. To
further improve the performance, we need to build specialized hardware. As a proof-of-
concept, we built a hardware pwlog2 computing unit using FPGA. However, since the
FPGA is connected to the CPU using 40MHz SPI (Serial Peripheral Interface), the
maximum data rate is only around 4MB/s. (Given that FIFO and DMA is utilized.
Comparing to the 331.78MB/s IO speed in the multithreading solution, it’s much too
slow.) Therefore, the result here only proves that this solution will work, if we redesign
the CPU and extend its instruction set.
Here’s our Automata design.
Figure 5 Automata Design
For the actual implementation, we utilized the wishbone driver provided by ValentF(X).
[3] The result data was computed when the FPGA receives a 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑤𝑟𝑖𝑡𝑒 operation.
The CPU can then fetch the data by reading the same register. As the CPU frequency is
much higher than the FPGA, we need to let the FPGA block read requests before the
computation is completed.
The implementation details are omitted to conserver space. Please inspect the attached
code for full implementation code. Here is the testing result.
Compression started at:
Compression started at: Compression finished at: Time elapsed:
2016-08-05 15:32:13.720360
2016-08-05 15:32:20.623885
6.903525
16
The actual IO speed between CPU and FPGA is 124 KB/s. Surely packing multiple data
in one transmission or use DMA could increase transmission rate close to 4MB/s, but it’s
still much slower than 331.78MB/s. The SPI interface is the bottleneck of the current
hardware solution.
Currently, for the multithreading solution, the computation time for each sample is 8.62
ns. The propagation delay estimated by Timing-Analyzer is 19.68 ns. Therefore, it’s
unlikely that a FPGA based customized computing unit could bring in any performance
improvement.
17
PERFORMANCE/COST EVALUATION
During the testing we have utilized and compared 4 versions of mu-law compression
algorithm. In the first version, we used a straight forward implementation of the mu-law
encoding method following the compression algorithm. Secondly, we have changed the
natural log function in a piecewise logarithm linear approximation. The results were
significantly increased. Next, we have changed the piecewise linear approximation from
float point arithmetic to fixed point arithmetic. This step eliminates the complicating
calculations used by float points, hence reducing the programs difficulty of arithmetic
achieving performance gain. The final step is to replace the logarithm output with an
lookup table. The function only uses bit-wise operations and without multiplication and
division, the performance was slightly increased.
Although the output was correct using the first step, the Elapsed time for the conversion
was about 145 seconds. This shows that even with a 900MHz ARM processor and
optimized complied software, the time for the mu-law compression was still time
consuming. The main reason of this result is because of the algorithm was implemented
using float point arithmetic. This not only caused the program to allocated more memory
to store the float pointers, it also took longer for the processor to compute multiplications
with float point numbers. The next step was to benchmark the mu-law implementation
with integer arithmetic. Using the compilers automatic optimization option, some data
was obtained to the following tables.
Mu-law Encoding Original Size
(MB)
Compressed Size
(MB)
Elapsed Time (S)
Float Point Arithmetic 581 291 145.348462
Piecewise Log2 (x) 581 291 50.950101
Integer Arithmetic
Log
581 291 22.094284
Bit-wise Operations 581 291 16.577726
Table 2 Time obtained without auto optimization.
Mu-law Encoding Original Size
(MB)
Compressed
Size(MB)
Elapsed Time (S)
Float Point Arithmetic 581 291 50.950101
Piecewise Log2 (x) 581 291 29.460849
Integer Arithmetic
Log
581 291 9.869736
Bit-wise Operations 581 291 8.868392
Table 3 above time obtained with auto optimization compiled program
Comparing the two tables above, we can see that with software optimization, the time for
the programs to perform the compression is significantly less than the ones without. This
18
again indicates that software optimization techniques are efficient and will save plenty of
processing time.
Stage #Sample Time Sample Per Second Rel. Perf. Gain (%)
Initial Implementation 2.9E+08 149.7534 1938294.286 0.00000%
fpwlog2() 2.9E+08 50.43322 5755456.582 196.93409%
pwlog2() with Integer 2.9E+08 9.866536 29419262.04 411.15427%
C Code Optimization 2.9E+08 8.158213 35579630.98 20.93992%
Assembly Optimization 2.9E+08 7.90615 36713976.84 3.18819%
Multi Threading 2.9E+08 2.502979 115968295.4 215.86961%
FPGA 220500 6.903525 31940.20446 -99.97246%
Table 4. Relative Performance Gain
The above table lists the relative performance gain comparing to the previous stage. All
these data are measured with -O3 compiler optimization enabled. It’s easy to tell that the
converting to Integer arithmetic is the most efficient way. For any embedded system
programmer that familiar with Integer Arithmetic, it won’t take them too long to convert
the program from floating point asthmatic to integer arithmetic. This could already give
the program 608.08% performance boost. If the target platform has more than 1 CPU
cores, making the program support multithreading should be a higher priority. Since the
compiler is becoming more and more smart nowadays, optimizing the C code should
have a lower priority. (But should still write good & fast C-Code in the first place) As
for the assembly level code, tracing compiler optimized assembly code is extremely time
consuming. In our case, the reward of the assembly level optimization does not match
the price. Therefore, unless for the experienced programmer, optimizing assembly code
should be at the lowest priority.
19
CONCLUSION
It is confident to say that software optimization is essential on computational processes.
From the benchmarking and testing with different cases, there is clear evidence that
software optimization will significantly reduce computation time.
With µ-law compression, the output of the compressed audio files maintains a higher
signal/ noise ratio. This means that the output has more static noise comparing to the
original file. Although the sound quality of the compressed audio file might sound less
pleasant than the original copy, the output audio file is 50% less in size than the
uncompressed copy. The gain in reduction provides the audio signals to be used in
limited bandwidth applications. For example, in telephone communication channels, the
bandwidth is 3100Hz. By using µ-law, the algorithm will sample these audio signals at
8000Hz with an 8-bit representation, resulting in a 64kbit/s bit rate [2].
It is also important to note that the performance between fixed point arithmetic versus
float point arithmetic is significantly different when calculating on a large data file. For
example, we can compare the time for both methods to compute on the same test file. As
of result, the time for using float point arithmetic to compute the compression file is
around 51 seconds, but using fixed point arithmetic the time required is only about 22
seconds. This shows that using integer arithmetic is approximately 43.1% faster than
using float point arithmetic in terms of performing this compression.
In the future, more varieties of testing audio files like GSM could be included during the
benchmarking and implementation. Since GSM files are mostly recorded by
cellular/mobile device, we can see how the compression will sound like in real world
conversations over the cellular network.
20
BIBLIOGRAPHY
[1] Mathwords. Change of Base Formula, 2016. [Online]. Available:
http://www.mathwords.com/c/change_of_base_formula.htm. [Accessed: 05 - Aug-
2016].
[2] TU-T, Geneva, Switzerland, ITU-T G.711.1 - Wideband embedded extension for
G.711 pulse code modulation (pre-published), 2008
[3] Valentfx.com. (2016). LOGI - Wishbone - Project - ValentFx Wiki. [online]
Available at: http://valentfx.com/wiki/index.php?title=LOGI_-_Wishbone_-_Project
[Accessed 5 Aug. 2016].