auio omprssion - maplerecallauio omprssion project report uvic department of engineering shengzong...

AUDIO

COMPRESSION

Project Report

UVic Department of Engineering

Shengzong He V00785261 [email protected]

Chenchen Guo V00788009 [email protected]

Submission date: Friday, August 5, 2016

mailto:[email protected]

mailto:[email protected]

1

INTRODUCTION

In the beginning of this report, there will be a brief description of the components of

sound signals and their importance to the world. This report will focus on the basic pulse

code modulation (PCM) method which encodes such signals into the digital world.

An audio signal representing sound typically have frequency in the range of 20 to 20,000

Hz (Limits of Human Hearing) and is usually represented by an electrical voltage. In the

analog world, the audio signals are the measure of the pressure of the sound wave by the

instantaneous voltage over the time domain.

To capture analog audio signal into the digital world, Pulse-code modulation algorithms

are introduced. The main focus on this report is to introduce the PCM algorithms such

as the µ-law algorithm and utilize such algorithms to compress uniform PCM codes.

In the Uniform PCM quantization processes, an error is introduced when estimating the

sample amplitude. The signal to noise ratio is good at high level signals but bad at low

level signals. Hence Non-Uniform PCM encoding algorithms such as A-law and µ-law

were introduced.

In this project, the main focus is to utilize µ-law compression method to encode uniform

PCM samples from 16-bit representation to 8-bit representation. The compressed ratio is

around 1: 1.75, hence the size of the compressed output should be approximate of 50%

less than the original copy. After the initial implementation, series of optimization

techniques will be applied on top of the base version. The goal of this project is to apply

software optimization technique to increase the performance of the µ-law compression

algorithm.

The project intends follow 6 steps including: implementing the u-law algorithm, use

integer arithmetic (piecewise logarithm), optimization of C code, replace subroutine with

arm, 2-slot machine implementation, Automata -> FPGA solutions. The testing device of

this project will be a single-board computing device with an ARM-v7 architecture chip

set on board.

2

TABLE OF CONTENTS

Introduction ........................................................................................................................ 1

Theoretical background ...................................................................................................... 3

Pulse Code Modulation Process ..................................................................................... 3

Non-uniform pulse code encoding: the µ-law algorithm and the A Law algorithm ...... 3

Design process ................................................................................................................... 4

Project Requirements ..................................................................................................... 4

Development of the Audio Compression Software Framework .................................... 4

Prototype Software Solutions ......................................................................................... 4

Determine the Bottleneck of the Application ................................................................. 5

Using Piecewise linear approximation instead of Log(x) .............................................. 6

Implementing Piecewise Linear Approximation Log2(x) Using Integer Arithmetic .... 8

Optimization of C Software Routines ............................................................................ 9

Optimization of ARM assembly code .......................................................................... 10

Implementation of 2-slot Machine Firmware .............................................................. 12

Multithreading Solution ............................................................................................... 13

Custom Hardware VHDL Solution .............................................................................. 15

Performance/cost evaluation ............................................................................................ 17

Conclusion ........................................................................................................................ 19

Bibliography ..................................................................................................................... 20

3

THEORETICAL BACKGROUND

Pulse Code Modulation Process

PCM encoding method includes three major processes, sampling, quantization and

coding. In the sampling process, the magnitude of the analog signal is sampled regularly

in uniform intervals, the obtained values are called Samples. For a 4kHz voice channel,

the sampling rate is 8000Hz. This means that the audio signal is sample 8000 times per

second. The quantization process then covert the obtained samples into uniform discrete

digital values. In Uniform PCM encoding, each of the quantized amplitude are divided

into uniformly spaced steps called the quantization step. The standard input vales are

divided into 16 steps.

However, the quantization process introduces an error since the real amplitude of a

sample is replaced by an approximated value during this process, this is also known as

quantization distortion. In the high level signals (High Voltage), the quantization

distortion error will be low. But in the low level signals this error is significant.

Non-uniform pulse code encoding: the µ-law algorithm and the A Law algorithm

To future avoid and minimize such quantization distortion, Non-uniform PCM encoding

algorithms like the µ-law algorithm and the A Law algorithm were introduced. Using

such methods will allow the PCM quantization step to be smaller at lower level signals

and higher at high level signals. Hence decreasing the quantization error and increasing

the signal to noise ratio.

During the encoding step of the µ-law algorithm, uniform PCM codes are compression

from 16-bit representation to 8-bit representation. The reduction in size allows for a

limited bandwidth channel to transfer more data, effectively increases the signal transfer

process.

Since the µ-law algorithm is the standard in North America, in this project we will

choose to implement the µ-law algorithm instead of the A-law algorithm. Also, because

the fact that A-law only uses 12-bit magnitude of the audio samples and µ uses 13-bit

magnitude [2], the compressed audio samples of A-law will be less accurate than output

from using µ-law.

4

DESIGN PROCESS

Project Requirements

1. Build and implement audio compression prototype using the µ-law algorithm

2. Use Piecewise linear approximation to replace log(x).

3. Implement piecewise linear approximation using integer arithmetic for log2(x) in

Software (C routines)

4. Optimization of C software routines

5. Optimization of ARM assembly code

6. Implement 2-slot machine solutions

7. Custom hardware solutions (VHDL)

8. Estimate the performance improvement for each implementation of piecewise

log2(x) based on software optimization

Development of the Audio Compression Software Framework

To simplify the problem, we decided that our audio compression software will only

support WAV files as input. As we are not planning to implement resampling algorithm,

the compressed audio file will maintain the original sampling rate. The only difference is

that the bit-depth for each samples decreased from 16 bits to 8 bits. As a result, the file

size will be reduced to 50% of the original size and the WAV format will be changed

from Linear PCM to u-Law compression. The designing process of the skeleton of our

application includes completing the following components: WAV header parser, Data

chunk locator, WAV header modifier, WAV file writer. Also we need to write

corresponding code in the main function that accepts arguments, read the WAV file,

print out file information, call the compression function, and write the compressed file

back to the list.

Since the key point of this report is on software optimization, the development processes

of the software skeleton are omitted to conserve space. For more details, please read the

attached source code.

Prototype Software Solutions

In the first step of this project, a software implementation of µ-law algorithm was

developed using C-routines. Simply following the algorithm and normalizing the input,

the audio file was compressed from 581 MB to 291 MB. During the development of the

code, the input audio sample “x” was represented as values that are in the range of -2^15

< X < 2^15. For the better implementation, the input value X was scaled.

inline unsigned char flin2mu(int16_t input_frame) { uint8_t sign_bit = ((uint16_t)input_frame) >> 15; uint16_t magnitude = (sign_bit)? -input_frame: input_frame; if(magnitude > 32767) magnitude = 32767;

5

double x = magnitude / 32767.0; double result = (log(1.0 + MU * x)/log(1.0 + MU)); uint8_t return_value = rint(result * 127.0); return_value = return_value | (sign_bit << 7); return return_value; }

Determine the Bottleneck of the Application

To optimize our software, it is essential to know where the bottleneck of our application

is. Although it is reasonable to assume that log () function is the most time consuming

part in the entire program, a profiling report could still be helpful.

Instead of using the “gprof “tool to profile our application, we decided to use the system

level “perf” command to get the most accurate and complete information of our

application.

sudo perf record ./a.out sengRadio.wav

The result obtained is as follows.

Loading WAV file: sengRadio.wav... File loaded into memory. No. of channels: 2 RIFF chunk size: 580532454 FMT chunk size: 18 Format code: 1 Sample rate: 44100 Byte rate: 176400 Bps Bit rate: 1411 kbps Bits per sample: 16 Block size: 4 Sample size is signed 16 bits Sample number: 290266208 Compression started at:

Figure 1. Original audio signals Figure 2. Compressed output file

6

2016-08-05 11:38:56.771413 Compression finished at: 2016-08-05 11:41:26.524838 Time elapsed: 149.753425 [ perf record: Woken up 96 times to write data ] [ perf record: Captured and wrote 24.217 MB perf.data (634644 samples) ]

Samples: 634K of event 'cycles:ppp', Event count (approx.): 140893513459 Overhead Command Shared Object Symbol 58.74% a.out libm-2.19.so [.] __log_finite 22.35% a.out a.out [.] flin2mu_encode 6.83% a.out libm-2.19.so [.] __rintl 5.34% a.out libm-2.19.so [.] __logl 0.83% a.out [kernel.kallsyms] [k] mmiocpy 0.64% a.out a.out [.] 0x00000718 0.63% a.out a.out [.] __libc_start_main@plt 0.31% a.out [kernel.kallsyms] [k] v7_flush_kern_dcache_area 0.24% a.out [kernel.kallsyms] [k] get_page_from_freelist 0.22% a.out a.out [.] 0x00000710 0.21% a.out a.out [.] malloc@plt 0.21% a.out a.out [.] puts@plt 0.21% a.out a.out [.] 0x00000714 0.21% a.out [kernel.kallsyms] [k] __memzero …… For a higher level overview, try: perf report --sort comm,dso

As the report indicates, the library function, log_finite (), which was invoked by the log()

function in the math library, has an overhead of 58.74%. Besides, the floating point

arithmetic operations in flin2mu_encode () function also produces 22.35% overhead. It’s

clear that we need to optimize those two functions to improve the overall performance of

our application. According to the U.S. standard, the value of µ used in the µ -law is 255.

Thus we can reduce the original formula to the following one that utilizes log2 ()

function instead of log () function.

ln(1 + µx)

ln(1 + µ)=ln(1 + 255𝑥)

ln(256)=ln(1 + 255𝑥)

8ln(2)=1

8log2(1 + 255𝑥)

It is also important to note that the transformed formula described above has a “divide by

8” operation. For unsigned integers, this is equivalent to performing Logical Shift Right

operation. Therefore, it is highly likely that converting floating point operations to fixed

point operations could further improve the performance.

Using Piecewise linear approximation instead of Log(x)

Since computing Log(x) is expensive, an alternative for increasing the performance is to

approximate the value of for this function. Also, because Log2(x) allows for better

calculation. Using the Change of base Formula of logarithms [1], Log2(x) was used

instead of the natural log in the algorithm. The result is as follows:

𝐹(𝑥) = 𝑠𝑔𝑛(𝑥) ∗1

8log2(1 + 255𝑥)

7

Using the Taylor series expansion technique discussed in class, we can get the following

function.

float fpwlog2(float x) { if(x < 1.0) return -1.0; if(x < 2.0) return x - 1.0; if(x < 4.0) return 1.0 + (x - 2.0) / 2.0; if(x < 8.0) return 2.0 + (x - 4.0) / 4.0; if(x < 16.0) return 3.0 + (x - 8.0) / 8.0; if(x < 32.0) return 4.0 + (x - 16.0)/ 16.0; if(x < 64.0) return 5.0 + (x - 32.0)/ 32.0; if(x < 128.0) return 6.0 + (x - 64.0)/ 64.0; if(x < 256.0) return 7.0 + (x - 128.0)/ 128.0; return -1; } result = 0.125 * fpwlog2(1 + 255.0 * x);

To make sure the result is accurate, we measured the error of the 𝑓𝑝𝑤𝑙𝑜𝑔2() function

when the input 𝑥 is range from 1.0 to 255.0. Here’s the plot of the error, the 𝑥-axis is the

𝑥 value, and the 𝑦-axis is the error value calculated by 𝐸𝑟𝑟 = 𝑓𝑝𝑤𝑙𝑜𝑔2(𝑥)– 𝑙𝑜𝑔2(𝑥);

Figure 3. Error Graph

The maximum measured error is 0.086071491. For our application, this accuracy is tolerable.

-0.01

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

18

.11

5.2

22

.32

9.4

36

.54

3.6

50

.75

7.8

64

.9 72

79

.18

6.2

93

.31

00

.41

07

.51

14

.61

21

.71

28

.81

35

.91

43

15

0.1

15

7.2

16

4.3

17

1.4

17

8.5

18

5.6

19

2.7

19

9.8

20

6.9

21

42

21

.12

28

.22

35

.32

42

.42

49

.5

Y = log2(X) - fpwlog2(X)

8

The result of the fpwlog2() implementation is as follows.

Compression started at: Compression finished at: Time elapsed:

2016-08-05 12:45:17.963338

2016-08-05 12:46:08.396555

50.43321

Comparing to the raw result that uses library log () function, we can see that the

fpwlog2() solution achieved 196.93% performance gain.

Implementing Piecewise Linear Approximation Log2(x) Using Integer Arithmetic

Although the natural logarithm in the mu-law algorithm was replaced with the linear

approximation, the arithmetic of the function still computes with float point numbers.

Hence, to further improve the performance of the software routines, an integer arithmetic

solution for the linear approximation is necessary.

Since we know the input of the pwlog2() function ranges from 1.0 to 256.0. Based on the

fact that it accepts and returns a 16-bit unsigned integer, it is reasonable to choose 28 as

scale factor. The resulting pwlog2() function is as follows.

uint16_t pwlog2(uint16_t X) { if(X < (1 << 8)) return (-1); if(X < (1 << 9)) return (X- (1 << 8)); if(X < (1 << 10)) return (X >> 1); if(X < (1 << 11)) return ((X >> 2) + (1 << 8)); if(X < (1 << 12)) return ((X >> 3) + (1 << 9)); if(X < (1 << 13)) return ((X >> 4) + 768); if(X < (1 << 14)) return ((X >> 5) + (1 << 10)); if(X < (1 << 15)) return ((X >> 6) + 1280); if(X < (1 << 16)) return ((X >> 7) + 1536); return (-1); // It should never reach this point }

Also, we need to modify the corresponding code in the lin2mu_encode function to

eliminate all the floating point operations. Together with operator strength reduction, we

can get the following function.

void lin2mu_encode(int16_t* restrict input_samples, uint8_t* restrict output_samples, uint32_t sample_number) { register uint32_t i; register int16_t* input_sample_pointer = input_samples; register uint8_t* output_sample_pointer = output_samples; register int16_t sample;

9

register uint32_t sign_bit, magnitude, x, result; for (i = sample_number; i > 0; i--) { sample = *input_sample_pointer; sign_bit = (((uint16_t)sample) >> 15) << 7; magnitude = (sign_bit) ? -sample : sample; if (magnitude > 32767) magnitude -= 1; // x = 1 + mu * magnitude; mu == 255 x = (((magnitude << 8) - magnitude) >> 15) + 1; // No rounding // result = 1/8 * log_2(1 + mu * x) * 127.0 result = (pwlog2(x << 8) >> 4); *output_sample_pointer = ~(result | sign_bit); input_sample_pointer++; output_sample_pointer++; } }

After the implementation of the Integer arithmetic of the piecewise linear approximation,

the total elapsed time for calculating the compressed audio file was significantly

reduced.


2016-08-05 13:03:26.470366

2016-08-05 13:03:36.336902

9.866536

This time, it achieved a 411.15% performance gain.

Optimization of C Software Routines

For the better compression time, software optimization techniques were introduced and

tested. Methodologies includes: Loop unrolling, grafting and software pipelining.

Unfortunately, as the compiler already optimized the code, neither loop unrolling nor

grafting gives any performance gain. The grafting solution even reduced the

performance by interfering the automatic complier optimization.

After the inspection of the assembly code generated by the compiler, we noticed that it at

least performed the following optimization techniques: constant folding, operator

strength reduction, function in-lining and loop condition optimization. Therefore, the

only thing left we can do is software pipelining. Loading and storing data before a

branch is likely to be beneficial. During the optimization process, we attempted to use

temporary variables to reduce true dependency between instructions. However, the extra

pressure on the register file significantly reduced the performance. Eventually, we

arrived into the following solution.

void lin2mu_encode(int16_t* restrict input_samples, uint8_t* restrict output_samples, uint32_t sample_number) { register uint32_t i; register int16_t* input_sample_pointer = input_samples; register uint8_t* output_sample_pointer = output_samples; register int16_t sample; register uint32_t sign_bit, magnitude, x, result; sample = *input_sample_pointer++; // prologue for(i = sample_number; i > 0; i--){ sign_bit = (((uint16_t)sample) >> 15) << 7;

10

magnitude = (sign_bit)? -sample - 1: sample; // modified // if-statement removed x = (((magnitude << 8) - magnitude) >> 15) + 1; sample = *input_sample_pointer++; // rearranged result = (pwlog2(x << 8) >> 4); *output_sample_pointer++ = ~(result | sign_bit); } }

To avoid segmentation fault, one extra space is given when allocating the

𝑖𝑛𝑝𝑢𝑡_𝑠𝑎𝑚𝑝𝑙𝑒𝑠 array.

The performance of the optimized C code solution is as follows.


2016-08-05 14:01:49.288499 2016-08-05 14:01:57.446712 8.158213

Here, the C code optimization gave the application an another 20.94% performance

boost.

Optimization of ARM assembly code

Although the compiler already applied a variety of optimization technique, there are still

something could be done through human intervention. Even the most advanced compiler

won’t know the trend of the input data. Therefore, the order of the branches is not always

perfect. To optimized the order of the branches in pwlog2() function, it is essential to

know the possibility of the sample occurs in each range. As an example, we profiled a

60-minute-long software engineering radio file. Here is the statistical data.

Range Number

1.0~2.0 88312443

2.0~4.0 51280930

4.0~8.0 48564421

8.0~16.0 48079378

16.0~32.0 35881654

32.0~64.0 15112524

64.0~128.0 2957250

Table 1. pwlog2() Function Input Range

11

Figure 4. pwlog2() Function Input Statistic

In the file we tested, there are 290266208 samples in total. As you can see from the bar

chart above. The number of occurrence is strictly follow the descending order, which

means switch-case or function pointer solution will not increase the performance.

However, in assembly level, we discovered that the compiler doesn’t know the input

value of pwlog2() function will never reach the first case (when X < 256). If we optimize

this out, we can reduce 1 jump for 88312443 samples. (The majority) Here is the

modification we made.

+ +--751 lines: .arch armv6----------------------------------------------------- ldrh r10, [lr], #2 @ sample, MEM[base: input_sample_pointer_26, offse rsb r3, r0, r0, asl #8 @, D.6844, magnitude, magnitude, mov r3, r3, lsr #15 @ D.6844, D.6844, add r3, r3, #1 @ x, D.6844, mov r3, r3, asl #8 @ tmp182, x, uxth r3, r3 @ D.6842, tmp182 cmp r3, #255 @ D.6842, movls ip, #0 @ D.6847, ------------------------------------------------------------------------------ bls .L158 @, cmp r3, r4 @ D.6842, tmp274 bls .L171 @, cmp r3, r5 @ D.6842, tmp275 bls .L173 @, cmp r3, r6 @ D.6842, tmp276 bls .L174 @, cmp r3, r7 @ D.6842, tmp277 bls .L175 @, + +--651 lines: cmp r3, r8 @ D.6842, tmp278------------------------------------- ~ ~ ~ Before modification 1,1-4 All + +--751 lines: .arch armv6----------------------------------------------------- ldrh r10, [lr], #2 @ sample, MEM[base: input_sample_pointer_26, offse rsb r3, r0, r0, asl #8 @, D.6844, magnitude, magnitude, mov r3, r3, lsr #15 @ D.6844, D.6844, add r3, r3, #1 @ x, D.6844,

0100000002000000030000000400000005000000060000000700000008000000090000000

100000000

Number of Occurrence

12

mov r3, r3, asl #8 @ tmp182, x, uxth r3, r3 @ D.6842, tmp182 cmp r3, r4 @ D.6842, tmp274 mvnls ip, ip @ tmp261, tmp259 uxtbls ip, ip @ D.6847, tmp261 bls .L158 @, ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ cmp r3, r5 @ D.6842, tmp275 bls .L173 @, cmp r3, r6 @ D.6842, tmp276 bls .L174 @, cmp r3, r7 @ D.6842, tmp277 bls .L175 @, + +--651 lines: cmp r3, r8 @ D.6842, tmp278------------------------------------- ~ ~ ~ After modification 1,1 All

The result of the execution is as follows.


2016-08-05 14:35:47.554919

2016-08-05 14:35:55.461069

7.906150

Although it is not very much comparing to C-Code optimization, it still gave our

program a 3.188% of performance bonus.

Implementation of 2-slot Machine Firmware

Due to the large amount of true dependencies and branches, it is highly unlikely that a 2-

slot machine could improve the performance of the pwlog2() function. Here’s our

implementation of the firmware using ARM-A7 style micro code for 2-slot machines.

pwlog2: @ args = 0, pretend = 0, frame = 0 @ frame_needed = 0, uses_anonymous_args = 0 @ link register save eliminated. cmp r0, #255 @ X, | nop bls .L1 @, | nop cmpcc r0, #512 @ X, | nop subcc r0, r0, #256 @ tmp123, X, | nop uxthcc r0, r0 @ D.6805, tmp123 | nop bxcc lr @ | nop cmp r0, #1024 @ X, | nop bcc .L2 @, | nop cmp r0, #2048 @ X, | nop bcc .L3 @, | nop cmp r0, #4096 @ X, | nop bcc .L4 @, | nop cmp r0, #8192 @ X, | nop bcc .L5 @, | nop cmp r0, #16384 @ X, | nop bcc .L6 @, | nop tst r0, #32768 @ X, | nop moveq r0, r0, lsr #6 @ tmp153, X, | movne r0, r0, lsr #7 addeq r0, r0, #1280 @ D.6805, tmp153, | addne r0, r0, #1536 bx lr @ | nop

13

.L1: ldr r0, .L8 @ D.6805, | nop bx lr @ | nop .L2: mov r0, r0, lsr #1 @ D.6805, X, | nop bx lr @ | nop .L3: mov r0, r0, lsr #2 @ tmp131, X, | nop add r0, r0, #256 @ D.6805, tmp131, | nop bx lr @ | nop .L4: mov r0, r0, lsr #3 @ tmp137, X, | nop add r0, r0, #512 @ D.6805, tmp137, | nop bx lr @ | nop .L5: mov r0, r0, lsr #4 @ tmp143, X, | nop add r0, r0, #768 @ D.6805, tmp143, | nop bx lr @ | nop .L6: mov r0, r0, lsr #5 @ tmp149, X, | nop add r0, r0, #1024 @ D.6805, tmp149, | nop bx lr @ | nop .L7: .align 2 .L8: .word 65535 .size pwlog2, .-pwlog2

According to the statistical data obtained before. The chances that the input of the

pwlog2() function falls into range 64.0 ~ 256.0 is 1.045543%. In that case, the firmware

solution will be 2 cycles faster. Assuming the execution time for both addition and

move instruction are 1 cycle.

Multithreading Solution

Since the embedded programmer is typically exposed to the hardware itself, it would be

beneficial to take the advantage of this extra information. The SoC we are using is a

Broadcom BCM2836, which has a 900 MHz Quad Core ARM Cortex-A7 processor. The

original program we wrote is a single-threaded program that could only occupies 25% of

the cores. That is, we wasted 75% of the computing resource. To make use of the extra

cores we have, we need to modify our program so it can create more than one thread (in

this case, 4 threads) to process the data simultaneously. Fortunately, there is no racing

condition in our audio compression application, so we don’t need MUTEX or

SEMAPHORES to synchronize the action between threads.

First we need to define the data structure for parameter passing.

typedef struct { int id; int16_t* input_samples; uint8_t* output_samples; uint32_t sample_number; } WorkerParm;

Then we can modify the function signature of the lin2mu_encode () function so it

matches the 𝑝𝑡ℎ𝑟𝑒𝑎𝑑 standard.

14

void* lin2mu_encode(void* _myParm) { WorkerParm* myParm = (WorkerParm*) _myParm; register uint32_t i; register int16_t* input_sample_pointer = myParm->input_samples; register uint8_t* output_sample_pointer = myParm->output_samples;

… for(i = myParm->sample_number; i > 0; i--){

… } return NULL; }

Since all the threads perform computation simultaneously, we can’t overlap the sample

source and sample output pointer position anymore.

uint8_t* output_samples = (uint8_t*) malloc(sample_number); if(output_samples == NULL){ printf("Failed allocating memory\n"); exit(1); }

Finally, we can set the parameters and create the threads.

WorkerParm* threadParms = (WorkerParm*) malloc(sizeof(WorkerParm) * 4); uint32_t remaining_samples = sample_number; uint32_t sample_for_each_thread = sample_number / 4; threadParms[0].id = 0; threadParms[0].input_samples = (int16_t*)samples; threadParms[0].output_samples = (uint8_t*)output_samples; threadParms[0].sample_number = sample_for_each_thread; remaining_samples -= sample_for_each_thread; threadParms[1].id = 1; threadParms[1].input_samples = ((int16_t*)samples) + sample_for_each_thread; threadParms[1].output_samples = ((uint8_t*)output_samples) + sample_for_each_thread; threadParms[1].sample_number = sample_for_each_thread; remaining_samples -= sample_for_each_thread; threadParms[2].id = 2; threadParms[2].input_samples = ((int16_t*)samples) + 2 * sample_for_each_thread; threadParms[2].output_samples = ((uint8_t*)output_samples) + 2 * sample_for_each_thread; threadParms[2].sample_number = sample_for_each_thread; remaining_samples -= sample_for_each_thread; threadParms[3].id = 3; threadParms[3].input_samples = ((int16_t*)samples) + 3 * sample_for_each_thread; threadParms[3].output_samples = ((uint8_t*)output_samples) + 3 * sample_for_each_thread; threadParms[3].sample_number = remaining_samples; pthread_t thread1, thread2, thread3, thread4; printf("Creating threads\n"); gettimeofday(&tvBegin, NULL); printf("Compression started at: \n"); timeval_print(&tvBegin); pthread_create(&thread1, NULL, lin2mu_encode, (void*)(threadParms)); pthread_create(&thread2, NULL, lin2mu_encode, (void*)(threadParms + 1)); pthread_create(&thread3, NULL, lin2mu_encode, (void*)(threadParms + 2)); pthread_create(&thread4, NULL, lin2mu_encode, (void*)(threadParms + 3)); pthread_join(thread1, NULL);

15

pthread_join(thread2, NULL); pthread_join(thread3, NULL); pthread_join(thread4, NULL);

Here’s the result.


2016-08-05 15:05:19.794080

2016-08-05 15:05:22.297059

2.502979

Not surprisingly, we got 215.8% performance gain.

Custom Hardware VHDL Solution

Since we already done all the optimizations we can in software/firmware solution. To

further improve the performance, we need to build specialized hardware. As a proof-of-

concept, we built a hardware pwlog2 computing unit using FPGA. However, since the

FPGA is connected to the CPU using 40MHz SPI (Serial Peripheral Interface), the

maximum data rate is only around 4MB/s. (Given that FIFO and DMA is utilized.

Comparing to the 331.78MB/s IO speed in the multithreading solution, it’s much too

slow.) Therefore, the result here only proves that this solution will work, if we redesign

the CPU and extend its instruction set.

Here’s our Automata design.

Figure 5 Automata Design

For the actual implementation, we utilized the wishbone driver provided by ValentF(X).

[3] The result data was computed when the FPGA receives a 𝑟𝑒𝑔𝑖𝑠𝑡𝑒𝑟𝑤𝑟𝑖𝑡𝑒 operation.

The CPU can then fetch the data by reading the same register. As the CPU frequency is

much higher than the FPGA, we need to let the FPGA block read requests before the

computation is completed.

The implementation details are omitted to conserver space. Please inspect the attached

code for full implementation code. Here is the testing result.

Compression started at:


2016-08-05 15:32:13.720360

2016-08-05 15:32:20.623885

6.903525

16

The actual IO speed between CPU and FPGA is 124 KB/s. Surely packing multiple data

in one transmission or use DMA could increase transmission rate close to 4MB/s, but it’s

still much slower than 331.78MB/s. The SPI interface is the bottleneck of the current

hardware solution.

Currently, for the multithreading solution, the computation time for each sample is 8.62

ns. The propagation delay estimated by Timing-Analyzer is 19.68 ns. Therefore, it’s

unlikely that a FPGA based customized computing unit could bring in any performance

improvement.

17

PERFORMANCE/COST EVALUATION

During the testing we have utilized and compared 4 versions of mu-law compression

algorithm. In the first version, we used a straight forward implementation of the mu-law

encoding method following the compression algorithm. Secondly, we have changed the

natural log function in a piecewise logarithm linear approximation. The results were

significantly increased. Next, we have changed the piecewise linear approximation from

float point arithmetic to fixed point arithmetic. This step eliminates the complicating

calculations used by float points, hence reducing the programs difficulty of arithmetic

achieving performance gain. The final step is to replace the logarithm output with an

lookup table. The function only uses bit-wise operations and without multiplication and

division, the performance was slightly increased.

Although the output was correct using the first step, the Elapsed time for the conversion

was about 145 seconds. This shows that even with a 900MHz ARM processor and

optimized complied software, the time for the mu-law compression was still time

consuming. The main reason of this result is because of the algorithm was implemented

using float point arithmetic. This not only caused the program to allocated more memory

to store the float pointers, it also took longer for the processor to compute multiplications

with float point numbers. The next step was to benchmark the mu-law implementation

with integer arithmetic. Using the compilers automatic optimization option, some data

was obtained to the following tables.

Mu-law Encoding Original Size

(MB)

Compressed Size

(MB)

Elapsed Time (S)

Float Point Arithmetic 581 291 145.348462

Piecewise Log2 (x) 581 291 50.950101

Integer Arithmetic

Log

581 291 22.094284

Bit-wise Operations 581 291 16.577726

Table 2 Time obtained without auto optimization.

Mu-law Encoding Original Size

(MB)

Compressed

Size(MB)

Elapsed Time (S)

Float Point Arithmetic 581 291 50.950101

Piecewise Log2 (x) 581 291 29.460849

Integer Arithmetic

Log

581 291 9.869736

Bit-wise Operations 581 291 8.868392

Table 3 above time obtained with auto optimization compiled program

Comparing the two tables above, we can see that with software optimization, the time for

the programs to perform the compression is significantly less than the ones without. This

18

again indicates that software optimization techniques are efficient and will save plenty of

processing time.

Stage #Sample Time Sample Per Second Rel. Perf. Gain (%)

Initial Implementation 2.9E+08 149.7534 1938294.286 0.00000%

fpwlog2() 2.9E+08 50.43322 5755456.582 196.93409%

pwlog2() with Integer 2.9E+08 9.866536 29419262.04 411.15427%

C Code Optimization 2.9E+08 8.158213 35579630.98 20.93992%

Assembly Optimization 2.9E+08 7.90615 36713976.84 3.18819%

Multi Threading 2.9E+08 2.502979 115968295.4 215.86961%

FPGA 220500 6.903525 31940.20446 -99.97246%

Table 4. Relative Performance Gain

The above table lists the relative performance gain comparing to the previous stage. All

these data are measured with -O3 compiler optimization enabled. It’s easy to tell that the

converting to Integer arithmetic is the most efficient way. For any embedded system

programmer that familiar with Integer Arithmetic, it won’t take them too long to convert

the program from floating point asthmatic to integer arithmetic. This could already give

the program 608.08% performance boost. If the target platform has more than 1 CPU

cores, making the program support multithreading should be a higher priority. Since the

compiler is becoming more and more smart nowadays, optimizing the C code should

have a lower priority. (But should still write good & fast C-Code in the first place) As

for the assembly level code, tracing compiler optimized assembly code is extremely time

consuming. In our case, the reward of the assembly level optimization does not match

the price. Therefore, unless for the experienced programmer, optimizing assembly code

should be at the lowest priority.

19

CONCLUSION

It is confident to say that software optimization is essential on computational processes.

From the benchmarking and testing with different cases, there is clear evidence that

software optimization will significantly reduce computation time.

With µ-law compression, the output of the compressed audio files maintains a higher

signal/ noise ratio. This means that the output has more static noise comparing to the

original file. Although the sound quality of the compressed audio file might sound less

pleasant than the original copy, the output audio file is 50% less in size than the

uncompressed copy. The gain in reduction provides the audio signals to be used in

limited bandwidth applications. For example, in telephone communication channels, the

bandwidth is 3100Hz. By using µ-law, the algorithm will sample these audio signals at

8000Hz with an 8-bit representation, resulting in a 64kbit/s bit rate [2].

It is also important to note that the performance between fixed point arithmetic versus

float point arithmetic is significantly different when calculating on a large data file. For

example, we can compare the time for both methods to compute on the same test file. As

of result, the time for using float point arithmetic to compute the compression file is

around 51 seconds, but using fixed point arithmetic the time required is only about 22

seconds. This shows that using integer arithmetic is approximately 43.1% faster than

using float point arithmetic in terms of performing this compression.

In the future, more varieties of testing audio files like GSM could be included during the

benchmarking and implementation. Since GSM files are mostly recorded by

cellular/mobile device, we can see how the compression will sound like in real world

conversations over the cellular network.

20

BIBLIOGRAPHY

[1] Mathwords. Change of Base Formula, 2016. [Online]. Available:

http://www.mathwords.com/c/change_of_base_formula.htm. [Accessed: 05 - Aug-

2016].

[2] TU-T, Geneva, Switzerland, ITU-T G.711.1 - Wideband embedded extension for

G.711 pulse code modulation (pre-published), 2008

[3] Valentfx.com. (2016). LOGI - Wishbone - Project - ValentFx Wiki. [online]

Available at: http://valentfx.com/wiki/index.php?title=LOGI_-_Wishbone_-_Project

[Accessed 5 Aug. 2016].

auio omprssion - maplerecallauio omprssion project report uvic department of engineering shengzong...

Documents