a multiprocessor implementation for the gsm algorithm

A Multiprocessor Implementation for the GSM Algorithm

by

Jennifer C. Kleiman

Submitted to the Department of Electrical Engineering and Computer Science

in Partial Fulfillment of the Requirements for the Degrees of

Bachelor of Science in Electrical Engineering and Computer Science

and Master of Engineering in Electrical Engineering and Computer Science

at the Massachusetts Institute of Technology

May 21, 1999

© Copyright Jennifer C. Kleiman 1999. All rights reserved.

The author hereby grants to M.I.T. permission to reproduce anddistribute publicly paper and electronic copies of this thesis

and to grant others the right to do so.

AuthorDepartment of Electrical Engineering(nd domputer Science

May 21, 1999

Certified byDr. Christ h~r J. Terman

T jesSupepjsor

Accepted byProf. Arthur . Smith

Chairman, Department Committee on Graduate es

MASSACHUSETTS INSTITUTE

OF TECHNOLOGY

JUL 1 5 1999

LIBRARIES

A Multiprocessor Implementation for the GSM Algorithm

by

Jennifer C. Kleiman

Submitted to theDepartment of Electrical Engineering and Computer Science

May 21, 1999

In Partial Fulfillment of the Requirements for the Degrees ofBachelor of Science in Electrical Engineering and Computer Science

and Master of Engineering in Electrical Engineering and Computer Science

ABSTRACT

Telecommunications or simply communications play an important role in the computerindustry. At the core of this industry lies the digital signal processor. Moreover, manycommunication technologies rely principally on signal processing. At both the systemand component level, redundancy is present in most of these applications. Therefore, anopportunity exists to optimize these technologies using parallel processing. Specifically,the DSPs in these applications may be designed in a parallel configuration to achievehigher performance and a reduction of dedicated hardware.

GSM, a mobile communications system, displays redundancy at core processing nodes inits network as well as in its fundamental speech processing algorithm, thereby making itan optimal choice for this implementation. This thesis describes the design methodologyfor this implementation and evaluates several different configurations. As a result, a newmultiprocessor is proposed.

Thesis Supervisor: Christopher J. TermanTitle: Senior Lecturer, Dept. of Electrical Engineering and Computer Science

2

Acknowledgments

First, I would like to thank God for giving me the patience, endurance, and motivation tocomplete this thesis.

I would like to thank my advisor, Chris Terman, for providing a wealth of knowledge andassistance throughout my thesis endeavor. I am thankful for the opportunity to haveworked with him.

To my parents, I would like to give my deepest gratitude and honor. They have been aconstant source of love, support and encouragement since the moment I came to MIT. Ithank them for the sacrifices they made in order to provide me with an excellenteducation. "Keep Going"

3

Contents

1 Introduction 71.1 Background ........................................................................................................ 71.2 D igital Signal Processors....................................................................................... 81.3 N etw ork A rchitectures ........................................................................................... 9

2 Technologies of Interest 122.1 A D SL....................................................................................................................... 122.2 M PEG ...................................................................................................................... 162.3 G SM ........................................................................................................................ 19

3 The Bulk DSP Architecture 253.1 Parallel Com puting.............................................................................................. 253.2 Processing Elem ents........................................................................................... 273.3 M em ory H ierarchy ............................................................................................. 283.4 G SM Im plem entation ......................................................................................... 293.5 Encoding and D ecoding ...................................................................................... 30

4 Scheduling 334.1 D esign M ethodology ........................................................................................... 334.2 G SM 's Com putational M odules ........................................................................... 344.3 Evaluation of Architectures................................................................................ 37

5 Conclusion 445.1 Sum m ary ................................................................................................................. 445.2 Future W ork ........................................................................................................ 455.3 Final Thoughts.................................................................................................... 46

A Software Tools Used 48

B GSM Resources on the Web 49

4

List of Figures

2-1 ADSL Network Connection.................................................................................. 14

2-2 MPEG Encoding Algorithm .................................................................................. 18

2-3 GSM Network Architecture...................................................................................... 20

3-1 GSM Encoding Algorithm......................................................................................... 30

4-1 GSM Decoding Algorithm.........................................................................................35

4-2 LongTermS nthesisFiltering Module ............................................................. 36

4-3 Preliminary Building Block of Bulk DSP.............................................................. 38

4-4 Segment of Schedule for Proposed Architecture .................................................. 40

4-5 Final Bulk DSP Architecture .................................................................................. 42

5

List of Tables

4.1 Computational Module Parameters ........................................................................ 37

4.2 Comparison of Results with Best Architecture ...................................................... 41

4 .3 F in al R esults ............................................................................................................... 4 3

6

Chapter 1

Introduction

This thesis identifies the utility of a parallel multiprocessor for use in communication

technologies. In order to demonstrate this, several applications are explored, and the

basic architecture of the new parallel multiprocessor is presented. A specific technology

is then chosen as a vehicle to implement this multiprocessor, and the design process is

described. In addition, several architectures are analyzed based on experiments done

utilizing a set of software modules, and the performance results are presented. Finally,

some conclusions are drawn for the final implementation.

1.1 Background

Virtually everyone today uses some type of telephone service. Whether it is a 'plain old

telephone', a wireless cellular phone, or a modem in a computer, people heavily depend

on telephone systems as their primary source of communication. By the year 2001, it is

estimated that there will be 1 billion phone lines worldwide and 580 million cell phone

subscribers. Communication, however, does not just encompass spoken conversation

between people. Technological advances have enabled communications or

telecommunications to also include transmitting sound, video, and digital data across

7

subscribers.8 Communication, however, does not just encompass spoken conversation

between people. Technological advances have enabled communications or

telecommunications to also include transmitting sound, video, and digital data across

telephone lines, radio frequencies, and cable lines. Because these types of

communication have expanded so rapidly, there are now telephone systems in almost

every area of the world. Even so, there still remains a large demand for global

connectivity; the most prominent example of this today is the Internet. This demand

drives telecommunication technologies to provide an infrastructure, which includes the

hardware, software, and network topology to further connect people around the world

while delivering as much information as possible. In addition, these communication

systems need to service millions of customers simultaneously and efficiently.

1.2 Digital Signal Processors

Communication technology relies on specialized hardware to perform digital

signal processing computations. In general, the computations performed are part or all of

a signal processing algorithm. Specifically, the numerous computations include

receiving, decoding, directing, encoding, and transmitting data. All of these operations

are done digitally by the DSPs. The benefits for communication applications reaped from

operating in the digital realm include reliability in the transmission process as well as

compression in data size, which makes transmission faster and easier. These procedures

are carried on small portions of the data known as frames on each of the subscriber lines

in order to achieve "seamless" point-to-point communication. The hardware components

that perform these functions are known as digital signal processors (DSPs). These

8

processors have been designed to handle complex arithmetic computations precisely for

speech and data processing applications.

Digital signal processing applications tend to be repetitive in nature. Many signal

processing functions rely on the execution of the same computation on each byte of data

in order to complete the desired operation. In fact, numerous of these algorithms consist

of a relatively small set of instructions that are just executed over and over again in loop

configurations. Because of this inherit repetition, the overall signal processing algorithm

can be broken down into smaller computing modules, which are then just applied

repeatedly throughout the algorithm.

1.3 Network Architectures

The network topologies of these communication systems vary somewhat depending on

the particular type of data they transmit and the communication medium used. Although,

the current trend is to build networks that can transmit all types of data, there still exist

several methods in which to transmit this data (i.e. physical lines vs. wireless). In

general, the basic foundations of these networks include a backbone network that

connects everything in the network. There are also several central locations in the

network, which brings together all the separate "channels" or subscriber lines in order to

process the data. Usually, these channels carry multiple simultaneous calls, and thus,

these central locations receive and transmit hundreds of channels at any given time.

Moreover, the central office or location is equipped with many computers whose tasks

are only to process these incoming and outgoing channels of communication. Because all

the channels come through a central office, the majority or bulk of the processing in the

9

network occurs at these central locations. The other major components of the network

infrastructure handle other operations such as transmitting and receiving the channels.

Since the central offices control much of the computation in a communication system,

they can be considered the "core processing nodes" of the network. The latency of these

core processing nodes, however, is much larger than at anywhere else in the network and

as a result much of the total computation time is spent here. Therefore, the efficiency of

these nodes play a significant role in the overall network performance.

As stated, much of the processing in many communication networks occurs in

central locations. Present day infrastructures tend to use a single DSP processor for a

small number of channels routed through these points. Since there are usually many

channels transmitted through these locations, there are many DSPs located at these spots.

Ideally a single DSP chip would process a large number of the channels at these locations

which can vary anywhere from 100 to 1000 depending on the technology. This would

immensely reduce the number of processors in the network. Logically, the next step

towards achieving such an improvement is to design a DSP or multiprocessor that could

process much more than just a few channels. By taking advantage of the instruction level

parallelism (ILP) among the channels, a parallel computation architecture such as a

SIMD (Single Instruction stream Multiple stream) vector processor might be used to

design a more efficient implementation. SIMD refers to an architecture in which the

same instruction is executed by multiple processors on different data streams. In

particular, for communication applications, a multiprocessor or 'Bulk DSP'might be

designed to contain many simple processors, such that each processor executes the same

instruction or set of instructions in parallel. This design would enable the Bulk DSP to

10

control and process many channels. The reduction in hardware would result in

significant cost savings for the given infrastructure. In addition, if successful, the Bulk

DSP would exceed the performance of existing communication networks, thereby

meeting the growing demands of communication technologies.

11

Chapter 2

Technologies of Interest

Of the many communication and digital signal processing applications, the three that will

be considered are ADSL (Asynchronous Digital Subscriber Line), MPEG (Motion

Picture Experts Group video compression), and GSM (Global System for Mobile

Communications). These technologies were chosen based on their significance to the

communications world as well as the similarities in their computation structure. They

each embody several key characteristics, which demonstrate the utility of a Bulk DSP as

their core processor.

2.1 ADSL

Asynchronous Digital Subscriber Line refers to a communication technology

implemented on the copper telephone lines found in all homes and businesses. ADSL is

an enhanced version of your basic phone line that provides much more data bandwidth to

the subscriber. It transfers voice, data and video at a significant increase in data rates.

The main point of attraction that drives this technology is a faster Internet connection. A

salient detail of this technology lies in the data rate transmitted. Much more bandwidth is

sent downstream (as from the Internet to one's computer) than in the reverse direction;

12

thus the name Asynchronous DSL. In fact, ADSL communication can theoretically reach

data rates of 8Mbps (Mega bits per second) downstream and 1Mbps upstream. ADSL

communication exists on the same twisted-pair wire as the telephone line and transmits

simultaneously with existing phone services without interruption or interference.

Because of this, no new phone lines need to be installed to implement ADSL, making it

an attractive choice of communication. Once companies start deploying ADSL,

customers need only to acquire an ADSL modem for their computers and to subscribe for

ADSL service in order to start using ADSL.

The ADSL network consists of the central office that contains the ADSL Modem

Rack, the phone line connection, a POTS (Plain Old Telephone System) splitter, and the

user-end ADSL modems. Figure 1-1 shows the details of the core network and how it

interfaces with the actual phone lines and other types of networks [2].

As shown by the diagram, the central office in the network manages all the

communication via phone lines to and from personal computers and corporate networks.

When a call comes into the central office, it is first passed through a POTS splitter, which

"splits" the call into voice and data signals and then directs them to the appropriate

device. A voice call will proceed to the Public Switch Telephone Network while a data

call will go to the ADSL Modem Rack. The Modem Rack consists of many line cards or

ATU-Cs (ADSL Transceiver Unit- Central Office) and is the key device of interest since

it uses digital signal processors. The ATU-C receives data from the access module and

converts the data into analog signals. It also receives and decodes data from customers

sent by the ATU-R (remote or end-user modem). Presently, each ATU-C can

13

accommodate up to 3 ADSL circuits, which means it can serve up to 3 individual phone

lines. These ADSL

circuits are implemented with several integrated circuits such as a core DSP to perform

Discrete Multi-Tone (DMT) technology functions, a line driver/receiver, a general

Central Office

Phoneor

Fax

PC or POTSNetwork - ADSL Modem SplitterComputer

WWW orVideo InternetServer Backbone

Figure 2-1: ADSL Network Connection

Ethernet, FrameRelay, ATM

purpose DSP, and an ASIC to perform all the analog and mixed-signal operations as well

as the modem configuration software.

The ADSL technology rests largely in its transmission methodology. As

mentioned ADSL transmits far more information downstream than it does upstream;

upstream and downstream refer to different "channels" which transmit information at

different frequencies. These channels are created by Discrete Multi-Tone (DMT). This

technique falls under the category of Frequency Division Multiplexing (FDM) and is a

14

Public PTSwitch _ _

ADSL Modem

POTS Rack

Sphitter

multi-carrier modulation technology. Basically it takes a band of frequencies (input

data) and divides them into separate "channels" so that the channels have the same band

but a different center frequency. This allows the channels to be coded individually and

independently from each other. The DMT transmitter relies on the efficiency of the IFFT

(Inverse Fast Fourier Transform) to create these channels while the receiver uses the FFT

(Fast Fourier Transform) to do the compliment operation. This transform pair represents

a key digital signal processing technique used by the ADSL technology. DMT uses the

band from 26KHz to 134KHz for the upstream channel and 138KHz to 1.1MHz for

downstream. DMT reserves a 4KHz band (0 to 4KHz) for POTS to accommodate the

ordinary phone line on the same copper wire [14]. ADSL modems also implement error

correction algorithms in order to reduce errors that occur on a network line such as

impulse noise or continuous noise coupled into a line. These operations, performed by

DSPs, combine the channels into blocks and use error correction codes on each block of

data. This method allows for effective and accurate transmission of both data and video

signals on the wire.

The ADSL network implements a high-speed transmission technology on normal

copper phone lines. Because it uses the phone lines, it does not require much equipment

from the customer and is easy and inexpensive to use. In addition, it meets or exceeds

customer requirements with respect to Internet access. Examination of the ADSL

network reveals the importance of DSPs to the basic functionality of this technology. In

fact, DSPs coordinate and perform the main computations in the transmission technology.

These DSPs are found on the ATU-C located at the central office in the ADSL network.

Because the ATU-C contains up to 3 ADSL circuits, and in turn each ADSL circuit has at

15

least 2 DSPs, the minimum number of DSPs on each ATU-C is 6. This corresponds to 2

DSPs for each phone line. Presently 560 million copper phone lines exist worldwide.

Therefore, if each of these phone lines subscribed to ADSL, over a billion DSPs would

be needed in this network alone!

2.2 MPEG

Although this next application is not the same type of communication technology as the

other two studied, it exhibits some of the same characteristics, such as the need for many

repetitive DSP computations. This is the MPEG (Moving Picture Experts Group)

standard, which describes a compression technology for video. MPEG compresses video

data into a smaller format so that more information can fit on a storage disk or more data

can transfer across a network. The compression ratio achieved with MPEG ranges from

30:1 to 8:1, depending on the complexity of the video. One of the most popular

applications today that employs MPEG compression is DVD (Digital Versatile Disk).

This technology stores video information on DVDs similar to VHS video tapes and is

played on special DVD players (like VCRs). Each disk can hold up to 17 Gigabytes of

information. That's a lot of data! A second, very popular application that employs

MPEG compression is video conferencing. This application relies heavily on the

transmission of data across different types of networks. Thus, using the MPEG algorithm

to compress data enables video conferencing applications to achieve real-time point-to-

point communication. With MPEG compression, other technologies such as High

Definition Television (HDTV) can be transmitted at 24 frames per second while movies

and live broadcast at 30 frames per second in order to produce high quality resolution

16

pictures. An update to the standard, MPEG-2 adds the functionality to transmit high-

quality broadcast video. The main difference between the standards is the data rate at

which they can transmit video sequences. The MPEG-I standard targets a data rate of

1.5Mbps, which transmits over most transmission links that support MPEG format,

namely the Internet, cable networks, and ADSL networks; MPEG-2 transmits at a data

rate of 4-8Mbps. MPEG-2 supports a broader range of applications including digital TV

and coding of interlaced video, retaining all of the MPEG-I syntax and functionality.

The MPEG compression algorithm depends largely on motion compensation and

estimation. A block diagram of the algorithm is shown in Figure 1-2 [2]. It first takes

low resolution video and converts the images to YUV space. In this domain, the U and V

(color) components can be compressed to a greater degree than the Y component without

affecting the picture quality. Video pictures characteristically do not contain a lot of

movement in them and in a lot of cases the movement can be predicted if done in an

intelligent manner. MPEG compression does this prediction by estimation and

interpolative algorithms. Specifically, these techniques perform inter-frame coding

which means motion is predicted from frame to frame in the temporal direction. The

MPEG video stream consists of three types of frames. These frames are defined based on

whether their spatial or temporal redundancy is eliminated. They are also grouped

together to form GOPs (Groups of Pictures) or the MPEG bit stream. The I (Intra-coded)

frames are coded by eliminating spatial redundancy using a technique derived from JPEG

compression and serve as a reference point for the sequence. The I frame originates as a

sequence of raw images and are then split into 3 8x8 blocks of pixels (one block for

luminance and the other two for chrominance). These blocks then pass through a

17

Discrete Time Cosine Transform (DCT), are quantified, and finally proceed through an

entropy encoder which transforms the images into a MPEG bit stream (see Figure 1-2).

The second type of frame is the P (Predictive-coded) frame which is coded using motion

estimation and depends on the preceding I or P frame. In addition, to the motion

estimation and compensation operations, P frames require a DCT computation as well.

The DCTS performed on I and P frames serve to eliminate the spatial redundancy found

in these frames. Finally the B (Bidirectionally predictive-coded) frames are predicted

LowResolution

DCT Q Compressed Data

DJDC IQ

IDCT

-- Filter M.C.

Motion Estimation

Figure 2-2: MPEG Encoding Algorithm

based on the two closest P or I frames and are the smallest frames in the sequence.

Although this type of coding exploits similarities with future images and reduces

temporal redundancy, it still introduces a large delay in the overall algorithm [5].

18

Because images are not compressed as a single frame, an MPEG bit stream

usually consists of thousands of blocks, which represents a single image. In essence

these blocks are just smaller images and are encoded as described above using the MPEG

algorithm. Consequently, the process to compress an image is a highly repetitive one

since the same operations are executed on each block. In addition, these computations

are independent of each other and require digital signal processing. Specifically DSPs

are used to compute DCTs as well as the other signal processing operations required by

the MPEG algorithm. A Bulk DSP implementation for this algorithm could reduce the

overall compression time by performing many of these operations in parallel.

2.3 GSM

Mobile communications technology has undergone a major change in the last several

years. The mobile or cellular world has transferred from the analog to the digital domain.

Previously, cellular phones used a strictly analog protocol to transmit signals. However

with the increased number of cellular users, the push for faster data rates, and the need for

better service, cellular technology has moved into the digital realm. GSM (Global

System for Mobile Communications) a digital cellular radio network, relies on digital

cellular technology. It has been widely used in Europe for several years and is gaining

popularity in the US. GSM implements Personal Communication System (PCS) which

delivers more than just a wireless phone service. PCS incorporates the transfer of calls,

voice mail, and other data transfers anywhere, anytime. In fact, each GSM phone has a

personal identifier, which is unique to the phone and identifies itself on the GSM network

from any location. PCS also includes the ability to connect your GSM phone to a laptop

19

or computers in order to send and receive faxes, email, or connect to the Internet. GSM

has stepped to the forefront in mobile communications and provides its services in over

200 countries worldwide [9].

The GSM network architecture consists of three main functional entities that

interface with each other to provide end-to-end communication. The subsystems are the

Base Station Subsystem, the Network and Switching Subsystem, and the Operation and

Management Subsystem (see Figure 1-3).

BSS

BTS BTS BTS BTS BTS BTS

BSC BSC

ISD

NSSOMC OMC

NMC

Figure 2-3: GSM Network Architecture

GSM subscribers connect to the GSM network via a radio link from their phone

(the Mobile Station) to the Base Station Subsystem (BSS). The BSS is actually

composed of multiple Base Transceivers Stations (BTS) and a Base Station Controller

20

(BSC). The BTS includes all the transmission and reception equipment such as the

antennas and transceivers in order to conduct the radio protocols and signal processing

over the radio link. The BSC controls the set of BTSs in its service area and controls

radio-channel setup, frequency administration, and handovers for each call. In addition,

the multiplexing of speech data is performed by the Transcoder Rate Adaptation Unit

(TRAU) which is located at either the BTS, BSC, or the MSC (Mobile Service Switching

Center) depending on the configuration. The BSS subsystem interfaces with the Network

and Switching Subsystem, specifically, the Base Station Controller connects to the main

component of the Network and Switching Subsystem, the MSC. The MSC manages

communication with other fixed telecommunication networks such as ISDN (Integrated

Services Digital Network) and PSTN (Public Service Telephone Network), and it also

performs paging, resource allocation, location registration, authentication, and encryption

functionality required to service a mobile subscriber. Finally, all the equipment in the

BSC and the Switching System connect to the Operational and Maintenance Center

(OMC) which includes the operation and maintenance of GSM equipment and support of

the operator network interface. The OMC performs mostly administrative functions such

as billing within a region. Depending on the size of the network, there may only be one

OMC in a country in which case the OMC is responsible for the network administration

in the entire country [13].

GSM's technology allocates a range of frequencies to a GSM system and divides

that band of frequencies into individual simultaneous data channels. Each GSM system

has a bandwidth of 25MHz that allows for 124 carriers with a bandwidth of 200KHz

each. There are 8 users per carrier and as a result approximately 1000 total speech or

21

data channels. The maximum speech rate on the channel (known as full-rate speech) is

13Kbps (Kilobits/sec) and the maximum data rate is 9.6Kbps.

GSM's main purpose is to transmit information (either speech or data) reliably in

wireless form from one location to another. The following explanation will describe the

full-rate speech transmission in order to highlight the main details of GSM. The process

begins when the mobile station (the GSM phone) receives an audio signal (speech)

through a microphone. This signal must first be converted from an analog to a digital

signal before processing begins. This occurs by first filtering the signal so that it only

contains frequency components below 4KHz. This frequency characterizes baseband

voice signals and is the minimum bandwidth necessary to accurately recognize a voice.

Once filtered, the signal is sampled at a rate of 8000 samples per second (8KHz), which

corresponds to the minimum sampling frequency needed in order to not to lose any

information. As the signal is sampled, it is quantified into 13-bit words. Thus, the

output of this analog to digital converter is a bit stream of 104Kbps (13 x 8000) which

then becomes the input to the GSM speech codec. The speech codec's job is to reduce

this data rate to a size more appropriate for radio transmission. In essence, it removes all

the redundant information in the data stream. The codec uses the Linear Predictive

Coding (LPC) and Regular Pulse Excitation (RPE) algorithms to perform this function

and executes at a bit rate of 6.5Kbps. GSM's codec collects segments of the data stream

every 20ms and produces speech frames of 260 bits every 20ms. This corresponds to a

speech rate of 13Kbps. From there, the data is transmitted via the radio link to the Base

Transceiver Station. The next step in the process occurs at the Base Station where the

BTS receives the signal and proceeds to extract the signal and recover the modulation.

22

The signal gets directed and further transmitted by the Base Station Controller to the

MSC where the GSM transcoder (speech encoder and decoder) converts the GSM

formatted encoding into either a speech format for the PSTN or to 13Kbps data for GSM

mobile station functions.

The essential part of the GSM technology depends on digital signal processing to

encode and decode bits of information into the GSM format. Specifically DSP

processors perform the speech, modem, and channel coding, as well as decoding

operations. The DSP operation of interest computes the encoding and decoding part of

the GSM algorithm and is performed by the GSM transcoder or codec. There are many

of these DSPs in the network located in the mobile station as well as in the BTS, BSC, or

MSC depending on the network configuration. The BSC and MSC can be considered

central processing locations in the GSM infrastructure since most of the phone calls are

routed through these units. Effectively the transcoders here encode and decode

individual phone calls where the likely configuration is one transcoder per channel, thus

operating on one channel at a time. As with the ADSL network, GSM uses DSPs to

perform key processing operations on each channel of communication. So again there is

a central place in the network where "bulk processing" occurs and at which repetitive

computations are executed among its processors.

The ADSL, MPEG, and GSM technologies share similarities at two different

levels; first in their governing algorithms and second, in their system architectures. At

the algorithm level, they each require a lot of digital signal processing which

characterizes most of their computations. As stated, DSP operations tend to be composed

23

of a relatively small set of instructions, which are executed repetitively. Thus, each DSP

operation in the algorithm can be treated as a separate computational module. If the

algorithm is subdivided into these modules then the steps of the algorithm are easily

identified and an instruction level parallelism (ILP) results. GSM, ADSL, and MPEG

exemplify this level of parallelism in their algorithms. The second similarity exists at the

system level and also exemplifies a type of ILP. Each technology operates on multiple

independent data streams in parallel, thus, there is an inherit repetition among the

computations performed. The system architectures of each technology dedicate multiple

processors to work on the data even though they are all essentially doing the same thing.

Therefore the utility of a Bulk DSP is evident. This multiprocessor could take advantage

of this inherit repetitive scheme by increasing computing power and thereby exceeding

the performance of modern day microprocessors. Indeed, one would assume that if the

DSP were designed with "N" simple processors, the improvement in performance would

equal that of "N" present day DSP processors. However, if the Bulk DSP were designed

to optimize a particular algorithm, one could imagine exceeding a factor of N'

improvement in performance with the use of "N" processors. Therefore, a single

application has been chosen and a Bulk DSP designed in efforts to achieve this type of

improvement.

24

Chapter 3

The Bulk DSP Architecture

3.1 Parallel Computing

Parallel processing describes a method of computational style suited for applications

which exhibit some type of parallel algorithmic behavior. These applications usually

consist of small computational modules or modules that are used throughout the

algorithm. Given that there is a set number of transistors available to design a parallel

multiprocessor, the question is how to best utilize these transistors to maximize

performance? In order to answer this question, performance-critical aspects of the

algorithm must be considered; they include the amount of data processed, the load

balance among the computational modules, the parallel structure, the distribution of data,

and the spatial and temporal access patterns to memory of the algorithm [10]. These

factors determine the design parameters of the multiprocessor such as the architecture of

the simple processing elements, the allocation of memory resources, the communication

protocol, and ultimately the number of simple processors. With the exception of the

communication protocol, all of these parameters were considered in this design process.

Bulk DSP is basically a subset of this type of processor architecture. Bulk DSP aims at

connecting many simple processors on a single chip instead of designing one large

25

complex processor. The gain in performance stems from using these simple processors in

a parallel structure. Bulk DSP differs from modern day microprocessors in that its basic

building block consists of simple hardware and a reduced instruction set. Unlike Intel's

Pentium processor, which incorporates branch prediction and multiple instructions per

clock cycle, Bulk DSP relies on a simple set of instructions using a RISC-like structure

[11]. Also, in contrast to the Pentium, the Bulk DSP does not include an extensive

memory hierarchy. The memory components of the Bulk DSP consist of a simple

instruction cache and data cache. The instruction cache does not have to be large due to

the small number of instructions utilized by the algorithm; the main part of memory is

dedicated to the data cache that serves as a buffer memory between the modules of

computation. This architecture also differs from another type of modern day processor

named IRAM (Intelligent RAM) which was designed at the University of California-

Berkeley [12]. This processor relies on the ability to place 1 billion transistors on a

single chip, made possible by advances in integrated circuit technology. In having such a

large transistor budget, IRAM is able to allocate a large portion of its transistors to

memory, specifically on-chip DRAM. Its main purpose is to diminish the gap between

microprocessor performance and the latency of main memory accesses. Although the

Bulk DSP would also rely on being able to integrate a large number of transistors on a

chip, the Bulk DSP allocates these resources to computing or processing power rather

than memory. Instead of dedicating 80% of the transistor budget to memory as IRAM

does, the Bulk DSP might dedicate this percentage to processors. Bulk DSP applications

require a lot of arithmetic computations and thus would benefit from more processing

power than memory. Both of these processors, the Pentium and IRAM, are beneficial for

26

certain classes of applications. The Pentium is designed for general-purpose applications

that don't necessarily exemplify a specific type of algorithmic structure while the IRAM

targets memory-intensive applications such as database and multimedia programs. The

Bulk DSP targets neither of these areas, moreover, it aims at improving applications

which require a lot of parallel signal processing. Thus, in comparison, an architecture

such as the Bulk DSP would be more advantageous in performance than a Pentium or

IRAM processor for this class of parallel applications. Additionally, the Bulk DSP does

better from a cost standpoint; the cost to have many simple processors on a chip is less

than the cost of a lot of DRAM or other specialized hardware characteristic of modern

day microprocessors.

3.2 Processing Elements

The architecture and organization of the Bulk DSP's simple processors model the

processing elements used in SIMD parallel processing. A SIMD multiprocessor usually

contains many simple processors called processing elements and a single control unit

with only one instruction and data memory resource. These processing elements are

characterized by their simplicity. Their main function is to execute the instructions given

to them by a control unit that distributes the instructions to all processors. The ILP

present in these programs, imply that short instruction sequences will be carried out in

parallel. Because each simple processor only carries out the given instructions, it

contains minimal control logic. In fact, these simple processors have a RISC

architecture, which basically just fetches an instruction and data, executes the assigned

computation, outputs the new data, and fetches the next instruction in a continuous cycle.

27

Essentially, these simple processors contain only basic hardware components and do not

require a lot of complexity, therefore, they are inexpensive and easy to replicate on a

chip. The number of simple processors used in this Bulk DSP will be discussed in the

scheduling section.

3.3 Memory Hierarchy

The memory resources of the Bulk DSP play a large part in the design process. For this

multiprocessor, a single 2KB (kilobyte) cache for both instructions and data will be

allocated to each simple processor. The caches will be subdivided into 256 byte sections,

which can be designated to either instruction or data memory. The instruction size of the

computing module(s) in each processor determine the portion of the cache used for

instructions, and the number of input and output bytes for each module(s) determine the

amount for data cache. Because the algorithm will be subdivided into computational

modules among the processors, data will need to propagate from one processor to

another. This means that each processor will have to both read data from and write to

other processor memories. This data movement can be setup in such a way that the

movement occurs in the "background." Consequently, the processors will not have to

wait for data and no cycles will be wasted on data movement. This concept will be

enabled by the buffer memories between the computing modules. For each processor,

there will essentially be four buffer memories. Two on the input "side" and two on the

output "side." One buffer memory on each side will be dedicated to the current set of

data being processed; this will give the processor a place from which to read current data

28

and another place to write out current data. The other two memories associated with each

processor are for other processors to write to or read from while the processor associated

with those buffer memories is busy working on the current set of data.

The focus of the remainder of the investigation explores how to best implement a

Bulk DSP. GSM serves as an excellent application for the Bulk DSP and will be the

main application of the designed processor. This is due to several reasons. First, the

Base Transceiver Station in the GSM network acts as a core processing node at which

many DSP computations take place. Second, a software library entailing the GSM

algorithm was found and proved useful for this investigation. Third, GSM is a popular

mobile cellular system, which has gained acceptance around the world thereby making it

a very relevant and useful technology to explore.

3.4 GSM Implementation

Because GSM is a cellular phone network, human speech encompasses the

majority of the information transmitted across the network. As mentioned, the speech

compression algorithm used in GSM is a Regular Pulse Excitation- Long-Term

Prediction (RPE-LTP) specified in the GSM 06.10 standard [9]. A block diagram of the

encoding algorithm is shown in Figure 2-1. This algorithm is executed in the GSM codec

and serves as its primary functionality. The input frames to the codec consist of 160

signed 13-bit linear PCM values each of which are sampled at 8 kHz. They come from

either the audio part of the mobile station or from the PSTN. These frames last for 20ms,

and thus, cover about one glottal period of a very low voice or 10 periods for a very high

29

voice. Because this is a relatively short period of time, the speech wave does not change

much and thus the algorithm will not lose any information by dividing up the speech

Short-TermLPC

I u Short-TermS ign Pre-Process Analysis

[0..1591

Log Area Ratios

2 RPE RPE Parameters

Long-Term (4) (5) RPEAnalysis Grid

TP LTP Parameters

Analysis -

Figure 2-1: GSM Encoding Algorithm

signal as such. The encoder divides the input speech samples into a short-term

predictable part, a long-term predictable part, and the rest into the residual pulse. Then, it

encodes and quantifies the residual pulse and parameters for the two predictors. The

decoder applies the long-term residual pulse to the residual pulse in order to reconstruct

the speech and then passes the filtered output through the short-term predictor [6].

3.5 Encoding and Decoding

The first step in the encoding algorithm consists of preprocessing the samples to produce

an offset-free signal and then passing them through a first-order preemphasis filter. The

resulting 160 samples are then analyzed to determine the coefficients of the short-term

30

(1) Short Term Residual(2) Long Term Residual(3) Short Term Residual

Estimate(4) Reconstructed Short

Term Residual(5) Quantized Long Term

Residual

analysis filter. This short-term linear-predictive filter (LPC analysis) is the first stage of

compression and the last stage of decompression. The speech compression in this

algorithm is achieved by modeling the human-speech system with two filters and an

initial excitation of which LPC is the first filter. In this process, the short-term filter acts

as the human vocal and nasal tract such that when excited by a mixture of glottal wave

and noise, produces speech that is hopefully similar to the one you are compressing. This

is done using the set of coefficients determined from the preprocessed signal and using

them as well as the 160 samples to produce a weighted sum of the previous output, which

is, termed the short-term residual signal. In addition, the filter parameters, named the

reflection coefficients, are transformed into log-area-ratios (LARs) before transmission

since they will be used for the short-term synthesis filter in the decoder. The next stage

in processing involves the long-term analysis where the main computation is the long-

term prediction filter. Before filtering, the speech frame is subdivided into 40 sample

blocks of the short-term residual signal. Also, the parameters of the long-term analysis

filter, the LTP lag which describes the source of the copy in time and the LTP gain, a

scaling factor, are estimated and updated in the LTP analysis block. Both of these

prediction parameters are calculated based on the current sub-block and the previous 120

reconstructed short-term residual samples. With these parameters, an estimate of the

short-term residual signal is found via the long-term prediction filter. Then, the last stage

of this section, subtracts the estimated short-term residual signal from the actual short-

term signal to produce the long-term residual signal. With each 40 sub-block iteration,

56 bits of the GSM encoded frame are produced. The resulting 40 samples of the long-

term signal are then passed to the regular pulse excitation analysis for the primary

31

compression operation of the algorithm. Here, each sub-segment of the residual signal is

filtered by an FIR (Finite Impulse Response) algorithm and then down-sampled by a

factor of 3. Thus results a four candidate sequence of length 13. The sequence with the

most energy is chosen and the 13 samples are quantified by block adaptive PCM

(APCM) encoding. The result is passed on to the decoder via a 2-bit grid selection.

Lastly, the encoder updates the reconstructed short-term residual in order to prepare the

next LTP analysis. In summary, the speech codec, or encoder compresses an input of 160

samples into an output frame of 260 bits every 20ms. Therefore, one can see that one

second of speech equals 1625 bytes and one megabyte of compressed data holds about 10

minutes of speech [6].

The decoder mirrors many of the encoding computations. Decoding occurs when

a call is received from the PSTN or from the Mobile Station (the cellular phone) at the

Base Station. The decoding algorithm begins by multiplying the 13 3-bit samples by the

scaling factor and expanding them back to 40 sample sub-blocks. This residual signal

passes through the long-term predictor, which consists of a similar feedback loop as the

one in the encoder. The long-term synthesis filter removes 40 samples of the previous

estimated short-term signal, scales it by the LTP gain and adds it to the incoming pulse.

This new short-term residual becomes part of the source for the next three predictions. In

addition, these samples are applied to the short-term synthesis filter, which uses the

reflection coefficients calculated by the LPC module. Finally, the de-emphasis filter

processes the samples whose output should resemble the original speech signal.

32

Chapter 4

Scheduling

In an effort to design a Bulk DSP that will optimize the performance of the GSM

algorithm, different organizations of the algorithm's computational modules were

arranged and considered for the building block of the multiprocessor. These architectures

differ based on the number of computing modules grouped together in a single processor

and the schedule in which the modules are executed by the control unit. Changing the

architecture based on these parameters allowed the designer to explore the parallel

structure already present in the GSM algorithm.

4.1 Design Methodology

At first, the "best architecture" for this Bulk DSP might seem to be a set of simple

processors with a fixed memory resource each assigned to process the entire decoding

algorithm. This architecture results in each simple processor working on an entire frame

at a time. The benefit from this solution is that each processor will continuously process

data. The only exception, or idle time induced, would occur the first time an instruction

is called within a frame; this results in some memory access time to fetch the instruction

into the cache. Due to the limited memory resource, the entire decoding program can not

33

fit into the on-chip cache, thus the idle cycles while waiting for memory access. This

design represents the scheme where a factor of N'improvement is achieved by simply

replicating N'number of simple processors on chip. However, this architecture does not

take advantage of any instruction-level parallelism present in the algorithm, and

therefore, the idea is that a more efficient scheme using the same amount of hardware

exists. Thus, given that this Bulk DSP is composed of simple processors with 2KB of

memory each, what is the best organization of these resources? In order to best address

this question, careful study of each computational module is required. Accordingly, a

discussion of the computational modules will follow.

4.2 GSM's Computational Modules

The GSM algorithm consists of two main operations: encoding and decoding. Each of

these operations can be easily subdivided into a set of independent computing modules.

This modularity allows for the flexibility in organizing the simple processors. Here, we

will focus on the decoding part of the algorithm in the design of the Bulk DSP. Because

many of the modules are the same for decoding as they are for encoding, a similar

approach as the one taken here may also serve to design a multiprocessor for the encoder.

In order to subdivide the algorithm, specific functions within the overall computation

were identified (see Figure 4-1). Ten independent modules were distinguished differing

in instruction and data size. Due to the nature of the GSM algorithm, several of these

modules can be executed in parallel thus providing a means to optimize the architectural

organization of the algorithm. The number of instructions executed characterizes each

module. There are four important parameters that determine the above information for

34

each module. They are the numbers of Loads, Stores, Arithmetics, and Shifts encountered

in the instruction set of the module. A Load represents a processor reading from memory

(this could be either instructions or data). Specifically, a Load fetches two bytes of

RPERPE GridPositionS

Invers ANAPCM Z

APCM

Quantization

LTP Long-TermSSynthesis

LAR

SShort-Term De- 0Synthesis Emphasis Signal P

t 1 -0..159]

LAR-to-RP

Coefficiet

Decodingof

LARs

Figure 4-1: GSM Decoding Algorithm

information at a time. Stores, symbolize the times the processor writes to memory.

Arithmetics are the actual computations executed by the processor, and Shifts

corresponds to the computation of indexing a data array. Another aspect of these

modules is the presence of loops in their structure. As noted earlier, DSP computations

require many of the same operations repetitively which accounts for the large number of

loops found in these modules. An example of this is demonstrated in Figure 3-1, which

shows the code for one of the computing modules, Long_TermSynthesisFiltering. An

example of the Load (L), Store (S), Arithmetic (A), and Shift accounting is also

demonstrated. So, to determine the number of instructions executed by this

35

computational module, the sum of Loads, Stores, Arithmetics, and Shifts was calculated

without regard to the loops. This information designates the number of bytes stored in

the instruction cache (I-cache). One assumption made here is that each instruction equals

4 bytes of memory. This is a typical number for most modern day instruction sets. The

second calculation done, includes counting the number of the above parameters but this

time including the loops. The sum of these parameters represent the total number of

void GsmLongTermSynthesisFiltering (struct gsm-state * S,word Ncr,word bcr,register word * erp,register word * drpregister longword Itmp;register int k;word brp, drpp, Nr;Nr = Ncr < 40 11 Ncr > 120 ? S->nrp Ncr;S->nrp = Nr;assert(Nr >= 40 && Nr <= 120);brp = gsmQLB[ bcr ];assert(brp != MINWORD);for (k = 0; k <= 39; k++) {drpp = GSMMULTR( brp, drp[ k - Nr] );drp[k] = GSMADD( erp[k], drpp);}for (k = 0; k <= 119; k++) drp[ -120 + k] = drp[ -80 + k]; } //8L, 3S,

5A, 2Shifts

Figure 4-2: LongTermSynthesisFiltering Module

operations executed by the processor in a module. Operations process two bytes of data

at a time, thus processing 16-bit operands. The number of total operations will also be

used as the main factor to study the relative length of time the modules require to execute

all their operations. Lastly, one final calculation was done to determine the number of

data bytes "enter and exit" each module. Basically the input data and output data indicate

36

the number of bytes processed by the module in addition to revealing the data movement

to and from the modules (see Table 4.1).

Computational Module In/Out Operations Instructions

GsmAPCM-quantization xmaxc to-exp-mant 4/2 436 53APCM inverse-quantization 30/26 745 109

RPE-grid-positioning 28/80 699 65GsmLongTermSynthesisFiltering 328/82 3451 92

GsmLong_TermAdd 320/320 920 16Decoding-of the codedLogAreaRatios 16/16 334 334

Coefficients 16/16 248 122LARtorp 16/16 242 34

ShortTermSynthesisFiltering (k=13) 50/26 6680 90ShortTermSynthesis Filtering (k=14) 54/28 7193 90

ShortTermSynthesisFiltering (k=120) 240/272 61571 90Postprocessing 42/26 4650 39

Table 4.1: Computational Module Parameters

4.3 Evaluation of the Architectures

There are two main aspects to the design of the multiprocessor: the organization, which

includes the number of simple processors and the assignment of computational modules

to each processor and secondly, the schedule of the modules, which represent the order

and time the modules are called. For each organization, a schedule was comprised and

the time it would take for the organization to produce 1, 10, and 25 frames was

calculated. Frames represent segments of individual phone calls.

Several approaches were taken in order to reach the optimal design for the

building block of this multiprocessor. The first, perhaps an obvious choice, involved

assigning each module to its own processor and arranging them as a pipeline in the order

37

in which they are executed by the algorithm. Thus, the scheduler for this organization

simply called each module as the previous one completed. Because of the buffer

memories associated with each processor, data movement between modules essentially

occurred in the "background," and consequently, the calls to the modules could be made

as soon as the previous one finished. Again, the number used as the time to complete a

particular module was the total number of operations for that module. The next

modification was made after observing the instruction-level parallel structure present in

the algorithm. While maintaining the one or two modules per processor approach, the

scheduler was modified to call the modules in parallel wherever possible as shown by

Figure 4-3 where 'M' denotes the module.

~~ M1-M2 M3 -10 M4 ~-

M5, M9 -- , M10 00

M6 M7 M8

Figure 4-3: Preliminary Building Block of Bulk DSP

This change, of course, decreased the total computation time in comparison to the

previous organizations. These designs were fairly straightforward, though, it was evident

that better performance could be achieved due to the large number of idle cycles incurred.

The next approach included looking more closely at the length of time it took each

processor to execute its module. Unfortunately, there exists great disparity among the

modules with respect to the computation time. As a result, more processors were

38

dedicated to modules with the most number of computations. Many of these modules are

called four times each (four times per frame), so it was easy to assign four processors to

one module without having to break-up the module. This also seemed to be a good idea

since the modules could be executed in parallel and as a result performance improved.

At this stage in the design, however, an important issue arose. How were these

architectures comparing to our best case scenario? There was basically no metric used to

see if the increase in performance was really significant or if it still lagged the one-

processor-per-frame scheme. Up until then, it was thought that the total time it took each

organization to compute 1, 10 or 25 frames could be used to compare the different

organizations. But this number does not account for the number of wasted cycles

incurred in each type of architecture. As stated, the idle time in the one-processor-per-

phone-call scenario derives from the limitation of the 2KB cache, which does not hold

the entire program for the algorithm. Here, the minimum number of memory accesses

each processor exercises equals the number of instructions in the entire algorithm. In

contrast, in the schemes which hold just a few computational modules, the 2KB cache is

sufficient to hold the instructions for the individual modules. However, in these

architectures, there is idle time in the latency between processors since the number of

operations vary greatly among the modules. In order to determine the time for memory

access, a few assumptions were made. First, each access to memory takes 20 cycles and

second, each access to memory fetches 4 instructions. Hence, the number of wasted

cycles due to memory access equates to the total number of instructions multiplied by 5.

The latency between modules was simply calculated from the scheduler. Due to the fact

that certain modules take longer than others, it was often the case where some processors

39

would have to wait a long time until it could start its computation because it was waiting

for its input from another processor. This is shown in Figure 4-4, which demonstrates a

segment of the schedule for one of the proposed architectures. The numbered 'P's denote

the simple processor while the numbers represent the computation time for the module.

P1 345

P2 61571 7193 6680 6680

P3 920 92253171467 0 920

Figure 4-4: Segment of Schedule for Proposed Architecture

The idle time is denoted as the time in between processors. This turned out to be a major

problem in all of the architectures considered thus far. The number of wasted cycles due

to latency was calculated for each organization and compared to the idle time (memory

accesses) for the first pass "best architecture" to see if performance was better.

Unfortunately, the added hardware was not being used efficiently and the number of

wasted cycles outweighed any gain in performance. These results are summarized in

Table 4-2. Architecture 1 represents the building block shown in Figure 4-3 while

Architecture 2 represents the architecture from Figure 4-4, which includes a total of 4

simple processors. The performance of the best architecture is measured based on the

number of simple processors in each proposed scenario.

40

_ _ _ Frames Produced Avg. Idle Cycles/Proc

Best Architecture 96 4770Scenario 1 423 71526

Best Architecture 64 4770Scenario 2 26 56495

Table 4.2: Comparison of Results with Best Architecture

Another approach was attempted, but this time careful attention was paid to the

number of instructions and data for each module. The goal was to pack as many modules

into one processor that would fit into the 2KB cache in an effort to keep each processor

busy all the time, thus reducing idle time. The previous considerations such as

computation time per module were also considered. It was noted that grouping modules

together which did not need to execute orthogonal in relation to each other worked better.

Moreover, modules that shared no data dependency such as Postprocessing and the

LongTermSynthesisFiltering were grouped together. This reduced the idle time for

each processor. It soon became apparent, though, that no organization could eliminate

the large number of idle cycles in these architectures. The problem stems from the

disparity of one particular module's computation time (ShortTermSynthesisFiltering)

in comparison to all the others (see Table 4.1). This module was broken into four sub-

modules based on the four times it is called in the frame. Three of these iterations had

computational lengths on the same order, however, the last sub-module is ten times

longer than the first two. This was the major source of processor idle time since

essentially all the other processors had to wait until this computation was done before the

next frame could be processed. A completely new approach was needed in order to

exceed the performance of the first proposed architecture.

41

The final iteration necessary to produce a better, more efficient architecture than

the original one entailed completely re-thinking the approach to the problem. This time,

the design was done without a fixed cache size for the processors. So instead of trying to

group the modules together based on their instruction sizes, the modules were grouped

sequentially, and the algorithm was divided up evenly based on the number of total

operations. Four processors were chosen as the number of processors needed to perform

the complete decoding algorithm. This number was selected based on the results of the

previous iterations and the number of buffer memories needed in this new scheme.

Essentially, more buffer memories were required because the longest module,

ShortTermSynthesisFiltering, was broken up four times and, thus, information

regarding the state within a loop needed to pass from one processor to the next. In

essence, the four processors executed approximately the same number of operations and

required a total of 12KB of cache altogether. Figure 4-5 shows a picture of the final Bulk

DSP including all the processing elements, memory and control units.

P1: M1-M8, M9P1 P2 1 P3 P4 1 Memory P2: M9

P3: M9* 0 0 0 P4: M9-M10

ControlLogic

Figure 4-5: Final Bulk DSP Architecture

42

In order to accurately compare this architecture to the original one proposed it was

necessary to modify the first one to equate the hardware resources allocated to each.

Thus, 3KB of cache was allotted to each processor in the original architecture, so that

evaluating the performance of four of these processors would accurately compare to the

performance of four processors in the final architecture. Interestingly enough though,

adding 1 KB of cache to the first architecture does not really affect its number of idle

cycles, since 3KB of memory is not enough to store the entire program. However, the

present cache scheme assumed for this architecture, is simply a least recently used (LRU)

method, which just removes the oldest touched data in the cache when it needs to store

new data and is not the most efficient way to utilize a cache. So in order to significantly

reduce this idle time, a new cache scheme would need to be implemented. One can

imagine reducing this number by two if half of the instruction set was four processors.

Because the percentage of idle time in the original architecture represents such a small

part (4%) of the overall computation time, it is difficult to design an architecture with a

significant increase in performance over the original architecture.

Frames Produced Avg. Idle Cydes/ProcBest Architecture 64 4770Final Architecture 70 2.5

Table 4.3: Final Results

43

Chapter 5

Conclusion

In this chapter, a summary of the work completed will be presented as well as several

suggestions for future work. Lastly some final thoughts on this investigation will

conclude the thesis.

5.1 Summary

Recognizing the improvements achieved with parallel processing technologies motivated

the idea for a Bulk DSP. This multiprocessor is intended for technologies which

demonstrate repetition at two levels: system and algorithmic. The former is characteristic

of communication technologies which generally entail some type of core processing node

in their networks, and the latter of DSP technologies which exhibit repetitive instructions

in their algorithms. This investigation aimed at researching several technologies that

demonstrate these characteristics and applying the concepts of parallel processing to

design a multiprocessor implementation. As a result, a specific technology was chosen,

GSM, and a design methodology carried out. This process was described in order to

present some of the architectures designed and analyzed for this investigation. Finally,

the most prominent architecture was deemed as the best implementation of a GSM Bulk

DSP based on the increase in performance it would bring to the existing technology.

44

5.2 Future Work

The final architecture proposed only succeeded in achieving marginal improvements over

the N' DSP factor. Thus, there might yet be another way to realize a larger improvement

in performance. As stated, the limiting factor in utilizing the presence of multiple

processors was the modules' disparity in computation length. If one modified the

implementation of the decoding algorithm to reduce this disparity, the added processors

might be used more efficiently. Specifically, the long sub-module of the

ShortTermSynthesisFiltering could be divided into smaller, equal portions, which

match the length of the other sub-modules. This would eliminate the wasted cycles due

to latency between the processors in the organizations which grouped only a few modules

together in an effort to take advantage of the instruction-level parallelism present in the

algorithm.

A second way to improve the design might be to consider incorporating more of

the overall GSM algorithm that gets executed at the Base Station into the Bulk DSP.

This idea is based on the fact that the encoding and decoding instruction set is just not

very large and therefore, does not require a lot of computing power. If more functionality

were required of the Bulk DSP, its parallel structure and added processing power might

be efficiently used to achieve greater performance. The channel encoding/decoding, as

well as error protection and encryption of the radio channel are all parts of the GSM

system that occur at the Base Station and would be likely candidates to incorporate into

the Bulk DSP.

45

Finally, the majority of the work done in this investigation occurred at a

theoretical level, based on "paper calculations." Although the results gained from these

calculations are extremely revealing and necessary, there is another level of investigation

needed before the multiprocessor can actually be implemented in hardware. The next

step in design is to simulate the proposed organizations at both the algorithmic level and

the hardware level. The software analysis can be done by taking the best architectures

proposed and using a GSM decoding library to simulate these configurations. Actual

speech or GSM encoded data can be used as test vectors to see how long it takes each

architecture to process a given number of frames or data inputs. These results should

confirm which architecture is indeed the best to implement. Finally, the hardware should

be specified, which consists of designing and figuring out the exact hardware components

needed to execute GSM's decoding algorithm along with what control logic and exact

memory hierarchy will be used. There are several well known programs that can aid in

this part of the design, such as VHDL or Verilog. These programs provide a way to

implement the architecture at the circuit level as well as a method to simulate and verify

functionality.

5.3 Final Thoughts

The main conclusions drawn from this investigation are threefold.

1.) The GSM encoding/decoding algorithm is simply not complex enough to take

advantage of parallel processing. It contains only a small set of instructions, which does

not allow for efficient use of replicated processing power. The added processing power

46

is essentially lost in the latency that stems from a difference in the processing lengths of

its computing modules.

2.) Parallel processing involves much more than just simply replicating processors on a

chip. In fact, when replicating N'processors on a chip, it is not always the case that this

will result in a factor of N'improvement. The effect of parallel processing largely

depends on the algorithm that it tries to optimize. The number of instructions in the

algorithm play an important role, as does the schedule in which these instructions are

executed. As seen from the stated results, many instructions are necessary in order to

benefit from the increased number of processors. The optimal program or algorithm for

parallel processing is one that contains a lot of instructions, including large decision trees.

This implies that the added processors would actually be used since only a really large

cache could store all the instructions. Without the large cache present, cache misses

would claim a significant amount of the total computation time. In this scenario, adding

processors would effectively take the place of idle time.

3.) Because the Bulk DSP is essentially a parallel multiprocessor, it will only benefit

technologies with "optimal" parallel properties. In this study, it was specifically applied

to the GSM decoding algorithm and did not prove extremely beneficial. However, the

Bulk DSP could still be the most desirable processor for the other technologies

researched such as MPEG and ADSL since their specific algorithms were not identified

nor considered. Furthermore, a GSM implementation that incorporates more functionality

might also result in a more beneficial Bulk DSP.

47

Appendix A

Software Tools Used

Jutta Degener and Carsten Bormann, from the Technical University of Berlin, developed

the GSM 06.10 software used to research and study the encoding and decoding

algorithms. The software consists of a C library as a stand-alone program. It was first

designed for a UNIX-like environment although the library has been ported to VMS and

DOS environments which was the implementation used in this investigation. Several

other tools were used to test and use the library. The code was run using Microsoft

Visual C++ compiler. A digital audio editor named Cool Edit, was used to convert GSM

files to raw PCM format in order to test the encoding modules. Syntrillium Software

developed this program.

48

Appendix B

GSM Resources on the Web

Many web sites containing both official and unofficial information about GSM and the

telecommunications industry were found throughout this research. A list of the most

helpful and relevant sites follow.

* Dr. Dobb's Journal --A good technical explanation of GSM encoding/decoding:http://www.ddj.com/articles/1994/9412/9412b/9412b.htm

* GSM Encoding/Decoding C library--Jutta Degener and Carsten Bormann:http://www.kbs.cs.tu-berlin.de/-jutta/toast.html

* GSM Information Network: http://www.gin.nl/

* GSMag International: http://www.gsmag.com/

* GSM Online Journal: http://www.gsmdata.com/today.html

* GSM Specification--ETSI: http://www.etsi.org/

* GSM Streaming Audio for the Web: http://itre.ncsu.edu/gsm/

* GSM World--GSM MOU Association: http://www.gsmworld.com/

* Intel's GSM Data Knowledge Site: http://www.gsmdata.com/

* International Telecommunication Union: http://www.itu.int/

* Audio Clips in different formats: http://www.geek-girl.com/audioclips.html

49

* Total Telecom: http://www.totaltele.com/

* Universal Mobile Telecommunications: http://www.umts-forum.org/

50

Bibliography

[1] ADSL Forum, ADSL Tutorial, available at http://www.adsl.com/adslforum.html.

[2] Analog Devices, A Fast Ramp to the Information Superhighway and How Pieces FitTogether, available athttp://www.analog.com/publications/whitepapers/products/backadsl/index.html.

[3] Array Microsystems, Inc. VideoFlow Architecture, available athttp://www.array.com.

[4] Baldi Mario and Ofek Yoram. End-to-End Delay of Videoconferencing over PacketSwitched Networks, Yorktown Heights, NY, IBM T.J. Watson Research Center,1996.

[5] Chen Ming-Syan and Kandlur Dilip D. Stream Conversion to Support InteractivePlayout of Videos in a Client Station, Yorktown Heights, NY, IBM T.J. WatsonResearch Center, 1994.

[6] Degener, Jutta, Digital Speech Compression, available athttp://www.ddj.com/articles/1994/9412/9412b/9412b.htm.

[7] DSL Knowledge Center, How Does ADSL Work?, available athttp://www.orckit.comlorckitdsl_center.html.

[8] GSM MOU Association, GSM World, available at http://www.gsmworld.com.

[9] GSM 06.10 - European digital cellular telecommunications system (Phase 2); Fullratespeech transcoding, ETS 300 580-2, European Telcommunications StandardInstitute, March 1998.

[ 10] Hennessy, J., and Patterson, D., Computer Architecture: A Quantitative Approach,2nd ed., Morgan Kaufmann Publishers Inc, San Fransisco, CA, 1996.

[11] Intel Corporation, Pentium III Processor, available athttp://developer.intel.com/design/pentiumiii.

51

[12] Kozyrakis, C., et al. Scaling Processors to 1 Billion Transistors and Beyond:IRAM, IEEE Computer, September 1997, pp. 75-78.

[13] Mehrotra Asha, GSM System Engineering, Norwood, MA: Artech House, 1997.

[14] Motorola, DMT Line Code, available athttp://mot-sps.com/sps/General/chips-nav.html.

[15] Motorola, Echo Cancellation, available athttp://mot-sps.com/sps/General/chips-nav.html.

[16] Patterson, D., et al., A Case for Intelligent RAM: IRAM. IEEE Micro vol. 17, no.2 (April1997), pp. 34-44.

[17] Redl Siegmund M., Weber Matthias K., and Oliphant Malcolm W., GSM andPersonal Communications Handbook, Norwood, MA: Artech House, 1998.

[18] Redl Siegmund M., Weber Matthias K., and Oliphant Malcolm W., An Introductionto GSM, Norwood, MA: Artech House, 1995.

[19] Tisal, Joachim, GSM Cellular Radio Telephony, West Sussex Pol9 IUD, England,John Wiley & Sons Ltd, 1997.

[20] Turletti, Thierry, Bentzen, Hans, and Tennehouse, David, Towards the SoftwareRealization of a GSM Base Station, to appear in JSAC issue on software radios,4th

52

a multiprocessor implementation for the gsm algorithm

Documents