fast prototyping of posix based applications on a multiprocessor soc

7/27/2019 Fast Prototyping of Posix Based Applications on a Multiprocessor Soc

1/14

May 2007 (vol. 8, no. 5), art. no. 0704-o50021541-4922 2007 IEEEPublished by the IEEE Computer Society

Rapid System Prototyping

Prototyping Multiprocessor System-on-Chip Applications: APlatform-Based Approach

Benaoumeur Senouci, Aimen Bouchhima, Frdric Rousseau, and Frdric Ptrot TIMALaboratoryAhmed Jerraya CEA-LETI Menatec

A new multiprocessor system-on-chip prototyping flow based on the Portable Operating SystemInterface (Posix) standard and a multiprocessor hardware platform lets you quickly prototype Posix-based applications.

Modern MPSoCs (multiprocessor systems on chip) contain a huge amount of software and rely oncomplex hardware components. As application complexity grows, programmable multiprocessorplatforms are becoming more desirable. In fact, chips with several processors (such as generalpurpose processors and digital signal processors) are emerging in the industry, either for low-endapplications such as audio codec or high-end applications such as video encoders.

Reconfigurable hardware platforms recently emerged as effective solutions to validate and prototypeMPSoC designs early in a design flow. Such prototyping platforms make simultaneous hardware andsoftware development possible and enable early software design and debugging,13 thus allowing forearly software and hardware integrationwhich is the critical step in MPSoC system designs.Applications running on these new multiprocessor platforms usually require sophisticated multitaskingoperating systems to execute the system parts mapped to software. These operating systems providea suitable abstraction, allowing easy development of application software.4 Using a standard API at

this level makes this process even more effective and enhances software portability and reuse acrossdifferent operating systems. However, the same portability doesnt apply from a hardwareperspective, where changes to the underlying configurable hardware architecture are still seen as amajor source of nonportability and usually lead to long, tedious redesign cycles.

MPSoC designers have recently introduced the concept of hardware-dependent software (HdS) totackle such strong coupling between hardware and software within the lower system software layers.1

Here, we describe our experience in MPSoC prototyping using a multiprocessor operating systemkernel that implements the Portable Operating System Interface (Posix) API standard on top of areconfigurable multiprocessor platform using the HdS concept. (We chose Posix as the API owing to itswide acceptance and availability in many runtime environments.5) Investigating the complexity ofhardware and software integration in multiprocessor system design, a process still not completelymastered in MPSoC design flows, helped us understand the duality between the low-level softwarelayer (HdS) and the underlying hardware platform in the context of MPSoC design.

Hardware-software boundary: Defining HdS

When introducing standards such as Posix threads in embedded software development, the aim is to makethe applications portable from one platform to another.2 However, it still isnt always obvious whethersoftware running on one platform will run on another. Different platform specificities (such as memorymaps, the processors family, multiprocessor booting strategies, or on-chip commutation) might requiredifferent tuning or tradeoffs, forcing designers to redesign a major part of their embedded software.

IEEE Distributed Systems Online (vol. 8, no. 5), art. no. 0704-o5002 1


2/14

The HdS concept aims to tackle the disadvantages of such low-level programming practices bydividing the embedded software code into two parts: code that depends on the hardware architecture(the HdS) and code that is implementation independent (the hardware-independent software). Theexact meaning of HdS depends on the context in which you use it, but HdS generally includes thoselow-level software functionalities whose implementations depend directly on the underlying hardwarearchitecture.1,6 This includes, for instance, device drivers, digital signal processor-specific algorithms,and parts of the operating system (interrupt management, context related operations, semaphores,

and so on). The hardware-independent software comprises application, middleware, and operatingsystem software (see figure 1). We assume that the application software comprises a set ofconcurrent tasks and the middleware software represents dedicated communication libraries. Theoperating system software provides a useful abstraction interface between applications and targetarchitectures by simplifying the control code required to coordinate processes.

Figure 1. The different layers of typical multiprocessor system-on-chip embeddedsoftware.

A generic MPSoC prototyping flow

Using an HdS-based approach and the Posix standard, we propose a generic MPSoC prototyping flow,as figure 2 shows. The shaded parts present the steps in which were interestedin particular,extracting platform specifications for redesigning certain parts of the HdS code.

Figure 2. A generic MPSoC prototyping flow. The shaded parts highlight the steps inwhich were interested.



3/14


4/14

Kernel protection and thread migration

Processors provide access to the scheduler through critical section (that is, through common or sharedoperating system resources) and under the protection of locks. Lock granularity is a major player indetermining the balance between the overhead introduced by the locking mechanism and theopportunity to increase parallelism among different processors.

The SMP version of Mutek allows thread migration. Intuitively, when a CPU finishes the threadscurrently allocated on it for scheduling, it can resume executing a preempted thread that waspreviously executed on another processor. In that way, the system is dynamically balanced, reducingthe mean response time.

Thread synchronization

When multiple processors require access to shared data, synchronization among threads is required.Mutek performs this synchronization using different primitives:

x Mutual exclusion locks (mutex). A mutex allows exclusive access to shared resources such asglobal data. Threads attempting to access an object locked by a mutex will be blocked untilthe thread holding the object releases it.

x

Condition variables. Using condition variables, a thread can wait until (or indicate that) apredicate becomes true. A condition variable requires a mutex to protect the data associatedwith the predicate.

x Semaphores. In Posix.1b, named and unnamed semaphores have been tuned specifically forthreads. The semaphore is initialized to a certain value and decremented. Threads may wait toacquire a semaphore. If the semaphores current value is greater than 0, it is decrementedand the wait call returns. If the value is 0 or less, the thread is blocked until the semaphore isavailable.

Memory coherency

If the architectures interconnect is a shared bus, using a snoopy cache algorithm is sufficient toensure cache coherence. This has the advantage of avoiding any processor slow down owing tomemory traffic.

Processor identification number

Processors generally provide a specialized register allowing their identification within the system. Eachprocessor is assigned a number at boot time. This identification number is needed, because somestart-up actions, such as clearing the Blank Static Storage (BSS) and creating the scheduler, shouldoccur only once.

Implementation

Here we describe the configurable multiprocessor platform we used to implement our MPSoCprototyping flow and explain how HdS let us port the kernel on the target hardware platform.

The prototyping platform

We used the ARM (Advanced RISC Machines) Integrator/AP prototyping platform (seewww.arm.com/documentation/Boards_and_Firmware/index.html), which consists of three main parts.

The motherboard. The motherboard is composed of four core processor modules (which aremountable on a stack), one logic module (an FPGAfield programmable gate array), and a systembus implemented as an AMBA/AHB (Advanced Microcontroller Bus Architecture/Advanced High-Performance Bus). Figure 4 gives a simplified block diagram of the Integrator platform.



5/14

Figure 4. The ARM Integrator platform.

Core modules. Each core module is built around one ARM core processor without caches, and eachcontains 256 Kbytes of local synchronous static RAM (SSRAM), which only that ARM processor canaccess. An adjacent 128 Mbytes of synchronous dynamic access memory (SDRAM) can be directlyaccessed by all master processors on Integrator via the system bus. A system-bus-bridge adapts eachARM core interface to the AHB protocol and lets the processor access the system bus.

Memory architecture. The ARM Integrator platform implements a distributed shared memory (DSM)

architecture. Two memory types exist in each core module: a static memory of limited capacity (theSSRAM) and a dynamic one of greater capacity (the SDRAM). Only the local processor can access theSSRAM, while all bus masters (processors) at alias address locations can access the local SDRAM (onthe same core module).

HdS adaptation layer for Mutek porting

As we mentioned earlier, the gap between the embedded software and the Integrator platform bringsabout the need for a software adaptation layer (HdS) as figure 5 shows. Here we detail the techniqueswe used to tune Mutek kernel specificities to the Integrator platform constraints. More particularly, wefocus on multiprocessor booting, memory mapping, synchronization, and context switching.

Figure 5. HdS adaptation for the Mutek operating system.



6/14

Multiprocessor booting. As a multiprocessor kernel, the Mutek boot code should pay close attention tothe synchronization of the different concurrent processors to ensure the coherency of the operatingsystems initialization phase. At this stage, the operating system should allocate and initialize itsdifferent vital data structures (including its task queue). It should also clear the BSS section(noninitialized global and static C variables) according to the American National Standards InstitutesC standard. Given that the system needs only one processor to perform this initialization process, theother processors must wait until the process ends. Therefore, the system needs an identification

mechanism that can differentiate the running processors.

In Mutek, the get_proc_id function ensures the processor identification mechanism by returning thecurrent processors ID. Implementing this function depends on the hardware architecture and is thuspart of the HdS layer. In the case of the Integrator platform, the processor ID is available via a specialstatus register (CM_STAT) on each core module that appears on the same shared logical address(0x10000010).

Figure 6 shows the specific implementation ofget_proc_id for the Integrator platform. The kernelassumes that the processors are labeled from 0 to (n1), where n is the number of availableprocessors. The figure also shows the algorithm used to ensure synchronization among differentprocessors. Only the master processor (ID = 0) is allowed to carry out kernel initialization. The other

processors (core_modules 1, 2, and 3) enter a busy loop waiting for the master processor to finish. Theshared variable scheduler_created is declared as volatile to ensure that the compiler wont optimizethe waiting loop.

/*get_proc_id function*/unsigned int get_proc_id (void) {unsigned int r;r =*((char*) 0x10000010); // (CM_STAT) registerreturn r;}

/*********************************************************************************

if (get_proc_id = = 0) {core_module 0 (processor 0)doing kernel initializationScheduler and main thread creation (application enter);scheduler_created < = 1

}else {(get_proc_id 0) For the other core_modules

(1, 2, 3) waiting master processor (core_module 0)to create the scheduler and set the variable (scheduler_created < =1)wait for a thread to be reenabled;

}

Figure 6. Multiprocessor booting.

Memory mapping. Figure 7 shows the global memory mapping system. After resetting, all processorsmust jump to a specific shared address alias in the DSM address space. This address is set when firstinitialization routine starts inside the Mutek kernel (__init). The binary image (*.axf) is physicallyloaded on the local SDRAM of core module 0 (CM_0) according to this specific shared address (usingthe MultiIce connector).



7/14

Figure 7. Project memory mapping.

Multiprocessor synchronization. Mutek intensively relies on low-level semaphore primitives tosynchronize between the different concurrent processors that compete for common operating systemresources. From an implementation viewpoint, we can distinguish between two implementations ofsemaphore: CPU based (software) and FPGA based (hardware).

For a CPU-based implementation, binary semaphore implementations on general purpose CPUs arebased on atomic read and (conditional) write of a shared variable.7 These existing mechanisms can beintegrated in shared memory multiprocessors (the SMP) to synchronize between applications runningon multiple homogeneous CPUs. Figure 8 shows the implementation of the different functions(SEM_LOCK and SEM_UNLOCK) using the specific swap (SWP) multiprocessor atomic instruction. SEM_LOCKchecks the SEM-LOCK variable and SEM_UNLOCK releases it.

voidSEM_LOCK (unsigned int semaddr) {__asm {

Tryagain Request the Semaphore (SWP)Is it free?YES (we have the Lock)NO (Branch: Tryagain)

};}voidSEM_UNLOCK (unsigned int semaddr) {__asm {

Release the Semaphore;Semaddr


8/14

We can use the platforms FPGA to implement more efficient synchronization mechanisms that areindependent of the CPU family (the semaphore engine). As such, these new mechanisms are easilyportable across shared and distributed memory multiprocessor configurations. Thus, the architecturecan implement semaphores that dont lock the system bus that grants other processors or threadsaccess to the memory system.7

The semaphore engine uses a standard read of a memory mapped register Sem_addr. We define asimple control structure within the FPGA (logic module) that updates the register after a readoperation. Figure 9 shows the semaphore engines implementation on the ARM platform.

Figure 9. The semaphore engine for Mutek.

The basic semantics of all SEM_LOCK and SEM_UNLOCK APIs for accessing the lock are implementedidentically for all system processors (see figure 10).

void SEM_LOCK (unsigned int semaddr) {while (Sem_addr! =0);

}void SEM_UNLOCK (unsigned int semaddr) {Sem_addr =0;}

Figure 10. A CPU-independent semaphore API.



9/14

Context management. The context switch code written in assembly language assures the commutationof the processors between threads. A context switch stores the current processor state (generalpurpose registers and status register) in a memory location (on the stack) and loads a new processorcontext from another location in the memory that corresponds to the new thread to be executed. Thiscontext switch is a part of the HdS layer. In Mutek, the scheduler_commute kernel routine isresponsible for thread commutation.

This routine performs two different low-level calls to the commute function, which depends on thetarget processor; the functions implementation, which is part of the HdS layer, varies accordingly.Figure 11 shows an example of context switch implementation for two different processors: the ARMarchitecture and the MIPS R3000 architecture.

/* Context switch routine for ARM architecture*/STMIA R0!, {R0 R14} ; save the old context registersMRS R5, cpsr ; we get the cpsrMRS R4, spsr ; and the spsr

LDMIA R1, {R0, R14} ; load the new context registers

MOV PC,Lr ; and we branch

(a)

/*Context switch routine for MIPS R3000 architecture*/SW $at, 4*1($a0)

Save the old context registersSW $ra, 4*31($a0)LW $at, 4*1($a1)

Load the new context registersLW $ra, 4*31($a1)

(b)

Figure 11. An example of context switch implementation for two differentprocessors: (a) the ARM architecture and (b) the MIPS R3000 architecture.

Environment setting. We used ADS (ARM Developer Suite v1.2) to make up our project, using theARM CC compiler and ARM Link as linker. The output file is a .axf file targeted at the ARM platform.

Validation: M-JPEG video encoder

We validate our approach using a video decoder of a flow of JPEG images (known as M-JPEG and

Motion JPEG). Figure 12 shows its task graph, which is composed of eight paralleled threadscommunicating with each other using hardware or software channels.



10/14

Figure 12. M-JPEG task graph (the circles are the threads).

Kahn network communication layer. The video decoder application is a graph of communicatingthreads in the form of a Kahn process network. In this formalism, the threads communicate with eachother via circular first-in, first-out (FIFO) channels (C0 C10). Our implementation of thiscommunication library previews different communication schemes (software-software, software-hardware, hardware-software, and hardware-hardware). In this case, we use software-software

communication, building FIFOs on top of the Posix standard and protecting them using semaphores.

Application mapping. The designer maps the parallelized code on the given multiprocessorarchitecture. This includes mapping the software parts (the concurrent threads, operating system, andHdS) onto the system memory and mapping the concurrent threads on the top of the multiprocessorplatform architecture.

In platform-based design approaches, mapping the different functions (threads) on the hardwarearchitecture is the key process that correlates the function to the architecture. In our case (as figure13 shows), the abstract architecture model consists of four ARM CPU/SMP_OS units, an AMBA busunit, and a global memory (DSM). The software architecture is built around an OS/SMP, allowing adynamic threads-scheduling policy. The four ARM CPUs share a common view of the DSM via thesystem bus. Several software threads can share a CPU unit and request services from it.

Figure 13. M-JPEG threads mapping using the dynamic scheduler.



11/14

A thread can be executed on any CPU of the platform (thread migration), allowing an efficient loadbalance of the different software tasks. This helps balance the system, reducing its mean responsetime.

Experimentation results

In this SMP configuration, the operating system kernels memory footprint is approximately 11 Kbytes.This was the result of compiling approximately 100 C source files using the ARM CC compiler, with O3 as the optimization option. Table 1 shows the softwares code size.

Table 1. The softwares executable code size.

Code SizeHdS 472 bytesMutek kernel 11 Kbytes per symmetric multiprocessorMultiprocessor boot 360 bytes per processorCommunication library 896 bytes

Compared to our previous experience with custom and application-specific operating systemgeneration, Mutek has a larger footprint (four times larger). However, when the number of processorsscales up, the memory footprint of the application-specific operating system (which implementsdistributed scheduling only) increases accordingly.

Table 2 shows the number of cycles necessary to perform typical Posix functions obtained from ourimplementation.

Table 2. The number of cycles necessary to perform typical Posix operations.

Operations No. of cyclesContext switch 1,462

Thread creation 2,750Semaphore request 162Semaphore release 78

In addition to functional tests, we also performed tests to quantify the Semaphore Enginesperformance to address our concern that the hardware implementations speed could lead to betterperformance than a software one (see table 3). The test sequence was a SEM_LOCK (request) andSEM_UNLOCK (release).

Table 3. Clock cycles for semaphore request and release.

Operation CPU-based cycles(software)

FPGA-based cycles(hardware)

SEM_LOCK 162 54SEM_UNLOCK 78 41

The software Semaphore Engine implementation average access time was 162 clock cycles, comparedto 54 clock cycles for the hardware implementation, yielding a 3 average performance access ratio.The Semaphore Engines performance is quantified by accounting for just one processor on the systembus.



12/14

Using a reconfigurable hardware platform (the ARM Integrator) and the Posix threads for applicationsdevelopment let us validate several multiprocessor applications by developing a prototype for eachapplication. The validation processs critical step is the HdS design process, which depends on theplatforms configuration instance.

We estimate that developing and debugging the new HdS layer for the Integrator platform required

approximately three designers per month. Note that this is only for a particular configuration of theprogrammable hardware platform. Of course, for subsequent configurations of the same platform, theeffort should be considerably less because the designers will be able to reuse the predesigned parts.Also, they wont have the same learning curve the second time around.

Creating an operating system service (semaphore) based on FPGA and thus eliminating the differencebetween a CPU and FPGA from the developers viewpoint requires codesigning the operating systemshardware and software to extend system services across the FPGA-CPU boundary. An attractive goalof such a hardware-software codesign is improving application portability on several mixedmultiprocessor platforms. This FPGA-based design of certain CPU-dependent operating systemservicessuch as synchronization, processor identification, and interrupts controlpromise toovercome HdS design problems.

Future work will focus on developing methods and tools that can automate this design step to furthershorten design and validation time and enable effective design space exploration.

References

1. B. Senouci et al., Fast Prototyping of Posix Based Applications on a Multiprocessor SoCArchitecture: Hardware Dependant Software Oriented Approach,(http://doi.ieeecomputersociety.org/10.1109/RSP.2006.17), Proc. Workshop on Rapid SystemPrototyping, IEEE CS Press, 2006, pp. 6975.

2. K. Keutzer et al., System Level Design: Orthogonalization of Concerns and Platform-BasedDesign,IEEE Trans. Computer-Aided Design of Circuits and Systems, vol. 19, no. 12, 2000,pp. 15231543.

3. N. Ohba and K. Takano, An SoC Design Methodology Using FPGAs and EmbeddedMicroprocessors,Proc. 41st Design Automation Conf., DAC, 2004, pp. 747752.

4. V. Mooney III and J. Lee, Hardware/Software Partitioning of Operating Systems: Focus onDeadlock Detection and Avoidance,IEE Proc. Computers and Digital Techniques, vol. 152, no.2, 2005, pp. 167182.

5. I. Aug et al., "Platform Based Design from Parallel C Specifications," IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 12, 2005, pp. 1811-1826.

6. S. Yoo and A.A. Jerraya, Introduction to Hardware Abstraction Layers for SoC,EmbeddedSoftware for SoC, A.A. Jerraya et al., ed., Kluwer Academic, 2003, pp. 179186.

7. D. Andrews, D.L. Neihaus, and D. Ashenden, Programming Models for Hybrid FPGA/CPUComputational Components,(http://doi.ieeecomputersociety.org/10.1109/MC.2004.1260732), Computer, Jan. 2004, pp.118120.

Benaoumeur Senouci is a PhD student at the TIMA Laboratory, working with the System LevelSynthesis group. His research interests concern hardware-platform-based design and prototyping ofmultiprocessor systems on chip with a particular focus on MPSoCs embedded software. He received



13/14

his Master 2 Research degree in computer science and integrated system design from the InstitutNational Polytechnique de Grenoble. Contact him at TIMA LaboratorySLS Group, 46 Ave. Flix Viallet,38031 Grenoble Cedex, France; [email protected].

Aimen Bouchhima is a postdoctoral researcher at the TIMA Laboratory. His research interestsinclude embedded software design and validation, high-level hardware and software modeling, andsimulation and multiprocessor system-on-chip design flows. He received his PhD in microelectronicsfrom the Institut National Polytechnique de Grenoble. Contact him at TIMA LaboratorySLS Group, 46Ave. Flix Viallet, 38031 Grenoble Cedex, France; [email protected].

Frdric Rousseau is an assistant professor at the University of Grenoble and a researcher at theTIMA Laboratory. His research interest concerns system-on-chip design and architectureinparticular, the design and validation of hardware and software interfaces. He received his PhD incomputer science from the University of Evry. Contact him at TIMA LaboratorySLS Group, 46 Ave.Flix Viallet, 38031 Grenoble Cedex, France; [email protected].

Frdric Ptrot is a professor of computer architecture at the Institut National Polytechnique deGrenoble. His main research interests concern computer-aided design of VLSI circuits and systemarchitecture, with a particular emphasis on system integration, kernels, and multiprocessor systemson chip. He received his PhD in computer science from Universit Pierre et Marie Curie, Paris. Contacthim at TIMA LaboratorySLS Group, 46 Ave. Flix Viallet, 38031 Grenoble Cedex, France;[email protected].

Ahmed Amine Jerraya is the head of Design Programs for the Design and System Division ofCEA/LETI (Commissariat lnergie Atomique / Laboratoire dlectronique et de Technologie delInformation). He received his Docteur d'Etat degree in computer science from the University ofGrenoble. Contact him at CEA/LETI/DCIS, Minatec, 17 rue des Martyrs, 38054 Grenoble, France;[email protected]; www-leti.cea.fr.



14/14

Related Links

x "Energy-Efficient Thread-Level Speculation," IEEE Micro(http://doi.ieeecomputersociety.org/10.1109/MM.2006.11)

x "Cross Layer Design to Multi-thread a Data-Pipelining Application on a Multi-processor onChip," Proc. ASAP 06 (http://doi.ieeecomputersociety.org/10.1109/ASAP.2006.24)

x

"Automatic Phase Detection for Stochastic On-Chip Traffic Generation ," Proc. CODES+ISSS 06(http://doi.ieeecomputersociety.org/10.1145/1176254.1176277)

Cite this article:

Benaoumeur Senouci, Aimen Bouchhima, Frdric Rousseau, Frdric Ptrot, and Ahmed Jerraya,"Prototyping Multiprocessor System-on-Chip Applications: A Platform-Based Approach," IEEEDistributed Systems Online, vol. 8, no. 5, 2007, art. no. 0705-o5002.


fast prototyping of posix based applications on a multiprocessor soc

Documents