shumcos a rtos using multitask model to reduce migration cost

Upload: anbuembedded

Post on 07-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/4/2019 SHUMCOS a RTOS Using Multitask Model to Reduce Migration Cost

    1/6

    The 9th nternational Conference on C omputer Supported Cooperative Work in Design Proceedings

    SHUM--COS: A RTOS Using Multi-task Model to Reduce Migration Costbetween SW/HW Tasks

    Bo Zhoul, Weidong Qiu', Yan Chen', ChenglianPeng'Department of Computer andInformation Technology, Fudan UniversiQ, Shanghai, Chinaallenzhou@xasumail. orn

    AbstractThe design of embedded systems has become morecomplex than ever, and the design qualities dependmore on the cooperation of multidisciplinary designteams: hardware engineers and sofryonre engineers ingeneral. However, due to the Iack of uniformprogramming model and system components for these

    different teams, the migrations costs of a functionmodel from software to hardware are high. But theseactions are necessary in the hardwure-sojhvayepartitioning of embedded systems, especially in theprototype designs. To cope with this problem, we adopta ungorm multi-task model and implement U RTOS(Red -Time Operating System). caIled SHUM-uCOS,which deals with hardware functions IZS same assoftware tasks. This RTOS uses uCUSII as itsprotootype, traces and manages the sfates ofreconjigurable resources (FPGAs), which allows iheexecution o hardware task in a true multitaskingmunner. Moreover. SHUM-uCOS also dejnes astandard hardware-task inter$ace, which supportsshare-bus protocol. I t has been proved by experimentsthat SHUM-uCOS can shorten the m igration time fromsofrware implementations to hardware implementationswith /he performance improvement.Keywords: Reconfigurable Computing System, RTOS,multi-task Model, uCOS1. Introduction

    Embedded systems experienced a considerableexpansion in the last few years. With the silicontechnology advancemen t, more powerful devices, (e.g.,the higher frequency CPU, the larger memory) areprovided. At the sam e time, the design complexity alsoincreases dramatically, and the design qualities dependmore on the effective cooperation of multidisciplinarydesign teams: hardware engineers and softwareengineers in general.gut how would the designer determine where toplace the work-dividing line between softwareengineers and hardware engineers? This is a well-known problem that hasn't been solved in embedded

    S Y S ~ M S , alled hardware-software partitioning,Currently, the dividing line is made by hand. Anexperienced system analyzer would attempt to lethardware engineers impIement the time-consumingcompon ents, thus maximizing execu tion speed.To determine which part is the performancebottleneck, we often need several product prototypeswith different hardware-sohare dividing lines andrealize the same hnctions in different methods. Then

    we will get the proper boundary between soha r e . andhardware by co mparisons. In this procedu re, there existmany migrations between software and hardware.However, due to the lack of uniform programmingmodel and system components for these differentimplementation methods, the migration costs of afunction implementation from software to hardware arenormally high. Even a small task migration needs anexcessive modification, because it relates to both designteams. But the recent developments in configurabledevices have increasingly blurred the traditional linebetween hardware and software. Using this excitecharacteristic, it seems that w e can redu ce the migrationcost greatly.Operating system is a reasonable solution because iti s the traditional boundary between hardware andsoftware. Although commercial RTOSs available forpopular embedded processors provide significantreductions in design time, they typically do not takeadvantage of the intrinsic parallelism o f hardware tasks,probabiy because FPGAs and ASICs have historicallybeen treated as hardware accelerators, for which thereare only device drivers provided by the operatingsystem.To cope with this problem, we have adopted auniform multi-task (thread) model and implemented aRTOS with uCOSII [I ] RTOS as its prototype, calledSoftware Hardware Uniform Management uCOS(SHUM-uCOS). The basic concept of multi-thread

    model was first discussed in [2], which is proposed forhybrid chips containing both CP U and FPGAcomponents in one chip. We extend this model into theembedded system design that is composed of a hostprocessor and several reconfigurable devices. Thisprogramming model allows hardware tasks onreconfigurable devices to execute in a truly-parallelmultitasking manner, which are organized like software

    984

    Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 SHUMCOS a RTOS Using Multitask Model to Reduce Migration Cost

    2/6

    Th e 9th International Conference on Computer Supported Cooperative Work inDesign Proceedings.

    tasks, and substantially decreases the migration time fora task from SW implementation to HW implementation.Sumanth Donthi [33 classifies FPGAs into tw ocategories. If only a portion of the chip is modified andthe remaining logic operates normally without anydisruption, then it is partially reconfigurable. If thewhole chip is modified at once, with a total loss of theprevious configuration and the state of the flip-flops,then it is fully reconfigurable.The main functions of SHUM-uCOS are task andresource management. Several recent publications dealwith task and resource management problems, e.g., [4]and [5 ] , especially the problem of finding placementsfor hardware tasks on a reconfigurable surface, e.g., in[6 j [7]. However, their discussions mainly focus on thepartially configurable FPGAs. ft seems that there arefew attentions paid to the fully configurable FPGAs inthe operating system, which take a great share of th eFPGA market currently. The SHUM-uCOS deals withthese devices and uses preconfiguration table toincrease the utilization of reconfigurable resources.2. SHUM-uCOSFramework

    The SHUM-uCOS is an extended version ofuCOS11, expanding its management range by addingextra functions. It reserves mo st of data structures fromuCOSII, and the priority-based scheduling policy.

    While dealing with the so b a r e tasks only, the SHUM-uCOS is almost the same as uCOSII. While involvingthe hardware tasks, the SHUM-uCOS adopts uniformmulti-task model to manage them, which can be seen inFigure 1.The whole model is divided into three parts: CPU,the hardware-task manager and reconfigurable devices.Th e software tasks execute on the CPU and thehardware tasks runon the FPGAs. The software part ofS H U M - C O S includes the soRware task interface, taskscheduler and resource manager. The hardware part ofSHUM-uCOS is called the hardware task manager,usually implemented in the FPGAs, including thecommunication controller, standard hardware-taskinterface, configuration interface and hardware-taskconfiguration controlter.Th e SHUM-uCOS is composed of following parts indetail.Software task interface: a set of AP I functions.Designers can interact with the operating system

    through these fun ctions by calling system services, e.g.,creating semaphores and mutexes.The hardware task preconfiguration table: to reducethe configuration cost at runtime, we can getconfiguration sequences of confgurable devices byanalyzing the task graph statically. The data is usefulfor the scheduler to configure devices before thehardware tasks run.Scheduler: the core of the RTOS. It is responsible

    Figure 1-SHUM-uCOS ramework

    985

    Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 SHUMCOS a RTOS Using Multitask Model to Reduce Migration Cost

    3/6

    Th e !Xh International Conference on Co mputer Supported Cooperative Work in Design Proceedings

    for managing the states of tasks (HW an d SW),handling the synchronous and he asynchronous events,such as the scheduling of software tasks or theconfiguration of hardware tasks, and thesynchronization between tasks.Resource manager: because of the dynamic creationand deletion of hardware tasks, the usage of thereconfiguration resource also changes steadily. Theresource manager traces and records these changes,providing information for scheduler to configurehardware tasks.Communication controller: this moduIe handles thelow-level communication detail, and translates thecommand to binary signals according to the application,e.g. the count of hardware tasks.Hardware task configuration database: this databasecontains all the hardw are-task configuration data, whichwas synthesized ahead.Hardware-task configuration controller: thecontroller will retrieve corresponding configurationdata from database, and configure the devices afterreceiving the configuration command from scheduler.A 4-bit or 8-bit microcontroller can be used asconfiguration controller because of the light workload,Hardware task interface: it supplies thecommun ication controller with the standard signals andprotocols.Hardware task implementation: it includes all thefunction modules in the FPGAs, which will bedescribed in the section 3.3.3.The implementation of SHUM-uCOS

    each task group is smaller than that of the configurationdevices, in which the task group would be put in. 2.From temporal point of view, we must schedule the taskgroups to ensure that they just need minimum amountof reconfiguration devices. The grouping andscheduling of a DAG are all N-P completeproblems.[8][9][10]Paper 191 has made a detailed discussion about theproblem of task-group partition, and two algorithms areproposed: level based partitioning algorithm andclustering based partitioning algorithm. The formeralgorithm mainly exposes the parallelism hidden in thegraph nodes, and the aim of the latter algorithm is todecrease the commu nication o verhead, i.e., the numberof terminal edges resulting from partitioning.In the multiprocessor field, there are already manydiscussions about how to get parallelism by analyzingthe task graph statically. Correspondingly, numerousmethods have been proposed, such as the MC Palgorithm, the DCP lgorithm [101.With above consideration, the basic idea ofgenerating preconfiguration table is: at first, divide thehardware tasks into task groups that can be fit into thereconfigurable devices, and then view the configurationprocedures as tasks with deadline. Finally, we can ge tthe preconfiguration table by schedu ling these tasks.Following steps will describe this procedure in detail(an example can be seen in the Fig.2).I . Remove the software-task nodes from the origintask graph G1, and then we get a task graph G2only containing hardware tasks. The precedencerelations between hardware tasks in G2 must bekept as same as them in G1 . For example, in theFigure 2(a), there exist three tasks: T4, TS andT12, where the latter task depends on the former,.1 Preconfiguration table generation

    and T8 is the only software task. Thus, we mustany embedded appIications can be represented by

    remove it and keep T4 dependinng on T12 in theata flow graphs (DFG). A DFG is a directed acyclicgraph. Figure 2(b).The probIem of generating hardware-taskpreconfiguration table can be viewed as two separate 2 . Replace *e hardware task node T i asproblems: 1. From spatial point ofv iew , hardware tasks configuration task node Ci, and the deadiine of Cica n be organized as task groups, and the total area of equals the arriving time of Ti minus configuration

    b z TOFT& Gmup

    Figure 2. The generation of preconfigurationtable

    986

    Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 SHUMCOS a RTOS Using Multitask Model to Reduce Migration Cost

    4/6

    Th e 9th International Conference on Computer Supported Cooperative Work in Design Proceedings

    time.3. According to the level based partitioningalgorithm [9 ] , get the task groups under the areaconstraint.4. Merge each task group into one configurationnode.5. Usethe DCP algorithm [lo] to schedule theconfiguration nodes, and the result is thepreconfiguration table.

    The procedure demonstrated in Figure 2 cangenerate the preconfiguration table, but there i s noguarantee for the optimization result. However, Thefocus of the SHUM-uCOS is not on the optimization.And in any case, the reconfigurations of thereconfigurable devices are always beneficial to theexecution of hardware tasks.3.2 Reconfigurable resources managem ent

    Th e SHUM-uCOS uses the RCB (Resource ControlBlock) structure to trace and control the usage ofreconfigurable resources. A RCB s a data structure asfollow:typedef struct os-rcb fINTSU ResourceArea; // the area of the resourceINTlUResourceNo; // the unique ID of th eINTSU ActiveTaskCo unt; //the count of sleepingstruct os-rcb *OSRCBNext; //pointer to the nextstruct os-hcb *OSH CBFirst; // the pointer to th e

    resourcetasks in the resource.RC Bfirst task in this resource.OS-RCBIn the SHUM-uCOS, th e reconfigurable resourcesare always in one of the following four states: usedstate, preconfiguration state, blank state andconfiguring state, which are shown in Figure 3. And theSHUM-uCOS maintains four chains corresponding tothe four states respectively.Used state: the resource has been configured with atask group, and there is at least one task in the groupcontrolled by th e scheduler.Preconfiguration state: the resource has beenconfigured with a task group, but all the tasks of th egroup are in sleeping state, waiting for activation.

    Blank state: there is no task group in the resource orthe resource is going to be reconfigured with a new taskgroup.Configuring state: the configuration procedure onthe resource is ongoing.

    If one task in preconfiguration state is activated byscheduler within a given time interval, we call it aspreconfiguration hit, otherwise as preconfigurationmiss.

    To reduce the cost of resource configuration, when aresource does not contain any active task, the schedulersets its state as preconfiguration instead of putting itinto blank state directly. When preconfiguration missoccurs subsequently, the resource is moved to blankstate. This approach adds preconfiguration statebetween used state and blank state while recyclingreconfigurable resources, and makes the resourcerecycle much like the cache manner for memory. As aresult, it will improve the preconfiguration efficiency.

    Lh(n

    Figure 3. Thestate graph of configurableresources

    3.3 Hardware-task imp lementationIn the SHUM-uCOS, the hardware taskimplementation is divided into three layers: I ) timing-convert layer, whose main function is to convert othertiming to standard memory timing, e.g., CAN or I2Ctiming to memory timing. The aim of this layer is toreduce the usage of precious FPGA pins; 2) primitivelayer, which is responsible for managing the states ofhardware tasks and providing the synchronizationmechanisms; 3) Function entity layer, whichimplements the users functions.

    Figure 4. Hardware-task implementationThe first two layers belong to the SHUM-uCOS, andthey are provided as IP (Intellectual Property) . Th etimings between layers are all standard mem ory timing.Th e SHUM-uCOS provides two methods fo rintertask communication: global variables and massagepassing, where massage passing includes mutex,semaphore and message box. There is no semaphore

    987

    Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 SHUMCOS a RTOS Using Multitask Model to Reduce Migration Cost

    5/6

    Th e 9th International Conference on Computer Supp orted Cooperative Work in Design Proceedings

    queue and message- box queue support for thehardware task at present.T he hardware-task implementation uses share-busprotocol, the hardware-task can access the mainmemory if the share-bus is available. If the timing ofmain memory is standard memory timing, there is noneed for timing-convert layer.Four parts compose the primitive operation layer inthe standard hardware task implementation:The data pathw ay: connected with the main memory,allows the function entity to access the data stored inmemory.The control pathway: connected with the DMA(direct memory access) signals of CPU or bus arbiter,handling the bus request or release.The initialization pathway: connected with thehardware-task controller. It is used to initialize theinternal registers of primitive layers after the creation o fhardware tasks.Hardware state controller: the core of the primitiveoperation layer. It interprets the CPU command,controls the hardware task state and reports the taskstatus.There are no local registers or memory in theprimitive layer, all the data is stored in the mainmemory. And each hardware task has a Task InterfaceControl Block (TICB) data structure to define itscontrol registers, which are mapped into the mainmemory.typedef struct os-ticb {INT32U Receive-Cmd; // command received fromINT32U Send-Req; //request sent to CP UINT32U Return-Code //T he result codeINT32U Param-Reg //command parameterINT32U Pointer-Reg //the pointer to data fh m eJNT32U Len-Reg // th e length of data frame} OS-TICB;After the hardware task i s created, t h e task ID andth e start address of TICB will be saved into registers ofthe hardware tasks. These parameters are apt to changeat runtime. Only with the start address of TICB, cantask state controller access the memory data.If there are some commands need to be sent to ahardware task, CPU will write the command into thememory location of Receive-Cmd parameter in TCIBfirst, then set the Cmd-Aquire in TCIB to teli thehardware task that there is a new c ommand . At last thehardware requests the bus and obtains the data.If hardware tasks ask for the CPUs services, they

    will write the service type into the memory location ofthe Send-Req parameter in TCIB, then sets interrupt tonotice CPU that something happens. Finally, accordingto the Send-Req parameter in TCIB, the CP U selectsthe proper service function.

    CP U

    4. Experiment Results4.1 OS performance evaluationThe operating systems using uniform multi-tasksmodel are a rather new line of research. And there is no

    explicit numerical result to compare with until now, Inorder to demonstrate the quality of the proposedoperating system, we evaluate the performances of theSHUM-uCOS using the Rhealstone benchmark [I 11and compare with uCOS.The Rhealstone is a well-known benchmark for realtime operating systems. The benchmark identifies theexecution times (or time delays) associated with sixoperations that are vital indicators of real-timemultitasking system performance. These operationsinclude: task-switch time, preemption time, interruptlatency, semaphoreshuffle time, deadIock-break timeand intertask message latency. The Rhealstone isintended to be independent of the CPU architecture, andit adopts a small Whetstone benchmark as the workloadof each task. Because the tasks in the uCOS are allsoftware tasks, we do not implement the Whetstoneusing hardware in the SHUM-uCOS, but wait the sametime as s o h a r e e x e c u t i o n to keep the resultindependent of the task workload.In the Realstone, the task-switch time is defined asthe average time to switch between two active tasks ofequaI priority. But the SHUM-uCOS inherits thepriority-based scheduling policy of uCOS, and does no tallow the existence of tasks with equal priority either.Thus, we have ignored the first experiment.We use a platform composed of four Altera FPGAs,and the detailed information about experimentenvironment is described as folIowing: 1) FPGA:EPlC6-8 of the Alteas Cyclone series [12]. 2) CPU:NIOSII standard version at 5OMHz [13], which is asoft-core CPU rom Altera Corporation. 3) Benchm arkRhealstone Benchmark, 10 bytes will be sent everytime while using the message-box. 4) Targets for test:SHLM-uCOS Verl.0 and uCOSII Ver2.76. 5 ) Systemtick period: lms. 6 ) Main memory: IDT71V416SRAM.

    Table i he Rhealstone benchmark results(unit: us)I SHUM-uCOS I uCOS I Remark

    (SWT: Softwan: task; HWT: ardware task)

    Authorized licensed use limited to: Bharat University. Downloaded on August 09,2010 at 08:25:53 UTC from IEEE Xplore. Restrictions apply.

  • 8/4/2019 SHUMCOS a RTOS Using Multitask Model to Reduce Migration Cost

    6/6

    The9th nternational Conference on Computer Supported Cooperative Work n Design Proceedings

    4.2 Case studyTh e SHUM-uCOS has been used in a VOIPterminal. For this project, the most impo rtant part is th e

    voice compression and decompression, which willaffect the system performance greatly because of itsheavy computation load. To demonstrate th eperformance difference between two implementations,we migrate the ADPCM compression (decompression)from software implementation to hardwareimplementation. The communication style fromhardware task to CP U is by a message box. We choosethe ITU G.726 standard for the voice compression(decompression) and aggravate the workload for thesystem by increasing the compression ratio. To makethe final result distinct, we set the frequency of theNIOS at ISMH z, which is much lower than usual. If theCPU busies itself with the older frame, the new framewill be discarded. And we evaluate the performancethough the frame-lost ratio. The result is shown in table2.

    Table2. Frame-lost ratios for varyimplementations

    Compression/decompression I SoftwareW k I Hanvare tarkstandard

    I I I ITable 2 shows that the lost-frame ratio decreasesdramatically after the compression (decompression)task migrates from software to hardware. It is true thatany migrations form SW to HW is ap t to increasesystem performances, and the more important themigrated function is, the more benefit we can get.However, with the SHUM-uCOS, this kind ofmigrations will be mote natural, and affect the otherparts less. In this case study, we changed only 13locations to migrate th e compressioddecompressionfunctions from software to hardware successfully,which is even beyond our expectation.5. Conclusion

    We implemented a RTOS based on the multi-taskmodel. The aim of this approach is to provide a uniformplatform for both software and hardware engineers, andreduce the migration cost for embedded system designs,which is a time-consuming step in the whole designflow, The SHUM -uCOS traces and manages the statesof reconfigurable resources (FPGAs), allowing theexecution of hardware tasks in a true multitaskingmanner. The Rhealstone Benchmarks have shown theSHUM-uCOS has almost the same performance as theUCOSII while dealing with software tasks only.

    Furthermore, it can also handle hardware tasks.Thirteen modifications in our VOIP case study haveproved that the SHUM-uCOS can shorten the migrationtime greatly with the performance improveme nt.References[I ] Jean J. Labrosse, Micro/OS-I1 The Real-Time Kernel,Second Edition, CMP Books, 2002[2] David Andrews an d Douglas Niehaus. ProgrammingModels for Hybrid FPGA-CPU Computational Components:A Misssing Link , Micro, IEEE Transactions, Volume: 24 ,Issue, 4 , uly-Aug. 2004, pp: 42 -53[3] Donthi, S .; Haggard, R.L.A Survey of DynamicallyReconfigurable FPGA Devices. Proceedings of the 35thSoutheastern Symposium on System Theory, 16-18 March2003. Pages: 42 2 - 426[4] 0. iessel,H.EIGindy, M. Middendorc H. chmeck, andB . Schmidt. Dynamic scheduling of tasks on partiaIlyreconfigurable FPGAs. In IEE Proceedings on Computers andDigital Techniques, volume 147, pages 1 8 1 - 1 8 8 , May 2000.[SI Katherine Compton, James Cooky, Stephen Knol, andScott Hawk. Configuration Relocation and Defiagmentationfor Reconfigurable Computing. In Proceedings of the IEEESymposium ou FPGAs for Custom Computing Machines(FCCM). IEEE CS Press, April 2003.[6] Kiarash Bazargan, Ryan Kastner, and Majid Sarrafiadeh.Fast Template Placement for Reconfigurable ComputingSystems. In IEEE Design and Test of Computers, volume 17,pages 6843,2000.171 Herbert Walder, Christoph Steiger, an d Marco Platzner.Fast Online Task Placement on FPGAs: Free SpacePartitioning and 2D-Hashing. In Proceedings of the 10thReconfigurable Architectures Workshop (RAW). IEEE CSPress, April 2003.[a ] Thomas H.Cormen and Charles E. Leiserson.Introduction to Algorithms, Th e MIT Press. ,2001, Pages:[9] Karthikeya M. Gajda Puma an d Dinesh Bhatia.Temporal Partitioning and Scheduling Data Flow Graphs forReconfigurable Computers, IEEE Transactionson Computer,[lo] Kwork YK, Ahmad I. Dynamic critical-path scheduling:An effective technique fo r allocation task graphs tomultiprocessors. IEEE Trans. on Parallel and DistributedSystem, 1996, 7(5): 506-521[I 11Rabindra P. Kar, Implementing the Rhealstone Real-timeBenchmark,Dr. Dobbs ournal, Sep. 1990.[U] ltera Corporation, Cyclone Programmable LogicDevice Family Datasheet, 2003, http://www.alteracom.[I31 Peng Cheng-lian and Zhou bo, SOPC esign and practiceusing N O S , Beijing, TsinghuaPress, 2004

    1043-1054

    1999.pp.579-590

    989

    http://www.alteracom/http://www.alteracom/