scheduling and resource sharing (on fpgas)

Scheduling and Resource SharingDigital Systems Design

Jose M. LeitaoMarcos AntunesSergio Paiagua

November 2012

1 Introduction

Much of the complexity involved in designing a digital system stems from the usuallyvery stringent constraints involved. These usually include limited resources, a certaintarget throughput or even limitations to power dissipation. Each of these parametersmust be then taken into account during the design process, so that the circuit is tailoredright from the early stages with these requirements in mind.

In this second project for the course Digital Systems Design, an image processingalgorithm is implemented with very limited hardware resources, so that the concepts ofScheduling and Resource Sharing in the context of digital circuit design are explored.In particular, a Sobel-like operator is to be implemented. This operator computes anapproximation of the gradient of an image’s intensity function. The two directions of thegradient vector are obtained by performing a convolution between the source image andtwo convolution kernels of size 3x3, with parametrizable entries.

This report is organized as follows. Section 2.2 presents the data flow graph corre-sponding to one inner loop iteration of the algorithm. Within this section ALAP andASAP scheduling is utilized, in addition to list scheduling. In section 3, the circuit thatimplements the algorithm is described in detail, focusing on its two main elements, thedatapath and the control unit. The following section discusses the obtained circuit basedon the resource utilization and obtained performance, i.e, maximum operating frequencyand throughput. Finally, section 5 concludes the report by discussing to what extentthe objectives set forth in the beginning were accomplished and how the circuit could beimproved in terms of throughput with or without increasing the available resources.

2 Scheduling

As described in the introduction, the Sobel operator relies on the convolution of an imagewith two convolution kernels of size 3× 3, which are presented below:

Gx =

−a 0 +a−b 0 +b−c 0 +c

Gy =

−a −b −c0 0 0

+a +b +c

1

The constants a, b and c, are positive numbers that can be implemented to obtaindifferent filtering effects. For simplicity, the image to process is composed of 10 × 20pixels with values ranging from 0 to 255, i.e, the image is in grayscale format. Whenthe kernel is being applied to a given pixel S(0,0), the following notation is employed todescribe the adjacent pixels: S(−1,−1) S(−1,0) S(−1,1)

S(0,−1) S(0,0) S(0,1)

S(1,−1) S(1,0) S(1,1)

In the following subsections, the scheduling of the hardware utilization is devised only

for one inner loop iteration of the algorithm. As such, the expression to implement is thefollowing:

{gx = −a× S(−1,−1) − b× S(0,−1) − c× S(1,−1) + a× S(−1,1) + b× S(0,1) + c× S(1,1)

gy = −a× S(−1,−1) − b× S(−1,0) − c× S(−1,1) + a× S(1,−1) + b× S(1,0) + c× S(1,1)

2.1 Scheduling with no area restrictions

From the equation above and considering that each operation takes a clock cycle to beperformed, it is trivial to obtain the correspondent data flow graph, DFG. This type ofgraph exposes the data dependencies that exist between the various operations and thelevel of mobility that is associated with each operation, i.e, how large is the temporalwindow within which the node can be positioned without increasing the latency of thecircuit. When all the nodes, or operations, have zero mobility, it is not possible to applytechniques such as ASAP (as soon as possible) or ALAP (as late as possible) scheduling.

In figure 2.1, the result of the application of both the previously mentioned techniquesis shown, considering that no hardware restrictions exist, meaning that as long as thereare no data dependencies, all operations can run in parallel. It should be noted that theDFG shown already includes an optimization, which arises from the observation thatS(−1,−1) and S(1,1) are multiplied by the same constants, a and c, respectively.

From the analysis of the diagram, it is clear that only two of the nodes boast a non-zero mobility and, as such, are scheduled differently depending on the approach taken.However, as none of these operations are part of the critical path, their position withinthe scheduling does not impact the latency of the complete circuit. Having said that,despite taking an ALAP or ASAP approach, the values of gx and gy for each processedpixel are obtained within 4 clock cycles.

2.2 Scheduling with area restrictions

Since the previous analysis did not take into account the restriction on the number ofoccupied resources, the obtained scheduling is far from being the optimum. The specifiedmaximum number for each type of processing element is listed in table 2.1. In order

2

Figure 2.1: ASAP and ALAP scheduling diagram, ignoring the hardware restrictions and con-sidering single cycle operations.

to meet the project requirements with a minimum impact on the overall throughput,the mathematical expression for computing a result (gx, gy) was rearranged, using thedistributive property of the multiplication, as follows:

{gx = −a× S(−1,−1) − b× S(0,−1) − c× S(1,−1) + a× S(−1,1) + b× S(0,1) + c× S(1,1)

gy = −a× S(−1,−1) − b× S(−1,0) − c× S(−1,1) + a× S(1,−1) + b× S(1,0) + c× S(1,1)

⇔{gx = a× [S(−1,1) − S(−1,−1)] + b× [S(0,1) − S(0,−1)] + c× [S(1,1) − S(1,−1)]

gy = a× [S(1,−1) − S(−1,−1)] + b× [S(1,0) − S(−1,0)] + c× [S(1,1) − S(−1,1)]

The new expression is advantageous for two mains reasons. Firstly it requires lessarithmetic operations, which by itself decreases the significance of the scheduling limita-tions imposed by the project’s hardware requirements. Secondly, it provides a greaterincidence balance between each arithmetic operation, which happens to be in tune withthe number of available resources, i.e., the scarcer arithmetic units correspond to theleast frequent operations.

Figure 2.2 depicts the ASAP and ALAP scheduling diagrams that were obtained forthis new formula, ignoring both the memory access delays and the number of available

3

Processing Element Maximum number of units

Multiplier 2Subtractor 2

Adder 1

Table 2.1: Area restrictions.

resources. Taking those diagrams as a starting point made it trivial to extract a schedulingsolution which accounts for the hardware limitations.

Figure 2.2: ASAP and ALAP scheduling diagrams for the rearranged formula, ignoring thehardware restrictions and considering single cycle operations.

Although the assumption of single cycle operations does not represent a real constraint,it stands useful for scheduling purposes, since it provides a solid visualization of thepresumable critical path. The scheduling presented in table 3.1 was created using thecritical path priority principle and taking advantage of the mobility of nodes 11 and 12.Note that the scheduling list itself did not consider single cycle operations since thisproject was not supposed to make use of a pipelined datapath.

Considering that the only restriction is on the number of arithmetic units populating thecircuit, a maximum throughput of 0.25 pixels per cycle (ppc) could be achievedusing the proposed scheduling. The restriction of using a single shared adder was foundto be the limiting factor, due to the fact that the four addition operations imperativelyrequire four clock cycles to be computed.

4

Cycle Nodes × (≤ 2) − (≤ 2) + (≤ 1)

1 1,3,7,9,13 2 2 12 2,4,8,10,14 2 2 13 5,6,11,12,15 2 2 14 16 0 0 1

Table 2.2: Scheduling list solution limited by the resources availability.

3 Hardware Design

As specified by the project guidelines, the hardware was structured using the FSMDparadigm In this way, a Control Unit was designed to control the flow of the data that isto be passed through the Datapath, where all the arithmetic operations related to theimage filter convolution take place. Given the strict requisites in terms of the number ofprocessing elements available, the bottleneck for the processor throughput is more likelyto be on the Datapath than on the Control Unit, which is why the hardware design wasmostly on focused on devising a satisfiable architecture for the former.

3.1 Datapath

In section 2.2, a scheduling scheme was derived for a single pixel computation consideringthe specified area restrictions. However, some additional constraints need to be takeninto account before mapping it into a valid hardware implementation, namely regardingthe input memory accesses.

As stated in the project guidelines, the developed circuit can access to three in-linepixels per cycle, thus resulting in the following memory reading sequence, represented infigure 3.1. Pi(j) corresponds to the pixel read at the ith memory output j cycles ago.

Figure 3.1: Input memory reading sequence.

5

If the pixels read in each cycle are stored for further processing, the previous schedulingsolution shown in table 3.1 turns out to be easily adaptable. For instance, in cycle 2 theavailable pixels can be matched with the previous terminology:

S(−1,−1) → P1(2) S(−1,0) → P2(2) S(−1,1) → P3(2)

S(0,−1) → P1(1) S(0,0) → P2(1) S(0,1) → P3(1)

S(1,−1) → P1 S(1,0) → P2 S(1,1) → P3

Replacing the variables’ names in the formula of the filter yields the following expression:

{gx = a× [P3(2)− P1(2)] + b× [P3(1)− P1(1)] + c× [P3− P1]

gy = a× [P1− P1(2)] + b× [P2− P2(2)] + c× [P3− P3(2)]

In spite of the correctness of the previous expression, it requires some reschedulingin order to distribute the operations over the several cycles, and thus complying withthe imposed area restrictions. Applying a certain time-shift to each pixel enables thenecessary computation to be spread around four clock cycles, noting that there is a firsthold cycle (0) where no operation is performed which will still add up to the overalldatapath latency.

gx =

Cycle 1︷︸︸︷a× [P3(1)− P1(1)] + b× [P3− P1]

Cycle 2︷︸︸︷+c× [P3− P1]

gy = a× [P1− P1(2)]︸︷︷︸Cycle 2

+︸︷︷︸Cycle 4

b× [P2(1)− P2(3)] + c× [P3(1)− P3(3)]︸︷︷︸Cycle 3

The rescheduled algorithm can then be summarized into the following table :

Cycle Computed terms × (≤ 2) - (≤ 2) + (≤ 1)

0 - - - -

1

gx 1︷︸︸︷a× [S(−1,1) − S(−1,−1)] + b× [S(0,1) − S(0,−1)] 2 2 1

2 gx 1 + c× [S(1,1) − S(1,−1)] ,

gy 1︷︸︸︷a× [S(1,−1) − S(−1,−1)] 2 2 1

3

gy 2︷︸︸︷b× [S(1,0) − S(−1,0)] + c× [S(1,1) − S(−1,1)] 2 2 1

4 gy 1 + gy 2 0 0 1

Table 3.1: Scheduling list for the rescheduled solution considering both area and memory con-straints.

The algorithm was then translated into an hardware implementation of the datapath,as depicted in figure 3.2. The devised circuit depends on eleven control signals to ensure

6

the desired functionality is achieved. A description of these signals can be found insection 3.2, where the inner-workings of the Control Unit are detailed.

Figure 3.2: Datapath block diagram.

Despite the first hold cycle, the obtained circuit is able to reach the maximum through-put of 0.25/ppc, if Cycle 4 is overlapped with the Cycle 0 of the next iteration. Assumingthat is the case, only the first and last occurrences will contribute to the datapath latency,since each one of the remaining pixels is computed in four cycles (TConv). Considering oneextra cycle for reading the filter constants (a,b,c) from the input memory, and anotherone to activate the done flag after the last result is written in memory, the total overhead(TOH) for the convolution of one complete line will be of four clock cycles. Hence, thegeneral formula to get the number of cycles required to compute a line of the filteredimage, which can be seen as a single instruction, is given by:

CPI = TOH + TConv × (N − 2)

In this particular case, considering images 10-pixels wide, CPI = 4 + 4× (10− 2) = 36,which was positively checked against the simulation.

3.2 Control Unit

The limited resources available to implement the convolution operation dictates that thefunctional units present in the Datapath must be shared and time-multiplexed. Thus,it is the task of the Control Unit to effectively coordinate the access to these resourcesalong the multiple cycles required to perform a convolution centered on a given pixel.

As made clear on the previous section, the devised Datapath relies on the timely acti-vation and deactivation of 11 control signals, each of them controlling either multiplexersor registers’ write enables. To properly compute the values of gx and gy, the sequenceof signals shown on figure 3.3 is required. The signal names herewith presented are

7

based on those of figure 3.2, although the actual signal names used within the hardwaredescription of the unit are short-hand versions of these.

Figure 3.3: Sequence of control signals needed to compute gx and gy.

For simplicity reasons, the 11 control signals were combined into a control word of13-bit, where the two extra bits are due to the 2-bit width of the signals rsub2 andgy sel. This control word is then routed to the Datapath block where it is separated intothe necessary control signals. Although the Datapath also requires a abc we signal totrigger the storage of the three constants, this signal was not included in the control wordmentioned, as it is only used once after the dedicated processor is reset, given that the a,b and c coefficients do not change during the processing of an image. In addition, theControl Unit is also responsible for managing the input and output memories, generatingits read and write addresses, respectively, as well as the write-enable signal for the latter.Finally, the done signal is also generated within this unit, after a full line of the imagehas been processed.

The signal sequence of figure 3.3 was implemented directly as an FSM, leaving allpossible optimizations to the synthesis process, so that the cross-checking between thetable and the respective hardware implementation could be made trivially, only by lookingat the respective VHDL source. This proved to be a good decision, as the effort thatwould have been put towards optimizing the generation of these signals would be maderedundant by the action of the synthesis tool.

The complete finite-state machine is composed of 9 states and utilizes a Moore approach,as the output of each state is uniquely a function of the internal signals of the machine andnot a combination of those with the external signals, as is the case for Mealy machines.The diagram representing the FSM is depicted in figure 3.4, where it should be noted thatall the output signals default to logic value 0 in each state, unless otherwise specified.

The FSM operation starts when the rst signal is high, holding the state machine inthe HOLD STATE, which generates the address corresponding to the position where thethree constants are kept in the input memory. In addition, the address of the output

8

Figure 3.4: Finite-state machine that generates the control signals for the datapath.

memory is set to zero. As soon as the reset is released, the FSM iterates through thefollowing states:

• GET ABC In this state, the abc we signal is asserted, in order to store the valuesof the three coefficients. The address of the input memory is also set to one, toinitiate the reading of the image, three pixels at a time, which will start takingplace in the next state.

• STATE 0→4 In this sequence of states, the input memory address is incrementedby 10 three consecutive times, after which it is decreased by 29, which correspondsto shifting the convolution window by one position to the right. During these cycles,the outputs are those presented in the control table presented in the beginning ofthis section. Once STATE 1 is reached, the write address is incremented by oneunit and the write operation is enabled.

• STATE 1 Init This state, similarly to STATE 0, is part of the transient phase ofthe processing and, therefore, only occurs once after every reset of the dedicatedprocessor, due to the necessity to fill the array of input registers of the datapathbefore the processing can start occurring at a constant pace. After STATE 1 isreached for the first time, the sequence of states will be STATE 1 → STATE 2→ STATE 3 → STATE 4, thus generating a value for gx and gy every four clockcycles.

9

• STATE Done This state is reached after a full line of pixels has been written tothe output memory, i.e, the output memory address has reached position 7. End ofoperation is signalled by asserting the Done signal and the next state is set to beSTATE Hold, as the task of the processor is to operate on a single image line.

4 Resource utilization and Performance Results

A completed documentation of a given digital circuit must always include the numberof occupied resources, as well as the maximum clock frequency for which the circuitfunctions correctly. The latter, together with the Cycles per Instruction (CPI), providesa precise measurement of the overall performance.

Table 4.1 displays the absolute and relative occupation of the main type of cells featuredin the targeted FPGA device. The reported Post-Place and Route results were obtainedseparately for the Datapath and the Control Unit, in addition to the values for the finalImage Processor.

Circuit Slice Flip Flops Slices LUTs Multipliers (DSP)

Datapath 125 (6%) 120 (12%) 146 (7%) 2 (50%)Control Unit 29 (1%) 28 (1%) 32 (2%) 0

Image Processor (D+CU) 154 (8%) 153 (15%) 186 (9%) 2 (50%)

Table 4.1: Resource utilization of the Image Processor on the targeted XC3S100E device.

The synthesis report confirmed the compliance with the constraints, as 3 Adders/-Subtractors and 2 Multipliers were inferred by the tool, for the Datapath. As expected,the core components of the Control Unit were also detected, such as the Finite-StateMachine and the pair of Adders used to compute the addresses for the input and outputmemories.

Regarding the performance, a maximum operating frequency of 65.4 MHz was achieved,which, combined with the derived CPI in section 3.1, leads to an average throughput ofNpixels

CPI × f = 8 pixels36 × 65.4 M/s=14.5 Mpixels/s or 1.81 instructions/s. As expected, the

critical path is the one that traverses the serial connection of the subtractor, multiplierand adder.

Given the provided specifications, the developed circuit stands as an efficient solutionsince it provides a relatively high throughput using a fair amount of resources. Thismeans there is still a margin for improvement in terms of performance, if one revisits theclassical tradeoff of area/throughput.

5 Conclusion and Future Work

The dedicated processor described within this report managed to achieve all the requiredspecifications in terms of hardware resources utilization while obtaining a fair throughput,

10

producing one pixel every four clock cycles, neglecting the initial overheads that arequickly made irrelevant when computing the convolution operation over a full image.

However, there is still room for improvement, both in the pixel production rate andon the maximum operating frequency. The latter can be easily achieved by employingpipelining techniques. In particular, and given that the critical path was determined to gothrough the branch where the subtractor, multiplier and adder lie in series, introducing aregister at the output of both multipliers (allowing the synthesis tool to place them insidethe multipliers themselves) would effectively reduce the length of this path, thereforeallowing an increase in the operating frequency. The downside to this approach would bethe increase in the latency of the circuit, which would only have an impact in the initialoverhead. Again, considering the processing of a full image, this increased time neededto ”fill” the pipeline would be negligible.

The rate at which new pixels are made available to the datapath is determined by theinput memories which, in this case, provide 3 pixels per cycle. This immediately sets amaximum pixel production rate of a pixel for every three clock cycles, the time it takesto read in the 9 pixels needed to perform a local convolution. By inspecting once againthe table 3.1, it is clear that the throughput of the dedicated processor is being limitedby the existence of a single adder. By relaxing this hardware restriction and adding asecond adder, it is possible to perform two additions in cycle 3, instead of waiting oneextra cycle to perform the final addition. The updated scheduling is represented in table5.1.

Cycle Computed terms × (≤ 2) - (≤ 2) + (≤ 2)

0 - - - -

1

gx 1︷︸︸︷a× [S(−1,1) − S(−1,−1)] + b× [S(0,1) − S(0,−1)] 2 2 1

2 gx 1 + c× [S(1,1) − S(1,−1)] ,

gy 1︷︸︸︷a× [S(1,−1) − S(−1,−1)] 2 2 1

3 gy 1 + b× [S(1,0) − S(−1,0)] + c× [S(1,1) − S(−1,1)] 2 2 2

Table 5.1: Result of a list scheduling limited to 2 multipliers, 2 subtractors and 2 adders

If, once again, cycle 3 is overlapped with cycle 0 after the first local convolution, theresult is a throughput of 1 pixel every 3 clock cycles, which corresponds to the maximumachievable performance for an input of 3 pixels per cycle. Finally, the revised datapathreflecting this enhancement is presented in figure 5.1

11

Figure 5.1: Revised Datapath block diagram to include pipelining and an additional adder.

12

scheduling and resource sharing (on fpgas)

Documents