gcd fpga-based design - online academic community · 2017-01-16 · elec569a – project report gcd...

GCD FPGA-Based Design Ibrahim Hazmi - V00835716

Design and Implementation of the Euclidean Algorithm for Computing the Greatest Common Divisor using

ELEC569A Project Final Report (Fall, 2014) Supervised by Dr. Mihai Sima

ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014

Contents

GCD FPGA-Based Design i

Contents ii

List of Figures iii

List of Tables iv

Visual Executive Summary 1

Introduction 2 Background of GCD and Euclidean Algorithm 2

Overview of Spartan-6 FPGA 4

Project Description and Milestones 7

Detailed Description of the Design 9 Behavioural Level: WHILE/FOR LOOP 9

Behavioural Level: From AMS to FSM (ASM2FSM) 13

Structural level: GCD Data-Path and Control Units (GCD2SUB) 16

Structural level: GCD with Sum of Absolute Difference (GCDSAD) 20

Overview of the Results 22 Summary of the results for the different architectures 22

The results in a Chart: 23

The Area-Delay Product (ASM, PGCD2SUB, and PGCDSAD) 23

Summary and Conclusion 24

Final Thoughts and Suggestions 25

Bibliography 26

IBRAHIM HAZMI - [email protected] V00835716 �ii

mailto:[email protected]


List of Figures

Fig.1 Prime Factorization method for finding the GCD of two integers 2

Fig.2 Euclidean Algorithm 3

Fig.3 Simplified Euclidean GCD Algorithm 3

Fig.4 XC6SLX25 Floor-plan View in PlanAhead 5

Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead 5

Fig.6 The Design Strategy Window from Xilinx Project Navigator 8

Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm 9

Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP 10

Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model 11

Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model 11

Fig.11 From ASM GCD to Finite State Diagram 13

Fig.12 The Reduced Finite State Diagram with VHDL Code 13

Fig.13 The Behavioural Simulation of ASM2FSM Model 14

Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model 14

Fig.15 Block Diagram of the “Original” GCD Data-Path 16

Fig.16 Block Diagram of the Modified GCD Data-Path 16

Fig.17 The Control Unit (FSM) with VHDL Code 17

Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3) 17

Fig.18.b Primitives: CARRY4 Fast Carry-Chain 18

Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model 18

Fig.20 GCD with Sum of Absolute Difference (GCDSAD) 20

Fig.21 Carry-out Generation Functions for SAD 20

Fig.22 Results in a Chart (FOR_LOOP dominated) 23

Fig.23 The Area-Delay Product 23

IBRAHIM HAZMI - [email protected] V00835716 �iii



List of Tables

Table 1: Xilinx spartan-6 FPGA Feature Summary [4], [5] 4

Table 2: Slice features of spartan-6 Family (including XC6SLX25) [5] 6

Table 3: Project Milestones 8

Table 4: Mapping Report of the FOR_LOOP GCD 12

Table 5: Synthesis and Timing Report of the FOR_LOOP GCD 12

Table 6: Mapping Report: ASM2FSM GCD vs. FOR_LOOP GCD 15

Table 7: Synthesis and Timing Report - ASM2FSM GCD vs. FOR_LOOP GCD 15

Table 9: Mapping Report: Optimized vs. Simple GCD2SUB 19

Table 10: Synthesis and Timing Report - Optimized vs. Simple GCD2SUB 19

Table 11: Mapping Report: GCDSAD 21

Table 12: Synthesis and Timing Report - GCDSAD 21

Table 13: Overview of The results (O; Optimized with Primitives) 22

Table 14: Comparison Summary between Architectures 24

IBRAHIM HAZMI - [email protected] V00835716 �iv



Visual Executive Summary

IBRAHIM HAZMI - [email protected] V00835716 �1

The main idea of this project is to design a Digital Circuit that calculates the GCD of two 16-bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx Spartan6 FPGA using different techniques/architectures. The first attempt was to see how far the compiler goes with the behavioural loop that represents Euclidean Algorithm. Because the tools kept copying the hardware inside the loop all the time, a massive area of the FPGA was occupied and the number of iterations was limited. Thus, an RTL behavioural architecture was implemented, in which only one iteration can run per each clock cycle. The compiler still have the freedom for placement and routing with the aid of “Design and Goals Strategies”. Then, the design was built structurally, by port-mapping all functions of the previous design as components, to see how the compiler is going to utilize the FPGA differently from the behavioural one. The structural model consists of two parts: GCD data-path unit and GCD control unit (FSM). Another version of the structural design was created as an attempt to adapt the idea of the “Sum of Absolute Difference (SAD)” in order to have only one subtraction instead of two. Finally, Spartan-6 Primitives and Macros were utilized to reduce the Area-Delay product of the design, and the optimized GCD with two subtractors has been proved to give the minimum Area-Delay product among all other design architectures.



IntroductionImplementing mathematical calculations on hardware platforms such as FPGA is quite

more challenging than performing them in a software environment, where the hardware itself

is already equipped with the calculation data-path and control unit for almost infinite number

of algorithms and arithmetic operations. Behind this pain of the hardware implementation is a

priceless gain in terms of performance as there is a great opportunity to utilize smaller area,

obtain higher speed, consume less power, or get a reasonable combination of all of these.

Calculating The greatest common divisor (GCD), is one of the problems that need number of

steps in order to be solved correctly. These steps can be transformed into an iterative

algorithm such as Euclidean algorithm, which makes the computation understandable and

traceable. This section is divided into three parts; a brief mathematical background about the

GCD computation, an overview of Xilinx Spartan-6 FPGA, and an outline of the project

description highlighting its objective and milestones.

Background of GCD and Euclidean Algorithm

The greatest common divisor (GCD) of two positive integers is the largest integer that

divides both numbers without a remainder [2]. It is also know as Greatest Common Factors

(GCF), Greatest Common Measure (GCM), Highest Common Divisor (HCD), or Highest

Common Factor (HCF) [1]. GCD can be computed by determining the prime factors of both

numbers, then multiplying the common prime factors. Practically, this method is not feasible

for great numbers. (Fig. 1) shows an example of how prime factorization method works.

Fig.1 Prime Factorization method for finding the GCD of two integers




An efficient method for solving GCD problems is Euclidean algorithm, which is based

on the fact that the GCD of two numbers divides the remainder of the division between them:

It is an iterative process (Fig. 2), that

takes a number of cycles to compute the

GCD. Divisions are done iteratively until

rn = 0, is obtained, then, the GCD = rn-1.

Fig.2 Euclidean Algorithm

As division is simply a subtraction, it was observed that the GCD of two numbers also

divides their difference [1], in which the design and implementation of the circuit gets easier.

The flowchart in (Fig. 3) illustrates this

simple computation process of the GCD. So,

it is obvious that the circuit should include

subtraction and comparison units, in

addition to registers and multiplexers for

data update in each iteration cycle.

Fig.3 Simplified Euclidean GCD Algorithm


gcd(a,b) = gcd(b,r)where, a = qb + r

gcd(a,b) = gcd(b,(a − b))= gcd(a,(b − a))

gcd(a,b) = gcd(b,r1)gcd(b,r1) = gcd(r1,r2 )

!



Overview of Spartan-6 FPGA

From the previous part, the chosen FPGA should have some properties to accommodate

the design units efficiently. For example, subtraction/addition may take advantage of some

dedicated components in the FPGA slices such as ripple carry-chain or DSP. Spartan-6 FPGA

family from Xilinx provides the designers with such components, which would help a great

deal in designing GCD circuit in different levels. “The thirteen-member family delivers

expanded densities ranging from 3,840 to 147,443 logic cells, with half the power consumption

of previous Spartan families, and faster, more comprehensive connectivity,” [4]. (TABLE 1)

shows a feature summary of some devices from this family; the smallest (XC6SLX4), the

largest (XC6SLX150T), and the choice of this project (XC6SLX25), which was the smallest

member of the family to accommodate the first design, i.e., the reference one. More about this

device and the reference design is presented later in this report.

TABLE 1: XILINX SPARTAN-6 FPGA FEATURE SUMMARY [4], [5]

In the next two pages, the Floor-plan views in PlanAhead for the device XC6SLX25 help

to illustrate the internal construction of the chosen device. (Fig. 4) is a full-scale floor-plan

view that shows the device layout indicating some important elements such as IOB cells,

CLBs, DSP, and block RAM columns.

Device LogicCells

CLB DSP Slices

RAM Blocks

UserI/OSlices FFs RAM (kb) LUT6

XC6SLX4 3,840 600 4,800 75 2,400 8 12 132

XC6SLX25 24,051 3,758 30,046 229 15,032 38 52 226

XC6SLX150T 147,443 23,038 184,304 1,355 92,152 180 268 540




Fig.4 XC6SLX25 Floor-plan View in PlanAhead

In (Fig. 5), a closer view of the layout reveals the three different slices inside the

Configurable Logic Block (CLB) surrounding a DSP block. It is clear from the figure that Each

CLB contains two slices, one of them is SLICEX and the other one is either SLICEL or SLICEM.

(TABLE 2) presents important slice features of the XC6SLX25 device.

Fig.5 Closer View of the XC6SLX25 Floor-plan View in PlanAhead


`

Memory Controller

Block

Block RAMColumn

Clock ManagementTile Column

DSP Column

CLB Cell

IOB Cells



TABLE 2: SLICE FEATURES OF SPARTAN-6 FAMILY (INCLUDING XC6SLX25) [5]


DSP

Flip-Flops

Carry-Chain

Storage LUTs

LUT6

Slices SLICEX SLICEL SLICEM

6-Input LUTs √ √ √

8 Flip-flops √ √ √

Wide Multiplexers √ √

Carry Logic √ √

Distributed RAM √

Shift Registers √


ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014Project Description and Milestones

The objective of this project is to design a Digital Circuit that calculates the GCD of two

16-bit unsigned integer numbers using Euclidean Algorithm and Implement it on Xilinx

Spartan6 FPGA. In this project, the Euclidean GCD circuit was implemented using different

architectures in order to examine the tradeoff between area and speed, i.e., Area-Delay

product, and decide which design is more implementable in terms of dedicated configuration

components inside the FPGA. The first step was to implement a simple behavioural loop,

i. e., a direct interpretation of the Euclidean GCD Algorithm, using FOR_LOOP to see how

the compiler would represent a large number of iterations. Considering this design reference,

the next step was to implement the Euclidean GCD circuit in the following levels:

A. RTL Behavioural level, where the design is simply a Finite State Machine (FSM) that

performs the GCD calculation sequentially as a lower level of interpreting the Algorithm.

In this case, the compiler was free to translate the operations into different units/

components and place all these components and rout all the connections with the aid of

“Design and Goals Strategies” optimization.

B. Structural level, where the data flow of the GCD Algorithm is transformed into an

arithmetic circuit, i.e., data-path unit (DP), and the iteration process is attained by a simple

control unit (CU), FSM basically. The design was built abstractly transforming all

functions, such as comparison, subtraction, and data transfer, to components and port-

mapping them in a top level entity, to see how the compiler would utilize the FPGA

differently from the behavioural implementation. In this level, “Sum of Absolute

Difference,” (SAD) circuit has been introduced in order to replace the two subtraction units

with one computation unit. Finally, some functions were designed utilizing Primitives,

e.g., Look-Up-Tables “LUTs,” Carry-Chains, Flip-Flops, Exclusive-Ores, and/or DSPs, and

Macros, e.g., ADDSUB macro, with which the calculation unit has been optimized in terms

of occupied area inside the FPGA.



ELEC569A – PROJECT REPORT GCD FPGA-BASED DESIGN FALL 2014(TABLE 3) is an outline of the project milestones, and how much approximately was

achieved during the journey of the ELEC569A course.

TABLE 3: PROJECT MILESTONES

In the next section all of the above stages will be presented and discussed in sequence. It

is helpful, by this point, to mention that the target is to obtain minimum Area-Delay product,

which could be achieved by reducing the area and/or the time delay of the circuit. By

determining optimization gaol to be area (Fig. 6), smaller area of the FPGA will be utilized in

order to reduce Area*Delay. At the same time, it might also lead to higher speed, assuming

that the smaller area is obtained, the fewer jumps through interconnections is needed.

Fig.6 The Design Strategy Window from Xilinx Project Navigator


Description Done

Design the Simple Behavioural Loop and examine its aspects and limitations 100

Design the Behavioural FSM Model and test its features and margins 100

Design the direct Structural Model (DP+CU) and compare it with the behavioural 100

Design the Optimized Structural Model (SAD) and compare it with the direct structural 100

Get into Primitive Level and utilize the dedicated elements for faster computation 100

Report the Area/Delay Comparison between all the architectures and propose suggestions 100



Detailed Description of the DesignAs mentioned in the previous section, a simple behavioural loop was implemented first,

in order to see how the compiler understands loops and how it deals with a large number of

iterations. Then, the results of this design, as a reference, kept pushing towards trying

different architectures in order to obtain smaller area and less jump through the

interconnection hoops.

Behavioural Level: WHILE/FOR LOOP

Starting with the direct WHILE_LOOP, that represents the Euclidean GCD Algorithm

(Fig. 7), the result tells a lot about how the system treats loops. It was clear that the compiler

has just copied the corresponding circuit along the way until the loop ends. It was not too

surprising that the compiler did not synthesize the (While) function, simply, because it

generates an infinite loop, which means the number of the circuit copies is infinity. In

hardware world, infinity does not usually exist, it needs to be a finite number.

Fig.7 WHILE_LOOP translations of the Simplified Euclidean GCD Algorithm


While (A /= B) Then

If (A > B) Then

A := A - B;

Else

B := B - A;

End If;

End Loop;

GCD <= B;



Thus, the transition to the Finite FOR_LOOB (Fig. 8) was obvious, where the maximum

number of iterations must be defined from the beginning. In fact, determining the number of

iterations before even starting computing the GCD creates limitation to the design, with

which almost an infinite number of GCD calculations will return zero.

Fig.8 Finite FOR_LOOP as a replacement for the Infinite WHILE_LOOP

Before going through the design and implementation results of this model and

proceeding to the other different levels, it is essential to point out that the WHILE_LOOP

Model is a perfect transformation of the Euclidean GCD Algorithm. Therefore, all the

following design effort would be considered as an attempt to produce a synthesizable version

of the WHILE_LOOP Model.

The number of iterations in the FOR_LOOP Model was defined as 100, which means

that for any two numbers that require more than hundred iterations to compute their GCD

(e.g., 511 and 2), the result will be zero. The behavioural and Post-Route simulation of this

design are shown in (Fig. 9) showing the delay for some input examples.


For i in 1 to 100 Loop

If (A /= B) Then

If (A > B) Then

A := A - B;

Else

B := B - A;

End If;

Else

GCD <= B;

End If;

End Loop;



Fig.9 The Behavioural and Post-Route Simulation of FOR_LOOP Model

Furthermore, the complexity of the generated circuit was very high as the system has

converted the loop into a massive number of components. The system has just copied the

comparators, subtractors, and multiplexers a hundred times (i.e., no registers at all). (Fig. 10)

shows RTL, Technology Schematic, and Floor-plan (from PlanAhead) of FOR_LOOP Model.

Fig.10 RTL, Technology Schematic, and Floor-plan of FOR_LOOP Model


Behavioural Simulation

Post-Route Simulation



(TABLE 4) highlights the huge number of units that are mapped to satisfy FOR_LOOP

design requirements, whereas (TABLE 5) summarizes the Synthesis report including Timing.

TABLE 4: MAPPING REPORT OF THE FOR_LOOP GCD

TABLE 5: SYNTHESIS AND TIMING REPORT OF THE FOR_LOOP GCD

In this model, there is nothing could be done further except changing the maximum

number of the iterations which affects the performance (i.e., generates poorer latency for

higher max #iterations). In fact, it is supposed to be faster, see behavioural simulation, as the

design is purely parallel design. Yet, the huge circuitry raise the need to jump through

interconnection hoops. Finally, this model works faster in larger devices such as XC6SLX150.


Hardware Statistics XC6SLX25 Total Total Used %

# Slices 3,758 2,697 71.77%

# LUTs 15,032 8,776 58.38%

# MUXCYs 7,516 4,760 63.33%

# Registers 30,064 0 0.00%

# DSP 38 0 0.00%

# IOBs 226 48 21.24%

Macros Statistics

# 16-bit Add/Sub/Acc 198

# Registers 0

# 16-bit Comparators (=,>) 199

# 2-1 Multiplexers (16-bit) 298

# XOR 0

# DSP 0

# FSM 0

Time Element ns

Register to Register Paths 0

Input to Register Paths 0

Register to Out-pad Paths 0

In-pad to Out-pad Paths 478

Total Time Delay 478



Behavioural Level: From AMS to FSM (ASM2FSM)

Recalling again the “Simplified Euclidean GCD Algorithm” in (Fig. 3), it can be

considered as an Arithmetic State Machine (ASM) that describes the behaviour of the GCD

circuit. Then, the three states FSM is an RTL implementation of the ASM circuit (Fig. 11).

Fig.11 From ASM GCD to Finite State Diagram

Using the basic “States Reduction” rule, S1 => S2. The new FSM with sample of the

code are shown in (Fig. 12).

Fig.12 The Reduced Finite State Diagram with VHDL Code


⇒

WHEN S0 => ELSIF (AR > BR) THEN

IF (Start = '1') THEN EnA <= '1';

NextState <= S1; EnB <= '0';

Else NextState <= S1;

NextState <= S0; Else

End If; EnA <= '0';

WHEN S1 => EnB <= '1';

AM <= AS; BM <= BS; NextState <= S1;

IF (AR = BR) THEN End If;

GCD <= BR; AS => AR - BR;

NextState <= S0; BS => BR - AR;



(Fig. 13) highlights the Behavioural Simulation results of the ASM2FSM Model, while

(Fig. 14) shows RTL, Technology Schematic, and Floor-plan (from PlanAhead).

Fig.13 The Behavioural Simulation of ASM2FSM Model

!Fig.14 RTL, Technology Schematic, and Floor-plan of ASM2FSM Model




Again, (TABLE 6 & 7) highlight the Mapping and Synthesis reports including Timing

and indicating dramatic improvement in terms of both, area and speed.

TABLE 6: MAPPING REPORT: ASM2FSM GCD VS. FOR_LOOP GCD

TABLE 7: SYNTHESIS AND TIMING REPORT - ASM2FSM GCD VS. FOR_LOOP GCD

There is no comparison between the results that was obtained with the ASM2FSM

model with the FOOR_LOOP ones, considering the huge area saving and the ability to

compute the GCD with very large number of iterations.


HW Statistics Total FOR_LOOP ASM2FSM

# Slices 3,758 2,697 71.77% 21 0.56%

# LUTs 15,032 8,776 58.38% 58 0.39%

# MUXCYs 7,516 4,760 63.33% 48 0.64%

# Registers 30,064 0 0.00% 33 0.11%

# DSP 38 0 0.00% 0 0.00%

# IOBs 226 48 21.24% 52 23.01%

Macros Statistics FOR ASM

# 16-bit Add/Sub/Acc 198 2

# Registers 0 33

# 16-bit Comparators (=,>) 199 2

# 2-1 Multiplexers (1, 16-bit) 298 8

# XOR 0 0

# DSP 0 0

# FSM 0 1

Time Element | ns FOR ASM

Register to Register Paths 0 5.07

Input to Register Paths 0 2.96

Register to Out-pad Paths 0 6.59

In-pad to Out-pad Paths 478 0

Total Time Delay 478 14.62



Structural level: GCD Data-Path and Control Units (GCD2SUB)

The next step was to build a Data-Path for the computation unit, which could be as

shown in the block diagram in (Fig. 15).

Fig.15 Block Diagram of the “Original” GCD Data-Path

Because the comparison unit could be implemented as a subtractor, it was obvious to

use the CARRY_OUT signals of the subtractors in (Fig. 15) as AGB and ALB signals (Fig. 16).

Fig.16 Block Diagram of the Modified GCD Data-Path




The “FSM block” in (Fig. 16) refers to the Control Unit (Fig. 17) that drives the Control

signals of the GCD data-path (i.e., Registers’ Enable signals). It is important to note that the

MUXs’ select signals are driven by the signals AGB and AEB directly, whereas for the REGs’

enable signals, smaller MUXs (i.e., 1-bit) were built by the control unit.

Fig.17 The Control Unit (FSM) with VHDL Code

In this model, subtractors are the bottle neck of the design as they combined the

subtraction and comparison at the same time. They need to be as fast as their results must be

ready before the next clock occurrence. Therefor, fast CARRY4 primitive (Fig. 18.b), which

utilizes the dedicated Carry-Chain in SliceL and SliceM inside Spartan-6 FPGA, was adapted

in the design to perform faster subtraction. Furthermore, LUT2 and LUT3 Macros where used

to accommodate some logic functions such AND, XOR, and multiplexer (Fig. 18.a).

Fig.18.a Utilization of Primitive (FDCE), and Macros (LUT2 & LUT3)


AEB <= AGB NOR ALB; WHEN S1 =>

Finish <= AEB; EnA <= AGB;

WHEN S0 => EnB <= ALB;

IF (Start = '1') THEN IF (AEB = '1') THEN

NextState <= S1; NextState <= S1;

Else Else

NextState <= S0; NextState <= S0;

End If; End If;

MUX2x1_inst : LUT3 -- MUX 2x1 XOR_inst : LUT2 -- A XOR B

Generic (INTIT <= X“AC”;) Generic (INTIT <= X“6”;)

PORT MAP(O=>O, I2=>S, I1=>A, I0=>B); PORT MAP(O=>P, I1=>A, I1=>B);

FF_inst : FDCE -- Flip-Flop AND_inst : LUT2 -- A AND B

Generic (INTIT <= ‘0’;) Generic ( INTIT <= X”8”; )

PORT MAP(Q=>Q,C=>C,CE=>C,CLR=>C,D=>D); PORT MAP(O=>P, I1=>A, I1=>B);



Fig.18.b Primitives: CARRY4 Fast Carry-Chain

(Fig. 19) shows RTL, Technology Schematic, and Floor-plan of GCD2SUB.

Fig.19 RTL, Technology Schematic, and Floor-plan of GCD2SUB Model


CARRY4_inst : CARRY4 PORT MAP (

CO => CO,-- 4-bit carry out

O => Sub ,-- 4-bit carry chain XOR data out

CI => ‘1’,-- 1-bit carry cascade input

CYINIT => ‘1’,-- 1-bit carry initialization

DI => A,-- 4-bit carry-MUX data in

S => P); -- 4-bit carry-MUX select input



Perfectly, the Mapping and Synthesis reports (TABLE 9 & 10) prove the presumable

results of the design and it was clearly “Faster” and “Areas saver”.

TABLE 9: MAPPING REPORT: OPTIMIZED VS. SIMPLE GCD2SUB

TABLE 10: SYNTHESIS AND TIMING REPORT - OPTIMIZED VS. SIMPLE GCD2SUB

The comparison was between two versions of the GCD2SUB; The Optimized version

using primitives and macros, and a simple version with high level components (i.e., “-“ for

subtraction, “Select” for Multiplexer, …etc, even the comparator was defined in this version).

It was clear that although the tool is capable of Optimizing Macros in a good way, the

designer could utilize the dedicated Primitives and Macros for more efficient optimization.


HW Statistics Total Simple GCD2SUB Optimized GCD2SUB

# Slices 3,758 16 0.43% 15 0.40%

# LUTs 15,032 53 0.35% 51 0.34%

# MUXCYs 7,516 52 0.69% 32 0.43%

# Registers 30,064 33 0.11% 51 0.17%

# DSP 38 0 0.00% 0 0.00%

# IOBs 226 52 23.01% 52 23.01%

Macros Statistics S2S O2S

# 16-bit Add/Sub/Acc 2 0

# Registers 33 33

# 16-bit Comparators (=,>) 2 0

# 2-1 Multiplexers (1, 16-bit) 3 3

# XOR 0 0

# DSP 0 0

# FSM 1 1

Time Element | ns S2S O2S

Register to Register Paths 5.14 3.31

Input to Register Paths 3.07 2.96

Register to Out-pad Paths 6.72 3.67

In-pad to Out-pad Paths 0 0

Total Time Delay 14.93 9.94



Structural level: GCD with Sum of Absolute Difference (GCDSAD)

Sum of Absolute Different (SAD) replaces the two subtractors using Carry-Out

Generation Function (Fig. 20 & 21). It expected to give better result than GCD2SUB as it uses

less components and produces less outputs.

Fig.20 GCD with Sum of Absolute Difference (GCDSAD)

Fig.21 Carry-out Generation Functions for SAD


GPBLOCK: FOR i IN 1 TO (N/4) generate PB8<=PB4(1)AND PB4(2);

PB4(i)<=P(4*i-1) AND P(4*i-2) AND GB8<=GB4(2)OR(GB4(1)

P(4*i-3) AND P(4*i-4); AND PB4(2));

GB4(i)<= G(4*i-1)OR(G(4*i-2)AND P(4*i-1)) GN<= G(3)OR(G(2)AND P(3))

OR(G(4*i-3)AND P(4*i-1)AND P(4*i-2))OR OR(G(1)AND P(3)AND P(2))OR

(G(4*i-4)AND P(4*i-1)AND P(4*i-2)AND P(4*i-3)); (G(0)AND P(3)AND P(2)AND P(1));

END Generate GPBLOCK; CO<=GB4(4)OR(PB4(4)AND C12);



Before implementing the primitive of the GCDSAD circuit (Optimized GCDSAD), there

was an attempt to try a function called (ABS), which does the same job as GCDSAD, in order

to see how the compiler accommodates such function in the hardware level. Also, the simple

GCDSAD has been designed using high level component definition. ABS_GCD has given a

significant result in terms of speed, while the simple GCDSAD was a bit better in terms of

area. (TABLE 11 & 12) compare between ABSGCD, Simple, and Optimized GCDSAD.

TABLE 11: MAPPING REPORT: GCDSAD

TABLE 12: SYNTHESIS AND TIMING REPORT - GCDSAD


HW Stat. Total ABSGCD SGCDSAD OGCDSAD

# Slices 3,758 22 0.59% 20 0.53% 18 0.48%

# LUTs 15,032 73 0.49% 62 0.41% 59 0.39%

# MUXCYs 7,516 52 0.69% 16 0.21% 16 0.21%

# Registers 30,064 34 0.11% 33 0.11% 41 0.14%

# DSP 38 0 0.00% 0 0.00% 0 0.00%

# IOBs 226 52 23.01% 52 23.01% 52 23.01%

Macros Statistics ABS SSAD OSAD

# 16-bit Add/Sub/Acc 2 1 0

# Registers 33 33 33

# 2-1 MUX (1, 16-bit) 5 5 3

# XOR 15 33 0

# DSP 0 0 0

# FSM 1 1 0

Time Element | ns ABS SSAD OSAD

R to R Paths 5.40 10.92 8.40

In to R Paths 3.19 3.19 2.96

R to Out Paths 5.80 13.11 3.63

In to Out Paths 0 0 0

Total Time Delay 14.39 27.22 14.99



Overview of the Results

Recalling all the design architectures and their area/time figures, this section reveals

the conclusion in numbers and charts (TABLE 13 & Fig. 22, & Fig. 23).

Summary of the results for the different architectures

TABLE 13: OVERVIEW OF THE RESULTS (O; OPTIMIZED WITH PRIMITIVES)


HW Stat. Total FOR ASM OGCD2SUB OGCDSAD

# Slices 3,758 2,697 71.77% 21 0.56% 15 0.40% 18 0.48%

# LUTs 15,032 8,776 58.38% 58 0.39% 51 0.34% 59 0.39%

# MUXCYs 7,516 4,760 63.33% 48 0.64% 32 0.43% 16 0.21%

# Registers 30,064 0 0.00% 33 0.11% 51 0.17% 41 0.14%

# IOBs 226 48 21.24% 52 23.01% 52 23.01% 52 23.01%

Macros Statistics FOR ASM OGCD2SUB OGCDSAD

# 16-bit Add/Sub/Acc 198 2 0 0

# Registers 0 33 33 33

# 16-bit Comparators (=,>) 199 2 0 0

# 2-1 MUX (1, 16-bit) 298 8 3 3

Time Element | ns FOR ASM OGCD2SUB OGCDSAD

Register to Register Paths 0.00 5.07 3.31 8.40

Input to Register Paths 0.00 2.96 2.96 2.96

Register to Out-pad Paths 0.00 6.59 3.67 3.63

In-pad to Out-pad Paths 478.00 0.00 0.00 0.00

Total Time Delay 478.00 14.62 9.94 14.99



The results in a Chart:

Fig.22 Results in a Chart (FOR_LOOP dominated)

The Area-Delay Product (ASM, PGCD2SUB, and PGCDSAD)

Fig.23 The Area-Delay Product


#SLICES TIME DELAY AREA*DELAY

#SLICES TIME DELAY AREA*DELAY



Summary and Conclusion

(TABLE 14) Summarizes the work that has been done and compares between all the

versions of the Euclidean GCD design and its implementation on Xilix Spartan-6 FPGA.

TABLE 14: COMPARISON SUMMARY BETWEEN ARCHITECTURES

The overall results shows that the optimized GCD2SUB design has the least Area-Delay

product among the other models in this project, which means it provides fast computation of

the Euclidean GCD Algorithm, while saving area a great deal. Apart from the slow, limited

and area consuming FOOR_LOOP GCD model, the other architectures were not too far for

GCD2SUB model, especially, the Simple GCD2Sub and the Optimized GCDSAD. However,

GCDSAD could be better than Simple GCD2Sub because of the full control over placement

which might make its Area-Delay product significantly better. Furthermore, some

components, such as FSM Flip-Flops and MUXs, could be implemented using primitives and

placed efficiently in order to provide more reduction on the Area-Delay product.

Factors Architectures Area*Delay Macros & Primitives Design features &

Synthesizability

Behavioural

For Loop 7 1289166 198 16bit Sub, 199 16bit comp, 298 16bit MUX-2x1

Depends on max. #iterations, works faster in larger devices

ASM2FSM 4 307.022 16bit Loadable Accumulators33 Registers, 2 16bit Comp, 2 16bit and 6 1-bit MUX-2x1

FSM is the Top Entity,Depends on compiler MacrosStill no control over placement

Simple Structural

GCD2Sub 2 238.882 16bit Loadable Accumulators33 Registers, 2 16bit Comp, 3 1-bit MUX-2x1

Datapath & FSM Control UnitsDepends on components def.Still no control over placement

GCDSAD 6 544.416bit Add (w Cin), 33 Registers 2 16bit and 3 1-bit MUX-2x11 16bit and 32 1-bit XOR

DP &CU, Dep. on components,Still no control over placementUtilizes SAD circuits (1 ADD)

GCDABS 5 316.5816bit Sub, 16bit Add, 15 XOR33 Registers, 2 16bit Comp, 2 16bit and 3 1-bit MUX-2x1

DP &CU, Dep. on components and the compiler MacrosStill no control over placement

Optimized Strucutral

GCD2Sub 1 149.1 33 Registers, 3 1-bit MUX-2x1

DP &CU, Utilizes Primitives(LUT2,3 & CARRY4), Fast,Full Control over placement

GCDSAD 3 269.82 33 Registers, 3 1-bit MUX-2x1

DP &CU, Utilizes Primitives(LUT2,3 & CARRY4), Smart,Full Control over placement




Final Thoughts and Suggestions

The Euclidean GCD Algorithm design journey has brought great experience, from

infinite loop to loop limitations, then thought RTL behavioural architecture to the structural

architecture, to the optimized design, where, Primitives and Macros were utilized to reduce

the Area-Delay product of the design. Pro's & Con's of the main architectures can be:

Behavioural

- Apart from “Design Strategies & Goals,” there in no control at any level on the implemented circuit or the placement and routing of the design.

✤ It is High Level Coding approach, which is easier to write and manage.

Structural

- The design could be much more complex than behavioural especially with Primitives.

✤ By utilizing Primitives & Macros efficiently, there is gain of full control over the placement.

✤ Having the data-path and control units separated, allows for better optimization.

It is important to note that utilizing primitives and macros efficiently helps to reduce the

jumps through interconnections and maintain a logical and persistent data flow in the design.

For instance, 16-bit Carry-Look-Ahead Subtractor (CLASub) is assumed to be faster than the

ripple carry. However, utilizing CARRY4 primitive to benefit form the dedicated Carry-chain

with the help of propagation function (i.e., Half-Adder SUM - XOR), gives an Area-Delay

product of about 10 times better than using full CLASub with primitives.

Finally, it would be fair to mention that both, ASM2FSM, Simple GCD2SUB, and Simple

GCDSAD were implemented with the enforcement of using DSP as a primitive. The time

delay in all cases was not promising, and the occupied area inside the FPGA was greater than

using the CLB’s slices. However, it seems somehow possible to utilize the DSP itself in order

to benefit from its features to perform the whole computations of the Euclidean GCD

Algorithm. This might be a reasonable suggestion for future work related to GCD design on

FPGA in addition to learning more about the tools and their helpful features.




Bibliography

1. Wikipedia, (2014). Greatest common divisor. [online] Available at: http://en.wikipedia.org/wiki/

Greatest_common_divisor [Accessed 15 Dec. 2014].

2. EE254L – GCD (University of South California): Subject Lab Manual: http://www-classes.usc.edu/engr/ee-

s/254/ee254l_lab_manual/.

3. Lesson 93 - Example 63: GCD Algorithm - VHDL while Statement [A tutorial on datapaths and state

machines for computing the GCD / While Loops accompanies the book Digital Design Using Digilent FPGA

Boards]. (Nov 2012). LBEbooks. Retrieved from https://www.youtube.com/watch?v=DMSaYhD1GkM.

4. “Spartan-6 Family Overview (v2.0),” Xilinx, 2011. http://www.xilinx.com/support/documentation/

data_sheets/ds160.pdf.

5. “Spartan-6 FPGA Configurable Logic Block, UG384 (v1.1),” Xilinx, 2010. http://www.xilinx.com/support/

documentation/user_guides/ug384.pdf.

6. “Spartan-6 Libraries Guide for HDL Designs, UG615 (v 14.1),” Xilinx, 2012. http://www.xilinx.com/support/

documentation/sw_manuals/xilinx14_1/spartan6_hdl.pdf.

7. “XST User Guide for Virtex-6, Spartan-6, and 7 Series Devices, UG687 (v 13.4),” Xilinx, 2012. http://

www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/xst_v6s6.pdf.

8. “ISE In-Depth Tutorial, UG695 (v 12.1)”, Xilinx , 2009. http://www.xilinx.com/support/documentation/

sw_manuals/xilinx14_1/spartan6_hdl.pdf.

9. Sima, M. (2014). ELEC669 'Reconfigurable Computing. -[Lecture Notes]

10. Devi, R., Singh, J. and Singh, M. (2011). VHDL Implementation of GCD Processor with Built in Self Test

Feature. International Journal of Computer Applications, 25(2), pp.50-54.

11. C.P, N. and M. Ravi Kumar, K. (2014). Efficient Comparator based Sum of Absolute Differences Architecture

for Digital Image Processing Applications. International Journal of Computer Applications, 96(4), pp.17-24.

12. TechOnlineIndia, (2014). An introduction to FPGA timing analysis [online] Available at http://

www.techonlineindia.com/techonline/news_and_analysis/170126/introduction-fpga-timing-analysis.



http://www-classes.usc.edu/engr/ee-s/254/ee254l_lab_manual/

http://www-classes.usc.edu/engr/ee-s/254/ee254l_lab_manual/

https://www.youtube.com/watch?v=DMSaYhD1GkM

http://www.xilinx.com/support/documentation/data_sheets/ds160.pdf

http://www.xilinx.com/support/documentation/data_sheets/ds160.pdf

http://www.xilinx.com/support/documentation/user_guides/ug384.pdf

http://www.xilinx.com/support/documentation/user_guides/ug384.pdf

http://www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/spartan6_hdl.pdf


http://www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/xst_v6s6.pdf

http://www.xilinx.com/support/documentation/sw_manuals/xilinx14_1/xst_v6s6.pdf



http://www.techonlineindia.com/techonline/news_and_analysis/170126/introduction-fpga-timing-analysis

http://www.techonlineindia.com/techonline/news_and_analysis/170126/introduction-fpga-timing-analysis

gcd fpga-based design - online academic community · 2017-01-16 · elec569a – project report gcd...

Documents