processor architecture for elliptic curve crypto 1 ...architecture for elliptic curve crypto...

IET Computers & Digital Techniques

Research Article

Throughput/area optimised pipelinedarchitecture for elliptic curve cryptoprocessor

ISSN 1751-8601Received on 25th April 2018Revised 17th November 2018Accepted on 26th February 2019doi: 10.1049/iet-cdt.2018.5056www.ietdl.org

Malik Imran1, Muhammad Rashid2 , Atif Raza Jafri1, Muhammad Kashif31Department of Electrical Engineering, Bahria University, Islamabad, Pakistan2Department of Computer Engineering, Umm Al-Qura University, Makkah, Saudi Arabia3Department of Electronics and Computer Engineering, Istanbul Sehir University, Istanbul, Turkey

E-mail: [email protected]

Abstract: A pipelined architecture is proposed in this work to speed up the point multiplication in elliptic curve cryptography(ECC). This is achieved, at first; by pipelining the arithmetic unit to reduce the critical path delay. Second, by reducing thenumber of clock cycles (latency), which is achieved through careful scheduling of computations involved in point addition andpoint doubling. These two factors thus, help in reducing the time for one point multiplication computation. On the other hand, thesmall area overhead for this design gives a higher throughput/area ratio. Consequently, the proposed architecture is synthesisedon different FPGAs to compare with the state-of-the-art. The synthesis results over GF(2m) show that the proposed design canwork up to a frequency of 369, 357 and 337 MHz when implemented for m = 163, 233 and 283 bit key lengths, respectively, onVirtex-7 FPGA. The corresponding throughput/slice figures are 42.22, 12.37 and 9.45, which outperform existingimplementations.

1 IntroductionSecurity networks frequently use public key cryptographicalgorithms such as Rivest–Shamir–Adleman (RSA) [1] and ellipticcurve cryptography (ECC) [2]. However, ECC is getting more andmore popular as compared to RSA due to certain advantages suchas shorter key lengths, lower hardware cost for equivalent securitylevel and lower power consumption [3–7]. These advantages makeECC usable in both high speed and low resource applications.

The typical hierarchy of ECC contains four layers, as shown inFig. 1. At each layer, different operations are needed to beperformed. These operations are arithmetic (addition,multiplication, squarer and inversion), point addition (PA) andpoint doubling (PD), point multiplication (PM) (also called asscalar multiplication) and protocols. Arithmetic operations areperformed at layer 1 while PA and PD operations are computed atlayer 2. The PM is the core operation in ECC and is computed atlayer 3. Finally, protocols (layer 4 operations) are the set of rules,used to govern data encryption and decryption.

To implement PM operation, National Institute of Standards andTechnology (NIST) have proposed a number of standard ellipticcurves over the prime field GF(p) and binary extension fieldGF(2m) [3]. However, GF(2m) field is commonly used for efficienthardware implementations [4–15]. Furthermore, field-

programmable-gate-array (FPGA) based designs of ECC aregaining more popularity due to the provision of itsreconfigurability, availability (commonly available to everyone inthe market) and shorter development time scales.

1.1 Related work

Several FPGA-based ECC architectures, either for high speed [4–11] or low resource applications [12–15], are available in theliterature.

1.1.1 High-speed applications: In real-time applications, such asIP security (IPsec) and secure socket layer (SSL), high-speedimplementation of an asymmetric cryptosystem is important [4].The conventional practices for optimising high-speed ECCarchitectures involve (a) reduction of clock cycles (CCs) (latency)and (b) increasing the clock frequency for one PM computation.

To reduce the number of CCs, various techniques have beenemployed in [4–11]. For example, the work in [4] presents threecomplex instructions while instruction level parallelism is used in[6, 8] to reduce the number of required CCs. Similarly, the effect ofvarious digit sizes for digit serial finite field (FF) multipliers isexplored in [7]. Moreover, the work presented in [7, 9, 11]duplicates multiple arithmetic blocks (such as adder, multiplier andsquarer) to exploit the parallelism.

For optimising operational frequency and to reduce the criticalpath delay, pipelining is frequently implemented. The architectures,presented in [4–11], require 3010, 1446, 1428, 2751, 1091, 3379,780 and 450 CCs, respectively. The achieved frequencies (MHz) in[4–11] are 154 (on Virtex 4), 143 (on Virtex 4), 185 (on Virtex 4),250 (on Virtex 4), 121 (on Virtex 4), 262 (on Virtex 5), 153 (onVirtex 5) and 159 (LLECC_3M architecture on Virtex 7),respectively. The time for computing one PM is determined bydividing the number of CCs with operational frequency. The time(in µs) required for one PM in [4–11] is 19.5, 10, 7.7, 9.6, 9.0,12.9, 5.1 and 2.83, respectively. Although the architecturesreported in [4–11] achieve higher speed (or require lesscomputational time for one PM), but they utilise higher hardwareresources in terms of FPGA Slices, i.e.16,209, 24,363, 20,807,17,929, 10,417, 6536, 10,363 and 11,657, respectively. The use ofFig. 1 Layer model of ECC

IET Comput. Digit. Tech.© The Institution of Engineering and Technology 2019

1

higher hardware resources is not suitable for constrained (low area)applications.

1.1.2 Hardware architectures for low area applications: Lowarea implementations of asymmetric cryptosystem are importantfor the embedded systems applications, such as connected vehicles,smart cards and smart cities [12–15]. In [12], the critical path delayhas been reduced by implementing a four-stage pipelined FFmultiplier. In [13], a bit-serial multiplier is used to reduce thehardware complexity while compromising on the total number ofCCs. Using single adder, multiplier and squarer blocks, low-costimplementation of ECC are available in [14, 15]. The most relevantarchitectures, reported in [12–15], require CCs of 1397, 52,012,2,438,675, and 3426, respectively. The achieved frequencies(MHz) in [12–15] are 147 (on Virtex 5), 550 (on Virtex 5), 12.5 (onSpartan 6) and 135 (on Virtex 7). Similarly, the time (in µs)required for one PM computation in [12–15] is 9.5, 94.6, 195,094and 25.3, respectively. It is obvious from the above discussion thatthe architectures reported in [12–15] require higher computationaltime for one PM, but on the other hand they consume low hardwareresources (FPGA Slices) of 3513, 4815, 1844 and 3657,respectively.

1.2 Importance of throughput/area

Section 1.1 reveals that state-of-the-art hardware architectures [4–15] are implemented, either for optimising throughput or area,without paying due attention to the overall performance in terms ofthroughput/area metric. The performance evaluation in terms ofthroughput/area is useful, where both the constraints (throughputand area) are required to be fulfilled at the same time. Furthermore,it has been argued in [11] that the main motivation behind the useof ECC is its suitability for high throughput and low areaapplications at the same time.

Therefore, the performance in terms of throughput/area isdesirable in many real-time applications such as ambientintelligence and internet-based applications [16, 17], cloudcomputing [18], banking and other security applications (i.e. e-commerce, e-banking) [19]. For example, in ambient intelligenceapplications, ubiquitous sensor network is deployed. The deployedsensors require constantly increasing high computational demandsto process data and provide various services to the end-users.

Similarly, high throughput/area implementations are alsoimportant for network-based applications such as SSL and IPsecprotocols which are commonly used today in over-the-webtransactions [19]. Consequently, it is critical to achieving thedesired throughput at a reasonable time with lower hardwareresource utilisations. A comprehensive study of various designconstraints in multiple applications is provided in [20].

1.3 Our contributions

In Section 1.2, we have proposed a high throughput/area pipelinedECC architecture for the NIST curves [3] over GF 2m with m = 163, 233 and 283. Moreover, the data path of the proposedarchitecture depends upon the size of the underlying field (m). Theproposed design is synthesised on different FPGA devices forperformance estimation (Virtex 7) and compared with existingsolutions (Virtex 4 and Virtex 5).

Previously, we have proposed a throughput/area processor forbinary huff curves [21]. In this paper, we are targeting a pipelinedarchitecture of ECC for those applications, where the throughput/area is critical such as internet-based applications [16, 17], cloudcomputing [18], and banking applications [19]. Consequently, thecontributions of this paper are listed below:

• A digit parallel multiplier is presented with the optimal digit sizeof 32 bits to reduce the latency as well as critical path delay.

• For PM computation in state-of-the-art architectures [4–15], therequired arithmetic operations are an adder, multiplier, squarerand inversion. Separate squarer and multiplier blocks aregenerally useful when multiple CCs are needed to perform oneFF multiplication. However, our digit-parallel multiplier,

mentioned in the previous point, is capable of producing theresult of each FF multiplication in one CC. In other words,multiplier and squarer have the same computational cost in ourproposed design. Therefore, squaring instructions can beperformed by providing the same inputs to the multiplier block.It has allowed us to reduce the overall area of the design.

• To optimise the clock frequency and to reduce the critical pathdelay, pipeline registers have been used at the input of thearithmetic unit (AU). Moreover, considering the pipelinehazards, such as read after write (RAW), PA and PD instructionshave been efficiently scheduled.

• Finally, a dedicated finite-state-machine (FSM) based controlblock has been used to speed up the control functionalities.

The remainder of this paper is organised as follows: In Section 2,preliminaries related to PM computation on ECC over GF 2m arepresented. The proposed efficient throughput/area architecture forECC is discussed in Section 3. Section 4 presents the synthesis andperformance results of the proposed hardware architecture alongwith the comparison with the state-of-the-art. Finally, Section 5concludes the paper.

2 Setting the stageThe introductory part of this paper mentions two different fields forthe computation of PM in ECC: prime field GF p and a binaryextension field GF 2m . For software implementations, the primefield is suitable while the binary field is useful for hardwareimplementations [14]. Furthermore, each of these fields (primefield as well as binary extension field) may be used either withsimple affine coordinates or projective coordinates.

In a simple affine coordinate system, FF inversion operation isrequired to be performed during each PA and PD computation [9–11]. For example, for ‘m’ bit key length, ‘m’ a number of inversionoperations are required to be performed. However, the number ofrequired inversion operations can be reduced by implementing theprojective coordinate system, where only two inversion operationsare required to compute PM [22]. In addition to the reducednumber of required inversion operations, projective coordinates arewell suited to achieve efficient throughput/area ECC designs ascompared to affine coordinates [23].

For each coordinate system, two types of field representationsare available, i.e. normal basis and polynomial basis. Normal basisrepresentation is useful where frequent squaring operations areinvolved; however, for efficient FF multiplications, polynomialbasis representation is used [4, 14, 15].

Based on the above-mentioned scenario, we have used thebinary extension field with projective coordinate (Lopez Dahab)systems. Lopez Dahab projective coordinate system requires alower number of field multiplications for PM computation [22].Moreover, for coordinate representation, we have selected thepolynomial basis representation due to efficient FF multiplications.

2.1 Point multiplication on ECC over GF 2m

For GF 2m , a projective (Lopez Dahab) form of the elliptic curveis defined as a set of points P X :Y :Z , satisfying the followingequation:

E : Y2 + XYZ = X3Z + aX2Z2 + bZ4 (1)

In (1), the variables ‘X’, ‘Y’ and ‘Z’ are the Lopez Dahab projectiveelements of point P X :Y :Z , where Z ≠ 0, ‘a’ and ‘b’ are thecurve constants with b ≠ 0. The core operation in ECC is PM.Consider a base point ‘P’ and a large integer ‘k’ of the size ofunderlying field ‘m’, then the PM will be the addition of ‘k’ copiesof point ‘P’, i.e. Q = k ⋅ P + P + ⋯ + P , where ‘Q’ is the newpoint on the defined elliptic curve. To compute PM, we have usedthe Montgomery algorithm [23], represented as Algorithm 1 in thefollowing:

To implement Montgomery algorithm for PM, it requires ascalar multiplier ‘k’ along with the initial point ‘P’ with its

2 IET Comput. Digit. Tech.© The Institution of Engineering and Technology 2019

coordinates (xp, yp) as input and produces (xq, yq) coordinates of thefinal point ‘Q’ as output. Montgomery algorithm contains threesteps:

• Step 1 is the initialisation, where affine to projective (LopezDahab) conversions are performed.

• Step 2 is to compute PM by performing point addition(P = P + Q) and point doubling (P = 2P) operations, based onthe value of a scalar multiplier (ki).

• Finally, step 3 is to perform projective to affine conversions(reconversion step).

It is important to mention here that this work handles the sidechannel and power attacks at the algorithmic level through the useof the Montgomery algorithm. The resistance against the sidechannel and power attacks is an inherent feature of theMontgomery algorithm (Algorithm 1). Therefore, we have used itfor the computation of point multiplication (consisting of PA andPD).

In the Montgomery algorithm, the number of requiredarithmetic operations for PA and PD steps is independent of the nthkey bit (i.e. ki). In other words, the same number of arithmeticoperations is required, irrespective of the value of the key bit(scalar multiplier). The details of these arithmetic operations aresix multiplications, five squaring and three additions, as shown inAlgorithm 1.

Due to the same number of required arithmetic operations in PAand PD steps, the Montgomery algorithm provides resistanceagainst simple power and side channel attacks. Moreover, in ourdesign, we have ensured that the sequence of these arithmeticoperations should remain the same during the execution of PA andPD. Therefore, the inherent feature of the Montgomery algorithms

(independence of arithmetic operations on the value of the key bit)remains unaffected.

3 Proposed pipelined architectureFig. 2 shows the proposed pipelined hardware architecture whichconsists of (a) register file (RF), (b) routing networks (RNs), (c) anefficient AU, (d) pipeline registers and (e) a dedicated control unit(CU). The placement of pipeline registers is not shown in Fig. 2;however, it is discussed in Section 3.4. The initial curve parameters(xp, yp and b) for the proposed design have been selected fromNIST [3].

3.1 Register file

The RF of proposed design contains a register array of size 8 × m’,as shown in Fig. 2. The value of ‘m’ specifies the width of eachparticular location and mainly depends upon the size of the field(163, 233 and 283). The main purpose of the RF unit is to store theintermediate results (X1, X2, Z1, Z2, T1, T2, T3 and T4) whileimplementing the PM algorithm (Montgomery in our case) for thecorresponding ECC curve. Furthermore, it contains twomultiplexers (Mux M1 and Mux M2), which are used to fetch theoperands (OP1 and OP_2) from RF unit and a single de-multiplexer (Dmux) to modify the RF contents (Mplex_out).

3.2 Routing networks

The proposed design constitutes two RNs, Mux M3 and Mux M4,as shown in Fig. 2. Inputs to the Mux M3 are curved parametersand an operand from RF (OP1). The output of Mux M3 is anoperand (OP_1) to the AU. Inputs to the Mux M4 are from theoutput of AU and Mux M3 (OP_1) and its output go into the inputof the RF unit.

3.3 Arithmetic unit

The AU of proposed crypto processor contains adder and multiplierblocks/units, as shown in Fig. 2. The adder is implemented throughbit wise exclusive-OR gates. Polynomial squaring is implementedby providing the same inputs to the multiplier unit. For two ‘m’ bitpolynomials multiplication (A(x) × B(x)), we have implemented aparallel Least Significant Digit (LSD) multiplier with digit size ofd = 32 bits. The digits with d = 32 bits of the polynomial B xare created and the parallel multiplication of each ‘d’ bit digit withan ‘m’ bit polynomial (A(x)) is performed to generate partialproducts. For further mathematical formulations and algorithmicoverview of digit level multipliers, interested readers can consult[24].

To compute FF multiplication operations over GF 2163 , a totalof six digits (B1 to B6) are required (32 + 32 + 32 + 32 + 32 + 3).Out of these six digits, the size of five digits (B1 to B5) is 32 bits,whereas the size of sixth digit (B6) is 3 bits only. Similarly, forGF 2233 , a total of eight digits (B1 to B8) are required(32 + 32 + 32 + 32 + 32 + 32 + 32 + 9). Out of these eight digits,the size of seven digits (B1 to B7) is 32 bits each, while the size ofeighth digit (B8) is 9 bits. Moreover, for GF 2283 , a total of ninedigits are required (32 + 32 + 32 + 32 + 32 + 32 + 32 + 32 + 27)as shown in Fig. 2. Out of these nine digits, eight digits (B1 to B8)are with 32 bit size and the size of the last digit (B9) is 27 bits.Parallel multiplication of each B1 to B9 digit with an ‘m’ bitpolynomial A x results ‘d + m − 1’ bits of polynomials and theseresultant polynomials are represented as C1 to C9 in Fig. 2. Oncemultiplication of each ‘d’ bit digit with an ‘m’ bit polynomial iscompleted, the final resultant polynomial (D(x)) of size‘2 × m − 1’ bits is created by XOR and shift operations ofC1 to C9.

To summarise, two ‘m’ bit polynomials multiplication producesa resultant polynomial of degree ‘2 × m − 1’ bits. Consequently,after each field multiplication, FF reduction is required. Reductionoperations are performed by implementing NIST reductionalgorithms over GF 2163 , GF 2233 and GF 2283 , as described in

Algorithm 1: Montgomery algorithm [23] over GF 2m

Input: k = (kn−1, …, k1, k0) with kn−1 = 1, P = (xp, yp)∈GF(2m)Output: Q(xq, yq) = k·P

Initialisations: X1 = xp, Z1 = 1, X2 = xp4 + b, Z2 = xp2

Point multiplication: for (i from n − 2 down to 0) doif (ki = 1)P = P + Q (PA) P = 2P (PD)1. Z1 = X2·Z1 1. Z2 = Z22

2. X1 = X1·Z2 2. T = Z22

3. T = X1 + Z1 3. T = b·T4. X1 = X1·Z1 4. X2 = X22

5. Z1 = T2 5. Z2 = X2·Z2

6. T = xp·Z1 6. X2 = X22

7. X1 = X1 + T 7. X2 = X2 + TReturn: P(X1, Z1) Return: Q(X2, Z2)ElseP = P + Q (PA) P = 2P (PD)1. Z2 = X1·Z2 1. Z1 = Z12

2. X2 = X2·Z1 2. T = Z12

3. T = X2 + Z2 3. T = b·T4. X2 = X2·Z2 4. X1 = X12

5. Z2 = T2 5. Z1 = X1·Z1

6. T = xp·Z2 6. X1 = X12

7. X2 = X2 + T 7. X1 = X1 + TReturn: P(X2, Z2) Return: Q(X1, Z1)end if, end forReconversion: xq = X1/Z1, yq = (xp + X1/Z1)[(X1 + xp·Z1)(X2 + xp·Z2) + (xp2 + y)(Z1. Z2)] (xp. Z1. Z2)−1 + y


3

Algorithm 2.41, Algorithm 2.42 and Algorithm 2.43 of [22],respectively. To compute an inversion over GF 2m the field, squareItoh Tsujii algorithm [25] has been implemented using multiplierblock.

3.4 Inclusion of pipeline registers

To achieve optimal throughput, the first step is to explore/evaluatethe various available/possible options for pipelining. Consequently,the circuit can be divided into three parts: (a) M1, M2 and M3,used for the read operation [R]. (b) AU for the execution operation[E]. (c) Combination of M4 and Demux for the write-backoperation [WB].

With the circuit above partitioning, the three possible solutionsare

• The first option is to use no pipeline registers in the architecture.Therefore, read [R], execute [E] and write back [WB] operationsare performed in a single CC.

• The second option is to place appropriate pipeline registers atthe input of AU. It results in a 2-stage pipelined architecture, i.e.read operation [R] in one CC and [E, WB] in the second CC.

• Finally, the last option is to place appropriate pipeline registers,both at the input as well as the output of the AU. It results in athree-stage pipelined architecture, causing [R], [E] and [WB] inthree separate CCs.

The Montgomery algorithm, presented in Section 2 of this article,may cause read after write (RAW) hazards in the context ofpipelining. Therefore, before developing the control section inSection 3.5, it is required to generate the instruction sequence ofthe Montgomery algorithm for the appropriate placement ofpipeline registers. Consequently, the sequences of instructions andthe corresponding actions performed in different pipeline stages areprovided in Table 1 for three different cases: (1) no pipelineregisters, (2) two-stage pipelining and (3) three-stage pipelining.

The first column of Table 1 presents the CCs, whereas thesecond column shows the sequence of instructions with no pipelineregisters. Placement of pipeline registers can cause different datahazards such as RAW, write after reading (WAR) and write afterwrite (WAW) [21]. The term hazard implies the prevention of the

next instruction from execution until the read/WB operation of theprevious instruction is completed. Consequently, the third columnshows the corresponding RAW hazards. Finally, the fourth and lastcolumns (fifth) present the proposed scheduling of PA and PDinstructions with two-stage and three-stage pipelines, respectively.

As shown in Table 1, sequences of instructions without pipelinerequire a total of 14 CCs. For two-stage and three-stage pipelineinstructions scheduling, a total of 17 and 20 CCs are needed,respectively. Furthermore, to compute PA and PD operations ofMontgomery algorithm (presented in Algorithm 1) for m bit keylength, a total of 14 × m, 17 × m and 20 × m CCs are required withno pipeline, two-stage pipeline and three-stage pipelinearchitectures, respectively. Consequently, the addition of thirdpipeline stage for WB is not efficient as it adds more CCs (a totalof 20 CCs) due to RAW hazard whereas the increase in frequencyis not higher enough to get an overall throughput higher than atwo-stage pipelined architecture. Moreover, the addition ofregisters at the output of AU further reduces the overallthroughput/area performance. Therefore, in subsequent sections,the required information related to a two-stage pipelinedarchitecture is described only.

3.5 Dedicated CU

An FSM-based dedicated CU is designed to perform controlfunctionalities. The CU generates the signals for the components ofRNs as well as the read and writes addresses for the RF unit. Theused control signals are shown as dotted lines with red colour inFig. 2, whereas the corresponding FSM is generating these signalsis shown in Fig. 3.

To implement the Montgomery algorithm for ECC, FSMincorporates a total of 121 states for a two-stage pipelinedarchitecture

• St: 0 is an idle state, while during St: 1 to St: 6, control signalsfor affine to projective conversions are generated.

• The proposed scheduling of PA and PD for the PM step of theMontgomery algorithm, as shown in Table 1 (Section 3.4),requires a total of 35 states. Out of these 35 states, 17 states arefor PA and PD when inspected key bit is ‘1’ (shown in Table 1).Similarly, an additional 17 states are required for PA and PD, ifthe inspected key bit is ‘0’. Furthermore, St: 7 is a conditional

Fig. 2 Proposed two-stage pipeline ECC processor


state and is used to count the number of points on the specifiedECC curve by using count signal and also responsible forchecking the inspected bit of key. The count has an initial valueof ‘m − 1’. St: 8 to St 24 (17 states) are used if the inspected bitfor the key is ‘1’ (IF part of Montgomery algorithm). St: 25 toSt: 41 (17 states) are used if the inspected bit for the key is ‘0’(else part of Montgomery algorithm).

• Finally, the reconversion step of the Montgomery algorithmrequires two FF inversion (Inv) operations. To check the statusof the required inversion operations, a single bit ‘inverse_1’signal is used. This signal is checked at the last state of inversion

operation, i.e. at St: 87 to define one of the next state either 88or 117. When inverse_1’ signal is ‘0’ then next state after thecompletion of the first inverse will be St: 88 otherwise next statewill be St: 117. In addition to inversion operations, additional 34cycles are required to complete the reconversion step (Recon).Each inversion requires St: 42 to St: 87 for implementations, St:116 to generate the addresses for second inversion andremaining states (St: 88 to St: 115 and St: 117 to St: 120)performs remaining operations in the projective to affineconversion step of Montgomery algorithm.

Table 1 Proposed instruction scheduling for PA and PD operationsCCs Insti/operation Pipeline hazards Proposed Instruction

scheduling with two-stage-pipeline

Proposed Instruction schedulingwith three-stage-pipelineDue to two-stage-

pipelineDue to three-stage-

pipeline1 Inst1 − Z1 = X2 ⋅ Z1 — — Inst1 R Inst1 R2 Inst2 − X1 = X1 ⋅ Z2 — — Inst1 E, WB , Inst2 R Inst1 E , Inst2 R3 Inst3 − T1 = X1 + Z1 RAW : X1 RAW : Z1

RAW : X1

Inst2 E, WB , Inst8 R Inst1 WB , Inst2 E , Inst8 R

4 Inst4 − X1 = X1 ⋅ Z1 — RAW : X1 Inst8 E, WB , Inst3 R Inst2 WB , Inst8 E , Inst3 R5 Inst5 − Z1 = T1

2 — RAW : T1 Inst3 E, WB , Inst4 R Inst8WB], Inst3 E , Inst4 R

6 Inst6 − T1 = xp ⋅ Z1 RAW :Z1 RAW : Z1 Inst4 E, WB , Inst5 R Inst3 WB , Inst4 E , Inst5 R7 Inst7 − X1 = X1 + T1 RAW : T1 RAW : X1

RAW : T1

Inst5 E, WB , Inst11 R Inst4 WB , Inst5 E , Inst11 R

8 Inst8 − Z2 = Z22 — — Inst11 E, WB , Inst6 R Inst5 WB , Inst11 E , Inst6 R

9 Inst9 − T1 = Z22 RAW :Z2 RAW : Z2 Inst6 E, WB Inst11 WB , Inst6 E

10 Inst10 − T1 = b . T1 RAW : T1 RAW : T1 Inst7 R Inst6 WB11 Inst11 − X2 = X2

2 — — Inst7 E, WB , Inst9 R Inst7 R

12 Inst12 − Z2 = X2 ⋅ Z2 RAW : X2 RAW : X2 Inst9 E, WB , Inst12 R Inst7 E , Inst9 R13 Inst13 − X2 = X2

2 — RAW : X2 Inst12 E, WB , Inst10 R Inst7 WB , Inst9 E , Inst12 R

14 Inst14 − X2 = X2 + T1 RAW : X2 RAW : X2 Inst10 E, WB , Inst13 R Inst9 WB , Inst12 E , Inst10 R15 — — — Inst13 E, WB Inst12 WB , Inst10 E , Inst13 R16 — — — Inst14 R Inst13 E17 — — — Inst14 E, WB Inst13 WB18 — — — — Inst14 R19 — — — — Inst14 E20 — — — — Inst14 WB

Fig. 3 FSM-based CU


5

The total number of CCs for the proposed architecture can becalculated by using (2). The CC information for the proposedarchitecture with different key lengths is further provided inTable 2. In (2), the term ‘Initial’ defines the initialisations part ofAlgorithm 1, ‘m’ defines the key length and ‘Inv’ defines theinversion operation required in the reconversion part of Algorithm1. Similarly, in Table 2, the first column shows the key length,whereas required CCs for initialisations part of Algorithm 1(initial) are presented in the second column. The third columnshows the CCs for the PA and PD computations of Algorithm 1.Required CCs for each inversion (Inv) and reconversions (Recon)part of Algorithm 1 are presented in the fourth and fifth columns,respectively. Finally, the total CCs for implementing Algorithm 1are presented in the last column of Table 2.

clock cycles = Initial + 18 m − 1 + 2 Inv + 34 (2)

4 Implementation and discussion of resultsAs shown in Section 3.4, a two-stage pipeline architecture provideshigher performance in terms of throughput/area ratio than a three-stage pipeline architecture. Therefore, we have implemented andpresented the results for a two-stage pipeline architecture only.However, to perform a fair comparison with state-of-the-art, it isnecessary to first define the performance metric. Consequently,Section 4.1 elaborates the target performance metric (throughput/area). The implementation results for the proposed two-stagepipeline architecture are given in Section 4.2. Finally, Section 4.3provides a comprehensive comparison of the proposed architecturewith existing solutions.

4.1 Performance metric

To analyse the performance of proposed PM architecture, withdifferent key sizes on FPGAs, a throughput over area (in terms ofthroughput/slices) metric is considered in this work and ispresented in (3). The simplified form of (3) is further presented in(4)

throughputarea = throughput Q = k ⋅ P in μs

slices (3)

throughputarea = (106/(throughput(Q = k ⋅ P in s)))

slices (4)

In (3) and (4), the term throughput is the time required for one PM(i.e. Q = k ⋅ P in s) and is calculated by using (5). Similarly, ‘Q’ isthe final point on the elliptic curve, ‘k’ is the scalar multiplier, ‘P’is the initial point on the elliptic curve and term slices refer to theutilised area on the selected FPGA device. The term, ‘106’ in (4)just simplifies (3) by converting throughput (i.e. time for one PM)from microseconds to seconds

throughput(s) = 106

time (s) = clock cycles (CCs)frequency (MHz) (5)

4.2 Implementations results

The proposed two-stage pipelined architecture for NISTrecommended binary elliptic curves over GF 2163 , GF 2233 andGF 2283 are implemented (synthesised) on Virtex-7 FPGA (V7-XC7VX690T) technology using Xilinx ISE (14.2) design suitetool. The synthesis and performance (throughput/slice) results ofthe proposed design are given in Table 3.

The first column of Table 3 shows different key sizes of ‘m’with 163, 233 and 283. The second, third and fourth columnspresent the FPGA area information in terms of Slices, LUTs andFFs, respectively. The fifth column provides the operationalfrequency (Freq. MHz). The time required for computation of onepoint multiplication (in µs) is presented in the sixth column.Finally, the last column presents the achieved results in terms ofthe performance above metric (throughput/slices).

As shown in Table 3, the proposed architecture consumes a totalof 2207 slices, 9965 LUTs and 1981 FFs over GF 2163 . ForGF 2233 , the proposed architecture utilises only 5120 slices, 18,953LUTs and 2764 FFs. Similarly, for GF 2283 , the proposedarchitecture consumes only 5207 slices, 20,202 LUTs and 3210FFs. Lower hardware resources (in terms of FPGA slices, LUTsand FFs) are achieved due to the placement of a single FFmultiplier in AU for computation of both squaring andmultiplication instructions. Furthermore, proposed architecture canoperate at a maximum operational frequency of 369, 357 and 337 MHz, respectively, while implementing from GF 2163 to GF 2283 .Higher frequency is achieved due to the placement of pipelineregisters at the input of AU. Due to higher frequencies, theproposed architecture requires only 10.73, 15.78 and 20.32 µs forone PM computation over GF 2163 , GF 2233 and GF 2283 ,respectively. This results in throughput/slices equal to 42.22, 12.37and 9.45, respectively, when synthesised on Virtex 7(XC7VX690T) FPGA device, as shown in Table 3. Consequently,due to the achievement of high frequencies and lower hardwareresource utilisations, the proposed architecture results in highthroughput/slices ratio.

4.3 Performance comparison with state-of-the-art

Section 4.2 provides the implementation results on Virtex-7 FPGA(V7- XC7VX690T). However, to perform a fair comparison withthe most relevant existing works over GF 2163 , the proposeddesign is also implemented for Virtex-4 (V4-XC4VLX100) andVirtex-5 (V5-XC5VFX200T) devices. Consequently, thecomparison results are summarised in Table 4.

Our previous low-area implementations of ECC architectures,presented in [14, 15], are best-reported implementations in terms ofarea optimisations on Spartan 6 (XC6SLX16) and Virtex 7(XC7VX690T) devices, respectively. Highly optimised throughput/slice ECC processor presented in this paper over GF 2163 on Virtex7, achieves 80% higher value of throughput/slices than ourprevious works in [14, 15].

On Virtex 4, the previous best-reported architecture in terms ofthroughput/slices (6.24) is presented in [6] and consumes 20,807slices to compute one PM in 7.7 µs using three 82-bit parallel

Table 2 CCs informationGF 2m Initial PA + PD = 18 × (m − 1) Inv Recon Total cycles163 6 2916 502 1038 3960233 6 4176 709 1452 5634283 6 5076 867 1768 6850

Table 3 Implementation results over GF 2m on V7 FPGAm Slices LUTs FFs Freq., MHz Time, µs 106/s/slices163 2207 9965 1981 369 10.73 42.22233 5120 18,953 2764 357 15.78 12.37283 5207 20,202 3210 337 20.32 9.45


multiplier cores. The proposed implementation on Virtex 4 shows19% higher throughput/slice figure (7.69) and consumes 64%lower FPGA slices (7519) as compared to work in [6].

In [4], a seven-stage pipeline architecture is presented that uses16,209 slices and result in throughput/slice ratio of 3.16 on Virtex4. Our two-stage pipelined architecture on Virtex 4 consumes 54%lower area and shows 12% speed improvement as compared towork in [4]. Therefore, our work achieves a 59% higherthroughput/slice ratio than the solution proposed in [4]. This is dueto the placement of pipeline registers only at the input of ALU,while in [4] multiple registers have been used in the data path. Useof multiple registers in the data path further increases hardwareresources; so overall throughput/area ratio is affected.

In [5], a high-speed design is presented which utilises 24,363slices to compute one PM in 10 µs, and therefore, achieves athroughput/slice ratio of 4.10. Our proposed work on Virtex 4consumes 70% lower slices and shows 47% better throughput/slicefigure than the work presented in [5]. In [7], multiple arithmeticblocks (i.e. three FF multipliers connected serially and four FFsquares connected in parallel) are used to achieve a high speed of9.6 µs by consuming 17,929 slices. Our proposed work utilisesonly 7519 slices which is 59% lower than [7]. Additionally, theproposed architecture provides 25% higher throughput/slice ratiowhen compared with [7].

To achieve higher performance while utilising lower hardwareresources, the work presented in [8] employs karatsuba multiplierwith no idle CC. Other arithmetic instructions (i.e. addition andsquaring) are performed in parallel with the karatsuba multiplier in[8]. The proposed design in this article utilises 28% lower slicesand shows 19% higher throughput/slice ratio as compared to [8].To achieve high performance in [9], parallelisation at the hardwarelevel has been obtained by using two multipliers, two adders andtwo squarer blocks. Our proposed work utilises 42% lowerhardware resources in terms of slices and achieves 41% betterthroughput/slice ratio than the parallelised architecture in [9].Pipelining in [8, 9] is achieved by placement of registers inside theFF multiplier. This increases the number of required CCs toperform one multiplication. On the other hand, the two-stagepipelined architecture in this paper performs one FF multiplicationin one CC. Moreover, the proposed architecture achieves 33, 39,19, 48 and 15% improvement in clock frequency over [4–6, 8, 9],respectively.

On Virtex 5, the best-reported throughput/slice result overGF 2163 is 29.96 which is achieved by implementing a four-stagepipelining in [12]. It consumes a total of 3513 FPGA slices andrequires 1397 CCs. The proposed two-stage pipeline architecturerequires only 2027 slices which are 43% lower than [12] and showsa 19% higher throughput/slice ratio. In comparison with [9, 10],

our architecture shows throughput/slice improvement of 69 and49%, respectively.

The most recent solution, presented in [11], has implementedtwo different architectures, i.e. for high-performance ECC(HPECC) and for low latency ECC (LLECC). The proposed two-stage pipeline architecture outperforms LLECC architecturepresented in [11] over Virtex 5 as well as Virtex 7. On Virtex 5technology, 41% improvement in throughput/slice ration has beenobserved while the improvement figure on Virtex 7 device is 29%.Finally, the proposed architecture achieves 94% higher throughput/slice ratio as compared to the solution in [13] on Virtex 5 as well asVirtex 7 devices.

5 ConclusionsThis paper presents a pipelined architecture for point multiplicationon FPGA using GF 2163 to GF 2283 , which outperforms othersolutions in terms of throughput/area (area for FPGA slices) forhigh-performance applications. The key contributions include: (i)an efficient parallel LSD-based FF multiplier to perform fieldmultiplication in a single CC, (ii) the placement of pipelinedregisters at the input of AU to reduce the critical path and (iii) theefficient scheduling of point addition and point doubling operationsto reduce the number of required CCs. The proposed design forGF 2163 provides a throughput/slice figure of 42.22 on Virtex 7which is higher than the relevant state-of-the-art solutions.Furthermore, our architectures outperform others in terms of FPGAarea (slices) as well as operating frequency.

6 References[1] Rivest, R.L., Shamir, A., Adleman, L.: ‘A method for obtaining digital

signatures and public-key cryptosystems’, Commun. ACM, 1978, 21, (2), pp.120–126

[2] Koblitz, N.: ‘Elliptic curve cryptosystems’, Math. Comput., 1987, 48, (177),pp. 203–209

[3] National Institute of Standards and Technology (NIST): ‘Recommendedelliptic curves for federal government use’, July 1999. Available at http://csrc.nist.gov/CryptoToolkit/dss/ecdsa/NISTReCur.pdf, accessed April 2018

[4] Chelton, W.N., Benaissa, M.: ‘Fast elliptic curve cryptography on FPGA’,IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2008, 16, (2), pp. 198–205

[5] Kim, C.H., Kwon, S., Hong, C.P.: ‘FPGA implementation of highperformance elliptic curve cryptographic processor over GF(2163)’, J. Syst.Archit., 2008, 54, (10), pp. 893–900

[6] Zhang, Y., Chen, D., Choi, Y., et al.: ‘A high performance ECC hardwareimplementation with instruction-level parallelism over GF(2163)’,Microprocess. Microsyst., 2010, 34, pp. 228–236

[7] Mahdizadeh, H., Masoumi, M.: ‘Novel architecture for efficient FPGAimplementation of elliptic curve cryptographic processor over GF(2163)’,IEEE Trans. Very Large Scale Integr. (VLSI) Syst., 2013, 21, (12), pp. 2330–2333

Table 4 Comparison with state-of-the-art over GF 2163

Ref. FPGA Slices Freq., MHz CCs Time, µs 106/s/slices[4] Virtex 4 16,209 154 3010 19.5 3.16[5] Virtex 4 24,363 143 1446 10 4.10[6] Virtex 4 20,807 185 1428 7.7 6.24[7] Virtex 4 17,929 250 2751 9.6 5.80[8] Virtex 4 10,417 121 1091 9.0 6.19[9] Virtex 4 12,834 196 3372 17.2 4.53[9] Virtex 5 6536 262 3379 12.9 11.86[10] Virtex 5 10,363 153 780 5.1 18.92[11] Virtex 5 11,777 113 450 3.9 21.77[12] Virtex 5 3513 147 1397 9.5 29.96[13] Virtex 5 4815 550 52,012 94.6 2.19[11] Virtex 7 11,657 159 450 2.8 30.31[13] Virtex 7 4665 800 52,012 65.0 3.29[14] Spartan 6 1844 12 9755 780.4 0.69[15] Virtex 7 3657 135 3426 25.3 10.80proposed Virtex 4 7519 229 3960 17.29 7.69

Virtex 5 2027 298 3960 13.28 37.14Virtex 7 2207 369 3960 10.73 42.22


7

http://csrc.nist.gov/CryptoToolkit/dss/ecdsa/NISTReCur.pdf

http://csrc.nist.gov/CryptoToolkit/dss/ecdsa/NISTReCur.pdf

[8] Liu, S., Ju, L., Cai, X., et al.: ‘High performance FPGA implementation ofelliptic curve cryptography over binary fields’. 2014 IEEE 13th Int. Conf. onTrust, Security and Privacy in Computing and Communications (TrustCom),Beijing, China, 2014, pp. 148–155

[9] Azarderakhsh, R., Reyhani-Masoleh, A.: ‘Efficient FPGA implementations ofpoint multiplication on binary Edwards and generalized Hessian curves usingGaussian normal basis’, IEEE Trans. Very Large Scale Integr. (VLSI) Syst.,2012, 20, (8), pp. 1453–1466

[10] Khan, Z.U.A., Benaissa, M.: ‘High speed ECC implementation on FPGA overGF(2m)’. IEEE Proc. of 25th Int. Conf. on Field-Programmable LogicApplications, London, UK, 2015, pp. 1–6

[11] Khan, Z.U.A., Benaissa, M.: ‘High-speed and low-latency ECC processorimplementation over GF(2m) on FPGA’, IEEE Trans. Very Large Scale Integr.(VLSI) Syst., 2017, 25, (1), pp. 165–176

[12] Roy, S.S., Rebeiro, C., Mukhopadhyay, D.: ‘Theoretical modeling of ellipticcurve scalar multiplier on LUT-based FPGAs for area and speed’, IEEETrans. Very Large Scale Integr. (VLSI) Syst., 2013, 21, (5), pp. 901–909

[13] Nguyen, T.T., Lee, H.: ‘Efficient algorithm and architecture for elliptic curvecryptographic processor’, J. Semicond. Technol. Sci., 2016, 16, (1), pp. 118–125

[14] Imran, M., Kashif, M., Rashid, M.: ‘Hardware design and implementation ofscalar multiplication in elliptic curve cryptography (ECC) over GF(2163) onFPGA’. IEEE Proc. of 6th Int. Conf. on Information and CommunicationTechnologies (ICICT), Karachi, Pakistan, December 2015, pp. 1–4

[15] Imran, M., Shafi, I., Jafri, A.R., et al.: ‘Hardware design and implementationof ECC based crypto processor for low-area-applications on FPGA’. IEEEProc. of 11th Int. Conf. on Open Source Systems and Technologies(ICOSST), Lahore, Pakistan, December 2017, pp. 54–59

[16] Shafique, M., Theocharides, T., Bouganis, C., et al.: ‘An overview of next-generation architectures for machine learning: roadmap, opportunities andchallenges in the IoT era’. Proc. of DATE Conf., Dresden, Germany, 2018,pp. 827–832

[17] Jumaa, N.K.: ‘Survey: internet of thing using FPGA’, Iraq J. Electr. Electron.Eng., 2017, 13, (1), pp. 38–45

[18] Sareen, P.: ‘Cloud computing: types, architecture, applications, concerns,virtualization and role of IT governance in cloud’, Int. J. Adv. Res. Comput.Sci. Softw. Eng., 2013, 3, (3), pp. 533–538

[19] Rashidi, B., Sayedi, S.M., Reza, F.R.: ‘High-speed hardware architecture ofscalar multiplication for binary elliptic curve cryptosystems’, Microelectron.J., 2016, 52, pp. 49–65

[20] Rashid, M., Imran, M., Jafri, A.R., et al.: ‘Flexible architectures forcryptographic algorithms- a systematic literature review’, J. Circuits Syst.Comput., 2019, 28, pp. 1930003-1–1930003-35

[21] Jafri, A.R., Islam, M.N., Imran, M., et al.: ‘Towards an optimized architecturefor unified binary huff curves’, J. Circuits Syst. Comput., 2017, 26, (11), pp.1–14

[22] Hankerson, D., Menezes, A., Vanstone, S.: ‘Guide to elliptic curvecryptography’ (Springer-Verlag, New York, 2004, 1st edn.), pp. 1–311

[23] Montgomery, P.L.: ‘Speeding the pollard and elliptic curve methods offactorization’, Math. Comput., 1987, 48, (177), pp. 243–264

[24] Imran, M., Rashid, M.: ‘Architectural review of polynomial bases finite fieldmultipliers over GF(2m)’. IEEE Int. Conf. on Communication, Computingand Digital Systems (C-CODE), Islamabad, Pakistan, May 2017, pp. 331–336

[25] Itoh, T., Tsujii, S.: ‘A fast algorithm for computing multiplicative inverses inGF(2m) using normal bases’, J. Inf. Comput., 1988, 78, (3), pp. 171–177


processor architecture for elliptic curve crypto 1 ...architecture for elliptic curve crypto...

Documents