hardware implementation analysis of sha-256 and sha-512 algorithms on fpgas

www.elsevier.com/locate/compeleceng

Computers and Electrical Engineering 31 (2005) 345–360

Hardware implementation analysis of SHA-256and SHA-512 algorithms on FPGAs q

Imtiaz Ahmad *, A. Shoba Das

Department of Computer Engineering, Kuwait University, P.O. Box 5969, Safat 13060, Kuwait

Received 2 May 2005; received in revised form 1 June 2005; accepted 14 July 2005Available online 18 October 2005

Abstract

Hash functions are common and important cryptographic primitives, which are very critical for dataintegrity assurance and data origin authentication security services. Field programmable gate arrays(FPGAs) being reconfigurable, flexible and physically secure are a natural choice for implementation ofhash functions in a broad range of applications with different area-performance requirements. In this paper,we explore alternative architectures for the implementation of hash algorithms of the secure hash standardsSHA-256 and SHA-512 on FPGAs and study their area-performance trade-offs. As several 64-bit addersare needed in SHA-512 hash value computation, new architectures proposed in this paper implement mod-ulo-64 addition as modulo-32, modulo-16 and modulo-8 additions with a view to reduce the chip area.Hash function SHA-512 is implemented in different FPGA families of ALTERA to compare their perfor-mance metrics such as area, memory, latency, clocking frequency and throughput to guide a designer toselect the most suitable FPGA for an application. In addition, a common architecture is designed for imple-menting SHA-256 and SHA-512 algorithms.� 2005 Elsevier Ltd. All rights reserved.

0045-7906/$ - see front matter � 2005 Elsevier Ltd. All rights reserved.doi:10.1016/j.compeleceng.2005.07.001

q This research is funded by Kuwait University Grant EO 04/03.* Corresponding author. Tel.: +965 4985849; fax: +965 4839461.E-mail address: [email protected] (I. Ahmad).

mailto:[email protected]

346 I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360

1. Introduction

Data integrity assurance and data origin authentication are essential security services in finan-cial transactions, electronic commerce, electronic mail, software distribution, data storage and soon. Cryptographic hash functions are utilized to achieve these security services. The purpose of ahash function is to produce a ‘‘fingerprint’’ of a file, message, or other block of data. A hash valueh is generated by a function H of the form h = H(M), where M is a variable-length message andH(M) is the fixed-length hash value. In a cryptographic hash function, a message of arbitrarylength padded and broken into blocks is input sequentially to a compression function which con-verts a fixed-length input (current message block) to a fixed-length output (hash value). The hashvalues of individual blocks are used iteratively by the compression function to find the final hashvalue, referred to as message digest. A hash function provides a unique relationship between theinput message and the hash value and hence, represents a longer message in a concise way. There-fore, computation of digital signature to a large document (message) can be replaced by applyingcryptographic processing to the document�s hash value which is much smaller than the document[1]. Other popular applications of hash functions include digital signature schemes in public-keycryptosystems recommended in REC 2437 [19], password storage and verification specified inRFC 2289 [20], and pseudo-random number generation. Hash function is also the building blockof secret-key message authentication codes (MACs) [2] used in two popular security protocols,namely, secure sockets layer (SSL) provided in RFC 2246 [21] and IPSecurity mentioned inRFC 2404 [3].

A cryptographic hash function should be hard to invert, i.e., given a hash value h, it should becomputationally infeasible to find some input M such that H(M) = h and collision-free, i.e., findtwo messages M1 and M2 such that H(M1) = H(M2). Most of the secure hash functions in usetoday have an iterative structure. The motivation for the iterative structure stems from the factthat the compression function which generates a hash value for a current message block usinghash value of the preceding block, actually combines two or more inputs to produce an outputwhere each output bit is a different complex non-linear function of all the input bits. This makesthe resultant hash function collision resistant. High performance cryptographic hardware systemstypically require an extra module for hash function calculation to reduce the workload of themain microprocessor.

Hash functions which have a ‘‘dedicated’’ design are fast and have considerable advantage overother algorithms which are based on block cipher. Dedicated hash functions suitable for bothsoftware and hardware implementation have been proposed and now widely used in real worldapplications. Some of the most widely used dedicated hash functions in real applications are mes-sage-digest algorithm MD5 [4] and secure hash algorithm SHA-1 [5]. The complexity of the bestattack of SHA-1 is 280 and it does not any longer match the security guaranteed by the new secretkey encryption standard, AES (Advanced Encryption Standard), which uses one key for bothencryption and decryption with key sizes of 128, 192 and 256 bits [6]. Therefore, three new hashfunctions (SHA-256, SHA-384 and SHA-512) referred to as SHA-2, with the security matchingthe security of AES with complexity of the best attack as 2128, 2192 and 2256, respectively, havebeen announced by the National Institute of Standards and Technology (NIST) [7]. More re-cently, a new standard SHA-224 has been announced by NIST [8]. The functional characteristicsof SHA functions are different and are presented in Table 1.

Table 1Comparison of functional characteristics of hash functions

Hash functions SHA-1 SHA-224 SHA-256 SHA-384 SHA-512

Size of hash value 160 224 256 384 512Complexity of the best attack 280 2112 2128 2192 2256

Message size <264 <264 <264 <2128 <2128

Message block size 512 512 512 1024 1024Word size 32 32 32 64 64Number of words 5 8 8 8 8Number of digest rounds 80 64 64 80 80Number of constants 4 64 64 80 80Round-dependent operations ft None None None None

I. Ahmad, A. Shoba Das / Computers and Electrical Engineering 31 (2005) 345–360 347

A hardware implementation of cryptographic hash function has more physical security by nat-ure as they are physically separate from the main processor and has higher performance than soft-ware implementation. Moreover, the reconfigurable hardware devices such as field programmablegate arrays (FPGAs) are best suited for implementation of cryptographic hash functions as theyare flexible and easily upgradeable. In the implementation of hash function on FPGAs, area andperformance are two of the most important design criteria of concern. Hash functions have abroad range of applications and hence, their area-performance requirements may be differentfor different applications. In some applications such as smart cards, area is of concern, whereasin storage area networks (SANs) and virtual private networks (VPNs) performance is critical.Some other applications, such as digital video recorders, require an optimization of perfor-mance/area ratio. Therefore, different architectures can be used for SHA function implementationand it is necessary to evaluate alternative architectures on the basis of area-performancecharacteristics.

In cryptographic hash functions a common sequence of operations is called a digest round andthe compression function produces a hash value by subjecting a block of message to several digestrounds. The number of digest rounds differ among the SHA functions as shown in Table 1. Inmany applications performance of these basic cryptographic primitives is often directly reflectedin an overall improvement of the system performance. Among the several operations of a digestround in SHA-2 functions, addition of several operands is involved which occupy a major chunkof the chip area when implemented in FPGAs. Multi-operand addition also dictates the criticalpath delay in the computation of hash value [17]. Hence, we focused on design of adders whileproposing new architectures. This paper deals with three issues, namely, proposing different archi-tectures for implementation of a hash function on FPGA, comparing the performance metrics ofdifferent FPGAs that implement a SHA-2 function and single chip implementation of SHA-2 fam-ily hash functions. As the performance metrics of FPGAs of different families even by the samemanufacturer are not identical, an evaluation of FPGAs on the basis of performance metrics helpsin selection of appropriate FPGA to suit an application. Moreover, since hash functions of SHA-2family have identical operations in a digest round, we were motivated to design a common archi-tecture for these functions.

The remaining paper is organized in the following manner. An overview of the previous workis given in Section 2. The prelude in Section 3 discusses SHA-256 and SHA-512 algorithms. In


Section 4, the philosophy behind the reduced word length implementation of SHA-512 functionand single chip implementation of SHA-256 and SHA-512 are explained and in Section 5 the imple-mentation is detailed. Results are discussed in Section 6 and the paper is concluded in Section 7.

2. Previous work

Many studies had been done on implementation of cryptographic hash functions [9–17]. Boos-elaers et al. [9,10] reported the software performance evaluations of Message-Digest algorithmMD5 and algorithm of secure hash standards, SHA-1 hash functions on a Pentium processor.Nakajima and Matsui [11] reported the software performance analysis of the new proposed hashfunction SHA-512 on Pentium III processor. Hash function applications demand hardware imple-mentations to meet the performance requirements for high-speed networks. Dominikus [12] hasreported an FPGA implementation of MD5 hash algorithm. McLoone and McCanny proposeda single-chip FPGA solution for SHA-384 and SHA-512 [13]. Kang et al. [14] reported the imple-mentation of MD5 and SHA-1 on Altera FPGA. Grembowski et al. [15] recently reported thecomparative analysis of the hardware implementation of SHA-1 and the new proposed hash func-tion SHA-512 on Xilinix Virtex FPGA. A common architecture for implementation of SHA-2family architecture is reported in [16]. An elegant application specific integrated circuit (ASIC)implementation of SHA-512 by making use of delay balancing and pipelining is recently reportedby Dadda and Macchetti [17].

To the best of our knowledge a study on area-performance metrics in the FPGA implementa-tion of SHA-2 functions to suit different applications has not been done so far. In a digest roundof SHA-512, several 64-bit operands are added and logic operations are performed on them. Thispaper explores alternative architectures for SHA-512 implementation on a FPGA using 8, 16 and32-bit adder/logic circuits and compares their area-performance trade-offs. Performance metricssuch as area, memory, latency, clocking frequency and throughput of FPGAs of different familiesof ALTERA for implementation of SHA-512 are evaluated in this paper and finally, single chipimplementation of SHA-256 and SHA-512 with 32-bit adders also has been done.

3. Prelude

In this section, SHA-256 and SHA-512 algorithms are discussed in detail. When a message ofany length <264 bits (for SHA-256) or <2128 bits (for SHA-512) is input, the hash functions SHA-256 and SHA-512 compute a condensed representation of message, referred to as message digest.

The message digest generated by SHA-256 and SHA-512 are 256 and 512 bits long, respectively.The algorithm for generation of message digest is identical for SHA-256 and SHA-512 and onlythe constants and functions used differ, and hence, in this section SHA-256 and SHA-512 are dis-cussed simultaneously. The procedure consists of two stages, namely, preprocessing and hashcomputation. In the preprocessing stage, the message is padded, parsed into m-bit blocks and ini-tialization values to be used in the hash computation are set. A Message Scheduler (MS) dividesthe m-bit block into 16 words and prepares a message schedule by passing one word at a time. Aseries of hash values are generated iteratively from functions, constants, and word operations and

Padder

ROM

MessageScheduler

Hashconstants

ain - hin

Iterative Processing Unit

aout - hout

Modulo Adder

Message

Message Digest

ai - hi

Wt

Kt

H0i- H7

i

H0i+1- H7

i+1

Fig. 1. Message digest generation.


the final hash value is the message digest. The message digest generation technique is shown inFig. 1. The operations performed on the two stages are listed below:

Preprocessing:

• Padding the message into a multiple of 512 or 1024 bits.• Parsing the padded message into N message blocks B0,B1, . . . ,BN, where block size is 512 or1024 bits.

Hash computation:

• Each message block Bi are processed in order. A word (32 bits or 64 bits wide) of a messageblock Bi is referred to as Bi

t and in a block there are 16 such words.• For each message block i in the range 1 to N, starting from message schedule Wt, followingsteps (1– 4) are repeated to compute hash values Hi

0 to Hi7 for the ith block.

Step 1: Wt is computed by identical procedure for SHA-256 and SHA-512, only the logic func-tions r0 and r1 are different.


SHA-256:

Message schedule W t ¼ Bit 0 6 t 6 15

¼ r2561 ðW t�2Þ þ W t�7 þ r256

0 ðW t�15Þ þ W t�16 16 6 t 6 63

where

r2561 ¼ ROTR17ðxÞ �ROTR19ðxÞ � SHR10ðxÞ


SHA-512:

Message schedule W t ¼ Bit 0 6 t 6 15

¼ r5121 ðW t�2Þ þ W t�7 þ r512

0 ðW t�15Þ þ W t�16 16 6 t 6 80

where



ROTRn(x) is a circular rotation of a variable x by n positions to the right and SHRn(x) is shiftingof a variable x by n positions to the right.

The block diagram of SHA-256/SHA-512 algorithm is shown in Fig. 2.Step 2: The hash values, Hi�1

0 to Hi�17 are assigned to variables a,b,c,d,e, f,g,h. The eight initial

hash values, which are 32 or 64 bits wide, are shown in Table 2.

1514131211109876543210

0

σ1σ0

PaddedMessage

hgfedcba

Wt

Kt

Maj (a, b, c) Ch(e, f, g)1

T1

T1 + T 2

Message scheduler

Iterative processing unit

Fig. 2. Block diagram of SHA-256/SHA-512.

Table 2Initial hash values of SHA-256 and SHA-512

SHA-256 SHA-512

H00 ! a 6a09e667 6a09e667 f3bcc908

H01 ! b bb67ae85 bb67ae85 84caa73b

H02 ! c 3c6ef372 3c6ef372 fe94f82b

H03 ! d a54ff53a a54ff53a 5f1d36f1

H04 ! e 510e527f 510e527f ade682d1

H05 ! f 9b05688c 9b05688c 2b3e6c1f

H06 ! g 1f83d9ab 1f83d9ab fb41bd6b

H07 ! h 5be0cd19 5be0cd19 137e2179


• A sequence of 64 constant 32-bit words, K256t or 80 constant 64-bit words, K512

t are used by theprocessing unit.

• The processing unit uses four logical functions, Ch and Maj, R0, and R1. The logic functions Chand Maj are identical for SHA-256 and SHA-512.

Chðx; y; zÞ ¼ ðx ^ yÞ � ðpx ^ zÞMajðx; y; zÞ ¼ ðx ^ yÞ � ðx ^ zÞ � ðy ^ zÞ

SHA-256:

R0 ¼ ROTR2ðxÞ �ROTR13ðxÞ �ROTR22ðxÞR1 ¼ ROTR6ðxÞ �ROTR11ðxÞ �ROTR25ðxÞ

SHA-512:

R0 ¼ ROTR28ðxÞ �ROTR34ðxÞ �ROTR39ðxÞR1 ¼ ROTR14ðxÞ �ROTR18ðxÞ �ROTR41ðxÞ

Step 3: The processing unit performs this step, 64 or 80 times on a 512 or 1024 bit block.

T 1 ¼ hþ R1ðeÞ þ Chðe; f ; gÞ þ Kt þW t

T 2 ¼ R0ðaÞ þMajða; b; cÞh ¼ g

g ¼ f

f ¼ e

e ¼ d þ T 1

d ¼ c

c ¼ b

b ¼ a

a ¼ T 1 þ T 2

Variables used in the above equations refer to respective values for SHA-256 and SHA-512.


Step 4: The ith intermediate hash value Hi0 to Hi

7 are computed by modulo-32 or modulo-64 bitadders after the iterations.

Hi0 ¼ aþ Hi�1

0 Hi1 ¼ bþ Hi�1

1 Hi2 ¼ cþ Hi�1

2 Hi3 ¼ d þ Hi�1

3

Hi4 ¼ eþ Hi�1

4 Hi5 ¼ f þ Hi�1

5 Hi6 ¼ g þ Hi�1

6 Hi7 ¼ hþ Hi�1

7

N N N N N N N N
• The message digest is computed by H 0 kH 1 kH 2 kH 3 kH 4 kH 5 kH 6 kH 7 after processing all theN blocks in the message.
4. Reduced word length implementations

FPGAs are best suited for implementation of cryptographic hash functions as they meet thespeed requirements and are reconfigurable. It is clear from Section 3, that message scheduleW t ¼ r256

1 ðW t�2Þ þ W t�7 þ r2560 ðW t�15Þ þW t�16 requires four operand addition and intermediate

value a = T1 + T2 where, T 1 ¼ hþ R1ðeÞ þ Chðe; f ; gÞ þ Kt þ W t and T 2 ¼ R0ðaÞ þMajða; b; cÞ,requires six operand addition. The ith intermediate hash value Hi

0 to Hi7 are computed by mod-

ulo-64 bit adders after the iterations and hence, an additional eight modulo-64 adders are requiredto find the final hash value for the block. The multi-operand addition is the most problematic partin the implementation of hash functions. Hence, reducing the size of these adders will reduce thenumber of logic elements required in the FPGA, thereby reducing the overall area of the finalcircuit.

The implementation of multi-operand 64-bit adders on FPGAs demands selection of properscheme for performing the addition as both the speed and area are of concern. It is a well-knownfact that carry look-ahead adders (CLAs) are faster than conventional carry propagate adders,but carry save adders (CSAs), referred to as redundant adders are faster and has a smaller areathan a CLA. For an n-bit adder with each module handling m bits, the delay is proportionalto logmn for one level CLA whereas the redundant adders have a constant delay [18]. Assuminga complexity of km for implementing 1 bit module, the area of a CLA has a complexity propor-tional to kmn and a redundant adder has an area proportional to n. In a CSA, an array of fulladders (FAA) are used to perform addition of three binary vectors without propagating the car-ries and two binary vectors, pseudo-sum and carry are generated. As the carry output of a ith fulladder has a weight i + 1, carry of bit 0, vc0 = 0. Hence, the carry-in (cin) can be included in theplace of vc0 as shown in Fig. 3.

Several schemes exist for the implementation of multi-operand addition. Using a network offull-adders, p operands each n bits wide can be added using an array of [p:2] adders, referredto as reduction by rows or using an array of (p:q] counters, referred to as reduction by columns.The arrays can be linear or tree array and same number of adders are used by both the schemes.Reduction by rows technique will involve an array of full adders and a CLA in the last stage toadd the final pseudo-sum and pseudo-carry. For large n, the number of groups in one level CLAwill be large, resulting in a slow operation. If multiple levels are used, the maximum number oflevels, L = logmn, the number of modules Nmax with maximum number of levels will be(n � 1)/(m � 1) and the delay is proportional to 2 logmn [18]. It can be seen that the selection of

FA

c in

x0 y0 z0

vs0vc1

CSA

n

nn

n n

cincout

X Y Z

vsvc vc0

Fig. 3. Carry save adder.


group size m is an important factor which in turn affects the delay and the number of modules. Anoptimum size for m will be 4 which gives rise to 21 modules for the 3 level implementation of a64 bit adder.

Reduction of word length of operands input to the adders from 64-bit to smaller denomina-tions, namely, 32, 16 and 8 will reduce the value of n which in turn will reduce L, Nmax andthe delay. The size of the adders and the overall size of the hash function circuit are thereby re-duced. Instead of a 64-bit adder, if a 32-bit adder is used, the reduction in area of CSA will beproportional to n = 32. The addition operation of three 8-bit operands, P, Q and R using a 4-bit CSA is illustrated in Fig. 4. Initially the lower nibble of the operands is added and the sameadder is used to add the higher nibbles.

The carry bit vc4 = 1 generated while adding bits (3-0) of operands in a 4-bit CSA is stored in aflip-flop and vc0 is assigned a value �0�. The pseudo-sum vs3-0 and the pseudo-carry vc3-0 are addedin a 4-bit CLA to get the final sum VS3-0 = 0101. The higher nibble of the operands is added bythe same CSA and the carry produced vc4 = 0 is ignored and vc0 is assigned the value �1� which isthe carry from lower nibble addition saved in the flip-flop. In the same 4-bit CLA, pseudo-sum vs3-0 and the pseudo-carry vc3-0 are added to get the final sum VS7�4 = 0110. The same logic is usedfor the carry generated from the lower nibble addition in CLA and the Modulo-8 sum is01100101.

Moreover, the logic functions Ch(e, f,g) and Maj(a,b,c) are such that these can be performedon reduced size operands. Design of SHA-512 with 64/32/16/8 bit adders and logic circuits will

Let P = 22 Q = 79 R = CA

Bits (3-0) Bits (7-4)

X3-0 = 0 0 1 0 X3-0 = 0 0 1 0

Y3-0 = 1 0 0 1 Y3-0 = 0 1 1 1

Z3-0 = 1 0 1 0 Z3-0 = 1 1 0 0

vs3-0 = 0 0 0 1 vs3-0 = 1 0 0 1

vc4-1 = 1 0 1 0 vc4-1 = 0 1 1 0

vc4 = 1 vc4 is ignored

Fig. 4. Addition of 8-bit operands using 4-bit CSA.


be referred to as SHA(64)-512, SHA(32)-512, SHA(16)-512 and SHA(8)-512, respectively, in the restof the paper.

5. Implementation of hash functions with scaled down adders and logic circuits

In the block diagram of SHA(64)-512 which is shown in Fig. 2 in Section 3, the message sched-uler is implemented with sixteen, 64-bit registers and in the Iterative Processing Unit (IPU),(a � h) are 64-bit registers. The addition of operands in message scheduler and processing unitare implemented using a network of 64-bit CSAs and the reduction is done by rows. The finalpseudo-sum and carry vectors are added using a 64-bit CLA.

5.1. SHA-512 implementation with 32 bit adders

Message scheduler (MS) and the iterative processing unit (IPU) implemented with 32-bit CSAsare shown in Figs. 5 and 6, respectively. Sixteen registers in MS and the eight registers in IPU are64-bit registers, but split and used as two 32-bit registers. The suffixes U and L refer, respectively,to bit vectors (63-32) and (31-0). A selector unit SEL is used to select the U or L register andDSEL performs the opposite function of SEL. The SEL unit consists of two tristate buffers whichin turn are driven by select lines S0 and S1. The select signal S1 is generated by inverting S0. Apositive edge triggered T-flip-flop generates S0. The data is always shifted from one half of a reg-ister to the corresponding half of the next register in the path. The transfer of data between lowerhalves of registers take place when S0 is asserted and upper halves when S1 = 1. The shift logicfunctions r0 and r1 in MS needs 64-bit operands together, therefore, the two halves of the oper-ands are latched into these blocks at negative edge of S1 and the shift logic is performed by com-binational circuits. The same technique is used for R0 and R1 in IPU.

The logic functions Ch(e, f,g) and Maj(a,b,c) are such that these can be performed on twohalves of the operands independently, hence, only 32 bit circuits are used for these logic. Thisin turn reduces the size of the overall circuit.

15U14U13U12U11U10U9U8U7U6L5L4L3L2L1L0L

σ1

CL

A

σ0

PaddedMessage

Wt

FA

A

FA

A

0U 2U 10L1U 3U 4U 5U 6U 7L8L 9L 11L 12L 13L 14L 15L

Sel

Sel

Sel

Sel

64 64

32

32 32

32

Sel

32

32

DSel32

32

Fig. 5. SHA(32)-512—message scheduler.

0

hUgUfUeUducUbUaU

Wt

Maj (a, b, c) Ch (e, f, g)

1

FA

A

FA

A

FA

A

FAA

CLAF

AA

FA

A

CLA

aL bL cL dL eL fL gL hL

Kt

32

64

DS

el

Sel

Sel SelSelDSel

Sel SelSelSel

Sel

Sel

64

32 32

Fig. 6. SHA(32)-512—iterative processing unit.


The addition of three 64-bit operands using a 32-bit FAA is shown in Fig. 7. The SEL circuitsare included to demonstrate the selection of one half of the operands and the output is two 32 bitvectors, namely, Sum and Carry. The vc32, which is shown as cout in Fig. 7 is stored in a D flip-flopwhen S0 is asserted as this the carry generated from the addition of bits (31-0) of the operands. Bitvc0 = 0 when S0 is asserted and vc0 = cout of the lower half addition when S1 is asserted. The carryfrom higher half addition is ignored as hash functions require modulo-64 addition. The 64-bitconstant Kt is input 32-bit at a time from a ROM. In order to store the eighty constants, a(160 * 32) ROM is used. Message digest computation is done by 32-bit CLAs, which add the hashvalues of the preceding iteration with the contents of registers a to h, hence eight CLAs are used toperform this addition. The carries from the lower order words are stored and used as cin whilehigher order words are added as shown in Fig. 7 for FAAs.

5.2. SHA-512 implementation with 16 and 8 bit adders

The design methodology of SHA-512 function using 16 or 8 bits adder and logic circuits isidentical to 32-bit version. The 64-bit registers are split to suit the respective word lengths. The

32-bit Adder(FAA)

Carry

vs31-0SumX31-0

X63-32

S0

S0

S1

vc31-1

vc0cout FF

ClkS1

S0

‘0’

32

32

32

32

32

Y31-0

Y63-32S0

S1

Z31-0

Z63-32

S0

S1

D

Fig. 7. 32-bit FAA.


selection is done by a counter-decoder (2 · 4) circuit and a (3 · 8) decoder is used for the 8-bitversion. Accordingly, the size of SEL and DSEL circuits, FAAs, Ch, Maj functions and ROMare also changed to operate on 16-bit or 8-bit operands.

5.3. Single chip implementation of SHA-512 and SHA-256

Implementation of SHA-512 using 32 bit adders and logic circuits facilitates implementation ofSHA-256 on the same chip, as SHA-256 performs operations on 32 bits operands. The algorithmis identical for SHA-512 and SHA-256 functions and the user can select the algorithm by assertingan input line. One initial hash value and one Kt value are shown in Table 3. It is clear from thetable that even the initial hash values and constant Kt of SHA-256 are exactly half of those ofSHA-512. Therefore, the ROM is organized as two banks of 80 words, with each bank handlingone half of the constants. A combinational logic selects the appropriate banks depending on thealgorithm.

The ROM banks and their associated logic is shown in Fig. 8. An 80 · 32 ROM bank (KH32)stores the Kt constants of SHA-256 which is also the higher words of SHA-512 and another ROMbank (KL32) of same capacity stores the lower half of constants. The contents of KH32 are the Kt

constants of SHA-256 which have to be selected at every clock pulse, whereas, as the same con-stants are the higher words of SHA-512, these have to be selected at alternate clock pulses whenpassed on to a 32-bit adder. The associated logic shown in Fig. 8 selects the contents of KH32either at every clock when it is computing SHA-256 or at alternate clocks for SHA-512.

The logic functions r0, r1, R0, and R1 involve rotation and shifting and is different for both thefunctions and separate logic circuits are designed for SHA-256 and SHA-512 to handle the respec-tive functions.

Table 3Relationship between constants of SHA-256 and SHA-512

SHA-256 SHA-512

H�10 ! a 6a09e667 6a09e667 f3bcc908

K0 428a2f98 428a2f98 d728ae22

ROM80 X 32

ROM80 X 32

HAdr6-0

Adr7-1Adr0

S512

Kt

32

32

327 HAdr6-0Adr7-1

Adr6-0

S512

Fig. 8. ROM banks with selection logic.


6. Experimental results

The SHA-512 and SHA-256 algorithms were designed and tested using a comprehensive designsoftware, the Altera Quartus II, version 4.0. Altera is the programmable logic performance leaderacross all platforms as reported in http://www.altera.com/products/devices/performance/per-in-dex.html and provides a complete multi-platform design environment to suit specific design needs.The designs were analyzed and synthesized using Verilog HDL and VHDL, placed and routed inAltera devices of APEX II, Stratix, and Mercury family FPGAs. Five performance metrics suchas the area (a), memory (l), latency (k), clocking frequency (f) and throughput (d) were computed.APEX II FPGAs have up to 67,200 logic elements (LEs) and 1.1 Mbits of embedded RAM andthese devices offer abundant logic resources and remarkable I/O performance. High speed com-pute-intensive data path functions can be easily implemented with one or multiple APEX II de-vices. Mercury family FPGAs typically have up to 14400 LEs with maximum RAM bits of114,688. The FPGAs of Stratix family contain 10,570 to 79,040 LEs and up to 7,427,520 RAMbits (928,440 bytes) without reducing logic resources. High-speed differential I/O support on upto 116 channels with up to 80 channels optimized for 840 megabits per second (Mbps) is providedby these FPGAs. The resources used in terms of number of logic elements for the implementationof algorithm is referred to as the area. A memory segment consists of a bit-slice of a memory thatis implemented in a single embedded cell. Each embedded cell implements one output of the mem-ory and multiple memory segments may be needed to create a single memory block. Latency isdefined as the number of rounds in a loop and the minimum operating clock as clock period.The throughput (d) is computed as, d = message block size/(clock period * latency).

The designs were simulated for a block of 1024 bits padded message. SHA(64)-512 and SHA(32)-512 were designed and placed on the FPGA, EP1S10F484C5 of Stratix family and their perfor-mance metrics are presented in Table 4. The SHA(32)-512 design occupies 2800 logic elementswhereas SHA(64)-512 occupies 4229 logic elements. As SHA(32)-512 occupies only 26% of the chiparea to handle one block of 1024 bits of padded message, three blocks of message can be authen-ticated at a time by the chip if the blocks are pipelined. This in turn will occupy only 78% of thechip area and the rest can be used for implementing encryption logic. SHA(64)-512 can handle onlytwo blocks at a time and leaves only 20% of chip area for other purposes. Moreover, area used byone block of SHA(32)-512 design is only 66% of that of SHA(64)-512 and throughput for maximumnumber of blocks is almost 72%.

In Table 5, a similar SHA(64)-512 design implemented with a Xilinx Virtex-E XCV600E-8 [13] iscompared with our design implemented with the FPGA of Mercury family, EPM120F484C5. Theoperating frequency reported in [13] was 38 MHz whereas, our design has an operating frequencyof 43.7 MHz. The throughput of our design is also more than that of [13].

Table 4Synthesis results of SHA(64)-512 and SHA(32)-512 on Stratix

Design Area(LEs) (a)

Percentage of chiparea used (%)

Memory(bits) (l)

Percentage of memoryon chip used (%)

Clock(MHz) (f)

Throughput (Mbits/s)(Max block)

SHA(64)-512 4229 40 9216 <1 47.9 1226.2SHA(32)-512 2800 26 8448 <1 46.5 892.8

http://www.altera.com/products/devices/performance/per-index.html

http://www.altera.com/products/devices/performance/per-index.html

Table 5Comparison of our SHA(64)-512 design with design of [13]

Design Clock (MHz) (f) Throughput (Mbits/s)

Design of [13] 38 479SHA(64)-512 43.7 560


The lower word length versions, namely, SHA(64)-512, SHA(32)-512, SHA(16)-512, SHA(8)-512were synthesized on a Mercury family FPGA, EPM120F484C5 and their areas were comparedon the basis of logic element count. The design was optimized for area and their comparison chartis shown in Fig. 9. Choosing SHA(64) as the base, SHA(8) occupies 27.1% less area followed bySHA(16) with 24.3% and SHA(32) with 16.5% than SHA(64). It is clear from Fig. 9, that applica-tions where area is of concern, smaller word length implementations will be suitable.

In order to evaluate the devices belonging to different families of Altera, the throughput ofSHA(32)-512 design on devices belonging to three different families had been done and their per-formance metrics area, memory, throughput, and operating frequency are listed in Table 6. It canbe seen that the hash algorithm synthesized on Stratix device occupies less area than the other twoFPGAs listed in Table 6. Moreover, since one block occupies only 26% of the chip area, threeblocks of 1024 bits of padded message can be handled by Stratix device whereas only two blockscan be handled by Apex II and one block by Mercury device. The maximum possible throughputis listed in the last column.

Finally, synthesis results of SHA(32)-512 and SHA-256 on a single chip are given in Table 7.Both the algorithms use the same area and memory, but the throughput is different since the blocksize and latency are 512 and 64, respectively, for SHA-256, whereas for SHA(32)-512, block size is1024 and latency is 160.

2705 28103101

3711

0

1000

2000

3000

4000

Lo

gic

ele

men

tco

un

t

SHA(8) SHA(16) SHA(32) SHA(64)

Area Comparison - 8/16/32/64 bit versions

Fig. 9. Area comparison—8/16/32/64 bit versions.

Table 6Synthesis results of SHA(32)-512 on different FPGAs

FPGA Area(LEs) (a)

Chip areaused (%)

Memory(bits) (l)

Memoryused (%)

Clock(MHz) (f)

Throughput(Mbits/s)(1 block)

Throughput(Mbits/s)(Max block)

Stratix EP1S10F484C5 2794 26 8448 <1 45.8 292.8 878.4Apex II EP20K200EFC484-1 2867 34 15,360 14 24.86 159.1 318.2Mercury EPM120F484C5 3775 78 15,360 31 48.7 311.9 311.9

Table 7Single chip implementation of SHA(32)-512 and SHA-256

Design Areaused (%)

Memoryused (%)

Clock(MHz) (f)

Throughput(Mbits/s) (1 block)

Throughput(Mbits/s) (Max block)

SHA-256 32 1 41.97 335.9 1007.7SHA(32)-512 32 1 41.97 268.7 806.1


7. Conclusions

Secure hash algorithms SHA-256 and SHA-512 are versatile algorithms deployed in a broadrange of applications with different area-performance requirements. Several 64-bit adders are re-quired to implement a SHA-512 hash function in FPGAs requiring bulk of the chip area. In thispaper, we explored alternative adder architectures for implementing SHA-512 in FPGA with re-duced size operands and studied their area-performance trade-offs. Our results showed that thechip area on FPGA decreased with reduction in operand size but the throughput suffered dueto increased latency. The architectures were synthesized in different FPGA families of ALTERAand their performance metrics such as area, memory, latency, clocking frequency and throughputwere compared. Implementation of SHA-256 and SHA-512 was also done using a common archi-tecture. The performance metrics shed light on the possibility of synthesizing multiple blocks on asingle chip, which in turn would increase the throughput. Stratix family of FPGAs offered the bestperformance metrics. Future work will be directed towards error analysis and error detection pro-cedures for the hardware implementation of hash functions.

References

[1] Kaufman C, Perlman R, Speciner M. Network security: private communication in a public world. 2nded. Prentice-Hall; 2002.

[2] FIPS Publication 198. The Keyed-hash message authentication code (HMAC). US Doc/NIST, March 6, 2002.[3] Madson C, Glenn R. The use of HMAC-SHA-1-96 within ESP and AH. RFC 2404, November 1998.[4] Rivest RL. The MD5 message digest algorithm. RFC 1321, April 1992.[5] FIPS Publication 180-1. Secure hash standard (SHS). US Doc/NIST, April 17, 1995.[6] FIPS Publication 197. Advanced encryption standard (AES). US Doc/NIST, November 26, 2001.[7] FIPS Publication 180-2. Secure hash standard (SHS). US Doc/NIST, May 30, 2001.[8] FIPS Publication 180-2. Secure hash standard (SHS) change notice 1. US Doc/NIST, February 2004.[9] Booselaers A, Govaerts R, Vandewalle J. Fast hashing on Pentium. Proceedings of Crypto�96, LNCS

1109. Springer-Verlag; 1996. p. 298–312.[10] Booselaers A, Govaerts R, Vandewalle J. SHA: a design for parallel architectures? Proceedings of the

EUROCRYPT�97, LNCS 1233. Springer-Verlag; 1997. p. 348–62.[11] Nakajima J, Matsui M. Performance analysis and parallel implementation of dedicated hash functions on Pentium

III. IEICE Transactions on Fundamentals 2003;E86-A(1):54–63.[12] Dominikus S. A hardware implementation of MD5-family hash algorithm. In: Proceedings of the international

conference on electronics circuits and systems, Dubrovnik, Croatia, September 15–18, 2002. p. 1143–6.[13] McLoone M, McCanny JV. Efficient single-chip implementation of SHA-384 and SHA-512. In: Proceedings of the

IEEE international conference on field-programmable technology (FPT), Hong Kong, July 2002. p. 311–4.[14] Kang YK, Kim DW, Kwon TW, Choi JR. An efficient implementation of hash function processor for IPSEC. In:

Proceedings of the Asia–Pacific conference on ASICs, August 2002. p. 93–6.


[15] Grembowski T, Lien R, Gaj K, Nguyen N, Bellows P, Flidr J, et al. Comparative analysis of the hardwareimplementation of hash functions SHA-1 and SHA-512. Proceedings of the 5th international conference oninformation security (ISC�2002), LNCS 2433. Springer-Verlag; 2002. p. 75–89.

[16] Sklavos N, Koufopavlou O. On the hardware implementation of the SHA-2 (256, 384, 512) hash functions. In:Proceedings of the IEEE international symposium on circuits and systems, vol. 5, May 2003. p. 153–6.

[17] Dadda L, Macchetti M. The design of a high speed ASIC unit for the hash functions SHA-256 (384, 512). In:Proceedings of the design, automation and test in Europe conference (DATE�04), February 16–20, 2004.

[18] Ercegovac MD, Lang T. Digital arithmetic. Morgan Kaufmann Publishers; 2004.[19] Kaliski B, Staddon J. RSA cryptography specifications—Version 2.0. RFC 2437, October 1998.[20] Haller N, Metz C, Nesser P, Straw M. A one-time password system. RFC 2289, February 1998.[21] Dierks T, Allen C. The TLS protocol—Version 1.0. RFC 2246, January 1999.

Imtiaz Ahmad received his B.Sc. in Electrical Engineering from University of Engineering andTechnology, Lahore, Pakistan, an M.Sc. in Electrical Engineering from King Fahd University ofPetroleum and Minerals, Dhahran, Saudi Arabia, and a Ph.D. in Computer Engineering fromSyracuse University, Syracuse, New York, in 1984, 1988 and 1992, respectively. Since September1992, he has been with the Department of Computer Engineering at Kuwait University, Kuwait,where he is currently a professor. His research interests include design automation of digitalsystems, high-level synthesis, and parallel and distributed computing.

A. Shoba Das received the B.E. degree from Guindy College of Engineering, Madras University,India and the M.E. degree from PSG College of Technology, Madras University, India. She hasbeen in various teaching assignments in India from 1982 and presently working as scientificassistant in Kuwait University. Her research interests include optimal design of sequentialmachines and testing of communication systems.

hardware implementation analysis of sha-256 and sha-512 algorithms on fpgas

Documents