the implementation and analysis of the ecdsa on the ...cgebotys/new/eric_masters.pdf · ecdsa on...

The Implementation and Analysis of the

ECDSA on the Motorola StarCore SC140 DSP

Primarily Targeting Portable Devices

by

Eric W. Smith

A thesis

presented to the University of Waterloo

in fulfillment of the

thesis requirement for the degree of

Master of Applied Science

In

Electrical and Computer Engineering

Waterloo, Ontario, Canada, 2002

© Eric W. Smith 2002

I hereby declare that I am the sole author of this thesis.

I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the

purpose of scholarly research.

I further authorize the University of Waterloo to reproduce this thesis by photocopying or by other

means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly

research.

ii

The University of Waterloo requires the signatures of all persons using or photocopying this thesis.

Please sign below, and state an address and date.

iii

Abstract

The viability of the Elliptic Curve Digital Signature Algorithm (ECDSA) on portable devices is

important due to the growing wireless communications industry, which has inherent insecurities. The

StarCore SC140 DSP (SC140) targets portable devices, and therefore is a prime candidate to study the

viability of the ECDSA on such devices. The ECDSA was implemented on the SC140 using a Koblitz

curve over GF(2163). The τ-adic representation of polynomials involved in the elliptic curve point-

multiplication is exploited to achieve superior performance. The ECDSA was implemented and

optimized in C and assembly, and verified in hardware. The performance of the C and assembly

implementations is analyzed and compared to previously published results. The ability of the compiler

to generate efficient cryptographic related code and the SC140 to perform efficient operations is

discussed. Numerous compiler optimization improvements that considerably enhance the performance

of the generated assembly are suggested. Coding guidelines that state simple measures to improve the

performance of the implementation and help to achieve efficient C and assembly are listed. Finally,

security issues, with respect to the implementation and focusing on side-channel attacks (SCA) are

investigated, including estimated performance penalties due to adding resiliency. Two SCA

countermeasures specific to the implementation are also described. In summary, the implemented

ECDSA signature generation and verification processes require 4.43 and 8.63 ms when the SC140

operates at 300MHz. Methods of optimizing the implementation to further reduce execution times are

also presented.

iv

Acknowledgements

The author would like to thank his supervisor, Professor Catherine Gebotys, for her aid and direction

throughout the development of the thesis, as well as the use of computing resources and the StarCore

SC140 Software Development Platform (SDP). He would also like to thank friends and family for

their support, without which the completion of the thesis would not be possible.

The author is extremely grateful for the financial support provided by the National Sciences

and Research Council of Canada (NSERC), Motorola, his supervisor and the Department of Electrical

and Computer Engineering at the University of Waterloo. Financial support was provided by the listed

entities through various scholarships, which allowed the author to focus more thoroughly on his

research and studies.

v

Contents

1 Introduction..................................................................................................................................... 1 1.1 DSPs and Embedded Systems Security Requirements ............................................................. 2 1.2 Thesis Objective........................................................................................................................ 3 1.3 Thesis Overview ....................................................................................................................... 3

2 Public-Key Cryptosystems and the StarCore SC140 DSP .......................................................... 4 2.1 Public-Key Cryptosystems........................................................................................................ 4 2.2 ECC Background ...................................................................................................................... 5

2.2.1 Comparison to Other Cryptographic Techniques.............................................................. 7 2.3 Digital Signature Schemes ........................................................................................................ 9 2.4 StarCore SC140 DSP Processor Description .......................................................................... 10 2.5 Previous Cryptographic and DSP Research ............................................................................ 14

3 The ECDSA Algorithm and Implementation Philosophy ......................................................... 15 3.1 The ECDSA ............................................................................................................................ 15 3.2 Finite Field and Large Integer Arithmetic............................................................................... 17

3.2.1 Basic Operations ............................................................................................................. 18 3.2.2 Finite Field Multiplication .............................................................................................. 19 3.2.3 Finite Field Squaring....................................................................................................... 20 3.2.4 Finite Field Inversion ...................................................................................................... 20 3.2.5 Large Integer Operations................................................................................................. 22

3.3 Elliptic Curve Arithmetic........................................................................................................ 22 3.3.1 Elliptic Curve Point Addition and Subtraction ............................................................... 22 3.3.2 Elliptic Curve Point Representation................................................................................ 25 3.3.3 Elliptic Curve Point-Multiplication................................................................................. 27 3.3.3.1 Non-Adjacent Format...................................................................................................... 27 3.3.3.2 Reduced TNAF Representation ...................................................................................... 28 3.3.3.3 TNAF Point-Multiplication............................................................................................. 32 3.3.3.4 Width-w TNAF Representation ...................................................................................... 33 3.3.3.5 TNAFw Point-multiplication .......................................................................................... 34 3.3.4 Simultaneous Multiple Point-Multiplication................................................................... 35

3.4 Implementation and Integration Philosophy ........................................................................... 36

4 Implementation Analysis and Performance Results .................................................................. 38 4.1 C Data Structures .................................................................................................................... 38 4.2 Finite Field Operations............................................................................................................ 39

vi

4.2.1 Finite Field Addition (c = a ⊕ b) .................................................................................... 40 4.2.2 Finite Field Reduction (c = a mod f) ............................................................................... 40 4.2.3 Finite Field Multiplication (c = a ⋅ b) .............................................................................. 41 4.2.4 Finite Field Squaring (c = a2) .......................................................................................... 43 4.2.5 Finite Field Inversion (c = a-1 mod f) .............................................................................. 45

4.3 Large Integer Operations......................................................................................................... 48 4.3.1 Large Integer Addition and Subtraction (c = a + b; c = a - b) ......................................... 49 4.3.2 Large Integer Multiplication (c = a ⋅ b)........................................................................... 49 4.3.3 Large Integer Division (c = a / b).................................................................................... 50 4.3.4 Large Integer Inversion (c = a-1 mod f) ........................................................................... 51

4.4 Elliptic Curve Operations........................................................................................................ 51 4.4.1 TNAF Conversion (k2 kTNAF) ..................................................................................... 52 4.4.2 Partial Reduction - Partmod δ (k′ = k partmod δ) ........................................................... 52 4.4.3 TNAF Point-Multiplication (Q = kTNAF ⋅ P) .................................................................... 54 4.4.4 TNAFw Conversion (k2 kTNAFw)................................................................................. 55 4.4.5 TNAFw Point-Multiplication (Q = kTNAFw ⋅ P) ............................................................... 57 4.4.6 Simultaneous Multiple Point-Multiplication (R = k ⋅ P + l ⋅ Q)...................................... 59

5 Implementation Comparison and Coding Guidelines ............................................................... 61 5.1 Performance Comparison with Previous Published Results ................................................... 61

5.1.1 Low-Level Performance Comparison ............................................................................. 61 5.1.2 High-Level Performance Comparison ............................................................................ 63

5.2 Guidelines for Writing Efficient C Code for Cryptographic Applications ............................. 65 5.3 Guidelines for Writing Efficient Assembly Code for Cryptographic Applications ................ 67 5.4 Hand-Written and Compiler-Generated Assembly Comparison............................................. 70

5.4.1 Low-Level Performance Comparison ............................................................................. 70 5.4.2 High-Level Performance Comparison ............................................................................ 76

5.5 Memory Requirements Comparison ....................................................................................... 78

6 SC140 and Compiler Analysis for Cryptographic Applications............................................... 81 6.1 Analysis of the SC140 for Elliptic Curve Cryptographic Applications .................................. 81

6.1.1 SC140 Cryptographic Pros.............................................................................................. 82 6.1.2 SC140 Cryptographic Cons............................................................................................. 87

6.2 Compiler Optimization Improvements ................................................................................... 89 6.3 Compiler Anomalies ............................................................................................................... 98

6.3.1 Compiler Anomaly A...................................................................................................... 98 6.3.2 Compiler Anomaly B .................................................................................................... 100

7 Side-Channel Attack Security Issues......................................................................................... 104 7.1 Timing Attacks...................................................................................................................... 105 7.2 Simple Power Attacks ...........................................................................................................107 7.3 Differential Power Analysis.................................................................................................. 108 7.4 SCA Countermeasures specific to Koblitz Curves and the SC140....................................... 109

7.4.1 Parallel Processing Countermeasure ............................................................................. 110

vii

7.4.2 Koblitz Curve Specific Countermeasure....................................................................... 112

8 Discussion and Conclusions........................................................................................................ 115 8.1 Thesis Summary.................................................................................................................... 115 8.2 Limitations of the Research and Implementation ................................................................. 116 8.3 Conclusions........................................................................................................................... 117 8.4 Future Work .......................................................................................................................... 119

Appendix A – Koblitz Curve Parameters ......................................................................................... 122

Bibliography ........................................................................................................................................ 123

viii

List of Acronyms

AAU Address Arithmetic Unit IF Integer Factorization

AGU Address Generation Unit IFA IF Always

AIA Almost Inverse Algorithm IFF IF False

ALU Arithmetic Logic Unit IFT IF True

ASL Arithmetic Shift Left (by one bit) JF Jump if True

ASLL Arithmetic Shift Left (by multiple bits) JT Jump if False

ASM Assembly Language Code LSL Logical Shift Left (by one bit)

ASR Arithmetic Shift Right (by one bit) LSLL Logical Shift Left (by multiple bits)

ASRR Arithmetic Shift Right (by multiple bits) LSR Logical Shift Right (by one bit)

BF Branch if False LSRR Logical Shift Right (by multiple bits)

BFU Bit Field Unit LUT Look-Up Table

BT Branch if True MAC Multiply and Accumulate

CA Certificate Authority MIPS Million Instructions Per Second

CGA Compiler-Generated Assembly NAF Non-Adjacent Format

CLB Count Leading Bits NB Normal Basis

CP Critical Path NIST National Institute of Standards and Technology

DALU Data Arithmetic Logic Unit NOP No Operation

DL Discrete Logarithm PB Polynomial Basis

DLP Discrete Logarithm Problem PDA Personal Digital Assistant

DPA Differential Power Analysis RRK Random Rotation of Key

DSA Digital Signature Algorithm SCA Side Channel Attack

DSP Digital Signal Processor SC140 StarCore SC140 DSP

EC Elliptic Curve SPA Simple Power Attacks

ECC Elliptic Curve Cryptography SMPM Simultaneous Multiple Point-Multiplication

ECDLP Elliptic Curve Discrete Logarithm Problem SRAM Static Random Access Memory

ECDSA Elliptic Curve Digital Signature Algorithm TA Timing Attack

EEA Extended Euclidean Algorithm TNAF τ-adic NAF

FF Finite Field TNAFw Width-w TNAF

GUI Graphical User Interface VLES Variable Length Execution Set

HWA Hand-Written Assembly VLIW Very Long Instruction Word

IDE Integrated Development Environment XXX(A) XXX and XXXA instructions

ix

List of Algorithms

Algorithm 3-1. ECDSA Signature Generation [30] ................................................................................ 16 Algorithm 3-2. ECDSA Signature Verification [30] .............................................................................. 16 Algorithm 3-3. Finite Field Reduction (c = a mod f) [19] ...................................................................... 18

Algorithm 3-4. Finite Field Multiplication (c = a⋅b) [39] ....................................................................... 19 Algorithm 3-5. Finite Field Squaring (c = a2) [19] ................................................................................. 20 Algorithm 3-6. Finite Field Inversion (b = a-1 mod f) [20] ..................................................................... 21 Algorithm 3-7. Elliptic Curve Point Addition (P3 = P1 + P2) [38] .......................................................... 24

Algorithm 3-8. TNAF Conversion (kTNAF = r0 + r1⋅τ) [61]...................................................................... 29

Algorithm 3-9. Partmod δ Reduction (r0 + r1⋅τ := k2 partmod δ) [61] .................................................... 31

Algorithm 3-10. TNAF Point-Multiplication (Q = kTNAF⋅P) [19] ........................................................... 32

Algorithm 3-11. TNAFw Conversion (kTNAFw = r0 + r1⋅τ) [61]............................................................... 33

Algorithm 3-12. TNAFw Point-Multiplication (Q = kTNAFw⋅P) [61]....................................................... 35

Algorithm 3-13. Simultaneous Multiple Point-Multiplication (R = k⋅P + l⋅Q) [19] ............................... 35 Algorithm 4-1. Improved Finite Field Squaring (c = a2)......................................................................... 45 Algorithm 4-2. Improved Finite Field Inversion (c = a-1 mod f)............................................................. 46 Algorithm 4-3. Integer Coefficient to Binary Representation Conversion ............................................. 56

Algorithm 7-1. TA Resistant TNAF Point-Multiplication (Q[0] = kTNAF⋅P) [22]................................. 106

Algorithm 7-2. Proposed DPA Resistant τ–adic Point-Multiplication ................................................. 112

x

List of Tables

Table 2-1. Current Estimated Memory Requirement Comparison [6]...................................................... 9 Table 3-1. Elliptic Curve Coordinate System Comparison [19] ............................................................. 26 Table 4-1. Finite Field Reduction Performance ...................................................................................... 41 Table 4-2. Single and Multiple Bit-Shifting Function Comparison........................................................ 42 Table 4-3. Finite Field Multiplication Performance................................................................................ 42 Table 4-4. Finite Field Squaring Performance Comparison ................................................................... 44 Table 4-5. Finite Field Inversion Bit-Shift Distribution ......................................................................... 47 Table 4-6. Finite Field Inversion Performance ....................................................................................... 48 Table 4-7. TNAF Point-Multiplication Performance.............................................................................. 54 Table 4-8. TNAFw Point-Multiplication Performance Comparison ...................................................... 58 Table 4-9. Simultaneous Multiple Point-Multiplication Performance Comparison ............................... 60 Table 5-1. Estimated Finite Field Operation Cycle Count Comparison ................................................. 62 Table 5-2. Estimated Elliptic Curve Operation Cycle Count Comparison ............................................. 64 Table 5-3. Estimated Signature Generation and Verification Cycle Count Comparison........................ 64 Table 5-4. Low-Level CGA and HWA Performance Comparison (input independent routines)........... 72 Table 5-5. Low-Level CGA and HWA Performance Comparison (input dependent routines) .............. 74 Table 5-6. Computational Reduction of the Signature Generation Process due to HWA Routines ....... 76 Table 5-7. High-Level CGA and HWA Performance Comparison ........................................................ 77 Table 5-8. Estimated Permanent Storage Requirements......................................................................... 79 Table 6-1. Assembly Symbolic Description ........................................................................................... 90 Table 7-1. Estimated TA Resistant Performance Penalties................................................................... 106 Table 7-2. Estimated SPA Resistant Signature Generation Performance Penalty ................................ 107 Table 7-3. Estimated Sample Entropy and Overhead of Algorithm 7-2 .............................................. 113

xi

1 Introduction

The ECDSA is a cryptographic tool that can provide security to systems when implemented correctly.

The algorithm defines a method of achieving data integrity, data origin authentication and non-

repudiation. The ability to efficiently implement the ECDSA forecasts its usefulness in the growing

wireless and wireline communications industries.

It is difficult to argue the usefulness of the ECDSA without proof that it can be efficiently

implemented on a wide range of target processors. Furthermore, it is difficult to convey the threat of

attackers to users without personally experiencing a security breach in the digital sense, because they

do not easily tolerate large computational delays due to tasks they deem unimportant.

The purpose of the thesis is to study the performance of the ECDSA on the SC140. Analysis

of the implementation, the benefits of an assembly implementation, and the strength of the compiler to

produce efficient cryptographic applications are all included as part of the study. There have been

several documented implementations of the ECDSA on general-purpose and extremely resource

limited processors that are present on smart cards, but a limited number of implementations of the

ECDSA, or more generally Elliptic Curve Cryptography (ECC), on processors targeting portable

devices. The few documented implementations of ECC on DSPs have involved prime fields. ECC

using binary fields is also a viable option, which may be more attractive due to the numerous bit-

manipulating instructions common to DSPs.

Due to the decreased power consumption of DSPs relative to general-purpose processors, and

the limited battery lifespan of portable devices, DSPs are an excellent candidate for the primary

computing core of portable devices. Furthermore, due to the inherent insecurities of wireless

communications that threaten portable devices, the implementation of security measures is of utmost

importance. The performance of security measures, including the ECDSA, must be studied on such

devices. By studying the performance of the ECDSA on the SC140, their viability with respect to

portable devices and security on the devices can be determined.

1

CHAPTER 1. INTRODUCTION 2

1.1 DSPs and Embedded Systems Security Requirements

The employment of adequate security systems was overlooked during the incredible growth the

communications industry underwent over the past two decades. Systems were introduced without

adequate security measures in place. The combination of recent world events and the sudden decline in

the communications industry’s growth, has led to the realization that a great deal of current network

security measures are inadequate.

Furthermore, increasing the demand on network security, the wireless communications

industry is rapidly expanding. The current trend in the communications industry is increasing wireless

services as 3rd generation cellular systems become a reality. The services that cell phones, personal

digital assistants (PDAs) and other portable handheld devices provide are ever increasing. The new

services require more bandwidth and greater processing capabilities. Examples of the introduced

services are email and streaming multimedia.

As the communications industry expands, and more information is transmitted via wireless and

wireline connections, the inherent requirement for security measures increases. The SC140 targets

several communication applications that all require certain levels of security. It is therefore important

to study the SC140 to determine if it is a viable processor to implement the required security measures.

Handheld devices are powered by several different processing units, including DSPs. The

deployment of DSPs is widespread. They have lower power dissipation than general-purpose

processors, and are less costly than specialty processors. DSPs are currently present in network and

data communications, and several other devices throughout the communications industry. High-end

DSPs control network traffic on high-speed backbones, and will likely be deployed in future handheld

devices. Handheld devices are often part of extensive wireless networks that are naturally insecure,

and are extremely susceptible to security risks such as impersonation attacks.

Digital signatures allow tasks such as data integrity, data origin authentication and non-

repudiation to be performed. The importance of integrity and origin of data is heightened in a wireless

network, which are much more susceptible to impersonation attacks and modification of transmitted

data because of the ease of which the transmission medium is accessed.

It is important to study security on DSPs because of their widespread deployment. If DSPs are

a viable target for implementation of security measures, security can be added to systems with simple

software add-ons or upgrades. The cost of adding the security related services is greatly decreased

because new hardware is not required. Furthermore, expensive processing units for cryptographic

applications are not required by new devices, maintaining their affordability.

CHAPTER 1. INTRODUCTION 3

1.2 Thesis Objective

The objective of the thesis is to study the performance of ECC, and more precisely the ECDSA, on a

DSP targeting portable devices. The ECDSA is implemented on the StarCore SC140 DSP. The

implementation is examined thoroughly, and optimized to improve its performance with respect to

execution time and code size. The compiler and associated optimizer are examined to determine if

efficient implementation is possible at the C programming language, or if assembly language coding is

required to achieve the necessary performance. The execution time of the implementation should be

comparable to current published results, and must account for acceptable delays, unnoticeable to the

average user when the digital signature technique is utilized by a practical application. Furthermore,

while maintaining acceptable execution times, the code size of the compiled application must be

suitable for portable devices, where memory is an expensive and limited resource.

1.3 Thesis Overview

In chapter 2, a brief description of public-key cryptosystems, focusing on ECC, and the StarCore

SC140 DSP is presented. The ECDSA and the algorithms utilized to implement the required finite

field and elliptic curve operations are outlined in chapter 3, along with the implementation philosophy.

The implementation and performance analysis of the finite field and elliptic curve operations are

presented in chapter 4. Chapter 5 analyzes the performance of the implementation. A comparison of

the performance of the implementation with previously published results is presented. In addition, the

performance of the hand-written and compiler-generated assembly is contrasted. Several coding

guidelines to follow, which aid in the development of efficient assembly and C code, are included. To

conclude chapter 5, the memory requirements are presented and compared. An analysis of the SC140

and the associated compiler is presented in chapter 6. The analysis is based on implementing

cryptographic applications on the SC140, and the ability of the compiler to optimize cryptographic

related code. Security issues that arise due to side-channel attacks and methods of avoidance are

presented in chapter 7. Finally, chapter 8 presents a thesis summary, limitations of the study, a

conclusion, as well as future work to be done in this area of research.

2 Public-Key Cryptosystems and the StarCore SC140 DSP

This chapter introduces the concept of public-key cryptosystems, providing several examples. The

implemented public-key cryptosystem is explained and compared to alternatives, and a description of

the SC140 is included.

2.1 Public-Key Cryptosystems

Public-key cryptosystems were invented by Whitfield Diffie and Martin Hellman in 1976 [6]. They are

asymmetric cryptosystems, which are based on the concept of using different keys for the encryption

and decryption processes. The two keys involved must seem unrelated for the cryptosystem to be

useful. They must seem unrelated such that the encryption key E, or public key, can be put in the

public domain without compromising the decrypting key D. The decrypting key D is also known as the

private or secret key.

Consider two entities that are people or computer nodes, which want to communicate. Each

entity individually develops a public and private key. The two keys are inverses of each other,

described by Formula 2.1 [9]. In the formula, M is the message, and the functions D() and E()

represent encryption using the private and public keys respectively.

M = D(E(M)) = E(D(M)) (2.1)

A trusted third party is required by public-key cryptosystems. The trusted third party, also

known as a Certificate Authority (CA), is in charge of key storage and distribution. Entities transmit

their public key to a CA in a secure manner. The CA is in-charge of storage and distribution of the

domain parameters and public keys of entities.

Public-key cryptosystems are capable of providing authentication, secrecy, or both between

communicating parties. Authentication is achieved by encrypting messages with one’s private key

before transmission. The target entity uses the public key of the sender, obtained from a CA, to decrypt

the transmitted message. Authentication is inherent when the message is successfully decrypted.

Secure communication is achieved by encrypting a message with the target’s public key. The

4

CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 5

communication is secure because the target’s private key, which is only known by the target, is

required to decrypt the message. An authenticated and secure communication is achieved by first using

the target’s public key to encrypt a message, then by using one’s private key to encrypt the encrypted

message before transmission to the target. In this case, the target first authenticates the transmission by

decrypting the received message with the sender’s public key. Then the original message is revealed

by decrypting the authenticated message with their private key.

There are currently three secure and efficient public-key cryptosystems. They are based on

Integer Factorization (IF), the Discrete Logarithm (DL) and Elliptic Curves (EC) [6]. Each system is

based on a difficult mathematical problem, relative to their input size [6], which requires a great deal of

time to solve.

The most common public-key cryptosystem is RSA, named after its developers Rivest, Shamir

and Adelman [6]. It is an IF system based on large prime integers. The difficult problem associated

with the system is the factorization of large numbers. Both encryption and digital signature schemes

have been developed using RSA.

The Digital Signature Algorithm (DSA) is an example of a DL cryptosystem [6]. The DSA is a

digital signature scheme. DSA is based on the Discrete Logarithm Problem (DLP), as are all DL

cryptosystems. Encryption is also possible using the DLP, but is not commonly used due to the

associated large overhead.

Both encryption and digital signature schemes have been developed using ECC. It is based on

an extension of the DLP, rightfully named the Elliptic Curve DLP (ECDLP). The ECDSA is an

example of an ECC, which is a digital signature scheme very similar to the DSA. ECC is presently the

most promising public-key cryptosystem because of the high security-per-bit ratio it provides. Further

detail of the cryptosystem is given in the following section.

2.2 ECC Background

In 1985, Neil Koblitz and Victor Miller independently proposed the use of elliptic curves for a public-

key cryptosystem [2]. Both encryption and digital signature techniques have been developed using the

cryptosystem. The public-key cryptosystem is based on the manipulation of points on an elliptic curve,

defined modulo f, where P = (x, y) is a point on the curve. For cryptographic applications, each

coordinate belongs to a prime or binary finite field, defined by f. The generalized equation of an

elliptic curve for cryptography is presented as Formula 2.2.


y2 + x⋅y = x3 + a⋅x2 + b (mod f) (2.2)

Several parameters define an elliptic curve, including but not limited to, a and b. There are

classes of curves that are defined by specific sets of parameters. These classes have special properties

associated with them, making them more or less attractive for cryptographic applications. For

example, anomalous binary curves, more commonly known as Koblitz curves, are used and assumed in

the scope of the thesis. They have properties that allow for efficient point-multiplication that are

explained in §3.3.3.

Associated with elliptic curves are point addition, doubling, negating, subtracting and

multiplication operations. Each of the elliptic curve operations is defined by a sequence of finite field

operations, and are described in §3.3. Similar to standard mathematics, point-multiplication is based on

a series of point additions. Algorithms have been developed to improve the computation of the

multiplication because of its expensiveness.

The ECDLP is the difficult mathematical problem of reversing the point-multiplication

operation. The problem is an analogue of the DLP, but in the elliptic curve domain. The ECDLP,

which is attempting to solve for k knowing P and Q, from the point-multiplication formula Q = k⋅P,

proves difficult. The problem is computationally expensive and cannot be solved in a reasonable

amount of time as long as the finite field associated with the curve is large enough. As expected, the

finite field size required to make the problem computationally infeasible is directly related to current

processing trends. As the average computing power of devices increase, larger finite fields are required

to maintain security levels [30].

Elliptic curves for cryptographic applications can be defined over prime or binary fields. In

general, prime fields tend to outperform binary fields because most processors are designed to favor the

execution of integer arithmetic, and are not number crunchers. However, binary fields were chosen for

implementation to determine how well they perform on the SC140, because it has an extended and less

computationally costly set of logic instructions compared to general-purpose processors. Binary finite

fields are assumed throughout the thesis unless otherwise stated.

The Polynomial Basis (PB) was selected to represent binary finite field elements. Unless

otherwise stated, the PB is used throughout the thesis. The use of an alternate basis, such as the

Normal Basis (NB) does have its benefits. For example, squaring fields is simplified when employing

the NB, but other operations become more complex. It is believed by the writer that the drawbacks of

alternative representations outweigh the benefits for this implementation.


The binary finite field GF(2m), where m=163, is used and assumed throughout the project.

The field size of 163-bits was selected to provide the current acceptable security level [6]. When

implementing Koblitz curves, all of the elliptic curve parameters are fixed after the finite field size is

selected, except for C. The parameter C and its value are further explained in §3.3.3.2. The Koblitz

curve parameters for the PB, GF(2163) and used for implementation, are listed in Appendix A.

Some general finite field terminology must be defined. The terms polynomial and finite field

element are used interchangeably throughout the thesis. The degree of a polynomial is the position of

the most significant coefficient (where the first coefficient position is position zero), and the Hamming

weight of a polynomial is the number of nonzero coefficients in its representation.

In the following section, ECC is compared to two other public-key cryptosystems. The

positive and negative aspects of the cryptosystems are compared to give some background on why

ECC was selected for implementation over other possibilities.

2.2.1 Comparison to Other Cryptographic Techniques

The ECC was selected as the public-key cryptosystem for implementation on the SC140 for several

reasons. The comparison below briefly develops and states the grounds for selecting ECC over the

other public-key cryptosystems. Further comparison of the public-key cryptosystems can be found in

[6] and [29].

There are both encryption and digital signature schemes associated with ECC, DLP and RSA.

The underlying mathematical problems associated with the cryptosystems are identical for both

encryption and digital signature schemes. Therefore, after one scheme is implemented, much of the

implementation can be reused with the other scheme. There are two types of digital signature

techniques, which are with and without appendices. The digital signature is logically combined with

the message in a technique without appendix, whereas the digital signature is appended to the message

in the digital signature with appendix technique.

RSA refers to both encryption and digital signature schemes. It includes a digital signature

scheme without appendix. ElGammal proposed digital signature and encryption schemes based on the

DLP [6]. Later, the DSA was developed, which is an improvement on the digital signature scheme

proposed by ElGammal. The ECDSA is the digital signature algorithm associated with ECC, and

encryption is simply referred to as ECC. The ECDSA and DSA are both digital signature schemes with

appendix. Digital signature schemes with and without appendices are explained in §2.3.


Assuming that an elliptic curve is selected that does not have negative security implications,

and that each public-key cryptosystem is implemented correctly without any security loopholes or

backdoors, the per-bit security of ECC is far superior to that of RSA and DSA. As stated in [6], and

depicted in Figure 2-1, the current acceptable security level is 1012 MIPS years, leading to a 160-bit

modulus for ECC, and 1024-bit modulus for both RSA and DSA [6]. In addition, the figure shows the

expected growth of the modulus size for each cryptosystem. The modulus sizes for RSA and DSA

grow exponentially versus an exponential growth in MIPS years, whereas modulus sizes for ECC

experience approximately linear growth [6]. The per-bit security of ECC is far greater than RSA and

DSA, making it far more attractive for portable devices with limited resources, assuming the execution

times are similar. In the future, RSA and DSA modulus sizes are expected to grow exponentially,

resulting in an unacceptable amount of overhead.

Figure 2-1. Modulus Size Comparison of Public-Key Cryptosystems [2]

0

1000

2000

3000

4000

5000

6000

1.E+04 1.E+12 1.E+20 1.E+36

Time to break cryptosystem (MIPS Years)

Mod

ulus

siz

e (b

its)

ECC

RSA andDSA

Current Acceptable Security Level

In general, the key size can be assumed identical to the modulus size for each system. The

total size of the system parameters and key pairs for RSA and DSA are much larger than with ECC.

They currently differ by a factor of four, which will become larger as the current acceptable security

level increases. The encrypted message and signature sizes for RSA are much larger than with ECC,

and only the signature size of DSA is the same as ECC. A comparison of the current estimated sizes of

parameters, keys, signatures and encrypted messages is presented in Table 2-1. With respect to the

table values, the signature sizes stated are for large messages, i.e. 2000-bits, and the original size of the

encrypted message is 100-bits.


Larger keys, parameters, signatures, and encrypted messages require more memory for

storage and more bandwidth to transmit, both of which are scarce resources when dealing with portable

devices. Moreover, even with non-portable devices, there is no reason to unnecessarily squander

resources. As depicted in Figure 2-1 and Table 2-1, ECC provides equivalent security levels, requiring

fewer resources than alternative public-key cryptosystems.

Table 2-1. Current Estimated Memory Requirement Comparison [6]

Public-Key

Cryptosystem

System

Parameters (bits)

Public

Key (bits)

Private

Key (bits)

Signatures

Size (bits)

Encrypted

Message (bits)

RSA N/A 1088 2048 1024 1024

DLP (DSA, ElGammal) 2208 1024 160 320 2048

ECC (ECDSA) 481 161 160 320 321

When comparing cryptosystems, the computational overhead must be investigated. Focusing

on the computational expenses within a single cryptosystem, DLP and ECC systems perform similarly,

and opposite to RSA. The signature generation process for ECDSA and DSA is faster than the

verification process, whereas the verification of RSA signatures is less computationally expensive.

Decryption is slower than encryption using RSA, whereas the opposite is true for ECC.

Overall, ECC is proven to require less computational overhead. After all the techniques used

to increase the performance of each cryptosystem are implemented, ECC is found ten times faster than

RSA and DSA [6]. Both the execution times and memory requirements of ECC are less than those

associated with RSA and DSA, making it superior to the alternatives.

2.3 Digital Signature Schemes

The concept of a digital signature is very powerful, but difficult to achieve. This section gives a brief

overview of the two types of digital signature schemes and their capabilities. Digital signatures are

designed to be similar to, and more compelling than handwritten signatures, and target digital data [30].

Digital signatures are based on the data being signed, M, and a private key only known by the signer.

Digital signatures are powerful because they provide data integrity, data origin authentication

and non-repudiation [30]. After data has been signed, all these services are achieved by the signature


verification process. There is no privacy associated with digital signatures. Transmitted data can be

easily intercepted and interpreted by eavesdroppers. To achieve confidentiality between

communicating parties, an encryption scheme must be employed.

Signatures that are verified with a sender’s public key guarantee the integrity of the

transmission because the signature is based on the original message. Messages cannot be intercepted

and modified because the signature on the message will not verify. Without the private key of the

sender, the correct signature cannot be computed. An entity cannot impersonate another because each

entity has a unique private key. The private key is known only by the owner, and is required to

compute the digital signature of each message. Lastly, an entity cannot deny knowledge of a message

containing their signature. The entity is the only one with knowledge of their private key, and therefore

is the only one able to compute the signature of a message.

There are two types of digital signature schemes. They are schemes with and without

appendix. In the case of a digital signature scheme without appendix, the digital signature is the only

data transmitted. The transmitted data contains the original message. The signature verification

process results in the computation of the original message. It is impossible to determine the original

message without signature verification. If the verification process fails, the receiver is left with a

garbled message, and the original message cannot be determined. In the case of a digital signature with

appendix, the digital signature is computed and concatenated onto the message. The message along

with the concatenated digital signature is transmitted. It is possible for an attacker to modify the

original message and concatenate an incorrect signature. In this case, the signature will not be verified

by the receiver. Therefore, the receiver will know the transmitted data has been modified. Since the

message and signature are separate in the digital signature with appendix scheme, the verification

process is technically optional, and is left up to the receiver.

The specific type of digital signature implemented, the ECDSA, is described in §3.1. Further

details of the implemented algorithm are provided, as well as a depiction of a digital signature scheme

with appendix.

2.4 StarCore SC140 DSP Processor Description

The StarCore SC140 DSP is a high performance processor that can be clocked at a maximum of 300

MHz [46]. With its many assets that include high performance and low power consumption, the

processor targets computationally intensive communication applications [46]. The SC140 has several


features that allow for efficient digital signal processing, which are also useful for cryptographic

applications. These features are examined in §6.1.1.

The SC140 targets a wide range of communication applications. Some examples of the target

markets include wireless Internet and multimedia, network and data communications, 3rd generation

wireless handset systems with wideband data services, wireless and wireline base stations and the

corresponding infrastructure [46].

The high performance SC140 is designed to have a large data throughput of 4.8GBytes/sec.

The processor uses a 32-bit unified program and data address space, which is byte addressable. It is

designed to allow significant parallelism. The SC140 has the capability of having a very large on-chip

zero-wait Static Random Access Memory (SRAM). The SRAM allows for efficient execution of

applications, by reducing the cost of fetching instructions from memory. The cost of read and writes,

to and from memory is reduced as well.

The Data Arithmetic Logic Unit (DALU) of the SC140 performs arithmetic and logical

instructions with four parallel Arithmetic Logic Units (ALUs). Each ALU has access to the sixteen 40-

bit data registers, which is the DALU register file. Each ALU contains a Multiply and Accumulate

(MAC) and Bit-Field Unit (BFU). The MAC is capable of a multiplication of two 16-bit values and an

accumulate every clock cycle. The BFU contains a 40-bit bi-directional barrel shift register. It is

capable of single and multiple, arithmetic and logical shifts, as well as logical, bit-masking and bit-

extraction operations.

The Address Generation Unit (AGU) of the SC140 performs address manipulation and limited

arithmetic instructions with two parallel Address Arithmetic Units (AAU). It contains its own register

file and operates in parallel with the DALU. The register file consists of sixteen 32-bit registers.

The SC140 employs a Variable Length Execution Set (VLES), which allows the execution of

several instructions in a single clock cycle. A VLES allows the execution of up to six instructions per

clock cycle, fully utilizing the processing capabilities of the SC140. The combination of instructions

allowed in a VLES is limited by a set of rules, but VLESs greatly improve the overall code size.

Within a VLES, NOPs are assumed, eliminating the need to define an instruction for each processing

unit per clock cycle. Similar DSPs with parallel processing capabilities that do not have VLESs have a

fixed size instruction set, referred to as a Very Long Instruction Word (VLIW). VLIWs lead to large

program sizes that have a low code density [44]. Large programs are inefficient, and attempts must be

made to avoid them when dealing with portable devices with limited memory, bandwidth and power

resources.

The SC140 has zero-overhead hardware loops that can be highly beneficial when used

correctly. The hardware loops allow for up to four levels of nesting, and provide a means of reducing


repetitive code with a minimal execution cost. By reducing repetitive code, the code size of

applications can be decreased. Hardware loops are further explained in §6.1.1.

The SC140 also has several unique addressing modes that allow for efficient execution of

repetitive algorithms. There are four addressing modes that include register direct, address register

indirect, PC relative, and special [46]. The register direct, PC relative and special address modes are

general addressing techniques that are common to most processors.

The address register indirect addressing mode is the most interesting and beneficial addressing

method. It allows several techniques of addressing memory, including methods of modifying address

registers. Post-increment and post-decrement addressing can be specified. In each case, the address

register is modified by the memory access width defined by the instruction. There is also a post-

increment by offset addressing method, where the address register is modified by the memory access

width multiplied by a control register. There is no cycle penalty for each of the post addressing

methods. When properly implemented, these addressing modes lead to tightly bound looping with

minimal wasted clock cycles when implementing repetitive algorithms.

Indirect addressing modes allow addressing by offsets. The offset can be another address

register, a short or long displacement, or a control register. For more addressing methods with

complete descriptions, refer to the SC140 DSP Core Reference Manual [46].

The power management control features of the SC140 further increase the processor’s

attractiveness for portable devices. The processor can be put into a wait or stop state, which are both

low power consumption modes. In these modes, the functionality of the processor greatly decreases

while waiting for an event to occur. The low power consumption modes conserve energy, on top of

that saved because of the processor’s low voltage operation.

There is a wide variety of software tools available to develop applications for the SC140,

including the Metrowerks CodeWarrior Integrated Development Environment (IDE). It provides an

IDE where applications for the SC140 can be written, edited, compiled, assembled, simulated,

analyzed, and tested in software and hardware. Projects can be developed with the IDE using either or

both C and assembly source code. The amount of parallelism, efficiency and size of the compiled

application is controlled by compiler optimization levels selected by the developer at compile time.

The optimization techniques used by the compiler include various levels of scheduling, pipelining,

bundling, global register allocation, as well as global and space optimization.

The IDE includes a fully functional source code browser and editor, which provide developers

with an easy-to-use graphical user interface. They allow the addition and removal of files from

projects, as well as providing a means of navigating a project’s source code, thus simplifying the

development of large and complex multi-file applications.


Projects can be simulated in software or executed in hardware using the IDE debugging tool.

Software simulation is slow for complex applications, whereas hardware debugging of projects is much

faster but requires the attachment of a development board to the host or a networked computer. All of

the debugging options are available independent of simulation target. An important fact when

simulating, is code simulated and verified in software is not guaranteed to execute identically in

hardware. Some code and functionality restrictions present in hardware are not implemented by the

software simulator.

Breakpoints can be set within the C and/or assembly code for debugging purposes. The

position of breakpoints is limited when debugging optimized code. To aid in the debugging process,

commands such as step into, step over, step out, and run to cursor can be issued when the execution

sequence is paused. Stack variables, memory addresses, and internal register values can all be viewed

and modified during debugging.

A very useful component of the IDE is the profiler. The profiler can be run during the

debugging process to record statistics describing the execution sequence. It records several useful

statistical values that aid in weighing the performance of an application and individual functions.

Values including function call counts, function cycle counts, function and descendant function cycle

counts, as well as minimum, maximum and average cycle counts are recorded for each C function

executed during the simulation. A function call tree is also recorded and useful during analysis.

An easy to use GUI associated with the profiler allows a developer to analyze each function

within an application. The developer can navigate the function call tree, and investigate the

performance of each function, as well as viewing the performance and call counts of parent and

descendent functions.

The profiler is an excellent tool to use during analysis of the optimization process. The

performance of individual functions can be easily analyzed. All the C functions in a project can be

sorted by statistics including total call count, total function cycle count, total function and descendant

cycle count, average function cycle count, and average function and descendant cycle count to quickly

determine the functions that consume the most execution time, and therefore most likely require

optimizations.

The data recorded by the profiler is best viewed and analyzed with the IDE and profiler, but

can be exported to other formats including HTML, XML and a tab delimited file. The alternative

formats allow the sharing of data with colleagues, who can navigate and analyze the data on computers

that do not have the IDE tool installed.


2.5 Previous Cryptographic and DSP Research

There has been a significant amount of research done on ECC. Several resources, which are referenced

in §2.2.1, compare ECC with other public-key cryptosystems. Some papers present the theory behind

ECC and the ECDSA. Others state and develop various algorithms for implementing the required ECC

operations.

The papers that state and develop various algorithms, along with ones that implement and

compare the performance of the algorithms were thoroughly investigated before the implementation

process. References [19], [20], [38], [40] and [61] present important algorithms and/or performance

results that influenced the algorithms implemented in the thesis.

Most of the work in ECC related resources revolves around the elliptic curve point-

multiplication operation because it accounts for nearly the entire execution time of the encryption and

signature processes. A significant breakthrough is presented by Solinas in [61]. He presents a

technique of reducing the execution time of the point-multiplication operation on Koblitz curves. The

modified point-multiplication algorithms presented by Solinas that use the technique, which are

presented in §3.3 and implemented in §4.4, are shown to outperform other methods in [19].

The majority of recent ECC papers focus on SCAs. They are a new class of attacks that use

timing and power analysis to break cryptosystems. Kocher first describes the attacks against RSA,

DSS and others in [33], and later Coron generalized the technique to include ECC [10]. Actual power

traces of elliptic curve point-multiplication were published in [13], illustrating resistance to power

analysis attacks on the SC140. However, prime fields, and not binary fields, were used in the

implementation. SCAs are examined in chapter 7, focusing on the implementation. The three types of

SCAs and countermeasures for each are presented, as well as some alternative techniques that may foil

SCAs developed by the writer.

Most cryptographic research to date, including both symmetric and asymmetric-key

cryptosystems, involves general-purpose processors. A minimum amount of research has been done

involving DSP implementations of cryptography. For example, a very broad view of the secure

communication issues with respect to DSPs is presented in [12], implementation of AES on DSPs is

investigated in [64], and efficient implementations of 1024-bit RSA, 1024-bit DSA and ECDSA using

a 160-bit prime field are described in [26]. Finally, power-attacks on the elliptic curve point-

multiplication operation using prime fields are investigated in [13]. Obviously, further cryptographic

research involving DSPs is required.

3 The ECDSA Algorithm and Implementation Philosophy

The following chapter describes the ECDSA protocol and its implementation. A thorough research

process led to a set of efficient algorithms that compute operations required by the ECDSA. The

algorithms used to implement the various finite field, large integer and elliptic curve operations are

described and listed. To conclude, a section that describes the implementation and integration

philosophy used is included. It explains terminology used and provides some vital implementation

information, which are both required to fully understand chapter 4.

3.1 The ECDSA

The ECDSA is an asymmetric digital signature scheme with appendix that is an analogue to the DSA

[30]. The main difference between the two techniques is the problem they are based on. The ECDSA

is based on the ECDLP, which is a more difficult problem than what the DSA is based on, the DLP.

The ECDSA is asymmetric, meaning different keys are used to generate and verify the digital

signature. The key used in the generation process, referred to as the private key, is kept secret by the

signing entity. Other entities must not know the private key for the signature scheme to function

correctly. The public key, which is used in the signature verification process, is stored by a CA. The

CA is open to distribute the key to any entity desiring it, making it known publicly.

The ECDSA is a digital signature scheme with appendix. The digital signature is appended to

the original message, leaving the message in the clear. Figure 3-1 is a depiction of a digital signing

process for digital signatures with appendices. The original message may already be encrypted.

Digital Signature

Figure 3-1. Digital Signature with Appendix

Transmit original message with signature appended

Begin with original message

Digital Signature

Compute Signature using original message

Original Message

Original Message

15

CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 16

The ECDSA signature generation algorithm used to sign a message M is listed as Algorithm

3-1. It uses domain parameters, which provide cryptosystem details such as the base point G, the curve

order n, and the reduction polynomial f. The sender, who generates the signature, has a private and

public key, d and Q respectively. SHA-1() is a standardized function that is used to compute a hash of

the original message. The value k, also known as a nonce, is uniquely computed for each digital

signature. The signature generation algorithm computes r and s, which form the digital signature or

appendix.

Algorithm 3-1. ECDSA Signature Generation [30]

Input: d, f, n, G = (x1, y1), M

Output: r, s

1. Select a random or pseudorandom integer k, 1≤ k ≤ n-1.

2. Compute k⋅G = (x1, y1) and convert x1 to an integer.

3. Compute r = x1 mod f. If (r = 0) then go to step 1.

4. Compute k-1 mod f.

5. Compute SHA-1(M) and convert this bit string to an integer e.

6. Compute s = k-1 (e + d⋅r) mod f. If (s = 0) then go to step 1.

7. The signature for message M is (r, s).

The signature verification process is listed as Algorithm 3-2. To verify an elliptic curve digital

signature, the receiver must first obtain a verified copy of the signer’s domain parameters and Q. The

values of all variables listed in the verification algorithm below are assumed the same as those listed in

the generation algorithm. This assumption may be incorrect due to transmission error(s) and/or third

parties. In the case when the values of the variables differ, the signature verification process is almost

guaranteed to fail.

Algorithm 3-2. ECDSA Signature Verification [30]

Input: f, n, r, s, G, Q, M

Output: signature verification or rejection

1. Verify that r and s are integers in the interval [1, n-1].

2. Compute SHA-1(M) and convert this bit string to an integer e.

3. Compute w = s-1 mod f.


4. Compute u1 = e⋅w mod f and u2 = r⋅w mod f.

5. Compute X = u1⋅G + u2⋅Q.

6. If X = , then reject the signature. Otherwise, convert the x-coordinate x1 of X to an integer x1,

and compute v = x1 mod f.

7. Accept the signature if and only if v = r.

The elliptic curve point-multiplications in step 2 and 5 of the signature generation and

verification algorithms respectively, are by far the most time consuming operations. As previously

stated, point-multiplication involves several point additions and possibly subtractions, which each

involve finite field multiplications and a finite field inversion. Other steps in the signature generation

and verification processes only require single finite field multiplication and inversion operations.

For an optimum implementation of the ECDSA, the most efficient algorithms that implement

each finite field, large integer and elliptic curve operation must be employed. Inefficient algorithms are

detrimental to the overall performance. For example, inefficient finite field algorithms require more

computations, and therefore more clock cycles to execute. This increases the computational cost of the

implementation significantly because finite field operations are commonly used within the ECDSA.

The decrease in performance is further impacted when inefficient elliptic curve multiplication

algorithms are implemented. These algorithms require several more finite field operations than

superior algorithms, resulting in an even further loss in performance. It is imperative to use the most

efficient algorithms possible because the overall performance is limited by the underlying algorithms

used to implement each operation.

Each inefficient implementation of an operation affects the overall performance of the ECDSA,

especially finite field operations, because elliptic curve operations consist of several of the operations.

By slightly improving implementations of finite field operations, the overall performance of the

ECDSA can be significantly improved. In §4.2 and §4.3, it is shown how even slight improvements

and optimizations to implemented operations have a large impact on performance figures.

3.2 Finite Field and Large Integer Arithmetic

The following sections present an overview of the finite field, or polynomial, and large integer

operations required by the ECDSA. Finite field operations, which are the foundation of ECC, were

thoroughly researched in an attempt to obtain optimum algorithms. Alternative large integer operations


were not investigated because of their limited effect on execution times. Each section briefly

describes operations and presents the implemented algorithm discovered during the research process.

3.2.1 Basic Operations

Several basic finite field and large integer functions are required by the ECDSA. The theory behind

these operations, including finite field addition and reduction, and large integer addition and

subtraction is not described. Algorithms for the operations are not stated because of their simplicity. If

they are not already known, algorithms to implement the operations can be found in [19], [30] and [38].

A specialized algorithm that implements finite field reduction exists. The algorithm is field

dependent, meaning the details of the algorithm depend on the reduction polynomial. A description of

how to formulate the algorithm for a specific reduction polynomial is presented in [19]. The algorithm

formulated for the reduction polynomial f = x163+x7+x6+x3+1, which defines the binary finite field used

for the scope of the thesis, is presented in the paper and below as Algorithm 3-3. The assumption that

32-bit registers are used to store finite field elements, c and a, is made. Therefore, c[i] refers to the ith

32-bit register. Furthermore, ⊕ represents the exclusive-or operation, and >> and << represent right

and left bit-shifts of the corresponding variable respectively.

Algorithm 3-3. Finite Field Reduction (c = a mod f) [19]

Input: a = (a324, a323, a322,…, a0)

Output: c = (c162, c161, c160,…, c0)

1. c = a

2. For i = 10 to 6 do

2.1. T = c[i]

2.2. c[i-6] = c[i-6] ⊕ (T << 29)

2.3. c[i-5] = c[i-5] ⊕ (T << 4) ⊕ (T << 3) ⊕ T ⊕ (T >> 3)

2.4. c[i-4] = c[i-4] ⊕ (T >> 28) ⊕ (T >> 29)

3. T = c[5] & (0xFFFFFFF8)

4. c[0] = c[0] ⊕ (T << 4) ⊕ (T << 3) ⊕ T ⊕ (T >> 3)

5. c[1] = c[1] ⊕ (T >> 28) ⊕ (T >> 29)

6. c[5] = c[5] & (0x00000007)


7. Return(c[5], c[4], c[3], c[2], c[1], c[0])

3.2.2 Finite Field Multiplication

Several algorithms exist to perform finite field multiplication. The complexity and performance of

each vary considerably. A thorough investigation of the algorithms and performance figures presented

in several papers led to the conclusion that a multiplication technique, labeled left-to-right comb

method with windows by is the most efficient algorithm [19].

Presented as Algorithm 3-4, the left-to-right comb method uses a Look-Up Table (LUT) to

reduce repetitive computations. A window of variable width w, determines the computed set of

multiples of the element b. The multiples of b are stored in a LUT, and later indexed by sets of bits

from element a. The indexed values are added to the running sum, c, of the multiplication until all of

the sets of bits from element a are used. Then, to complete the operation, c is reduced. The algorithm

presents an efficient method of computing the result of a finite field multiplication of two elements.

To further improve the performance of Algorithm 3-4, the window width w must be fixed. By

doing so, the code becomes less general. Assumptions can be made because of the fixed window

width, increasing the performance of the operation.

Algorithm 3-4. Finite Field Multiplication (c = a⋅b) [39]

Input: a = (am-1,am-2,…,a0), b = (bm-1,bm-2,…,b0), w

Output: c = (cm-1,cm-2,…,c0)

1. Compute bu = u⋅b for all polynomials u of degree less than w

2. c = 0

3. For k = (m/w) -1 to 0 do

3.1. For j = 0 to t-1 do

3.1.1. c = c + bu, where u = (aw*(k+1)+j, a w*(k+1)+j-1, …, a w*k+j+1, a w*k+j)

3.2. If (k ≠ 0) then c = c << w

4. c = c mod f

5. Return(c)


3.2.3 Finite Field Squaring

Finite field squaring is a special case of multiplication, where the two elements involved are equal.

Due to the equality, the multiplication operation becomes much simpler. As described in [19], when an

element is represented using the PB, squaring is equivalent to expanding its bit representation by

inserting a zero bit between each pair of consecutive bits. After the expansion, the element must be

reduced using Algorithm 3-3. Algorithm 3-5, which squares an element, is presented below.

Algorithm 3-5. Finite Field Squaring (c = a2) [19]

Input: a = (am-1, am-2, …, a0)

Output: c = (cm-1, cm-2, …, c0)

1. For v = (v3, v2, v1, v0) = 0 to 15 do

1.1. T(v) = (0, v3, 0, v2, 0, v1, 0, v0)

2. c = 0

3. For i = 0 to t-1 do, where t = m/4

3.1. c = T(a4i +3, a4i+2, a4i+1, a4i) << (2⋅i)

4. c = c mod f

5. Return(c)

A LUT is the most efficient method of achieving the element expansion. For example, i-bits of

the original element are used to index a 2⋅i-bit value in a LUT, which is an expansion of the original i-

bits.

3.2.4 Finite Field Inversion

Inversion is the most computationally expensive finite field operation required by the ECDSA.

Therefore, it is extremely important to implement the best algorithm when execution time is a concern.

There are two principal algorithms, the Extended Euclidean Algorithm (EEA) and Almost Inverse

Algorithm (AIA), which efficiently compute the inverse of a finite field element.

The AIA eliminates nonzero bits of the input element from right to left, and computes a value

that requires a reduction before the inverse element is obtained. The AIA is less intuitive than the


EEA, and is expected to require less loop iterations [19]. Nevertheless, extensive research showed

that the EEA outperforms it in many papers, including [19] and [20].

A variation of the EEA, developed and published by Hasan in [20], is shown to outperform the

original algorithm while requiring a smaller memory footprint. The inversion algorithm is presented

below as Algorithm 3-6.

Algorithm 3-6. Finite Field Inversion (b = a-1 mod f) [20]

Input: a = (am-1, am-2, …, a0)

Output: b = (bm-1, bm-2, …, b0)

1. r(-1) = f, r(0) = a, u(-1) = 0, u(0) = 1

2. deg_r(-1,0) = m, deg_r(-1) = m, deg_r(0) = degree(r(0))

3. d(1,0) = m – deg_r(0), i = 0

4. Do

4.1. i = i + 1, j = 0

4.2. r(i-2,0) = r(i-2), u(i-2,0) = u(i-2)

4.3. While (d(i, j) ≥ 0) do

4.3.1. r(i-2,j+1) = r(i-2,j) ⊕ r(i-1)⋅xd(i,j), u(i-2,j+1) = u(i-2,j) ⊕ u(i-1)⋅xd(i,j)

4.3.2. j = j + 1, deg_r(i-2,j) = degree(r(i-2,j)), d(i, j) = deg_r(i-2,j) – deg_r(i-1)

4.4. r(i) = r(i-2,j), u(i) = u(i-2,j)

4.5. deg_r(i) = deg_r(i-2,j)

4.6. d(i+1,0) = -d(i,j)

5. While (r(i) ≠ 0)

6. Return (u(i-1))

Algorithm 3-6 was developed by Hasan with the aid of custom hardware that further improves

performance. The custom hardware allows certain obscure steps and functionality only required when

inverting finite field elements to be implemented with minimal execution time cost. For the thesis

implementation, the absence of specialized functionality required the software emulation of some tasks

at the cost of execution time and/or memory storage.


3.2.5 Large Integer Operations

Large integer operations, including multiplication, inversion and division, are also required when

implementing the ECDSA. Efficient algorithms for implementation of the operations were not

researched, primarily because only a minute portion of the ECDSA involves large integer operations.

Therefore, the impact of optimal algorithms for the operations is lessened significantly.

Large integer inversion is extremely similar to finite field inversion. Algorithms that are

designed for either operation can be very easily converted to the other by replacing basic operations.

For example, Algorithm 3-6 can be converted to invert large integers by replacing all finite field

additions with large integer subtractions. Some other modifications must be made to ensure

inequalities that are not relevant with finite field elements. Before each subtraction, measures should

be added to ensure the result is positive. These measures are optional, but must be added to avoid other

special cases and modifications to the algorithm.

Both large integer multiplication and division are not common operations in the ECDSA.

Therefore, basic algorithms, for example long division in the case of large integer division, can be used

to implement the operations. The performance of such algorithms has minimal impact on the overall

execution times of the signature generation and verification processes.

3.3 Elliptic Curve Arithmetic

Several elliptic curve operations are required when implementing the ECDSA. The following includes

descriptions of the operations required by the signature generation and verification processes, a brief

description of an alternative representation of finite field elements that allows superior point-

multiplication computations, along with efficient algorithms for implementation.

3.3.1 Elliptic Curve Point Addition and Subtraction

Elliptic curve addition and subtraction are very similar point manipulations that consist of several finite

field operations. First, the point addition operation is described and defined. Later, point subtraction is

defined. It is performed by computing the negative of a point and then executing point addition.


Elliptic curve point addition is defined as the summation of two points, P3 = P1 + P2. There

are four different cases for point addition. Each case is individually explained in the following

paragraphs.

First, an easy case of point addition is defined. When either point involved in the addition is

the point at infinity, designated as , the result is the other point involved in the summation, as shown

by Formula 3.1

P + = + P = P for all points P on an elliptic curve (3.1)

The second case of point addition is when a point and its negative counterpart are involved.

For a point P = (x, y), a point is called the negative of P, denoted -P, if it has the coordinates (x, x + y).

The result of such an addition is defined as the point at infinity, . Formula 3.2 presents this case of

point addition.

P + -P = -P + P = for all points P on an elliptic curve (3.2)

The third case of point addition is the general case of the operation, where the sum of two

points P1 = (x1, y1) and P2 = (x2, y2) is calculated, where P1 ≠ ±P2 and neither are equal to . The

operation results in the point P3 = (x3, y3), where Formula 3.3 hold. The variable a, is a domain

parameter associated with the elliptic curve.

x3 = y1 + y2 2 + y1 + y2 + x1 + x2 + a y3 = y1 + y2 ⋅(x1 + x3) + x3 + y1 (3.3)

x1 + x2 x1 + x2 x1 + x2

where Pi = (xi, yi) and P3 = P1 + P2

The above formulas for elliptic curve point addition define a complex relationship between the

coordinates of the resultant and original points. All operations within the formulas are finite field

operations. The general case of elliptic curve point addition is an expensive operation consisting of one

finite field inversion, two multiplications, one squaring and numerous additions.

Finally, the last case of point addition, where P1=P2, is also known as point doubling. The

result of point doubling, P3 = (x3, y3) = P1 + P2 = 2P1, where P1 = (x1, y1) is calculated using Formula

3.4. Elliptic curve point doubling is slightly less expensive than point addition. It requires one finite


field inversion, two multiplications, two squarings and numerous additions. Similar to a, the variable

b is a domain parameter associated with the elliptic curve involved.

x3 = x12 + b y3 = x1

2 + x1 + y1 ⋅(x3) + x3 (3.4)

x12 x1

where Pi = (xi, yi) and P3 = P1 + P1 = 2⋅P1

The basic algorithm for point addition that implements all four-point addition cases is stated

below as Algorithm 3-7. In the algorithm, division consisting of sets of finite field elements is

achieved by multiplying the numerator of the expression with the inverse of the denominator.

Algorithm 3-7. Elliptic Curve Point Addition (P3 = P1 + P2) [38]

Input: P1 = (x1, y1), P2 = (x2, y2)

Output: P3 = (x3, y3)

1. If (P1 = ) then Return(P3 = P2)

2. If (P2 = ) then Return(P3 = P1)

3. If (x1 = x2) then

3.1. If (y1 = y2) then

3.1.1. λ = x1 + y1 / x1

3.1.2. x3 = λ2 + λ + a

3.2. Else Return(P3 = )

4. Else

4.1. λ = (y1 + y2) / (x1 + x2)

4.2. x3 = λ2 + λ + x1 + x2 + a

5. y3 = λ⋅(x1 + x3) + x3 + y1

6. Return(P3 = (x3, y3))

The point addition algorithm shown above describes a technique of performing the operation.

No optimizations to the algorithm are possible without changing the coordinate system. The point

addition algorithm is based on an affine coordinate system, which is the basic coordinate system used

with elliptic curve operations. There are other coordinate systems, referred to as projective coordinate


systems, which define modified addition algorithms. Projective coordinate systems are examined in

§3.3.2.

Elliptic curve point subtraction, P3 = P1 - P2, is achieved by computing the negative of P2, and

then summing the result of the negation with P1. The formula for computing the negative of a point,

which can be very easily derived by its definition, is presented as Formula 3.5.

xj = xi yj = xi + yi where Pk = (xk, yk) and Pj = - Pi (3.5)

The purpose of projective coordinate systems is to avoid the costly finite field inversion

operation required in each point addition. These coordinate systems define points with three

coordinates, and essentially replace inversion operations with several finite field multiplications. The

coordinate systems are beneficial when the cost of finite field inversion is substantially larger than

multiplication. The different coordinate systems found by research efforts are compared in the

following section.

3.3.2 Elliptic Curve Point Representation

Finite field inversion is the most expensive finite field operation involved in the ECDSA. To avoid

inversion, several types of projective coordinates have been researched. Projective coordinate systems

use three coordinates (x, y, z). Affine points are converted into projective coordinate points at the start

of a point-multiplication operation. The operation is carried out, and the result is then converted back

into an affine point.

The purpose of changing coordinate systems is to eliminate finite field inversions from the

several point addition operations within a point-multiplication operation. Effectively, projective

coordinate systems provide a means of computing a point addition, where the single inverse operation

is exchanged for numerous multiplication operations. Since multiplication is a less costly finite field

operation, projective coordinates can be computationally beneficial. It should be noted that inversion

operations are required when converting between the affine and projective coordinate systems.

Coordinate conversion is required before and after each point-multiplication operation.

Several papers including [15], [19] and [26] show that the use of projective coordinates over

affine is beneficial. The advantage of a projective coordinate system is that the most expensive finite

field operation, inversion, is replaced with several less expensive finite field multiplication operations.


A comparison of projective coordinate systems with the affine coordinate system shows that, unless

finite field inversions are at least ten times as expensive as multiplications, the use of projective

coordinate systems are not beneficial.

Examples of the three alternatives to an affine coordinate system are Standard projective,

Jacobian projective and Projective coordinate systems. The following table shows the number of

multiplications and inversions of each coordinate system.

Table 3-1. Elliptic Curve Coordinate System Comparison [19]

Coordinate System General Addition Doubling

Affine (x, y) 1 inversion, 2 multiplications 1 inversion, 2 multiplications

Standard projective (x/z, y/z) 13 multiplications 7 multiplications

Jacobian projective (x/z2, y/z3) 14 multiplications 5 multiplications

Projective (x/z, y/z2) 14 multiplications 4 multiplications

Table 3-1 also shows the tradeoff between inversion and multiplication that is possible when a

projective coordinate system is employed. When focusing primarily on point doubling, projective

coordinates are beneficial. By using one of the projective coordinate systems listed, an inversion can

be traded for two to five multiplications. This is definitely a reasonable tradeoff since [19], [38] and

[40] report the cost of a single inversion proportional to approximately ten multiplications.

Alternatively, speedups of 30-70% are predicted for elliptic curve point addition when

Algorithm 3-6 is used for finite field inversion [20], compared to results similar to those from [19], [38]

and [40]. Since the only performance enhancement is from the inversion operation, it is reasonable to

assume that the performance of the inversion operation is proportional to much less than ten

multiplications. With the performance of inversions being less than ten multiplications, projective

coordinate systems are not as attractive. Point doubling may still be slightly faster using projective

coordinates, but definitely general addition, where an inversion is traded for over ten multiplications, is

less efficient. Only the point doubling operation benefits when using a projective coordinate system

and efficient inversion is possible.

The primary advantage of using Koblitz curves is that point-multiplication algorithms can be

exploited that does not require any point doublings [19]. In the subsequent section, the benefits of such

algorithms is examined and found superior. Therefore, point doublings can be eliminated, thus they do

not play a factor in the choice of a coordinate system. Without point doublings involved in the elliptic

curve point-multiplication operation, and a finite field inversion algorithm that is expected to


outperform ten multiplications, projective coordinates are not computationally beneficial, and an

affine coordinate system is superior for the implementation of ECDSA.

3.3.3 Elliptic Curve Point-Multiplication

There are several ways of representing a finite field element. In general, each representation has

positive and negative aspects. For example, the NB representation of a finite field simplifies the

squaring operation significantly. When an element is represented using NB, the squaring operation

reduces to a cyclic shift left. Unfortunately, other operations become more expensive when using the

NB [21]. The following sections present some alternative representations of k from its original binary

integer representation, which improve the performance of the elliptic curve point-multiplication

operation. The sections also analyze the benefits of the alternative representations of k, and the point-

multiplication algorithms associated with each representation.

3.3.3.1 Non-Adjacent Format

A Non-Adjacent Form (NAF) is a beneficial representation of k when performing point-multiplication.

The NAF representation of k can be exploited when performing point-multiplication to increase

performance. By definition, each coefficient of the NAF representation of a finite field element

belongs to the set {-1, 0, +1}, and no two consecutive coefficients are nonzero. When comparing NAF

and binary integer representations, there is at most one additional coefficient required to represent an

element. The benefit of using the NAF representation of k when performing point-multiplications is

the reduced hamming weight of k, and therefore the reduced number of point addition or subtraction

operations required.

Each coefficient in the NAF representation belongs to the set {–1, 0, +1}. This means that for

every 0 coefficient of k, as with the binary integer representation, a point addition operation is not

required. For every –1 coefficient of k, a point subtraction operation is required. The point subtraction

operation is very similar to point addition. There is no substantial difference in computational costs

between the two operations. For every +1, as with the binary integer representation, a point addition

operation is required.


Consider an average random finite field element k, which is m-bits long. The binary integer

representation of k requires m-bits, half of which are zeros, and the other half ones. Point-

multiplication requires m/2 point additions. In comparison, an average random finite field element

represented using NAF has a hamming weight of m/3 [61]. Therefore, the point-multiplication only

requires m/3 point additions or subtractions, reducing the number of point additions and subtractions by

m/6. This is a significant improvement considering the cost of point additions and subtractions.

When the NAF technique of point-multiplication is combined with windowing techniques,

further benefits arise. The equivalent number of pre-computed values in the LUT is halved when using

the NAF representation, because only the positive values are required. The negative values can be

calculated during execution using the corresponding positive value, because point negating is

inexpensive.

Algorithms that compute the NAF representation of a finite field element, and perform NAF

point-multiplication are presented by Solinas in [61]. For the NAF point-multiplication algorithm to be

beneficial, the cost of converting k from binary integer to NAF must be less than the cost of m/6 point

additions, where m-bits are required to represent k. As shown in several papers, the NAF point-

multiplication algorithm is beneficial, compared to binary integer point-multiplication methods, for

current and future finite field sizes [19].

3.3.3.2 Reduced TNAF Representation

Solinas presented a method of exploiting the τ-adic representation of a finite field element to reduce the

computational cost of elliptic curve point-multiplication over Koblitz curves [61]. The benefits of the

point-multiplication technique described are dependent on the implementation and other factors.

Backing Solinas’ paper, a significant reduction in computational costs as documented in [19] is verified

in §5.1.2. The τ-adic representation is only valid for Koblitz curves, which is the type of elliptic curve

implemented in the thesis. Point-multiplication is the most time consuming operation associated with

the ECDSA, making Solinas’ findings significant.

The benefit of using the τ-adic representation of an element is derived directly from the

formula describing the base of the representation, τ. Formula 3.6 describes the complex number τ, and

includes a variable µ, which is a Koblitz curve parameter and takes on the value of ±1 [61]. With some

algebraic manipulation, the Frobenius map presented as Formula 3.7 can be derived from Formula 3.6.

The mapping holds for all points P on the Koblitz curve, where P = (x, y) [61].


τ2 + 2 = µ⋅τ (3.6)

τ⋅(x, y) := (x2, y2) (3.7)

Solinas presents an algorithm that exploits Formula 3.7. The algorithm is very similar to the

NAF point-multiplication algorithm. The τ-adic representation of k is required by the algorithm, where

k is the finite field element involved in the point-multiplication. The τ-adic representation of k requires

several more coefficients than the binary integer representation, but the inefficiency due to the larger

representation is resolved by Algorithm 3-9. By utilizing Formula 3.7, every point doubling operation

in the point-multiplication algorithm is replaced by the squaring of the individual coordinates of the

point, which is much less computationally expensive.

Solinas adds two other improvements to the τ-adic representation before stating a point-

multiplication algorithm. The τ-adic representation is combined with a NAF representation, so the

finite field element is represented in τ-adic NAF (TNAF) form. The modification combines the

benefits of both NAF and τ-adic representations. The hamming weight of the element in TNAF form is

reduced from the τ-adic representation, and all point doublings are eliminated. Algorithm 3-8

computes the TNAF representation of a finite field element.

In the following algorithms, subscripts such as 2, TNAF and Width-w TNAF (TNAFw)

symbolize the representation of the large integer k. Subscript 2 symbolizes a base2 binary integer

representation, whereas TNAF and TNAFw symbolize TNAF and TNAFw representations respectively.

Algorithm 3-8. TNAF Conversion (kTNAF = r0 + r1⋅τ) [61]

Input: integers r0, r1 where k2 = r0 + r1⋅τ

Output: kTNAF

1. k = {}

2. While (r0 ≠ 0) or (r1 ≠ 0)

2.1. If (r0 odd) then

2.1.1. u = 2 – (r0 – 2⋅r1 mod 4)

2.1.2. r0 = r0 – u

2.2. Else

2.2.1. u = 0

2.3. Prepend u to k


2.4. r0 = r1 + µ⋅r0/2

2.5. r1 = -r0/2

3. Return (k)

As previously stated, the number of coefficients required to represent k in TNAF form are

more than with the binary integer representation. The TNAF representation can be reduced, so that the

same amount of coefficients is required as with the binary integer representation, by using Formula 3.8

and Formula 3.9. The reduction of the polynomial, mod (τm-1), results in an equivalent and minimal τ-

adic representation. The minimal, or reduced, TNAF representation of a finite field element has a

length ≤ m + a, where a is a elliptic curve parameter, and an average density of 1/3 + o(1) [61]. Solinas

thoroughly describes the theory behind the reduction and how it is performed. However, the reduction

mod (τm-1) requires a division of large integers that results in a real number, which is not practical for

implementation. He later states a method that allows efficient implementation of an approximation of

the reduction.

τm(x, y) = (x2m, y2m

) = (x, y) (3.8)

(τm-1)⋅(x, y) = (3.9)

The modified reduction algorithm, referred to as partmod δ, approximates the reduction mod

(τm-1), which guarantees the minimal TNAF representation of a finite field element. The partmod δ

algorithm, presented as Algorithm 3-9, calculates a reduced TNAF representation of the finite field

element [61]. Associated with the partmod δ is a probability of the reduction resulting in the minimal

TNAF representation of the element. To implement the reduction, the finite field element is passed to

the modified reduction algorithm, which computes inputs for the TNAF conversion algorithm. The

conversion results in a reduced TNAF representation of the original finite field element.

The modified reduction algorithm utilizes a parameter C, which determines the probability that

the reduced representation of the element is minimal. The inequality from [61] that describes the

probability is presented as Formula 3.10. The value of C corresponds to the number of significant

figures kept during the division of two integers. It also determines the computational cost of the

algorithm. For example, at the cost of computational time, an extremely large value of C can be used,

guaranteeing the minimal representation of the finite field element. For implementation on processors,

the register width should be taken into account when determining an appropriate C value.


Probability of not being minimal < 2-(C - 5) (3.10)

Solinas presents the partmod δ reduction algorithm in several parts. First, the original

reduction algorithm is stated in a form difficult to implement. Then steps in the original algorithm are

broken down so they are efficiently realizable. The realizable methods approximate the division of

large integers from the original algorithm. Algorithm 3-9 performs the partmod δ reduction, combining

formulas and algorithms listed by Solinas.

Algorithm 3-9. Partmod δ Reduction (r0 + r1⋅τ := k2 partmod δ) [61]

Curve Parameters: a, m, r, s0, s1, Vm Input: k2, C Output: integers r0, r1

1. d0 = s0 + µ⋅s1

2. K = (m + 5)/2 + C

3. k’ = k / (2(m – K + 2 + a))

4. g0 = s0⋅k’, g1 = s1⋅k’

5. h0 = g0/2m, h1 = g1/2m

6. j 0 = Vm⋅h 0, j 1 = Vm⋅h 1

7. l 0 = (g 0 + j 0) / 2(K – C) + ½, l 1 = (g 1 + j 1) / 2(K – C) + ½

8. λ0 = l 0 / 2C, λ1 = l 1 / 2C

9. f0 = λ0 + ½, f1 = λ1 + ½

10. η0 = λ0 - f0, η1 = λ1 - f1

11. h0 = 0, h1 = 0

12. η = 2⋅η0 + µ⋅η1

13. If (η ≥ 1) then

13.1. If (η0 - 3⋅µ⋅η1 < -1) then h1 = µ

13.2. Else h0 = 1

14. Else If (η0 + 4⋅µ⋅η1 ≥ 2) then h1 = µ

15. If (η < -1) then

15.1. If (η0 - 3⋅µ⋅η1 ≥ 1) then h1 = -µ

15.2. Else h0 = 0

16. Else If (η0 +4⋅µ⋅η1 < -2 ) then h1 = -µ

17. q0 = f0 + h0, q1 = f1 + h1


18. r0 = k – d0⋅q0 – 2⋅s1⋅q1

19. r1 = s1⋅q0 - s0⋅q1

20. Return (r0, r1)

When comparing the binary integer and TNAF point-multiplication algorithms, either a point

doubling operation or two finite field squarings is required each loop iteration. There is no comparison

between the computational costs of a point doubling operation and two finite field squarings. Each

point doubling operation consists of an inversion, two multiplications and several nominal finite field

operations, making point doubling an expensive operation. Relative to the finite field operations

multiplication and inversion, squaring is considered insignificant; making the cost of the squarings

required each loop iteration of the τ-adic point-multiplication algorithm minimal.

3.3.3.3 TNAF Point-Multiplication

The algorithm defined in [61] for TNAF point-multiplication is extremely similar to the NAF point-

multiplication algorithm. As previously mentioned, the point doubling operation present in the point-

multiplication algorithm is replaced by two finite field squaring operations, because the input element k

is in a TNAF representation. The TNAF point-multiplication is presented as Algorithm 3-10.

Algorithm 3-10. TNAF Point-Multiplication (Q = kTNAF⋅P) [19]

Input: k = (km-1, km-2, …, k1, k0)TNAF, P

Output: Q = (x, y)

1. Q =

2. For i = (m-1) downto 0

2.1. Q = τ⋅Q i.e. x = x2, y = y2

2.2. If (ki = 1) then Q = Q + P

2.3. If (ki = -1) then Q = Q – P

3. Return(Q)

Similar to finite field multiplication, by employing a LUT, the performance of an operation is

often improved. The only cost of the improved performance is an increase in the dynamic memory

footprint. As expected, this is also the case with TNAF point-multiplication. By employing a LUT to


reduce repetitive computations, execution time of the operation can be reduced. LUTs reduce

execution time by eliminating duplicate computational sequences whose results are pre-computed.

The TNAFw technique of the point-multiplication operation, which employs a LUT, was

investigated to improve performance. The following two sections describe the new functions required

to implement the so-called TNAFw point-multiplication. First, the TNAFw representation of k must be

computed.

3.3.3.4 Width-w TNAF Representation

To increase the performance of point-multiplication further, a width-w technique can be combined with

the TNAF point-multiplication algorithm. The technique requires a width-w representation of the

polynomial k, which guarantees at least w zero coefficients between each nonzero coefficient, leading

to an average hamming weight of m/(w+1). The reduced hamming weight results in a reduction in the

number of point additions and subtractions required.

The width-w technique requires the computation of functions of the point P involved in the

point-multiplication. They are stored in a LUT, and used in the operation. The number of functions of

P is determined by the width-w used. An optimum width-w can be estimated in theory, and then

confirmed during implementation, so the maximum performance of the algorithm is achieved. The

functions of P that are pre-computed depend directly on the Koblitz curve and width-w used in the

implementation. Refer to [61] for the formulas that describe the functions and other background

information. Furthermore, the theory behind the width-w technique and other useful information

required for implementation is explained in detail.

The TNAFw representation of k must be computed before performing the point-multiplication.

This requires the implementation of an algorithm similar to the TNAF conversion algorithm. The

TNAFw representation of a finite field element is computed using Algorithm 3-11.

Algorithm 3-11. TNAFw Conversion (kTNAFw = r0 + r1⋅τ) [61]

Parameters: a, w, tw, αu := βu + γu⋅τ for u = 1, 3,…, 2w-1 - 1

Input: integers r0, r1 where k2 = r0 + r1⋅τ

Output: kTNAFw

1. k = {}

2. While (r0 ≠ 0) or (r1 ≠ 0)


2.1. If (r0 odd) then

2.1.1. u = r0 – r1⋅tw (mod 2w)

2.1.2. If (u > 0) then ξ = 1

2.1.3. Else

2.1.3.1. ξ = -1

2.1.3.2. u = -u

2.1.4. r0 = r0 - ξ⋅βu

2.1.5. r1 = r1 - ξ⋅γu

2.1.6. Prepend ξ⋅αu to k

2.2. Else Prepend 0 to k

2.3. r0 = r1 + µ⋅r0/2

2.4. r1 = -r0/2

3. Return (k)

Similar to the NAF and TNAF representations, the guarantee that the TNAFw representation of

a finite field element consists of coefficients that are mostly zero greatly reduces the computational cost

of the point-multiplication. This is because, as shown in Algorithm 3-12 in the next section, point

addition and subtraction operations are only called when the corresponding coefficient is nonzero.

When the coefficient is zero, only two finite field squaring operations required every loop iteration are

executed.

3.3.3.5 TNAFw Point-multiplication

Once the TNAFw representation of k is computed, the TNAFw point-multiplication can be performed.

The algorithm to complete the multiplication operation is presented as Algorithm 3-12. The TNAFw

point-multiplication algorithm is very similar to previous multiplication algorithms. The only

differences are that, a LUT containing functions of the point P is pre-computed, and the pre-computed

values are used by the point addition and subtraction algorithms instead of the original point P.

The performance of the below point-multiplication algorithm is superior to others [19].

Therefore, TNAFw point-multiplication is implemented for the thesis and is used in the signature

generation and verification processes of the ECDSA. Conversely, for reasons stated, the TNAF point-

multiplication method is used in the implementation of the simultaneous multiple point-multiplication.


Algorithm 3-12. TNAFw Point-Multiplication (Q = kTNAFw⋅P) [61]

Input: k = (km-1, km-2, …, k1, k0)TNAFw, P

Output: Q = (x, y)

1. Compute Pu = αu⋅P, for u ε {1, 3, 5, 2w-1 - 1}

2. Q =


3.1. Q = τ⋅Q i.e. x = x2, y = y2

3.2. u = | ki |

3.3. If (ki > 0) then Q = Q + Pu

3.4. If (ki < 0) then Q = Q – Pu

4. Return(Q)

3.3.4 Simultaneous Multiple Point-Multiplication

Within the signature verification process of the ECDSA, the sum of two point-multiplications must be

calculated. To improve the performance of this set of operations, the point-multiplications can be

simultaneously computed. This is called simultaneous multiple point-multiplication, and is known as

Shamir’s Trick [19].

Simultaneous multiple point-multiplication is based on a similar windowing technique used in

the finite field multiplication operation. Alternately to the finite field operation, there are two point-

multiplications, so the LUT involved is two-dimensional, and consists of elliptic curve points.

Summations of multiplies of the two points are computed and stored in the LUT. In the main loop of

the algorithm, the LUT is indexed by a set of bits from the two finite field elements k and l. The

algorithm improves the computation of the sum of two point-multiplications [19]. The simultaneous

multiple point-multiplication computation is presented as Algorithm 3-13, and is specific to the

ECDSA signature verification process.

Algorithm 3-13. Simultaneous Multiple Point-Multiplication (R = k⋅P + l⋅Q) [19]

Input: k = (km-1, km-2, …, k1, k0)2, l = (lm-1, lm-2, …, l1, l0) 2, P, Q

Output: R


1. Compute i⋅P + j⋅Q for all i, j ε [0,2w - 1]

2. R =

3. For i = (m/w - 1) downto 0

3.1. R = 2w⋅R

3.2. R = R + (k’⋅P + l’⋅Q), where k’=(ki*w+w, ki*w+w-1,…, ki*w)2 and l’=(li*w+w, li*w+w-1,…, li*w)2

4. Return(R)

3.4 Implementation and Integration Philosophy

C source code that is available on the Internet was used as a starting point for the thesis. The original

code, referred to as Rosing code throughout the thesis, is associated with the book written by Michael

Rosing, “Implementing Elliptic Curve Cryptography” [56]. The Rosing code defines an excellent

starting point for ECC, as well as a learning environment and testing atmosphere for the thesis.

After a strong knowledge base of the theory of finite fields and elliptic curves was built, more

efficient algorithms that perform finite field, large integer and elliptic curve operations were studied,

implemented, tested and integrated into the existing Rosing code.

Whenever possible, the operations are written to be versatile and reusable. Commonly used

values are standardized with #define statements. The code is written such that only the defined values

must be modified to change the performance and cryptographic strength of the source code. For

example, the finite field size used is identified with a #define statement. By modifying the #define

statement, the finite field size, and therefore the encryption strength is changed.

Other values that are defined to add to the versatility of the code are data structure sizes include

the width-w value and the window widths used in the implementation of efficient finite field and

elliptic curve operation algorithms. Originally, all the window widths used were variable. Later, some

of them were fixed at optimum values to increase the efficiency of the implementation.

A set of basic routines is written in both C and assembly using identical algorithms for each.

The two sets of routines are written such that Compiler-Generated Assembly (CGA) from C source

code, and the Hand-Written Assembly (HWA) can be compared. The comparison mentioned is

presented in a later section. The CGA and HWA routines have identical inputs and outputs, only their

names differ by ending in _cga and _hwa respectively. In the implementation and analysis sections of

the thesis, the function name extensions _cga and _hwa, are left out, but both are implied. The set of

routines written in both C and assembly is defined in §5.4.1.


The routines included during compilation depend on whether GCC or SC140 is defined. Only

one of these values can be defined at compile time. If GCC is defined, the set of C functions ending in

_cga are used, and the code is compatible with a generic C compiler and any processor. If SC140 is

defined, the handwritten assembly functions ending in _hwa are included. The code is specialized and

only compatible with the SC140.

Almost all of the Rosing code implementations of finite field and elliptic curve operations were

based on non-optimal algorithms. Therefore, the operations were replaced with implementations of

more efficient algorithms. The only operations not replaced are elliptic curve addition and subtraction.

All of the large integer operations were entirely replaced, except for large integer multiplication. Only

an assembly language version of the large integer multiplication operation, which is employed as an

HWA routine, was integrated. The original large integer multiplication from the Rosing code is used

when GCC, and not SC140, is defined. After each operation was implemented, thoroughly tested, and

integrated, obsolete Rosing code was removed from the source code when possible.

Furthermore, a new and improved method of point-multiplication from the Rosing code was

implemented. The Rosing code includes a NAF point-multiplication operation. TNAF and TNAFw

point-multiplication operations, which are superior to the NAF method, were implemented.

Each newly implemented function was tested using pseudorandom data versus existing

functionality to verify its correctness. Special cases for each function that test input data boundaries

were also investigated to ensure accurate results. For example, all the possible combinations of

positive and negative large integer inputs were tested with the large integer function

field_mult_wrapper.

After the implemented function was tested versus existing functionality, it was integrated into

the ECDSA code. To further ensure its correctness, the signature generation and verification processes

were executed. Variables are output at important points in the ECDSA during execution. The

correctness of the output variables was tested after each function was integrated to ensure the code

functions properly, and does not produce faulty results. Variables including the signature value were

observed and verified to ensure null signatures, which may be incorrectly verified by the

implementation, are not generated.

The following section, which is organized by the type of operation, analyzes the implemented

operations, briefly describes optimization techniques, and list results obtained.

4 Implementation Analysis and Performance Results

First, low-level finite field and large integer operation routines were implemented and integrated with

the Rosing code. The implemented functions, which are the most basic operations involved in the

ECDSA replace inferior Rosing source code. They are based on superior algorithms, and therefore,

outperform the routines present in the Rosing code.

The course of action, implementing low-level operations first, was taken to force the

familiarization with the data structures defined by the Rosing code. Furthermore, the functions are

simple to understand, write and test. Several of the operations are so simple that their correctness can

be determined using hand calculated results, which extremely simplifies the debugging process. The

low-level operations provide an excellent starting point for implementation. They generally do not call

existing functions, thus limiting the possible erroneous code.

Later, functions that are more complex were implemented. Some of the functions implement

finite field and large integer operations. These operations include inversion, division and

multiplication functions. Other complex routines, including elliptic curve operations, are also

described and analyzed.

As previously stated, the PB was selected to represent finite field elements for the scope of the

thesis, which is assumed throughout the remainder of the project. The following section describes the

data structures used in the implementation process. Most data structures were originally defined in the

Rosing code, and were not modified. Some other data structures were modified or defined because

they are required by algorithms not implemented by the Rosing code.

4.1 C Data Structures

Most of the data structures used in the implementation of the ECDSA were originally defined in the

Rosing code. A majority of the data structures were never modified from the Rosing code, some

required minor modifications, and others had not been implemented in the Rosing code, and therefore

were defined.

All of the data structures are based on the FIELD2N structure defined in the Rosing code. The

FIELD2N structure is an array of ELEMENT, or unsigned long, variables that are used to store binary

38

CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 39

finite field elements. The size of the array depends on the finite field size. The polynomial basis is

used to represent finite field elements. Each of the m least significant bits of a FIELD2N element

represent a coefficient, where m defines the finite field size, GF(2m).

The structure DBLFIELD is used to store finite fields that have not been reduced. The

structure is also an array of unsigned longs, but can store twice as many coefficients as a FIELD2N.

The structure is temporarily used when multiplying and squaring elements.

The two structures, FIELD2N and DBL_FIELD2N, the second being an array of twice as many

elements as a FIELD2N, are used by large integer functions. The large integers involved in the

ECDSA are the same size as the finite field elements, so the identical data structure can be used to

represent both types of data. Originally, the Rosing code used a BIGINT structure to represent large

integers. The structure and large integer algorithms are highly inefficient. The structure is an array of

unsigned longs, but only the least significant half of each array element is used to store the large

integer. Furthermore, the structure is twice the size of a large integer to avoid overflows during

multiplication.

The structure TNAF_FIELD is used to store TNAF and TNAFw representations of finite field

elements. Similar to previous structures, it consists of an array of unsigned longs. The size of the array

is larger than the reduced TNAFw representation of any finite field. The size depends on the finite

field size, m, and the width-w used in the TNAFw representation. Each array element in a

TNAF_FIELD structure is used to store several TNAF or TNAFw coefficients. The number of

coefficients per array element depends on the number of bits required to represent each coefficient.

There are also other data structures, which were defined in the Rosing code, such as POINT,

CURVE, SIGNATURE and EC_PARAMETER. The structures consist of various combinations of the

FIELD2N, CURVE and POINT structures, which define them in the elliptic curve domain.

4.2 Finite Field Operations

The following sections describe the finite field operations implemented, including methods to optimize

the operations. The terms finite field element and polynomial are used interchangeably.


4.2.1 Finite Field Addition (c = a ⊕ b)

The first finite field operation implemented is polynomial addition. There is only one algorithm, which

is used to implement the operation because of its simplicity. All of the bits of the subject polynomials,

b and a, are combined using the exclusive-or operation to form c.

The function, poly_add, which implements polynomial addition, was written in both C and

assembly. The functions were thoroughly tested and integrated with the Rosing code. Refer to Table

5-4 for the performance of the routine. Often within the ECDSA code, the polynomial addition

function is not called because of its simplicity. Instead, the operation is made inline. Inlining the

operation increases the performance of the parent function because the overhead of a function call is

eliminated.

4.2.2 Finite Field Reduction (c = a mod f)

Next to finite field addition, reduction is the simplest operation. Finite field reduction is equivalent to

dividing a finite field element a, by the reduction polynomial f and returning the remainder c. The

reduction polynomial defines the finite field.

Several different algorithms accomplish finite field reduction. The simplest algorithm for

finite field reduction, which is inefficient and used by the Rosing code, is basic division using base2

arithmetic. This algorithm requires finite field elements, f and a, as inputs, and returns c. The

algorithm is inefficient in the case of finite field reduction because it does not take advantage of the fact

that f is fixed.

A superior algorithm to that stated previously is Algorithm 3-3. The algorithm takes advantage

of the fact that f is fixed and has a low hamming weight. By doing so, the algorithm is much more

efficient at performing finite field reductions, and has a reduced number of iterations compared to the

basic division algorithm. It also does not require f as an argument, which further improves its

performance.

The reduction algorithm, Algorithm 3-3, was implemented in the C programming language,

and named reduction163. It was thoroughly tested versus the reduction function defined by the Rosing

code. Once it was determined that the function implements the reduction operation correctly, the

previous reduction function was removed from the source code and all calls to it were modified to

target the reduction163. The performance of the reduction function is presented in Table 4-1.


Table 4-1. Finite Field Reduction Performance

Code Description Cycle Count

FF Reduction 124

The implementation of the reduction algorithm was simple because of its form. The algorithm

defines the individual operations performed on each 32-bit word of the element, simplifying its

conversion to C code.

4.2.3 Finite Field Multiplication (c = a ⋅ b)

The polynomial multiplication function, poly_mul_win, was written to replace the Rosing code version

of the operation. The function performs polynomial multiplication using a windowing technique and

LUT. The algorithm is based on the left-to-right comb method with window width, and is presented as

Algorithm 3-4.

The LUT is calculated during execution by the function, poly_mul_win_LUT, and originally

similar to poly_sqr, the size of the LUT and associated window width were modifiable. This was done

so that testing could later reveal the optimum window width. After a small amount of testing and

intuition, the window width was fixed at the optimum value of 4-bits, and the function was modified to

enhance performance for the specific window width.

The function, poly_mul_win_LUT, became highly optimized after the window width was fixed

at 4-bits, improving its overall performance. The function calculates all of the 4-bit multiples of the

first polynomial involved in the multiplication, and stores the reduced results in the LUT. Shifted

versions of the polynomials in the LUT are then added to a running sum, and reduced at the end of the

function. The result of the multiplication is temporarily stored in a structure, DBLFIELD, which is

twice the finite field size. The extra space is required to store the polynomial, because it is not reduced

until the end of the multiplication operation.

The polynomial multiplication function was thoroughly optimized to decrease execution time.

A function was written specifically for the function that shifts a polynomial by a specified number of

bits, which in this case is the window width. The specialized shifting function, shift_left, was

implemented and integrated because it is more efficient than shifting the polynomial by a single bit

multiple times. The shifting function is also used for implementing the finite field squaring and


inversion operations. The following table shows a comparison between a specialized multiple bit-

shifting function, shift_left_cga, and multiple calls to a single bit-shifting function, dbl_shift_left, with

respect to machine cycle and instruction counts.

The C code results were recorded using the SC100 simulator, and the maximum compiler

optimization level. The pattern shown in Table 4-2 continues with some minor discrepancies for all

bit-shifts. The multiple bit-shifting function outperforms, or is at least equal in performance to, all

nonzero bit-shifts, and therefore was integrated into all benefiting functions.

Table 4-2. Single and Multiple Bit-Shifting Function Comparison

Cycle Count (Instruction Count) Number of Shifts

(bit-shifts) Single Bit-Shifting Function Multiple Bit-Shifting Function

1 108 (77) 108 (92)

2 215 (154) 108 (92)

3 322 (231) 108 (92)

4 429 (308) 108 (92)

5 536 (385) 108 (92)

Average Change 107 (77) 0 (0)

After optimizations were applied to poly_mul_win, and the function was thoroughly verified

for correctness, the performance of the operation was recorded. The performance was measured in

terms of machine cycles, or cycle counts. The following results were obtained using the SC100

simulator, when specifying the maximum compiler optimization level. Table 4-3 presents the

performance of the implemented code when employing CGA and HWA routines. The cycle counts in

the table clearly illustrate the performance benefit achieved by partial assembly implementation.

Table 4-3. Finite Field Multiplication Performance

Code Description Cycle Count

FF Multiplication - CGA 5,595

FF Multiplication - HWA 3,475


4.2.4 Finite Field Squaring (c = a2)

The function, poly_sqr, was written to implement finite field squaring. Since τ-adic point-

multiplication was selected for implementation, a specialized function for the operation is highly

beneficial and is required for efficient computation. The Rosing code does not have a specialized

function for polynomial squaring.

The original squaring algorithm implemented is Algorithm 3-5. The function was originally

written so that the window width used with the LUT is easily modified. Similar to polynomial

multiplication, the window width was modifiable so that the optimum bit-width could be determined

through testing.

Intuitively, the optimum window-width, also referred to as the bit-width, used with the LUT

divides the register width of 32-bits. With such bit-widths, special cases where a set of bits span two

registers are eliminated, allowing greater optimization of the code. There is a memory versus

execution time tradeoff associated with the bit-width and LUT size. Since the LUT is fixed, and not

calculated during execution, the tradeoff favors a larger LUT size than with the multiplication

operation. The LUT requires twice as much storage space each time the bit-width is increased, while

the execution time diminishes as the bit-width is increased. In this case, the memory versus execution

time tradeoff is affected by the fact that polynomial squaring is a very common operation within the

point-multiplication algorithm implemented. The tendency is to choose a large bit-width, because it

will have a large affect on execution time, with a relatively small memory requirement price.

Several different bit-widths of the LUT were investigated using identical C source code. Only

the definition of the bit-width and the LUT used were modified. Table 4-4 shows the performance of

finite field squaring function for the tested bit-widths, along with the associated LUT size. In the table,

the cycle count required to square finite field elements is presented. For each bit-width, fifty identical

pseudorandom elements were used to obtain the results.

At the cost of memory, the performance of the operation can be increased by increasing the bit-

width. In general, this trend continues at the expense of memory. Memory requirements become

unreasonable quickly for larger bit-widths because it grows exponentially. The memory requirements

are important, especially when targeting portable devices.

The computational cost of the polynomial squaring function increases by a factor of two when

CGA routines are used instead of HWA routines. The superior performance of the HWA versions is

expected because of the superior performance of the HWA routines shown in §5.4.1.


Table 4-4. Finite Field Squaring Performance Comparison

Bit-Width Description

4 5 6 7 8 9 8*

Cycle Count - CGA 7,821 6,301 5,379 4,615 4,021 3,660 212

Cycle Count - HWA 3,838 3,110 2,683 2,315 2,018 1,855 212

LUT Size (bytes) 64 128 256 512 1,024 2,048 1,024

* - Source code does not include any CGA or HWA routines

The last column in the table, marked with an asterisk, shows the performance of the poly_sqr

function after the bit-width is fixed at eight. When this is done, assumptions within the routine are

possible. These assumptions are exploited to increase the performance of poly_sqr significantly. The

algorithm used is slightly different then that previously implemented, and is presented as Algorithm

4-1.

Recalling that bit-widths that divide the register width are preferred because they allow the

most optimizations, results in an optimum bit-width of 8-bits being selected. The selected bit-width

requires a LUT consisting of 256 16-bit binary values. The bit-width was selected primarily because

the execution time and memory requirements are too large with 4-bits and 16-bits respectively. When

comparing bit-widths of four and eight, Table 4-4 clearly shows eight to outperform four by a factor of

almost two. The table also shows that a bit-width of eight requires a reasonable amount of memory.

The memory footprint of a bit-width of eight is considered reasonable relative to the total memory

requirement of the signature generation and/or verification source code, which is presented in §5.5.

The next window width that divides the register width is 16-bits. The performance of this bit-

width is assumed superior to that of eight, most likely by a factor of two, which is the performance

improvement recorded when doubling the window width from four to eight. However, the memory

requirements of such a bit-width, 262,144 bytes, is unreasonable compared to the total memory

requirements. The memory requirements of the bit-width are unreasonable, especially for portable

devices where memory is an expensive resource and must be conserved.

The bit-width was fixed at 8-bits, and the function was modified accordingly. The function

was modified so that it follows a slightly different algorithm, presented as Algorithm 4-1. Table 4-4

shows the execution time of the squaring operation improves drastically when the window width was

fixed. This shows how versatile code that is easily modifiable is far less efficient than code with fixed

parameters. Certain assumptions can be made with fixed parameters, leading to greater optimizations

and significant execution time improvements. The assumptions made in the squaring implementation

significantly change the performance of the operation, and the algorithm itself. The algorithm allows


the use of greater parallelism, primarily because the window width divides the register width. In the

algorithm below, the assumption that 32-bit registers are used to store finite field elements a and c is

made. Therefore, c[i] refers to the ith 32-bit register.

Algorithm 4-1. Improved Finite Field Squaring (c = a2)

Input: a = (a[5], a[4], a[3], a[2], a[1], a[0])

Output: c = (c[5], c[4], c[3], c[2], c[1], c[0])

1. For v = (v7, v6, v5, v4, v3, v2, v1, v0) = 0 to 256 do

1.1. T(v) = (0, v7, 0, v6, 0, v5, 0, v4, 0, v3, 0, v2, 0, v1, 0, v0)

2. b = 0

3. For i = 0 to 4 do

3.1. c[2⋅i] = (T(a[i] & 0xFF00) << 2⋅i) ⊕ T(a[i] & 0xFF)

3.2. c[2⋅i + 1] = (T(a[i] & 0xFF000000) << 2⋅i) ⊕ T(a[i] & 0xFF0000)

4. c[10] = T(a[5] & 0xFF)

5. c = c mod f

6. Return(c[5], c[4], c[3], c[2], c[1], c[0])

4.2.5 Finite Field Inversion (c = a-1 mod f)

The last finite field operation described, and most expensive, is polynomial inversion. The function,

poly_inv_eff, was written to perform the operation. The implemented algorithm is based on Algorithm

3-6. Performance increasing modifications were made to the algorithm by exploiting the functionality

of the SC140. The modified algorithm that was implemented is listed as Algorithm 4-2. The following

paragraphs describe functions that were called by poly_inv_eff, and later made inline to improve

performance, as well as the performance of the implementation.

Two functions are written to calculate and return the degree of a polynomial. The functions,

MSB_degree and MSB_degree1, were both initially written in C, and then assembly. The main

difference between the two functions is that MSB_degree1 requires the previous degree of the

polynomial, making it slightly faster. The assembly versions of the degree computing functions use the

CLB instruction, which requires a single clock cycle to find a leading nonzero bit. This is much more

efficient than the C functions, which check each bit individually, and therefore require several clock

cycles to perform the identical task.


Algorithm 4-2. Improved Finite Field Inversion (c = a-1 mod f)

Input: a = (am-1, am-2, …, a0)

Output: b = (bm-1, bm-2, …, b0)

1. r0 = f, r1 = a, b = 0, u1 = 1

2. deg_r0 = m, deg_r1 = degree(r1), deg10 = deg_r0 – deg_r1

3. Do

3.1. If (deg10 = 0)

3.1.1. r0 = r0 ⊕ r1, b = b ⊕ u1

3.1.2. deg_r0 = degree(r0)

3.1.3. deg10 = deg_r0 – deg_r1

3.1.4. If (deg_r0 = -1) b = u1, Return (b)

3.2. If (deg10 > 0)

3.2.1. r0 = r0 ⊕ (r1 << deg10), b = b ⊕ (u1 << deg10)


3.2.3. deg10 = deg_r0 – deg_r1

3.2.4. If (deg_r0 = -1) b = u1, Return (b)

3.3. If (deg10 < 0)

3.3.1. r1 = r1 ⊕ (r0 << -deg10), u1 = u1 ⊕ (b << -deg10)


3.3.3. deg10 = deg_r0 – deg_r1

3.3.4. If (deg_r1 = -1) Return (b)

4. While (true)

A function was written specifically for the inversion operation, which adds a shifted version of

a polynomial to another. This function, shift_and_add, requires two input polynomials, and an integer

that defines the number of bits to shift the second polynomial. The first polynomial is over-written

with result of the polynomial addition. This function is written in both C and assembly.

In a further attempt to improve the performance of the inversion operation, the distribution of

bit-shifts involved in the shift and add was investigated. Forty inversions were performed on

pseudorandom polynomials to determine the approximate distribution of bit-shifts. The exploitation of

the distribution leads to a performance improvement. Table 4-5 shows the distribution results obtained.

Table 4-5 shows that occurrences of bit-shifts decrease by approximately fifty percent per bit.

Therefore, it is beneficial to have a shift and add function that favors small bit-shifts. In addition, since


shifting by zero bits is common, it is likely beneficial to check for these occurrences and process them

differently.

Table 4-5. Finite Field Inversion Bit-Shift Distribution

Number of Bits Involved in the Bit-Shift Description

0 1 2 3 4 5 6 7 8 9 10 11 12

Number of Occurrences 1,586 2,427 1,138 623 320 152 78 39 28 12 5 1 2

Percentage of Total (%) 24.74 37.86 17.75 9.72 4.99 2.37 1.22 0.61 0.44 0.19 0.08 0.02 0.03

The performance of different bit-shifting options was investigated. The shift and add code was

modified so that the shift_and_add function is only called for nonzero bit-shifts. For zero occurrences,

the polynomial addition function is used. The performance of the modified code is superior to the

previous, after the distribution of bit-shifts is taken into account.

A second version of the shift_and_add function, shift_and_add2, was written to further

improve efficiency. This version was written, realizing there are always two sets of polynomials that

require a shift and add in the inversion algorithm. The bit shifts in each case are always equal. The

function, shift_and_add2, performs two shift and add operations in parallel, which is slightly more

efficient than two consecutive calls of the shift_and_add function.

A similar premise was used with the implementation of poly_add2. The two calls to

polynomial addition are combined so the operations are computed in parallel. Again, the resulting code

is slightly more efficient than performing two consecutive polynomial additions. C and assembly

versions of poly_add2 and shift_and_add2 were written, tested and integrated.

The implementation of Algorithm 4-2 led to the impressive performance figures in Table 4-6.

The most extensive difference between Algorithm 4-2 and Algorithm 3-6, besides fixing the number of

polynomials employed, is eliminating the need to exchange polynomials. After the algorithm is

modified to only require the current four polynomial values, r0, r1, b and u1, an exchange of the

polynomials is occasionally required so that the degree of r0 remains larger than r1. To eliminate the

exchange, and improve the performance of the operation, code was added that allows either polynomial

to have a larger degree. This is clearly illustrated in Algorithm 4-2.

As expected, the performance of the optimal implementation of the inversion operation, shown

in Table 4-6, is comparable to the performance of the optimal implementation of approximately five

finite field multiplications. Therefore, the selection of the affine coordinate system instead of

projective coordinates is correct and beneficial. For this implementation, a projective coordinate


system is not beneficial with respect to computational performance due to the ratio of the

computational costs of the finite field multiplication and inversion operations.

Table 4-6. Finite Field Inversion Performance

Cycle Count Code Description

Minimum Average Maximum

FF Inversion – CGA 15,950 17,590 18,730

FF Inversion - HWA 14,390 16,730 18,290

4.3 Large Integer Operations

The focus of the thesis is to implement and optimize efficient finite field and elliptic curve operations.

During the testing and simulation process, the performance of the large integer operations was

unacceptable. Progress was slowed considerably and the performance of the signature generation and

verification processes was unacceptable. Therefore, the large integer operations defined by the Rosing

code were improved upon. Algorithms to implement these operations were not thoroughly researched,

so it is assumed that further improvements to the implemented operations are possible. The purpose of

implementing these operations, which are listed in the following sections, is to increase performance of

the large integer operations, and the signature generation and verification processes to levels that are

more acceptable.

Only large integer addition and subtraction were implemented in both C and assembly. Large

integer inversion and division operations were written in C, and employ both C and assembly addition

and subtraction routines. A partial assembly implementation of large integer multiplication was

integrated. The multiplication operation is from an external source.

The cycle counts for the more complicated large integer operations are not provided because

the implementations were optimized, and they do not significantly affect the overall performance. The

large integer multiplication function was not implemented by the writer, and it is not commonly called

within the ECDSA. The performances of the large integer inversion and division functions are input

dependent, and the functions are not commonly called as well.


4.3.1 Large Integer Addition and Subtraction (c = a + b; c = a - b)

Large integer addition and subtraction are the simplest large integer operations implemented. The two

operations are implemented separate from each other, in both C and assembly. They were thoroughly

tested before being integrated with the Rosing code.

The implementations of large integer addition and subtraction are great improvements on the

Rosing code. The Rosing code uses an inefficient algorithm for addition, and subtraction is

implemented as an extension of addition, with the aid of a negate function. The operations are

inefficient because they are based on the BIGINT structure, which only uses half of each element for

data storage. The only plausible reason for the inefficiencies of the BIGINT structure is to simplify the

large integer multiplication operation.

The implementations of addition and subtraction, add_int and sub_int respectively, do not use

the BIGINT structure. Unlike BIGINT, all of the bits within the data structure, FIELD2N, are used to

store the large integer, and a different structure is used to store the result of multiplications. Thus, the

data structure is a quarter of the size, and the addition and subtraction operations are performed more

efficiently.

The implementation of the two operations is almost identical. Both assembly and C versions of

the operations use the same basic algorithm, except when computing carries and borrows. The C code

computes the carry and borrow bits with boolean expressions, whereas the assembly code exploits the

carry bit in the status register. Therefore, the assembly version of the operations is more efficient. The

performance of both large integer operations is presented in Table 5-4.

4.3.2 Large Integer Multiplication (c = a ⋅ b)

Only an assembly language version of the large integer multiplication operation, int_mult, was

integrated with the existing code. The assembly version multiplication function was previously written

for a different application and is only called when both C and assembly source code is compiled [13].

When only C source is compiled, the original large integer multiplication from the Rosing code, which

uses the BIGINT structure, is called.

The large integer multiplication function, which is written in assembly language, is actually

written for multiplying 192-bit positive integers, resulting in a 384-bit integer. The two sizes of

integers correspond to the implemented structures FIELD2N and DBL_FIELD2N respectively. The


function can be used in the scope of the thesis because the finite fields and therefore the large integers

are 163-bits in size.

However, both large integers must be positive for int_mult. Therefore, a wrapper function is

required, that ensures the inputs to the function are positive, and modifies the large integers to their

correct sign after the multiplication is complete. The wrapper function is named field_mult_wrapper.

Within this implementation of the ECDSA, the result of large integer multiplications is

guaranteed to require less than 163-bits. Therefore, no reduction is necessary after a large integer

multiplication, allowing the result to be copied straight to a FIELD2N, ignoring the most significant

bits of the large integer.

The integration process for the multiplication operation was simple because the function being

integrated had been previously tested. However, a problem was encountered with the compiler. The

compiler produces incorrect assembly when the highest level of optimizations is selected. This

problem is thoroughly examined in §6.3.2.

4.3.3 Large Integer Division (c = a / b)

The two most inefficient large integer operations implemented by the Rosing code, large integer

inversion and division, were replaced with much more efficient functions. The new functions are based

on algorithms that are more efficient, and do not use the BIGINT structure.

The large integer division function, large_int_div, is based on the long division algorithm. The

implementation of the algorithm is straightforward, exploiting previously written functions such as

shift_left, sub_int and MSB_degree. The long division algorithm implemented, subtracts the

denominator multiplied by powers of two from the numerator. The process is complete once the

numerator is less than the denominator.

Only the performance of the finite field and elliptic curve operations were thoroughly

investigated, so the large integer division function can likely be improved. A superior algorithm may

also exist.


4.3.4 Large Integer Inversion (c = a-1 mod f)

The function, large_int_inv, was written to implement large integer inversion. The function inverts a

large integer with respect to an input prime large integer. The Rosing code’s large integer inversion

function is extremely inefficient. It utilizes the already inefficient large integer division function and

BIGINT structure.

The improved large integer inversion function utilizes a more efficient structure, FIELD2N. It

is based on the previously implemented finite field inversion routine, which simplified its realization.

The two functions are almost identical. The only difference is that the large integer inversion function

uses large integer addition instead of polynomial addition. No substantial problems were encountered

during the implementation of this operation because it was based so closely on the previously

implemented and thoroughly tested finite field inversion function.

The performance of the large integer inversion function is assumed near optimal, with only the

possibility of minimal improvements. This is assumed because it is based on the already thoroughly

optimized finite field inversion function.

4.4 Elliptic Curve Operations

After finite field and large integer operations, elliptic curve operations were implemented. The

following sections list the operations and implemented functions associated with the elliptic curve

operations. The performance of the implementations is presented, and then compared to previously

published results in §5.1.2.

Following minimal research, it was decided to use a Koblitz curve for the implementation of

the ECDSA. Koblitz curves allow the use of several specialized algorithms for point-multiplication,

which are computationally beneficial. In particular, Koblitz Curves allow the use of the TNAF point-

multiplication method.

Researching elliptic curve multiplication algorithms led to the conclusion that the TNAF and

TNAFw methods of point-multiplication, proposed by Solinas in [61], are the most efficient techniques

of performing elliptic curve point-multiplication. Results presented in [19] had a significant impact on

the decision.

To implement the TNAF method for point-multiplication outlined by Solinas, several functions

are required. The existing elliptic curve addition and subtraction functions were kept because no


significant improvements can be made to them. The functions that were implemented are listed and

described in the following sections.

4.4.1 TNAF Conversion (k2 kTNAF)

The first function implemented, which is get_TNAF_rep, calculates the TNAF representation of a

polynomial and writes it to a TNAF_FIELD structure. The function is based on Algorithm 3-8. The

implementation of the algorithm is straightforward, given that functions performing large integer

manipulation are available.

There are three possible TNAF coefficients. Therefore, two bits are required to represent all

possibilities. The values that represent the TNAF coefficients –1, 0 and 1 are 0x11, 0x00 and 0x01

respectively. These values are arbitrary, providing they are used consistently within the conversion and

point-multiplication functions. After calculating coefficient values using large integer operations and

calculating remainders, each coefficient must be shifted by the correct number of bits so that it was

written to the proper location. This requires careful programming.

The function, get_TNAF_rep, was difficult to test and verify. Initially, the only verification

done was with small binary values whose representation can be hand calculated. After the completion

of the point-multiplication function, which is described in a following section, the conversion function

was thoroughly tested and verified with larger binary values.

4.4.2 Partial Reduction - Partmod δ (k′ = k partmod δ)

According to Solinas [61], to fully take advantage of the TNAF and TNAFw techniques of point-

multiplication, the input polynomial must be reduced so that the representation decreases in size. The

reduction decreases the hamming weight of the representation, thereby decreasing the number of loop

iterations and execution time of the point-multiplication.

Solinas states that it is possible to guarantee the minimal representation, but it is too expensive

to achieve in practice. Instead, a partial reduction, partmod δ, is defined that guarantees a minimal

TNAF or TNAFw representation with a certain probability. The probability is related to C, which is a

modifiable parameter. It is up to the implementer to define the value of C, and therefore the probability


that the TNAF and TNAFw representation is minimal. There is a tradeoff associated with the value of

C. As C increases, the probability of the TNAF or TNAFw representation being minimal increases, but

so does the execution time of the partial reduction.

For implementation, a C value of sixteen was selected for two reasons. According to Formula

3.10, the probability that the TNAF or TNAFw is minimal is over 99.95%. Therefore, the minimal

representation is almost guaranteed. The second reason for choosing a C value of sixteen is the register

width. Due to the nature of the algorithm, the cost of the algorithm increases every time C becomes

larger than a multiple of sixteen. C values of sixteen and less have equal computational costs, which is

less than the computational costs for C values of seventeen and above. Since the minimal

representation is guaranteed with such a high probability, it was determined that the processing cost of

increasing C beyond sixteen outweighs the benefits.

The function, TNAF_partmod_delta, implements the partial reduction presented as Algorithm

3-9. The function calculates the value of two polynomials, which are then used to compute the TNAF

or TNAFw representation of the polynomial. Two other functions were defined for the partial

reduction process. The two functions, div2_ceil and multiple_div2, are described in the following

paragraphs.

The function, multiple_div2, divides a large integer by a power of two, always rounding down.

Effectively, the function is the opposite of shift_left. It shifts a large integer to the right by the

specified number of bits. The operation, which is very similar to shift_left, is written in both C and

assembly source code.

The function, div2_ceil, which is usually used in conjunction with a multiple_div2 call, divides

a large integer by two, always rounding up. The operation is implemented in both C and assembly, and

is achieved by first adding one to the least significant word, and then adding the propagating carry bit

while performing a single bit shift.

The implementation of Algorithm 3-9 was straightforward after support functions were

available that encapsulate the required operations. The implementation was tested after the point-

multiplication of the following section was written. No implementation errors were found during the

testing of the reduction function.


4.4.3 TNAF Point-Multiplication (Q = kTNAF ⋅ P)

The next function implemented performs the elliptic curve point-multiplication, using the TNAF

representation of k. The function, TNAF_point_mul, is based on Algorithm 3-10. The TNAF point-

multiplication algorithm is extremely similar to the point-multiplication algorithm in the Rosing code.

The Rosing code implements the NAF point-multiplication technique. The difference between the two

techniques is the polynomial, k, is in a TNAF representation, which changes the point doubling

operation to two polynomial squaring operations.

The point-multiplication function follows Algorithm 3-10 very closely. First, the resultant

point Q is set to the point at infinity, . Then, if the finite field element k is zero, the function is

terminated, returning as the result of the operation.

Otherwise, variables are set to select the most-significant nonzero coefficient of k. The main

loop of the algorithm, where each nonzero coefficient results in a point addition or subtraction is

entered. As well as possible point addition or subtraction operations, two point squaring operations and

the manipulation of coefficient selecting variables is executed within the main loop. The loop iterates

through each coefficient, from most significant coefficient to least significant, before the point-

multiplication is complete.

The implementation of the TNAF point-multiplication was simplified by following the basic

structure of the Rosing code point-multiplication function. The only difficulty was encountered with

coefficient values. The Rosing code uses a large array to store the NAF representation of k, where each

array element contains a single coefficient. To reduce memory requirements, several TNAF

coefficients are stored in each array element of a TNAF_FIELD structure. Bit-shifting and bit masks

are required to select individual TNAF coefficients. Several definitions are used to ensure consistency

and avoid errors when referencing the coefficients.

Table 4-7. TNAF Point-Multiplication Performance

Code Description CGA Cycle Count HWA Cycle Count

Partmod δ Reduction 8,700 4,800

TNAF Point-Multiplication 2,033,000 1,670,000

Table 4-7 shows the performance of the TNAF point-multiplication function implemented.

The individual HWA routines are shown to outperform their CGA counterparts in §5.4.1. Therefore, as

expected, the point-multiplication employing HWA routines, outperforms the CGA alternative. The


performance of the TNAF_partmod_delta function is also presented in the table. The cycle counts of

each partial reduction function are insignificant compared to the point-multiplication functions.

4.4.4 TNAFw Conversion (k2 kTNAFw)

The TNAF conversion function was modified to calculate the TNAFw representation of a polynomial.

The algorithm used for this function is Algorithm 3-11. The algorithms used to calculate the TNAF

and TNAFw representation of polynomials are very similar, which simplified the implementation of

get_TNAFw_rep. There are three main modifications made to the TNAF conversion function so that it

calculates the TNAFw representation of finite field elements. The modifications are described in the

following paragraphs.

First, there are more coefficients in the TNAFw representation. The statement that calculates

the value of the coefficients is modified. The equation used to calculate each coefficient is much more

complicated. Also, associated with the calculation of coefficient values, definitions of variables that

are width-w specific were added to the source code. This is done so that width-w value modifications

are simplified. Consistency throughout the TNAFw conversion and point-multiplication functions is

achieved by employing several definitions in the source code. The definitions also allow the width-w

to be easily changed without modifying the actual function.

Second, a coefficient mapping to bit representation is required. The possible coefficients

belong to the set {-2w-1-1, -2w-1-3, …, -1, 0, 1, …, 2w-1-3, 2w-1-1}, where w is the width-w value. There

are exactly (2w-1+1) coefficients. Therefore, w-bits are required to fully represent each possibility.

The goal of the coefficient mapping is to maintain simplicity, and to allow employment of a

general algorithm, which efficiently maps the coefficients to their binary representations. Two other

mapping stipulations are that it must not be width-w dependent, and it must be easily reversible. A

coefficient representation that is easily mapped to the index used when addressing the LUT in the

TNAFw point-multiplication is also beneficial. The mapping implemented simplifies the addressing of

the LUT. Algorithm 4-3 defines the mapping that converts integer coefficients to their binary

representation. In the algorithm, coeff is the integer representation of the coefficient, w is the width-w

value, and brep is the binary representation of the coefficient.

With a few exceptions, the index used to address the LUT in the TNAFw point-multiplication

is the binary representation of the TNAFw coefficient. First, 0 is represented by all bits being set.

Second, the sign bit of the binary representation must be masked to determine the index used to address


the LUT in the point-multiplication algorithm. Thus, because of the chosen TNAFw coefficient

representations, the determination of the LUT index is greatly simplified.

Algorithm 4-3. Integer Coefficient to Binary Representation Conversion

Input: coeff, w

Output: brep

1. If (coeff > 0) then

1.1. brep = coeff >>1

2. Else If (coeff < 0) then

2.1. brep = ( | coeff | ) >>1

2.2. brep = brep & 2w-1

3. Else brep = 2w+1-1

4. Return (brep)

The last modification to the TNAF conversion function is required because of a problem that

arises due to the nature of the mapping. Normally, all the bits in the structure are originally cleared.

Then, zero coefficients are represented by cleared bits, making the most significant coefficient easily

distinguishable. Alternatively, in this case a set of zero bits does not represent a zero coefficient. They

represent a 1 coefficient, making the most significant coefficient of a TNAFw representation difficult

to find. To combat this problem, at the end of the conversion process, a single zero coefficient is

placed in the position after the most significant coefficient. The added zero coefficient represents the

most significant coefficient in the representation. Therefore, in the point-multiplication algorithm, the

most significant coefficient is easily found by searching for the extra zero coefficient, starting at the

most significant end of the TNAFw representation.

The following section describes the implementation of the TNAFw point-multiplication

implementation. It also includes some performance values of the TNAFw conversion for different

width-w values.


4.4.5 TNAFw Point-Multiplication (Q = kTNAFw ⋅ P)

The implementation of the TNAFw point-multiplication function is also based on the point-

multiplication function from the Rosing code. Some modifications were made to the function so that it

utilizes a LUT, and because a TNAFw representation is used to define the polynomial k.

The function, TNAF_precompute, was written and integrated into the TNAFw point-

multiplication function, TNAFw_point_mul. TNAF_precompute calculates the LUT used in the

multiplication process. Equations for the points in the LUT were developed using formulas in [61]. A

different set of equations was developed for each width-w. The width-w values implemented and

tested are four, five and six.

Similar to previous functions, the width-w used in the TNAFw point-multiplication can be

modified, but in this case, modifications are more difficult. The equations that define the points used in

the LUT are width-w specific. Therefore, general code that correctly computes the LUT for all width-

w values cannot be written. The function TNAF_precompute must be modified entirely in addition to

changing definitions related to the TNAFw representation when the width-w value is changed. In fact,

three different versions of the pre-computing function were written. Each function calculates the LUT

for a specific width-w value. During the testing and analysis phase, the proper pre-computing function

was compiled with the source code so the multiplication operation executed correctly.

After implementing the pre-computing function, Algorithm 3-12 was implemented to perform

the point-multiplication. The function is similar to the TNAF point-multiplication function, besides

employing a LUT. Minor modifications were made to the previous point-multiplication function,

which result in a working implementation of the TNAFw point-multiplication algorithm.

Table 4-8 summarizes the results of testing the performance of three width-w values. The

cycle counts required for converting from binary integer to TNAFw, pre-computing LUT values and

executing the entire TNAFw point-multiplication are all included for a width-w of five. All the values

are from the signature generation process, and the point-multiplication cycle count includes the pre-

computing cycle count for each width-w. As shown in the table, the optimum width-w value is five.

The surrounding width-w values result in greater point-multiplication execution times. A width-w of

five is assumed with all presented results for the remaining portion of the thesis unless otherwise stated.

TNAF_precompute is part of the function TNAFw_point_mul, whereas get_TNAFw_rep is not.

The cycle counts for the point-multiplication operation in the table include the cost of pre-computing

the LUT, i.e. TNAF_precompute. The conversion and pre-computing costs for width-w values of four

and six were not recorded because they are non-optimum.


Table 4-8. TNAFw Point-Multiplication Performance Comparison

Cycle Counts (from signature generation process)

get_TNAFw_rep TNAF_precompute TNAFw_point_mul Width-w

(w) CGA HWA CGA HWA CGA HWA

4 - - - - 2,624,000 2,140,000

5 77,570 27,030 368,800 406,700 1,494,000 1,193,000

6 - - - - 2,014,000 1,660,000

In many implementations, the base point is fixed resulting in identical pre-computed LUTs. In

these cases, the LUT does not have to be recalculated for every point-multiplication. Instead, the LUT

can be stored permanently in memory, reducing the cost of the point-multiplication operation. When

the LUT is permanently stored in memory, the use of a larger width-w becomes beneficial with respect

to performance, assuming the memory penalty is acceptable. A larger width-w may be optimum as

implied in [19]. A memory versus execution time tradeoff is created when the base point is fixed. The

execution time of the point-multiplication operation can be reduced at the expense of memory space.

Referring to the results in Table 4-8, using a width-w of six likely becomes superior to five. The trend

continues as the width-w is increased, but the memory required to store the LUT quickly becomes

unreasonable. A maximum width-w value of six is recommended due to the memory requirements of

the LUT. The memory required by the LUT for this implementation of TNAFw point-multiplication is

described by Formula 4.1.

LUT Memory Requirement = 384⋅(2w-1-1) bytes (4.1)

The TNAFw point-multiplication function is used in the signature verification process. An

algorithm presented in [19] was implemented to enhance the verification performance, but the

implementation is found unbeneficial. The implementation and performance of the algorithm is

discussed in the following section. As explained, the TNAF point-multiplication algorithm is utilized

in the simultaneous multiple point-multiplication. The rationale for using the TNAF, and not TNAFw,

point-multiplication algorithm is stated.


4.4.6 Simultaneous Multiple Point-Multiplication (R = k ⋅ P + l ⋅ Q)

Also known as Shamir’s Trick, simultaneous multiple point-multiplication can speed up the execution

time of the signature verification process in the ECDSA [19]. The principle is based on a similar

windowing technique used in the finite field multiplication operation. The trick is to compute the two

point-multiplications simultaneously by employing a two dimensional LUT.

The implementation of simultaneous multiple point-multiplication, sim_multiple_pnt_mul, is

based on Algorithm 3-13. The first step in the implementation process was to write code to compute

the LUT in an efficient manner. The LUT constructing technique employed minimizes the number of

point-multiplication operations required. Point-multiplication operations are the most expensive.

Therefore, by constructing the LUT using point additions whenever possible minimizes the execution

time.

Most of the pre-computed LUT does not require point-multiplications. The first row and

column are computed first. Initially, indices that are powers of two in the first row and column are

computed using the TNAF point-multiplication function. Then, all indices that are non-powers of two

are computed by the addition of two previously computed LUT points. The first row and column of the

LUT are computed first because they only depend on one of the points. Furthermore, they can be used

to construct the remaining portion of the LUT. All remaining points in the LUT are computed by

adding points in the first row and column. A simple for loop and point additions are used to calculate

the remaining points.

After the LUT was implemented, the primary loop of the simultaneous multiple point-

multiplication was written. The loop uses pre-computed row and column indices to address the LUT.

It also uses a pre-computed TNAF representation of 2w, where w is the window width, when

performing the required point-multiplication operation.

Throughout sim_multiple_pnt_mul, TNAF point-multiplication is employed because of the

nature of the finite field elements involved in the multiplication. The elements always have small

degrees. Most of the bits in their binary integer representation are zero, and only a few of the least-

significant bits are set. Therefore, the LUT in the TNAFw point-multiplication operation is not fully

used, and pre-computing it is a waste of execution time.

Table 4-9 presents the performance of the simultaneous multiple point-multiplication

operation. Both CGA and HWA results for window widths of two, three and four are included. A

window width of three is optimal, and the operations utilizing HWA routines outperform CGA

counterparts. The superior performance of the HWA version is expected because the individual HWA


routines outperform their CGA counterparts. The performance of the HWA and CGA routines is

examined in §5.4.1.

Table 4-9. Simultaneous Multiple Point-Multiplication Performance Comparison

Cycle Count Window

Width CGA HWA

2 6,226,000 5,198, 000

3 6,139, 000 5,043, 000

4 11,820, 000 9,933, 000

5 Implementation Comparison and Coding Guidelines

The following section compares the performance of the implementation and states coding guidelines

that were followed. First, comparisons between some implemented operations and previously

published results are presented. Then, coding guidelines that were followed during implementation are

included. The guidelines specify techniques that result in efficient generation of assembly and C code.

A comparison of the performance of the CGA and HWA routines is done. The comparison is presented

at both the routine level and the signature generation and verification level. Finally, a comparison of

the memory requirement is provided, which is based on the CGA and HWA routines. It includes the

memory requirements of the data, the actual CGA and HWA routines, and the signature generation and

verification processes.

5.1 Performance Comparison with Previous Published Results

The following two sections compare the operation performances presented in the previous section with

published results. A comparison of the implemented finite field squaring, multiplication and inversion

operations is presented in the following section. Then, a performance comparison of the point-

multiplication operation, simultaneous multiple point-multiplication, and the signature generation and

verification processes is offered in §5.1.2.

5.1.1 Low-Level Performance Comparison

The results of the finite field operations implemented are compared with figures presented in three

papers. In [19], results from implementations of the NIST-recommended elliptic curves on a Pentium

II 400MHz processor are presented. The NIST-recommended curve for GF(2163) is the Koblitz curve

implemented in the thesis. In [40], ECC results from several papers are summarized. The results are

from implementations using various elliptic curves, finite fields and platforms. In [39], the

performance of a point-multiplication algorithm is presented on various processors. Only the relevant

61

CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 62

results are included in the comparison. The tables present estimated values because the published

results were converted from execution times to cycle counts.

Table 5-1 presents a comparison summary of the implementation performance of finite field

operations with other published results. In the table, the Hankerson implementation is shown superior.

The thesis implementation results are not good as the Hankerson results, and better than the Lopez

results for the inversion and squaring operations. The fact that a random elliptic curve is used by

Lopez, does not affect the finite field results. The type of elliptic curve used only affects the

performance of elliptic curve operations.

It is interesting to note the multiplication and inversion performance ratio in each case.

Hankerson and Lopez both present ratios near ten, whereas the thesis implementation achieved a ratio

less than five. The ratios differ by a lot, which leads to a couple of possibilities. It is likely that the

Pentium II and UltraSparc processors favor the multiplication operation, or the SC140 favors the

inversion multiplication. Another possibility is the Hankerson and Lopez inversion operation and/or

the thesis multiplication operation can be improved upon. Finally, the different implementations may

use highly optimized versions of similar algorithms as presented in [20] and [22], leading to

performance enhancements of the multiplication and inversion operations.

Table 5-1. Estimated Finite Field Operation Cycle Count Comparison

Finite Field Operation Description Elliptic Curve Target Processor

Squaring Multiplication Inversion

Hankerson, et al - C [19] Koblitz – GF(2163) Pentium II 400 MHz 160 1,200 12,400

Lopez, Dahab – C [39] GF(2163) Pentium 233MHz - 2,346 -

Lopez, Dahab – C [39] GF(2163) Pentium II 400MHz - 1,188 -

Lopez, Dahab – C++ [40] Random – GF(2163) UltraSparc 300MHz 690 3,150 28,860

Lopez, Dahab – C [39] GF(2163) UltraSparc 450MHz - 1,134 -

Thesis – C and ASM Koblitz – GF(2163) SC140 300 MHz 212 3,475 16,730

Some factors that explain the difference in performance shown in the above table include code

size, target processor and optimizing strength of the compiler. The factors and how they can affect

performance figures are explained in the following paragraphs.

It is possible to improve the performance of operations by increasing the code size. For

example, the inversion performance was increased by removing a polynomial exchange and adding an

extra loop case. By adding the extra loop case, the code size is increased. Therefore, discrepancies in

the performance of operations can be due to code size.


Code that is written to be versatile, that allows the modification of the finite field size, is

likely to perform inferiorly. For example, by fixing the poly_sqr window width, the performance of the

operation is significantly improved. There are certain assumptions that can be made that improve

performance when parameters are fixed. The performance of the thesis implementation of the

multiplication operation and others can likely be improved by fixing the window width and finite field

size used.

The target processor can have a large affect on the performance of an application. Different

processors have different instruction sets, data path widths and number of registers. The data path

width and number of registers can significantly affect the performance of applications that are

computationally intensive. Furthermore, specialized instructions, such as the CLB instruction of the

SC140, that are not present in most processor instruction sets, can have a large affect on the

performance of implementations.

Lastly, mature processors are advantageous for implementations because of the strength of

their compiler. Mature processors have been more thoroughly studied, and therefore superior compiler

optimization techniques are known. Compilers that target a mature processor are likely to generate

superior performing applications.

5.1.2 High-Level Performance Comparison

In this section, the performance of the point-multiplication operation, and the signature generation and

verification processes are compared to previously published results. Similar to the previous section, the

results used in the comparison are from [5], [19] and [40]. The tables present estimated values because

the published results were converted from execution times to cycle counts.

Table 5-2 compares the implemented point-multiplication results with other published figures.

For the thesis implementation, two TNAFw point-multiplications are less computationally expensive

than the simultaneous multiple point-multiplication operation presented in §4.4.6, so the execution time

of two TNAFw point-multiplications and a point addition is included in the table. The thesis

implementation results are inferior and superior to the Hankerson and Lopez results respectively.

Some reasons for the difference in performance are presented in the previous finite field performance

comparison section. The Lopez results are less impressive because a random curve is used. Random

curve implementations cannot benefit from τ-adic related point-multiplication algorithms, and therefore

are more computationally expensive.


The Hankerson cycle count of the simultaneous multiple point-multiplication is based on two

TNAFw point-multiplications, one of which exploits a pre-computed LUT. Therefore, the memory-

constrained result of 1,588,400 cycles presents a more accurate comparison [19]. Either way, the

Hankerson implementation outperforms the thesis implementation. This is expected because of the

comparison in §5.1.1. Since elliptic curve operations are based on the execution of several finite field

operations, the performance of elliptic curve operations are limited by the performance of the

underlying finite field operations.

Table 5-2. Estimated Elliptic Curve Operation Cycle Count Comparison

Point-Multiplication Implementation

Description Elliptic Curve Target Processor

TNAF TNAFw

Simultaneous

Multiple Point-

Multiplication

Hankerson, et al - C [19] Koblitz – GF(2163) Pentium II 400 MHz 778,400 576,800 1,080,800

Lopez, Dahab – C++ [40] Random – GF(2163) UltraSparc 300MHz 4,050,000 -

Thesis – C and ASM Koblitz – GF(2163) SC140 300 MHz 1,670,000 1,193,000 2,450,000

Table 5-3 presents a comparison of the performance of the signature generation and

verification processes. The table includes more results than previous comparisons. The elliptic curve

and target processor in the table must be noted in each case. The Smart implementation results are

similar to that obtained in the thesis implementation, but use a random elliptic curve. Furthermore, a

direct comparison between cycle counts corresponding to 16-bit and 32-bit processors is invalid

because of the difference in data path width.

Table 5-3. Estimated Signature Generation and Verification Cycle Count Comparison

Signature Process Implementation

Description Elliptic Curve Target Processor

Generation Verification

Brown, et al – C++ [5] Koblitz – GF(2163) Dragon Ball 16MHz 28,688,000 52,208,000

Brown, et al – C++ [5] Koblitz – GF(2163) Intel 386 10MHz 10,011,000 18,260,000

Brown, et al – C++ [5] Koblitz – GF(2163) Pentium II 400MHz 844,000 1,636,000

Certicom – C [40] Koblitz – GF(2163) UltraSparc 167MHz 634,600 1,786,900

Daswani, Boneh – C [40] Koblitz – GF(2163) Dragon Ball 15MHz 12,000,000 35,100,000

Smart – C++ [40] Random – GF(2161) Pentium Pro 334MHz 1,336,000 6,346,000

Thesis – C and ASM Koblitz – GF(2163) SC140 300 MHz 1,329,000 2,590,000


The results illustrated in the table that use Koblitz curves vary greatly. The target processor

accounts for some of the variation. Other factors that may contribute to the varied performance figures

obtained in the signature processes and the point-multiplication operation is explained in the previous

comparison of §5.1.1.

One of the intentions of the thesis is to ensure the performance of the signature generation and

verification processes on the SC140 are acceptable. Converting the thesis cycle counts from Table 5-3

lead to execution times of approximately 4.43 and 8.63 milliseconds. These execution times are

reasonable. The delay caused by executing the processes is insignificant, and would be unnoticeable to

a user. Therefore, implementing the ECDSA on the SC140 is practical. Specialized processing units

are not required to execute ECC based security protocols. The security protocols can be designed for

and executed on the SC140, thereby eliminating the need for a specialized cryptographic processor by a

portable device with a SC140.

5.2 Guidelines for Writing Efficient C Code for Cryptographic

Applications

Several techniques were found to improve the performance of the compiler-generated code during the

implementation and optimization of the ECDSA. C coding guidelines that have the most significant

influences on the performance of the compiled application are presented in the following points. To

further clarify the techniques, examples are provided when necessary.

1. Pass small variables by value whenever possible. This simplifies the assembly code generated

because the variable does not have to be read from memory several times. Instead, it only has to be

read once from the stack. Then the value can be used throughout the function without being

updated because moves to memory will change its value.

2. Use pointers to arrays, and add offsets, or increase pointer values, instead of indexing array

elements. The resultant C code is harder to follow, but the compiler-generated code is more

efficient when this technique is used.


int i, array[10], *pointer;

for(i=0, pointer = array; i < 10;i++, pointer++) *pointer = 0;

3. Define and use temporary variables within functions when a calculated value is used several times.

This ensures that the compiler does not generate code that recalculates the value each time it is

used. The compiler can also more easily map the temporary variable to a single register.

4. To further reuse code, generalize functions that perform the same operation so that they can handle

variable size arrays. The required arguments to the function are a pointer to the start of each array,

and the size of the array, which is passed by value.

void null(ELEMENT *input, ELEMENT count);

5. When optimizing code for speed, inline small function calls within larger ones. Then the inline

function is guaranteed and not an optimization that the compiler may or may not employ. This

avoids the overhead associated with calling functions and can significantly decrease execution

time.

6. Write code that can be reordered when possible. The compiler has more freedom when optimizing

the generated assembly if the code can be reordered. The compiler can reorder instructions to

result in greater parallelism and possibly reduce memory moves.

c1 = LUT[1].e; c2 = LUT[2].e; c3 = LUT[3].e; c4 = LUT[4].e;

c5 = LUT[5].e; c6 = LUT[6].e; c7 = LUT[7].e; c8 = LUT[8].e;

for(i = 0; i <= NUMWORD; i++, c1++, c2++, c3++, c4++, c5++, c6++, c7++, c8) {

*c3 = *c2 ^ *c1; *c5 = *c4 ^ *c1;

*c6 = *c4 ^ *c2; *c7 = *c4 ^ *c3; }

7. Use temporary variables for loop counting in for loops, instead of calculating the end address of

pointers. Temporary variables require less resources, are simpler for the processor to handle, and

can often be hard-coded.

8. When calling functions to perform simple tasks, combine multiple calls so that loops that require

the same number of iterations can be combined. This reduces the code size, and decreases


execution time in many cases because greater parallelism and pipelining can be achieved. In

addition, by combining several loops that require the same number of iterations, the overhead

associated with looping is reduced.

9. By using alternative operations that achieve the identical results, the execution performance of the

resulting application is improved. For example, when performing division and remainder

calculations, the use of the & and >> C operators have the same results as % and / respectively.

The & and >> operators result in superior performing code because they map directly to assembly

instructions.

mod = num % 8; ↔ mod = num >> 3;

5.3 Guidelines for Writing Efficient Assembly Code for Cryptographic

Applications

There are several coding techniques used that result in optimized assembly code. The techniques

maximize the parallelism of the code, leading to a reduction in execution time. The guidelines used

during the implementation of assembly routines that had the greatest performance effect are listed.

Performance refers foremost to execution time, but power consumption is also considered. When

necessary, examples are provided to further clarify the techniques.

1. For implementation on processors with limited memory bandwidth such as the SC140, it is

important to give memory moves the highest priority during assembly coding. In most cases,

memory moves limit the performance of the code. Therefore, the optimum organization of MOVE

instructions should be determined. Then, other instructions should be organized in parallel with the

MOVE instructions in a manner that least hinders performance.

2. Avoid writing values to memory whenever possible. Updating variables stored in memory should

only be carried out near the end of a routine, after the variable is no longer modified within the

routine. The exception to the guideline is when a child routine(s) is called that accesses the

variables from memory. In many cases, when implementing simple routines, there is enough

register file space to store all the required values that are used throughout the routine. Therefore,


amount of memory moves can be minimized. Values that are only required for a small portion of

the routine can be read from memory before their use, and if they are modified, written to memory

after they are no longer required within the routine. This reduces the amount of memory moves,

decreasing the execution time and power consumption of the implementation.

3. An attempt to minimize stack space usage, and only allocate stack space if necessary, should be

made. In many cases, stack space is not required because of the size of the register files. Registers

can be used to store local variables instead of stack space. This further reduces the amount of

memory moves, decreasing the execution time and power consumption of the routine.

4. Instructions that belong to the Critical Path (CP) should be given the highest execution priority.

The CP is the set of instructions that determine the execution time of a routine. In other words,

when organizing instructions that are performed in parallel, the set of instructions that are involved

in the CP should be placed before other instructions whenever possible. Cases arise when there are

several instructions that can be executed during a certain clock cycle, but because of processor or

VLES limitations, only a subset of the instructions can be executed. The instructions that belong to

the CP should always be included in the subset of executed instructions before other less

performance important instructions.

5. When the data structure located in memory is known, pipelining techniques can be employed that

lead to greater parallelism and tighter loops. The order in which memory reads and writes can be

modified or parallelized because the memory addresses involved are guaranteed not to overlap.

Rearranging the order of memory accesses often results in performance improvements.

loopstart3

MOVE.L (r3)+, r4

MOVE.L r4, (r1)+

loopend3

MOVE.L (r0)+, r3

loopstart3

MOVE.L r3, (r1)+

loopend3

MOVE.L (r0)+, r3

6. Advanced pipelining techniques that use registers to temporarily store data, reduce the length of

loops. In many cases, by reading data into registers for temporary storage, and later moving them

to registers for use, loops become tighter and therefore more efficient.


loopstart3

MOVE.L (r1)+, d2

MOVE.L (r0), d0

TFR d2, d1

LSRR d7, d2

MOVE.L d0, (r0)+

loopend3

MOVE.L (r3)+, d5

MOVE.L (r2), d3

TFR d5, d4

LSRR d7, d5

MOVE.L d3, (r2)+

LSLL d6, d1

ZXT.L d2

EOR d1, d0

EOR d2, d0

LSLL d6, d4

ZXT.L d5

EOR d4, d3

EOR d5, d3

7. The reordering of instructions can improve performance. In cases where the same data is required

to compute two different variables, the order in which the instructions are executed may affect the

performance of the code. See 12 in §6.2 for an example.

8. In several cases, to comply with StarCore programming rules, NOPs must be added before and

after hardware loops. NOPs surrounding loops can often be avoided by a minor unrolling of the

loop. By doing so, clock cycles can be saved each time the routine is called. The overall

performance of applications is affected when this is applied to commonly called routines. See 17

in §6.2 for an example.

9. Cycles are saved when branch instructions are replaced with return from subroutine instructions

when appropriate. This rule is also presented as a suggested compiler optimization improvement.

See 21 in §6.2 for an example.

10. Use delay instructions whenever possible. The instructions, such as BFD, BRAD, BSRD, BTD,

JMPD, JSRD, RTED, RTSD and RTSKD, save clock cycles when properly implemented. The

effective cycle cost of the execution sequence is reduced by one clock cycle by executing the

following VLES during the delay instruction.

MOVE.L d1, (r4)

RTS

MOVE.L d3, (r5)

RTSD

MOVE.L d1, (r4)

MOVE.L d3, (r5)


5.4 Hand-Written and Compiler-Generated Assembly Comparison

Several routines, which implement operations or functionality required by operations, were written in

assembly code in an attempt to improve the performance of the signature generation and verification

processes. In each case, an HWA and CGA version of the routine was implemented. The two versions

of identical functionality were implemented to determine the effect and benefits of HWA and the

strength of the C compiler.

The C code used in the comparison, and throughout the entire thesis, does not take advantage

of special optimizing techniques. In other words, there are no definitions implemented that aid in the

compilers understanding of the code, allowing for assumptions and greater optimized code. In

addition, intrinsic functions were not used in the C code. They result in superior compiler-generated

assembly because they allow the programmer to specify individual assembly instructions.

The routines that were chosen for dual implementation are primarily simple routines that are

commonly called. Routines that could greatly benefit from the added functionality HWA offers were

also implemented in both source code languages and compared. The corresponding HWA and CGA

routines have identical parameter lists. All of the functions require a count argument, which defines the

size of the input data structures. The count argument is used to determine the number of loop iterations

to perform. A fixed structure size would improve the performance of both the HWA and CGA

routines, but it reduces the reusability of the routines. For example, the same routine is used to copy

FIELD2N and DBLFIELD structures. Fixing structure sizes also further complicates modifying the

finite field size.

At the start of the implementation process, importance was placed on the reusability of the

code, both to reduce the application size and to allow for the easy modification of the security strength

by changing the finite field size. For these reasons, the structure size is an input to the HWA and CGA

routines. The following two sections compare the HWA and CGA routines with respect to low-level

and high-level execution times respectively.

5.4.1 Low-Level Performance Comparison

Identical algorithms were used when implementing the HWA and CGA routines whenever possible.

The implemented algorithms only differ in cases where functionality is not available in the specific


source code language. The MSB_degree and MSB_degree1, add_int and sub_int routines are

examples of routines where the assembly and C algorithms differ.

The two sets of degree computing routines are MSB_degree and MSB_degree1. The CLB

instruction was used in the HWA versions of the routine. This functionality is not available at the C

level, and therefore had to be simulated at the expense of execution time, leading to greater

computational costs.

The add_int and sub_int HWA routines benefit from the carry bit of the SC140. The carry bit

is used to record overflows and underflows during the addition and subtraction of integers. There are

instructions, ADC and SBC, which include the carry bit when adding or subtracting two registers.

These instructions allow efficient implementation of large integer addition and subtraction, where the

large integers span more than one register, by including carry and borrow bits in future instructions

without a cycle penalty. In comparison, a complex boolean instruction was defined to compute the

carry and borrow bits in the C routine. Employing boolean expressions is an inefficient method of

simulating the carry bit, resulting in less efficient code.

The results presented in this section are from C level CGA and HWA routine calls. All of the

routines are implemented in the form of C or assembly functions. The input finite field elements and

large integers are all FIELD2N structures consisting of six elements unless otherwise stated. The

performance of the HWA and CGA routines, which is for the most part independent of the input

parameters, are presented in Table 5-4. Both the cycle and instruction counts of each HWA and CGA

routine are provided, along with a brief description of the inputs involved. To obtain the results, each

routine was run several times, using identical inputs for both HWA and CGA routines.

The routines were named according to the task they perform. The routines, add_int and

sub_int, add and subtract large integers respectively. The routine neg computes the negative of a large

integer. Rounding up of a large integer after dividing it by two is achieved by div2_ceil. The routines,

null and copy, clear and copy the input structures respectively. Polynomial addition is performed by

poly_add, and poly_add2 performs two polynomial additions. Shifting a structure to the left by a

specified number of bits less than the register width of 32 is performed by shift_left. The two routines,

convert_to_larger and convert_to_smaller, copy data from a structure to a larger and smaller structure

respectively. The sizes of the two structures are arguments to the routines. The structures can be of

equal size, resulting in each routine performing a copy operation. In convert_to_larger, all the data

from one structure is copied to the least significant array elements of the other, and the remaining array

elements are cleared. No reduction is performed by the convert_to_smaller routine. The least

significant array elements are copied to the other structure, ignoring any overflowing array elements.


Table 5-4 shows that the HWA routines outperform the CGA routines in each case. The cycle

performance comparison column shows the superior performance of the HWA as a percentage. The

routine that benefits the most from hand-written assembly is add_int. A performance improvement of

over 530% is observed when comparing the HWA and CGA routines. As previously mentioned, this is

most likely due to the use of the carry bit and ADC instruction in the HWA. This functionality is not

available at the C level without the use of intrinsic functions.

Table 5-4. Low-Level CGA and HWA Performance Comparison (input independent routines)

Cycle Count (Instr Count) Routine

CGA HWA

Cycle Performance

Comparison (%) Description

add_int 218 (172) 41 (28) 532 Random large integer

sub_int 170 (130) 41 (28) 415 Random large integer

neg 125 (89) 37 (27) 338 Random large integer

neg 125 (89) 37 (27) 338 Large integer = 0x43

neg 120 (99) 37 (27) 324 Large integer = - 0x43

div2_ceil 139 (102) 54 (34) 257 Random large integer

div2_ceil 139 (117) 54 (34) 257 Large integer = 0

div2_ceil 139 (114) 54 (34) 257 Large integer = 1

div2_ceil 145 (112) 54 (34) 269 Large integer = -1

null 63 (40) 20 (12) 315 Random polynomials

copy 74 (49) 27 (16) 274 Random polynomials

convert_to_smaller 88 (61) 30 (19) 293 Emulate copy

convert_to_larger 95 (65) 37 (23) 257 Emulate copy

convert_to_smaller 89 (62) 31 (20) 287 DBLFIELD[11] FIELD2N[6]

convert_to_larger 131 (91) 40 (29) 328 FIELD2N[6] DBLFIELD[11]

poly_add 87 (59) 36 (24) 242 Random polynomials

poly_add2 107 (79) 46 (33) 233 Random polynomials

shift_left 119 (93) 38 (27) 313 4 shifts, random polynomials

shift_left 119 (93) 38 (27) 313 8 shifts, random polynomials

The performance of the routines, shown in the table, was measured using inputs with six array

elements. As the finite field size and large integers involved increase, the HWA routines outperform

the CGA routines by larger factors. This is due to tighter hardware loops in the HWA routines.


Large integer addition, which is implemented by the add_int routine, is not commonly used in

the ECDSA. Therefore, the performance benefits of other HWA routines must be the primary focus of

the comparison. Routines that implement finite field element manipulation such as null, copy,

poly_add, poly_add2 and shift_left have a greater affect on the overall performance of the signature

generation and verification processes. Compared to their CGA counterparts, a minimum performance

enhancement of 230% is achieved by these routines.

Several other routines were also implemented in both C and hand-written assembly. Their

performance figures are recorded in Table 5-5. The performance of each routine depends on the

number of shifts required, or the degree of the input polynomial.

The performance of the CGA version of the first routine listed, MSB_degree, completely

depends on the degree of the input polynomial. The performance follows a pattern that is clarified by

organizing results into groups. The groups correspond to results where the degree of the polynomial

divides 32, which is the register width, by the same integer. The grouping also corresponds to the

representation of the polynomials. The MSB of the polynomial is located in the same array element of

the FIELD2N structure for each group. As shown by the brackets in the Table 5-5, the CGA code

requires an additional twelve clock cycles for each degree of a polynomial within a single grouping.

In comparison, by exploiting the CLB instruction, the HWA code requires the same number of

cycles to calculate the degree of any polynomial in a group. Furthermore, the HWA routine only

requires four extra cycles to compute the degree of a polynomial in a subsequent group, whereas the

CGA routine requires an additional twelve cycles for corresponding degrees in subsequent groups.

Due to the additional functionality available at the assembly level, and the superior assembly coding

techniques compared to the compiler, the HWA routine outperforms the CGA routine. The HWA is an

improvement from the CGA routine by at least 130%, and by a maximum of 1550%. The benefits of

the HWA routine MSB_degree lessen when the previous degree of the polynomial is provided as an

argument.

When the previous degree of the polynomial is exploited by the degree calculating routine,

which is the case with MSB_degree1, the performance of the CGA routine significantly improves,

while the HWA falters. This is seen when comparing the performances of MSB_degree and

MSB_degree1. On average, the CGA version of MSB_degree1 outperforms MSB_degree, whereas the

opposite is true with the HWA versions. In the case of MSB_degree1, the previous degree of the

polynomial is passed to the routine. The value is used as a starting point for the calculation of the

current degree. The routine calculates the degree correctly as long as the previous degree is valid. It

must be greater than the current degree. To collect the performance values in Table 5-5, the previous

degree value used was the actual degree plus one, except with the null polynomial.


Table 5-5. Low-Level CGA and HWA Performance Comparison (input dependent routines)

CGA Counts HWA Counts Function

Cycle Instruction Cycle Instruction

Cycle

Performance

Comp (%)

Description

MSB_degree 378-403 (12) 279-297 (9) 26 13 1454 - 1550 Degrees 162 160, () - change per degree

MSB_degree 40-410 (~12) 24-304 (~9) 30 17 133 - 1367 Degrees 159 128, () - change per degree





MSB_degree 101 73 48 35 210 Degree -1

MSB_degree1 55 38 39 (43) 23 (27) 141 () - Degree = 32*i-1

MSB_degree1 83 63 41 25 202 Degree -1, prev_degree = 2

shift_and_add 135 94 49 37 276 Shifts 1 31






shift_and_add2 195 138 72 56 271 Shifts 1 31






multiple_div2 125 (124) 96 (99) 49 35 255 Shifts 1 31, () – negative large int




multiple_div2 95 72 (73) 46 31 207 Shifts 128 159, () – negative large int

multiple_div2 94 (93) 66 47 32 200 Shifts 160 162, () – negative large int

The CGA routine MSB_degree1 calculates the degree of the input polynomial in fixed time.

On average, it is much faster than the previous degree calculating routine, but the HWA routine still

outperforms the CGA implementation by 141%. It is interesting to note that the overhead associated

with passing the previous degree to the HWA implementation is actually detrimental. On average, the


HWA MSB_degree routine outperforms the HWA MSB_degree1 routine. By averaging the figures in

the table, the HWA versions of MSB_degree and MSB_degree1 require 37.77 and 39.15 cycles

respectively. For this reason, the CGA version of MSB_degree1 and the HWA version of MSB_degree

are integrated into the finite field inversion function.

The routines, shift_and_add, shift_and_add2, and multiple_div2 follow the same performance

pattern. Similar to MSB_degree, the routines are grouped by multiples of the register width. Within

each group, the routines perform identically because they employ instructions capable of shifting

registers by a variable number of bits in a single clock cycle.

The CGA and HWA routines of shift_and_add and shift_and_add2 are very similar. The first

adds a shifted version of a polynomial to another. The second routine, which makes the first obsolete,

is equivalent to calling shift_and_add twice. Two shifted polynomials are added to two other

polynomials in parallel. The routines perform tasks specific to the finite field inversion operation.

The multiple_div2 routine is equivalent to the shift_left routine described previously, except it

shifts a structure to the right. It is actually more powerful than the shift_left routine, in that the shifting

value is not limited to be less than the register width of 32-bits. The routine was defined specifically

for the partmod δ reduction. Thus, it targets large integers, however it can also be used with finite field

elements.

The performance of the HWA routines is superior to that of the CGA routines. The HWA

shift_and_add, shift_and_add2 and multiple_div2 routines result in at least a 141% increase in

performance, and on average, the increase exceeds 210%.

Overall, a significant increase in performance is achieved by implementation using assembly

code instead of C. Only simple routines were implemented and compared, but significant decreases in

cycle counts for all the implemented routines were recorded. Similar benefits are expected when

complex functions and operations are entirely implemented in assembly. The performance

enhancements likely accumulate so that an even greater savings in execution time is achieved.

The performance discrepancy between the CGA and HWA routines clearly presents the

inability of the compiler to generate efficient code. The CGA routines are significantly outperformed

by the HWA routines, which shows that the SC140 compiler does not produce optimum source code

for cryptographic applications. Assembly code implementations of cryptographic applications on the

SC140 are far superior to those written in C.


5.4.2 High-Level Performance Comparison

A high-level performance of the CGA and HWA routines was achieved by integrating the routines into

the ECDSA source code. Definitions were used to determine the set of routines, either CGA or HWA,

included in the execution sequence.

In an attempt to estimate the reduction in computational costs due to employing HWA routines

in place of CGA routines, the number of times each routine is called was recorded. An estimation of

the computational cost reduction due to employing HWA routines is achieved by combining call counts

with the cycle performance comparison from the previous section. The minimum and maximum

computational reduction due to the most commonly called HWA routines within the signature

generation process is presented as Table 5-6. CGA and HWA routines that are not commonly called

are not included in the table because they do not have a significant affect on the performance of the

process. The minimum and maximum reduction are computed because several of the routines listed

execute in data dependent time, and without the inputs of each routine call, it is very difficult to

estimate the actual computational reduction achieved.

Table 5-6. Computational Reduction of the Signature Generation Process due to HWA Routines

CGA HWA Cycle Count Reduction

Total Cycle Count Reduction Routine

Description Call

Count Minimum Maximum Minimum Maximum

add_int 469 177 177 83,010 83,010 convert_to_larger 183 91 91 16,650 16,650 copy 857 47 47 40,280 40,280 MSB_degree 292 10 412 2,920 120,300 multiple_div2 235 47 76 11,050 17,860 neg 183 83 88 15,190 16,100 null 189 43 43 8,127 8,127 shift_and_add 4674 15 86 70,110 402,000 shift_left 511 81 81 41,390 41,390 sub_int 303 129 129 39,090 39,090

Total 327,800 784,800

The cycle count of the signature generation process that employs HWA routines should be at

least 327,800 cycles less than the cycle count of the process that employs CGA routines. The actual

cycle count reduction should be larger than the minimum cycle count reduction in the table, and should

not exceed the maximum computational reduction. The speedup should not exceed maximum

computational reduction in Table 5-6 because the worst-case performance of the CGA routines is used


in its calculation. The cycle count reduction of the signature verification process is expected to be

approximately twice that of the signature generation process. This is because the verification process

requires two point-multiplications, which account for most of the computational cost of the process,

whereas the signature generation process only requires one point-multiplication.

The ECDSA source code was modified so that it is possible to measure the duration of the

signature generation and verification processes. Code was written that set the digital signature to the

correct value, and that compared the signature against hard-coded values so the generation and

verification processes can be bypassed. The modifications make it possible to selectively execute the

signature generation and verification processes. The results obtained from selectively executing the

processes and measuring the cycle counts are presented in Table 5-7.

The signature verification results in Table 5-7 were recorded using two TNAFw point-

multiplication functions. Two TNAFw point-multiplication operations, along with a point addition

operation are found to outperform the simultaneous multiple point-multiplication operation in the

signature verification process. This is due to the efficiency of the TNAFw representation. The

TNAFw representation guarantees an average hamming weight of m/(w+1), where w=5 is optimal for

this implementation. Due to the small number of nonzero coefficients, there are a reduced number of

point additions required in the TNAFw point-multiplication operation.

Table 5-7. High-Level CGA and HWA Performance Comparison

Description CGA Cycle Count HWA Cycle Count Actual Cycle

Count Reduction

Signature Generation 1,819,000 1,329,000 429,000 (23.6%)

Signature Verification 3,393,000 2,590,000 803,000 (23.7%)

The signature processes that include HWA routines outperform their CGA counterparts, and

the cycle count reduction is within the maximum and minimum values estimated in Table 5-6. An

actual reduction of 429,000 clock cycles is achieved by employing HWA routines instead of CGA

counterparts within the signature generation process. The computational cost of signature verification

is reduced by 803,000 clock cycles when employing HWA routines. As expected, the cycle count

reduction of the signature verification process is approximately twice that of the signature generation

process.

Due to the significant computation cost reduction illustrated in Table 5-7, it is recommended

that assembly implementations be employed. In the thesis, only a small amount of basic functionality


was implemented in assembly. As presented in Table 5-4 and Table 5-5, the implemented assembly

routines greatly outperform their counterparts, which were written in C. Furthermore, the routines

implemented in assembly translate to significant computational cost reductions at higher-levels.

Computational cost reductions are easily achieved by implementing commonly executed low-level

routines in assembly, and even greater computational cost reductions are expected when higher-level

routines are also implemented in assembly.

The following section examines the memory requirements of the implementation. The

memory requirements of the signature generation, signature verification, and entire signature

generation and verification processes are presented, as well as the memory requirements of the CGA

and HWA routines.

5.5 Memory Requirements Comparison

When targeting portable devices, the memory requirements of the application become more important.

Applications targeting portable devices must adhere to the limited processing and storage resources

present. Throughout the implementation process, the computing time of operations and processes was

focused on. An attempt to minimize the computing time was made at the cost of memory.

However, some decisions were made because of memory limitations. For example, the bit

width used in the polynomial squaring function was selected primarily due to a decrease in execution

time, but the size of the LUT, which is permanently stored in memory, grows exponentially with the bit

widths. Larger bit widths reduce the computational cost of the polynomial squaring operation, but their

LUT requires substantial amounts of storage space. The memory requirements of the LUT become

unpractical.

When selecting the window width and width-w for the polynomial and point-multiplication

operations respectively, the memory requirements were considered. After determining the optimum

window width with respect to computational costs, the memory requirements were investigated. Both

functions use temporary LUTs that are computed during execution. The LUTs are dynamic, and only

present when executing the appropriate operation. It was ensured that the LUTs required by the

operations are of reasonable size. In each case, the size of the LUT was compared to the memory

requirements of the entire application. The memory requirements of the operations seemed reasonable,

but for systems with extremely limited memory resources, the requirements may be too large.


The memory requirements for LUT in the point-multiplication operation are considerable.

The LUT is 6,144 bytes, which may be too large for certain portable devices. By employing the TNAF

point-multiplication operation instead of the TNAFw version, the LUT is eliminated at the cost of

approximately 500,000 cycles. This translates to an additional 1.67milliseconds per point-

multiplication operation when the SC140 is operating at 300MHz. The dynamic memory requirements

of the implementation are not included in Table 5-8.

The permanent storage requirements of the implementation are presented in Table 5-8. The

table presents the memory requirements divided into several categories, while contrasting compilations

including either CGA or HWA routines. The memory requirements of the data, CGA and HWA

routines, general code other than the CGA and HWA routines, the entire ECDSA, and the signature

generation and verification processes are provided. All values are approximate because it is difficult to

determine the exact size of each category. Several of the categories overlap and affect each other.

There are not distinct divisions within the code that allow the memory requirements to be easily

separated into categories. Furthermore, the dynamic memory requirements of LUTs that are calculated

on the fly are not included in the table, only the memory requirements of the LUT used in the finite

field squaring operation is included because it is not calculated on the fly.

Table 5-8. Estimated Permanent Storage Requirements

Memory Requirement (bytes) Description

CGA HWA

Data 1,232 1,232

Routines 3,300 (2,200) 1,200

General Code (not CGA or HWA) 32,300 32,300

ECDSA (total) 37,800 35,700

Signature Generation 34,200 32,300

Signature Verification 33,000 31,100

The memory requirements of the HWA routines are significantly smaller than the CGA

routines. The size of the CGA and HWA routines differ by a factor of three. The routine memory

requirements are slightly misleading due to inline functions. There are several copies of identical

routines in the compilation including the CGA routines. There are no inline HWA routines, thus the

memory requirements are lessened. The value in brackets is the size of the compiled CGA routines,

excluding inline functions. This value is more accurate for comparison purposes. The memory


requirements of the CGA routines are still much larger than the HWA routines. They differ by a

factor of two. By focusing on the CGA and HWA routines, it is shown that the memory requirements

can be significantly reduced by writing routines in assembly. As presented in §5.4.1, the performance

of the routines is also far superior.

Table 5-8 also presents the memory requirements of the signature generation and verification

processes. All of the finite field, large integer and elliptic curve operations are required by both

signature processes. Half of the memory requirements of the entire process are not associated with

each process. There is a significant amount of overlap. The memory requirements of both the

signature generation and verification processes are almost identical, and are approximately equal to the

requirements of the entire ECDSA.

The table does not show the memory requirements of implementing the SHA-1 hash function.

The hash function accounts for approximately 13,000 bytes of each of the memory requirements in the

table (excluding data and routines). This is a significant amount of the total memory required. An

attempt to improve the implementation of the hash function was not made. Instead, the original

implementation was used. It may be possible to implement the hash function more efficiently,

reducing memory footprints.

6 SC140 and Compiler Analysis for Cryptographic

Applications

The following sections analyze the target processor and compiler, and state compiler optimization

improvements. An analysis of the SC140 for cryptographic applications and guidelines for writing

efficient C and assembly source code is presented. In addition, compiler improvement

recommendations are made based on the comparison of CGA and HWA routines. Finally, compiler

anomalies encountered are included.

6.1 Analysis of the SC140 for Elliptic Curve Cryptographic

Applications

There have been few documented cryptographic implementations on DSPs, which leads to the

question, are DSPs suitable for cryptographic applications. One of the purposes of the thesis is to show

that a DSP, and more specifically the SC140, is a viable target for cryptographic applications. The

performance of the implementation of the ECDSA is previously shown adequate.

DSPs are currently present in several computing environments. Therefore, if they are suitable

for cryptographic applications, security measures can easily be added to the environments, without

upgrading hardware with specialized cryptographic processors. The following sections bring forth both

positive and negative aspects of implementing cryptographic applications on the SC140. Several of the

aspects are relevant to all DSPs. For simplicity, instructions such as ADD and ADDA, which result in

the same operation executed by the DALU or ALU respectively, are written as ADD(A) throughout the

section.

81

CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 82

6.1.1 SC140 Cryptographic Pros

The following sections describe the notable properties of the SC140 that have a positive affect on the

execution of cryptographic applications.

1. Variable Length Execution Set

A VLES is a set of instructions that is executed in a single clock cycle. Up to six instructions can

be grouped into one VLES on the SC140, leading to substantial parallelism. Grouping several

instructions so they are performed in a single clock cycle reduces execution times significantly.

Since polynomials involved in ECC span several SC140 registers, a large amount of parallelism is

possible. Most finite field algorithms allow parallelism, by grouping several instructions so they

execute in a single clock cycle, execution times are reduced.

VLESs increase code density and reduce power consumption. Code density is significantly

increased because instructions for unused processing units do not have to be defined. A VLIW,

which is a fixed length set of instructions, defining an instruction for each processing unit per clock

cycle, requires more memory storage than a VLES. Processors that require VLIWs have

corresponding applications with larger memory footprints, which is problematic for portable

devices. Generally, several of the instructions within a VLIW are NOPs, which are a waste of

memory space. Alternatively, NOP instructions are assumed unless otherwise stated within a

VLES. By eliminating the NOPs, and slightly limiting the combination of instructions that are

allowed within a single VLES, the code size of the application is greatly reduced.

In addition, less code directly leads to less power consumption, which is also beneficial to

portable devices. First, less power is consumed because less memory is required to define each

clock cycle’s execution set. Less bandwidth is required to transfer the VLES, thereby reducing

power consumption. Secondly, the total amount of memory required to store an application using

VLESs is reduced. Therefore, the amount of flash memory, or ROM, a device requires can be

reduced. Smaller flash or ROM sizes require less operating power, thus decreasing the total

amount of power consumed by the system.

2. Loop Control Instructions (BREAK, DOENSHn, …)

The instructions BREAK, DOENSHn, DOENn, DOSETUPn and SKIPLS are hardware loop

control instructions. Hardware loops are the most efficient method of repeatedly executing


instructions on the SC140. As long as the StarCore programming rules are followed, there is no

penalty for returning to the first VLES of a hardware loop, resulting in minimal overhead.

The instructions DOENSHn and DOENn enable short and long loops respectively. The

DOSETUPn instruction sets the starting address of long loops, and is not required with short loops.

The BREAK and SKIPLS instructions are used to exit a hardware loop, and to exit a hardware loop

if the loop counter is less than or equal to zero respectively.

Branch instructions are expensive, several cycles, four with the StarCore, are wasted each

time a branch is executed. Therefore, the cost of using branches increases execution time

significantly; especially considering several of the implemented hardware loops consist of only one

or two VLESs. The overhead introduced by the branch instruction is larger than the cost of

executing the desired instructions.

Repetitive code is an inefficient waste of memory that can be avoided by looping

instructions. An efficient means of looping that does not hamper execution efficiency, and aids in

reducing code size, is extremely beneficial for portable devices.

The use of hardware loops saves memory space with minimal cost, because there is no cost

associated with returning to the start of the loop. In the case of finite field arithmetic, several

registers are required to store one polynomial, so the identical operation(s) is performed on several

sets of data. By employing hardware loops, the operations are performed efficiently with address

pointers. Without hardware loops, branch instructions or repetitive code is required, resulting in

inefficient execution or large applications.

3. Conditional Instructions (BC, IFC, JC)

The conditional instructions of the SC140 allow for greater parallelism and instruction hiding.

Conditional instructions are all based on the true bit of the status register. The set of conditional

instructions include branch (BF, BT), jump (JF, JT), and the more general if (IFA, IFF, IFT).

The instructions IFA, IFF and IFT allow a VLES to be divided into two subsets of

instructions. The instructions that are actually executed depend on the true bit in the status register.

A single VLES is allowed to contain a maximum of two of the three conditional if instructions.

The instructions grouped with an IFT or IFF are only executed if the true bit is true or false

respectively. Whereas, the instructions grouped with an IFA are always executed.

The if instructions allow for greater parallelism. They are extremely useful with respect to

ECC because there are several instances when either one set of instructions or another is

performed. There are also cases when one set of instructions is always performed, and another


may be performed. In these instances, each set of instructions can be grouped with a conditional

instruction, thereby computing them in parallel and reducing execution time.

Furthermore, conditional instructions along with the parallel processing capabilities of the

SC140 make it possible to hide the actual instructions executed, and fix the execution times of

either condition. For example, the conditional branch instructions can be executed in parallel, thus

hiding the branch taken. Identical sets of instructions can be grouped with the IFF and IFT

instructions, only exchanging the targets of the instructions grouped with IFT. Thus, it is

extremely difficult, if possible, to determine the set of instructions executed by the processor. The

strength of the conditional instructions is further explored in §7.4.1.

4. Polynomial Degree Calculation Instructions (CLB, TSTEQ)

The CLB instruction returns a value corresponding to the number of equal most significant bits in a

register. The TSTEQ instruction is used to compare a register with zero. These two instructions

aid in the efficient calculation of the degree of a polynomial.

The CLB and TSTEQ instructions are very useful when computing the inverse of a finite

field element or large integer. Inversion of an element is the most expensive finite field operation,

and becomes even more expensive if an efficient method of calculating the degree of a polynomial

is not employed.

When calculating the degree of a polynomial, the TSTEQ is used to check if a register,

which contains a subset of a polynomial’s bits, is zero. Once a nonzero register is found, the CLB

instruction is used to calculate the number of leading zero bits the register contains. The

exploitation of the two instructions, along with control logic and other computations, allows the

efficient determination of the degree of a polynomial. The efficient method of calculating the

degree of a polynomial reduces the cost of the finite field inversion operation.

5. Intrinsic Functions

There are several functions, called intrinsic functions, defined at the C language level specific to

the SC140. They allow programmers to employ individual assembly instructions from a C

function. The functions specify assembly instructions from a higher level of abstraction, allowing

the designer to specify individual assembly instructions, resulting in more efficient execution

sequences.

Although intrinsic functions were not used in the implementation process, they promise to

result in applications that are more efficient. They allow programmers to specify a single assembly


instruction, simplifying the compiling process and allowing the programmer to use processor

functionality that may not be available at the C level.

6. Dual Task, Single Cycle Instructions

ADDL1A, ADDL2A, DECEQ(A) and post-address updating instructions perform two tasks in a

single cycle. They combine two instructions that are commonly found consecutively in assembly

code, and execute them simultaneously.

The instructions ADDL1A and ADDL2A add a shifted version (by 1 and 2-bits

respectively) of an address register to another address register. This instruction is very useful when

an algorithm must start at the least significant bits of a polynomial. In this case, the address of the

most significant bits, and the number of registers the polynomial spans are known. By using the

ADDL2A instruction, the address of the least significant bits can be calculated, from the address of

the most significant bits and the number of registers the polynomial spans, in a single clock cycle.

The DECEQ(A) instruction decreases a register by one, compares the result to zero, and

sets the true bit in the status register accordingly. The instruction is useful because it is very

common to decrease a register by one and then check if it is zero. For example, the last iteration of

a loop is often different from previous iterations. Before the loop is entered, the DECEQ(A)

instruction is used to control the entrance of the loop. If the number of loop iterations is one, the

loop is never entered, and a branch is taken to the last iteration of the loop.

Post-address updating is a powerful feature of the SC140, which was introduced in §2.4.

An address register can be adjusted by a value after it is used in an instruction without any clock

cycle penalty. This is extremely useful within hardware loops when accessing a polynomial stored

in consecutive memory addresses.

7. Shifting Instructions (ASL, ASLL, …)

The instructions ASL, ASLL, ASR, ASRR, LSL, LSLL, LSR and LSRR are for single or multiple,

arithmetic or logical shifts of registers to the right or left. These instructions are useful when

implementing ECC because in many cases polynomials and large integers are shifted to the left or

right by a single or multiple bits. Polynomials are shifted by a multiple number of bits throughout

the finite field inversion operation, and whenever an operations algorithm employs a windowing

method.

These instructions are also useful when calculating the TNAF or TNAFw representation of

a polynomial. When this representation is computed, large integers are divided by powers of two,


i.e. 2i, which is the same as shifting the large integer to the right by i-bits. The instructions that

achieve multiple bit shifts allow division by powers of two to be performed efficiently.

The instructions that perform arithmetic and logical shifts by multiple bits are important to

ECC applications. Without them, registers must be shifted by a single bit multiple times, which is

extremely inefficient and significantly increases the computational costs associated with

performing finite field and elliptic curve operations. The benefits of employing windowing

techniques would be significantly reduced, if not eliminated without multiple bit-shifting

instructions.

8. Logic Instructions (AND, EOR, …)

The instructions, AND, EOR, NEG, NOT and OR perform the logical operations of and, exclusive-

or, logical negate and or respectively. The instructions listed are common to most processors.

ECC is based on binary finite field arithmetic. This arithmetic is slightly different from

integer arithmetic. It requires logical instructions. The instructions AND, EOR, NEG, NOT and

OR are all very useful when performing finite field arithmetic, or clearing and masking consecutive

bits of a finite field element. The clearing and masking of consecutive bits of a finite field element

is a common task when advanced algorithms are used that employ windowing techniques.

The NEG instruction results in zero minus the operand, which is very similar to logically

negating all the bits of a register, which the NOT instruction performs. The instruction is useful

when negating data that spans several registers. The negation of a register and a carry from a less

significant register can be performed in a single cycle using the NEG instruction.

9. Arithmetic Instructions (ADC, ADD, …)

The instructions ADC, ADD, SBC and SUB perform register addition and subtraction, excluding

or including the carry bit of the status register. They are common instructions that most processors

are capable of performing.

ECC requires large integer arithmetic. The SC140 provides instructions such as ADD(A),

SUB(A), ADC and SBC that allow for efficient large integer arithmetic. Large integers refer to

integers that are greater than 32-bits, and therefore span more than one register in the processor.

The instructions ADC and SBC add and subtract the operands, including the carry bit in the

operation. By including the carry bit, at least one additional instruction is avoided when values that

span several registers are involved.


6.1.2 SC140 Cryptographic Cons

The following sections describe the aspects of the StarCore SC140 DSP that have a negative affect on

the execution of cryptographic applications, and some of the properties that a specialized cryptographic

processor would be designed to possess to aid in the efficient execution of cryptographic applications.

1. Register Size

The SC140 registers are 40-bits wide, but MOVE instructions only access 32-bits of memory at a

time. Since the effective size of the registers is only 32-bits, polynomials of GF(2163) span six

registers, making a simple operation performed on a polynomial require six moves from memory,

six executions of the same or similar instructions to modify the polynomial, and six moves back to

memory to store the result.

As larger polynomials are required to maintain security levels, more registers and

instructions are required to store and perform operations. Specialized cryptographic processors are

designed to have register sizes that match their application. For example, processors designed to

implement ECC have registers large enough to store a finite field element. This may not be the

optimal design because the processor becomes obsolete when the finite field size is increased to

maintain security levels. Furthermore, different applications require different encryption strengths,

so it may be beneficial to design a processor that can implement several different encryption

strengths, i.e. a processor that can handle different sizes of polynomials.

As the computational power of devices increases, the need for stronger encryption is

required. To maintain security levels, the finite field size used in the encryption process is

increased. When the size of the polynomial increases, processors designed for a fixed size

polynomial become obsolete, whereas processors that can implement different sizes of polynomials

are not. They only require new software to implement the stronger encryption. New software is

both less costly and easier to deploy than new hardware.

The suggestion is to design specialized cryptographic processors that are able to handle

different sizes of polynomials so that they avoid becoming obsolete and can provide several

different levels of security. The processors will be slightly less efficient than ones designed for a

specific polynomial size, but will not become obsolete and require replacement as the polynomial

size used in cryptographic applications increases.


Processors that are designed to have register of widths 64 or 96-bits are presently beneficial to

reduce register and instruction counts. By having larger registers, less move and modifying

instructions are required to perform operations, thus leading to reduced execution times.

Furthermore, the processor does not become obsolete when larger finite field sizes are required to

increase security levels, and it is easily able to provide different levels of security that depend on

the data involved.

2. Memory Bandwidth Limitation (memory moves per cycle)

The maximum amount of data that can be moved between memory and registers is one of the main

performance limiting factors of ECC on the SC140. The maximum throughput of the SC140 is

128-bits per cycle, with some limitations. Only two memory moves are allowed per clock cycle.

To achieve maximum throughput, two moves of 64-bits each must be performed. For a 64-bit

move, the origin and target of the move must be two consecutive 32-bit data locations. The origin

and target must be consecutive memory addresses and registers. In general, the maximum

achieved throughput of the processor is actually 64-bits.

Since the current standard size of a finite field element in ECC is 163-bits, each

polynomial requires six 32-bit memory locations, or six registers to be stored. This means that the

processor requires at least three clock cycles to transfer a polynomial from memory to registers,

and vice versa. To perform an operation, more clock cycles are required to move data, because

generally, an operation involves more than one polynomial. It would be highly beneficial if the

throughput of the SC140 were larger. With a larger throughput, polynomials could be transferred

between memory and registers much faster, reducing execution times.

The memory bandwidth limitation is significant only because other throughputs of the DSP

are larger. The DALU and AGU of the SC140 are able to perform four and two instructions per

clock cycle respectively. They can perform instructions much faster than data can be moved from

memory. Due to this fact, the SC140’s limitation is the memory bandwidth, and not DALU and

AGU throughput.

3. Specifying Unique Processor Functionality (ex. CLB)

Improved techniques to allow the use of specialized processor functionality from high-level

languages would be beneficial for cryptographic applications. For example, there should be a

means of forcing specialized processor functionality at the C language level. Intrinsic functions


allow the programmer to specify instructions such as ADD, ADC, etc, but not unique processor

functionality.

A means of specifying unique processor functionality is much more important than more

general instructions such as ADD, ADC, etc. The ADD instruction is a general function that most,

if not all, processors are capable of performing. It also has a counterpart in high-level languages,

so the mapping from high-level language to assembly is quite simple. The mapping should be

easily recognized by the compiler because of its simplicity and frequency of use.

However, the CLB instruction is a highly specialized instruction that processors rarely are

capable of performing. Furthermore, there is no operation in most high-level languages similar to

CLB that a high-level to assembly compiler mapping can be derived from. As a result, there

should be a way of specifying the CLB instruction, and other similar specialized instructions

common to the SC140, from the high-level C language.

6.2 Compiler Optimization Improvements

In §5.4, the performance of the HWA and CGA routines are compared, and some compiler anomalies

are presented. During the comparison process, a set of compiler optimization rules was developed.

The rules attempt to improve the compiler-generated assembly of the CGA routines so that they

resemble the superior performing HWA routines. The rules are suggestions to improve the generated

assembly. Some of the more obvious rules stated in this section may already be implemented by the

compiler. In some cases, applying the suggested rules may violate assembly instruction restrictions

defined in [46]. The rules should not be applied when they lead to invalid assembly source code.

The set of rules define methods to improve the compiler optimizations. Several rules describe

instructions or sets of instructions that were found in the assembly of the CGA routines, and can easily

be improved upon. Other rules stated are much more broad. When they are followed, they result in a

direct or indirect performance enhancement. Performance enhancements include a decreased execution

time via increased parallelism, or a reduction in power consumption and more efficient resource

management that may lead to decreased execution time.

Examples of the rules are provided when appropriate. They are taken from the CGA routines,

and the operands involved are generalized. In the examples, assembly instructions presented on a

single line are part of one VLES. Several symbols are used to generalize the assembly code.

Descriptions of each symbol are provided in Table 6-1.


Table 6-1. Assembly Symbolic Description

Symbol Description

Dn# Data Registers (d0-d15)

Dx# Data Registers (d0-d7)

Rn# Address Registers (r0-r15)

Rx# Address Registers (r0-r7)

SP Stack Pointer

DR# Data or Address Registers (d0-d15, r0-r15)

VLES# Variable Length Instruction Set

XXX(A) DALU or Equivalent AGU Instruction

x Integer value less than 32

y Integer value less than 232

In the compiler optimization improvements, the DALU and AGU instructions and registers are

considered different domains. For example, a value in an address register is in a different domain than

a value in a data register.

1. If the instruction space permits, moves of certain fixed values can be avoided. The modification

saves a MOVE, which is expensive with respect to power and quite often limits the amount of

parallelism because there are only two MOVE instructions allowed per clock cycle. The

modification should be done as long as it does not affect the execution time. The instructions

accomplish the same result while reducing memory bandwidth. They only require DALU

bandwidth, which is less costly with respect to bandwidth and power consumption. When a value

of less than thirty-two is required, the ASLL instruction is not required, and when a value of zero is

required, both the ADD and ASLL instructions are not required.

VLES1

VLES2

VLES3

VLES4

MOVE.L #<1334, Dn1

VLES1

VLES2

VLES3

VLES4

CLR Dn1

ADD #<21, Dn1

ASLL #<6, Dn1

2. Do not allocate storage space on the stack in a function if the storage space is never used, or only

allocate the required stack space. Changing the stack space allocated in a function may require the


modification of instructions that write and read to and from the stack, and instructions that read

arguments or argument addresses from the stack. The instructions possibly affected are MOVE.L,

ADDA and SUBA, as well as any other instruction involving the stack pointer.

ADDA #<x, SP

…

SUBA #x, SP

… or

MOVE.L #y, Rn1

NOP {AGU Stall}

ADDA Rn1, SP

…

MOVE.L #y, Rn1

NOP {AGU Stall}

SUBA Rn1, SP

…

3. Priorities must be given to the allocation of registers, so that the optimum register is assigned when

one is required. The priorities of the registers should follow the following rules, applied in

descending order.

3.1. Within a parent function, the registers {d0, d1, r0, r1} should be given the least priority. For

example, use {d2-d15, r2-r15} before the use of {d0, d1, r0, r1} because the last set of

registers are often used to pass arguments to child functions. By using other registers, greater

parallelism may be achieved by reducing the setup time required before function calls.

3.2. Within any function, the registers {d6, d7, r6, r7} should be given the least priority in an

attempt to avoid push and pop instructions at the start and end of the function. For example,

use {d0-d5, d8-d15, r0-r5, r8-r15} before the use of {d6, d7, r6, r7}.

3.3. A higher priority must be given to the registers {d0-d7, r0-r7} than the upper registers {d8-

d15, r8-r15}. This is because use of the upper registers increases instruction sizes, and some

instructions are limited such that they cannot involve upper registers.

3.4. The highest priority should be given to the register that already contains the data.

4. Reuse fixed values that are written to registers. This rule is subject to a reaching definition test and

can reduce the number of registers required within a function.


MOVE.W #<x, DR1

DR1 used as operand

MOVE.W #<x, DR2

DR2 used as operand

MOVE.W #<x, DR1

DR1 used as operand

DR2 instances changed to DR1

DR1 and DR2 must be in the same domain

5. There should be no preference between AGU and DALU instructions that perform identical tasks

(TSTEQ↔TSTEQA, INC↔INCA, SUB↔SUBA, etc). Several tasks can be performed by both

the AGU and DALU. Precedence should be given to the domain that contains the operands. If the

operands are originally located in both domains, precedence should be given to the domain the

result must be in. If the result can be in either domain, precedence should be given to the DALU

because it has a larger throughput. The modification can reduce AGU stalls, which occur because

a one cycle delay is required between when a value is moved to the AGU register file and when it

is used as a memory address. This rule can also help avoid MOVE.L instructions from data to

address registers or vice versa, thus avoiding the creation of extra copies of variables.

MOVE.L Dn1, Rn1

MAX Dn2, Dn1

BT <L4

CLR Dn2

TSTEQA Rn1

CLR Dn2

MAX Dn2, Dn1

BT <L4

TSTEQ Dn1

6. Conditional branch instructions can be combined with the subsequent VLES. A conditional branch

instruction can be slightly modified, and then combined with subsequent instructions. Depending

on the value of the true bit at execution time, the modification may reduce the cycle count.

BT <L4

DOENSH3 Dn1

IFT BRA <L4 IFF DOENSH3 Dn1

7. Avoid moves when possible. First, minimize MOVE instructions involving memory whenever

possible, then minimize MOVE instructions involving registers only. The concept of copy

propagation should be used to accomplish the improvement. Use existing data copies instead of

creating redundant copies of data whenever possible. As in the example below, to increase the

affect of testing for copy propagation, MOVE instructions addressed with Rn# do not overwrite

local variables on the stack, unless the SP is moved into Rn# at some point.


MOVE.L (SP-20), DR1

… {DR1 not modified}

MOVE.L YYY, Rn1


MOVE.L (SP-20), DR2

MOVE.L (SP-20), DR1


MOVE.L YYY, Rn1


MOVE.L DR1, DR2

YYY – any register or value.

DR1 and DR2 are in opposite domains. When DR1 and DR2

are in the same domain, MOVE.L DR1, DR2 becomes

TFR(A) DR1, DR2.

8. Combining instructions is possible in some cases, where two instructions can be combined into a

single more efficient instruction. The modification results in assembly that is more efficient, and

should be made as long as negative side effects are avoided. An example involving ADDL2A is

shown. Similar optimizations involving, but not limited to, ASL1A and DECEQ(A) are also

possible.

ASL2A Rx1

ADDA Rx1, Rx2

ADDL2A Rx1, Rx2

9. Use fixed values in instructions, avoiding the use of temporary registers whenever possible. This

optimization reduces computational costs, register usage, power consumption and memory

bandwidth.

MOVE.L #<x, Dn1

ADD Dn1, Dn2, Dn2

ADD #<x, Dn2

10. When a value is used throughout a function, a register should be allocated to store the value for the

entire function, eliminating multiple copies of values and inefficient use of registers. Furthermore,

re-allocation of registers when the contained data is not used for the rest of the function allows for

greater parallelism.

11. Instructions that belong to the CP should get the highest priority. The set of instructions included

in a VLES is limited. When there are more instructions that can be included in a VLES than rules

allow, instructions belonging to the CP should be given highest priority.


12. Rearranging move and transfer statements can reduce execution times. By changing the target of a

MOVE instruction, greater parallelism is achieved in some cases where more than one copy of the

same data is required.

MOVE.L (Rn1)+, Dn1

TFR Dn1, Dn2

ZXT.L Dn2

MOVE.L (Rn1)+, Dn1

ZXT.L Dn1

TFR Dn1, Dn2

13. Combining IFC statements can be done when simple if statements are implemented. The following

optimization works for simple if and else clauses. The first example is generalized for boolean

expressions xx1, xx2, xx3, … and if and else clauses of yy and zz respectively, where the execution

of yy is not cumulative. If the execution of yy is cumulative, and zz is not, instances of yy and zz

and instances of IFT and IFF can be exchanged for proper execution. If the execution of yy and zz

are cumulative, the second example applies, where yy is an addition.

Non-cumulative yy:

if (xx1|xx2|xx3…) yy

else zz

TSTEQ xx1

IFT yy

IFT yy

…

IFT yy

IFF TSTEQ xx2

IFF TSTEQ xx3

IFF zz

Cumulative yy and zz:

if (xx1|xx2|xx3…) yy

else zz

TSTEQ xx1

IFT ADD Dn1, Dn2, Dn2


…


CLR Dn1

CLR Dn1

MOVE.L #<x, Dn1

IFF TSTEQ xx2

IFF TSTEQ xx3

IFF zz

14. Upper registers can be used within functions to store variables instead of the stack. This reduces

the number of MOVE instructions executed. MOVE instructions become transfers, which are less

expensive with respect to power consumption and memory bandwidth. Greater parallelism can be

achieved with transfers compared to MOVE instructions, because only two move instructions are

allowed per VLES. Furthermore, by employing the improvement, it may be possible to eliminate

allocating space on the stack for local variables. The improvement can be used when arguments


are read from the stack several times within a single function. It is beneficial to store the

argument in an upper register, and then use transfers from the upper register, instead of several

moves from the stack. In many cases, when moves are changed to transfers, later instructions can

be simplified, eliminating the need for a transfer by directly referencing the origin of the transfer.

15. Allow instructions originally situated after loops, to be placed before loops as long as they are

independent of the loop. This optimization may already be implemented. It can increase the

amount of parallelism within the generated code, and therefore reduce computational costs.

L4

BT <L4

DOENSH3 Dn1

NOP

NOP

loopstart3

MOVE.L Dn2, (Rn1)+

loopend3

SUBA #x, SP

POP D6

POP D7

L4

BT <L4

DOENSH3 Dn1

NOP

SUBA #x, SP

loopstart3

MOVE.L Dn2, (Rn1)+

loopend3

POP D6

POP D7

16. Removing repetitive instructions that are executed in the same or both domains, results in superior

assembly code. The improvement is achieved by postponing MOVE and TFR(A) instructions, so

they are located after other instructions involving the same operand.

MOVE.L Dn1, Rn1

SUB x, Dn1

SUBA x, Rn1

SUB x, Dn1

MOVE.L Dn1, Rn1

17. There are several hardware loop restrictions defined in the SC140 Reference Manual. Rules L.D.2,

L.D.3 and L.D.9 specify hardware loop restrictions [46]. Depending on the implementation, a

minimum number of cycles are required between certain hardware loop instructions. To comply

with the rules, NOP(s) are commonly added by the compiler. The addition of a NOP(s) because of

the listed rules can be avoided, or reduced. By decreasing the number of iterations of the hardware

loop by one, and unrolling the loop by replacing the NOP(s) with the first VLES(s) from the loop,

and adding the remaining VLES(s) after the loop, a NOP(s) can be avoided. The modification

assumes that it is possible to reduce the loop count without delaying the start of the loop, or there


are at least two NOPs involved, resulting in the modification reducing the number of clock cycles

in the sequence. The re-ordering of the loop must not break some other rule.

VLES0

DOEN Dn1

NOP

NOP

loopstart3

VLES1

VLES2

VLES3

loopend3

VLES4

VLES0 SUB #<1, Dn1

DOEN Dn1

VLES1

VLES2

loopstart3

VLES3

VLES1

VLES2

loopend3

VLES3

VLES4

18. By improving the allocation of registers, which are used temporarily within a function, greater

parallelism can be achieved. Another register can be used temporarily so that sets of instructions

can be parallelized. Blocks in the example below represent sets of instructions. In the first

example, at the start of each block, the register(s) associated with the block is used to store a totally

unrelated value to its previous contents. In the second example, the value in register Rn1, which is

related in Block2a and Block2b, is unrelated to the value in Block1. Register Rn2 is only used in

Block2b. In both examples, by using Rn2 instead of Rn1 in Block1, it is possible to parallelize

Block1 and Block2.

Block1 - DR1

Block2 - DR1

Block3 - DR1 and DR2

Block1 - DR2

Block2 - DR1

Block3 - DR1 and DR2

Block1 - DR1

Block2a - DR1

Block2b - DR1 (not redefined) and DR2

Block1 - DR2

Block2a - DR1

Block2b - DR1 (not redefined) and DR2


19. In some cases, obvious simplifications are possible because of repetitive instructions, copy

propagation or rewriting a target without using the original value. The improvement can also apply

to instances of TFR(A), AND, OR, MAX and several other instructions. The first example is a


case of repetitive instructions, where TSTEQ(A) does not accomplish anything, because the true

bit is already modified by the previous instruction. The second example is a case where the value

of Rn1 and Rn2 are found equal using copy propagation. Therefore, the second MOVE instruction

does not accomplish anything.

DECEQ(A) DR1

TSTEQ(A) DR1

DECEQ(A) DR1


MOVE.L Rn1, Dn1

MOVE.L Rn2, Dn1

MOVE.L Rn1, Dn1

20. The execution of assembly involving a calculated value and a fixed value comparison can be

improved. The simplification eliminates a clock cycle. Furthermore, the instruction resulting in

the calculated value should be omitted if the result is never used after the comparison.

SUB(A) DR1, DR2, DR3

TSTEQ(A) DR3

SUB(A) DR1, DR2, DR3

CMPEQ(A) DR1, DR2


21. By analyzing the target of branches, they may be avoidable. It is possible to avoid branching to the

last few VLESs of a function by having duplicate copies of the last few instructions in a function.

The improvement eliminates a branch or conditional branch instruction, which require four clock

cycles to execute. The IFT and IFF instructions in the following example can be exchanged if

appropriate.

L_exit

IFT BRA L_exit

VLES2

VLES3

…

VLES4

…

RTS

IFF VLES1

L_exit

IFT VLES4

IFT …

IFT RTS

…

VLES4

…

RTS

IFF VLES1

IFF VLES2

IFF VLES3


6.3 Compiler Anomalies

The following two sections detail compiler anomalies encountered during the implementation process.

There were several unexplainable occurrences during implementation. Problems with software and

hardware simulations were encountered, as well as problems with the compiled code. The first

anomaly was discovered when studying the CGA and HWA routines. The second anomaly focuses on

a compiler problem, where the compiled assembly is incorrect when the highest level of compiler

optimization is selected.

6.3.1 Compiler Anomaly A

The compiler produces unexpected assembly when compiling the CGA routines. The generated

assembly code is not a correct implementation of the C code, and appears to be out of order. At the

start of the first hardware loop in each of the CGA routines that were analyzed, there is some

manipulation of the loop count. The C code and compiled assembly are presented as Example 6.1 and

Example 6.2.

Example 6.1. Original C code Example 6.2. Erroneous Generated Assembly

ELEMENT i;

for(i = 0; i < count ; i++, input++, output++) {

MOVE.L (sp-28), r2 MOVE.L (sp-28), d0

MOVE.W #<0,d4

MAX d0, d4 TSTEQA r2

BT <L2

DOENSH3 d4

In the code, the value in (sp-28), which is the loop count, is an unsigned integer. The hardware

loop should be enabled to execute (sp-28) times. The generated assembly does not reflect the C code.

Consider the instance when the number of loop iterations is extremely large, and is 32-bits in

size. When the 32-bit value is treated as signed by MAX, the maximum of d0 and d4 is d4, i.e. zero.

By default, hardware loops must iterate at least once after they are enabled. Therefore, the hardware


loop is executed once. This is not desired, or described, by the C code. Fortunately, with respect to

this application, the loop count is always a small positive integer, so the described case never occurs.

Nevertheless, the generated assembly is incorrect.

When the loop count is assumed signed, which is an incorrect assumption from the C code, the

generated assembly is still incorrect. The hardware loop should be skipped when the loop count is less

than zero. In the assembly code, the hardware loop is executed once. The result of MAX d0, d4 should

determine if the branch to <L2 is taken, not the value in r2.

An alternative to testing the loop count before the hardware loop is enabled, is to have a

SKIPLS <L2 instruction after the hardware loop is enabled. The SKIPLS instruction eliminates the

need for the MOVE.W (sp-28), r2, MOVE.W #<0, d4, TESTEQA r2 and BT <L2 instructions. The

only problem with the SKIPLS instruction is that it has some restrictions that usually require the

addition of NOPs.

Next, consider the instance when the number of loop iterations is a small positive number, or

zero. In this case, the generated assembly executes as desired. Either the branch is taken to skip the

hardware loop, or the hardware loop executes the desired number of times. For the scope of the thesis,

the number of loop iterations is always a small positive integer. Therefore, the generated assembly

leads to correct results. Nonetheless, the generated assembly does not correctly implement the C code

it is derived from.

The generated code can be greatly improved using the rules stated in §6.2. First, there is no

reason for loading the loop count into r2. Each instance of r2 can be replaced with d0, which

eliminates a move and reduces the number of registers required. The MOVE #<0, d4 instruction can be

replaced by a CLR d4. The resulting CLR d4 instruction can be combined with instructions in the

previous VLES. Finally, the last two VLESs can be combined using IFC instructions. The following

shows the assembly code improvements described.

Example 6.3. Improved Assembly Code

MOVE.L (sp-28), d0 CLR d4

MAX d0, d4 TSTEQ d0

IFT <L2 IFF doensh3 d4

The suggested compiler improvements enhance the assembly, so that it only requires three

clock cycles, instead of the original five. After the improvements to the assembly are applied, the same

results are achieved as previously. Furthermore, it is easier to understand the assembly, and realize that


to correctly implement what is outlined by the C code, the TSTEQ d0 instruction must be changed to

TSTEQ d4, and be placed on a separate line after MAX d0, d4. Then, if the loop count is interpreted to

be negative, the hardware loop is skipped and not enabled. The result of the change is a more accurate

compilation of the original C code.

6.3.2 Compiler Anomaly B

Two other instances were encountered where the compiler generated incorrect assembly, but they

resulted in negative side effects. Both instances occurred when compiler optimizations were specified.

Without compiler optimizations, the generated assembly was correct. The following describes the

format of source code that caused the anomaly.

In general, the compiler has difficulty optimizing code that result in several branch statements.

Successive C language control statements, including if statements, can lead to problems. The problems

are related to the lack of updating register, stack and memory instances of variables. In the compiler-

generated assembly, several copies of the same variable are made, but when compiling with

optimizations, successive branch statements seem to cause erroneous or out-of-date copies of variables.

The first example of this compiler anomaly was never solved, and a method to work around the

problem, remaining at the C level, was never discovered. The problem occurred in the function

field_mult_wrapper. The large integer multiplication functions require positive integers. Therefore, a

wrapper function was implemented to convert all negative large integers, call the multiplication

function, and then convert the appropriate large integers back to their original values.

Initially, flags were employed to keep track of the large integers that require a sign conversion

after the multiplication is performed. For an unknown reason, the compiler generates assembly that

effectively does not convert one of the input large integers back to its original value. This was

concluded after several print statements were added to the source code. The statements output the

parameters at the start and end of the function. The incorrect assembly results in problems when the

one large integer is used in future calculations. The result of the first large integer multiplication

operation is correct, but the incorrect sign of the one input integer propagates. Future results of

multiplication operations, which involve the integer with the incorrect sign, also have the incorrect

sign.

Several methods of restructuring the wrapper function were investigated. A series of if

statements that have a separate case for each of the four possible combinations of large integer signs


were attempted. The series of if statements did not work, they also resulted in erroneous assembly

that produces identical results.

Assembly, generated with and without the error causing optimization, was analyzed to

determine where the error is introduced. The process of finding the problem in the source code was

simplified because of the lack of complexity of the function involved and the thorough problem

definition.

As shown in Example 6.4, a register is not set to the proper value before a subroutine call. The

erroneous assembly is missing a MOVEU.L #<6, d1, which is present in Example 6.5, immediately

before the neg_hwa routine is first called. The statement is very important because d1 is the register

the count argument is passed to neg_hwa. The count argument defines the size of the structure, and

therefore the number of loop iterations required to complete the negate operation. During the

optimization of the generated assembly, the compiler must have incorrectly determined that the move

instruction is not required.

Example 6.4. neg_hwa Assembly (Erroneous) Example 6.5. neg_hwa Assembly (Errorless)

…

JSRD _intmult

TFRA r7, r1

MOVEU.L #<6,d3

MOVE.L d3, (sp-8) MOVE.L (sp-100), r1

MOVEU.L #<12, d1

JSRD _convert_to_smaller_hwa

MOVE.L d1, (sp-4) ADDA #>-64, sp, r0

JSRD _neg_hwa

MOVEU.L #<6,d1 TFRA r6, r0

JSRD _neg_hwa

…

…

JSRD _intmult

MOVE.L r2, (sp-4) MOVE.L (sp-72), r0

MOVEU.L #<6,d0

MOVE.L d0, (sp-8) MOVE.L (sp-100), r1

MOVEU.L #<12,d2

JSRD _convert_to_smaller_hwa

MOVE.L d2, (sp-4) ADDA #>-64, sp, r0

MOVEU.L #<6,d1

JSRD _neg_hwa

MOVE.L (sp-72), r0

MOVEU.L #<6,d1

JSRD _neg_hwa

…

Due to the missing instruction, neg_hwa uses the incorrect value of twelve loop iterations.

Therefore, the negating function assumes the size of the input structure is twelve elements instead of

the correct value of six. The neg_hwa manipulates the set of memory addresses specified by the

parameters. The set of memory consists of both input large integers to the large integer multiplication

function because they are located consecutively. The result of the single neg_hwa execution is almost


equivalent to changing the sign of both integers. The negate function is called again in a later

instruction, passing the correct arguments to the routine. This changes the sign of one of the large

integers for a second time.

The compiler generates incorrect assembly for the large integer multiplication wrapper

function. By inserting the instruction MOVEU.L #<6,d1 before the first neg_hwa call, the assembly

code is modified to work correctly.

The second example of the anomaly is a combination of inefficient and poor coding that leads

to erroneous assembly. The source code that the compiler produces erroneous assembly for is

semantically correct, but redundant and inefficient. Examples of the error causing and corrected C

code are presented as Example 6.6 and Example 6.7.

Example 6.6. get_TNAFw_rep C Code (Error Causing) Example 6.7. get_TNAFw_rep C Code (not Error Causing)

…

*t1_ptr = (*r0_ptr + *r1_ptr * TNAF_TW) % TNAF_TWOW;

if (*t1_ptr>TNAF_TWOW/2) *t1_ptr-=TNAF_TWOW;

if (*t1_ptr & MSB)

{

/* t1.e[NUMWORD] is negative */

*t1_ptr = -*t1_ptr;

curbit = (*t1_ptr>>1) | TNAF_SB;

…

…

*t1_ptr = (*r0_ptr + *r1_ptr * TNAF_TW) % TNAF_TWOW;

if (*t1_ptr>TNAF_SB)

{

/* t1.e[NUMWORD] is negative */

*t1_ptr=TNAF_TWOW - *t1_ptr;

curbit = (*t1_ptr>>1) | TNAF_SB;

…

Two consecutive if statements are used to control the execution sequence in the error causing

example, but the if statements are effectively equivalent. Restructuring of the C code leads to

improved source code that the compiler correctly converts into assembly. The two versions of C code

are shown above. Good coding practices avoid the problematic C code that is incorrectly compiled.

Nonetheless, both versions of code are semantically correct. Therefore, in either case, the compiler

should produce errorless assembly.

When analyzing the generated assembly, the root of the problem was more difficult to

determine. The order of function calls within get_TNAFw_rep and the calculated coefficient values

were required to trace the execution sequence and determine the exact problem. By comparing correct

output sequences with the incorrect ones, the problem was found. All nonzero coefficients were being

set to one. This meant the error in the assembly was in the calculation or the storing of the coefficient.

The erroneous and errorless assembly code that calculates the bit value and branches taken was

studied. It was found that when a specific sequence of branches is executed, a non-up-to-date instance


of a variable is used in an instruction. By updating the instance of the variable used in the instruction,

the assembly code correctly performed the desired task.

The two examples stated detail compiler errors. In the first example, erroneous assembly is

generated by the compiler. Several variations of C code were written that implement the desired task,

but each resulted in incorrect assembly generated by the compiler. An incorrect assumption by the

compiler led to erroneous assembly code. In the second example, poor C coding techniques led to a

compiler error. Nonetheless, the C code is semantically correct. Therefore, it should not have resulted

in erroneous compiler generated assembly.

In both cases, the compiler-generated assembly is incorrect. The use of optimizations when

compiling caused the generation of erroneous assembly. Optimization techniques should not affect the

results obtained when the compiled code is executed. Compiler optimizations should only affect the

computational cost of generated code.

7 Side-Channel Attack Security Issues

SCAs are a set of cryptographic attacks that attempt to break cryptosystems by analyzing execution

times, processor power consumption and other information. Recently, a lot of effort has been put

towards better understanding their capabilities, and to develop strategies and algorithms to resist them.

SCAs are generally used to determine private keys, but can also be used to determine other pertinent

information that can be used to break a cryptosystem. The three types of SCAs focused on are a

Timing Attack (TA), Simple Power Attack (SPA) and Differential Power Analysis (DPA).

A TA is relevant to algorithms that operate in data dependent time. The time required by

certain algorithms depends on inputs such as a private key and/or message. A TA exploits data

dependent time algorithms by collecting timing information and using it to determine algorithm inputs.

Described by [24] and [33], it is possible to determine pertinent information with TAs. §7.1 further

describes TAs and states the vulnerabilities of the implementation to this attack. It also estimates the

computational penalty suffered if the implementation were modified to resist the attack.

Power consumption based attacks include SPA and DPA. Instructions require different

amounts of power during execution. Power attacks generally determine execution sequences from the

power trace of a processor [10]. SPA exploits algorithms whose execution sequence depends on

pertinent information. The sequence of operations executed by a processor may be determined by

analyzing its power trace. SPA, a technique to foil the attack and a resistant algorithm are presented in

§7.2. DPA is a more powerful attack than SPA, and is based on a statistical analysis of power traces.

In particular, it uses a correlation between power traces and bits of the private key, or other pertinent

data, at specific points of an algorithm. DPA is presented in §7.3.

TA and SPA attempts can break the ECDSA. The attacks can be used to determine several bits

of the nonce. By observing the generation of signatures, and determining the corresponding nonces,

the private key can be determined, because the ECDLP is reduced to a variant of the hidden number

problem [49]. Any leakage of information about the nonces used in the signature process could prove

dramatic [49], and lead to insecurities. Furthermore, the nonce used in the signature generation process

must be generated by a cryptographically secure and unbiased pseudo-random number generator to

eliminate possible insecurities [49].

The ECDSA is not subject to DPA because a random or pseudorandom nonce, k, is involved in

the point-multiplication operation [10]. The private key is involved in a finite field addition, two

multiplications and a reduction, but because of the presence of the nonce, the ECDSA is naturally

104

CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 105

resistant to the attack. However, ECC encryption is subject to DPA. §7.3 is directed towards ECC

encryption, and how the implemented finite field and elliptic curve operations must be modified to

become resistant to DPA attacks when used for ECC encryption.

Finally, a different view of DPA, and other techniques developed by the writer that disrupt and

possibly prevent SCA techniques are presented in §7.4.

7.1 Timing Attacks

The basic theory behind TAs is that the execution time is dependant on input parameters [24].

Therefore, by recording the execution time required to perform operations, inputs to the operation can

be determined. By modifying operations so that they execute in fixed time, TA attempts are thwarted.

In general, it is very difficult to modify functions so that they operate in fixed time. Implementations

may be TA resistant on one processor, but not on another [24].

The point-multiplication operation must be made TA resistant even though a nonce is involved.

This is because even partially known nonces can lead to security risks. Any knowledge of information

related to the value of the nonces used in the ECDSA can lead to security issues [49]. Below is a brief

description of the modifications required to achieve TA resistance. The actual implementation details

are not included, only the computational penalty is focused on.

With respect to the implemented operations, two main functions are subject to TAs. Some less

expensive operations may be subject to TAs, but their cycle counts are negligible relative to overall

cycle counts, and therefore are not analyzed. The two operations, which must be modified to execute

in fixed time so they foil TA efforts, are finite field inversion and elliptic curve point-multiplication.

First, the implemented finite field inversion operation executes in data dependent time, and

must be modified to execute in fixed time. To avoid TA attempts, the inversion operation must be

modified to require the worst-case inversion time [24], independent of the input polynomial. Table 7-1

states the original and TA resistant case, along with the performance penalty incurred. The original

case in the table is actually the average cycle count of the inversion operation.

The other operation implemented that a TA can target is elliptic curve point-multiplication.

Both the TNAF and TNAFw algorithms execute in data dependent time. The estimated performance of

the TA resistant point-multiplication operations is presented in Table 7-1. To fix the time required by

both algorithms, an addition must be executed every loop iteration. An example of a modified τ-adic

algorithm, which is from [22], is listed as Algorithm 7-1. It is TA resistant when implemented


correctly. The implemented TNAFw point-multiplication can be made both TA and SPA resistant by

employing an SPA resistant algorithm presented in [22]. The estimated performance penalty incurred

by employing such an algorithm is also included in Table 7-1.

Table 7-1. Estimated TA Resistant Performance Penalties

Cycle Count Description

Original Case TA Resistant Case Performance Penalty

poly_inv_eff 16,730 18,290 1,560 (9.3%)

TNAF_point_mul 1,670,000 4,499,000 2,829,000 (169.4%)

TNAFw_point_mul 1,193,000 4,729,000 3,536,000 (296.4%)

The estimated performance penalty and TA resistant computational cost of the TNAFw point-

multiplication function is larger than the TNAF function because of pre-computations. The pre-

computations do not have to be TA resistant, but because both TNAF and TNAFw representations

require m bits, m point additions are also required during the double and add portion of the algorithm.

In other words, the decreased hamming weight of the TNAFw representation of k is not beneficial.

The hamming weight does not affect the performance of the operation, only the length of the

representation does. Due to the pre-computed LUT required in the TNAFw point-multiplication

operation, the computational cost of the operation is more than that of the TNAF point-multiplication.

The estimated signature generation performance penalty incurred due to modifying the

implementation to be TA resistant is identical to the values presented in Table 7-2 of §7.2.

Algorithm 7-1. TA Resistant TNAF Point-Multiplication (Q[0] = kTNAF⋅P) [22]

Input: k = (km-1, km-2, …, k1, k0)τ-adic, P

Output: Q[0] = (x, y)

1. Q[0] =


2.1. Q[0] = τ⋅Q[0] i.e. x = x2, y = y2

2.2. Q[1] = Q[0] + P

2.3. Q[0] = Q[ki]

3. Return (Q[0])


7.2 Simple Power Attacks

For an implementation to be SPA resistant, the instructions executed by the processor cannot depend on

the input parameters [10]. Any dependence between the input parameters and the executed instructions

is visible in the power trace of the processor and can be exploited.

A detailed analysis of the power trace may reveal the instructions executed by the processor. It

is assumed that the attacker can trace the power consumption and therefore can determine the executed

instructions. Furthermore, the attacker is able to determine the input parameter if a dependency is

present. The TNAF and TNAFw point-multiplication operations implemented are susceptible to SPAs.

Algorithm 7-1, which is presented in the previous section, is an example of an SPA resistant point-

multiplication algorithm. As previously mentioned, an SPA resistant point-multiplication algorithm

that can be used to modify the TNAFw point-multiplication operation is presented in [22]. The

performance loss due to modifying either point-multiplication operation implemented is identical to the

TA related estimations presented in Table 7-1.

The computational cost of the signature generation process is severely hindered when SPA

resiliency is required. The effect of computing a point addition operation each loop iteration

significantly increases the running time of the point-multiplication operation, and therefore the

signature generation process. Table 7-2 shows the computational cost of modifying the signature

generation process so that it is SPA resistant, as well as TA resistant. Signature generation processes

employing both the TNAF and TNAFw point-multiplication operation are presented because the TNAF

employing signature generation process becomes less computationally expensive when SPA resiliency

is added. The TNAFw technique normally outperforms the TNAF technique, but the opposite is true

when TA and SPA resistant measures are employed, because the hamming weight does not affect the

computational cost and the TNAF technique does not require a LUT.

Table 7-2. Estimated SPA Resistant Signature Generation Performance Penalty

Cycle Count Code Description

(Point-Multiplication) Original Case SPA Resistant Case Performance Penalty

Signature Generation (TNAF) 1,806,000 4,634,000 2,828,000 (156.6%)

Signature Generation (TNAFw) 1,329,000 4,864,000 3,535,000 (266.0%)


Coron states that for an algorithm to be SPA resistant, there should be no branch instructions

that depend on the input parameters [10]. This statement is not entirely true. It is proposed that by

employing techniques described in §7.4.1, and maintaining identical execution sequences for each

branch, SPA resistance is maintained. SPA efforts targeting the execution sequence after the branch is

taken are thwarted because they are identical, leaving the implementation of the branch to either if case

crucial in maintaining SPA resistance. If the branch is not implemented correctly, the attacker is able

to determine the input parameters and break the cryptosystem.

Consider the algorithms being implemented on the SC140. By employing the parallel

processing capabilities of the processor, it is proposed that the attacker is unable to determine the

branch the processor takes. The input parameters can be masked from the attacker by executing

branches in parallel. This is achieved with conditional instructions illustrated in §6.1.1, which are

conditionally executed, or can be used to define a subset of instructions to be executed within a VLES.

Providing the subsets of instructions are identical, but refer to different addresses, registers and/or data,

the attacker is unable to determine the subset executed. For a detailed explanation and example, see

§7.4.1.

The parallel processing capabilities of the SC140 allow instructions to be executed in parallel.

The parallelism masks the instructions executed, and providing care is used, the attacker is unable to

determine the exact instructions executed, and therefore the branches taken and the bit-values of

pertinent data.

7.3 Differential Power Analysis

As stated previously, the ECDSA is not subject to DPA type attacks because of the nonce used in the

algorithm. However, other ECC cryptosystems such as encryption are subject to this type of attack.

SPA resistant algorithms reduce the dependency of power traces on bit values of the private

key or other pertinent data, such that they are hidden from attackers. Unlike SPA, DPA is a statistically

based attack that requires several samples to break a cryptosystem. Several power traces are used to

increase the correlation between power traces and bit values, allowing the determination of bit values

by the attacker. An excellent description of the attack on the point-multiplication, and several DPA

countermeasures specifically geared towards Koblitz curves are provided in [22].

DPA countermeasures are foiled by adding some type of randomness, or uncertainty, to the

operation in question, in an attempt to randomize power traces and eliminate correlations between


accumulated power traces and the private key or other pertinent data. The countermeasures listed

below focus on the τ-adic representation of k in the point-multiplication operation, Q = k⋅P.

The first DPA countermeasure stated in [22], key masking with localized operations, is based

on Formula 3.6. Basic algebra can be used to formulate a set of functions that include several powers

of τ, and are equated to zero. By adding a randomly selected function from the set to randomly

selected consecutive coefficients of k, k′ is computed. The two polynomials, k and k′, are equivalent.

Therefore, k′ can be used to compute Q. Each time the point-multiplication operation is executed, a

new k′ is computed, thus adding randomness to the power traces and foiling DPA attacks.

The second DPA countermeasure, Random Rotation of Key (RRK), is based on Formula 3.8.

By exploiting the formula, and the simplicity of multiplying points by τ, randomness can be added to

the point-multiplication operation. A random integer r, where 0 ≤ r ≤ m-1, is selected. Then a

modified base point, τr⋅P, and a version of the key cyclically shifted by r-bits are used in the point-

multiplication operation. The operation results in the point Q, where Q = k⋅P. The security of this

countermeasure is transferred to the computation of the modified base point and cyclically shifted key.

A secure computational method is required to compute the values. Otherwise, the implementation

remains subject to DPA attacks.

A third countermeasure is presented in [28]. The countermeasure modifies the reduction of the

TNAF representation. Joye and Tymen propose that by reducing the TNAF representation of k,

involved in the point-multiplication Q = k⋅P, using ρ⋅(τm-1), where ρ is randomly selected, DPA attacks

are thwarted. The suggested range of ρ for m=163 leads to a TNAF representation of k involving 200

coefficients, resulting in approximately a 25% increase in the computational cost. The implementation

of the countermeasure is simple because no additional routines are required [28]. Furthermore, ρ affects

the entire representation of k [28], thus the countermeasure is much more successful in thwarting DPA

attacks.

7.4 SCA Countermeasures specific to Koblitz Curves and the SC140

Algorithms that are resistant to SPA are well known and documented, whereas techniques that foil

DPA are not, and are easily scrutinized. Furthermore, resistant SPA algorithms are easily modified to

protect operations involving private keys and other pertinent information in different cryptosystems.

DPA countermeasures are more implementation specific. They are generally based on the


implementation and specific properties of the underlying problem, where some degree of randomness,

or uncertainty, can be added without large performance penalties. Countermeasures for a single

cryptosystem are not applicable to all implementations. Two examples of countermeasures that do not

apply to the thesis implementation are provided.

First, a fast point-multiplication method that is immune to DPA is proposed in [51]. The

algorithm is based on Montgomery’s method. For ECC implementations that use the point addition

and doubling technique of point-multiplication, the proposed DPA resistant method cannot be

employed.

The second countermeasure is from [10], where Coron presents three countermeasures to DPA

attacks, two of which are proven vulnerable to DPA in [51]. The remaining DPA countermeasure is

based on randomizing the base point of the point-multiplication. The method is only valid for ECC

implementations using projective coordinates. Implementations that use the affine coordinate system

cannot employ the countermeasure.

DPA countermeasures are generally based on a strength of the given implementation. For

example, Koblitz curves are attractive because the use of the τ-adic representation, which allows the

replacement of point doubling with the much less computationally expensive execution of two finite

field squarings. Another strength of Koblitz curves is equivalent τ-adic representations can be

computed very easily by exploiting Formula 3.6. DPA countermeasures that use Koblitz curve

properties are presented in the previous section. Two proposed SCA countermeasures are presented

below. They exploit strengths of the SC140 and Koblitz curves respectively to foil attacks. A notion

of sample entropy is also introduced.

7.4.1 Parallel Processing Countermeasure

Parallel processing capabilities of processors may foil SPA and DPA attacks. It is proposed that by

executing instructions in parallel, and by exploiting the functionality of the SC140, the actual

instructions performed are masked. Parallelism of the SC140 can be used at pertinent points in the

execution of the point-multiplication operation to foil SPA, and possibly eliminate correlations between

accumulated power traces and pertinent information.

Several implementations of parallel instructions achieve identical goals. As a simple example,

consider implementing an if statement, where an address depends on the value of a coefficient. The

assembly implementation, presented as Example 7.1, is proposed to be SPA resistant when care is


exercised. In the example, the data register d0 contains the value of the current coefficient ki. The

address registers, r1 and r2, contain the two possible addresses, and the address register r3 is a dummy

register. The two possible addresses designate the locations of P and –P, which are involved in a point-

multiplication operation. The goal of the example assembly is to securely transfer the correct address

to register r0.

Example 7.1. Parallel Assembly Implementation

…

CMPEQ.W $<0, d0

IFT TFRA r1, r0 IFF TFRA r1, r3

IFT TFRA r2, r3 IFF TFRA r2, r0

…

In the example, the address transferred into r0 depends on the current coefficient located in d0.

In either case, the same instruction is executed, and only the target of the instruction depends on the

true bit of the status register. Assuming there is no inherent processor preference between different

operands of the TFRA instruction, and instruction sets grouped with IFF and IFT, the power trace of

the code is independent of the true bit. It is proposed that the implementation of the if statement is TA

and SPA resistant.

A more sensitive analysis of the assembly is required when considering DPA attacks. It is

assumed that the implementation of the transfer statements is DPA resistant, because the power traces

are unlikely to be different in either case. However, DPA attacks may be successful against the

comparison instruction preceding the register transfers. Furthermore, the remaining part of the

algorithm, where either P or -P is used in the point addition, may also be subject to a DPA attack.

It is proposed that achieving branching while maintaining SPA resiliency is possible. By

employing the same technique to that used in Example 7.1, branch instructions that target different

memory addresses can be grouped with IFT and IFF instructions. By doing so, and assuming there is

no inherent processor preference between different operands of a branch instruction, and instruction

sets grouped with IFF and IFT, attackers are unable to determine the branch taken.

The proposed theory of exploiting parallel processing to mask instructions likely foils TA and

SPA attacks. The resiliency of the countermeasure against DPA attacks must be studied further before

its effectiveness is determined.


7.4.2 Koblitz Curve Specific Countermeasure

Koblitz curve implementations are attractive because of the superior performance of the point-

multiplication operation. By using τ-adic representations, point doubling is replaced, resulting in a

significant reduction in computational costs. Similar to countermeasures presented in [22], the below

countermeasure exploits a property of Koblitz curves, and is proposed to foil DPA attacks. It exploits

the inexpensiveness of multiplying elliptic curve points by τ.

The countermeasure is loosely based on a finite field exponentiation algorithm. It is preferable

that the τ-adic coefficients representing k are evenly distributed. When performing a point-

multiplication operation, the τ-adic representation of k is divided into r groups of g coefficients, where

g = m/r. The point-multiplication between each group and base point P is performed in a random

order. Then, the resulting points are multiplied by the corresponding power of τ, and summed,

resulting in the point Q.

The algorithm applies to point-multiplications over Koblitz curves. It can be generalized to

apply to other elliptic curves, but the computational cost is extremely large. If the algorithm were

implemented on a non-Koblitz curve, the performance overhead would be significantly greater, and

most likely unacceptable. Each pair of polynomial squaring operations would be converted to a point

doubling operation, which is much more computationally expensive. A τ-adic representation of the

polynomial k is required by the algorithm. It is suggested that a NAF related representation of the

polynomial be avoided, otherwise the distribution of coefficients favor zero. This may lead to lessened

security and possible attacks because of the structure of NAF representations. Furthermore, if an SPA

resistant algorithm is employed, a NAF representation is of no advantage.

Algorithm 7-2. Proposed DPA Resistant τ–adic Point-Multiplication

Input: k = (km-1, km-2, …, k1, k0)τ-adic, P

Output: Q = k⋅P

1. Compute ki, for 0 ≤ i ≤ r-1

Where k = (kr-1, kr-2, …, k1, k0)τ-adic and ki = (k(i+1)⋅g-1, k(i+1)⋅g-2, …, ki⋅g+1, ki⋅g)τ-adic

2. Compute ki⋅P, for 0 ≤ i ≤ r-1, following a random sequence of i values.

3. Compute Q = Στi⋅g⋅ki⋅P, for 0 ≤ i ≤ r-1


The crucial step of the algorithm is computing ki⋅P following a random sequence of i-values.

It is speculated that this step adds randomness to the power trace of the algorithm, foiling DPA attacks.

To further analyze the proposed countermeasure, a concept of sample entropy is introduced.

A new concept may play a role in the comparison of DPA countermeasures. As DPA attacks

mature, requiring fewer samples to be effective, an idea of sample entropy, or uncertainty per sample,

may become important when comparing the strength of different DPA countermeasures. The amount

of entropy introduced to an attacker monitoring a cryptosystem is important.

For example, consider a cryptosystem that is SPA resistant, and employs the RRK DPA

countermeasure from [22]. Kerckhoff’s principle, which is the assumption that the attacker knows

everything about the cryptosystem except the key, applies. The sample entropy of the RRK DPA

countermeasure is fixed at m, where the cryptosystem uses GF(2m). Currently, the standard m-value is

163. Alternatively, the sample entropy and overhead of Algorithm 7-2 is controllable.

The sample entropy introduced by Algorithm 7-2 varies with r and g. For small r-values,

where the probability of ki = kj for i≠j and 0 ≤ i, j ≤ r-1, which is referred to as a ki collision, is

approximately zero, resulting in a sample entropy of (r!). This is only true when the probability of ki

collisions is approximately zero. The estimated sample entropy and overhead of the algorithm is

controlled by selecting various r-values, and is stated in Table 7-3.

The computational overhead of Algorithm 7-2 depends on the implementation of step 3 and the

selection of r. The number of point addition and polynomial squaring operations can be estimated

using Table 7-3, assuming the optimum technique of performing the step is employed. The number of

overhead point addition operations grows linearly with r. The number of polynomial squaring

operations is not of great importance, because the operation is computationally inexpensive.

Table 7-3. Estimated Sample Entropy and Overhead of Algorithm 7-2

Sample Entropy Computational and Memory Overhead of Algorithm 7-2

Small r

P(ki coll) = 0

Large r

P(ki coll) = 1

Point Addition

Operations

FF Squaring

Operations

Memory Requirement

(points)

r! r!

((r/2⋅g)!)2g r - 1 2⋅g⋅(r - 1) r

The memory overhead of the algorithm, excluding the point-multiplication operations, is

reasonable for implementations without strict memory restrictions. As presented in Table 7-3, an


additional r points must be stored in memory. The number of points cannot be reduced by combining

steps 2 and 3, because this leads to DPA vulnerabilities.

Several modifications can be made to the algorithm to increase its performance. For example,

a single LUT can be used during all of the point-multiplications in step 2. The LUT can be computed

for the first point-multiplication, and used in all remaining operations. The expense of pre-computing

the LUT is therefore reduced because it is used multiple times.

Algorithm 7-2 is a proposed DPA countermeasure. Steps must be taken to resist TA and SPA

attacks as well. A resistant point-multiplication algorithm, such as Algorithm 7-1, is required to foil

the attacks. The sample entropy can be increased by using large r-values, but it cannot be used to avoid

TA and SPA attempts. Furthermore, large r-values result in unacceptable computational and memory

overheads. By selecting large r-values, the sample entropy can be increased, but it cannot grow to be

larger than the key entropy of 2m. For large r, i.e. 163, 82 and 55, the sample entropy can be estimated

using Table 7-3. The maximum sample entropy possible is one-sixteenth the key entropy of 2m.

The proposed DPA countermeasure, presented as Algorithm 7-2, takes advantage of the

Frobenius mapping, stated in Formula 3.7. The randomness introduced to the power trace by the

countermeasure is easily controllable, at the expense of computational and memory overhead. It is

proposed that the randomness introduced may foil DPA attacks, but this must be investigated further.

Additional analysis of the algorithm is required to determine the effectiveness of the proposed

technique in resisting DPA attacks.

8 Discussion and Conclusions

The thesis presents the implementation, optimization and analysis of the ECDSA on the StarCore

SC140 DSP. The ECDSA and algorithms used to implement each of the finite field, large integer and

elliptic curve operations are presented in chapter 3. After which, the implementation and performance

of the operations is described. The results are compared to previously published results, and the

performance of the hand-written and compiler generated assembly is compared. The memory

requirements of the implementation are examined. The SC140 is analyzed for cryptographic

applications, as well as the ability of the SC140 compiler to generate efficient assembly. Furthermore,

several optimization improvements that the compiler could employ are stated. Finally, security issues

are examined, focusing on resisting side-channel attacks, and proposing two possible countermeasures

to the attacks.

8.1 Thesis Summary

A Koblitz curve over GF(2163) was selected and used to implement the ECDSA. The focus of the

implementation is to minimize execution time, while targeting portable devices by maintaining an

acceptable code size and minimizing power consumption. Optimal finite field and elliptic curve

algorithms were sought for implementation. The algorithms are listed and described in chapter 3, as

well as the implementation philosophy.

First, a working version of the ECDSA was obtained. Then, inefficient operations were

methodically replaced with superior performing and thoroughly tested operations. The implementation

and integration of the finite field, large integer and elliptic curve operations is outlined in chapter 4.

The performance of the implemented operations are compared with published results in chapter

5. The execution times of finite field operations, elliptic curve operations, and the signature generation

and verification processes is presented and compared. The performance comparison of the signature

generation and verification processes shows that the performance of the implementation is adequate,

and the processes result in acceptable delays.

Coding guidelines that were used when implementing the assembly and C code are listed. The

guidelines are a set of suggestions that result in computationally and memory efficient hand-written

115

CHAPTER 8. DISCUSSION AND CONCLUSIONS 116

assembly and compiler-generated code. The performance of the implementation, compiled with CGA

and HWA routines, is presented and compared. The performance of the code shows the advantage

realized by assembly implementations instead of C.

The SC140 and associated compiler are analyzed with respect to cryptographic applications in

chapter 6. The pros and cons of the processor are listed and described in detail. By studying the HWA

and CGA routines, a list of suggested compiler optimization improvements was gathered. The

improvements are specific to the CGA routines, and state rules that if employed, significantly improve

the compiler-generated assembly. To conclude chapter 6, two compiler anomalies encountered during

the implementation process are stated.

Security issues due to side-channel attacks are investigated in chapter 7. TA, SPA and DPA

attacks are briefly described, and implemented operations that are susceptible to the attacks are

presented. An algorithm that resists SPA and TA attempts is included, as well as estimated

performance penalties of all susceptible operations. Several countermeasures that attempt to foil DPA

attacks are presented, along with two SCA countermeasures specific to the SC140 and Koblitz curves.

The two countermeasures exploit strengths of the SC140 and Koblitz curves, in an attempt to foil

attacks. Further analysis of the countermeasures is required to determine their true effectiveness

against such attacks.

8.2 Limitations of the Research and Implementation

The primary limitations of the thesis are based on the actual curve implemented and the target

processor. They are explained in the following paragraphs.

An attempt to lessen the first limitation results in a negative effect on the performance of the

implementation. Only the performance of a single finite field size was investigated. The implemented

code is written in such a way that the difficulty of testing alternative finite field sizes is reduced.

The only finite field size investigated in the thesis is GF(2163). As stated, this is the current

standard finite field size employed. The finite field size is currently valid, but the execution times

associated with larger finite field sizes must be investigated as well. According to Moore’s law, which

has been surprisingly accurate for thirty years, average computing power doubles every eighteen

months. The exponential growth of computing power requires increasing cryptographic strengths to

maintain acceptable security. With respect to ECC, cryptographic strengths are increased by


employing larger finite field sizes. By testing larger finite field sizes, the viability of implementing

higher security ECC and ECDSA on the SC140 is determined.

A second limitation of the research is that only one elliptic curve was investigated. There are

several types of elliptic curves that all have positive and negative aspects associated with them.

Koblitz curves are attractive for implementation because of their superior performance. Specific

properties of Koblitz curves allow for efficient point-multiplication execution. Most other curves do

not perform as well, but as stated in [62], Koblitz curves properties may lead to efficient attacks that are

not possible on other curves, so it is important to investigate alternative elliptic curves.

The final limitation of the research is the target processor. Only the SC140 is used for

implementation. The target processor can affect the performance of the application because the

computational costs of the implementation is somewhat limited by the instruction set and processor

architecture of the SC140. The processor has both positive and negative properties that affect the

performance of cryptographic applications that are implemented on it. Alternative high-end DSPs,

with slightly different instruction sets and processor architectures will perform differently. Other DSPs

may be more suited for executing cryptographic applications because of their instruction sets and

architecture.

8.3 Conclusions

The ECDSA is an efficient digital signature technique, which was implemented on the SC140.

Previous implementations of the ECDSA generally target general-purpose processors. The SC140 is an

interesting target processor because of its intended applications. The processor targets 3rd generation

wireless communication and wireline communication devices, which can benefit from digital

signatures.

The implementation employs optimal algorithms that compute finite field and elliptic curve

operations in an attempt to minimize the computational costs of the signature generation and

verification processes. The implementation was primarily done using the C programming language.

Some basic routines were written in both C and assembly.

In most cases, the computational costs of the implemented operations, and signature generation

and verification processes are greater than other published results. However, the computational costs

are comparable and can be improved by fixing parameters such as window widths, the width-w value

and the finite field size, as well as implementing additional functionality in assembly. The


performance of the signature generation and verification processes is adequate. When the SC140 is

operating at the maximum clock speed of 300MHz, the signature generation and verification processes

lead to delays of approximately 4.43 and 8.63 milliseconds respectively. The delays that the

implementation of the signature processes are user acceptable. A user would not notice incurring

delays of this magnitude.

A comparison of the C and assembly routines was completed. The comparison focused on the

performance enhancement and memory requirement reduction achieved by implementation at the

assembly level. A significant computational cost reduction was recorded by the HWA routines. The

computational speedups vary depending on the input values and task. The maximum and minimum

speedups recorded when implementing tasks whose execution times depend on inputs are 1550% and

141% respectively. The speedups achieved with the input dependent routines vary greatly. A more

consistent range of speedups was achieved with routines whose execution times are independent of

inputs. Maximum and minimum speedups of 532% and 233% were achieved by implementing the

routines in assembly. Furthermore, the memory requirements of the more computationally efficient

HWA routines are at most half of the requirements of the CGA routines.

High-level operations also benefit from HWA routines. Signature generation and verification

computation reductions of 23.6% and 23.7% respectively were achieved by employing HWA routines

instead of less efficient CGA counterparts. Employment of superior performing low-level HWA

routines translate into the reduction of high-level computational costs.

The benefits of assembly implementation include both computational costs and memory

requirements. It is assumed that the implementation of larger tasks, or even the entire signature

generation and verification processes, leads to comparable computational speedups and memory

requirement reductions that resulted from the assembly implementation of low-level routines.

Therefore, assembly implementation of the ECDSA is deemed beneficial.

For target devices that have extremely high memory restrictions, the memory requirement of

the implementation, which is around 36,000 bytes, may be unacceptable. By implementing the

ECDSA entirely in assembly, the memory requirements are reduced, and possibly halved.

Furthermore, the computational cost of the implementation will decrease significantly. The reduction

in memory requirements should meet the limitations of target devices. For implementations where it is

only desired to write a limited number of simple functions in assembly, which was done in the thesis,

computational costs and memory requirements can be reduced. However, the projected reduction is not

as significantly as that achieved with an entire assembly implementation.


Several guidelines were found during the implementation process. The guidelines aid in the

creation of efficient assembly and C code, that result in superior performing compiled assembly. The

guidelines list several techniques that were found to improve performance.

The significant performance benefits of assembly implementations leads to possible compiler

optimization improvements. Analysis of the compiler-generated assembly led to several significant

improvements that were documented. The improvements suggested include, but are not limited to,

improved register allocation, improved copy propagation techniques and analysis, minimization of

moves from memory and simplification of multiple instructions.

Lastly, improper implementation can lead to insecurities. The insecurities due to SCA attacks

are currently of great interest. Algorithms that are TA and SPA resistant should be employed in

practice. The algorithms lead to significant computational costs, but without them, the implementation

is insecure.

The ECDSA is naturally immune to DPA attacks because of the use of a nonce, but ECC

encryption is vulnerable. There are several countermeasures that foil DPA attacks. In general, the

countermeasures add randomness to the implementation to resist DPA attacks. Most countermeasures

are implementation specific, limiting the possible techniques of thwarting DPA attacks that can be

employed. Two proposed countermeasures specific to the SC140 and Koblitz curves are presented.

The first countermeasure, by exploiting the parallel processing capabilities of the SC140, TA and SPA

attacks are likely foiled. The parallelism allows the masking of branch targets and executed

instructions. The second countermeasure, which adds a controllable amount of randomness to the

elliptic curve point-multiplication operation, is proposed to foil DPA attacks. The proposed

countermeasures exploit strengths of the SC140 and Koblitz curves respectively.

8.4 Future Work

The performance of the thesis implementation of the ECDSA on the SC140 can be improved upon.

Results from other published works report performances of signature processes superior by as much as

a factor of two. More specifically, improvements to the finite field and elliptic curve operations will

lead to an increased performance of the signature generation and verification processes.

To improve the performance of the ECDSA, it may be necessary to implement a larger set of

functions in assembly. The hand-written assembly routines are shown to significantly outperform the


compiler-generated assembly. It is assumed that the implementation of a larger set of functions leads

to further computational cost and memory requirement reductions, but the actual benefits are unknown.

Furthermore, an increase in performance was achieved by fixing the window width of the finite

field squaring operation. The remaining source code is written to maintain versatility by allowing the

easy modification of window widths, the finite field size of GF(2163), and other parameters. By fixing

parameters, a significant overall performance is achieved. The versatility reduces the cost of changing

finite field sizes, but this is not commonly done and has a negative effect on performance. The results

obtained are adequate and implementing the ECDSA in a system would not impose significant delays,

but the delays can be reduced by reducing the versatility of the code.

Research must be done with larger finite field sizes. As the average computing power

increases, a need for greater security leads to the implementation of larger finite field sizes. The

implemented code is written in such a way to simplify the modification of the finite field size, but the

performance of larger finite field sizes was not investigated. The performance of larger finite field

sizes must be investigated to determine how it affects performance, and to ensure reasonable execution

times are achievable.

The effects of increasing the finite field size should be researched on an operation basis. Each

finite field and elliptic curve operation will be affected by the change differently. The performance of

finite field squaring and inversion is linearly related to the finite field size. Other operations have

polynomial performance relations to the finite field size. Alternative algorithms that carry out these

operations should be investigated or developed in an attempt to maintain near-linear performance

relationships between the finite field size and ECC or the ECDSA.

Implementation of the ECDSA on other DSPs that target similar systems should be

investigated. DSPs that have different instruction sets may be advantageous when implementing the

ECDSA. Furthermore, organizations of processors may be advantageous. For example, the Motorola

MSC8101 comprises of four SC140 cores that operate in parallel. The ECDSA may be well suited for

parallel processors, resulting in significant speedups. Each parallel processor could be used to operate

on specific portions of a finite field element, resulting in computational speedups.

Security issues with respect to the ECDSA and the SC140 must be investigated more closely.

The thesis only presents SCA resistant algorithms and countermeasures, and estimates the performance

penalty of implementation. Implementation of the resistant algorithms is required to determine the

actual performance penalty. Attempts to break the implementation are then required to ensure their

effectiveness against an attacker.


Lastly, the proposed algorithms specific to the SC140 and Koblitz curves that attempt to foil

SCA techniques must be studied. Compared to other options, the proposed algorithms may have

performance benefits, reducing the penalty of resisting SCA attacks.

In general, there is a lack of research applying to the implementation of cryptosystems on

DSPs. Limited research has been done with respect to symmetric key cryptosystems and DSPs, and

with respect to asymmetric cryptosystems. DSPs are present in many systems, and therefore their

usefulness for implementing cryptography should be investigated. Furthermore, it is more difficult than

expected to find information that applies to the implementation of Koblitz curves, which is surprising

when the benefits of Koblitz curves are considered.

Further research in the area the thesis touches is important because of the usefulness of digital

signatures. Digital signatures can be useful in many computing environments, and especially beneficial

in wireless networks, which are growing in popularity.

Cost-benefit studies are required to compare DSPs with specialized processors based on the

implementation of cryptographic techniques. DSPs must be studied further to determine if they are a

viable alternative for cryptographic implementations to a more expensive dedicated processing unit.

Appendix A – Koblitz Curve Parameters

The following are a list of the Koblitz curve parameters used in the implementation. They are specific

to the PB and GF(2163), and were found in [25].

Reduction Polynomial: f(x) = x163 + x7 + x6 + x3 + 1

Elliptic Curve Equation: y2 + x⋅y = x3 + a⋅x2 + b (mod f) where a = 1, b = 1

Base Point Order: r = 5846006549323611672814741753598448348329118574063

Base Point Coordinates: x = 2 FE13C053 7BBC11AC AA07D793 DE4E6D5E 5C94EEE8

y = 2 89070FB0 5D38FF58 321F2E80 0536D538 CCDAA3D9

Koblitz Curve Parameters: µ = 1

C = 16

s0(163) = 2579386439110731650419537

s1(163) = -755360064476226375461594

V(163) = -4845466632539410776804317

122

Bibliography [1] G. Agnew, R. C. Mullin and S. A. Vanstone, “An implementation of elliptic curve cryptosystems

over F2155”, IEEE journal on selected areas in communications, Vol 11, No. 5, pp. 804-813, 1993. [2] G. Agnew, ECE 628 Lecture Slides – Computer Network Security, Department of Electrical and

Computer Engineering, University of Waterloo, 2002. [3] M. Aydos, T. Yanik and C. Koc, “High-speed implementation of an ECC-based wireless

authentication protocol on an ARM microprocessor”, IEE Proceedings – Communications, Vol. 148, No. 5, pp.273-279, October 2001.

[4] D. Brown, “The exact security of ECDSA”, Technical report CORR 2000-54, Department of

Combinatorics & Optimization, University of Waterloo, 2000. Available at http://www.cacr.uwaterloo.ca

[5] M. Brown, D. Cheung, D. Hankerson, J. Lopez Hernandez, M. Kirkup and A. Menezes, “PGP in

constrained wireless devices”, Proceedings of the 9th USENIX Security Symposium, The USENIX Association, 2000. Available at http://www.usenix.org

[6] Certicom, “Current Public-Key Cryptographic Systems”, 2000. Available at

http://www.certicom.com [7] Certicom, “Remarks on the Security of the Elliptic Curve Cryptosystem”, 2000. Available at

http://www.certicom.com [8] C. Clavier and M. Joye, “Universal exponentiation algorithm: a first step towards provable SPA-

resistance”, in Workshop on Cryptographic Hardware and Embedded Systems- CHES 2001, LNCS 2162, pp. 300-308, Springer-Verlag, 2001.

[9] T. Cormen, C. Leiserson and R. Rivest, Introduction to algorithms, The MIT Press Cambridge,

Massachusetts (1999). [10] J.-S. Coron, “Resistance against differential power analysis for elliptic curve cryptosystems”, in

Workshop on Cryptographic Hardware and Embedded Systems, LNCS 1717, pp.292-302, Springer-Verlag, 1999.

[11] E. De Win, A. Bosselaers and S. Vandenberghe, “A fast software implementation for arithmetic

operations in GF(2n)”, Advances in Cryptology, Proc. Asiacrypt’96, LNCS 1163, pp. 65-76, Springer-Verlag, 1996.

[12] H. Eisenbise, “Embedded cryptography: secure communications with digital signal processors”,

2001. Available at http://www.rit.edu/~hje3479/cryptography.html

123

BIBLIOGRAPHY 124

[13] C. H. Gebotys and R. J. Gebotys, “Secure elliptic curve implementations: an analysis of resistance to power-attacks in a DSP processor”, in Workshop on Cryptographic Hardware and Embedded Systems- CHES 2002, 2002.

[14] J. Goodman and A. Chandrakasan, “An energy-efficient reconfigurable public-key cryptographic

processor”, IEEE Journal of Solid-State Circuits, Vol 36, No. 11, 2001. [15] J. Guajardo and C. Paar, “Efficient algorithms for elliptic curve cryptosystems,” Advances in

Cryptology, Proc. Crypto’97, LNCS 1294, pp.342-356, Springer-Verlag, 1997. [16] J. Guajardo and C. Paar, “Itoh-Tsujii inversion in standard basis and its application in

cryptographic codes”, Kluwer Academic Publishers, 2001. [17] J. Guajardo, R. Blumel, U. Krieger and C. Paar, “Efficient implementation of elliptic curve

cryptosystems on the TI MSP430x33x family of microcontrollers”, In Proceedings of PKC 2001, LNCS 1992, pp. 365-382, Springer-Verlag, 2001.

[18] P. Hamalainen, M. Hannikainen, T. Hamalainen, and J. Saarinen, “Configurable hardware

implementation of triple-DES encryption algorithm for wireless local area network”, IEEE, 0-7803-7041-4, 2001.

[19] D. Hankerson, J. Hernandez and A. Menezes, "Software implementation of elliptic curve

cryptography over binary fields", In Proceedings of CHES 2000, 2000. [20] M. Hasan, "Efficient computation of multiplicative inverses for cryptographic applications", 15th

IEEE Symposium on Computer Arithmetic, pp.66-72, 2001. [21] M. Hasan, ECE 720 (Topic 2) Lecture Slides - Selected Topics in Cryptographic Computations,

Department of Electrical and Computer Engineering, University of Waterloo, 2001. [22] M. A. Hasan, "Look-up Table-Based Large Finite Field Multiplication in Memory Constrained

Cryptosystems", IEEE Transactions on Computers, LNCS 1746, pp.749-758, July 2000. [23] M. Hasan, “Power analysis attacks and algorithmic approaches to their countermeasures for

Koblitz curve cryptosystems”, in Cryptographic Hardware and Embedded Systems - CHES 2000, LNCS 1965, pp. 93-108, Springer-Verlag, 2000.

[24] E. Hess, N. Janssen, B. Meyer and T. Schütze, “Information leakage attacks against smart card

implementations of cryptographic algorithms and countermeasures”, Available at http://infilsec.com/papers/dpa/

[25] IEEE P1363, Standard Specifications for Public-Key Cryptograpy, 2000.

[26] K. Itoh, M. Takenaka, N. Torii, S. Temma and Yasushi Kurihara, “Fast implementation of public-key cryptography on a DSP TMS320C6201,” In Proceedings of the First Workshop on Cryptographic Hardware and Embedded Systems (CHES’99), LNCS 1717, pp. 61-72, Springer-Verlag, 1999.

[27] T. Izu and T. Takagi, “A fast parallel elliptic curve multiplication resistant against side channel

attacks”, Technical report, CACR, University of Waterloo, 2001. Available at http://www.math.uwateroo.ca/

BIBLIOGRAPHY 125

[28] M. Joye, C. Tymon, “Protections against differential analysis for elliptic curve cryptography”, in

Cryptographic Hardware and Embedded Systems - CHES 2001, LNCS 2162, pp. 377-390, Springer-Verlag, 2001.

[29] D. Johnson, “ECC, future resiliency and high security systems”, Certicom, 1999. Available at

http://www.certicom.com [30] D. Johnson and A. Menezes, "The elliptic curve digital signature algorithm (ECDSA)", Technical

report CORR 99-06, Department of Combinatorics & Optimization, University of Waterloo, 1999. Available at http://www.cacr.math.uwaterloo.ca/

[31] N. Koblitz, A.J. Menezes and S. Vanstone, “The State of Elliptic Curve Cryptography,” Designs,

Codes and Cryptography, 19, pp.173-193, 2000. [32] K. Koyama, Y. Tsuruoka, “Speeding up elliptic cryptosystems by using a signed binary window

method”, In Advances in Cryptography-CRYPTO’92, LNCS 740, pp. 345-357, Springer-Verlag, 1992.

[33] P. Kocher, “Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other

systems”, in Advances in Cryptology- CRYPTO’96, LNCS, pp. 104-113, Springer-Verlag, 1996. [34] P. Kocher, J. Jaffe and B. Jun, “Differential Power Analysis”, in Advances in Cryptology-

CRYPTO’99, LNCS, pp. 388-397, Springer-Verlag, 1999. [35] N. Kanayama, T. Kobayashi, T. Saito and S. Uchiyama, “Remarks on elliptic curve discrete

logarithm problem”, IEICE Trans. Fundamentals, Vol.E83-A, No.1, 2000. [36] N. Koblitz, "CM-curves with good cryptographic properties", In Advances in Cryptology: CRYPT

'91, LNCS 576, pp. 279-287, Springer-Verlag, 1992. [37] C. H. Lim and P. J. Lee, “Security of interactive DSA batch verification”, Electronics Letters, No.

19941112, 1994. [38] J. Lopez and R. Dahab, “An overview of elliptic curve cryptography”, Technical Report, IC-00-

10, 2000. Available at http://www.dcc.unicamp.br/ic-main/publications-e.html [39] J. Lopez and R. Dahab, “High-speed software multiplication in F2

m”, Technical report, IC-00-09, 2000. Available at http://www.dcc.unicamp.br/ic-main/publications-e.html

[40] J. Lopez and R. Dahab, “Performance of elliptic curve cryptosystems”, Technical report, IC-00-

08, May 2000. Available at http://www.dc.unicamp.br/ic-main/publications-e.html [41] Metrowerks Corporation, CodeWarrior IDE, version 4.1, build 696, 2000. [42] Metrowerks Corporation, CodeWarrior® Metrowerks enterprise C compiler user’s manual,

(1999). Available at http://www.metrowerks.com [43] Metrowerks Corporation, StarCore C Compiler, vMtwk.Production 1.1, build 050901-1936, 2000.

BIBLIOGRAPHY 126

[44] C. Moerman and E. Lambers, “Optimizing DSP: low power by architecture”, Adelante Technologies. Available at http://www.techonline.com

[45] B. Möller, “Securing elliptic curve point multiplication against side-channel attacks”, in

Information Security – 4th International Conference, ISC 2001, LNCS 2200, pp.324-334, Springer-Verlag, 2001.

[46] Motorola, SC140 DSP Core Reference Manual. Motorola and Lucent Technologies Inc., Rev. 1,

(2000). Available at http://www.motorola.com [47] National Institute of Standards and Technology, “Digital Signature Standard (DSS)”, FIPS

Publication 186-2, February 2000. Available at http://csrc.nist.gov/fips [48] National Institute of Standards and Technology, “Secure Hash Standard (SHS)”, FIPS Publication

180-1, April 1995. Available at http://csrc.nist.gov/fips [49] P. Nguyen, I. Shparlinski, “The insecurity of the elliptic curve digital signature algorithm with

partially known nonces”, submitted to Designs, Codes and Cryptography, 2001. [50] K. Okeya and K. Sakurai, “On insecurity of the side channel attack countermeasure using

addition-subtraction chains under distinguishability between addition and doubling”, LNCS 2384, pp. 420-435, Springer-Verlag, 2002.

[51] K. Okeya and K. Sakurai, “Power analysis breaks elliptic curve cryptosystems even secure against

the timing attacks”, in Progress in Cryptology - Indocrypt 2000, LNCS 1977, pp. 178-190, Springer-Verlag, 2000.

[52] G. Orlando and C. Paar, “A high-performance reconfigurable elliptic curve processor for GF(2m)”,

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2000), LNCS 1965, Springer-Verlag, 2000.

[53] G. Orlando and C. Paar, “An efficient architecture for GF(2m) and its applications in cryptographic

systems”, Electronic Letters, Vol. 36, No. 13, pp.1116-1117, 2000. [54] S. Ravi, A. Raghunathan and N. Potlapally, “Securing wireless data: architecture challenges”, ISSS

’02, ACM 1-58113-576-9, 2002. [55] L. Reyzin and B. Kaliski, “Storage-efficient basis conversion techniques”, IEEE, 2000. [56] M. Rosing, Implementing Elliptic Curve Cryptography, Manning Publications Greenwich, CT

(1999). [57] E. Roy and D. Crawford, “Introduction to the StarCore SC140 tools: an approach in nine

exercises”, Motorola, AN2009/D, Rev. 1, 2001. Available at http://www.motorola.com [58] Z. Rozenshein, D. Halahmi, A. Mordoh and Y. Ronen, “Speed and code-size trade-off with the

StarCore SC140”, Motorola, AN1838/D, Rev. 0, 2000. Available at http://www.motorola.com [59] B. Schneier, “Cryptographic design vulnerabilities”, IEEE, 0018-9162, 1998.

BIBLIOGRAPHY 127

[60] G. Seroussi, “Compact representation on elliptic curve points over F2n”, Hewlett-Packard

Company, HPL-98-94 (R.1), 1998. [61] J. Solinas, “Efficient arithmetic on Koblitz curves”, Designs, Codes and Cryptography, 19,

pp.195-249, 2000. [62] M. Wiener and R. Zuccherato, “Faster attacks on elliptic curve cryptosystems”, Selected Areas in

Cryptogrpahy’98, LNCS 1556, pp. 190-200, Springer-Verlag, 1998. [63] E. Witzke and L. Pierson, “Key management for large scale end-to-end encryption”, IEEE, D-

7803-1479-4, 1994. [64] T. Wollinger, M. Wang, J. Guajardo and C. Paar, “How well are high-end DSPs suited for the

AES algorithms? AES algorithms on the TMS320C6x DSP,” AES Candidate Conference 2000, pp. 94-105, 2000.

[65] C. Zamfirescu and E. Madve, “Stack measurement for the StarCore SC140 core”, Motorola,

AN2267/D, Rev. 1, 2002. Available at http://www.motorola.com

the implementation and analysis of the ecdsa on the ...cgebotys/new/eric_masters.pdf · ecdsa on...

Documents