the implementation and analysis of the ecdsa on the ...cgebotys/new/eric_masters.pdf · ecdsa on...
TRANSCRIPT
The Implementation and Analysis of the
ECDSA on the Motorola StarCore SC140 DSP
Primarily Targeting Portable Devices
by
Eric W. Smith
A thesis
presented to the University of Waterloo
in fulfillment of the
thesis requirement for the degree of
Master of Applied Science
In
Electrical and Computer Engineering
Waterloo, Ontario, Canada, 2002
© Eric W. Smith 2002
I hereby declare that I am the sole author of this thesis.
I authorize the University of Waterloo to lend this thesis to other institutions or individuals for the
purpose of scholarly research.
I further authorize the University of Waterloo to reproduce this thesis by photocopying or by other
means, in total or in part, at the request of other institutions or individuals for the purpose of scholarly
research.
ii
The University of Waterloo requires the signatures of all persons using or photocopying this thesis.
Please sign below, and state an address and date.
iii
Abstract
The viability of the Elliptic Curve Digital Signature Algorithm (ECDSA) on portable devices is
important due to the growing wireless communications industry, which has inherent insecurities. The
StarCore SC140 DSP (SC140) targets portable devices, and therefore is a prime candidate to study the
viability of the ECDSA on such devices. The ECDSA was implemented on the SC140 using a Koblitz
curve over GF(2163). The τ-adic representation of polynomials involved in the elliptic curve point-
multiplication is exploited to achieve superior performance. The ECDSA was implemented and
optimized in C and assembly, and verified in hardware. The performance of the C and assembly
implementations is analyzed and compared to previously published results. The ability of the compiler
to generate efficient cryptographic related code and the SC140 to perform efficient operations is
discussed. Numerous compiler optimization improvements that considerably enhance the performance
of the generated assembly are suggested. Coding guidelines that state simple measures to improve the
performance of the implementation and help to achieve efficient C and assembly are listed. Finally,
security issues, with respect to the implementation and focusing on side-channel attacks (SCA) are
investigated, including estimated performance penalties due to adding resiliency. Two SCA
countermeasures specific to the implementation are also described. In summary, the implemented
ECDSA signature generation and verification processes require 4.43 and 8.63 ms when the SC140
operates at 300MHz. Methods of optimizing the implementation to further reduce execution times are
also presented.
iv
Acknowledgements
The author would like to thank his supervisor, Professor Catherine Gebotys, for her aid and direction
throughout the development of the thesis, as well as the use of computing resources and the StarCore
SC140 Software Development Platform (SDP). He would also like to thank friends and family for
their support, without which the completion of the thesis would not be possible.
The author is extremely grateful for the financial support provided by the National Sciences
and Research Council of Canada (NSERC), Motorola, his supervisor and the Department of Electrical
and Computer Engineering at the University of Waterloo. Financial support was provided by the listed
entities through various scholarships, which allowed the author to focus more thoroughly on his
research and studies.
v
Contents
1 Introduction..................................................................................................................................... 1 1.1 DSPs and Embedded Systems Security Requirements ............................................................. 2 1.2 Thesis Objective........................................................................................................................ 3 1.3 Thesis Overview ....................................................................................................................... 3
2 Public-Key Cryptosystems and the StarCore SC140 DSP .......................................................... 4 2.1 Public-Key Cryptosystems........................................................................................................ 4 2.2 ECC Background ...................................................................................................................... 5
2.2.1 Comparison to Other Cryptographic Techniques.............................................................. 7 2.3 Digital Signature Schemes ........................................................................................................ 9 2.4 StarCore SC140 DSP Processor Description .......................................................................... 10 2.5 Previous Cryptographic and DSP Research ............................................................................ 14
3 The ECDSA Algorithm and Implementation Philosophy ......................................................... 15 3.1 The ECDSA ............................................................................................................................ 15 3.2 Finite Field and Large Integer Arithmetic............................................................................... 17
3.2.1 Basic Operations ............................................................................................................. 18 3.2.2 Finite Field Multiplication .............................................................................................. 19 3.2.3 Finite Field Squaring....................................................................................................... 20 3.2.4 Finite Field Inversion ...................................................................................................... 20 3.2.5 Large Integer Operations................................................................................................. 22
3.3 Elliptic Curve Arithmetic........................................................................................................ 22 3.3.1 Elliptic Curve Point Addition and Subtraction ............................................................... 22 3.3.2 Elliptic Curve Point Representation................................................................................ 25 3.3.3 Elliptic Curve Point-Multiplication................................................................................. 27 3.3.3.1 Non-Adjacent Format...................................................................................................... 27 3.3.3.2 Reduced TNAF Representation ...................................................................................... 28 3.3.3.3 TNAF Point-Multiplication............................................................................................. 32 3.3.3.4 Width-w TNAF Representation ...................................................................................... 33 3.3.3.5 TNAFw Point-multiplication .......................................................................................... 34 3.3.4 Simultaneous Multiple Point-Multiplication................................................................... 35
3.4 Implementation and Integration Philosophy ........................................................................... 36
4 Implementation Analysis and Performance Results .................................................................. 38 4.1 C Data Structures .................................................................................................................... 38 4.2 Finite Field Operations............................................................................................................ 39
vi
4.2.1 Finite Field Addition (c = a ⊕ b) .................................................................................... 40 4.2.2 Finite Field Reduction (c = a mod f) ............................................................................... 40 4.2.3 Finite Field Multiplication (c = a ⋅ b) .............................................................................. 41 4.2.4 Finite Field Squaring (c = a2) .......................................................................................... 43 4.2.5 Finite Field Inversion (c = a-1 mod f) .............................................................................. 45
4.3 Large Integer Operations......................................................................................................... 48 4.3.1 Large Integer Addition and Subtraction (c = a + b; c = a - b) ......................................... 49 4.3.2 Large Integer Multiplication (c = a ⋅ b)........................................................................... 49 4.3.3 Large Integer Division (c = a / b).................................................................................... 50 4.3.4 Large Integer Inversion (c = a-1 mod f) ........................................................................... 51
4.4 Elliptic Curve Operations........................................................................................................ 51 4.4.1 TNAF Conversion (k2 kTNAF) ..................................................................................... 52 4.4.2 Partial Reduction - Partmod δ (k′ = k partmod δ) ........................................................... 52 4.4.3 TNAF Point-Multiplication (Q = kTNAF ⋅ P) .................................................................... 54 4.4.4 TNAFw Conversion (k2 kTNAFw)................................................................................. 55 4.4.5 TNAFw Point-Multiplication (Q = kTNAFw ⋅ P) ............................................................... 57 4.4.6 Simultaneous Multiple Point-Multiplication (R = k ⋅ P + l ⋅ Q)...................................... 59
5 Implementation Comparison and Coding Guidelines ............................................................... 61 5.1 Performance Comparison with Previous Published Results ................................................... 61
5.1.1 Low-Level Performance Comparison ............................................................................. 61 5.1.2 High-Level Performance Comparison ............................................................................ 63
5.2 Guidelines for Writing Efficient C Code for Cryptographic Applications ............................. 65 5.3 Guidelines for Writing Efficient Assembly Code for Cryptographic Applications ................ 67 5.4 Hand-Written and Compiler-Generated Assembly Comparison............................................. 70
5.4.1 Low-Level Performance Comparison ............................................................................. 70 5.4.2 High-Level Performance Comparison ............................................................................ 76
5.5 Memory Requirements Comparison ....................................................................................... 78
6 SC140 and Compiler Analysis for Cryptographic Applications............................................... 81 6.1 Analysis of the SC140 for Elliptic Curve Cryptographic Applications .................................. 81
6.1.1 SC140 Cryptographic Pros.............................................................................................. 82 6.1.2 SC140 Cryptographic Cons............................................................................................. 87
6.2 Compiler Optimization Improvements ................................................................................... 89 6.3 Compiler Anomalies ............................................................................................................... 98
6.3.1 Compiler Anomaly A...................................................................................................... 98 6.3.2 Compiler Anomaly B .................................................................................................... 100
7 Side-Channel Attack Security Issues......................................................................................... 104 7.1 Timing Attacks...................................................................................................................... 105 7.2 Simple Power Attacks ...........................................................................................................107 7.3 Differential Power Analysis.................................................................................................. 108 7.4 SCA Countermeasures specific to Koblitz Curves and the SC140....................................... 109
7.4.1 Parallel Processing Countermeasure ............................................................................. 110
vii
7.4.2 Koblitz Curve Specific Countermeasure....................................................................... 112
8 Discussion and Conclusions........................................................................................................ 115 8.1 Thesis Summary.................................................................................................................... 115 8.2 Limitations of the Research and Implementation ................................................................. 116 8.3 Conclusions........................................................................................................................... 117 8.4 Future Work .......................................................................................................................... 119
Appendix A – Koblitz Curve Parameters ......................................................................................... 122
Bibliography ........................................................................................................................................ 123
viii
List of Acronyms
AAU Address Arithmetic Unit IF Integer Factorization
AGU Address Generation Unit IFA IF Always
AIA Almost Inverse Algorithm IFF IF False
ALU Arithmetic Logic Unit IFT IF True
ASL Arithmetic Shift Left (by one bit) JF Jump if True
ASLL Arithmetic Shift Left (by multiple bits) JT Jump if False
ASM Assembly Language Code LSL Logical Shift Left (by one bit)
ASR Arithmetic Shift Right (by one bit) LSLL Logical Shift Left (by multiple bits)
ASRR Arithmetic Shift Right (by multiple bits) LSR Logical Shift Right (by one bit)
BF Branch if False LSRR Logical Shift Right (by multiple bits)
BFU Bit Field Unit LUT Look-Up Table
BT Branch if True MAC Multiply and Accumulate
CA Certificate Authority MIPS Million Instructions Per Second
CGA Compiler-Generated Assembly NAF Non-Adjacent Format
CLB Count Leading Bits NB Normal Basis
CP Critical Path NIST National Institute of Standards and Technology
DALU Data Arithmetic Logic Unit NOP No Operation
DL Discrete Logarithm PB Polynomial Basis
DLP Discrete Logarithm Problem PDA Personal Digital Assistant
DPA Differential Power Analysis RRK Random Rotation of Key
DSA Digital Signature Algorithm SCA Side Channel Attack
DSP Digital Signal Processor SC140 StarCore SC140 DSP
EC Elliptic Curve SPA Simple Power Attacks
ECC Elliptic Curve Cryptography SMPM Simultaneous Multiple Point-Multiplication
ECDLP Elliptic Curve Discrete Logarithm Problem SRAM Static Random Access Memory
ECDSA Elliptic Curve Digital Signature Algorithm TA Timing Attack
EEA Extended Euclidean Algorithm TNAF τ-adic NAF
FF Finite Field TNAFw Width-w TNAF
GUI Graphical User Interface VLES Variable Length Execution Set
HWA Hand-Written Assembly VLIW Very Long Instruction Word
IDE Integrated Development Environment XXX(A) XXX and XXXA instructions
ix
List of Algorithms
Algorithm 3-1. ECDSA Signature Generation [30] ................................................................................ 16 Algorithm 3-2. ECDSA Signature Verification [30] .............................................................................. 16 Algorithm 3-3. Finite Field Reduction (c = a mod f) [19] ...................................................................... 18
Algorithm 3-4. Finite Field Multiplication (c = a⋅b) [39] ....................................................................... 19 Algorithm 3-5. Finite Field Squaring (c = a2) [19] ................................................................................. 20 Algorithm 3-6. Finite Field Inversion (b = a-1 mod f) [20] ..................................................................... 21 Algorithm 3-7. Elliptic Curve Point Addition (P3 = P1 + P2) [38] .......................................................... 24
Algorithm 3-8. TNAF Conversion (kTNAF = r0 + r1⋅τ) [61]...................................................................... 29
Algorithm 3-9. Partmod δ Reduction (r0 + r1⋅τ := k2 partmod δ) [61] .................................................... 31
Algorithm 3-10. TNAF Point-Multiplication (Q = kTNAF⋅P) [19] ........................................................... 32
Algorithm 3-11. TNAFw Conversion (kTNAFw = r0 + r1⋅τ) [61]............................................................... 33
Algorithm 3-12. TNAFw Point-Multiplication (Q = kTNAFw⋅P) [61]....................................................... 35
Algorithm 3-13. Simultaneous Multiple Point-Multiplication (R = k⋅P + l⋅Q) [19] ............................... 35 Algorithm 4-1. Improved Finite Field Squaring (c = a2)......................................................................... 45 Algorithm 4-2. Improved Finite Field Inversion (c = a-1 mod f)............................................................. 46 Algorithm 4-3. Integer Coefficient to Binary Representation Conversion ............................................. 56
Algorithm 7-1. TA Resistant TNAF Point-Multiplication (Q[0] = kTNAF⋅P) [22]................................. 106
Algorithm 7-2. Proposed DPA Resistant τ–adic Point-Multiplication ................................................. 112
x
List of Tables
Table 2-1. Current Estimated Memory Requirement Comparison [6]...................................................... 9 Table 3-1. Elliptic Curve Coordinate System Comparison [19] ............................................................. 26 Table 4-1. Finite Field Reduction Performance ...................................................................................... 41 Table 4-2. Single and Multiple Bit-Shifting Function Comparison........................................................ 42 Table 4-3. Finite Field Multiplication Performance................................................................................ 42 Table 4-4. Finite Field Squaring Performance Comparison ................................................................... 44 Table 4-5. Finite Field Inversion Bit-Shift Distribution ......................................................................... 47 Table 4-6. Finite Field Inversion Performance ....................................................................................... 48 Table 4-7. TNAF Point-Multiplication Performance.............................................................................. 54 Table 4-8. TNAFw Point-Multiplication Performance Comparison ...................................................... 58 Table 4-9. Simultaneous Multiple Point-Multiplication Performance Comparison ............................... 60 Table 5-1. Estimated Finite Field Operation Cycle Count Comparison ................................................. 62 Table 5-2. Estimated Elliptic Curve Operation Cycle Count Comparison ............................................. 64 Table 5-3. Estimated Signature Generation and Verification Cycle Count Comparison........................ 64 Table 5-4. Low-Level CGA and HWA Performance Comparison (input independent routines)........... 72 Table 5-5. Low-Level CGA and HWA Performance Comparison (input dependent routines) .............. 74 Table 5-6. Computational Reduction of the Signature Generation Process due to HWA Routines ....... 76 Table 5-7. High-Level CGA and HWA Performance Comparison ........................................................ 77 Table 5-8. Estimated Permanent Storage Requirements......................................................................... 79 Table 6-1. Assembly Symbolic Description ........................................................................................... 90 Table 7-1. Estimated TA Resistant Performance Penalties................................................................... 106 Table 7-2. Estimated SPA Resistant Signature Generation Performance Penalty ................................ 107 Table 7-3. Estimated Sample Entropy and Overhead of Algorithm 7-2 .............................................. 113
xi
1 Introduction
The ECDSA is a cryptographic tool that can provide security to systems when implemented correctly.
The algorithm defines a method of achieving data integrity, data origin authentication and non-
repudiation. The ability to efficiently implement the ECDSA forecasts its usefulness in the growing
wireless and wireline communications industries.
It is difficult to argue the usefulness of the ECDSA without proof that it can be efficiently
implemented on a wide range of target processors. Furthermore, it is difficult to convey the threat of
attackers to users without personally experiencing a security breach in the digital sense, because they
do not easily tolerate large computational delays due to tasks they deem unimportant.
The purpose of the thesis is to study the performance of the ECDSA on the SC140. Analysis
of the implementation, the benefits of an assembly implementation, and the strength of the compiler to
produce efficient cryptographic applications are all included as part of the study. There have been
several documented implementations of the ECDSA on general-purpose and extremely resource
limited processors that are present on smart cards, but a limited number of implementations of the
ECDSA, or more generally Elliptic Curve Cryptography (ECC), on processors targeting portable
devices. The few documented implementations of ECC on DSPs have involved prime fields. ECC
using binary fields is also a viable option, which may be more attractive due to the numerous bit-
manipulating instructions common to DSPs.
Due to the decreased power consumption of DSPs relative to general-purpose processors, and
the limited battery lifespan of portable devices, DSPs are an excellent candidate for the primary
computing core of portable devices. Furthermore, due to the inherent insecurities of wireless
communications that threaten portable devices, the implementation of security measures is of utmost
importance. The performance of security measures, including the ECDSA, must be studied on such
devices. By studying the performance of the ECDSA on the SC140, their viability with respect to
portable devices and security on the devices can be determined.
1
CHAPTER 1. INTRODUCTION 2
1.1 DSPs and Embedded Systems Security Requirements
The employment of adequate security systems was overlooked during the incredible growth the
communications industry underwent over the past two decades. Systems were introduced without
adequate security measures in place. The combination of recent world events and the sudden decline in
the communications industry’s growth, has led to the realization that a great deal of current network
security measures are inadequate.
Furthermore, increasing the demand on network security, the wireless communications
industry is rapidly expanding. The current trend in the communications industry is increasing wireless
services as 3rd generation cellular systems become a reality. The services that cell phones, personal
digital assistants (PDAs) and other portable handheld devices provide are ever increasing. The new
services require more bandwidth and greater processing capabilities. Examples of the introduced
services are email and streaming multimedia.
As the communications industry expands, and more information is transmitted via wireless and
wireline connections, the inherent requirement for security measures increases. The SC140 targets
several communication applications that all require certain levels of security. It is therefore important
to study the SC140 to determine if it is a viable processor to implement the required security measures.
Handheld devices are powered by several different processing units, including DSPs. The
deployment of DSPs is widespread. They have lower power dissipation than general-purpose
processors, and are less costly than specialty processors. DSPs are currently present in network and
data communications, and several other devices throughout the communications industry. High-end
DSPs control network traffic on high-speed backbones, and will likely be deployed in future handheld
devices. Handheld devices are often part of extensive wireless networks that are naturally insecure,
and are extremely susceptible to security risks such as impersonation attacks.
Digital signatures allow tasks such as data integrity, data origin authentication and non-
repudiation to be performed. The importance of integrity and origin of data is heightened in a wireless
network, which are much more susceptible to impersonation attacks and modification of transmitted
data because of the ease of which the transmission medium is accessed.
It is important to study security on DSPs because of their widespread deployment. If DSPs are
a viable target for implementation of security measures, security can be added to systems with simple
software add-ons or upgrades. The cost of adding the security related services is greatly decreased
because new hardware is not required. Furthermore, expensive processing units for cryptographic
applications are not required by new devices, maintaining their affordability.
CHAPTER 1. INTRODUCTION 3
1.2 Thesis Objective
The objective of the thesis is to study the performance of ECC, and more precisely the ECDSA, on a
DSP targeting portable devices. The ECDSA is implemented on the StarCore SC140 DSP. The
implementation is examined thoroughly, and optimized to improve its performance with respect to
execution time and code size. The compiler and associated optimizer are examined to determine if
efficient implementation is possible at the C programming language, or if assembly language coding is
required to achieve the necessary performance. The execution time of the implementation should be
comparable to current published results, and must account for acceptable delays, unnoticeable to the
average user when the digital signature technique is utilized by a practical application. Furthermore,
while maintaining acceptable execution times, the code size of the compiled application must be
suitable for portable devices, where memory is an expensive and limited resource.
1.3 Thesis Overview
In chapter 2, a brief description of public-key cryptosystems, focusing on ECC, and the StarCore
SC140 DSP is presented. The ECDSA and the algorithms utilized to implement the required finite
field and elliptic curve operations are outlined in chapter 3, along with the implementation philosophy.
The implementation and performance analysis of the finite field and elliptic curve operations are
presented in chapter 4. Chapter 5 analyzes the performance of the implementation. A comparison of
the performance of the implementation with previously published results is presented. In addition, the
performance of the hand-written and compiler-generated assembly is contrasted. Several coding
guidelines to follow, which aid in the development of efficient assembly and C code, are included. To
conclude chapter 5, the memory requirements are presented and compared. An analysis of the SC140
and the associated compiler is presented in chapter 6. The analysis is based on implementing
cryptographic applications on the SC140, and the ability of the compiler to optimize cryptographic
related code. Security issues that arise due to side-channel attacks and methods of avoidance are
presented in chapter 7. Finally, chapter 8 presents a thesis summary, limitations of the study, a
conclusion, as well as future work to be done in this area of research.
2 Public-Key Cryptosystems and the StarCore SC140 DSP
This chapter introduces the concept of public-key cryptosystems, providing several examples. The
implemented public-key cryptosystem is explained and compared to alternatives, and a description of
the SC140 is included.
2.1 Public-Key Cryptosystems
Public-key cryptosystems were invented by Whitfield Diffie and Martin Hellman in 1976 [6]. They are
asymmetric cryptosystems, which are based on the concept of using different keys for the encryption
and decryption processes. The two keys involved must seem unrelated for the cryptosystem to be
useful. They must seem unrelated such that the encryption key E, or public key, can be put in the
public domain without compromising the decrypting key D. The decrypting key D is also known as the
private or secret key.
Consider two entities that are people or computer nodes, which want to communicate. Each
entity individually develops a public and private key. The two keys are inverses of each other,
described by Formula 2.1 [9]. In the formula, M is the message, and the functions D() and E()
represent encryption using the private and public keys respectively.
M = D(E(M)) = E(D(M)) (2.1)
A trusted third party is required by public-key cryptosystems. The trusted third party, also
known as a Certificate Authority (CA), is in charge of key storage and distribution. Entities transmit
their public key to a CA in a secure manner. The CA is in-charge of storage and distribution of the
domain parameters and public keys of entities.
Public-key cryptosystems are capable of providing authentication, secrecy, or both between
communicating parties. Authentication is achieved by encrypting messages with one’s private key
before transmission. The target entity uses the public key of the sender, obtained from a CA, to decrypt
the transmitted message. Authentication is inherent when the message is successfully decrypted.
Secure communication is achieved by encrypting a message with the target’s public key. The
4
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 5
communication is secure because the target’s private key, which is only known by the target, is
required to decrypt the message. An authenticated and secure communication is achieved by first using
the target’s public key to encrypt a message, then by using one’s private key to encrypt the encrypted
message before transmission to the target. In this case, the target first authenticates the transmission by
decrypting the received message with the sender’s public key. Then the original message is revealed
by decrypting the authenticated message with their private key.
There are currently three secure and efficient public-key cryptosystems. They are based on
Integer Factorization (IF), the Discrete Logarithm (DL) and Elliptic Curves (EC) [6]. Each system is
based on a difficult mathematical problem, relative to their input size [6], which requires a great deal of
time to solve.
The most common public-key cryptosystem is RSA, named after its developers Rivest, Shamir
and Adelman [6]. It is an IF system based on large prime integers. The difficult problem associated
with the system is the factorization of large numbers. Both encryption and digital signature schemes
have been developed using RSA.
The Digital Signature Algorithm (DSA) is an example of a DL cryptosystem [6]. The DSA is a
digital signature scheme. DSA is based on the Discrete Logarithm Problem (DLP), as are all DL
cryptosystems. Encryption is also possible using the DLP, but is not commonly used due to the
associated large overhead.
Both encryption and digital signature schemes have been developed using ECC. It is based on
an extension of the DLP, rightfully named the Elliptic Curve DLP (ECDLP). The ECDSA is an
example of an ECC, which is a digital signature scheme very similar to the DSA. ECC is presently the
most promising public-key cryptosystem because of the high security-per-bit ratio it provides. Further
detail of the cryptosystem is given in the following section.
2.2 ECC Background
In 1985, Neil Koblitz and Victor Miller independently proposed the use of elliptic curves for a public-
key cryptosystem [2]. Both encryption and digital signature techniques have been developed using the
cryptosystem. The public-key cryptosystem is based on the manipulation of points on an elliptic curve,
defined modulo f, where P = (x, y) is a point on the curve. For cryptographic applications, each
coordinate belongs to a prime or binary finite field, defined by f. The generalized equation of an
elliptic curve for cryptography is presented as Formula 2.2.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 6
y2 + x⋅y = x3 + a⋅x2 + b (mod f) (2.2)
Several parameters define an elliptic curve, including but not limited to, a and b. There are
classes of curves that are defined by specific sets of parameters. These classes have special properties
associated with them, making them more or less attractive for cryptographic applications. For
example, anomalous binary curves, more commonly known as Koblitz curves, are used and assumed in
the scope of the thesis. They have properties that allow for efficient point-multiplication that are
explained in §3.3.3.
Associated with elliptic curves are point addition, doubling, negating, subtracting and
multiplication operations. Each of the elliptic curve operations is defined by a sequence of finite field
operations, and are described in §3.3. Similar to standard mathematics, point-multiplication is based on
a series of point additions. Algorithms have been developed to improve the computation of the
multiplication because of its expensiveness.
The ECDLP is the difficult mathematical problem of reversing the point-multiplication
operation. The problem is an analogue of the DLP, but in the elliptic curve domain. The ECDLP,
which is attempting to solve for k knowing P and Q, from the point-multiplication formula Q = k⋅P,
proves difficult. The problem is computationally expensive and cannot be solved in a reasonable
amount of time as long as the finite field associated with the curve is large enough. As expected, the
finite field size required to make the problem computationally infeasible is directly related to current
processing trends. As the average computing power of devices increase, larger finite fields are required
to maintain security levels [30].
Elliptic curves for cryptographic applications can be defined over prime or binary fields. In
general, prime fields tend to outperform binary fields because most processors are designed to favor the
execution of integer arithmetic, and are not number crunchers. However, binary fields were chosen for
implementation to determine how well they perform on the SC140, because it has an extended and less
computationally costly set of logic instructions compared to general-purpose processors. Binary finite
fields are assumed throughout the thesis unless otherwise stated.
The Polynomial Basis (PB) was selected to represent binary finite field elements. Unless
otherwise stated, the PB is used throughout the thesis. The use of an alternate basis, such as the
Normal Basis (NB) does have its benefits. For example, squaring fields is simplified when employing
the NB, but other operations become more complex. It is believed by the writer that the drawbacks of
alternative representations outweigh the benefits for this implementation.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 7
The binary finite field GF(2m), where m=163, is used and assumed throughout the project.
The field size of 163-bits was selected to provide the current acceptable security level [6]. When
implementing Koblitz curves, all of the elliptic curve parameters are fixed after the finite field size is
selected, except for C. The parameter C and its value are further explained in §3.3.3.2. The Koblitz
curve parameters for the PB, GF(2163) and used for implementation, are listed in Appendix A.
Some general finite field terminology must be defined. The terms polynomial and finite field
element are used interchangeably throughout the thesis. The degree of a polynomial is the position of
the most significant coefficient (where the first coefficient position is position zero), and the Hamming
weight of a polynomial is the number of nonzero coefficients in its representation.
In the following section, ECC is compared to two other public-key cryptosystems. The
positive and negative aspects of the cryptosystems are compared to give some background on why
ECC was selected for implementation over other possibilities.
2.2.1 Comparison to Other Cryptographic Techniques
The ECC was selected as the public-key cryptosystem for implementation on the SC140 for several
reasons. The comparison below briefly develops and states the grounds for selecting ECC over the
other public-key cryptosystems. Further comparison of the public-key cryptosystems can be found in
[6] and [29].
There are both encryption and digital signature schemes associated with ECC, DLP and RSA.
The underlying mathematical problems associated with the cryptosystems are identical for both
encryption and digital signature schemes. Therefore, after one scheme is implemented, much of the
implementation can be reused with the other scheme. There are two types of digital signature
techniques, which are with and without appendices. The digital signature is logically combined with
the message in a technique without appendix, whereas the digital signature is appended to the message
in the digital signature with appendix technique.
RSA refers to both encryption and digital signature schemes. It includes a digital signature
scheme without appendix. ElGammal proposed digital signature and encryption schemes based on the
DLP [6]. Later, the DSA was developed, which is an improvement on the digital signature scheme
proposed by ElGammal. The ECDSA is the digital signature algorithm associated with ECC, and
encryption is simply referred to as ECC. The ECDSA and DSA are both digital signature schemes with
appendix. Digital signature schemes with and without appendices are explained in §2.3.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 8
Assuming that an elliptic curve is selected that does not have negative security implications,
and that each public-key cryptosystem is implemented correctly without any security loopholes or
backdoors, the per-bit security of ECC is far superior to that of RSA and DSA. As stated in [6], and
depicted in Figure 2-1, the current acceptable security level is 1012 MIPS years, leading to a 160-bit
modulus for ECC, and 1024-bit modulus for both RSA and DSA [6]. In addition, the figure shows the
expected growth of the modulus size for each cryptosystem. The modulus sizes for RSA and DSA
grow exponentially versus an exponential growth in MIPS years, whereas modulus sizes for ECC
experience approximately linear growth [6]. The per-bit security of ECC is far greater than RSA and
DSA, making it far more attractive for portable devices with limited resources, assuming the execution
times are similar. In the future, RSA and DSA modulus sizes are expected to grow exponentially,
resulting in an unacceptable amount of overhead.
Figure 2-1. Modulus Size Comparison of Public-Key Cryptosystems [2]
0
1000
2000
3000
4000
5000
6000
1.E+04 1.E+12 1.E+20 1.E+36
Time to break cryptosystem (MIPS Years)
Mod
ulus
siz
e (b
its)
ECC
RSA andDSA
Current Acceptable Security Level
In general, the key size can be assumed identical to the modulus size for each system. The
total size of the system parameters and key pairs for RSA and DSA are much larger than with ECC.
They currently differ by a factor of four, which will become larger as the current acceptable security
level increases. The encrypted message and signature sizes for RSA are much larger than with ECC,
and only the signature size of DSA is the same as ECC. A comparison of the current estimated sizes of
parameters, keys, signatures and encrypted messages is presented in Table 2-1. With respect to the
table values, the signature sizes stated are for large messages, i.e. 2000-bits, and the original size of the
encrypted message is 100-bits.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 9
Larger keys, parameters, signatures, and encrypted messages require more memory for
storage and more bandwidth to transmit, both of which are scarce resources when dealing with portable
devices. Moreover, even with non-portable devices, there is no reason to unnecessarily squander
resources. As depicted in Figure 2-1 and Table 2-1, ECC provides equivalent security levels, requiring
fewer resources than alternative public-key cryptosystems.
Table 2-1. Current Estimated Memory Requirement Comparison [6]
Public-Key
Cryptosystem
System
Parameters (bits)
Public
Key (bits)
Private
Key (bits)
Signatures
Size (bits)
Encrypted
Message (bits)
RSA N/A 1088 2048 1024 1024
DLP (DSA, ElGammal) 2208 1024 160 320 2048
ECC (ECDSA) 481 161 160 320 321
When comparing cryptosystems, the computational overhead must be investigated. Focusing
on the computational expenses within a single cryptosystem, DLP and ECC systems perform similarly,
and opposite to RSA. The signature generation process for ECDSA and DSA is faster than the
verification process, whereas the verification of RSA signatures is less computationally expensive.
Decryption is slower than encryption using RSA, whereas the opposite is true for ECC.
Overall, ECC is proven to require less computational overhead. After all the techniques used
to increase the performance of each cryptosystem are implemented, ECC is found ten times faster than
RSA and DSA [6]. Both the execution times and memory requirements of ECC are less than those
associated with RSA and DSA, making it superior to the alternatives.
2.3 Digital Signature Schemes
The concept of a digital signature is very powerful, but difficult to achieve. This section gives a brief
overview of the two types of digital signature schemes and their capabilities. Digital signatures are
designed to be similar to, and more compelling than handwritten signatures, and target digital data [30].
Digital signatures are based on the data being signed, M, and a private key only known by the signer.
Digital signatures are powerful because they provide data integrity, data origin authentication
and non-repudiation [30]. After data has been signed, all these services are achieved by the signature
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 10
verification process. There is no privacy associated with digital signatures. Transmitted data can be
easily intercepted and interpreted by eavesdroppers. To achieve confidentiality between
communicating parties, an encryption scheme must be employed.
Signatures that are verified with a sender’s public key guarantee the integrity of the
transmission because the signature is based on the original message. Messages cannot be intercepted
and modified because the signature on the message will not verify. Without the private key of the
sender, the correct signature cannot be computed. An entity cannot impersonate another because each
entity has a unique private key. The private key is known only by the owner, and is required to
compute the digital signature of each message. Lastly, an entity cannot deny knowledge of a message
containing their signature. The entity is the only one with knowledge of their private key, and therefore
is the only one able to compute the signature of a message.
There are two types of digital signature schemes. They are schemes with and without
appendix. In the case of a digital signature scheme without appendix, the digital signature is the only
data transmitted. The transmitted data contains the original message. The signature verification
process results in the computation of the original message. It is impossible to determine the original
message without signature verification. If the verification process fails, the receiver is left with a
garbled message, and the original message cannot be determined. In the case of a digital signature with
appendix, the digital signature is computed and concatenated onto the message. The message along
with the concatenated digital signature is transmitted. It is possible for an attacker to modify the
original message and concatenate an incorrect signature. In this case, the signature will not be verified
by the receiver. Therefore, the receiver will know the transmitted data has been modified. Since the
message and signature are separate in the digital signature with appendix scheme, the verification
process is technically optional, and is left up to the receiver.
The specific type of digital signature implemented, the ECDSA, is described in §3.1. Further
details of the implemented algorithm are provided, as well as a depiction of a digital signature scheme
with appendix.
2.4 StarCore SC140 DSP Processor Description
The StarCore SC140 DSP is a high performance processor that can be clocked at a maximum of 300
MHz [46]. With its many assets that include high performance and low power consumption, the
processor targets computationally intensive communication applications [46]. The SC140 has several
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 11
features that allow for efficient digital signal processing, which are also useful for cryptographic
applications. These features are examined in §6.1.1.
The SC140 targets a wide range of communication applications. Some examples of the target
markets include wireless Internet and multimedia, network and data communications, 3rd generation
wireless handset systems with wideband data services, wireless and wireline base stations and the
corresponding infrastructure [46].
The high performance SC140 is designed to have a large data throughput of 4.8GBytes/sec.
The processor uses a 32-bit unified program and data address space, which is byte addressable. It is
designed to allow significant parallelism. The SC140 has the capability of having a very large on-chip
zero-wait Static Random Access Memory (SRAM). The SRAM allows for efficient execution of
applications, by reducing the cost of fetching instructions from memory. The cost of read and writes,
to and from memory is reduced as well.
The Data Arithmetic Logic Unit (DALU) of the SC140 performs arithmetic and logical
instructions with four parallel Arithmetic Logic Units (ALUs). Each ALU has access to the sixteen 40-
bit data registers, which is the DALU register file. Each ALU contains a Multiply and Accumulate
(MAC) and Bit-Field Unit (BFU). The MAC is capable of a multiplication of two 16-bit values and an
accumulate every clock cycle. The BFU contains a 40-bit bi-directional barrel shift register. It is
capable of single and multiple, arithmetic and logical shifts, as well as logical, bit-masking and bit-
extraction operations.
The Address Generation Unit (AGU) of the SC140 performs address manipulation and limited
arithmetic instructions with two parallel Address Arithmetic Units (AAU). It contains its own register
file and operates in parallel with the DALU. The register file consists of sixteen 32-bit registers.
The SC140 employs a Variable Length Execution Set (VLES), which allows the execution of
several instructions in a single clock cycle. A VLES allows the execution of up to six instructions per
clock cycle, fully utilizing the processing capabilities of the SC140. The combination of instructions
allowed in a VLES is limited by a set of rules, but VLESs greatly improve the overall code size.
Within a VLES, NOPs are assumed, eliminating the need to define an instruction for each processing
unit per clock cycle. Similar DSPs with parallel processing capabilities that do not have VLESs have a
fixed size instruction set, referred to as a Very Long Instruction Word (VLIW). VLIWs lead to large
program sizes that have a low code density [44]. Large programs are inefficient, and attempts must be
made to avoid them when dealing with portable devices with limited memory, bandwidth and power
resources.
The SC140 has zero-overhead hardware loops that can be highly beneficial when used
correctly. The hardware loops allow for up to four levels of nesting, and provide a means of reducing
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 12
repetitive code with a minimal execution cost. By reducing repetitive code, the code size of
applications can be decreased. Hardware loops are further explained in §6.1.1.
The SC140 also has several unique addressing modes that allow for efficient execution of
repetitive algorithms. There are four addressing modes that include register direct, address register
indirect, PC relative, and special [46]. The register direct, PC relative and special address modes are
general addressing techniques that are common to most processors.
The address register indirect addressing mode is the most interesting and beneficial addressing
method. It allows several techniques of addressing memory, including methods of modifying address
registers. Post-increment and post-decrement addressing can be specified. In each case, the address
register is modified by the memory access width defined by the instruction. There is also a post-
increment by offset addressing method, where the address register is modified by the memory access
width multiplied by a control register. There is no cycle penalty for each of the post addressing
methods. When properly implemented, these addressing modes lead to tightly bound looping with
minimal wasted clock cycles when implementing repetitive algorithms.
Indirect addressing modes allow addressing by offsets. The offset can be another address
register, a short or long displacement, or a control register. For more addressing methods with
complete descriptions, refer to the SC140 DSP Core Reference Manual [46].
The power management control features of the SC140 further increase the processor’s
attractiveness for portable devices. The processor can be put into a wait or stop state, which are both
low power consumption modes. In these modes, the functionality of the processor greatly decreases
while waiting for an event to occur. The low power consumption modes conserve energy, on top of
that saved because of the processor’s low voltage operation.
There is a wide variety of software tools available to develop applications for the SC140,
including the Metrowerks CodeWarrior Integrated Development Environment (IDE). It provides an
IDE where applications for the SC140 can be written, edited, compiled, assembled, simulated,
analyzed, and tested in software and hardware. Projects can be developed with the IDE using either or
both C and assembly source code. The amount of parallelism, efficiency and size of the compiled
application is controlled by compiler optimization levels selected by the developer at compile time.
The optimization techniques used by the compiler include various levels of scheduling, pipelining,
bundling, global register allocation, as well as global and space optimization.
The IDE includes a fully functional source code browser and editor, which provide developers
with an easy-to-use graphical user interface. They allow the addition and removal of files from
projects, as well as providing a means of navigating a project’s source code, thus simplifying the
development of large and complex multi-file applications.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 13
Projects can be simulated in software or executed in hardware using the IDE debugging tool.
Software simulation is slow for complex applications, whereas hardware debugging of projects is much
faster but requires the attachment of a development board to the host or a networked computer. All of
the debugging options are available independent of simulation target. An important fact when
simulating, is code simulated and verified in software is not guaranteed to execute identically in
hardware. Some code and functionality restrictions present in hardware are not implemented by the
software simulator.
Breakpoints can be set within the C and/or assembly code for debugging purposes. The
position of breakpoints is limited when debugging optimized code. To aid in the debugging process,
commands such as step into, step over, step out, and run to cursor can be issued when the execution
sequence is paused. Stack variables, memory addresses, and internal register values can all be viewed
and modified during debugging.
A very useful component of the IDE is the profiler. The profiler can be run during the
debugging process to record statistics describing the execution sequence. It records several useful
statistical values that aid in weighing the performance of an application and individual functions.
Values including function call counts, function cycle counts, function and descendant function cycle
counts, as well as minimum, maximum and average cycle counts are recorded for each C function
executed during the simulation. A function call tree is also recorded and useful during analysis.
An easy to use GUI associated with the profiler allows a developer to analyze each function
within an application. The developer can navigate the function call tree, and investigate the
performance of each function, as well as viewing the performance and call counts of parent and
descendent functions.
The profiler is an excellent tool to use during analysis of the optimization process. The
performance of individual functions can be easily analyzed. All the C functions in a project can be
sorted by statistics including total call count, total function cycle count, total function and descendant
cycle count, average function cycle count, and average function and descendant cycle count to quickly
determine the functions that consume the most execution time, and therefore most likely require
optimizations.
The data recorded by the profiler is best viewed and analyzed with the IDE and profiler, but
can be exported to other formats including HTML, XML and a tab delimited file. The alternative
formats allow the sharing of data with colleagues, who can navigate and analyze the data on computers
that do not have the IDE tool installed.
CHAPTER 2. PUBLIC-KEY CRYPTOSYSTEMS AND THE STARCORE SC140 DSP 14
2.5 Previous Cryptographic and DSP Research
There has been a significant amount of research done on ECC. Several resources, which are referenced
in §2.2.1, compare ECC with other public-key cryptosystems. Some papers present the theory behind
ECC and the ECDSA. Others state and develop various algorithms for implementing the required ECC
operations.
The papers that state and develop various algorithms, along with ones that implement and
compare the performance of the algorithms were thoroughly investigated before the implementation
process. References [19], [20], [38], [40] and [61] present important algorithms and/or performance
results that influenced the algorithms implemented in the thesis.
Most of the work in ECC related resources revolves around the elliptic curve point-
multiplication operation because it accounts for nearly the entire execution time of the encryption and
signature processes. A significant breakthrough is presented by Solinas in [61]. He presents a
technique of reducing the execution time of the point-multiplication operation on Koblitz curves. The
modified point-multiplication algorithms presented by Solinas that use the technique, which are
presented in §3.3 and implemented in §4.4, are shown to outperform other methods in [19].
The majority of recent ECC papers focus on SCAs. They are a new class of attacks that use
timing and power analysis to break cryptosystems. Kocher first describes the attacks against RSA,
DSS and others in [33], and later Coron generalized the technique to include ECC [10]. Actual power
traces of elliptic curve point-multiplication were published in [13], illustrating resistance to power
analysis attacks on the SC140. However, prime fields, and not binary fields, were used in the
implementation. SCAs are examined in chapter 7, focusing on the implementation. The three types of
SCAs and countermeasures for each are presented, as well as some alternative techniques that may foil
SCAs developed by the writer.
Most cryptographic research to date, including both symmetric and asymmetric-key
cryptosystems, involves general-purpose processors. A minimum amount of research has been done
involving DSP implementations of cryptography. For example, a very broad view of the secure
communication issues with respect to DSPs is presented in [12], implementation of AES on DSPs is
investigated in [64], and efficient implementations of 1024-bit RSA, 1024-bit DSA and ECDSA using
a 160-bit prime field are described in [26]. Finally, power-attacks on the elliptic curve point-
multiplication operation using prime fields are investigated in [13]. Obviously, further cryptographic
research involving DSPs is required.
3 The ECDSA Algorithm and Implementation Philosophy
The following chapter describes the ECDSA protocol and its implementation. A thorough research
process led to a set of efficient algorithms that compute operations required by the ECDSA. The
algorithms used to implement the various finite field, large integer and elliptic curve operations are
described and listed. To conclude, a section that describes the implementation and integration
philosophy used is included. It explains terminology used and provides some vital implementation
information, which are both required to fully understand chapter 4.
3.1 The ECDSA
The ECDSA is an asymmetric digital signature scheme with appendix that is an analogue to the DSA
[30]. The main difference between the two techniques is the problem they are based on. The ECDSA
is based on the ECDLP, which is a more difficult problem than what the DSA is based on, the DLP.
The ECDSA is asymmetric, meaning different keys are used to generate and verify the digital
signature. The key used in the generation process, referred to as the private key, is kept secret by the
signing entity. Other entities must not know the private key for the signature scheme to function
correctly. The public key, which is used in the signature verification process, is stored by a CA. The
CA is open to distribute the key to any entity desiring it, making it known publicly.
The ECDSA is a digital signature scheme with appendix. The digital signature is appended to
the original message, leaving the message in the clear. Figure 3-1 is a depiction of a digital signing
process for digital signatures with appendices. The original message may already be encrypted.
Digital Signature
Figure 3-1. Digital Signature with Appendix
Transmit original message with signature appended
Begin with original message
Digital Signature
Compute Signature using original message
Original Message
Original Message
15
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 16
The ECDSA signature generation algorithm used to sign a message M is listed as Algorithm
3-1. It uses domain parameters, which provide cryptosystem details such as the base point G, the curve
order n, and the reduction polynomial f. The sender, who generates the signature, has a private and
public key, d and Q respectively. SHA-1() is a standardized function that is used to compute a hash of
the original message. The value k, also known as a nonce, is uniquely computed for each digital
signature. The signature generation algorithm computes r and s, which form the digital signature or
appendix.
Algorithm 3-1. ECDSA Signature Generation [30]
Input: d, f, n, G = (x1, y1), M
Output: r, s
1. Select a random or pseudorandom integer k, 1≤ k ≤ n-1.
2. Compute k⋅G = (x1, y1) and convert x1 to an integer.
3. Compute r = x1 mod f. If (r = 0) then go to step 1.
4. Compute k-1 mod f.
5. Compute SHA-1(M) and convert this bit string to an integer e.
6. Compute s = k-1 (e + d⋅r) mod f. If (s = 0) then go to step 1.
7. The signature for message M is (r, s).
The signature verification process is listed as Algorithm 3-2. To verify an elliptic curve digital
signature, the receiver must first obtain a verified copy of the signer’s domain parameters and Q. The
values of all variables listed in the verification algorithm below are assumed the same as those listed in
the generation algorithm. This assumption may be incorrect due to transmission error(s) and/or third
parties. In the case when the values of the variables differ, the signature verification process is almost
guaranteed to fail.
Algorithm 3-2. ECDSA Signature Verification [30]
Input: f, n, r, s, G, Q, M
Output: signature verification or rejection
1. Verify that r and s are integers in the interval [1, n-1].
2. Compute SHA-1(M) and convert this bit string to an integer e.
3. Compute w = s-1 mod f.
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 17
4. Compute u1 = e⋅w mod f and u2 = r⋅w mod f.
5. Compute X = u1⋅G + u2⋅Q.
6. If X = , then reject the signature. Otherwise, convert the x-coordinate x1 of X to an integer x1,
and compute v = x1 mod f.
7. Accept the signature if and only if v = r.
The elliptic curve point-multiplications in step 2 and 5 of the signature generation and
verification algorithms respectively, are by far the most time consuming operations. As previously
stated, point-multiplication involves several point additions and possibly subtractions, which each
involve finite field multiplications and a finite field inversion. Other steps in the signature generation
and verification processes only require single finite field multiplication and inversion operations.
For an optimum implementation of the ECDSA, the most efficient algorithms that implement
each finite field, large integer and elliptic curve operation must be employed. Inefficient algorithms are
detrimental to the overall performance. For example, inefficient finite field algorithms require more
computations, and therefore more clock cycles to execute. This increases the computational cost of the
implementation significantly because finite field operations are commonly used within the ECDSA.
The decrease in performance is further impacted when inefficient elliptic curve multiplication
algorithms are implemented. These algorithms require several more finite field operations than
superior algorithms, resulting in an even further loss in performance. It is imperative to use the most
efficient algorithms possible because the overall performance is limited by the underlying algorithms
used to implement each operation.
Each inefficient implementation of an operation affects the overall performance of the ECDSA,
especially finite field operations, because elliptic curve operations consist of several of the operations.
By slightly improving implementations of finite field operations, the overall performance of the
ECDSA can be significantly improved. In §4.2 and §4.3, it is shown how even slight improvements
and optimizations to implemented operations have a large impact on performance figures.
3.2 Finite Field and Large Integer Arithmetic
The following sections present an overview of the finite field, or polynomial, and large integer
operations required by the ECDSA. Finite field operations, which are the foundation of ECC, were
thoroughly researched in an attempt to obtain optimum algorithms. Alternative large integer operations
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 18
were not investigated because of their limited effect on execution times. Each section briefly
describes operations and presents the implemented algorithm discovered during the research process.
3.2.1 Basic Operations
Several basic finite field and large integer functions are required by the ECDSA. The theory behind
these operations, including finite field addition and reduction, and large integer addition and
subtraction is not described. Algorithms for the operations are not stated because of their simplicity. If
they are not already known, algorithms to implement the operations can be found in [19], [30] and [38].
A specialized algorithm that implements finite field reduction exists. The algorithm is field
dependent, meaning the details of the algorithm depend on the reduction polynomial. A description of
how to formulate the algorithm for a specific reduction polynomial is presented in [19]. The algorithm
formulated for the reduction polynomial f = x163+x7+x6+x3+1, which defines the binary finite field used
for the scope of the thesis, is presented in the paper and below as Algorithm 3-3. The assumption that
32-bit registers are used to store finite field elements, c and a, is made. Therefore, c[i] refers to the ith
32-bit register. Furthermore, ⊕ represents the exclusive-or operation, and >> and << represent right
and left bit-shifts of the corresponding variable respectively.
Algorithm 3-3. Finite Field Reduction (c = a mod f) [19]
Input: a = (a324, a323, a322,…, a0)
Output: c = (c162, c161, c160,…, c0)
1. c = a
2. For i = 10 to 6 do
2.1. T = c[i]
2.2. c[i-6] = c[i-6] ⊕ (T << 29)
2.3. c[i-5] = c[i-5] ⊕ (T << 4) ⊕ (T << 3) ⊕ T ⊕ (T >> 3)
2.4. c[i-4] = c[i-4] ⊕ (T >> 28) ⊕ (T >> 29)
3. T = c[5] & (0xFFFFFFF8)
4. c[0] = c[0] ⊕ (T << 4) ⊕ (T << 3) ⊕ T ⊕ (T >> 3)
5. c[1] = c[1] ⊕ (T >> 28) ⊕ (T >> 29)
6. c[5] = c[5] & (0x00000007)
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 19
7. Return(c[5], c[4], c[3], c[2], c[1], c[0])
3.2.2 Finite Field Multiplication
Several algorithms exist to perform finite field multiplication. The complexity and performance of
each vary considerably. A thorough investigation of the algorithms and performance figures presented
in several papers led to the conclusion that a multiplication technique, labeled left-to-right comb
method with windows by is the most efficient algorithm [19].
Presented as Algorithm 3-4, the left-to-right comb method uses a Look-Up Table (LUT) to
reduce repetitive computations. A window of variable width w, determines the computed set of
multiples of the element b. The multiples of b are stored in a LUT, and later indexed by sets of bits
from element a. The indexed values are added to the running sum, c, of the multiplication until all of
the sets of bits from element a are used. Then, to complete the operation, c is reduced. The algorithm
presents an efficient method of computing the result of a finite field multiplication of two elements.
To further improve the performance of Algorithm 3-4, the window width w must be fixed. By
doing so, the code becomes less general. Assumptions can be made because of the fixed window
width, increasing the performance of the operation.
Algorithm 3-4. Finite Field Multiplication (c = a⋅b) [39]
Input: a = (am-1,am-2,…,a0), b = (bm-1,bm-2,…,b0), w
Output: c = (cm-1,cm-2,…,c0)
1. Compute bu = u⋅b for all polynomials u of degree less than w
2. c = 0
3. For k = (m/w) -1 to 0 do
3.1. For j = 0 to t-1 do
3.1.1. c = c + bu, where u = (aw*(k+1)+j, a w*(k+1)+j-1, …, a w*k+j+1, a w*k+j)
3.2. If (k ≠ 0) then c = c << w
4. c = c mod f
5. Return(c)
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 20
3.2.3 Finite Field Squaring
Finite field squaring is a special case of multiplication, where the two elements involved are equal.
Due to the equality, the multiplication operation becomes much simpler. As described in [19], when an
element is represented using the PB, squaring is equivalent to expanding its bit representation by
inserting a zero bit between each pair of consecutive bits. After the expansion, the element must be
reduced using Algorithm 3-3. Algorithm 3-5, which squares an element, is presented below.
Algorithm 3-5. Finite Field Squaring (c = a2) [19]
Input: a = (am-1, am-2, …, a0)
Output: c = (cm-1, cm-2, …, c0)
1. For v = (v3, v2, v1, v0) = 0 to 15 do
1.1. T(v) = (0, v3, 0, v2, 0, v1, 0, v0)
2. c = 0
3. For i = 0 to t-1 do, where t = m/4
3.1. c = T(a4i +3, a4i+2, a4i+1, a4i) << (2⋅i)
4. c = c mod f
5. Return(c)
A LUT is the most efficient method of achieving the element expansion. For example, i-bits of
the original element are used to index a 2⋅i-bit value in a LUT, which is an expansion of the original i-
bits.
3.2.4 Finite Field Inversion
Inversion is the most computationally expensive finite field operation required by the ECDSA.
Therefore, it is extremely important to implement the best algorithm when execution time is a concern.
There are two principal algorithms, the Extended Euclidean Algorithm (EEA) and Almost Inverse
Algorithm (AIA), which efficiently compute the inverse of a finite field element.
The AIA eliminates nonzero bits of the input element from right to left, and computes a value
that requires a reduction before the inverse element is obtained. The AIA is less intuitive than the
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 21
EEA, and is expected to require less loop iterations [19]. Nevertheless, extensive research showed
that the EEA outperforms it in many papers, including [19] and [20].
A variation of the EEA, developed and published by Hasan in [20], is shown to outperform the
original algorithm while requiring a smaller memory footprint. The inversion algorithm is presented
below as Algorithm 3-6.
Algorithm 3-6. Finite Field Inversion (b = a-1 mod f) [20]
Input: a = (am-1, am-2, …, a0)
Output: b = (bm-1, bm-2, …, b0)
1. r(-1) = f, r(0) = a, u(-1) = 0, u(0) = 1
2. deg_r(-1,0) = m, deg_r(-1) = m, deg_r(0) = degree(r(0))
3. d(1,0) = m – deg_r(0), i = 0
4. Do
4.1. i = i + 1, j = 0
4.2. r(i-2,0) = r(i-2), u(i-2,0) = u(i-2)
4.3. While (d(i, j) ≥ 0) do
4.3.1. r(i-2,j+1) = r(i-2,j) ⊕ r(i-1)⋅xd(i,j), u(i-2,j+1) = u(i-2,j) ⊕ u(i-1)⋅xd(i,j)
4.3.2. j = j + 1, deg_r(i-2,j) = degree(r(i-2,j)), d(i, j) = deg_r(i-2,j) – deg_r(i-1)
4.4. r(i) = r(i-2,j), u(i) = u(i-2,j)
4.5. deg_r(i) = deg_r(i-2,j)
4.6. d(i+1,0) = -d(i,j)
5. While (r(i) ≠ 0)
6. Return (u(i-1))
Algorithm 3-6 was developed by Hasan with the aid of custom hardware that further improves
performance. The custom hardware allows certain obscure steps and functionality only required when
inverting finite field elements to be implemented with minimal execution time cost. For the thesis
implementation, the absence of specialized functionality required the software emulation of some tasks
at the cost of execution time and/or memory storage.
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 22
3.2.5 Large Integer Operations
Large integer operations, including multiplication, inversion and division, are also required when
implementing the ECDSA. Efficient algorithms for implementation of the operations were not
researched, primarily because only a minute portion of the ECDSA involves large integer operations.
Therefore, the impact of optimal algorithms for the operations is lessened significantly.
Large integer inversion is extremely similar to finite field inversion. Algorithms that are
designed for either operation can be very easily converted to the other by replacing basic operations.
For example, Algorithm 3-6 can be converted to invert large integers by replacing all finite field
additions with large integer subtractions. Some other modifications must be made to ensure
inequalities that are not relevant with finite field elements. Before each subtraction, measures should
be added to ensure the result is positive. These measures are optional, but must be added to avoid other
special cases and modifications to the algorithm.
Both large integer multiplication and division are not common operations in the ECDSA.
Therefore, basic algorithms, for example long division in the case of large integer division, can be used
to implement the operations. The performance of such algorithms has minimal impact on the overall
execution times of the signature generation and verification processes.
3.3 Elliptic Curve Arithmetic
Several elliptic curve operations are required when implementing the ECDSA. The following includes
descriptions of the operations required by the signature generation and verification processes, a brief
description of an alternative representation of finite field elements that allows superior point-
multiplication computations, along with efficient algorithms for implementation.
3.3.1 Elliptic Curve Point Addition and Subtraction
Elliptic curve addition and subtraction are very similar point manipulations that consist of several finite
field operations. First, the point addition operation is described and defined. Later, point subtraction is
defined. It is performed by computing the negative of a point and then executing point addition.
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 23
Elliptic curve point addition is defined as the summation of two points, P3 = P1 + P2. There
are four different cases for point addition. Each case is individually explained in the following
paragraphs.
First, an easy case of point addition is defined. When either point involved in the addition is
the point at infinity, designated as , the result is the other point involved in the summation, as shown
by Formula 3.1
P + = + P = P for all points P on an elliptic curve (3.1)
The second case of point addition is when a point and its negative counterpart are involved.
For a point P = (x, y), a point is called the negative of P, denoted -P, if it has the coordinates (x, x + y).
The result of such an addition is defined as the point at infinity, . Formula 3.2 presents this case of
point addition.
P + -P = -P + P = for all points P on an elliptic curve (3.2)
The third case of point addition is the general case of the operation, where the sum of two
points P1 = (x1, y1) and P2 = (x2, y2) is calculated, where P1 ≠ ±P2 and neither are equal to . The
operation results in the point P3 = (x3, y3), where Formula 3.3 hold. The variable a, is a domain
parameter associated with the elliptic curve.
x3 = y1 + y2 2 + y1 + y2 + x1 + x2 + a y3 = y1 + y2 ⋅(x1 + x3) + x3 + y1 (3.3)
x1 + x2 x1 + x2 x1 + x2
where Pi = (xi, yi) and P3 = P1 + P2
The above formulas for elliptic curve point addition define a complex relationship between the
coordinates of the resultant and original points. All operations within the formulas are finite field
operations. The general case of elliptic curve point addition is an expensive operation consisting of one
finite field inversion, two multiplications, one squaring and numerous additions.
Finally, the last case of point addition, where P1=P2, is also known as point doubling. The
result of point doubling, P3 = (x3, y3) = P1 + P2 = 2P1, where P1 = (x1, y1) is calculated using Formula
3.4. Elliptic curve point doubling is slightly less expensive than point addition. It requires one finite
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 24
field inversion, two multiplications, two squarings and numerous additions. Similar to a, the variable
b is a domain parameter associated with the elliptic curve involved.
x3 = x12 + b y3 = x1
2 + x1 + y1 ⋅(x3) + x3 (3.4)
x12 x1
where Pi = (xi, yi) and P3 = P1 + P1 = 2⋅P1
The basic algorithm for point addition that implements all four-point addition cases is stated
below as Algorithm 3-7. In the algorithm, division consisting of sets of finite field elements is
achieved by multiplying the numerator of the expression with the inverse of the denominator.
Algorithm 3-7. Elliptic Curve Point Addition (P3 = P1 + P2) [38]
Input: P1 = (x1, y1), P2 = (x2, y2)
Output: P3 = (x3, y3)
1. If (P1 = ) then Return(P3 = P2)
2. If (P2 = ) then Return(P3 = P1)
3. If (x1 = x2) then
3.1. If (y1 = y2) then
3.1.1. λ = x1 + y1 / x1
3.1.2. x3 = λ2 + λ + a
3.2. Else Return(P3 = )
4. Else
4.1. λ = (y1 + y2) / (x1 + x2)
4.2. x3 = λ2 + λ + x1 + x2 + a
5. y3 = λ⋅(x1 + x3) + x3 + y1
6. Return(P3 = (x3, y3))
The point addition algorithm shown above describes a technique of performing the operation.
No optimizations to the algorithm are possible without changing the coordinate system. The point
addition algorithm is based on an affine coordinate system, which is the basic coordinate system used
with elliptic curve operations. There are other coordinate systems, referred to as projective coordinate
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 25
systems, which define modified addition algorithms. Projective coordinate systems are examined in
§3.3.2.
Elliptic curve point subtraction, P3 = P1 - P2, is achieved by computing the negative of P2, and
then summing the result of the negation with P1. The formula for computing the negative of a point,
which can be very easily derived by its definition, is presented as Formula 3.5.
xj = xi yj = xi + yi where Pk = (xk, yk) and Pj = - Pi (3.5)
The purpose of projective coordinate systems is to avoid the costly finite field inversion
operation required in each point addition. These coordinate systems define points with three
coordinates, and essentially replace inversion operations with several finite field multiplications. The
coordinate systems are beneficial when the cost of finite field inversion is substantially larger than
multiplication. The different coordinate systems found by research efforts are compared in the
following section.
3.3.2 Elliptic Curve Point Representation
Finite field inversion is the most expensive finite field operation involved in the ECDSA. To avoid
inversion, several types of projective coordinates have been researched. Projective coordinate systems
use three coordinates (x, y, z). Affine points are converted into projective coordinate points at the start
of a point-multiplication operation. The operation is carried out, and the result is then converted back
into an affine point.
The purpose of changing coordinate systems is to eliminate finite field inversions from the
several point addition operations within a point-multiplication operation. Effectively, projective
coordinate systems provide a means of computing a point addition, where the single inverse operation
is exchanged for numerous multiplication operations. Since multiplication is a less costly finite field
operation, projective coordinates can be computationally beneficial. It should be noted that inversion
operations are required when converting between the affine and projective coordinate systems.
Coordinate conversion is required before and after each point-multiplication operation.
Several papers including [15], [19] and [26] show that the use of projective coordinates over
affine is beneficial. The advantage of a projective coordinate system is that the most expensive finite
field operation, inversion, is replaced with several less expensive finite field multiplication operations.
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 26
A comparison of projective coordinate systems with the affine coordinate system shows that, unless
finite field inversions are at least ten times as expensive as multiplications, the use of projective
coordinate systems are not beneficial.
Examples of the three alternatives to an affine coordinate system are Standard projective,
Jacobian projective and Projective coordinate systems. The following table shows the number of
multiplications and inversions of each coordinate system.
Table 3-1. Elliptic Curve Coordinate System Comparison [19]
Coordinate System General Addition Doubling
Affine (x, y) 1 inversion, 2 multiplications 1 inversion, 2 multiplications
Standard projective (x/z, y/z) 13 multiplications 7 multiplications
Jacobian projective (x/z2, y/z3) 14 multiplications 5 multiplications
Projective (x/z, y/z2) 14 multiplications 4 multiplications
Table 3-1 also shows the tradeoff between inversion and multiplication that is possible when a
projective coordinate system is employed. When focusing primarily on point doubling, projective
coordinates are beneficial. By using one of the projective coordinate systems listed, an inversion can
be traded for two to five multiplications. This is definitely a reasonable tradeoff since [19], [38] and
[40] report the cost of a single inversion proportional to approximately ten multiplications.
Alternatively, speedups of 30-70% are predicted for elliptic curve point addition when
Algorithm 3-6 is used for finite field inversion [20], compared to results similar to those from [19], [38]
and [40]. Since the only performance enhancement is from the inversion operation, it is reasonable to
assume that the performance of the inversion operation is proportional to much less than ten
multiplications. With the performance of inversions being less than ten multiplications, projective
coordinate systems are not as attractive. Point doubling may still be slightly faster using projective
coordinates, but definitely general addition, where an inversion is traded for over ten multiplications, is
less efficient. Only the point doubling operation benefits when using a projective coordinate system
and efficient inversion is possible.
The primary advantage of using Koblitz curves is that point-multiplication algorithms can be
exploited that does not require any point doublings [19]. In the subsequent section, the benefits of such
algorithms is examined and found superior. Therefore, point doublings can be eliminated, thus they do
not play a factor in the choice of a coordinate system. Without point doublings involved in the elliptic
curve point-multiplication operation, and a finite field inversion algorithm that is expected to
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 27
outperform ten multiplications, projective coordinates are not computationally beneficial, and an
affine coordinate system is superior for the implementation of ECDSA.
3.3.3 Elliptic Curve Point-Multiplication
There are several ways of representing a finite field element. In general, each representation has
positive and negative aspects. For example, the NB representation of a finite field simplifies the
squaring operation significantly. When an element is represented using NB, the squaring operation
reduces to a cyclic shift left. Unfortunately, other operations become more expensive when using the
NB [21]. The following sections present some alternative representations of k from its original binary
integer representation, which improve the performance of the elliptic curve point-multiplication
operation. The sections also analyze the benefits of the alternative representations of k, and the point-
multiplication algorithms associated with each representation.
3.3.3.1 Non-Adjacent Format
A Non-Adjacent Form (NAF) is a beneficial representation of k when performing point-multiplication.
The NAF representation of k can be exploited when performing point-multiplication to increase
performance. By definition, each coefficient of the NAF representation of a finite field element
belongs to the set {-1, 0, +1}, and no two consecutive coefficients are nonzero. When comparing NAF
and binary integer representations, there is at most one additional coefficient required to represent an
element. The benefit of using the NAF representation of k when performing point-multiplications is
the reduced hamming weight of k, and therefore the reduced number of point addition or subtraction
operations required.
Each coefficient in the NAF representation belongs to the set {–1, 0, +1}. This means that for
every 0 coefficient of k, as with the binary integer representation, a point addition operation is not
required. For every –1 coefficient of k, a point subtraction operation is required. The point subtraction
operation is very similar to point addition. There is no substantial difference in computational costs
between the two operations. For every +1, as with the binary integer representation, a point addition
operation is required.
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 28
Consider an average random finite field element k, which is m-bits long. The binary integer
representation of k requires m-bits, half of which are zeros, and the other half ones. Point-
multiplication requires m/2 point additions. In comparison, an average random finite field element
represented using NAF has a hamming weight of m/3 [61]. Therefore, the point-multiplication only
requires m/3 point additions or subtractions, reducing the number of point additions and subtractions by
m/6. This is a significant improvement considering the cost of point additions and subtractions.
When the NAF technique of point-multiplication is combined with windowing techniques,
further benefits arise. The equivalent number of pre-computed values in the LUT is halved when using
the NAF representation, because only the positive values are required. The negative values can be
calculated during execution using the corresponding positive value, because point negating is
inexpensive.
Algorithms that compute the NAF representation of a finite field element, and perform NAF
point-multiplication are presented by Solinas in [61]. For the NAF point-multiplication algorithm to be
beneficial, the cost of converting k from binary integer to NAF must be less than the cost of m/6 point
additions, where m-bits are required to represent k. As shown in several papers, the NAF point-
multiplication algorithm is beneficial, compared to binary integer point-multiplication methods, for
current and future finite field sizes [19].
3.3.3.2 Reduced TNAF Representation
Solinas presented a method of exploiting the τ-adic representation of a finite field element to reduce the
computational cost of elliptic curve point-multiplication over Koblitz curves [61]. The benefits of the
point-multiplication technique described are dependent on the implementation and other factors.
Backing Solinas’ paper, a significant reduction in computational costs as documented in [19] is verified
in §5.1.2. The τ-adic representation is only valid for Koblitz curves, which is the type of elliptic curve
implemented in the thesis. Point-multiplication is the most time consuming operation associated with
the ECDSA, making Solinas’ findings significant.
The benefit of using the τ-adic representation of an element is derived directly from the
formula describing the base of the representation, τ. Formula 3.6 describes the complex number τ, and
includes a variable µ, which is a Koblitz curve parameter and takes on the value of ±1 [61]. With some
algebraic manipulation, the Frobenius map presented as Formula 3.7 can be derived from Formula 3.6.
The mapping holds for all points P on the Koblitz curve, where P = (x, y) [61].
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 29
τ2 + 2 = µ⋅τ (3.6)
τ⋅(x, y) := (x2, y2) (3.7)
Solinas presents an algorithm that exploits Formula 3.7. The algorithm is very similar to the
NAF point-multiplication algorithm. The τ-adic representation of k is required by the algorithm, where
k is the finite field element involved in the point-multiplication. The τ-adic representation of k requires
several more coefficients than the binary integer representation, but the inefficiency due to the larger
representation is resolved by Algorithm 3-9. By utilizing Formula 3.7, every point doubling operation
in the point-multiplication algorithm is replaced by the squaring of the individual coordinates of the
point, which is much less computationally expensive.
Solinas adds two other improvements to the τ-adic representation before stating a point-
multiplication algorithm. The τ-adic representation is combined with a NAF representation, so the
finite field element is represented in τ-adic NAF (TNAF) form. The modification combines the
benefits of both NAF and τ-adic representations. The hamming weight of the element in TNAF form is
reduced from the τ-adic representation, and all point doublings are eliminated. Algorithm 3-8
computes the TNAF representation of a finite field element.
In the following algorithms, subscripts such as 2, TNAF and Width-w TNAF (TNAFw)
symbolize the representation of the large integer k. Subscript 2 symbolizes a base2 binary integer
representation, whereas TNAF and TNAFw symbolize TNAF and TNAFw representations respectively.
Algorithm 3-8. TNAF Conversion (kTNAF = r0 + r1⋅τ) [61]
Input: integers r0, r1 where k2 = r0 + r1⋅τ
Output: kTNAF
1. k = {}
2. While (r0 ≠ 0) or (r1 ≠ 0)
2.1. If (r0 odd) then
2.1.1. u = 2 – (r0 – 2⋅r1 mod 4)
2.1.2. r0 = r0 – u
2.2. Else
2.2.1. u = 0
2.3. Prepend u to k
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 30
2.4. r0 = r1 + µ⋅r0/2
2.5. r1 = -r0/2
3. Return (k)
As previously stated, the number of coefficients required to represent k in TNAF form are
more than with the binary integer representation. The TNAF representation can be reduced, so that the
same amount of coefficients is required as with the binary integer representation, by using Formula 3.8
and Formula 3.9. The reduction of the polynomial, mod (τm-1), results in an equivalent and minimal τ-
adic representation. The minimal, or reduced, TNAF representation of a finite field element has a
length ≤ m + a, where a is a elliptic curve parameter, and an average density of 1/3 + o(1) [61]. Solinas
thoroughly describes the theory behind the reduction and how it is performed. However, the reduction
mod (τm-1) requires a division of large integers that results in a real number, which is not practical for
implementation. He later states a method that allows efficient implementation of an approximation of
the reduction.
τm(x, y) = (x2m, y2m
) = (x, y) (3.8)
(τm-1)⋅(x, y) = (3.9)
The modified reduction algorithm, referred to as partmod δ, approximates the reduction mod
(τm-1), which guarantees the minimal TNAF representation of a finite field element. The partmod δ
algorithm, presented as Algorithm 3-9, calculates a reduced TNAF representation of the finite field
element [61]. Associated with the partmod δ is a probability of the reduction resulting in the minimal
TNAF representation of the element. To implement the reduction, the finite field element is passed to
the modified reduction algorithm, which computes inputs for the TNAF conversion algorithm. The
conversion results in a reduced TNAF representation of the original finite field element.
The modified reduction algorithm utilizes a parameter C, which determines the probability that
the reduced representation of the element is minimal. The inequality from [61] that describes the
probability is presented as Formula 3.10. The value of C corresponds to the number of significant
figures kept during the division of two integers. It also determines the computational cost of the
algorithm. For example, at the cost of computational time, an extremely large value of C can be used,
guaranteeing the minimal representation of the finite field element. For implementation on processors,
the register width should be taken into account when determining an appropriate C value.
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 31
Probability of not being minimal < 2-(C - 5) (3.10)
Solinas presents the partmod δ reduction algorithm in several parts. First, the original
reduction algorithm is stated in a form difficult to implement. Then steps in the original algorithm are
broken down so they are efficiently realizable. The realizable methods approximate the division of
large integers from the original algorithm. Algorithm 3-9 performs the partmod δ reduction, combining
formulas and algorithms listed by Solinas.
Algorithm 3-9. Partmod δ Reduction (r0 + r1⋅τ := k2 partmod δ) [61]
Curve Parameters: a, m, r, s0, s1, Vm Input: k2, C Output: integers r0, r1
1. d0 = s0 + µ⋅s1
2. K = (m + 5)/2 + C
3. k’ = k / (2(m – K + 2 + a))
4. g0 = s0⋅k’, g1 = s1⋅k’
5. h0 = g0/2m, h1 = g1/2m
6. j 0 = Vm⋅h 0, j 1 = Vm⋅h 1
7. l 0 = (g 0 + j 0) / 2(K – C) + ½, l 1 = (g 1 + j 1) / 2(K – C) + ½
8. λ0 = l 0 / 2C, λ1 = l 1 / 2C
9. f0 = λ0 + ½, f1 = λ1 + ½
10. η0 = λ0 - f0, η1 = λ1 - f1
11. h0 = 0, h1 = 0
12. η = 2⋅η0 + µ⋅η1
13. If (η ≥ 1) then
13.1. If (η0 - 3⋅µ⋅η1 < -1) then h1 = µ
13.2. Else h0 = 1
14. Else If (η0 + 4⋅µ⋅η1 ≥ 2) then h1 = µ
15. If (η < -1) then
15.1. If (η0 - 3⋅µ⋅η1 ≥ 1) then h1 = -µ
15.2. Else h0 = 0
16. Else If (η0 +4⋅µ⋅η1 < -2 ) then h1 = -µ
17. q0 = f0 + h0, q1 = f1 + h1
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 32
18. r0 = k – d0⋅q0 – 2⋅s1⋅q1
19. r1 = s1⋅q0 - s0⋅q1
20. Return (r0, r1)
When comparing the binary integer and TNAF point-multiplication algorithms, either a point
doubling operation or two finite field squarings is required each loop iteration. There is no comparison
between the computational costs of a point doubling operation and two finite field squarings. Each
point doubling operation consists of an inversion, two multiplications and several nominal finite field
operations, making point doubling an expensive operation. Relative to the finite field operations
multiplication and inversion, squaring is considered insignificant; making the cost of the squarings
required each loop iteration of the τ-adic point-multiplication algorithm minimal.
3.3.3.3 TNAF Point-Multiplication
The algorithm defined in [61] for TNAF point-multiplication is extremely similar to the NAF point-
multiplication algorithm. As previously mentioned, the point doubling operation present in the point-
multiplication algorithm is replaced by two finite field squaring operations, because the input element k
is in a TNAF representation. The TNAF point-multiplication is presented as Algorithm 3-10.
Algorithm 3-10. TNAF Point-Multiplication (Q = kTNAF⋅P) [19]
Input: k = (km-1, km-2, …, k1, k0)TNAF, P
Output: Q = (x, y)
1. Q =
2. For i = (m-1) downto 0
2.1. Q = τ⋅Q i.e. x = x2, y = y2
2.2. If (ki = 1) then Q = Q + P
2.3. If (ki = -1) then Q = Q – P
3. Return(Q)
Similar to finite field multiplication, by employing a LUT, the performance of an operation is
often improved. The only cost of the improved performance is an increase in the dynamic memory
footprint. As expected, this is also the case with TNAF point-multiplication. By employing a LUT to
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 33
reduce repetitive computations, execution time of the operation can be reduced. LUTs reduce
execution time by eliminating duplicate computational sequences whose results are pre-computed.
The TNAFw technique of the point-multiplication operation, which employs a LUT, was
investigated to improve performance. The following two sections describe the new functions required
to implement the so-called TNAFw point-multiplication. First, the TNAFw representation of k must be
computed.
3.3.3.4 Width-w TNAF Representation
To increase the performance of point-multiplication further, a width-w technique can be combined with
the TNAF point-multiplication algorithm. The technique requires a width-w representation of the
polynomial k, which guarantees at least w zero coefficients between each nonzero coefficient, leading
to an average hamming weight of m/(w+1). The reduced hamming weight results in a reduction in the
number of point additions and subtractions required.
The width-w technique requires the computation of functions of the point P involved in the
point-multiplication. They are stored in a LUT, and used in the operation. The number of functions of
P is determined by the width-w used. An optimum width-w can be estimated in theory, and then
confirmed during implementation, so the maximum performance of the algorithm is achieved. The
functions of P that are pre-computed depend directly on the Koblitz curve and width-w used in the
implementation. Refer to [61] for the formulas that describe the functions and other background
information. Furthermore, the theory behind the width-w technique and other useful information
required for implementation is explained in detail.
The TNAFw representation of k must be computed before performing the point-multiplication.
This requires the implementation of an algorithm similar to the TNAF conversion algorithm. The
TNAFw representation of a finite field element is computed using Algorithm 3-11.
Algorithm 3-11. TNAFw Conversion (kTNAFw = r0 + r1⋅τ) [61]
Parameters: a, w, tw, αu := βu + γu⋅τ for u = 1, 3,…, 2w-1 - 1
Input: integers r0, r1 where k2 = r0 + r1⋅τ
Output: kTNAFw
1. k = {}
2. While (r0 ≠ 0) or (r1 ≠ 0)
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 34
2.1. If (r0 odd) then
2.1.1. u = r0 – r1⋅tw (mod 2w)
2.1.2. If (u > 0) then ξ = 1
2.1.3. Else
2.1.3.1. ξ = -1
2.1.3.2. u = -u
2.1.4. r0 = r0 - ξ⋅βu
2.1.5. r1 = r1 - ξ⋅γu
2.1.6. Prepend ξ⋅αu to k
2.2. Else Prepend 0 to k
2.3. r0 = r1 + µ⋅r0/2
2.4. r1 = -r0/2
3. Return (k)
Similar to the NAF and TNAF representations, the guarantee that the TNAFw representation of
a finite field element consists of coefficients that are mostly zero greatly reduces the computational cost
of the point-multiplication. This is because, as shown in Algorithm 3-12 in the next section, point
addition and subtraction operations are only called when the corresponding coefficient is nonzero.
When the coefficient is zero, only two finite field squaring operations required every loop iteration are
executed.
3.3.3.5 TNAFw Point-multiplication
Once the TNAFw representation of k is computed, the TNAFw point-multiplication can be performed.
The algorithm to complete the multiplication operation is presented as Algorithm 3-12. The TNAFw
point-multiplication algorithm is very similar to previous multiplication algorithms. The only
differences are that, a LUT containing functions of the point P is pre-computed, and the pre-computed
values are used by the point addition and subtraction algorithms instead of the original point P.
The performance of the below point-multiplication algorithm is superior to others [19].
Therefore, TNAFw point-multiplication is implemented for the thesis and is used in the signature
generation and verification processes of the ECDSA. Conversely, for reasons stated, the TNAF point-
multiplication method is used in the implementation of the simultaneous multiple point-multiplication.
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 35
Algorithm 3-12. TNAFw Point-Multiplication (Q = kTNAFw⋅P) [61]
Input: k = (km-1, km-2, …, k1, k0)TNAFw, P
Output: Q = (x, y)
1. Compute Pu = αu⋅P, for u ε {1, 3, 5, 2w-1 - 1}
2. Q =
3. For i = (m-1) downto 0
3.1. Q = τ⋅Q i.e. x = x2, y = y2
3.2. u = | ki |
3.3. If (ki > 0) then Q = Q + Pu
3.4. If (ki < 0) then Q = Q – Pu
4. Return(Q)
3.3.4 Simultaneous Multiple Point-Multiplication
Within the signature verification process of the ECDSA, the sum of two point-multiplications must be
calculated. To improve the performance of this set of operations, the point-multiplications can be
simultaneously computed. This is called simultaneous multiple point-multiplication, and is known as
Shamir’s Trick [19].
Simultaneous multiple point-multiplication is based on a similar windowing technique used in
the finite field multiplication operation. Alternately to the finite field operation, there are two point-
multiplications, so the LUT involved is two-dimensional, and consists of elliptic curve points.
Summations of multiplies of the two points are computed and stored in the LUT. In the main loop of
the algorithm, the LUT is indexed by a set of bits from the two finite field elements k and l. The
algorithm improves the computation of the sum of two point-multiplications [19]. The simultaneous
multiple point-multiplication computation is presented as Algorithm 3-13, and is specific to the
ECDSA signature verification process.
Algorithm 3-13. Simultaneous Multiple Point-Multiplication (R = k⋅P + l⋅Q) [19]
Input: k = (km-1, km-2, …, k1, k0)2, l = (lm-1, lm-2, …, l1, l0) 2, P, Q
Output: R
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 36
1. Compute i⋅P + j⋅Q for all i, j ε [0,2w - 1]
2. R =
3. For i = (m/w - 1) downto 0
3.1. R = 2w⋅R
3.2. R = R + (k’⋅P + l’⋅Q), where k’=(ki*w+w, ki*w+w-1,…, ki*w)2 and l’=(li*w+w, li*w+w-1,…, li*w)2
4. Return(R)
3.4 Implementation and Integration Philosophy
C source code that is available on the Internet was used as a starting point for the thesis. The original
code, referred to as Rosing code throughout the thesis, is associated with the book written by Michael
Rosing, “Implementing Elliptic Curve Cryptography” [56]. The Rosing code defines an excellent
starting point for ECC, as well as a learning environment and testing atmosphere for the thesis.
After a strong knowledge base of the theory of finite fields and elliptic curves was built, more
efficient algorithms that perform finite field, large integer and elliptic curve operations were studied,
implemented, tested and integrated into the existing Rosing code.
Whenever possible, the operations are written to be versatile and reusable. Commonly used
values are standardized with #define statements. The code is written such that only the defined values
must be modified to change the performance and cryptographic strength of the source code. For
example, the finite field size used is identified with a #define statement. By modifying the #define
statement, the finite field size, and therefore the encryption strength is changed.
Other values that are defined to add to the versatility of the code are data structure sizes include
the width-w value and the window widths used in the implementation of efficient finite field and
elliptic curve operation algorithms. Originally, all the window widths used were variable. Later, some
of them were fixed at optimum values to increase the efficiency of the implementation.
A set of basic routines is written in both C and assembly using identical algorithms for each.
The two sets of routines are written such that Compiler-Generated Assembly (CGA) from C source
code, and the Hand-Written Assembly (HWA) can be compared. The comparison mentioned is
presented in a later section. The CGA and HWA routines have identical inputs and outputs, only their
names differ by ending in _cga and _hwa respectively. In the implementation and analysis sections of
the thesis, the function name extensions _cga and _hwa, are left out, but both are implied. The set of
routines written in both C and assembly is defined in §5.4.1.
CHAPTER 3. THE ECDSA ALGORITHM AND IMPLEMENTATION PHILOSOPHY 37
The routines included during compilation depend on whether GCC or SC140 is defined. Only
one of these values can be defined at compile time. If GCC is defined, the set of C functions ending in
_cga are used, and the code is compatible with a generic C compiler and any processor. If SC140 is
defined, the handwritten assembly functions ending in _hwa are included. The code is specialized and
only compatible with the SC140.
Almost all of the Rosing code implementations of finite field and elliptic curve operations were
based on non-optimal algorithms. Therefore, the operations were replaced with implementations of
more efficient algorithms. The only operations not replaced are elliptic curve addition and subtraction.
All of the large integer operations were entirely replaced, except for large integer multiplication. Only
an assembly language version of the large integer multiplication operation, which is employed as an
HWA routine, was integrated. The original large integer multiplication from the Rosing code is used
when GCC, and not SC140, is defined. After each operation was implemented, thoroughly tested, and
integrated, obsolete Rosing code was removed from the source code when possible.
Furthermore, a new and improved method of point-multiplication from the Rosing code was
implemented. The Rosing code includes a NAF point-multiplication operation. TNAF and TNAFw
point-multiplication operations, which are superior to the NAF method, were implemented.
Each newly implemented function was tested using pseudorandom data versus existing
functionality to verify its correctness. Special cases for each function that test input data boundaries
were also investigated to ensure accurate results. For example, all the possible combinations of
positive and negative large integer inputs were tested with the large integer function
field_mult_wrapper.
After the implemented function was tested versus existing functionality, it was integrated into
the ECDSA code. To further ensure its correctness, the signature generation and verification processes
were executed. Variables are output at important points in the ECDSA during execution. The
correctness of the output variables was tested after each function was integrated to ensure the code
functions properly, and does not produce faulty results. Variables including the signature value were
observed and verified to ensure null signatures, which may be incorrectly verified by the
implementation, are not generated.
The following section, which is organized by the type of operation, analyzes the implemented
operations, briefly describes optimization techniques, and list results obtained.
4 Implementation Analysis and Performance Results
First, low-level finite field and large integer operation routines were implemented and integrated with
the Rosing code. The implemented functions, which are the most basic operations involved in the
ECDSA replace inferior Rosing source code. They are based on superior algorithms, and therefore,
outperform the routines present in the Rosing code.
The course of action, implementing low-level operations first, was taken to force the
familiarization with the data structures defined by the Rosing code. Furthermore, the functions are
simple to understand, write and test. Several of the operations are so simple that their correctness can
be determined using hand calculated results, which extremely simplifies the debugging process. The
low-level operations provide an excellent starting point for implementation. They generally do not call
existing functions, thus limiting the possible erroneous code.
Later, functions that are more complex were implemented. Some of the functions implement
finite field and large integer operations. These operations include inversion, division and
multiplication functions. Other complex routines, including elliptic curve operations, are also
described and analyzed.
As previously stated, the PB was selected to represent finite field elements for the scope of the
thesis, which is assumed throughout the remainder of the project. The following section describes the
data structures used in the implementation process. Most data structures were originally defined in the
Rosing code, and were not modified. Some other data structures were modified or defined because
they are required by algorithms not implemented by the Rosing code.
4.1 C Data Structures
Most of the data structures used in the implementation of the ECDSA were originally defined in the
Rosing code. A majority of the data structures were never modified from the Rosing code, some
required minor modifications, and others had not been implemented in the Rosing code, and therefore
were defined.
All of the data structures are based on the FIELD2N structure defined in the Rosing code. The
FIELD2N structure is an array of ELEMENT, or unsigned long, variables that are used to store binary
38
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 39
finite field elements. The size of the array depends on the finite field size. The polynomial basis is
used to represent finite field elements. Each of the m least significant bits of a FIELD2N element
represent a coefficient, where m defines the finite field size, GF(2m).
The structure DBLFIELD is used to store finite fields that have not been reduced. The
structure is also an array of unsigned longs, but can store twice as many coefficients as a FIELD2N.
The structure is temporarily used when multiplying and squaring elements.
The two structures, FIELD2N and DBL_FIELD2N, the second being an array of twice as many
elements as a FIELD2N, are used by large integer functions. The large integers involved in the
ECDSA are the same size as the finite field elements, so the identical data structure can be used to
represent both types of data. Originally, the Rosing code used a BIGINT structure to represent large
integers. The structure and large integer algorithms are highly inefficient. The structure is an array of
unsigned longs, but only the least significant half of each array element is used to store the large
integer. Furthermore, the structure is twice the size of a large integer to avoid overflows during
multiplication.
The structure TNAF_FIELD is used to store TNAF and TNAFw representations of finite field
elements. Similar to previous structures, it consists of an array of unsigned longs. The size of the array
is larger than the reduced TNAFw representation of any finite field. The size depends on the finite
field size, m, and the width-w used in the TNAFw representation. Each array element in a
TNAF_FIELD structure is used to store several TNAF or TNAFw coefficients. The number of
coefficients per array element depends on the number of bits required to represent each coefficient.
There are also other data structures, which were defined in the Rosing code, such as POINT,
CURVE, SIGNATURE and EC_PARAMETER. The structures consist of various combinations of the
FIELD2N, CURVE and POINT structures, which define them in the elliptic curve domain.
4.2 Finite Field Operations
The following sections describe the finite field operations implemented, including methods to optimize
the operations. The terms finite field element and polynomial are used interchangeably.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 40
4.2.1 Finite Field Addition (c = a ⊕ b)
The first finite field operation implemented is polynomial addition. There is only one algorithm, which
is used to implement the operation because of its simplicity. All of the bits of the subject polynomials,
b and a, are combined using the exclusive-or operation to form c.
The function, poly_add, which implements polynomial addition, was written in both C and
assembly. The functions were thoroughly tested and integrated with the Rosing code. Refer to Table
5-4 for the performance of the routine. Often within the ECDSA code, the polynomial addition
function is not called because of its simplicity. Instead, the operation is made inline. Inlining the
operation increases the performance of the parent function because the overhead of a function call is
eliminated.
4.2.2 Finite Field Reduction (c = a mod f)
Next to finite field addition, reduction is the simplest operation. Finite field reduction is equivalent to
dividing a finite field element a, by the reduction polynomial f and returning the remainder c. The
reduction polynomial defines the finite field.
Several different algorithms accomplish finite field reduction. The simplest algorithm for
finite field reduction, which is inefficient and used by the Rosing code, is basic division using base2
arithmetic. This algorithm requires finite field elements, f and a, as inputs, and returns c. The
algorithm is inefficient in the case of finite field reduction because it does not take advantage of the fact
that f is fixed.
A superior algorithm to that stated previously is Algorithm 3-3. The algorithm takes advantage
of the fact that f is fixed and has a low hamming weight. By doing so, the algorithm is much more
efficient at performing finite field reductions, and has a reduced number of iterations compared to the
basic division algorithm. It also does not require f as an argument, which further improves its
performance.
The reduction algorithm, Algorithm 3-3, was implemented in the C programming language,
and named reduction163. It was thoroughly tested versus the reduction function defined by the Rosing
code. Once it was determined that the function implements the reduction operation correctly, the
previous reduction function was removed from the source code and all calls to it were modified to
target the reduction163. The performance of the reduction function is presented in Table 4-1.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 41
Table 4-1. Finite Field Reduction Performance
Code Description Cycle Count
FF Reduction 124
The implementation of the reduction algorithm was simple because of its form. The algorithm
defines the individual operations performed on each 32-bit word of the element, simplifying its
conversion to C code.
4.2.3 Finite Field Multiplication (c = a ⋅ b)
The polynomial multiplication function, poly_mul_win, was written to replace the Rosing code version
of the operation. The function performs polynomial multiplication using a windowing technique and
LUT. The algorithm is based on the left-to-right comb method with window width, and is presented as
Algorithm 3-4.
The LUT is calculated during execution by the function, poly_mul_win_LUT, and originally
similar to poly_sqr, the size of the LUT and associated window width were modifiable. This was done
so that testing could later reveal the optimum window width. After a small amount of testing and
intuition, the window width was fixed at the optimum value of 4-bits, and the function was modified to
enhance performance for the specific window width.
The function, poly_mul_win_LUT, became highly optimized after the window width was fixed
at 4-bits, improving its overall performance. The function calculates all of the 4-bit multiples of the
first polynomial involved in the multiplication, and stores the reduced results in the LUT. Shifted
versions of the polynomials in the LUT are then added to a running sum, and reduced at the end of the
function. The result of the multiplication is temporarily stored in a structure, DBLFIELD, which is
twice the finite field size. The extra space is required to store the polynomial, because it is not reduced
until the end of the multiplication operation.
The polynomial multiplication function was thoroughly optimized to decrease execution time.
A function was written specifically for the function that shifts a polynomial by a specified number of
bits, which in this case is the window width. The specialized shifting function, shift_left, was
implemented and integrated because it is more efficient than shifting the polynomial by a single bit
multiple times. The shifting function is also used for implementing the finite field squaring and
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 42
inversion operations. The following table shows a comparison between a specialized multiple bit-
shifting function, shift_left_cga, and multiple calls to a single bit-shifting function, dbl_shift_left, with
respect to machine cycle and instruction counts.
The C code results were recorded using the SC100 simulator, and the maximum compiler
optimization level. The pattern shown in Table 4-2 continues with some minor discrepancies for all
bit-shifts. The multiple bit-shifting function outperforms, or is at least equal in performance to, all
nonzero bit-shifts, and therefore was integrated into all benefiting functions.
Table 4-2. Single and Multiple Bit-Shifting Function Comparison
Cycle Count (Instruction Count) Number of Shifts
(bit-shifts) Single Bit-Shifting Function Multiple Bit-Shifting Function
1 108 (77) 108 (92)
2 215 (154) 108 (92)
3 322 (231) 108 (92)
4 429 (308) 108 (92)
5 536 (385) 108 (92)
Average Change 107 (77) 0 (0)
After optimizations were applied to poly_mul_win, and the function was thoroughly verified
for correctness, the performance of the operation was recorded. The performance was measured in
terms of machine cycles, or cycle counts. The following results were obtained using the SC100
simulator, when specifying the maximum compiler optimization level. Table 4-3 presents the
performance of the implemented code when employing CGA and HWA routines. The cycle counts in
the table clearly illustrate the performance benefit achieved by partial assembly implementation.
Table 4-3. Finite Field Multiplication Performance
Code Description Cycle Count
FF Multiplication - CGA 5,595
FF Multiplication - HWA 3,475
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 43
4.2.4 Finite Field Squaring (c = a2)
The function, poly_sqr, was written to implement finite field squaring. Since τ-adic point-
multiplication was selected for implementation, a specialized function for the operation is highly
beneficial and is required for efficient computation. The Rosing code does not have a specialized
function for polynomial squaring.
The original squaring algorithm implemented is Algorithm 3-5. The function was originally
written so that the window width used with the LUT is easily modified. Similar to polynomial
multiplication, the window width was modifiable so that the optimum bit-width could be determined
through testing.
Intuitively, the optimum window-width, also referred to as the bit-width, used with the LUT
divides the register width of 32-bits. With such bit-widths, special cases where a set of bits span two
registers are eliminated, allowing greater optimization of the code. There is a memory versus
execution time tradeoff associated with the bit-width and LUT size. Since the LUT is fixed, and not
calculated during execution, the tradeoff favors a larger LUT size than with the multiplication
operation. The LUT requires twice as much storage space each time the bit-width is increased, while
the execution time diminishes as the bit-width is increased. In this case, the memory versus execution
time tradeoff is affected by the fact that polynomial squaring is a very common operation within the
point-multiplication algorithm implemented. The tendency is to choose a large bit-width, because it
will have a large affect on execution time, with a relatively small memory requirement price.
Several different bit-widths of the LUT were investigated using identical C source code. Only
the definition of the bit-width and the LUT used were modified. Table 4-4 shows the performance of
finite field squaring function for the tested bit-widths, along with the associated LUT size. In the table,
the cycle count required to square finite field elements is presented. For each bit-width, fifty identical
pseudorandom elements were used to obtain the results.
At the cost of memory, the performance of the operation can be increased by increasing the bit-
width. In general, this trend continues at the expense of memory. Memory requirements become
unreasonable quickly for larger bit-widths because it grows exponentially. The memory requirements
are important, especially when targeting portable devices.
The computational cost of the polynomial squaring function increases by a factor of two when
CGA routines are used instead of HWA routines. The superior performance of the HWA versions is
expected because of the superior performance of the HWA routines shown in §5.4.1.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 44
Table 4-4. Finite Field Squaring Performance Comparison
Bit-Width Description
4 5 6 7 8 9 8*
Cycle Count - CGA 7,821 6,301 5,379 4,615 4,021 3,660 212
Cycle Count - HWA 3,838 3,110 2,683 2,315 2,018 1,855 212
LUT Size (bytes) 64 128 256 512 1,024 2,048 1,024
* - Source code does not include any CGA or HWA routines
The last column in the table, marked with an asterisk, shows the performance of the poly_sqr
function after the bit-width is fixed at eight. When this is done, assumptions within the routine are
possible. These assumptions are exploited to increase the performance of poly_sqr significantly. The
algorithm used is slightly different then that previously implemented, and is presented as Algorithm
4-1.
Recalling that bit-widths that divide the register width are preferred because they allow the
most optimizations, results in an optimum bit-width of 8-bits being selected. The selected bit-width
requires a LUT consisting of 256 16-bit binary values. The bit-width was selected primarily because
the execution time and memory requirements are too large with 4-bits and 16-bits respectively. When
comparing bit-widths of four and eight, Table 4-4 clearly shows eight to outperform four by a factor of
almost two. The table also shows that a bit-width of eight requires a reasonable amount of memory.
The memory footprint of a bit-width of eight is considered reasonable relative to the total memory
requirement of the signature generation and/or verification source code, which is presented in §5.5.
The next window width that divides the register width is 16-bits. The performance of this bit-
width is assumed superior to that of eight, most likely by a factor of two, which is the performance
improvement recorded when doubling the window width from four to eight. However, the memory
requirements of such a bit-width, 262,144 bytes, is unreasonable compared to the total memory
requirements. The memory requirements of the bit-width are unreasonable, especially for portable
devices where memory is an expensive resource and must be conserved.
The bit-width was fixed at 8-bits, and the function was modified accordingly. The function
was modified so that it follows a slightly different algorithm, presented as Algorithm 4-1. Table 4-4
shows the execution time of the squaring operation improves drastically when the window width was
fixed. This shows how versatile code that is easily modifiable is far less efficient than code with fixed
parameters. Certain assumptions can be made with fixed parameters, leading to greater optimizations
and significant execution time improvements. The assumptions made in the squaring implementation
significantly change the performance of the operation, and the algorithm itself. The algorithm allows
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 45
the use of greater parallelism, primarily because the window width divides the register width. In the
algorithm below, the assumption that 32-bit registers are used to store finite field elements a and c is
made. Therefore, c[i] refers to the ith 32-bit register.
Algorithm 4-1. Improved Finite Field Squaring (c = a2)
Input: a = (a[5], a[4], a[3], a[2], a[1], a[0])
Output: c = (c[5], c[4], c[3], c[2], c[1], c[0])
1. For v = (v7, v6, v5, v4, v3, v2, v1, v0) = 0 to 256 do
1.1. T(v) = (0, v7, 0, v6, 0, v5, 0, v4, 0, v3, 0, v2, 0, v1, 0, v0)
2. b = 0
3. For i = 0 to 4 do
3.1. c[2⋅i] = (T(a[i] & 0xFF00) << 2⋅i) ⊕ T(a[i] & 0xFF)
3.2. c[2⋅i + 1] = (T(a[i] & 0xFF000000) << 2⋅i) ⊕ T(a[i] & 0xFF0000)
4. c[10] = T(a[5] & 0xFF)
5. c = c mod f
6. Return(c[5], c[4], c[3], c[2], c[1], c[0])
4.2.5 Finite Field Inversion (c = a-1 mod f)
The last finite field operation described, and most expensive, is polynomial inversion. The function,
poly_inv_eff, was written to perform the operation. The implemented algorithm is based on Algorithm
3-6. Performance increasing modifications were made to the algorithm by exploiting the functionality
of the SC140. The modified algorithm that was implemented is listed as Algorithm 4-2. The following
paragraphs describe functions that were called by poly_inv_eff, and later made inline to improve
performance, as well as the performance of the implementation.
Two functions are written to calculate and return the degree of a polynomial. The functions,
MSB_degree and MSB_degree1, were both initially written in C, and then assembly. The main
difference between the two functions is that MSB_degree1 requires the previous degree of the
polynomial, making it slightly faster. The assembly versions of the degree computing functions use the
CLB instruction, which requires a single clock cycle to find a leading nonzero bit. This is much more
efficient than the C functions, which check each bit individually, and therefore require several clock
cycles to perform the identical task.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 46
Algorithm 4-2. Improved Finite Field Inversion (c = a-1 mod f)
Input: a = (am-1, am-2, …, a0)
Output: b = (bm-1, bm-2, …, b0)
1. r0 = f, r1 = a, b = 0, u1 = 1
2. deg_r0 = m, deg_r1 = degree(r1), deg10 = deg_r0 – deg_r1
3. Do
3.1. If (deg10 = 0)
3.1.1. r0 = r0 ⊕ r1, b = b ⊕ u1
3.1.2. deg_r0 = degree(r0)
3.1.3. deg10 = deg_r0 – deg_r1
3.1.4. If (deg_r0 = -1) b = u1, Return (b)
3.2. If (deg10 > 0)
3.2.1. r0 = r0 ⊕ (r1 << deg10), b = b ⊕ (u1 << deg10)
3.2.2. deg_r0 = degree(r0)
3.2.3. deg10 = deg_r0 – deg_r1
3.2.4. If (deg_r0 = -1) b = u1, Return (b)
3.3. If (deg10 < 0)
3.3.1. r1 = r1 ⊕ (r0 << -deg10), u1 = u1 ⊕ (b << -deg10)
3.3.2. deg_r1 = degree(r1)
3.3.3. deg10 = deg_r0 – deg_r1
3.3.4. If (deg_r1 = -1) Return (b)
4. While (true)
A function was written specifically for the inversion operation, which adds a shifted version of
a polynomial to another. This function, shift_and_add, requires two input polynomials, and an integer
that defines the number of bits to shift the second polynomial. The first polynomial is over-written
with result of the polynomial addition. This function is written in both C and assembly.
In a further attempt to improve the performance of the inversion operation, the distribution of
bit-shifts involved in the shift and add was investigated. Forty inversions were performed on
pseudorandom polynomials to determine the approximate distribution of bit-shifts. The exploitation of
the distribution leads to a performance improvement. Table 4-5 shows the distribution results obtained.
Table 4-5 shows that occurrences of bit-shifts decrease by approximately fifty percent per bit.
Therefore, it is beneficial to have a shift and add function that favors small bit-shifts. In addition, since
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 47
shifting by zero bits is common, it is likely beneficial to check for these occurrences and process them
differently.
Table 4-5. Finite Field Inversion Bit-Shift Distribution
Number of Bits Involved in the Bit-Shift Description
0 1 2 3 4 5 6 7 8 9 10 11 12
Number of Occurrences 1,586 2,427 1,138 623 320 152 78 39 28 12 5 1 2
Percentage of Total (%) 24.74 37.86 17.75 9.72 4.99 2.37 1.22 0.61 0.44 0.19 0.08 0.02 0.03
The performance of different bit-shifting options was investigated. The shift and add code was
modified so that the shift_and_add function is only called for nonzero bit-shifts. For zero occurrences,
the polynomial addition function is used. The performance of the modified code is superior to the
previous, after the distribution of bit-shifts is taken into account.
A second version of the shift_and_add function, shift_and_add2, was written to further
improve efficiency. This version was written, realizing there are always two sets of polynomials that
require a shift and add in the inversion algorithm. The bit shifts in each case are always equal. The
function, shift_and_add2, performs two shift and add operations in parallel, which is slightly more
efficient than two consecutive calls of the shift_and_add function.
A similar premise was used with the implementation of poly_add2. The two calls to
polynomial addition are combined so the operations are computed in parallel. Again, the resulting code
is slightly more efficient than performing two consecutive polynomial additions. C and assembly
versions of poly_add2 and shift_and_add2 were written, tested and integrated.
The implementation of Algorithm 4-2 led to the impressive performance figures in Table 4-6.
The most extensive difference between Algorithm 4-2 and Algorithm 3-6, besides fixing the number of
polynomials employed, is eliminating the need to exchange polynomials. After the algorithm is
modified to only require the current four polynomial values, r0, r1, b and u1, an exchange of the
polynomials is occasionally required so that the degree of r0 remains larger than r1. To eliminate the
exchange, and improve the performance of the operation, code was added that allows either polynomial
to have a larger degree. This is clearly illustrated in Algorithm 4-2.
As expected, the performance of the optimal implementation of the inversion operation, shown
in Table 4-6, is comparable to the performance of the optimal implementation of approximately five
finite field multiplications. Therefore, the selection of the affine coordinate system instead of
projective coordinates is correct and beneficial. For this implementation, a projective coordinate
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 48
system is not beneficial with respect to computational performance due to the ratio of the
computational costs of the finite field multiplication and inversion operations.
Table 4-6. Finite Field Inversion Performance
Cycle Count Code Description
Minimum Average Maximum
FF Inversion – CGA 15,950 17,590 18,730
FF Inversion - HWA 14,390 16,730 18,290
4.3 Large Integer Operations
The focus of the thesis is to implement and optimize efficient finite field and elliptic curve operations.
During the testing and simulation process, the performance of the large integer operations was
unacceptable. Progress was slowed considerably and the performance of the signature generation and
verification processes was unacceptable. Therefore, the large integer operations defined by the Rosing
code were improved upon. Algorithms to implement these operations were not thoroughly researched,
so it is assumed that further improvements to the implemented operations are possible. The purpose of
implementing these operations, which are listed in the following sections, is to increase performance of
the large integer operations, and the signature generation and verification processes to levels that are
more acceptable.
Only large integer addition and subtraction were implemented in both C and assembly. Large
integer inversion and division operations were written in C, and employ both C and assembly addition
and subtraction routines. A partial assembly implementation of large integer multiplication was
integrated. The multiplication operation is from an external source.
The cycle counts for the more complicated large integer operations are not provided because
the implementations were optimized, and they do not significantly affect the overall performance. The
large integer multiplication function was not implemented by the writer, and it is not commonly called
within the ECDSA. The performances of the large integer inversion and division functions are input
dependent, and the functions are not commonly called as well.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 49
4.3.1 Large Integer Addition and Subtraction (c = a + b; c = a - b)
Large integer addition and subtraction are the simplest large integer operations implemented. The two
operations are implemented separate from each other, in both C and assembly. They were thoroughly
tested before being integrated with the Rosing code.
The implementations of large integer addition and subtraction are great improvements on the
Rosing code. The Rosing code uses an inefficient algorithm for addition, and subtraction is
implemented as an extension of addition, with the aid of a negate function. The operations are
inefficient because they are based on the BIGINT structure, which only uses half of each element for
data storage. The only plausible reason for the inefficiencies of the BIGINT structure is to simplify the
large integer multiplication operation.
The implementations of addition and subtraction, add_int and sub_int respectively, do not use
the BIGINT structure. Unlike BIGINT, all of the bits within the data structure, FIELD2N, are used to
store the large integer, and a different structure is used to store the result of multiplications. Thus, the
data structure is a quarter of the size, and the addition and subtraction operations are performed more
efficiently.
The implementation of the two operations is almost identical. Both assembly and C versions of
the operations use the same basic algorithm, except when computing carries and borrows. The C code
computes the carry and borrow bits with boolean expressions, whereas the assembly code exploits the
carry bit in the status register. Therefore, the assembly version of the operations is more efficient. The
performance of both large integer operations is presented in Table 5-4.
4.3.2 Large Integer Multiplication (c = a ⋅ b)
Only an assembly language version of the large integer multiplication operation, int_mult, was
integrated with the existing code. The assembly version multiplication function was previously written
for a different application and is only called when both C and assembly source code is compiled [13].
When only C source is compiled, the original large integer multiplication from the Rosing code, which
uses the BIGINT structure, is called.
The large integer multiplication function, which is written in assembly language, is actually
written for multiplying 192-bit positive integers, resulting in a 384-bit integer. The two sizes of
integers correspond to the implemented structures FIELD2N and DBL_FIELD2N respectively. The
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 50
function can be used in the scope of the thesis because the finite fields and therefore the large integers
are 163-bits in size.
However, both large integers must be positive for int_mult. Therefore, a wrapper function is
required, that ensures the inputs to the function are positive, and modifies the large integers to their
correct sign after the multiplication is complete. The wrapper function is named field_mult_wrapper.
Within this implementation of the ECDSA, the result of large integer multiplications is
guaranteed to require less than 163-bits. Therefore, no reduction is necessary after a large integer
multiplication, allowing the result to be copied straight to a FIELD2N, ignoring the most significant
bits of the large integer.
The integration process for the multiplication operation was simple because the function being
integrated had been previously tested. However, a problem was encountered with the compiler. The
compiler produces incorrect assembly when the highest level of optimizations is selected. This
problem is thoroughly examined in §6.3.2.
4.3.3 Large Integer Division (c = a / b)
The two most inefficient large integer operations implemented by the Rosing code, large integer
inversion and division, were replaced with much more efficient functions. The new functions are based
on algorithms that are more efficient, and do not use the BIGINT structure.
The large integer division function, large_int_div, is based on the long division algorithm. The
implementation of the algorithm is straightforward, exploiting previously written functions such as
shift_left, sub_int and MSB_degree. The long division algorithm implemented, subtracts the
denominator multiplied by powers of two from the numerator. The process is complete once the
numerator is less than the denominator.
Only the performance of the finite field and elliptic curve operations were thoroughly
investigated, so the large integer division function can likely be improved. A superior algorithm may
also exist.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 51
4.3.4 Large Integer Inversion (c = a-1 mod f)
The function, large_int_inv, was written to implement large integer inversion. The function inverts a
large integer with respect to an input prime large integer. The Rosing code’s large integer inversion
function is extremely inefficient. It utilizes the already inefficient large integer division function and
BIGINT structure.
The improved large integer inversion function utilizes a more efficient structure, FIELD2N. It
is based on the previously implemented finite field inversion routine, which simplified its realization.
The two functions are almost identical. The only difference is that the large integer inversion function
uses large integer addition instead of polynomial addition. No substantial problems were encountered
during the implementation of this operation because it was based so closely on the previously
implemented and thoroughly tested finite field inversion function.
The performance of the large integer inversion function is assumed near optimal, with only the
possibility of minimal improvements. This is assumed because it is based on the already thoroughly
optimized finite field inversion function.
4.4 Elliptic Curve Operations
After finite field and large integer operations, elliptic curve operations were implemented. The
following sections list the operations and implemented functions associated with the elliptic curve
operations. The performance of the implementations is presented, and then compared to previously
published results in §5.1.2.
Following minimal research, it was decided to use a Koblitz curve for the implementation of
the ECDSA. Koblitz curves allow the use of several specialized algorithms for point-multiplication,
which are computationally beneficial. In particular, Koblitz Curves allow the use of the TNAF point-
multiplication method.
Researching elliptic curve multiplication algorithms led to the conclusion that the TNAF and
TNAFw methods of point-multiplication, proposed by Solinas in [61], are the most efficient techniques
of performing elliptic curve point-multiplication. Results presented in [19] had a significant impact on
the decision.
To implement the TNAF method for point-multiplication outlined by Solinas, several functions
are required. The existing elliptic curve addition and subtraction functions were kept because no
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 52
significant improvements can be made to them. The functions that were implemented are listed and
described in the following sections.
4.4.1 TNAF Conversion (k2 kTNAF)
The first function implemented, which is get_TNAF_rep, calculates the TNAF representation of a
polynomial and writes it to a TNAF_FIELD structure. The function is based on Algorithm 3-8. The
implementation of the algorithm is straightforward, given that functions performing large integer
manipulation are available.
There are three possible TNAF coefficients. Therefore, two bits are required to represent all
possibilities. The values that represent the TNAF coefficients –1, 0 and 1 are 0x11, 0x00 and 0x01
respectively. These values are arbitrary, providing they are used consistently within the conversion and
point-multiplication functions. After calculating coefficient values using large integer operations and
calculating remainders, each coefficient must be shifted by the correct number of bits so that it was
written to the proper location. This requires careful programming.
The function, get_TNAF_rep, was difficult to test and verify. Initially, the only verification
done was with small binary values whose representation can be hand calculated. After the completion
of the point-multiplication function, which is described in a following section, the conversion function
was thoroughly tested and verified with larger binary values.
4.4.2 Partial Reduction - Partmod δ (k′ = k partmod δ)
According to Solinas [61], to fully take advantage of the TNAF and TNAFw techniques of point-
multiplication, the input polynomial must be reduced so that the representation decreases in size. The
reduction decreases the hamming weight of the representation, thereby decreasing the number of loop
iterations and execution time of the point-multiplication.
Solinas states that it is possible to guarantee the minimal representation, but it is too expensive
to achieve in practice. Instead, a partial reduction, partmod δ, is defined that guarantees a minimal
TNAF or TNAFw representation with a certain probability. The probability is related to C, which is a
modifiable parameter. It is up to the implementer to define the value of C, and therefore the probability
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 53
that the TNAF and TNAFw representation is minimal. There is a tradeoff associated with the value of
C. As C increases, the probability of the TNAF or TNAFw representation being minimal increases, but
so does the execution time of the partial reduction.
For implementation, a C value of sixteen was selected for two reasons. According to Formula
3.10, the probability that the TNAF or TNAFw is minimal is over 99.95%. Therefore, the minimal
representation is almost guaranteed. The second reason for choosing a C value of sixteen is the register
width. Due to the nature of the algorithm, the cost of the algorithm increases every time C becomes
larger than a multiple of sixteen. C values of sixteen and less have equal computational costs, which is
less than the computational costs for C values of seventeen and above. Since the minimal
representation is guaranteed with such a high probability, it was determined that the processing cost of
increasing C beyond sixteen outweighs the benefits.
The function, TNAF_partmod_delta, implements the partial reduction presented as Algorithm
3-9. The function calculates the value of two polynomials, which are then used to compute the TNAF
or TNAFw representation of the polynomial. Two other functions were defined for the partial
reduction process. The two functions, div2_ceil and multiple_div2, are described in the following
paragraphs.
The function, multiple_div2, divides a large integer by a power of two, always rounding down.
Effectively, the function is the opposite of shift_left. It shifts a large integer to the right by the
specified number of bits. The operation, which is very similar to shift_left, is written in both C and
assembly source code.
The function, div2_ceil, which is usually used in conjunction with a multiple_div2 call, divides
a large integer by two, always rounding up. The operation is implemented in both C and assembly, and
is achieved by first adding one to the least significant word, and then adding the propagating carry bit
while performing a single bit shift.
The implementation of Algorithm 3-9 was straightforward after support functions were
available that encapsulate the required operations. The implementation was tested after the point-
multiplication of the following section was written. No implementation errors were found during the
testing of the reduction function.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 54
4.4.3 TNAF Point-Multiplication (Q = kTNAF ⋅ P)
The next function implemented performs the elliptic curve point-multiplication, using the TNAF
representation of k. The function, TNAF_point_mul, is based on Algorithm 3-10. The TNAF point-
multiplication algorithm is extremely similar to the point-multiplication algorithm in the Rosing code.
The Rosing code implements the NAF point-multiplication technique. The difference between the two
techniques is the polynomial, k, is in a TNAF representation, which changes the point doubling
operation to two polynomial squaring operations.
The point-multiplication function follows Algorithm 3-10 very closely. First, the resultant
point Q is set to the point at infinity, . Then, if the finite field element k is zero, the function is
terminated, returning as the result of the operation.
Otherwise, variables are set to select the most-significant nonzero coefficient of k. The main
loop of the algorithm, where each nonzero coefficient results in a point addition or subtraction is
entered. As well as possible point addition or subtraction operations, two point squaring operations and
the manipulation of coefficient selecting variables is executed within the main loop. The loop iterates
through each coefficient, from most significant coefficient to least significant, before the point-
multiplication is complete.
The implementation of the TNAF point-multiplication was simplified by following the basic
structure of the Rosing code point-multiplication function. The only difficulty was encountered with
coefficient values. The Rosing code uses a large array to store the NAF representation of k, where each
array element contains a single coefficient. To reduce memory requirements, several TNAF
coefficients are stored in each array element of a TNAF_FIELD structure. Bit-shifting and bit masks
are required to select individual TNAF coefficients. Several definitions are used to ensure consistency
and avoid errors when referencing the coefficients.
Table 4-7. TNAF Point-Multiplication Performance
Code Description CGA Cycle Count HWA Cycle Count
Partmod δ Reduction 8,700 4,800
TNAF Point-Multiplication 2,033,000 1,670,000
Table 4-7 shows the performance of the TNAF point-multiplication function implemented.
The individual HWA routines are shown to outperform their CGA counterparts in §5.4.1. Therefore, as
expected, the point-multiplication employing HWA routines, outperforms the CGA alternative. The
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 55
performance of the TNAF_partmod_delta function is also presented in the table. The cycle counts of
each partial reduction function are insignificant compared to the point-multiplication functions.
4.4.4 TNAFw Conversion (k2 kTNAFw)
The TNAF conversion function was modified to calculate the TNAFw representation of a polynomial.
The algorithm used for this function is Algorithm 3-11. The algorithms used to calculate the TNAF
and TNAFw representation of polynomials are very similar, which simplified the implementation of
get_TNAFw_rep. There are three main modifications made to the TNAF conversion function so that it
calculates the TNAFw representation of finite field elements. The modifications are described in the
following paragraphs.
First, there are more coefficients in the TNAFw representation. The statement that calculates
the value of the coefficients is modified. The equation used to calculate each coefficient is much more
complicated. Also, associated with the calculation of coefficient values, definitions of variables that
are width-w specific were added to the source code. This is done so that width-w value modifications
are simplified. Consistency throughout the TNAFw conversion and point-multiplication functions is
achieved by employing several definitions in the source code. The definitions also allow the width-w
to be easily changed without modifying the actual function.
Second, a coefficient mapping to bit representation is required. The possible coefficients
belong to the set {-2w-1-1, -2w-1-3, …, -1, 0, 1, …, 2w-1-3, 2w-1-1}, where w is the width-w value. There
are exactly (2w-1+1) coefficients. Therefore, w-bits are required to fully represent each possibility.
The goal of the coefficient mapping is to maintain simplicity, and to allow employment of a
general algorithm, which efficiently maps the coefficients to their binary representations. Two other
mapping stipulations are that it must not be width-w dependent, and it must be easily reversible. A
coefficient representation that is easily mapped to the index used when addressing the LUT in the
TNAFw point-multiplication is also beneficial. The mapping implemented simplifies the addressing of
the LUT. Algorithm 4-3 defines the mapping that converts integer coefficients to their binary
representation. In the algorithm, coeff is the integer representation of the coefficient, w is the width-w
value, and brep is the binary representation of the coefficient.
With a few exceptions, the index used to address the LUT in the TNAFw point-multiplication
is the binary representation of the TNAFw coefficient. First, 0 is represented by all bits being set.
Second, the sign bit of the binary representation must be masked to determine the index used to address
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 56
the LUT in the point-multiplication algorithm. Thus, because of the chosen TNAFw coefficient
representations, the determination of the LUT index is greatly simplified.
Algorithm 4-3. Integer Coefficient to Binary Representation Conversion
Input: coeff, w
Output: brep
1. If (coeff > 0) then
1.1. brep = coeff >>1
2. Else If (coeff < 0) then
2.1. brep = ( | coeff | ) >>1
2.2. brep = brep & 2w-1
3. Else brep = 2w+1-1
4. Return (brep)
The last modification to the TNAF conversion function is required because of a problem that
arises due to the nature of the mapping. Normally, all the bits in the structure are originally cleared.
Then, zero coefficients are represented by cleared bits, making the most significant coefficient easily
distinguishable. Alternatively, in this case a set of zero bits does not represent a zero coefficient. They
represent a 1 coefficient, making the most significant coefficient of a TNAFw representation difficult
to find. To combat this problem, at the end of the conversion process, a single zero coefficient is
placed in the position after the most significant coefficient. The added zero coefficient represents the
most significant coefficient in the representation. Therefore, in the point-multiplication algorithm, the
most significant coefficient is easily found by searching for the extra zero coefficient, starting at the
most significant end of the TNAFw representation.
The following section describes the implementation of the TNAFw point-multiplication
implementation. It also includes some performance values of the TNAFw conversion for different
width-w values.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 57
4.4.5 TNAFw Point-Multiplication (Q = kTNAFw ⋅ P)
The implementation of the TNAFw point-multiplication function is also based on the point-
multiplication function from the Rosing code. Some modifications were made to the function so that it
utilizes a LUT, and because a TNAFw representation is used to define the polynomial k.
The function, TNAF_precompute, was written and integrated into the TNAFw point-
multiplication function, TNAFw_point_mul. TNAF_precompute calculates the LUT used in the
multiplication process. Equations for the points in the LUT were developed using formulas in [61]. A
different set of equations was developed for each width-w. The width-w values implemented and
tested are four, five and six.
Similar to previous functions, the width-w used in the TNAFw point-multiplication can be
modified, but in this case, modifications are more difficult. The equations that define the points used in
the LUT are width-w specific. Therefore, general code that correctly computes the LUT for all width-
w values cannot be written. The function TNAF_precompute must be modified entirely in addition to
changing definitions related to the TNAFw representation when the width-w value is changed. In fact,
three different versions of the pre-computing function were written. Each function calculates the LUT
for a specific width-w value. During the testing and analysis phase, the proper pre-computing function
was compiled with the source code so the multiplication operation executed correctly.
After implementing the pre-computing function, Algorithm 3-12 was implemented to perform
the point-multiplication. The function is similar to the TNAF point-multiplication function, besides
employing a LUT. Minor modifications were made to the previous point-multiplication function,
which result in a working implementation of the TNAFw point-multiplication algorithm.
Table 4-8 summarizes the results of testing the performance of three width-w values. The
cycle counts required for converting from binary integer to TNAFw, pre-computing LUT values and
executing the entire TNAFw point-multiplication are all included for a width-w of five. All the values
are from the signature generation process, and the point-multiplication cycle count includes the pre-
computing cycle count for each width-w. As shown in the table, the optimum width-w value is five.
The surrounding width-w values result in greater point-multiplication execution times. A width-w of
five is assumed with all presented results for the remaining portion of the thesis unless otherwise stated.
TNAF_precompute is part of the function TNAFw_point_mul, whereas get_TNAFw_rep is not.
The cycle counts for the point-multiplication operation in the table include the cost of pre-computing
the LUT, i.e. TNAF_precompute. The conversion and pre-computing costs for width-w values of four
and six were not recorded because they are non-optimum.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 58
Table 4-8. TNAFw Point-Multiplication Performance Comparison
Cycle Counts (from signature generation process)
get_TNAFw_rep TNAF_precompute TNAFw_point_mul Width-w
(w) CGA HWA CGA HWA CGA HWA
4 - - - - 2,624,000 2,140,000
5 77,570 27,030 368,800 406,700 1,494,000 1,193,000
6 - - - - 2,014,000 1,660,000
In many implementations, the base point is fixed resulting in identical pre-computed LUTs. In
these cases, the LUT does not have to be recalculated for every point-multiplication. Instead, the LUT
can be stored permanently in memory, reducing the cost of the point-multiplication operation. When
the LUT is permanently stored in memory, the use of a larger width-w becomes beneficial with respect
to performance, assuming the memory penalty is acceptable. A larger width-w may be optimum as
implied in [19]. A memory versus execution time tradeoff is created when the base point is fixed. The
execution time of the point-multiplication operation can be reduced at the expense of memory space.
Referring to the results in Table 4-8, using a width-w of six likely becomes superior to five. The trend
continues as the width-w is increased, but the memory required to store the LUT quickly becomes
unreasonable. A maximum width-w value of six is recommended due to the memory requirements of
the LUT. The memory required by the LUT for this implementation of TNAFw point-multiplication is
described by Formula 4.1.
LUT Memory Requirement = 384⋅(2w-1-1) bytes (4.1)
The TNAFw point-multiplication function is used in the signature verification process. An
algorithm presented in [19] was implemented to enhance the verification performance, but the
implementation is found unbeneficial. The implementation and performance of the algorithm is
discussed in the following section. As explained, the TNAF point-multiplication algorithm is utilized
in the simultaneous multiple point-multiplication. The rationale for using the TNAF, and not TNAFw,
point-multiplication algorithm is stated.
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 59
4.4.6 Simultaneous Multiple Point-Multiplication (R = k ⋅ P + l ⋅ Q)
Also known as Shamir’s Trick, simultaneous multiple point-multiplication can speed up the execution
time of the signature verification process in the ECDSA [19]. The principle is based on a similar
windowing technique used in the finite field multiplication operation. The trick is to compute the two
point-multiplications simultaneously by employing a two dimensional LUT.
The implementation of simultaneous multiple point-multiplication, sim_multiple_pnt_mul, is
based on Algorithm 3-13. The first step in the implementation process was to write code to compute
the LUT in an efficient manner. The LUT constructing technique employed minimizes the number of
point-multiplication operations required. Point-multiplication operations are the most expensive.
Therefore, by constructing the LUT using point additions whenever possible minimizes the execution
time.
Most of the pre-computed LUT does not require point-multiplications. The first row and
column are computed first. Initially, indices that are powers of two in the first row and column are
computed using the TNAF point-multiplication function. Then, all indices that are non-powers of two
are computed by the addition of two previously computed LUT points. The first row and column of the
LUT are computed first because they only depend on one of the points. Furthermore, they can be used
to construct the remaining portion of the LUT. All remaining points in the LUT are computed by
adding points in the first row and column. A simple for loop and point additions are used to calculate
the remaining points.
After the LUT was implemented, the primary loop of the simultaneous multiple point-
multiplication was written. The loop uses pre-computed row and column indices to address the LUT.
It also uses a pre-computed TNAF representation of 2w, where w is the window width, when
performing the required point-multiplication operation.
Throughout sim_multiple_pnt_mul, TNAF point-multiplication is employed because of the
nature of the finite field elements involved in the multiplication. The elements always have small
degrees. Most of the bits in their binary integer representation are zero, and only a few of the least-
significant bits are set. Therefore, the LUT in the TNAFw point-multiplication operation is not fully
used, and pre-computing it is a waste of execution time.
Table 4-9 presents the performance of the simultaneous multiple point-multiplication
operation. Both CGA and HWA results for window widths of two, three and four are included. A
window width of three is optimal, and the operations utilizing HWA routines outperform CGA
counterparts. The superior performance of the HWA version is expected because the individual HWA
CHAPTER 4. IMPLEMENTATION ANALYSIS AND PERFORMANCE RESULTS 60
routines outperform their CGA counterparts. The performance of the HWA and CGA routines is
examined in §5.4.1.
Table 4-9. Simultaneous Multiple Point-Multiplication Performance Comparison
Cycle Count Window
Width CGA HWA
2 6,226,000 5,198, 000
3 6,139, 000 5,043, 000
4 11,820, 000 9,933, 000
5 Implementation Comparison and Coding Guidelines
The following section compares the performance of the implementation and states coding guidelines
that were followed. First, comparisons between some implemented operations and previously
published results are presented. Then, coding guidelines that were followed during implementation are
included. The guidelines specify techniques that result in efficient generation of assembly and C code.
A comparison of the performance of the CGA and HWA routines is done. The comparison is presented
at both the routine level and the signature generation and verification level. Finally, a comparison of
the memory requirement is provided, which is based on the CGA and HWA routines. It includes the
memory requirements of the data, the actual CGA and HWA routines, and the signature generation and
verification processes.
5.1 Performance Comparison with Previous Published Results
The following two sections compare the operation performances presented in the previous section with
published results. A comparison of the implemented finite field squaring, multiplication and inversion
operations is presented in the following section. Then, a performance comparison of the point-
multiplication operation, simultaneous multiple point-multiplication, and the signature generation and
verification processes is offered in §5.1.2.
5.1.1 Low-Level Performance Comparison
The results of the finite field operations implemented are compared with figures presented in three
papers. In [19], results from implementations of the NIST-recommended elliptic curves on a Pentium
II 400MHz processor are presented. The NIST-recommended curve for GF(2163) is the Koblitz curve
implemented in the thesis. In [40], ECC results from several papers are summarized. The results are
from implementations using various elliptic curves, finite fields and platforms. In [39], the
performance of a point-multiplication algorithm is presented on various processors. Only the relevant
61
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 62
results are included in the comparison. The tables present estimated values because the published
results were converted from execution times to cycle counts.
Table 5-1 presents a comparison summary of the implementation performance of finite field
operations with other published results. In the table, the Hankerson implementation is shown superior.
The thesis implementation results are not good as the Hankerson results, and better than the Lopez
results for the inversion and squaring operations. The fact that a random elliptic curve is used by
Lopez, does not affect the finite field results. The type of elliptic curve used only affects the
performance of elliptic curve operations.
It is interesting to note the multiplication and inversion performance ratio in each case.
Hankerson and Lopez both present ratios near ten, whereas the thesis implementation achieved a ratio
less than five. The ratios differ by a lot, which leads to a couple of possibilities. It is likely that the
Pentium II and UltraSparc processors favor the multiplication operation, or the SC140 favors the
inversion multiplication. Another possibility is the Hankerson and Lopez inversion operation and/or
the thesis multiplication operation can be improved upon. Finally, the different implementations may
use highly optimized versions of similar algorithms as presented in [20] and [22], leading to
performance enhancements of the multiplication and inversion operations.
Table 5-1. Estimated Finite Field Operation Cycle Count Comparison
Finite Field Operation Description Elliptic Curve Target Processor
Squaring Multiplication Inversion
Hankerson, et al - C [19] Koblitz – GF(2163) Pentium II 400 MHz 160 1,200 12,400
Lopez, Dahab – C [39] GF(2163) Pentium 233MHz - 2,346 -
Lopez, Dahab – C [39] GF(2163) Pentium II 400MHz - 1,188 -
Lopez, Dahab – C++ [40] Random – GF(2163) UltraSparc 300MHz 690 3,150 28,860
Lopez, Dahab – C [39] GF(2163) UltraSparc 450MHz - 1,134 -
Thesis – C and ASM Koblitz – GF(2163) SC140 300 MHz 212 3,475 16,730
Some factors that explain the difference in performance shown in the above table include code
size, target processor and optimizing strength of the compiler. The factors and how they can affect
performance figures are explained in the following paragraphs.
It is possible to improve the performance of operations by increasing the code size. For
example, the inversion performance was increased by removing a polynomial exchange and adding an
extra loop case. By adding the extra loop case, the code size is increased. Therefore, discrepancies in
the performance of operations can be due to code size.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 63
Code that is written to be versatile, that allows the modification of the finite field size, is
likely to perform inferiorly. For example, by fixing the poly_sqr window width, the performance of the
operation is significantly improved. There are certain assumptions that can be made that improve
performance when parameters are fixed. The performance of the thesis implementation of the
multiplication operation and others can likely be improved by fixing the window width and finite field
size used.
The target processor can have a large affect on the performance of an application. Different
processors have different instruction sets, data path widths and number of registers. The data path
width and number of registers can significantly affect the performance of applications that are
computationally intensive. Furthermore, specialized instructions, such as the CLB instruction of the
SC140, that are not present in most processor instruction sets, can have a large affect on the
performance of implementations.
Lastly, mature processors are advantageous for implementations because of the strength of
their compiler. Mature processors have been more thoroughly studied, and therefore superior compiler
optimization techniques are known. Compilers that target a mature processor are likely to generate
superior performing applications.
5.1.2 High-Level Performance Comparison
In this section, the performance of the point-multiplication operation, and the signature generation and
verification processes are compared to previously published results. Similar to the previous section, the
results used in the comparison are from [5], [19] and [40]. The tables present estimated values because
the published results were converted from execution times to cycle counts.
Table 5-2 compares the implemented point-multiplication results with other published figures.
For the thesis implementation, two TNAFw point-multiplications are less computationally expensive
than the simultaneous multiple point-multiplication operation presented in §4.4.6, so the execution time
of two TNAFw point-multiplications and a point addition is included in the table. The thesis
implementation results are inferior and superior to the Hankerson and Lopez results respectively.
Some reasons for the difference in performance are presented in the previous finite field performance
comparison section. The Lopez results are less impressive because a random curve is used. Random
curve implementations cannot benefit from τ-adic related point-multiplication algorithms, and therefore
are more computationally expensive.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 64
The Hankerson cycle count of the simultaneous multiple point-multiplication is based on two
TNAFw point-multiplications, one of which exploits a pre-computed LUT. Therefore, the memory-
constrained result of 1,588,400 cycles presents a more accurate comparison [19]. Either way, the
Hankerson implementation outperforms the thesis implementation. This is expected because of the
comparison in §5.1.1. Since elliptic curve operations are based on the execution of several finite field
operations, the performance of elliptic curve operations are limited by the performance of the
underlying finite field operations.
Table 5-2. Estimated Elliptic Curve Operation Cycle Count Comparison
Point-Multiplication Implementation
Description Elliptic Curve Target Processor
TNAF TNAFw
Simultaneous
Multiple Point-
Multiplication
Hankerson, et al - C [19] Koblitz – GF(2163) Pentium II 400 MHz 778,400 576,800 1,080,800
Lopez, Dahab – C++ [40] Random – GF(2163) UltraSparc 300MHz 4,050,000 -
Thesis – C and ASM Koblitz – GF(2163) SC140 300 MHz 1,670,000 1,193,000 2,450,000
Table 5-3 presents a comparison of the performance of the signature generation and
verification processes. The table includes more results than previous comparisons. The elliptic curve
and target processor in the table must be noted in each case. The Smart implementation results are
similar to that obtained in the thesis implementation, but use a random elliptic curve. Furthermore, a
direct comparison between cycle counts corresponding to 16-bit and 32-bit processors is invalid
because of the difference in data path width.
Table 5-3. Estimated Signature Generation and Verification Cycle Count Comparison
Signature Process Implementation
Description Elliptic Curve Target Processor
Generation Verification
Brown, et al – C++ [5] Koblitz – GF(2163) Dragon Ball 16MHz 28,688,000 52,208,000
Brown, et al – C++ [5] Koblitz – GF(2163) Intel 386 10MHz 10,011,000 18,260,000
Brown, et al – C++ [5] Koblitz – GF(2163) Pentium II 400MHz 844,000 1,636,000
Certicom – C [40] Koblitz – GF(2163) UltraSparc 167MHz 634,600 1,786,900
Daswani, Boneh – C [40] Koblitz – GF(2163) Dragon Ball 15MHz 12,000,000 35,100,000
Smart – C++ [40] Random – GF(2161) Pentium Pro 334MHz 1,336,000 6,346,000
Thesis – C and ASM Koblitz – GF(2163) SC140 300 MHz 1,329,000 2,590,000
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 65
The results illustrated in the table that use Koblitz curves vary greatly. The target processor
accounts for some of the variation. Other factors that may contribute to the varied performance figures
obtained in the signature processes and the point-multiplication operation is explained in the previous
comparison of §5.1.1.
One of the intentions of the thesis is to ensure the performance of the signature generation and
verification processes on the SC140 are acceptable. Converting the thesis cycle counts from Table 5-3
lead to execution times of approximately 4.43 and 8.63 milliseconds. These execution times are
reasonable. The delay caused by executing the processes is insignificant, and would be unnoticeable to
a user. Therefore, implementing the ECDSA on the SC140 is practical. Specialized processing units
are not required to execute ECC based security protocols. The security protocols can be designed for
and executed on the SC140, thereby eliminating the need for a specialized cryptographic processor by a
portable device with a SC140.
5.2 Guidelines for Writing Efficient C Code for Cryptographic
Applications
Several techniques were found to improve the performance of the compiler-generated code during the
implementation and optimization of the ECDSA. C coding guidelines that have the most significant
influences on the performance of the compiled application are presented in the following points. To
further clarify the techniques, examples are provided when necessary.
1. Pass small variables by value whenever possible. This simplifies the assembly code generated
because the variable does not have to be read from memory several times. Instead, it only has to be
read once from the stack. Then the value can be used throughout the function without being
updated because moves to memory will change its value.
2. Use pointers to arrays, and add offsets, or increase pointer values, instead of indexing array
elements. The resultant C code is harder to follow, but the compiler-generated code is more
efficient when this technique is used.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 66
int i, array[10], *pointer;
for(i=0, pointer = array; i < 10;i++, pointer++) *pointer = 0;
3. Define and use temporary variables within functions when a calculated value is used several times.
This ensures that the compiler does not generate code that recalculates the value each time it is
used. The compiler can also more easily map the temporary variable to a single register.
4. To further reuse code, generalize functions that perform the same operation so that they can handle
variable size arrays. The required arguments to the function are a pointer to the start of each array,
and the size of the array, which is passed by value.
void null(ELEMENT *input, ELEMENT count);
5. When optimizing code for speed, inline small function calls within larger ones. Then the inline
function is guaranteed and not an optimization that the compiler may or may not employ. This
avoids the overhead associated with calling functions and can significantly decrease execution
time.
6. Write code that can be reordered when possible. The compiler has more freedom when optimizing
the generated assembly if the code can be reordered. The compiler can reorder instructions to
result in greater parallelism and possibly reduce memory moves.
c1 = LUT[1].e; c2 = LUT[2].e; c3 = LUT[3].e; c4 = LUT[4].e;
c5 = LUT[5].e; c6 = LUT[6].e; c7 = LUT[7].e; c8 = LUT[8].e;
for(i = 0; i <= NUMWORD; i++, c1++, c2++, c3++, c4++, c5++, c6++, c7++, c8) {
*c3 = *c2 ^ *c1; *c5 = *c4 ^ *c1;
*c6 = *c4 ^ *c2; *c7 = *c4 ^ *c3; }
7. Use temporary variables for loop counting in for loops, instead of calculating the end address of
pointers. Temporary variables require less resources, are simpler for the processor to handle, and
can often be hard-coded.
8. When calling functions to perform simple tasks, combine multiple calls so that loops that require
the same number of iterations can be combined. This reduces the code size, and decreases
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 67
execution time in many cases because greater parallelism and pipelining can be achieved. In
addition, by combining several loops that require the same number of iterations, the overhead
associated with looping is reduced.
9. By using alternative operations that achieve the identical results, the execution performance of the
resulting application is improved. For example, when performing division and remainder
calculations, the use of the & and >> C operators have the same results as % and / respectively.
The & and >> operators result in superior performing code because they map directly to assembly
instructions.
mod = num % 8; ↔ mod = num >> 3;
5.3 Guidelines for Writing Efficient Assembly Code for Cryptographic
Applications
There are several coding techniques used that result in optimized assembly code. The techniques
maximize the parallelism of the code, leading to a reduction in execution time. The guidelines used
during the implementation of assembly routines that had the greatest performance effect are listed.
Performance refers foremost to execution time, but power consumption is also considered. When
necessary, examples are provided to further clarify the techniques.
1. For implementation on processors with limited memory bandwidth such as the SC140, it is
important to give memory moves the highest priority during assembly coding. In most cases,
memory moves limit the performance of the code. Therefore, the optimum organization of MOVE
instructions should be determined. Then, other instructions should be organized in parallel with the
MOVE instructions in a manner that least hinders performance.
2. Avoid writing values to memory whenever possible. Updating variables stored in memory should
only be carried out near the end of a routine, after the variable is no longer modified within the
routine. The exception to the guideline is when a child routine(s) is called that accesses the
variables from memory. In many cases, when implementing simple routines, there is enough
register file space to store all the required values that are used throughout the routine. Therefore,
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 68
amount of memory moves can be minimized. Values that are only required for a small portion of
the routine can be read from memory before their use, and if they are modified, written to memory
after they are no longer required within the routine. This reduces the amount of memory moves,
decreasing the execution time and power consumption of the implementation.
3. An attempt to minimize stack space usage, and only allocate stack space if necessary, should be
made. In many cases, stack space is not required because of the size of the register files. Registers
can be used to store local variables instead of stack space. This further reduces the amount of
memory moves, decreasing the execution time and power consumption of the routine.
4. Instructions that belong to the Critical Path (CP) should be given the highest execution priority.
The CP is the set of instructions that determine the execution time of a routine. In other words,
when organizing instructions that are performed in parallel, the set of instructions that are involved
in the CP should be placed before other instructions whenever possible. Cases arise when there are
several instructions that can be executed during a certain clock cycle, but because of processor or
VLES limitations, only a subset of the instructions can be executed. The instructions that belong to
the CP should always be included in the subset of executed instructions before other less
performance important instructions.
5. When the data structure located in memory is known, pipelining techniques can be employed that
lead to greater parallelism and tighter loops. The order in which memory reads and writes can be
modified or parallelized because the memory addresses involved are guaranteed not to overlap.
Rearranging the order of memory accesses often results in performance improvements.
loopstart3
MOVE.L (r3)+, r4
MOVE.L r4, (r1)+
loopend3
MOVE.L (r0)+, r3
loopstart3
MOVE.L r3, (r1)+
loopend3
MOVE.L (r0)+, r3
6. Advanced pipelining techniques that use registers to temporarily store data, reduce the length of
loops. In many cases, by reading data into registers for temporary storage, and later moving them
to registers for use, loops become tighter and therefore more efficient.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 69
loopstart3
MOVE.L (r1)+, d2
MOVE.L (r0), d0
TFR d2, d1
LSRR d7, d2
MOVE.L d0, (r0)+
loopend3
MOVE.L (r3)+, d5
MOVE.L (r2), d3
TFR d5, d4
LSRR d7, d5
MOVE.L d3, (r2)+
LSLL d6, d1
ZXT.L d2
EOR d1, d0
EOR d2, d0
LSLL d6, d4
ZXT.L d5
EOR d4, d3
EOR d5, d3
7. The reordering of instructions can improve performance. In cases where the same data is required
to compute two different variables, the order in which the instructions are executed may affect the
performance of the code. See 12 in §6.2 for an example.
8. In several cases, to comply with StarCore programming rules, NOPs must be added before and
after hardware loops. NOPs surrounding loops can often be avoided by a minor unrolling of the
loop. By doing so, clock cycles can be saved each time the routine is called. The overall
performance of applications is affected when this is applied to commonly called routines. See 17
in §6.2 for an example.
9. Cycles are saved when branch instructions are replaced with return from subroutine instructions
when appropriate. This rule is also presented as a suggested compiler optimization improvement.
See 21 in §6.2 for an example.
10. Use delay instructions whenever possible. The instructions, such as BFD, BRAD, BSRD, BTD,
JMPD, JSRD, RTED, RTSD and RTSKD, save clock cycles when properly implemented. The
effective cycle cost of the execution sequence is reduced by one clock cycle by executing the
following VLES during the delay instruction.
MOVE.L d1, (r4)
RTS
MOVE.L d3, (r5)
RTSD
MOVE.L d1, (r4)
MOVE.L d3, (r5)
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 70
5.4 Hand-Written and Compiler-Generated Assembly Comparison
Several routines, which implement operations or functionality required by operations, were written in
assembly code in an attempt to improve the performance of the signature generation and verification
processes. In each case, an HWA and CGA version of the routine was implemented. The two versions
of identical functionality were implemented to determine the effect and benefits of HWA and the
strength of the C compiler.
The C code used in the comparison, and throughout the entire thesis, does not take advantage
of special optimizing techniques. In other words, there are no definitions implemented that aid in the
compilers understanding of the code, allowing for assumptions and greater optimized code. In
addition, intrinsic functions were not used in the C code. They result in superior compiler-generated
assembly because they allow the programmer to specify individual assembly instructions.
The routines that were chosen for dual implementation are primarily simple routines that are
commonly called. Routines that could greatly benefit from the added functionality HWA offers were
also implemented in both source code languages and compared. The corresponding HWA and CGA
routines have identical parameter lists. All of the functions require a count argument, which defines the
size of the input data structures. The count argument is used to determine the number of loop iterations
to perform. A fixed structure size would improve the performance of both the HWA and CGA
routines, but it reduces the reusability of the routines. For example, the same routine is used to copy
FIELD2N and DBLFIELD structures. Fixing structure sizes also further complicates modifying the
finite field size.
At the start of the implementation process, importance was placed on the reusability of the
code, both to reduce the application size and to allow for the easy modification of the security strength
by changing the finite field size. For these reasons, the structure size is an input to the HWA and CGA
routines. The following two sections compare the HWA and CGA routines with respect to low-level
and high-level execution times respectively.
5.4.1 Low-Level Performance Comparison
Identical algorithms were used when implementing the HWA and CGA routines whenever possible.
The implemented algorithms only differ in cases where functionality is not available in the specific
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 71
source code language. The MSB_degree and MSB_degree1, add_int and sub_int routines are
examples of routines where the assembly and C algorithms differ.
The two sets of degree computing routines are MSB_degree and MSB_degree1. The CLB
instruction was used in the HWA versions of the routine. This functionality is not available at the C
level, and therefore had to be simulated at the expense of execution time, leading to greater
computational costs.
The add_int and sub_int HWA routines benefit from the carry bit of the SC140. The carry bit
is used to record overflows and underflows during the addition and subtraction of integers. There are
instructions, ADC and SBC, which include the carry bit when adding or subtracting two registers.
These instructions allow efficient implementation of large integer addition and subtraction, where the
large integers span more than one register, by including carry and borrow bits in future instructions
without a cycle penalty. In comparison, a complex boolean instruction was defined to compute the
carry and borrow bits in the C routine. Employing boolean expressions is an inefficient method of
simulating the carry bit, resulting in less efficient code.
The results presented in this section are from C level CGA and HWA routine calls. All of the
routines are implemented in the form of C or assembly functions. The input finite field elements and
large integers are all FIELD2N structures consisting of six elements unless otherwise stated. The
performance of the HWA and CGA routines, which is for the most part independent of the input
parameters, are presented in Table 5-4. Both the cycle and instruction counts of each HWA and CGA
routine are provided, along with a brief description of the inputs involved. To obtain the results, each
routine was run several times, using identical inputs for both HWA and CGA routines.
The routines were named according to the task they perform. The routines, add_int and
sub_int, add and subtract large integers respectively. The routine neg computes the negative of a large
integer. Rounding up of a large integer after dividing it by two is achieved by div2_ceil. The routines,
null and copy, clear and copy the input structures respectively. Polynomial addition is performed by
poly_add, and poly_add2 performs two polynomial additions. Shifting a structure to the left by a
specified number of bits less than the register width of 32 is performed by shift_left. The two routines,
convert_to_larger and convert_to_smaller, copy data from a structure to a larger and smaller structure
respectively. The sizes of the two structures are arguments to the routines. The structures can be of
equal size, resulting in each routine performing a copy operation. In convert_to_larger, all the data
from one structure is copied to the least significant array elements of the other, and the remaining array
elements are cleared. No reduction is performed by the convert_to_smaller routine. The least
significant array elements are copied to the other structure, ignoring any overflowing array elements.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 72
Table 5-4 shows that the HWA routines outperform the CGA routines in each case. The cycle
performance comparison column shows the superior performance of the HWA as a percentage. The
routine that benefits the most from hand-written assembly is add_int. A performance improvement of
over 530% is observed when comparing the HWA and CGA routines. As previously mentioned, this is
most likely due to the use of the carry bit and ADC instruction in the HWA. This functionality is not
available at the C level without the use of intrinsic functions.
Table 5-4. Low-Level CGA and HWA Performance Comparison (input independent routines)
Cycle Count (Instr Count) Routine
CGA HWA
Cycle Performance
Comparison (%) Description
add_int 218 (172) 41 (28) 532 Random large integer
sub_int 170 (130) 41 (28) 415 Random large integer
neg 125 (89) 37 (27) 338 Random large integer
neg 125 (89) 37 (27) 338 Large integer = 0x43
neg 120 (99) 37 (27) 324 Large integer = - 0x43
div2_ceil 139 (102) 54 (34) 257 Random large integer
div2_ceil 139 (117) 54 (34) 257 Large integer = 0
div2_ceil 139 (114) 54 (34) 257 Large integer = 1
div2_ceil 145 (112) 54 (34) 269 Large integer = -1
null 63 (40) 20 (12) 315 Random polynomials
copy 74 (49) 27 (16) 274 Random polynomials
convert_to_smaller 88 (61) 30 (19) 293 Emulate copy
convert_to_larger 95 (65) 37 (23) 257 Emulate copy
convert_to_smaller 89 (62) 31 (20) 287 DBLFIELD[11] FIELD2N[6]
convert_to_larger 131 (91) 40 (29) 328 FIELD2N[6] DBLFIELD[11]
poly_add 87 (59) 36 (24) 242 Random polynomials
poly_add2 107 (79) 46 (33) 233 Random polynomials
shift_left 119 (93) 38 (27) 313 4 shifts, random polynomials
shift_left 119 (93) 38 (27) 313 8 shifts, random polynomials
The performance of the routines, shown in the table, was measured using inputs with six array
elements. As the finite field size and large integers involved increase, the HWA routines outperform
the CGA routines by larger factors. This is due to tighter hardware loops in the HWA routines.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 73
Large integer addition, which is implemented by the add_int routine, is not commonly used in
the ECDSA. Therefore, the performance benefits of other HWA routines must be the primary focus of
the comparison. Routines that implement finite field element manipulation such as null, copy,
poly_add, poly_add2 and shift_left have a greater affect on the overall performance of the signature
generation and verification processes. Compared to their CGA counterparts, a minimum performance
enhancement of 230% is achieved by these routines.
Several other routines were also implemented in both C and hand-written assembly. Their
performance figures are recorded in Table 5-5. The performance of each routine depends on the
number of shifts required, or the degree of the input polynomial.
The performance of the CGA version of the first routine listed, MSB_degree, completely
depends on the degree of the input polynomial. The performance follows a pattern that is clarified by
organizing results into groups. The groups correspond to results where the degree of the polynomial
divides 32, which is the register width, by the same integer. The grouping also corresponds to the
representation of the polynomials. The MSB of the polynomial is located in the same array element of
the FIELD2N structure for each group. As shown by the brackets in the Table 5-5, the CGA code
requires an additional twelve clock cycles for each degree of a polynomial within a single grouping.
In comparison, by exploiting the CLB instruction, the HWA code requires the same number of
cycles to calculate the degree of any polynomial in a group. Furthermore, the HWA routine only
requires four extra cycles to compute the degree of a polynomial in a subsequent group, whereas the
CGA routine requires an additional twelve cycles for corresponding degrees in subsequent groups.
Due to the additional functionality available at the assembly level, and the superior assembly coding
techniques compared to the compiler, the HWA routine outperforms the CGA routine. The HWA is an
improvement from the CGA routine by at least 130%, and by a maximum of 1550%. The benefits of
the HWA routine MSB_degree lessen when the previous degree of the polynomial is provided as an
argument.
When the previous degree of the polynomial is exploited by the degree calculating routine,
which is the case with MSB_degree1, the performance of the CGA routine significantly improves,
while the HWA falters. This is seen when comparing the performances of MSB_degree and
MSB_degree1. On average, the CGA version of MSB_degree1 outperforms MSB_degree, whereas the
opposite is true with the HWA versions. In the case of MSB_degree1, the previous degree of the
polynomial is passed to the routine. The value is used as a starting point for the calculation of the
current degree. The routine calculates the degree correctly as long as the previous degree is valid. It
must be greater than the current degree. To collect the performance values in Table 5-5, the previous
degree value used was the actual degree plus one, except with the null polynomial.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 74
Table 5-5. Low-Level CGA and HWA Performance Comparison (input dependent routines)
CGA Counts HWA Counts Function
Cycle Instruction Cycle Instruction
Cycle
Performance
Comp (%)
Description
MSB_degree 378-403 (12) 279-297 (9) 26 13 1454 - 1550 Degrees 162 160, () - change per degree
MSB_degree 40-410 (~12) 24-304 (~9) 30 17 133 - 1367 Degrees 159 128, () - change per degree
MSB_degree 52-422 (~12) 33-313 (~9) 34 21 153 - 1241 Degrees 127 96, () - change per degree
MSB_degree 64-433 (~12) 42-322 (~9) 38 25 168 - 1139 Degrees 95 64, () - change per degree
MSB_degree 76-446 (~12) 51-331 (~9) 42 29 181 - 1062 Degrees 63 32, () - change per degree
MSB_degree 88-458 (~12) 60-340 (~9) 46 33 191 - 996 Degrees 31 0, () - change per degree
MSB_degree 101 73 48 35 210 Degree -1
MSB_degree1 55 38 39 (43) 23 (27) 141 () - Degree = 32*i-1
MSB_degree1 83 63 41 25 202 Degree -1, prev_degree = 2
shift_and_add 135 94 49 37 276 Shifts 1 31
shift_and_add 117 82 46 34 254 Shifts 32 63
shift_and_add 99 70 43 31 230 Shifts 64 95
shift_and_add 81 58 40 28 203 Shifts 96 127
shift_and_add 63 46 37 25 170 Shifts 128 159
shift_and_add 50 34 35 20 143 Shifts 160 162
shift_and_add2 195 138 72 56 271 Shifts 1 31
shift_and_add2 168 119 66 50 255 Shifts 32 63
shift_and_add2 141 100 60 44 235 Shifts 64 95
shift_and_add2 114 81 54 38 211 Shifts 96 127
shift_and_add2 87 62 48 32 181 Shifts 128 159
shift_and_add2 62 42 44 24 141 Shifts 160 162
multiple_div2 125 (124) 96 (99) 49 35 255 Shifts 1 31, () – negative large int
multiple_div2 113 (116) 90 (94) 49 37 231 Shifts 32 63, () – negative large int
multiple_div2 107 (109) 84 (87) 47 35 228 Shifts 64 95, () – negative large int
multiple_div2 101 (102) 78 (80) 45 33 224 Shifts 96 127, () – negative large int
multiple_div2 95 72 (73) 46 31 207 Shifts 128 159, () – negative large int
multiple_div2 94 (93) 66 47 32 200 Shifts 160 162, () – negative large int
The CGA routine MSB_degree1 calculates the degree of the input polynomial in fixed time.
On average, it is much faster than the previous degree calculating routine, but the HWA routine still
outperforms the CGA implementation by 141%. It is interesting to note that the overhead associated
with passing the previous degree to the HWA implementation is actually detrimental. On average, the
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 75
HWA MSB_degree routine outperforms the HWA MSB_degree1 routine. By averaging the figures in
the table, the HWA versions of MSB_degree and MSB_degree1 require 37.77 and 39.15 cycles
respectively. For this reason, the CGA version of MSB_degree1 and the HWA version of MSB_degree
are integrated into the finite field inversion function.
The routines, shift_and_add, shift_and_add2, and multiple_div2 follow the same performance
pattern. Similar to MSB_degree, the routines are grouped by multiples of the register width. Within
each group, the routines perform identically because they employ instructions capable of shifting
registers by a variable number of bits in a single clock cycle.
The CGA and HWA routines of shift_and_add and shift_and_add2 are very similar. The first
adds a shifted version of a polynomial to another. The second routine, which makes the first obsolete,
is equivalent to calling shift_and_add twice. Two shifted polynomials are added to two other
polynomials in parallel. The routines perform tasks specific to the finite field inversion operation.
The multiple_div2 routine is equivalent to the shift_left routine described previously, except it
shifts a structure to the right. It is actually more powerful than the shift_left routine, in that the shifting
value is not limited to be less than the register width of 32-bits. The routine was defined specifically
for the partmod δ reduction. Thus, it targets large integers, however it can also be used with finite field
elements.
The performance of the HWA routines is superior to that of the CGA routines. The HWA
shift_and_add, shift_and_add2 and multiple_div2 routines result in at least a 141% increase in
performance, and on average, the increase exceeds 210%.
Overall, a significant increase in performance is achieved by implementation using assembly
code instead of C. Only simple routines were implemented and compared, but significant decreases in
cycle counts for all the implemented routines were recorded. Similar benefits are expected when
complex functions and operations are entirely implemented in assembly. The performance
enhancements likely accumulate so that an even greater savings in execution time is achieved.
The performance discrepancy between the CGA and HWA routines clearly presents the
inability of the compiler to generate efficient code. The CGA routines are significantly outperformed
by the HWA routines, which shows that the SC140 compiler does not produce optimum source code
for cryptographic applications. Assembly code implementations of cryptographic applications on the
SC140 are far superior to those written in C.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 76
5.4.2 High-Level Performance Comparison
A high-level performance of the CGA and HWA routines was achieved by integrating the routines into
the ECDSA source code. Definitions were used to determine the set of routines, either CGA or HWA,
included in the execution sequence.
In an attempt to estimate the reduction in computational costs due to employing HWA routines
in place of CGA routines, the number of times each routine is called was recorded. An estimation of
the computational cost reduction due to employing HWA routines is achieved by combining call counts
with the cycle performance comparison from the previous section. The minimum and maximum
computational reduction due to the most commonly called HWA routines within the signature
generation process is presented as Table 5-6. CGA and HWA routines that are not commonly called
are not included in the table because they do not have a significant affect on the performance of the
process. The minimum and maximum reduction are computed because several of the routines listed
execute in data dependent time, and without the inputs of each routine call, it is very difficult to
estimate the actual computational reduction achieved.
Table 5-6. Computational Reduction of the Signature Generation Process due to HWA Routines
CGA HWA Cycle Count Reduction
Total Cycle Count Reduction Routine
Description Call
Count Minimum Maximum Minimum Maximum
add_int 469 177 177 83,010 83,010 convert_to_larger 183 91 91 16,650 16,650 copy 857 47 47 40,280 40,280 MSB_degree 292 10 412 2,920 120,300 multiple_div2 235 47 76 11,050 17,860 neg 183 83 88 15,190 16,100 null 189 43 43 8,127 8,127 shift_and_add 4674 15 86 70,110 402,000 shift_left 511 81 81 41,390 41,390 sub_int 303 129 129 39,090 39,090
Total 327,800 784,800
The cycle count of the signature generation process that employs HWA routines should be at
least 327,800 cycles less than the cycle count of the process that employs CGA routines. The actual
cycle count reduction should be larger than the minimum cycle count reduction in the table, and should
not exceed the maximum computational reduction. The speedup should not exceed maximum
computational reduction in Table 5-6 because the worst-case performance of the CGA routines is used
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 77
in its calculation. The cycle count reduction of the signature verification process is expected to be
approximately twice that of the signature generation process. This is because the verification process
requires two point-multiplications, which account for most of the computational cost of the process,
whereas the signature generation process only requires one point-multiplication.
The ECDSA source code was modified so that it is possible to measure the duration of the
signature generation and verification processes. Code was written that set the digital signature to the
correct value, and that compared the signature against hard-coded values so the generation and
verification processes can be bypassed. The modifications make it possible to selectively execute the
signature generation and verification processes. The results obtained from selectively executing the
processes and measuring the cycle counts are presented in Table 5-7.
The signature verification results in Table 5-7 were recorded using two TNAFw point-
multiplication functions. Two TNAFw point-multiplication operations, along with a point addition
operation are found to outperform the simultaneous multiple point-multiplication operation in the
signature verification process. This is due to the efficiency of the TNAFw representation. The
TNAFw representation guarantees an average hamming weight of m/(w+1), where w=5 is optimal for
this implementation. Due to the small number of nonzero coefficients, there are a reduced number of
point additions required in the TNAFw point-multiplication operation.
Table 5-7. High-Level CGA and HWA Performance Comparison
Description CGA Cycle Count HWA Cycle Count Actual Cycle
Count Reduction
Signature Generation 1,819,000 1,329,000 429,000 (23.6%)
Signature Verification 3,393,000 2,590,000 803,000 (23.7%)
The signature processes that include HWA routines outperform their CGA counterparts, and
the cycle count reduction is within the maximum and minimum values estimated in Table 5-6. An
actual reduction of 429,000 clock cycles is achieved by employing HWA routines instead of CGA
counterparts within the signature generation process. The computational cost of signature verification
is reduced by 803,000 clock cycles when employing HWA routines. As expected, the cycle count
reduction of the signature verification process is approximately twice that of the signature generation
process.
Due to the significant computation cost reduction illustrated in Table 5-7, it is recommended
that assembly implementations be employed. In the thesis, only a small amount of basic functionality
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 78
was implemented in assembly. As presented in Table 5-4 and Table 5-5, the implemented assembly
routines greatly outperform their counterparts, which were written in C. Furthermore, the routines
implemented in assembly translate to significant computational cost reductions at higher-levels.
Computational cost reductions are easily achieved by implementing commonly executed low-level
routines in assembly, and even greater computational cost reductions are expected when higher-level
routines are also implemented in assembly.
The following section examines the memory requirements of the implementation. The
memory requirements of the signature generation, signature verification, and entire signature
generation and verification processes are presented, as well as the memory requirements of the CGA
and HWA routines.
5.5 Memory Requirements Comparison
When targeting portable devices, the memory requirements of the application become more important.
Applications targeting portable devices must adhere to the limited processing and storage resources
present. Throughout the implementation process, the computing time of operations and processes was
focused on. An attempt to minimize the computing time was made at the cost of memory.
However, some decisions were made because of memory limitations. For example, the bit
width used in the polynomial squaring function was selected primarily due to a decrease in execution
time, but the size of the LUT, which is permanently stored in memory, grows exponentially with the bit
widths. Larger bit widths reduce the computational cost of the polynomial squaring operation, but their
LUT requires substantial amounts of storage space. The memory requirements of the LUT become
unpractical.
When selecting the window width and width-w for the polynomial and point-multiplication
operations respectively, the memory requirements were considered. After determining the optimum
window width with respect to computational costs, the memory requirements were investigated. Both
functions use temporary LUTs that are computed during execution. The LUTs are dynamic, and only
present when executing the appropriate operation. It was ensured that the LUTs required by the
operations are of reasonable size. In each case, the size of the LUT was compared to the memory
requirements of the entire application. The memory requirements of the operations seemed reasonable,
but for systems with extremely limited memory resources, the requirements may be too large.
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 79
The memory requirements for LUT in the point-multiplication operation are considerable.
The LUT is 6,144 bytes, which may be too large for certain portable devices. By employing the TNAF
point-multiplication operation instead of the TNAFw version, the LUT is eliminated at the cost of
approximately 500,000 cycles. This translates to an additional 1.67milliseconds per point-
multiplication operation when the SC140 is operating at 300MHz. The dynamic memory requirements
of the implementation are not included in Table 5-8.
The permanent storage requirements of the implementation are presented in Table 5-8. The
table presents the memory requirements divided into several categories, while contrasting compilations
including either CGA or HWA routines. The memory requirements of the data, CGA and HWA
routines, general code other than the CGA and HWA routines, the entire ECDSA, and the signature
generation and verification processes are provided. All values are approximate because it is difficult to
determine the exact size of each category. Several of the categories overlap and affect each other.
There are not distinct divisions within the code that allow the memory requirements to be easily
separated into categories. Furthermore, the dynamic memory requirements of LUTs that are calculated
on the fly are not included in the table, only the memory requirements of the LUT used in the finite
field squaring operation is included because it is not calculated on the fly.
Table 5-8. Estimated Permanent Storage Requirements
Memory Requirement (bytes) Description
CGA HWA
Data 1,232 1,232
Routines 3,300 (2,200) 1,200
General Code (not CGA or HWA) 32,300 32,300
ECDSA (total) 37,800 35,700
Signature Generation 34,200 32,300
Signature Verification 33,000 31,100
The memory requirements of the HWA routines are significantly smaller than the CGA
routines. The size of the CGA and HWA routines differ by a factor of three. The routine memory
requirements are slightly misleading due to inline functions. There are several copies of identical
routines in the compilation including the CGA routines. There are no inline HWA routines, thus the
memory requirements are lessened. The value in brackets is the size of the compiled CGA routines,
excluding inline functions. This value is more accurate for comparison purposes. The memory
CHAPTER 5. IMPLEMENTATION COMPARISON AND CODING GUIDELINES 80
requirements of the CGA routines are still much larger than the HWA routines. They differ by a
factor of two. By focusing on the CGA and HWA routines, it is shown that the memory requirements
can be significantly reduced by writing routines in assembly. As presented in §5.4.1, the performance
of the routines is also far superior.
Table 5-8 also presents the memory requirements of the signature generation and verification
processes. All of the finite field, large integer and elliptic curve operations are required by both
signature processes. Half of the memory requirements of the entire process are not associated with
each process. There is a significant amount of overlap. The memory requirements of both the
signature generation and verification processes are almost identical, and are approximately equal to the
requirements of the entire ECDSA.
The table does not show the memory requirements of implementing the SHA-1 hash function.
The hash function accounts for approximately 13,000 bytes of each of the memory requirements in the
table (excluding data and routines). This is a significant amount of the total memory required. An
attempt to improve the implementation of the hash function was not made. Instead, the original
implementation was used. It may be possible to implement the hash function more efficiently,
reducing memory footprints.
6 SC140 and Compiler Analysis for Cryptographic
Applications
The following sections analyze the target processor and compiler, and state compiler optimization
improvements. An analysis of the SC140 for cryptographic applications and guidelines for writing
efficient C and assembly source code is presented. In addition, compiler improvement
recommendations are made based on the comparison of CGA and HWA routines. Finally, compiler
anomalies encountered are included.
6.1 Analysis of the SC140 for Elliptic Curve Cryptographic
Applications
There have been few documented cryptographic implementations on DSPs, which leads to the
question, are DSPs suitable for cryptographic applications. One of the purposes of the thesis is to show
that a DSP, and more specifically the SC140, is a viable target for cryptographic applications. The
performance of the implementation of the ECDSA is previously shown adequate.
DSPs are currently present in several computing environments. Therefore, if they are suitable
for cryptographic applications, security measures can easily be added to the environments, without
upgrading hardware with specialized cryptographic processors. The following sections bring forth both
positive and negative aspects of implementing cryptographic applications on the SC140. Several of the
aspects are relevant to all DSPs. For simplicity, instructions such as ADD and ADDA, which result in
the same operation executed by the DALU or ALU respectively, are written as ADD(A) throughout the
section.
81
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 82
6.1.1 SC140 Cryptographic Pros
The following sections describe the notable properties of the SC140 that have a positive affect on the
execution of cryptographic applications.
1. Variable Length Execution Set
A VLES is a set of instructions that is executed in a single clock cycle. Up to six instructions can
be grouped into one VLES on the SC140, leading to substantial parallelism. Grouping several
instructions so they are performed in a single clock cycle reduces execution times significantly.
Since polynomials involved in ECC span several SC140 registers, a large amount of parallelism is
possible. Most finite field algorithms allow parallelism, by grouping several instructions so they
execute in a single clock cycle, execution times are reduced.
VLESs increase code density and reduce power consumption. Code density is significantly
increased because instructions for unused processing units do not have to be defined. A VLIW,
which is a fixed length set of instructions, defining an instruction for each processing unit per clock
cycle, requires more memory storage than a VLES. Processors that require VLIWs have
corresponding applications with larger memory footprints, which is problematic for portable
devices. Generally, several of the instructions within a VLIW are NOPs, which are a waste of
memory space. Alternatively, NOP instructions are assumed unless otherwise stated within a
VLES. By eliminating the NOPs, and slightly limiting the combination of instructions that are
allowed within a single VLES, the code size of the application is greatly reduced.
In addition, less code directly leads to less power consumption, which is also beneficial to
portable devices. First, less power is consumed because less memory is required to define each
clock cycle’s execution set. Less bandwidth is required to transfer the VLES, thereby reducing
power consumption. Secondly, the total amount of memory required to store an application using
VLESs is reduced. Therefore, the amount of flash memory, or ROM, a device requires can be
reduced. Smaller flash or ROM sizes require less operating power, thus decreasing the total
amount of power consumed by the system.
2. Loop Control Instructions (BREAK, DOENSHn, …)
The instructions BREAK, DOENSHn, DOENn, DOSETUPn and SKIPLS are hardware loop
control instructions. Hardware loops are the most efficient method of repeatedly executing
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 83
instructions on the SC140. As long as the StarCore programming rules are followed, there is no
penalty for returning to the first VLES of a hardware loop, resulting in minimal overhead.
The instructions DOENSHn and DOENn enable short and long loops respectively. The
DOSETUPn instruction sets the starting address of long loops, and is not required with short loops.
The BREAK and SKIPLS instructions are used to exit a hardware loop, and to exit a hardware loop
if the loop counter is less than or equal to zero respectively.
Branch instructions are expensive, several cycles, four with the StarCore, are wasted each
time a branch is executed. Therefore, the cost of using branches increases execution time
significantly; especially considering several of the implemented hardware loops consist of only one
or two VLESs. The overhead introduced by the branch instruction is larger than the cost of
executing the desired instructions.
Repetitive code is an inefficient waste of memory that can be avoided by looping
instructions. An efficient means of looping that does not hamper execution efficiency, and aids in
reducing code size, is extremely beneficial for portable devices.
The use of hardware loops saves memory space with minimal cost, because there is no cost
associated with returning to the start of the loop. In the case of finite field arithmetic, several
registers are required to store one polynomial, so the identical operation(s) is performed on several
sets of data. By employing hardware loops, the operations are performed efficiently with address
pointers. Without hardware loops, branch instructions or repetitive code is required, resulting in
inefficient execution or large applications.
3. Conditional Instructions (BC, IFC, JC)
The conditional instructions of the SC140 allow for greater parallelism and instruction hiding.
Conditional instructions are all based on the true bit of the status register. The set of conditional
instructions include branch (BF, BT), jump (JF, JT), and the more general if (IFA, IFF, IFT).
The instructions IFA, IFF and IFT allow a VLES to be divided into two subsets of
instructions. The instructions that are actually executed depend on the true bit in the status register.
A single VLES is allowed to contain a maximum of two of the three conditional if instructions.
The instructions grouped with an IFT or IFF are only executed if the true bit is true or false
respectively. Whereas, the instructions grouped with an IFA are always executed.
The if instructions allow for greater parallelism. They are extremely useful with respect to
ECC because there are several instances when either one set of instructions or another is
performed. There are also cases when one set of instructions is always performed, and another
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 84
may be performed. In these instances, each set of instructions can be grouped with a conditional
instruction, thereby computing them in parallel and reducing execution time.
Furthermore, conditional instructions along with the parallel processing capabilities of the
SC140 make it possible to hide the actual instructions executed, and fix the execution times of
either condition. For example, the conditional branch instructions can be executed in parallel, thus
hiding the branch taken. Identical sets of instructions can be grouped with the IFF and IFT
instructions, only exchanging the targets of the instructions grouped with IFT. Thus, it is
extremely difficult, if possible, to determine the set of instructions executed by the processor. The
strength of the conditional instructions is further explored in §7.4.1.
4. Polynomial Degree Calculation Instructions (CLB, TSTEQ)
The CLB instruction returns a value corresponding to the number of equal most significant bits in a
register. The TSTEQ instruction is used to compare a register with zero. These two instructions
aid in the efficient calculation of the degree of a polynomial.
The CLB and TSTEQ instructions are very useful when computing the inverse of a finite
field element or large integer. Inversion of an element is the most expensive finite field operation,
and becomes even more expensive if an efficient method of calculating the degree of a polynomial
is not employed.
When calculating the degree of a polynomial, the TSTEQ is used to check if a register,
which contains a subset of a polynomial’s bits, is zero. Once a nonzero register is found, the CLB
instruction is used to calculate the number of leading zero bits the register contains. The
exploitation of the two instructions, along with control logic and other computations, allows the
efficient determination of the degree of a polynomial. The efficient method of calculating the
degree of a polynomial reduces the cost of the finite field inversion operation.
5. Intrinsic Functions
There are several functions, called intrinsic functions, defined at the C language level specific to
the SC140. They allow programmers to employ individual assembly instructions from a C
function. The functions specify assembly instructions from a higher level of abstraction, allowing
the designer to specify individual assembly instructions, resulting in more efficient execution
sequences.
Although intrinsic functions were not used in the implementation process, they promise to
result in applications that are more efficient. They allow programmers to specify a single assembly
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 85
instruction, simplifying the compiling process and allowing the programmer to use processor
functionality that may not be available at the C level.
6. Dual Task, Single Cycle Instructions
ADDL1A, ADDL2A, DECEQ(A) and post-address updating instructions perform two tasks in a
single cycle. They combine two instructions that are commonly found consecutively in assembly
code, and execute them simultaneously.
The instructions ADDL1A and ADDL2A add a shifted version (by 1 and 2-bits
respectively) of an address register to another address register. This instruction is very useful when
an algorithm must start at the least significant bits of a polynomial. In this case, the address of the
most significant bits, and the number of registers the polynomial spans are known. By using the
ADDL2A instruction, the address of the least significant bits can be calculated, from the address of
the most significant bits and the number of registers the polynomial spans, in a single clock cycle.
The DECEQ(A) instruction decreases a register by one, compares the result to zero, and
sets the true bit in the status register accordingly. The instruction is useful because it is very
common to decrease a register by one and then check if it is zero. For example, the last iteration of
a loop is often different from previous iterations. Before the loop is entered, the DECEQ(A)
instruction is used to control the entrance of the loop. If the number of loop iterations is one, the
loop is never entered, and a branch is taken to the last iteration of the loop.
Post-address updating is a powerful feature of the SC140, which was introduced in §2.4.
An address register can be adjusted by a value after it is used in an instruction without any clock
cycle penalty. This is extremely useful within hardware loops when accessing a polynomial stored
in consecutive memory addresses.
7. Shifting Instructions (ASL, ASLL, …)
The instructions ASL, ASLL, ASR, ASRR, LSL, LSLL, LSR and LSRR are for single or multiple,
arithmetic or logical shifts of registers to the right or left. These instructions are useful when
implementing ECC because in many cases polynomials and large integers are shifted to the left or
right by a single or multiple bits. Polynomials are shifted by a multiple number of bits throughout
the finite field inversion operation, and whenever an operations algorithm employs a windowing
method.
These instructions are also useful when calculating the TNAF or TNAFw representation of
a polynomial. When this representation is computed, large integers are divided by powers of two,
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 86
i.e. 2i, which is the same as shifting the large integer to the right by i-bits. The instructions that
achieve multiple bit shifts allow division by powers of two to be performed efficiently.
The instructions that perform arithmetic and logical shifts by multiple bits are important to
ECC applications. Without them, registers must be shifted by a single bit multiple times, which is
extremely inefficient and significantly increases the computational costs associated with
performing finite field and elliptic curve operations. The benefits of employing windowing
techniques would be significantly reduced, if not eliminated without multiple bit-shifting
instructions.
8. Logic Instructions (AND, EOR, …)
The instructions, AND, EOR, NEG, NOT and OR perform the logical operations of and, exclusive-
or, logical negate and or respectively. The instructions listed are common to most processors.
ECC is based on binary finite field arithmetic. This arithmetic is slightly different from
integer arithmetic. It requires logical instructions. The instructions AND, EOR, NEG, NOT and
OR are all very useful when performing finite field arithmetic, or clearing and masking consecutive
bits of a finite field element. The clearing and masking of consecutive bits of a finite field element
is a common task when advanced algorithms are used that employ windowing techniques.
The NEG instruction results in zero minus the operand, which is very similar to logically
negating all the bits of a register, which the NOT instruction performs. The instruction is useful
when negating data that spans several registers. The negation of a register and a carry from a less
significant register can be performed in a single cycle using the NEG instruction.
9. Arithmetic Instructions (ADC, ADD, …)
The instructions ADC, ADD, SBC and SUB perform register addition and subtraction, excluding
or including the carry bit of the status register. They are common instructions that most processors
are capable of performing.
ECC requires large integer arithmetic. The SC140 provides instructions such as ADD(A),
SUB(A), ADC and SBC that allow for efficient large integer arithmetic. Large integers refer to
integers that are greater than 32-bits, and therefore span more than one register in the processor.
The instructions ADC and SBC add and subtract the operands, including the carry bit in the
operation. By including the carry bit, at least one additional instruction is avoided when values that
span several registers are involved.
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 87
6.1.2 SC140 Cryptographic Cons
The following sections describe the aspects of the StarCore SC140 DSP that have a negative affect on
the execution of cryptographic applications, and some of the properties that a specialized cryptographic
processor would be designed to possess to aid in the efficient execution of cryptographic applications.
1. Register Size
The SC140 registers are 40-bits wide, but MOVE instructions only access 32-bits of memory at a
time. Since the effective size of the registers is only 32-bits, polynomials of GF(2163) span six
registers, making a simple operation performed on a polynomial require six moves from memory,
six executions of the same or similar instructions to modify the polynomial, and six moves back to
memory to store the result.
As larger polynomials are required to maintain security levels, more registers and
instructions are required to store and perform operations. Specialized cryptographic processors are
designed to have register sizes that match their application. For example, processors designed to
implement ECC have registers large enough to store a finite field element. This may not be the
optimal design because the processor becomes obsolete when the finite field size is increased to
maintain security levels. Furthermore, different applications require different encryption strengths,
so it may be beneficial to design a processor that can implement several different encryption
strengths, i.e. a processor that can handle different sizes of polynomials.
As the computational power of devices increases, the need for stronger encryption is
required. To maintain security levels, the finite field size used in the encryption process is
increased. When the size of the polynomial increases, processors designed for a fixed size
polynomial become obsolete, whereas processors that can implement different sizes of polynomials
are not. They only require new software to implement the stronger encryption. New software is
both less costly and easier to deploy than new hardware.
The suggestion is to design specialized cryptographic processors that are able to handle
different sizes of polynomials so that they avoid becoming obsolete and can provide several
different levels of security. The processors will be slightly less efficient than ones designed for a
specific polynomial size, but will not become obsolete and require replacement as the polynomial
size used in cryptographic applications increases.
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 88
Processors that are designed to have register of widths 64 or 96-bits are presently beneficial to
reduce register and instruction counts. By having larger registers, less move and modifying
instructions are required to perform operations, thus leading to reduced execution times.
Furthermore, the processor does not become obsolete when larger finite field sizes are required to
increase security levels, and it is easily able to provide different levels of security that depend on
the data involved.
2. Memory Bandwidth Limitation (memory moves per cycle)
The maximum amount of data that can be moved between memory and registers is one of the main
performance limiting factors of ECC on the SC140. The maximum throughput of the SC140 is
128-bits per cycle, with some limitations. Only two memory moves are allowed per clock cycle.
To achieve maximum throughput, two moves of 64-bits each must be performed. For a 64-bit
move, the origin and target of the move must be two consecutive 32-bit data locations. The origin
and target must be consecutive memory addresses and registers. In general, the maximum
achieved throughput of the processor is actually 64-bits.
Since the current standard size of a finite field element in ECC is 163-bits, each
polynomial requires six 32-bit memory locations, or six registers to be stored. This means that the
processor requires at least three clock cycles to transfer a polynomial from memory to registers,
and vice versa. To perform an operation, more clock cycles are required to move data, because
generally, an operation involves more than one polynomial. It would be highly beneficial if the
throughput of the SC140 were larger. With a larger throughput, polynomials could be transferred
between memory and registers much faster, reducing execution times.
The memory bandwidth limitation is significant only because other throughputs of the DSP
are larger. The DALU and AGU of the SC140 are able to perform four and two instructions per
clock cycle respectively. They can perform instructions much faster than data can be moved from
memory. Due to this fact, the SC140’s limitation is the memory bandwidth, and not DALU and
AGU throughput.
3. Specifying Unique Processor Functionality (ex. CLB)
Improved techniques to allow the use of specialized processor functionality from high-level
languages would be beneficial for cryptographic applications. For example, there should be a
means of forcing specialized processor functionality at the C language level. Intrinsic functions
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 89
allow the programmer to specify instructions such as ADD, ADC, etc, but not unique processor
functionality.
A means of specifying unique processor functionality is much more important than more
general instructions such as ADD, ADC, etc. The ADD instruction is a general function that most,
if not all, processors are capable of performing. It also has a counterpart in high-level languages,
so the mapping from high-level language to assembly is quite simple. The mapping should be
easily recognized by the compiler because of its simplicity and frequency of use.
However, the CLB instruction is a highly specialized instruction that processors rarely are
capable of performing. Furthermore, there is no operation in most high-level languages similar to
CLB that a high-level to assembly compiler mapping can be derived from. As a result, there
should be a way of specifying the CLB instruction, and other similar specialized instructions
common to the SC140, from the high-level C language.
6.2 Compiler Optimization Improvements
In §5.4, the performance of the HWA and CGA routines are compared, and some compiler anomalies
are presented. During the comparison process, a set of compiler optimization rules was developed.
The rules attempt to improve the compiler-generated assembly of the CGA routines so that they
resemble the superior performing HWA routines. The rules are suggestions to improve the generated
assembly. Some of the more obvious rules stated in this section may already be implemented by the
compiler. In some cases, applying the suggested rules may violate assembly instruction restrictions
defined in [46]. The rules should not be applied when they lead to invalid assembly source code.
The set of rules define methods to improve the compiler optimizations. Several rules describe
instructions or sets of instructions that were found in the assembly of the CGA routines, and can easily
be improved upon. Other rules stated are much more broad. When they are followed, they result in a
direct or indirect performance enhancement. Performance enhancements include a decreased execution
time via increased parallelism, or a reduction in power consumption and more efficient resource
management that may lead to decreased execution time.
Examples of the rules are provided when appropriate. They are taken from the CGA routines,
and the operands involved are generalized. In the examples, assembly instructions presented on a
single line are part of one VLES. Several symbols are used to generalize the assembly code.
Descriptions of each symbol are provided in Table 6-1.
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 90
Table 6-1. Assembly Symbolic Description
Symbol Description
Dn# Data Registers (d0-d15)
Dx# Data Registers (d0-d7)
Rn# Address Registers (r0-r15)
Rx# Address Registers (r0-r7)
SP Stack Pointer
DR# Data or Address Registers (d0-d15, r0-r15)
VLES# Variable Length Instruction Set
XXX(A) DALU or Equivalent AGU Instruction
x Integer value less than 32
y Integer value less than 232
In the compiler optimization improvements, the DALU and AGU instructions and registers are
considered different domains. For example, a value in an address register is in a different domain than
a value in a data register.
1. If the instruction space permits, moves of certain fixed values can be avoided. The modification
saves a MOVE, which is expensive with respect to power and quite often limits the amount of
parallelism because there are only two MOVE instructions allowed per clock cycle. The
modification should be done as long as it does not affect the execution time. The instructions
accomplish the same result while reducing memory bandwidth. They only require DALU
bandwidth, which is less costly with respect to bandwidth and power consumption. When a value
of less than thirty-two is required, the ASLL instruction is not required, and when a value of zero is
required, both the ADD and ASLL instructions are not required.
VLES1
VLES2
VLES3
VLES4
MOVE.L #<1334, Dn1
VLES1
VLES2
VLES3
VLES4
CLR Dn1
ADD #<21, Dn1
ASLL #<6, Dn1
2. Do not allocate storage space on the stack in a function if the storage space is never used, or only
allocate the required stack space. Changing the stack space allocated in a function may require the
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 91
modification of instructions that write and read to and from the stack, and instructions that read
arguments or argument addresses from the stack. The instructions possibly affected are MOVE.L,
ADDA and SUBA, as well as any other instruction involving the stack pointer.
ADDA #<x, SP
…
SUBA #x, SP
… or
MOVE.L #y, Rn1
NOP {AGU Stall}
ADDA Rn1, SP
…
MOVE.L #y, Rn1
NOP {AGU Stall}
SUBA Rn1, SP
…
3. Priorities must be given to the allocation of registers, so that the optimum register is assigned when
one is required. The priorities of the registers should follow the following rules, applied in
descending order.
3.1. Within a parent function, the registers {d0, d1, r0, r1} should be given the least priority. For
example, use {d2-d15, r2-r15} before the use of {d0, d1, r0, r1} because the last set of
registers are often used to pass arguments to child functions. By using other registers, greater
parallelism may be achieved by reducing the setup time required before function calls.
3.2. Within any function, the registers {d6, d7, r6, r7} should be given the least priority in an
attempt to avoid push and pop instructions at the start and end of the function. For example,
use {d0-d5, d8-d15, r0-r5, r8-r15} before the use of {d6, d7, r6, r7}.
3.3. A higher priority must be given to the registers {d0-d7, r0-r7} than the upper registers {d8-
d15, r8-r15}. This is because use of the upper registers increases instruction sizes, and some
instructions are limited such that they cannot involve upper registers.
3.4. The highest priority should be given to the register that already contains the data.
4. Reuse fixed values that are written to registers. This rule is subject to a reaching definition test and
can reduce the number of registers required within a function.
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 92
MOVE.W #<x, DR1
DR1 used as operand
MOVE.W #<x, DR2
DR2 used as operand
MOVE.W #<x, DR1
DR1 used as operand
DR2 instances changed to DR1
DR1 and DR2 must be in the same domain
5. There should be no preference between AGU and DALU instructions that perform identical tasks
(TSTEQ↔TSTEQA, INC↔INCA, SUB↔SUBA, etc). Several tasks can be performed by both
the AGU and DALU. Precedence should be given to the domain that contains the operands. If the
operands are originally located in both domains, precedence should be given to the domain the
result must be in. If the result can be in either domain, precedence should be given to the DALU
because it has a larger throughput. The modification can reduce AGU stalls, which occur because
a one cycle delay is required between when a value is moved to the AGU register file and when it
is used as a memory address. This rule can also help avoid MOVE.L instructions from data to
address registers or vice versa, thus avoiding the creation of extra copies of variables.
MOVE.L Dn1, Rn1
MAX Dn2, Dn1
BT <L4
CLR Dn2
TSTEQA Rn1
CLR Dn2
MAX Dn2, Dn1
BT <L4
TSTEQ Dn1
6. Conditional branch instructions can be combined with the subsequent VLES. A conditional branch
instruction can be slightly modified, and then combined with subsequent instructions. Depending
on the value of the true bit at execution time, the modification may reduce the cycle count.
BT <L4
DOENSH3 Dn1
IFT BRA <L4 IFF DOENSH3 Dn1
7. Avoid moves when possible. First, minimize MOVE instructions involving memory whenever
possible, then minimize MOVE instructions involving registers only. The concept of copy
propagation should be used to accomplish the improvement. Use existing data copies instead of
creating redundant copies of data whenever possible. As in the example below, to increase the
affect of testing for copy propagation, MOVE instructions addressed with Rn# do not overwrite
local variables on the stack, unless the SP is moved into Rn# at some point.
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 93
MOVE.L (SP-20), DR1
… {DR1 not modified}
MOVE.L YYY, Rn1
… {DR1 not modified}
MOVE.L (SP-20), DR2
MOVE.L (SP-20), DR1
… {DR1 not modified}
MOVE.L YYY, Rn1
… {DR1 not modified}
MOVE.L DR1, DR2
YYY – any register or value.
DR1 and DR2 are in opposite domains. When DR1 and DR2
are in the same domain, MOVE.L DR1, DR2 becomes
TFR(A) DR1, DR2.
8. Combining instructions is possible in some cases, where two instructions can be combined into a
single more efficient instruction. The modification results in assembly that is more efficient, and
should be made as long as negative side effects are avoided. An example involving ADDL2A is
shown. Similar optimizations involving, but not limited to, ASL1A and DECEQ(A) are also
possible.
ASL2A Rx1
ADDA Rx1, Rx2
ADDL2A Rx1, Rx2
9. Use fixed values in instructions, avoiding the use of temporary registers whenever possible. This
optimization reduces computational costs, register usage, power consumption and memory
bandwidth.
MOVE.L #<x, Dn1
ADD Dn1, Dn2, Dn2
ADD #<x, Dn2
10. When a value is used throughout a function, a register should be allocated to store the value for the
entire function, eliminating multiple copies of values and inefficient use of registers. Furthermore,
re-allocation of registers when the contained data is not used for the rest of the function allows for
greater parallelism.
11. Instructions that belong to the CP should get the highest priority. The set of instructions included
in a VLES is limited. When there are more instructions that can be included in a VLES than rules
allow, instructions belonging to the CP should be given highest priority.
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 94
12. Rearranging move and transfer statements can reduce execution times. By changing the target of a
MOVE instruction, greater parallelism is achieved in some cases where more than one copy of the
same data is required.
MOVE.L (Rn1)+, Dn1
TFR Dn1, Dn2
ZXT.L Dn2
MOVE.L (Rn1)+, Dn1
ZXT.L Dn1
TFR Dn1, Dn2
13. Combining IFC statements can be done when simple if statements are implemented. The following
optimization works for simple if and else clauses. The first example is generalized for boolean
expressions xx1, xx2, xx3, … and if and else clauses of yy and zz respectively, where the execution
of yy is not cumulative. If the execution of yy is cumulative, and zz is not, instances of yy and zz
and instances of IFT and IFF can be exchanged for proper execution. If the execution of yy and zz
are cumulative, the second example applies, where yy is an addition.
Non-cumulative yy:
if (xx1|xx2|xx3…) yy
else zz
TSTEQ xx1
IFT yy
IFT yy
…
IFT yy
IFF TSTEQ xx2
IFF TSTEQ xx3
IFF zz
Cumulative yy and zz:
if (xx1|xx2|xx3…) yy
else zz
TSTEQ xx1
IFT ADD Dn1, Dn2, Dn2
IFT ADD Dn1, Dn2, Dn2
…
IFT ADD Dn1, Dn2, Dn2
CLR Dn1
CLR Dn1
MOVE.L #<x, Dn1
IFF TSTEQ xx2
IFF TSTEQ xx3
IFF zz
14. Upper registers can be used within functions to store variables instead of the stack. This reduces
the number of MOVE instructions executed. MOVE instructions become transfers, which are less
expensive with respect to power consumption and memory bandwidth. Greater parallelism can be
achieved with transfers compared to MOVE instructions, because only two move instructions are
allowed per VLES. Furthermore, by employing the improvement, it may be possible to eliminate
allocating space on the stack for local variables. The improvement can be used when arguments
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 95
are read from the stack several times within a single function. It is beneficial to store the
argument in an upper register, and then use transfers from the upper register, instead of several
moves from the stack. In many cases, when moves are changed to transfers, later instructions can
be simplified, eliminating the need for a transfer by directly referencing the origin of the transfer.
15. Allow instructions originally situated after loops, to be placed before loops as long as they are
independent of the loop. This optimization may already be implemented. It can increase the
amount of parallelism within the generated code, and therefore reduce computational costs.
L4
BT <L4
DOENSH3 Dn1
NOP
NOP
loopstart3
MOVE.L Dn2, (Rn1)+
loopend3
SUBA #x, SP
POP D6
POP D7
L4
BT <L4
DOENSH3 Dn1
NOP
SUBA #x, SP
loopstart3
MOVE.L Dn2, (Rn1)+
loopend3
POP D6
POP D7
16. Removing repetitive instructions that are executed in the same or both domains, results in superior
assembly code. The improvement is achieved by postponing MOVE and TFR(A) instructions, so
they are located after other instructions involving the same operand.
MOVE.L Dn1, Rn1
SUB x, Dn1
SUBA x, Rn1
SUB x, Dn1
MOVE.L Dn1, Rn1
17. There are several hardware loop restrictions defined in the SC140 Reference Manual. Rules L.D.2,
L.D.3 and L.D.9 specify hardware loop restrictions [46]. Depending on the implementation, a
minimum number of cycles are required between certain hardware loop instructions. To comply
with the rules, NOP(s) are commonly added by the compiler. The addition of a NOP(s) because of
the listed rules can be avoided, or reduced. By decreasing the number of iterations of the hardware
loop by one, and unrolling the loop by replacing the NOP(s) with the first VLES(s) from the loop,
and adding the remaining VLES(s) after the loop, a NOP(s) can be avoided. The modification
assumes that it is possible to reduce the loop count without delaying the start of the loop, or there
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 96
are at least two NOPs involved, resulting in the modification reducing the number of clock cycles
in the sequence. The re-ordering of the loop must not break some other rule.
VLES0
DOEN Dn1
NOP
NOP
loopstart3
VLES1
VLES2
VLES3
loopend3
VLES4
VLES0 SUB #<1, Dn1
DOEN Dn1
VLES1
VLES2
loopstart3
VLES3
VLES1
VLES2
loopend3
VLES3
VLES4
18. By improving the allocation of registers, which are used temporarily within a function, greater
parallelism can be achieved. Another register can be used temporarily so that sets of instructions
can be parallelized. Blocks in the example below represent sets of instructions. In the first
example, at the start of each block, the register(s) associated with the block is used to store a totally
unrelated value to its previous contents. In the second example, the value in register Rn1, which is
related in Block2a and Block2b, is unrelated to the value in Block1. Register Rn2 is only used in
Block2b. In both examples, by using Rn2 instead of Rn1 in Block1, it is possible to parallelize
Block1 and Block2.
Block1 - DR1
Block2 - DR1
Block3 - DR1 and DR2
Block1 - DR2
Block2 - DR1
Block3 - DR1 and DR2
Block1 - DR1
Block2a - DR1
Block2b - DR1 (not redefined) and DR2
Block1 - DR2
Block2a - DR1
Block2b - DR1 (not redefined) and DR2
DR1 and DR2 must be in the same domain
19. In some cases, obvious simplifications are possible because of repetitive instructions, copy
propagation or rewriting a target without using the original value. The improvement can also apply
to instances of TFR(A), AND, OR, MAX and several other instructions. The first example is a
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 97
case of repetitive instructions, where TSTEQ(A) does not accomplish anything, because the true
bit is already modified by the previous instruction. The second example is a case where the value
of Rn1 and Rn2 are found equal using copy propagation. Therefore, the second MOVE instruction
does not accomplish anything.
DECEQ(A) DR1
TSTEQ(A) DR1
DECEQ(A) DR1
DR1 and DR2 must be in the same domain
MOVE.L Rn1, Dn1
MOVE.L Rn2, Dn1
MOVE.L Rn1, Dn1
20. The execution of assembly involving a calculated value and a fixed value comparison can be
improved. The simplification eliminates a clock cycle. Furthermore, the instruction resulting in
the calculated value should be omitted if the result is never used after the comparison.
SUB(A) DR1, DR2, DR3
TSTEQ(A) DR3
SUB(A) DR1, DR2, DR3
CMPEQ(A) DR1, DR2
DR1 and DR2 must be in the same domain
21. By analyzing the target of branches, they may be avoidable. It is possible to avoid branching to the
last few VLESs of a function by having duplicate copies of the last few instructions in a function.
The improvement eliminates a branch or conditional branch instruction, which require four clock
cycles to execute. The IFT and IFF instructions in the following example can be exchanged if
appropriate.
L_exit
IFT BRA L_exit
VLES2
VLES3
…
VLES4
…
RTS
IFF VLES1
L_exit
IFT VLES4
IFT …
IFT RTS
…
VLES4
…
RTS
IFF VLES1
IFF VLES2
IFF VLES3
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 98
6.3 Compiler Anomalies
The following two sections detail compiler anomalies encountered during the implementation process.
There were several unexplainable occurrences during implementation. Problems with software and
hardware simulations were encountered, as well as problems with the compiled code. The first
anomaly was discovered when studying the CGA and HWA routines. The second anomaly focuses on
a compiler problem, where the compiled assembly is incorrect when the highest level of compiler
optimization is selected.
6.3.1 Compiler Anomaly A
The compiler produces unexpected assembly when compiling the CGA routines. The generated
assembly code is not a correct implementation of the C code, and appears to be out of order. At the
start of the first hardware loop in each of the CGA routines that were analyzed, there is some
manipulation of the loop count. The C code and compiled assembly are presented as Example 6.1 and
Example 6.2.
Example 6.1. Original C code Example 6.2. Erroneous Generated Assembly
ELEMENT i;
for(i = 0; i < count ; i++, input++, output++) {
MOVE.L (sp-28), r2 MOVE.L (sp-28), d0
MOVE.W #<0,d4
MAX d0, d4 TSTEQA r2
BT <L2
DOENSH3 d4
In the code, the value in (sp-28), which is the loop count, is an unsigned integer. The hardware
loop should be enabled to execute (sp-28) times. The generated assembly does not reflect the C code.
Consider the instance when the number of loop iterations is extremely large, and is 32-bits in
size. When the 32-bit value is treated as signed by MAX, the maximum of d0 and d4 is d4, i.e. zero.
By default, hardware loops must iterate at least once after they are enabled. Therefore, the hardware
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 99
loop is executed once. This is not desired, or described, by the C code. Fortunately, with respect to
this application, the loop count is always a small positive integer, so the described case never occurs.
Nevertheless, the generated assembly is incorrect.
When the loop count is assumed signed, which is an incorrect assumption from the C code, the
generated assembly is still incorrect. The hardware loop should be skipped when the loop count is less
than zero. In the assembly code, the hardware loop is executed once. The result of MAX d0, d4 should
determine if the branch to <L2 is taken, not the value in r2.
An alternative to testing the loop count before the hardware loop is enabled, is to have a
SKIPLS <L2 instruction after the hardware loop is enabled. The SKIPLS instruction eliminates the
need for the MOVE.W (sp-28), r2, MOVE.W #<0, d4, TESTEQA r2 and BT <L2 instructions. The
only problem with the SKIPLS instruction is that it has some restrictions that usually require the
addition of NOPs.
Next, consider the instance when the number of loop iterations is a small positive number, or
zero. In this case, the generated assembly executes as desired. Either the branch is taken to skip the
hardware loop, or the hardware loop executes the desired number of times. For the scope of the thesis,
the number of loop iterations is always a small positive integer. Therefore, the generated assembly
leads to correct results. Nonetheless, the generated assembly does not correctly implement the C code
it is derived from.
The generated code can be greatly improved using the rules stated in §6.2. First, there is no
reason for loading the loop count into r2. Each instance of r2 can be replaced with d0, which
eliminates a move and reduces the number of registers required. The MOVE #<0, d4 instruction can be
replaced by a CLR d4. The resulting CLR d4 instruction can be combined with instructions in the
previous VLES. Finally, the last two VLESs can be combined using IFC instructions. The following
shows the assembly code improvements described.
Example 6.3. Improved Assembly Code
MOVE.L (sp-28), d0 CLR d4
MAX d0, d4 TSTEQ d0
IFT <L2 IFF doensh3 d4
The suggested compiler improvements enhance the assembly, so that it only requires three
clock cycles, instead of the original five. After the improvements to the assembly are applied, the same
results are achieved as previously. Furthermore, it is easier to understand the assembly, and realize that
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 100
to correctly implement what is outlined by the C code, the TSTEQ d0 instruction must be changed to
TSTEQ d4, and be placed on a separate line after MAX d0, d4. Then, if the loop count is interpreted to
be negative, the hardware loop is skipped and not enabled. The result of the change is a more accurate
compilation of the original C code.
6.3.2 Compiler Anomaly B
Two other instances were encountered where the compiler generated incorrect assembly, but they
resulted in negative side effects. Both instances occurred when compiler optimizations were specified.
Without compiler optimizations, the generated assembly was correct. The following describes the
format of source code that caused the anomaly.
In general, the compiler has difficulty optimizing code that result in several branch statements.
Successive C language control statements, including if statements, can lead to problems. The problems
are related to the lack of updating register, stack and memory instances of variables. In the compiler-
generated assembly, several copies of the same variable are made, but when compiling with
optimizations, successive branch statements seem to cause erroneous or out-of-date copies of variables.
The first example of this compiler anomaly was never solved, and a method to work around the
problem, remaining at the C level, was never discovered. The problem occurred in the function
field_mult_wrapper. The large integer multiplication functions require positive integers. Therefore, a
wrapper function was implemented to convert all negative large integers, call the multiplication
function, and then convert the appropriate large integers back to their original values.
Initially, flags were employed to keep track of the large integers that require a sign conversion
after the multiplication is performed. For an unknown reason, the compiler generates assembly that
effectively does not convert one of the input large integers back to its original value. This was
concluded after several print statements were added to the source code. The statements output the
parameters at the start and end of the function. The incorrect assembly results in problems when the
one large integer is used in future calculations. The result of the first large integer multiplication
operation is correct, but the incorrect sign of the one input integer propagates. Future results of
multiplication operations, which involve the integer with the incorrect sign, also have the incorrect
sign.
Several methods of restructuring the wrapper function were investigated. A series of if
statements that have a separate case for each of the four possible combinations of large integer signs
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 101
were attempted. The series of if statements did not work, they also resulted in erroneous assembly
that produces identical results.
Assembly, generated with and without the error causing optimization, was analyzed to
determine where the error is introduced. The process of finding the problem in the source code was
simplified because of the lack of complexity of the function involved and the thorough problem
definition.
As shown in Example 6.4, a register is not set to the proper value before a subroutine call. The
erroneous assembly is missing a MOVEU.L #<6, d1, which is present in Example 6.5, immediately
before the neg_hwa routine is first called. The statement is very important because d1 is the register
the count argument is passed to neg_hwa. The count argument defines the size of the structure, and
therefore the number of loop iterations required to complete the negate operation. During the
optimization of the generated assembly, the compiler must have incorrectly determined that the move
instruction is not required.
Example 6.4. neg_hwa Assembly (Erroneous) Example 6.5. neg_hwa Assembly (Errorless)
…
JSRD _intmult
TFRA r7, r1
MOVEU.L #<6,d3
MOVE.L d3, (sp-8) MOVE.L (sp-100), r1
MOVEU.L #<12, d1
JSRD _convert_to_smaller_hwa
MOVE.L d1, (sp-4) ADDA #>-64, sp, r0
JSRD _neg_hwa
MOVEU.L #<6,d1 TFRA r6, r0
JSRD _neg_hwa
…
…
JSRD _intmult
MOVE.L r2, (sp-4) MOVE.L (sp-72), r0
MOVEU.L #<6,d0
MOVE.L d0, (sp-8) MOVE.L (sp-100), r1
MOVEU.L #<12,d2
JSRD _convert_to_smaller_hwa
MOVE.L d2, (sp-4) ADDA #>-64, sp, r0
MOVEU.L #<6,d1
JSRD _neg_hwa
MOVE.L (sp-72), r0
MOVEU.L #<6,d1
JSRD _neg_hwa
…
Due to the missing instruction, neg_hwa uses the incorrect value of twelve loop iterations.
Therefore, the negating function assumes the size of the input structure is twelve elements instead of
the correct value of six. The neg_hwa manipulates the set of memory addresses specified by the
parameters. The set of memory consists of both input large integers to the large integer multiplication
function because they are located consecutively. The result of the single neg_hwa execution is almost
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 102
equivalent to changing the sign of both integers. The negate function is called again in a later
instruction, passing the correct arguments to the routine. This changes the sign of one of the large
integers for a second time.
The compiler generates incorrect assembly for the large integer multiplication wrapper
function. By inserting the instruction MOVEU.L #<6,d1 before the first neg_hwa call, the assembly
code is modified to work correctly.
The second example of the anomaly is a combination of inefficient and poor coding that leads
to erroneous assembly. The source code that the compiler produces erroneous assembly for is
semantically correct, but redundant and inefficient. Examples of the error causing and corrected C
code are presented as Example 6.6 and Example 6.7.
Example 6.6. get_TNAFw_rep C Code (Error Causing) Example 6.7. get_TNAFw_rep C Code (not Error Causing)
…
*t1_ptr = (*r0_ptr + *r1_ptr * TNAF_TW) % TNAF_TWOW;
if (*t1_ptr>TNAF_TWOW/2) *t1_ptr-=TNAF_TWOW;
if (*t1_ptr & MSB)
{
/* t1.e[NUMWORD] is negative */
*t1_ptr = -*t1_ptr;
curbit = (*t1_ptr>>1) | TNAF_SB;
…
…
*t1_ptr = (*r0_ptr + *r1_ptr * TNAF_TW) % TNAF_TWOW;
if (*t1_ptr>TNAF_SB)
{
/* t1.e[NUMWORD] is negative */
*t1_ptr=TNAF_TWOW - *t1_ptr;
curbit = (*t1_ptr>>1) | TNAF_SB;
…
Two consecutive if statements are used to control the execution sequence in the error causing
example, but the if statements are effectively equivalent. Restructuring of the C code leads to
improved source code that the compiler correctly converts into assembly. The two versions of C code
are shown above. Good coding practices avoid the problematic C code that is incorrectly compiled.
Nonetheless, both versions of code are semantically correct. Therefore, in either case, the compiler
should produce errorless assembly.
When analyzing the generated assembly, the root of the problem was more difficult to
determine. The order of function calls within get_TNAFw_rep and the calculated coefficient values
were required to trace the execution sequence and determine the exact problem. By comparing correct
output sequences with the incorrect ones, the problem was found. All nonzero coefficients were being
set to one. This meant the error in the assembly was in the calculation or the storing of the coefficient.
The erroneous and errorless assembly code that calculates the bit value and branches taken was
studied. It was found that when a specific sequence of branches is executed, a non-up-to-date instance
CHAPTER 6. SC140 AND COMPILER ANALYSIS FOR CRYPTOGRAPHIC APPLICATIONS 103
of a variable is used in an instruction. By updating the instance of the variable used in the instruction,
the assembly code correctly performed the desired task.
The two examples stated detail compiler errors. In the first example, erroneous assembly is
generated by the compiler. Several variations of C code were written that implement the desired task,
but each resulted in incorrect assembly generated by the compiler. An incorrect assumption by the
compiler led to erroneous assembly code. In the second example, poor C coding techniques led to a
compiler error. Nonetheless, the C code is semantically correct. Therefore, it should not have resulted
in erroneous compiler generated assembly.
In both cases, the compiler-generated assembly is incorrect. The use of optimizations when
compiling caused the generation of erroneous assembly. Optimization techniques should not affect the
results obtained when the compiled code is executed. Compiler optimizations should only affect the
computational cost of generated code.
7 Side-Channel Attack Security Issues
SCAs are a set of cryptographic attacks that attempt to break cryptosystems by analyzing execution
times, processor power consumption and other information. Recently, a lot of effort has been put
towards better understanding their capabilities, and to develop strategies and algorithms to resist them.
SCAs are generally used to determine private keys, but can also be used to determine other pertinent
information that can be used to break a cryptosystem. The three types of SCAs focused on are a
Timing Attack (TA), Simple Power Attack (SPA) and Differential Power Analysis (DPA).
A TA is relevant to algorithms that operate in data dependent time. The time required by
certain algorithms depends on inputs such as a private key and/or message. A TA exploits data
dependent time algorithms by collecting timing information and using it to determine algorithm inputs.
Described by [24] and [33], it is possible to determine pertinent information with TAs. §7.1 further
describes TAs and states the vulnerabilities of the implementation to this attack. It also estimates the
computational penalty suffered if the implementation were modified to resist the attack.
Power consumption based attacks include SPA and DPA. Instructions require different
amounts of power during execution. Power attacks generally determine execution sequences from the
power trace of a processor [10]. SPA exploits algorithms whose execution sequence depends on
pertinent information. The sequence of operations executed by a processor may be determined by
analyzing its power trace. SPA, a technique to foil the attack and a resistant algorithm are presented in
§7.2. DPA is a more powerful attack than SPA, and is based on a statistical analysis of power traces.
In particular, it uses a correlation between power traces and bits of the private key, or other pertinent
data, at specific points of an algorithm. DPA is presented in §7.3.
TA and SPA attempts can break the ECDSA. The attacks can be used to determine several bits
of the nonce. By observing the generation of signatures, and determining the corresponding nonces,
the private key can be determined, because the ECDLP is reduced to a variant of the hidden number
problem [49]. Any leakage of information about the nonces used in the signature process could prove
dramatic [49], and lead to insecurities. Furthermore, the nonce used in the signature generation process
must be generated by a cryptographically secure and unbiased pseudo-random number generator to
eliminate possible insecurities [49].
The ECDSA is not subject to DPA because a random or pseudorandom nonce, k, is involved in
the point-multiplication operation [10]. The private key is involved in a finite field addition, two
multiplications and a reduction, but because of the presence of the nonce, the ECDSA is naturally
104
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 105
resistant to the attack. However, ECC encryption is subject to DPA. §7.3 is directed towards ECC
encryption, and how the implemented finite field and elliptic curve operations must be modified to
become resistant to DPA attacks when used for ECC encryption.
Finally, a different view of DPA, and other techniques developed by the writer that disrupt and
possibly prevent SCA techniques are presented in §7.4.
7.1 Timing Attacks
The basic theory behind TAs is that the execution time is dependant on input parameters [24].
Therefore, by recording the execution time required to perform operations, inputs to the operation can
be determined. By modifying operations so that they execute in fixed time, TA attempts are thwarted.
In general, it is very difficult to modify functions so that they operate in fixed time. Implementations
may be TA resistant on one processor, but not on another [24].
The point-multiplication operation must be made TA resistant even though a nonce is involved.
This is because even partially known nonces can lead to security risks. Any knowledge of information
related to the value of the nonces used in the ECDSA can lead to security issues [49]. Below is a brief
description of the modifications required to achieve TA resistance. The actual implementation details
are not included, only the computational penalty is focused on.
With respect to the implemented operations, two main functions are subject to TAs. Some less
expensive operations may be subject to TAs, but their cycle counts are negligible relative to overall
cycle counts, and therefore are not analyzed. The two operations, which must be modified to execute
in fixed time so they foil TA efforts, are finite field inversion and elliptic curve point-multiplication.
First, the implemented finite field inversion operation executes in data dependent time, and
must be modified to execute in fixed time. To avoid TA attempts, the inversion operation must be
modified to require the worst-case inversion time [24], independent of the input polynomial. Table 7-1
states the original and TA resistant case, along with the performance penalty incurred. The original
case in the table is actually the average cycle count of the inversion operation.
The other operation implemented that a TA can target is elliptic curve point-multiplication.
Both the TNAF and TNAFw algorithms execute in data dependent time. The estimated performance of
the TA resistant point-multiplication operations is presented in Table 7-1. To fix the time required by
both algorithms, an addition must be executed every loop iteration. An example of a modified τ-adic
algorithm, which is from [22], is listed as Algorithm 7-1. It is TA resistant when implemented
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 106
correctly. The implemented TNAFw point-multiplication can be made both TA and SPA resistant by
employing an SPA resistant algorithm presented in [22]. The estimated performance penalty incurred
by employing such an algorithm is also included in Table 7-1.
Table 7-1. Estimated TA Resistant Performance Penalties
Cycle Count Description
Original Case TA Resistant Case Performance Penalty
poly_inv_eff 16,730 18,290 1,560 (9.3%)
TNAF_point_mul 1,670,000 4,499,000 2,829,000 (169.4%)
TNAFw_point_mul 1,193,000 4,729,000 3,536,000 (296.4%)
The estimated performance penalty and TA resistant computational cost of the TNAFw point-
multiplication function is larger than the TNAF function because of pre-computations. The pre-
computations do not have to be TA resistant, but because both TNAF and TNAFw representations
require m bits, m point additions are also required during the double and add portion of the algorithm.
In other words, the decreased hamming weight of the TNAFw representation of k is not beneficial.
The hamming weight does not affect the performance of the operation, only the length of the
representation does. Due to the pre-computed LUT required in the TNAFw point-multiplication
operation, the computational cost of the operation is more than that of the TNAF point-multiplication.
The estimated signature generation performance penalty incurred due to modifying the
implementation to be TA resistant is identical to the values presented in Table 7-2 of §7.2.
Algorithm 7-1. TA Resistant TNAF Point-Multiplication (Q[0] = kTNAF⋅P) [22]
Input: k = (km-1, km-2, …, k1, k0)τ-adic, P
Output: Q[0] = (x, y)
1. Q[0] =
2. For i = (m-1) downto 0
2.1. Q[0] = τ⋅Q[0] i.e. x = x2, y = y2
2.2. Q[1] = Q[0] + P
2.3. Q[0] = Q[ki]
3. Return (Q[0])
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 107
7.2 Simple Power Attacks
For an implementation to be SPA resistant, the instructions executed by the processor cannot depend on
the input parameters [10]. Any dependence between the input parameters and the executed instructions
is visible in the power trace of the processor and can be exploited.
A detailed analysis of the power trace may reveal the instructions executed by the processor. It
is assumed that the attacker can trace the power consumption and therefore can determine the executed
instructions. Furthermore, the attacker is able to determine the input parameter if a dependency is
present. The TNAF and TNAFw point-multiplication operations implemented are susceptible to SPAs.
Algorithm 7-1, which is presented in the previous section, is an example of an SPA resistant point-
multiplication algorithm. As previously mentioned, an SPA resistant point-multiplication algorithm
that can be used to modify the TNAFw point-multiplication operation is presented in [22]. The
performance loss due to modifying either point-multiplication operation implemented is identical to the
TA related estimations presented in Table 7-1.
The computational cost of the signature generation process is severely hindered when SPA
resiliency is required. The effect of computing a point addition operation each loop iteration
significantly increases the running time of the point-multiplication operation, and therefore the
signature generation process. Table 7-2 shows the computational cost of modifying the signature
generation process so that it is SPA resistant, as well as TA resistant. Signature generation processes
employing both the TNAF and TNAFw point-multiplication operation are presented because the TNAF
employing signature generation process becomes less computationally expensive when SPA resiliency
is added. The TNAFw technique normally outperforms the TNAF technique, but the opposite is true
when TA and SPA resistant measures are employed, because the hamming weight does not affect the
computational cost and the TNAF technique does not require a LUT.
Table 7-2. Estimated SPA Resistant Signature Generation Performance Penalty
Cycle Count Code Description
(Point-Multiplication) Original Case SPA Resistant Case Performance Penalty
Signature Generation (TNAF) 1,806,000 4,634,000 2,828,000 (156.6%)
Signature Generation (TNAFw) 1,329,000 4,864,000 3,535,000 (266.0%)
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 108
Coron states that for an algorithm to be SPA resistant, there should be no branch instructions
that depend on the input parameters [10]. This statement is not entirely true. It is proposed that by
employing techniques described in §7.4.1, and maintaining identical execution sequences for each
branch, SPA resistance is maintained. SPA efforts targeting the execution sequence after the branch is
taken are thwarted because they are identical, leaving the implementation of the branch to either if case
crucial in maintaining SPA resistance. If the branch is not implemented correctly, the attacker is able
to determine the input parameters and break the cryptosystem.
Consider the algorithms being implemented on the SC140. By employing the parallel
processing capabilities of the processor, it is proposed that the attacker is unable to determine the
branch the processor takes. The input parameters can be masked from the attacker by executing
branches in parallel. This is achieved with conditional instructions illustrated in §6.1.1, which are
conditionally executed, or can be used to define a subset of instructions to be executed within a VLES.
Providing the subsets of instructions are identical, but refer to different addresses, registers and/or data,
the attacker is unable to determine the subset executed. For a detailed explanation and example, see
§7.4.1.
The parallel processing capabilities of the SC140 allow instructions to be executed in parallel.
The parallelism masks the instructions executed, and providing care is used, the attacker is unable to
determine the exact instructions executed, and therefore the branches taken and the bit-values of
pertinent data.
7.3 Differential Power Analysis
As stated previously, the ECDSA is not subject to DPA type attacks because of the nonce used in the
algorithm. However, other ECC cryptosystems such as encryption are subject to this type of attack.
SPA resistant algorithms reduce the dependency of power traces on bit values of the private
key or other pertinent data, such that they are hidden from attackers. Unlike SPA, DPA is a statistically
based attack that requires several samples to break a cryptosystem. Several power traces are used to
increase the correlation between power traces and bit values, allowing the determination of bit values
by the attacker. An excellent description of the attack on the point-multiplication, and several DPA
countermeasures specifically geared towards Koblitz curves are provided in [22].
DPA countermeasures are foiled by adding some type of randomness, or uncertainty, to the
operation in question, in an attempt to randomize power traces and eliminate correlations between
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 109
accumulated power traces and the private key or other pertinent data. The countermeasures listed
below focus on the τ-adic representation of k in the point-multiplication operation, Q = k⋅P.
The first DPA countermeasure stated in [22], key masking with localized operations, is based
on Formula 3.6. Basic algebra can be used to formulate a set of functions that include several powers
of τ, and are equated to zero. By adding a randomly selected function from the set to randomly
selected consecutive coefficients of k, k′ is computed. The two polynomials, k and k′, are equivalent.
Therefore, k′ can be used to compute Q. Each time the point-multiplication operation is executed, a
new k′ is computed, thus adding randomness to the power traces and foiling DPA attacks.
The second DPA countermeasure, Random Rotation of Key (RRK), is based on Formula 3.8.
By exploiting the formula, and the simplicity of multiplying points by τ, randomness can be added to
the point-multiplication operation. A random integer r, where 0 ≤ r ≤ m-1, is selected. Then a
modified base point, τr⋅P, and a version of the key cyclically shifted by r-bits are used in the point-
multiplication operation. The operation results in the point Q, where Q = k⋅P. The security of this
countermeasure is transferred to the computation of the modified base point and cyclically shifted key.
A secure computational method is required to compute the values. Otherwise, the implementation
remains subject to DPA attacks.
A third countermeasure is presented in [28]. The countermeasure modifies the reduction of the
TNAF representation. Joye and Tymen propose that by reducing the TNAF representation of k,
involved in the point-multiplication Q = k⋅P, using ρ⋅(τm-1), where ρ is randomly selected, DPA attacks
are thwarted. The suggested range of ρ for m=163 leads to a TNAF representation of k involving 200
coefficients, resulting in approximately a 25% increase in the computational cost. The implementation
of the countermeasure is simple because no additional routines are required [28]. Furthermore, ρ affects
the entire representation of k [28], thus the countermeasure is much more successful in thwarting DPA
attacks.
7.4 SCA Countermeasures specific to Koblitz Curves and the SC140
Algorithms that are resistant to SPA are well known and documented, whereas techniques that foil
DPA are not, and are easily scrutinized. Furthermore, resistant SPA algorithms are easily modified to
protect operations involving private keys and other pertinent information in different cryptosystems.
DPA countermeasures are more implementation specific. They are generally based on the
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 110
implementation and specific properties of the underlying problem, where some degree of randomness,
or uncertainty, can be added without large performance penalties. Countermeasures for a single
cryptosystem are not applicable to all implementations. Two examples of countermeasures that do not
apply to the thesis implementation are provided.
First, a fast point-multiplication method that is immune to DPA is proposed in [51]. The
algorithm is based on Montgomery’s method. For ECC implementations that use the point addition
and doubling technique of point-multiplication, the proposed DPA resistant method cannot be
employed.
The second countermeasure is from [10], where Coron presents three countermeasures to DPA
attacks, two of which are proven vulnerable to DPA in [51]. The remaining DPA countermeasure is
based on randomizing the base point of the point-multiplication. The method is only valid for ECC
implementations using projective coordinates. Implementations that use the affine coordinate system
cannot employ the countermeasure.
DPA countermeasures are generally based on a strength of the given implementation. For
example, Koblitz curves are attractive because the use of the τ-adic representation, which allows the
replacement of point doubling with the much less computationally expensive execution of two finite
field squarings. Another strength of Koblitz curves is equivalent τ-adic representations can be
computed very easily by exploiting Formula 3.6. DPA countermeasures that use Koblitz curve
properties are presented in the previous section. Two proposed SCA countermeasures are presented
below. They exploit strengths of the SC140 and Koblitz curves respectively to foil attacks. A notion
of sample entropy is also introduced.
7.4.1 Parallel Processing Countermeasure
Parallel processing capabilities of processors may foil SPA and DPA attacks. It is proposed that by
executing instructions in parallel, and by exploiting the functionality of the SC140, the actual
instructions performed are masked. Parallelism of the SC140 can be used at pertinent points in the
execution of the point-multiplication operation to foil SPA, and possibly eliminate correlations between
accumulated power traces and pertinent information.
Several implementations of parallel instructions achieve identical goals. As a simple example,
consider implementing an if statement, where an address depends on the value of a coefficient. The
assembly implementation, presented as Example 7.1, is proposed to be SPA resistant when care is
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 111
exercised. In the example, the data register d0 contains the value of the current coefficient ki. The
address registers, r1 and r2, contain the two possible addresses, and the address register r3 is a dummy
register. The two possible addresses designate the locations of P and –P, which are involved in a point-
multiplication operation. The goal of the example assembly is to securely transfer the correct address
to register r0.
Example 7.1. Parallel Assembly Implementation
…
CMPEQ.W $<0, d0
IFT TFRA r1, r0 IFF TFRA r1, r3
IFT TFRA r2, r3 IFF TFRA r2, r0
…
In the example, the address transferred into r0 depends on the current coefficient located in d0.
In either case, the same instruction is executed, and only the target of the instruction depends on the
true bit of the status register. Assuming there is no inherent processor preference between different
operands of the TFRA instruction, and instruction sets grouped with IFF and IFT, the power trace of
the code is independent of the true bit. It is proposed that the implementation of the if statement is TA
and SPA resistant.
A more sensitive analysis of the assembly is required when considering DPA attacks. It is
assumed that the implementation of the transfer statements is DPA resistant, because the power traces
are unlikely to be different in either case. However, DPA attacks may be successful against the
comparison instruction preceding the register transfers. Furthermore, the remaining part of the
algorithm, where either P or -P is used in the point addition, may also be subject to a DPA attack.
It is proposed that achieving branching while maintaining SPA resiliency is possible. By
employing the same technique to that used in Example 7.1, branch instructions that target different
memory addresses can be grouped with IFT and IFF instructions. By doing so, and assuming there is
no inherent processor preference between different operands of a branch instruction, and instruction
sets grouped with IFF and IFT, attackers are unable to determine the branch taken.
The proposed theory of exploiting parallel processing to mask instructions likely foils TA and
SPA attacks. The resiliency of the countermeasure against DPA attacks must be studied further before
its effectiveness is determined.
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 112
7.4.2 Koblitz Curve Specific Countermeasure
Koblitz curve implementations are attractive because of the superior performance of the point-
multiplication operation. By using τ-adic representations, point doubling is replaced, resulting in a
significant reduction in computational costs. Similar to countermeasures presented in [22], the below
countermeasure exploits a property of Koblitz curves, and is proposed to foil DPA attacks. It exploits
the inexpensiveness of multiplying elliptic curve points by τ.
The countermeasure is loosely based on a finite field exponentiation algorithm. It is preferable
that the τ-adic coefficients representing k are evenly distributed. When performing a point-
multiplication operation, the τ-adic representation of k is divided into r groups of g coefficients, where
g = m/r. The point-multiplication between each group and base point P is performed in a random
order. Then, the resulting points are multiplied by the corresponding power of τ, and summed,
resulting in the point Q.
The algorithm applies to point-multiplications over Koblitz curves. It can be generalized to
apply to other elliptic curves, but the computational cost is extremely large. If the algorithm were
implemented on a non-Koblitz curve, the performance overhead would be significantly greater, and
most likely unacceptable. Each pair of polynomial squaring operations would be converted to a point
doubling operation, which is much more computationally expensive. A τ-adic representation of the
polynomial k is required by the algorithm. It is suggested that a NAF related representation of the
polynomial be avoided, otherwise the distribution of coefficients favor zero. This may lead to lessened
security and possible attacks because of the structure of NAF representations. Furthermore, if an SPA
resistant algorithm is employed, a NAF representation is of no advantage.
Algorithm 7-2. Proposed DPA Resistant τ–adic Point-Multiplication
Input: k = (km-1, km-2, …, k1, k0)τ-adic, P
Output: Q = k⋅P
1. Compute ki, for 0 ≤ i ≤ r-1
Where k = (kr-1, kr-2, …, k1, k0)τ-adic and ki = (k(i+1)⋅g-1, k(i+1)⋅g-2, …, ki⋅g+1, ki⋅g)τ-adic
2. Compute ki⋅P, for 0 ≤ i ≤ r-1, following a random sequence of i values.
3. Compute Q = Στi⋅g⋅ki⋅P, for 0 ≤ i ≤ r-1
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 113
The crucial step of the algorithm is computing ki⋅P following a random sequence of i-values.
It is speculated that this step adds randomness to the power trace of the algorithm, foiling DPA attacks.
To further analyze the proposed countermeasure, a concept of sample entropy is introduced.
A new concept may play a role in the comparison of DPA countermeasures. As DPA attacks
mature, requiring fewer samples to be effective, an idea of sample entropy, or uncertainty per sample,
may become important when comparing the strength of different DPA countermeasures. The amount
of entropy introduced to an attacker monitoring a cryptosystem is important.
For example, consider a cryptosystem that is SPA resistant, and employs the RRK DPA
countermeasure from [22]. Kerckhoff’s principle, which is the assumption that the attacker knows
everything about the cryptosystem except the key, applies. The sample entropy of the RRK DPA
countermeasure is fixed at m, where the cryptosystem uses GF(2m). Currently, the standard m-value is
163. Alternatively, the sample entropy and overhead of Algorithm 7-2 is controllable.
The sample entropy introduced by Algorithm 7-2 varies with r and g. For small r-values,
where the probability of ki = kj for i≠j and 0 ≤ i, j ≤ r-1, which is referred to as a ki collision, is
approximately zero, resulting in a sample entropy of (r!). This is only true when the probability of ki
collisions is approximately zero. The estimated sample entropy and overhead of the algorithm is
controlled by selecting various r-values, and is stated in Table 7-3.
The computational overhead of Algorithm 7-2 depends on the implementation of step 3 and the
selection of r. The number of point addition and polynomial squaring operations can be estimated
using Table 7-3, assuming the optimum technique of performing the step is employed. The number of
overhead point addition operations grows linearly with r. The number of polynomial squaring
operations is not of great importance, because the operation is computationally inexpensive.
Table 7-3. Estimated Sample Entropy and Overhead of Algorithm 7-2
Sample Entropy Computational and Memory Overhead of Algorithm 7-2
Small r
P(ki coll) = 0
Large r
P(ki coll) = 1
Point Addition
Operations
FF Squaring
Operations
Memory Requirement
(points)
r! r!
((r/2⋅g)!)2g r - 1 2⋅g⋅(r - 1) r
The memory overhead of the algorithm, excluding the point-multiplication operations, is
reasonable for implementations without strict memory restrictions. As presented in Table 7-3, an
CHAPTER 7. SIDE-CHANNEL ATTACK SECURITY ISSUES 114
additional r points must be stored in memory. The number of points cannot be reduced by combining
steps 2 and 3, because this leads to DPA vulnerabilities.
Several modifications can be made to the algorithm to increase its performance. For example,
a single LUT can be used during all of the point-multiplications in step 2. The LUT can be computed
for the first point-multiplication, and used in all remaining operations. The expense of pre-computing
the LUT is therefore reduced because it is used multiple times.
Algorithm 7-2 is a proposed DPA countermeasure. Steps must be taken to resist TA and SPA
attacks as well. A resistant point-multiplication algorithm, such as Algorithm 7-1, is required to foil
the attacks. The sample entropy can be increased by using large r-values, but it cannot be used to avoid
TA and SPA attempts. Furthermore, large r-values result in unacceptable computational and memory
overheads. By selecting large r-values, the sample entropy can be increased, but it cannot grow to be
larger than the key entropy of 2m. For large r, i.e. 163, 82 and 55, the sample entropy can be estimated
using Table 7-3. The maximum sample entropy possible is one-sixteenth the key entropy of 2m.
The proposed DPA countermeasure, presented as Algorithm 7-2, takes advantage of the
Frobenius mapping, stated in Formula 3.7. The randomness introduced to the power trace by the
countermeasure is easily controllable, at the expense of computational and memory overhead. It is
proposed that the randomness introduced may foil DPA attacks, but this must be investigated further.
Additional analysis of the algorithm is required to determine the effectiveness of the proposed
technique in resisting DPA attacks.
8 Discussion and Conclusions
The thesis presents the implementation, optimization and analysis of the ECDSA on the StarCore
SC140 DSP. The ECDSA and algorithms used to implement each of the finite field, large integer and
elliptic curve operations are presented in chapter 3. After which, the implementation and performance
of the operations is described. The results are compared to previously published results, and the
performance of the hand-written and compiler generated assembly is compared. The memory
requirements of the implementation are examined. The SC140 is analyzed for cryptographic
applications, as well as the ability of the SC140 compiler to generate efficient assembly. Furthermore,
several optimization improvements that the compiler could employ are stated. Finally, security issues
are examined, focusing on resisting side-channel attacks, and proposing two possible countermeasures
to the attacks.
8.1 Thesis Summary
A Koblitz curve over GF(2163) was selected and used to implement the ECDSA. The focus of the
implementation is to minimize execution time, while targeting portable devices by maintaining an
acceptable code size and minimizing power consumption. Optimal finite field and elliptic curve
algorithms were sought for implementation. The algorithms are listed and described in chapter 3, as
well as the implementation philosophy.
First, a working version of the ECDSA was obtained. Then, inefficient operations were
methodically replaced with superior performing and thoroughly tested operations. The implementation
and integration of the finite field, large integer and elliptic curve operations is outlined in chapter 4.
The performance of the implemented operations are compared with published results in chapter
5. The execution times of finite field operations, elliptic curve operations, and the signature generation
and verification processes is presented and compared. The performance comparison of the signature
generation and verification processes shows that the performance of the implementation is adequate,
and the processes result in acceptable delays.
Coding guidelines that were used when implementing the assembly and C code are listed. The
guidelines are a set of suggestions that result in computationally and memory efficient hand-written
115
CHAPTER 8. DISCUSSION AND CONCLUSIONS 116
assembly and compiler-generated code. The performance of the implementation, compiled with CGA
and HWA routines, is presented and compared. The performance of the code shows the advantage
realized by assembly implementations instead of C.
The SC140 and associated compiler are analyzed with respect to cryptographic applications in
chapter 6. The pros and cons of the processor are listed and described in detail. By studying the HWA
and CGA routines, a list of suggested compiler optimization improvements was gathered. The
improvements are specific to the CGA routines, and state rules that if employed, significantly improve
the compiler-generated assembly. To conclude chapter 6, two compiler anomalies encountered during
the implementation process are stated.
Security issues due to side-channel attacks are investigated in chapter 7. TA, SPA and DPA
attacks are briefly described, and implemented operations that are susceptible to the attacks are
presented. An algorithm that resists SPA and TA attempts is included, as well as estimated
performance penalties of all susceptible operations. Several countermeasures that attempt to foil DPA
attacks are presented, along with two SCA countermeasures specific to the SC140 and Koblitz curves.
The two countermeasures exploit strengths of the SC140 and Koblitz curves, in an attempt to foil
attacks. Further analysis of the countermeasures is required to determine their true effectiveness
against such attacks.
8.2 Limitations of the Research and Implementation
The primary limitations of the thesis are based on the actual curve implemented and the target
processor. They are explained in the following paragraphs.
An attempt to lessen the first limitation results in a negative effect on the performance of the
implementation. Only the performance of a single finite field size was investigated. The implemented
code is written in such a way that the difficulty of testing alternative finite field sizes is reduced.
The only finite field size investigated in the thesis is GF(2163). As stated, this is the current
standard finite field size employed. The finite field size is currently valid, but the execution times
associated with larger finite field sizes must be investigated as well. According to Moore’s law, which
has been surprisingly accurate for thirty years, average computing power doubles every eighteen
months. The exponential growth of computing power requires increasing cryptographic strengths to
maintain acceptable security. With respect to ECC, cryptographic strengths are increased by
CHAPTER 8. DISCUSSION AND CONCLUSIONS 117
employing larger finite field sizes. By testing larger finite field sizes, the viability of implementing
higher security ECC and ECDSA on the SC140 is determined.
A second limitation of the research is that only one elliptic curve was investigated. There are
several types of elliptic curves that all have positive and negative aspects associated with them.
Koblitz curves are attractive for implementation because of their superior performance. Specific
properties of Koblitz curves allow for efficient point-multiplication execution. Most other curves do
not perform as well, but as stated in [62], Koblitz curves properties may lead to efficient attacks that are
not possible on other curves, so it is important to investigate alternative elliptic curves.
The final limitation of the research is the target processor. Only the SC140 is used for
implementation. The target processor can affect the performance of the application because the
computational costs of the implementation is somewhat limited by the instruction set and processor
architecture of the SC140. The processor has both positive and negative properties that affect the
performance of cryptographic applications that are implemented on it. Alternative high-end DSPs,
with slightly different instruction sets and processor architectures will perform differently. Other DSPs
may be more suited for executing cryptographic applications because of their instruction sets and
architecture.
8.3 Conclusions
The ECDSA is an efficient digital signature technique, which was implemented on the SC140.
Previous implementations of the ECDSA generally target general-purpose processors. The SC140 is an
interesting target processor because of its intended applications. The processor targets 3rd generation
wireless communication and wireline communication devices, which can benefit from digital
signatures.
The implementation employs optimal algorithms that compute finite field and elliptic curve
operations in an attempt to minimize the computational costs of the signature generation and
verification processes. The implementation was primarily done using the C programming language.
Some basic routines were written in both C and assembly.
In most cases, the computational costs of the implemented operations, and signature generation
and verification processes are greater than other published results. However, the computational costs
are comparable and can be improved by fixing parameters such as window widths, the width-w value
and the finite field size, as well as implementing additional functionality in assembly. The
CHAPTER 8. DISCUSSION AND CONCLUSIONS 118
performance of the signature generation and verification processes is adequate. When the SC140 is
operating at the maximum clock speed of 300MHz, the signature generation and verification processes
lead to delays of approximately 4.43 and 8.63 milliseconds respectively. The delays that the
implementation of the signature processes are user acceptable. A user would not notice incurring
delays of this magnitude.
A comparison of the C and assembly routines was completed. The comparison focused on the
performance enhancement and memory requirement reduction achieved by implementation at the
assembly level. A significant computational cost reduction was recorded by the HWA routines. The
computational speedups vary depending on the input values and task. The maximum and minimum
speedups recorded when implementing tasks whose execution times depend on inputs are 1550% and
141% respectively. The speedups achieved with the input dependent routines vary greatly. A more
consistent range of speedups was achieved with routines whose execution times are independent of
inputs. Maximum and minimum speedups of 532% and 233% were achieved by implementing the
routines in assembly. Furthermore, the memory requirements of the more computationally efficient
HWA routines are at most half of the requirements of the CGA routines.
High-level operations also benefit from HWA routines. Signature generation and verification
computation reductions of 23.6% and 23.7% respectively were achieved by employing HWA routines
instead of less efficient CGA counterparts. Employment of superior performing low-level HWA
routines translate into the reduction of high-level computational costs.
The benefits of assembly implementation include both computational costs and memory
requirements. It is assumed that the implementation of larger tasks, or even the entire signature
generation and verification processes, leads to comparable computational speedups and memory
requirement reductions that resulted from the assembly implementation of low-level routines.
Therefore, assembly implementation of the ECDSA is deemed beneficial.
For target devices that have extremely high memory restrictions, the memory requirement of
the implementation, which is around 36,000 bytes, may be unacceptable. By implementing the
ECDSA entirely in assembly, the memory requirements are reduced, and possibly halved.
Furthermore, the computational cost of the implementation will decrease significantly. The reduction
in memory requirements should meet the limitations of target devices. For implementations where it is
only desired to write a limited number of simple functions in assembly, which was done in the thesis,
computational costs and memory requirements can be reduced. However, the projected reduction is not
as significantly as that achieved with an entire assembly implementation.
CHAPTER 8. DISCUSSION AND CONCLUSIONS 119
Several guidelines were found during the implementation process. The guidelines aid in the
creation of efficient assembly and C code, that result in superior performing compiled assembly. The
guidelines list several techniques that were found to improve performance.
The significant performance benefits of assembly implementations leads to possible compiler
optimization improvements. Analysis of the compiler-generated assembly led to several significant
improvements that were documented. The improvements suggested include, but are not limited to,
improved register allocation, improved copy propagation techniques and analysis, minimization of
moves from memory and simplification of multiple instructions.
Lastly, improper implementation can lead to insecurities. The insecurities due to SCA attacks
are currently of great interest. Algorithms that are TA and SPA resistant should be employed in
practice. The algorithms lead to significant computational costs, but without them, the implementation
is insecure.
The ECDSA is naturally immune to DPA attacks because of the use of a nonce, but ECC
encryption is vulnerable. There are several countermeasures that foil DPA attacks. In general, the
countermeasures add randomness to the implementation to resist DPA attacks. Most countermeasures
are implementation specific, limiting the possible techniques of thwarting DPA attacks that can be
employed. Two proposed countermeasures specific to the SC140 and Koblitz curves are presented.
The first countermeasure, by exploiting the parallel processing capabilities of the SC140, TA and SPA
attacks are likely foiled. The parallelism allows the masking of branch targets and executed
instructions. The second countermeasure, which adds a controllable amount of randomness to the
elliptic curve point-multiplication operation, is proposed to foil DPA attacks. The proposed
countermeasures exploit strengths of the SC140 and Koblitz curves respectively.
8.4 Future Work
The performance of the thesis implementation of the ECDSA on the SC140 can be improved upon.
Results from other published works report performances of signature processes superior by as much as
a factor of two. More specifically, improvements to the finite field and elliptic curve operations will
lead to an increased performance of the signature generation and verification processes.
To improve the performance of the ECDSA, it may be necessary to implement a larger set of
functions in assembly. The hand-written assembly routines are shown to significantly outperform the
CHAPTER 8. DISCUSSION AND CONCLUSIONS 120
compiler-generated assembly. It is assumed that the implementation of a larger set of functions leads
to further computational cost and memory requirement reductions, but the actual benefits are unknown.
Furthermore, an increase in performance was achieved by fixing the window width of the finite
field squaring operation. The remaining source code is written to maintain versatility by allowing the
easy modification of window widths, the finite field size of GF(2163), and other parameters. By fixing
parameters, a significant overall performance is achieved. The versatility reduces the cost of changing
finite field sizes, but this is not commonly done and has a negative effect on performance. The results
obtained are adequate and implementing the ECDSA in a system would not impose significant delays,
but the delays can be reduced by reducing the versatility of the code.
Research must be done with larger finite field sizes. As the average computing power
increases, a need for greater security leads to the implementation of larger finite field sizes. The
implemented code is written in such a way to simplify the modification of the finite field size, but the
performance of larger finite field sizes was not investigated. The performance of larger finite field
sizes must be investigated to determine how it affects performance, and to ensure reasonable execution
times are achievable.
The effects of increasing the finite field size should be researched on an operation basis. Each
finite field and elliptic curve operation will be affected by the change differently. The performance of
finite field squaring and inversion is linearly related to the finite field size. Other operations have
polynomial performance relations to the finite field size. Alternative algorithms that carry out these
operations should be investigated or developed in an attempt to maintain near-linear performance
relationships between the finite field size and ECC or the ECDSA.
Implementation of the ECDSA on other DSPs that target similar systems should be
investigated. DSPs that have different instruction sets may be advantageous when implementing the
ECDSA. Furthermore, organizations of processors may be advantageous. For example, the Motorola
MSC8101 comprises of four SC140 cores that operate in parallel. The ECDSA may be well suited for
parallel processors, resulting in significant speedups. Each parallel processor could be used to operate
on specific portions of a finite field element, resulting in computational speedups.
Security issues with respect to the ECDSA and the SC140 must be investigated more closely.
The thesis only presents SCA resistant algorithms and countermeasures, and estimates the performance
penalty of implementation. Implementation of the resistant algorithms is required to determine the
actual performance penalty. Attempts to break the implementation are then required to ensure their
effectiveness against an attacker.
CHAPTER 8. DISCUSSION AND CONCLUSIONS 121
Lastly, the proposed algorithms specific to the SC140 and Koblitz curves that attempt to foil
SCA techniques must be studied. Compared to other options, the proposed algorithms may have
performance benefits, reducing the penalty of resisting SCA attacks.
In general, there is a lack of research applying to the implementation of cryptosystems on
DSPs. Limited research has been done with respect to symmetric key cryptosystems and DSPs, and
with respect to asymmetric cryptosystems. DSPs are present in many systems, and therefore their
usefulness for implementing cryptography should be investigated. Furthermore, it is more difficult than
expected to find information that applies to the implementation of Koblitz curves, which is surprising
when the benefits of Koblitz curves are considered.
Further research in the area the thesis touches is important because of the usefulness of digital
signatures. Digital signatures can be useful in many computing environments, and especially beneficial
in wireless networks, which are growing in popularity.
Cost-benefit studies are required to compare DSPs with specialized processors based on the
implementation of cryptographic techniques. DSPs must be studied further to determine if they are a
viable alternative for cryptographic implementations to a more expensive dedicated processing unit.
Appendix A – Koblitz Curve Parameters
The following are a list of the Koblitz curve parameters used in the implementation. They are specific
to the PB and GF(2163), and were found in [25].
Reduction Polynomial: f(x) = x163 + x7 + x6 + x3 + 1
Elliptic Curve Equation: y2 + x⋅y = x3 + a⋅x2 + b (mod f) where a = 1, b = 1
Base Point Order: r = 5846006549323611672814741753598448348329118574063
Base Point Coordinates: x = 2 FE13C053 7BBC11AC AA07D793 DE4E6D5E 5C94EEE8
y = 2 89070FB0 5D38FF58 321F2E80 0536D538 CCDAA3D9
Koblitz Curve Parameters: µ = 1
C = 16
s0(163) = 2579386439110731650419537
s1(163) = -755360064476226375461594
V(163) = -4845466632539410776804317
122
Bibliography [1] G. Agnew, R. C. Mullin and S. A. Vanstone, “An implementation of elliptic curve cryptosystems
over F2155”, IEEE journal on selected areas in communications, Vol 11, No. 5, pp. 804-813, 1993. [2] G. Agnew, ECE 628 Lecture Slides – Computer Network Security, Department of Electrical and
Computer Engineering, University of Waterloo, 2002. [3] M. Aydos, T. Yanik and C. Koc, “High-speed implementation of an ECC-based wireless
authentication protocol on an ARM microprocessor”, IEE Proceedings – Communications, Vol. 148, No. 5, pp.273-279, October 2001.
[4] D. Brown, “The exact security of ECDSA”, Technical report CORR 2000-54, Department of
Combinatorics & Optimization, University of Waterloo, 2000. Available at http://www.cacr.uwaterloo.ca
[5] M. Brown, D. Cheung, D. Hankerson, J. Lopez Hernandez, M. Kirkup and A. Menezes, “PGP in
constrained wireless devices”, Proceedings of the 9th USENIX Security Symposium, The USENIX Association, 2000. Available at http://www.usenix.org
[6] Certicom, “Current Public-Key Cryptographic Systems”, 2000. Available at
http://www.certicom.com [7] Certicom, “Remarks on the Security of the Elliptic Curve Cryptosystem”, 2000. Available at
http://www.certicom.com [8] C. Clavier and M. Joye, “Universal exponentiation algorithm: a first step towards provable SPA-
resistance”, in Workshop on Cryptographic Hardware and Embedded Systems- CHES 2001, LNCS 2162, pp. 300-308, Springer-Verlag, 2001.
[9] T. Cormen, C. Leiserson and R. Rivest, Introduction to algorithms, The MIT Press Cambridge,
Massachusetts (1999). [10] J.-S. Coron, “Resistance against differential power analysis for elliptic curve cryptosystems”, in
Workshop on Cryptographic Hardware and Embedded Systems, LNCS 1717, pp.292-302, Springer-Verlag, 1999.
[11] E. De Win, A. Bosselaers and S. Vandenberghe, “A fast software implementation for arithmetic
operations in GF(2n)”, Advances in Cryptology, Proc. Asiacrypt’96, LNCS 1163, pp. 65-76, Springer-Verlag, 1996.
[12] H. Eisenbise, “Embedded cryptography: secure communications with digital signal processors”,
2001. Available at http://www.rit.edu/~hje3479/cryptography.html
123
BIBLIOGRAPHY 124
[13] C. H. Gebotys and R. J. Gebotys, “Secure elliptic curve implementations: an analysis of resistance to power-attacks in a DSP processor”, in Workshop on Cryptographic Hardware and Embedded Systems- CHES 2002, 2002.
[14] J. Goodman and A. Chandrakasan, “An energy-efficient reconfigurable public-key cryptographic
processor”, IEEE Journal of Solid-State Circuits, Vol 36, No. 11, 2001. [15] J. Guajardo and C. Paar, “Efficient algorithms for elliptic curve cryptosystems,” Advances in
Cryptology, Proc. Crypto’97, LNCS 1294, pp.342-356, Springer-Verlag, 1997. [16] J. Guajardo and C. Paar, “Itoh-Tsujii inversion in standard basis and its application in
cryptographic codes”, Kluwer Academic Publishers, 2001. [17] J. Guajardo, R. Blumel, U. Krieger and C. Paar, “Efficient implementation of elliptic curve
cryptosystems on the TI MSP430x33x family of microcontrollers”, In Proceedings of PKC 2001, LNCS 1992, pp. 365-382, Springer-Verlag, 2001.
[18] P. Hamalainen, M. Hannikainen, T. Hamalainen, and J. Saarinen, “Configurable hardware
implementation of triple-DES encryption algorithm for wireless local area network”, IEEE, 0-7803-7041-4, 2001.
[19] D. Hankerson, J. Hernandez and A. Menezes, "Software implementation of elliptic curve
cryptography over binary fields", In Proceedings of CHES 2000, 2000. [20] M. Hasan, "Efficient computation of multiplicative inverses for cryptographic applications", 15th
IEEE Symposium on Computer Arithmetic, pp.66-72, 2001. [21] M. Hasan, ECE 720 (Topic 2) Lecture Slides - Selected Topics in Cryptographic Computations,
Department of Electrical and Computer Engineering, University of Waterloo, 2001. [22] M. A. Hasan, "Look-up Table-Based Large Finite Field Multiplication in Memory Constrained
Cryptosystems", IEEE Transactions on Computers, LNCS 1746, pp.749-758, July 2000. [23] M. Hasan, “Power analysis attacks and algorithmic approaches to their countermeasures for
Koblitz curve cryptosystems”, in Cryptographic Hardware and Embedded Systems - CHES 2000, LNCS 1965, pp. 93-108, Springer-Verlag, 2000.
[24] E. Hess, N. Janssen, B. Meyer and T. Schütze, “Information leakage attacks against smart card
implementations of cryptographic algorithms and countermeasures”, Available at http://infilsec.com/papers/dpa/
[25] IEEE P1363, Standard Specifications for Public-Key Cryptograpy, 2000.
[26] K. Itoh, M. Takenaka, N. Torii, S. Temma and Yasushi Kurihara, “Fast implementation of public-key cryptography on a DSP TMS320C6201,” In Proceedings of the First Workshop on Cryptographic Hardware and Embedded Systems (CHES’99), LNCS 1717, pp. 61-72, Springer-Verlag, 1999.
[27] T. Izu and T. Takagi, “A fast parallel elliptic curve multiplication resistant against side channel
attacks”, Technical report, CACR, University of Waterloo, 2001. Available at http://www.math.uwateroo.ca/
BIBLIOGRAPHY 125
[28] M. Joye, C. Tymon, “Protections against differential analysis for elliptic curve cryptography”, in
Cryptographic Hardware and Embedded Systems - CHES 2001, LNCS 2162, pp. 377-390, Springer-Verlag, 2001.
[29] D. Johnson, “ECC, future resiliency and high security systems”, Certicom, 1999. Available at
http://www.certicom.com [30] D. Johnson and A. Menezes, "The elliptic curve digital signature algorithm (ECDSA)", Technical
report CORR 99-06, Department of Combinatorics & Optimization, University of Waterloo, 1999. Available at http://www.cacr.math.uwaterloo.ca/
[31] N. Koblitz, A.J. Menezes and S. Vanstone, “The State of Elliptic Curve Cryptography,” Designs,
Codes and Cryptography, 19, pp.173-193, 2000. [32] K. Koyama, Y. Tsuruoka, “Speeding up elliptic cryptosystems by using a signed binary window
method”, In Advances in Cryptography-CRYPTO’92, LNCS 740, pp. 345-357, Springer-Verlag, 1992.
[33] P. Kocher, “Timing attacks on implementations of Diffie-Hellman, RSA, DSS, and other
systems”, in Advances in Cryptology- CRYPTO’96, LNCS, pp. 104-113, Springer-Verlag, 1996. [34] P. Kocher, J. Jaffe and B. Jun, “Differential Power Analysis”, in Advances in Cryptology-
CRYPTO’99, LNCS, pp. 388-397, Springer-Verlag, 1999. [35] N. Kanayama, T. Kobayashi, T. Saito and S. Uchiyama, “Remarks on elliptic curve discrete
logarithm problem”, IEICE Trans. Fundamentals, Vol.E83-A, No.1, 2000. [36] N. Koblitz, "CM-curves with good cryptographic properties", In Advances in Cryptology: CRYPT
'91, LNCS 576, pp. 279-287, Springer-Verlag, 1992. [37] C. H. Lim and P. J. Lee, “Security of interactive DSA batch verification”, Electronics Letters, No.
19941112, 1994. [38] J. Lopez and R. Dahab, “An overview of elliptic curve cryptography”, Technical Report, IC-00-
10, 2000. Available at http://www.dcc.unicamp.br/ic-main/publications-e.html [39] J. Lopez and R. Dahab, “High-speed software multiplication in F2
m”, Technical report, IC-00-09, 2000. Available at http://www.dcc.unicamp.br/ic-main/publications-e.html
[40] J. Lopez and R. Dahab, “Performance of elliptic curve cryptosystems”, Technical report, IC-00-
08, May 2000. Available at http://www.dc.unicamp.br/ic-main/publications-e.html [41] Metrowerks Corporation, CodeWarrior IDE, version 4.1, build 696, 2000. [42] Metrowerks Corporation, CodeWarrior® Metrowerks enterprise C compiler user’s manual,
(1999). Available at http://www.metrowerks.com [43] Metrowerks Corporation, StarCore C Compiler, vMtwk.Production 1.1, build 050901-1936, 2000.
BIBLIOGRAPHY 126
[44] C. Moerman and E. Lambers, “Optimizing DSP: low power by architecture”, Adelante Technologies. Available at http://www.techonline.com
[45] B. Möller, “Securing elliptic curve point multiplication against side-channel attacks”, in
Information Security – 4th International Conference, ISC 2001, LNCS 2200, pp.324-334, Springer-Verlag, 2001.
[46] Motorola, SC140 DSP Core Reference Manual. Motorola and Lucent Technologies Inc., Rev. 1,
(2000). Available at http://www.motorola.com [47] National Institute of Standards and Technology, “Digital Signature Standard (DSS)”, FIPS
Publication 186-2, February 2000. Available at http://csrc.nist.gov/fips [48] National Institute of Standards and Technology, “Secure Hash Standard (SHS)”, FIPS Publication
180-1, April 1995. Available at http://csrc.nist.gov/fips [49] P. Nguyen, I. Shparlinski, “The insecurity of the elliptic curve digital signature algorithm with
partially known nonces”, submitted to Designs, Codes and Cryptography, 2001. [50] K. Okeya and K. Sakurai, “On insecurity of the side channel attack countermeasure using
addition-subtraction chains under distinguishability between addition and doubling”, LNCS 2384, pp. 420-435, Springer-Verlag, 2002.
[51] K. Okeya and K. Sakurai, “Power analysis breaks elliptic curve cryptosystems even secure against
the timing attacks”, in Progress in Cryptology - Indocrypt 2000, LNCS 1977, pp. 178-190, Springer-Verlag, 2000.
[52] G. Orlando and C. Paar, “A high-performance reconfigurable elliptic curve processor for GF(2m)”,
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2000), LNCS 1965, Springer-Verlag, 2000.
[53] G. Orlando and C. Paar, “An efficient architecture for GF(2m) and its applications in cryptographic
systems”, Electronic Letters, Vol. 36, No. 13, pp.1116-1117, 2000. [54] S. Ravi, A. Raghunathan and N. Potlapally, “Securing wireless data: architecture challenges”, ISSS
’02, ACM 1-58113-576-9, 2002. [55] L. Reyzin and B. Kaliski, “Storage-efficient basis conversion techniques”, IEEE, 2000. [56] M. Rosing, Implementing Elliptic Curve Cryptography, Manning Publications Greenwich, CT
(1999). [57] E. Roy and D. Crawford, “Introduction to the StarCore SC140 tools: an approach in nine
exercises”, Motorola, AN2009/D, Rev. 1, 2001. Available at http://www.motorola.com [58] Z. Rozenshein, D. Halahmi, A. Mordoh and Y. Ronen, “Speed and code-size trade-off with the
StarCore SC140”, Motorola, AN1838/D, Rev. 0, 2000. Available at http://www.motorola.com [59] B. Schneier, “Cryptographic design vulnerabilities”, IEEE, 0018-9162, 1998.
BIBLIOGRAPHY 127
[60] G. Seroussi, “Compact representation on elliptic curve points over F2n”, Hewlett-Packard
Company, HPL-98-94 (R.1), 1998. [61] J. Solinas, “Efficient arithmetic on Koblitz curves”, Designs, Codes and Cryptography, 19,
pp.195-249, 2000. [62] M. Wiener and R. Zuccherato, “Faster attacks on elliptic curve cryptosystems”, Selected Areas in
Cryptogrpahy’98, LNCS 1556, pp. 190-200, Springer-Verlag, 1998. [63] E. Witzke and L. Pierson, “Key management for large scale end-to-end encryption”, IEEE, D-
7803-1479-4, 1994. [64] T. Wollinger, M. Wang, J. Guajardo and C. Paar, “How well are high-end DSPs suited for the
AES algorithms? AES algorithms on the TMS320C6x DSP,” AES Candidate Conference 2000, pp. 94-105, 2000.
[65] C. Zamfirescu and E. Madve, “Stack measurement for the StarCore SC140 core”, Motorola,
AN2267/D, Rev. 1, 2002. Available at http://www.motorola.com