266 ieee transactions on computers, vol. …fayez/research/gf3m/beuchat_j11.pdfindex terms—tate...
TRANSCRIPT
Fast Architectures for the �TPairing over Small-Characteristic
Supersingular Elliptic CurvesJean-Luc Beuchat, Jeremie Detrey, Nicolas Estibals,
Eiji Okamoto, Member, IEEE, and Francisco Rodrıguez-Henrıquez
Abstract—This paper is devoted to the design of fast parallel accelerators for the cryptographic �T pairing on supersingular elliptic
curves over finite fields of characteristics two and three. We propose here a novel hardware implementation of Miller’s algorithm based
on a parallel pipelined Karatsuba multiplier. After a short description of the strategies that we considered to design our multiplier, we
point out the intrinsic parallelism of Miller’s loop and outline the architecture of coprocessors for the �T pairing over F2m and F3m .
Thanks to a careful choice of algorithms for the tower field arithmetic associated with the �T pairing, we manage to keep the pipelined
multiplier at the heart of each coprocessor busy. A final exponentiation is still required to obtain a unique value, which is desirable in
most cryptographic protocols. We supplement our pairing accelerators with a coprocessor responsible for this task. An improved
exponentiation algorithm allows us to save hardware resources. According to our place-and-route results on Xilinx FPGAs, our designs
improve both the computation time and the area–time trade-off compared to previously published coprocessors.
Index Terms—Tate pairing, �T pairing, elliptic curve, finite field arithmetic, Karatsuba multiplier, hardware accelerator, FPGA.
Ç
1 INTRODUCTION
IN 2000, Mitsunari et al. [36], Sakai et al. [41], and Joux [25]independently showed how to use bilinear pairings
defined over algebraic curves to solve cryptographicproblems of long standing. This discovery ignited anintensive research that, until today, has produced animpressive number of pairing-based cryptographic protocolproposals [13]. Practice has shown that one of the mostefficient options to compute bilinear pairings is to resort tothe Tate pairing operating on supersingular elliptic curvesof low embedding degrees.
Back in 1986, Miller [33], [34] presented an iterative
algorithm that can be adapted to compute the Tate pairing
with linear complexity with respect to the size of the input.
Since then, significant improvements of Miller’s algorithm
were independently proposed in 2002 by Barreto et al. [4]
and Galbraith et al. [17]. One year later, Duursma and Lee
presented a radix-3 variant of Miller’s algorithm especially
targeted at the case of characteristic three [14]. In 2004,Barreto et al. [3] introduced the �T approach, which furthershortens the loop of Miller’s algorithm. More recently,Hess et al. generalized these results to ordinary curves[22], [23], [46].
We extend here the work presented in [10] and proposenovel hardware architectures for computing the �T pairingover binary and ternary fields based on parallel pipelinedKaratsuba multipliers and enhanced unified arithmeticoperators. We stress that the modified Tate pairing can bedirectly computed from the reduced �T pairing at almost noextra cost [7]. Our hardware accelerators are able tocompute the �T pairing operating on supersingular ellipticcurves defined over F2691 and F3313 in just 18.8 and 16:9 �s,respectively (Table 5). We note that these field sizes enjoyan associated security equivalent to that of 105- and 109-bitsymmetric-key cryptosystems, respectively (Table 4).
The main strategies considered to design our parallelpipelined multiplier are described in Section 2. They areincluded in a VHDL code generator that allows us toexperiment on a wide range of operators as well as a varietyof design parameters. Thanks to a judicious choice ofalgorithms for performing tower field arithmetic and acareful analysis of the scheduling, we managed to keep ourpipelined units always busy. This allows us to compute oneiteration of Miller’s algorithm over ternary and binary fieldsin only 17 and 7 clock cycles, respectively (Sections 3 and 4).We summarize the results obtained from our FPGAimplementation and provide the reader with a thoroughcomparison against previously published coprocessors inSection 5.
For the sake of concision, we are forced to skip thedescription of many important concepts of elliptic curve
266 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011
. J.-L. Beuchat and E. Okamoto are with the Graduate School of Systems andInformation Engineering, University of Tsukuba, 1-1-1 Tennodai,Tsukuba, Ibaraki 305-8573, Japan.E-mail: [email protected], [email protected].
. J. Detrey and N. Estibals are with the CARAMEL Project-Team, INRIANancy–Grand Est, Batiment A, 615 rue du Jardin Botanique, 54602Villers-les-Nancy Cedex, France.E-mail: {jeremie.detrey, nicolas.estibals}@loria.fr.
. F. Rodrıguez-Henrıquez is with the Computer Science Department, Centrode Investigacion y de Estudios Avanzados del IPN, Av. InstitutoPolitecnico Nacional No. 2508, 07300 Mexico City, Mexico.E-mail: [email protected].
Manuscript received 19 Aug. 2009; revised 27 Mar. 2010; accepted 19 Apr.2010; published online 24 June 2010.Recommended for acceptance by J. Bruguera, M. Cornea, and D. Das Sarma.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TCSI-2009-08-0395.Digital Object Identifier no. 10.1109/TC.2010.163.
0018-9340/11/$26.00 � 2011 IEEE Published by the IEEE Computer Society
theory. We suggest the interested reader to review [44], [47]
for an in-depth coverage of this topic.
2 PARALLEL KARATSUBA MULTIPLIERS
Before delving into the specifics of our pairing coprocessor
architectures, we first detail here the Karatsuba multipliers
on which they extensively rely.We define the p-ary extension field Fpm as Fp½x�=ðfðxÞÞ,
where f is an irreducible degree-m monic polynomial over
Fp. The product of two arbitrary elements of Fpm repre-
sented as p-ary polynomials of degree at most m� 1 is
computed as the polynomial multiplication of the two
elements modulo f . Carefully selecting an irreducible
polynomial with low Hamming weight (i.e., trinomial,
tetranomial, etc.) and low subdegree allows for a simple
modular reduction step.In this work, due to its subquadratic space complexity,
we opted for a variant of the classical Karatsuba multiplier
to implement the polynomial product, while a few extra
adders and subtracters over Fp are dedicated to performing
the final reduction modulo f .
2.1 Variations on the Karatsuba Algorithm
The Karatsuba multiplication [27] is based on the observa-
tion that the polynomial product c ¼ a � b, for a and b 2 Fpm ,
can be computed as
c ¼ aLbL þ ½ðaH þ aLÞðbL þ bHÞ � ðaHbH þ aLbLÞ�xn
þ aHbHx2n;
where n ¼ dm2e, a ¼ aL þ xnaH , and b ¼ bL þ xnbH .Note that since we are working with polynomials, there
is no carry propagation. This allows one to split the
operands in a slightly different way: for instance, Hanrot
and Zimmermann [21] suggested to split them into odd- and
even-degree parts. It was adapted to multiplication over F2m
by Fan et al. [15]. Since there is no overlap between the odd
and even parts at the reconstruction step, this different
method of splitting saves approximately m additions over
Fp during the reconstruction of the product.Another natural way to generalize the Karatsuba multi-
plication is to split the operands into three or more parts, in
a classical way (i.e., splitting each operand into contiguous
parts from the lowest to the highest powers of x) or using a
generalized odd/even split (i.e., according to the degree
modulo the number of split parts). By applying this strategy
recursively, in each iteration, each polynomial multiplica-
tion is transformed into three or more subproducts of
smaller degree, until all the polynomial operands are
reduced to single coefficients. Nevertheless, practice has
shown that it is better to prune the recursion earlier,
performing the lowest-level multiplications using alterna-
tive techniques that are more compact and/or faster for
low-degree operands, such as the so-called schoolbook
method with quadratic complexity, which has been selected
for this work.
2.2 A Pipelined Architecture for the KaratsubaMultiplier
We pipelined our multiplier architecture by means ofoptional registers inserted between the computations ofthe required subproducts, where the depth of the pipelinecan be adjusted according to the complexity of theapplication at hand. This approach allows us to split thecritical path of the whole multiplier structure and, therefore,increase its operating frequency.
In order to study a wide range of implementationstrategies, we wrote a VHDL code generator, which auto-matically produces the description of different variants ofKaratsuba multipliers according to several parameters (fieldextension degree, irreducible polynomial, splitting method,etc.). Our automatic tool was extremely useful for selectingthe operator that showed the highest clock frequency, thesmallest area, or a good trade-off between them.
3 REDUCED �T PAIRING IN CHARACTERISTIC THREE
In the following, we consider the computation of thereduced �T pairing in characteristic three. Table 1 sum-marizes the parameters of the algorithm and of the super-singular curves. We refer the reader to [3], [8] for moredetails about the computation of the �T pairing. Recall that afinal exponentiation is required to obtain a unique value,which is desirable in the context of cryptographic protocols.As pointed out by Beuchat et al. [9], the computations of thenonreduced pairing (i.e., Miller’s algorithm) and of the finalexponentiation do not share the same datapath, and it seemsjudicious to pipeline these two tasks using two distinctcoprocessors in order to reduce the computation time andincrease the throughput.
3.1 Computation of Miller’s Algorithm
We rewrote in Algorithm 1 the reversed-loop algorithm incharacteristic three described in [8], denoting each iterationwith parenthesized indices in superscript in order to
BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 267
emphasize the intrinsic parallelism of the �T pairing. Ateach iteration of Miller’s algorithm, two tasks are performedin parallel, namely, a sparse multiplication over F36m (lines 6and 7) and the computation of the coefficients for the nextsparse operation (lines 8–10). We say that an operand inF36m is sparse when some of its coefficients are trivial (i.e.,either zero, one, or minus one).
3.1.1 Sparse Multiplication over F36m
The intermediate result Rði�1Þ is multiplied by the sparseoperand SðiÞ (Algorithm 1, lines 6 and 7). This operation iseasier than a standard multiplication over F36m , but thechoice of the sparse multiplication algorithm requires carefulattention. Bertoni et al. [6] and Gorla et al. [18] tookadvantage of Karatsuba multiplication and Lagrange inter-polation, respectively, to reduce the number of multiplica-tions over F3m at the expense of several additions. (Note thatGorla et al. study standard multiplication over F36m in [18],but extending their approach to sparse multiplication isstraightforward.) In order to keep the pipeline of a Karatsubamultiplier busy, we would have to embed in our processor alarge multioperand adder (up to 12 operands for the schemeproposed by Gorla et al.) and several multiplexers to dealwith the irregular datapath. This would negatively impactthe area and the clock frequency, and we prefer consideringthe algorithm discussed by Beuchat et al. in [11] which gives
a better trade-off between the number of multiplications andadditions over the underlying field (Algorithm 2): it involves17 multiplications and 29 additions over F3m to compute SðiÞ
and Rði�1Þ � SðiÞ.We suggest to take advantage of a parallel Karatsuba
multiplier with seven pipeline stages to implement Miller’salgorithm. Since the algorithm we selected for sparsemultiplication over F36m requires at most the addition offour elements of F3m , it suffices to complement the multiplierwith a four-operand adder to compute s
ðiÞ3 , a
ðiÞj , and r
ðiÞj ,
0 � j � 5, as shown in Fig. 1. The second loop of Algorithm 2(lines 13–16) requires a small amount of additional hard-ware. Since the first two multiplications of the loop involverði�1Þ2j and r
ði�1Þ2jþ1 , respectively, we compute s
ðiÞj on-the-fly by
means of an accumulator. Furthermore, it seems convenientto store p
ðiÞ6 , p
ðiÞ7 , and s
ðiÞ3 in a circular shift register.
We managed to find a scheduling that allows us to start anew multiplication over F3m at each clock cycle, thuskeeping the pipeline busy and computing an iteration ofMiller’s algorithm in 17 clock cycles, as depicted in Fig. 2. Itis worth noticing that the cost of additions over F3m ishidden and the number of clock cycles depends only on theamount of multiplications over F3m . We easily identify fivedatapaths (denoted by the numerals to in Figs. 1 and 2)between the output of the four-operand adder and theinputs of the parallel multiplier.
268 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011
TABLE 1Supersingular Curves over F3m and F2m
Specific attention is needed to design the register file
storing the coefficients of RðiÞ and the intermediate
variables aðiÞj , 0 � j � 6, of the sparse multiplication algo-
rithm. According to our scheduling scheme, we have to
read simultaneously up to three variables from the register
file. Thus, we decided to implement it by means of two
blocks of Dual-Ported RAM (DPRAM):
. The first one is connected to input M0 of the parallelmultiplier and input A0 of the four-operand adder,and stores the coefficients of RðiÞ.
. According to our scheduling (Fig. 2), the second
DPRAM block provides the four-operand adder
with its fourth input, namely, aðiÞj , 0 � j � 5, and
rði�1Þj , 0 � j � 2.
3.1.2 Computation of the Sparse Operand
The second task consists in computing the coefficients of the
sparse operand Sðiþ1Þ required for the next iteration of
Miller’s algorithm (Algorithm 1, lines 8–10). Two cubings
and an addition over F3m allow us to update the coordinates
of point P and to determine the coefficient tðiþ1Þ of the sparse
operand Sðiþ1Þ, respectively.
Recall that the �T pairing over F3m comes in two flavors:
the original one involves a cubing over F36m after each
sparse multiplication. Barreto et al. [3] explained how to get
rid of that cubing at the price of two cube roots over F3m to
update the coordinates of point Q. It is essential to consider
such an algorithm here, as an extra cubing over F36m would
put even more strain on the first task (which is already the
most expensive one). According to our results, the critical
path of the circuit is never located in a cube root operator
when pairing-friendly irreducible trinomials or pentano-
mials [2], [20] are used to define F3m . If, by any chance, such
polynomials are not available for the considered extension
of F3 and the critical path is in the cube root, it is always
possible to pipeline this operation. Therefore, the cost of
cube roots is hidden by the first task.The hardware implementation is rather straightforward
(Fig. 1): four registers, a cubing operator, and a cube root
operator allow us to store and update the coordinates of
points P and Q. Then, a two-operand adder computes the
sum of xðiÞP and x
ðiÞQ , and the result tðiÞ is memorized in a fifth
register. Multiplexers select the inputs of the parallel
multiplier according to our scheduling.
3.1.3 Initialization
The initialization step of the �T pairing (Algorithm 1, lines 1
and 2) involves a small amount of specific hardware in
order to compute xð0ÞP , y
ð0ÞP , x
ð0ÞQ , and y
ð0ÞQ . Note that we are
able to send tðiÞ, �yðiÞP , and ��yðiÞQ to input M1 of the parallel
multiplier (Fig. 1). Assuming that the constants 0, 1, and 2
are stored in the DPRAM block connected to input M0, we
compute the coefficients of Rð�1Þ by means of six multi-
plications over F3m :
rð�1Þ0 ¼ tð0Þ � �yð0ÞP ; r
ð�1Þ3 ¼ 0 � tð0Þ ¼ 0;
rð�1Þ1 ¼ 1 �
�� �yð0ÞQ
�; r
ð�1Þ4 ¼ 0 � tð0Þ ¼ 0; and
rð�1Þ2 ¼ 2 � �yð0ÞP ¼ ��y
ð0ÞP ; r
ð�1Þ5 ¼ 0 � tð0Þ ¼ 0:
Loading the coordinates of points P and Q and performing
the initialization step involve 17 clock cycles (i.e., exactly the
same number of clock cycles as an iteration of Miller’s
algorithm). Therefore, our coprocessor returns Rððm�1Þ=2Þ
after 17 � ðmþ 3Þ=2 clock cycles.
3.2 Final Exponentiation
The second and last stage in the computation of the �T pairing
is the final exponentiation, where the result of Miller’s
algorithm Rððm�1Þ=2Þ ¼ �T ðP;QÞ is raised to the Mth power
(Algorithm 1, line 12). This exponentiation is necessary since
the nonreduced pairing �T ðP;QÞ is only defined up to
Nth powers in F�36m .
3.2.1 Improved Algorithm
In order to compute this final exponentiation, we use the
algorithm presented by Beuchat et al. [8]. This method
exploits the special form of the exponent M (see Table 1)
to achieve better performances than with a classical
square-and-multiply algorithm. Among other computa-
tions, this final exponentiation involves the raising of an
element of F�36m to the power of 3ðmþ1Þ=2, which Beuchat
et al. [8] perform by computing ðmþ 1Þ=2 successive
cubings over F�36m . Since each of these cubings requires six
BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 269
cubings and six additions over F3m , the total cost of this
step is 3mþ 3 cubings and 3mþ 3 additions.
We present here a new method for computing U3ðmþ1Þ=2
for U ¼ u0 þ u1�þ u2�þ u3��þ u4�2 þ u5��
2 2 F�36m by ex-
ploiting the linearity of the Frobenius map (i.e., cubing
in characteristic three) to reduce the number of addi-
tions. Indeed, noting that �3i ¼ ð�1Þi�, �3i ¼ �þ ib, and
ð�2Þ3i
¼�2 � ib�þ i2, we obtain the following formula for
U3i , depending on the value of i:
U3i ¼ ðu0 � �1u2 þ �2u4Þ3i
þ �3ðu1 � �1u3 þ �2u5Þ3i
�
þ ðu2 þ �1u4Þ3i
� þ �3ðu3 þ �1u5Þ3i
��þ u3i
4 �2 þ �3u3i
5 ��2;
270 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011
Fig. 1. Coprocessor for the �T pairing in characteristic three. (N.B. All control bits ci belong to f0; 1g.)
with �1 ¼ �ibmod 3, �2 ¼ i2 mod 3, and �3 ¼ ð�1Þi. Thus,according to the value of ðmþ 1Þ=2 modulo 6, thecomputation of U3ðmþ1Þ=2
will still require 3mþ 3 cubingsbut at most only six additions or subtractions over F3m .
3.2.2 Hardware Implementation
Our first attempt at computing the final exponentiation wasto use the unified arithmetic operator introduced by Beuchatet al. [8]. Unfortunately, due to the sequential schedulinginherent to this operator, it turned out that the finalexponentiation algorithm required more clock cycles thanthe computation of Miller’s algorithm by our coprocessor.We, therefore, had to consider a slightly more parallelarchitecture.
Noticing that the critical operations in the final exponen-tiation algorithm were multiplications and long sequences ofcubings over F3m , we designed the coprocessor for arithmeticover F3m depicted in Fig. 3. Besides a register file imple-mented by means of DPRAM, our coprocessor embeds aparallel–serial multiplier [45] processing D coefficients of anoperand at each clock cycle (typically, D ¼ 13 or 14), alongwith a novel unified operator supporting addition, subtrac-tion, accumulation, Frobenius map (i.e., cubing), and doubleFrobenius map (i.e., raising to the ninth power). Thisarchitecture allowed us to efficiently implement the finalexponentiation algorithm described, for instance, in [8],while taking advantage of the improvement proposed above.
3.2.3 Using Inverse Frobenius Maps
We adapt here the idea behind the square-root-and-multi-
ply algorithm for exponentiation over binary finite fields
given by Rodrıguez-Henrıquez et al. in [37].From the final exponentiation algorithm given in [8], it
can be noticed that Frobenius maps over F3m (i.e., cubings)are needed only to perform an inversion over F�3m [8, Algo-rithms 10 and 11] and for raising an element of F�36m to thepower of 3ðmþ1Þ=2, as already discussed in Section 3.2.1.
As far as the inversion is concerned, note that the first
two lines of [8, Algorithm 10] are dedicated to computing
u ¼ dð3m�3Þ=2 thanks to successive cubings and multi-
plications, where d 2 F�3m is the element to be inverted. It
is actually also possible to compute that same element by
means of only cube roots and multiplications by making
use of the identityffiffiffiz3ip ¼ z3m�i for all z 2 F3m , derived
from Fermat’s little theorem. Indeed, considering the
family of elements ðwiÞ0�i<m such that wi ¼ dð3m�3m�iÞ=2,
we note that w1 ¼ d3m�1 ¼ffiffiffid3p
and u ¼ dð3m�3Þ=2 ¼ wm�1.
Since, for any two integers n and n0, we also have
wnþn0 ¼ dð3m�3m�nÞ=2 � dð3m�n�3m�n�n
0 Þ=2 ¼ wn �ffiffiffiffiffiffiffiwn03np
;
it follows that, given an addition chain of length l for
m� 1, we can compute u ¼ wm�1 in l multiplications and
m� 1 cube roots over F3m . This has to be compared to
BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 271
Fig. 2. Scheduling of Miller’s algorithm in characteristic three. (N.B. The numerals �1 to �5 refer to the datapaths between the output of the four-
operand adder and the inputs of the parallel multiplier in Fig. 1.)
the l multiplications and m� 1 cubings required in[8, Algorithm 10] to obtain the same u.
As for the raising of an element of F�36m to the power of
3ðmþ1Þ=2, also part of the final exponentiation algorithm, we
simply apply Fermat’s little theorem once more to see that
z3ðmþ1Þ=2 ¼ ffiffiffiz3ðm�1Þ=2p
for all z 2 F3m . Thus, we can directly trade
the 3mþ 3 required cubings (as explained in the analysis
given in Section 3.2.1) for 3m� 3 cube roots over F3m .
Hence, from the previous considerations, it is possible
to replace all Frobenius maps (cubings) by inverse
Frobenius maps (cube roots) in the final exponentiation.
This is particularly interesting since the irreducible
polynomial used to represent that F3m was carefully
chosen to allow for low-complexity cubings and cube
roots, as both are required for the computation of the
nonreduced �T pairing. Furthermore, it appears that for
the considered irreducible trinomials, the complexity of the
cube root is always lower than that of the cubing. This is
shown in Table 2, where the third column reports the total
number of required additions/subtractions over F3, and
the fourth column indicates the largest number of elements
of F3 that need to be added/subtracted to one another to
compute a coefficient of the result (for instance, cubing
over F397 requires summing at most four elements of F3 at
a time).In order to assess the impact of replacing the Frobenius
and double-Frobenius operators by inverse-Frobenius (cuberoot) and double-inverse-Frobenius (ninth root) operators inthe architecture presented in Fig. 3, we implemented thedifferent variants on Xilinx Virtex-4 LX FPGAs (xc4vlx40-11).The place-and-route results, reported in the last two columnsof Table 2, show that the use of cube roots usually shortensthe critical path, even though the circuits are then slightlylarger, as the cubing formulas generally involve morecommon subexpressions which can then share the samelogic and decrease the total resource usage. All in all, itappears that relying on inverse Frobenius maps to computethe final exponentiation is by and large an effectiveoptimization.
4 REDUCED �T PAIRING IN CHARACTERISTIC TWO
An approach similar to that of characteristic three allowedus to design a parallel coprocessor for the reduced
272 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011
Fig. 3. Coprocessor for the final exponentiation of the �T pairing in characteristic three. (N.B. All control bits ci belong to f0; 1g.)
�T pairing in characteristic two, as depicted in Fig. 4. Thesupersingular curves and the parameters of the algorithmare summarized in Table 1.
4.1 Computation of Miller’s Algorithm
Applying the strategy that we used for characteristic threeto the case of characteristic two, we adopted the reversed-loop algorithm described in [7], which we recall here inAlgorithm 3. However, the scheduling turns out to beslightly more difficult than in characteristic three since wehave to perform three tasks in parallel at each iteration ofMiller’s algorithm.
4.1.1 Sparse Multiplication over F24m
The intermediate result F ði�1Þ computed during the
previous iteration is multiplied by the sparse operand
GðiÞ ¼ gðiÞ0 þ gðiÞ1 sþ t by means of a parallel Karatsuba
multiplier (Algorithm 3, lines 13 and 14). This operation is
easier than the standard multiplication over F24m and
requires only 6 multiplications and 14 additions over the
base field F2m , as explained in Algorithm 4. Thanks to the
careful scheduling given in Fig. 5, the intermediate
variables aðiÞ0 , a
ðiÞ1 , and a
ðiÞ2 (Algorithm 4, lines 1 and 2) are
generated on-the-fly by means of two accumulators. (Note
that in order to share the control bits c15 and c16 between
both accumulators, we compute aðiÞ0 twice at each iteration
of Miller’s algorithm.) The addition of at most four
operands belonging to F2m allows for computing fðiÞj , 0 �
j � 3 (Algorithm 4, lines 6–9).
It is worth mentioning that the datapath between the
output of the four-operand adder and the parallel multiplier
is much simpler than in characteristic three: it suffices to
delay fðiÞj , 0 � j � 3, by one clock cycle and there is,
therefore, no need for a memory block to store the operands
of the multiplier. Dealing with inputs A0 and A1 of the four-
operand adder is unfortunately more difficult because of
data dependencies between the coefficients of F ði�1Þ and F ðiÞ
in Algorithm 4. According to our scheduling, we update
the coefficients of F ðiÞ in the following order: fðiÞ0 , f
ðiÞ1 , f
ðiÞ2 ,
and, eventually, fðiÞ3 . Since f
ðiÞ2 and f
ðiÞ3 depend onf
ði�1Þ0 and
fði�1Þ1 , respectively, we have to keep a copy of those values
until the end of the ith iteration of Miller’s algorithm.
Instead of including a DPRAM block in our design, we
propose a solution based on two small FIFOs (see Fig. 6 for
details). An advantage of characteristic two over character-
istic three is that the register file is smaller in terms of circuit
area and requires fewer control bits.
4.1.2 Computation of g0 and g1
Both coefficients gðiþ1Þ0 and g
ðiþ1Þ1 (Algorithm 3, lines 15 and
16) are involved in the next sparse multiplication and,
thus, have to be computed beforehand. The product uðiþ1Þ �vðiþ1Þ is evaluated by means of our parallel Karatsuba
multiplier. Then, two-operand adders allow for working
out gðiþ1Þ0 and g
ðiþ1Þ1 .
4.1.3 Computation of u, v, and w
The three values uðiþ2Þ, vðiþ2Þ, and wðiþ2Þ (Algorithm 3, lines
17–20) will be required to calculate gðiþ2Þ0 and g
ðiþ2Þ1 during
the next iteration of Miller’s algorithm. Here, we have to
work with the coordinates of points P and Q that are stored
in a FIFO and updated by means of squaring and square
root operators, respectively (see Section 4.1.6). Then, an
accumulator allows for computing wðiþ2Þ in two clock
cycles. Depending on the values of � and , 1-bit adders
(i.e., XOR gates) are necessary to calculate the least
significant bit of uðiþ2Þ, vðiþ2Þ, and wðiþ2Þ.
4.1.4 Choosing the Adequate Karatsuba Multiplier
In these settings, a Karatsuba multiplier with five pipeline
stages can be kept busy during the computation of the main
loop, as shown in Fig. 5. Since we have to carry out seven
multiplications over F2m at each iteration, the calculation
time for the full loop is equal to 7 � ðmþ 1Þ=2 clock cycles. It is
again crucial to consider an algorithm with inverse Frobe-
nius maps (i.e., square roots) in order to avoid squaring F ðiÞ
at each iteration of Miller’s algorithm (see, for instance, [7] for
a survey of algorithms for the Tate pairing over super-
singular curves in characteristic two). Such an operation
would lengthen the computation time and pipeline bubbles
would be inserted in the multiplier.
BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 273
TABLE 2Frobenius versus Inverse Frobenius Maps in Characteristic
Three and Their Influence on the Coprocessor of Fig. 3
4.1.5 Initialization
The initialization step requires specific attention. In order to
start multiplying uð0Þ by vð0Þ as soon as possible (Algorithm 3,
line 5), we load the coordinates of points P and Q in the
following order: xP , xQ, yP , and yQ. Thus, uð0Þ and vð0Þ
are available after two clock cycles. Thanks to this scheduling,
we complete the initialization step in 15 clock cycles.
4.1.6 Irreducible Pentanomials Suitable for Low-
Complexity Square Root Computation
Although irreducible trinomials allowing for simple compu-tations of squarings and square roots exist in some binaryfinite fields, as detailed in [37], this was not the case forseveral fields considered in this work (see Table 3). To tacklethis issue, we present here a novel family of irreducible
274 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011
Fig. 4. Coprocessor for the �T pairing in characteristic two. (N.B. All control bits ci belong to f0; 1g.)
square-root-friendly pentanomials that, to the best of our
knowledge, has not been proposed before in the literature.Let m and d be two odd positive integers with d < m=2
and such that the degree-m monic polynomial
fðxÞ ¼ xm þ xm�d þ xm�2d þ xd þ 1
is irreducible over F2. We then represent the binary
extension field F2m as F2½x�=ðfðxÞÞ.Note that reducing modulo f , we have xmþ2dþ1 þ xmþ1 þ
xm�2dþ1 þ x3dþ1 ¼ x, where all the exponents on the left-
hand side are even. It then follows that
ffiffiffixp¼ xmþ2dþ1
2 þ xmþ12 þ xm�2dþ1
2 þ x3dþ12 :
Therefore, using this expression forffiffiffixp
, we can compute the
square root of an element a 2 F2m as [16]
ffiffiffiap¼Xm�1
2
i¼0
a2ixi þ
ffiffiffixp Xm�3
2
i¼0
a2iþ1xi mod fðxÞ:
Furthermore, one can show that if ð2m� 1Þ=7 � d �ð2mþ 1Þ=5, then the complete computation of the square
root (i.e., including the reduction modulo f) will require the
addition of at most three elements of F2 at a time for each
coefficient of the result. And finally, choosing d � ðm� 1Þ=6ensures that a squaring will involve only additions of at
most four operands.As reported in Table 3, three pentanomials of this family
have been selected to represent the finite fields F2557 , F2613 ,
and F2691 , with d ¼ 197, 185, and 243, respectively.
4.2 Final Exponentiation
The final exponentiation (Algorithm 3, line 22) is carried outaccording to the algorithms proposed in [7] and [42], [43] for ¼ 1 and ¼ �1, respectively. We took advantage of thealgorithm introduced in [12] when raising to the ð2m þ 1Þstpower over F24m . Here, again, the linearity of the Frobeniusmap allows us to reduce the number of additions whencomputing U2ðmþ1Þ=2
for U ¼ u0 þ u1sþ u2t þ u3st 2 F�24m .Noting that s2i ¼ sþ �1 and t2
i ¼ tþ �1sþ �2, where �1 ¼imod 2 and �2 ¼ bi2cmod 2, we obtain the following formulafor U2i , depending on the value of i modulo 4:
U2i ¼ ðu0 þ �1u1 þ �2u2 þ �3u3Þ2i
þ ðu1 þ �1u2 þ �2u3Þ2i
s
þ ðu2 þ �1u3Þ2i
tþ u2i
3 st;
where �3 ¼ 1 when imod 4 ¼ 1, and �3 ¼ 0 otherwise.
According to the value of ðmþ 1Þ=2 mod 4, the computation
of U2mþ1
2 requires 2mþ 2 squarings and at most four
additions over F2m .Here, again, a hybrid arithmetic operator, similar to the
one used in the case of characteristic three (see Fig. 3),allows us to perform the final exponentiation in slightly lessclock cycles than Miller’s algorithm without impactingtoo much on the resource usage. The architecture is verysimilar to that of characteristic three, except that weremoved the multiplications by 1 and �1, which are uselessin characteristic two, and replaced the double cubing by atriple-squaring operator, to accommodate for the longerchains of successive Frobenius maps in the final exponen-tiation algorithm. The parallel–serial multiplier processeshere between D ¼ 15 and D ¼ 17 coefficients of its secondoperand per clock cycle.
It is worth noting that the trick of trading Frobenius forinverse Frobenius maps used for the final exponentiationin characteristic three can also be applied to the case ofcharacteristic two. Indeed, as reported in Table 3, thecomplexity of the square root is always lower than that ofthe squaring over the considered finite fields.
However, putting this optimization into practice hap-pens to be more complex than in characteristic three. Apartfrom the Frobenius maps required for the inversion overF�2m and for the raising to the power of 2ðmþ1Þ=2 over F�24m ,further squarings are actually necessary to raise elementsof F�24m to the ð22m � 1Þst and ð2m þ 1Þst powers, as per[7, Algorithm 3] and [12, Algorithm 3], respectively.
These few squarings could be replaced by actualmultiplications, which would then slightly increase thenumber of clock cycles required to compute the finalexponentiation. Alternatively, we could take the fourth rootof the result, which by linearity of the square root wouldthen cancel all the extra Frobenius maps, but we wouldthen end up computing a fixed power of the �T pairing andnot the �T pairing itself.
However, observing that in characteristic two, the criticalpath of the whole pairing accelerator lies in the nonreducedpairing coprocessor and not in the final exponentiation one,this is actually a moot point as there is no use trying toshorten further the critical path in the final exponentiation.
BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 275
Fig. 5. Scheduling of Miller’s algorithm in characteristic two.
We, therefore, decided against using this optimization
altogether in the case of characteristic two.
5 RESULTS AND COMPARISONS
5.1 Comparison with Previous Works
Thanks to our automatic VHDL code generator, we
designed several versions of the proposed architectures
and prototyped our coprocessors on Xilinx Virtex-II Pro and
Virtex-4 LX FPGAs with average speedgrade. Table 4 details
the specifics of the considered supersingular curves, while
Table 5 provides the reader with a comparison between our
work and accelerators for the Tate and �T pairings over
supersingular (hyper)elliptic curves published in the open
literature. (Note that our comparison remains fair since the
Tate pairing can be computed from the �T pairing at no
extra cost [7].) Finally, these results are summarized in
Fig. 7, where post-place-and-route computation time and
area–time product estimations are plotted against the
achieved level of security.
In the presented benchmarks, the logic resource usage is
given in terms of slices, which is the usual metric on Xilinx
FPGAs. Each slice comprises two four-input lookup tables
and two 1-bit flip-flops. Furthermore, it is worth noting
that even though our coprocessors also make use of some
embedded memory blocks as register files, they are by far
not a critical resource and are, therefore, not reported in
the benchmarks.
Our architectures are also much faster than software
implementations. Mitsunari wrote a very careful multi-
threaded implementation of the �T pairing over F397 and
F3193 [35]. He reported a computation time of 92 and 553 �s,
respectively, on an Intel Core 2 Duo processor (2.66 GHz).
276 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011
Fig. 6. Contents of the register file of our coprocessor for the �T pairing in characteristic two.
TABLE 3Frobenius versus Inverse Frobenius Maps in Characteristic Two
Interestingly enough, his software library outperforms
several hardware architectures proposed by other research
ers for low levels of security. When we compare his results
with our work, we note that the gap between software and
hardware increases when considering larger values of m.
The computation of the �T pairing over F3193 on a Virtex-4
LX FPGA with a medium speedgrade is, for instance,
roughly 50 times faster than software. This speedup
justifies the use of large FPGAs which are now available
in servers and supercomputers such as the SGI Altix 4700
platform.
Kammler et al. [26] reported the first hardware im-
plementation of the Optimal Ate pairing [46] over a
Barreto–Naehrig (BN) curve [5], that is an ordinary curve
defined over a prime field Fp with embedding degree
k ¼ 12. The proposed design is implemented with a 130 nm
standard cell library and computes a pairing in 15.8 ms
over a 256-bit BN curve. It is, however, difficult to make a
fair comparison between our respective works since the
level of security and the target technology are not the same.
5.2 Characteristic Two versus Characteristic Three
It is worth noting that, in order to achieve the same level of
security for the �T pairing over supersingular curves in
characteristics two and three, the extension degree m of F2m
has to be larger than that of F3m0 . More precisely, we have
the ratio
m
m0¼ 3 log 3
2 log 2 2:4;
since the embedding degree is six in characteristic three,
against four in characteristic two. This ratio also applies
asymptotically to the number of iterations in Miller’s
algorithm, which is ðmþ 1Þ=2 and ðm0 þ 1Þ=2, respectively.
However, the arithmetic over F24m required for the
computation of the pairing in characteristic two is much
simpler than that the arithmetic over F36m0 : one iteration of
Miller’s algorithm requires only 7 multiplications over F2m ,
against 17 multiplications over F3m0 in the case of
characteristic three. Coincidentally, the ratio between the
two is also 17=7 2:4.
Thus, although necessitating 2.4 times as many iterations
as in characteristic three, the �T pairing over F2m requires
almost exactly as many products over the base field as the
�T pairing over F3m0 . Furthermore, a smaller extension
degree m0 compensates for the arithmetic over F3 being
more expensive than that over F2.
That close similarity in terms of performances between
characteristics two and three at a constant level of security,
as hinted at by this short analysis, can actually be observed
in the place-and-route results of our coprocessors (Fig. 7),
even though characteristic two appears to have a slight
advantage for low security.
6 CONCLUSION
We proposed novel architectures based on a parallel
pipelined Karatsuba multiplier for the �T pairing in
characteristics two and three. The main design challenge
we faced was to keep the pipeline continuously busy.
Accordingly, we modified the scheduling of Miller’s algo-
rithm in order to introduce more parallelism in the pairing
computation. We also presented a faster way to perform
the final exponentiation by exploiting the linearity of the
Frobenius map and/or taking advantage of a simpler inverse
Frobenius map in certain cases. Both software and hardware
implementations can benefit from these techniques.
To our knowledge, the implementation of our designs on
several Xilinx FPGA devices improved both the computation
BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 277
TABLE 4Supersingular Elliptic Curves Considered in This Paper
time and the area–time trade-off of all the hardware pairing
coprocessors previously published in the open literature [1],
[7], [11], [19], [24], [28], [29], [30], [31], [38], [39], [40], [42], [43].
However, as of today, the design of pairing accelerators
providing a level of security equivalent to that of AES-128
remains a problem of major interest. Although Kammler et al.
[26] proposed a first solution over a Barreto–Naehrig curve,
several questions remain open. For instance, is it possible to
achieve such a level of security in hardware with super-
singular (hyper)elliptic curves at a reasonable cost in terms of
computation time and circuit area? Since several protocols
rely on such curves, it seems crucial to us to address this topic
in a near future.
Another interesting direction for further work is to
investigate the use of the hybrid operator (Fig. 3) to compute
the complete Tate pairing and not only the final exponentia-
tion. From our experiments, this operator should offer a
competitive balance between the area-efficient unified
operators of [7] and the latency-oriented architectures
presented here.
278 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011
TABLE 5Hardware Accelerators for the Tate and �T Pairings
ACKNOWLEDGMENTS
The authors would like to thank Nidia Cortez-Duartealong with the anonymous referees for their valuablecomments. This work was supported by the Japan-FranceIntegrated Action Program (Ayame Junior/Sakura). Theauthors would also like to express their warmest thanks toTonito for his tasty contribution to this paper. Muchasgracias por tus riquısimos tacos!
REFERENCES
[1] A. Barenghi, G. Bertoni, L. Breveglieri, and G. Pelosi, “AFPGA Coprocessor for the Cryptographic Tate Pairing OverFp,” Proc. Fourth Int’l Conf. Information Technology: NewGenerations (ITNG ’08), 2008.
[2] P.S.L.M. Barreto, “A Note on Efficient Computation of Cube Rootsin Characteristic 3,” Report 2004/305, Cryptology ePrint Archive,2004.
[3] P.S.L.M. Barreto, S.D. Galbraith, C. �Oh�Eigeartaigh, and M.Scott, “Efficient Pairing Computation on Supersingular AbelianVarieties,” Designs, Codes and Cryptography, vol. 42, pp. 239-271, 2007.
[4] P.S.L.M. Barreto, H.Y. Kim, B. Lynn, and M. Scott, “EfficientAlgorithms for Pairing-Based Cryptosystems,” Advances in Cryp-tology—CRYPTO ’02, M. Yung, ed., pp. 354-368, 2002.
[5] P.S.L.M. Barreto and M. Naehrig, “Pairing-Friendly EllipticCurves of Prime Order,” Proc. Int’l Workshop Selected Areas inCryptography (SAC ’05), B. Preneel and S. Tavares, eds., pp. 319-331, 2006.
[6] G. Bertoni, L. Breveglieri, P. Fragneto, and G. Pelosi, “ParallelHardware Architectures for the Cryptographic Tate Pairing,” Proc.Third Int’l Conf. Information Technology: New Generations (ITNG ’06),2006.
[7] J.-L. Beuchat, N. Brisebarre, J. Detrey, E. Okamoto, and F.Rodrıguez-Henrıquez, “A Comparison between HardwareAccelerators for the Modified Tate Pairing over F2m andF3m ,” Proc. Int’l Conf. Pairing-Based Cryptography—Pairing ’08,S.D. Galbraith and K.G. Paterson, eds., pp. 297-315, 2008.
[8] J.-L. Beuchat, N. Brisebarre, J. Detrey, E. Okamoto, M. Shirase, andT. Takagi, “Algorithms and Arithmetic Operators for Computingthe �T Pairing in Characteristic Three,” IEEE Trans. Computers,vol. 57, no. 11, pp. 1454-1468, Nov. 2008.
[9] J.-L. Beuchat, N. Brisebarre, M. Shirase, T. Takagi, and E.Okamoto, “A Coprocessor for the Final Exponentiation of the�T Pairing in Characteristic Three,” Proc. Int’l Workshop Arith-metic of Finite Fields (Waifi ’07), C. Carlet and B. Sunar, eds.,pp. 25-39, 2007.
[10] J.-L. Beuchat, J. Detrey, N. Estibals, E. Okamoto, and F.Rodrıguez-Henrıquez, “Hardware Accelerator for the TatePairing in Characteristic Three Based on Karatsuba-OfmanMultipliers,” Proc. Workshop Cryptographic Hardware and EmbeddedSystems (CHES ’09), C. Clavier and K. Gaj, eds., pp. 225-239,2009.
[11] J.-L. Beuchat, H. Doi, K. Fujita, A. Inomata, P. Ith, A. Kanaoka, M.Katouno, M. Mambo, E. Okamoto, T. Okamoto, T. Shiga, M.Shirase, R. Soga, T. Takagi, A. Vithanage, and H. Yamamoto,“FPGA and ASIC Implementations of the �T Pairing in Char-acteristic Three,” Computers and Electrical Eng., vol. 36, no. 1,pp. 73-87, Jan. 2010.
BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 279
Fig. 7. Performance comparison in terms of (a) Pairing computation time (top) and (b) area–time (AT) product (bottom) between the proposed
architectures and the coprocessors published in the literature, on Virtex-II Pro (left) and Virtex-4 (right) FPGAs.
[12] J.-L. Beuchat, E. Lopez-Trejo, L. Martınez-Ramos, S. Mitsunari,and F. Rodrıguez-Henrıquez, “Multi-Core Implementation of theTate Pairing over Supersingular Elliptic Curves,” Proc. Int’l Conf.Cryptology and Network Security (CANS ’09), J.A. Garay, A. Miyaji,and A. Otsuka, eds., pp. 413-432, 2009.
[13] R. Dutta, R. Barua, and P. Sarkar, “Pairing-Based CryptographicProtocols: A Survey,” Report 2004/64, Cryptology ePrint Archive,2004.
[14] I. Duursma and H.S. Lee, “Tate Pairing Implementation forHyperelliptic Curves y2 ¼ xp � xþ d,” Advances in Cryptology—A-SIACRYPT ’03, C.S. Laih, ed., pp. 111-123, 2003.
[15] H. Fan, J. Sun, M. Gu, and K.-Y. Lam, “Overlap-Free Karatsuba-Ofman Polynomial Multiplication Algorithm,” Report 2007/393,Cryptology ePrint Archive, 2007.
[16] K. Fong, D. Hankerson, J. Lopez, and A. Menezes, “FieldInversion and Point Halving Revisited,” IEEE Trans. Computers,vol. 53, no. 8, pp. 1047-1059, Aug. 2004.
[17] S.D. Galbraith, K. Harrison, and D. Soldera, “Implementing theTate Pairing,” Proc. Int’l Symp. Algorithmic Number Theory—ANTSV, C. Fieker and D.R. Kohel, eds., pp. 324-337, 2002.
[18] E. Gorla, C. Puttmann, and J. Shokrollahi, “Explicit Formulas forEfficient Multiplication in F36m ,” Proc. Int’l Workshop Selected Areasin Cryptography (SAC ’07), C. Adams, A. Miri, and M. Wiener, eds.,pp. 173-183, 2007.
[19] P. Grabher and D. Page, “Hardware Acceleration of the TatePairing in Characteristic Three,” Proc. Workshop CryptographicHardware and Embedded Systems (CHES ’05), J.R. Rao andB. Sunar, eds., pp. 398-411, 2005.
[20] D. Hankerson, A. Menezes, and M. Scott, Software Implementationof Pairings, chapter 12, pp. 188-206. IOS Press, 2009.
[21] G. Hanrot and P. Zimmermann, “A Long Note on Mulders’Short Product,” J. Symbolic Computation, vol. 37, no. 3, pp. 391-401, 2004.
[22] F. Hess, “Pairing Lattices,” Proc. Int’l Conf. Pairing-Based Crypto-graphy—Pairing ’08, S.D. Galbraith and K.G. Paterson, eds., pp. 18-38, 2008.
[23] F. Hess, N. Smart, and F. Vercauteren, “The Eta PairingRevisited,” IEEE Trans. Information Theory, vol. 52, no. 10,pp. 4595-4602, Oct. 2006.
[24] J. Jiang, “Bilinear Pairing (Eta_T Pairing) IP Core,” technical report,Dept. of Computer Science, City Univ. of Hong Kong, May 2007.
[25] A. Joux, “A One Round Protocol for Tripartite Diffie-Hellman,”Proc. Int’l Symp. Algorithmic Number Theory—ANTS IV,W. Bosma, ed., pp. 385-394, 2000.
[26] D. Kammler, D. Zhang, P. Schwabe, H. Scharwaechter, M.Langenberg, D. Auras, G. Ascheid, R. Leupers, R. Mathar, andH. Meyr, “Designing an ASIP for Cryptographic Pairings overBarreto-Naehrig Curves,” Report 2009/056, Cryptology ePrintArchive, 2009.
[27] A. Karatsuba and Y. Ofman, “Multiplication of MultidigitNumbers on Automata,” Soviet Physics Doklady (English Transla-tion), vol. 7, no. 7, pp. 595-596, Jan. 1963.
[28] M. Keller, T. Kerins, F. Crowe, and W.P. Marnane, “FPGAImplementation of a GF ð2mÞ Tate Pairing Architecture,” Proc. Int’lWorkshop Applied Reconfigurable Computing (ARC ’06), K. Bertels,J.M.P. Cardoso, and S. Vassiliadis, eds., pp. 358-369, 2006.
[29] M. Keller, R. Ronan, W.P. Marnane, and C. Murphy, “HardwareArchitectures for the Tate Pairing over GF ð2mÞ,” Computers andElectrical Eng., vol. 33, nos. 5/6, pp. 392-406, 2007.
[30] T. Kerins, W.P. Marnane, E.M. Popovici, and P.S.L.M. Barreto,“Efficient Hardware for the Tate Pairing Calculation in Char-acteristic Three,” Proc. Int’l Workshop Cryptographic Hardware andEmbedded Systems (CHES ’05), J.R. Rao and B. Sunar, eds., pp. 412-426, 2005.
[31] G. Komurcu and E. Savas, “An Efficient Hardware Implementationof the Tate Pairing in Characteristic Three,” Proc. Third Int’l Conf.Systems (ICONS ’08), E. Prasolova-Førland and M. Popescu, eds.,pp. 23-28, 2008.
[32] H. Li, J. Huang, P. Sweany, and D. Huang, “FPGA Implementa-tions of Elliptic Curve Cryptography and Tate Pairing over aBinary Field,” J. Systems Architecture, vol. 54, pp. 1077-1088, 2008.
[33] V.S. Miller, “Short Programs for Functions on Curves,” http://crypto.stanford.edu/miller, 1986.
[34] V.S. Miller, “The Weil Pairing, and Its Efficient Calculation,”J. Cryptology, vol. 17, no. 4, pp. 235-261, 2004.
[35] S. Mitsunari, “A Fast implementation of �T Pairing in Character-istic Three on Intel Core 2 Duo Processor,” Report 2009/032,Cryptology ePrint Archive, 2009.
[36] S. Mitsunari, R. Sakai, and M. Kasahara, “A New Traitor Tracing,”IEICE Trans. Fundamentals, vol. 2, pp. 481-484, Feb. 2002.
[37] F. Rodrıguez-Henrıquez, G. Morales-Luna, and J. Lopez, “Low-Complexity Bit-Parallel Square Root Computation Over GF ð2mÞfor All Trinomials,” IEEE Trans. Computers, vol. 57, no. 4, pp. 472-480, Apr. 2008.
[38] R. Ronan, C. Murphy, T. Kerins, C. �O h�Eigeartaigh, and P.S.L.M.Barreto, “A Flexible Processor for the Characteristic 3 �T Pairing,”Int’l J. High Performance Systems Architecture, vol. 1, no. 2, pp. 79-88,2007.
[39] R. Ronan, C. �Oh�Eigeartaigh, C. Murphy, M. Scott, and T. Kerins,“FPGA Acceleration of the Tate Pairing in Characteristic 2,” Proc.IEEE Int’l Conf. Field Programmable Technology (FPT ’06), pp. 213-220, 2006.
[40] R. Ronan, C. �Oh�Eigeartaigh, C. Murphy, M. Scott, and T. Kerins,“Hardware Acceleration of the Tate Pairing on a Genus 2Hyperelliptic Curve,” J. Systems Architecture, vol. 53, pp. 85-98,2007.
[41] R. Sakai, K. Ohgishi, and M. Kasahara, “Cryptosystems Based onPairing” Proc. 2000 Symp. Cryptography and Information Security(SCIS ’00), pp. 26-28, Jan. 2000.
[42] C. Shu, S. Kwon, and K. Gaj, “FPGA Accelerated Tate PairingBased Cryptosystem over Binary Fields,” Proc. IEEE Int’l Conf.Field Programmable Technology (FPT ’06), pp. 173-180, 2006.
[43] C. Shu, S. Kwon, and K. Gaj, “Reconfigurable ComputingApproach for Tate Pairing Cryptosystems over Binary Fields,”IEEE Trans. Computers, vol. 58, no. 9, pp. 1221-1237, Sept. 2009.
[44] J.H. Silverman, The Arithmetic of Elliptic Curves. Springer-Verlag,1986.
[45] L. Song and K.K. Parhi, “Low Energy Digit-Serial/Parallel FiniteField Multipliers,” J. VLSI Signal Processing, vol. 19, no. 2, pp. 149-166, July 1998.
[46] F. Vercauteren, “Optimal Pairings,” Report 2008/096, CryptologyePrint Archive, 2008.
[47] L.C. Washington, Elliptic Curves—Number Theory and Cryptography,second ed., CRC Press, 2008.
Jean-Luc Beuchat received the MSc and PhDdegrees in computer science from Swiss Fed-eral Institute of Technology, Lausanne, in 1997and 2001, respectively. He is an associateprofessor in the Graduate School of Systemsand Information Engineering at the University ofTsukuba. His current research interests includecomputer arithmetic and cryptography.
Jeremie Detrey received the MSc and PhDdegrees in computer science from the LIP of �ENSLyon, France, in 2003 and 2007, respectively,under the supervision of Florent de Dinechin andJean-Michel Muller. He is now a junior researcherin the CARAMEL Project-Team at INRIA inNancy, France. His research interests cover thevarious hardware aspects of computer arith-metic, with a special focus on finite fields, ellipticcurves, and their applications in cryptography.
Nicolas Estibals received the MSc degree incomputer science from the �ENS Lyon, France, in2009. He is currently working toward the PhDdegree at the CARAMEL Project-Team in Nancy,France, under the supervision of Jeremie Detreyand Pierrick Gaudry. His research interests arecryptography, computer arithmetic, and hard-ware development.
280 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 2, FEBRUARY 2011
Eiji Okamoto received the BS, MS, and PhDdegrees in electronics engineering from TokyoInstitute of Technology in 1973, 1975, and 1978,respectively. He been working and studyingcommunication theory and cryptography forNEC Central Research Laboratories since1978. In 1991, he was a professor at JapanAdvanced Institute of Science and Technology,then at Toho University. Now he is a professor inthe Graduate School of Systems and Information
Engineering, University of Tsukuba. His research interests are crypto-graphy and information security. He is a member of the IEEE and acoeditor in chief of the International Journal of Information Security.
Francisco Rodrıguez-Henrıquez received theBSc degree in electrical engineering from theUniversity of Puebla, Mexico, the MSc degree inelectrical and computer engineering from theNational Institute of Astrophysics, Optics andElectronics (INAOE), Mexico, and the PhDdegree in electrical and computer engineeringfrom Oregon State University, in 1989, 1992,and 2000, respectively. Currently, he is aprofessor (CINVESTAV-3B researcher) in the
Computer Science Department, CINVESTAV-IPN, Mexico City, Mexico,which he joined in 2002. His major research interests are incryptography, finite field arithmetic, and hardware implementation ofcryptographic algorithms.
. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.
BEUCHAT ET AL.: FAST ARCHITECTURES FOR THE �T PAIRING OVER SMALL-CHARACTERISTIC SUPERSINGULAR ELLIPTIC CURVES 281