fpga and asic implementation of rho and p-1 methods of...
TRANSCRIPT
FPGA and ASICImplementation of Rho
and P-1 Methods ofFactoring
Master’s Thesis Presentation
Ramakrishna Bachimanchi
Director: Dr. Kris Gaj
Contents
Introduction
Background
Hardware Architecture
FPGA and ASIC Design Flow
Results
Conclusions
RSA
In 1977
Ron Rivest, Adi Shamir & Leonard Adleman
developed the first public key cryptosystems, they called RSA
RSA
Public key {e, N} Private key {d, P,Q}
Alice BobNetwork
Encryption Decryption
N = PQ P, Q - large prime factors
{ e, N } { d, P, Q }
e d 1 mod ((P-1)(Q-1))
Recommended key sizes for RSA
Old standard:
New standard:
Individual users
Short-term use ( up to 2010)
Long-term use
512 bits(155 decimal digits)
1024 bits
2048 bits
Size of the RSA key = size of N=P· Q
Factoring RSA
RSA-200 (663-bits) factored byBahr, Boehm, Frank and Kleinjung
When?Dec 2003 – May 2005
Effort?First stage:
About 1 year on various machines, equivalent to 55years on Opteron 2.2 GHz CPU
Second stage:3 months on a cluster of 80 2.2 GHz Opterons
connected via a gigabit network
Number Field Sieve
Best Algorithm to Factor Large Numbers
Complexity: Sub-exponential time and memory
N = Number to factor,k = Number of bits of N
Polynomialfunction, a•km
Exponential function, ek
Sub-exponential function,
e k1/3
(ln k)2/3
Steps of Number Field Sieve (NFS)
Polynomial Selection
Linear Algebra
Square Root
Relation Collection
Sieving
Mini factoring200 bit
numbers
& 350 bit Pollard rhop-1 methodECM
Birthday paradox: If more than 23 “random” peopleare in a room (or even if they aren't) there is a morethan 50% probability that the birthdays of two ofthem fall on the same day of the year.
Pollard’s Rho Method
x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 …
2 5 26 677 91864 15449 102236 39678 5749 69062 …
mod 97:
2 5 26 95 5 26 95 5 26 95 …
N = 97 1889 = 183 233
xi+1 = xi2 + 1 mod N
Pollard's rho method - Example
x0
x2 x5 x8 …mod q
x3 x6 x9 … mod q
x1 x4 x7 …mod q x1x4 mod qq | (x1 – x4)q | Nq | gcd(x1 – x4, N)q=gcd(-91 859, 183 233)= 97
2
5
26
95
Pollard’s Rho Method
x4 mod q
x3 mod q
x2 mod q
x1 mod q xs xe mod q
xs+1 xe+1 mod q
x0 mod q ………………..
………………..
xs+k xe+k mod q
period=e-s
. ...
...
xs mod q
xs+1 mod q
xs+2 mod q
xi mod qxi+1 mod q
xe mod q
xsxe mod q
...
. ..
xe-1 mod q
Rho Algorithm- Floyd’s Version
0
21. ( )
2. ( ) mod ( ( )) mod
3. gcd( - , )
4. 1 ,
5. 1 2
Initialize b c x
choose the polynomial as f x x a
calculate b f b n and c f f c n
compute d b c n
if d n a non trivial factor of n is found
if d go to step
if d N ch
1ange a and go to step
Rho Method - Floyd’s Version
x1-x2 x1-x3 x1-x4 x1-x5 x1-x6 ---------------------------------------------------- x1-xi
x2-x3 x2-x4 x2-x5 x2-x6 x2-x7 ---------------------------------------------------- x2-xi
x3-x4 x3-x5 x3-x6 x3-x7 x3-x8 ---------------------------------------------------- x3-xi
x4-x5 x4-x6 x4-x7 x4-x8 x4-x9 ---------------------------------------------------- x4-xi
x5-x6 x5-x7 x5-x8 x5-x9 x5-x10 ----------------------------------------------------- x5-xi
x6-x7 x6-x8 x6-x9 x6-x10 x6-x11 x6-x12 --------------------------------------- x6-xi
x7-x8 x7-x9 x7-x10 x7-x11 x7-x12 x7-x13 x7-x14 ------------------------- x7-xi
x8-x9 x8-x10 x8-x11 x8-x12 x8-x13 x8-x14 x8-x15 x8-x16 --------------- x8-xi
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
xk-xk+1 xk-xk+2 xk-xk+3 --------------------------------------------------- xk-x2k --------- xk-xi
Pollard’s Rho Algorithm - Floyd’s Versionf(x)=x2+a with a{-2,0}
# iterations t <100 √qmax (qmax is the maximum factor we expect to find using rho method)We choose random x0 in the range(0,N-1) and x1=f(x0)
V2 V1 dx0
↓ d=1
x2 ← x1 d=d*(x2-x1)
↓f(f()) ↓f()x4 x2 d=d*(x4-x2)
↓ ↓x6 x3 d=d*(x6-x3)…………………………………… . …………………………………… ……………………………………… . …………………………………… ……………………………………… …………………………………… … … .
xt xt/2 d=d*(xt-xt/2)
↓ ↓xt+2 x(t+2)/2 d=d*(xt+2-x(t+2)/2)………………………………………….. …………………………………………..…………………………………………. . ………………………………………….
x2i xi d=d*(x2i-xi)
↓ ↓x2(i+1) xi+1 d=d*(x2i+2-xi+1)……………………………………….. …………………………………………. .………………………………………. …………………………………………. .
x2t xt d=d*(x2t-xt)
*x2i+2=f(f(x2i)),xi+1=f(xi) q=gcd(d,N)
Minimization forarea and/or memory
Rho Algorithm- Floyd’s Version Contd.2
0
1 1 0 2 2 1 2 1 2 1
22 2
2 2
22 2
2 2
21 1
1 1
: , , ( ) , , ( , 2)
: ( | )
( ), ( ), - - , 1
( 2; ; )
{
*
Inputs x a f x x a N t even
Outputs q such that q N
v x f x v x f x temp v v x x d
for i i t i
v v
v v a
v v
v v a
v v
v v a all operations are done
tem
2 1- mod
*
}
gcd ( , )
p v v ulo N
d d temp
q d N
Rho Method - Brent’s Version
x1-x2 x1-x3 x1-x4 x1-x5 x1-x6 ---------------------------------------------------- x1-xi
x2-x3 x2-x4 x2-x5 x2-x6 x2-x7 ---------------------------------------------------- x2-xi
x3-x4 x3-x5 x3-x6 x3-x7 x3-x8 ---------------------------------------------------- x3-xi
x4-x5 x4-x6 x4-x7 x4-x8 x4-x9 ---------------------------------------------------- x4-xi
x5-x6 x5-x7 x5-x8 x5-x9 x5-x10 ----------------------------------------------------- x5-xi
x6-x7 x6-x8 x6-x9 x6-x10 x6-x11 x6-x12 --------------------------------------- x6-xi
x7-x8 x7-x9 x7-x10 x7-x11 x7-x12 x7-x13 x7-x14 ------------------------- x7-xi
x8-x9 x8-x10 x8-x11 x8-x12 x8-x13 x8-x14 x8-x15 x8-x16 --------------- x8-xi
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
xk-xk+1 xk-xk+2 xk-xk+3 ---------------- x2k -x2
k+ 2
k-1+1 -------------------------------------------- x2
k-x2k+1
Rho Method - Brent’s VersionSequence of Operations
v2 d v1
x2 d=1 x2
x3
x4 d=d*(x4-x2) x4
x5
x6
x7 d*(x7-x4)x8 d*(x8-x4) x8
x9
x10
x11
x12
x13 d*(x13-x8)x14 d*(x14-x8)x15 d*(x15-x8)x16 d*(x16-x8) x16
Minimization forexecution time
24%
Rho Algorithm- Brent’s Version2
0
1 0 2 1 2 1
2 2
-1 1
2 1
1
1 2
: , , ( ) , , ( , 2)
: ( | )
( ), ( ), 1
( 3; 2 ; )
{
( )
(2 2 1 2 )
{
-
*
}
( 2 )
{
1
}
}
g
k k k
k
Inputs x a f x x a N t even
Outputs q such that q N
x f x v v x f x k
for i i t i
v f v
if i
temp v v
d d temp
if i
v v
k k
q
cd( , )d N
p-1 Algorithm
Based on Fermat’s Little Theorem
a p-1≡ 1(mod p)
a m(p-1)≡ 1(mod p)
a m(p-1) – 1 ≡ 0(mod p)
N –number to be factored
a, any small integer
p, non-trivial factor of N
Choose a small number a, such that 1<a<N
Choose a special number k
Compute a k(mod N) – 1
Compute gcd(a k(mod N) – 1, N)
p-1 algorithm
Inputs :
N – number to be factored
a – arbitrary integer such that gcd(a, N)=1
B1 – smoothness bound for Phase1
B2 – smoothness bound for Phase2
Outputs:
q - factor of N, 1 < q ≤ N
or FAIL
p-1 algorithm – Phase 1
1
1
0
0
1: such that - consecutive primes
- largest exponent such that
2: mod
3: gcd( 1, )
4 : if 1
5: return
i
i
i
e
i ip
e
i i
k
k p p B
e p B
q a N
q q N
q
q
(factor of )
6: else
7: go to Phase 2
8: end if
N
precomputations
postcomputations
main computations
p-1 algorithm – Phase 2
1 2
0
09: 1
10: for each prime to do
11: ( 1) (mod )
12 : end for
13: gcd( , )
14: if 1 then
15: return
16: else
17: return FAIL
18: end if
p
d
p B B
d d q N
q d N
q
q
postcomputations
main computations
p-1 Phase 1 – Numerical example
N = 1 740 719 = 1279·1361
a = 2
B1 = 20k = 24·32·5·7·11·13·17·19 = 232 792 560
q0=ak mod N = 2232 792 560 mod 1 740 719 = 1 003 058
q = gcd (1 003 058 1; 1 740 719) = 1361
Why did the method work?
q-1 = 1360 = 2·5·17 | k
ak mod q = a(q-1)·m mod q = 1
q | ak-1
Modular Exponentiation- SlidingWindow Method
1 1 0
21 2
12 1 2 1 2
: , ( ........ , ) 1, int 1
:
1.
,
1 (2 1) : *
2. 1,
3. 0
t t t
e
wi i
Input g e e e e e with e and an eger w
Output g
precomputation
g g g g
For i from to do g g g
A i t
while i do the following
i
- 1
1
2
-1
2( ....... )
0 : , -1
( 0), ..... - 1
1,
* , 1i l
i i l
i
i i i l
l
e e e
f e then do A A i i
otherwise e find the longest bitstring ee e such that i l w
and e and do the following
A A g i l
4.Re ( )turn A
Sliding Window Method- Example
calculating g50, e = (110010)2 , window size 2 Pre-computations g3
Main computations, A←111 0010 , window size = 2 and the value = “11” = 3A←(A)4.g3 = g3
11 0 010A←A2 = g6
110 0 10A←A2 = g12
1100 1 0, window size = 1 and the value = ‘1’ = 1
A←(A)2.g1 = g25
11001 0A←A2 = g50
Montgomery Multiplication
M U LT IPLIE R
A _MB
write
A _M _C hoice
start
C read
clk
reset
3 2 32
32
done_m ul
S1 S2 A (Shift_Reg)B
CSR42
ws
>>1
>>1
S1in S2in
AB
zeros zeros
M
mmBB
S1out S2out Bout
carrysum
S1in
S2in
S2out(0)
S1out(0)
Ai qi
A1 A2 B C
SUM CARRY
Es Es loadAEb
reg_rst reg_rst resetreset
qi
M
Mout
Eb
reset
Ai
A(0)
w w w w
w
wws ws
www
w
S1out(ws-1 downto 0)
S2out(ws-1 downto 0)
data_out
ws
+
ws
read
Bout(0)
Ai
read read
ws
ww
ws
VU W Y
CSA
CSA
S C
CSR42
w w w w
w+1 w+1
w+2 w+2
Based onMcIvor, McLoone, et al.Asilomar 2003:full-length CSAsword-length CPAs
Addition / Subtraction
ADDE R/SUB TRACTO R
A_M
B
write
A_M _Choice
add_sub
C read
clk
reset
32 32
32 +
C 1
C 2
L U T3 2 X 3 2M E M
< >
a d d r1 a d d r2W E L
O P 1 O P 2
A _ M _ C h o ic e
A _ M B
s u b
s ig n Z
re a d
C inC o u t
s u m 1 s u m 2 E C 1
E C 2
A _ M
A D D E R
3 2 b t i re g A 3 2 b ti re g BE BE A
M
2 M
< < 1
Original design
Global Memory- Rho
n for unit1
n for unit2
x0
a
. . .
31 00
t
n for unit m
. . .
No. of iterations
Same for all units
Local Memory- Rho
M
temp
V1
V2
a
d
Local Memory
0
63
031
6
6
32
Aaddr
Baddr
WEA
A_M
B
32
32
32
Grei
data_out
0
01
1
g_l
u_l
Kout
C
32
32
data_in
Computation Flow
MUL ADD/SUB
1 to 2t-1 v2← v22 cond1 temp← (v2-v1)
cond1 d← d*temp 1 to 2t-1 v2← v2+ a
cond1: 2k+2k-1+1≤ i-1 ≤2k+1
Global Memory – p-1
511
prime_table[1]
GCD_table[1]
GCD_table[GMAXD]
Mmin
Mmax
31 00
. . .
prime_table[2]
. . .
prime_table[PMAXD]
Phase 2
Determines j such that1 ≤ j ≤ D andgcd(j, D) = 1
Phase 1
k
N for unit 1
g2
. . .
initial valuesfor
All units
kN
31 00
511
g1
N for unit m
N for unit 2
Determines m,j such thatP = m.D-j is a prime
b)
g 2
31 0
0
511
N
g 1
d = g e
g 3
a)Phase 1
. . . .
. . . .
g s
*s = 2k-1
d 2
31 0
0
511
N /d
dd 11
d 13
d 209
Phase 2
. . . .
. . . .
d D
d m.D
d mD - d j
x
Local Memory – p-1
Control Unit
Phase 1 Phase 2
Memory Initialization
Modular Exponentiation
Reading Out Results
Memory Initialization
Pre-Computations
Reading Out Results
Main-Computations
Control Unit
Total 17 state machines with 140 states
5 state machines with 45 states in Rho
12 state machines with 103 states in P-1
5 Shift registers
9 Registers
13 Counters
22 Comparators Original design
FPGA vs ASIC
FPGA
Field Programmable Gate Array
Array of logic blocks
Switchable interconnect resources
Final user can set switches
Immediate use (“Zero” fab time)
Not good for high volume applications
ASIC
Application Specific Integrated Circuit
Standard cells and Macros
Requires full manufacturing sequence
Good for high volume applications
FPGA Design Flow
Specification
RTL Description(VHDL / Verilog HDL)
Design Entry
Synthesis
Implementation
Configuration
Functional Simulation
Post-Synthesis Simulation
Timing Simulation
On Chip Testing
Design Verification
ASIC Design Flow
SynthesisSynthesis
PlacementPlacement
Clock Tree SynthesisClock Tree Synthesis
RoutingRouting
FloorplanningFloorplanning
Timing AnalysisTiming Analysis
Design for ManufacturingDesign for Manufacturing
Front-End
Design
Back-End
Design
Design Analyzer
Primetime
Astro
Families of Xilinx FPGA Devices
Spartan 3 Virtex II
(< $130*) (< $2,700*)
Spartan 3E Virtex 4
(< $35*) (< $3,000*)
*approximate cost of the largest device per unit for
a batch of 10,000 units
Low-cost High-performance
FPGA Implementation of Single Units
Results Rho P-1 Unified
Resources
-CLB Slices 1,680(4%) 1,749(5%) 2,042(6%)
-LUTs 2,714(4%) 2,875(4%) 3,451(5%)
-FFs 1,518(2%) 1,645(2%) 1,740(2%)
-BRAMs 0/144 2/144 2/144
Max. ClockFrequency
130 MHz 131 MHz 115 MHz
Target device is Virtex II XC2v6000-6
Number of unified units per FPGA
Spartan 3
XC3S5000-5
Low-cost
Virtex II
XC2V6000-6
High-performance
Spartan 3E
XC3S1600-5
Low-cost
Virtex 4
XC4VLX200-11
High-performance
2119
42
8
Performance – Unified Operations per
Second
581
819
290
2,262
Spartan 3
XC3S5000-5
Low-cost
Virtex II
XC2V6000-6
High-performance
Spartan 3E
XC3S1600-5
Low-cost
Virtex 4
XC4VLX200-11
High-performance
x 1.41
x 7.8
Performance to cost ratio
447
828
3075
Spartan 3
XC3S5000-5
Low-cost
Virtex II
XC2V6000-6
High-performance
Spartan 3E
XC3S1600-5
Low-cost
Virtex 4
XC4VLX200-11
High-performance
x 14.9 x 11
Unified Operations per second per $100
Results - ASIC Implementation
Operation rho p-1 Unifiedarchitecture
Area 1.15 mm2 1.21 mm2 1.8 mm2
Max. ClockFrequency
200 MHz 200MHz 200 MHz
Time for execution 3.52 ms 9.56 ms 13.1 ms
# of operations persecond (usingmaximum no.
of units)
96,022 34,100 16,615
Core utilizationratio
70% 70% 65%
Area of Virtex II FPGA is 19.68 x 19.8 mm2
(estimation by R.J. Lim Fong, MS Thesis, VPI, 2004)
FPGA vs ASIC - Area
338322
216
Rho P-1 Unified
2023
21
x 17 x 14
x 10
ASIC
FPGA
Area of Virtex II FPGA is 19.68 x 19.8 mm2
(estimation by R.J. Lim Fong, MS Thesis, VPI, 2004)
51x
ASIC 130 nm vs. Virtex II 6000 – rho (20 units)
19.80 mm
19.6
8m
m
2.7 mm
2.82 mm
Area of Virtex II 6000(estimation by R.J. Lim Fong,
MS Thesis, VPI, 2004)
Area of an ASIC with equivalent functionality
Source:
I. Kuon, J. Rose,
University of Toronto
“Measuring the Gap Between
FPGAs and ASICs”
IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems,
vol. 62, no. 2, Feb 2007.
ASICs vs. FPGAs
Contributions
Verified the VHDL code through functional andtiming simulation by comparison with the operationof test software implementation written in C.
Ported the VHDL code to 4 different families ofFPGA devices and to a standard-cell ASIC basedon 130 nm TSMC library
Conclusions
Low-cost FPGA devices, such as Spartan 3,outperformed high-performance devices, such asVirtex II, in terms of performance to cost ratio by afactor of 14.9
ASIC Implementation outperforms FPGA with afactor of 50* in terms of area and 1.5 times in termsof frequency.
*In case of rho it is 50, for other architectures it may be less
Conclusions
Low cost FPGA devices Spartan 3 and Spartan 3Eare suitable for code-breaking
ASIC implementation is suitable when largenumber of chips (>1,000,000) are considered
Future Work
Implementation of Trial Division in Hardware
Implementation of ECM in Hardware using onemultiplier and one adder/subtractor
Integrating Trial division, Rho, P-1 and ECM tobuild a co-factoring machine
Experiments on COPACOBANA