fpga and asic implementation of rho and p-1 methods of...

FPGA and ASICImplementation of Rho

and P-1 Methods ofFactoring

Master’s Thesis Presentation

Ramakrishna Bachimanchi

Director: Dr. Kris Gaj

Contents

Introduction

Background

Hardware Architecture

FPGA and ASIC Design Flow

Results

Conclusions

RSA

In 1977

Ron Rivest, Adi Shamir & Leonard Adleman

developed the first public key cryptosystems, they called RSA

RSA

Public key {e, N} Private key {d, P,Q}

Alice BobNetwork

Encryption Decryption

N = PQ P, Q - large prime factors

{ e, N } { d, P, Q }

e d 1 mod ((P-1)(Q-1))

Common Applications of RSA

S/MIME, PGP

Alice Bob

Secure WWW, SSL

Browser WebServer

Network

Recommended key sizes for RSA

Old standard:

New standard:

Individual users

Short-term use ( up to 2010)

Long-term use

512 bits(155 decimal digits)

1024 bits

2048 bits

Size of the RSA key = size of N=P· Q

Factoring RSA

RSA-200 (663-bits) factored byBahr, Boehm, Frank and Kleinjung

When?Dec 2003 – May 2005

Effort?First stage:

About 1 year on various machines, equivalent to 55years on Opteron 2.2 GHz CPU

Second stage:3 months on a cluster of 80 2.2 GHz Opterons

connected via a gigabit network

Number Field Sieve

Best Algorithm to Factor Large Numbers

Complexity: Sub-exponential time and memory

N = Number to factor,k = Number of bits of N

Polynomialfunction, a•km

Exponential function, ek

Sub-exponential function,

e k1/3

(ln k)2/3

Steps of Number Field Sieve (NFS)

Polynomial Selection

Linear Algebra

Square Root

Relation Collection

Sieving

Mini factoring200 bit

numbers

& 350 bit Pollard rhop-1 methodECM

Rho Algorithm

Birthday paradox: If more than 23 “random” peopleare in a room (or even if they aren't) there is a morethan 50% probability that the birthdays of two ofthem fall on the same day of the year.

Pollard’s Rho Method

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 …

2 5 26 677 91864 15449 102236 39678 5749 69062 …

mod 97:

2 5 26 95 5 26 95 5 26 95 …

N = 97 1889 = 183 233

xi+1 = xi2 + 1 mod N

Pollard's rho method - Example

x0

x2 x5 x8 …mod q

x3 x6 x9 … mod q

x1 x4 x7 …mod q x1x4 mod qq | (x1 – x4)q | Nq | gcd(x1 – x4, N)q=gcd(-91 859, 183 233)= 97

2

5

26

95

Pollard’s Rho Method

x4 mod q

x3 mod q

x2 mod q

x1 mod q xs xe mod q

xs+1 xe+1 mod q

x0 mod q ………………..

………………..

xs+k xe+k mod q

period=e-s

. ...

...

xs mod q

xs+1 mod q

xs+2 mod q

xi mod qxi+1 mod q

xe mod q

xsxe mod q

...

. ..

xe-1 mod q

Rho Algorithm- Floyd’s Version

0

21. ( )

2. ( ) mod ( ( )) mod

3. gcd( - , )

4. 1 ,

5. 1 2

Initialize b c x

choose the polynomial as f x x a

calculate b f b n and c f f c n

compute d b c n

if d n a non trivial factor of n is found

if d go to step

if d N ch

1ange a and go to step

Rho Method - Floyd’s Version

x1-x2 x1-x3 x1-x4 x1-x5 x1-x6 ---------------------------------------------------- x1-xi

x2-x3 x2-x4 x2-x5 x2-x6 x2-x7 ---------------------------------------------------- x2-xi

x3-x4 x3-x5 x3-x6 x3-x7 x3-x8 ---------------------------------------------------- x3-xi

x4-x5 x4-x6 x4-x7 x4-x8 x4-x9 ---------------------------------------------------- x4-xi

x5-x6 x5-x7 x5-x8 x5-x9 x5-x10 ----------------------------------------------------- x5-xi

x6-x7 x6-x8 x6-x9 x6-x10 x6-x11 x6-x12 --------------------------------------- x6-xi

x7-x8 x7-x9 x7-x10 x7-x11 x7-x12 x7-x13 x7-x14 ------------------------- x7-xi

x8-x9 x8-x10 x8-x11 x8-x12 x8-x13 x8-x14 x8-x15 x8-x16 --------------- x8-xi

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

xk-xk+1 xk-xk+2 xk-xk+3 --------------------------------------------------- xk-x2k --------- xk-xi

Pollard’s Rho Algorithm - Floyd’s Versionf(x)=x2+a with a{-2,0}

# iterations t <100 √qmax (qmax is the maximum factor we expect to find using rho method)We choose random x0 in the range(0,N-1) and x1=f(x0)

V2 V1 dx0

↓ d=1

x2 ← x1 d=d*(x2-x1)

↓f(f()) ↓f()x4 x2 d=d*(x4-x2)

↓ ↓x6 x3 d=d*(x6-x3)…………………………………… . …………………………………… ……………………………………… . …………………………………… ……………………………………… …………………………………… … … .

xt xt/2 d=d*(xt-xt/2)

↓ ↓xt+2 x(t+2)/2 d=d*(xt+2-x(t+2)/2)………………………………………….. …………………………………………..…………………………………………. . ………………………………………….

x2i xi d=d*(x2i-xi)

↓ ↓x2(i+1) xi+1 d=d*(x2i+2-xi+1)……………………………………….. …………………………………………. .………………………………………. …………………………………………. .

x2t xt d=d*(x2t-xt)

*x2i+2=f(f(x2i)),xi+1=f(xi) q=gcd(d,N)

Minimization forarea and/or memory

Rho Algorithm- Floyd’s Version Contd.2

0

1 1 0 2 2 1 2 1 2 1

22 2

2 2

22 2

2 2

21 1

1 1

: , , ( ) , , ( , 2)

: ( | )

( ), ( ), - - , 1

( 2; ; )

{

*

Inputs x a f x x a N t even

Outputs q such that q N

v x f x v x f x temp v v x x d

for i i t i

v v

v v a

v v

v v a

v v

v v a all operations are done

tem

2 1- mod

*

}

gcd ( , )

p v v ulo N

d d temp

q d N

Rho Method - Brent’s Version

x1-x2 x1-x3 x1-x4 x1-x5 x1-x6 ---------------------------------------------------- x1-xi

x2-x3 x2-x4 x2-x5 x2-x6 x2-x7 ---------------------------------------------------- x2-xi

x3-x4 x3-x5 x3-x6 x3-x7 x3-x8 ---------------------------------------------------- x3-xi

x4-x5 x4-x6 x4-x7 x4-x8 x4-x9 ---------------------------------------------------- x4-xi

x5-x6 x5-x7 x5-x8 x5-x9 x5-x10 ----------------------------------------------------- x5-xi

x6-x7 x6-x8 x6-x9 x6-x10 x6-x11 x6-x12 --------------------------------------- x6-xi

x7-x8 x7-x9 x7-x10 x7-x11 x7-x12 x7-x13 x7-x14 ------------------------- x7-xi

x8-x9 x8-x10 x8-x11 x8-x12 x8-x13 x8-x14 x8-x15 x8-x16 --------------- x8-xi

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

xk-xk+1 xk-xk+2 xk-xk+3 ---------------- x2k -x2

k+ 2

k-1+1 -------------------------------------------- x2

k-x2k+1

Rho Method - Brent’s VersionSequence of Operations

v2 d v1

x2 d=1 x2

x3

x4 d=d*(x4-x2) x4

x5

x6

x7 d*(x7-x4)x8 d*(x8-x4) x8

x9

x10

x11

x12

x13 d*(x13-x8)x14 d*(x14-x8)x15 d*(x15-x8)x16 d*(x16-x8) x16

Minimization forexecution time

24%

Rho Algorithm- Brent’s Version2

0

1 0 2 1 2 1

2 2

-1 1

2 1

1

1 2

: , , ( ) , , ( , 2)

: ( | )

( ), ( ), 1

( 3; 2 ; )

{

( )

(2 2 1 2 )

{

-

*

}

( 2 )

{

1

}

}

g

k k k

k

Inputs x a f x x a N t even

Outputs q such that q N

x f x v v x f x k

for i i t i

v f v

if i

temp v v

d d temp

if i

v v

k k

q

cd( , )d N

p-1 Algorithm

p-1 Algorithm

Based on Fermat’s Little Theorem

a p-1≡ 1(mod p)

a m(p-1)≡ 1(mod p)

a m(p-1) – 1 ≡ 0(mod p)

N –number to be factored

a, any small integer

p, non-trivial factor of N

Choose a small number a, such that 1<a<N

Choose a special number k

Compute a k(mod N) – 1

Compute gcd(a k(mod N) – 1, N)

p-1 algorithm

Inputs :

N – number to be factored

a – arbitrary integer such that gcd(a, N)=1

B1 – smoothness bound for Phase1

B2 – smoothness bound for Phase2

Outputs:

q - factor of N, 1 < q ≤ N

or FAIL

p-1 algorithm – Phase 1

1

1

0

0

1: such that - consecutive primes

- largest exponent such that

2: mod

3: gcd( 1, )

4 : if 1

5: return

i

i

i

e

i ip

e

i i

k

k p p B

e p B

q a N

q q N

q

q

(factor of )

6: else

7: go to Phase 2

8: end if

N

precomputations

postcomputations

main computations

p-1 algorithm – Phase 2

1 2

0

09: 1

10: for each prime to do

11: ( 1) (mod )

12 : end for

13: gcd( , )

14: if 1 then

15: return

16: else

17: return FAIL

18: end if

p

d

p B B

d d q N

q d N

q

q

postcomputations

main computations

p-1 Phase 1 – Numerical example

N = 1 740 719 = 1279·1361

a = 2

B1 = 20k = 24·32·5·7·11·13·17·19 = 232 792 560

q0=ak mod N = 2232 792 560 mod 1 740 719 = 1 003 058

q = gcd (1 003 058 1; 1 740 719) = 1361

Why did the method work?

q-1 = 1360 = 2·5·17 | k

ak mod q = a(q-1)·m mod q = 1

q | ak-1

Modular Exponentiation- SlidingWindow Method

1 1 0

21 2

12 1 2 1 2

: , ( ........ , ) 1, int 1

:

1.

,

1 (2 1) : *

2. 1,

3. 0

t t t

e

wi i

Input g e e e e e with e and an eger w

Output g

precomputation

g g g g

For i from to do g g g

A i t

while i do the following

i

- 1

1

2

-1

2( ....... )

0 : , -1

( 0), ..... - 1

1,

* , 1i l

i i l

i

i i i l

l

e e e

f e then do A A i i

otherwise e find the longest bitstring ee e such that i l w

and e and do the following

A A g i l

4.Re ( )turn A

Sliding Window Method- Example

calculating g50, e = (110010)2 , window size 2 Pre-computations g3

Main computations, A←111 0010 , window size = 2 and the value = “11” = 3A←(A)4.g3 = g3

11 0 010A←A2 = g6

110 0 10A←A2 = g12

1100 1 0, window size = 1 and the value = ‘1’ = 1

A←(A)2.g1 = g25

11001 0A←A2 = g50

Hardware Architecture

Top-level View

ControlUnit

Globalmemory

I/O

FPGA / ASIC

RAM

Hostcomputer

Rho, p-1, unified Units

Low Level Arithmetic Units

Montgomery Multiplication

M U LT IPLIE R

A _MB

write

A _M _C hoice

start

C read

clk

reset

3 2 32

32

done_m ul

S1 S2 A (Shift_Reg)B

CSR42

ws

>>1

>>1

S1in S2in

AB

zeros zeros

M

mmBB

S1out S2out Bout

carrysum

S1in

S2in

S2out(0)

S1out(0)

Ai qi

A1 A2 B C

SUM CARRY

Es Es loadAEb

reg_rst reg_rst resetreset

qi

M

Mout

Eb

reset

Ai

A(0)

w w w w

w

wws ws

www

w

S1out(ws-1 downto 0)

S2out(ws-1 downto 0)

data_out

ws

+

ws

read

Bout(0)

Ai

read read

ws

ww

ws

VU W Y

CSA

CSA

S C

CSR42

w w w w

w+1 w+1

w+2 w+2

Based onMcIvor, McLoone, et al.Asilomar 2003:full-length CSAsword-length CPAs

Addition / Subtraction

ADDE R/SUB TRACTO R

A_M

B

write

A_M _Choice

add_sub

C read

clk

reset

32 32

32 +

C 1

C 2

L U T3 2 X 3 2M E M

< >

a d d r1 a d d r2W E L

O P 1 O P 2

A _ M _ C h o ic e

A _ M B

s u b

s ig n Z

re a d

C inC o u t

s u m 1 s u m 2 E C 1

E C 2

A _ M

A D D E R

3 2 b t i re g A 3 2 b ti re g BE BE A

M

2 M

< < 1

Original design

Global Memory- Rho

n for unit1

n for unit2

x0

a

. . .

31 00

t

n for unit m

. . .

No. of iterations

Same for all units

Local Memory- Rho

M

temp

V1

V2

a

d

Local Memory

0

63

031

6

6

32

Aaddr

Baddr

WEA

A_M

B

32

32

32

Grei

data_out

0

01

1

g_l

u_l

Kout

C

32

32

data_in

Computation Flow

MUL ADD/SUB

1 to 2t-1 v2← v22 cond1 temp← (v2-v1)

cond1 d← d*temp 1 to 2t-1 v2← v2+ a

cond1: 2k+2k-1+1≤ i-1 ≤2k+1

Control Unit - Rho

Memory Initialization

Main Computations

Reading Out Results

Global Memory – p-1

511

prime_table[1]

GCD_table[1]

GCD_table[GMAXD]

Mmin

Mmax

31 00

. . .

prime_table[2]

. . .

prime_table[PMAXD]

Phase 2

Determines j such that1 ≤ j ≤ D andgcd(j, D) = 1

Phase 1

k

N for unit 1

g2

. . .

initial valuesfor

All units

kN

31 00

511

g1

N for unit m

N for unit 2

Determines m,j such thatP = m.D-j is a prime

b)

g 2

31 0

0

511

N

g 1

d = g e

g 3

a)Phase 1

. . . .

. . . .

g s

*s = 2k-1

d 2

31 0

0

511

N /d

dd 11

d 13

d 209

Phase 2

. . . .

. . . .

d D

d m.D

d mD - d j

x

Local Memory – p-1

Control Unit

Phase 1 Phase 2


Modular Exponentiation

Reading Out Results


Pre-Computations

Reading Out Results

Main-Computations

Unified Architecture

Control Unit

Global Memory

ADD/SUB

MUL

Local Memoryfor p-1

Local Memoryfor Rho

Control Unit


Rho-Computations

Reading Out Results

P-1 -Computations

Control Unit

Total 17 state machines with 140 states

5 state machines with 45 states in Rho

12 state machines with 103 states in P-1

5 Shift registers

9 Registers

13 Counters

22 Comparators Original design

Design Flow

FPGA vs ASIC

FPGA

Field Programmable Gate Array

Array of logic blocks

Switchable interconnect resources

Final user can set switches

Immediate use (“Zero” fab time)

Not good for high volume applications

ASIC

Application Specific Integrated Circuit

Standard cells and Macros

Requires full manufacturing sequence

Good for high volume applications

FPGA Design Flow

Specification

RTL Description(VHDL / Verilog HDL)

Design Entry

Synthesis

Implementation

Configuration

Functional Simulation

Post-Synthesis Simulation

Timing Simulation

On Chip Testing

Design Verification

ASIC Design Flow

SynthesisSynthesis

PlacementPlacement

Clock Tree SynthesisClock Tree Synthesis

RoutingRouting

FloorplanningFloorplanning

Timing AnalysisTiming Analysis

Design for ManufacturingDesign for Manufacturing

Front-End

Design

Back-End

Design

Design Analyzer

Primetime

Astro

Results

Families of Xilinx FPGA Devices

Spartan 3 Virtex II

(< $130*) (< $2,700*)

Spartan 3E Virtex 4

(< $35*) (< $3,000*)

*approximate cost of the largest device per unit for

a batch of 10,000 units

Low-cost High-performance

FPGA Implementation of Single Units

Results Rho P-1 Unified

Resources

-CLB Slices 1,680(4%) 1,749(5%) 2,042(6%)

-LUTs 2,714(4%) 2,875(4%) 3,451(5%)

-FFs 1,518(2%) 1,645(2%) 1,740(2%)

-BRAMs 0/144 2/144 2/144

Max. ClockFrequency

130 MHz 131 MHz 115 MHz

Target device is Virtex II XC2v6000-6

Number of unified units per FPGA

Spartan 3

XC3S5000-5

Low-cost

Virtex II

XC2V6000-6

High-performance

Spartan 3E

XC3S1600-5

Low-cost

Virtex 4

XC4VLX200-11

High-performance

2119

42

8

Performance – Unified Operations per

Second

581

819

290

2,262

Spartan 3

XC3S5000-5

Low-cost

Virtex II

XC2V6000-6

High-performance

Spartan 3E

XC3S1600-5

Low-cost

Virtex 4

XC4VLX200-11

High-performance

x 1.41

x 7.8

Performance to cost ratio

447

828

3075

Spartan 3

XC3S5000-5

Low-cost

Virtex II

XC2V6000-6

High-performance

Spartan 3E

XC3S1600-5

Low-cost

Virtex 4

XC4VLX200-11

High-performance

x 14.9 x 11

Unified Operations per second per $100

ASIC - Layout of p-1 - floorplanning

Layout of p-1 - placement

Layout of p-1 – clock tree synthesis

Layout of p-1 – Global Routing

Layout of p-1 – Detailed Routing

Results - ASIC Implementation

Operation rho p-1 Unifiedarchitecture

Area 1.15 mm2 1.21 mm2 1.8 mm2

Max. ClockFrequency

200 MHz 200MHz 200 MHz

Time for execution 3.52 ms 9.56 ms 13.1 ms

# of operations persecond (usingmaximum no.

of units)

96,022 34,100 16,615

Core utilizationratio

70% 70% 65%

Area of Virtex II FPGA is 19.68 x 19.8 mm2

(estimation by R.J. Lim Fong, MS Thesis, VPI, 2004)

FPGA vs ASIC - Area

338322

216

Rho P-1 Unified

2023

21

x 17 x 14

x 10

ASIC

FPGA

Area of Virtex II FPGA is 19.68 x 19.8 mm2

(estimation by R.J. Lim Fong, MS Thesis, VPI, 2004)

LocalMemory

Global Memory

Rho in an ASIC 130 nm

51x

ASIC 130 nm vs. Virtex II 6000 – rho (20 units)

19.80 mm

19.6

8m

m

2.7 mm

2.82 mm

Area of Virtex II 6000(estimation by R.J. Lim Fong,

MS Thesis, VPI, 2004)

Area of an ASIC with equivalent functionality

Source:

I. Kuon, J. Rose,

University of Toronto

“Measuring the Gap Between

FPGAs and ASICs”

IEEE Transactions on Computer-Aided

Design of Integrated Circuits and Systems,

vol. 62, no. 2, Feb 2007.

ASICs vs. FPGAs

Contributions

Verified the VHDL code through functional andtiming simulation by comparison with the operationof test software implementation written in C.

Ported the VHDL code to 4 different families ofFPGA devices and to a standard-cell ASIC basedon 130 nm TSMC library

Conclusions

Low-cost FPGA devices, such as Spartan 3,outperformed high-performance devices, such asVirtex II, in terms of performance to cost ratio by afactor of 14.9

ASIC Implementation outperforms FPGA with afactor of 50* in terms of area and 1.5 times in termsof frequency.

*In case of rho it is 50, for other architectures it may be less

Conclusions

Low cost FPGA devices Spartan 3 and Spartan 3Eare suitable for code-breaking

ASIC implementation is suitable when largenumber of chips (>1,000,000) are considered

Future Work

Implementation of Trial Division in Hardware

Implementation of ECM in Hardware using onemultiplier and one adder/subtractor

Integrating Trial division, Rho, P-1 and ECM tobuild a co-factoring machine

Experiments on COPACOBANA

Thank you!

Questions???

fpga and asic implementation of rho and p-1 methods of...

Documents