montgomery multiplication

42
1 Montgomery Montgomery Multiplication Multiplication David Harris and Kyle Kelley Harvey Mudd College Claremont, CA 91711 {David_Harris, Kyle_Kelley}@hmc.edu

Upload: hamlin

Post on 25-Feb-2016

150 views

Category:

Documents


1 download

DESCRIPTION

Montgomery Multiplication. David Harris and Kyle Kelley Harvey Mudd College Claremont, CA 91711 {David_Harris, Kyle_Kelley}@hmc.edu. Outline. Cryptography Overview Finite Field Mathematics Montgomery Multiplication Tenca-Koç Montgomery Multiplier Improved Montgomery Multiplier - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Montgomery Multiplication

1

MontgomeryMontgomeryMultiplicationMultiplication

David Harris and Kyle KelleyHarvey Mudd CollegeClaremont, CA 91711

{David_Harris, Kyle_Kelley}@hmc.edu

Page 2: Montgomery Multiplication

2

OutlineOutline• Cryptography Overview• Finite Field Mathematics• Montgomery Multiplication• Tenca-Koç Montgomery Multiplier• Improved Montgomery Multiplier• Very High Radix• Implementation Results• Summary

Page 3: Montgomery Multiplication

3

Cryptography OverviewCryptography Overview

• Encryption has become essential– E-commerce (SSL)– Communications / network processors– Smart cards / digital cash– Military

• Two major classes of algorithms– Symmetric cryptosystems (e.g. DES)– Public key cryptosystems (e.g. RSA)

Page 4: Montgomery Multiplication

4

Cryptographic ProtocolsCryptographic Protocols

• Alice and Bob would like to communicate securely. Eve wants to listen in.– Symmetric key:

• Alice and Bob must share a key for encryption and decryption.

• If Eve hears it, she can read the messages.– Public key:

• Alice publishes her public key to the world.• Bob encrypts with Alice’s public key.• Alice can decrypt only with her private key.• Eve can’t decrypt with the public key.

Page 5: Montgomery Multiplication

5

Digital SignaturesDigital Signatures

• Alice wants to sign a contract in a way that only she can do.– Alice publishes her public key and

keeps the private key secret.– Encrypt the document with her secret

key.– Anyone can decrypt the document with

her public key– But nobody can forge her signature.

Page 6: Montgomery Multiplication

6

Key ExchangeKey Exchange

• Public key encryption is slow• Use it to share a symmetric key

– Use symmetric key to encrypt large blocks of data

Page 7: Montgomery Multiplication

7

RSA EncryptionRSA Encryption

• Most widely used public key system.– Good for encryption and signatures.– Invented by Rivest, Shamir, Adleman (1978)

• Public e and private d keys are long #s– n = 256-2048+ bits– Satisfy xde mod M = 1 for all x– Finding d from e is as hard as factoring M

• Encryption: B = Ae mod M• Decryption: C = Bd mod M = Aed = A

Page 8: Montgomery Multiplication

8

Modular ExponentiationModular Exponentiation

• Critical operation in RSA and for– Digital signature algorithm– Diffie-Hellman key exchange– Elliptic curve cryptosystems

• Done with 2n modular multiplications– Ex: A27 = ((((((A2) * A)2)2) * A)2) * A– Division required after each

multiplication to compute modulo

Page 9: Montgomery Multiplication

9

Finite Field MathematicsFinite Field Mathematics

• +, * modulo prime p form a finite field– p elements– Additive identity: 0– Multiplicitive identity: 1– Each nonzero number has a unique

inverse x-1

– Named GF(p)• For Evariste Galois, a 19th century number

theorist killed in a duel at age 20

Page 10: Montgomery Multiplication

10

Binary Extension FieldsBinary Extension Fields

• Building blocks are polynomials in x– Operations performed modulo some

irreducible polynomial f(x) of degree n– Arithmetic done modulo 2– Called GF(2n)

• Example: GF(23)

• Computation is the same as GF(p)– Except that no carries are propagated

Element Code 0 000 1 001 x 010 x+1 011x2 100x2 +1 101x2+x 110x2+x+1 111

Page 11: Montgomery Multiplication

11

Montgomery MultiplicationMontgomery Multiplication

• Faster way to do modular exponentation– Operate on Montgomery residues– Division becomes a simple shift– Requires conversion to and from

residues only once per exponentiation

Page 12: Montgomery Multiplication

12

Montgomery ResiduesMontgomery Residues

• Let the modulus M be an odd n-bit integer

• Define r = 2n

• Define the M-residue of an integer a < M as

• There is a one-to-one correspondence between integers and M-residues for 0 < a < M-1

nn M 22 1

Mara mod

Page 13: Montgomery Multiplication

13

MM-Residue Examples-Residue Examples

• M = 11, r = 16

611 mod 16*1010

111 mod 16*99

711 mod 16*88

211 mod 16*77811 mod 16*66

311 mod 16*55911 mod 16*44

411 mod 16*33

1011 mod 16*22

511 mod 16*11

011 mod 16*00

Page 14: Montgomery Multiplication

14

Montgomery MultiplicatonMontgomery Multiplicaton

• Define

• Where r-1 is the inverse of r mod M: – r-1r = 1 (mod M)

• This gives the Montgomery residue of – z = xy mod M

MryxyxMMz mod),( 1

MzrMxyr

Mryrxr

Mryxz

mod mod

mod))((

mod1

1

Page 15: Montgomery Multiplication

15

Mont. Multiplication ExampleMont. Multiplication Example

711 mod 9*7*5)7,5()111 mod 9*16( 91

MMr

• It may not be obvious that this is easier to do than regular modular multiplication.

Page 16: Montgomery Multiplication

16

Montgomery MultiplierMontgomery Multiplier

• MM is an easier operation that requires no hard division, just shifting

• In radix 2,

Z = 0for i = 0 to n-1Z = Z + xi•Yif Z is odd then Z = Z + MZ = Z/2

if Z ≥ M then Z = Z – M

Page 17: Montgomery Multiplication

17

ExampleExample

• X = 7 = 0111• Y = 5 = 0101• M = 11 = 1011

• Z initially 0– Z = (0 + 5 + 11) / 2 = 8– Z = (8 + 5 + 11) / 2 = 12– Z = (12 + 5 + 11) / 2 = 14– Z = (14 + 0) / 2 = 7 (final result)

Z = 0for i = 0 to n-1

Z = Z + xi•Yif Z is odd then Z

= Z + MZ = Z/2

if Z ≥ M then Z = Z – M

Page 18: Montgomery Multiplication

18

ConversionConversion

• Conversion of integers to/from Montgomery residues takes one MM operation (if r2 mod M is precomputed and saved):

• Modular exponentiation takes two conversion steps and 2n multiplication steps.

xMrxrMrxxMMx

MxrMrxrrxMMx

mod 1 mod1)1,(

mod mod),(11

122

Page 19: Montgomery Multiplication

19

Cryptography AcceleratorsCryptography Accelerators

• Hardware accelerators offer more speed at less power than software– Via announced x86 C5J core

Montgomery Multiply opcode (May 04)

3COM Router 5000 Series Encryption Accelerator

IBM PCI SSL Cryptography Accelerator

Page 20: Montgomery Multiplication

20

BreakBreak

Page 21: Montgomery Multiplication

21

BreakBreak

Page 22: Montgomery Multiplication

22

BreakBreak

Page 23: Montgomery Multiplication

23

Reconfigurable HardwareReconfigurable Hardware

• Building hardwired n-bit unit is limiting– Slow for large n– Not scalable to different n

• Better to design for w-bit words– Break n-bit operand into e w-bit words– This is called scalable

• Also handle both GF(p) and GF(2n)– Requires conditionally killing carries– Called unified

Page 24: Montgomery Multiplication

24

Unified Carry GateUnified Carry Gate

• Full adder modified for dual-field ops– fsel = 1: normal operation GF(p)– fsel = 0: kill carry GF(2n)

• Only changesmajority gate

• Sum remainsXOR

FSEL

FSEL

C S

P S

C

S

C

S C

PCout

Page 25: Montgomery Multiplication

25

Tenca-Koç Montgomery MultiplierTenca-Koç Montgomery Multiplier

Z = 0for i = 0 to n-1 (CA, Zw-1:0) = Zw-1:0 + Xi × Yw-1:0

reduce = Z0

(CB, Zw-1:0) = Zw-1:0 + reduce × Mw-1:0

for j = 1 to e+1 (CA, Z(j+1)w-1:jw) = Z(j+1)w-1:jw + Xi × Y(j+1)w-1:jw + CA

(CB, Z(j+1)w-1:jw) = Z(j+1)w-1:jw + reduce × M(j+1)w-1:jw + CB

Zjw-1:(j-1)w = (Zjw, Zjw-1:(j-1)w+1)

IEEE Transactions on Computers, Sept. 2003

Page 26: Montgomery Multiplication

26

Processing ElementsProcessing Elements

• Keep Z in carry-save redundant form• Simple processing element (PE)

3:2 CS

A

3:2 CS

A

(w)

Cin

Cb

x i

Zw-1:0

Yw-1:0

Mw-1:0

Ca

Cout

Cin

Cout

reset

(w)

Z0

Zw-1

Z0

Zw-1:0

Mw-1:0Yw-1:0

reduce

Page 27: Montgomery Multiplication

27

ParallelismParallelism

• Two dimensions of parallelism:– Width of processing element w– Number of pipelined PEs p

• Multiply takes k = n/p kernel cycles

FIFO

0YM

Mem

X Mem

PE1 PE2 PE3 PE p

SequenceControl

Result

Z

MYx

Kernel

Z’

Page 28: Montgomery Multiplication

28

Pipeline TimingPipeline Timing

Ker

nel C

ycle

1

Case I: e > 2p-1e = 4, p = 2

Case II: e < 2p-1e = 4, p = 4

time

spacePE1 PE2

1 x0

x1

x0

x1

x0

x1

x0

x1

x3

x3

x3

x3

x2

x2

x2

x2

23456789

10111213

Ker

nel C

ycle

2

PE1 PE2 PE3 PE4

KernelStall

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

x1MY5w-1:4wZ5w-2:4w-1

x0MY5w-1:4wZ5w-2:4w-1

x3MY5w-1:4wZ5w-2:4w-1

x2MY5w-1:4wZ5w-2:4w-1

x0

x0

x0

x0

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

x0MY5w-1:4wZ5w-2:4w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MY5w-1:4wZ5w-2:4w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MY5w-1:4wZ5w-2:4w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MY5w-1:4wZ5w-2:4w-1

x1

x2

x3

x1

x2

x3

x1

x2

x3

x1

x2

x3

x1

x2

x3

x0

x0

x0

x0

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

x0MY5w-1:4wZ5w-2:4w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MY5w-1:4wZ5w-2:4w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MY5w-1:4wZ5w-2:4w-1

MYw-1:0Zw-2:-1

MY2w-1:wZ2w-2:w-1

MY3w-1:2wZ3w-2:2w-1

MY4w-1:3wZ4w-2:3w-1

MY5w-1:4wZ5w-2:4w-1

x1

x2

x3

x1

x2

x3

x1

x2

x3

x1

x2

x3

x1

x2

x3

Ker

nel C

ycle

1K

erne

l Cyc

le 2

14151617181920

Page 29: Montgomery Multiplication

29

QueueQueue

• If full PEs cause stall, queue results• Convert back to nonredundant form

– Saves queue space– CPA needed for final result anyway

Z’ Z

ResultFIFO

(0 or more words)

firstcycle

bypass

1x0

0100w w

sum

carry CP

A

Page 30: Montgomery Multiplication

30

Improved DesignImproved Design

• Don’t wait two cycles for MSB• Kick off dependent operation right

away on the available bits• Take extra cycle(s) at the end to

handle the extra bits• For p processing elements, cycle

count reduces from 2p to p + (p/w)

Page 31: Montgomery Multiplication

31

Improved PEImproved PE

• Left-shift M and Y rather than right-shifting Z

• Same amount of hardware

3:2 CS

A

3:2 CS

A

(w)Cin

Cb

x i

reduce

Zw-1:0

Mw-1:0Yw-1:0

M-1

Zw-2:-1

Yw-2:-1

Mw-2:-1

Mw-1

Ca

Cout

Cin

Cout

reset

(w)

Yw-1

Y-1

Z0

Page 32: Montgomery Multiplication

32

Pipeline TimingPipeline Timing

Ker

nel C

ycle

1

Case I: e > p+1e = 4, p = 2

Case II: e < p+1e = 4, p = 4

time

spacePE1 PE2

1 MYw-1:0Zw-2:-1

x0

MYw-2:-1Zw-3:-2

x1MY2w-1:wZ2w-2:w-1

x0

MY2w-2:w-1Z2w-3:w-2

x1MY3w-1:2wZ3w-2:2w-1

x0

MY3w-2:2w-1Z3w-3:2w-2

x1MY4w-1:3wZ4w-2:3w-1

x0

MY4w-2:3w-1Z4w-3:3w-2

x1

x3

x3

x3

x3

x2

x2

x2

x2

2345678910111213

Ker

nel C

ycle

2

Ker

nel C

ycle

1

PE1 PE2x0

x1x0

x1x0

x1x0

x1

MYw-3:-2Zw-4:-3

x2

MY2w-3:w-2Z2w-4:w-3

x2

MY3w-3:2w-2Z3w-4:2w-3

x2

MY4w-3:3w-2Z4w-4:3w-3

x2

MYw-4:-3Zw-5:-4

x3

MY2w-4:w-3Z2w-5:w-4

x3

MY3w-4:2w-3Z3w-5:2w-4

x3

MY4w-4:3w-3Z4w-5:3w-4

x3

PE3 PE4

Ker

nel C

ycle

2 x4

x5x4

x4

x4

x6

x7

Kernel StallMYw-1:0Zw-2:-1

MYw-2:-1Zw-3:-2

MY2w-1:wZ2w-2:w-1

MY2w-2:w-1Z2w-3:w-2

MY3w-1:2wZ3w-2:2w-1

MY3w-2:2w-1Z3w-3:2w-2

MY4w-1:3wZ4w-2:3w-1

MY4w-2:3w-1Z4w-3:3w-2

MYw-1:0Zw-2:-1

MYw-2:-1Zw-3:-2

MY2w-1:wZ2w-2:w-1

MY2w-2:w-1Z2w-3:w-2

MY3w-1:2wZ3w-2:2w-1

MY3w-2:2w-1Z3w-3:2w-2

MY4w-1:3wZ4w-2:3w-1

MY4w-2:3w-1Z4w-3:3w-2

MYw-1:0Zw-2:-1

MYw-2:-1Zw-3:-2

MY2w-1:wZ2w-2:w-1

MY2w-2:w-1Z2w-3:w-2

MY3w-1:2wZ3w-2:2w-1

MY3w-2:2w-1Z3w-3:2w-2

MY4w-1:3wZ4w-2:3w-1

MY4w-2:3w-1Z4w-3:3w-2

MYw-3:-2Zw-4:-3

MY2w-3:w-2Z2w-4:w-3

MY3w-3:2w-2Z3w-4:2w-3

MY4w-3:3w-2Z4w-4:3w-3

MYw-4:-3Zw-5:-4

MY3w-4:2w-3Z3w-5:2w-4

MY4w-4:3w-3Z4w-5:3w-4

MY2w-4:w-3Z2w-5:w-4

x5

x6

x7

x5

x6

x7

x5

x6

x7

MY5w-1:4wZ5w-2:4w-1

x0

MY5w-2:4w-1Z5w-3:4w-2

x1

14

MY5w-1:4wZ5w-2:4w-1

x2

MY5w-2:4w-1Z5w-3:4w-2

x3

x0

MY5w-3:4w-2Z5w-4:4w-3

x2

MY5w-4:4w-3Z5w-5:4w-4

x3

MY5w-1:4wZ5w-2:4w-1

x1MY5w-2:4w-1Z5w-3:4w-2

x4

MY5w-3:4w-2Z5w-4:4w-3

x6

MY5w-4:4w-3Z5w-5:4w-4

x7

MY5w-1:4wZ5w-2:4w-1

MY5w-2:4w-1Z5w-3:4w-2

x5

Page 33: Montgomery Multiplication

33

LatencyLatency

• Tenca-Koç

• Improved Design

II) (Case1221

I) (Case111peepkpepek

II) (Case12212I) (Case12)1(2)1(

peepkpepek

Page 34: Montgomery Multiplication

34

Very High RadixVery High Radix

• These designs are Radix-2– 1 bit of x per PE

• Higher radix designs reduce latency– Process more bits of x per PE– Require integer multiplication instead of

AND gates

Page 35: Montgomery Multiplication

35

Montgomery’s AlgorithmMontgomery’s Algorithm

Multiply: Z = X × YReduce: reduce = Z × M’ mod R

Z = [Z + reduce × M] / RNormalize: if Z ≥ M then Z = Z – M

• M’ satisfies RR-1 – MM’ = 1– Drives LSBs to 0

Page 36: Montgomery Multiplication

36

Scalable Very High Radix AlgorithmScalable Very High Radix Algorithmw-bit words of M and Y e = n/wv-bit digits of X f = n/vradix = 2v

Z = 0for i = 0 to f-1

(CA, Zw-1:0) = Zw-1:0 + X(i+1)v-1:iv × Yw-1:0

reduce = (M'v-1:0 × Zw-1:0)v-1:0

(CB, Zw-1:0) = Zw-1:0 + reduce × Mw-1:0

for j = 1 to e+1

(CA, Z(j+1)w-1:jw) = Z(j+1)w-1:jw + X(i+1)v-1:iv × Y(j+1)w-1:jw + CA

(CB, Z(j+1)w-1:jw) = Z(j+1)w-1:jw + reduce × M(j+1)w-1:jw + CB

Zjw-1:(j-1)w = (Zjw+v-1:jw, Zjw-1:(j-1)w+v)

Page 37: Montgomery Multiplication

37

Very High Radix PEVery High Radix PE

Z

*

X

Y

reduce

M

M'

v

v+w01

*

1 0

CA CB

w

w

v

v v

w+1

w w

v+w

w+1

v+w

YMZ to next PE

Y

first

first to next PE

v+w

w w

vv

v

01

v

v

shift

MAC MAC

Page 38: Montgomery Multiplication

38

Very High Radix Pipeline TimingVery High Radix Pipeline Timing

Zw-1:0reduce

Xv-1:0

Yw-1:0

Zw-1:0

Y2w-1:w

Z2w-1:w

Y3w-1:2w

Z3w-1:2w

Y4w-1:3w

Z4w-1:3w

Y5w-1:4w

Z5w-1:4w

Y6w-1:5w

Z6w-1:5w

Yw-1:0

Zw-1:0

Y2w-1:w

Z2w-1:w

Y3w-1:2w

Z3w-1:2w

Y4w-1:3w

Z4w-1:3w

Y5w-1:4w

Z5w-1:4w

Y6w-1:5w

Z6w-1:5w

Zw-1:0reduce

X2v-1:v

Yw-1:0

Zw-1:0

Y2w-1:w

Z2w-1:w

Y3w-1:2w

Z3w-1:2w

Y4w-1:3w

Z4w-1:3w

Y5w-1:4w

Z5w-1:4w

Y6w-1:5w

Z6w-1:5w

Yw-1:0

Zw-1:0

Y2w-1:w

Z2w-1:w

Y3w-1:2w

Z3w-1:2w

Y4w-1:3w

Z4w-1:3w

Y5w-1:4w

Z5w-1:4w

Y6w-1:5w

Z6w-1:5w

Zw-1:0reduce

Yw-1:0

Zw-1:0

Y2w-1:w

Z2w-1:w

Y3w-1:2w

Z3w-1:2w

Y4w-1:3w

Z4w-1:3w

Y5w-1:4w

Z5w-1:4w

Y6w-1:5w

Z6w-1:5w

Yw-1:0

Zw-1:0

Y2w-1:w

Z2w-1:w

Y3w-1:2w

Z3w-1:2w

Y4w-1:3w

Z4w-1:3w

Y5w-1:4w

Z5w-1:4w

Y6w-1:5w

Z6w-1:5w

X3v-1:2v

Zw-1:0reduce

Yw-1:0

Zw-1:0

Y2w-1:w

Z2w-1:w

Y3w-1:2w

Z3w-1:2w

Y4w-1:3w

Z4w-1:3w

Y5w-1:4w

Z5w-1:4w

Yw-1:0

Zw-1:0

Y2w-1:w

Z2w-1:w

Y3w-1:2w

Z3w-1:2w

Yw-1:0

Zw-1:0

Ker

nel S

tall

PE 1 PE 2 PE 3

Ker

nel C

ycle

1K

erne

l Cyc

le 2

X4v-1:3v

X5v-1:4v

… … … … … …

1

Cycle #

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

Page 39: Montgomery Multiplication

39

LatencyLatency

• Tenca-Koç: k = n/p

• Very High Radix: k = n/pv

3 4 1 2 4 2 (Case I)4 1 1 4 2 (Case II)

k e p e pk p e e p

II) (Case12212I) (Case12)1(2)1(

peepkpepek

Page 40: Montgomery Multiplication

40

ImplementationImplementation

• C and Verilog reference models– Parameterized by w, p, and v– Extensive testing up to n = 1024

• Synthesized Verilog onto FPGA– Xilinx Virtex II Pro XC2V2000-6

Page 41: Montgomery Multiplication

41

ResultsResultsDescription Technology Hardware Clock

Speed (MHz)

Scalable 256-bit time (ms)

1024-bit time (ms)

T-K p = 40 w=8 0.5 m CMOS synthesized

28 Kgates 80 Yes 3.8 88

Improved p = 16 w = 16 Xilinx Virtex II 1514 LUTs+ ~5n RAM

144 Yes 1.1 59

Improved p = 64 w = 16 Xilinx Virtex II 5598 LUTs+ ~5n RAM

144 Yes 1.0 16

p = 4 w = 16 v = 16very high radix

Xilinx Virtex II 780 LUTs + 8 mults+ ~5n RAM

102 Yes 0.45 22

p = 16 w = 16 v = 16very high radix

Xilinx Virtex II 2847 LUTs + 32 mults+ ~5n RAM

102 Yes 0.40 6.6

Page 42: Montgomery Multiplication

42

SummarySummary

• Modular exponentiation is key operation in cryptography

• Hardware accelerators getting popular– Reconfigurable in key length & field

• Developed improved MM– Half the latency for n ≤ w*p– Half the queue size

• Higher radix looks even better– Well-suited to FPGAs