software speed records for lattice-based signatures€¦ · software speed records for...

Software Speed Records for Lattice-BasedSignatures

Tim Guneysu1 Tobias Oder1 Thomas Poppelmann1

Peter Schwabe2

1Horst Gortz Institute for IT-Security, Ruhr-University Bochum, Germany

2Digital Security Group, Radboud University Nijmegen, The Netherlands

June 5, 2013PQCrypto 2013, Limoges, France

1 / 25

Outline

I Motivation

I Introduction of implemented signature schemeI Optimizations and implementation techniques for high speed

I Fast polynomial multiplicationI Vector instructions (AVX)

I Results and comparison

I Outlook

2 / 25

Motivation

I Lattices are post-quantum

I Recent proposals have practical parameters

I How fast is ideal lattice-based cryptography on modern CPUs?

I How to speed-up computation?

3 / 25

The [GLP’12] Signature Scheme

I Introduced at CHES’12 by Tim Guneysu, VadimLyubashevsky, Thomas Poppelmann [GLP’12]1

I Optimized version of [Lyu12]2

I Based on ideal lattices (lattices with additional algebraicstructure)

I Practical parameters and adapted to needs of embeddedhardware (no Gaussian sampling)

1Practical lattice-based cryptography: A signature scheme for embeddedsystems, Tim Guneysu, Vadim Lyubashevsky, Thomas Poppelmann, CHES 2012

2Lattice signatures without trapdoors, Vadim Lyubashevsky, Eurocrypt 20124 / 25

Notation

I n is a power of 2

I p is a prime congruent to 1 modulo 2n (necessary forefficiency)

I R is the ring Zp[x ]/〈f = xn + 1〉I Rk subset of R with coefficients [−k , k].

5 / 25

Hardness Assumptions

Standard lattice hardness assumption:

Definition (Decisional Ring-LWE)

Given (a1, t1), ..., (am, tm) ∈ R×R. Decide whether ti = ais + ei

where s, e1, ..., em ← Dσ and a$← R or uniformly random from

R×R (Dσ denotes a Gaussian distribution).

More ”aggressive” hardness assumption:

Definition (Decisional Compact Knapsack Problem)

Given (a, t) ∈ R×R. Decide whether(a, t = as1 + s2) where

s1, s2$← R1 and a

$← R or uniformly random from R×R.

6 / 25

Simple Signatures Scheme

crypto sign keypair()

Signing Key: s1, s2$← R1

Verification Key: t← as1 + s2

crypto sign(µ, s1, s2)

1: y1, y2$← Rk

2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)

crypto sign open(µ, z1, z2, c, t)

1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)

Comments:

7 / 25






1: y1, y2$← Rk




Comments:Secret key is two random polynomials s1, s2 with −1/0/1coefficients

7 / 25






1: y1, y2$← Rk




Comments:Extracting the secret key from the public key is exactly solving theSearch Compact Knapsack problem (SCK) and a is a globalconstant

7 / 25






1: y1, y2$← Rk




Comments:The message to be signed is µ

7 / 25






1: y1, y2$← Rk




Comments:Pick two ”masking values” y1, y2 from Rk with coefficientssomewhat smaller than p

7 / 25






1: y1, y2$← Rk

2: c← H(ay1 + y2, µ)

3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)



Comments:Bind ”nonce” ay1 + y2 to the message by hashing. Use 160 bithash output and transform it into a polynomial c with only 32coefficients being either −1 or 1

7 / 25






1: y1, y2$← Rk

2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y2

4: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)



Comments:Compute z1 and z2 by multiplying the sparse and small polynomialc by the small (but not sparse) secret key s. Add the maskingvalues

7 / 25






1: y1, y2$← Rk

2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 1

5: output (z1, z2, c)



Comments:Rejection sampling step testing whether z1, z2 in the range Rk−32

7 / 25






1: y1, y2$← Rk




Comments:Signature is (z1, z2, c)

7 / 25






1: y1, y2$← Rk




Comments:

7 / 25






1: y1, y2$← Rk



1: Accept iffz1, z2 ∈ Rk−32

andc = H(az1 + z2 − tc, µ)

Comments:Check size of coefficients to prevent attack

7 / 25






1: y1, y2$← Rk




Comments:Correctness:az1 + z2 − tc = as1c + ay1 + s2c + y2 − as1c− s2c = ay1 + y2

7 / 25

Parameters

Implemented parameter set:

I n = 512, p = 8383489 (23-bit), k = 214

I |sig |= 8950 bits

I |sk|= 1620 bits

I |pk|= 11800 bits

I Parameter a is chosen as global constant and not includedinto public key size

I Security: 100 80 bits (new Crypto 2013 result)I Parameters are chosen with the root-Hermite factor

methodologyI Solve underlying lattice problem (80 bits)I Find preimage in the hash function (output size of 160 bits) -

quantum computer

I On average 7 attempts needed (rejection sampling)

8 / 25

Most Expensive Operations



Verification Key: t← as1 +s2


1: y1, y2$← Rk

2: c← H( ay1 +y2, µ)3: z1 ← s1c +y1, z2 ← s2c +y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)


1: Accept iffz1, z2 ∈ Rk−32 andc = H( az1 +z2− tc , µ)

I Sampling: Sampling of 2n = 1024 uniformly random valuesfrom range [−k , k]

I Polynomial multiplication: Multiplication of512× 512-coefficient polynomials (sparse and non-sparse)

9 / 25

High Level Optimizations

10 / 25

High Level Optimization - Random Polynomials

Uniformly random sampling of y1 and y2 from Rk :

I Generate random bytes using Salsa20 stream cipher(/dev/urandom just for the seed)

I We need rejection sampling to achieve uniform distributionI For one random polynomial sample n + 16 unsigned 32-bit

integersI Discard coefficient ri if ri ≥ (2k + 1) · b232/(2k + 1)cI Compute ci = (ri mod (2k + 1))− kI Probability to discard a coefficient is 4/232 = 2−30

11 / 25

High Level Optimization - Polynomial Multiplication

Computation of ay1, s1c, s2c, az1, tc:

I Expensive - polynomials have 512 coefficients

I Schoolbook multiplier has complexity O(n2) and requiresn2 = 262144 multiplications

I Number Theoretic Transform (NTT) has complexityO(n log n)

I NTT is simplified an FFT in Zp

I For a,b,d ∈ R the multiplication d = a · b corresponds to thenegative wrapped convolution (modulus xn + 1)

12 / 25

High Level Optimization - Polynomial Multiplication

Theorem (Wrapped Convolution)

Let ω be a primitive n-th root of unity in Zp and ψ2 = ω.

1. Let d be the negative wrapped convolution of a and b. Leta, b and d be defined as (a0, ψa1, ..., ψ

n−1an−1),(b0, ψb1, ..., ψ

n−1bn−1), and (d0, ψd1, ..., ψn−1dn−1). Then

d = NTT−1w (NTTw (a)◦NTTw (b)).

Advantages:

I Reduction by xn + 1 for free and no zero padding

I Store constants (e.g., a) in NTT representation

I Only 12n log n multiplications for one NTT

I Constant time

Disadvantage:

I Storage for powers of ω, ψ, ω−1, ψ−1

13 / 25

Low Level Optimizations

14 / 25

Low Level Optimization - SIMD/AVX

Modern processors are equipped with powerful vector engines:

I Allows Single Instruction Multiple-Data (SIMD) operations

I Supported by Advanced Vector Extensions (AVX) extendingthe x86 instruction set

I Included in Intel Sandy Bridge, Intel Ivy Bridge, and AMDBulldozer

General idea:

I Represent 512-coefficient polynomial as array of 512double-precision floating-point values

I 4 double-precision floats fit into the 256-bit-wide AVX ymm

vector registers

I Perform operations on four coefficients at the same time (e.g.,addition, reduction)

I For example AVX supports one double-precision-vectormultiplication and one addition every cycle

15 / 25

Low Level Optimization - Modular Reduction and Addition

Parallel modular reduction:

I Modular reduction modp extremely important

I Multiply x by inverse c = x · p−1 (vmulpd)

I Round c , multiply c by p and then subtract c from x(vroundpd/vsubpd)

I Work on four coefficients in parallel

I Lazy reduction: Reduce only when necessary (p has 23 bits,mantissa has 53-bit)

Parallel addition:

I Just requires 256 vector loads, 128 vector additions orsubtractions, and 128 vector stores

16 / 25

Low Level Optimization - Optimizing the NTT

I NTT is most speed-critical operation

I Adapted standard fast iterative algorithm

I log2 n = 9 levels of operations

I Additions (and subtractions) are a bottleneck

I Merging of levels to reduce load and stores

17 / 25

Results

18 / 25

Results

Cycle counts for Intel Ivy Bridge:

Operation Cycles Op/s (@2 GHz)crypto sign keypair 31140 64226crypto sign 634988 3149crypto sign open 45036 44408

ntt 4484 446030poly mul 16096 124254poly mul a 11044 181093poly setrandom maxk 10824 184774poly setrandom max1 5464 366032

I Signing attempt takes 85384 cyclesI On average 7 attempts: 7 · 85384 = 597688 cycles + overheadI NTT is even advantageous for sparse multiplicationI Timing variation is independent of secret data - protection

against timing attacks

19 / 25

Comparison - ECC/RSA

Note that the implementation is now in eBACS benchmarkingframework (lattisigns512).

Software Security Cycles/s SizesOur work 80 bits sign: 634988 pk: 1536

verify: 45036 sk: 256sig: 1184

ed25519 128 bits sign: 67564 pk: 32verify: 209328 sk: 64

sig: 64

ronald2048 112 bits sign: 5768360 pk: 256(RSA-2048) verify: 77032 sk: 2048

sig: 256

20 / 25

Comparison - PQ

Software Security Cycles/s SizesOur work 80 bits sign: 634988 pk: 1536

verify: 45036 sk: 256sig: 1184

XMSS 82 bits sign: 7261100 pk: 912verify: 556600 sk: 19

sig: 2451

mqqsig160 80 bits sign: 1996 pk: 206112verify: 33220 sk: 401

sig: 20

rainbow 80 bits sign: 29364 pk: 102912binary16242020 verify: 17900 sk: 94384

sig: 40

21 / 25

Lessons learned and outlook

Lessons learned:

I (Ideal) lattice-based cryptography is fast and parallelizable

I Results are applicable to other schemes (e.g., LWE-encryption)

I For high speed do not rely on high level libraries (e.g., NTL)

Outlook:I Lattice Signatures and Bimodal Gaussians, Leo Ducas and

Alain Durmus and Tancrede Lepoint and VadimLyubashevsky, Crypto 2013, to appear

I Signature size of 5600 bit providing 128 bit securityI Analysis suggests only 80-bit security for implemented scheme

([GLP’12] claim was 100)

I Different architectures, e.g., GPU or ARM with NEON

I Application of results to other schemes (e.g., LWE-encryption,homomorphic, IBE)

22 / 25

Thank you for your attention!

Any questions?

Results are online:http://cryptojedi.org/crypto/index.shtml#lattisigns

23 / 25

http://cryptojedi.org/crypto/index.shtml#lattisigns

Backup

24 / 25

Security Proof

Sketch of the security proof:

I Choose invalid public key (a, t = as ′1 + s ′2) with s ′1,2 in R ′k withsufficiently large k ′. Invalid key is indistinguishable from validkey.

I Deliver signatures by programming the ROM (withoutknowing the private key)

I When adversary produces forgery (z1, z2, c), we can produce asecond forgery (z ′1, z

′2, c) by the Forking Lemma

I Therefore it holds thatH(az1 + z2 − tc ,m) = H(az ′1 + z ′2 − tc ,m). This allows us toextract small u1, u2 such that au1 + u2 = 0 which allows us tosolve the DCK problem

25 / 25

software speed records for lattice-based signatures€¦ · software speed records for...

Documents