software speed records for lattice-based signatures€¦ · software speed records for...
TRANSCRIPT
Software Speed Records for Lattice-BasedSignatures
Tim Guneysu1 Tobias Oder1 Thomas Poppelmann1
Peter Schwabe2
1Horst Gortz Institute for IT-Security, Ruhr-University Bochum, Germany
2Digital Security Group, Radboud University Nijmegen, The Netherlands
June 5, 2013PQCrypto 2013, Limoges, France
1 / 25
Outline
I Motivation
I Introduction of implemented signature schemeI Optimizations and implementation techniques for high speed
I Fast polynomial multiplicationI Vector instructions (AVX)
I Results and comparison
I Outlook
2 / 25
Motivation
I Lattices are post-quantum
I Recent proposals have practical parameters
I How fast is ideal lattice-based cryptography on modern CPUs?
I How to speed-up computation?
3 / 25
The [GLP’12] Signature Scheme
I Introduced at CHES’12 by Tim Guneysu, VadimLyubashevsky, Thomas Poppelmann [GLP’12]1
I Optimized version of [Lyu12]2
I Based on ideal lattices (lattices with additional algebraicstructure)
I Practical parameters and adapted to needs of embeddedhardware (no Gaussian sampling)
1Practical lattice-based cryptography: A signature scheme for embeddedsystems, Tim Guneysu, Vadim Lyubashevsky, Thomas Poppelmann, CHES 2012
2Lattice signatures without trapdoors, Vadim Lyubashevsky, Eurocrypt 20124 / 25
Notation
I n is a power of 2
I p is a prime congruent to 1 modulo 2n (necessary forefficiency)
I R is the ring Zp[x ]/〈f = xn + 1〉I Rk subset of R with coefficients [−k , k].
5 / 25
Hardness Assumptions
Standard lattice hardness assumption:
Definition (Decisional Ring-LWE)
Given (a1, t1), ..., (am, tm) ∈ R×R. Decide whether ti = ais + ei
where s, e1, ..., em ← Dσ and a$← R or uniformly random from
R×R (Dσ denotes a Gaussian distribution).
More ”aggressive” hardness assumption:
Definition (Decisional Compact Knapsack Problem)
Given (a, t) ∈ R×R. Decide whether(a, t = as1 + s2) where
s1, s2$← R1 and a
$← R or uniformly random from R×R.
6 / 25
Hardness Assumptions
Standard lattice hardness assumption:
Definition (Decisional Ring-LWE)
Given (a1, t1), ..., (am, tm) ∈ R×R. Decide whether ti = ais + ei
where s, e1, ..., em ← Dσ and a$← R or uniformly random from
R×R (Dσ denotes a Gaussian distribution).
More ”aggressive” hardness assumption:
Definition (Decisional Compact Knapsack Problem)
Given (a, t) ∈ R×R. Decide whether(a, t = as1 + s2) where
s1, s2$← R1 and a
$← R or uniformly random from R×R.
6 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:Secret key is two random polynomials s1, s2 with −1/0/1coefficients
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:Extracting the secret key from the public key is exactly solving theSearch Compact Knapsack problem (SCK) and a is a globalconstant
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:The message to be signed is µ
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:Pick two ”masking values” y1, y2 from Rk with coefficientssomewhat smaller than p
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)
3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:Bind ”nonce” ay1 + y2 to the message by hashing. Use 160 bithash output and transform it into a polynomial c with only 32coefficients being either −1 or 1
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y2
4: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:Compute z1 and z2 by multiplying the sparse and small polynomialc by the small (but not sparse) secret key s. Add the maskingvalues
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 1
5: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:Rejection sampling step testing whether z1, z2 in the range Rk−32
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:Signature is (z1, z2, c)
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32
andc = H(az1 + z2 − tc, µ)
Comments:Check size of coefficients to prevent attack
7 / 25
Simple Signatures Scheme
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 + s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H(ay1 + y2, µ)3: z1 ← s1c + y1, z2 ← s2c + y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H(az1 + z2 − tc, µ)
Comments:Correctness:az1 + z2 − tc = as1c + ay1 + s2c + y2 − as1c− s2c = ay1 + y2
7 / 25
Parameters
Implemented parameter set:
I n = 512, p = 8383489 (23-bit), k = 214
I |sig |= 8950 bits
I |sk|= 1620 bits
I |pk|= 11800 bits
I Parameter a is chosen as global constant and not includedinto public key size
I Security: 100 80 bits (new Crypto 2013 result)I Parameters are chosen with the root-Hermite factor
methodologyI Solve underlying lattice problem (80 bits)I Find preimage in the hash function (output size of 160 bits) -
quantum computer
I On average 7 attempts needed (rejection sampling)
8 / 25
Most Expensive Operations
crypto sign keypair()
Signing Key: s1, s2$← R1
Verification Key: t← as1 +s2
crypto sign(µ, s1, s2)
1: y1, y2$← Rk
2: c← H( ay1 +y2, µ)3: z1 ← s1c +y1, z2 ← s2c +y24: if z1 or z2 /∈ Rk−32, goto step 15: output (z1, z2, c)
crypto sign open(µ, z1, z2, c, t)
1: Accept iffz1, z2 ∈ Rk−32 andc = H( az1 +z2− tc , µ)
I Sampling: Sampling of 2n = 1024 uniformly random valuesfrom range [−k , k]
I Polynomial multiplication: Multiplication of512× 512-coefficient polynomials (sparse and non-sparse)
9 / 25
High Level Optimizations
10 / 25
High Level Optimization - Random Polynomials
Uniformly random sampling of y1 and y2 from Rk :
I Generate random bytes using Salsa20 stream cipher(/dev/urandom just for the seed)
I We need rejection sampling to achieve uniform distributionI For one random polynomial sample n + 16 unsigned 32-bit
integersI Discard coefficient ri if ri ≥ (2k + 1) · b232/(2k + 1)cI Compute ci = (ri mod (2k + 1))− kI Probability to discard a coefficient is 4/232 = 2−30
11 / 25
High Level Optimization - Polynomial Multiplication
Computation of ay1, s1c, s2c, az1, tc:
I Expensive - polynomials have 512 coefficients
I Schoolbook multiplier has complexity O(n2) and requiresn2 = 262144 multiplications
I Number Theoretic Transform (NTT) has complexityO(n log n)
I NTT is simplified an FFT in Zp
I For a,b,d ∈ R the multiplication d = a · b corresponds to thenegative wrapped convolution (modulus xn + 1)
12 / 25
High Level Optimization - Polynomial Multiplication
Theorem (Wrapped Convolution)
Let ω be a primitive n-th root of unity in Zp and ψ2 = ω.
1. Let d be the negative wrapped convolution of a and b. Leta, b and d be defined as (a0, ψa1, ..., ψ
n−1an−1),(b0, ψb1, ..., ψ
n−1bn−1), and (d0, ψd1, ..., ψn−1dn−1). Then
d = NTT−1w (NTTw (a)◦NTTw (b)).
Advantages:
I Reduction by xn + 1 for free and no zero padding
I Store constants (e.g., a) in NTT representation
I Only 12n log n multiplications for one NTT
I Constant time
Disadvantage:
I Storage for powers of ω, ψ, ω−1, ψ−1
13 / 25
Low Level Optimizations
14 / 25
Low Level Optimization - SIMD/AVX
Modern processors are equipped with powerful vector engines:
I Allows Single Instruction Multiple-Data (SIMD) operations
I Supported by Advanced Vector Extensions (AVX) extendingthe x86 instruction set
I Included in Intel Sandy Bridge, Intel Ivy Bridge, and AMDBulldozer
General idea:
I Represent 512-coefficient polynomial as array of 512double-precision floating-point values
I 4 double-precision floats fit into the 256-bit-wide AVX ymm
vector registers
I Perform operations on four coefficients at the same time (e.g.,addition, reduction)
I For example AVX supports one double-precision-vectormultiplication and one addition every cycle
15 / 25
Low Level Optimization - SIMD/AVX
Modern processors are equipped with powerful vector engines:
I Allows Single Instruction Multiple-Data (SIMD) operations
I Supported by Advanced Vector Extensions (AVX) extendingthe x86 instruction set
I Included in Intel Sandy Bridge, Intel Ivy Bridge, and AMDBulldozer
General idea:
I Represent 512-coefficient polynomial as array of 512double-precision floating-point values
I 4 double-precision floats fit into the 256-bit-wide AVX ymm
vector registers
I Perform operations on four coefficients at the same time (e.g.,addition, reduction)
I For example AVX supports one double-precision-vectormultiplication and one addition every cycle
15 / 25
Low Level Optimization - Modular Reduction and Addition
Parallel modular reduction:
I Modular reduction modp extremely important
I Multiply x by inverse c = x · p−1 (vmulpd)
I Round c , multiply c by p and then subtract c from x(vroundpd/vsubpd)
I Work on four coefficients in parallel
I Lazy reduction: Reduce only when necessary (p has 23 bits,mantissa has 53-bit)
Parallel addition:
I Just requires 256 vector loads, 128 vector additions orsubtractions, and 128 vector stores
16 / 25
Low Level Optimization - Optimizing the NTT
I NTT is most speed-critical operation
I Adapted standard fast iterative algorithm
I log2 n = 9 levels of operations
I Additions (and subtractions) are a bottleneck
I Merging of levels to reduce load and stores
17 / 25
Results
18 / 25
Results
Cycle counts for Intel Ivy Bridge:
Operation Cycles Op/s (@2 GHz)crypto sign keypair 31140 64226crypto sign 634988 3149crypto sign open 45036 44408
ntt 4484 446030poly mul 16096 124254poly mul a 11044 181093poly setrandom maxk 10824 184774poly setrandom max1 5464 366032
I Signing attempt takes 85384 cyclesI On average 7 attempts: 7 · 85384 = 597688 cycles + overheadI NTT is even advantageous for sparse multiplicationI Timing variation is independent of secret data - protection
against timing attacks
19 / 25
Comparison - ECC/RSA
Note that the implementation is now in eBACS benchmarkingframework (lattisigns512).
Software Security Cycles/s SizesOur work 80 bits sign: 634988 pk: 1536
verify: 45036 sk: 256sig: 1184
ed25519 128 bits sign: 67564 pk: 32verify: 209328 sk: 64
sig: 64
ronald2048 112 bits sign: 5768360 pk: 256(RSA-2048) verify: 77032 sk: 2048
sig: 256
20 / 25
Comparison - PQ
Software Security Cycles/s SizesOur work 80 bits sign: 634988 pk: 1536
verify: 45036 sk: 256sig: 1184
XMSS 82 bits sign: 7261100 pk: 912verify: 556600 sk: 19
sig: 2451
mqqsig160 80 bits sign: 1996 pk: 206112verify: 33220 sk: 401
sig: 20
rainbow 80 bits sign: 29364 pk: 102912binary16242020 verify: 17900 sk: 94384
sig: 40
21 / 25
Lessons learned and outlook
Lessons learned:
I (Ideal) lattice-based cryptography is fast and parallelizable
I Results are applicable to other schemes (e.g., LWE-encryption)
I For high speed do not rely on high level libraries (e.g., NTL)
Outlook:I Lattice Signatures and Bimodal Gaussians, Leo Ducas and
Alain Durmus and Tancrede Lepoint and VadimLyubashevsky, Crypto 2013, to appear
I Signature size of 5600 bit providing 128 bit securityI Analysis suggests only 80-bit security for implemented scheme
([GLP’12] claim was 100)
I Different architectures, e.g., GPU or ARM with NEON
I Application of results to other schemes (e.g., LWE-encryption,homomorphic, IBE)
22 / 25
Lessons learned and outlook
Lessons learned:
I (Ideal) lattice-based cryptography is fast and parallelizable
I Results are applicable to other schemes (e.g., LWE-encryption)
I For high speed do not rely on high level libraries (e.g., NTL)
Outlook:I Lattice Signatures and Bimodal Gaussians, Leo Ducas and
Alain Durmus and Tancrede Lepoint and VadimLyubashevsky, Crypto 2013, to appear
I Signature size of 5600 bit providing 128 bit securityI Analysis suggests only 80-bit security for implemented scheme
([GLP’12] claim was 100)
I Different architectures, e.g., GPU or ARM with NEON
I Application of results to other schemes (e.g., LWE-encryption,homomorphic, IBE)
22 / 25
Thank you for your attention!
Any questions?
Results are online:http://cryptojedi.org/crypto/index.shtml#lattisigns
23 / 25
Backup
24 / 25
Security Proof
Sketch of the security proof:
I Choose invalid public key (a, t = as ′1 + s ′2) with s ′1,2 in R ′k withsufficiently large k ′. Invalid key is indistinguishable from validkey.
I Deliver signatures by programming the ROM (withoutknowing the private key)
I When adversary produces forgery (z1, z2, c), we can produce asecond forgery (z ′1, z
′2, c) by the Forking Lemma
I Therefore it holds thatH(az1 + z2 − tc ,m) = H(az ′1 + z ′2 − tc ,m). This allows us toextract small u1, u2 such that au1 + u2 = 0 which allows us tosolve the DCK problem
25 / 25