intel’s new aes instructions - indian statistical institutedebrup/aeworkshop/slides/03_aes...
TRANSCRIPT
Intel’s New AES Instructions
Enhanced Performance and Security
Shay Gueron
- Intel Corporation, Israel Development Center, Haifa, Israel
- University of Haifa, Israel
1
Overview
• AES basics
• Performance hungry applications
• The security issue
• The AES instrcutions
• Performance scalability
• Basic usage
• Software flexibility
• Software tools
• Performance and optimizations
• More on software flexibility
• And more…
2
AES Basics
3
AES Overview
4
X10-14 “Rounds”
Shift Row Plain Text
Fast Software Encryption
SubByte (Sbox) Mix Columns
Slow Software Encryption
Cipher Text
Add Round Key
Round key
AES Transformations
• AddRoundKey — 128b xor of State and round key
• SubBytes — nonlinear bytewise substitution (repeted 16x)
• ShiftRows — bytewise permutation
• MixColumns — matrix multiplication in GF(28)
• InvSubBytes, InvShiftRows, InvMixColumns
• SubWord – 4 x SubBytes
• RotWord – [a0, a1, a2, a3] [a1, a2, a3, a0]
• Rcon – in round i equals [{02}i-1, {00}, {00}, {00}]
5
AES Encryption
Tmp = AddRoundKey (Data, Round_Key_Encrypt [0])
For round = 1-9 or 1-11 or 1-13:
Tmp = ShiftRows (Tmp)
Tmp = SubBytes (Tmp)
Tmp = MixColumns (Tmp)
Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round])
end loop
Tmp = ShiftRows (Tmp)
Tmp = SubBytes (Tmp)
Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14])
Result = Tmp
6
40/48/56 steps
AES Decryption (Equivalent Inverse Cipher)
Tmp = AddRoundKey (Data, Round_Key_Decrypt [0])
For round = 1-9 or 1-11 or 1-13:
Tmp = InvShiftRows (Tmp)
Tmp = InvSubBytes (Tmp)
Tmp = InvMixColumns (Tmp)
Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [round])
end loop
Tmp = InvShiftRows (Tmp)
Tmp = InvSubBytes (Tmp)
Tmp = AddRoundKey (Tmp, Round_Key_Decrypt [10 or 12 or 14])
Result = Tmp
7
Equivalent Inverse Cipher
AES-128 Key Expansion
AES-128 Key Expansion
for (i = 0 .. 3) { w[i] = Cipher Key[i] }
for (i = 4 .. 43) {
temp = w[i-1]
if (i mod 4 = 0) {
temp = SubWord(RotWord(temp)) xor Rcon
}
w[i] = w[i-4] xor temp
}
8
AES-256 Key Expansion Encrypt
for (i = 0 .. 7) { w[i] = Cipher Key[i] }
for (i = 8 .. 59) {
temp = w[i-1]
if (i mod 8 = 0) {
temp = SubWord(RotWord(temp)) xor Rcon
}
else if (i mod 8 = 4) {
temp = SubWord(temp)
}
w[i] = w[i-8] xor temp
}
Preparing the decryption key schedule
9
K0 K2K1 K3
K4 K6K5 K7
K8 K10K9 K11
K12 K14K13 K15
Key0
Key1
Key2
Key3
K16 K18K17 K19
K20 K22K21 K23
K24 K26K25 K27
K28 K30K29 K31
Key4
Key5
Key6
Key7
K32 K34K33 K35Key8
Encrypt Keys
K36 K38K37 K39Key9
K40 K42K41 K43Key10
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
Encry
pt
Round K
eys
For the Equivalent Inverse cipher: apply InvMixCols to Encrypt Round keys
Decry
pt
Round K
eys
Performance Hungry
Applications
10
Performance hungry AES usage models
• SSL/TLS for HTTPS
• IPSec
• OS Based Disk Encryption
– E.g., Microsoft Bitlocker
– Similar in Linux
• File encryption utilities
• Storage Encryption
• Voice Over IP Security (VOIP)
11
Relevant to clinet and server
platforms
The Security Issue
12
CPU cache
Memory tradeoff for capacity and latency (and cost)
Most instructions are in relation to memory (load and store)
Cache = small and fast memory
• working close to CPU’s frequency
• hiding the latency of larger large memories
• Speculative: holds “next” required data
13
Problem: in a multitasking environment
memory access can be made implicitly
data-dependent
Cache-based attacks (among others)
Theoretical attacks by Page:
• Time-driven: execution time as function of cache-hit/miss numbers
– 2003: Tsunoo et al. on DES
– 2004: Bernstein on first round of AES
– 2006: Neve et al. on first and second round of AES
• Trace-driven: sequence of cache-hit/miss
– 2005: Bertoni et al. on AES through SimpleScalar
– 2005: Lauradoux et al. on AES
– 2006: Acıiçmez et al. on AES
• Access-driven: cache line accesses of crypto process
– 2005: Percival on RSA with multithreaded processors
– 2005-06: Osvik, Shamir et al. on AES with multithreaded processors
– 2005-06: Neve and Seifert on AES with single-threaded processors and last round attack
14
Table based AES (e.g., OpenSSL)
Tables based easier accesses and operations on 32-bit proc.
For AES encryption, 5 precomputed tables [1-byte] [4-byte]
Composed from two tables S and S’ [1-byte] [1-byte]
T0 = [S’,S,S,SS’]
T1 = [SS’,S’,S,S]
T2 = [S,SS’,S’,S]
T3 = [S,S,SS’,S’]
T4 = [S,S,S,S]
15
/* round 1: */
t0 = T0[s0 >> 24] T1[(s1 >> 16) & 0xff] T2[(s2 >> 8) & 0xff] T3[s3 & 0xff] rcon[4];
t1 = T0[s1 >> 24] T1[(s2 >> 16) & 0xff] T2[(s3 >> 8) & 0xff] T3[s0 & 0xff] rcon[5];
t2 = T0[s2 >> 24] T1[(s3 >> 16) & 0xff] T2[(s0 >> 8) & 0xff] T3[s1 & 0xff] rcon[6];
t3 = T0[s3 >> 24] T1[(s0 >> 16) & 0xff] T2[(s1 >> 8) & 0xff] T3[s2 & 0xff] rcon[7];
/* round 2: */
…
Table based AES
T4 is used for the last round (no MixColumns) and for Key Expansion
T4=
16
lsb 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
msb
0 63 7c 77 7b f2 6b 6f c5 30 01 67 2b fe d7 ab 76
1 ca 82 c9 7d fa 59 47 f0 ad d4 a2 af 9c a4 72 c0
2 b7 fd 93 26 36 3f f7 cc 34 a5 e5 f1 71 d8 31 15
3 04 c7 23 c3 18 96 05 9a 07 12 80 e2 eb 27 b2 75
4 09 83 2c 1a 1b 6e 5a a0 52 3b d6 b3 29 e3 2f 84
5 53 d1 00 ed 20 fc b1 5b 6a cb be 39 4a 4c 58 cf
6 d0 ef aa fb 43 4d 33 85 45 f9 02 7f 50 3c 9f a8
7 51 a3 40 8f 92 9d 38 f5 bc b6 da 21 10 ff f3 d2
8 cd 0c 13 ec 5f 97 44 17 c4 a7 7e 3d 64 5d 19 73
9 60 81 4f dc 22 2a 90 88 46 ee b8 14 de 5e 0b db
10 e0 32 3a 0a 49 06 24 5c c2 d3 ac 62 91 95 e4 79
11 e7 c8 37 6d 8d d5 4e a9 6c 56 f4 ea 65 7a ae 08
12 ba 78 25 2e 1c a6 b4 c6 e8 dd 74 1f 4b bd 8b 8a
13 70 3e b5 66 48 03 f6 0e 61 35 57 b9 86 c1 1d 9e
14 e1 f8 98 11 69 d9 8e 94 9b 1e 87 e9 ce 55 28 df
15 8c a1 89 0d bf e6 42 68 41 99 2d 0f b0 54 bb 16
each
value
repeated
4x
Exploiting OS scheduling
AES rounds are short vs context switch frequency
Preemptive scheduling
ability for a process to yield CPU before end of OS quantum
2 processes
• spy continuously watches the cache accesses
• crypto runs for small amounts a time
17
start of
OS quantum
end of
OS quantum
(re)loading table and wait accessing
tables
Cache sharing leakages
Two processes on the same processor: crypto and spy
1. spy loads a (large) table
2. crypto runs on the processor
3. spy reloads and times each table line:
if loading time is
short line not evicted
long line evicted
18
ca
ch
e lin
es
Mitigation
• There are way to write AES software and avoid the data-dependency of memory accesses
– But they severely degrade performance
19
Intel’s AES Instructions
20
AES New Instructions (AES-NI)
• Will be introduced into the Intel Instructions Set starting from 2009
Four instructions to perform AES encryption and decryption
• AESENC – Perform one round encryption of AES
• AESENCLAST – Perform last round encryption of AES
• AESDEC – Perform one round decryption of AES
• AESDECLAST – Perform last round decryption of AES
Two instructions to perform AES Key Expansion
• AESKEYGENASSIST – Used for round key expansion
• AESIMC – convert encryption round keys to a form usable for decryption
• Intel’s architecture uses the equivalent inverse cipher
21
AES Data Structure
S(0,0) S(0,3)S(0,2)S(0,1)
S(1,0) S(1,3)S(1,2)S(1,1)
S(2,0) S(2,3)S(2,2)S(2,1)
S(3,0) S(3,3)S(3,2)S(3,1)
X0 = S (3 ,0) S (2, 0) S (1, 0) S (0, 0)
X1 = S (3, 1) S (2 ,1) S (1, 1) S (0, 1)
X2 = S (3, 2) S (2, 2) S (1, 2) S (0, 2)
X3 = S (3, 3) S (2, 3) S (1, 3) S (0, 3)
22
lsbmsb
X1 3263X2 6495X3 96127 X0 031xmm1
X5 3263X6 6495X7 96127 X4 031xmm2/m128
The State (xmm0) in matrix representation
State and Round Key in xmm0 and xmm2/m128
The 4 AES Round Instructions
AESENC xmm0, xmm2/m128
Tmp:= xmm0;
Round Key:= xmm2/m128;
Tmp:= Shift Rows (Tmp);
Tmp:= Substitute Bytes (Tmp);
Tmp:= Mix Columns (Tmp);
xmm0:= Tmp xor Round Key
AESENCLAST xmm0, xmm2/m128
Tmp:= xmm0;
Round Key:= xmm2/m128;
Tmp:= Shift Rows (Tmp);
Tmp:= Substitute Bytes (Tmp);
xmm0:= Tmp xor Round Key
23
AESDEC xmm0, xmm2/m128
Tmp:= xmm0;
Round Key:= xmm2/m128;
Tmp:= Inverse Shift Rows (Tmp);
Tmp:= Inverse Substitute Bytes (Tmp);
Tmp:= Inverse Mix Columns (Tmp:=);
xmm0:= Tmp xor Round Key
AESDECLAST xmm0, xmm2/m128
State := xmm0;
Round Key := xmm2/m128
Tmp:= Inverse Shift Rows (State);
Tmp:= Inverse Substitute Bytes (Tmp);
xmm0:= Tmp xor Round Key
Two instructions for Key Expansion
AESIMC xmm0, xmm2/m128
RoundKey := xmm2/m128;
xmm0 := InvMixColumns (RoundKey)
AESKEYGENASSIST xmm0, xmm2/m128, imm8
Tmp := xmm2/m128
RCON[31-8] := 0; RCON[7-0] := imm8;
X3[31-0] := Tmp[127-96]; X2[31-0] := Tmp[95-64];
X1[31-0] := Tmp[63-32]; X0[31-0] := Tmp[31-0];
xmm0 := [RotWord (SubWord (X3)) XOR RCON, SubWord (X3), Rotword (SubWord (X1)) XOR RCON, SubWord (X1)]
24
AESKEYGENASSIST xmm0, xmm2/m128, imm8
25
X3 X2 X1 X0
X3 X3 X1 X1
Duplicate
X3’ X3’ X1’ X1’
S-box
X3’’ X3’ X1’
Rotate Rotate
Duplicate
X3’’’ X3’ X1’’’ X1’
XOR RCON
X1’’
XOR RCON
S-box S-box S-box
Performance
Scalability
26
Design for performance scalability
Tmp = AddRoundKey (Data, Round_Key_Encrypt [0])
For round = 1-9 or 1-11 or 1-13:
Tmp = ShiftRows (Tmp)
Tmp = SubBytes (Tmp)
Tmp = MixColumns (Tmp)
Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [round])
end loop
Tmp = ShiftRows (Tmp)
Tmp = SubBytes (Tmp)
Tmp = AddRoundKey (Tmp, Round_Key_Encrypt [10 or 12 or 14])
Result = Tmp
27
Can control last round
via immediate
Basic Usage
28
AES-128 Key Expansion
begin
word temp
for (i = 0 .. 3) {
w[i] = Initial Key[i]
}
for (i = 4 .. 43) {
temp = w[i-1]
if (I mod 4 = 0) {
temp = SubWord(RotWord(temp)) xor Rcon
}
w[i] = w[i-4] xor temp
}
end
29
AESKEYGENASSIST
AES-256 Key Expansion
word temp
for (i = 0 .. 7) { w[i] = Initial Key[i] }
for (i = 8 .. 59) {
temp = w[i-1]
if (i mod 8 = 0) {
temp = SubWord(RotWord(temp)) xor Rcon
}
else if (i mod 8 = 4) {
temp = SubWord(temp)
}
w[i] = w[i-8] xor temp
}
30
AESKEYGENASSIST
AESKEYGENASSIST
AESIMC xmm0, xmm2/m128
31
K0 K2K1 K3
K4 K6K5 K7
K8 K10K9 K11
K12 K14K13 K15
Key0
Key1
Key2
Key3
K16 K18K17 K19
K20 K22K21 K23
K24 K26K25 K27
K28 K30K29 K31
Key4
Key5
Key6
Key7
K32 K34K33 K35Key8
Encrypt Keys
K36 K38K37 K39Key9
K40 K42K41 K43Key10
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
InvMixCols
Encypt
Round K
eys
Equivalent Inverse cipher requires applying InvMixCols to Encrypt Round keys
Decry
pt
Round K
eys
AES-128 Key Expansion
AESKEYGENASSIST xmm2, xmm0, 0x1
call key_expand_128
AESKEYGENASSIST xmm2, xmm0, 0x2
call key_expand_128
AESKEYGENASSIST xmm2, xmm0, 0x4
call key_expand_128
…
…
AESKEYGENASSIST xmm2, xmm0, 0x36
call key_expand_128
32
key_expand_128: pshufd xmm2, xmm2, 0xff vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 vpslldq xmm3, xmm0, 0x4 pxor xmm0, xmm3 pxor xmm0, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 ret
AES-192 Key Expansion
aeskeygenassist xmm2, xmm3, 0x1
call key_expansion_192
aeskeygenassist xmm2, xmm3, 0x2
call key_expansion_192
aeskeygenassist xmm2, xmm3, 0x4
call key_expansion_192
aeskeygenassist xmm2, xmm3, 0x8
call key_expansion_192
aeskeygenassist xmm2, xmm3, 0x10
call key_expansion_192
aeskeygenassist xmm2, xmm3, 0x20
call key_expansion_192
aeskeygenassist xmm2, xmm3, 0x40
call key_expansion_192
aeskeygenassist xmm2, xmm3, 0x80
call key_expansion_192
33
key_expand_192: key_expansion_192: pshufd xmm2, xmm2, 0x55 vpslldq xmm4, xmm0, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4pxor xmm0, xmm4 Pslldq xmm4, 0x4 pxor xmm0, xmm4 pxor xmm0, xmm2 pshufd xmm2, xmm0, 0xff vpslldq xmm4, xmm3, 0x4 pxor xmm3, xmm4 pxor xmm3, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 movdqu XMMWORD PTR [rcx], xmm3 add rcx, 0x8ret
AES-256 Key Expansion
aeskeygenassist xmm2, xmm3, 0x1
call key_expansion_256
aeskeygenassist xmm2, xmm3, 0x2
call key_expansion_256
aeskeygenassist xmm2, xmm3, 0x4
call key_expansion_256
aeskeygenassist xmm2, xmm3, 0x8
call key_expansion_256
aeskeygenassist xmm2, xmm3, 0x10
call key_expansion_256
aeskeygenassist xmm2, xmm3, 0x20
call key_expansion_256
aeskeygenassist xmm2, xmm3, 0x40
call key_expansion_256
34
key_expand_1256: key_expansion_256: pshufd xmm2, xmm2, 0xff vpslldq xmm4, xmm0, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4 pxor xmm0, xmm4 pslldq xmm4, 0x4 pxor xmm0, xmm4 pxor xmm0, xmm2 movdqu XMMWORD PTR [rcx], xmm0 add rcx, 0x10 aeskeygenassist xmm4, xmm0, 0 pshufd xmm2, xmm4, 0xaa vpslldq xmm4, xmm3, 0x4 pxor xmm3, xmm4 pslldq xmm4, 0x4 pxor xmm3, xmm4 pslldq xmm4, 0x4
Encrypting with AES round instructions
AES-128 ECB mode example
Round keys already expanded
for i form 1 to N_BLOCKS do
xmm0 = BLOCK [i] // load next data process
for j from 1 to 9 do
xmm0 = AESENC (xmm0, RK [j])
end
xmm0 = AESENCLAST (xmm0, RK [10])
store xmm0
end
35
AES-128 assembler (encryption and decryption)
36
AES-128 decryption
pxor xmm0, xmm02
AESDEC xmm0, xmm01
AESDEC xmm0, xmm00
AESDEC xmm0, xmm9
AESDEC xmm0, xmm8
AESDEC xmm0, xmm7
AESDEC xmm0, xmm6
AESDEC xmm0, xmm5
AESDEC xmm0, xmm4
AESDEC xmm0, xmm3
AESDECLAST xmm0, xmm2
Decryption Round Keys
AESIMC xmm3, xmm3
AESIMC xmm4, xmm4
AESIMC xmm5, xmm5
AESIMC xmm6, xmm6
AESIMC xmm7, xmm7
AESIMC xmm8, xmm8
AESIMC xmm9, xmm9
AESIMC xmm00, xmm00
AESIMC xmm01, xmm01
AES-128 encryption
pxor xmm0, xmm2
AESENC xmm0, xmm3
AESENC xmm0, xmm4
AESENC xmm0, xmm5
AESENC xmm0, xmm6
AESENC xmm0, xmm7
AESENC xmm0, xmm8
AESENC xmm0, xmm9
AESENC xmm0, xmm00
AESENC xmm0, xmm01
AESENCLAST xmm0, xmm02
Software Flexibility:
modes of operation
37
ECB (Encrypt)
38
get next plaintext block
AES encrypt
store result into memory as ciphertext block
more data YES NO
DONE
Use AES-NI
building blocks
CBC (Encrypt)
39
initialize feedback register with IV
get next plaintext block
XOR with feedback register
AES encrypt
store result into feedback register
store result into memory as ciphertext block
more data YES NO
DONE
Use AES-NI
building blocks
CTR (Encrypt)
40
initialize counter register with IV
get counter register
XOR with next plaintext block
AES encrypt
increment counter register
store result into memory as ciphertext block
more data YES NO
DONE
Use AES-NI
building blocks
GCM
41
data 1
ciphertext 1
data 2
ciphertext 2
data 3
ciphertext 3
hash 1
⊕
multiply with
hash key
in GF(2128)
hash 0 hash 2
⊕
multiply with
hash key
in GF(2128)
etc…
AES
CTR
computation
of the
Galois hash
Use AES-NI
building blocks
Software Tools
42
Software Development tools
43
C/C++ program
icl /arch:AVX <filename>
Executable binary
Program output
sde -- <binary name>
Prior to silicon Program output
today
Software
Development
Emulator
Running the Basic Emulator
sde -- foo.exe <foo options>
For ease of use
• Special command window where every command is run on the emulator
% sde -help Usage: sde [args] -- application [application-args] -mix (run mix histogram tool) -omix (set the output file name for mix, Implies -mix. Default is "mix.out") -debugtrace (run mix debugtrace tool) -odebugtrace (set the output file name for debugtrace, Implies -debugtrace. Default is "debugtrace.out") -ast (run the AVX/SSE transition checker) -oast (set the output file name for the AVX/SSE transition checker. Implies -ast. Default is "avx-sse-transition.out") -no-avx (disable AVX emulation, just emulate AES+PCLMULQDQ+SSE4) -no-aes (disable AES+PCLMULQDQ+AVX emulation, just emulate SSE4) -pin-runtime (Use Pin's runtime libraries, required on some Linux* systems)
44
Compiler support for using AES-NI
Compiler support through
• Inline asm
• Intrinsics
45
extern __m128i __cdecl _mm_aesdec_si128(__m128i v, __m128i rkey);
…
extern __m128i __cdecl _mm_clmulepi64_si128(__m128i v1, __m128i v2, const int imm8);
wmmintrin.h
#include <ia32intrin.h>
__m128i x, y, z;
z = _mm_aesdec_si128(x, y);
User program
AES-128 CBC Encryption (Intrinsics)
void AES_128_CBC_Encrypt () {
int i, j, k;
__m128i tmp, feedback;
__m128i RKEY [11];
for (k=0; k<11; k++)
RKEY [k] = _mm_load_si128 ( (__m128i*)&Key_Schedule [4*k]);
};
feedback = _mm_load_si128 ( (__m128i*)&IV [0]);
for(i=0; i < NBLOCKS; i++) {
tmp = _mm_load_si128 ( (__m128i*)&PLAINTEXT[i*4]);
tmp = _mm_xor_si128 (tmp,feedback);
tmp = _mm_xor_si128(tmp, RKEY[0]);
for(j=1; j <10; j++) {
tmp = _mm_aesenc_si128 (tmp, RKEY [j]);
};
tmp = _mm_aesenclast_si128 (tmp, RKEY [10]);
feedback = tmp;
_mm_store_si128 ((__m128i*)&CIPHERTEXT[4*i], tmp);
}
} 46
Performance
and optimizations
47
Parallelization
•All useful modes are parallel (except for CBC-encrypt) are parallelizable
• Blocks can be processed independently
– Can apply the loop reversal technique
•The only serial mode in use is CBC encrypt
– Leading usage model is Bitlocker (and disk encryption in general)
– CBC-encrypt throughput is less sensitive in because
• Disk write latency is ok (CBC-encrypt)
• Disk read is sensitive (CBC-encrypt is parallel)
• Efficient SW technique can help squeeze out performance boost from the existing architecture
• The latency of the AES-NI does not matter too much
– As long as #registers ≥ Latency of instrcution
48
Straightforward AES
for i form 1 to N_BLOCKS do
xmm0 = BLOCK [i] // load
xmm0 = AESENC (xmm0, RK [1])
xmm0 = AESENC (xmm0, RK [2])
xmm0 = AESENC (xmm0, RK [3])
…
xmm0 = AESENC (xmm0, RK [9])
xmm0 = AESENCLAST (xmm0, RK [10])
store xmm0
end
49
Wait L cycles
Performance: (10 x Latency) cycles / 16B
Efficient Usage of AES-NI: Loop Reversal
for i from 0 to N_BLOCKS/8 -1 do
xmm0 = BLOCK [8*i+1], xmm2 = BLOCK [8*i+2]; … xmm8 = BLOCK [8*i+8]
xmm0 = AESENC (xmm0, RK [1])
xmm2 = AESENC (xmm2, RK [1])
xmm3 = AESENC (xmm2, RK [1])
…
xmm8 = AESENC (xmm8, RK [1])
xmm0 = AESENC (xmm0, RK [2])
xmm2 = AESENC (xmm2, RK [2])
…
xmm8 = AESENC (xmm8, RK [2])
…
xmm0 = AESENCLAST (xmm0, RK [10])
xmm2 = AESENCLAST (xmm2, RK [10])
…
xmm8 = AESENCLAST (xmm8, RK [10])
store xmm0; store xmm0; … store xmm8
end
50
L cycles elapse – ready
No need to wait
Scheduling the flow to space dependent AES-Ni by more than L cycles
Effectively, dispatch an AES-NI every cycle
Throughput: 80 cycles / (8*16B)
Gain speedup factor of L
Parallel modes of operation and fully pipelines hardware implementation of the AES-NI allow for re-scheduling the flow in a way that dependent AES-NI’s are spaced to hide the latency of one instruction
Parallelizing CBC encryption
void AES_128_CBC_Encrypt_Parallel_4_Blocks () {
int i, j, k;
__m128i tmp, feedback, feedback1, feedback2,
__m128i feedback3, feedback4;
__m128i tmp1, tmp2, tmp3, tmp4;
__m128i RKEY [11];
for (k=0; k<11; k++){
RKEY [k] = _mm_load_si128 ( (__m128i*)&Key_Sched [4*k]);
};
feedback1 = _mm_load_si128 ( (__m128i*)&IV1 [0]);
feedback2 = _mm_load_si128 ( (__m128i*)&IV2 [0]);
feedback3 = _mm_load_si128 ( (__m128i*)&IV3 [0]);
feedback4 = _mm_load_si128 ( (__m128i*)&IV4 [0]);
for(i=0; i < NBLOCKS; i++)
tmp1 = _mm_load_si128 ( (__m128i*)&PLAINTEXT1[i*4]);
tmp2 = _mm_load_si128 ( (__m128i*)&PLAINTEXT2[i*4]);
tmp3 = _mm_load_si128 ( (__m128i*)&PLAINTEXT3[i*4]);
tmp4 = _mm_load_si128 ( (__m128i*)&PLAINTEXT4[i*4]);
51
tmp1 = _mm_xor_si128 (tmp1, feedback1); tmp2 = _mm_xor_si128 (tmp2, feedback2); tmp3 = _mm_xor_si128 (tmp3, feedback3); tmp4 = _mm_xor_si128 (tmp4, feedback4); tmp1 = _mm_xor_si128(tmp1,RKEY[0]); tmp2 = _mm_xor_si128(tmp2,RKEY[0]); tmp3 = _mm_xor_si128(tmp3,RKEY[0]); tmp4 = _mm_xor_si128(tmp4,RKEY[0]); for(j=1; j <10; j++) { tmp1 = _mm_aesenc_si128 (tmp1, RKEY [j]); tmp2 = _mm_aesenc_si128 (tmp2, RKEY [j]); tmp3 = _mm_aesenc_si128 (tmp3, RKEY [j]); tmp4 = _mm_aesenc_si128 (tmp4, RKEY [j]); }; tmp1 = _mm_aesenclast_si128 (tmp1, RKEY [10]); tmp2 = _mm_aesenclast_si128 (tmp2, RKEY [10]); tmp3 = _mm_aesenclast_si128 (tmp3, RKEY [10]); tmp4 = _mm_aesenclast_si128 (tmp4, RKEY [10]); feedback1 = tmp1; feedback2 = tmp2; feedback3 = tmp3; feedback4 = tmp4; _mm_store_si128 ((__m128i*)&CIPHERTEXT1[4*i], tmp1); _mm_store_si128 ((__m128i*)&CIPHERTEXT2[4*i], tmp2); _mm_store_si128 ((__m128i*)&CIPHERTEXT3[4*i], tmp3); _mm_store_si128 ((__m128i*)&CIPHERTEXT4[4*i], tmp4); } }
Parallelization at a higher level: operate on multiple independent data streams in parallel
Performance projections
• Highly optimized software implementations of AES
– On today’s silicon ~15 cycles/byte (OpenSSL)
– 18 cycles/byte from MSFT on 2006 platform
• No side channel mitigation included
• Mitigation in costly (no known real “protected implementation”)
• With AES-NI:
– Side channel mitigation is built-in
Significant speedup
• 2-3x in CBC encrypt in serial mode
• More than 10x in parallel modes of operation
52
More on
Software Flexibility
53
Rijndael-256 (256b block size)
VPBLENDVB xmm3, xmm2, xmm0, xmm5
VPBLENDVB xmm4, xmm0, xmm2, xmm5
PSHUFB xmm3, xmm8
PSHUFB xmm4, xmm8
AESENC xmm0, xmm6
AESENC xmm2, xmm7
54
“left” half of RIJNDAEL input state (columns 0-3),
“right” half of RIJNDAEL input state (columns 4-7),
“right” half of RIJNDAEL round key
“left” half of RIJNDAEL round key Mask: 0x03020d0c0f0e0908b0a050407060100
(account for ShiftRows)
VPBLENDVB mask selecting bytes 1-3, 6-7, 10-11, 15 of from 1st operand and all other bytes from 2nd operand
Isolating the AES Transformations
• AES-NI perform bundled sequences of AES transformations
– But - each one of these transformations can be isolated by a proper combination and the use of the byte shuffling (PSHUFB instruction).
• Motivation
– Constructing cipher variants
– Supporting possible future modifications in the AES standard
– Using the AES primitives as building blocks for ciphers and for cryptographic hash functions.
• Hashing: some of the new Secure Hash Function submissions to NIST’s SHA-3 competition use AES rounds and/or AES transformations.
– E.g., LANE, SHAMATA, SHAvite-3, and Vortex
55
Isolating the AES Transformations
Isolating ShiftRows
PSHUFB xmm0, 0x0b06010c07020d08030e09040f0a0500
Isolating InvShiftRows
PSHUFB xmm0, 0x0306090c0f0205080b0e0104070a0d00
Isolating MixColumns
AESDECLAST xmm0, 0x00000000000000000000000000000000
AESENC xmm0, 0x00000000000000000000000000000000
Isolating InvMixColumns
AESENCLAST xmm0, 0x00000000000000000000000000000000
AESDEC xmm0, 0x00000000000000000000000000000000
Isolating SubBytes
PSHUFB xmm0, 0x0306090c0f0205080b0e0104070a0d00
AESENCLAST xmm0, 0x00000000000000000000000000000000
Isolating InvSubBytes
PSHUFB xmm0, 0x0b06010c07020d08030e09040f0a0500
AESDECLAST xmm0, 0x00000000000000000000000000000000
56
AESDECLAST xmm0, 0 Tmp:= Inverse Shift Rows (State); Tmp:= Inverse Substitute Bytes (Tmp); xmm0:= Tmp xor 0 = xmm0 AESENC xmm0, 0 Round Key:= 0 Tmp:= Shift Rows (Tmp); Tmp:= Substitute Bytes (Tmp); Tmp:= Mix Columns (Tmp); xmm0:= Tmp xor 0
AES-NI and PCLMULQDQ: Latency and throughput
• Micro-architectural enhancements:
• Latency and throughput improve across CPU generations
• (Latency and Throughput are measured in cycles)
• The AES-NI AESENC/AESENCLAST, AESDEC/AESDECLAST
• Latency/Throughput
• WSM: 7/2; SNB: 8/1; HSW: 7/1 BDW: 7/1 SKL: 4/1
•PCLMULQDQ:
• Latency/Throughput
• SNB: 14/8 ; HSW: 7/2 ; BDW: 7/1 SKL: 4/1
57
Architecture Codenames: Westmere (WSM) Sandy bridge (SNB), Haswell (HSW), Broadwell (BDW), Skylake (SKL)
Backup
58
References
S. Gueron. Intel Advanced Encryption Standard (AES) Instructions Set, Rev 3.01. Intel Software Network.
• https://software.intel.com/sites/default/files/article/165683/aes-wp-2012-09-22-v01.pdf
• S. Gueron. Intel's New AES Instructions for Enhanced Performance and Security. Fast Software Encryption, 16th International Workshop (FSE 2009), Lecture Notes in Computer Science: 5665, p. 51-66 (2009).
59