Running OpenSSL Crypto Algorithms in Simplescalar
Piyush Ranjan Satapathy
Department of Computer Science & Engineering
University of California Riverside
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
2
Outline
What Crypto Algorithms are ? Why we need to run them on simplescalar ? Any previous work on this ? Introducing OpenSSL0.9-7e Introducing Simplescalar version2.0 Selecting the crypto Algorithms from OpenSSL Simulation Settings and parameters Results & Discussions An interesting Comparison Demo Conclusion Acknowledgement and References Q&A
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
3
What Crypto Algorithms Are ?
Algorithms meant for Network Security1. Authentication
2. Secrecy3. Nonrepudiation4. Integrity Control
Kind of Crypto Algorithms to solve the above1. Public Key Algorithms (Ex:- RSA,DSS,LUC...)
2. Secret key Algorithms (Ex:- AES,DES,RC4,SEAL…)3. Cryptographic Hash Functions (Ex:- MD5,SHA1…)4. Random Number Generators (Ex:- PGP, Noiz,SSH…)
Secret Key Algorithms1. Block Ciphering (Ex:- IDEA, DES, AES, BLOWFISH…)
2. Stream Ciphering (Ex:- RC4,SEAL,A5)
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
4
Why run on Simplescalar ?
Architectural Analysis for Crypto algorithmsTo achieve a best network processor design we need to know the architectural analysis of crypto algorithms at cycle level accuracy.
Simplescalar Easy to Simulate !!Fast, Flexible and Accurate simulation.
Simplescalar provides a cycle level accuracy simulation of MIPS processor
Not concerned about Parallel programmingOtherwise could have used Simics…
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
5
Previous Work on Architectural Analysis of Crypto Algorithms:
Analysis using widely available Crypto algorithms (I refer “Average” here) by haiyong et. al.
Analysis using SPECInt & CommBench Performance of SSL crypto Algorithms (Li Zhao et. al.)
But no architectural analysis of OpenSSL crypto algorithms.
Now OpenSSL has been the standard bench mark for crypto engines…..
So knowing the architectural analysis of these algorithms help understanding the need of modern network processor dealing with cryptography.
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
6
Introducing OpenSSL 0.9.7e
Widely used Open source for crypto algorithms ( I have used the recent version)
OpenSSL is a cryptography toolkitIt implementing the Secure Sockets Layer (SSL v2/v3) and Transport Layer Security (TLS
v1) network protocols and related cryptography standards required by them.The openssl program is a command line tool for using the various cryptography functions
of OpenSSL's crypto library from the shell. It can be used for creation of RSA, DH and DSA key parameters Creation of X.509 certificates, CSRs and CRLs o Calculation of Message Digests Encryption and Decryption with Ciphers SSL/TLS Client and Server Tests Handling of S/MIME signed or encrypted mail
I have used the library to port the crypto algorithms into Simplescalar.
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
7
Introducing SimpleScalar2.0
Compiling: sslittle-na-sstrix-gcc foo.c –o foo
Running: sim-outorder foo
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
8
Selecting OpenSSL Crypto Algorithms:
Private Key Block Cipher Mode
AES (Key length: 128bits; Block Size: 16bits)
DES (Key length: 128bits; Block Size: 8bits)
3DES (Key length: 168 bits; Block Size:8 bits)
IDEA (Key length: 128 bits; Block Size: 8 bits)
Stream Cipher Mode RC4 (Length of 128 bits)
Hash Key MD5 (Block Size: 512 bits; Digest Size: 128 bits)
SHA1 (Block Size: 512 bits; Digest Size: 160 bits)
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
9
Simulation Settings & parameters
Settings:-Writing of separate modules for each algorithm
by using crypto library.- Simulating by gcc 2.7.2.3 simplescalar
simulator and running the binary file and giving a file as Input.
-Input file length varies from 1byte to 256 KB.
-Most readings are taken by running through 1 byte length of Input file.
- Changing different parameters in simplescalar in command line and observing the readings.
Parameters used:
Parameters Values
ALU
IFQ Size
ILP
1,2,4,8
1,2,4,…,32
Changing ALU and IFQ same time
Branch prediction type Not taken, taken, 2lev, bimodal, combinational
-Cache size (L1I & L1D)
-Line size
-Sets
-Replacement policy
4,8,…256 KB
8,16,…64 Bytes
1,2,4,8,16
L, r , f
-Unified Cache Size (UL2)
-Replacement Policy
4,8,…2048 KB
L, f, r
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
10
Results & Discussions: (1)
1. Instruction Set Characteristics:
- Comparison with Average, SPECint & Commbench
- “Average” represents Li’s work
- SSLcrypto represents the average over all the OpenSSL algorithms I considered.
Obvservation:-
* SSLCrypto algorithms has significant amount of memory reference (~40%)
* Intensive Arithmetic Computation but less than Average
Instruction Mix Comparison
0%
20%
40%
60%
80%
100%
SSLCypto Average SPECint CommBench
Perc
entag
e%
load store uncond. branch cond. branch int computation
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
11
Results & Discussions: (2,3)
2.Comaprisons of Instruction Mix:
-Plotted all the block, stream and hash ciphers for the instruction mix
Observation:
- DES, 3DES have high memory reference
-IDEA has a significant branch predictions
3. Cycle per Bytes of Computation
-3DES takes more cycle as it has to manipulate data 3 times with 3 diff keys.
- Block ciphers require more cycles than Stream and hash ciphers.
Block,Stream & hash Algorithm Instruction Mix
0%
20%
40%
60%
80%
100%
AES DES 3DES RC4 IDEA MD5 SHA1
Per
cen
tag
e %
Load Store Cond. Branch Int Computation
Computational Complexity of the algorithms per Byte
010203040
5060708090
AES DES 3DES RC4 IDEA MD5 SHA1
1000
Cyc
les
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
12
Results & Discussions (4,5)
4. IPC Vs ALU:
- I26%, 37%, and 40% for Block, stream and hash kind of algorithms respectively when the number of ALUs increases from 1 to 2
- 6%, 10%, and 5% when the number of ALUs increases from 2 to 4
-with more than 4 ALUs, the number of instructions executed in one cycle increases only less than 1%.
5. IPC Vs IFQ Size:
-26%, 37%, and 40% for block, stream and hash kind of algorithms respectively after the size of the instruction fetch queue changes from 1 to 2
- 6%, 10% and 5% if the IFQ changes from 2 to4
- After that it changes within 2%
Average ILP with different Number of Integer ALUs
0
0.5
1
1.5
2
2.5
3
1 2 4 8
ALU
Inst
ruct
ion
Per
Cyc
le AES
DES
3DES
RC4
IDEA
MD5
SHA1
Impact of Instruction fetch Queue on Average ILP
0
0.5
1
1.5
2
2.5
3
1 2 4 8 16 32
IFQ Size
Inst
ruct
ion
Per
Cyc
le AES
DES
3DES
RC4
IDEA
MD5
SHA1
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
13
Results & Discussions: (6)
6. IPC Vs ILP:
- ILP 4 means 4 ALU and 4 IFQ (Both Changes)
- ILP of 4 is enough for getting the best Instruction per cycle value.
Overall Impact of ILP
0
0.5
1
1.5
2
2.5
3
1 2 4 8
ILP
Instr
uctio
n Pe
r Cyc
le
AES
DES
3DES
RC4
IDEA
MD5
SHA1
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
14
Results & Discussions: (7)
7. Branch prediction Hit Rate:
- Bimodal & Combinational kinds of prediction give a better hit rate
- Also 2lev kind of prediction gives almost better hit rate.
-Simple taken or not taken doesn’t do well..
-So need to consider the complex branch predictions.
Branch prediction Hit Rate
010
2030
405060
7080
90100
AES DES 3DES RC4 IDEA MD5 SHA1
Hit R
ate
%
Not Taken
2 Lev
Taken
Comb
Bimod
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
15
Results & Discussions: (8,9)
8. L1 Instruction Cache Size behaviors: - Cache Size changed keeping fixed 64 bytes of lines size , 4way set and l replacement
- We can observe that 128KB is enough to reach the best performance level.
9. L1Instruction Cache Line Size :
-Cache line size changed keeping fixed 256 cache size and 4 way set and l replacement
- we can observe that 32 bytes of line size is enough to reach the lowest possible miss rate.
Impact of cache Size on Miss Rate
0
2
4
6
8
10
12
14
4 8 16 32 64 128 256
Cache Size (KB)
Mis
s ra
te %
AES
DES
3DES
RC4
IDEA
MD5
SHA1
Impact of Block Size on Miss rate
05
1015202530354045
8 16 32 64
Block Size (Bytes)
Mis
s R
ate
%
AES
DES
3DES
RC4
IDEA
MD5
SHA1
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
16
Results & Discussions: (10,11)
10. L1 Instruction cache Set behaviors:
- Set Associativity changed keeping fixed 256KB cache size, 32 bytes of line size and l kind of replacement policy.
- We can observe that 2 way set associativity is enough to reach a miss rate lower than 5%.
11. L1 Instruction Cache Replacement Policy Behaviors:
- Replacement policy changes keeping fixed 256KB cache size, 32 bytes of line size and 4 way set..
- We can observe that LRU & FIFO give same performance . We can choose either one.
Impact of Set Associativity
0
1
2
3
4
5
6
7
8
1 2 4 8 16
Number of Sets
Mis
s R
ate
%
AES
DES
3DES
RC4
IDEA
MD5
SHA1
Impact of Replacement policy
0
0.5
1
1.5
2
2.5
3
3.5
AES DES 3DES RC4 IDEA MD5 SHA1
Mis
s R
ate
%
LRU
FIFO
Random
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
17
Results & Discussions:(12,13)
12. L1 Data Cache Behaviors:
- Cache Size changed keeping fixed 64 bytes of lines size , 1way set and l replacement
- We can observe that 32KB is enough to reach the best performance level.
13. L1 Data Cache Line Size :
-Cache line size changed keeping fixed 256 cache size and 1 way set and l replacement
- we can observe that 32 bytes of line size is enough to reach the lowest possible miss rate.
Impact of Cache Size on Miss Rate
0
5
10
15
20
25
4 8 16 32 64 128 256
Cache Size (KB)
Mis
s R
ate
%
AES
DES
3DES
RC4
IDEA
MD5
SHA1
Impact of Block Size on Miss Rate
0
5
10
15
20
25
8 16 32 64
Block Size (Bytes)
Mis
s R
ate
%
AES
DES
3DES
RC4
IDEA
MD5
SHA1
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
18
Results & Discussions: (14,15)
14. L1 Data cache Set behaviors:
- Set Associativity changed keeping fixed 256KB cache size, 32 bytes of line size and l kind of replacement policy.
- We can observe that 2 way set associativity is enough for block and for stream but 4 way is enough for Hash ciphers.
15. L1 Instruction Cache Replacement Policy Behaviors:
- Replacement policy changes keeping fixed 256KB cache size, 32 bytes of line size and 4 way set..
- We can observe that LRU & FIFO give same performance . We can choose either one.
Impact of Set Associativity
0
1
2
3
4
5
6
7
1 2 4 8 16
Number of Sets
Mis
s R
ate
%
AES
DES
3DES
RC4
IDEA
MD5
SHA1
Impact of Replacement Polilcy
0
1
2
3
4
5
6
AES DES 3DES RC4 IDEA MD5 SHA1
Mis
s R
ate
%
LRU
FIFO
Random
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
19
Results & Discussions: (16,17)16. L1 Data Cache Behaviors:
- Cache Size changed keeping fixed 64 bytes of lines size , 1way set and l replacement- We can observe that 512KB is enough to reach the best performance level.
17. L1 Instruction Cache Replacement Policy Behaviors:- Replacement policy changes keeping fixed 512KB cache size, 64 bytes of line size and 4 way set..- We can observe that LRU & FIFO give same performance . We can choose either one.
Impact of Cache Size
0
10
20
30
40
50
60
70
80
4 8 16 32 64 128 256 512 1024 2048
Cache Size (KB)
Mis
s R
ate
%
AES
DES
3DES
RC4
IDEA
MD5
SHA1
Impact of Replacement Policy
05
1015
202530
3540
4550
AES DES 3DES RC4 IDEA MD5 SHA1
Mis
s R
ate
%
LRU
FIFO
Random
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
20
An Interesting Comparison:Observation: Li’s Analysis
(Widely available cryptoAlgo)
My Analysis
(OpenSSL Crypto Algorithms)
Instruction Mix: 23% Memory Reference
60% Arithmetic computations
40-45 % Memory Reference
68% Arithmetic Reference
Cycles per Byte of Computation Block:80 Stream: 20
Hash: 18
Block: 55 Stream: 55
Hash: 30
ALU Vs IPC
IFQ Vs IPC
ILP Vs IPC
Best when 4 ALUs
Best when IFQ is 4
Best when ILP is 4
Best when 4 ALUs
Best When IFQ is 4
Best when ILP is 8
Branch prediction technique Simple technique (taken or not taken)
Complex technique (Bimodal or Combinational)
L1 Instruction cache parameters 16KB cache size, 8 bytes of line size, 4 way set, l replacement
128KB Cache size, 32 bytes line size, 2 way sets, l replacement
L1 Data Cache parameters 32KB cache, 8bytes of line size, 2 way sets, l replacement
32KB cache Size, 64 bytes line size, 2 way set, l replacement
UL2 Unified cache parameters 64 KB cache Size, l kind of replacement policy
512 KB cache size, l kind of replacement policy
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
21
Demo Time …………
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
22
Conclusion:
For crypto Engines using OpenSSL crypto algorithms should have * 128KB L1 Inst cache size* 32KB L1 Data cache Size* 512KB UL2 cache Size* 2 way set associativity* l replacement policy* ILP of 8* Advanced branch prediction schemes
For a better performance architecture wise….!!!
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
23
Acknowledgement & References:
A Big Thanks to Li Zhao References:
SimpleScalr Tool Set
http://www.simplescalar.com
OpenSSL
http://www.openssl.org
Architectural Analysis of Cryptographic applications for Network processors by Haiyong Xie et. al.
Anatomy and Performance of SSL processing by Li Zhao, Ravi Iyer, Srihari Maikeneni, Laxmi Bhuyan.
04/19/23 CS213: "Parallel processing Architecture" By Dr Laxmi Narayan Bhuyan (Winter 2005) University of California Riverside
24
Q&A ????