parallel algorithms for symmetric key...

PARALLEL ALGORITHMS FOR

SYMMETRIC KEY

INFRASTRUCTURE BASED

SECURITY TECHNIQUES

THESIS

Submitted

in fulfilment of the requirements of the degree of

DOCTOR OF PHILOSOPHY

By

Disha Handa

University Regd. No: PHDENG10023

Supervised by

Dr. Bhanu Kapoor,

Professor, Chitkara University, Himachal Pradesh

December, 2014

Department of Computer Science & Engineering

CHITKARA UNIVERSITY, HIMUDA EDUCATIONAL HUB,

SOLAN, HIMACHAL PRADESH-174103

ii

CHITKARA UNIVERSITY, HIMACHAL PRADESH

DECLARATION BY THE STUDENT

I hereby certify that the work which is being presented in this thesis

entitled “Parallel Algorithms for Symmetric Key infrastructure Based

Security Techniques” is for fulfillment of the requirement for the award

of Degree of Doctor of Philosophy submitted in the Department of

Computer Science and Engineering, Chitkara University, Barotiwala,

Solan, Himachal Pradesh is an authentic record of my own work carried

out under the supervision of Dr. Bhanu Kapoor.

The work has not formed the basis for the award of any other degree or

diploma, in this or any other Institution or University. In keeping with the

ethical practice in reporting scientific information, due acknowledgements

have been made wherever the findings of others have been cited.

(Signature)

(Disha Handa)

iii

CHITKARA UNIVERSITY, HIMACHAL PRADESH

CERTIFICATE BY THE SUPERVISOR

This is to certify that the thesis entitled “Parallel Algorithms for Symmetric Key

infrastructure Based Security Techniques” submitted by Disha Handa, Regd. No.

PHDENG10023 to the Chitkara University, Barotiwala, Solan,Himachal Pradesh in

fulfillment for the award of the degree of Doctor of Philosophy is a bonafide record of

research work carried out by her under my supervision. The contents of this thesis,in

full or in parts, have not been submitted to any other Institution or University for the

award of any degree or diploma.

(Signature)

Dr. Bhanu Kapoor,

Professor, Chitkara University,

Himachal Pradesh, India

iv

ACKNOWLEDGMENT

I would like to express my special appreciation and thanks to my advisor

Professor Dr. Bhanu Kapoor, you have been a great mentor for me. I would like to

thank you for encouraging my research and for allowing me to grow as a research

scientist. Your advice on both research as well as on my career have been

priceless. I would also like to thank my research colleagues, Ms.Neha Kishore,

Ms. Sapna Saxena, and Ms. Tanu Sharma for their valuable support. A special

thanks to Ms. Harpreet Kaur from Electronics and Communication Department for

her advices and support. A heartiest thanks to Ms. Isha Saluja and Ms.Padma kala

for helping me on refining this document in terms of language accuracy.

All the same, I would like to thank my family for motivating and supporting me in

this endeavor and letting me spread my wings all over. My parents, husband and

my little daughter have been my backbone throughout. Only because of them, I

could keep the essence of hard work towards my goal.

In the end, I would like to thank the management of Chitkara University and a

special thanks to Dr. Ashok Chitkara, Chancellor, Dr. Madhu Chitkara, Pro

Chancellor, Brig. (Dr.) R.S. Grewal, Vice Chancellor, Dr. Sudhir Mahajan, Dean

Research and Development and his team, Dr. Rajnish Sharma, Dean Academics,

Dr. Shaily Jain, Head of the Computer Science Department, and all the internal

and external examiners for their valuable time and their expert guidance for

various progress seminars, presentations, suggestions, feedback and the approvals.

Once, again I would like to extend my deep gratitude to everyone who has helped

me shaping up this dream and making it a reality.

Above all, I would like to thank Almighty for giving me the inner strength and

passion that drives me and helps me keep going.

Disha Handa

v

LIST OF PUBLICATONS

Published/Presented

Handa D and Kapoor B (2014) State of the Art Realistic Cryptographic

Approaches for RC4 Symmetric Stream Cipher. IJCSA,vol. 4, pp. 27-37,

DOI:10.5121/ijcsa.2014.4403

Handa D and Kapoor B (2014) “Performance Analysis of PBlock

Algorithm Implemented Using SIMD Model to Attain Parallelism”,

Proceedings of the 49th Annual Convention of the Computer Society of

India CSI -Emerging ICT for Bridging the Future, Volume 2, Springer,

pp.71-80, DOI: 10.1007/978-3-319-13731-5_9

Handa D and Kapoor B (2014) "PARC4: High performance

implementation of RC4 cryptographic algorithm using parallelism",

Proceedings of the international conference on Optimization, Reliability,

and Information Technology (ICROIT), pp. 286-

289,10.1109/ICROIT.2014.6798339.

Accepted

Handa D and Kapoor B( 2015) PARC4-I: Parallel Implementation of

Enhanced RC4A using PASCS and Loop Unrolling Mechanism”,

Computer Applications: International Journal, 2:2

Communicated

Handa D and Kapoor B (2015) PBlock- an Energy Efficient Parallel

Approach for Faster File Encryption using Parallel Independent Feistel

Cipher Structure, Asian Journal of Scientific research

Handa D and Kapoor B (2015) PARC4: An Energy Efficient, Parallel

Implementation of RC4 Cipher using Parallel Additive Stream Cipher

Structure, International journal of emergent, parallel and distributed

computing.

http://dx.doi.org/10.1109/ICROIT.2014.6798339

vi

ABBREVIATIONS

AES Advanced Encryption Standard

API Application Programming Interface

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

CC-NUMA Cache-Coherent Non Uniform Access

DES Data Encryption Standard

DVFS Dynamic Voltage and Frequency Scaling

DES Data Encryption Standard

ECB Electronic Code Book

FDE File and Disk Encryption

FPGA Field-Programmable Gate Array

GCC GNUs Compiler Collection

GPGPU General Purpose Graphics Processing Unit

HPC High Performance Computing

IC Integrated Circuit

IPC Inter Processor Communication

KSA Key Generation Algorithm

MIMD Multiple Instruction Stream and Multiple

Data Stream

MISD Multiple Instruction Stream Single Data

Stream

NUMA Non- Uniform Memory Access

OpenMP Open Multiprocessing

vii

PASCS Parallel Additive Stream Cipher Structure

PARC4 Parallel Approach of RC4

PRGA Pseudo Random Generation Algorithm

PIFNS Parallel Independent Feistel Network Structure

RC4 ARC Four

RAM Random Access Memory

ROM Read Only Memory

RSA Rivest Shamir Adleman

RAW Read after Write

SIMD Single Instruction Stream Multiple Data

Stream

SISD Single Instruction Stream Single Data

Stream

SMP Symmetric Multiprocessor Architecture

SSL Secure Socket Layer

TLS Transport Layer Security

UMA Uniform Memory Access

VLSI Very large Scale Integration

WEP Wired Equivalent Privacy

WAR Write after Read

WAW Write after Write

viii

NOTATIONS

S Substitution Box

P1-P18 Array containing digits of Pi

Mod Modular arithmetic

Tp Execution time of parallel portion

Ts Execution time for serial portion

E(n) Efficiency with n processing elements

T(1) Execution time using single processing element

T(n) Execution time using n processing element

⊕ XOR operation

Ω Standard asymptotic lower bound

Θ Standard asymptotic upper bound

𝑇𝑜 Total parallel overhead

X Number of times speedup

𝑝 Number of processors

n Number of blocks

µ muon

ix

LIST OF TABLES

Table No. Title Page No.

Table 4-1: Time (In Seconds) taken by RC4 to encrypt/decrypt large data files by

uniprocessor ......................................................................................................... 57

Table 4-2: Time (In Seconds) taken by PARC4 to encrypt/decrypt large data files

using 2 Cores ........................................................................................................ 57


using 4 cores ......................................................................................................... 58


using 6 cores ......................................................................................................... 58

Table 4-5: Time (In Seconds)taken by PARC4 to encrypt/decrypt large data files

using 8 cores ......................................................................................................... 59

Table 4-6: Efficiency as a function of n and p for running n blocks on p

processors to encrypt input stream………………………………………………62

Table 4-7: Comparison between PARC4 and Multithreaded approach ............... 66

Table 5.1: Time taken by RC4A to encrypt/decrypt large data files by

uniprocessor system ........................................................................................... 766

Table 5.2: Time taken by PARC4-I to encrypt/decrypt large data files using 2

Cores ..................................................................................................................... 76


cores ................................................................................................................... 776


cores ................................................................................................................... 777


cores ..................................................................................................................... 77

Table 5.6: Comparison between PARC4 and PARC4-I..................................... 844

Table 7.1: Avalanche effect in Blowfish and PBlock: change in plaintext ........ 100

Table 7.2: Avalanche effect in Blowfish and PBlock: change in key ................ 100

Table 7.3: Time taken by Blowfish to encrypt/decrypt large data files by single

processor ........................................................................................................... 1001

x

Table 7.4 Time taken by PBlock to encrypt/decrypt large data files using 2 cores

………………………………………………………………………………….101


............................................................................................................................ 102


............................................................................................................................ 102


............................................................................................................................ 103

Table 7.8 Efficiency Vs number of processing elements for different file size..105

Table 7.9 Comparison between PBlock and Pipelined approach ……………..107

Table 8.1 Calibrated and Non-calibrated specification .................................... 1155

Table 8.2 Energy consumed by Blowfish and PBlock with system’s default

frequency and voltage ...................................................................................... 1155

Table 8.3 Energy consumed by existing and proposed parallel algorithms for

stream cipher technique using system’s default frequency and voltage .......... 1155

Table 8.4 Low power states of AMD-8320 processor…………………………116

xi

LIST OF FIGURES

Figure No. Title Page No.

Fig.1.1 Pictorial Representation of Cryptography………………………………..5

Fig.1.2 Pictorial representation of Symmetric key infrastructure based security

algorithms ……………………………………………………………………….8

Fig.1.3 Stream Cipher Encryption Techniques…………………………………...8

Fig.1.4 Block Cipher Encryption Techniques…………………………………….9

Fig. 2.1 Pictorial Representation of Data Parallel Model (Barney, 2010) ........... 32

Fig. 2.2 Pictorial Representation of MPMD (Barney, 2010) ............................... 33

Fig. 2.3 Representation of Domain Decomposition (Barney, 2010) .................... 34

Fig. 2.4 Pictorial Demonstration of Functional Decomposition (Barney, 2010) . 34

Fig. 2.5 Multi-Core Processor Architecture ......................................................... 37

Fig 3.1 Design of Parallel Additive Stream Cipher Structure .............................. 44

Fig 4.1 Swapping between S[i] and S[j] .............................................................. 49

Fig 4.2 I. Depicts sequential key generation whereas II. Presents the formation of

key stream for parallel framework ....................................................................... 52

Fig 4.3 Graphical Representation of Complete Flow and Model Used to

Parallelize RC4 ..................................................................................................... 53

Fig 4.4 Pictorial Representation of Input Data Decomposition Technique ......... 54

Fig 4.5 Speedup comparison of PARC4 using multiple cores ............................. 60

Fig 4:6 Speedup for Constnt Data using multiple cores ..................................... 61

Fig 4:7 Comparison for Throughput achieved using RC4 and PARC4 ............. 64

Fig 5.1 Method used to implement PARC4-I on SMPs ....................................... 71

Fig 5.2 Pictorial representation of normal and unwinding loop ......................... 755

Fig 5.3 Execution time of 1Gb of data file using PARC4-I ................................. 78

Fig 5.4 Speedup comparison using multiple cores ............................................... 79

Fig 5.5 Graphical representation of parallel run time of PARC4 vs PARC4-I on

eight cores .......................................................................................................... 831

Fig 5.6 Comparison between PARC4 and PARC4-I for speedup ……………...81

Fig 5.7 Comparison of PARC4 and PARC4-I for Efficiency .............................. 82

Fig 5.8 Comparison between PARC4 and PARC4-I algorithms for throughput . 83

xii

Fig 6.1 Structure of Sequential Feistel Network (William, 2006) ....................... 87

Fig 6.2 Parallel Independent Feistel Network Structure ...................................... 89

Fig 7.1 Graphical representation of F function .................................................... 97

Fig 7.2 Speedup comparison of PBlock using multiple cores .......................... 1044

Fig 7.3 For constant file size speedup tends to saturate at specific point .......... 104

Fig 8.1-Power metering interface exposed by Joulemeter ............................... 1133

Fig 8.2: Data file of PARC4 consisting joules consumed at each time stamp . 1144

Fig 8.3 Comparison of serial, parallel and parallel with calibration Blowfish and

PBlock for energy consumption using platform 1…………………………….117

Fig 8.4 Comparison of serial, parallel and parallel with calibration Blowfish and

PBlock for energy consumption using platform 2 ............................................ 118

Fig 8.5 Serial and Parallel algorithms for stream ciphers technique with default

and calibrated frequency using platform 1 ......................................................... 118

Fig 8.6 Serial and Parallel algorithms for stream ciphers technique with default

and calibrated frequency using platform 2 ......................................................... 119

xiii

LIST OF ALGORITHMS

Algorithm No. Title Page No.

Algorithm 1.1 The key Scheduling Algorithm (KSA) (Schneier, 2008) ............. 18

Algorithm 1.2 The Pseudo-Random Generation Algorithm (PRGA) (Schneier,

2008) ..................................................................................................................... 18

Algorithm 4.1 Steps to Implement PARC4 .......................................................... 51

Algorithm 5.1 Enhanced pseudo-random generation algorithm (PRGA) ............ 70

Algorithm 5.2 Method use to parallelize multiple data chunks using PARC4-I .. 72

Algorithm 7.1 Algorithmic steps for encryption process in PBlock .................. 965

Algorithm 7.2 Algorithmic steps for parallel F function…………………..........96

xiv

CONTENTS

DECLARATION BY THE STUDENT .......................................................................... ii

CERTIFICATE BY THE SUPERVISOR ..................................................................... iii

ACKNOWLEDGMENT ................................................................................................. iv

LIST OF PUBLICATONS ............................................................................................... v

ABBREVIATIONS ........................................................................................................... vi

NOTATIONS .................................................................................................................. viii

List of Tables ..................................................................................................................... ix

List of Figures.................................................................................................................... xi

List of Algorithms ........................................................................................................... xiii

Contents ........................................................................................................................... xiv

Abstract ......................................................................................................................... xviii

1 Chapter 1 ................................................................................................................ 2

Introduction ........................................................................................................................ 2

1.1 Technology enhancements ................................................................................. 3

1.2 Cryptography ...................................................................................................... 4

1.2.1 Classification of Cryptographic techniques ........................................................ 7

Blowfish .......................................................................................................................... 9

Cast ...............................................................................................................................10

Data Encryption Standard (DES) ...................................................................................10

IDEA ..............................................................................................................................10

RC4 ................................................................................................................................10

Triple DES ......................................................................................................................10

1.3 Issues in Symmetric Key Infrastructure based Algorithms ...............................11

1.4 Possible Solutions .............................................................................................11

1.5 Motivation of Research: ...................................................................................13

1.6 Research Problem .............................................................................................13

1.7 Literature Review .............................................................................................14

1.7.1 Description of Blowfish....................................................................................14

1.7.2 Description of RC4 ...........................................................................................17

1.7.3 Description of RC4A ........................................................................................19

1.8 Dissertation Contribution And Delineate .........................................................21

xv

2 Chapter 2 .............................................................................................................24

Basic Ideology of Parallel Computing, Tools and Experimental Setup used for Research

..........................................................................................................................................24

2.1 Introduction ......................................................................................................24

2.2 Essential notions of parallel programming used in Research ..........................26

2.2.1 Identifying Parallel Region in Code .................................................................26

2.2.2 Type of Parallel Computer ...............................................................................28

2.2.3 Speed up calculation using Amdahl’s law ........................................................30

2.2.4 Parallel computing memory architecture ..........................................................30

2.2.5 Parallel Programming Models ..........................................................................31

2.2.6 Partitioning .......................................................................................................33

2.2.7 Synchronization ................................................................................................34

2.2.8 Mapping for Load Balancing ............................................................................35

2.2.9 Granularity ........................................................................................................35

2.3 Experimental setup ...........................................................................................36

2.4 Tools used .........................................................................................................37

2.4.1 Gprof.................................................................................................................38

2.4.2 OpenMP ............................................................................................................38

2.4.3 MinGW .............................................................................................................38

2.4.4 CodeBlocks .......................................................................................................39

2.4.5 Joulemeter .........................................................................................................39

Conclusion ....................................................................................................................39

3 Chapter 3 ..............................................................................................................41

Design of Parallel Additive Stream Cipher Structure.......................................................41

3.1 Introduction ......................................................................................................41

3.2 Motivation For Parallel Architecture ................................................................43

3.3 Design of PASCS ................................................................................................43

Conclusion ....................................................................................................................45

4 Chapter 4 ..............................................................................................................47

PARC4: Parallel approach for RC4 using PASCS ...........................................................47

4.1 Introduction ......................................................................................................47

4.2 Detection of Parallelism ...................................................................................48

4.3 Method for Adding Parallelism .........................................................................49

4.3.1 Parallelization techniques .................................................................................54

xvi

4.4 Security Analysis ...............................................................................................55

4.4.1 Shannon’s entropy ............................................................................................55

4.5 Experimental Results ........................................................................................56

4.6 Performance and Scalability Analysis ...............................................................59

4.6.1 Speedup ............................................................................................................59

4.6.2 Efficiency .........................................................................................................61

4.6.3 Complexity and Cost optimality .......................................................................62

4.6.4 Scalability .........................................................................................................63

4.6.5 Throughput .......................................................................................................64

4.7 Comparative Analysis .......................................................................................64

4.7.1 Mapping and Load Balance ..............................................................................65

4.7.2 Modified Key stream ........................................................................................65

4.7.3 Energy Efficiency .............................................................................................65

Conclusion ....................................................................................................................66

5 Chapter 5 ...................................................................................................................67

PARC4-I: Parallel RC4A using PASCS and loop unrolling mechanism .........................68

5.1 Introduction ......................................................................................................68

5.2 Modified KSA and PRGA ...................................................................................68

5.3 Incorporating parallelism .................................................................................69

5.3.1 Techniques to enhance benefits of parallelization ............................................72

5.4 Experimental Results ........................................................................................75

5.5 Performance and Scalability Analysis ...............................................................78

5.5.1 Parallel Run Time .............................................................................................78

5.5.2 Speedup ............................................................................................................78

5.5.3 Efficiency .........................................................................................................80

5.5.4 Scalability .........................................................................................................80

5.6 Comparison between PARC4 and PARC4-I .......................................................80

5.6.1 Parallel Run time ..............................................................................................80

5.6.2 Speedup ............................................................................................................81

5.6.3 Efficiency .........................................................................................................82

5.6.4 Loop overhead ..................................................................................................83

5.6.5 Throughput .......................................................................................................83

Conclusions ...................................................................................................................84

6 Chapter 6 ..............................................................................................................86

xvii

Design of Parallel Independent Feistel Network ..............................................................86

6.1 Introduction ......................................................................................................86

6.2 Motivation for Parallel architecture .................................................................88

6.3 Design of Parallel Independent Feistel Network Structure ..............................88

6.4 Application Area of PIFNS .................................................................................90

Conclusion ....................................................................................................................90

7 ........................................................................................................................................91

Chapter 7 ................................................................................................................92

PBlock- Parallel approach for Blowfish cipher using PIFN .............................................92

7.1 Introduction ......................................................................................................92

7.2 Implementation of PBlock using PIFNS ............................................................93

7.2.1 Parallel Methodology ...........................................................................................94

7.2.2 Design of Parallel F function ............................................................................97

7.3 Security Analysis using Avalanche effect ..........................................................97

7.4 Experimental Results ......................................................................................100

7.5 Performance and Scalability Analysis .............................................................103

7.5.1 Speedup ..........................................................................................................103

7.5.2 Efficiency .......................................................................................................105

7.5.3 Complexity and Cost optimality .....................................................................106

7.5.4 Scalability .......................................................................................................106

7.6 Comparative analysis of PBlock and Blowfish using Pipeline approach .........107

Conclusion ..................................................................................................................108

8 ......................................................................................................................................109

8 Chapter 8 ............................................................................................................110

Analysis of Energy Consumption by proposed parallel algorithms ...............................110

8.1 Introduction ....................................................................................................110

8.2 Motivation ......................................................................................................111

8.3 Tools and Techniques used for energy measurement ...................................112

8.4 How Joulemeter works to measure energy ...................................................113

8.5 Energy Measurement .....................................................................................114

8.5.1 Result and Analysis ........................................................................................114

Conclusion ..................................................................................................................120

9 Chapter 9 ...............................................................................................................121

Conclusions and Future Scope .......................................................................................122

xviii

9.1 Thesis Contribution ........................................................................................122

9.2 CONCLUSIONS ....................................................................................................123

9.3 Future Scope ...................................................................................................125

References ......................................................................................................................127

Appendix-A ....................................................................................................................134

Appendix-B ....................................................................................................................135

Appendix-C ....................................................................................................................136

Appendix-D ....................................................................................................................139

Appendix-E.....................................................................................................................141

Appendix-F .....................................................................................................................142

Appendix-G ....................................................................................................................146

Appendix-H ....................................................................................................................148

Appendix-I ......................................................................................................................150

ABSTRACT

Parallel computing involves simultaneous use of multiple compute

resources to solve a large computational problem. In the real world applications,

xix

many of the tasks can be executed in parallel. Parallel computing is being used in

diverse areas that range from computational simulations for technical and

engineering problems to marketable applications in transaction processing and

data mining. The performance and energy benefits of parallelism are key drivers

for the growth in parallel computing.

Data security is a critical issue for businesses and individual computer

users. Client information, payment information, personal files, bank account

details – information that is typically needed any commercial transaction is

potentially dangerous if it falls into the wrong hands. Thus, to secure data or

information, various cryptographic algorithms are being used. These encryption

algorithms are compute-intensive and tend to be slow as a result. These

algorithms can benefit significantly from parallel implementations that utilize

multicore processors available today.

In this thesis, parallel symmetric-key based algorithms to encrypt/decrypt

large sets of data have been proposed. The design of the parallel algorithms is

targeting speed and energy consumption improvements for these algorithms.

Much effort has gone in to enhance the speed of these algorithms using FPGA-

based hardware implementations in recent times. The thesis proposes software-

based parallel implementations of these security algorithms running on

symmetric multiprocessing machines.

The performance of proposed parallel algorithms with the existing

sequential implementations of the algorithms has been compared. The

comparisons of results show that the proposed algorithms have significantly

better performance than the existing sequential algorithms. Apart from the

speedup gained due to parallel implementation, the energy efficiency of the

algorithms has also been measured. Energy-efficient parallel algorithms make

them suitable for their use in the handheld devices. The proposed algorithms in

this thesis have a high potential for their adoption in the area of full-disk

encryption and other data-intensive encryption processes.

1

Chapter 1

Introduction

2

1 CHAPTER 1

INTRODUCTION

This chapter discusses security as one of the foremost concerns of today’s

computer-centric era and the requirements for security techniques along with

some historical background on cryptographic algorithms. The issues involved in

sequential security algorithms and motivation for the research has been discussed

followed by an in-depth literature review.

To secure information communications over the network, different encryption

algorithms have been used from time to time. The encryption algorithms are

further categorized into two broad categories: Symmetric and Asymmetric(

Menezes, 1996).In symmetric algorithms, the key is common to both encryption

and decryption process. Some symmetric block cipher algorithms include DES,

3-DES, AES and Blowfish. RC4 is a symmetric stream cipher algorithm(

Menezes, 1996). Asymmetric algorithms use two dissimilar keys for encryption

and decryption. The public-key infrastructure-based algorithm such as RSA is an

example of an asymmetric encryption algorithm. Apart from the security level,

speed of the encryption algorithm is also a very important aspect in the

cryptographic world. A slow algorithm can drastically affect the speed of entire

application and condenses its effectiveness. Power consumption by electronic

devices such as smart phones, tablets, and other computing systems is another

challenging issue that has become as a significant concern at the individual level

as well as the community level(the heat produced by these electronic systems

raises the temperature of greenhouse gases). In multi core processor models

applications can be executed on N number of cores where N may be variable, and

these cores can operate at diverse frequencies( Roy, 2008, POWER, 2010, Vajda

and Stenström, 2012). The overall performance and power cost of a parallel

algorithm will depend on different parameters such as the number of cores an

algorithm uses, set of frequencies these cores operate at, and last but not the

least, the formation of the parallel algorithm( Gepner and Kowalik, 2006).

3

Sequential security algorithms can be made faster using parallelization.

Fortunately, with the advent of parallel processors in computing, we now have

easily available means to parallelize the algorithms to make them faster(Kumar

V).The symmetric multiprocessors such as those from Intel and AMD can be

used in conjunction with parallel programming APIs such as the OpenMP to

make security algorithms parallel and faster( Chapman B, 2008).It is feasible to

use parallel algorithms for any of the cryptographic techniques currently in use.

The basic motivation for this research is to observe if complex security

algorithms could break down their responsibilities as tasks that can be executed

in parallel, successfully leading to performance gains.

The complete chapter is organized as follows: the brief information about

technology enhancements has been given in section 1.1.In section 1.2, the

overview of cryptographic techniques have discussed. Afterwards, issues related

to cryptographic algorithms in next section followed by the possible solutions for

that. In section 1.5 importance of research has stated and after that the research

problem is given. Literature review related to the research problem is presented in

later section.

1.1 Technology enhancements

Cryptography is a scientific discipline where immense calculations are

required in order to secure the information of any type (William, 2006,

Diffie and Hellman, 1976). It can be the data transmitted over a channel

or an important file/disk data. In recent scenarios, the complexity of

cryptographic algorithms is increasing due to massive usage of confusion

and diffusion structures, variable rounds and complex feistel function

hence leading to longer execution time (Patidar et al., 2009, Kanda,

2001).

Development in CMOS technology in terms of scaling capacity, increases

performance, escalates transistor density and condenses power

consumption( Davari et al., 1995, Auth et al., 2012). A chip with billions

4

of transistors was not unforeseen by the corporate sector, as the concept

diligently follows Moore’s Law which states that “The number of

transistors in an integrated circuit seems to be doubled every two years”

(Moore, 1965). Gordon E. Moore was the co-founder of the Intel

Corporation and explained this concept in 1965.His law continues to

apply on CMOS technology. In 2006, Intel designed their first chip with

more than one billion transistors. Several transistors on a single chip grant

chances for new intensities of computing ability.

But enhancing processor’s performance without generating too much heat

is a challenge as per the following quote “Intel processors would soon be

producing more heat per square centimeter than the surface of the sun,

which is why the problem of heat is already setting hard limits to

frequency (clock speed) increases(Koch, 2005)”. Parallel Programmers

need to take care of frequency and voltage in order to reduce power

consumption by the complex applications.

1.2 Cryptography

Over a period of times, an ostentatious set of rules and methods have been

specified to lever the information security concern. In early days, the

major purpose of cryptography was to achieve message confidentiality

which is the transformation of messages from a clear form into a

perplexing one and vice versa at the other end. The basic idea behind this

process is to make the information unreadable by unauthorized parties, to

confirm privacy in communications. The pictorial representation of

cryptography has been shown in Fig. 1.1. Now days, the arena has

extended beyond confidentiality and includes procedures to ensure

message integrity, digital signatures, authentication and secure

calculations. In recent scenarios, cryptography comes with the following

objectives(William, 2006, Yu et al., 2010):

1) Confidentiality: It refers to limiting information access and averting

access to unauthorized users.

5

2) Integrity: It refers to the trustworthiness and consistency of

information resources.

3) Non-repudiation: It ensures that the source of the information cannot

decline its intents in the transmission of the information at the later

stage.

4) Authentication: It refers to the process of confirming the identities of

the sender and the receiver along with the confirmation about the

source and the destination of the information.

Fig. 1.1 Pictorial representation of Cryptography (Source:

http://www.onlinebusiness.newstipstricks.com/what-is-cryptography/)

Cryptography comprises of diverse approaches such as amalgamation of

words with images, microdots to hide information in transfer and many

more. It basically, ensures the sender and receiver that the information

6

cannot be retrieved by unauthorized parties. The most common traditional

ciphers fall in two major categories:

Substitution techniques: As per this technique, the alphabets or digits of

plaintext are substituted by other alphabets or by digits or special

symbols. Caesar Cipher, Mono alphabetic ciphers, Hill ciphers, Poly

alphabetic ciphers are the ciphers belonging to this category( Menezes,

1996).

Transposition techniques: Apart from the substitution, a dissimilar kind

of mapping is accomplished by performing some permutations based on

some predefined function, on the plaintext( Menezes, 1996). This is

mentioned as transposition cipher. Rail fence technique is the common

method applied to many cipher algorithms to achieve permutation. In this

method, the plaintext is written as an arrangement of diagonals and then

read off that arrangement row wise. For example, to encipher the message

“Go to Party” with a rail fence method, can be written as:

G t t e a t

O o h p r y

Finally, the encrypted message is: “Gtteatoohpry”

To implement any cryptographic technique, following key elements must

be involved:

Plaintext - The original understandable and comprehensible message

Cipher text - The altered message

Cipher - An algorithm for converting a plaintext into cipher text by

transposition and/or substitution methods.

key - some significant information used by the cipher to manipulate

text and only known to the sender and receiver

Where, the cipher or an algorithm is the most important element as all

functionality related to enciphering or deciphering is with this algorithm.

7

Another important element is key. The longer and harder the key, the time

taken to deduce the key will increase.

1.2.1 Classification of Cryptographic techniques

Broadly Cryptographic techniques are divided into two categories.

Asymmetric Infrastructure based techniques

Asymmetric cryptographic algorithms use two different keys for

encryption and decryption(Salomaa, 1996). The key that is used for

encryption process is known as public key and the key used for decryption

process is known as private key. That means sender should have public

key and receiver should have private key to decrypt the message sent by

sender. RSA, DSA are popular asymmetric also known as public key

infrastructure algorithms(Schoen and Boberski, 2002).

Symmetric Infrastructure based techniques

The symmetric algorithms, also known as private key based algorithms use

same key for both encryption as well as decryption process( Bellare and

Yee, 2003). The private keys used in symmetric-key cryptography are

strongly resistant to brute force attacks. That means private-key

algorithms are more difficult to break than their public key counterparts.

Additionally, secret-key algorithms require less computing power to be

created than equivalent private keys in public-key cryptography. Figure.

1.2 is the pictorial representation of symmetric infrastructure based

security techniques.

8

Fig. 1.2 Pictorial representation of Symmetric key infrastructure based

security algorithms. (Source: http://www.powayusd.com)

1.2.1.1 Classification of Symmetric Infrastructure based techniques

Furthermore, Symmetric algorithms are divided into two categories:

Stream Cipher and Block Cipher algorithms.

1.2.1.1.1 Stream ciphers

Stream Ciphers are one of the common type of f encryption algorithms.

They encrypt individual characters of a plaintext message one at a time,

using an encryption transformation. Figure 1.1 shows the working of

stream cipher encryption technique.

Fig.1.3 Stream Cipher Encryption Techniques

9

1.2.1.1.2 Block ciphers

A block cipher encrypts data in fixed sized blocks (commonly of 64 bits).

The most commonly used block ciphers are Triple DES , DES, Blowfish

and AES. Figure 1.2 demonstrates the working of block cipher encryption

technique.

Fig.. 1.4 Block Cipher Encryption Techniques

1.2.1.2 Symmetric key based algorithms

The brief overview of some of the commonly used stream/block

symmetric algorithms is stated in this section (Schneier, 2008).

Blowfish

Blowfish (Schneier) is a symmetric encryption algorithm designed By

Bruce Schneier in 1993. It has a 64-bit block size and a capricious key

length that ranges from 32 bits to 448 bits. At the time of doing key

scheduling, it produces huge pseudo-random lookup tables by doing many

encryptions. All required tables depend on the complex key that is

supplied by the user. This technique has been confirmed to be highly

defiant against several attacks such as differential and linear

cryptanalysis. But at the same time, this also means that the algorithm

cannot be used for those systems where huge memory space is not

available. Since then Blowfish has been considerably receiving attention

as a strong encryption algorithm. It is unpatented and license-free.

http://www.encryptionanddecryption.com/encrypt_decrypt_encyclopedia.html#Triple_DES

http://www.encryptionanddecryption.com/encrypt_decrypt_encyclopedia.html#AES

http://www.encryptionanddecryption.com/encrypt_decrypt_encyclopedia.html#Symmetric_Encryption

10

Cast

CAST stands for Carlisle Adams and Stafford Tavares, the inventors of

CAST (Heys, 1994). It is also a popular 64-bit block cipher which

belongs to the class of symmetric encryption algorithms.

Data Encryption Standard (DES)

Data Encryption Standard (DES) was implemented in the United States as

a federal standard in 1977 (Madson, 1998). DES uses a 56-bit key to

encrypt and decrypt data that is in the form of fix size blocks where each

block size is 64 bit. The DES algorithm has 16 rounds, which means the

main algorithm is repeated 16 times to generate the cipher text. It has

been observed that the number of rounds in DES is exponentially

proportional to the total of time needed to locate a key using a brute-force

attack. As the number of rounds increases, the security of the algorithm

increases exponentially.

IDEA

International Data Encryption Algorithm, abbreviated as IDEA (Schneier,

2008) is a symmetric encryption algorithm and was developed by Dr. X.

Lai and Prof. J. Massey. This was the replacement of DES algorithm. It

uses 128 bit key. The size of the key makes it unfeasible to break by

simply trying permutation and combinations.

RC4

RC4 stream cipher was developed by Ron Rivest in 1987(Schneier,

2008). The key size of the cipher is f up to 2048 bits (256 bytes). The

algorithm is extremely fast. Because of its speed, it is being used in many

applications. The algorithm is divided into two sub algorithms, one is for

key generation and another is for encryption. For encryption, the output of

the generator is XOR with the data stream.

Triple DES

Triple DES (William, 2006)is a deviation of Data Encryption Standard

(DES). It uses a 64-bit key from which first 56 bits are effective key bits

11

and 8 are considered as parity bits. The block size for the algorithm is 8

bytes that is 64 bits. The thought behind the proposal of Triple DES is to

improve the security of DES by implementing DES encryption three

times using three different keys. The algorithm is considerably secure but

very slow.

1.3 Issues in Symmetric Key Infrastructure based Algorithms

1. Apart from the security, Execution time or speed is also very important

aspect of security algorithm. In recent days, there are maximum hardware

based FPGA implementations of these algorithms to enhance the speed.

This dissertation presented the software based parallel implementations of

security algorithms providing good speed up on symmetric multiprocessor

machine.

2. Another major challenge these days is to reduce the power consumption

by software applications. These algorithms will consume more energy on

uniprocessor systems due to the massive calculations they do. In this

thesis, it has been proved that parallel implementations are more energy

efficient.

1.4 Possible Solutions

1) Hardware based approaches: There are three types of approaches that

can be implemented at hardware level.

1.1) FPGA Implementations: A field-programmable gate array

(FPGA), as the name suggests, is a cohesive circuit which is

configured or programmed by client using hardware description

language (HDL) after manufacturing(Zeidman, 1999). FPGAs

have large RAM blocks, number of logic gates and very fast I/O

and bidirectional data buses to implement composite digital

calculations. FPGAs comprise of programmable logic mechanisms

termed "logic blocks", and a sequence of reconfigurable

interconnects so that different blocks can be wired together. Logic

12

blocks can be designed to accomplish multifaceted combinational

functions, or simply logic gates like AND, OR and XOR.

1.2) ASIC Implementations: An application specific integrated circuit

is an assimilated circuit that is designed for a specific use, instead

of general purpose usage(Sato et al., 1991). For example, a small

chip customized to implement in a digital voice recorder is an

ASIC. Application specific standard products (ASSPs) are

transitional between ASICs and corporate standard ICs like the

4000 or the 7400 series. Latest ASICs generally contain complete

microprocessors, memory lumps including ROM, RAM,

EEPROM, flash memory and other large building blocks. Such an

ASIC is called a system-on-chip (SoC) system. Verilog or HDL is

used to program these ASICs (Palnitkar, 2003).

1.3) VLSI Implementations: VLSI is an acronym for Very-large-scale

integration which is the method of constructing an integrated

circuit by associating large number of transistors into a single

chip(Mead and Conway, 1980). A circuit may comprises of a

CPU, RAM, ROM and other adhesive logic. This technology lets

IC manufacturers implement all of these into a small single chip.

2) Software based approaches: In today’s computing era, most of the

cryptographic algorithms are implemented at software level. Parallel

computing is one the possible solution for above mentioned issues

because it can make better use of essential parallel hardware. In recent

scenarios, desktops or laptops are parallel in design with multiple

cores/processors. In general cases, serial programs executed using latest

computer architecture "waste" possible computing power. Parallel

software is explicitly envisioned for optimum utilization of parallel

hardware with multiple cores.

13

1.5 Motivation of Research:

Security has always been a biggest concern for the computing world in

terms of transmitting information and data across the networks. Security

algorithms are usually implemented serially which are bit slow as it takes

time to perform calculations for encryption as well as decryption. It also

requires large amount of memory which is sometimes not possible for a

single processor.

Parallel computing is an emerging area which uses multicore processor

for the faster and efficient execution of the instructions. So to achieve

higher performance in the area of security, the security algorithms can be

implemented in parallel.

Parallel implementations of security algorithms are also very important in

the area of mobile computing and high-end servers as a means to reach

high performance targets while also maintaining acceptable power

characteristics, as security algorithms are more computation intensive and

the implementation of security algorithms concurrently will help to reduce

power consumption which is one of the most critical aspects in above

mentioned areas.

1.6 Research Problem

Considering above mentioned issues related to the sequential

cryptographic algorithms, this dissertation is proposing parallel

algorithms for symmetric-key based encryption methods implemented on

Symmetric multiprocessor machine using OpenMP and analysis of

performance gains by parallelizing the algorithms through experiments

over large number of data sets. To achieve the desired outcome, research

problem is divided into following objectives:

To study sequential algorithms that can be used to implement

symmetric key cryptography and execute them.

To come up with parallel algorithm for symmetric-key based-

encryption methods.

14

To implement parallel algorithm on multi core machine using

OpenMP.

To analyze how much performance can be gained by parallelizing the

security algorithm through experiments over large number of data sets

and through utilizing various parameters of the algorithms.

To use parallelism leading to more energy-efficient algorithms for

intensive computations and making them more applicable to real time

applications such as cryptographic algorithms.

1.7 Literature Review

There are many security algorithms based on symmetric key infrastructure

as discussed in Section 1.2.1.2. But the thesis is proposing general parallel

framework for the Feistel network and for stream ciphers. Thus, three

different algorithms based on feistel network and stream cipher properties

are chosen to test the performance of feistel framework and parallel

stream cipher framework. Blowfish (from the category of Block ciphers)

is feistel network based algorithm and thesis is presenting the

performance of parallel blowfish after implementing parallel feistel

framework. RC4, RC4A (from the category of stream ciphers) are

considered to test the parallel framework and thus the performance

enhancement of parallel algorithms based on that framework. In this

section, the introduction and structure of the existing algorithms along

with the related work for these algorithms is presented.

1.7.1 Description of Blowfish

Blowfish is a private key infrastructure based block cipher security

algorithm that uses only single key to encrypt and decrypt the data

(Schneier). Blowfish was designed by Bruce Schneier in 1993. The

algorithm has 64-bit block size and a changeable key length from 1 bit to

448 bits. It is simply appropriate for those applications in which the key

does not alter frequently, for example an automatic file encryptor or a

communications link. The algorithm encompasses two sub algorithms:

15

Key expansion and Data encryption algorithm. Key expansion function

will convert the key into many sub key arrays that ranges up to 4168 bytes.

On the other hand, the data encryption occurs through a 16-round Feistel

network. All rounds consist of a key-dependent transformation, and a key

and data dependent replacement. The algorithm uses simple operations like

exclusive-or, addition, table lookup, modular multiplication. All these

operations are efficient on microprocessor. The following elements are

involved in both functionalities:

Key expansion function:

P-box: which are eighteen 32-bit boxes from P1 to P18 used to perform

bit shuffling.

S-box : Substitution box for non-linear functions which are four 32-bit

arrays with 256 entries each. All of these boxes are initialized with a

fixed string, the hexadecimal digits of pi.

Blowfish Algorithm uses a large number of sub keys. These keys could

be pre computed for faster encryption or decryption process.

Data Encryption function:

Feistel function, where input data is divided into two halves.

F function: This is the commonly used function in Blowfish. It

necessitates a 32 bit input data to be divided into four eight bit blocks.

Each block references the S-Box and each entry of the S-box output a

32 bit data. The output of S-box 1 and S-box 2 are added first and then

result is XOR with S-box 3. Finally, S-box 4 is then added to the

output of the XOR operation and it provides a 32 bit data as output.

1.7.1.1 Related work

In the emerging era of data-intensive computing and low-cost internet

connections for global data communications, there is a higher demand for

data security and computational speedup. In recent years, successful

studies have been made using hardware acceleration technique, FPGA

16

implementations, using CUDA’s GPGPU platform to speed up the

execution of cryptographic algorithms. Liu et al (Liu, 2012) presented

implementation method for power efficient hardware acceleration of RSA

and Blowfish cryptographic algorithms. They were able to condense the

energy consumption by 9.6% for RSA and 36% for the Blowfish

algorithms, separately. However, their approach is based on co-processor

design on an FPGA platform.

Krishnamurthy G.N et al (G.N, 2007) presented the performance

enhancement of Blowfish by modifying its F function without violating

memory requirements, security and simplicity of existing blowfish

algorithm. The presented modification was only limited to the change in

the implementation of F function of the feistel network. That means the

existing Blowfish divide X1 into four eight-bit quarters: a, b, c and d and

F(X1)= ((S1,a+S2,b mod 232 ) XOR S3,c)+S4,d mod 232 whereas the

modified F(X1)= ((S1,a+S2,b mod 232 ) XOR (S3,c +S4,d mod 232 ).

Thus, it supports to the parallel evaluation of two addition operations by

using threads. They were able to reduce the overall execution time by

14%.

An ASIC implementation of low power and high throughput blowfish

security algorithm has presented by P. Karthigai kumara and K. Baskaran

(P. Karthigai Kumara June 2010). The algorithm was prototyped in

130 nm custom integrated circuit.

Krishnamurthy G.N et al (G.N, March 2008) presented the Performance

enhancement of Blowfish and CAST-128 algorithms and Security analysis

of improved Blowfish algorithm using Avalanche effect. With the help of

VHDL implementation, it was observed that the reduction in time

achieved for encryption and decryption is above 12.5 % compared to the

existing algorithm.

P. Karthigai kumara and k. Baskaran (P. Karthigai kumar 2010) explored

and presented the partially pipelined VLSI implementation of blowfish

encryption/decryption algorithm. This implementation is a partial

17

pipelined, vigorous architecture of Blowfish algorithm in hardware. The

proposed design attains an implausible encryption speed of 2670

MBits/sec and decryption speed of 2642 MBits/sec.

Authors of pipelined approach for High-Performance Implementation and

Evaluation of Blowfish Cryptographic Algorithm on Single-Chip Cloud

Computer (SCC) (Kamak Ebadi, Dec 2012) presented parallel approach to

blowfish on special processor. This was an experimental processor having

48 core architecture created by Intel Labs for research projects. According

to this model, the input file is split into number of small data chunks and

each data chunk undergoes a sequence of computations based on the

blowfish security algorithm, each core is responsible to perform single

round of computations and then data is sent to next core for next round of

computations. Authors illustrate that this approach is 27X faster than the

sequential one. However, in this pipelined model, the use of large data

chunks can cause bandwidth saturation and higher latency which further

leads to longer execution time.

1.7.2 Description of RC4

RC4 is the most common algorithm and is used in popular protocols like

secure socket layer (SSL) to protect web browsing and in WEP to protect

the wireless networks (Schneier, 2008). Other application areas of RC4

are Skype and Bit Torrent protocol system. RC4 generates key stream that

is random stream of bits. The key stream is combined with the plaintext

using bit-wise XOR to generate the encrypted text. The algorithm has two

main parts: the key scheduling algorithm (KSA) and the pseudo random

generation algorithm (PRGA).

The KSA is used to initialize the permutations in the ‘S’ array. The "key

length" is the number of bytes in the key and the range of “key length” is

from 1 to 256.

18

For i=0 to 255

S[i]:= i

End loop

Set j: = 0

For i=0 to 255

Set J: = (j + S[i] + key [i mod key length]) mod 256

Swap values of S[i] and S[j]

End loop

Algorithm1.1 The key Scheduling Algorithm (KSA) (Schneier, 2008)

As shown in algorithm 1.1, the S array is first initialized with digits 0 to

255. Then with the help of S array elements and the keys, the j values are

calculated. S[i] and S[j] are then swapped to generate a permuted array.

The whole process is executed 256 times to generate a random key

stream.

The iterations in the PRGA algorithm, as shown in algorithm 1.2, depend

on the input size. In each of the iterations, there is a different value for

ranging from 1 to 255. If the input length is more than 255 then the

process again starts from 1 and continues until the last byte. For each of

the iterations, the value for j is calculated, S[i] and S[j] are swapped, and

the sum of S[i] and S[j] mod 256 is looked up in the S array to return one

byte. This byte is then XOR with one individual letter in plaintext to

convert it into cipher text.

Algorithm1.2 The Pseudo-Random Generation Algorithm (PRGA) (Schneier,

2008)

While Generating Output:

i := (i + 1) mod 256

j := (j + S[i]) mod 256

Swap values of S[i] and S[j]

K: = S [(S[i] + S[j]) mod 256]

Output

End loop

http://en.wikipedia.org/wiki/Modulo_operation

19

1.7.2.1 Related work

Many researchers have worked on the parallelization of stream ciphers

security algorithms using hardware acceleration techniques. K.H. Tsoi et al

(Tsoi, 2002) presented a parallel FPGA implementation of RC4 algorithm

in 2002. FPGA designs employ parallelism at the logic level to increase

the number of operations per cycle by RC4 search engine. In their design,

they have used on-chip memories to attain very high memory bandwidth,

floor planning to condense routing delays and multiple decryption units to

accomplish further parallelism. Total 96 number of RC4 decryption

engines was integrated on a single Xilinx Virtex XCV1000-E field

programmable gate array (FPGA). The resulting design operates at a 50

MHz clock rate and gained a search speed of 6.06 × 106 keys/second,

which is a speedup of 58 over a 1.5 GHz Pentium 4 PC.

In 2009 (Li, 2009, August) Changxin Li, Hongwei Wu, Shifeng Chen,

Xiaochao Li, Donghui Guo have presented an efficient implementation for

MD5-RC4 encryption/decryption algorithm using NVIDIA’s Graphics

Processing Unit with CUDA programming framework. The algorithm was

implemented on NVIDIA GeForce 9800GTX GPU and they got 3-5X

speedup.

In 2012 T.D.B Weerasinghe (Weerasinghe, August 2012) presented a java

based multithreaded implementation of RC4 algorithm using i3 and i7

series processors. The proposed method does not parallelize RC4, instead

it introduces a way that multithreading can be used to perform encryption

and decryption when the message is in the form of text file. According to

the author, the plaintext is in the form of text file and the file is split into

number of small files and then these files are encrypted separately using

RC4 cipher.

1.7.3 Description of RC4A

Souradyuti Paul and Bart Preneel have proposed an RC4 variant, which

they call RC4A(Paul and Preneel, 2004). RC4A uses two state arrays S1

and S2, and two indices j1 and j2. Each time i is incremented, two bytes

http://www.researchgate.net/researcher/74988408_Changxin_Li/

http://www.researchgate.net/researcher/75172220_Hongwei_Wu/

http://www.researchgate.net/researcher/74827434_Shifeng_Chen/

http://www.researchgate.net/researcher/75206073_Xiaochao_Li/

http://www.researchgate.net/researcher/10186002_Donghui_Guo/

20

are generated. First, the basic RC4 algorithm is performed using S1 and j1,

but in the last step, S1 [i] + S1 [j1] is looked up in S2. Second, the

operation is repeated (without incrementing i again) on S2 and j2, and S1

[S2 [i] +S2 [j2]] is output. The algorithm has two main parts: the key

scheduling algorithm (KSA) and the pseudo random generation algorithm

(PRGA).

1.7.3.1 KSA

In this algorithm, the key stream is generated with the help of a variable

length key with an internal state comprised of the following key elements:

1) Four 256 bytes S1-S2 arrays that contains a transformation of these 256

bytes

2) Three index pointers i, j1 and j2 which will use to point elements in the

S1 and S2 arrays

The algorithm will start with initializing two arrays with the values from 0-

255 that means the values in the array are equal to their index. Once the

arrays are initialized, the next step is to generate random numbers and

store in these two arrays to make them permutation arrays. For this, simply

iterate the array 256 times, compute the value of j1 and j2 pointers with the

help of j1 = j1 + S[i] + key[i mod key-length] formula where key is the

user’s input value and The "key length" is the number of bytes in the key

and the range of “key length” is from 1 to 256.

As already discussed, the only operation on these S arrays is swap, the only

effect is a permutation and all S arrays contains all random numbers from

0-255.

1.7.3.2 PRGA

In this step, generated key stream is XORed with plaintext to produce

encrypted text in the form of a sequence of bytes. All arithmetic is

performed modulo 256. The iterations in the PRGA algorithm, depend on

the input size. In each out of 256 iterations, there is a different value that

ranges from 0 to 255. If the input length is more than 255 bytes, then the

21

process again starts from 0 and continues until the last byte. For each of

the iterations, the value for j1 and j2 is calculated, S1 [i] and S1 [j1] are

swapped, and the sum of S1 [i] and S1 [j1] mod 256 is looked up in the S2

array to return one byte. Same operation applies to S2 array. Returned

bytes are then XORed with one individual letter in plaintext to convert it

into cipher text.

1.7.3.3 Related Work

Authors of the paper (Noman, 2009) presents efficient hardware

implementation of new stream cipher, RC4A. The proposed hardware

implementation achieves a data throughput up to 22.28 MB/sec at

frequency of 33.33 MHz and the performance in terms of throughput to

area ratio equal to 0.37. The implementation is also parameterized in order

to support variable key lengths, 8-bit to 512-bit. The cipher was designed

using Verilog hardware description language and implemented into a

single Altera APEX TM 20K200E Field Programmable Gate Array

(FPGA).

1.8 Dissertation Contribution And Delineate

In this dissertation, two major issues related to sequential cryptographic

algorithms are addressed: First is the slow execution and second is the

energy consumption (Noman, 2009). To solve these two issues, a parallel

stream cipher structure and a parallel Feistel Network structure is

proposed, which can further implement in any feistel based block cipher

and stream ciphers to make them parallel. The thesis has incorporated

three parallel algorithms based on RC4, RC4A and Blowfish. In this

regard, the thesis is organized into eight chapters.

Chapter 1 incorporates the major concerns of today’s computer centric

era, requirement of security techniques along with some historical

background of cryptographic algorithms, issues involved in sequential

security algorithms, and motivation for the research followed by literature

review.

22

In Chapter 2, the basic concepts of parallel programming along with

different techniques and experimental setup used for the research is

discussed. These concepts have been used throughout the whole thesis to

parallelize the algorithms.

Chapter 3 will briefly introduces the stream ciphers and its types,

Motivation for the parallel design of Parallel Additive stream cipher

(PASCS) along with the description of new architecture.

In Chapter 4, description of Parallel Feistel network and parallel F

function is provided.

PARC4 –The parallel approach for RC4 using PASCS and corresponding

data parallel model along with along with its experimental results and

comparisons with existing one is given in Chapter 5.

Chapter 6 introduces PARC4-1, parallel implementation of RC4A (RC4

variant) using PASCS and loop unrolling method.

PBlock- Parallel implementation of Blowfish Using Parallel Feistel

Network along with the results and comparisons is described in Chapter 7.

In Chapter 8, the thorough discussion about the energy measurements of

all three parallel algorithms along with description of low power and high

power states of the processor to find the power benefit is given and

Chapter 9 will have the conclusions and future scope of the research.

23

Chapter 2

Basic Ideology Of Parallel

Computing, Tools And

Experimental Setup Used For

Research

24

2 CHAPTER 2

BASIC IDEOLOGY OF PARALLEL COMPUTING,

TOOLS AND EXPERIMENTAL SETUP USED FOR

RESEARCH

Concurrent execution of tasks has been in use for many years, especially in high-

performance computing, but more attention in this area has developed recently

because of some physical constraints that prevent frequency scaling. Since power

consumption by computers has turned out to be a major concern in recent years,

parallel computing has grown to be the leading archetype in computer

architecture. In this chapter, brief introduction of each of the concepts which are

used for this research will be discussed. parallel programming along with its

applications and fundamentals have been discussed. On the basis of these

concepts, further parallelization of algorithms is described in detail.

2.1 Introduction

Parallel computing is a type of process in which numerous sub processes

are carried out concurrently on multiple cores. The concept is based on

the theory that huge problems can be divided into smaller ones, and

solved simultaneously. Parallel computing provides different levels of

parallelism: bit-level, instruction level, data, and task level

parallelism(Quinn, 1994, Almasi and Gottlieb, 1988).

Furthermore parallel computers can be broadly categorized by the level

by which the hardware support concurrent execution of tasks:, multi-

processor and multi-core computers having several computing nodes

within a single machine, whereas clusters and grids use number of

computers collectively to work on the shared task.

25

Parallel algorithms to solve any computational challenge are more

complex to write than sequential programs, since parallelism comes with

numerous possible software bugs, out of which synchronization, data

locality problem and race conditions are the most ordinary one(Leighton,

1992). Inter-processor Communication and synchronization among the

different sub processes are usually some of the key hindrances to having

good performance.

There is a tremendous impact of parallel computing on a number of

diverse areas that ranges from computational simulations for technical and

engineering problems to marketable applications in transaction processing

and data mining(Bo, 2009). The cost and energy benefits of parallelism

tied with the performance necessities of applications that present

convincing point of view in support of parallel computing. Although there

is a huge scope of parallel computing but here it has been divided into

three common categories(Kumar et al., 1994):

1) Engineering and Design applications

Conventionally, parallel computing has been used in the design of airfoils

and high speed circuits’ etc. Now the days, it is being used in making

design of micro-electro-mechanical and nano-electro-mechanical systems

and has engrossed noteworthy attention.

2) Scientific applications

The past few years have seen a revolution in high performance scientific

computing applications. Advancements in computable physics and

chemistry are paying attention in learning processes that ranges in scale

from quantum phenomena to macromolecular structures. As a result, we

have design of new resources and more proficient processes.

Bioinformatics and astrophysics are other good areas which present many

demanding problems with respect to investigating enormously large

datasets.

26

3) Applications in computer systems

In this domain, computer security will be the major agenda and under this

area, intrusion detection is a great challenge. For intrusion detection in a

network, data is collected at dispersed sites. For signaling intrusion, it

must be analyzed speedily. In the area of cryptography, some of the most

impressive applications of Internet- based parallel computing have paying

attention on factoring enormously large set of integers.

The complete chapter is organized as: In section 2.1, introduction of

parallel programming and applications of it, is mentioned. Afterwards in

section 2.2, the concepts of parallel programming specifically used for

this research is explained because these concepts have to serve as baseline

concepts to parallelize security algorithms which are presented in the

following chapters. This section covers all the notions like which type of

parallel computer is used, how to identify parallel regions of the code,

decomposition of problem, mapping ,load balancing and speed up

calculations using Amdahl’s law. Experimental setup used for research

purpose is described in section 2.3.Tools used to parallelize all algorithms

are briefly discussed in next section.

2.2 Essential notions of parallel programming used in Research

There are some key concepts which serve as the basis of our

parallelization methodology. These concepts are:

1) Identifying portion of work that can be performed concurrently

2) Type of parallel computer

3) Decomposition technique and mapping method for load balancing.

2.2.1 Identifying Parallel Region in Code

There are many methods to check the parallel and strictly sequential

regions in code. One and most important is to check the code for

dependencies. If there is any type of data dependency in code, it should be

removed to make it parallel. If due to any constraint, the given

27

dependency cannot be removed, that portion of code cannot be

parallelized. There are different types of dependencies as discussed next.

(Barney, 2010).

1) Flow dependency

Flow dependency also known as read-after-write (RAW)

dependency(Babb, 1984). It occurs when an instruction depends on the

output of a previous instruction. For example:

1. A = 3

2. B = A

3. C = B

Here, 3rd instruction is dependent on 2nd instruction, because the final

value of C depends on the previous instruction which is updating the

value of B. 2nd instruction is dependent on 1st instruction, because the

concluding value of B depends on the instruction updating the value of A.

Since each instruction is dependent on each other instruction level

parallelism is not an option in this example.

2) Anti-dependency

Anti-dependency means write-after-read (WAR) (Babb, 1984). It happens

while an instruction needs a value that will be updated in future. In the

following example, 2nd instruction anti-depends on 3rd instruction.

1. B = 4

2. A = B + 2

3. B = 8

The sequence of these instructions cannot be altered, nor can they be

implemented in parallel because it would affect the final outcome of A.

3) Output dependency

Output dependency is popularly known as write-after-write

(WAW)(Babb, 1984). It arises when the arrangement of instructions will

28

affect the final value of a variable. In example below, there is an output

dependency between 3rdinstructions and 1st instruction. 1. B = 5

2. A = B + 1

3. B = 10

But changing the execution sequence of instructions will change the final

value of B therefore these statements cannot be implemented in parallel.

4) Control Dependency

An instruction is said to be control dependent on a previous instruction if

the result of final instruction defines whether previous instruction should

be executed? In the example below, I2 instruction is control dependent on

I1 instruction. Though, I3 is not control dependent upon I1 as I3 is always

executed regardless of output of I1.

I1. If (x == y)

I2. x = x + y

I3. y = x + y

2.2.2 Type of Parallel Computer

There are different methods to categorize parallel computers. According

to Flynn's Taxonomy (Barney, 2010) multi-processor computer

architecture system is organized according to Instruction Stream and Data

Stream. Each of these proportions can have only one of two possible

states: Single or Multiple. There are four possible classifications:

i. Single Instruction, Single Data (SISD):

A serial computer

Single Instruction: one instruction stream is being executed by the

CPU during one clock cycle.

Single Data: one data stream is being used as input data during

one clock cycle.

29

Examples: mainframes, minicomputers and workstations.

ii. Single Instruction, Multiple Data (SIMD):

A form of parallel computer

Single Instruction: All processing elements execute the similar

instruction at any given clock cycle.

Multiple Data: Each processing element can operate on dissimilar

data elements.

Graphics processing units (GPUs), AMD andIntel’s multicore

processors are available in market.

iii. Multiple Instructions, Single Data (MISD):


Multiple Instructions: Each core/processing unit operates on the

data independently by using separate instruction streams.

Single Data: data in the form of sequence of bits/bytes is being

used as input data during one clock cycle.

Some conceivable uses of this type of system is might be:

Multiple security algorithms attempting to break a single coded

message.

For fault-tolerance purposes

iv. Multiple Instructions, Multiple Data (MIMD):


Multiple Instructions: Each core/processing unit operates on the

data independently by using separate instruction streams.

Multiple Data: Every core will work with a different data stream

Currently, the most common type of parallel computer falls in this

category is Supercomputers.

30

Examples: supercomputers, multi-processor SMP computers and

multi-core PCs.

2.2.3 Speed up calculation using Amdahl’s law

The speedup of a program using multiple cores or processors is restricted

by the time required to execute the sequential portion of the program. This

is known as the Amdahl law(Hill and Marty, 2008). As per this law, if 10

hours are required to execute a program using a single processor core, and

the sequential part of the program take an hour whereas the remaining 9

hours (90%) program can execute in parallel or concurrently, then

irrespective of how many cores are dedicated to a execute the parallel

portion, the minimum time required to accomplish the whole task cannot

be less than that one hour. Hence the speedup is restricted to be no more

than 10. Following equation can be used to calculate speedup:

Speedup = (2.1)

Where P = Parallel fraction, N = Number of processors and S = serial

fraction.

2.2.4 Parallel computing memory architecture

Memory architecture of a parallel computer is based on either shared

memory system or distributed memory scheme.

Shared Memory: In parallel computers based on this type of memory

architecture, all processors access common memory as global address

space. Multiple cores or processors can work individually but share the

same memory resources. If one processor makes some changes in a

memory location, same will be visible to all other processors.

Furthermore, shared memory systems have been categorized as UMA and

NUMA, based on memory access times.

1) Uniform Memory Access (UMA): The Computer architectures in

which every portion of main memory can be accessed with equivalent

31

bandwidth and latency are recognized as Uniform Memory Access

(UMA) systems. Now the days, these types of architectures are well

represented by Symmetric Multiprocessor (SMP) machines having

identical processors and equal access times to memory. UMA systems are

also known as CC-UMA that means Cache Coherent UMA. That means if

one processor updates a position in shared memory, it is known to all

other processors. Cache coherency is implemented at the hardware level.

2) Non-Uniform Memory Access (NUMA): This is the combination of

two or more SMPs and can directly access memory of one another. But

the access time is not same for all processors. If cache coherency is

continued, then it will be called CC-NUMA that is Cache Coherent

NUMA

Distributed Memory: In this architecture the system requires a

communication network to connect inter-processor memory. Each

Processor has its own private or local memory. That’s why changes made

by one processor to its local memory are not having any effect on other

processor’s memory. Hence, there is no need to apply the concept of

cache coherency. When there is a requirement to access data of another

processor, the programmer has to explicitly define about the data

communication. In this thesis, the concept of shared memory architecture

is used as it is commonly works for SMPs.

2.2.5 Parallel Programming Models

The concept of Parallel programming models exists to create mapping

between hardware and memory architectures. These models are not very

specific to a particular type of memory or machine architecture. Basically,

the choice of programming model for parallel implementation is based on

the architecture of the algorithm.

There are some commonly used parallel programming models:

1) Shared Memory Model

2) Distributed Memory / Message Passing Model

32

3) Data Parallel Model

4) Hybrid Model

5) Single Program Multiple Data (SPMD)

6) Multiple Program Multiple Data (MPMD)

In this thesis, Data Parallel and Multiple Program Multiple Data Models

are considered to implement security algorithms in parallel because these

two models are well suited for data oriented algorithms.

Data Parallel Model: It is also mentioned as the Partitioned Global

Address Space (PGAS) model. In this model, Global address space is

common to all cores or processors. As mentioned above, Most of the

parallel work implemented using this model emphases on execution of

operations on a data set where the data set is usually prearranged into a

common structure, For example an array or cube. There must be a set of

tasks work together on the similar data structure, but, each individual task

works on a different portion of the same data structure. On shared

memory systems, all tasks can have access to the data through global or

common memory. On distributed memory systems the data structure is

divided and exists in as small portions in the local memory of each task.

Fig. 2.1 Pictorial Representation of Data Parallel Model (Barney, 2010)

Multiple Program Multiple Data (MPMD): Like SPMD, MPMD model

is a high level programming model that can be made up using any

33

grouping of the formerly mentioned parallel programming models. In

MPMD Model

Fig. 2.2 Pictorial Representation of MPMD (Barney, 2010)

Multiple programs: Tasks may perform dissimilar programs concurrently.

The programs can be threads, message passing, data parallel or hybrid

models.

Multiple Data: All tasks may use dissimilar data.

MPMD applications are more appropriate for those types of problems

where functional decomposition is used instead of domain decomposition.

2.2.6 Partitioning

After identifying the type of parallel computer the next step involves the

decomposition of data. In case of large data sets this is important to

decide that how to decompose data so that an algorithm can execute in

parallel. The optimization objective for decomposition is to balance the

work-load among processing units and to minimize the inter process

communication requirements. The number of data sets generated by the

partitioning step may not be equal to the processing units/cores, thus a

core may be idle or loaded with multiple processes. There are two

techniques to decompose data: Domain partitioning and Functional

partitioning.

Domain Decomposition: In this type of partitioning, the data associated

with an algorithm is decomposed. Following figure demonstrates it.

34

Fig. 2.3 Representation of Domain Decomposition (Barney, 2010)

Functional Decomposition: In Functional partitioning the computations

involved in executing an algorithm is decomposed among multiple cores

rather than data.

Fig. 2.4 Pictorial Demonstration of Functional Decomposition (Barney, 2010)

2.2.7 Synchronization

Synchronization is one of the crucial problems in developing shared

memory based parallel software. At the user level, shared resources or

shared memory implementation is generally the usage of shared variables,

whereas at the machine level they may registers, memory locations and

status flags, etc. To increase the efficiency of parallel software, source

languages should offer high-level notions for synchronization to affluence

parallel programming and compilers are mandatory to provide exact and

efficient implementation for such notions.

Broadly there are two types of Synchronization:

Barrier

Commonly infers that all sub tasks are involved in order to

accomplish a larger one where each task executes its work till it

reaches the barrier.

35

When the last task reaches the barrier, all tasks are synchronized.

Lock / semaphore

It can comprise any number of tasks. Basically, it is used to

protect access to global data or a section of code. Only one task at

a time may use the lock / semaphore / flag.

2.2.8 Mapping for Load Balancing

After decomposing data the next step is to load balancing. Load balancing

refers to the approach of distributing approximately equal amount of work

among cores so that all cores/processing units are kept busy all of the

time. The primary optimization purpose of mapping is to balance the task

load of processing unit/cores and to minimize the cost of inter-processor

communication (IPC). Commonly, the task of load balancing is to

develop decomposition and mapping algorithm for the purpose of

achieving their respective optimization objectives. Furthermore Load

balancing techniques can be broadly classified into two major categories,

one is static load balancing techniques and another is dynamic

techniques. Static load balancing techniques distribute the processes to

processors at compile time. This type of mapping is being used when the

data set is known. While dynamic techniques bind processes and

processors at run time. This approach is being used for unknown data

sets..

2.2.9 Granularity

In parallel computing, granularity refers to a measure of the proportion of

the calculation to communication. Where, Phases of calculations are

separated from phases of communication by synchronization measures.

Granularity can be classified into two categories:

Fine-grain Parallelism:

• Comparatively small amount of computational work is done between

communication events

36

• There is Low ratio of calculations to communication

• Ease of load balancing

• Infers high communication overhead and less chance for

performance improvement

• It is possible that the overhead required for communications and

synchronization between tasks takes longer than the actual

calculations, if the granularity is too fine.

Coarse-grain Parallelism:

• Large amount of computational work is done between

communication and synchronization events

• There is very High ratio of computation to communication

• Infers more possibilities for better performance

• Load balancing is a complex task.

2.3 Experimental setup

A multi core processor is a single computing component(Geer, 2005). But

it can have two or more independent actual cores. These units can read

and execute various tasks and instructions concurrently, increasing overall

speed of programs which are adaptable to parallel computing. Also, this

architecture will enhance performance and reduce power consumption

which will in turn serve as a contribution towards greenhouse effects.

Figure-2.5 shows the architecture of multi core processor.

For this research, the machine setup is done with the following

configuration:

Processor - AMD FX(tm) - 8320 , eight core processor running @ 3500

MHz

RAM - 8 GB

System Type - 64 bit operating system,x64- based processor

37

Operating System - Linux/Ubuntu 12.04 version

OpenMP - 4.0

Compiler- GCC

Programming Language - C with OpenMP

Fig. 2.5 Multi-Core Processor Architecture

2.4 Tools used

There are different steps involved in parallelization of an algorithm.

First and foremost condition for parallelism is that the algorithm must

be designed in a way so that it can support parallelism. To execute parallel

algorithm, following points must be considered:

•How many functions are time consuming in sequential implementation?

•Is there any type of data dependency?

Above questions can be answered with the help of a profiler tool. The

code profiler will tell how many functions are compute intensive in

algorithm, how many functions are calling other functions and how many

times and where the loop dependencies are. For this research, GNU’s

Gprof is used to profile the algorithm.

Core-1 Core-2 Core-3 Core-4

Private

memory

Private

memory

Private

memory

Private

memory

Shared Memory

Bus Interface

Chip Boundary

Off Chip Components

38

2.4.1 Gprof

This is a profiler program that collects and arranges statistics on

programs(Graham et al., 2004, Fenlason and Stallman, 1988). It generates

“gmon.out” data file having all details of your program like which

function is getting executed maximum number of times. It provides

many options to get details about the program.

2.4.2 OpenMP

OpenMP is an application programming interface (API) to design parallel

algorithms using its shared memory model(Jin et al., 1999). OpenMP is

used in conjunction with C/C++and FORTRAN (Dagum and Menon,

1998, Chandra, 2001). It provides a manageable model to programmers to

develop portable and scalable parallel algorithms. It comprises of three

common components: environment variables, compiler directives and

runtime library routines. These constructs extend a programming

language which is sequential with single instruction multiple data models,

synchronization and work sharing models. OpenMP uses fork join model

for parallel programs which means only a single processor starts

execution and the moment it encounters the parallel region it distributes

the tasks among the team of other processors depending upon the

constructs and data in the region. At the end of parallel region, all

processors terminated after the completion of their respective tasks and

only the master processor will continue execution until next parallel

region encountered in the program.

2.4.3 MinGW

MinGW (Minimalist GNU for Windows) is free and open source tool for

built-in Microsoft Windows applications(Peters et al., 2010, Team, 2008).

But itis also compatible with cross-hosted on GNU/Linux platform. It

consists of a port of the GNU Compiler Collection (GCC), GNU Binutils

for Windows, a set of Windows specific header files which are freely

distributable and specific libraries which allow the use of the Windows

http://en.wikipedia.org/wiki/Free_software

http://en.wikipedia.org/wiki/GNU_Compiler_Collection

http://en.wikipedia.org/wiki/GNU_Binutils

http://en.wikipedia.org/wiki/Header_file

http://en.wikipedia.org/wiki/Static_library

http://en.wikipedia.org/wiki/Windows_API

39

API and various utilities. MinGW supports almost all languages which are

supported by GCC few of them are C, C++, Objective-C, Objective-C++,

FORTRAN and Ada.

2.4.4 CodeBlocks

Code Blocks is a cross-platform IDE that supports compiling and running

multiple programming languages.

2.4.5 Joulemeter

This project has been used to develop methods to improve the energy

efficacy of calculating devices and infrastructures. It is a demonstrating

tool to measure the energy consumption of desktops, laptops, servers,

virtual machines and even specific software applications running on a

computer. The perceptibility provided by this tool is being used to

improve energy consumption costs for data centers, desktop energy

optimizations, and mobile battery management.

Conclusion

Generally the programs are written using sequential execution model.

That means instructions are executing one after another forming a

sequence. To take the benefits of multi core machine architecture, which

can provide faster execution as compared to sequential one, the algorithm

or program must be designed or developed using parallel computing

principles. Further, profiler tools can be used to optimize the code. For

implementation, different type of manual as well as directive based

approaches can be used. In this thesis, OpenMP is used to implement the

programs in parallel.

http://en.wikipedia.org/wiki/Windows_API

40

Chapter 3

Design Of Parallel Additive

Stream Cipher Structure

41

3CHAPTER 3

DESIGN OF PARALLEL ADDITIVE STREAM CIPHER

STRUCTURE

Stream ciphers are used to encrypt distinct characters of a plaintext one at a time.

Various design methodologies for stream ciphers have been proposed and

comprehensively studied. Linear feedback shift registers (LFSRs) are commonly

used in key stream generators but these are well suited to hardware

implementations. For software implementations, usually additive or binary

additive stream cipher structures are being used. This chapter provides a

discussion on the stream ciphers and focuses on the design of the Parallel

Additive Stream Cipher Structure (PASCS) which can be used to develop parallel

stream cipher algorithm based on synchronous additive stream cipher structure.

3.1 Introduction

Stream ciphers scramble specific characters of a plaintext one at a time,

using some specific method or technique which differs with time(Mao,

2003). In hardware implementations, stream ciphers are usually faster

than block ciphers with simple hardware circuitry as compared to block

cipher’s hardware circuitry(Mao, 2003). In fact, in some of the cases the

usage of stream ciphers are mandatory. For example, in

telecommunications applications, when there is a limited buffering or

when characters must be exclusively processed as they are received, only

stream ciphers can be used. Further, it can be classified as:

a. One-time pad cipher

The Encryption process to encrypt a binary alphabet using vernam cipher

(Salomon, 2003) is defined by:

42

Ci = Mi XOR Kifor i=1, 2, 3…, n, where M1, M2 up to Mn, are the plaintext

digits, K1, K2, up to Kn, are the key stream bits and C1, C2up to Cnare the

cipher text digits. Decryption is the exact vice versa of the above equation

and is defined by Mi= Ci XOR Ki. If the process used to generate key

stream is generating independent and random key stream digits, the cipher

is titled one-time pad cipher.

b. Synchronous stream cipher

In this stream cipher, the key stream is produced separately of the original

message and of the Cipher text (Fontaine, 2011). The complete encryption

process can be described as shown below using the equations:

StateSpacei+1=NextStateFunction (StateSpacei, key),

Keystreami = CalculateKeyStream (StateSpacei, key),

Cipheri=GenerateCipherText (Keystreami, Messagei)

Where Statespace0 is the primary state and can be determined from the

key, CalculateKeyStream function is producing the key stream

Keystreami, and finally, GenerateCipherText is the output function which

takes Keystreami and Messagei as input and producing Cipheri.

Additive stream ciphers

The commonly used ciphers from the category of synchronous stream

ciphers are additive stream ciphers (Cusick et al., 2004). In these

ciphers, the key stream digits are XOR with plaintext individually and

in case of decryption, the reverse process takes place by doing XOR

the cipher-text with the key stream.

c. Asynchronous stream cipher

These ciphers are also known as self-synchronizing stream ciphers and in

this type of ciphers, the key stream is created as a function of the key and

a static number of preceding cipher-text digits(Robshaw, 1995). The

encryption function can be described by the following equations:

StateSpacei= (ci-t, ci-t+1… ci-1),

43

Keystreami=CalculateKeyStream (StateSpacei, key),

Cipheri=GenerateCipherText (Keystreami, Messagei),

Where StateSpacei= (ci-t, ci-t+1… ci-1) is the original state,

CalculateKeyStream function is calculating key stream of digits with the

help of StateSpace and the key. Finally, GenerateCipherText is

calculating the cipher-text.

The rest of the chapter is organized as follows:Section 3.2 states the

motivation for parallel architecture. The design for the parallel

architecture is presented in Section 3.3 followed by some of the

conclusions from the design.

3.2 Motivation For Parallel Architecture

For stream ciphers, the whole encryption process is based on bit-by-bit

encryption in a sequential manner. This mechanism doesn’t make use of

parallel computing and the multi-core processors that are commonly

available in the market today. If multiple bits can be processed at the

same time to produce multiple cipher text bits then the encryption process

can be much faster as compared to the traditional one where only a single

bit is encrypted at a time. Most of the software applications are encrypted

using stream ciphers which are based on synchronous additive stream

cipher structure discussed before. This structure is sequential in nature. A

parallel framework for the same will help to process many bits

concurrently and make the application faster.

3.3 Design of PASCS

In this framework, a key is supplied to the key stream generator which

will produce random key stream. The plaintext is in the form of fixed

sized blocks. The random key stream is supplied to each individual block

to process plaintext concurrently. The following figure illustrates the

complete process used in this framework.

44

Fig 3.1 Design of Parallel Additive Stream Cipher Structure

As shown in Fig.3.1, there are n fix size data blocks. Random key stream

of same length is supplied to each block and further each bit from

plaintext block is XOR with key bit to produce the cipher text bit. This

parallel structure can be used by any stream cipher which is of

synchronous nature. The size of the block depends upon the algorithm’s

structure. PASCS is based on the concept of vernam cipher where

corresponding to each bit of plain text there is individual key bit. To keep

the essence of vernam cipher and maintain its randomness, each block

should have different key stream. Hence, modification in architecture is

required to implement the stream cipher algorithms. In this thesis, the

PASCS is applied to RC4 and RC4A algorithms to analyze the impact of

parallelization on the speed of the cipher. We discuss RC4 and RC4A in

later chapters.

45

Conclusion

Linear feedback shift registers are used extensively today but for

hardware implementations. For software applications, there is a need to

redesign binary additive structures to enable parallelism so that the speed

of the application could be enhanced. PASCS is an effort towards this.

PASCS is a parallel framework based on the scheme where multiple

blocks of data in the form of bits/bytes can be encrypted or decrypted

concurrently in order to achieve improvements in performance.

46

Chapter 4

PARC4: Parallel approach for

RC4 using PASCS

47

4 CHAPTER 4

PARC4: PARALLEL APPROACH FOR RC4 USING

PASCS

This chapter introduces a parallel stream cipher, PARC4 which is used to encrypt

a large set of data. The implementation of the algorithm is based on PASCS

framework which we discussed in Chapter 3. This chapter focuses on the

development of the methodology to add parallelism and on the model that is used

to map PASCS architecture to gain the performance benefits. Various

performance metrics have been used to measure the performance of the

developed parallel algorithm and discussed in this chapter.

4.1 Introduction

To ensure the security of confidential data or information, different

encryption algorithms have been used. As discussed earlier, the

encryption algorithms are of two types: Symmetric and Asymmetric. RC4

developed by Ron Rivest (Rivest), is a very popular symmetric stream

cipher algorithm. It operates on individual bits to secure the message.

Although it is a faster cipher as compared to other symmetric ciphers like

DES and 3DES (Elminaam et al., 2010), this algorithm doesn’t take

advantage of today’s multiprocessing computing environments. Today’s

computing environment supports symmetric multicore programming

infrastructure. Also, if we can improve the performance of the algorithms

then we can make them more energy-efficient at their original

performance levels. We discuss this later in this thesis. To effectively

utilize all processing cores, the structure of the algorithm must support

parallelism. Furthermore, parallelism improves the speed of encryption as

well as that of decryption resulting in overall speedup of the applications,

an important need for security algorithms while working with software

48

applications. Any compute-intensive algorithm, which security algorithms

are, can reduce the speed of an application and reduce its effectiveness.

With the start of the usage of Symmetric Multiprocessors (SMPs) in

computing, it becomes possible to parallelize complex computational

algorithms and make them run faster (Keckler et al., 2009), (Chandra,

2001).

The complete chapter is organized as follows: Section 4.2 discusses RC4

and how it supports parallelism. An identification of the parallelism in the

algorithm, along with the parallel techniques used in the implementation

are presented in Section 4.3. In the next section, we have included

security analysis to verify that the modified algorithm is as secure as the

original one. The results on a large set of data files along with some of the

speed up calculations are discussed in Section 4.5. In Section 4.6, the

performance of the proposed algorithm has been measured using various

metrics. PARC4 is compared with the existing multithreaded approaches

in Section 4.7 followed by a discussion on the conclusions from the work.

4.2 Detection of Parallelism

As discussed in Chapter 3, the PASCS framework is applied on the RC4

algorithm to parallelize it. RC4 has two sub-algorithms: KSA to generate

key stream and PRGA for encryption and decryption. Furthermore, KSA

(Fluhrer et al., 2001) performs a fix set of iterations but the number of

times the PRGA algorithm is called upon depends on the length of the

input data. PASCS is used to make PRGA algorithm parallel. However,

PRGA is based on the exchange shuffle model which is inherently

sequential.

As shown in Fig.4.1, the S array’s values are changed after each swap

occurs. The procedure is repeated for n times, where n is size of

message/plaintext in bits. As a result, the functional decomposition of the

algorithmwith existing structure is not feasible.

49

Fig 4:1 Swapping between S[i] and S[j]

4.3 Method for Adding Parallelism

The input to PASCS framework must be supplied in the form of

individual blocks which are of fixed length. So first, the input data has

been divided into fixed size blocks. Afterwards, each individual block is

encrypted simultaneously using similar steps, and finally, the output of

each block is concatenated to make the complete cipher text. All of these

operations are being done in parallel by multiple cores to achieve

performance improvements. The next objective is to decide the size of the

block. For this, consider the following code snippet of PRGA:

// Output function: used to perform encryption and decryption

unsigned char rc4_output()

{

i = (i + 1) % 256;

j = (j + s[i]) % 256;

Swap(s, i, j);

return s[(s[i] + s[j]) % 256];

}

// f_size is a variable represents the size of the input.

// output function will be called up to f_size

1

i=0

8

i=7

2

i=1

3

i=2

4

i=3

5

i=4

6

i=5

7

i=6

i=2 j=7

If i=2 and j=7, values at S[i], S[j] will

interchange

50

for (int x = 0; x < f_size; x++)

{

enblock[x] = (memblock[x] ^ rc4_output());

}

The value of index i start from 1 and goes up to the length of the input

text. Now, for the 257th iteration, i will be repeated gain starting from 1

onwards as.: i= (256+1) mod 256 = 1, (257+1) mod 256 = 2 and so on and

this repetition happens at each multiple of 256. Thus, after the completion

of 256 iterations, i will be starting from 1 onwards to calculate j’s value

and that determines the swap taking place in the S array. Moreover, the

length of the array, which is used to generate key stream, is 256 only.

Hence, it makes sense for the block size of PARC4 to be 256 bits. If input

data is not a multiple of 256 then the last block will be padded with extra

zeros to make a block 256-bit long and those added bits will be discarded

during decryption process. For 512 bytes of data, i and j value at each

iteration has been given in Appendix-C for reader’s reference.

The next important objective is to supply random key stream to each

block. For this, the value of block id has been added to the generated key

stream value and then the outcome uses mod by 256 so that the generated

number can fall within the boundary of 256. Because the block id is

different for each block, the generated key stream for each block must be

different from each other. Algorithm 4.1 represents the whole procedure

of PARC4 to run it on SMPs whereas Figure 4.2 and Figure 4.2 is the

pictorial representation of key stream generated for the parallel structure.

51

Algorithm 4-1: Steps to Implement PARC4

In Algorithm 4.1, plaintext is declared as shared variable because this data

needs to be accessed by each core in small chunk sizes and the block size

variable should be known to each core. It is declared as a shared variable

in the OpenMP looping construct. Furthermore, each core works on its

own set of data and this data should be declared as private data to each

core. The line numbers 1 and 7 include the parallel region. Line 2 specifies

that each block includes fixed number of iterations and can be carried out

in a synchronized manner and Line 3 is assigning the size and range of

each block. Line 4 specifies the loop which executes a total of 256

iterations for each block in a sequential manner. To synchronize the work

done by multiple cores, synchronization constructs have been used in the

OpenMP implementation.

Procedure: Encryption

Model: Data Parallel Model with P processors [P=2, 4, 6, 8]

Input: Plaintext in the form of small chunks [Chunk Size = 256], n=number of

blocks

Output: Encrypted text

Declare: Plaintext and BlockID as shared variable, i as private variable to each

processing element

1. ParBegin

2. For ALL BlockID: [0, n] IN SYNC

3. Set Start=BlockID*256 and End=Start+256

4. For i=start to End-1 do

5. Output= ((keystream_bit+ BlockID) Mod 256) XOR plaintext_bit

6. End for

7. ParEnd

52

I. I. II.

Fig 4:2: I. Depicts sequential key generation whereas II. Presents the formation

of key stream for parallel framework

The parallel implementation using PASCS necessitates random key

stream for each block. Therefore, as shown in Fig. 4.2 II, the process of

key generation has block index as an additional variable. Each block has

different index which gets added to the key bit and the final result is

calculated after applying the mod 256 operation:

(KeyByte + BlockID) Mod 256

Figure 4.3 represents the complete process of encryption as well as

decryption in PARC4. Input data is divided into fix size blocks where the

block size is 256 byte. Each block is then encrypted using modified key

stream. For reader’s reference the parallel implementation of PARC4

using OpenMP has been given in Appendix-C.

53

Fig 4:3 Graphical Representation of Complete Flow and Model Used to

Parallelize RC4

54

4.3.1 Parallelization techniques

PARC4 algorithm has been developed using the following parallelization

techniques:

1) Data decomposition: Data decomposition is a commonly used technique

for deriving concurrency for the algorithm where dataset is large(Kumar

et al., 1994). In this implementation “input data partitioning” method has

been used to decompose the input data. Figure 4.4 represents this

partitioning:

Fig 4:4 Pictorial Representation of Input Data Decomposition Technique

2) Mapping technique for load balancing: After the decomposition of

data, the next step is to map the specific chunk of data on different threads

(which is used by different processors) to complete the whole task in a

parallel and faster manner. As discussed in Chapter 2, dynamic mapping

is used for unknown data set. The dynamic mapping can be further

classified in two categories: centralized and distributed. In centralized

dynamic mapping, all executable tasks are maintained in a common task

pool. A master thread initiates the task and then each processor

independently takes some portion of task to perform where as in

distributed dynamic mapping the set of executable tasks are distributed

among processes which exchange tasks at run time to balance work. Each

process can send or receive work from any other process. In PARC4,

centralized dynamic mapping scheme is applied to make sure that each

55

core has equal load and the work is more balanced among the processing

elements.

3) Data Parallel Model: In this model, the complete data set is arranged into

a shared structure such as an array. The set of instructions work together

on the shared data structure but each instruction works on a different

portion of the shared data structure.

4.4 Security Analysis

The modified key stream is used in PARC4, must be verified that the

stream of bits which has been supplied to different blocks, is unique and

random. The most popular method to measure randomness in data is

Entropy. It was introduced by Claude E. Shannon in 1948 and also known

as Shannon’s entropy just to differentiate from the other occurrence of it,

which appears in various parts of physics in different forms (Bekenstein,

1973).

4.4.1 Shannon’s entropy

Shannon’s entropy is an important metric in information theory(Shannon,

1951). It measures the uncertainty coupled with a random variable (Rrnyi,

1961). Various tool are available to measure entropy of random

numbers(Walker and ENT, 1998). For this research, PARC4 is tested for

randomness by using Shannon’s entropy formula which is based on

probability distribution (Shannon, 1949, Shannon, 2001) . It is calculated

using formula:

(4.1)

In Eq.4.1 Pi is the probability of given value. Here, log with base 2 is used

because information is in binary form. It gives the information about the

minimal number of bits per symbol required to encode the information

which is in binary form for log base 2. Additionally, metric entropy can

56

be calculated using entropy value divided by the string length. It gives the

information about the randomness of the data in message. The entropy

metric can take the values from 0 to 1, where 1 means equally distributed

random values. To measure entropy and metric entropy for RC4 and

PARC4, two different text files containing random key stream bytes in

decimal form have been generated. Then, using online Shannon’s Entropy

tool (Shannon), it is calculated for both the algorithms. The result shows

that the H(X) = 3.72 for PARC4 and H(X) =3.14 for RC4. That tells

PARC4 requires 4 bits and RC4 requires 3 bits to encode the data value

optimally. Similarly, metric entropy of both the algorithms is 0.0071 for

PARC4 and 0.0052 for RC4 algorithm. It shows the randomness in data.

From above mentioned values, it is observed that entropy values for both

the algorithms are almost same which shows that the changes made in

PARC4 algorithm to generate random key stream for each block has not

disturbed the security of existing cipher. For reader’s reference the key

bytes generated using RC4 and PARC4 have been given in Appendix-A

and Appendix-B.

4.5 Experimental Results

To study performance improvements achieved through the parallelization

of the RC4 algorithm, firstly, the sequential RC4 cryptographic algorithm

has been executed to evaluate its execution time in a given environment.

The sequential results serve as the baseline for comparison with the

results for the parallel algorithm PARC4. In order to evaluate all data

files, GCC compiler has been used for compilation, -O3 is used to support

third level of optimization and -March=native to enable usage of CPU

specific instructions. All the data has been tested on a server having

configuration as mentioned in Section 2.3 of Chapter 2.

To assess the parallel framework, all tests use the text files from t5-

Corpus11(Roussev, Roussev, 2011) with some changes (Required

sequence of bits). From Table 4.1 to 4.5, the time taken by RC4 and

PARC4onmultiple cores has been shown. The first column of each of the

57

table show the number of input bytes used for encryption and decryption.

Second and third column shows the execution time (in seconds) for

encryption and decryption processes and last column shows the overall

time taken by both the processes on AMD FX(tm) - 8320 , eight core

processor running @ 3.5 GHz machine.

Table 4-1: Time (In Seconds) taken by RC4 to encrypt/decrypt large data files by

uniprocessor

Size of input

data [In GB ]

Encryption Decryption Overall Time

0.1 1.31785 1.29868 2.61653

0.2 2.64969 2.62575 5.27544

0.3 3.87678 3.85207 7.72885

0.4 5.29639 5.24420 10.54059

0.5 6.46599 6.41855 12.88454

0.6 7.73803 7.67800 15.41603

0.7 9.02553 9.16311 18.18864

0.8 10.58992 10.93706 21.52698

0.9 11.59993 11.85782 23.45775

1.0 12.89147 12.97361 25.86508

Table 4-2: Time (In Seconds) taken by PARC4 to encrypt/decrypt large data

files using 2 Cores

Size of input

data [In GB ]


0.1 0.67892 0.67872 1.35764

0.2 1.33886 1.33878 2.67764

0.3 2.0177 2.0176 4.03528

0.4 2.69656 2.69636 5.39292

0.5 3.37532 3.37524 6.75056

0.6 4.05413 4.05407 8.1082

0.7 4.73296 4.73288 9.46584

0.8 5.41179 5.41169 10.8235

0.9 6.09061 6.09051 12.1811

1.0 6.76942 6.76934 13.5388

58


files using 4 cores

Size of input

data [In GB ]


0.1 0.399695 0.399295 0.79899

0.2 0.799947 0.799943 1.59989

0.3 1.199905 1.199903 2.39981

0.4 1.640765 1.640725 3.28149

0.5 1.99994 1.9999 3.99984

0.6 2.362748 2.362742 4.72549

0.7 2.797925 2.797885 5.59581

0.8 3.299505 3.299465 6.59897

0.9 3.649495 3.649455 7.29895

1.0 3.974905 3.974865 7.94977


files using 6 cores

Size of input

data [In GB ]


0.1 0.25999 0.25995 0.51994

0.2 0.499937 0.499933 0.99987

0.3 0.769528 0.76952 1.53905

0.4 1.04601 1.04597 2.09198

0.5 1.28548 1.28547 2.57095

0.6 1.537027 1.537023 3.07405

0.7 1.844995 1.844955 3.68995

0.8 2.14746 2.14726 4.29472

0.9 2.33866 2.3386 4.67726

1.0 2.574919 2.574911 5.14983

59

Table 4-5: Time (In Seconds)taken by PARC4 to encrypt/decrypt large data files

using 8 cores

Size of input

data [In GB ]

Encryption Decryption Overall time

0.1 0.18687 0.18683 0.3737

0.2 0.37683 0.37677 0.7536

0.3 0.55207 0.55203 1.1041

0.4 0.75287 0.75283 1.5057

0.5 0.92033 0.92027 1.8406

0.6 1.10114 1.10106 2.2022

0.7 1.29918 1.29912 2.5983

0.8 1.53763 1.53757 3.0752

0.9 1.67557 1.67553 3.3511

1.0 1.84755 1.84745 3.695

The encrypted and decrypted text using PARC4 has been given in Appendix-D

for reader’s reference . OpenMP implementation for the same has been given in

Appendix E

4.6 Performance and Scalability Analysis

To examine the benefits of parallelism, a number of metrics, such as

speedup, efficiency, complexity, and scalability, have been used to

measure performance of proposed algorithm. We discuss these metrics

next.

4.6.1 Speedup

A serial algorithm is typically assessed in terms of its execution time

which is stated as a function of its input size. In contrast, the execution

time of a parallel algorithm is determined by the input size as well as the

parallel structural design and the number of processing elements

employed. With this, the speedup is well-defined as the ratio of the time

taken to solve a problem using a single processing element to the time

required to execute the same problem using a parallel computer with p

identical processing cores. From the Tables 4.1 to 4.6, it can be observed

60

that PARC4 results in speedup corresponding to the number of cores

being used for experiments. Fig. 4.5 shows the speedup comparison for

~1GB of data file by using PARC4 on multiple cores.

Fig 4:5: Speedup comparison of PARC4 using multiple cores

It is visible from the graph above that speedup is increasing as the number

of cores are increasing. But as per the conclusions drawn from the

Amdahl’s law, speedup tends to saturate and efficiency drops at some

specific point which depends on the sequential portion of the executing

code. If the processing elements and problem data are increasing, the

overhead of decomposition and distribution of tasks among processors are

also increased.

Similarly, if execution time is observed for large input streams like ~1 GB

of data, the total time for executing the algorithm on the complete data is

a function of number of cores employed to complete the task. This way it

can be inferred that after adding two additional cores for constant file size,

the execution time will be half of the time the algorithm takes to execute

on less number of cores. Figure 4.6 illustrates the similar scenario.

61

Fig 4:6: Speedup for constant data using multiple cores

4.6.2 Efficiency

Although speedup measures performance gains for multiple cores

compared to a single core, it does not provide information whether the

processing elements or cores used in the parallel computer are being used

efficiently. The efficiency of a given problem on n processing elements,

E(n), is defined as the ratio of the speedup attained and the number of

processors used to attain the speedup.

(4.6)

In Equation 4.6, let T (1) = 25.8 seconds, T (n) = 3.6 seconds and n=8 to

process ~ 1GB data. The Efficiency of PARC4 is: = = 0.89,

since , there is . PARC4 achieved efficiency

in that range as

.

62

Table 4-6: Efficiency as a function of n and p for running n blocks on p

processors to encrypt input stream

Input size [GB] P=4 P=6 P=8

0.6 0.81 0.83 0.87

0.7 0.82 0.84 0.9

0.8 0.83 0.85 0.87

0.9 0.81 0.84 0.89

1.0 0.81 0.84 0.89

It is visible from Table 4.6 that for a given problem size, as the number of

processing elements increase, the overall efficiency of the parallel system

increases. Secondly, the efficiency of a parallel system remains almost

constant if the problem size is increased while keeping the number of

processing elements constant. This is due to the fact that implementation

of PARC4 is based on the data parallel model, in which each processing

core has equal distribution of tasks from the centralized task pool.

4.6.3 Complexity and Cost optimality

The cost of running a program on a SMP is given as the product of

parallel execution time and the number of cores or processing elements

employed for that program. It replicates the sum of the execution time that

each core spends solving the programmable task. The cost of solving a

specific task using single core is execution time of the fastest known

serial algorithm. “A parallel system is said to be cost optimal if the cost of

solving a problem on a parallel computer has the same asymptotic growth

as a function of the input size as the fastest known sequential algorithm on

a single processing element(Kumar et al., 1994). Since efficiency is the

ratio of sequential cost to parallel cost, a cost optimal or pTp optimal

system has as efficiency of O(1). But due to parallel overhead involved,

the efficiency of 1 is never achieved”(Kumar et al., 1994).

Furthermore, if n is representing as number of data blocks and p is

number of cores where n>p, n/p steps are required to execute the n blocks

in parallel and in each step. There are256*n iterations executing

63

simultaneously. Thus the overall parallel execution time for PARC4

is , where n is the number of parallel blocks and m is 256. For

example, if input is 512 bytes then n =2. Further, (n*256)/p is half less

than the original algorithm. Therefore, parallel execution time can be

defined as:

(4.7)

Consequently, its cost is:

Cost = (4.8)

(4.9)

= (4.10)

= (n*m) (4.11)

= (n) (4.12)

In equation 4.11, n is the number of blocks multiplied by 256 which is

equal to serial n iterations. This proves it remains linear in nature making

PARC4 cost optimal.

4.6.4 Scalability

“The ability to maintain efficiency at a fixed value by concurrently

increasing the number of cores and the size of the problem is unveiled by

many SMPs. Such systems are scalable parallel systems(Kumar et al.,

1994)”. In Table 4.7, it is visible that after increasing problem size and

number of cores, the efficiency remains almost the same. It indicates that

the proposed algorithm is supporting scalability feature as it keeps the

efficiency fixed by increasing the problem size and processing elements

instantaneously.

64

4.6.5 Throughput

Throughput can be calculated as plaintext in bytes or bits divided by the

time taken to encrypt and decrypt that data. Figure 4.9 shows the fair

comparison of throughput achieved using RC4 and PARC4 for ~0.5 GB

of data. Here, PARC4 is executed on eight cores. It shows PARC4 is

providing much higher throughput as compared to RC4 running on a

single core.

Fig 4:7: Comparison for throughput achieved using RC4 and PARC4

4.7 Comparative Analysis

As mentioned in Chapter, several hardware or FPGA based

implementations are available to parallelize RC4 stream cipher. Those

implementations cannot be compared with this technique because of

different technology. Thus, in this section, PARC4 is compared with a

multithreaded approach proposed by T.D.B Weerasinghe (Weerasinghe,

2014). Although both approaches are extremely different due to the

different platforms used because parallel computing using symmetric

multiprocessors offers large scope for parallel programming using API

such as OpenMP as compared to multithreading which takes benefit of the

CPU idle time. Due to the similar concepts and common agendas, both

the techniques can be compared on the basis of few parameters:

65

4.7.1 Mapping and Load Balance

Using threading in Java like technologies, there is no assurance that all

available cores are being used efficiently. Moreover, JVM takes care of

creating and assigning threads. So it is an implicit procedure but on the

contrary, parallel programming explicitly breaks the task into smaller

chunks, where each chunk can be executed on an individual core. This

way one can have multiple parts of the same program being executed in

parallel. As it is an explicit approach, programmer has to take care of

mapping between the processes and cores.

4.7.2 Modified Key stream

In multithreaded approach, same key stream is used for all the file chunks.

This can affect security of the cipher. In the contrary modified key stream

is used in PARC4 algorithm so that each individual block can have

different key stream.

4.7.3 Energy Efficiency

If power consumption of an application needs to be reduced, each core

should be operated at low frequency and voltage. Single core with low

frequency and voltage will lower the performance further. Thus,

multithreading on a single core will work only with high frequency and

voltage which will ultimately consume more power. This has been

discussed in detail in Chapter 8.

On the basis of above parameters, the following conclusion can be drawn

that PARC4 uses better and futuristic approach as compared to

multithreaded approach. Below table shows the comparison in both

approaches.

66

Table 4-7: Comparison between PARC4 and Multithreaded approach

Conclusion

This chapter introduces PARC4 algorithm, a parallel approach to the well-

known RC4 cipher algorithm. The parallel implementation of the

algorithm is based on PASCS framework which can be used to implement

any stream cipher in parallel.

The total time for executing the algorithm on the complete data is a

function of number of cores employed to complete the task. As the

number of processing elements increase, the overall efficiency of the

parallel system increases. Secondly, the efficiency of a parallel system

remains almost constant if the problem size is increased while keeping the

number of processing elements constant. Due to its high efficiency,

PARC4 is also cost optimal. PARC4 provides much higher throughput as

compared to RC4 running on a single core

Parameters RC4 using Multithreading PARC4

Processor/Technology Core i3

2 cores/4 threads

AMD FX(tm) 8 core

8 cores/8 threads

Energy Efficient No Yes

Type of programming Implicit (No intervention of

programmer to map the

processes onto multiple

cores/threads. Programming

environment will take care

of that)

Explicit (Programmer

can map the processes

onto multiple cores

according to the

requirement)

67

5

Chapter 5

PARC4-I: Parallel RC4A

using PASCS and loop

unrolling mechanism

68

CHAPTER 5

PARC4-I: PARALLEL RC4A USING PASCS AND

LOOP UNROLLING MECHANISM

This chapter introduces a parallel algorithm to encrypt/decrypt large data files

and to secure communications over a channel. In this parallel approach, some

revisions have been done in the existing KSA and PRGA algorithms of RC4A in

order to produce a random key stream and to generate more than one output byte

of data during each of the iterations for parallelization. PASCS framework has

been used for parallelization along with loop unwinding technique for code

optimization purposes. This revised and parallel algorithm is then termed as

PARC4-I.

5.1 Introduction

RC4A is one of the strongest alternatives for the RC4 algorithm. It was

proposed by Bert and Preneel(Paul and Preneel, 2004). It has a modified

key stream generator which enables stronger security than RC4. Most of

the attacks on RC4 are less effective on RC4A. Moreover, RC4A requires

fewer instructions per output byte and it is feasible to make use of the

inbuilt parallelism to get better performance.

The rest of the chapter is organized as follows: The process of adding

parallelism along with the use of parallel techniques has been discussed in

Section 5.3. We discuss the results on a set of data files of various sizes

along with the speedup calculations in the Section 5.4. The performance

analysis for the same has been discussed in the next section. In Section 5.5,

PARC4-I is compared with PARC4 to observe speedup gains followed by

conclusions.

5.2 Modified KSA and PRGA

In the KSA algorithm, a randomly chosen key k1 is supplied to the key

generator which then produces three more keys k2, k3, and k4 using k1 as

69

the seed .There are four S-boxes S1,S2,S3 and S4 that are initialized using

different keys. As discussed in Chapter 2 , S1 and S2 are two random

permutations of N-1(Abinash Roy, 2008). In this modified scheme there are

four distinct S-Boxes, all assumed to be random permutations of (N-1) and

it is assumed to generate a uniform distribution of permutation of (N −1).

Algorithm 5.1 shows the steps of generating the four distinct key bytes

which are then use to encrypt four plaintext bytes. All the arithmetic

operations are based on modulo N where N is 256. The transition of the

internal states of the four S-boxes are based on an exchange shuffle as RC4.

In order to generate four bytes, four distinct variables j1, j2, j3 and j4

corresponding to the four S-Boxes have been introduced. The only

modification is that the index-pointer S1[i]+S1[j] evaluated on S1

generates output from lookup S-Box S2 and vice-versa for all of the four

bytes. Please see the steps 1.4, 1.7, 1.10, and 1.13 of Algorithm 5.1. The

next round starts after each output generation.PARC4-I with new KSA and

PRGA schemes that use fewer instructions per output byte as compared to

the RC4. To generate four successive output bytes, the index i pointer is

incremented once in the case of the PARC4-I algorithm whereas it is

incremented four times to produce as many output words in the RC4

algorithm. The RC4A produces two output bytes at each iteration.

5.3 Incorporating parallelism

The input text is divided into fixed size blocks, where each block size is

256 bytes. Afterwards, multiple data blocks are encrypted simultaneously

using PRGA. As discussed in Section 5.2, at each index pointer increment

PRGA generates four distinct bytes, therefore, first four bytes of plaintext

can be fetched altogether to encrypt or decrypt. Using the loop unrolling

technique this process can be accomplished efficiently, and finally, the

output of each block is concatenated to make the complete cipher text. The

overhead associated with the function calls is also reduced by using this

method because for every 64 bytes of data PRGA is executed only 16 times

70

Procedure: Pseudo random number generator

Input: Four S-Boxes: S1, S2, S3, S4

Output: Four distinct key bytes

Declare: i, j1, j2, j3, j4

Initialize: i, j1, j2, j3, j4 are set to 0

Repeat steps until i=255

i: = i + 1

calculate j1:= j1 + S1 [i]

swap values of S1 [i] and S1 [j1]

output S2 [S1 [i] + S1 [j1]]

calculate j2:= j2 + S2 [i]


output S1 [S2 [i] + S2 [j2]]

calculate J3:= j3 + S3 [i]


output S4 [S3 [i] + S3 [j3]]

calculate J4:= j4 + S4 [i]


output S3 [S4 [i] + S4 [j4]]

End

Algorithm 5.1 Enhanced pseudo-random generation algorithm (PRGA)

instead of 32 in RC4A or 64 times in RC4. Figures 5.1 depicts the process

of encryption/decryption of PARC4-I and for reader’s reference the parallel

implementation of PARC4-I using OpenMP has been given in Appendix-D.

71

F ig. 5.1 Method used to implement PARC4-I on SMPs

72

The Algorithm for PARC4-I is listed below:

Algorithm 5.2 Method use to parallelize multiple data chunks using PARC4-I

5.3.1 Techniques to enhance benefits of parallelization

PARC4-I uses parallel techniques similar to PARC4. Additionally, it is

using loop unrolling method to further optimize the performance. PARC4

cannot use this technique due the intensive swap functionality and

because it returns only one byte per call whereas PARC4-I returns four

distinct bytes. We will next take a look at the loop unrolling method to

optimize the code.

Loop unrolling: It is also acknowledged as loop unwinding (Krall and

Lelait, 2000).It is a loop alteration technique that tries to optimize a

Procedure: Encryption

Model: Data Parallel Model and loop unrolling with P processors [P=2, 4, 6, 8]

Input: Plaintext in the form of small chunks [Chunk Size = 256], Number of blocks

Output: Encrypted text

Declare: Plaintext, BlockID as shared variables, i as private variable to each processing core

Initialize: Number of blocks= Size of plaintext / 256

1. ParBegin

2. For ALLBlockID: [0, Number of blocks] IN SYNC

2.1 Set Start=BlockID*256 and End=Start+256

2.1.1 For i=start to End-1,4 do

2.1.1.1 Output-1= ((key_byte-1+ block id[[i]) Mod 256) XOR msg_byte-1

2.1.1.2 Output-2= ((key_byte-2+ block id[[i+1]) Mod 256) XOR msg_byte-2



2.1.2 End for

2.2 Concatenate all blocks to make it complete Ciphertext corresponding to plaintext

3. ParEnd

73

program's execution speed at the cost of its code size. The conversion can

be undertaken manually by the programmer. The objective of loop

unwinding is to boost a program's speed by dipping instructions that

control the loop, for example ‘end of loop’ test on every iteration,

dropping branch penalties, and hiding latencies, particularly, the waiting

time used to read data from memory. To eliminate this overhead, one can

use the mechanism in which loops can be re-written as a repetitive series

of similar independent statements.

This implementation has used static loop unrolling in which the

programmer analyzes the loop and convert the iterations into a series of

directions which will diminish the loop overhead.

A simple example of static loop unrolling used in this implementation is:

A function in a computer program adds 100 items from an array. This is

usually done using simple for-loop which calls the function add

(item_number). If the size of the loop is 100 then the method “add” is

called 100 times. There is loop iteration overhead each time. In order to

reduce this loop overhead, loop unrolling can be used.

Normal loop

inti;

for (i = 0; i< 100; i++)

{

Add(i);

}

74

After loop unrolling

inti;

for (i = 0; i< 100; i+=5)

{

Add(i);

add(i+1);

add(i+2);

add(i+3);

add(i+4);

}

According to the revision, the new program will make just 20 iterations

rather than 100. As a result, merely 20% jumps and conditional branches

are required. To generate the maximum benefit, there should be no

variable specification in the unrolled code that necessitates pointer

arithmetic. This normally requires "base plus offset" addressing

mechanism, instead of indexed referencing.

Conversely, it should be observed that the manual loop unwinding

expands the size of the source code from 3 lines to 7, that have to be

checked, debugged and the compiler can have to assign extra registers to

accumulate variables in the extended loop, Moreover the control variables

and number of operations within the body of the loop have to be elected

cautiously so that the result should be same as in the original code. Fig.

5.2 depicts the process of loop unrolling.

PARC4-I has been developed using the same framework which has been

used to develop PARC4. A similar key generator is being used to develop

both the parallel algorithms. Thus, PARC4-I can be considered as secure

as the original algorithm.

75

Fig. 5.2 Pictorial representation of normal and unwinding loop


To estimate the performance gains, the sequential RC4A cryptographic

algorithm has been executed to evaluate the execution time. The sequential

results serve as the baseline for comparison with the results of the improved

parallel algorithm PARC4-I. The same compiler along with same

configuration options have been used to compile all data files. To disable

debugging, Compiler option –g0 is used. Similarly, -O3 is used to support

third level of optimization and -March=native to enable usage of CPU

specific instructions. All the data has been tested on a server having

configuration as mentioned in Section 2.3 of chapter 2.

76

To assess the parallel framework, all of the tests use the text files from t5-

Corpus11 (Roussev). Table-5.1 to Table 5.5 shows the execution time taken

by RC4A on a single core and that of PARC4-Ionmultiple cores. The first

column of each of the table shows the number of input bytes used for

encryption and decryption. The second and the third columns show the

execution time, which has been measured in seconds, for the encryption

and decryption processes and the last column shows the overall time taken

by both the processes.

Table 5.1 Time taken by RC4A to encrypt/decrypt large data files by uniprocessor

system

Data files [in GB ] Encryption time Decryption time Overall time

0.1 1.73520 1.78436 1.35034

0.2 1.58461 1.5705 3.15511

0.3 2.52451 2.50427 5.02878

0.4 3.59566 3.57956 7.17522

0.5 4.77162 4.54654 9.31816

0.6 6.5028 6.21054 12.71334

0.7 7.8804 7.14834 15.02874

0.8 8.91011 8.77074 17.68085

0.9 9.43049 9.59899 19.02948

1.0 10.71573 10.49485 21.21058

Table 5.2 Time taken by PARC4-I to encrypt/decrypt large data files using 2

Cores

Data file [In GB ] Encryption time Decryption time Overall time

0.1 0.357585 0.337585 0.69517

0.2 0.809525 0.789525 1.59905

0.3 1.317195 1.297195 2.61439

0.4 1.903805 1.883805 3.78761

0.5 2.38954 2.36954 4.75908

0.6 3.238335 3.218335 6.45667

0.7 3.827185 3.797185 7.61437

0.8 4.432125 4.402125 8.82425

0.9 4.78737 4.75737 9.53474

1.0 5.317645 5.287645 10.61529

77

Table 5.3 Time taken by PARC4-I to encrypt/decrypt large data files using 4 cores

Data Files [In GB ] Encryption time Decryption time Overall time

0.1 0.17993 0.17989 0.35982

0.2 0.41564 0.41558 0.83123

0.3 0.66285 0.66277 1.3256

0.4 0.97849 0.97845 1.95694

0.5 1.24622 1.24616 2.49238

0.6 1.68213 1.68207 3.36421

0.7 2.01168 2.01162 4.0233

0.8 2.36608 2.36602 4.7321

0.9 2.50562 2.50559 5.0112

1.0 2.81159 2.81151 5.62311


Data Files [In GB ] Encryption time Decryption time Overall time

0.1 0.11847 0.11843 0.2369

0.2 0.2766 0.276 0.5526

0.3 0.4415 0.4407 0.8822

0.4 0.6297 0.6291 1.2588

0.5 0.8177 0.8171 1.6348

0.6 1.1157 1.1148 2.2304

0.7 1.31328 1.31322 2.6265

0.8 1.55089 1.55081 3.1017

0.9 1.6659 1.6655 3.3314

1.0 1.8606 1.8605 3.7211


Data files [In GB ] Encryption time Decryption time Overall time

0.1 0.09248 0.09242 0.1849

0.2 0.21614 0.21606 0.4322

0.3 0.34438 0.34432 0.6887

0.4 0.49144 0.49136 0.9828

0.5 0.63823 0.63817 1.2764

0.6 0.87077 0.87073 1.7415

0.7 1.0294 1.0293 2.0587

0.8 1.21104 1.21097 2.422

0.9 1.30337 1.30333 2.6067

1.0 1.45277 1.45273 2.9055

78

The encrypted and decrypted text using PARC4-I has been given in

Appendix-D for reader’s reference. OpenMP implementation for the same

has been given in Appendix F.


Various metrics have been used to measure and analyses the performance

benefit of proposed algorithm over serial algorithm.

5.5.1 Parallel Run Time

The parallel run time is a measure represented as PT (n). This is referred

as the execution time of a parallel program on a symmetric multi-

processor having n number of cores. PT (1) denotes the execution time of

a serial program on single processor. Figure 5.5 is visualizing the parallel

run time on each core using PARC4-I.

Fig. 5.3 Execution time of 1 GB of data file using PARC4-I

It is clear in the above graph that the execution times reduce nicely as the

cores increase for a given data set.

5.5.2 Speedup

Speedup is a quantitative measure of performance gain that is

accomplished by a parallel algorithm when executed on SMPs over a

79

sequential implementation running on a single processor system. But to

capture the relative benefit of running a program in parallel, the sequential

algorithm should be the fastest algorithm. After observing the results from

Tables 5.1 to Table 5.5, it can be deduced that PARC4-I results in a 7.3X

speedup on eight cores and 5.7X on 6 cores. Figure 5.6shows the speedup

comparison of PARC4-I on multiple cores.

Fig. 5.4 Speedup comparison using multiple cores

It is clear from the above graph that speedup is increases nicely as the

number of cores increase. But as per Amdahl’s law, speedup tends to

saturate and efficiency can drop at some specific point depending upon

the available parallelism in the program. The point where saturation

occurs depends on the type of parallel execution model used to execute

the program, the size of data associated with the program and the number

of cores to process the data.

Similarly, if execution time for large input streams like 1GB of data, the

total time for executing the complete data is the function of number of

cores employed to complete the task. This way it has been inferred that

after adding cores for a given file size, the execution time will continue to

see reduction with increasing number of cores until some point.

80

5.5.3 Efficiency

This metric reflects the efficiency of all processing elements working

together. Basically, it is a function of load balancing. The efficiency of a

given problem using n processing elements, E (n), is defined as the ratio

of the speedup achieved and the number of processors used to achieve it:

(5.1)

In equation 5.1, Let T (1) = 21.21 seconds, T (n) = 2.9 seconds and n=8 to

process 93, 41, 59,360 bytes (~1 GB). The Efficiency of PARC4-I is: =

0.91, since , there is . PARC4-I

achieved for increasing p and n.

5.5.4 Scalability

As stated above, PARC4-I maintained efficiency 0.9 for all increasing

cores with respect to scalable data. This property of proposed algorithm

indicates that the algorithm is scalable because by increasing the problem

size and processing elements it keeps the efficiency fixed.

5.6 Comparison between PARC4 and PARC4-I

The two implementations, PARC4(Handa and Kapoor, 2014) and PARC4-I

have been compared on the bases of below mentioned parameters.

5.6.1 Parallel Run time

Consider 93, 41, 59,360 bytes (~ 1GB) to process on eight processing elements

using PARC4 and PARC4-I. The time taken by PARC4 is 3.69 seconds whereas

PARC4-I has taken 2.90 seconds which is less than time taken by PARC4.

Please see Figure 5.5 for a comparison for all different sizes of data.

81

Fig 5.5 Graphical representation of parallel run time of PARC4 vs PARC4-I on eight

cores

5.6.2 Speedup

As PARC4-I is using loop unrolling optimization technique for additional

speedup and, it achieves higher speedup compared to PARC4. Figure 5.6

represents the speedup comparison between both the approaches.

Fig 5.6 Comparison between PARC4 and PARC4-I for speedup

82

In above figure, the speedup is calculated for the same input stream size

(93, 41, 59,360 bytes that is ~ 1 GB data) and using four, six and eight

cores. It is worth noting that on four and six cores PARC4-I has given

slightly larger speedup as compared to speedup using eight cores. This is

because of parallel overhead is involved in distributing the task on

multiple cores. Thus it can be concluded that using fewer number of cores

PARC4-I is more effective as compared to PARC4.

5.6.3 Efficiency

There is a slight difference between the efficiency achieved by PARC4

and PARC4-I because both uses the same concept to implement

parallelism. The only difference between both is the usages of

optimization technique i.e. loop unrolling. In PARC4, the loop has to

execute 256 times whereas in PARC4-I only 64 iterations are required.

Consider Figure 5.7 which is representing the similar scenario.

Fig. 5.7 Comparison of PARC4 and PARC4-I for Efficiency

83

5.6.4 Loop overhead

As discussed in efficiency metric, loop overhead in PARC4-I is one

fourth of as compared to PARC4. This is because of the loop unrolling

method.

5.6.5 Throughput

Here, throughput is the total number of bytes the algorithm can process in

a given time period and it can be calculated using total number of input

bytes / execution time. Considering 93, 41, 59,360 (~1GB) number of

input bytes processed using eight processing cores, Figure 5.8 is

representing the comparison of PARC4-I and PARC4 with respect to

throughput achieved. It shows that PARC4-I has higher throughput than

PARC4.

Fig. 5.8 Comparison between PARC4 and PARC4-I algorithms for throughput

All above parameters conclude that PARC4-I algorithm is faster than

PARC4 algorithm due to less loop overhead. Table 5.6 outlines the

comparative analysis between both the algorithms.

84

Table 5.6 Comparison between PARC4 and PARC4-I

Parameters PARC4 PARC4-I

Parallel run time [1GB of

data file]

25.8 21.2

Loop overhead High Low

Efficiency 0.89 0.91

Speedup with eight cores 7 7.3

Throughput [MB/s] 252.8

322.12

Conclusions

This chapter introduces PARC4-I, a parallel approach to the well-known

RC4A cipher algorithm. The basic idea behind this implementation is to use

some loop unrolling optimization techniques along with the parallel

methodology to improve the performance gains. The implementation shows

promising results with the use of loop unrolling method. The following

conclusions are drawn from the discussion about this implementation. As a

result of use of the PASCS framework along with the loop unrolling

optimization techniques, we get better performance gains along with the

gains in efficiency and throughput. In addition, the new algorithm is as

scalable as the algorithm that doesn’t use the optimization.

85

Chapter 6

Design of Parallel

Independent Feistel Network

86

6 CHAPTER 6

DESIGN OF PARALLEL INDEPENDENT FEISTEL

NETWORK

A Feistel network is a symmetric structure used in the construction of block

ciphers. Many popular block ciphers use this network, including the Data

Encryption Standard (DES) and Blowfish algorithms. The Feistel structure has

the advantage that encryption and decryption operations are very similar.

Therefore the size of the code or circuitry required to implement such a cipher is

cut down in half. This chapter focuses on the design of the Parallel Independent

Feistel Network which has been used to develop a Feistel network based parallel

block cipher algorithm for faster execution.

6.1 Introduction

A block cipher encrypts or decrypts fixed-length data called blocks. The

block ciphers fall in the category of deterministic algorithms that means for

a given input, they always produce the same output. Majority of block

ciphers are based on two principles:

Substitution: Plaintext elements as individuals or group of elements are

exclusively substituted by corresponding cipher text elements or group of

elements.

Permutation: A sequence of plaintext elements is replaced by a permutation

of that sequence. That is, no elements are added or deleted or replaced in

the sequence, rather the order in which the elements appear in the sequence

is changed.

Based on these two principles, there are two types of structure to encrypt

blocks:

1) Feistel Network

2) Substitution-Permutation Network(Heys and Tavares, 1995)

http://en.wikipedia.org/wiki/Block_cipher

http://en.wikipedia.org/wiki/Block_cipher

http://en.wikipedia.org/wiki/Data_Encryption_Standard

http://en.wikipedia.org/wiki/Data_Encryption_Standard

http://en.wikipedia.org/wiki/Encryption

http://en.wikipedia.org/wiki/Decryption

87

Numerous block ciphers are based on the Feistel framework(Choy et al.,

2009). Figure 6.1 depicts the concept of the Feistel structure. Such a

framework consists of numeral identical rounds of processing. In each

round, an exchange is done on one half of the data being processed,

followed by a transformation that swaps the two halves. The unique key

which is used in the enciphering process, is extended so that a different

key is used for every round. Using the complete process, each block of

plaintext is processed one by one in a sequence, due to which the whole

procedure of encryption becomes very slow. In this chapter, a parallel

independent Feistel network to encrypt multiple data blocks concurrently

for faster execution.

Fig. 6.1 Structure of Sequential Feistel Network (William, 2006)

http://www.amazon.com/Cryptography-Network-Security-William-Stallings/dp/0131873164/ref=sr_1_1/183-3525394-0467722?s=books&ie=UTF8&qid=1422067662&sr=1-1&keywords=0131873164

88

The rest of the chapter is organized as follows: In section 6.1, a brief

introduction of block cipher and its building blocks, is described. Section

6.2 specifies the motivation and requirement for parallel architecture. PIFN

Design is presented in Section 6.3. The chapter includes a discussion on the

application areas of PIFN followed by the conclusions.

6.2 Motivation for Parallel architecture

In case of encoding data using block cipher algorithms, the entire process is

based on block- by- block encryption. As shown in Fig. 6.1, one block of

plaintext block goes through the process to generate the same length of

cipher-text block using the Feistel network. This sequential process doesn’t

take advantage of today’s multiple-core processors with large and shared

memory systems that are available. Parallel Feistel network is designed

with the basic idea of data parallelism. For example: 16,77,7216 bytes of

data with a single block with 64 bytes results in (16, 77, 7216 /64) or 26,

2144 blocks of data. These blocks can run concurrently to encrypt the

entire 16,77,7216 bytes of data. Parallelization will help block ciphers to

execute multiple data blocks simultaneously and provide better results in

terms of execution time.

6.3 Design of Parallel Independent Feistel Network Structure

The essence of this approach is to develop a parallel Feistel network, in

which multiple blocks of same length can be executed on multiple cores of

a processor to produce the cipher text. This network is based on Electronic

Code Book (ECB) mode of operation where each block is independent of

each other so to support parallelism. Fig-6.2 shows the complete parallel

process of encrypting multiple fix size blocks using PIFNS.

89

Fig. 6.2 Parallel Independent Feistel Network Structure

According to the Fig. 6.2, the input to the network is complete message that

needs to encrypt. Message is being divided into fix-length blocks. Further,

each plaintext block is divided into two halves, Left and Right halves. Both

halves of the data pass through n rounds of processing and then combine to

produce the cipher text blocks of equal length. Each round has a left-side

and a right-side input that is derived from the previous rounds, as well as a

sub key Ki derived from the overall Key K. This parallel structure can also

be used with N number of rounds where N can be any positive number. A

substitution is performed on the left half of the data. This has been done by

applying a parallel round function F to the right half of the data and then

taking the exclusive-or with the left half of the data. Parallel function F

takes the right half of block of w bits and a sub-key of y bits, which

produces an output value of length w bits. Following this substitution, a

90

permutation is performed that consists of the interchange of the two halves

of the data.

6.4 Application Area of PIFNS

All block ciphers of deterministic type can be parallelized with PIFNS

framework. These ciphers can be used for both password management as

well as file/disk encryption. A significant use case scenario of this

framework is to encrypt/decrypt large data files where execution time can

be a critical factor. In this thesis, Blowfish block cipher algorithm is

parallelized using PIFNS and the resulted algorithm is named as PBlock

which is discussed in detail in the next chapter.

Conclusion

Feistel networks have been used extensively for encryption but these

networks can be very slow due to the sequential nature of processing.

Today’s computing applications require encryption such as-, file encryption

and complete disk encryption. There is a great demand for some parallel

security structure that can process the tasks faster. PIFNS has been

developed for this purpose. PIFNS is a parallel framework based on the

scheme where multiple data blocks can be encrypted or decrypted

concurrently to achieve faster execution. In the next chapter, a parallel block

cipher called, PBlock using PIFNS framework will be discussed.

91

7

Chapter 7

PBlock- Parallel approach for

Blowfish cipher using PIFN

92

CHAPTER 7

PBLOCK- PARALLEL APPROACH FOR BLOWFISH

CIPHER USING PIFN

In this chapter, we discuss and explain PBlock, a parallel block cipher algorithm.

The parallel implementation is based on Parallel Independent Feistel Network

Structure, discussed in Chapter 6. Various metrics have been applied on the

parallel algorithm to measure its efficiency, speedup, and cost optimality. The

results prove that the proposed algorithm is faster and can be used in many

typical applications such as file and disk encryption and for securing

communications over the Internet.

7.1 Introduction

There are many benefits of using parallel programming on symmetric

multiprocessors (Keckler et al., 2009). Increasing the number of cores on a

single chip can help improve the performance and also reduce the energy

consumption of the system. The idea which serves as motivation for this

research is to observe if complex cryptographic algorithms can be

restructured to allow efficient parallel implementations. As a result, higher

performance can be achieved making these algorithms more applicable to

the long processes like full disk encryption or backing up software for

networked computers. The focus of the study is to lessen the time taken by

the block cipher encryption process for large data and redesign the security

algorithm so that it can utilize multiple cores if available in the computing

device. With the advent of the parallel computing era, there is no

requirement of extra hardware to benefit from parallelism. Also, almost all

smart phones have multiple cores(Gonzalez et al., 2009) but effectively

utilize all these processing elements, parallel algorithm implementation is

required.

93

Blowfish is a private-key infrastructure based block cipher that can be used

as an alternative for DES or IDEA (Schneier, 1994). It has a variable-length

key, which ranges from 32 bits to 448 bits. It is ideal for both commercial

and domestic uses. It was designed by Bruce Schneier in 1993 as a fast

substitute for existing security algorithms. Blowfish is used in many

commercial products (Schneier). The chapter is focused on one major

application area of Blowfish that is file encryption.

Rest of the chapter is organized as follows: Section 7.2 explains the PBlock

implementation with PIFN framework, the design of the parallel Feistel

network along with parallel techniques used in the algorithm. In next

section, security analysis has been done to verify that the modified

algorithm is as secure as original one. The results on a large set of data files

along with the speedup calculations have been mentioned in Section 7.4. In

Section 7.5, performance analysis using various metrics has been done.

PBlock is compared with a pipelined approach in Section 7.5 and that is

followed by the conclusions from this work.

7.2 Implementation of PBlock using PIFNS

Blowfish algorithm is made parallel using the PIFNS system. Apart from

the requirement of faster execution of cipher, additional metrics are vital and

serve as motivation for this research. Possibly the most important of these is

the ability of the memory system to feed data to the processor at the required

rate. “There is a mismatch between processor speed and DRAM latency

and is normally bridged by a hierarchy of successively faster memory

devices called caches that rely on locality of data reference to deliver higher

memory system performance(Kumar et al., 1994)”. Parallel platforms

typically yield better memory system performance as compared to

sequential one. The reason being that the larger the aggregate caches and the

higher the aggregate bandwidth to the memory system; typically linear in

the number of processors. This argument can be extended to disks where

parallel platforms can be used to achieve high aggregate bandwidth to

secondary storage. The other benefit of parallelism is to get energy

94

efficiency. In this section, the brief introduction of the parallel methodology

along with design of parallel F function has been discussed.

7.2.1 Parallel Methodology

PBlock execution model is based on the data parallel model and this model

can be easily mapped on to the PIFNS framework. According to the

model, “the tasks are statically mapped on to the processes and each task

performs similar operations on different data(Kumar et al., 1994)”. As all

tasks carry out similar set of computations, the decomposition of the

problem into tasks is based on data partitioning techniques because a

uniform partitioning of data followed by a static mapping is sufficient to

guarantee the load balance. Data parallel algorithms can be implemented in

both message passing paradigms as well as with shared address space

technique. For this implementation, the shared address space paradigm has

been used. The important characteristic of data parallel model is that for

most problems the degree of data parallelism increases with the size of the

problem, making it possible to use more processes/cores to effectively

solve larger problems. At the same time, reducing the interaction overhead

between concurrenttasks is important for an efficient parallel program. The

overhead that a parallel program incurs due to interaction among its

processes depends on many factors, such as the volume of data exchanged

during interactions, the frequency of interaction, the spatial and temporal

pattern of interactions. Thus to reduce interaction overheads following

techniques have been used (Kumar et al., 1994):

Maximizing Data Locality: The interaction overheads in a parallel program

can be reduced by using methods that support the use of local data or data

that have been recently fetched.

Minimize volume of data exchange: Another important technique for

reducing the interaction overhead is to minimize the overall amount of

shared data that needs to be accessed by parallel processes. This is similar

to maximizing the temporal data locality.

Minimize frequency of interactions: Minimizing interactions frequency is

important in reducing the interaction overheads in parallel programs

because there is a comparatively high startup cost related with each

95

interaction on most of the architectures. Interaction frequency can be

reduced by redesigning the algorithm such that the shared data is accessed

and used in large chunks

Consider the algorithmic steps of PBlock encryption process and parallel F

function to discuss all above factors in detail. For reader’s reference the

parallel implementation of PBlock using OpenMP has been given in

Appendix-E.

Procedure: Parallel Encryption

Input: Plaintext

Output: Cipher text

Declare: Y, LHalf, RHalf, i, Lblock, Rblock, pi

1. Plaintext and Block size (as shared variable)

2. Y, LHalf, RHalf as (Private variable to increase data locality)

3. Divide plaintext into 64 bit n number of blocks

4. ParBegin

4.1 For i=0 to n-1:

4.2 Declare LHalf=Y, RHalf=Y+32(To create specific chunk size)

4.3 Divide each block into 32-bit halves: Lblock, Rblock

4.3.1 For i=1 to 16

4.3.2 Lblock = Lblock XOR pi

4.3.3 Rblock =F (Lblock) XOR Rblock [Figure 7]

4.3.4 Swap Lblock and Rblock

4.4 End For

5. ParEnd

6. Swap Lblock and Rblock again to undo the last swap

7. Then, Rblock=Rblock XOR p17 and Lblock=Lblock XOR p18.

8. Recombine Lblock and Rblock to get the cipher text

9. Finally recombine the output of all blocks to get cipher text

Algorithm 7.1 Algorithmic steps for Encryption process in PBlock

96

Procedure: Feistel Function

Input: 32 bit Right Half

Output: 32 bit data

1. Declare a, b, c, d, y1, y2 and Message (32 bit input data which is

shared among each core)

2. ParBegin

2.1 Calculate a, b, c, d concurrently (private data to each core)

3. ParEnd

4. Synchronization constructs

5. ParBegin

5.1 calculate y1=s[0][a]+s[1][b] and y2= s[2][c]+s[3][d]

concurrently (private data to each core)

6. ParEnd

7. Synchronization constructs

8. calculate y3=y1^y2

Algorithm 7.2 Algorithmic steps for parallel F function

In Algorithm 7.1, line number 1 shows the declaration of two shared

variables plaintext and block size. These should be stored in shared

address space as this data needs to be accessed by each core. Moreover, the

variable holding value of block size should be known to each core.

Furthermore, line number 2 declares private data members. Each core

must have its own iterations, local variable to calculate data. Thus y,

LHalf, RHalf should be declared as private data members. If these data

members should not be declared as private to each core, the data will be

lost due to inter processor communication. Line number 3 calculates the

block size and from line 4 the parallel execution of multiple blocks has

been started. The implementation has been done in ECB mode, where each

block is independent to one another. To synchronize the work done by

multiple cores, synchronization constructs have been used. Similarly, in

Algorithm 7.2, line 1 declares a, b, c, d, y1, y2 as private variables and

97

Message as the shared variable to each core so that the inter processor

communication between the different available cores can be reduced. Line

2.1, 5.1 and 8 declares different private data members for independent

calculation to increase data locality.

7.2.2 Design of Parallel F function

It takes 32-bit input data which is further divided into four eight bit

quarters. Each block references the S-Box and each entry of the S-box

output a 32 bit data. Further, the output of S-box 1, S-box 2 and the output

of S-box3, S-box4 will be added by different cores. Finally, XOR operation

has been applied on both the values and it provide 32 bit output. In this

method, the whole process is executed in three instructions as compared to

the sequential F function which takes seven instructions for the complete

process. Figure-7.1 represents the functionality of parallel F function.

Fig. 7.1 Graphical representation of F function

7.3 Security Analysis using Avalanche effect

A single change in the plain text or key should generate a change in

numerous bits in the cipher text. This process is called Avalanche

98

effect(Webster and Tavares, 1986). If there are fewer changes, it may make

available a way to condense the size of the key space or plaintext to be

searched and consequently makes the cryptanalysis extremely effortless. So,

to say that any cryptographic algorithm is secure, it should exhibit strong

avalanche effect, and this is the reason that the thesis has considered

avalanche effect to make sure that the parallel implementation has not

compromised the security of the existing blowfish algorithm. If a single

change in plain text has been done, two bits in cipher text get affected using

existing algorithm and same happens with PBlock because the security

architecture of both algorithms are same. As listed below, text and tables

show that avalanche effect of blowfish and avalanche effect of PBlock is

same.

Modified plaintext and corresponding Cipher text generated by Blowfish:

Plaintext: “It is soon posted on the sci.crypt newsgroup, and from there to

many sites on the Internet. The leaked code was confirmed to b”

Cipher text:

”cc13b58d468422cfa4d491d475c8d78996b5db84a6a7b4be87469124801b2

edbbba75cc059712e6d5c10157aa52440ce85c6c9828de1581cd59d5c0d76c4

6826d616e79207369746573206f6e2074686520496e7465726e65742e20546

865206c65616b656420636f64652077617320636f6e6669726d656420746f2

062a0”

After changing 4 bit positions in plaintext: i.e “It is soon hosted on the

sci.crypt newsgroup, and from there to many sites on the Internet. The

peaked mode was confirmed to d”

Following changes are there in Cipher text :

“cc13b58d468422cfa4d4917575c8d78996b5db84a6a7b4be87469124801b2

edbbba75cc059712e6d5c1015a6a52440ce85c6c9828de1581cd59d5c0d76c4

6826d616e79207369746573206f6e2074686520496e7465726e65742e20546

865207065616b6564206d6f64652077617320636f6e6669726d656420746f2

064a0”

99

After changing 11 bit positions in plaintext: i.e “It is very hosted in the

sci.crypt newsgroup, and home there to nany sites on the Internet. The

peaked mode was confirmed to d


cc13b58d4684b575581f917575c8d78996b542d49697b4be87469124801b2e

dbbba75cc059713abe871315a6a52440ce85cea107c881581cd59d5c0d76c46

826e616e79207369746573206f6e2074686520496e7465726e65742e205468

65207065616b6564206d6f64652077617320636f6e6669726d656420746f20

64a0

Similarly for PBlock:

Plaintext: “It is soon posted on the sci.crypt newsgroup, and from there to

many sites on the Internet. The leaked code was confirmed to b”


497420697320736f6f6e20706f73746564206f6e20746865207363692e63727

97074206e65777367726f75702c20616e642066726f6d20746865726520746

f206d616e79207369746573206f6e2074686520496e7465726e65742e20546

865206c65616b656420636f64652077617320636f6e6669726d656420746f2

062a0”

After changing 4 bit positions in plaintext: i.e “It is soon hosted on the

sci.crypt newsgroup, and from there to many sites on the Internet. The

peaked mode was confirmed to d”


“497420697320736f6f6e20686f73746564206f6e20746865207363692e6372

797074206e65777367726f75702c20616e642066q26f6d2074686572652074

6f206d616e79207369746573206f6e2074686520496e7465726e65742e2054

6865207065616b6564206d6f64652077617320636f6e6669726d656420746f

2064a0”

After changing 11 bit positions in plaintext: i.e“ It is very hosted in the

sci.crypt newsgroup, and home there to nany sites on the Internet. The

peaked mode was confirmed to d

The Ciphertext is:

100

“4974206973207665727920686f7374656420696e20746865207363692e637

2797074206e65777367726f75702c20616e6420686f6d65207468657265207

46f206e616e79207369746573206f6e2074686520496e7465726e65742e205

46865207065616b6564206d6f64652077617320636f6e6669726d656418746

f2064a0”

Table 7.1 Avalanche effect in Blowfish and PBlock: change in plaintext

Bits changed in

plaintext

Number of bits changed in

Cipher text produced by

Blowfish

Number of bits changed

in Cipher text produced

by PBlock

4 5 5

11 31 32

19 66 68

Table 7.2 Avalanche effect in Blowfish and PBlock: change in key

Bits changed in

key

Number of bits changed in

Cipher text produced by

Blowfish

Number of bits changed

in Cipher text produced

by PBlock

4 5 5

11 31 32

19 66 68


To study performance improvements achieved through the PBlock

algorithm, firstly, the sequential Blowfish cryptographic algorithm has been

executed to evaluate its execution time in a given environment. The same

compiler along with same configuration options have been used to compile

all data files. To assess the parallel framework, all tests uses the text files

from t5-Corpus11 (Roussev). Tables 7.3-7.7 show the time taken by

Blowfish and PBlock while executing on multiple cores. The first column

of each of the table show the number of input bytes used for encryption and

decryption. Second and third column shows the execution time [in seconds]

for encryption and decryption processes and last column shows the overall

101

time taken by both the processes. All the data has been tested on a server

having below mentioned configuration:

AMD FX(tm) - 8320, eight core processor running @ 3500 MHz, 64 bit

operating system with 8 GB RAM.

Table 7.3 Time taken by Blowfish to encrypt/decrypt large data files by single processor

File Size[ GB ] Encryption time Decryption time Overall time

0.1 20.06685 20.11274 40.17959

0.2 40.06112 40.18968 80.25080

0.3 60.19072 60.87597 121.0667

0.4 80.11484 81.17861 161.2935

0.5 100.14973 101.49825 201.6480

0.6 120.16994 121.77476 241.9447

0.7 140.20940 142.06767 282.2771

0.8 160.52404 161.84168 322.3657

0.9 180.31254 182.75779 363.0703

1.0 200.59846 201.02998 401.6284

Table7.4 Time taken by PBlock to encrypt/decrypt large data files using 2 cores

File Size[ GB ] Encryption time Decryption time Overall time

0.1 10.5349 10.5549 21.0898

0.2 19.5527 19.5727 39.1254

0.3 30.16668 30.36668 60.53335

0.4 40.30338 40.34338 80.64675

0.5 50.312 50.512 100.824

0.6 60.47618 60.49618 120.9724

0.7 71.05928 71.07128 142.1386

0.8 80.58143 80.60143 161.1829

0.9 90.25758 90.28758 180.5352

1.0 99.8871 99.9271 199.8142

102


File Size[ GB ] Encryption time Decryption time Overall Time

0.1 5.51086 5.53086 11.04172

0.2 11.00007 11.00607 22.00614

0.3 16.47954 16.49954 32.97908

0.4 22.22653 22.26653 44.49307

0.5 27.39367 27.41367 54.80734

0.6 32.95798 32.99798 65.95596

0.7 38.44913 38.48913 76.93826

0.8 44.02625 44.04625 88.07250

0.9 50.24433 50.26433 100.50866

1.0 54.85922 54.86122 109.72044

Table 7.6 Time taken by PBlock to encrypt/decrypt bits of input stream using 6 cores


0.1 3.82568 3.84568 7.67136

0.2 7.689945 7.709945 15.3999

0.3 11.50644 11.52644 23.0329

0.4 15.27489 15.29489 30.5698

0.5 19.12053 19.14053 38.2611

0.6 22.9221 22.9421 45.8642

0.7 26.66044 26.68044 53.3409

0.8 30.76716 30.78716 61.5543

0.9 34.45764 34.47764 68.9353

1.0 38.09243 38.11243 76.2049

103

Table 7.7 Time taken by PBlock to encrypt/decrypt bits of input stream using 8 cores


0.1 3.014415 3.016415 6.03083

0.2 5.981855 5.997855 11.9837

0.3 9.03536 9.05536 18.0907

0.4 12.15148 12.17148 24.323

0.5 15.23598 15.25598 30.492

0.6 18.22181 18.24181 36.4636

0.7 21.31759 21.33759 42.6552

0.8 24.25144 24.27144 48.5229

0.9 27.12253 27.14253 54.2651

1.0 30.0599 30.0799 60.1398

After getting results on different cores, the performance gains are visible

and it can be easily predicted that by adding cores the execution time

decreases drastically. The encrypted and decrypted text using PARC4 has

been given in Appendix-D for reader’s reference . OpenMP implementation

for the same has been given in Appendix G.


The following metrics have been used to examine the benefits of

parallelism of proposed algorithm.

7.5.1 Speedup

From above Tables 7.3 to 7.7, it can be observed that PBlock is giving

6.6X speedup using eight cores and 5.2X using six cores. Following

figure is showing the execution time taken by PBlock with different cores.

104

Fig. 7.2 Speedup comparison of PBlock using multiple cores

It is clear from the graph that execution time is increasing as the file size

increases. On the other hand, for constant data if the number of cores

increases, speedup tends to saturate and start decreasing after a specific

point. As per the conclusions drawn from the Amdahl’s law, if speedup

tends to saturate, efficiency can drops at some specific point and that

depends on the type of parallel model used to execute the problem, the

size of data associated with the problem and the number of cores to

process that data. Figure 7.3 depicts that the execution time decreases as

the cores increase but saturate at specific point. That means for the 1 GB

data file, at most 16 cores will be sufficient to perform after that the

difference is very small moreover at the cost of extra cores it should be

negligible.

105

Fig 7.3: For constant file size speedup tends to saturate at specific point

7.5.2 Efficiency

The same formula, which has been used to measure efficiency for

PARC4 and PARC4-I, is being used to measure efficiency of PBlock

algorithm. Let T (1) = 401.8 seconds, T (n) = 60.6 seconds and n=8 to

process 1 GB of data. The Efficiency of PBlock is: = 0.82, since, there

is . PBlock achieved an efficiency in that range as

. Table 7.8 shows that if numbers of cores are

increasing for constant file size the efficiency drops but if the problem

size is increasing and number of cores remains constant, the efficiency

remains constant.

Table 7.8 Efficiency Vs number of processing elements for different file size

File size

[GB] P=4 P=6 P=8

0.6 0.91 0.88 0.83

0.7 0.91 0.88 0.83

0.8 0.91 0.88 0.83

0.9 0.91 0.87 0.82

1.0 0.91 0.87 0.82

106

7.5.3 Complexity and Cost optimality

If there are n data blocks where n is any positive number and p is number

of cores where n>p, n/p steps are required to execute the n blocks in

parallel. Thus the overall parallel execution time for PBlock is ,

where n is the number of blocks and p is the number of processing

elements. Therefore, parallel execution time can be defined as:

(7.1)

Consequently, its cost is:

Cost = (7.2)

= (7.3)

= (7.4)

Above equations have been proved that the parallel runtime is =

which is same as of serial runtime. Hence, PBlock is Cost optimal.

7.5.4 Scalability

The ability to maintain efficiency at a fixed value by concurrently

increasing the number of cores and the size of the problem is unveiled by

many SMPs. Such systems are scalable parallel systems(Kumar et al.,

1994). It can be seen in Table 7.8 that if four symmetric cores are

processing ~1GB of data, the efficiency is 0.91 but the same problem is

processed using eight cores with efficiency of 0.82. That means efficiency

drops as the cores increases. But after increasing problem size and number

of cores, the efficiency remains almost same. This is the indicator that the

proposed algorithm is supporting scalability feature as keeps the efficiency

fixed when increasing the problem size and processing elements

simultaneously.

107

7.6 Comparative analysis of PBlock and Blowfish using Pipeline

approach

As described in Chapter 1, in order to parallelize Block cipher blowfish

algorithm, Kamak Ebadi, Victor Pena and Chen Liu has proposed a

pipelined approach. Pipelined approach has been implemented on a Single-

Chip Cloud Computer (SCC) experimental processor having 48-cores

created by Intel Labs as a platform for many-core software research.

Although both approaches are extremely different due to different platform

used yet due to the similar concept and common agenda both the

techniques can be differentiated on the basis of few parameters:

Table 7.9 Comparison between PBlock and Pipelined approach

Parameter PBlock Pipelined approach

Parallel Computing

Model

Data Parallel Model Pipeline Model

Processor type Symmetric

multiprocessor system

Single chip cloud

computer –A 48-core

experimental processor

Communication

Overhead

No( each core perform

independent task

assigned to it)

Yes ( because the input

data passes in turn

through all the cores

involved)

Suitability For larger input data For smaller files ( due to

communication

overhead and latency

associated with the

model)

Speedup Attain good speedup for

large input data due to

domain decomposition.

Achieve ample speedup

for very small files due

to functional

decomposition.

108

Conclusion

This chapter introduced PBlock, which is the data parallel approach of

Blowfish and concluded that parallel approach is much faster than sequential

method and has given 6.6X speedup using eight symmetric cores. The

parallel algorithm also proved cost optimal as it has the time complexity

similar to sequential and having less complexity in terms of asymptotic

notations. The approach proposed in this research has no communication

overhead involved. Thus, it is suitable for large data files encryption.

109

8

Chapter 8

Analysis of Energy

Consumption by proposed

parallel algorithms

110

8 CHAPTER 8

ANALYSIS OF ENERGY CONSUMPTION BY PROPOSED

PARALLEL ALGORITHMS

In today’s mobile computing era, a large number of battery-operated embedded

systems such as cell phones, smart cards, and health monitoring devices are used

to access, store, and manipulate complex and confidential data. Security

trepidations in such systems range from user identification, to secure software

execution, and secure information storage. To implement security techniques in

these systems, cryptographic algorithms have been used extensively. But as

discussed in previous chapters, these cryptographic algorithms do compute-

intensive calculations to encrypt/decrypt data and also consume lots of energy as

a result. This chapter incorporates the detailed analysis of the energy

consumption using serial and parallel algorithms. Some encouraging results on

reduction of energy over sequential algorithm have been achieved through the

experiments on an eight-core parallel machine and simulated using Joulemeter

(Microsoft’s Research Tool).

8.1 Introduction

Energy costs have become increasingly important to computing, since

they directly impact the power provisioning cost for computing

infrastructures, the operating expenses for both the data centers and the

enterprise infrastructures as well as the battery life for laptops and all

other mobile devices. Cryptographic algorithms are well-known for doing

large and complex computations to protect important data files from

illegal or unauthorized access. Because of the rigorous computation

intrinsic in encryption/decryption algorithms, they tend to consume a

significant amount of energy. As explained by (Krishnamurthy,2003), to

encrypt only 13.6 kilobytes of data using Blowfish block cipher algorithm

on a mobile device will gutter about 75% of the battery power. Many

111

researchers have tried to contribute in this area of key significance.

Various power management techniques such as power gating, adaptive

voltage and frequency scaling, and active body-bias ,to address power

consumption issues have been discussed in the literature (Kapoor and

Verma, 2011). In article “Computational and Energy Costs of

Cryptographic Algorithms on Handheld Devices” authors had carried out

an extensive analysis on the costs of initiating private and public key

infrastructure based algorithms and hash functions, and compare them

with the costs of basic operating system functions. Outcomes show that

though cryptographic energy costs are high and such operations shall be

delimited in time(Rifa-Pous and Herrera-Joancomartí, 2011).

Rest of the chapter is organized as follows: Section 8.2 describes the

motivational aspect of the study. Tools and techniques used for the

research is mentioned in next section. Joulemeter powered by Microsoft

has been used for the research to measure application level energy. In

Section 8.4, the working of the tool is described. Experimental Results

have been given in section 8.5 and discussion in next section followed by

the conclusion.

8.2 Motivation

There is a fundamental relationship among power and frequency

(Korthikanti and Agha, 2009). The shift to multicore processors is a result

of increasing power consumption in the microprocessors. In multicore

architecture, each core can be operated at a lower frequency, dividing

power between them usually given to a single core, while reducing the

overall power consumption. This is because one can also lower the

voltage of operation when reducing frequency and power consumption

has a quadratic dependence on the supply voltage (Chandrakasan and

Brodersen, 1995, Chandrakasan et al., 1992). Symmetric multi-processor

architectures have been proposed as a method to escalate computation

cycles with less energy consumption. As the relationship between power

and frequency of a core is non-linear, on a uniprocessor, energy

112

consumption can be condensed by dropping the frequency at which it

operates. However, dropping the frequency in a single processor will

decreases the performance of the algorithm. Parallel algorithm consists of

few serial sub computations, parallel computations, and communication

between the parallel sub computations. Thus the performance and energy

cost of the parallel algorithm are dependent on two factors: one is the

number of cores and the frequency at which each core operates, another is

structure of the parallel algorithm. In previous chapter, three different

parallel algorithms have been already proposed. Thus, this chapter focuses

on the study and analysis to reduce the energy consumption by compute-

intensive processes/applications by lowering the frequency. But this

change will affect the speed of the application. According to

(Krishnamurthy,2003), “if we increase clock frequency by 20 % of a

single core, it can provide a 13% performance increase, but it requires

approximately 73% greater power. On the other hand, if we decrease

clock frequency by 20% the reduction of power can be up to 49% but

causes only 13% performance loss. If another core is added into the

single core design, it results in a dual-core processor that at 20% reduced

clock frequency; this design can provide approximately 73% more

performance while using the same power as a single core processor at

maximum frequency”(Mani and Jee, 2007). This research also points out

that energy power can be reduced while providing optimal performance to

the systems.

8.3 Tools and Techniques used for energy measurement

To measure and compare the energy consumption by proposed parallel

algorithms, following tools have been used (as discussed in Chapter 2):

Operating system: Windows 7

Framework: dot net service pack 1

Compiler : MinGW

Editor : Code blocks

Tools: Joulemeter

113

8.4 How Joulemeter works to measure energy

In Joulemeter, Power data is shown for the computer as a whole as well as

the key hardware components(BEKAROO et al.). Power data for a

specific application can also be tracked using this software tool. The data

can be stored periodically to a file if desired.

Fig 8.1-Power metering interface exposed by Joulemeter

Joulemeter estimates the power usage through a power model that relates

the computer resource usage and hardware power state (processor

frequency, processor utilization, screen intensity, monitor on/off state,

disk usage) to power drawn. This power model is derived using a process

called calibration. On laptops calibration can be performed without any

external power meter. For desktops, a Watts UP PRO power meter is

required. If such a meter is not available, approximate power data can be

monitored. In this thesis no external power meter is used to measure

energy of proposed parallel algorithms, thus the machine needs to

calibrate with the values as shown in Table 8.4.

114

After setting these values, we specify application’s name on the Power

usage tab and then start the application. We then execute the program

using code blocks and check Joulemeter to see the energy usage by the

application for each time stamp. By adding all the instances total energy

consumed in joules, we can compute the energy consumed by the

application over a period of time. Figure 8.3 shows the excel file being

generated by Joulemeter for PARC4.

Fig 8.2: Data file of PARC4 consisting joules consumed at each time stamp

8.5 Energy Measurement

This section covers the comprehensive results and detailed analysis for

energy consumption by proposed algorithms (as discussed in Chapter 4, 5,

and 7) using the described experimental setup.

8.5.1 Result and Analysis

To measure energy cost of proposed algorithms versus existing

algorithms, following test environments have been used:

Platform 1: An Intel Core 2 Duo CPU T5270 laptop supported 32-bit

Windows 7 Operating System, 1.40 GHz frequency, 2GB RAM.

115

Platform 2: A Desktop having AMD FX (tm) - 8320 Eight - Core

Processor, 3.5 GHz frequency , 8 GB RAM and 64-bit Windows 7

Operating System.

On both the test environments, Window 7 with Joulemeter has been

installed to measure application-level energy consumption with system’s

default setting for frequency and voltage. Table 8.1 specified the

calibrated and non-calibrated states of a system.

Table 8.1: Calibrated and Non-calibrated specification

Laptop Desktop

Default / non-calibrated 1.2 GHz frequency and 1.2

voltage

3.5 GHz frequency and 1.332

voltage

Calibrated [operate processor

by using low frequency]


voltage


Voltage

The thesis proposes three different parallel algorithms for block cipher

and stream cipher. Energy characteristics for all of these algorithms have

been described in the following tables.

Table 8.2: Energy consumed by Blowfish and PBlock with system’s default frequency

and voltage

Platform 1 Platform 2

Algorithm µJ/B MB/s Algorithm µJ/B MB/s

Blowfish 12.21875 1.103448 Blowfish 0.1546875 1.855072464

PBlock 12.53125 1.939394 PBlock 0.28515625 5.333333333

Table 8.3: Energy consumed by existing and proposed parallel algorithms for stream

cipher technique using system’s default frequency and voltage

Platform 1 Platform 2

Algorithm µJ/B MB/s Algorithm µJ/B MB/s

RC4 0.283008 24.7343 RC4 0.036914063 32

PARC4 0.353125 40.99608 PARC4 0.058789063 128

RC4A 0.256953 30.08226 RC4A 0.031640625 39.38461538

PARC4-I 0.345371 51.71717 PARC4-I 0.067773438 131.2820513

Table 8.2 and 8.3 specifies the results using system’s default frequency

and voltage i.e., using non-calibrated states. From these results, it can be

116

inferred that PBlock provides a 1.5X speedup over the serial version on

Platform 1 and approximately 2.5X on Platform 2, but at the same time

parallel algorithms consume more energy as compared to sequential

algorithms. Similarly PARC4 and PARC4-I are much faster than the

existing sequential algorithms but consuming more µJ/B as compared to

their sequential versions. Results specified using Platform 2 are having

similar description as Platform 1. The only difference is, processor is

running with low frequency and voltage in Platform 1 where as in

Platform 2 processor is running with much higher frequency. That’s why

the results of Platform 2 are better than Platform 1 even with non-

calibrated states. Serial algorithms are slower and consuming less energy

but parallel algorithms are consuming more energy at the cost of faster

execution. With parallel computing, one can operate each core at low

frequency to have less energy consumption as a benefit while keeping

performance levels same as the sequential methods. Thus, all experiments

have been carried out for calibrated states of a processor. Table 8.4

specifies the low power states for Platform 2.

Table 8.4: Low power states of AMD-8320 processor

After calibrating the system with values mentioned in table 8.4, the

energy consumption using parallel algorithms reduced drastically. Figures

8.4 shows that by using PBlock algorithm on multiple cores where each

core is operating at low frequency, the transmission rate is 1.18 MB/s and

energy consumption is 7.1125 µJ/B.

Voltage Frequency

1.3 2900

1.1875 2300

1.0625 1700

0.95 1400

117

Fig 8.3: Comparison of serial, parallel and parallel with calibration for energy

consumption using platform 1

From Figure 8.3, it is clear that blowfish algorithm has consumed less

energy as compared to PBlock parallel algorithm with system’s default

frequency and voltage but PBlock is much faster than Blowfish algorithm.

On the other hand if frequency reduces, the energy consumption by

PBlock reduces drastically at same performance level. Similarly, in

Fig.8.4 both the algorithms have been executed using Platform 2 after

scaling down the frequency. Again, result shows that with low frequency

PBlock algorithm consumes less energy at the same performance levels as

that of sequential implementation.

118

Fig 8.4: Comparison of serial, parallel and parallel with calibration Blowfish and

PBlock for energy consumption using platform 2

Fig 8.5: Serial and Parallel algorithms for stream ciphers technique with default and

calibrated frequency using platform 1

119

Fig 8.6: Serial and Parallel algorithms for stream ciphers technique with default and

calibrated frequency using platform 2

Figure 8.5 and 8.6 shows that using a lower frequency, PARC4 and

PARC4-I consume less energy while providing high throughput on both

the platforms. Thus, it has been observed that by adding number of cores,

the computation carried out at each core can be reduced, which can help

to improve performance with respect to time. But at the same time if

frequency drops, it will turn into the gain in energy. That means at same

performance level, parallel algorithm is consuming less energy as

compared to the sequential algorithm.

120

Conclusion

This chapter described and compared the energy cost of proposed parallel

algorithms PARC4, PARC4-I, PBlock, and the serial algorithms RC4,

RC4A and Blowfish. Results have been shown that parallel algorithms are

much faster than serial algorithms but consuming more energy. SMPs

provide option to reduce the frequency and voltage of the machine

through the dynamic voltage and frequency scaling technique. By using

this mechanism, parallel algorithms can become more energy efficient.

The analysis shows that the PBlock parallel algorithm has consumes 58%

less energy as compared to Blowfish algorithm and similarly PARC4 and

PARC4-I have consumed 63% and 54% less energy than RC4 and RC4A

algorithms. On the other hand, the compromise for the speed needs to be

done. That means, the gain in time will turn into the gain in energy.

Overall, the study concluded that SMPs using low frequency has given

promising results and can significantly contributes towards greenhouse

computing, ultimately towards society. Our results also indicate that block

ciphers consume more power than stream ciphers while executing,

because faster algorithms consume less energy because they operate at an

elevated level of power for less time and stream ciphers are faster than

block ciphers.

121

9

Chapter 9

Conclusions and Future Scope

122

CHAPTER 9

CONCLUSIONS AND FUTURE SCOPE

9.1 Thesis Contribution

Information security is the practice of preventing information from

unauthorized access, use, disclosure, disruption, modification, perusal,

inspection, recording, or destruction. The rapid growth and prevalent use

of electronic data processing and electronic business conducted through

the Internet, along with numerous occurrences of cyber-attacks, has fueled

the need for better methods of protecting computers and information they

store, process and transmit. It is sensible to assume that anyone's

communication can be captured or altered over the network. Thus, to

safeguard the data from unauthorized use over the channel,

encryption/decryption process is used.

Different encryption algorithms have been used to secure information

communications over the network. Along with the security of the

algorithms, there are two important aspects of these algorithms:

1. Speed:

Speed of encryption and decryption is an important aspect of security

algorithms. A slow cryptographic algorithm can slow down the speed of

an application and reduce its effectiveness.

2. Energy consumption

Apart from the speed, energy consumption of these cryptographic

algorithms is another crucial aspect due the prevalent usage of mobile

devices today. As discussed in Chapter 8, to encrypt only 13.6 kilobytes

of data using Blowfish block cipher algorithm on a mobile device will use

up about 75% of its battery power.

123

This thesis has developed parallel algorithms for different symmetric

cryptographic algorithms. First of all, the parallel framework for both the

stream ciphers as well as the block ciphers is designed which enables

writing of parallel algorithms without impairing their security aspect.

Afterwards, PARC4 which is a parallel implementation of the well-known

RC4 stream cipher is developed and implemented for execution on an

eight-core machine to measure its performance gains. PARC4 has proved

much faster than the RC4. It has resulted in approximately 7X speedup

over the sequential implementation. Then RC4A is implemented in

parallel and the resulting algorithm is named PARC4-I. This

implementation uses loop unrolling method as a key optimization

technique and has proved better than PARC4. It provides up to 7.3X

speedup and results in larger percentage gains when using fewer cores

when compared to PARC4.

From the category of block ciphers, Blowfish has been chosen for

implementation in parallel because this is one of the latest cipher

techniques using the Feistel structure. Parallel algorithm is termed as

PBlock which is an acronym for Parallel Blocks. PBlock provides up to

6.6X speedup when using eight symmetric cores.

Power consumption for all parallel algorithms, is measured using

Microsoft’s Joulemeter tool and it has been observed that at the same

performance level, parallel algorithms are more energy efficient as

compared to the sequential algorithms.

9.2 CONCLUSIONS

The following conclusions are made:

RC4 stream cipher can be parallelized with the help of data parallel model

and to achieve this, sufficient modifications need to be done in the PRGA

algorithm. Parallel algorithm PARC4 results in a 7X speedup and has

been proved to be as secure as RC4. PARC4 uses extra space compared to

RC4, necessary to implement the parallel algorithm. It should be applied

to large data tasks to take full benefits of parallelism.

124

PARC4-I is faster than PARC4 but in contrast, PARC4-I takes up more

space than PARC4 due to the use of additional lookup tables. Thus for

applications, which do not have memory constraint and can benefit from

performance gains can use PARC4-I.

PBlock is the parallel approach for Blowfish block cipher algorithm.

Blowfish is used in many products like password encryption, File and

Disk encryption etc. In this research, parallel approach has been tested for

file encryption scenario and found good speedup. PBlock provides a 6.6X

speedup.

Code optimization techniques have proved helpful to gain potential

speedup. For example, in the implementation of PARC4-I, loop unrolling

technique proved better in conjunction with the data parallel model

because it reduces loop overhead.

It has been observed in each implementation that execution time is

directly proportional to the file size. As the file size increases, the

execution time increases as well. On the other hand, in parallel systems, if

number processing cores can be increased simultaneously, the execution

time decreases and speedup increases. However, if the problem size is

constant and number of cores is added to the problem performance gains

are only limited by the amount of sequential computation present in the

problem being addressed. The benefits of parallel computation have a

ceiling as stated by the Amdahl’s Law.

While executing a parallel algorithm on SMPs, energy can also be

reduced for operating at similar performance level as what the sequential

implementation offers. In order to reduce energy using the multicore

architecture, processor cores have to be calibrated with low frequency and

low voltage using the dynamic voltage and frequency scaling technique.

The research concluded that PBlock parallel algorithm has consumes 58%

less energy as compared to Blowfish algorithm and similarly PARC4 and

PARC4-I have consumed 63% and 54% less energy as compared to their

sequential versions at same performance level.

125

9.3 Future Scope

This thesis has concentrated on developing parallel cryptographic

algorithms for symmetric-key approach. According to the thesis, parallel

algorithms for symmetric-key based security techniques have proven

much faster and energy-efficient as compared to existing sequential

algorithms. Here are some suggestions for the future work:

The thesis has incorporated the domain decomposition technique for

almost all of the parallel algorithms to divide the problem into sub tasks

and assign them to the different processes. Domain decomposition or data

decomposition normally works on either the large data sets or the data

where similar types of operations are needed to perform on the complete

data. But in those cases where data set is not that large and type of

operation is different for each task, instruction level parallelism can

perform better. But in order to have instruction level parallelism,

algorithms will have be redesigned completely in many cases.

Only PARC4, PARC4-I, and PBlock have been implemented using

PASCS and PIFN frameworks. More security algorithms fall under the

same category and can also be implemented and analyzed. Moreover, in

this cloud computing era, almost every electronic device including

commonly used smart phones have multicore processors. Thus, to make

them more efficient in terms of speed and energy, these parallel

algorithms can be applied.

Finally, the measurement of energy itself is an active area of research.

More accurate energy measurement techniques can be applied for more

the analysis as well as for measuring energy issues at a finer grain.

126

10

11

References

127

12 REFERENCES

Almasi GS & Gottlieb A (1988) Highly parallel computing. CA:

Benjamin/Cummings Publishing Company

Auth C, Allen C, Blattner A, Bergstrom D, Brazier M, Bost M, Buehler M,

Chikarmane V, Ghani T and Glassman T (2012) "A 22nm high

performance and low-power CMOS technology featuring fully-depleted

tri-gate transistors, self-aligned contacts and high density MIM

capacitors" An IEEE Symposium on VLSI Technology (VLSIT), pp.131-

135

Babb RG (1984) Parallel processing with large-grain data flow techniques.

Computer, 7:17, pp. 55-61

Barney B (2010) Introduction to parallel computing. Lawrence Livermore

National Laboratory[online], Available from:

(https://computing.llnl.gov/tutorials/parallel_comp/?ref=driverlayer.com/

web)

Bekaroo G, Bokhoree C and Pattinson C "Power Measurement of Computers:

Analysis of the Effectiveness of the Software Based Approach" Int. J.

Emerg. Technol. Adv. Eng. 4:5, pp.755-762

Bekenstein JD (1973) Black holes and entropy, Physical Review D, 7:8, DOI:

http://dx.doi.org/10.1103/PhysRevD.7.2333

Bellare M. and Yee B (2003) "Forward-security in private-key cryptography",

Topics in Cryptology—CT-RSA 2003, pp.1-18, Springer Berlin

Heidelberg.

Bo H (2009) Parallel Computing and Data Mining. China

Chandra R (2001) Parallel programming in OpenMP, CA: Academic Press.

Morgan Kaufmann Publishers. ISBN:1-55860-671-8.

Chandrakasan AP and Brodersen RW (1995) "Minimizing power consumption

in digital CMOS circuits", Proceedings of the IEEE international

Conference, pp.498-523

128

Chandrakasan AP, Sheng S and Brodersen RW (1992) Low-power CMOS

digital design. IEICE Transactions on Electronic, 75, pp.371-382

Chapman B, Jost G and Van R (2008) Using OpenMP: Portable Shared Memory

Parallel Programming, MIT Press.Vol 10

Choy J, Chew G, Khoo K and Yap H (2009) "Cryptographic properties and

application of a generalized unbalanced Feistel network structure",

Proceedings of 14th Australian Conference on Information Security and

Privacy, LNCS, vol. 5594. pp. 73–89. Springer

Cusick TW, Ding C and Renvall AR (2004) Stream ciphers and number theory,

Revised Edition,Elsevier

Dagum L and Menon R (1998) "OpenMP: an industry standard API for shared-

memory programming", Proceedings of the IEEE international

Conference on Computational Science & Engineering, 5, pp.46-55

Davari B, Dennard RH and Shahidi GG (1995) "CMOS scaling for high

performance and low power-the next ten years", Proceedings of the IEEE

international Conference, 83, pp.595-606

Diffie W and Hellman ME (1976) New directions in cryptography, IEEE

Transactions on Information Theory, 22, pp. 644-654

Elminaam D S A, Abdual-Kader H M and Hadhoud MM (2010) Evaluating

The Performance of Symmetric Encryption Algorithms, International

Journal of Network Security, 10, pp.216-222

Fenlason J and Stallman R (1988) GNU gprof. GNU binutils.[Online].

Available from: (http://www. gnu. org/software/binutils)

Fluhrer S, Mantin I and Shamir A (2001) "Weaknesses in the key scheduling

algorithm of RC4", Proceedings of the Selected areas in cryptography,

vol: 2259 of LNCS, pp. 1-24, Springer-Verlag

Fontaine C (2011) Synchronous Stream Cipher. Encyclopedia of Cryptography

and Security, 1,pp.1274-1275

G.N, P.K (2007) Performance Enhancement of Blowfish Algorithm by

Modifying its function. In: Innovative Algorithms and

Techniques,Industrial Electronics and Telecommunications, springer,

pp.241-244

http://www/

129

G.N, P. K (2008) Performance enhancement of Blowfish and CAST-128

algorithms and Security analysis of improved Blowfish algorithm using

Avalanche effect, International Journal of Computer Science and

Network Security, 8, pp.244-250

Geer D (2005) Chip makers turn to multicore processors, Computer, 38:5 ,

pp.11-13

Gepner P and Kowalik MF (2006) "Multi-core processors: New way to achieve

high system performance", International IEEE Symposium on Parallel

Computing in Electrical Engineering, pp. 9-13

Gonzalez ME, Bilgic A, Lackorzynski A, Tudor D, Matus E and Badr I

(2009) ICT-eMuCo. "An innovative solution for future smart phones'',

Proceedings of the IEEE International Conference on Multimedia and

Expo., pp.1821-1824

Graham SL, Kessler PB and Mckusick MK (2004) Gprof: A call graph

execution profiler, ACM SIGPLAN Notice, 39, pp.49-57

Handa D and Kapoor B (2014) "PARC4: High performance implementation of

RC4 cryptographic algorithm using parallelism" Proceedings of IEEE

International Conference on Optimization, Reliabilty, and Information

Technology (ICROIT), Faridabad, pp. 286-289

Heys HM, and Tavares SE (1994) "On the security of the CAST encryption

algorithm", Proceedings of the Canadian Conference on Electrical and

Computer Engineering, pp.332-335

Heys HM and Tavares SE (1995) Avalanche characteristics of substitution-

permutation encryption networks, IEEE Transactions on Computers, 44,

pp.1131-1139

Hill MD and Marty MR (2008) Amdahl's law in the multicore era, Computer,

pp.33-38

Jin HQ, Frumkin M and Yan J (1999) The OpenMP implementation of NAS

parallel benchmarks and its performance, NAS technical report, Available

from: (https://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-

011.pdf)

130

Kamak Ebadi, V. P, Chen Liu (2012) "High-Performance Implementation and

Evaluation of Blowfish Cryptographic Algorithm on Single-Chip Cloud

Compute: A Pipelined Approach", Proceedings of the International

Conference on Applied and Theoretical Information Systems Research,

Taiwan, pp. 27-29

Kanda M. (2001) Practical security evaluation against differential and linear

cryptanalyses for Feistel ciphers with SPN round function, Selected

Areas in Cryptography, Springer, pp.324-338

Kapoor B. and Verma S. (2011) Power Management Design and Verification,

Journal of Low Power Electronics, 7, pp.41-48

Keckler SW, Olukotun OA and Hofstee HP (2009) Multicore processors and

systems, Springer, ISBN: 978-1-4419-0262-7

Kholidy H and Alghathbar K ( 2009) Adapting and accelerating the stream

cipher algorithm “RC4” using “ultra gridsec” and “HIMAN” and use it to

secure “HIMAN” data, Journal of information assurance and security, 4,

pp.474-483

Koch G. (2005) Discovering Multi-Core: Extending the Benefits of Moore’s

Law, Technology Intel Magazine, Intel Corporation, Tech. Report

Korthikanti VA and Agha G (2009) "Analysis of parallel algorithms for energy

conservation in scalable multicore architectures", Proceedings of the IEEE

International Conference on Parallel Processing, pp. 212-219

Krall A. and Lelait S. (2000) Compilation techniques for multimedia processors,

International Journal of Parallel Programming, 28, pp.347-361

Kumar V, Grama A, Gupta A and Karypis G (1994) Introduction to parallel

computing: design and analysis of algorithms, 2nd edition. CA:

Benjamin/Cummings Publishing Company Redwood City

Leighton FT (1992) Introduction to parallel algorithms and architectures,

Morgan Kaufmann San Francisco publishers

Li C, Wu H, Chen S, Li X, Guo D (2009) "Efficient implementation for MD5-

RC4 encryption using GPU with CUDA", Proceedings of the 3rd

International IEEE Conference on Anti-counterfeiting, Security, and

Identification in Communication, pp.167-170

131

Liu C. (2012) Critical Path based hardware Acceleration for Cryptosystems,

Journal of information processiong system(JIPS), 8, pp.133-144

Madson C, and Doraswamy N (1998) The ESP DES-CBC cipher algorithm with

explicit IV. RFC 2405 [Online] Available from: (http://www.rfc-

editor.org/info/rfc2405)

Mani K and Jee B (2007) On the Edge: A Comprehensive Guide to Blade

Server Technology, 1st edition, John Wiley & Sons.

Mao W (2003) Modern cryptography: theory and practice, 1st edition, Prentice

Hall Professional Technical Reference

Mead C and Conway L (1980) Introduction to VLSI systems, Reading, MA:

Addison-Wesley

Menezes, Alfred J, Paul C and Scott A (1996) Handbook of applied

cryptography, CRC press.

Moore GE (1965) Cramming More Components onto Integrated Circuits,

Electronics, 38, pp.114-117

Noman AA (2009) Hardware Implementation of RC4A Stream Cipher,

International Journal of Cryptology Research, pp.224-233

P. Karthigai Kumar , K. B (2010) Partially pipelined vlsi implementation of

blowfish encryption/decryption algorithm, International Journal of Image

and Graphics, 10:03, pp.327-341

P. Karthigai Kumara , K. B (2010) An ASIC implementation of low power and

high throughput blowfish crypto algorithm, Microelectronics Journal, 41,

pp.347-355

Palnitkar S (2003) Verilog HDL: a guide to digital design and synthesis, Prentice

Hall Professional

Patidar V, Pareek N and Sud K (2009) A new substitution–diffusion based

image cipher using chaotic standard and logistic maps. Communications

in Nonlinear Science and Numerical Simulation, 14, pp.3056-3075

Paul S and Preneel B (2004) A New Weakness in the RC4 Keystream Generator

and an Approach to Improve the Security of the Cipher, Fast Software

Encryption, Springer, pp.245-259

132

Peters C, Van Der Heijden J and Khan M (2010) MinGW: Minimalist GNU for

Windows[online], Available from : (http://www. mingw. org)

POWER, IBM. (2010) Multi-Core Processors.

Quinn MJ (1994) Parallel computing: theory and practice, 2nd edition, McGraw-

Hill New York, ISBN:0-07-051294-9

Rifa-Pous H and Herrera-Joancomartí J (2011) Computational and energy costs

of cryptographic algorithms on handheld devices. Future internet, 3,

pp.31-48

Rivest RL (1992) The RC4 encryption algorithm. RSA Data Security Inc.

Robshaw MJ (1995) Stream ciphers, RSA Laboratories, a division of RSA Data

Security, Inc.

Roussev, Available from: (http://roussev.net/t5/t5.html)

Roussev V (2011) An evaluation of forensic similarity hashes, digital

investigation, 8, pp.S34-S41.

Roy A, Jingye Xu and Chowdhury M (2008) "Multi-core processors: A new way

forward and challenges", Proceedings of the IEEE international

Conference on Microelectronics, pp. 454-457

Rrnyi A (1961) "On measures of entropy and information", Fourth Berkeley

symposium on mathematical statistics and probability, pp.547-561

Salomaa A (1996) Public-key cryptography, 1st edition, Springer Science &

Business Media

Salomon D (2003) Introduction: Data Privacy and Security, 1st edition. Springer.

Sato J, Imai M, Hakata T, Alomary AY and Hikichi N (1991) "An integrated

design environment for application specific integrated processor",

Proceedings of the IEEE International Conference on Computer Design:

VLSI in Computers and Processors, pp. 414-417

Schneier B [online] Available from: (https://www.schneier.com/blowfish-

products.html)

Schneier B (1994) Description of a new variable-length key, 64-bit block cipher

(Blowfish), In Fast Software Encryption, Springer, pp.191-204

Schneier B (2008) Applied cryptography: protocols, algorithms, and source

code in C, Wiley India.

http://www/

http://roussev.net/t5/t5.html

https://www.schneier.com/blowfish-products.html

https://www.schneier.com/blowfish-products.html

133

Schoen I and Boberski M (2002) Secure PKI proxy and method for instant

messaging clients, Patent No : US 20030204741 A1,US

Shannon CE (1949) Communication theory of secrecy systems. Bell system

technical journal, 28, pp.656-715

Shannon CE (1951) Prediction and entropy of printed English, Bell system

technical journal, 30, pp.50-64

Shannon CE (2001) A mathematical theory of communication, ACM

SIGMOBILE Mobile Computing and Communications Review, 5, pp. 3-55

Tsoi KH, Lee KH, Leong PHW (2002) "A massively parallel RC4 key search

engine", 10th Annual IEEE Symposium, pp.13-21

Vajda A and Stenström P (2012) Multi-core processors, Patent No: WO

2012136766 A1

Walker J and Ent A (1998) A pseudorandom number sequence test program

[Online] Available from: (http://www.fourmilab.ch/random/)

Webster A and Tavares SE (1986) "On the design of S-boxes", Proceedings of

the Advances in Cryptology, Springer, pp.523-534

Weerasinghe T (2014) Improving throughput of RC4 algorithm using

multithreading techniques in multicore processors, IACR Cryptology

ePrint Archiv, pp.180-184

Weerasinghe TDB (2012) Improving throughput of RC4 algorithm using

multithreading techniques in multicore processors, International Journal

of Computer Applications, 51:22,pp.102-109

William S (2006) Cryptography And Network Security, 4th edition, Pearson.

Yu S, Wang C, Ren K and Lou W (2010) "Achieving secure, scalable, and

fine-grained data access control in cloud computing", Proceedings of

INFOCOM, pp.1-9.

Zeidman B (1999) "An Introduction to FPGA design", Proceedings of the

Embedded Systems Conference, Europe

134

APPENDIX-A

//256 key bytes generated for a set of 256 plaintext bytes by RC4 algorithm

208 1 45 122 39 140 218 179 112 99

94 190 89 150 207 224 212 241 48 78

178 160 197 40 80 5 167 32 105 107

37 21 225 182 42 234 42 103 221 254

5 124 20 140 106 38 48 163 224 206

247 120 191 139 73 3 0 146 244 251

32 142 139 84 250 14 138 20 41 55

124 172 185 30 38 210 25 252 160 98

170 68 143 42 215 150 93 207 55 93

210 216 209 3 23 196 173 222 18 148

181 207 192 255 183 93 219 108 7 42

121 236 148 210 71 88 105 174 163 84

147 73 215 24 61 87 145 95 122 109

137 177 182 3 214 121 41 186 146 190

62 42 90 255 131 77 139 156 173 71

53 75 90 164 19 219 209 83 65 174

35 222 155 165 66 14 143 138 180 151

44 87 238 132 9 150 247 108 113 35

63 131 102 214 232 150 21 251 50 161

218 192 135 197 106 167 175 133 141 79

87 95 113 105 35 87 85 159 92 71

39 196 86 53 45 54 194 178 6 188

195 160 89 135 249 73 238 120 172 62

255 63 15 167 76 210 13 109 10 149

22 11 202 15 51 69 76 139 86 126

4 51 218 57 79 34

135

APPENDIX-B

//256 key bytes generated for a set of 256 plaintext bytes by PARC4

algorithm using PASCS framework

142 185 197 27 138 226 50 117 168 132

49 234 170 133 90 182 108 96 155 215

240 22 2 162 79 82 30 176 131 199

163 204 36 77 135 155 199 122 191 72

224 195 227 216 139 245 102 249 237 124

243 28 16 134 96 43 12 237 181 48

200 71 73 123 112 242 182 188 22 162

57 129 240 243 248 122 123 104 249 185

59 70 59 226 53 156 10 172 91 690

189 61 249 87 210 178 102 132 210 113

254 53 153 163 224 145 54 8 199 223

1 155 139 219 206 12 78 17 9 228

217 222 151 37 150 62 234 187 242 213

133 197 207 226 143 158 247 60 97 114

115 28 61 121 149 90 168 165 66 61

27 135 121 244 200 198 137 202 16 137

125 190 194 108 47 36 70 82 130 49

195 39 186 115 150 19 32 202 230 118

148 182 25 156 205 79 190 203 116 118

157 52 44 115 87 166 155 22 37 160

152 143 11 32 40 59 40 163 52 206

166 140 32 77 208 29 47 73 59 213

86 134 15 187 144 153 140 98 125 52

77 182 20 126 66 18 251 118 147 152

112 121 231 83 71 25

136

13 APPENDIX –C

// After the completion of 256 iterations, i will be starting from 1 onwards to

calculate j’s value and that determines the swap taking place in the S array.

(i=1,j=183) (i=2,j=208) (i=3,j=17) (i=4,j=176) (i=5,j=239)

(i=6,j=195) (i=7,j=16) (i=8,j=133) (i=9,j=103) (i=10,j=250)

(i=11,j=154) (i=12,j=222) (i=13,j=114) (i=14,j=245) (i=15,j=157)

(i=16,j=234) (i=17,j=43) (i=18,j=81) (i=19,j=60) (i=20,j=223)

(i=21,j=86) (i=22,j=1) (i=23,j=44) (i=24,j=226) (i=25,j=25)

(i=26,j=96) (i=27,j=191) (i=28,j=221) (i=29,j=95) (i=30,j=139)

(i=31,j=219) (i=32,j=185) (i=33,j=5) (i=34,j=225) (i=35,j=83)

(i=36,j=77) (i=37,j=205) (i=38,j=143) (i=39,j=237) (i=40,j=72)

(i=41,j=228) (i=42,j=139) (i=43,j=204) (i=44,j=247) (i=45,j=6)

(i=46,j=133) (i=47,j=57) (i=48,j=126) (i=49,j=135) (i=50,j=2)

(i=51,j=131) (i=52,j=32) (i=53,j=91) (i=54,j=10) (i=55,j=49)

(i=56,j=191) (i=57,j=115) (i=58,j=47) (i=59,j=137) (i=60,j=116)

(i=61,j=128) (i=62,j=220) (i=63,j=69) (i=64,j=111) (i=65,j=190)

(i=66,j=179) (i=67,j=241) (i=68,j=150) (i=69,j=255) (i=70,j=9)

(i=71,j=185) (i=72,j=20) (i=73,j=70) (i=74,j=57) (i=75,j=181)

(i=76,j=62) (i=77,j=56) (i=78,j=79) (i=79,j=102) (i=80,j=209)

(i=81,j=247) (i=82,j=58) (i=83,j=172) (i=84,j=201) (i=85,j=197)

(i=86,j=60) (i=87,j=200) (i=88,j=224) (i=89,j=171) (i=90,j=197)

(i=91,j=0) (i=92,j=86) (i=93,j=159) (i=94,j=5) (i=95,j=135)

(i=96,j=206) (i=97,j=83) (i=98,j=116) (i=99,j=106) (i=100,j=27)

(i=101,j=228) (i=102,j=251) (i=103,j=221) (i=104,j=188) (i=105,j=146)

(i=106,j=136) (i=107,j=112) (i=108,j=51) (i=109,j=108) (i=110,j=254)

(i=111,j=40) (i=112,j=16) (i=113,j=167) (i=114,j=59) (i=115,j=239)

(i=116,j=16) (i=117,j=86) (i=118,j=57) (i=119,j=75) (i=120,j=214)

(i=121,j=39) (i=122,j=23) (i=123,j=50) (i=124,j=50) (i=125,j=228)

(i=126,j=41) (i=127,j=77) (i=128,j=89) (i=129,j=6) (i=130,j=57)

(i=131,j=186) (i=132,j=129) (i=133,j=0) (i=134,j=172) (i=135,j=46)

(i=136,j=36) (i=137,j=126) (i=138,j=246) (i=139,j=157) (i=140,j=159)

(i=141,j=119) (i=142,j=116) (i=143,j=54) (i=144,j=239) (i=145,j=244)

(i=146,j=202) (i=147,j=61) (i=148,j=30) (i=149,j=76) (i=150,j=241)

(i=151,j=81) (i=152,j=87) (i=153,j=44) (i=154,j=204) (i=155,j=61)

(i=156,j=43) (i=157,j=210) (i=158,j=58) (i=159,j=60) (i=160,j=171)

(i=161,j=146) (i=162,j=22) (i=163,j=57) (i=164,j=145) (i=165,j=162)

(i=166,j=150) (i=167,j=45) (i=168,j=171) (i=169,j=0) (i=170,j=149)

(i=171,j=19) (i=172,j=191) (i=173,j=70) (i=174,j=16) (i=175,j=77)

(i=176,j=236) (i=177,j=243) (i=178,j=32) (i=179,j=21) (i=180,j=182)

(i=181,j=50) (i=182,j=211) (i=183,j=138) (i=184,j=115) (i=185,j=35)

(i=186,j=164) (i=187,j=145) (i=188,j=112) (i=189,j=6) (i=190,j=85)

(i=191,j=1) (i=192,j=73) (i=193,j=156) (i=194,j=134) (i=195,j=90)

(i=196,j=8) (i=197,j=34) (i=198,j=186) (i=199,j=5) (i=200,j=145)

137

(i=201,j=174) (i=202,j=132) (i=203,j=214) (i=204,j=118) (i=205,j=246)

(i=206,j=61) (i=207,j=83) (i=208,j=108) (i=209,j=215) (i=210,j=126)

(i=211,j=31) (i=212,j=195) (i=213,j=154) (i=214,j=236) (i=215,j=87)

(i=216,j=82) (i=217,j=44) (i=218,j=231) (i=219,j=55) (i=220,j=147)

(i=221,j=117) (i=222,j=185) (i=223,j=92) (i=224,j=116) (i=225,j=80)

(i=226,j=6) (i=227,j=248) (i=228,j=170) (i=229,j=186) (i=230,j=111)

(i=231,j=42) (i=232,j=96) (i=233,j=107) (i=234,j=184) (i=235,j=216)

(i=236,j=42) (i=237,j=136) (i=238,j=248) (i=239,j=177) (i=240,j=31)

(i=241,j=196) (i=242,j=18) (i=243,j=25) (i=244,j=30) (i=245,j=161)

(i=246,j=33) (i=247,j=71) (i=248,j=183) (i=249,j=176) (i=250,j=67)

(i=251,j=90) (i=252,j=70) (i=253,j=144) (i=254,j=34) (i=255,j=139)

(i=0,j=224) (i=1,j=140) (i=2,j=7) (i=3,j=38) (i=4,j=156)

(i=5,j=231) (i=6,j=157) (i=7,j=24) (i=8,j=198) (i=9,j=208)

(i=10,j=127) (i=11,j=63) (i=12,j=207) (i=13,j=35) (i=14,j=34)

(i=15,j=68) (i=16,j=14) (i=17,j=205) (i=18,j=27) (i=19,j=153)

(i=20,j=244) (i=21,j=233) (i=22,j=109) (i=23,j=93) (i=24,j=216)

(i=25,j=223) (i=26,j=73) (i=27,j=151) (i=28,j=149) (i=29,j=132)

(i=30,j=137) (i=31,j=247) (i=32,j=36) (i=33,j=164) (i=34,j=163)

(i=35,j=247) (i=36,j=36) (i=37,j=89) (i=38,j=120) (i=39,j=201)

(i=40,j=243) (i=41,j=56) (i=42,j=138) (i=43,j=120) (i=44,j=82)

(i=45,j=233) (i=46,j=107) (i=47,j=39) (i=48,j=138) (i=49,j=177)

(i=50,j=45) (i=51,j=240) (i=52,j=213) (i=53,j=176) (i=54,j=114)

(i=55,j=194) (i=56,j=7) (i=57,j=42) (i=58,j=146) (i=59,j=38)

(i=60,j=40) (i=61,j=111) (i=62,j=248) (i=63,j=184) (i=64,j=187)

(i=65,j=93) (i=66,j=153) (i=67,j=44) (i=68,j=78) (i=69,j=19)

(i=70,j=255) (i=71,j=37) (i=72,j=146) (i=73,j=252) (i=74,j=117)

(i=75,j=135) (i=76,j=181) (i=77,j=242) (i=78,j=20) (i=79,j=113)

(i=80,j=77) (i=81,j=173) (i=82,j=135) (i=83,j=157) (i=84,j=112)

(i=85,j=191) (i=86,j=5) (i=87,j=112) (i=88,j=152) (i=89,j=205)

(i=90,j=228) (i=91,j=180) (i=92,j=87) (i=93,j=249) (i=94,j=69)

(i=95,j=78) (i=96,j=132) (i=97,j=72) (i=98,j=51) (i=99,j=240)

(i=100,j=170) (i=101,j=70) (i=102,j=31) (i=103,j=61) (i=104,j=169)

(i=105,j=120) (i=106,j=2) (i=107,j=132) (i=108,j=157) (i=109,j=33)

(i=110,j=217) (i=111,j=32) (i=112,j=139) (i=113,j=232) (i=114,j=170)

(i=115,j=147) (i=116,j=171) (i=117,j=36) (i=118,j=196) (i=119,j=156)

(i=120,j=107) (i=121,j=21) (i=122,j=164) (i=123,j=165) (i=124,j=192)

(i=125,j=137) (i=126,j=48) (i=127,j=223) (i=128,j=92) (i=129,j=35)

(i=130,j=6) (i=131,j=206) (i=132,j=80) (i=133,j=139) (i=134,j=117)

(i=135,j=79) (i=136,j=173) (i=137,j=118) (i=138,j=217) (i=139,j=20)

(i=140,j=192) (i=141,j=140) (i=142,j=116) (i=143,j=13) (i=144,j=87)

(i=145,j=227) (i=146,j=80) (i=147,j=57) (i=148,j=212) (i=149,j=210)

(i=150,j=198) (i=151,j=20) (i=152,j=60) (i=153,j=120) (i=154,j=79)

(i=155,j=194) (i=156,j=154) (i=157,j=179) (i=158,j=246) (i=159,j=109)

(i=160,j=56) (i=161,j=187) (i=162,j=204) (i=163,j=203) (i=164,j=90)

(i=165,j=91) (i=166,j=153) (i=167,j=19) (i=168,j=130) (i=169,j=238)

(i=170,j=176) (i=171,j=200) (i=172,j=86) (i=173,j=180) (i=174,j=209)

138

(i=175,j=245) (i=176,j=183) (i=177,j=222) (i=178,j=123) (i=179,j=148)

(i=180,j=242) (i=181,j=32) (i=182,j=177) (i=183,j=115) (i=184,j=51)

(i=185,j=119) (i=186,j=135) (i=187,j=10) (i=188,j=31) (i=189,j=204)

(i=190,j=64) (i=191,j=143) (i=192,j=59) (i=193,j=78) (i=194,j=193)

(i=195,j=101) (i=196,j=5) (i=197,j=94) (i=198,j=82) (i=199,j=184)

(i=200,j=208) (i=201,j=33) (i=202,j=48) (i=203,j=47) (i=204,j=220)

(i=205,j=17) (i=206,j=217) (i=207,j=105) (i=208,j=129) (i=209,j=158)

(i=210,j=156) (i=211,j=121) (i=212,j=20) (i=213,j=249) (i=214,j=152)

(i=215,j=158) (i=216,j=25) (i=217,j=225) (i=218,j=16) (i=219,j=116)

(i=220,j=33) (i=221,j=119) (i=222,j=158) (i=223,j=77) (i=224,j=162)

(i=225,j=106) (i=226,j=0) (i=227,j=140) (i=228,j=163) (i=229,j=59)

(i=230,j=146) (i=231,j=221) (i=232,j=58) (i=233,j=209) (i=234,j=16)

(i=235,j=119) (i=236,j=50) (i=237,j=116) (i=238,j=224) (i=239,j=137)

(i=240,j=70) (i=241,j=7) (i=242,j=101) (i=243,j=143) (i=244,j=234)

(i=245,j=14) (i=246,j=81) (i=247,j=165) (i=248,j=46) (i=249,j=19)

(i=250,j=177) (i=251,j=133) (i=252,j=239) (i=253,j=163) (i=254,j=189)

(i=255,j=169) (i=0,j=63)

139

APPENDIX-D

// Encrypted and Decrypted text using PARC4

4aedd1ab7ee3874055cce14283cdeb3fd24c22f7b887706bd127727fc3a3a6

83ab4123e2e9a61692776092b38cabeed57958a13817564eed72c84db44ad2

92d1e14d2d7cf36c357a1919f8c1f9698cd52715ec241f32ad833637adc5a9

039b59222c181519f5bfd8386e3536de7b067bbffb3ab2c3e756686b5bbfa5

6b65f84dfd2b6eaabbc96fdff9e52151537d4eafa39c1c436587fa7e9dbcae

a9a564e118cce2d43436820eb45ab4a9420f5694ebb8f13e6967df9be26d9a

51112bd76401c20c0f2654d80f9fc214185c4dcd51bcc7e0d3da56f5f2648b

076e763dcffebe516155961967d1036779512f6fc5011947326b50ccdc969f

e1ab7bfbd3b079ab975c56dded57cbcae132d9a41d89da23762cd347379c3e

1ad84a2436efcf4ad5bb07a7983a881b4ffd66948a5d97727ff4155e6c87d8

45ba682bf29cd4d476d75fe6d18390f144995ce546d4ec357f16cc22e6e2fd

653d4bb0db9e0ba178d16fec192fb5067ad8422dee0b3b86b267b62c5bbacb

847b7588ed296a4e7aafd93e5ed88521116547c521df929c0d22b5330a8139

bbda2e4af74ee5edeb04d514b2832ef46a15ad515e36b57bac73fa976ef5ab

70deab1c19f9156931b87fd794281ffe267449534f8441a7c2ad513cbe7c5c

2f51a577e67ed8b1f9e2dd413cd67c113760dc16e7ea1e1281353cb9458ec6

8a8ffce674e4d3a26fac855810c5f45a98c4ea72ebdaf3bc9b7d7684353153

dbe2a7c0aa6de5f2be1aa87a7fc2a496faecd7f9e811b877f7ea512587c9fd

841af692a11159bcad76ccc56afd29c9481849fde59295c9717ff628e3c642

8da4d9538a0dd1ee3f476f17efcd87b3596da08f64bdc998832e31676fc6bd

92fc42b82f8a9d80bfe2e7a196fec2955be7557f511fb73fc5c9374b6fe812

98bebbabad61f810a3adf45436d74f35ab1419c1aec6b4792c819fadf74eca

625a8a0582bfc5e401030cdef3843c7e9f8644c4f59ae75abfdfe84b3cba3f

50381db678ef74d3f7e9ef852463aca66ef3771dd19f9fd1d0802121b0e8fd

b8a88f0a978f4d8e87ffec554168be65899c8fa33d81bae8bfd36d6a85263d

4493e7adcfa14070cedb995da3647cc3a094beebd869ad0b9e7a33f911416d

9cdd5eb86a2d1017d5dad077d648f6929f959d4b9dcf4d655d8c59eb68cb7e

7e34b428dc35a1ddcf5f5156d5ff3d48ae05166e4b167f6e0bbb86a232a688

2af88fd46fc25cdfc99b7ffaebb96fa819b4c441456782512fd2fcac468306

bf8c98b8acacac7febefb3ac1b5a4a2675ec5ab24fd312fc603809a59f6d73

cfebc26a4bc1918d45e59137addf1395ccbfeb2654442504c867394faaf423

eb70543e5bbd79e6749ee7f4ea45f0453fdb7be42866de18f8ff5310853839

aa51d1da8b89e5e87ce1cea67ce787421cc8a84287dae17ed54cfaa9d47b6a

c13c244bc0efa689834b39e5faa21bb12b7d81b48ebbe19ceddb3ec985772e

317473090ca14ad6b2a1a1a93c8a1768b4df0868793d1a4c9cd158644d8f4d

c8639c7f6330b524933fb3d24ae9a5556357bdc688e15f7ea58271f6f9baf2

643d2b7980b8c2fa4ef32c899fb29cdae9b288f7829d32013574f275ded36c

989363370e9119da1b9adaf7ae9a1a1a91e405c7937ef5bb3589e16e9764b8

1831becdf7980a63ab6a75815d5505557bcbf37d46d6f5e76d4926c4fcf56b

7c3f55671b044402458b73fee33fdf8f2e70e84b399a79f16677df1df2f211

68a3b66af51c3c4898df2e97fffd9a552e6935f1781a95f81dbef31df14180

bd863b66c974364fc1edbec6b05d3bfac5ecbab2c6e8dadc8bcf793599d2c5

bd7235eab543194c850a86c21e65583d3a43bd74ce997949812f34d9fd0242

c238854d27bc316231b631de3da4c44eabe17775ffa8892ff4f65ec8b72c8f

592f3703f6b619bb797f447b5108ca899bfe3afa6c7f0c68f281191d45623c

140

f630dacc2e3148ac1f8aedaaaea870e0e7b1a61d19597925f759bb499f19f7

614a9182ff1d63281bb3ab7b5171fc75d11192f8bc175ac4f1e06357432159

c14fb68be54a72a14a516e21bd32e573e0d1dbc247e357369b7ff122722014

febd11d89783c924ad0d98292ebea

//Decrypted text

another algorithm. Schneier designed Blowfish as a general-

purpose algorithm, intended as an alternative to the aging DES

and free of the problems and constraints associated with other

algorithm. Schneier designed Blowfish as a general-purpose

algorithm, intended as an alternative to the aging DES and

free of the problems and constraints associated with other












free of the problems

141

APPENDIX-E

// Encrypted and Decrypted text using PARC4-I

Encrypted text:

b6589fc6ab0dc82cf12099d1c2d40ab994e8410cddfe163345d338193ac2bd

c183f8e9dcff904b43c4a2d99bc28d236098a095277b7eb0718d6be068c4b5

c86bd577da3d93fea7c89cba61c78b48e58911904a4e8b77f6242e2d288705

023adad00a9310fdf8bc5814536f66012884e146a8887a44709a56b7ed0881

90c204b31cd71484e6a1c538986b5f77ccaa8d8dcc7d030cd6a6768db81f90

d0ef976c3d9a7149a5a7786bb368e06d08c5d77774eb43a49e87acec17cd9d

cd20a716cc2cf67417b71c8a70167ed10e4a589c87f9e6a85c22e4b0c38ecf

5f50595dd23b67eb79211cfdddad518279291b117971d3d7c997c777b174bb

05faa82799526f12

Decrypted text

The incredible growth of the Internet has excited businesses and consumers alike with

its promise of changing the way we live and work. It's extremely easy to buy and sell

goods all over the world while sitting in front of a laptop. But security is a major

concern on the Internet, especially when you're using it to send sensitive information

between parties. Information security is provided on computers and over the Internet

by a variety of methods. A simple but straightforward security method is to only keep

sensitive information on removable storage media like portable flash memory drives or

external hard drives.

142

14 APPENDIX-D

// Encrypted and Decrypted text using PBlock

//Encrypted text:

45303030303034367c412d317c41317ca45303030303035307c412e412e4d2

e442e7c41414d447ca45303030303038317c414944532072656c6174656420

636f6d706c65787c414944532d72656c6174656420636f6d706c65787ca453

03030303039387c414e4c4c7c414e4c7ca45303030303134387c416365636c

6964696e7c416365636c6964696e657ca45303030303135347c6163747c416

3747ca45303030303135357c61637469766520766572746963616c20636f72

726563746f727c41637469766520566572746963616c20436f72726563746f

727ca45303030303135387c4164616d732053746f6b6573206469736561736

57c4164616d732d53746f6b657320646973656173657ca4530303030313633

7c61647269616d7963696e7c41647269616d7963696e7ca453030303031363

47c61647269616d7963696e6f6c7c41647269616d7963696e6f6c7ca453030

30303138317c416e63697374726f646f6e7c41676b697374726f646f6e7ca4

5303030303230327c616c6369616e20626c75657c416c6369616e20626c756

57ca45303030303231327c416c6578616e646572205472616c6c69616e7573

7c416c6578616e646572206f66205472616c6c65737ca45303030303232357

c416c6b2e2050686f732e7c416c6b2e2070686f732e7ca4530303030323235

7c616c6b2e2070686f732e7c416c6b2e2070686f732e7ca453030303032333

47c416c75497c416c7520497ca45303030303233397c616c7a6865696d6572

2d747970657c416c7a6865696d65722d747970657ca45303030303335337c4

16d65726963616e2d747970652063756c7475726520636f6c6c656374696f6

e7c416d65726963616e20747970652063756c7475726520636f6c6c6563746

96f6e7ca45303030303336307c416e616261656e617c416e6162656e617ca4

5303030303337397c616e676f72617c416e676f72617ca4530303030333831

7c616e746172637469637c416e746172637469637ca45303030303338327c6

16e746172637469637c416e746172637469637ca45303030303338397c6170

617274686569647c4170617274686569647ca45303030303430317c4170726

573736f6c696e657c41707265736f6c696e657ca45303030303432337c4172

61417c41726120417ca45303030303432337c61726120417c41726120417ca

45303030303432337c6172612d417c41726120417ca45303030303432337c6

17261417c41726120417ca45303030303432387c61726368626973686f707c

41726368626973686f707ca45303030303433327c6172656368696e657c417

2656368696e657ca45303030303433377c417267796c6c2d526f6265727473

6f6e20707570696c7c417267796c6c2052

144

// Text encryption using PARC4

//Encrypted text (256 bytes)

7bb88cfd22ec81414cab4558992b739d05d52abeec97233c17d362a80e2a59

af51028bfaff64adc637186a6d2eeb8c652cc894fc02421bf534a6fdfd754a

b76714816cad38572c15fe7c9c3cc184f5b9a8d5a355fdbcfe699e6362688e

bc96eead356bde743c92fb9485f363ff6e765ade9befe3a76643d86ecbdaf1

3f45adf8cc5b1e4f6abdbbcf8925d56114b255e1bf46c99c675579b34191fd

feb0fb21b04d8aa3584a1c6565b557f5158e41f33344ecde4ea3862cacff7c

dfaf1512ad44d4a6497ab7041c6a0ed621510a1c967f8c0ba167de12f1771f

b067b239daa8a1b4551c0798124472327cd14a4fd1440df6b76e210ce82d2c

ef1f62becdef12dbfdb7139fe43889eb5638e5557feed972034c033327e89e

0f0c0ae4679ecacfb4ba36c2080f2d3efb4925c2df1bcd2d75b744a34d9805

2fa2c731d46c2838476965bb5c6cccf191e5acc825938a876a5689838306c8

99ce3db18754e4e74b9aeadc582f75438ff8366a5e8bfab3f7f693dd4ecbcf

b14f459dd8bc7e7b1a4e7d2f3a799a5242422a5b1ea76bcc9222b24b119cdf

ea1b3ae27eb468aa05f2403c31b351f710de17f13747ead742ac827ef9fc63

ddfa4212fc214466996ff7142c5fdf166141841b93cfd9ebc437fe02997ddb

264e327dfa8a3ba049279d5244877719e47a1ff1010d0667fe5189b83d59da

5f4

//Decrypted text









05faa82799526f12

145

// Text encryption using PARC4-I (256 bytes)

//Encrypted text

2da3d9f37cba820429dbf78f99b23edda6a8ef9779679d6064428bb0fd93fe

122eb0fef04da06d21dbf781bcee95e9f8e18972822bd571b3e8cd21f27128

4d11918c8c7b975ee5c5c19e1f44fc38b83b9d552ae359c3e31648c59ce3bb

78cdb2e4423455ab97dbaa03faa8467aabfbbe46c61316984e3d5ab12a823d

98ec4e2b5f3e182a294cb85b151726594aaf6e9b9771592eb14ccef9feb9aa

23ea4ef8f4577426f64e751f718dd45f93a47d1db4ff38e2eabad68a5f9424

fa8017163fccfe2a1393a3a967411ea12c650e0cbec412ee47b47aaee6cb12

7dbabf9b75746502d8921b621219c1ba2fe45118a617ee51ace8586c9aca32

ef485f929bfdc348c9b068f9bbd3b8e5750a7b8ca7d6497666012d5b5af9fa

c1764b8fcae1ba13e7a80f3d3eab89dacbc149977976b06163a93894f8287a

4f4793df867f908b4c1c49d491bdc58c76e5cd2bab6b993d3f39a673c569b7

8058e6e51532cffc5d1f15d6eafd131fab6bfe039673d3886bf80a849a970d

edcc0bdbca9ead1f7c4c85e141132f261da03bcfcc777929ee1a91f9feeefe

72eab1f8a15b51173f6ce7df349d941fb331bd8da18a08f24aae167f4f6164

e9114b4739caa72f4dc6a8a43210485c199f4b09ab61461e219b7fabd3ab17

689f1a9b85de354328b20b320219c1daff94646d36373ec12d3d581c5a6a9

// Decrypted text









05faa82799526f12

146

15 APPENDIX-E

// Modified encryption function for PARC4

unsigned char rc4_output()

{

i1 = (i1 + 1) % 256;

j1 = (j1 + s[i1]) % 256;

swap(s, i1, j1);

i1++;

j1++;

return s[((s[i1] + s[j1]) % 256)];

}

// Parallel region for encryption in PARC4

// by default all variables are shared variables. But i is declared as private to each

core.

#pragma omp parallel for default(shared) private(i)

for (int x = 0; x < block; x++)

{

i=0;

int y=x*256,end=y+256;

while(y<end)

{

enblock[y] = (memblock[y] ^((s2[i++]+x)%256));

// to generate random stream for each block//

y++;

147

}

}

// End of parallel region

148

APPENDIX-F

// modified encryption method to return four distinct bytes using PARC4-I

unsigned char * PARC4I_output()

{

i = (i + 1) %256;

j1=(j1+s1[i])%256;

swap_s1(s1,i,j1);

V1=(s1[i]+s1[j1])% 256;

index1[0]=V1;

i = (i +1) %256;

j2=(j2+s2[i])% 256;

swap_s2(s2,i,j2);

V2=(s2[i]+s2[j2])% 256;

index1[1]=V2;

i = (i +1) %256;

j3=(j3+s3[i])% 256;

swap_s3(s3,i,j3);

V3=(s3[i]+s3[j3])% 256;

index1[2]=V3;

i = (i +1) %256;

j4=(j4+s4[i])% 256;

swap_s4(s4,i,j4);

V4=(s4[i]+s4[j4])% 256;

index1[3]=V4;

149

j1++;

j2++;

j3++;

j4++;

return index1;

}

// Parallel region of PARC4-I to encrypt multiple data blocks simultaneously

#pragma omp parallel default(shared) private(i)

{

#pragma omp for

for (int x = 0; x < block; x++ )

{

i=0;

int y=x*256,end=y+256;

while(y<end)

{

enblock[y] = (memblock[y] ^(((temp[i])+x)%256) );

enblock[y+1] = (memblock[y+1] ^(((temp[i+1])+x)%256) );



i=i+4;

y=y+4;

}

}

}

150

APPENDIX-G

// Parallel region for encryption/decryption in PBlock

#pragma omp parallel default(none)

shared(ctx,enblock,memblock,block,size)

{

#pragma omp for


{

int y=x*64;

int L1=y,R1=y+32;

Blowfish_Encrypt(&ctx,L1,R1,memblock);

}

#pragma omp barrier

#pragma omp for


{

int y=x*64;

int L1=y,R1=y+32;

Blowfish_Decrypt(&ctx,L1,R1,memblock);

}

}

// Parallel region ends

parallel algorithms for symmetric key...

Documents