siti hanisah binti kamaruzamangreenskill.net/suhailan/fyp/report/038086.pdf · siti hanisah binti...
TRANSCRIPT
“DATA AT REST” ENCRYPTION FOR HADOOP
SITI HANISAH BINTI KAMARUZAMAN
BACHELOR OF COMPUTER SCIENCE
(NETWORK SECURITY)
UNIVERSITI SULTAN ZAINAL ABIDIN
2017
“DATA AT REST” ENCRPTION FOR HADOOP
SITI HANISAH BINTI KAMARUZAMAN
Bachelor of Computer Science (Network Security)
Faculty of Informatics and Computing
Universiti Sultan Zainal Abidin, Terengganu, Malaysia
MAY 2017
i
DECLARATION
I hereby declare that this report is based on my original work except for quotations
and citations, which have been duly acknowledged. I also declare that it has not been
previously or concurrently submitted for any other degree at Universiti Sultan Zainal
Abidin or other institutions.
________________________________
Name : ..................................................
Date : ..................................................
ii
CONFIRMATION
This is to confirm that:
The research conducted and the writing of this report was under my supervision.
________________________________
Name : ..................................................
Date : ..................................................
iii
DEDICATION
Firstly, and foremost praised to Allah, the most Merciful for blessing me and
giving me the opportunity to undergo and complete this final year project, “Data at
Rest” Encryption for Hadoop. Besides I would like to express my gratitude to my
supervisor, Dr. Wan Nor Shuhadah Binti Wan Nik for her guidance to complete my
final year project and give some ideas and suggestion about my project. I feel so proud
to be supervised by her because of her kindness.
Finally, I also would like to thank all my family members and to my friends
that giving me a lot of moral support to finish the project. Next, I also would like to
thank to Faculty of Informatics and Computing for the chance to expose and explore
students with this project. I would like to thank to all lecturers in Faculty of
Informatics and Computing for giving me a great support to complete the final year
project.
iv
ABSTRACT
Trusted computing and security of utility is one of the most challenging topics
today and is the cloud computing's core technology that is currently the focus of
international IT universe. Even in every second, the amount of data is drastically
increased nowadays. Hadoop is an open source software framework that supports
large data sets storage and processing in a distributed computing environment and
well-known implementation of MapReduce. Hadoop is used for big data analysis.
MapReduce is one common programming model to process and handle a large amount
of big data. Hadoop Distributed File System (HDFS) is a distributed, scalable and
portable file system that written in java for Hadoop framework. However, the main
problem is the data at rest is not secure which is intruders can steal or converts our
data or information. Hadoop Distributed File System (HDFS) store the data that has
been analysed by Hadoop and the encryption method may be implemented to the data
in HDFS to securing the data.
v
ABSTRAK
Pengkomputeran dipercayai dan keselamatan utiliti adalah salah satu topik
yang paling mencabar hari ini dan teknologi teras pengkomputeran awan yang pada
masa ini adalah fokus semesta IT antarabangsa. Walaupun dalam setiap saat, jumlah
data yang meningkat secara drastik pada masa kini. Hadoop adalah rangka kerja
perisian sumber terbuka yang menyokong penyimpanan set data yang besar dan
pemprosesan dalam persekitaran pengkomputeran teragih dan pelaksanaan terkenal
MapReduce. Hadoop digunakan untuk analisis data yang besar. MapReduce adalah
salah satu model pengaturcaraan biasa untuk memproses dan mengendalikan sejumlah
besar data yang besar. Sistem Fail Teragih Hadoop (HDFS) ialah, sistem fail berskala
dan mudah alih diedarkan yang ditulis dalam java untuk rangka kerja Hadoop. Walau
bagaimanapun, masalah utama adalah data yang berada dalam keadaan rehat tidak
selamat dimana penceroboh boleh mencuri atau menukarkan data atau maklumat
kami. Sistem Fail Teragih Hadoop (HDFS) menyimpan data yang telah dianalisis oleh
Hadoop dan kaedah penyulitan boleh dilaksanakan untuk data dalam HDFS untuk
menyelamatkan data.
vi
CONTENTS
DECLARATION i
CONFIRMATION ii
DEDICATION iii
ABSTRACT iv
ABSTRAK v
CONTENTS vi-vii
LIST OF TABLES viii
LIST OF FIGURES Ix
LIST OF ABBREVIATIONS x
CHAPTER 1 INTRODUCTION
1.1 Introduction 1
1.2 Background 1-2
1.3 Problem Statement 2
1.4 Objective 3
1.5 Scope 3
1.6 Activities and Milestones
4
CHAPTER 2 LITERATURE REVIEW
2.1 Introduction 5
2.2 Related Project and Article 5-6
2.3 Cryptography 7
2.6 Hadoop-Based on Cloud Data 8
2.7 Summary
8
CHAPTER 3 METHODOLOGY
3.1 Introduction 9
3.2 Analysis Study 9
3.3 Methodology Review 10-13
PAGE
vii
3.4 Method/Techniques 13-14
3.5 Framework of Project 14-15
3.6 System Requirement 15
3.6.1 Software Requirement 15
3.3.2 Hardware Requirement 16
3.7 Summary
16
REFERENCES 17
viii
LIST OF TABLES
TABLE TITLE PAGE
1.1 First table in chapter 1 4
ix
LIST OF FIGURES
FIGURE TITLE PAGE
3.3.1 First figure in chapter 3 10
3.3.2 Second figure in chapter 3 12
3.4.1 Third figure in chapter 3 13
3.5.1 Fourth figure in chapter 3 14
x
LIST OF ABBREVIATIONS / TERMS / SYMBOLS
HDFS Hadoop Distributed File System
AES Advanced Encryption Standard
DEA Data Encryption Algorithm
1
CHAPTER I
1.1 INTRODUCTION
The key aspect discussed in chapter 1 is includes background, problem
statement, objectives, scope, activities and milestones of the project. Big data,
Hadoop, HDFS and the importance to encrypt the data will be described in
background form. Some problems of the topic are stated in the problem statement.
Besides that, all the purposes of the project are stated in objective. Furthermore, this
chapter also discuss on the scope of the project that involved to make an encryption
for data at rest in Hadoop, and also discuss about activities and milestones during
complete this project.
1.2 BACKGROUND
Recently, trusted computing and security of utility is one of the most
challenging topics. It is also the cloud computing's core technology that is currently
the focus of international IT universe. Moreover, in every second the amount of data is
drastically increase day by day. In recent years, the faster development of the internet,
Internet of Things and Cloud Computing have led to the drastic growth of data in
almost every industry and business area [6]. The development of big data had attracted
2
attention from variety field around the world. Big data can be found in three forms
which are structured, unstructured and semi-structured [6].
Besides that, Hadoop is use for big data analysis and also open source software
framework that allows for the distributed processing of big data sets across clusters of
computers using simple programming language [6]. Its support large data sets storage
and processing in a distributed computing environment and well-known
implementation of Map Reduce. Map Reduce is one common programming model to
process and handle the large amount of big data.
In addition, Hadoop Distributed File System (HDFS) is a distributed, scalable
and portable file system that written in java for Hadoop framework. Hadoop
Distributed File System (HDFS) store the data that has been analyse by Hadoop.
However, the data at rest or in motion is not secure. Intruders can steal or converts our
data or information. So, the encryption might be implemented to the data in HDFS to
securing the data. In the same way to secure confidentiality of data at rest, the
encryption method is important due to keep the data safe from any intruders.
1.3 PROBLEM STATEMENT
While the data at rest are stored in the file system, there are several problems
where the data is not secure. Intruders can steal or converts our data or information.
Moreover, an attacker who can enter the Data Centre either physically or
3
electronically can steal the data they want, since the data is un-encrypted and there is
no authentication enforced for access [3].
1.4 OBJECTIVE
We have identified three main objective of the project. It can be identified as the
following:
I. To study the architecture of Hadoop.
II. To implement the encryption technique for data at rest in HDFS using AES
algorithm.
III. To test and evaluate the successfulness of AES algorithm in HDFS for data at
rest.
1.5 SCOPE
The data that was analyse by Hadoop will be stored in Hadoop Distributed File
System (HDFS). This project will encrypt the data at rest which mean the data that are
stored in HDFS. However, the encryption for data in transit is out of our scope.
Further, the encryption of data at rest only cover on data in the text form. In order to
make data encryption possible in Hadoop, the adjustment of Hadoop architecture is
needed. Thus, this project will be run on Linux.
4
1.6 ACTIVITIES AND MILESTONES
TASK / WEEK 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Topic Discussion and
Determination with
supervisor
Project Title Proposal -
abstract submission
Proposal Writing –
Introduction, background,
Problem Statement,
objective, scope
Proposal Writing -
Literature Review
(Research on related
project)
Proposal Writing -
Literature Review
(continued)
Proposal Progress
Presentation and
Evaluation (Presentation
1)
Discussion and Correction
of the Proposal
Proposed Solution –
Methodology(use flow
chart and AES technique
for encryption)
Proposed Solution -
Methodology (continued)-
understanding about AES
technique
Proof of concept-using
Hadoop and AES
technique
Drafting Report of
the Proposal-Chapter 1, 2
and 3
Submission Report of
the Proposal-Chapter 1, 2
and 3
5
CHAPTER 2
LITERATURE REVIEW
2.1 INTRODUCTION
This chapter discuss the concept and idea from the previous research or article
that related to this project. It is important in order to understand the problem and
suggest the appropriate solution.
2.2 RELATED PROJECT AND ARTICLE
In the research paper [3], the researcher state that Hadoop is a free, Java-based
programming framework that support the processing of large data sets in a distributed
computing environment [3]. Moreover, Hadoop allows running applications on
systems with thousands of nodes with thousands of terabytes of data [3]. Furthermore,
Hadoop ecosystem consist of the Hadoop kernel, MapReduce, the Hadoop distributed
file system (HDFS) and a number of related components such as Apache Hive,
HBase, Oozie, Pig and Zookeeper [3]. Next, encryption ensures confidentiality and
privacy of user information and it secures the sensitive data in Hadoop [3]. Hadoop
6
did not include basic controls for data protection and most third-party tools could not
scale along with NoSQL and so were little use to developers [4].
According to the researcher, the data at rest can be protected in two ways.
First, when file is stored in Hadoop, the complete file can be encrypted first and then
stored in Hadoop. Second, to applying encryption to data blocks once they are loaded
in Hadoop system [3]. Based on this paper, HDFS supports AES, OS level encryption
for data at rest. However, Zookeeper, Oozie, Hive, HBase, and Pig don’t offer data at
rest encryption solution but for this components encryption can be implemented via
custom encryption techniques or third party tools [3].
In the research paper [1], the researcher compared the Apache Spark and
Apache Hadoop. They concluded that Spark helps to simplify the challenging and
compute-intensive task of processing high volumes of real-time or archived data, both
structured and unstructured, seamlessly integrating relevant complex capabilities such
as machine learning and graph algorithm. Besides, Spark bring Big Data processing to
the masses to run over hundreds, thousand, or even tens of thousands of machines in a
cluster is merely a configuration change. Hence Apache Spark is not replacing to
Hadoop but it is one of the alternatives to Hadoop.
7
2.3 CRYPTOGRAPHY
Cryptography is one of the principal means for protecting information security.
Encryption is the process of encoding data in such a way that only authorized users
can decode and use the data which is self-defensive and enhances data security [1].
According to NIST’s definition, information security is the practice of maintaining the
integrity, confidentiality, and availability of data from malicious access and system
failure [1].
Modern cryptosystem can be classified into symmetric cryptosystem,
asymmetric cryptosystem and digital signature [1]. For a symmetric cryptosystem, the
sender and receiver share an encryption and decryption key [1]. These two keys are
the same or easy to deduce each other [1]. The examples of symmetric cryptosystem
are DES (Data Encryption Standard) and AES (Advanced Encryption Standard).
For an asymmetric cryptosystem, the receiver processes public key and private
key [1]. The public key can be published but the private key should be kept secret [1].
The examples of asymmetric cryptosystem are RSA (Rivest Shamir Adleman) and
ECC (Elliptic Curve Cryptosystem) while for Digital signature the examples are MD5
and SHA1 [1].
8
2.4 HADOOP-BASED ON CLOUD DATA
Cloud computing is an emerging and increasingly popular computing
paradigm, which provides the users massive computing, storage, and software
resources on demand [2]. With more and more cloud applications being available, data
security becomes an important issue in cloud computing. A security enhancement for
Hadoop [2], which provides strong mutual authentication by using Kerberos is
presented.
In order to ensure data security in Hadoop-based cloud data storage, a novel
triple encryption scheme is proposed and implemented, which combines HDFS files
encryption using DEA (Data Encryption Algorithm) and the data key encryption with
RSA, and then encrypts the user's RSA private key using IDEA (International Data
Encryption Algorithm) [2].
2.5 SUMMARY
This chapter provide an overview regarding the concept of application. Based
on the study that has been made it shows that literature review is one of the important
part in research or study of new. Literature review will help in determining the idea
and technique has been studied before or not. Every journal has their major point and
it can be used to relate with this project. The technique is chosen based on previous
research articles and journals. Every journal and article will be compared to decide
which better technique will be selected.
9
CHAPTER 3
METHODOLOGY
3.1 INTRODUCTION
This chapter cover the detail explanations about methodology used for this
project. The methodology being use to ensure that the implementation of this project
can fulfilled the objective and make sure that the system or tool can be develop
successfully. Therefore, after considering pros and cons of several different system
models or tools available, the iterative and incremental flow chart model has been
choosing. The details about this iterative model will be explained in this chapter.
3.2 ANALYSIS STUDY
The development and testing processes of the evolutionary/iterative
methodology will be completed on each stage. This model is less costly when it comes
to changing the scopes and requirements. By using this method, it will be easier to test
and debug during small iterations.
10
3.3 METHODOLOGY REVIEW
FIGURE 3.3.1
Start
1. Choose a Title of
project
2. Find the literature
review
3. Identify the problem
statement
4. Identify the objective
of the project
5. Find the scope and
limitation of work in the
project
6. Install Hadoop and study the
architecture of Hadoop.
7. Study the AES algorithm of encryption
and find the location to implement the
algorithm in Hadoop architecture.
8. Test the successfulness
of encryption
End
11
Based on the diagram above, it shows the steps to complete the project. In step
1, it starts with brainstorming the idea and title of the project that have been approve
by my supervisor and Head of Department where I decided to choose the project title
“Data at Rest” Encryption for Hadoop. This open source software framework uses
single server. After choose the project title, the article or research about the project
title will de find out. Several articles will be choosing to make as the literature review.
In the step 3, the problem statement has been identifying from the reading of
article. Besides that, the objective of the project also can be justified in step 4. In step
5, regarding to the project, the scope and limitation of the project has been identifying
where the data that was analyse by Hadoop will be stored in Hadoop Distributed File
System (HDFS). This project will encrypt the data at rest which mean the data that are
stored in HDFS. Further, the encryption of data at rest only cover on data in the text
form. In order to make data encryption possible in Hadoop, the adjustment of Hadoop
architecture is needed. Thus, this project will be run on Linux.
The next step is the installation of Hadoop in Oracle VM VirtualBox and
study of Hadoop architecture. Hadoop architecture consist of the Hadoop kernel, Map
Reduce and Hadoop Distributed File System (HDFS) and a number of related
component such as Apache Hive, HBase, Oozie, Pig and Zookeeper.
12
Hive
HBase
Pig
Other Project
(Avro, Zookeeper)
Map Reduce
Yarn Map Reduce
HDFS
Hadoop Framework
FIGURE 3.3.2 Architecture of Hadoop
HDFS is a highly faults tolerant distributed file system that is
responsible for storing data on the cluster while MapReduce is a powerful parallel
programming technique for distributed processing of vast amount of data on clusters.
Besides that, HBase is a column oriented distributed NoSQL database for random
read/write access. Next, Pig is a high level data programming language for analysing
data of Hadoop computation. In addition, Hive is a data warehousing application that
provides a SQL like access and relational model while Sqoop is a project for
transferring or importing data between relational databases and Hadoop. Oozie is an
orchestration and workflow management for dependent Hadoop jobs.
In step 7, the data at rest in HDFS will be encrypted by AES encryption
algorithm. The study of AES algorithm for encryption is needed to find the location to
implement the AES algorithm in Hadoop architecture. If the intruders have the correct
key to decrypt the data, the encrypted data will be decrypt into plaintext.
13
Last step is test the successfulness of encryption. The testing stage of
the framework must be performed in order to detect any defect that can only be found
when you test it in the operational environment. If everything functions smoothly
without any bugs or error, the framework will be converted to final product and if still
have any bugs the process will go back to step 7.
3.4 METHOD/TECHNIQUES
FIGURE 3.4.1 Encryption and Decryption
In this project, encryption technique will be applied Encryption is the process
of encoding data in such a way that only authorized users can decode and use the data
which is self-defensive and enhances data security [1]. Its means from plaintext to
cipher text. Decryption is the process that converting cipher text back to plaintext [7].
14
Symmetric encryption is used to encrypt more than a small amount of data.
During both the encryption and decryption, the process of symmetric key is used. The
key to encrypt the data must be used to decrypt a particular piece of cipher text, [7].
The goal of every encryption algorithm is to make it as difficult as possible to
decrypt the generated cipher text without using a key [7].
3.5 FRAMEWORK OF PROJECT
Encryption
FIGURE 3.5.1
Hadoop Linux/ Ubuntu
(SERVER)
HBase
Other Project (Avro,
Zookeeper)
HDFS
Hive Pig
Map Reduce
Yarn Map Reduce
Hadoop Framework
Admin
access
Component of Hadoop
install
15
Based on the framework of the project above, admin will access or control the
server. Next, open source software which is Hadoop Apache will be installed in the
server. The encryption method will be implement in the component of Hadoop which
is in HDFS (Hadoop Distributed File System). The data at rest in HDFS will be
encrypted by AES encryption algorithm. AES algorithm for encryption is needed to
find the location to implement the AES algorithm in Hadoop architecture. If the
intruders have the correct key to decrypt the data, the encrypted data will be decrypt
into plaintext.
3.6 SYSTEM REQUIREMENT
The framework requirements are needed in order to complete the
system. The requirement of hardware and software are the most important part for
project to be succeeded, because the hardware and software requirement will influence
the successfulness of the project. Incomplete requirement may cause the project face a
few problems.
3.6.1 Software Requirement
Microsoft Office PowerPoint 2016
Microsoft Word 2016
Window 8.1 single language
Oracle VM VirtualBox
16
3.6.2 Hardware Requirement
Laptop HP
Mouse
Printer
3.7 SUMMARY
As a conclusion, in order to produce a complete project within the time
given, the selection of suitable methodology is needed to ensure the deployment of the
project are successful. A good methodology will provide systematics steps in the
development of project and can carry out minimum error.
17
REFERENCES
[1]https://www.researchgate.net/publication/301887194_Efficient_Hybrid_MAES_En
cryption_Algorithm_for_Mobile_Device_Data_Security_at_Rest_in_Cloud_Environ
ment
[2]. Yang, C., Lin, W., & Liu, M. (2013, September). A novel triple encryption
scheme for Hadoop-based cloud data security. In Emerging Intelligent Data and Web
Technologies (EIDWT), 2013 Fourth International Conference on (pp. 437-442).
IEEE.
[3]. Sharma, P. P., & Navdeti, C. P. (2014). Securing big data Hadoop: a review of
security issues, threats, and solution. Int. J. Comput. Sci. Inf. Technol, 5.
[4]. https://securosis.com/assets/library/reports/Securing_Hadoop_Final_V2.pdf
[5]. Padmavathi, B., & Kumari, S. R. (2013). A survey on performance analysis of
DES, AES and RSA algorithm along with LSB substitution. Int. J. Sci. Res, 2(4), 170-
174.
[6] Pol, U. R. (2016). Big Data Analysis :Comparision of Hadoop MapReduce and
Apache Spark Big Data Analysis : Comparision of Hadoop MapReduce and Apache.
International Journal of Engineering Science and Computing, 6(6), 6389–6391.
https://doi.org/10.4010/2016.1535
[7] Microsoft [Online] Available:
https://msdn.microsoft.com/enus/library/windows/desktop/aa381939(v=vs.85).asp