decentralizing big data processinghpaik/thesis/showcases/17s1/fong_to.pdf · these concerns will be...

1
Author: Fong To Supervisor: Dr. Xiwei Xu Background and Motivation Aims and Objectives Experiment Set-up The world of Big Data is growing exponentially with at least 2.5 quintillion bytes of data generated globally every day. The MapReduce framework is the heart of Apache Hadoop - a well-known, open source Big Data processing framework particularly designed to allow fast distributed processing of large volumes of data on large clusters of commodity hardware to uncover hidden relationship in Big Data. The Hadoop Distributed File System (HDFS) is the de facto storage backend for Hadoop. Easy and seamless sharing of input and output data sets across Hadoop clusters is inefficient as each HDFS cluster is generally managed in an isolated manner to prevent different organizations from sharing data sets across clusters. Problems with HDFS: 1. Inefficient data ingestion and sharing process: Before Hadoop jobs can run, the input data must be transferred into HDFS if they are stored on an alternate file system or otherwise downloaded from some external sources; similarly, MapReduce results must be transferred out of HDFS in order for sharing with others. These required transfers greatly reduce efficiency and can be time consuming. 2. Highly centralized data set hosting: Data that is made available for use by third parties in HDFS is typically hosted and controlled by one entity. This implies that if the original host of the dataset decides they no longer want to host the data, it will no longer be accessible by others, so the longevity of the data cannot be guaranteed. 3. No guarantee of data integrity and versioning: Data stored on HDFS is typically only available in one version, that is data tampering can occur easily without anyone noticing and there is no easy way to be certain that the version being accessed is the one that is expected. These concerns will be addressed by decentralizing the Big Data processing framework, that is to replace HDFS with a decentralized file system, the InterPlanetary File System (IPFS) which is an open source content-addressable, globally (peer-to-peer) distributed file system for sharing large volume of data with high throughput, that aims to provide a Permanent Web. 1. Replace HDFS with IPFS by implementing Hadoop’s generic FileSystem interface using the Java implementation of the HTTP IPFS API to allow MapReduce tasks to be executed directly on data stored in IPFS. 2. Carry out experiment using data collected from Internet of Things (IoT) devices (sensors) to test and evaluate the performance differences on Hadoop MapReduce jobs between HDFS and IPFS. To compare the performance of HDFS and IPFS, the total time taken for the MapReduce job and data ingestion process was 23 minutes and 4 seconds for HDFS; while IPFS took 29 minutes and 53 seconds, approximately 6 minutes different. Although IPFS is considerably slower than HDFS in terms of performing MapReduce jobs on Hadoop, however, it offers other functionality and benefits that HDFS does not provide. Direct access of input data and MapReduce results over the IPFS network High availability and longevity of data Guarantee data integrity and versioning Conclusion and Future Improvement By implementing the FileSystem interface between Hadoop and IPFS, big data analytics can be performed over a decentralized file system. Although the performance of MapReduce jobs on a IPFS based Hadoop cluster was considerably slower than that of a HDFS based system, however, the primary goal of facilitating easy sharing of data was achieved. With the rapid growth of IoT that connect anything, anyone, any service, any place, and any network; large volume of real-time (sensitive) data is generated every second hence we proposed to incorporate the decentralized analytics system built using IPFS on Hadoop with EthDrive and perform real-time data processing, which hopes to enable effective sharing of real-time business intelligent and knowledge between organizations. Hadoop IPFS Implementation IPFSFileSystem Interface implements Hadoop's generic FileSystem interface using the HTTP IPFS API, which enables interoperability between IPFS and Hadoop. IPFSOutputStream is an output stream used directly by the IPFSFileSystem interface to create file on IPFS. It is used to accept writes to new file and publishing the file contents to IPFS as well as updating the working directory via a name service called InterPlanetary Naming System (IPNS) that IPFS offers, once the stream is flushed or closed. CustomOutputCommitter replaces the default committers used in Hadoop to allow MapReduce job to use HDFS for storing intermediate files created during the map phase and to use IPFS for storing the final result of the job created during the reduce phase. Decentralizing Big Data Processing IPFS MapReduce Logical Data Flow Figure 1: shows the data flow of a MapReduce job running on top of the IPFS system. 1. Metadata along with the input dataset is fetched from the IPFS network, then data will be split into smaller files and containers will be scheduled. 2. The map phase containers will then perform computation on their own subset of data fetch from the IPFS network, and intermediate output created by the map phase will be stored in HDFS. 3. The reduce phase containers will then fetch the intermediate output from HDFS and perform computation on these data. The final results will be stored on both HDFS and IPFS. Figure 2: shows IoT data storing architecture. A sensor was used to collect air temperature data for year 2016 and 2017 (total 49.4GB). Figure 3: A MapReduce program was implemented to find out the highest global temperature recorded in 2016 and 2017. Figure 4: shows the system structure for the Hadoop on HDFS experiment. Figure 5: shows the system structure for the Hadoop on IPFS experiment. Results

Upload: others

Post on 21-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Decentralizing Big Data Processinghpaik/thesis/showcases/17s1/Fong_To.pdf · These concerns will be addressed by decentralizing the Big Data processing framework, that is to replace

Author: Fong To Supervisor: Dr. Xiwei XuBackground and Motivation

Aims and Objectives Experiment Set-up

The world of Big Data is growing exponentially with at least 2.5 quintillion bytes of data generated globally every day. The MapReduce framework is the heart of Apache Hadoop - a well-known, opensource Big Data processing framework particularly designed to allow fast distributed processing of large volumes of data on large clusters of commodity hardware to uncover hidden relationship in BigData. The Hadoop Distributed File System (HDFS) is the de facto storage backend for Hadoop. Easy and seamless sharing of input and output data sets across Hadoop clusters is inefficient as eachHDFS cluster is generally managed in an isolated manner to prevent different organizations from sharing data sets across clusters. Problems with HDFS:

1. Inefficient data ingestion and sharing process: Before Hadoop jobs can run, the input data must be transferred into HDFS if they are stored on an alternate file system or otherwise downloadedfrom some external sources; similarly, MapReduce results must be transferred out of HDFS in order for sharing with others. These required transfers greatly reduce efficiency and can be timeconsuming.

2. Highly centralized data set hosting: Data that is made available for use by third parties in HDFS is typically hosted and controlled by one entity. This implies that if the original host of the datasetdecides they no longer want to host the data, it will no longer be accessible by others, so the longevity of the data cannot be guaranteed.

3. No guarantee of data integrity and versioning: Data stored on HDFS is typically only available in one version, that is data tampering can occur easily without anyone noticing and there is no easyway to be certain that the version being accessed is the one that is expected.

These concerns will be addressed by decentralizing the Big Data processing framework, that is to replace HDFS with a decentralized file system, the InterPlanetary File System (IPFS) which is an opensource content-addressable, globally (peer-to-peer) distributed file system for sharing large volume of data with high throughput, that aims to provide a Permanent Web.

1. Replace HDFS with IPFS by implementingHadoop’s generic FileSystem interface using theJava implementation of the HTTP IPFS API to allowMapReduce tasks to be executed directly on datastored in IPFS.

2. Carry out experiment using data collected fromInternet of Things (IoT) devices (sensors) to testand evaluate the performance differences onHadoop MapReduce jobs between HDFS and IPFS.

To compare the performance of HDFS and IPFS, thetotal time taken for the MapReduce job and dataingestion process was 23 minutes and 4 seconds forHDFS; while IPFS took 29 minutes and 53 seconds,approximately 6 minutes different. Although IPFS isconsiderably slower than HDFS in terms of performingMapReduce jobs on Hadoop, however, it offers otherfunctionality and benefits that HDFS does not provide.

• Direct access of input data and MapReduce results over the IPFS network

• High availability and longevity of data • Guarantee data integrity and versioning

Conclusion and Future ImprovementBy implementing the FileSystem interface between Hadoop and IPFS, bigdata analytics can be performed over a decentralized file system. Althoughthe performance of MapReduce jobs on a IPFS based Hadoop cluster wasconsiderably slower than that of a HDFS based system, however, theprimary goal of facilitating easy sharing of data was achieved.

With the rapid growth of IoT that connect anything, anyone, any service,any place, and any network; large volume of real-time (sensitive) data isgenerated every second hence we proposed to incorporate thedecentralized analytics system built using IPFS on Hadoop with EthDriveand perform real-time data processing, which hopes to enable effectivesharing of real-time business intelligent and knowledge betweenorganizations.

Hadoop IPFS ImplementationIPFSFileSystem Interface implements Hadoop'sgeneric FileSystem interface using the HTTP IPFS API,which enables interoperability between IPFS andHadoop.

IPFSOutputStream is an output stream used directly bythe IPFSFileSystem interface to create file on IPFS. It isused to accept writes to new file and publishing the filecontents to IPFS as well as updating the workingdirectory via a name service called InterPlanetaryNaming System (IPNS) that IPFS offers, once thestream is flushed or closed.

CustomOutputCommitter replaces the defaultcommitters used in Hadoop to allow MapReduce job touse HDFS for storing intermediate files created duringthe map phase and to use IPFS for storing the finalresult of the job created during the reduce phase.

Decentralizing Big Data Processing

IPFS MapReduce Logical Data Flow

Figure 1: shows the data flow of a MapReduce jobrunning on top of the IPFS system.

1. Metadata along with the input dataset is fetched fromthe IPFS network, then data will be split into smallerfiles and containers will be scheduled.

2. The map phase containers will then performcomputation on their own subset of data fetch fromthe IPFS network, and intermediate output created bythe map phase will be stored in HDFS.

3. The reduce phase containers will then fetch theintermediate output from HDFS and performcomputation on these data. The final results will bestored on both HDFS and IPFS.

Figure 2: shows IoT data storing architecture. A sensor was used to collect air temperature data for year 2016 and 2017 (total 49.4GB).

Figure 3: AMapReduce program was implemented to find out the highest global temperature recorded in 2016 and 2017.

Figure 4: shows the system structure for the Hadoop onHDFS experiment.

Figure 5: shows the system structure for the Hadoop on IPFS experiment.

Results