decentralizing big data processinghpaik/thesis/showcases/17s1/fong_to.pdf · these concerns will be...

Author: Fong To Supervisor: Dr. Xiwei XuBackground and Motivation

Aims and Objectives Experiment Set-up

The world of Big Data is growing exponentially with at least 2.5 quintillion bytes of data generated globally every day. The MapReduce framework is the heart of Apache Hadoop - a well-known, opensource Big Data processing framework particularly designed to allow fast distributed processing of large volumes of data on large clusters of commodity hardware to uncover hidden relationship in BigData. The Hadoop Distributed File System (HDFS) is the de facto storage backend for Hadoop. Easy and seamless sharing of input and output data sets across Hadoop clusters is inefficient as eachHDFS cluster is generally managed in an isolated manner to prevent different organizations from sharing data sets across clusters. Problems with HDFS:

1. Inefficient data ingestion and sharing process: Before Hadoop jobs can run, the input data must be transferred into HDFS if they are stored on an alternate file system or otherwise downloadedfrom some external sources; similarly, MapReduce results must be transferred out of HDFS in order for sharing with others. These required transfers greatly reduce efficiency and can be timeconsuming.

2. Highly centralized data set hosting: Data that is made available for use by third parties in HDFS is typically hosted and controlled by one entity. This implies that if the original host of the datasetdecides they no longer want to host the data, it will no longer be accessible by others, so the longevity of the data cannot be guaranteed.

3. No guarantee of data integrity and versioning: Data stored on HDFS is typically only available in one version, that is data tampering can occur easily without anyone noticing and there is no easyway to be certain that the version being accessed is the one that is expected.

These concerns will be addressed by decentralizing the Big Data processing framework, that is to replace HDFS with a decentralized file system, the InterPlanetary File System (IPFS) which is an opensource content-addressable, globally (peer-to-peer) distributed file system for sharing large volume of data with high throughput, that aims to provide a Permanent Web.

1. Replace HDFS with IPFS by implementingHadoop’s generic FileSystem interface using theJava implementation of the HTTP IPFS API to allowMapReduce tasks to be executed directly on datastored in IPFS.

2. Carry out experiment using data collected fromInternet of Things (IoT) devices (sensors) to testand evaluate the performance differences onHadoop MapReduce jobs between HDFS and IPFS.

To compare the performance of HDFS and IPFS, thetotal time taken for the MapReduce job and dataingestion process was 23 minutes and 4 seconds forHDFS; while IPFS took 29 minutes and 53 seconds,approximately 6 minutes different. Although IPFS isconsiderably slower than HDFS in terms of performingMapReduce jobs on Hadoop, however, it offers otherfunctionality and benefits that HDFS does not provide.

• Direct access of input data and MapReduce results over the IPFS network

• High availability and longevity of data • Guarantee data integrity and versioning

Conclusion and Future ImprovementBy implementing the FileSystem interface between Hadoop and IPFS, bigdata analytics can be performed over a decentralized file system. Althoughthe performance of MapReduce jobs on a IPFS based Hadoop cluster wasconsiderably slower than that of a HDFS based system, however, theprimary goal of facilitating easy sharing of data was achieved.

With the rapid growth of IoT that connect anything, anyone, any service,any place, and any network; large volume of real-time (sensitive) data isgenerated every second hence we proposed to incorporate thedecentralized analytics system built using IPFS on Hadoop with EthDriveand perform real-time data processing, which hopes to enable effectivesharing of real-time business intelligent and knowledge betweenorganizations.

Hadoop IPFS ImplementationIPFSFileSystem Interface implements Hadoop'sgeneric FileSystem interface using the HTTP IPFS API,which enables interoperability between IPFS andHadoop.

IPFSOutputStream is an output stream used directly bythe IPFSFileSystem interface to create file on IPFS. It isused to accept writes to new file and publishing the filecontents to IPFS as well as updating the workingdirectory via a name service called InterPlanetaryNaming System (IPNS) that IPFS offers, once thestream is flushed or closed.

CustomOutputCommitter replaces the defaultcommitters used in Hadoop to allow MapReduce job touse HDFS for storing intermediate files created duringthe map phase and to use IPFS for storing the finalresult of the job created during the reduce phase.

Decentralizing Big Data Processing

IPFS MapReduce Logical Data Flow

Figure 1: shows the data flow of a MapReduce jobrunning on top of the IPFS system.

1. Metadata along with the input dataset is fetched fromthe IPFS network, then data will be split into smallerfiles and containers will be scheduled.

2. The map phase containers will then performcomputation on their own subset of data fetch fromthe IPFS network, and intermediate output created bythe map phase will be stored in HDFS.

3. The reduce phase containers will then fetch theintermediate output from HDFS and performcomputation on these data. The final results will bestored on both HDFS and IPFS.

Figure 2: shows IoT data storing architecture. A sensor was used to collect air temperature data for year 2016 and 2017 (total 49.4GB).

Figure 3: AMapReduce program was implemented to find out the highest global temperature recorded in 2016 and 2017.

Figure 4: shows the system structure for the Hadoop onHDFS experiment.

Figure 5: shows the system structure for the Hadoop on IPFS experiment.

Results

decentralizing big data processinghpaik/thesis/showcases/17s1/fong_to.pdf · these concerns will be...

Documents