design of big data processing system architecture based on hadoop under the cloud computing

Design of big data processing system architecture based on Hadoop

Under the cloud computing

Chunmei Duan1, a 1Heyuan Polytechnic, Heyuan,Guangdong,517000,China

a [email protected]

Keywords: cloud computing, Hadoop, big data,storage model,security framework.

Abstract. In allusion to limitations of traditional data processing technology in big data processing,

big data processing system architecture based on hadoop is designed, using the characteristics of

quantification, unstructured and dynamic of cloud computing.It uses HDFS be responsible for big

data storage, and uses MapReduce be responsible for big data calculation and uses Hbase as

unstructured data storage database, at the same time a system of storage and cloud computing security

model are designed, in order to implement efficient storage, management, and retrieval of data,thus it

can save construction cost, and guarantee system stability, reliability and security.

Introduction

In the internet age, with the advent of the cloud era, big data gains more and more attentions, it can

be said that the big data opens a major change, making our society go from a calculate- priority era

into data-centric era, and the changes of times require us to get more and more abundant data at a

faster speed . But the big data processing is not only master large amount of informations but also deal

with data with professional technologys. Therefore, who masteres the cloud computing and big data,

who will control the future.

Big data and cloud computing

Big dta. Big data is that data size is beyond the scope of the collection, management, processing

capacity according to the current mainstream software tools within a reasonable period of time.

Scientific research field, Internet applications, mobile devices data fields, radio frequency and the

sensor data and e-commerce era become the main source of the big data, enterprise data warehouse

will become mainstream, industry data and open government data will also become the main source

of data in the future. As data generation automatea and generate speed quickens, the effective

treatment of data must rely on effective data processing technology. Big data processing compared

with the traditional data warehouse applications is has the characteristics of complex query analysis

etc. Big data processing technology generally includes massively parallel processing (MPP)

databases, data mining grid, a distributed file system, distributed database, cloud computing platform,

the Internet and extensible storage systems.

Cloud computing is the Feasible way of big data processing. Cloud Computing is an emerging

commercial calculation model, it provide a dynamic scalable virtualized resources calculation mode

in the form of services through the Internet, it is products of the combination of Distributed

Computing, Parallel Computing, Utility Computing, Network Storage Technologies, Virtualization,

Load Balance.

Cloud computing platform connects a large number of concurrent network computing with

services, and connects and integrates effectively the distributed computing with storage resources

through the network, and provides super computing and storage capacity for users.Cloud computing

architecture including three layers of infrastructure, platform resources and application services. Its

system structure diagram is shown in figure1.

Applied Mechanics and Materials Vols. 556-562 (2014) pp 6302-6306Online available since 2014/May/23 at www.scientific.net© (2014) Trans Tech Publications, Switzerlanddoi:10.4028/www.scientific.net/AMM.556-562.6302

All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP,www.ttp.net. (ID: 130.207.50.37, Georgia Tech Library, Atlanta, USA-16/11/14,17:02:14)

http://www.scientific.net

http://www.ttp.net

Fig.1 Cloud computing system structure diagram

Cloud client is the man-machine interface directly to the customer.The principle of "cloud" is that

the traditional software "local installation, local operation" is turned into " take and use " service, and

manipulates the remote server cluster via the Internet or LAN connection, completes business logic or

computing tasks; Cloud platform provides the "cloud" service, provides developers to create

applications, usually supports its cloud applications based on cloud infrastructure. Cloud

infrastructure layer provides virtual desktop system, server, virtual network services and email

services.

Hadoop. Hadoop is a distributed system infrastructure and software framework which can

reanlizes distributed processing for big data. Hadoop can work in parallel manner, speed up the

processing speed through parallel processing ,and can be able to handle petabytes data, and only relys

on community server, so it's low cost cloud computing solutions. The core content of Hadoop includs

a series of components of Distributed File System, MapReduce model and HBase database, Hive data

warehouse. The project diagram is shown in figure 2.

Fig.2 Hadoop project diagram

HDFS is mainly responsible for large data storage, can provide high throughput data access, and is

very suitable for the application of large-scale data set; MapReduce is a programming model, mainly

responsible for calculation on big data; HBase is a suitable unstructured data storage database, HBase

uses HDFS as its file storage system, uses MapReduce to deal with the large data in HBase; Hive is a

data warehouse tools, can make SQL statements be converted into MapReduce task to run.

Hadoop has the characteristics of reliable, efficient and scalable, and are widely used, and is the

most widely used mature large data processing open source cloud computing platform currently.

Applied Mechanics and Materials Vols. 556-562 6303

Design of big data processing system architecture based on Hadoop

The system overall architecture. The traditional data processing are based on relational database,

so big data processing technology must rely on high performance computer, with high cost, low

efficiency, long processing time i, difficulties of writing programs. Most big data are relational,

unstructured data, it requires timely having the ability of processing and querying complex data, so

the relational database cannot adapt to big data analysis processing, and the relational database has its

own structure limitations. In order to deal with big data effectively, the cloud computing model of

hadoop architecture is choosed, which has good ability of processing big data and is often used as a

basic structure software in building big data solution project. The diagram of big data processing

system architecture based on Hadoop is shown in figure 3.

Fig.3 Big data processing system architecture diagram

Creating big data processing platform uniformly is conducive to obtain a wider range of data

perspective for enterprises, control data instance effective directly , increase the monitoring data force

and safety performance. But the Hadoop can't solve all the big data problems, it needs to be closed

integration with other components, forming a complete and effective data processing platform.

Big data storage processing model. Big data analysis process needs ability of exceeding the

typical store paradigm, and traditional storage technology can't process terabytes and petabytes

unstructured data information in big data era.The success of big data analysis needs a new method of

processing the large capacity data, and needs a new storage platform at the same time. The Hadoop

platform can solve the problems caused by big data.

HDFS is a subproject of Hadoop, which has the characteristics of high fault tolerance.It accesses

the application data get by streaming access mode,it can provide high throughput data access,it is very

suitable for the application of big data sets. So the system platform usually uses the HDFS to store big

data sources.

At the same time, more hard isnot adopted to realize the large-scale batch data analysis on the basis

of relational database, the MapReduce calculation model of Hadoop framework is adopted to process

big data to under normal circumstances. Finally, the Hbase distributed database is uesd for storage in

order to achieve big data storage management. The big data storage process model is shown in figure

4.

Fig.4 Big data storage process model diagram

6304 Mechatronics Engineering, Computing and Information Technology

The big data storage system based on the Hadoop distributed technology is able to store big data

management efficiently. Among them, the HDFS provides data storage, management and processing

functions for the Hadoop platform, which is the base layer of Hadoop framework.HDFS architecture

mainly includes the NameNode and DataNode, the NameNode is mediators and repository of all

HDFS metadata, and manages file system namespace, is responsible for responsing the file system

client or read and write requests from the NameNode, and is responsible for executing database

create, delete, and copy operation under the guidance of the NameNode.At the same time, the client

obtains metadata information through the NameNode, and interacts with DataNode to access the

entire file system.The communications between them, including client and HDFS NameNode server

communication are based on TCP/IP, the architecture diagram is shown in figure 5.

Fig.5 HDFS architecture diagram

HBase is distributed, facing column open source database in the Hadoop, and is able to read and

write access for big data in random real-time.HBase is different from general relational database, it

can solve the limitations of the theory and implementation of the traditional relational database when

dealing with big data. In Hadoop system architecture, Hbase is in structured storage layer, HDFS

provides the high reliability underlying storage support, MapReduce provides high-performance

computing power, ZooKeeper provides a stable service and failover mechanism, Pig and Hive

provide high-level language support for Hbase, undertaking data statistics, Sqoop provides data

import and export function, making the traditional database to Hbase migration is convenient. Hbase

data model is shown in figure 6.

Fig.6 Hbase data model diagram Fig.7 The cloud computing framework security system diagram

It is composed of the client, master and slave, it uses a master node coordinate and manage one or

more regionserver subordinate machine.

System security architecture design. For big data storage and processing system platform,

security is very important. Because Hadoop is open source, so security is poorer, therefore, in view of

the current risks and threats, through the study of cloud security policy and network security, the

cloud computing framework security system based on Hadoop is designed by combining with the data

of information security technology in allusion to authority management, network security, data

Applied Mechanics and Materials Vols. 556-562 6305

privacy and integrity protection, data encryption, etc.The security model adopts the hierarchical

architecture. It is shown in figure7.

In the cloud computing security model, authority management and identity authentication is must

done, being conducted to ensure that only people who have relevant permissions to access the Hadoop

data; network threat defense is setted up, which realizes intrusion protection; Data encryption is done

to prevent data theft; Ensuring the integrity of the data ensures the data will not be tampered with;

virtual technology is used to realize dynamic management of material resources and protect the safety

of cloud environment., cloud computing security policy model based on the structure of the Hadoop

can be obtained on the above of protection strategy.

Conclusions

For the diversity, large capacity, high speed characteristic of the big data and the quantity,

unstructured , dynamic characteristics of cloud computing, a big data processing system based on

hadoop framework architecture is designed in this paper. the open source cloud platform hadoop is

further studied.The storage mode of system is concluded by analying Hbase to HDFS, and cloud

computing security model is designed at the same time, to guarantee the security of the system.Thus it

can efficiently store, manage, retrieve data, save construction cost, and guarantee system stability and

security.

References

[1] Feilong Chen. Mass sensing data management system based on Hadoop. Nanjing university of

science and technology, 2013

[2] Junzhou Luo etc. Cloud computing: the architecture and key technology. Tsinghua university

press, 2011.7

[3] A.T husoo? Petabytes J.S.S arma model. The Hive - a scale data warehouse using Hadoop,

ICDE2010, 2010.

[4] Jingjing Cheng, Zhaoxin YU. Research and design of big data analysis based on cloud computing

platform. Guangdong communication technology, 2013. 1

[5] JunSong,LinZhu. Design and implementation of huge amounts of data processing based on cloud

computing platform. Telecommunications technology, 2012.4

[6] JianHou,Renjun Shuai,Wen Hou. The mass data storage model based on cloud computing.

Communication technology, 2011.5

[7] Hongjian Li etc. Research of massive telecom data based on Hadoop under cloud computing.

Telecom science, 2012.8

[8] YaoFeng Miao etc. Cloud computing security architecture of mobile Internet. Science and

technology, 2012.8

[9] Zhihong Huang etc. Network security threats and protection based on the cloud computing.

Journal of chongqing university of science and technology, 2012.8

6306 Mechatronics Engineering, Computing and Information Technology

Mechatronics Engineering, Computing and Information Technology 10.4028/www.scientific.net/AMM.556-562 Design of Big Data Processing System Architecture Based on Hadoop under the Cloud Computing 10.4028/www.scientific.net/AMM.556-562.6302

http://dx.doi.org/www.scientific.net/AMM.556-562

http://dx.doi.org/www.scientific.net/AMM.556-562.6302

design of big data processing system architecture based on hadoop under the cloud computing

Documents