design of big data processing system architecture based on hadoop under the cloud computing
TRANSCRIPT
Design of big data processing system architecture based on Hadoop
Under the cloud computing
Chunmei Duan1, a 1Heyuan Polytechnic, Heyuan,Guangdong,517000,China
Keywords: cloud computing, Hadoop, big data,storage model,security framework.
Abstract. In allusion to limitations of traditional data processing technology in big data processing,
big data processing system architecture based on hadoop is designed, using the characteristics of
quantification, unstructured and dynamic of cloud computing.It uses HDFS be responsible for big
data storage, and uses MapReduce be responsible for big data calculation and uses Hbase as
unstructured data storage database, at the same time a system of storage and cloud computing security
model are designed, in order to implement efficient storage, management, and retrieval of data,thus it
can save construction cost, and guarantee system stability, reliability and security.
Introduction
In the internet age, with the advent of the cloud era, big data gains more and more attentions, it can
be said that the big data opens a major change, making our society go from a calculate- priority era
into data-centric era, and the changes of times require us to get more and more abundant data at a
faster speed . But the big data processing is not only master large amount of informations but also deal
with data with professional technologys. Therefore, who masteres the cloud computing and big data,
who will control the future.
Big data and cloud computing
Big dta. Big data is that data size is beyond the scope of the collection, management, processing
capacity according to the current mainstream software tools within a reasonable period of time.
Scientific research field, Internet applications, mobile devices data fields, radio frequency and the
sensor data and e-commerce era become the main source of the big data, enterprise data warehouse
will become mainstream, industry data and open government data will also become the main source
of data in the future. As data generation automatea and generate speed quickens, the effective
treatment of data must rely on effective data processing technology. Big data processing compared
with the traditional data warehouse applications is has the characteristics of complex query analysis
etc. Big data processing technology generally includes massively parallel processing (MPP)
databases, data mining grid, a distributed file system, distributed database, cloud computing platform,
the Internet and extensible storage systems.
Cloud computing is the Feasible way of big data processing. Cloud Computing is an emerging
commercial calculation model, it provide a dynamic scalable virtualized resources calculation mode
in the form of services through the Internet, it is products of the combination of Distributed
Computing, Parallel Computing, Utility Computing, Network Storage Technologies, Virtualization,
Load Balance.
Cloud computing platform connects a large number of concurrent network computing with
services, and connects and integrates effectively the distributed computing with storage resources
through the network, and provides super computing and storage capacity for users.Cloud computing
architecture including three layers of infrastructure, platform resources and application services. Its
system structure diagram is shown in figure1.
Applied Mechanics and Materials Vols. 556-562 (2014) pp 6302-6306Online available since 2014/May/23 at www.scientific.net© (2014) Trans Tech Publications, Switzerlanddoi:10.4028/www.scientific.net/AMM.556-562.6302
All rights reserved. No part of contents of this paper may be reproduced or transmitted in any form or by any means without the written permission of TTP,www.ttp.net. (ID: 130.207.50.37, Georgia Tech Library, Atlanta, USA-16/11/14,17:02:14)
Fig.1 Cloud computing system structure diagram
Cloud client is the man-machine interface directly to the customer.The principle of "cloud" is that
the traditional software "local installation, local operation" is turned into " take and use " service, and
manipulates the remote server cluster via the Internet or LAN connection, completes business logic or
computing tasks; Cloud platform provides the "cloud" service, provides developers to create
applications, usually supports its cloud applications based on cloud infrastructure. Cloud
infrastructure layer provides virtual desktop system, server, virtual network services and email
services.
Hadoop. Hadoop is a distributed system infrastructure and software framework which can
reanlizes distributed processing for big data. Hadoop can work in parallel manner, speed up the
processing speed through parallel processing ,and can be able to handle petabytes data, and only relys
on community server, so it's low cost cloud computing solutions. The core content of Hadoop includs
a series of components of Distributed File System, MapReduce model and HBase database, Hive data
warehouse. The project diagram is shown in figure 2.
Fig.2 Hadoop project diagram
HDFS is mainly responsible for large data storage, can provide high throughput data access, and is
very suitable for the application of large-scale data set; MapReduce is a programming model, mainly
responsible for calculation on big data; HBase is a suitable unstructured data storage database, HBase
uses HDFS as its file storage system, uses MapReduce to deal with the large data in HBase; Hive is a
data warehouse tools, can make SQL statements be converted into MapReduce task to run.
Hadoop has the characteristics of reliable, efficient and scalable, and are widely used, and is the
most widely used mature large data processing open source cloud computing platform currently.
Applied Mechanics and Materials Vols. 556-562 6303
Design of big data processing system architecture based on Hadoop
The system overall architecture. The traditional data processing are based on relational database,
so big data processing technology must rely on high performance computer, with high cost, low
efficiency, long processing time i, difficulties of writing programs. Most big data are relational,
unstructured data, it requires timely having the ability of processing and querying complex data, so
the relational database cannot adapt to big data analysis processing, and the relational database has its
own structure limitations. In order to deal with big data effectively, the cloud computing model of
hadoop architecture is choosed, which has good ability of processing big data and is often used as a
basic structure software in building big data solution project. The diagram of big data processing
system architecture based on Hadoop is shown in figure 3.
Fig.3 Big data processing system architecture diagram
Creating big data processing platform uniformly is conducive to obtain a wider range of data
perspective for enterprises, control data instance effective directly , increase the monitoring data force
and safety performance. But the Hadoop can't solve all the big data problems, it needs to be closed
integration with other components, forming a complete and effective data processing platform.
Big data storage processing model. Big data analysis process needs ability of exceeding the
typical store paradigm, and traditional storage technology can't process terabytes and petabytes
unstructured data information in big data era.The success of big data analysis needs a new method of
processing the large capacity data, and needs a new storage platform at the same time. The Hadoop
platform can solve the problems caused by big data.
HDFS is a subproject of Hadoop, which has the characteristics of high fault tolerance.It accesses
the application data get by streaming access mode,it can provide high throughput data access,it is very
suitable for the application of big data sets. So the system platform usually uses the HDFS to store big
data sources.
At the same time, more hard isnot adopted to realize the large-scale batch data analysis on the basis
of relational database, the MapReduce calculation model of Hadoop framework is adopted to process
big data to under normal circumstances. Finally, the Hbase distributed database is uesd for storage in
order to achieve big data storage management. The big data storage process model is shown in figure
4.
Fig.4 Big data storage process model diagram
6304 Mechatronics Engineering, Computing and Information Technology
The big data storage system based on the Hadoop distributed technology is able to store big data
management efficiently. Among them, the HDFS provides data storage, management and processing
functions for the Hadoop platform, which is the base layer of Hadoop framework.HDFS architecture
mainly includes the NameNode and DataNode, the NameNode is mediators and repository of all
HDFS metadata, and manages file system namespace, is responsible for responsing the file system
client or read and write requests from the NameNode, and is responsible for executing database
create, delete, and copy operation under the guidance of the NameNode.At the same time, the client
obtains metadata information through the NameNode, and interacts with DataNode to access the
entire file system.The communications between them, including client and HDFS NameNode server
communication are based on TCP/IP, the architecture diagram is shown in figure 5.
Fig.5 HDFS architecture diagram
HBase is distributed, facing column open source database in the Hadoop, and is able to read and
write access for big data in random real-time.HBase is different from general relational database, it
can solve the limitations of the theory and implementation of the traditional relational database when
dealing with big data. In Hadoop system architecture, Hbase is in structured storage layer, HDFS
provides the high reliability underlying storage support, MapReduce provides high-performance
computing power, ZooKeeper provides a stable service and failover mechanism, Pig and Hive
provide high-level language support for Hbase, undertaking data statistics, Sqoop provides data
import and export function, making the traditional database to Hbase migration is convenient. Hbase
data model is shown in figure 6.
Fig.6 Hbase data model diagram Fig.7 The cloud computing framework security system diagram
It is composed of the client, master and slave, it uses a master node coordinate and manage one or
more regionserver subordinate machine.
System security architecture design. For big data storage and processing system platform,
security is very important. Because Hadoop is open source, so security is poorer, therefore, in view of
the current risks and threats, through the study of cloud security policy and network security, the
cloud computing framework security system based on Hadoop is designed by combining with the data
of information security technology in allusion to authority management, network security, data
Applied Mechanics and Materials Vols. 556-562 6305
privacy and integrity protection, data encryption, etc.The security model adopts the hierarchical
architecture. It is shown in figure7.
In the cloud computing security model, authority management and identity authentication is must
done, being conducted to ensure that only people who have relevant permissions to access the Hadoop
data; network threat defense is setted up, which realizes intrusion protection; Data encryption is done
to prevent data theft; Ensuring the integrity of the data ensures the data will not be tampered with;
virtual technology is used to realize dynamic management of material resources and protect the safety
of cloud environment., cloud computing security policy model based on the structure of the Hadoop
can be obtained on the above of protection strategy.
Conclusions
For the diversity, large capacity, high speed characteristic of the big data and the quantity,
unstructured , dynamic characteristics of cloud computing, a big data processing system based on
hadoop framework architecture is designed in this paper. the open source cloud platform hadoop is
further studied.The storage mode of system is concluded by analying Hbase to HDFS, and cloud
computing security model is designed at the same time, to guarantee the security of the system.Thus it
can efficiently store, manage, retrieve data, save construction cost, and guarantee system stability and
security.
References
[1] Feilong Chen. Mass sensing data management system based on Hadoop. Nanjing university of
science and technology, 2013
[2] Junzhou Luo etc. Cloud computing: the architecture and key technology. Tsinghua university
press, 2011.7
[3] A.T husoo? Petabytes J.S.S arma model. The Hive - a scale data warehouse using Hadoop,
ICDE2010, 2010.
[4] Jingjing Cheng, Zhaoxin YU. Research and design of big data analysis based on cloud computing
platform. Guangdong communication technology, 2013. 1
[5] JunSong,LinZhu. Design and implementation of huge amounts of data processing based on cloud
computing platform. Telecommunications technology, 2012.4
[6] JianHou,Renjun Shuai,Wen Hou. The mass data storage model based on cloud computing.
Communication technology, 2011.5
[7] Hongjian Li etc. Research of massive telecom data based on Hadoop under cloud computing.
Telecom science, 2012.8
[8] YaoFeng Miao etc. Cloud computing security architecture of mobile Internet. Science and
technology, 2012.8
[9] Zhihong Huang etc. Network security threats and protection based on the cloud computing.
Journal of chongqing university of science and technology, 2012.8
6306 Mechatronics Engineering, Computing and Information Technology
Mechatronics Engineering, Computing and Information Technology 10.4028/www.scientific.net/AMM.556-562 Design of Big Data Processing System Architecture Based on Hadoop under the Cloud Computing 10.4028/www.scientific.net/AMM.556-562.6302