complexity international journal (cij)cij.org.in/pdfnew/cij-24-01-0006.pdf · 1.2 deep learning...
TRANSCRIPT
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 70
IMPLEMENTATION OF A PRIVACY BASED DEEP LEARNING
ALGORITHM FOR BIG DATA ANALYTICS
1Ramdas Vankdothu, 2 Dr.Mohd Abdul Hameed
1Rsearch Scholar, Department of Computer Science & Engineering, University College of Engineering(A),
OsmaniaUniversity Hyderabad ,Telangana ,India
2Assistant Professor, Department of Computer Science & Engineering ,University College of
Engineering(A), Osmania University Hyderabad, Telangana ,India
Email: [email protected], [email protected]
ABSTRACT
The classification of big data is the demanding challenge to be addressed among all research
issues since it provides a larger business value in any analytics environment. Classification
is a mechanism that labels data enabling economical and effective performance in valuable
analysis. Research has indicated that the quality of the feature may cause a backlash to the
classification performance. This research work explores the impact of privacy in feature
selection process since privacy is mandatory when a sample feature is shared by a user for
the choice of relevant features from databank and vice versa. Also the inclusion of privacy
preserving mechanism should not affect the classification performance. By keeping these
concerns, the thesis work also initiates an integrated mechanism named PPCS-MMDML for
privacy based feature selection and effective big data heterogeneous image set classification.
The PPCS which works well in a big data environment with minimum running time makes
the choice of a subset of features by maintaining privacy between the user and the
databank, and MMDML, a deep learning method makes an efficient classification of the big
data. Qualitative assessment of all the proposed classification models and privacy
preserving mechanism has been made with classification accuracy and running time
respectively. Statistical analysis of accuracy values and computational time portrays that
the proposed schemes provides compromising results over existent methods.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 71
Key words : PPCS, MMDML, Big data.
1.INTRODUCTION
1.1 Big Data
The idiom big data has the ability to manage huge volume of information and the
analytical skill overcomes the limitations in the existent data processing technologies
[15]. The frequent and expanding use of sensors, internet, heavy machines etc., in a
flying ratio has made accelerated increase in data on today’s digital world. Big data
characteristics such as velocity and volume have made complications to the computing
systems in handling the data. The data management, warehousing techniques and
systems being used for analysis in the traditional days abort to analyze this variety of
data. In order to overcome this complication, big data storage is handled by a distributed
architecture file system.
The data types of big data is categorized as follows.
• Structured data
• Unstructured data
• Semi-structured data
1.2 Deep Learning
Deep learning naturally involves hierarchical illustrations and makes use of supervised
or unsupervised approaches for classification of deep architectures [15, 18]. Here upper
layer notions can be used for characterizing lower layer notions and lower layer notions
can be used for upper layer notions. Deep learning is a type of machine learning
techniques that depends on learning representations. For example consider an image
which can be described in multiple ways (example a vector of pixels) but some
descriptions alone are useful in learning what the image is.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 72
1.3 Need For Privacy
The blasting volume of data has multiplied unrealized privacy break instead of making
big data to be productively used by humans. For example, 16 Google, Amazon etc., are
learning our online purchasing choices and online surfing behavior. Private information
relating our life, social relations are stored and analyzed by social networking websites
like face book etc., Familiar video sites like YouTube suggest our favorite videos depend
upon our search history. The ability of big data to gather, deposit and reuse secret
information for acquiring business gains are threats to private information.
2.LITERATURE SURVEY
Xiaojun Chen et al. have suggested a novel technique for feature categories weight
subspace and features high dimensional data clustering. High dimensionality data are
partitioned into feature categories established on the basis of their common aspects.
They have proposed two kinds of weights to determine the effect of feature categories
and particular feature in every cluster simultaneously, and initiated an advanced
optimization design to describe the optimization activity.
Qinbao Song et al. suggested a method FAST which is a feature selection method
strengthened with fast clustering. It works with two paces. Using graph-theoretic
clustering technique the features are partitioned into clusters in the first pace. The most
descriptive features highly relevant to the target class are chosen from every single
cluster in the second pace for framing feature subsets.
Makoto Yamada et al. have suggested a technique named feature wise Kernel zed
Lasso for grabbing non-linear input-output dependence. This method is initiated to
solve the inadequacy problem of the FVM. The benefit of this new construction is that
it provides globalized optimum result effectively. It is extensible to high
dimensionality features. This technique will identify features that are non redundant
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 73
with severe statistical dependence on resultant values which are determined by
kernel-based independence dimensions like HSIC, NOCCO etc.,
Zheng Zhao et al. performed a survey and noticed that the existent feature selection
activities completely choose features that conserve sample homogeneity and united
in a common framework. This framework could not deal with features that are
redundant. With these considerations, they initiated a technique, named, similarity
preserving feature selection in an accurate and demanding fashion.
Ciresan et al. [16] have suggested a GPU based CNN technique which is trained by
online gradient descent. The CNN differs in the execution of convolution and sub-
sampling layer and by the means used in the training of the network. The
convolution layer works by using various filter maps of equal size to run convolution
operation and the sub-sampling layers performing averaging pixels using max-
pooling minimize the size of progressing layer.
Rajat Raina et al. [66] have initiated a parallel based technique to speed up the
network for efficient use in large scale applications. The author’s discuss the capacity
of advanced (GPU) Graphics Processor Units to learn huge DBN and sparse coding
simulation.
James Martens [56] developed a second order optimization technique established on
“Hessian free” method and used for training deep auto-encoders. This method can be
utilized for huge models and large datasets for overcoming the under-fitting
drawback detected while training deep auto-encoder neural networks.
3.EXISTING METHOD
The existent methods for providing data privacy assurance and graph protection are
vast in numbers. A few existent methods are reviewed in this section.
3.1Privacy preserving aggregation
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 74
This technique modeled with some homomorphism encryption [63] is ideal for data
collection process. A similar public key is used for encrypting their own data during
group of individuals desire to distribute their data together. Aggregation of cipher
text from all individual included in communication are then computed and shared.
Using a specific private key, the authorized user decrypts the data. This privacy
preserving aggregation method preserves the privacy of individual data at the 57
time of big data collection process and in big data storing process. The problem in
this method is the cipher text aggregated for an objective cannot be utilized for other
objectives. So the traditional privacy preserving aggregation method is ineffective for
big data analysis since it is objective specific.
3.2 Operations over encrypted data
Individual data encryption is done for securing sensitive data and its relevant
keywords are preserved in third party storage [47]. Once the user finds a need to
read the data, it can be brought back from storage by executing the query. Data
privacy is assured by operations over encrypted data but the issue is that it requires
a long computing period. It is also a complicated process. Big data should be
processed in timely manner and hands large volume. So this method is inefficient for
big data analytics.
3.3 De-Identification
De-identification is a conventional method used in the provision of assurance of
privacy. Data have to be uninfected and minimum information should be revealed to
all for maintaining privacy to an individual. The de-identification method is more
effective and adaptable to data analytics when compared to operations over
encrypted data and privacy preserving aggregation methods [11]. But the issue here
is in de-identification method there is possibility of getting external information to
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 75
the hacker in big data environment which causes privacy issues. Hence this
technique is also unsatisfactory for providing privacy in big data.
4.PROPOSED METHODOLOGY
4.1 Mobile Big Data(MBD) Analytics Using Deep Learning
A. Problem Statement
Accelerometers are sensors which measure proper acceleration of an object due to
motion and gravitational force. Modern mobile devices are widely equipped with tiny
accelerometer circuits which are produced from electromechanically sensitive
elements and generate electrical signal in response to any mechanical motion. The
proper acceleration is distinctive from coordinate acceleration in classical mechanics.
The latter measures the rate of change of velocity while the former measures
acceleration relative to a free fall, i.e., the proper acceleration of an object in a free
fall is zero. Consider a mobile device with an embedded accelerometer sensor that
generates proper acceleration samples. Activity recognition is applied to time series
data frames which are formulated using a sliding and overlapping window. The
number of time-series samples depends on the accelerometer’s sampling frequency
(in Hertz) and windowing length (in seconds). At time t, the activity recognition
classifier f : xt → S matches the framed acceleration data xt with the most probable
activity label from the set of supported activity labels S = {1, 2, . . . , N}, where N is
the number of supported activities in the activity detection component. Conventional
approaches of recognizing activities require handcrafted features, e.g., statistical
features [3], which are expensive to design, require domain expert knowledge, and
generalize poorly to support more activities. To avoid this, a deep activity recognition
model learns not only the mapping between raw acceleration data and the
corresponding activity label, but also a set of meaningful features which are superior
to handcrafted features.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 76
B. Experimental Setup
In this section, we use the Act tracker dataset which includes accelerometer samples
of 6 conventional activities (walking, jogging, climbing stairs, sitting, standing, and
lying down) from 563 crowd sourcing users. Figure 4 (a) plots accelerometer signals
of the 6 different activities. Clearly, high frequency signals are sampled for activities
with active body motion, e.g., walking, jogging, and climbing stairs. On the other
hand, low frequency signals are collected during semi-static body motions, e.g.,
standing, sitting, and lying down. The data is collected using mobile phones with
20Hz of sampling rate, and it contains both labeled and unlabeled data of 2, 980, 765
and 38, 209, 772 samples, respectively. This is a real-world example of the limited
number of labeled data compared with unlabeled data as data labeling requires
manual human intervention. The data is framed using a 10- sec windowing function
which generates 200 samples of time-series samples. We first pre-train deep models
on the unlabeled data samples only, and we then fine-tune the models on the labeled
dataset. To enhance the activity recognition performance, we use the spectrogram of
the acceleration signal as input of the deep models. Basically, different activities
contain different frequency contents which reflect the body dynamics and
movements.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 77
Fig. 4: Experimental analysis. (a) Accelerometer signal of different human activities.
(b) Recognition accuracy of deep learning models under different deep model setups.
(c) Speedup of learning deep models using the Spark-based framework under
different computing cores.
C. Experimental Results
1) The impact of deep models: Specifically, the capacity of a deep model to capture MBD
structures is increased when using deeper models with more layers and neurons.
Nonetheless, using deeper models evolves a significant increase in the learning
algorithm’s computational burdens and time.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 78
TABLE I: Activity recognition error of deep learning and other conventional methods
used in. The conventional methods use handcrafted statistical features
2) The impact of computing cores:
The main performance metric of cluster-based computing is the task speedup metric.
In particular, we compute the speedup efficiency as T8 TM , where T8 is the
computing time of one machine with 8 cores, and TM is the computing time under
different computing power. Figure 4 (c) shows the speedup in learning deep models
when the number of computing cores is varied. As the number of cores increases, the
learning time decreases. For example, learning a deep model of 5 layers with 2000
neurons per layer can be trained in 3.63 hours with 6 Spark workers. This results in
the speedup efficiency of 4.1 as compared to a single machine computing which takes
14.91 hours.
3) MBD veracity:
A normalized confusion matrix of a deep model is shown in Figure 5. This confusion
matrix shows the high performance of deep models on a per-activity basis (high
scores at the diagonal entries). The incorrect detection of the “sitting” activity
instead of the “lying down” activity is typically due to the different procedures in
performing the activities by crowd sourcing users. This gives a real-world example of
the “veracity” characteristic of MBD, i.e., uncertainties in MBD collection
5.PPCS-MMDML ALGORITHM
5.1 PPCS Working
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 79
Feature selection is a process that determines the subclass of meaningful features for
establishing a classification layout. It rejects superfluous and mismatched features
with respect to the classification model. The PPCS [52] also achieves the same in an
excellent fashion by preserving privacy. This method measures the privacy
preserving cosine similarity of the user input image features with all image features
available in databank without any privacy negotiation. In other words private
information is not shared on both sides while calculating the PPCS of the feature
between the user and the databank. The boundaries of PPCS calculation should
lounge between -1 and 1 correspondingly. The range of cosine similarity values for
angular values from 0 to 180 degrees is considered. The PPCS similarity range is high
among the user input image features and databank image features when cosine angle
of the image features vectors is zero. After PPCS is calculated for all pairs of features,
the features with maximal similarity is counted up in an empty subset. This new
derived subset from PPCS is the input to MMDML classification method. The steps in
PPCS algorithm are described in Table 1.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 80
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 81
Table 1 PPCS Algorithm for User and Databank
6.TOOLS AND TECHNIQUES
The proposed methodology is applied by making use of Matlab2012a and hadoop on
Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz and 16GB RAM. Here maintaining
privacy using PPCS, Matlab2012a framework is used for the selection of features.
Once the features are selected the designing of MMDML is done through hadoop tool
for minimizing the run time for classification productively. 6.4.1 Dataset The dataset
gathered from OSIRIX viewer [62] and Mammographic Image Analysis Society
(MIAS) [74] are used for validating the performance of PPCS-MMDML. The size of all
images utilized for PPCS-MMDML is 1024 × 1024 pixels. The performance of PPCS-
MMDML is justified by parameters like runtime analysis and classification accuracy
for brain, bone and breast cancer disease.
and depend on the values of ℎ and ℎ which in turn, depend on the network
parameters and . Hence to identify local optimal solution, an iterative technique
stochastic sub-gradient descent is used for calculation of the optimized values for
MMDML classifier is chosen over other 94 conventional methods as it considers
manifold margin such that class specific and discriminative analysis of data can be
made efficiently.
7.RESULTS AND DISCUSSIONS
The computational analyses between the existent MMDML and the proposed PPCS-
MMDML for different datasets are shown in Table 2. The time needed for execution
of different datasets by PPCS-MMDML method is minimal compared to the existent
MMDML method.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 82
Table 2 Running time analysis of PPCS-MMDML
The time taken by PPCS-MMDML is 1.4, 1.8 and 2.2 x 105 ms (millisecond) which is
minimal compared to the existent MMDML for bone, brain and breast cancer
datasets. Figure 2 illustrates the runtime analysis of PPCS-MMDML and the existent
MMDML.
Figure 2 Running time analysis of PPCS-MMDML
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 83
Table 3 exhibits the performance analyses of the existent MMDML and the proposed
PPCS-MMDML in three cancer disease datasets. The accuracy figures of PPCS-
MMDML are 74.2, 72.6, and 77.8% respectively for bone, brain and breast cancer
disease. The comparison analysis proves that the proposed PPCS-MMDML gives
7.7%, 4.1% and 6.2% improvement than the existent MMDML for classification of
bone, brain and breast dataset respectively.
Table 3 Classification accuracy analysis of PPCS-MMDML
8.CONCLUSION
The focus of the research work is on the need for privacy based deep learning
algorithm for big data analytics. The idiom big data requires the ability of the
developing model to deal with volume, velocity, veracity, variety and value
characteristics. Keeping these characteristics in mind, the implementation of privacy
based deep learning algorithm for big data analytics is performed by four works. The
analytics task used in this research work is big data classification. The action of
classifying data with the issues and difficulties opened up by the big data
environment is critical and challenging. Analysis is carried out using four works
proposed for overcoming the challenges of privacy and classification in big data
environment.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 84
The conclusion is that, in relation to both running time and classification accuracy,
the LNTP-MDBN deep learning scheme outperforms the other research works in a
big data environment. Also the PPCS with p-stability works effectively compared to
the existent privacy mechanism in big data and inclusion of this privacy mechanism
does not affect the classification performance in any manner.
9.FUTURE SCOPE
In future, this work can be extended in the following aspects:
➢ Integrating PPCS-LNTP-MDBN for effective privacy based deep learning
algorithm for big data analytics.
➢ Perform online learning for Classification.
➢ Real time applications for different datasets.
REFERENCES
1. Z. Zhao, L.Wang, H. Liu and J. Ye, On similarity preserving feature selection, IEEE Trans. Knowledge and Data
Engineering., 25(3), (2013), 619–632.
2. M. Yamada, W. Jitkrittum, L. Sigal, E.P. Xing and M. Sugiyama, High dimensional feature selection by feature-
wise kernel zed lasso, Neural computation, 26(1), (2014), 185–207.
3. J.Q. Gan, B.A.S. Hasan and C.S.L. Tsui, A filter-dominating hybrid sequential forward floating search method for
feature subset selection in high dimensional space, Machine Learning and Cybernetics, 5(3), (2014), 413–423
4. L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW Publishers, (2013).
5. Q. Song, J. Ni and G. Wang, A fast clustering-based feature subset selection algorithm for high-dimensional data,
IEEE Trans. Knowledge. Data Engineering, 25(1), (2013), 1–14.
6. D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella and J. Schmidhuber, Flexible high performance convolution
neural networks for image classification, In Proc. Int. Conf. Artif. Intell., (2011), 1237-1242.
Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020
Impact Factor (2020): 5.6 ISSN: 1320-0682
http://cij.org.in/Currentvolumeissue2401.aspx 85
7. J. Martens, Deep learning via Hessian-free optimization, In Proc. Int. Conf. Mach. Learn., (2010), 735-742.
8. R. Raina, A. Madhavan, and A. Ng, Large-scale deep unsupervised learning using graphics processors, In Proc.
Int. Conf. Mach.Learn., (2009), 873-880.
9.S.Suthaharan, Machine Learning Models and Algorithms for Big Data Classification, Springer, (2016)
10.B. Brahim and L. Mohamed, Robust ensemble feature selection for high dimensional data sets, In Proc. IEEE Int.
HPCS Conf., (2013), 151-157.