complexity international journal (cij)cij.org.in/pdfnew/cij-24-01-0006.pdf · 1.2 deep learning...

Complexity International Journal (CIJ) Volume 24, Issue 01, Jan 2020

Impact Factor (2020): 5.6 ISSN: 1320-0682

http://cij.org.in/Currentvolumeissue2401.aspx 70

IMPLEMENTATION OF A PRIVACY BASED DEEP LEARNING

ALGORITHM FOR BIG DATA ANALYTICS

1Ramdas Vankdothu, 2 Dr.Mohd Abdul Hameed

1Rsearch Scholar, Department of Computer Science & Engineering, University College of Engineering(A),

OsmaniaUniversity Hyderabad ,Telangana ,India

2Assistant Professor, Department of Computer Science & Engineering ,University College of

Engineering(A), Osmania University Hyderabad, Telangana ,India

Email: [email protected], [email protected]

ABSTRACT

The classification of big data is the demanding challenge to be addressed among all research

issues since it provides a larger business value in any analytics environment. Classification

is a mechanism that labels data enabling economical and effective performance in valuable

analysis. Research has indicated that the quality of the feature may cause a backlash to the

classification performance. This research work explores the impact of privacy in feature

selection process since privacy is mandatory when a sample feature is shared by a user for

the choice of relevant features from databank and vice versa. Also the inclusion of privacy

preserving mechanism should not affect the classification performance. By keeping these

concerns, the thesis work also initiates an integrated mechanism named PPCS-MMDML for

privacy based feature selection and effective big data heterogeneous image set classification.

The PPCS which works well in a big data environment with minimum running time makes

the choice of a subset of features by maintaining privacy between the user and the

databank, and MMDML, a deep learning method makes an efficient classification of the big

data. Qualitative assessment of all the proposed classification models and privacy

preserving mechanism has been made with classification accuracy and running time

respectively. Statistical analysis of accuracy values and computational time portrays that

the proposed schemes provides compromising results over existent methods.

http://cij.org.in/Currentvolumeissue2401.aspx




Key words : PPCS, MMDML, Big data.

1.INTRODUCTION

1.1 Big Data

The idiom big data has the ability to manage huge volume of information and the

analytical skill overcomes the limitations in the existent data processing technologies

[15]. The frequent and expanding use of sensors, internet, heavy machines etc., in a

flying ratio has made accelerated increase in data on today’s digital world. Big data

characteristics such as velocity and volume have made complications to the computing

systems in handling the data. The data management, warehousing techniques and

systems being used for analysis in the traditional days abort to analyze this variety of

data. In order to overcome this complication, big data storage is handled by a distributed

architecture file system.

The data types of big data is categorized as follows.

• Structured data

• Unstructured data

• Semi-structured data

1.2 Deep Learning

Deep learning naturally involves hierarchical illustrations and makes use of supervised

or unsupervised approaches for classification of deep architectures [15, 18]. Here upper

layer notions can be used for characterizing lower layer notions and lower layer notions

can be used for upper layer notions. Deep learning is a type of machine learning

techniques that depends on learning representations. For example consider an image

which can be described in multiple ways (example a vector of pixels) but some

descriptions alone are useful in learning what the image is.





1.3 Need For Privacy

The blasting volume of data has multiplied unrealized privacy break instead of making

big data to be productively used by humans. For example, 16 Google, Amazon etc., are

learning our online purchasing choices and online surfing behavior. Private information

relating our life, social relations are stored and analyzed by social networking websites

like face book etc., Familiar video sites like YouTube suggest our favorite videos depend

upon our search history. The ability of big data to gather, deposit and reuse secret

information for acquiring business gains are threats to private information.

2.LITERATURE SURVEY

Xiaojun Chen et al. have suggested a novel technique for feature categories weight

subspace and features high dimensional data clustering. High dimensionality data are

partitioned into feature categories established on the basis of their common aspects.

They have proposed two kinds of weights to determine the effect of feature categories

and particular feature in every cluster simultaneously, and initiated an advanced

optimization design to describe the optimization activity.

Qinbao Song et al. suggested a method FAST which is a feature selection method

strengthened with fast clustering. It works with two paces. Using graph-theoretic

clustering technique the features are partitioned into clusters in the first pace. The most

descriptive features highly relevant to the target class are chosen from every single

cluster in the second pace for framing feature subsets.

Makoto Yamada et al. have suggested a technique named feature wise Kernel zed

Lasso for grabbing non-linear input-output dependence. This method is initiated to

solve the inadequacy problem of the FVM. The benefit of this new construction is that

it provides globalized optimum result effectively. It is extensible to high

dimensionality features. This technique will identify features that are non redundant





with severe statistical dependence on resultant values which are determined by

kernel-based independence dimensions like HSIC, NOCCO etc.,

Zheng Zhao et al. performed a survey and noticed that the existent feature selection

activities completely choose features that conserve sample homogeneity and united

in a common framework. This framework could not deal with features that are

redundant. With these considerations, they initiated a technique, named, similarity

preserving feature selection in an accurate and demanding fashion.

Ciresan et al. [16] have suggested a GPU based CNN technique which is trained by

online gradient descent. The CNN differs in the execution of convolution and sub-

sampling layer and by the means used in the training of the network. The

convolution layer works by using various filter maps of equal size to run convolution

operation and the sub-sampling layers performing averaging pixels using max-

pooling minimize the size of progressing layer.

Rajat Raina et al. [66] have initiated a parallel based technique to speed up the

network for efficient use in large scale applications. The author’s discuss the capacity

of advanced (GPU) Graphics Processor Units to learn huge DBN and sparse coding

simulation.

James Martens [56] developed a second order optimization technique established on

“Hessian free” method and used for training deep auto-encoders. This method can be

utilized for huge models and large datasets for overcoming the under-fitting

drawback detected while training deep auto-encoder neural networks.

3.EXISTING METHOD

The existent methods for providing data privacy assurance and graph protection are

vast in numbers. A few existent methods are reviewed in this section.

3.1Privacy preserving aggregation





This technique modeled with some homomorphism encryption [63] is ideal for data

collection process. A similar public key is used for encrypting their own data during

group of individuals desire to distribute their data together. Aggregation of cipher

text from all individual included in communication are then computed and shared.

Using a specific private key, the authorized user decrypts the data. This privacy

preserving aggregation method preserves the privacy of individual data at the 57

time of big data collection process and in big data storing process. The problem in

this method is the cipher text aggregated for an objective cannot be utilized for other

objectives. So the traditional privacy preserving aggregation method is ineffective for

big data analysis since it is objective specific.

3.2 Operations over encrypted data

Individual data encryption is done for securing sensitive data and its relevant

keywords are preserved in third party storage [47]. Once the user finds a need to

read the data, it can be brought back from storage by executing the query. Data

privacy is assured by operations over encrypted data but the issue is that it requires

a long computing period. It is also a complicated process. Big data should be

processed in timely manner and hands large volume. So this method is inefficient for

big data analytics.

3.3 De-Identification

De-identification is a conventional method used in the provision of assurance of

privacy. Data have to be uninfected and minimum information should be revealed to

all for maintaining privacy to an individual. The de-identification method is more

effective and adaptable to data analytics when compared to operations over

encrypted data and privacy preserving aggregation methods [11]. But the issue here

is in de-identification method there is possibility of getting external information to





the hacker in big data environment which causes privacy issues. Hence this

technique is also unsatisfactory for providing privacy in big data.

4.PROPOSED METHODOLOGY

4.1 Mobile Big Data(MBD) Analytics Using Deep Learning

A. Problem Statement

Accelerometers are sensors which measure proper acceleration of an object due to

motion and gravitational force. Modern mobile devices are widely equipped with tiny

accelerometer circuits which are produced from electromechanically sensitive

elements and generate electrical signal in response to any mechanical motion. The

proper acceleration is distinctive from coordinate acceleration in classical mechanics.

The latter measures the rate of change of velocity while the former measures

acceleration relative to a free fall, i.e., the proper acceleration of an object in a free

fall is zero. Consider a mobile device with an embedded accelerometer sensor that

generates proper acceleration samples. Activity recognition is applied to time series

data frames which are formulated using a sliding and overlapping window. The

number of time-series samples depends on the accelerometer’s sampling frequency

(in Hertz) and windowing length (in seconds). At time t, the activity recognition

classifier f : xt → S matches the framed acceleration data xt with the most probable

activity label from the set of supported activity labels S = {1, 2, . . . , N}, where N is

the number of supported activities in the activity detection component. Conventional

approaches of recognizing activities require handcrafted features, e.g., statistical

features [3], which are expensive to design, require domain expert knowledge, and

generalize poorly to support more activities. To avoid this, a deep activity recognition

model learns not only the mapping between raw acceleration data and the

corresponding activity label, but also a set of meaningful features which are superior

to handcrafted features.





B. Experimental Setup

In this section, we use the Act tracker dataset which includes accelerometer samples

of 6 conventional activities (walking, jogging, climbing stairs, sitting, standing, and

lying down) from 563 crowd sourcing users. Figure 4 (a) plots accelerometer signals

of the 6 different activities. Clearly, high frequency signals are sampled for activities

with active body motion, e.g., walking, jogging, and climbing stairs. On the other

hand, low frequency signals are collected during semi-static body motions, e.g.,

standing, sitting, and lying down. The data is collected using mobile phones with

20Hz of sampling rate, and it contains both labeled and unlabeled data of 2, 980, 765

and 38, 209, 772 samples, respectively. This is a real-world example of the limited

number of labeled data compared with unlabeled data as data labeling requires

manual human intervention. The data is framed using a 10- sec windowing function

which generates 200 samples of time-series samples. We first pre-train deep models

on the unlabeled data samples only, and we then fine-tune the models on the labeled

dataset. To enhance the activity recognition performance, we use the spectrogram of

the acceleration signal as input of the deep models. Basically, different activities

contain different frequency contents which reflect the body dynamics and

movements.





Fig. 4: Experimental analysis. (a) Accelerometer signal of different human activities.

(b) Recognition accuracy of deep learning models under different deep model setups.

(c) Speedup of learning deep models using the Spark-based framework under

different computing cores.

C. Experimental Results

1) The impact of deep models: Specifically, the capacity of a deep model to capture MBD

structures is increased when using deeper models with more layers and neurons.

Nonetheless, using deeper models evolves a significant increase in the learning

algorithm’s computational burdens and time.





TABLE I: Activity recognition error of deep learning and other conventional methods

used in. The conventional methods use handcrafted statistical features

2) The impact of computing cores:

The main performance metric of cluster-based computing is the task speedup metric.

In particular, we compute the speedup efficiency as T8 TM , where T8 is the

computing time of one machine with 8 cores, and TM is the computing time under

different computing power. Figure 4 (c) shows the speedup in learning deep models

when the number of computing cores is varied. As the number of cores increases, the

learning time decreases. For example, learning a deep model of 5 layers with 2000

neurons per layer can be trained in 3.63 hours with 6 Spark workers. This results in

the speedup efficiency of 4.1 as compared to a single machine computing which takes

14.91 hours.

3) MBD veracity:

A normalized confusion matrix of a deep model is shown in Figure 5. This confusion

matrix shows the high performance of deep models on a per-activity basis (high

scores at the diagonal entries). The incorrect detection of the “sitting” activity

instead of the “lying down” activity is typically due to the different procedures in

performing the activities by crowd sourcing users. This gives a real-world example of

the “veracity” characteristic of MBD, i.e., uncertainties in MBD collection

5.PPCS-MMDML ALGORITHM

5.1 PPCS Working





Feature selection is a process that determines the subclass of meaningful features for

establishing a classification layout. It rejects superfluous and mismatched features

with respect to the classification model. The PPCS [52] also achieves the same in an

excellent fashion by preserving privacy. This method measures the privacy

preserving cosine similarity of the user input image features with all image features

available in databank without any privacy negotiation. In other words private

information is not shared on both sides while calculating the PPCS of the feature

between the user and the databank. The boundaries of PPCS calculation should

lounge between -1 and 1 correspondingly. The range of cosine similarity values for

angular values from 0 to 180 degrees is considered. The PPCS similarity range is high

among the user input image features and databank image features when cosine angle

of the image features vectors is zero. After PPCS is calculated for all pairs of features,

the features with maximal similarity is counted up in an empty subset. This new

derived subset from PPCS is the input to MMDML classification method. The steps in

PPCS algorithm are described in Table 1.





Table 1 PPCS Algorithm for User and Databank

6.TOOLS AND TECHNIQUES

The proposed methodology is applied by making use of Matlab2012a and hadoop on

Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz and 16GB RAM. Here maintaining

privacy using PPCS, Matlab2012a framework is used for the selection of features.

Once the features are selected the designing of MMDML is done through hadoop tool

for minimizing the run time for classification productively. 6.4.1 Dataset The dataset

gathered from OSIRIX viewer [62] and Mammographic Image Analysis Society

(MIAS) [74] are used for validating the performance of PPCS-MMDML. The size of all

images utilized for PPCS-MMDML is 1024 × 1024 pixels. The performance of PPCS-

MMDML is justified by parameters like runtime analysis and classification accuracy

for brain, bone and breast cancer disease.

and depend on the values of ℎ and ℎ which in turn, depend on the network

parameters and . Hence to identify local optimal solution, an iterative technique

stochastic sub-gradient descent is used for calculation of the optimized values for

MMDML classifier is chosen over other 94 conventional methods as it considers

manifold margin such that class specific and discriminative analysis of data can be

made efficiently.

7.RESULTS AND DISCUSSIONS

The computational analyses between the existent MMDML and the proposed PPCS-

MMDML for different datasets are shown in Table 2. The time needed for execution

of different datasets by PPCS-MMDML method is minimal compared to the existent

MMDML method.





Table 2 Running time analysis of PPCS-MMDML

The time taken by PPCS-MMDML is 1.4, 1.8 and 2.2 x 105 ms (millisecond) which is

minimal compared to the existent MMDML for bone, brain and breast cancer

datasets. Figure 2 illustrates the runtime analysis of PPCS-MMDML and the existent

MMDML.

Figure 2 Running time analysis of PPCS-MMDML





Table 3 exhibits the performance analyses of the existent MMDML and the proposed

PPCS-MMDML in three cancer disease datasets. The accuracy figures of PPCS-

MMDML are 74.2, 72.6, and 77.8% respectively for bone, brain and breast cancer

disease. The comparison analysis proves that the proposed PPCS-MMDML gives

7.7%, 4.1% and 6.2% improvement than the existent MMDML for classification of

bone, brain and breast dataset respectively.

Table 3 Classification accuracy analysis of PPCS-MMDML

8.CONCLUSION

The focus of the research work is on the need for privacy based deep learning

algorithm for big data analytics. The idiom big data requires the ability of the

developing model to deal with volume, velocity, veracity, variety and value

characteristics. Keeping these characteristics in mind, the implementation of privacy

based deep learning algorithm for big data analytics is performed by four works. The

analytics task used in this research work is big data classification. The action of

classifying data with the issues and difficulties opened up by the big data

environment is critical and challenging. Analysis is carried out using four works

proposed for overcoming the challenges of privacy and classification in big data

environment.





The conclusion is that, in relation to both running time and classification accuracy,

the LNTP-MDBN deep learning scheme outperforms the other research works in a

big data environment. Also the PPCS with p-stability works effectively compared to

the existent privacy mechanism in big data and inclusion of this privacy mechanism

does not affect the classification performance in any manner.

9.FUTURE SCOPE

In future, this work can be extended in the following aspects:

➢ Integrating PPCS-LNTP-MDBN for effective privacy based deep learning

algorithm for big data analytics.

➢ Perform online learning for Classification.

➢ Real time applications for different datasets.

REFERENCES

1. Z. Zhao, L.Wang, H. Liu and J. Ye, On similarity preserving feature selection, IEEE Trans. Knowledge and Data

Engineering., 25(3), (2013), 619–632.

2. M. Yamada, W. Jitkrittum, L. Sigal, E.P. Xing and M. Sugiyama, High dimensional feature selection by feature-

wise kernel zed lasso, Neural computation, 26(1), (2014), 185–207.

3. J.Q. Gan, B.A.S. Hasan and C.S.L. Tsui, A filter-dominating hybrid sequential forward floating search method for

feature subset selection in high dimensional space, Machine Learning and Cybernetics, 5(3), (2014), 413–423

4. L. Deng and D. Yu, Deep Learning: Methods and Applications, NOW Publishers, (2013).

5. Q. Song, J. Ni and G. Wang, A fast clustering-based feature subset selection algorithm for high-dimensional data,

IEEE Trans. Knowledge. Data Engineering, 25(1), (2013), 1–14.

6. D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella and J. Schmidhuber, Flexible high performance convolution

neural networks for image classification, In Proc. Int. Conf. Artif. Intell., (2011), 1237-1242.





7. J. Martens, Deep learning via Hessian-free optimization, In Proc. Int. Conf. Mach. Learn., (2010), 735-742.

8. R. Raina, A. Madhavan, and A. Ng, Large-scale deep unsupervised learning using graphics processors, In Proc.

Int. Conf. Mach.Learn., (2009), 873-880.

9.S.Suthaharan, Machine Learning Models and Algorithms for Big Data Classification, Springer, (2016)

10.B. Brahim and L. Mohamed, Robust ensemble feature selection for high dimensional data sets, In Proc. IEEE Int.

HPCS Conf., (2013), 151-157.


complexity international journal (cij)cij.org.in/pdfnew/cij-24-01-0006.pdf · 1.2 deep learning...

Documents