protecting sensory data against sensitive inferences · 2018. 6. 21. · protecting sensory data...

6
Protecting Sensory Data against Sensitive Inferences Mohammad Malekzadeh Queen Mary University of London London, UK [email protected] Richard G. Clegg Queen Mary University of London London, UK [email protected] Andrea Cavallaro Queen Mary University of London London, UK [email protected] Hamed Haddadi Imperial College London London, UK [email protected] ABSTRACT There is growing concern about how personal data are used when users grant applications direct access to the sensors of their mobile devices. In fact, high resolution temporal data generated by mo- tion sensors reflect directly the activities of a user and indirectly physical and demographic attributes. In this paper, we propose a feature learning architecture for mobile devices that provides flex- ible and negotiable privacy-preserving sensor data transmission by appropriately transforming raw sensor data. The objective is to move from the current binary setting of granting or not permis- sion to an application, toward a model that allows users to grant each application permission over a limited range of inferences ac- cording to the provided services. The internal structure of each component of the proposed architecture can be flexibly changed and the trade-off between privacy and utility can be negotiated between the constraints of the user and the underlying application. We validated the proposed architecture in an activity recognition application using two real-world datasets, with the objective of recognizing an activity without disclosing gender as an example of private information. Results show that the proposed framework maintains the usefulness of the transformed data for activity recog- nition, with an average loss of only around three percentage points, while reducing the possibility of gender classification to around 50%, the target random guess, from more than 90% when using raw sensor data. We also present and distribute MotionSense, a new dataset for activity and attribute recognition collected from motion sensors. CCS CONCEPTS Security and privacy; Computing methodologies Ma- chine learning; Distributed computing methodologies; Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. W-P2DS’18, April 23–26, 2018, Porto, Portugal © 2018 Copyright held by the owner/author(s). Publication rights licensed to the Association for Computing Machinery. ACM ISBN 978-1-4503-5654-1/18/04. . . $15.00 https://doi.org/10.1145/3195258.3195260 KEYWORDS Privacy, Sensor Data, Activity Recognition, Machine Learning, Time- Series Analysis ACM Reference Format: Mohammad Malekzadeh, Richard G. Clegg, Andrea Cavallaro, and Hamed Haddadi. 2018. Protecting Sensory Data against Sensitive Inferences. In W-P2DS’18: 1st Workshop on Privacy by Design in Distributed Systems , April 23–26, 2018, Porto, Portugal. ACM, New York, NY, USA, 6 pages. https: //doi.org/10.1145/3195258.3195260 1 INTRODUCTION Smartphones and wearable devices are equipped with sensors such as accelerometers, gyroscope, barometer and light sensors that are directly accessed by applications (apps) to provide through a cloud service analysis and statistics about, for example, the activities of the user. However, by granting to these apps access to raw sensor data, users may unintentionally reveal information about gender, mood, personality, which is unnecessary for the specific services. To address this problem, we introduce the Guardian-Estimator- Neutralizer (GEN) framework that, instead of granting apps direct access to sensors, is designed to share only a transformed version of the sensor data, based on the functions and requirements of each application and privacy considerations. The Guardian pro- vides an inference-specific transformation, the Estimator guides the Guardian by estimating sensitive and non-sensitive information in the transformed data, and the Neutralizer is an optimizer that helps the Guardian converge to a near-optimal transformation function (see Figure 1). Unlike privacy-preserving works that only hide users’ identity by sharing population data using generative models for data syn- thesis [2, 9], our solution concerns sensitive information included in a single user’s data. There are, however, some methods which transform only selected temporal sections of sensor data that cor- respond to predefined sensitive activities [11, 12], our framework enables concurrently eliminating private information from each section of data, while keeping the utility of shared data. GEN is a feature learning and data reconstruction framework that helps to efficiently establish a trade-off between apps utility and user privacy. Specifically, in this paper, we instantiate the framework for an activity recognition application based on data recorded by the accelerometer and gyroscope of a smartphone. In the context of this application, we categorize information that can be inferred from sensor data into two types: information about a predefined set arXiv:1802.07802v4 [cs.LG] 20 Jun 2018

Upload: others

Post on 01-Jan-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Protecting Sensory Data against Sensitive Inferences · 2018. 6. 21. · Protecting Sensory Data against Sensitive Inferences W-P2DS’18, April 23–26, 2018, Porto, Portugal MobiAct

Protecting Sensory Data against Sensitive InferencesMohammad Malekzadeh

Queen Mary University of LondonLondon UK

mmalekzadehqmulacuk

Richard G CleggQueen Mary University of London

London UKrcleggqmulacuk

Andrea CavallaroQueen Mary University of London

London UKacavallaroqmulacuk

Hamed HaddadiImperial College London

London UKhhaddadiimperialacuk

ABSTRACTThere is growing concern about how personal data are used whenusers grant applications direct access to the sensors of their mobiledevices In fact high resolution temporal data generated by mo-tion sensors reflect directly the activities of a user and indirectlyphysical and demographic attributes In this paper we propose afeature learning architecture for mobile devices that provides flex-ible and negotiable privacy-preserving sensor data transmissionby appropriately transforming raw sensor data The objective is tomove from the current binary setting of granting or not permis-sion to an application toward a model that allows users to granteach application permission over a limited range of inferences ac-cording to the provided services The internal structure of eachcomponent of the proposed architecture can be flexibly changedand the trade-off between privacy and utility can be negotiatedbetween the constraints of the user and the underlying applicationWe validated the proposed architecture in an activity recognitionapplication using two real-world datasets with the objective ofrecognizing an activity without disclosing gender as an exampleof private information Results show that the proposed frameworkmaintains the usefulness of the transformed data for activity recog-nition with an average loss of only around three percentage pointswhile reducing the possibility of gender classification to around50 the target random guess from more than 90 when using rawsensor data We also present and distribute MotionSense a newdataset for activity and attribute recognition collected from motionsensors

CCS CONCEPTSbull Security and privacy bull Computing methodologies rarr Ma-chine learning Distributed computing methodologies

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than theauthor(s) must be honored Abstracting with credit is permitted To copy otherwise orrepublish to post on servers or to redistribute to lists requires prior specific permissionandor a fee Request permissions from permissionsacmorgW-P2DSrsquo18 April 23ndash26 2018 Porto Portugalcopy 2018 Copyright held by the ownerauthor(s) Publication rights licensed to theAssociation for Computing MachineryACM ISBN 978-1-4503-5654-11804 $1500httpsdoiorg10114531952583195260

KEYWORDSPrivacy Sensor Data Activity RecognitionMachine Learning Time-Series Analysis

ACM Reference FormatMohammad Malekzadeh Richard G Clegg Andrea Cavallaro and HamedHaddadi 2018 Protecting Sensory Data against Sensitive Inferences InW-P2DSrsquo18 1st Workshop on Privacy by Design in Distributed Systems April23ndash26 2018 Porto Portugal ACM New York NY USA 6 pages httpsdoiorg10114531952583195260

1 INTRODUCTIONSmartphones and wearable devices are equipped with sensors suchas accelerometers gyroscope barometer and light sensors that aredirectly accessed by applications (apps) to provide through a cloudservice analysis and statistics about for example the activities ofthe user However by granting to these apps access to raw sensordata users may unintentionally reveal information about gendermood personality which is unnecessary for the specific services

To address this problem we introduce the Guardian-Estimator-Neutralizer (GEN) framework that instead of granting apps directaccess to sensors is designed to share only a transformed versionof the sensor data based on the functions and requirements ofeach application and privacy considerations The Guardian pro-vides an inference-specific transformation the Estimator guides theGuardian by estimating sensitive and non-sensitive information inthe transformed data and the Neutralizer is an optimizer that helpsthe Guardian converge to a near-optimal transformation function(see Figure 1)

Unlike privacy-preserving works that only hide usersrsquo identityby sharing population data using generative models for data syn-thesis [2 9] our solution concerns sensitive information includedin a single userrsquos data There are however some methods whichtransform only selected temporal sections of sensor data that cor-respond to predefined sensitive activities [11 12] our frameworkenables concurrently eliminating private information from eachsection of data while keeping the utility of shared data

GEN is a feature learning and data reconstruction framework thathelps to efficiently establish a trade-off between apps utility and userprivacy Specifically in this paper we instantiate the frameworkfor an activity recognition application based on data recorded bythe accelerometer and gyroscope of a smartphone In the contextof this application we categorize information that can be inferredfrom sensor data into two types information about a predefined set

arX

iv1

802

0780

2v4

[cs

LG

] 2

0 Ju

n 20

18

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

m

d

Guardian

Estimator

Neutralizer

Training

Testing (in Use)

()

User Side Apps Side

Figure 1 GEN Architecture First the Estimator is trainedthen the Guardian is trained using the Estimator with thehelp of the Neutralizer

of activities of the user (non-sensitive inferences) and informationabout attributes of the user such as gender age weight and height(sensitive inferences)

Our goal is to establish a tradeoff between the ability of theapps to accurately infer non-sensitive information to maximizetheir utility and the reduction of revealed sensitive information tominimize the risk of privacy infringement We show that GEN canaccurately maintain the usefulness of the released (transformed)data for activity recognition while considerably reducing the riskof attribute recognition1

2 PROBLEM DEFINITIONLet X (t) =

(X1(t)X2(t) Xm (t)

)be the recorded values of the

m sensor-data components during a collection period of durationT where t isin 1 2 T We assume the data to be synchronizedand collected at the same frequency

Let us consider a running window of duration d that containsconsecutive values of X (t) from time t to t + d minus 1 Let Sd (t) be thecorresponding section of the time-series

Sd (t) = X [t t + d minus 1] =(X (t)X (t + 1) X (t + d minus 1)

)

where the value ofd should be chosen such that the runningwindowbe large enough for making desired inferences by apps Howeverin order to be computationally effective it should not be chosenvery large For simplicity we remove the index t from Sd (t) in thefollowing

We define two types of inference on each Sd inference of sensitiveinformation Is() and inference of non-sensitive information In()Our goal is to find a transformation function Glowast() in a way thatthe transformed data Slowastd = Glowast(Sd ) are such that Is(Slowastd ) fails to revealprivate information whereas In(Slowastd ) generates inferences that are

1The code and data used in this paper are publicly available athttpsgithubcommmalekzadehmotion-sense

Figure 2 An instantiation of GEN for activity recognitionfrom sensor data without revealing the gender informationThe Guardian is an autoencoder The Estimator is a multi-task ConvNet

as accurate as In(Sd ) Here Sd is the transformation of correspond-ing Sd and Slowastd is its optimal privacy-preserving transformation

3 LEARNING THE INFERENCE-SPECIFICTRANSFORMATION

We present the proposed framework that includes three compo-nents the Guardian the Estimator and the Neutralizer (Figure 1)and discuss its instantiation for an activity recognition applica-tion (Figure 2)

The Guardian which provides inference-specific transformationis a feature learning framework that recognizes and distinguishesdiscerning features from data In the specific implementation of thispaper we use a deep autoencoder [16] as Guardian An autoencoderis a neural network that tries to reconstruct its input based on anobjective function Here the autoencoder receives a section of m-dimensional time-series with length of d as input and produces atime-series with the same dimensionality as the output based onthe Neutralizerrsquos objective function which is described below

The Estimator quantifies how accurate an algorithm can be atmaking sensitive and non-sensitive inferences on the transformeddata In the specific implementation of this paper we use a multi-task convolutional neural network (MTCNN) as Estimator [17] Theshape of input is similar to the Guardian and the shape of outputdepends on the number of activity classes MTCNN has the abilityto share learned representations from input between several tasksMore precisely we try to simultaneously optimize a CNN with twotypes of loss function one for sensitive inferences and another fornon-sensitive ones Consequently MTCNN will learn more genericfeatures which should be used for several tasks at its earlier layersThen subsequent layers which become progressively more spe-cific to the details of the desired task can be divided into multiplebranches each for a specific task

The Neutralizer the most important contribution of this paperis an optimizer that helps the Guardian find the optimal Glowast(middot) for

Protecting Sensory Data against Sensitive Inferences W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal

MobiAct MotionSense

Males 32 14

Females 16 10

Features (m) 9 12

SampleRate (Hz) 20 50Table 1 Details of the MobiAct and MotionSense datasets

transforming each section Sd into Slowastd using as objective

Glowast() = argminG()isinF

(p(Is

(G(Sd )

) )minus p

(In

(G(Sd )

) ))

where p(Is (middot)

)and p

(In (middot)

)are the probabilities of making sensitive

and non-sensitive inferences respectively and the F is the set of allpossible transformation functions for the Guardian In the specificapplication of this paper the Neutralizer is a multi-task objectivefunction used by backpropagation to update the weights of theGuardian (autoencoder) The F is also the set of all possible weightmatrices for the selected autoencoder

Particularly we aim to transform each section Sd such that wecan recognize an activity from Sd without revealing the gender ofthe user For each section Sd let Ya(Sd ) and Ya(Sd ) be the true andpredicted class of activity respectively and Yg(Sd ) be the predictedgender class We define the Neutralizerrsquos objective function as

Slowastd = argminSd

(| (05 minus Yg(Sd )) | minus

csumi=1

minusY ia(Sd ) logY ia(Sd )) (1)

where c is the number of activity classes In the rhs of the equationthe first part is our custom gender-neutralizer loss function andthe second part is a categorical cross entropy The constant 05 isthe desired confidence for a gender predictor that will process thetransformed data

4 EXPERIMENTSWe validate the proposed framework on recognizing the followingactivities from smartphone motion sensors Downstairs UpstairsWalking Jogging The non-sensitive inferences In is the recog-nition of the activities whereas the sensitive inference Is is therecognition of gender

We aim to measure the trade-off between the utility of data foractivity recognition and privacy eg keeping gender secret Tothis end we first compare the accuracy of activity recognition andgender classification when a trained MTCNN has access to originaldata and to the corresponding transformed data Then we try tomeasure the amount of sensitive information which is still availablein the transformed data using different methods

Model Layer (Neurons | Kernel | Chance)

Inp(md)

Conv(50 1 times 5) Conv(50 1 times 3)

Dense(50) MP(1 times 2) DO(02)

Conv(40 1 times 5)

MTCNN Dense(40) MP(1 times 3) DO(02)

Conv(20 1 times 3) DO(02)

Flatten Dense(400) DO(04)

OutA = Softmax(4) OutG = Sigmoid

Inp(|x |) Dense(|x |2) Dense(|x |4)

AE Dense(|x |8)

Dense(|x |4) Dense(|x |2) Out(|x |)Table 2 Structure of the hidden layers The activation func-tion for all the layers is ldquoReLUrdquo Key ndash MP MaxPooling DODropOut |x | =m times d

41 DatasetsWe use two real-world datasets MobiAct2 and MotionSense3 Thelatter dataset is one of the contributions of this paper

MobiAct [15] includes accelerometer gyroscope and orienta-tion data (m = 9) from a smartphone collected when data subjectsperformed 9 activities in 16 trials A total of 67 participants in arange of gender age weight and height collected the data witha Samsung Galaxy S3 smartphone (we use a subset of 48 subjectswho have no missing data) Unlike other datasets which requirethe smartphone to be rigidly placed on the human body and witha specific orientation MobiAct attempted to simulate every-dayusage of mobile phones where a smartphone is located with randomorientation in a loose pocket chosen by the subject (Table 1)

MotionSense includes the accelerometer (acceleration and grav-ity) attitude (pitch roll yaw) and gyroscope data (m = 12) collectedwith an iPhone 6s kept in the participantrsquos front pocket using Sens-ingKit [10] A total of 24 participants in a range of gender ageweight and height performed 6 activities in 15 trials in the sameenvironment and conditions downstairs upstairs walking joggingsitting and standing With this dataset we aim to look for personalattributes fingerprints in time-series of sensor data ie attribute-specific patterns that can be used to infer physical and demographicattributes of the data subjects in addition to their activities

See httpgithubcommmalekzadehmotion-sense for details onthe methodology and the data (Table 1)

2publicly available athttpwwwbmiteicretegrindexphpresearchmobiact3publicly available athttpgithubcommmalekzadehmotion-sense

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

Setting Dataset Inf Sd Sd

Trial

MotionSenseIa 9508 9371

Ig 9515 4932

MobiActIa 9431 9046

Ig 9374 4983

Subject

MotionSenseIa 8633 8519

Ig 7535 5216

MobiActIa 7049 6501

Ig 6618 4554Table 3 Activity recognition Ia and gender classification Igaccuracy for original Sd and transformed Sd data in per-cent ()

42 Experimental SetupFor each dataset we consider two types of setting namely Trial andSubject In Trial we keep 23 of trials for training and 13 of themfor testing For example if there are 3 walking trials per participantwe keep the first two trials for training and the last one for testingIn Subject we keep data of 75 of all subjects for training and thedata of remaining 25 subjects for testing In the Subject settingwe report the average results of four selections for test dataset

We train an MTCNN as the Estimator by considering two tasks(i) activity recognition (4 classes) with categorical cross-entropy lossfunction [4] and (ii) gender classification (2 classes) with binarycross-entropy loss function [4] After training MTCNN we freezethe weights of the MTCNN layers and attach the output of a deepautoencoder (AE) as the Guardian to the input of the MTCNN tobuild the GEN neural network Finally we compile GEN and setits loss function equals to the objective function of the Neutralizerin Equation (1) The deep network architectures are described inTable 2

43 Transformation EfficiencyTable 3 shows that the Guardian produces time-series that keepthe utility of non-sensitive inferences at a comparable level to theoriginal ones (the average loss is three percentage points) whilepreventing sensitive inferences as the gender classification accu-racy decreases from more than 90 to near the target random guess(50)

Cross-Dataset Validation We also validate GEN in an ecosys-tem where edge users benefit from pre-trained models of a serviceprovider At the cloud side the Estimator (MTCNN) is trained ona public dataset the MobiAct dataset in our case At the edge sidethe Guardian receives the trained Estimator and uses its locally(personally) defined Neutralizer to transform the userrsquos data theMotionSense dataset in our case

The results show that the accuracy of the Estimator on rawdata for Ia and Iд are 9367 and 9280 respectively whereas on

Gender Age Height Weight0

20

40

60

80

100

Nor

mal

ized

Err

or(

)

Original Data Transformed Data

Figure 3 Error for gender is ldquoclassification errorrdquo and forthe rest of attributes is ldquomean absolute errorrdquo All the val-ues are divided by the error of a random estimator on theMotionSense dataset

transformed data are 9092 and 5193 respectively This showsan interesting property of GEN which makes it more applicable todeploy in edge devices

The only concern here is whether users trust the pre-trainedEstimator received from an untrusted service provider User canverify the Estimator by running it on a publicly available datasetWe leave more investigation on this concern for future work

44 Measuring Information LeakageWe aim to experimentally quantify the amount of information aboutuserrsquos attributes that is still available in the transformed data

Using Dynamic Time Warping To measure the amount ofresidual attribute-information in sensor data we chose4 k-NearestNeighbors (k-NN) with Dynamic Time Warping (DTW) [13] Weaim to verify whether a different algorithm will also fail to guessgender even when adversaries get access to the entire time-seriesand not just a section of it To this end we build an n times n matrixDl where n is the number of subjects in the dataset For eachactivity al isin downstairsupstairswalkinд joддinд letdl (i j) bethe distance between the time-series of usersui anduj calculated byFastDTW [13] Then we calculate the final distancematrix D as theelement-wise average of all the matrices Dl d(i j) = 1

4suml dl (i j)

We calculate distance matrices D and D for the original time-series and the transformed series (the output of the Guardian) re-spectively Then we compare the ability of the estimation based onthese matrices For each user ui i isin 1 n (one out-of-sample)we estimate the value of each attributeva (ui )a isin дender aдeweiдht heiдhtusing distance weighted k-NN based on matrixD where the weightis

w(i j) = 1d(i j)2

Figure 3 shows that the estimation error for gender classificationapproaches that of a random estimator after transformation In thisFigure the error of a random estimator for gender is Nf

Nf +Nm= 10

24

4k-NN with DTW outperforms other methods in time-series classification exceptwhen considerable computation and implementation cost is acceptable for very smallimprovements [1]

Protecting Sensory Data against Sensitive Inferences W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal

Figure 4 Dependencies between height and gender on theMotionSense and MobiAct datasets A classification thresh-old of 172cm predicts gender with 84 accuracy

and for the rest of attributes is considered as the half of the variationinterval in dataset eg 190minus1612 = 145 for height

Thus the GEN eliminates similarities between same-gender time-series and an attacker cannot confidently use distance measures tomake inference about gender Interestingly by eliminating genderinformation we also partially eliminate information on other at-tributes as there are dependencies between attributes For examplethe estimation error for height and weight increases by near 25and 20 respectively

Height is indeed highly correlated with gender in both datasets(Figure 4) the prediction accuracy of gender-based on height onlyis 81 However gender prediction from both datasets using theMTCNN architecture is considerably better than that

Using Supervised Learning We explore learning gender dis-criminative features from transformed data Figure 5 shows thetraining and validation accuracy of activity recognition and gen-der classification using supervised learning on transformed dataGender-discriminative features in the transformed data are rareeven with a large number of epochs as in this experiment GENeliminates gender-related features and thus makes it is difficult fora classifier to train on them even when it has access to the labels oftransformed data

Although with experiments in this section we have shown an ac-ceptable efficiency in eliminating sensitive information it is highlydesired to statistically prove the efficiency of the proposed solutionGenerally high temporal granularity of time-series and strong cor-relation between their samples make this task very challenging Weleave exploring this area to future research

5 RELATEDWORK AND DISCUSSIONGenerative adversarial networks (GANs) [7] learn to capture the sta-tistical distribution of data for synthesizing new samples from thelearned distribution In the GANs a discriminator model learns todetermine whether a sample is from themodel distribution (ie fromthe generator) or from the data distribution (ie from a real-worldsource) The discriminator aims to maximize an objective functionin minimax game that the generator aims to minimize GANs havealso been applied for enhancing privacy [9 14] For example toprotect health records synthetic medical datasets can be publishedinstead of the real ones using generative models training on sensi-tive real-world medical datasets [3 6] To provide a formal privacy

Activity Recognition (Training)Activity Recognition (Validation)Gender Classification (Training)Gender Classification (Validation)

100

95

90

85

80

75

70

65

60

550 50 100 150 200 250 300 350 400

Figure 5 Activity and gender classification accuracy on theMotionSense dataset in Trial setting when the Estimator istrained on transformed data produced by the Guardian Al-though activity-features can be easily learned there is nouseful discerning information about gender

guarantee [2] trains GANs under the constraint of differential pri-vacy [5] to protect against common privacy attacks

Although the architecture of our proposed framework lookssimilar to GANs there are key structural and logical differenceswith other existing frameworks First the focus of existing worksis mainly on protecting usersrsquo privacy against membership attackby releasing a synthetic dataset through differential privacy con-straints Instead we consider a situation where a user wants togrant third parties access to sensor data that can be used to makeboth sensitive and non-sensitive inferences

Second the generator in GANs seeks to learn the underlyingdistribution of the data to produce realistic simulated samples fromrandom vectors Instead the Guardian in GEN seeks to partition theunderlying features of the data to reconstruct privacy-preservingoutputs from real-world input vectors

Finally the minimax game in GANs is a two-player game be-tween generator and discriminator (ie two models) that updatesweights of both models in each iteration Instead the minimax objec-tive of GEN is a trade-off between utility and privacy that updatesthe weights of one only model (ie the guardian) in each iteration

Previous works on data collected from embedded sensors of per-sonal devices such as [11 12] consider temporal inferences ondifferent activities over time (ie some sections of time-series corre-sponding to non-sensitive activities and some of them to sensitiveones) In this paper for the first time we concurrently consider bothactivity and attribute inferences on the same section of time-series

Our framework is applicable in distributed environments wehave shown that the Estimator can be trained remotely (eg on apowerful system and with a large dataset) and edge devices justneed to download the resulting trained model to use it as the Estima-tor part of their locally implemented GEN under userrsquos control Forexample the Guardian can be trained in user side using individualsrsquopersonal data processing platforms like Databox [8]

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

6 CONCLUSIONWe proposed the GEN framework for locally transforming sensordata on mobile edge devices to respect functions and requirementsof an application as well as user privacy We evaluated the efficiencyof the trade-off between utility and privacy GEN provides on real-world datasets of motion data

Open questions to be explored in future work include providingtheoretical bounds on the amount of sensitive information leakageafter transformation and exploring dependencies between differentattributes eg co-dependence of gender and height Finally wewill measure the costs and requirements for running GEN on edgedevices

ACKNOWLEDGMENTSThis work was kindly supported by the Life Sciences Initiative atQueen Mary University London and a Microsoft Azure for ResearchAward Hamed Haddadi was partially funded by EPSRC Databoxgrant (Ref EPN0282601)

REFERENCES[1] A Bagnall J Lines A Bostrom J Large and E Keogh The great time series

classification bake off a review and experimental evaluation of recent algorithmicadvances Data Mining and Knowledge Discovery 31(3)606ndash660 2017

[2] B K Beaulieu-Jones Z S Wu C Williams and C S Greene Privacy-preservinggenerative deep neural networks support clinical data sharing bioRxiv page159756 2017

[3] E Choi S Biswal B Malin J Duke W F Stewart and J Sun Generating multi-label discrete electronic health records using generative adversarial networksarXiv preprint arXiv170306490 2017

[4] F Chollet et al Keras httpsgithubcomfcholletkeras 2015[5] C Dwork Differential privacy A survey of results In International Conference

on Theory and Applications of Models of Computation pages 1ndash19 Springer 2008[6] C Esteban S L Hyland and G Raumltsch Real-valued (medical) time series gener-

ation with recurrent conditional gans arXiv preprint arXiv170602633 2017[7] I Goodfellow J Pouget-Abadie M Mirza B Xu D Warde-Farley S Ozair

A Courville and Y Bengio Generative adversarial nets In Advances in neuralinformation processing systems pages 2672ndash2680 2014

[8] H Haddadi H Howard A Chaudhry J Crowcroft A MadhavapeddyD McAuley and R Mortier Personal data thinking inside the box In Pro-ceedings of The Fifth Decennial Aarhus Conference on Critical Alternatives pages29ndash32 Aarhus University Press 2015

[9] C Huang P Kairouz X Chen L Sankar and R Rajagopal Context-awaregenerative adversarial privacy Entropy 19(12)656 2017

[10] K Katevas H Haddadi and L Tokarchuk Poster Sensingkit A multi-platformmobile sensing framework for large-scale experiments In Proceedings of the 20thAnnual International Conference on Mobile Computing and Networking pages375ndash378 ACM 2014

[11] M Malekzadeh R G Clegg and H Haddadi Replacement autoencoder Aprivacy-preserving algorithm for sensory data analysis The 3rd ACMIEEEInternational Conference on Internet-of-Things Design and Implementation 2018

[12] N Saleheen S Chakraborty N Ali MM Rahman S M Hossain R Bari E BuderM Srivastava and S Kumar msieve differential behavioral privacy in timeseries of mobile sensor data In Proceedings of the 2016 ACM International JointConference on Pervasive and Ubiquitous Computing pages 706ndash717 2016

[13] S Salvador and P Chan Toward accurate dynamic time warping in linear timeand space Intelligent Data Analysis 11(5)561ndash580 2007

[14] A Tripathy Y Wang and P Ishwar Privacy-preserving adversarial networksarXiv preprint arXiv171207008 2017

[15] G Vavoulas C Chatzaki T Malliotakis M Pediaditis and M Tsiknakis Themobiact dataset Recognition of activities of daily living using smartphones InICT4AgeingWell pages 143ndash151 2016

[16] P Vincent H Larochelle Y Bengio and P-A Manzagol Extracting and com-posing robust features with denoising autoencoders In Proceedings of the 25thInternational Conference on Machine learning pages 1096ndash1103 2008

[17] J Yang M N Nguyen P P San X Li and S Krishnaswamy Deep convolutionalneural networks on multichannel time series for human activity recognition In

Proceedings of the 24th International Conference on Artificial Intelligence pages3995ndash4001 2015

  • Abstract
  • 1 Introduction
  • 2 Problem Definition
  • 3 Learning the Inference-Specific Transformation
  • 4 Experiments
    • 41 Datasets
    • 42 Experimental Setup
    • 43 Transformation Efficiency
    • 44 Measuring Information Leakage
      • 5 Related Work and Discussion
      • 6 Conclusion
      • Acknowledgments
      • References
Page 2: Protecting Sensory Data against Sensitive Inferences · 2018. 6. 21. · Protecting Sensory Data against Sensitive Inferences W-P2DS’18, April 23–26, 2018, Porto, Portugal MobiAct

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

m

d

Guardian

Estimator

Neutralizer

Training

Testing (in Use)

()

User Side Apps Side

Figure 1 GEN Architecture First the Estimator is trainedthen the Guardian is trained using the Estimator with thehelp of the Neutralizer

of activities of the user (non-sensitive inferences) and informationabout attributes of the user such as gender age weight and height(sensitive inferences)

Our goal is to establish a tradeoff between the ability of theapps to accurately infer non-sensitive information to maximizetheir utility and the reduction of revealed sensitive information tominimize the risk of privacy infringement We show that GEN canaccurately maintain the usefulness of the released (transformed)data for activity recognition while considerably reducing the riskof attribute recognition1

2 PROBLEM DEFINITIONLet X (t) =

(X1(t)X2(t) Xm (t)

)be the recorded values of the

m sensor-data components during a collection period of durationT where t isin 1 2 T We assume the data to be synchronizedand collected at the same frequency

Let us consider a running window of duration d that containsconsecutive values of X (t) from time t to t + d minus 1 Let Sd (t) be thecorresponding section of the time-series

Sd (t) = X [t t + d minus 1] =(X (t)X (t + 1) X (t + d minus 1)

)

where the value ofd should be chosen such that the runningwindowbe large enough for making desired inferences by apps Howeverin order to be computationally effective it should not be chosenvery large For simplicity we remove the index t from Sd (t) in thefollowing

We define two types of inference on each Sd inference of sensitiveinformation Is() and inference of non-sensitive information In()Our goal is to find a transformation function Glowast() in a way thatthe transformed data Slowastd = Glowast(Sd ) are such that Is(Slowastd ) fails to revealprivate information whereas In(Slowastd ) generates inferences that are

1The code and data used in this paper are publicly available athttpsgithubcommmalekzadehmotion-sense

Figure 2 An instantiation of GEN for activity recognitionfrom sensor data without revealing the gender informationThe Guardian is an autoencoder The Estimator is a multi-task ConvNet

as accurate as In(Sd ) Here Sd is the transformation of correspond-ing Sd and Slowastd is its optimal privacy-preserving transformation

3 LEARNING THE INFERENCE-SPECIFICTRANSFORMATION

We present the proposed framework that includes three compo-nents the Guardian the Estimator and the Neutralizer (Figure 1)and discuss its instantiation for an activity recognition applica-tion (Figure 2)

The Guardian which provides inference-specific transformationis a feature learning framework that recognizes and distinguishesdiscerning features from data In the specific implementation of thispaper we use a deep autoencoder [16] as Guardian An autoencoderis a neural network that tries to reconstruct its input based on anobjective function Here the autoencoder receives a section of m-dimensional time-series with length of d as input and produces atime-series with the same dimensionality as the output based onthe Neutralizerrsquos objective function which is described below

The Estimator quantifies how accurate an algorithm can be atmaking sensitive and non-sensitive inferences on the transformeddata In the specific implementation of this paper we use a multi-task convolutional neural network (MTCNN) as Estimator [17] Theshape of input is similar to the Guardian and the shape of outputdepends on the number of activity classes MTCNN has the abilityto share learned representations from input between several tasksMore precisely we try to simultaneously optimize a CNN with twotypes of loss function one for sensitive inferences and another fornon-sensitive ones Consequently MTCNN will learn more genericfeatures which should be used for several tasks at its earlier layersThen subsequent layers which become progressively more spe-cific to the details of the desired task can be divided into multiplebranches each for a specific task

The Neutralizer the most important contribution of this paperis an optimizer that helps the Guardian find the optimal Glowast(middot) for

Protecting Sensory Data against Sensitive Inferences W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal

MobiAct MotionSense

Males 32 14

Females 16 10

Features (m) 9 12

SampleRate (Hz) 20 50Table 1 Details of the MobiAct and MotionSense datasets

transforming each section Sd into Slowastd using as objective

Glowast() = argminG()isinF

(p(Is

(G(Sd )

) )minus p

(In

(G(Sd )

) ))

where p(Is (middot)

)and p

(In (middot)

)are the probabilities of making sensitive

and non-sensitive inferences respectively and the F is the set of allpossible transformation functions for the Guardian In the specificapplication of this paper the Neutralizer is a multi-task objectivefunction used by backpropagation to update the weights of theGuardian (autoencoder) The F is also the set of all possible weightmatrices for the selected autoencoder

Particularly we aim to transform each section Sd such that wecan recognize an activity from Sd without revealing the gender ofthe user For each section Sd let Ya(Sd ) and Ya(Sd ) be the true andpredicted class of activity respectively and Yg(Sd ) be the predictedgender class We define the Neutralizerrsquos objective function as

Slowastd = argminSd

(| (05 minus Yg(Sd )) | minus

csumi=1

minusY ia(Sd ) logY ia(Sd )) (1)

where c is the number of activity classes In the rhs of the equationthe first part is our custom gender-neutralizer loss function andthe second part is a categorical cross entropy The constant 05 isthe desired confidence for a gender predictor that will process thetransformed data

4 EXPERIMENTSWe validate the proposed framework on recognizing the followingactivities from smartphone motion sensors Downstairs UpstairsWalking Jogging The non-sensitive inferences In is the recog-nition of the activities whereas the sensitive inference Is is therecognition of gender

We aim to measure the trade-off between the utility of data foractivity recognition and privacy eg keeping gender secret Tothis end we first compare the accuracy of activity recognition andgender classification when a trained MTCNN has access to originaldata and to the corresponding transformed data Then we try tomeasure the amount of sensitive information which is still availablein the transformed data using different methods

Model Layer (Neurons | Kernel | Chance)

Inp(md)

Conv(50 1 times 5) Conv(50 1 times 3)

Dense(50) MP(1 times 2) DO(02)

Conv(40 1 times 5)

MTCNN Dense(40) MP(1 times 3) DO(02)

Conv(20 1 times 3) DO(02)

Flatten Dense(400) DO(04)

OutA = Softmax(4) OutG = Sigmoid

Inp(|x |) Dense(|x |2) Dense(|x |4)

AE Dense(|x |8)

Dense(|x |4) Dense(|x |2) Out(|x |)Table 2 Structure of the hidden layers The activation func-tion for all the layers is ldquoReLUrdquo Key ndash MP MaxPooling DODropOut |x | =m times d

41 DatasetsWe use two real-world datasets MobiAct2 and MotionSense3 Thelatter dataset is one of the contributions of this paper

MobiAct [15] includes accelerometer gyroscope and orienta-tion data (m = 9) from a smartphone collected when data subjectsperformed 9 activities in 16 trials A total of 67 participants in arange of gender age weight and height collected the data witha Samsung Galaxy S3 smartphone (we use a subset of 48 subjectswho have no missing data) Unlike other datasets which requirethe smartphone to be rigidly placed on the human body and witha specific orientation MobiAct attempted to simulate every-dayusage of mobile phones where a smartphone is located with randomorientation in a loose pocket chosen by the subject (Table 1)

MotionSense includes the accelerometer (acceleration and grav-ity) attitude (pitch roll yaw) and gyroscope data (m = 12) collectedwith an iPhone 6s kept in the participantrsquos front pocket using Sens-ingKit [10] A total of 24 participants in a range of gender ageweight and height performed 6 activities in 15 trials in the sameenvironment and conditions downstairs upstairs walking joggingsitting and standing With this dataset we aim to look for personalattributes fingerprints in time-series of sensor data ie attribute-specific patterns that can be used to infer physical and demographicattributes of the data subjects in addition to their activities

See httpgithubcommmalekzadehmotion-sense for details onthe methodology and the data (Table 1)

2publicly available athttpwwwbmiteicretegrindexphpresearchmobiact3publicly available athttpgithubcommmalekzadehmotion-sense

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

Setting Dataset Inf Sd Sd

Trial

MotionSenseIa 9508 9371

Ig 9515 4932

MobiActIa 9431 9046

Ig 9374 4983

Subject

MotionSenseIa 8633 8519

Ig 7535 5216

MobiActIa 7049 6501

Ig 6618 4554Table 3 Activity recognition Ia and gender classification Igaccuracy for original Sd and transformed Sd data in per-cent ()

42 Experimental SetupFor each dataset we consider two types of setting namely Trial andSubject In Trial we keep 23 of trials for training and 13 of themfor testing For example if there are 3 walking trials per participantwe keep the first two trials for training and the last one for testingIn Subject we keep data of 75 of all subjects for training and thedata of remaining 25 subjects for testing In the Subject settingwe report the average results of four selections for test dataset

We train an MTCNN as the Estimator by considering two tasks(i) activity recognition (4 classes) with categorical cross-entropy lossfunction [4] and (ii) gender classification (2 classes) with binarycross-entropy loss function [4] After training MTCNN we freezethe weights of the MTCNN layers and attach the output of a deepautoencoder (AE) as the Guardian to the input of the MTCNN tobuild the GEN neural network Finally we compile GEN and setits loss function equals to the objective function of the Neutralizerin Equation (1) The deep network architectures are described inTable 2

43 Transformation EfficiencyTable 3 shows that the Guardian produces time-series that keepthe utility of non-sensitive inferences at a comparable level to theoriginal ones (the average loss is three percentage points) whilepreventing sensitive inferences as the gender classification accu-racy decreases from more than 90 to near the target random guess(50)

Cross-Dataset Validation We also validate GEN in an ecosys-tem where edge users benefit from pre-trained models of a serviceprovider At the cloud side the Estimator (MTCNN) is trained ona public dataset the MobiAct dataset in our case At the edge sidethe Guardian receives the trained Estimator and uses its locally(personally) defined Neutralizer to transform the userrsquos data theMotionSense dataset in our case

The results show that the accuracy of the Estimator on rawdata for Ia and Iд are 9367 and 9280 respectively whereas on

Gender Age Height Weight0

20

40

60

80

100

Nor

mal

ized

Err

or(

)

Original Data Transformed Data

Figure 3 Error for gender is ldquoclassification errorrdquo and forthe rest of attributes is ldquomean absolute errorrdquo All the val-ues are divided by the error of a random estimator on theMotionSense dataset

transformed data are 9092 and 5193 respectively This showsan interesting property of GEN which makes it more applicable todeploy in edge devices

The only concern here is whether users trust the pre-trainedEstimator received from an untrusted service provider User canverify the Estimator by running it on a publicly available datasetWe leave more investigation on this concern for future work

44 Measuring Information LeakageWe aim to experimentally quantify the amount of information aboutuserrsquos attributes that is still available in the transformed data

Using Dynamic Time Warping To measure the amount ofresidual attribute-information in sensor data we chose4 k-NearestNeighbors (k-NN) with Dynamic Time Warping (DTW) [13] Weaim to verify whether a different algorithm will also fail to guessgender even when adversaries get access to the entire time-seriesand not just a section of it To this end we build an n times n matrixDl where n is the number of subjects in the dataset For eachactivity al isin downstairsupstairswalkinд joддinд letdl (i j) bethe distance between the time-series of usersui anduj calculated byFastDTW [13] Then we calculate the final distancematrix D as theelement-wise average of all the matrices Dl d(i j) = 1

4suml dl (i j)

We calculate distance matrices D and D for the original time-series and the transformed series (the output of the Guardian) re-spectively Then we compare the ability of the estimation based onthese matrices For each user ui i isin 1 n (one out-of-sample)we estimate the value of each attributeva (ui )a isin дender aдeweiдht heiдhtusing distance weighted k-NN based on matrixD where the weightis

w(i j) = 1d(i j)2

Figure 3 shows that the estimation error for gender classificationapproaches that of a random estimator after transformation In thisFigure the error of a random estimator for gender is Nf

Nf +Nm= 10

24

4k-NN with DTW outperforms other methods in time-series classification exceptwhen considerable computation and implementation cost is acceptable for very smallimprovements [1]

Protecting Sensory Data against Sensitive Inferences W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal

Figure 4 Dependencies between height and gender on theMotionSense and MobiAct datasets A classification thresh-old of 172cm predicts gender with 84 accuracy

and for the rest of attributes is considered as the half of the variationinterval in dataset eg 190minus1612 = 145 for height

Thus the GEN eliminates similarities between same-gender time-series and an attacker cannot confidently use distance measures tomake inference about gender Interestingly by eliminating genderinformation we also partially eliminate information on other at-tributes as there are dependencies between attributes For examplethe estimation error for height and weight increases by near 25and 20 respectively

Height is indeed highly correlated with gender in both datasets(Figure 4) the prediction accuracy of gender-based on height onlyis 81 However gender prediction from both datasets using theMTCNN architecture is considerably better than that

Using Supervised Learning We explore learning gender dis-criminative features from transformed data Figure 5 shows thetraining and validation accuracy of activity recognition and gen-der classification using supervised learning on transformed dataGender-discriminative features in the transformed data are rareeven with a large number of epochs as in this experiment GENeliminates gender-related features and thus makes it is difficult fora classifier to train on them even when it has access to the labels oftransformed data

Although with experiments in this section we have shown an ac-ceptable efficiency in eliminating sensitive information it is highlydesired to statistically prove the efficiency of the proposed solutionGenerally high temporal granularity of time-series and strong cor-relation between their samples make this task very challenging Weleave exploring this area to future research

5 RELATEDWORK AND DISCUSSIONGenerative adversarial networks (GANs) [7] learn to capture the sta-tistical distribution of data for synthesizing new samples from thelearned distribution In the GANs a discriminator model learns todetermine whether a sample is from themodel distribution (ie fromthe generator) or from the data distribution (ie from a real-worldsource) The discriminator aims to maximize an objective functionin minimax game that the generator aims to minimize GANs havealso been applied for enhancing privacy [9 14] For example toprotect health records synthetic medical datasets can be publishedinstead of the real ones using generative models training on sensi-tive real-world medical datasets [3 6] To provide a formal privacy

Activity Recognition (Training)Activity Recognition (Validation)Gender Classification (Training)Gender Classification (Validation)

100

95

90

85

80

75

70

65

60

550 50 100 150 200 250 300 350 400

Figure 5 Activity and gender classification accuracy on theMotionSense dataset in Trial setting when the Estimator istrained on transformed data produced by the Guardian Al-though activity-features can be easily learned there is nouseful discerning information about gender

guarantee [2] trains GANs under the constraint of differential pri-vacy [5] to protect against common privacy attacks

Although the architecture of our proposed framework lookssimilar to GANs there are key structural and logical differenceswith other existing frameworks First the focus of existing worksis mainly on protecting usersrsquo privacy against membership attackby releasing a synthetic dataset through differential privacy con-straints Instead we consider a situation where a user wants togrant third parties access to sensor data that can be used to makeboth sensitive and non-sensitive inferences

Second the generator in GANs seeks to learn the underlyingdistribution of the data to produce realistic simulated samples fromrandom vectors Instead the Guardian in GEN seeks to partition theunderlying features of the data to reconstruct privacy-preservingoutputs from real-world input vectors

Finally the minimax game in GANs is a two-player game be-tween generator and discriminator (ie two models) that updatesweights of both models in each iteration Instead the minimax objec-tive of GEN is a trade-off between utility and privacy that updatesthe weights of one only model (ie the guardian) in each iteration

Previous works on data collected from embedded sensors of per-sonal devices such as [11 12] consider temporal inferences ondifferent activities over time (ie some sections of time-series corre-sponding to non-sensitive activities and some of them to sensitiveones) In this paper for the first time we concurrently consider bothactivity and attribute inferences on the same section of time-series

Our framework is applicable in distributed environments wehave shown that the Estimator can be trained remotely (eg on apowerful system and with a large dataset) and edge devices justneed to download the resulting trained model to use it as the Estima-tor part of their locally implemented GEN under userrsquos control Forexample the Guardian can be trained in user side using individualsrsquopersonal data processing platforms like Databox [8]

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

6 CONCLUSIONWe proposed the GEN framework for locally transforming sensordata on mobile edge devices to respect functions and requirementsof an application as well as user privacy We evaluated the efficiencyof the trade-off between utility and privacy GEN provides on real-world datasets of motion data

Open questions to be explored in future work include providingtheoretical bounds on the amount of sensitive information leakageafter transformation and exploring dependencies between differentattributes eg co-dependence of gender and height Finally wewill measure the costs and requirements for running GEN on edgedevices

ACKNOWLEDGMENTSThis work was kindly supported by the Life Sciences Initiative atQueen Mary University London and a Microsoft Azure for ResearchAward Hamed Haddadi was partially funded by EPSRC Databoxgrant (Ref EPN0282601)

REFERENCES[1] A Bagnall J Lines A Bostrom J Large and E Keogh The great time series

classification bake off a review and experimental evaluation of recent algorithmicadvances Data Mining and Knowledge Discovery 31(3)606ndash660 2017

[2] B K Beaulieu-Jones Z S Wu C Williams and C S Greene Privacy-preservinggenerative deep neural networks support clinical data sharing bioRxiv page159756 2017

[3] E Choi S Biswal B Malin J Duke W F Stewart and J Sun Generating multi-label discrete electronic health records using generative adversarial networksarXiv preprint arXiv170306490 2017

[4] F Chollet et al Keras httpsgithubcomfcholletkeras 2015[5] C Dwork Differential privacy A survey of results In International Conference

on Theory and Applications of Models of Computation pages 1ndash19 Springer 2008[6] C Esteban S L Hyland and G Raumltsch Real-valued (medical) time series gener-

ation with recurrent conditional gans arXiv preprint arXiv170602633 2017[7] I Goodfellow J Pouget-Abadie M Mirza B Xu D Warde-Farley S Ozair

A Courville and Y Bengio Generative adversarial nets In Advances in neuralinformation processing systems pages 2672ndash2680 2014

[8] H Haddadi H Howard A Chaudhry J Crowcroft A MadhavapeddyD McAuley and R Mortier Personal data thinking inside the box In Pro-ceedings of The Fifth Decennial Aarhus Conference on Critical Alternatives pages29ndash32 Aarhus University Press 2015

[9] C Huang P Kairouz X Chen L Sankar and R Rajagopal Context-awaregenerative adversarial privacy Entropy 19(12)656 2017

[10] K Katevas H Haddadi and L Tokarchuk Poster Sensingkit A multi-platformmobile sensing framework for large-scale experiments In Proceedings of the 20thAnnual International Conference on Mobile Computing and Networking pages375ndash378 ACM 2014

[11] M Malekzadeh R G Clegg and H Haddadi Replacement autoencoder Aprivacy-preserving algorithm for sensory data analysis The 3rd ACMIEEEInternational Conference on Internet-of-Things Design and Implementation 2018

[12] N Saleheen S Chakraborty N Ali MM Rahman S M Hossain R Bari E BuderM Srivastava and S Kumar msieve differential behavioral privacy in timeseries of mobile sensor data In Proceedings of the 2016 ACM International JointConference on Pervasive and Ubiquitous Computing pages 706ndash717 2016

[13] S Salvador and P Chan Toward accurate dynamic time warping in linear timeand space Intelligent Data Analysis 11(5)561ndash580 2007

[14] A Tripathy Y Wang and P Ishwar Privacy-preserving adversarial networksarXiv preprint arXiv171207008 2017

[15] G Vavoulas C Chatzaki T Malliotakis M Pediaditis and M Tsiknakis Themobiact dataset Recognition of activities of daily living using smartphones InICT4AgeingWell pages 143ndash151 2016

[16] P Vincent H Larochelle Y Bengio and P-A Manzagol Extracting and com-posing robust features with denoising autoencoders In Proceedings of the 25thInternational Conference on Machine learning pages 1096ndash1103 2008

[17] J Yang M N Nguyen P P San X Li and S Krishnaswamy Deep convolutionalneural networks on multichannel time series for human activity recognition In

Proceedings of the 24th International Conference on Artificial Intelligence pages3995ndash4001 2015

  • Abstract
  • 1 Introduction
  • 2 Problem Definition
  • 3 Learning the Inference-Specific Transformation
  • 4 Experiments
    • 41 Datasets
    • 42 Experimental Setup
    • 43 Transformation Efficiency
    • 44 Measuring Information Leakage
      • 5 Related Work and Discussion
      • 6 Conclusion
      • Acknowledgments
      • References
Page 3: Protecting Sensory Data against Sensitive Inferences · 2018. 6. 21. · Protecting Sensory Data against Sensitive Inferences W-P2DS’18, April 23–26, 2018, Porto, Portugal MobiAct

Protecting Sensory Data against Sensitive Inferences W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal

MobiAct MotionSense

Males 32 14

Females 16 10

Features (m) 9 12

SampleRate (Hz) 20 50Table 1 Details of the MobiAct and MotionSense datasets

transforming each section Sd into Slowastd using as objective

Glowast() = argminG()isinF

(p(Is

(G(Sd )

) )minus p

(In

(G(Sd )

) ))

where p(Is (middot)

)and p

(In (middot)

)are the probabilities of making sensitive

and non-sensitive inferences respectively and the F is the set of allpossible transformation functions for the Guardian In the specificapplication of this paper the Neutralizer is a multi-task objectivefunction used by backpropagation to update the weights of theGuardian (autoencoder) The F is also the set of all possible weightmatrices for the selected autoencoder

Particularly we aim to transform each section Sd such that wecan recognize an activity from Sd without revealing the gender ofthe user For each section Sd let Ya(Sd ) and Ya(Sd ) be the true andpredicted class of activity respectively and Yg(Sd ) be the predictedgender class We define the Neutralizerrsquos objective function as

Slowastd = argminSd

(| (05 minus Yg(Sd )) | minus

csumi=1

minusY ia(Sd ) logY ia(Sd )) (1)

where c is the number of activity classes In the rhs of the equationthe first part is our custom gender-neutralizer loss function andthe second part is a categorical cross entropy The constant 05 isthe desired confidence for a gender predictor that will process thetransformed data

4 EXPERIMENTSWe validate the proposed framework on recognizing the followingactivities from smartphone motion sensors Downstairs UpstairsWalking Jogging The non-sensitive inferences In is the recog-nition of the activities whereas the sensitive inference Is is therecognition of gender

We aim to measure the trade-off between the utility of data foractivity recognition and privacy eg keeping gender secret Tothis end we first compare the accuracy of activity recognition andgender classification when a trained MTCNN has access to originaldata and to the corresponding transformed data Then we try tomeasure the amount of sensitive information which is still availablein the transformed data using different methods

Model Layer (Neurons | Kernel | Chance)

Inp(md)

Conv(50 1 times 5) Conv(50 1 times 3)

Dense(50) MP(1 times 2) DO(02)

Conv(40 1 times 5)

MTCNN Dense(40) MP(1 times 3) DO(02)

Conv(20 1 times 3) DO(02)

Flatten Dense(400) DO(04)

OutA = Softmax(4) OutG = Sigmoid

Inp(|x |) Dense(|x |2) Dense(|x |4)

AE Dense(|x |8)

Dense(|x |4) Dense(|x |2) Out(|x |)Table 2 Structure of the hidden layers The activation func-tion for all the layers is ldquoReLUrdquo Key ndash MP MaxPooling DODropOut |x | =m times d

41 DatasetsWe use two real-world datasets MobiAct2 and MotionSense3 Thelatter dataset is one of the contributions of this paper

MobiAct [15] includes accelerometer gyroscope and orienta-tion data (m = 9) from a smartphone collected when data subjectsperformed 9 activities in 16 trials A total of 67 participants in arange of gender age weight and height collected the data witha Samsung Galaxy S3 smartphone (we use a subset of 48 subjectswho have no missing data) Unlike other datasets which requirethe smartphone to be rigidly placed on the human body and witha specific orientation MobiAct attempted to simulate every-dayusage of mobile phones where a smartphone is located with randomorientation in a loose pocket chosen by the subject (Table 1)

MotionSense includes the accelerometer (acceleration and grav-ity) attitude (pitch roll yaw) and gyroscope data (m = 12) collectedwith an iPhone 6s kept in the participantrsquos front pocket using Sens-ingKit [10] A total of 24 participants in a range of gender ageweight and height performed 6 activities in 15 trials in the sameenvironment and conditions downstairs upstairs walking joggingsitting and standing With this dataset we aim to look for personalattributes fingerprints in time-series of sensor data ie attribute-specific patterns that can be used to infer physical and demographicattributes of the data subjects in addition to their activities

See httpgithubcommmalekzadehmotion-sense for details onthe methodology and the data (Table 1)

2publicly available athttpwwwbmiteicretegrindexphpresearchmobiact3publicly available athttpgithubcommmalekzadehmotion-sense

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

Setting Dataset Inf Sd Sd

Trial

MotionSenseIa 9508 9371

Ig 9515 4932

MobiActIa 9431 9046

Ig 9374 4983

Subject

MotionSenseIa 8633 8519

Ig 7535 5216

MobiActIa 7049 6501

Ig 6618 4554Table 3 Activity recognition Ia and gender classification Igaccuracy for original Sd and transformed Sd data in per-cent ()

42 Experimental SetupFor each dataset we consider two types of setting namely Trial andSubject In Trial we keep 23 of trials for training and 13 of themfor testing For example if there are 3 walking trials per participantwe keep the first two trials for training and the last one for testingIn Subject we keep data of 75 of all subjects for training and thedata of remaining 25 subjects for testing In the Subject settingwe report the average results of four selections for test dataset

We train an MTCNN as the Estimator by considering two tasks(i) activity recognition (4 classes) with categorical cross-entropy lossfunction [4] and (ii) gender classification (2 classes) with binarycross-entropy loss function [4] After training MTCNN we freezethe weights of the MTCNN layers and attach the output of a deepautoencoder (AE) as the Guardian to the input of the MTCNN tobuild the GEN neural network Finally we compile GEN and setits loss function equals to the objective function of the Neutralizerin Equation (1) The deep network architectures are described inTable 2

43 Transformation EfficiencyTable 3 shows that the Guardian produces time-series that keepthe utility of non-sensitive inferences at a comparable level to theoriginal ones (the average loss is three percentage points) whilepreventing sensitive inferences as the gender classification accu-racy decreases from more than 90 to near the target random guess(50)

Cross-Dataset Validation We also validate GEN in an ecosys-tem where edge users benefit from pre-trained models of a serviceprovider At the cloud side the Estimator (MTCNN) is trained ona public dataset the MobiAct dataset in our case At the edge sidethe Guardian receives the trained Estimator and uses its locally(personally) defined Neutralizer to transform the userrsquos data theMotionSense dataset in our case

The results show that the accuracy of the Estimator on rawdata for Ia and Iд are 9367 and 9280 respectively whereas on

Gender Age Height Weight0

20

40

60

80

100

Nor

mal

ized

Err

or(

)

Original Data Transformed Data

Figure 3 Error for gender is ldquoclassification errorrdquo and forthe rest of attributes is ldquomean absolute errorrdquo All the val-ues are divided by the error of a random estimator on theMotionSense dataset

transformed data are 9092 and 5193 respectively This showsan interesting property of GEN which makes it more applicable todeploy in edge devices

The only concern here is whether users trust the pre-trainedEstimator received from an untrusted service provider User canverify the Estimator by running it on a publicly available datasetWe leave more investigation on this concern for future work

44 Measuring Information LeakageWe aim to experimentally quantify the amount of information aboutuserrsquos attributes that is still available in the transformed data

Using Dynamic Time Warping To measure the amount ofresidual attribute-information in sensor data we chose4 k-NearestNeighbors (k-NN) with Dynamic Time Warping (DTW) [13] Weaim to verify whether a different algorithm will also fail to guessgender even when adversaries get access to the entire time-seriesand not just a section of it To this end we build an n times n matrixDl where n is the number of subjects in the dataset For eachactivity al isin downstairsupstairswalkinд joддinд letdl (i j) bethe distance between the time-series of usersui anduj calculated byFastDTW [13] Then we calculate the final distancematrix D as theelement-wise average of all the matrices Dl d(i j) = 1

4suml dl (i j)

We calculate distance matrices D and D for the original time-series and the transformed series (the output of the Guardian) re-spectively Then we compare the ability of the estimation based onthese matrices For each user ui i isin 1 n (one out-of-sample)we estimate the value of each attributeva (ui )a isin дender aдeweiдht heiдhtusing distance weighted k-NN based on matrixD where the weightis

w(i j) = 1d(i j)2

Figure 3 shows that the estimation error for gender classificationapproaches that of a random estimator after transformation In thisFigure the error of a random estimator for gender is Nf

Nf +Nm= 10

24

4k-NN with DTW outperforms other methods in time-series classification exceptwhen considerable computation and implementation cost is acceptable for very smallimprovements [1]

Protecting Sensory Data against Sensitive Inferences W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal

Figure 4 Dependencies between height and gender on theMotionSense and MobiAct datasets A classification thresh-old of 172cm predicts gender with 84 accuracy

and for the rest of attributes is considered as the half of the variationinterval in dataset eg 190minus1612 = 145 for height

Thus the GEN eliminates similarities between same-gender time-series and an attacker cannot confidently use distance measures tomake inference about gender Interestingly by eliminating genderinformation we also partially eliminate information on other at-tributes as there are dependencies between attributes For examplethe estimation error for height and weight increases by near 25and 20 respectively

Height is indeed highly correlated with gender in both datasets(Figure 4) the prediction accuracy of gender-based on height onlyis 81 However gender prediction from both datasets using theMTCNN architecture is considerably better than that

Using Supervised Learning We explore learning gender dis-criminative features from transformed data Figure 5 shows thetraining and validation accuracy of activity recognition and gen-der classification using supervised learning on transformed dataGender-discriminative features in the transformed data are rareeven with a large number of epochs as in this experiment GENeliminates gender-related features and thus makes it is difficult fora classifier to train on them even when it has access to the labels oftransformed data

Although with experiments in this section we have shown an ac-ceptable efficiency in eliminating sensitive information it is highlydesired to statistically prove the efficiency of the proposed solutionGenerally high temporal granularity of time-series and strong cor-relation between their samples make this task very challenging Weleave exploring this area to future research

5 RELATEDWORK AND DISCUSSIONGenerative adversarial networks (GANs) [7] learn to capture the sta-tistical distribution of data for synthesizing new samples from thelearned distribution In the GANs a discriminator model learns todetermine whether a sample is from themodel distribution (ie fromthe generator) or from the data distribution (ie from a real-worldsource) The discriminator aims to maximize an objective functionin minimax game that the generator aims to minimize GANs havealso been applied for enhancing privacy [9 14] For example toprotect health records synthetic medical datasets can be publishedinstead of the real ones using generative models training on sensi-tive real-world medical datasets [3 6] To provide a formal privacy

Activity Recognition (Training)Activity Recognition (Validation)Gender Classification (Training)Gender Classification (Validation)

100

95

90

85

80

75

70

65

60

550 50 100 150 200 250 300 350 400

Figure 5 Activity and gender classification accuracy on theMotionSense dataset in Trial setting when the Estimator istrained on transformed data produced by the Guardian Al-though activity-features can be easily learned there is nouseful discerning information about gender

guarantee [2] trains GANs under the constraint of differential pri-vacy [5] to protect against common privacy attacks

Although the architecture of our proposed framework lookssimilar to GANs there are key structural and logical differenceswith other existing frameworks First the focus of existing worksis mainly on protecting usersrsquo privacy against membership attackby releasing a synthetic dataset through differential privacy con-straints Instead we consider a situation where a user wants togrant third parties access to sensor data that can be used to makeboth sensitive and non-sensitive inferences

Second the generator in GANs seeks to learn the underlyingdistribution of the data to produce realistic simulated samples fromrandom vectors Instead the Guardian in GEN seeks to partition theunderlying features of the data to reconstruct privacy-preservingoutputs from real-world input vectors

Finally the minimax game in GANs is a two-player game be-tween generator and discriminator (ie two models) that updatesweights of both models in each iteration Instead the minimax objec-tive of GEN is a trade-off between utility and privacy that updatesthe weights of one only model (ie the guardian) in each iteration

Previous works on data collected from embedded sensors of per-sonal devices such as [11 12] consider temporal inferences ondifferent activities over time (ie some sections of time-series corre-sponding to non-sensitive activities and some of them to sensitiveones) In this paper for the first time we concurrently consider bothactivity and attribute inferences on the same section of time-series

Our framework is applicable in distributed environments wehave shown that the Estimator can be trained remotely (eg on apowerful system and with a large dataset) and edge devices justneed to download the resulting trained model to use it as the Estima-tor part of their locally implemented GEN under userrsquos control Forexample the Guardian can be trained in user side using individualsrsquopersonal data processing platforms like Databox [8]

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

6 CONCLUSIONWe proposed the GEN framework for locally transforming sensordata on mobile edge devices to respect functions and requirementsof an application as well as user privacy We evaluated the efficiencyof the trade-off between utility and privacy GEN provides on real-world datasets of motion data

Open questions to be explored in future work include providingtheoretical bounds on the amount of sensitive information leakageafter transformation and exploring dependencies between differentattributes eg co-dependence of gender and height Finally wewill measure the costs and requirements for running GEN on edgedevices

ACKNOWLEDGMENTSThis work was kindly supported by the Life Sciences Initiative atQueen Mary University London and a Microsoft Azure for ResearchAward Hamed Haddadi was partially funded by EPSRC Databoxgrant (Ref EPN0282601)

REFERENCES[1] A Bagnall J Lines A Bostrom J Large and E Keogh The great time series

classification bake off a review and experimental evaluation of recent algorithmicadvances Data Mining and Knowledge Discovery 31(3)606ndash660 2017

[2] B K Beaulieu-Jones Z S Wu C Williams and C S Greene Privacy-preservinggenerative deep neural networks support clinical data sharing bioRxiv page159756 2017

[3] E Choi S Biswal B Malin J Duke W F Stewart and J Sun Generating multi-label discrete electronic health records using generative adversarial networksarXiv preprint arXiv170306490 2017

[4] F Chollet et al Keras httpsgithubcomfcholletkeras 2015[5] C Dwork Differential privacy A survey of results In International Conference

on Theory and Applications of Models of Computation pages 1ndash19 Springer 2008[6] C Esteban S L Hyland and G Raumltsch Real-valued (medical) time series gener-

ation with recurrent conditional gans arXiv preprint arXiv170602633 2017[7] I Goodfellow J Pouget-Abadie M Mirza B Xu D Warde-Farley S Ozair

A Courville and Y Bengio Generative adversarial nets In Advances in neuralinformation processing systems pages 2672ndash2680 2014

[8] H Haddadi H Howard A Chaudhry J Crowcroft A MadhavapeddyD McAuley and R Mortier Personal data thinking inside the box In Pro-ceedings of The Fifth Decennial Aarhus Conference on Critical Alternatives pages29ndash32 Aarhus University Press 2015

[9] C Huang P Kairouz X Chen L Sankar and R Rajagopal Context-awaregenerative adversarial privacy Entropy 19(12)656 2017

[10] K Katevas H Haddadi and L Tokarchuk Poster Sensingkit A multi-platformmobile sensing framework for large-scale experiments In Proceedings of the 20thAnnual International Conference on Mobile Computing and Networking pages375ndash378 ACM 2014

[11] M Malekzadeh R G Clegg and H Haddadi Replacement autoencoder Aprivacy-preserving algorithm for sensory data analysis The 3rd ACMIEEEInternational Conference on Internet-of-Things Design and Implementation 2018

[12] N Saleheen S Chakraborty N Ali MM Rahman S M Hossain R Bari E BuderM Srivastava and S Kumar msieve differential behavioral privacy in timeseries of mobile sensor data In Proceedings of the 2016 ACM International JointConference on Pervasive and Ubiquitous Computing pages 706ndash717 2016

[13] S Salvador and P Chan Toward accurate dynamic time warping in linear timeand space Intelligent Data Analysis 11(5)561ndash580 2007

[14] A Tripathy Y Wang and P Ishwar Privacy-preserving adversarial networksarXiv preprint arXiv171207008 2017

[15] G Vavoulas C Chatzaki T Malliotakis M Pediaditis and M Tsiknakis Themobiact dataset Recognition of activities of daily living using smartphones InICT4AgeingWell pages 143ndash151 2016

[16] P Vincent H Larochelle Y Bengio and P-A Manzagol Extracting and com-posing robust features with denoising autoencoders In Proceedings of the 25thInternational Conference on Machine learning pages 1096ndash1103 2008

[17] J Yang M N Nguyen P P San X Li and S Krishnaswamy Deep convolutionalneural networks on multichannel time series for human activity recognition In

Proceedings of the 24th International Conference on Artificial Intelligence pages3995ndash4001 2015

  • Abstract
  • 1 Introduction
  • 2 Problem Definition
  • 3 Learning the Inference-Specific Transformation
  • 4 Experiments
    • 41 Datasets
    • 42 Experimental Setup
    • 43 Transformation Efficiency
    • 44 Measuring Information Leakage
      • 5 Related Work and Discussion
      • 6 Conclusion
      • Acknowledgments
      • References
Page 4: Protecting Sensory Data against Sensitive Inferences · 2018. 6. 21. · Protecting Sensory Data against Sensitive Inferences W-P2DS’18, April 23–26, 2018, Porto, Portugal MobiAct

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

Setting Dataset Inf Sd Sd

Trial

MotionSenseIa 9508 9371

Ig 9515 4932

MobiActIa 9431 9046

Ig 9374 4983

Subject

MotionSenseIa 8633 8519

Ig 7535 5216

MobiActIa 7049 6501

Ig 6618 4554Table 3 Activity recognition Ia and gender classification Igaccuracy for original Sd and transformed Sd data in per-cent ()

42 Experimental SetupFor each dataset we consider two types of setting namely Trial andSubject In Trial we keep 23 of trials for training and 13 of themfor testing For example if there are 3 walking trials per participantwe keep the first two trials for training and the last one for testingIn Subject we keep data of 75 of all subjects for training and thedata of remaining 25 subjects for testing In the Subject settingwe report the average results of four selections for test dataset

We train an MTCNN as the Estimator by considering two tasks(i) activity recognition (4 classes) with categorical cross-entropy lossfunction [4] and (ii) gender classification (2 classes) with binarycross-entropy loss function [4] After training MTCNN we freezethe weights of the MTCNN layers and attach the output of a deepautoencoder (AE) as the Guardian to the input of the MTCNN tobuild the GEN neural network Finally we compile GEN and setits loss function equals to the objective function of the Neutralizerin Equation (1) The deep network architectures are described inTable 2

43 Transformation EfficiencyTable 3 shows that the Guardian produces time-series that keepthe utility of non-sensitive inferences at a comparable level to theoriginal ones (the average loss is three percentage points) whilepreventing sensitive inferences as the gender classification accu-racy decreases from more than 90 to near the target random guess(50)

Cross-Dataset Validation We also validate GEN in an ecosys-tem where edge users benefit from pre-trained models of a serviceprovider At the cloud side the Estimator (MTCNN) is trained ona public dataset the MobiAct dataset in our case At the edge sidethe Guardian receives the trained Estimator and uses its locally(personally) defined Neutralizer to transform the userrsquos data theMotionSense dataset in our case

The results show that the accuracy of the Estimator on rawdata for Ia and Iд are 9367 and 9280 respectively whereas on

Gender Age Height Weight0

20

40

60

80

100

Nor

mal

ized

Err

or(

)

Original Data Transformed Data

Figure 3 Error for gender is ldquoclassification errorrdquo and forthe rest of attributes is ldquomean absolute errorrdquo All the val-ues are divided by the error of a random estimator on theMotionSense dataset

transformed data are 9092 and 5193 respectively This showsan interesting property of GEN which makes it more applicable todeploy in edge devices

The only concern here is whether users trust the pre-trainedEstimator received from an untrusted service provider User canverify the Estimator by running it on a publicly available datasetWe leave more investigation on this concern for future work

44 Measuring Information LeakageWe aim to experimentally quantify the amount of information aboutuserrsquos attributes that is still available in the transformed data

Using Dynamic Time Warping To measure the amount ofresidual attribute-information in sensor data we chose4 k-NearestNeighbors (k-NN) with Dynamic Time Warping (DTW) [13] Weaim to verify whether a different algorithm will also fail to guessgender even when adversaries get access to the entire time-seriesand not just a section of it To this end we build an n times n matrixDl where n is the number of subjects in the dataset For eachactivity al isin downstairsupstairswalkinд joддinд letdl (i j) bethe distance between the time-series of usersui anduj calculated byFastDTW [13] Then we calculate the final distancematrix D as theelement-wise average of all the matrices Dl d(i j) = 1

4suml dl (i j)

We calculate distance matrices D and D for the original time-series and the transformed series (the output of the Guardian) re-spectively Then we compare the ability of the estimation based onthese matrices For each user ui i isin 1 n (one out-of-sample)we estimate the value of each attributeva (ui )a isin дender aдeweiдht heiдhtusing distance weighted k-NN based on matrixD where the weightis

w(i j) = 1d(i j)2

Figure 3 shows that the estimation error for gender classificationapproaches that of a random estimator after transformation In thisFigure the error of a random estimator for gender is Nf

Nf +Nm= 10

24

4k-NN with DTW outperforms other methods in time-series classification exceptwhen considerable computation and implementation cost is acceptable for very smallimprovements [1]

Protecting Sensory Data against Sensitive Inferences W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal

Figure 4 Dependencies between height and gender on theMotionSense and MobiAct datasets A classification thresh-old of 172cm predicts gender with 84 accuracy

and for the rest of attributes is considered as the half of the variationinterval in dataset eg 190minus1612 = 145 for height

Thus the GEN eliminates similarities between same-gender time-series and an attacker cannot confidently use distance measures tomake inference about gender Interestingly by eliminating genderinformation we also partially eliminate information on other at-tributes as there are dependencies between attributes For examplethe estimation error for height and weight increases by near 25and 20 respectively

Height is indeed highly correlated with gender in both datasets(Figure 4) the prediction accuracy of gender-based on height onlyis 81 However gender prediction from both datasets using theMTCNN architecture is considerably better than that

Using Supervised Learning We explore learning gender dis-criminative features from transformed data Figure 5 shows thetraining and validation accuracy of activity recognition and gen-der classification using supervised learning on transformed dataGender-discriminative features in the transformed data are rareeven with a large number of epochs as in this experiment GENeliminates gender-related features and thus makes it is difficult fora classifier to train on them even when it has access to the labels oftransformed data

Although with experiments in this section we have shown an ac-ceptable efficiency in eliminating sensitive information it is highlydesired to statistically prove the efficiency of the proposed solutionGenerally high temporal granularity of time-series and strong cor-relation between their samples make this task very challenging Weleave exploring this area to future research

5 RELATEDWORK AND DISCUSSIONGenerative adversarial networks (GANs) [7] learn to capture the sta-tistical distribution of data for synthesizing new samples from thelearned distribution In the GANs a discriminator model learns todetermine whether a sample is from themodel distribution (ie fromthe generator) or from the data distribution (ie from a real-worldsource) The discriminator aims to maximize an objective functionin minimax game that the generator aims to minimize GANs havealso been applied for enhancing privacy [9 14] For example toprotect health records synthetic medical datasets can be publishedinstead of the real ones using generative models training on sensi-tive real-world medical datasets [3 6] To provide a formal privacy

Activity Recognition (Training)Activity Recognition (Validation)Gender Classification (Training)Gender Classification (Validation)

100

95

90

85

80

75

70

65

60

550 50 100 150 200 250 300 350 400

Figure 5 Activity and gender classification accuracy on theMotionSense dataset in Trial setting when the Estimator istrained on transformed data produced by the Guardian Al-though activity-features can be easily learned there is nouseful discerning information about gender

guarantee [2] trains GANs under the constraint of differential pri-vacy [5] to protect against common privacy attacks

Although the architecture of our proposed framework lookssimilar to GANs there are key structural and logical differenceswith other existing frameworks First the focus of existing worksis mainly on protecting usersrsquo privacy against membership attackby releasing a synthetic dataset through differential privacy con-straints Instead we consider a situation where a user wants togrant third parties access to sensor data that can be used to makeboth sensitive and non-sensitive inferences

Second the generator in GANs seeks to learn the underlyingdistribution of the data to produce realistic simulated samples fromrandom vectors Instead the Guardian in GEN seeks to partition theunderlying features of the data to reconstruct privacy-preservingoutputs from real-world input vectors

Finally the minimax game in GANs is a two-player game be-tween generator and discriminator (ie two models) that updatesweights of both models in each iteration Instead the minimax objec-tive of GEN is a trade-off between utility and privacy that updatesthe weights of one only model (ie the guardian) in each iteration

Previous works on data collected from embedded sensors of per-sonal devices such as [11 12] consider temporal inferences ondifferent activities over time (ie some sections of time-series corre-sponding to non-sensitive activities and some of them to sensitiveones) In this paper for the first time we concurrently consider bothactivity and attribute inferences on the same section of time-series

Our framework is applicable in distributed environments wehave shown that the Estimator can be trained remotely (eg on apowerful system and with a large dataset) and edge devices justneed to download the resulting trained model to use it as the Estima-tor part of their locally implemented GEN under userrsquos control Forexample the Guardian can be trained in user side using individualsrsquopersonal data processing platforms like Databox [8]

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

6 CONCLUSIONWe proposed the GEN framework for locally transforming sensordata on mobile edge devices to respect functions and requirementsof an application as well as user privacy We evaluated the efficiencyof the trade-off between utility and privacy GEN provides on real-world datasets of motion data

Open questions to be explored in future work include providingtheoretical bounds on the amount of sensitive information leakageafter transformation and exploring dependencies between differentattributes eg co-dependence of gender and height Finally wewill measure the costs and requirements for running GEN on edgedevices

ACKNOWLEDGMENTSThis work was kindly supported by the Life Sciences Initiative atQueen Mary University London and a Microsoft Azure for ResearchAward Hamed Haddadi was partially funded by EPSRC Databoxgrant (Ref EPN0282601)

REFERENCES[1] A Bagnall J Lines A Bostrom J Large and E Keogh The great time series

classification bake off a review and experimental evaluation of recent algorithmicadvances Data Mining and Knowledge Discovery 31(3)606ndash660 2017

[2] B K Beaulieu-Jones Z S Wu C Williams and C S Greene Privacy-preservinggenerative deep neural networks support clinical data sharing bioRxiv page159756 2017

[3] E Choi S Biswal B Malin J Duke W F Stewart and J Sun Generating multi-label discrete electronic health records using generative adversarial networksarXiv preprint arXiv170306490 2017

[4] F Chollet et al Keras httpsgithubcomfcholletkeras 2015[5] C Dwork Differential privacy A survey of results In International Conference

on Theory and Applications of Models of Computation pages 1ndash19 Springer 2008[6] C Esteban S L Hyland and G Raumltsch Real-valued (medical) time series gener-

ation with recurrent conditional gans arXiv preprint arXiv170602633 2017[7] I Goodfellow J Pouget-Abadie M Mirza B Xu D Warde-Farley S Ozair

A Courville and Y Bengio Generative adversarial nets In Advances in neuralinformation processing systems pages 2672ndash2680 2014

[8] H Haddadi H Howard A Chaudhry J Crowcroft A MadhavapeddyD McAuley and R Mortier Personal data thinking inside the box In Pro-ceedings of The Fifth Decennial Aarhus Conference on Critical Alternatives pages29ndash32 Aarhus University Press 2015

[9] C Huang P Kairouz X Chen L Sankar and R Rajagopal Context-awaregenerative adversarial privacy Entropy 19(12)656 2017

[10] K Katevas H Haddadi and L Tokarchuk Poster Sensingkit A multi-platformmobile sensing framework for large-scale experiments In Proceedings of the 20thAnnual International Conference on Mobile Computing and Networking pages375ndash378 ACM 2014

[11] M Malekzadeh R G Clegg and H Haddadi Replacement autoencoder Aprivacy-preserving algorithm for sensory data analysis The 3rd ACMIEEEInternational Conference on Internet-of-Things Design and Implementation 2018

[12] N Saleheen S Chakraborty N Ali MM Rahman S M Hossain R Bari E BuderM Srivastava and S Kumar msieve differential behavioral privacy in timeseries of mobile sensor data In Proceedings of the 2016 ACM International JointConference on Pervasive and Ubiquitous Computing pages 706ndash717 2016

[13] S Salvador and P Chan Toward accurate dynamic time warping in linear timeand space Intelligent Data Analysis 11(5)561ndash580 2007

[14] A Tripathy Y Wang and P Ishwar Privacy-preserving adversarial networksarXiv preprint arXiv171207008 2017

[15] G Vavoulas C Chatzaki T Malliotakis M Pediaditis and M Tsiknakis Themobiact dataset Recognition of activities of daily living using smartphones InICT4AgeingWell pages 143ndash151 2016

[16] P Vincent H Larochelle Y Bengio and P-A Manzagol Extracting and com-posing robust features with denoising autoencoders In Proceedings of the 25thInternational Conference on Machine learning pages 1096ndash1103 2008

[17] J Yang M N Nguyen P P San X Li and S Krishnaswamy Deep convolutionalneural networks on multichannel time series for human activity recognition In

Proceedings of the 24th International Conference on Artificial Intelligence pages3995ndash4001 2015

  • Abstract
  • 1 Introduction
  • 2 Problem Definition
  • 3 Learning the Inference-Specific Transformation
  • 4 Experiments
    • 41 Datasets
    • 42 Experimental Setup
    • 43 Transformation Efficiency
    • 44 Measuring Information Leakage
      • 5 Related Work and Discussion
      • 6 Conclusion
      • Acknowledgments
      • References
Page 5: Protecting Sensory Data against Sensitive Inferences · 2018. 6. 21. · Protecting Sensory Data against Sensitive Inferences W-P2DS’18, April 23–26, 2018, Porto, Portugal MobiAct

Protecting Sensory Data against Sensitive Inferences W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal

Figure 4 Dependencies between height and gender on theMotionSense and MobiAct datasets A classification thresh-old of 172cm predicts gender with 84 accuracy

and for the rest of attributes is considered as the half of the variationinterval in dataset eg 190minus1612 = 145 for height

Thus the GEN eliminates similarities between same-gender time-series and an attacker cannot confidently use distance measures tomake inference about gender Interestingly by eliminating genderinformation we also partially eliminate information on other at-tributes as there are dependencies between attributes For examplethe estimation error for height and weight increases by near 25and 20 respectively

Height is indeed highly correlated with gender in both datasets(Figure 4) the prediction accuracy of gender-based on height onlyis 81 However gender prediction from both datasets using theMTCNN architecture is considerably better than that

Using Supervised Learning We explore learning gender dis-criminative features from transformed data Figure 5 shows thetraining and validation accuracy of activity recognition and gen-der classification using supervised learning on transformed dataGender-discriminative features in the transformed data are rareeven with a large number of epochs as in this experiment GENeliminates gender-related features and thus makes it is difficult fora classifier to train on them even when it has access to the labels oftransformed data

Although with experiments in this section we have shown an ac-ceptable efficiency in eliminating sensitive information it is highlydesired to statistically prove the efficiency of the proposed solutionGenerally high temporal granularity of time-series and strong cor-relation between their samples make this task very challenging Weleave exploring this area to future research

5 RELATEDWORK AND DISCUSSIONGenerative adversarial networks (GANs) [7] learn to capture the sta-tistical distribution of data for synthesizing new samples from thelearned distribution In the GANs a discriminator model learns todetermine whether a sample is from themodel distribution (ie fromthe generator) or from the data distribution (ie from a real-worldsource) The discriminator aims to maximize an objective functionin minimax game that the generator aims to minimize GANs havealso been applied for enhancing privacy [9 14] For example toprotect health records synthetic medical datasets can be publishedinstead of the real ones using generative models training on sensi-tive real-world medical datasets [3 6] To provide a formal privacy

Activity Recognition (Training)Activity Recognition (Validation)Gender Classification (Training)Gender Classification (Validation)

100

95

90

85

80

75

70

65

60

550 50 100 150 200 250 300 350 400

Figure 5 Activity and gender classification accuracy on theMotionSense dataset in Trial setting when the Estimator istrained on transformed data produced by the Guardian Al-though activity-features can be easily learned there is nouseful discerning information about gender

guarantee [2] trains GANs under the constraint of differential pri-vacy [5] to protect against common privacy attacks

Although the architecture of our proposed framework lookssimilar to GANs there are key structural and logical differenceswith other existing frameworks First the focus of existing worksis mainly on protecting usersrsquo privacy against membership attackby releasing a synthetic dataset through differential privacy con-straints Instead we consider a situation where a user wants togrant third parties access to sensor data that can be used to makeboth sensitive and non-sensitive inferences

Second the generator in GANs seeks to learn the underlyingdistribution of the data to produce realistic simulated samples fromrandom vectors Instead the Guardian in GEN seeks to partition theunderlying features of the data to reconstruct privacy-preservingoutputs from real-world input vectors

Finally the minimax game in GANs is a two-player game be-tween generator and discriminator (ie two models) that updatesweights of both models in each iteration Instead the minimax objec-tive of GEN is a trade-off between utility and privacy that updatesthe weights of one only model (ie the guardian) in each iteration

Previous works on data collected from embedded sensors of per-sonal devices such as [11 12] consider temporal inferences ondifferent activities over time (ie some sections of time-series corre-sponding to non-sensitive activities and some of them to sensitiveones) In this paper for the first time we concurrently consider bothactivity and attribute inferences on the same section of time-series

Our framework is applicable in distributed environments wehave shown that the Estimator can be trained remotely (eg on apowerful system and with a large dataset) and edge devices justneed to download the resulting trained model to use it as the Estima-tor part of their locally implemented GEN under userrsquos control Forexample the Guardian can be trained in user side using individualsrsquopersonal data processing platforms like Databox [8]

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

6 CONCLUSIONWe proposed the GEN framework for locally transforming sensordata on mobile edge devices to respect functions and requirementsof an application as well as user privacy We evaluated the efficiencyof the trade-off between utility and privacy GEN provides on real-world datasets of motion data

Open questions to be explored in future work include providingtheoretical bounds on the amount of sensitive information leakageafter transformation and exploring dependencies between differentattributes eg co-dependence of gender and height Finally wewill measure the costs and requirements for running GEN on edgedevices

ACKNOWLEDGMENTSThis work was kindly supported by the Life Sciences Initiative atQueen Mary University London and a Microsoft Azure for ResearchAward Hamed Haddadi was partially funded by EPSRC Databoxgrant (Ref EPN0282601)

REFERENCES[1] A Bagnall J Lines A Bostrom J Large and E Keogh The great time series

classification bake off a review and experimental evaluation of recent algorithmicadvances Data Mining and Knowledge Discovery 31(3)606ndash660 2017

[2] B K Beaulieu-Jones Z S Wu C Williams and C S Greene Privacy-preservinggenerative deep neural networks support clinical data sharing bioRxiv page159756 2017

[3] E Choi S Biswal B Malin J Duke W F Stewart and J Sun Generating multi-label discrete electronic health records using generative adversarial networksarXiv preprint arXiv170306490 2017

[4] F Chollet et al Keras httpsgithubcomfcholletkeras 2015[5] C Dwork Differential privacy A survey of results In International Conference

on Theory and Applications of Models of Computation pages 1ndash19 Springer 2008[6] C Esteban S L Hyland and G Raumltsch Real-valued (medical) time series gener-

ation with recurrent conditional gans arXiv preprint arXiv170602633 2017[7] I Goodfellow J Pouget-Abadie M Mirza B Xu D Warde-Farley S Ozair

A Courville and Y Bengio Generative adversarial nets In Advances in neuralinformation processing systems pages 2672ndash2680 2014

[8] H Haddadi H Howard A Chaudhry J Crowcroft A MadhavapeddyD McAuley and R Mortier Personal data thinking inside the box In Pro-ceedings of The Fifth Decennial Aarhus Conference on Critical Alternatives pages29ndash32 Aarhus University Press 2015

[9] C Huang P Kairouz X Chen L Sankar and R Rajagopal Context-awaregenerative adversarial privacy Entropy 19(12)656 2017

[10] K Katevas H Haddadi and L Tokarchuk Poster Sensingkit A multi-platformmobile sensing framework for large-scale experiments In Proceedings of the 20thAnnual International Conference on Mobile Computing and Networking pages375ndash378 ACM 2014

[11] M Malekzadeh R G Clegg and H Haddadi Replacement autoencoder Aprivacy-preserving algorithm for sensory data analysis The 3rd ACMIEEEInternational Conference on Internet-of-Things Design and Implementation 2018

[12] N Saleheen S Chakraborty N Ali MM Rahman S M Hossain R Bari E BuderM Srivastava and S Kumar msieve differential behavioral privacy in timeseries of mobile sensor data In Proceedings of the 2016 ACM International JointConference on Pervasive and Ubiquitous Computing pages 706ndash717 2016

[13] S Salvador and P Chan Toward accurate dynamic time warping in linear timeand space Intelligent Data Analysis 11(5)561ndash580 2007

[14] A Tripathy Y Wang and P Ishwar Privacy-preserving adversarial networksarXiv preprint arXiv171207008 2017

[15] G Vavoulas C Chatzaki T Malliotakis M Pediaditis and M Tsiknakis Themobiact dataset Recognition of activities of daily living using smartphones InICT4AgeingWell pages 143ndash151 2016

[16] P Vincent H Larochelle Y Bengio and P-A Manzagol Extracting and com-posing robust features with denoising autoencoders In Proceedings of the 25thInternational Conference on Machine learning pages 1096ndash1103 2008

[17] J Yang M N Nguyen P P San X Li and S Krishnaswamy Deep convolutionalneural networks on multichannel time series for human activity recognition In

Proceedings of the 24th International Conference on Artificial Intelligence pages3995ndash4001 2015

  • Abstract
  • 1 Introduction
  • 2 Problem Definition
  • 3 Learning the Inference-Specific Transformation
  • 4 Experiments
    • 41 Datasets
    • 42 Experimental Setup
    • 43 Transformation Efficiency
    • 44 Measuring Information Leakage
      • 5 Related Work and Discussion
      • 6 Conclusion
      • Acknowledgments
      • References
Page 6: Protecting Sensory Data against Sensitive Inferences · 2018. 6. 21. · Protecting Sensory Data against Sensitive Inferences W-P2DS’18, April 23–26, 2018, Porto, Portugal MobiAct

W-P2DSrsquo18 April 23ndash26 2018 Porto Portugal Mohammad Malekzadeh Richard G Clegg Andrea Cavallaro and Hamed Haddadi

6 CONCLUSIONWe proposed the GEN framework for locally transforming sensordata on mobile edge devices to respect functions and requirementsof an application as well as user privacy We evaluated the efficiencyof the trade-off between utility and privacy GEN provides on real-world datasets of motion data

Open questions to be explored in future work include providingtheoretical bounds on the amount of sensitive information leakageafter transformation and exploring dependencies between differentattributes eg co-dependence of gender and height Finally wewill measure the costs and requirements for running GEN on edgedevices

ACKNOWLEDGMENTSThis work was kindly supported by the Life Sciences Initiative atQueen Mary University London and a Microsoft Azure for ResearchAward Hamed Haddadi was partially funded by EPSRC Databoxgrant (Ref EPN0282601)

REFERENCES[1] A Bagnall J Lines A Bostrom J Large and E Keogh The great time series

classification bake off a review and experimental evaluation of recent algorithmicadvances Data Mining and Knowledge Discovery 31(3)606ndash660 2017

[2] B K Beaulieu-Jones Z S Wu C Williams and C S Greene Privacy-preservinggenerative deep neural networks support clinical data sharing bioRxiv page159756 2017

[3] E Choi S Biswal B Malin J Duke W F Stewart and J Sun Generating multi-label discrete electronic health records using generative adversarial networksarXiv preprint arXiv170306490 2017

[4] F Chollet et al Keras httpsgithubcomfcholletkeras 2015[5] C Dwork Differential privacy A survey of results In International Conference

on Theory and Applications of Models of Computation pages 1ndash19 Springer 2008[6] C Esteban S L Hyland and G Raumltsch Real-valued (medical) time series gener-

ation with recurrent conditional gans arXiv preprint arXiv170602633 2017[7] I Goodfellow J Pouget-Abadie M Mirza B Xu D Warde-Farley S Ozair

A Courville and Y Bengio Generative adversarial nets In Advances in neuralinformation processing systems pages 2672ndash2680 2014

[8] H Haddadi H Howard A Chaudhry J Crowcroft A MadhavapeddyD McAuley and R Mortier Personal data thinking inside the box In Pro-ceedings of The Fifth Decennial Aarhus Conference on Critical Alternatives pages29ndash32 Aarhus University Press 2015

[9] C Huang P Kairouz X Chen L Sankar and R Rajagopal Context-awaregenerative adversarial privacy Entropy 19(12)656 2017

[10] K Katevas H Haddadi and L Tokarchuk Poster Sensingkit A multi-platformmobile sensing framework for large-scale experiments In Proceedings of the 20thAnnual International Conference on Mobile Computing and Networking pages375ndash378 ACM 2014

[11] M Malekzadeh R G Clegg and H Haddadi Replacement autoencoder Aprivacy-preserving algorithm for sensory data analysis The 3rd ACMIEEEInternational Conference on Internet-of-Things Design and Implementation 2018

[12] N Saleheen S Chakraborty N Ali MM Rahman S M Hossain R Bari E BuderM Srivastava and S Kumar msieve differential behavioral privacy in timeseries of mobile sensor data In Proceedings of the 2016 ACM International JointConference on Pervasive and Ubiquitous Computing pages 706ndash717 2016

[13] S Salvador and P Chan Toward accurate dynamic time warping in linear timeand space Intelligent Data Analysis 11(5)561ndash580 2007

[14] A Tripathy Y Wang and P Ishwar Privacy-preserving adversarial networksarXiv preprint arXiv171207008 2017

[15] G Vavoulas C Chatzaki T Malliotakis M Pediaditis and M Tsiknakis Themobiact dataset Recognition of activities of daily living using smartphones InICT4AgeingWell pages 143ndash151 2016

[16] P Vincent H Larochelle Y Bengio and P-A Manzagol Extracting and com-posing robust features with denoising autoencoders In Proceedings of the 25thInternational Conference on Machine learning pages 1096ndash1103 2008

[17] J Yang M N Nguyen P P San X Li and S Krishnaswamy Deep convolutionalneural networks on multichannel time series for human activity recognition In

Proceedings of the 24th International Conference on Artificial Intelligence pages3995ndash4001 2015

  • Abstract
  • 1 Introduction
  • 2 Problem Definition
  • 3 Learning the Inference-Specific Transformation
  • 4 Experiments
    • 41 Datasets
    • 42 Experimental Setup
    • 43 Transformation Efficiency
    • 44 Measuring Information Leakage
      • 5 Related Work and Discussion
      • 6 Conclusion
      • Acknowledgments
      • References