msieve: differential behavioral privacy in time series of...

12
mSieve: Differential Behavioral Privacy in Time Series of Mobile Sensor Data Nazir Saleheen ? , Supriyo Chakraborty , Nasir Ali ? , Md Mahbubur Rahman , Syed Monowar Hossain ? Rummana Bari ? , Eugene Buder ? , Mani Srivastava , Santosh Kumar ? University of Memphis ? IBM T. J. Watson Research Center Nokia Research University of California, Los Angeles {nsleheen, cnali, smhssain, rbari, ehbuder, skumar4}@memphis.edu ? [email protected] [email protected] [email protected] ABSTRACT Differential privacy concepts have been successfully used to protect anonymity of individuals in population-scale analy- sis. Sharing of mobile sensor data, especially physiologi- cal data, raise different privacy challenges, that of protect- ing private behaviors that can be revealed from time series of sensor data. Existing privacy mechanisms rely on noise ad- dition and data perturbation. But the accuracy requirement on inferences drawn from physiological data, together with well-established limits within which these data values occur, render traditional privacy mechanisms inapplicable. In this work, we define a new behavioral privacy metric based on dif- ferential privacy and propose a novel data substitution mech- anism to protect behavioral privacy. We evaluate the efficacy of our scheme using 660 hours of ECG, respiration, and activ- ity data collected from 43 participants and demonstrate that it is possible to retain meaningful utility, in terms of inference accuracy (90%), while simultaneously preserving the privacy of sensitive behaviors. Author Keywords Behavioral Privacy; Differential Privacy; Mobile Health ACM Classification Keywords K.4.1 Public Policy Issues: Privacy INTRODUCTION Smart phones with their onboard sensors and their ability to interface with a wide variety of external body worn sensors, provide an appealing mobile health (mHealth) platform that can be leveraged for continuous and unobtrusive monitoring of an individual in their daily life. The collected data can be shared with health care providers who can use the data to better understand the influence of the environment on an in- dividual and be proactive with their prognosis. On one hand, mHealth platforms have the potential to usher in affordable healthcare, but, on the other hand, their ability to continu- ously collect data about an individual raises serious privacy Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. UbiComp ’16, September 12-16, 2016, Heidelberg, Germany Copyright is held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-4461-6/16/09...$15.00. http://dx.doi.org/10.1145/2971648.2971753 concerns and limits their adoption – concerns that are largely absent during traditional episodic treatments. Motivating example: Consider a scientific study being con- ducted to assess the daily activities (e.g., sedentary versus ac- tive life styles) and behaviors of a user. To this end, the user participating in the study shares data from several body-worn sensors (e.g., respiration (RIP), electrocardiogram (ECG) and accelerometer sensors) with the study investigators. The data collected can be used to infer physical activities such as walk- ing, running, stationary (from accelerometers), but also cor- relate these activities with behaviors such as conversation episodes, stress episodes, detect when the user is eating or drinking water/coffee, and if the user smokes or takes cocaine. Note, activity can be detected from accelerometer data [5], conversation episodes [35] from respiration data, onset of stress [34, 26] can be inferred from ECG data, eating from wrist-worn sensors [40], smoking from respiration and wrist- worn sensors [32, 3, 38] and cocaine use from ECG data [25]. On one hand, some of the above inferences such as walk- ing, conversation, eating are extremely useful in investiga- tion of behavioral risk factors on health and wellness. But, on the other hand, inferences such as smoking, cocaine use and stress may be sensitive to the user and needs to be kept pri- vate. Thus, we have a conundrum, where the same time series data can be used for making both utility providing inferences (that are desirable) and also sensitive inferences (that need to be protected). Challenges unique to physiological data: While there ex- ists a large body of prior work on data privacy, there are sev- eral challenges that are unique to maintaining privacy of in- ferences drawn from time series of physiological data. First, the inferences themselves (e.g., detecting variation in heart rate, respiratory disorders) are extremely critical to proper di- agnosis and incorrect inferences can severely affect and even threaten human life. Second, there are well-defined limits for various physiological signals (e.g., the interbeat interval in ECG is typically between 300ms and 2, 000ms [13], and so on) and non-conformance to those thresholds can render the data unusable. Third, there is high degree of correlation be- tween an observed human behavior and the data recorded by these physiological sensors. For example, physical activities such as walking or running are associated with higher heart and respiration rates. Finally, physiological signals are high- dimensional, are extremely rich in information, and when

Upload: others

Post on 27-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

mSieve: Differential Behavioral Privacy in Time Series ofMobile Sensor Data

Nazir Saleheen?, Supriyo Chakraborty∇, Nasir Ali?, Md Mahbubur Rahman�, Syed Monowar Hossain?Rummana Bari?, Eugene Buder?, Mani Srivastava†, Santosh Kumar?

University of Memphis? IBM T. J. Watson Research Center∇Nokia Research� University of California, Los Angeles†

{nsleheen, cnali, smhssain, rbari, ehbuder, skumar4}@memphis.edu? [email protected][email protected][email protected]

ABSTRACTDifferential privacy concepts have been successfully used toprotect anonymity of individuals in population-scale analy-sis. Sharing of mobile sensor data, especially physiologi-cal data, raise different privacy challenges, that of protect-ing private behaviors that can be revealed from time series ofsensor data. Existing privacy mechanisms rely on noise ad-dition and data perturbation. But the accuracy requirementon inferences drawn from physiological data, together withwell-established limits within which these data values occur,render traditional privacy mechanisms inapplicable. In thiswork, we define a new behavioral privacy metric based on dif-ferential privacy and propose a novel data substitution mech-anism to protect behavioral privacy. We evaluate the efficacyof our scheme using 660 hours of ECG, respiration, and activ-ity data collected from 43 participants and demonstrate that itis possible to retain meaningful utility, in terms of inferenceaccuracy (90%), while simultaneously preserving the privacyof sensitive behaviors.

Author KeywordsBehavioral Privacy; Differential Privacy; Mobile Health

ACM Classification KeywordsK.4.1 Public Policy Issues: Privacy

INTRODUCTIONSmart phones with their onboard sensors and their ability tointerface with a wide variety of external body worn sensors,provide an appealing mobile health (mHealth) platform thatcan be leveraged for continuous and unobtrusive monitoringof an individual in their daily life. The collected data canbe shared with health care providers who can use the data tobetter understand the influence of the environment on an in-dividual and be proactive with their prognosis. On one hand,mHealth platforms have the potential to usher in affordablehealthcare, but, on the other hand, their ability to continu-ously collect data about an individual raises serious privacy

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected] ’16, September 12-16, 2016, Heidelberg, GermanyCopyright is held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-4461-6/16/09...$15.00.http://dx.doi.org/10.1145/2971648.2971753

concerns and limits their adoption – concerns that are largelyabsent during traditional episodic treatments.

Motivating example: Consider a scientific study being con-ducted to assess the daily activities (e.g., sedentary versus ac-tive life styles) and behaviors of a user. To this end, the userparticipating in the study shares data from several body-wornsensors (e.g., respiration (RIP), electrocardiogram (ECG) andaccelerometer sensors) with the study investigators. The datacollected can be used to infer physical activities such as walk-ing, running, stationary (from accelerometers), but also cor-relate these activities with behaviors such as conversationepisodes, stress episodes, detect when the user is eating ordrinking water/coffee, and if the user smokes or takes cocaine.Note, activity can be detected from accelerometer data [5],conversation episodes [35] from respiration data, onset ofstress [34, 26] can be inferred from ECG data, eating fromwrist-worn sensors [40], smoking from respiration and wrist-worn sensors [32, 3, 38] and cocaine use from ECG data [25].

On one hand, some of the above inferences such as walk-ing, conversation, eating are extremely useful in investiga-tion of behavioral risk factors on health and wellness. But, onthe other hand, inferences such as smoking, cocaine use andstress may be sensitive to the user and needs to be kept pri-vate. Thus, we have a conundrum, where the same time seriesdata can be used for making both utility providing inferences(that are desirable) and also sensitive inferences (that need tobe protected).

Challenges unique to physiological data: While there ex-ists a large body of prior work on data privacy, there are sev-eral challenges that are unique to maintaining privacy of in-ferences drawn from time series of physiological data. First,the inferences themselves (e.g., detecting variation in heartrate, respiratory disorders) are extremely critical to proper di-agnosis and incorrect inferences can severely affect and eventhreaten human life. Second, there are well-defined limits forvarious physiological signals (e.g., the interbeat interval inECG is typically between 300ms and 2, 000ms [13], and soon) and non-conformance to those thresholds can render thedata unusable. Third, there is high degree of correlation be-tween an observed human behavior and the data recorded bythese physiological sensors. For example, physical activitiessuch as walking or running are associated with higher heartand respiration rates. Finally, physiological signals are high-dimensional, are extremely rich in information, and when

Page 2: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

Re

leas

ed

Se

nso

r d

ata

Infe

ren

ces

Smoking

Walking

Eating

Conversation

Candidates, C1 Candidates, C2 Candidates, C3

Smoking

Walking

Eating

Conversation

DB

N

Respiration

ECG

Accelerometer

Act

ual

Se

nso

r d

ata

Infe

ren

ces

Respiration

ECG

Accelerometer

time

Figure 1: Illustration of the mSieve process.

continuously collected embed minute elements of an individ-ual’s lifestyle patterns. These patterns or inferences are oftencorrelated making it difficult to protect the privacy of one in-ference in isolation of the others.

These above challenges place constraints on the mechanismsthat can be used to protect privacy of physiological data. Theconstraints are in terms of the magnitude of noise that canbe added while retaining the utility of the inferences and inhandling of the correlation between the data streams fromthe various sensors. Anonymization techniques such as k-anonymity [39], l-diversity [30], and t-closeness [27] proposedata obfuscation aimed towards protecting the identity of auser within a subpopulation. However, we consider a settingwhere a single-user shares data with (possibly) many recipi-ents (primary/secondary researchers), and the identity of theuser is already known to the data recipients.

A principled mechanism for preserving privacy during analy-sis is differential privacy [17]. While several variants of dif-ferential privacy have been proposed [23, 37, 21], the centralidea there is to adequately obfuscate a query response com-puted on a multi-user statistical database (by adding noisetypically drawn from a Laplace distribution) such that thepresence or absence of any user in the database is protected.However, this notion of differential privacy cannot be directlyapplied to our single user setting to protect behavioral pri-vacy. Recent model-based approaches for location privacysuch as [24, 22] focus on effective data suppression to pro-tect sensitive inferences, but these can’t be applied directly toprotect behavioral privacy from mobile sensor data either dueto unique challenges listed above.

Our approach: In this paper, we propose mSieve, a model-based data substitution approach to address the privacy chal-lenges arising from sharing of personal physiological data.We group the inferences that can be drawn from shared datainto two sets – a whitelist and a blacklist [11, 10]. Inferences

User model, DBN DuRaw sensor data ~ri = {ri(t)|tstart ≤ t ≤ tend}Actual sensor data ~r = {~r1, ~r2, ..., ~rns}Released sensor data ~r(t) = {~r1, ~r2, ..., ~rns}Duration difference di = ti − ti ∀1 ≤ i ≤ nState x = (x1, ..., xn) ∈ {0, 1}nState interval s = (x, ts, te)Actual state sequence ~s =< s1, s2, ..., sτ >

Obfuscated state sequence ~s =< s1, s2, ..., sτ >Set of sensitive states SBSet of safe states SWSet of all states S = SW ∪ SBHole hkCandidate states for kth hole Ck

Table 1: Summary of notations used.

that are desirable for the user, such as tracking activity, con-versation episodes, frequency of eating, are all utility provid-ing to the user and are part of a whitelist. Other inferencessuch as smoking and onset of stress are sensitive to the userand need to be kept private. These inferences form part ofthe blacklist. Our goal is to prevent an adversary from mak-ing any of the inferences in the user-specified blacklist whilebeing able to accurately compute the whitelisted inferences.

Figure 1 illustrates the flow of data and the various compo-nents of mSieve. In summary, given various streams of sensordata from a user, mSieve identifies sensitive data segmentsand substitutes them with the most-plausible non-sensitivedata segments. To do so, it computes a Dynamic BayesianNetwork (DBN) model over the user’s data. The model main-tains a distribution over the various behavioral states (e.g.,smoking, running, conversation etc.) of the user. Note, thesestates are computed using the data collected from the body-worn sensors. To perform substitution, segments of data thatreveal sensitive behavior are detected and removed. The DBNmodel is then used to identify candidate replacement seg-ments from the same user’s data that can be used in place ofthe deleted segments. We use several techniques (such as dy-namic programming approach, and a greedy approach basedbest fit algorithm) to select the best segments that preserveprivacy and simultaneously retain the overall statistics of thephysiological signal (providing utility). To assess the privacyguarantees of our scheme, inspired by the privacy definitionof differential privacy, we define the notion of differential be-havioral privacy to protect sensitive inferences. The metricensures that the information leaked about a sensitive infer-ence from a substituted segment is always bounded.

We evaluate the efficacy of our substitution scheme using660 hours of ECG, respiration, location and accelerometerdata collected over multiple user studies with over 43 partic-ipants. We demonstrate that sensitive behavioral inferences,contributing to privacy loss, such as onset of stress, smoking,and cocaine use can be protected while still retaining mean-ingful utility (≥ 85% accuracy when privacy sensitivity ishigh and ≥ 90% on average) of the shared physiological sig-nals in terms of its use for tracking heart rate, breathing irreg-ularities, and detecting conversation episodes.DEFINITIONS AND PROBLEM STATEMENTWe first introduce notations and define terms we use through-out the paper and also formalize the problem statement.

Page 3: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

Sensor Data: Let ri(t) denote the sensor data from the ithsensor at time t, where i = 1, . . . , ns. We define ~ri ={ri(t)|ts ≤ t ≤ te} as the time-series of measurements fromthe ith sensor, from starting time ts to ending time te. Fi-nally, r(t) denotes the collection of time-series data from allthe different sensors, i.e., r(t) = {r1(t), r2(t), ..., rns(t)}.Inferences: An inference is a function computed (e.g., using amachine learning model) over a window of data values. Time-series data from different sensors (such as ECG, respiration,accelerometer) are used for computing inferences using databuffered over a chosen time interval. Let xi(t) be inferencevalue of the ith inference at time t. We assume that all ourinferences are binary classifiers, which output true when theinference occurs within the time interval and false otherwise1

, i.e., xi(t) ∈ {0, 1}. We define ~xi = {xi(t)|ts ≤ t ≤ te} tobe the time-series for the ith inference within the time interval(ts, te). Again, x(t) = {x1(t), x2(t), ..., xn(t)} representsthe collection of all possible inference time-series.

Whitelist and Blacklist of inferences: As mentioned earlier,a key component of our privacy mechanism is the separationof the possible inferences into a Whitelist (denoted by W )and a Blacklist (denoted by B). In mSieve, a whitelist is aset of inferences that are essential for obtaining utility fromthe shared data, and the goal of the recipient is to accuratelycompute the distribution p(xi), where xi ∈W . Similarly, theblacklist B, is a list of inferences xi, whose release the userwould like to protect from the recipient.

State: A bit vector of length n is used to represent a user statex = (x1, x2, ..., xn) ∈ {0, 1}n. The ith element of the bitvector represents the value of the ith inference. Without lossof generality, we assign the first nw bit values of state x tothe whitelist, i.e., xi for 1 ≤ i ≤ nw and the remaining nb bitvalues to the blacklist. We assume that whitelist and blacklistforms a disjoint partition of the inference set, i.e., n = (nw +nb). A state is sensitive, if one or more bits correspondingto inferences in set B, are set to one. All sensitive states areincluded in set SB and the non-sensitive states are in set SW .

State Interval: We define a state interval, s = (x, ts, te)as a state x at which the user dwells during an interval(ts, te). Successive state intervals are indexed by sj , wherej = 1, . . . , τ . The value of each inference stays the same dur-ing an interval. Interval changes to the next one when any ofthe inference values change. Unless otherwise specified, weuse the shorthand notation sji for sj .xi.

State Sequence: We define a state sequence, s =(s1, s2, ..., sτ ), where sj .x 6= sj+1.x and sj .te == sj+1.tsfor all j = 1, . . . , τ − 1.

User ModelTransition among different user states can be modeled us-ing graphical models such as Markov Chain (MC), HiddenMarkov Models (HMM), and Dynamic Bayesian Network(DBN). The Markov models, while suitable for modeling thetemporal correlation among the states across time intervals,1Any inference that produces categories can be easily converted toa set of binary inferences, one for each category of output.

Sj Sj+1 Sj-1

X1

X2 X3

Xn

X1

X2 X3

Xn

X1

X2

X3

Xn

Intra-slice link Inter-slice link

Figure 2: DBN showing user states over different time slices.

do not capture their conditional independence within a par-ticular time interval. Therefore, we model the transition be-tween states as a DBN, Du. In each time slice of the DBN,we maintain a uniform Bayesian Network described below:

• Nodes: Each node of the DBN is a random variable Sjrepresenting a user state at time interval j. Denoting theith inference within the state as Sji , we can write Sj =

{Sj1, Sj2..., S

jn}.

In addition to nodes, a DBN also contains two types of edges:

• Intra-slice links: For any time slice j, conditional indepen-dence between individual inferences X1 to Xn is main-tained as a Bayesian Network (BN). Denoting the parentsof node Sj by Pa(Sj) we have:

P (Sji | Pa(Sji )); where Pa(Sji ) = {Sjk : P (Sji | Sjk)}

• Inter-slice links: A DBN not only models conditional inde-pendence among states within a time slice but also capturestheir temporal correlations across time slices. These tran-sition probabilities among nodes in different BNs are rep-resented by the inter slice links. We use a first order model,so these links are only between adjacent time slices.

P (Sji | S1i , ..., S

j−1i ) = P (Sji | S

j−1i );

These conditional probabilities are stored in a ConditionalProbability table (CPT), which is associated with each nodeSj . An illustration of a DBN representing temporal behavioramong states is presented in Figure 2.

Adversary ModelWe use a DBN to capture adversarial knowledge. A DBN is apowerful graphical model that can effectively encode both thetemporal and spatial correlation among inferences. It is alsoa generalization of Markov models (including the HMM) thatare typically used to encode these information. We considertwo types of adversarial attacks.

• Data-based attack: In this setting, the attacker has accessto the raw sensor data.

• Model-based attack: The attacker has access to releasedinferences computed over raw data but not the raw data.

In addition, we assume that in both settings the attacker isaware of the blacklist inferences B, and the mSieve algorithmis publicly known.

Page 4: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

The algorithms in mSieve, are designed under an assumptionthat an adversary uses a DBN or a less powerful model forcapturing the correlation among the user states. However, ifan adversary uses a model that is more expressive in termsof modeling state correlations, or has access to side channelinformation that is not contained in the user model then addi-tional leakage may occur from the released data.

PrivacyThe privacy guarantee we seek is such that an adversary withaccess to data released by mSieve should not be able to sus-pect a sensitive behavior in a released data with significantlyhigher likelihood than when suspecting the same behavior ina corresponding reference data.

Corresponding Reference Data: For a given sensor time se-ries~r, a corresponding reference data~r is such that it releasesno more information about the blacklisted inferences than anull time-series, but is otherwise maximally close to~r.

Differential Behavioral Privacy: A system Λ preserves ε-privacy, if for any input sensor time series ~r with start timets and end time te, it produces an output ~r with same ts andte such that for any query q(.; .) on ~r about any sensitive stateb ∈ SB and for all K ∈ Range(q), and the same query q(.; .)on any~r with the same ts and te, the output is bounded by eε.

D(~r||~r) =P (q(~r; b) ∈ K)

P (q(~r; b) ∈ K)≤ eε (1)

The parameter ε denotes privacy sensitivity. A low value of εimplies a high privacy level and vice-versa.

UtilityWe define utility over an entire released episode. Releaseddata should preserve the same white list of inferences as theoriginal data. Utility metric minimizes distribution differencebetween the original and released data.

Let pi be the probability of inference xi occurring in the ac-tual signal. For simplicity, suppose Inference xi occurs forti =

∑τj=1 Ixij=1 ∗ (sj .te − sj .ts) duration out of a total

of T =∑τj=1(sj .te − sj .ts) time units in the original data,

where I is the identity function. Then pi = tiT .

Utility Loss: Let P = (p1, p2, ..., pnw) be the probabilityvector of the white listed inferences in the actual data andP = (p1, p2, ..., ˆpnw) be the probability vector of the whitelisted inferences in the released data, where pi = ti

T and tiis the duration of inference xi in the released data. Then, wedefine our utility loss metric as,

Uloss = ‖P − P‖ =

nw∑i=1

|pi − pi| =1

T

nw∑i=1

|ti − ti| (2)

Problem DefinitionThe goal of mSieve is to obfuscate time series data ~r ∈ DBand generate ~r with the same start and end time such that itsatisfies the privacy constrains in Equation (1) and minimizesthe utility loss in Equation (2).

Context

Engine

Safe sensor

data

InferencesPrivacy

preserving inferences

Database LookupContext -> sensor data

Context

Sensor data

Time seriesof

sensor data

Segment Inference

time seriesDBN lookupfor sensitive

segments

Substitute sensitive context

with plausible context

Blacklist

Wh

itelist

Context to

Sensor

Data

Substitution

Mechanism

Figure 3: An overview of the mSieve framework.

PROBLEM 1. For any given tolerable privacy loss ε > 0,the utility-plausibility tradeoff can be formulated as the fol-lowing optimization problem:

minimize Uloss

s.t. D(~r||~r) ≤ eε

SYSTEM OVERVIEWWe now present an overview of the components, as shownin Figure 3, that are required to implement the end-to-endsystem from sensor data, to user states, to privacy preservingsafe states, and finally to the release of sensor data.

Context EngineThe Context Engine (CE) generates inferences from raw sen-sor data. Inferred signals are more suitable for data model-ing than unprocessed, raw sensor signals as it simplifies themodeling task. CE transforms the raw sensor signals to task-specific feature signals. For example, RR-interval is moresuitable to infer heart rate than raw ECG signal. CE takesraw signals as input and produces time series of inferencevalues as output. All the white list and black list inferencesare inferred by the CE.

Substitution MechanismWe begin by segmenting the time series of inferences. Let τbe the total number of segments, where each segment is rep-resented by state interval sj where 1≤ j ≤ τ . As mentionedearlier, we assume that inferences are represented by a singlebit and thus at a particular time interval the CE provides asoutput a bit vector of size n, which also constitutes the userstate sj .x ∈ {0, 1}n. Finally, the segmentation generates asequence of such state intervals, ~s =< s1, s2, ..., sτ >.

Some of the states in ~s can be sensitive to the user. The firststep in our substitution algorithm is to remove the sensitivestates and also all other states that might lead to the sensitivestates. This introduces discontinuity in that state sequence;we call these gaps as holes. The next step is to perform aDBN lookup to identify plausible candidates to fill all theholes. This lookup operation on user DBN, Du, provides aset of candidate states for filling holes in any time interval j.We note that the plausible states are checked for continuationwith the previous state. To select the best state from amongthe candidate states, we consider utility loss as our metric.We substitute a hole with a state or sequence of states thatminimizes utility loss.

Page 5: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

Context to Sensor DataAnother important consideration in the substitution processis to maintain signal continuity. The Context-to-Sensor-Datamodule accepts possible substitution candidates generatedby the substitution mechanism and produces safe, privacy-preserving sensor data as output. This module communicateswith the database to ensure that the released sensor data issafe, i.e., meets the privacy and utility criteria.

SOLUTION DETAILSIn this section, we discuss our proposed solution to mitigateprivacy risks. Algorithm 1 takes raw sensor data ~r, sensitivestates SB , and user DBNDu, and outputs raw data ~r that doesnot contain any signature about sensitive inferences.

Algorithm 1 mSieve: Plausible Substitution Mechanism

Require: ~r,SB , Du, ε1: ~s = getStateSequence(~r)2: k ← 13: ~s← ~s4: for each interval j ∈ {1, 2, ..., τ} do5: if isHole(sj , Du,SB , ε) is true then6: ~sj = hk = (∅, sj .ts, sj .te)7: Ck = getPlausibleCandidateSet(Du, ~s, hk, ε)8: k ← k + 19: end if

10: end for11: Selected candidate {c1, ..., ck} = FillHole({hk}, {Ck})12: ~r ← ~r13: for each hole hk = (., ts, te), selected candidate ck do14: r(ts, te) = getSensorDataFromDB(ck, te − ts)15: end for16: return ~r

Step 1: Sensor Data to State SequenceWe first convert the time-series of raw data samples from var-ious sensors, ~r, to time-series of inferences using (classifier)models. The classifier models are specific to an inferenceand detects the time interval over which the inference oc-curs. Let ei = {(ts, te)} denote the set of time intervalsover which the ith inference is detected by a model where1 ≤ i ≤ n = (nw + nb) (recall nw and nb are the number ofwhitelisted and blacklisted inferences respectively). We cre-ate the inference time series xi(t) for the ith inference as

xi(t) =

{1 if ts ≤ t ≤ te for any (ts, te) ∈ ei0 Otherwise.

Collectively, we define, x(t) = (x1(t), x2(t), . . . , xn(t)) asthe n-dimensional state at time t.

Let t = ∪ni=1 ∪(ts,te)∈ei {ts, te} be the set of all time points(whether start or end) corresponding to the inference occur-rences, arranged in ascending order and τ = |t| − 1. Note,a user stays in same state between time interval (ti, ti+1) fori = 1, 2, ..., τ and ti ∈ t. We denote by si = (x(t), ti, ti+1),where ti < t < ti+1, the ith state interval. It is clear thattwo consecutive states are different. It forms a state sequence~s = < s1, s2, . . . , sτ >.

Step 2: Locate and Delete Sensitive and Unsafe StatesWe consider three types of states. a) Sensitive state, b) Unsafestate, and c) Safe state. We define a state sj as sensitive stateif any of the last nb bits are set to one, i.e., if ∨ni=nw+1s

ji =

1, where ∨ is logical or operation (recall SB is the set ofsensitive states).

We define a state sj as unsafe state if it is not directly sensi-tive but may contain some information about blacklist, e.g.,act of walking outside of a building to smoke and return-ing back after smoking. We use the following local privacycheck, which is similar as [22, 9], to mark a state as unsafe

P (Sj+1 ∈ SB |Sj = sj)

P (Sj+1 ∈ SB)≤ εδ. (3)

This condition provides us with a mechanism to stem privacyleakage from unsafe states. Here, δ is our local privacy sensi-tivity. Lemma 1 below describes how to select δ.

LEMMA 1. For any given ε > 0 there exist a δ < ε/2τsuch that if

P (Si ∈ SB |Si−1 = si−1)

P (Si ∈ SB)< eδ

then D(~r||~r) ≤ eε

A proof appears in the Appendix.

Deletion of sensitive and unsafe states in ~s results into ~s thatis punctuated with holes. Each hole consists of a starting timets, and an end time te. Thus, the kth hole is defined as hk =(ts, te). We denote the state occurring immediately before ahole hk as pre(hk), and the state after hk as next(hk).

Step 3: Candidate GenerationA candidate (i.e., a state sequence) is a sequence of statesthat can be substituted in place of a hole. A candidate shouldbe such that it does not contain any sensitive or unsafe stateand it maintains continuity, i.e., transitions among the statesin the filled up time series should be plausible using simi-lar criteria as in (3). If there does not exist a candidate longenough to fill the hole by itself, multiple state segments canbe composed together to obtain the desired length. We usea recursive function, described in Algorithm 2, to generatecandidates for each hole hk in time interval given by (ts, te).For kth hole, the GenerateCandidate function generates allthe possible candidates cand for the duration given by theinterval length of the hole, i.e., dur = hk.te − hk.ts, andstores the candidates in the set Ck. Prior to invoking theGenerateCandidate function, we initialize set Ck = ∅ andcurrent candidate cand =< . > to an empty vector. Note,isConnect(sj , sj+1) returns true iff Equation (3) returns true.

If GenerateCandidate does not generate any candidate statesequence, i.e., Ck = ∅, then we delete either the previousstate pre(hk) or the next state next(hk) (whichever providesa lower utility loss), to enlarge the size of the existing hole.We then invoke GenerateCandidate again for this newly cre-ated hole. This iterative process of increasing the size of the

Page 6: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

Algorithm 2 GenerateCandidate

Require: cand =< c1, . . . , cl >,Ck, hk, remDur1: if dur == 0 and isConnect(cl, next(hk)) then2: Ck = Ck ∪ cand3: else if dur > 0 then4: for each sj ∈ S do5: if isConnect(cl, sj) then6: remDur′ = remDur − (sj .te − sj .ts)7: cand′ =< c1, . . . , cl, s

j >8: GenerateCandidate(cand′,Ck, hk, remDur′)9: end if

10: end for11: end if

hole increases the chances of finding a candidate and in ourexperiments we did not encounter any instance when the al-gorithm failed to find a suitable candidate. But, theoreticallyspeaking, it is possible that the algorithm may not find anycandidates due to the continuity constraint. Further improv-ing this algorithm and proving its convergence is still an openquestion that we leave for future work.

Complexity analysis: Since length of the holes and the statesare dynamic we will use expected values to analyze complex-ity of this step. Let lh be the expected length of the hole, lsbe expected length of a state and y be the expected numberof states reachable from any state. Here, y is the branchingfactor and ` = d(lh/ls)e is the expected depth of the searchtree. Then, the expected time complexity is O(`y).

Step 4: Select Candidate and Fill HolesAfter the hole creation and candidate generation steps, weobtain a series of holes h1, . . . , hnh and a set of candidatesCi for the ith hole. Let ~c =< c1, c2, . . . , cm >, where1 ≤ j ≤ m is the index assigned to the candidate cj , it isthe jth candidate to be encountered as we enumerate throughcandidates in sets C1,C2, . . . ,Cnh in that order. We definean allocator matrix A[i][k], where A[i][k] = 1 implies thatcandidate state sequence ck ∈ C is a candidate for the ithhole, i.e. ck ∈ Ci. Let, after hole creation, tj be the durationof the jth whitelist in ~s. Thus, the initial duration difference isdj = tj − tj and the initial utility loss, U0

Loss =∑nwj=1 |dj |.

The next step is to select a candidate for each hole that mini-mizes the objective function specified in Equation 1. We for-mulate the above as a Hole Filing Problem stated below.

PROBLEM 2. Hole filling Problem: As stated earlier, forthe jth whitelisted inference, tj denotes its total duration in ~rand tj its duration in ~r. Let dj = tj − tj . The goal is to fillall the holes with candidate state sequences such that

Objective function: minnw∑i=j

|tj − tj | = min

nw∑j=1

|dj |

It can be shown that the above problem minimizes utility lossUloss (in Equation 2). By reducing the bin packing problem tothe unconstrained version of the hole filling problem, i.e., by

setting A[i][k] = 1 for all i and k and the privacy sensitivityparameter ε, to a large number, it can be shown that the holefilling problem is NP-hard. Therefore, we first provide a dy-namic programming based solution that gives optimal resultbut requires exponential memory. We then provide a greedybased solution that is not optimal but runs in polynomial time.

Dynamic Programming SolutionThe main idea here is to compute the solutions to smaller sub-problems and store the solutions in a table, so that they canbe reused repeatedly later to solve the overall problem. To doso, we need to decompose the hole filling problem in terms ofsmaller sub-problems and find a relation between the struc-ture of the optimal solution for the original problem, and so-lution of the smaller sub-problems. We begin by defining theproblem recursively as follows:

Recurrence Relation: Let L be similar to A, except that theentry L[i][j] stores the minimum utility loss achieved if thekth candidate is used to fill the ith hole, where 1 ≤ i ≤ nhand 1 ≤ k ≤ m.

To solve the problem, we use a bottom-up approach. At first,we compute the optimal result for the first hole. Then, usingthis result we compute the optimal result for the second holeand so on. We begin with the following initialization

L[i][k] =

{U0Loss if i = 0

∞ Otherwise.

Let ~di,k = < di,k1 , . . . , di,knw > be the duration differ-ence vector after assigning candidate ck to ith hole. Initialized0,kj = dj for all 1 ≤ k ≤ m and 1 ≤ j ≤ nw. To understand

the working of the algorithm, suppose that holes h1 throughhi−1 have all been assigned, and we are now ready to makean assignment for hi, i.e., we are now in stage i. Let durj(ck)

be the duration of the jth whitelist in candidate ck. Then, weupdate L[i][k] with

min1≤l≤m

{L[i− 1][l] +

nw∑j=1

−|di−1,lj |+ |di−1,l

j + durj(ck)|}

for 1 ≤ i ≤ nh and 1 ≤ k ≤ m(4)

For the l, that results into a minimum, we updatedi,kj = di−1,l

j + durj(ck) for 1 ≤ j ≤ nw. We alsomaintain an additional data structure pre(i, k) that storesindex of previous candidate from where we update L[i][k],i.e. pre(i, k) = l. We continue this process till i = nhand k = m. Finally, we select ck for last hole hnh suchthat L[nh][k] is minimum. Using the pre data structure, wedetermine complete assignment of the holes by invokingAssignCandidate(nh, k) (Algorithm 3).

Complexity analysis: Maximum number of candidates foreach hole is O(`y) (since any combination of whitelistedinferences can be a candidate sequence) and the maximumnumber of holes is O(τ). Thus, the upper bound on space

Page 7: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

Algorithm 3 AssignCandidate

Require: pre(., .), i, k1: if i > 0 then2: Assign ck in ith hole3: call AssignCandidate(pre, i− 1, pre(i, k))4: end if

required for a dynamic program based solution isO(τ`y) andthe upper bound on time is O(τ`y`y).

Greedy SolutionWe now provide a greedy strategy for the hole filling prob-lem. For each hole, we choose an item that minimizes theutility loss, Uloss, among all the candidates. We repeat thisprocess until all the holes are filled. This process is describedin Algorithm 4.

Algorithm 4 GreedySolution

Require: {Ci}nhi=1, {hi}nhi=1,

~d

1: ~d = ~d2: for each i = 1, 2, . . . , nh do3: select min

ck∈Ci

∑nwj=1−|dj |+ |dj + durj(ck)|

4: dj = dj + durj(ck); for all 1 ≤ j ≤ nw5: end for

Since, in every iteration, we select an item that locally min-imizes Uloss for that hole, this method does not always pro-vide an optimal result. However, the space complexity re-duces to O(τ) and time complexity reduces to O(τ`y) whichis `y times smaller than the previous solution using dynamicprogramming.

Step 5: Sensor Data SubstitutionNow, for each hole hk, we have to select a segment of sensordata that corresponds to the selected candidate state ck ∈ Ck.For this, we maintain a mapping database,M , that stores sen-sor segment of different length for each possible state. How-ever, for substituting the sensor data for a state within an in-terval, we have to maintain consistency at both boundaries ofthe hole. We use the case of ECG signals to illustrate fea-ture value consistency. One can also consider morphologicalconsistency or others relevant characteristics.

Feature value consistencyLet rri be the ith RR interval in an ECG data, ~r. The RRinterval is used to calculate other features of the signal definedas below.

1. Point of time error: Limit check on the current value

e(rri) =

{0 if 300 < rri < 2000

1 Otherwise.

2. Continuity error: Limit check on the first order difference

e(rri|rri−1) =

{0 if |rri − rri−1| < 200

1 Otherwise.

SensorAvg. # sample per

participantTotal # sample

D1 D2 D1 D2

(1 day) (3 days) (37x1 days) (6x3 days)Respiration 0.78M 1.66M 28.80M 9.98M

ECG 2.15M 5.95M 79.76M 35.74MAccelerometer 2.04M 4.58M 75.81M 27.47M

Gyroscope 2.07M 4.58M 76.60M 27.48M

Table 2: Data statistics from both studies (M = million).

Let {rr1, . . . , rrk} be sequence of RR intervals calculated inan ECG data, ~r. We define feature value error by,

ef (~r) =e(rr1) +

∑ki=2 e(rri) ∨ e(rri|rri−1)

n

mSieve maintains a database of sensor data corresponding toeach candidate state, and while substituting, it selects sen-sor data corresponding to a released state that minimizes theboundary errors on both boundaries.

Limits of the mSieve Algorithm:We discuss three limits of mSieve. First, if all the inferencesare part of the blacklist, then the released data will correspondto the null state x = ∅, i.e. xi = 0 for i = 1, ..., n. Nodata release is possible in this case because every inference issensitive.

Second, if either the number of data sources or the number ofinferences are increased, then the size of the DBN will grow.This will increase the complexity of learning the adversarymodel. It will also increase the amount of space and timerequired to obtain a solution (see the complexity analysis ofthe dynamic program approach).

Finally, since there are imperfections in any computationalmodel that can be used by mSieve to detect data segmentscorresponding to a blacklisted inference, there are some lowerbounds on the privacy level that can be achieved with mSieve.We formalize it with the following lemma.

LEMMA 2. Suppose false negative rate of the computa-tional model used in mSieve for detecting black list b ∈ Bis Fb. Define η = max

b∈SB{Fb}. Then lower bound of ε is ln(η).

A proof appears in the Appendix. We note that the above limitcan be improved by using better inference models.

EVALUATION

Study Design and Data CollectionWe use data collected in two different studies to evaluatemSieve. We first summarize the data collection process andprovide statistics of data from both studies which were ap-proved by the IRB. In the first study (D1), physiological datawas collected from 37 participants. The goal of this study wasto evaluate the value of wearable sensors in monitoring andreflecting upon daily stress, activity, and conversation. Eachparticipant wore mobile-sensors for one full day. In total, 37days of data was collected (one participant per day).

Page 8: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

In the second study (D2), data was collect from 6 daily smok-ers. Each participant wore mobile-sensors for three days, fora total of 18 days of data. The goal of this study was to de-velop and validate a computational model to detect smoking.In both studies, each participant wore a physiological chestband, inertial wristband, and GPS-enabled smartphone for aday. Each sensor transmitted data continuously to a mobilephone during the study period. At the end of each day, thedata collected on the phone was transferred to a server.

In both studies, participants wore the sensors for 12.04±2.16hours per day during their awake period. Using the ac-celerometer sensor, we found that participants were physi-cally active for 22.19% of the time, on average. Physiologicaldata in the natural environment can be of poor quality for sev-eral reasons, such as physical activity, loose attachment, wire-less disconnection, etc. [36]. We note that stress assessmentmodel is applied to the data to obtain stress at each minuteonly when data collected was of good quality and not affectedby these confounders [26]. Table 2 summarizes the numberof data-points collected from participants in each study.

Inferences from Sensor Data: We implemented the cStressmodel [26] to infer stress inference. It uses a set of featuresfrom both ECG and respiratory waveforms. If the stress prob-ability of a particular minute is above 0.33, it is labeled as astressed minute. We detected physical activity from wear-able accelerometers using the model in [4]. We used the puff-Marker model [38] to detect smoking episodes from wrist sen-sor data and respiration data. Finally, we used the mConversemodel [35] to infer conversation episodes.

Model LearningA key element of our scheme is the DBN model that we usefor identifying the substitution segments. For model learning,we divided the day into state intervals. We then used the datafrom all the users (not a specific user) to study the conver-gence of the model learning, i.e., find the time interval overwhich the transition probabilities converged to a stable value.Let Dcon be the converged transition matrix, and Dd be theconditional dependency probability matrix on day d. We de-fine our normalized distance as

∑(Dcon−Dd)∑

Dconwhere the sum

is over all elements of the matrix. Figure 4 shows the conver-gence of DBN for multiple users. For both cases, more than80% convergence occurs within first 9 days and 90% conver-gence occurs within 14 days.

Privacy-Utility TradeoffTo understand the privacy-utility tradeoff of both the dynamicprogram and greedy approaches, we vary the privacy param-eter ε and observe the utility loss Uloss in each case. Weconducted four experiments, two on each dataset, by chang-ing the configuration of the blacklist. In the first experi-ment, we set the Blacklist = {Stress}, and in the sec-ond experiment we set the Blacklist = {Conversation},and performed our evaluation on dataset D1. In third andfourth experiments, we set the Blacklist = {Smoking} andBlacklist = {Stress} respectively and conducted the eval-uation on dataset D2. In all the experiments, we vary ε fromzero to one in steps of size 0.05. Figure 5 shows the results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35 40

No

rmal

ize

d D

ista

nce

(F

rom

co

nve

rge

d D

BN

)

Number of Instances

Convergence of DBN Normalize Distance (D1) Normalize Distance (D2)

Convergence of >80% in ~9 days of data

D1 D2

Figure 4: Convergence rate for DBN training. The model is trained usingaggregate data from all the users.

50

55

60

65

70

75

80

85

90

95

100

0

0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

0.9

0.9

5 1

1.0

5

1.1

1.1

5

1.2

1.2

5

1.3

1.3

5

1.4

1.4

5

1.5

Pe

rce

nta

ge o

f St

ate

s Privacy sensitivity, ϵ

Safe state Unsafe state Sensitive state

Figure 6: Variation in the percentage of node types with privacy sensitivityε. Recall, lower value of ε means higher privacy.

obtained for each of the four experiments. Note, in each plotwe show the utility loss for both the dynamic program andgreedy approaches. Note also init utility loss is the utilityloss after creating holes. When ε = 0 we get eε = 1, whichmeans that the posterior belief about blacklist should not bemore than the prior expectation. At that point, Uloss is maxi-mum and we get an average of 11% utility loss for the greedyalgorithm and 7% utility loss for the DP algorithm. As we in-crease the value of ε, Uloss reduces. On average, we get lessthan 10% utility loss for both algorithm.

Dynamic Program vs. Greedy AlgorithmUtility Loss: We computed the Uloss value for both the dy-namic programming algorithm and the greedy algorithm. InFigure 8, we also show Uloss after substitution. DP alwaysproduces better result than the greedy algorithm and both pro-duce better results than the initial Uloss. For some users, ini-tial Uloss is less than our solution. This is because, usingour approach, we always fill each hole resulting in occasionaloverfilling of the whitelist inferences.

Distribution of Safe, Unsafe, and Sensitive States: We inves-tigate the distribution of the three states in the data releasedby each algorithm. To do so, we vary the value of the privacysensitivity ε and observe the percentage of nodes that are safe,unsafe, and sensitive. Figure 6 shows the results. Recall thatbecause of plausible substitution, in addition to safe and sen-sitive states, we also have unsafe states that are not-sensitive,yet unsuitable for release as they are highly correlated withsensitive states. As we increase the value of ε, the number ofsafe states also increases. For a given privacy sensitivity to

Page 9: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

0

0.02

0.04

0.06

0.08

0.1

0.12

0

0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

0.9

0.9

5 1

Uti

lity

loss

Privacy sensitivity, ϵ

init utility loss utility loss(Greedy) utility loss(DP)

Blacklist = {Stress}

(a) Dataset: D1

0

0.05

0.1

0.15

0.2

0.25

0.3

0

0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

0.9

0.9

5 1

Uti

lity

loss

Privacy sensitivity, ϵ

init utility loss utility loss(Greedy) utility loss(DP)

Blacklist = {Conversation}

(b) Dataset: D1

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0

0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

0.9

0.9

5 1

Uti

lity

loss

Privacy sensitivity, ϵ

init utility loss utility loss(Greedy) utility loss(DP)

Blacklist = {Smoking}

(c) Dataset: D2

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0

0.0

5

0.1

0.1

5

0.2

0.2

5

0.3

0.3

5

0.4

0.4

5

0.5

0.5

5

0.6

0.6

5

0.7

0.7

5

0.8

0.8

5

0.9

0.9

5 1

Uti

lity

loss

Privacy sensitivity, ϵ

init utility loss utility loss(Greedy) utility loss(DP)

Blacklist = {Stress}

(d) Dataset: D2

Figure 5: Privacy-Utility tradeoff for different blacklist configurations and varying privacy sensitivity ε. Results are shown for both the datasets.

0

10

20

30

40

50

60

70

80

90

100

Pe

rce

nta

ge o

f St

ate

s

User ID

Safe state Unsafe state Sensitive stateUloss(Suppression) Uloss(mSieve)

Figure 7: Percentage of node types for the different users. Users IDsare sorted in ascending order of safe node percentage. Privacy sensitivityε = 0.5.

ε = 0.5, the average distribution of the various state types ina user trajectory is shown in Figure 7.

RELATED WORKVarious transformation techniques have been proposed toprotect data privacy. Below, we summarize some of the tech-niques relevant to our problem setting.

Anonymization metrics: A vast majority of the literature onprivacy preserving data publishing consider anonymizationmetrics, such as k-anonymity [39], l-diversity [30], and t-closeness [27]. These approaches operate under the threatmodel in which an adversary is aware of quasi-identifiers in amulti-user dataset and wants to perform linkage attack usingother auxiliary data sources to infer the sensitive attributes ofindividuals within the same dataset. We consider a differentsetting where data from a single user (no multi-user databaseis present) is being protected against sensitive inferences. Wefurther assume that the identity of the user is known and hence

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37

Uti

lity

loss

User ID

init utility loss utility loss(Greedy) utility loss(DP)

Figure 8: Comparison of utility loss for the DP and Greedy algorithms.The user IDs are sorted according to the utility loss using the DP algorithm.

the anonymization mechanisms are not useful. In addition,instead of static (time-independent) relational databases, weconsider time-series of physiological sensor data.

Data Randomization: Randomization techniques add noiseto the data in order to protect the sensitive attributes ofrecords [2, 1]. Evfimievski et al. [20] proposed a series ofrandomization operators to limit the confidence of inferringan item’s presence in a dataset using association rule mining.Differential privacy offers a formal foundation for privacy-preserving data publishing [15, 18]. It can be classified as adata distortion mechanism that uses random noise (typicallyfrom a Laplace distribution) to ensure that an attacker, withaccess to the noisy query response, fails to guess the presenceor absence of a particular record in the dataset. Compared tothe general-purpose data publication offered by k-anonymity,differential privacy is a strictly verifiable method [28]. A sur-vey of results on differential privacy can be found in [17].While most of the research on differential privacy has fo-cussed on interactive settings [23, 37, 21], non-interactivesettings as an alternate to partition-based privacy models have

Page 10: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

also been considered in recent works [6, 19, 42]. Since we fo-cus on physiological data, the randomization approaches areoften unsuitable as they distort the data making it unusablefor critical inferences, especially for physiological data.

Distributed privacy preservation: Results are aggregatedfrom datasets that are partitioned across entries [33]. Par-titioning could be horizontal [12, 14, 16, 29], i.e., recordsdistributed across multiple entries, or vertical [12, 14, 41],i.e., attributes distributed across multiple entries. It is possi-ble that individual entities may not allow sharing their entiredataset. However, they consent to limited information shar-ing with the use of a variety of protocols. The overall goal ofsuch techniques is to maintain privacy for each individual en-tity, while obtaining aggregated results. We consider a singleuser setting for which the above techniques do not work.

Data Synthesis: Synthetic data generators exists for locationdata [8] that are used to protect the privacy of sensitive places.To obtain these synthetic traces, data from different users arepooled together into a population-scale model, which is thenused to generate the data. Physiological data of each personis unique and used for obtaining bio-metrics [7, 31]. Thus,population-scale models often results in significant degrada-tion in utility of the data. The well-defined temporal corre-lation between data segments in a physiological time series(i.e., continuity between presiding and succeeding contexts)makes it even harder to generate synthetic data.

In summary, protecting behavioral privacy when sharing timeseries of mobile sensor data (especially physiological data),pose new challenges and unique research opportunities thathave usually not been considered in prior works.

LIMITATIONS AND DISCUSSIONWe presented mSieve, as a first step towards building systemsthat can use substitution as an effective mechanism to pro-tect privacy of behavior while sharing personal physiologicaldata. Instead of random substitution of sensitive segments,which can degrade the utility of the overall dataset, mSieveperforms model-based substitution. Our approach opens upnew directions in privacy and differential privacy research,namely to define suitable metrics and mechanisms that canbe used to protect private behaviors in a time series of mobilesensor data. Specific challenges are as follows:

• Algorithmic Scalability: There are several aspects thatconstitute the scalability of the system. Although, we haveapplied our model to several data sources, its applicabil-ity to other newer sources of mobile sensor data is yet tobe established. Furthermore, an increase in the number ofsensors, would imply a significant increase in the numberof possible inferences (obtained using different combina-tions of the sensors), and a corresponding increase in thewhitelist and blacklist set sizes. Improving the efficiencyof our algorithms in computing candidate segments in suchscenarios, which effectively would depend on the size ofthe DBN model, is an interesting open question.

• Offline vs. Online: Our model is an offline model thatassumes availability of all data. In practice, data may needto be shared in real-time as they are produced. Significant

adaptations may be needed to develop an online version ofmSieve that can run in real-time on mobile phones.

• Adversarial Setting: Stronger adversarial models, wherethe adversary may possess more information regarding theuser than the behavioral model captures, may lead to newchallenges in ensuring privacy guarantees and representsinteresting privacy research. In fact, this also represents asignificant bottleneck for model-based privacy approaches,where a model is essential for maintaining the utility of theshared data, but the very use of a specific model leads toassumptions on adversarial capabilities. While in principlesuch approaches can be modified to protect against a worst-case adversary, it is difficult to provide meaningful utilityin such cases.

• Plausibility: The current formulation of mSieve only pro-vided local plausibility by adjusting the boundaries of sub-stituted segments. Incorporating plausibility as part of theprivacy formulation itself will be an interesting extensionof the current work.

• Privacy leakage due to Graylist: We consider inferencesthat are associated with whitelist or blacklist, but other be-haviors could be inferred from raw sensor data. We callthose additional inferences as graylist. Information leak-age due to graylist need to be investigated further.

• New and emerging inferences: Finally, rapid advancesare being made in being able to infer human behaviors fromseemingly innocuous sensor data, which may challenge thenotion of white list and black list, especially when raw sen-sor data is being shared.

CONCLUSIONWe presented mSieve, a system that uses substitution as amechanism to protect privacy of behaviors to facilitate shar-ing of personal physiological data collected continuously inthe natural environment. Instead of random substitution ofsensitive segments, which can degrade the utility of the over-all dataset, we perform model-based substitution. We employa Dynamic Bayesian Network model that allows us to searchfor plausible user-specific candidate segments that satisfy thestatistics of the sensitive segment and thus preserve the over-all consistency of the shared data. Through experimentationon real-life physiological datasets, we demonstrated that oursubstitution strategies can indeed be used for preserving theutility of inferences while achieving differential behavioralprivacy. This work opens the doors for follow up researchand real-life deployment as adoption of physiological sensorsin daily wearables such as smartwatches grow.

ACKNOWLEDGEMENTSWe thank Amin Ali from University of Dhaka. We also thankstudy coordinators at Universities of Memphis. The authorsacknowledge support by the National Science Foundation un-der award numbers CNS-1212901 and IIS-1231754 and bythe National Institutes of Health under grants R01CA190329,R01MD010362, and R01DA035502 (by NIDA) throughfunds provided by the trans-NIH OppNet initiative andU54EB020404 (by NIBIB) through funds provided by thetrans-NIH Big Data-to-Knowledge (BD2K) initiative.

Page 11: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

REFERENCES1. Agrawal, D., and Aggarwal, C. C. On the design and

quantification of privacy preserving data mining algorithms. InProceedings of the twentieth ACM SIGMOD-SIGACT-SIGARTsymposium on Principles of database systems, ACM (2001),247–255.

2. Agrawal, R., and Srikant, R. Privacy-preserving data mining.In ACM Sigmod Record, vol. 29, ACM (2000), 439–450.

3. Ali, A. A., Hossain, S. M., Hovsepian, K., Rahman, M. M.,Plarre, K., and Kumar, S. mpuff: automated detection ofcigarette smoking puffs from respiration measurements. InProceedings of the 11th international conference onInformation Processing in Sensor Networks, ACM (2012),269–280.

4. Atallah, L., Lo, B., King, R., and Yang, G.-Z. Sensorplacement for activity detection using wearableaccelerometers. In 2010 International Conference on BodySensor Networks, IEEE (2010), 24–29.

5. Bao, L., and Intille, S. S. Activity recognition fromuser-annotated acceleration data. In Pervasive computing.Springer, 2004, 1–17.

6. Bhaskar, R., Laxman, S., Smith, A., and Thakurta, A.Discovering frequent patterns in sensitive data. In Proceedingsof the 16th ACM SIGKDD international conference onKnowledge discovery and data mining, ACM (2010), 503–512.

7. Biel, L., Pettersson, O., Philipson, L., and Wide, P. Ecganalysis: a new approach in human identification. IEEETransactions on Instrumentation and Measurement 50, 3(2001), 808–812.

8. Bindschaedler, V., and Shokri, R. Synthesizing plausibleprivacy-preserving location traces. In 2016 IEEE Symposiumon Security and Privacy, IEEE (2016).

9. Chakraborty, S. Balancing Behavioral Privacy and InformationUtility in Sensory Data Flows. PhD thesis, University ofCalifornia, Los Angeles, 2014.

10. Chakraborty, S., Raghavan, K. R., Johnson, M. P., andSrivastava, M. B. A framework for context-aware privacy ofsensor data on mobile systems. In Proceedings of the 14thWorkshop on Mobile Computing Systems and Applications,ACM (2013), 11.

11. Chakraborty, S., Shen, C., Raghavan, K. R., Shoukry, Y.,Millar, M., and Srivastava, M. ipShield: A Framework ForEnforcing Context-Aware Privacy. In 11th USENIX Symposiumon Networked Systems Design and Implementation (NSDI 14)(2014), 143–156.

12. Chen, R., Mohammed, N., Fung, B. C., Desai, B. C., andXiong, L. Publishing set-valued data via differential privacy.Proceedings of the VLDB Endowment 4, 11 (2011),1087–1098.

13. Clifford, G. D., Azuaje, F., and McSharry, P. Advancedmethods and tools for ECG data analysis. Artech House, Inc.,2006.

14. Clifton, C., Kantarcioglu, M., Vaidya, J., Lin, X., and Zhu,M. Y. Tools for privacy preserving distributed data mining.ACM Sigkdd Explorations Newsletter 4, 2 (2002), 28–34.

15. Dinur, I., and Nissim, K. Revealing information whilepreserving privacy. In Proceedings of the twenty-second ACMSIGMOD-SIGACT-SIGART symposium on Principles ofdatabase systems, ACM (2003), 202–210.

16. Du, W., and Atallah, M. J. Secure multi-party computationproblems and their applications: a review and open problems.In Proceedings of the 2001 workshop on New securityparadigms, ACM (2001), 13–22.

17. Dwork, C. Differential privacy: A survey of results. In Theoryand applications of models of computation. Springer, 2008,1–19.

18. Dwork, C., McSherry, F., Nissim, K., and Smith, A.Calibrating noise to sensitivity in private data analysis. InTheory of cryptography. Springer, 2006, 265–284.

19. Dwork, C., Naor, M., Reingold, O., Rothblum, G. N., andVadhan, S. On the complexity of differentially private datarelease: efficient algorithms and hardness results. InProceedings of the forty-first annual ACM symposium onTheory of computing, ACM (2009), 381–390.

20. Evfimievski, A., Srikant, R., Agrawal, R., and Gehrke, J.Privacy preserving mining of association rules. InformationSystems 29, 4 (2004), 343–364.

21. Friedman, A., and Schuster, A. Data mining with differentialprivacy. In Proceedings of the 16th ACM SIGKDDinternational conference on Knowledge discovery and datamining, ACM (2010), 493–502.

22. Gotz, M., Nath, S., and Gehrke, J. Maskit: Privately releasinguser context streams for personalized mobile applications. InProceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data, SIGMOD ’12 (2012),289–300.

23. Hay, M., Rastogi, V., Miklau, G., and Suciu, D. Boosting theaccuracy of differentially private histograms throughconsistency. Proceedings of the VLDB Endowment 3, 1-2(2010), 1021–1032.

24. He, Y., Barman, S., Wang, D., and Naughton, J. F. On thecomplexity of privacy-preserving complex event processing. InProceedings of the Thirtieth ACM SIGMOD-SIGACT-SIGARTSymposium on Principles of Database Systems, PODS ’11(2011), 165–174.

25. Hossain, S. M., Ali, A. A., Rahman, M. M., Ertin, E., Epstein,D., Kennedy, A., Preston, K., Umbricht, A., Chen, Y., andKumar, S. Identifying drug (cocaine) intake events from acutephysiological response in the presence of free-living physicalactivity. In Proceedings of the 13th international symposium onInformation processing in sensor networks, IEEE Press (2014),71–82.

26. Hovsepian, K., al’Absi, M., Ertin, E., Kamarck, T., Nakajima,M., and Kumar, S. cstress: towards a gold standard forcontinuous stress assessment in the mobile environment. InProceedings of the 2015 ACM International Joint Conferenceon Pervasive and Ubiquitous Computing, ACM (2015),493–504.

27. Li, N., Li, T., and Venkatasubramanian, S. t-closeness: Privacybeyond k-anonymity and l-diversity. In Data Engineering,2007. ICDE 2007. IEEE 23rd International Conference on,IEEE (2007), 106–115.

28. Li, N., Qardaji, W. H., and Su, D. Provably private dataanonymization: Or, k-anonymity meets differential privacy.Arxiv preprint (2011).

29. Lindell, Y., and Pinkas, B. Privacy preserving data mining. InAdvances in CryptologyCRYPTO 2000, Springer (2000),36–54.

30. Machanavajjhala, A., Kifer, D., Gehrke, J., andVenkitasubramaniam, M. l-diversity: Privacy beyondk-anonymity. ACM Transactions on Knowledge Discovery fromData (TKDD) 1, 1 (2007), 3.

31. Pagani, M., Montano, N., Porta, A., Malliani, A., Abboud,F. M., Birkett, C., and Somers, V. K. Relationship betweenspectral components of cardiovascular variabilities and directmeasures of muscle sympathetic nerve activity in humans.Circulation 95, 6 (1997), 1441–1448.

32. Parate, A., Chiu, M.-C., Chadowitz, C., Ganesan, D., andKalogerakis, E. Risq: Recognizing smoking gestures withinertial sensors on a wristband. In Proceedings of the 12thannual international conference on Mobile systems,applications, and services, ACM (2014), 149–161.

Page 12: mSieve: Differential Behavioral Privacy in Time Series of ...web0.cs.memphis.edu/~santosh/Papers/mSieve-UbiComp-2016.pdf · Sensor Data: Let r i(t) denote the sensor data from the

33. Pinkas, B. Cryptographic techniques for privacy-preservingdata mining. ACM SIGKDD Explorations Newsletter 4, 2(2002), 12–19.

34. Plarre, K., Raij, A., Hossain, S. M., Ali, A. A., Nakajima, M.,al’Absi, M., Ertin, E., Kamarck, T., Kumar, S., Scott, M., et al.Continuous inference of psychological stress from sensorymeasurements collected in the natural environment. InInformation Processing in Sensor Networks (IPSN), 2011 10thInternational Conference on, IEEE (2011), 97–108.

35. Rahman, M., Ali, A. A., Plarre, K., Absi, M., Ertin, E., andKumar, S. mconverse : Inferring conversation episodes fromrespiratory measurements collected in the field. WirelessHealth (2011).

36. Rahman, M. M., Bari, R., Ali, A. A., Sharmin, M., Raij, A.,Hovsepian, K., Hossain, S. M., Ertin, E., Kennedy, A., Epstein,D. H., et al. Are we there yet?: Feasibility of continuous stressassessment via wireless physiological sensors. In Proceedingsof the 5th ACM Conference on Bioinformatics, ComputationalBiology, and Health Informatics, ACM (2014), 479–488.

37. Roth, A., and Roughgarden, T. Interactive privacy via themedian mechanism. In Proceedings of the forty-second ACMsymposium on Theory of computing, ACM (2010), 765–774.

38. Saleheen, N., Ali, A. A., Hossain, S. M., Sarker, H., Chatterjee,S., Marlin, B., Ertin, E., al’Absi, M., and Kumar, S.puffmarker: a multi-sensor approach for pinpointing the timingof first lapse in smoking cessation. In Proceedings of the 2015ACM International Joint Conference on Pervasive andUbiquitous Computing, ACM (2015), 999–1010.

39. Sweeney, L. k-anonymity: A model for protecting privacy.International Journal of Uncertainty, Fuzziness andKnowledge-Based Systems 10, 05 (2002), 557–570.

40. Thomaz, E., Essa, I., and Abowd, G. D. A practical approachfor recognizing eating moments with wrist-mounted inertialsensing. In Proceedings of the 2015 ACM International JointConference on Pervasive and Ubiquitous Computing, ACM(2015), 1029–1040.

41. Vaidya, J., and Clifton, C. Privacy preserving association rulemining in vertically partitioned data. In Proceedings of theeighth ACM SIGKDD international conference on Knowledgediscovery and data mining, ACM (2002), 639–644.

42. Xiao, Y., Xiong, L., and Yuan, C. Differentially private datarelease through multidimensional partitioning. In Secure DataManagement. Springer, 2010, 150–168.

APPENDIX

PROOF OF LEMMAS 1 AND 2PROOF. (Lemma 1): If we take a sample from expected

output ~s then

P (Si∈SB |Si−1=si−1)P (Si∈SB |Si−1=si−1)

< eδ and∏τi=1

P (Si∈SB |Si−1=si−1)P (Si∈SB |Si−1=si−1)

<

eτδ . Recall, inferences are computed over time intervals. Weassume that inferences calculation are independent betweentime intervals, i.e., P (ri → Si ∈ SB |ri−1) = P (ri → Si ∈SB) (ri is independent of ri−1). Thus, P (ri→Si∈SB)

P (ri→Si∈SB)< eδ

and correspondingly,∏τi=1

P (ri→Si∈SB)P (ri→Si∈SB)

< eτδ .

Combining the above equations together and denotingP (Si ∈ SB |ri) by P (ri → Si ∈ SB) we get,P (S1∈SB |r1)P (S1∈SB |r1)

∏τi=2

P (Si∈SB |si)P (Si∈SB |Si−1)P (Si∈SB |ri)P (Si∈SB |Si−1)

< e2τδ. This is

same as P (S1∈SB |r1,...,Sτ∈SB |rτ )P (S1∈SB |r1,...,Sτ∈SB |rτ )

≤ e2τδ . Finally, if the ratioof the joint probabilities of states is bounded then a black-list query corresponding to the states will also be bounded by

the same value. Thus, P (q(r1,...,rτ ;b∈SB)∈K)

P (q(r1,...,rτ ;b∈SB)∈K)≤ e2τδ Since,

δ < ε/2τ P (q(r;b)∈K)P (q(r∈DB;b)∈K) ≤ e

ε Thus, D(~r||~r) ≤ eε

PROOF. (Lemma 2): The probability of detecting a black-list inference from released data is equal to the false negativeof the computational model, i.e. P (r → X ∈ B) = Fb. Let rbe the data segment drawn from the reference database. Thus,probability of detecting blacklist from r is close to zero, i.e.P (r → X ∈ B) → 0 thus ln(P (r → X ∈ B)) → 0. Usingour privacy definition,| ln(P (r → X ∈ B))− ln(P (r → X ∈ B))| ≤ ln(η)− 0 =ln(η). Thus, we pick ε such that ln(η) ≤ ε.