a study of usability-aware network trace anonymization

1

A Study of Usability-aware Network Trace Anonymization

Kato Mivule Los Alamos National Laboratory Los Alamos, New Mexico, USA

[email protected]

Blake Anderson Los Alamos National Laboratory Los Alamos, New Mexico, USA

[email protected]

Abstract— The publication and sharing of network trace data is a critical to the advancement of collaborative research among various entities, both in government, private sector, and academia. However, due to the sensitive and confidential nature of the data involved, entities have to employ various anonymization techniques to meet legal requirements in compliance with confidentiality policies. Nevertheless, the very composition of network trace data makes it a challenge when applying anonymization techniques. On the other hand, basic application of microdata anonymization techniques on network traces is problematic and does not deliver the necessary data usability. Therefore, as a contribution, we point out some of the ongoing challenges in the network trace anonymization. We then suggest usability-aware anonymization heuristics by employing microdata privacy techniques while giving consideration to usability of the anonymized data. Our preliminary results show that with trade-offs, it might be possible to generate anonymized network traces with enhanced usability, on a case-by-case basis using micro-data anonymization techniques.

Keywords—Network Trace Anonymization; Usability; Differential Privacy; K-anonymity; Generalization

I. INTRODUCTION While a number of network trace anonymization techniques

have been presented in literature, data utility remains problematic due to the unique usability requirements by the different consumers of the privatized network traces. Yet still, a number of microdata privacy techniques from the statistical and computation sciences, are difficult to implement when anonymizing network traces due to the low usability of results. Moreover, finding the right proportionality between anonymization and data utility of network trace data is intractable and requires trade-offs on a case-by-case basis, after a careful consideration of privacy needs stipulated by policy makers, and likewise the usability requirements of the researchers, who in this case, are the consumers of the anonymized data. Furthermore, a generalized approach fails to deliver unique solutions, as each entity will have unique data privacy requirements. In this study, we take a look at the structure of the network trace data. We vertically partition the network trace data into different attributes and apply micro-data privatization techniques separately for each attribute. We

then suggest usability-aware anonymization heuristics for the anonymization process. While a number of anonymization attacks have been presented in literature, the main goal of this study was generation of anonymized network traces with better data usability. Therefore, the focus of the suggested heuristics and preliminary results, is about the generation of anonymized usability-aware network trace data, using privacy techniques covered in the statistical disclosure control domain; that include the following: Generalization, Noise addition and Multiplicative noise perturbation, Differential Privacy, and Data swapping [38]. A measure of usability by quantifying descriptive and inference statistics of the anonymized data in comparison with that of the original data is also presented. Furthermore, we apply frequency distribution analysis and unsupervised learning techniques in the measure of usability for the unlabeled data. The rest of the paper is organized as follows: In Section II, we present a review of related work, and definition of important terms pertaining to this paper. In Section III, we present methodologies and usability-aware anonymization heuristics. In Section IV, the experiment and results are given. Finally in Section V, the conclusion, recommendations, and future works are presented.

II. RELATED WORK One of the challenges of anonymizing network traces, is

how to keep the structure and flow of the data intact so as to provide usability to the consumer of the anonymized data. In such efforts, Maltz et al. (2004) demonstrated that network trace data could be anonymized while preserving the structure of the original data [1]. Additionally, Maltz et al. (2004) observed and noted that some of the challenges in anonymizing network traces included figuring out attributes in the network trace that could leak sensitive information, and how to anonymize the data such that the original configurations are preserved [1]. Observations by Maltz et al. are still relevant today, especially when considering the intractability between privacy and usability. On the other hand, Slagell, Wang, and Yurcik (2004) proposed Crypto-Pan, a network trace anonymization tool that employs cryptographic techniques in the privatization of network trace data [2]. While anonymization using cryptographic means might be effective in concealing sensitive data, usability of the anonymized data is always a challenge. Bishop, Crawford, Bhumiratana, Clark, and Levitt (2006), observed that one of

2

the problems in the anonymization of network traces, is that when handling IP addresses, the set of available addresses is finite, thus setting a limit to any anonymization prospects [3]. Each octet in the IP address would handle a range of 0 to 255. For instance, it would not make much sense to have an anonymized IP address with an octet value of 345. This limitation makes the data vulnerable to de-anonymization attacks. On the issue of de-anonymization attacks, Coull, Wright, Monrose, Collins, and Reiter (2007) presented inference techniques for de-anonymizing and detecting network topologies in anonymized network trace data [4]. Coull et al. showed that topological data could be deduced as an artifact of functional network packet traces, if the data on activity of hosts can be utilized as an advantage to prevent a successful obfuscation of the network traces [4]. Moreover, Coull et al., pointed out that obfuscating network trace data is not a trivial task as publishers of the data need to be aware of the tension between balancing privacy and data utility needs for anonymized network traces [4]. Additionally, Ribeiro, Chen, Miklau, and Towsley (2008), showed that systematic attacks on prefix-preserving anonymized network traces, could be done by adversaries using modest amount of publicly available information about a network and employing attack techniques such as finger printing [5]. However, Ribeiro et al. anticipated that their proposed attack methodologies would be employed in evaluating worst-case vulnerabilities and finding trade-offs between privacy and utility in prefix-preserving privatization of network traces [5]. Therefore, while researchers might have an interest in anonymized data sets that maintain the structure and flow of the original data, curators of that data have to contend with the fact that such prefix-preserving anonymization is subject to de-anonymization attacks.

A comprehensive reference model was presented by Gattani and Daniels (2008), in which they outlined that entities needed to formulate the problem of anonymizing network traces [6]. Gattani and Daniels (2008) noted that the anonymization procedure always aims at the following three goals [6]: (i) defending the confidentiality of users, (ii) obfuscating the inner structure of a network, and (iii) generating anonymized network traces with acceptable levels of usability [6]. However, Gattani and Daniels (2008) observed that attaining those three anonymization goals is problematic, as removing too much sensitive information from a network data trace only reduces the usability of the anonymized network traces [6]. Additionally, Gattani and Daniels (2008), categorized attacks on anonymized data categorized as, (i) active data injection attacks, (ii) known mapping attacks, (iii) network topology inference attacks, and (iv) cryptographic attacks [6]. On the categorization of attacks, King, Lakkaraju, and Slagell (2009) presented a taxonomy of attacks on anonymization techniques with the aim of helping curators of the privatization process negotiate trade-offs between data utility and anonymization [7]. King et al., classified attacks on anonymization methods as (i) fingerprinting, (ii) structure recognition, (iii) known mapping, (iv) data injection, and (v) cryptographic attacks [7]. A combined categorization of attacks on anonymization

techniques, from Gattani and Daniels, and King et al., would then be listed as follows [7] [6]: (i) Fingerprinting attacks: in this this category of attacks, attributes of anonymized data are compared with traits of known network structures to uncover a relationship between the anonymized and non-anonymized data. (ii) Data injection attacks: in this type of exploit, an attacker injects pseudo-traffic data in a network trace before anonymization process and uses the pseudo-traffic traces to de-anonymize the network traces and network structure. (iii) Structure recognition attacks: in this type of exploit, an attacker seeks to determine the structure between objects in the anonymized data to discover multiple relations between anonymized and non-anonymized data. (iv) Network topology inference: similar to known mapping attacks, this category of exploits seeks to retrieve the network topology map by de-anonymizing the nodes that make up the vertices of the network, the edges between the nodes that represent the connectivity and the routers. (v) Known mapping attacks: in this category of exploit, the attacker relies on external data (auxiliary data) to find a mapping between the anonymized network trace data and the original network trace data in order to retrieve the original IP addresses. (vi) Cryptographic attacks: in this category of attacks, exploits are carried out to break cryptographic algorithms used to encrypt the network traces.

A comparative analysis was done by Coull, Monrose, Reiter, and Bailey (2009) in which they pointed out the similarities and differences between network data anonymization and microdata privatization techniques, and how microdata obfuscation methods could be applied to anonymize network traces [8]. Coull, et al. observed that uncertainties did exist about the effectiveness of network data anonymization, from both methodological and policy view, with the research community in need for more study to understand the implications of publishing anonymized network data and the utility of such data to researchers [8]. Furthermore, Coull, et al. suggested that the extensive work that exists in the statistical disclosure control discipline could be employed by the network research community towards the privatization of network flow data [8]. On network trace packet anonymization, Foukarakis, Antoniades, and Polychronakis (2009), proposed the anonymization of network traces at the packet level – in the payload of a packet, due to inadequacies found in various network trace anonymization techniques [9]. Foukarakis et al., suggested identifying revealing information contained in the shell-code of code injection attacks, and anonymizing such packets to grant confidentiality in published network attack traces [9]. However, on the subject of IP-flow intrusion detection methods, Sperotto et al. (2010) presented an overview of IP-flow intrusion detection approach and highlighted the classification of attacks, and defense methods and how flow-based method can be used to discover scans, worms, botnets and denial of service (DoS) attacks [10]. Furthermore Sperotto et al. highlighted two types of sampling; packet sampling whereby a packet is deterministically chosen based on a time interval for analysis; and flow sampling in which a sample flow is chosen for analysis [10]. At the same

3

time, Burkhart et al. (2010), in their review of anonymization techniques, showed that current anonymization techniques are vulnerable to a series of injection attacks, by inserting attacker packets into the network flow prior to anonymization, then later retrieving the packets, thus revealing vulnerabilities and patterns in the anonymized data [11]. As a mitigation to injection attacks, Burhart et al. suggested that anonymization of network flow data should be done as part of a comprehensive approach including both legal and technical perspectives on data confidentiality [11].

Meanwhile, McSherry and Mahajan (2011) showed that differential privacy could be employed to anonymize network trace data. Yet despite privacy guarantees provided by differential privacy, the usability of the privatized data remains a challenge due to excessive noise from the anonymization [12]. However, McSherry, Frank, and Mahajan (2011), in their study of applying differential privacy on network trace data, acknowledged the challenges of balancing usability and privacy, despite the confidentiality assurances accorded by differential privacy [13]. On real time interactive anonymization, Paul, Valgenti, and Kim (2011) proposed the Real-time Netshuffle anonymization technique whereby distortion is done to a complete graph to prevent inference attacks in network traffic [14]. Netshuffle works by employing k-anonymity methodology on network traces, by ensuring that all trace records appear at least k>1, with k being the anonymized record, and then shuffling gets applied on the k-anonymized records, making it difficult for an attacker to decipher due to the distortion [14]. A network trace obfuscation methodology, (k, j)-obfuscation, was proposed by Riboni, Villani, Vitali, Bettini, and Mancini (2012), in which a network flow is considered obfuscated if it cannot be linked with greater assurance, to its source and destination IP addresses [15]. Riboni, et al. observed from their implementation of (k, j)-obfuscation, that the large set of network flows maintained the utility of the original network trace [15]. However, the context of data utility remains challenging as each consumer of privatized data will have unique usability requirements, different levels of needed assurance, and therefore, utility becomes constrained to a case-by-case basis, depending on an entity's privacy and usability needs. On the issue of preserving IP consistency in anonymized data, Qardaji and Li (2012) observed that full prefix-preserving IP anonymization suffers from a number of attacks yet from a usability perspective, some level of consistency is required in anonymized IP addresses [16]. To mitigate this problem, Qardaji and li (2012), proposed maintaining pseudonym consistency by dividing flow data into buckets based on temporal closeness and separately privatize flows in each bucket, thus maintaining consistency only in each bucket but not globally across all buckets [16]. Mendonca, Seetharaman, and Obraczka (2012) proposed AnonyFlow, an interactive anonymization technique that provides end point privacy by preventing the tracking of source behavior and location in network data [17]. However, Mendonca et al. acknowledged that AnonyFlow does not address issues of complete anonymity, data security,

steganography, and network trace anonymization in non-interactive settings [17].

On generating synthetic network traces, Jeon, Yun, and Kim (2013), proposed an anomaly-based intrusion detection system (A-IDS) to generate pseudo-network traffic for the obfuscation of real sensitive network traffic in supervisory control and data acquisition (SCADA) systems [18]. An overview of network data anonymization was presented by Nassar, al Bouna, Malluhi (2013), in which the need to address the problem of finding appropriate anonymization algorithms that grant privacy but with an optimal risk-utility trade-off, was highlighted [19]. On using entropy and similarity distance measures, Xiaoyun, Yujie, Xiaosheng, Xiaohong, and Yan (2013) employed similarity distance and entropy techniques in the quantification of anonymized network trace data [20]. Xiaoyun et al. proposed two types of similarity measures: (i) external similarity, in which the distance measurements are done to compute the probability that an adversary will obtain a one-to-one mapping relation between the anonymized and the original data, based on auxiliary knowledge; (ii) Internal similarity, in which distance measurements are done on the privatized and the original data to indicate how distinguishable or indistinguishable the data sets are [20]. On the extracting, classification, and anonymization of packet traces, Lin, Lin, Wang, Chen, and Lai (2014), observed that capturing and sharing real network traffic faced two challenges, first various protocols are associated with the packet traces and secondly, such packet traces tend not to be well classified before deep packet anonymization [21]. Therefore, Lin et al. proposed PCAPLib methodology to extract, classify, and the deep packet anonymization of packet traces [21]. In their work on Session Initiation Protocol (SIP) used in multimedia communication sessions, Stanek, Kencl, and Kuthan (2014), pointed out that current network trace anonymization techniques are insufficient for SIP traces due to the data format of the SIP trace, that includes, the IP address, the SIP URI, and the e-mail address [22]. To mitigate this problem, Stanek et al, proposed SiAnTo, an anonymization methodology that replaces SIP information with non-descriptive but matching labels [22]. Of recent, Riboni, Villani, Vitali, Bettini, and Mancini (2014), cautioned that current network trace anonymization techniques are vulnerable to various attacks while at the same time it is problematic to apply microdata privatization methods in obfuscating network traces [23]. Moreover, Riboni et al. noted that current obfuscation methods depend on assumptions about an adversary intentions, which are challenging to model, and do not guarantee privacy against background knowledge attacks [23]. In Table I, is a summary of some of network trace anonymization challenges outlined in literature for the past ten years.

A. Network trace anonymization techniques In the following section, a review of some of the common network trace anonymization techniques is presented [24] [25] [26] [27] [28] [16]: (i) Black marker technique: in this

4

method, sensitive values are erased or substituted with fixed values.

TABLE I. SUMMARY OF NETWORK TRACE ANONYMIZATION CHALLENGES

Author (s) Network Trace Anonymization Challenges

Maltz et al., (2004) Challenge of identifying attributes to anonymize while conserving usability

Slagell et al., (2004) Crypto-Pan – cryptography to anonymize IP addresses – usability a challenge.

Bishop et al.,(2006) Anonymization of IP addresses problematic – set of IP addresses is finite.

Coull et al.,(2007) Obfuscation not trivial task due to the tensions between privacy and usability.

Ribeiro et al., (2008) Prefix-preserving anonymized data subject to Fingerprinting attacks.

King et al., (2009) Taxonomy of attacks on anonymization technique – anonymization challenges.

Coull et al., (2009) Comparison between network and micro data anonymization – significant differences.

Foukarakis et al., (2009)

Network trace anonymization at the packet level – a challenge.

Burkhart et al, (2010) Injection attacks on anonymized network trace data.

McSherry and Mahajan (2011)

Differential privacy anonymization of network trace data.

Paul, Valgenti, and Kim (2011)

Real-time anonymization with k-anonymity.

Riboni et al., (2012) (k, j)-obfuscation – network flow is obfuscated if it cannot be linked to original data with greater assurance

Qardaji and Li (2012) Global Prefix Consistency is subject to attacks.

Mendonca et al., (2012)

Interactive network trace anonymization.

Jeon, Yun, and Kim (2013)

Synthetic (anonymized) network trace data generation.

Nassar, al et al., (2013)

Balance between utility and privacy needed - still a problem.

Farah and Trajkovic (2013)

Network trace anonymization techniques - an overview.

Stanek et al., (2014) Proposed Session Initiation Protocol (SIP) anonymization and challenges.

Riboni et al., (2014) Caution with current network anonymization techniques – vulnerable to attacks

(ii) Enumeration technique: in this scheme, sensitive values in a sequence are replaced with an ordered sequence of synthetic values. (iii) Hash technique: unique values are substituted with a fixed size bit string in the hash technique. (iv) Partitioning technique: with the partitioning method, revealing values are partitioned into a subset of values and each of the values in the subset is replaced with a generalized value. For example, an IP address 141.121.10.12, could be partitioned into four octets and the last two octets replaced with zero values, 141.121.0.0. (v) Precision degradation technique: highly specific values of a time-stamp attribute are

removed when employing the precision degradation method. (vi) Permutation technique: A random permutation is done to link non-anonymized IP and MAC addresses to a set of available addresses. (vii) Prefix-preserving anonymization technique: in this technique, values of an IP address are replaced with synthetic values in such a way that the original structure of the IP address is kept – the prefix values of an IP address structure is preserved. Prefix-preservation could be applied fully or partially on the IP address. The fully prefix-preserving anonymization will map the full structure of the original IP address in the anonymized data, while the partially prefix-preserving anonymization will preserve a select structure of the original IP address, for example the first two octets. (viii) Random time shift technique: this methodology works by applying a random value as an offset to each value in the field. (ix) Truncation technique: with this technique, part of the IP or MAC address is suppressed or deleted and the remaining IP address remains intact. (x) Time unit annihilation: In this partitioning anonymization methodology, part of the time-stamp is deleted and replaced with zeros. In Table 1, a summary of ongoing challenges from literature, on anonymizing network traces is given. Although a number of network trace anonymization solutions have been proposed in literature, usability of the anonymized data remains problematic. While a number of challenges exist, this study labored to focus on the challenge of usability-aware anonymization of network traces.

B. Statistical disclosure control techniques The following are some of the main microdata privatization

methods used: Suppression: in this technique, revealing and sensitive data values are deleted from a data set at the cell level [29]. Generalization: to achieve confidentiality for revealing values in an attribute, a single value is allocated to a group of sensitive values in the attribute [30]. K-anonymity: in this method, data privacy is enforced by requiring that all values in the quasi-attributes be repeated k times, such that k >1, thus providing confidentiality and making it harder to uniquely distinguish individuals values. K-anonymity employs both generalization and suppression methods to achieve k >1 [31]. Data swapping: Data swapping is a data privacy technique that involves exchanging sensitive cell values with other cell values in the same attribute while keeping intact the frequencies and statistical traits of the original data, and as such, making it difficult for an attacker to map the privatized values to the original record [32]. Noise addition: noise addition is a data privacy method that adds random values (noise) to revealing and sensitive numerical values, in the original data, to ensure confidentiality. The random values are usually chosen from between the mean and standard deviation of the original values [33]:

𝑋! + 𝜀! = 𝑍! (1)

Multiplicative noise: similar to noise addition, random values generated between the mean and variance of the original data

5

values, are then multiplied to the original data generating a privatized data set [34].

𝑋! ∗ 𝜀! = 𝑍! (2)

Where X = original data, Z = privatized data, and ε = the random values. Differential Privacy: Similar to noise addition, differential privacy imposes privacy by adding Laplace noise to query results from the database such that it cannot be distinguished if a particular value has been adjusted in that database or not; making it more difficult for an attacker to decode items in the database [35]. ε-differential privacy is satisfied if the results to a query run on database D1 and D2 should probabilistically be similar, and meet the following condition [35]:

𝑃 𝑞! 𝐷! ∈ 𝑅 𝑃 𝑞! 𝐷! ∈ 𝑅 ⩽ 𝑒! (3)

Where D1 and D2 are the two databases; P is the probability of the perturbed query results D1 and D2; qn() is the privacy granting procedure (perturbation); qn(D1) is the privacy granting procedure on query results from database D1; qn(D2) is the privacy granting procedure on query results from database D2; R is the perturbed query results from the databases D1 and D2 respectively; eε is the exponential e epsilon value. Differential privacy can be implemented as follows [36]:

(i) Run query on database

𝑤ℎ𝑒𝑟𝑒𝑓 𝑥 = 𝑞𝑢𝑒𝑟𝑦𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛

(ii) Calculate the most influential observation

𝛥𝑓 = 𝑀𝑎𝑥 𝑓 𝐷! − 𝑓 𝐷! (4)

(iii) Calculate the Laplace noise distribution

𝑏 = 𝛥 𝑓 𝜀 (5)

(iv) Add Laplace noise distribution to the query results

𝐷𝑃 = 𝑓 𝑥 + 𝐿𝑎𝑝𝑎𝑙𝑐𝑒 0, 𝑏 (6)

(v) Publish perturbed query results in interactive (query responses) or non-interactive (macro, micro data) mode.

C. Metrics used to quantify usability in this study The Shannon entropy: entropy is used essentially to measure the amount of randomness and uncertainty in a data set; if all values in a set of information fall into one category, then entropy in such cases is at zero. Probability is used to quantify randomness of elements in an information set; normalized entropy values range from 0 to 1, getting to the upper bound level when all probabilities are equal [37] [36]. Entropy is formally described using the following formula [37]:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐻 𝑝!, 𝑝!, . . . , 𝑝! = 𝑝!!!!! ⋅ 𝑙𝑜𝑔 !

!! (7)

where pi = probability; H(p1, p2,...,pn) is entropy for each pi. Correlation Metric (between Original data X and Privatized data Z): Correlation rxz computes the inclination and tendency of an additive linear relation between two data points; the correlation is dimensionless, independent of the environs in which the data points x and y are measured and is expressed as follows [38]:

𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑟!" = 𝐶𝑜𝑣 𝑥𝑧 𝜎! 𝜎! (8)

Where Cov xz is the covariance product of X and Z, sigma (σ) represents the standard deviation product of X and Z. If rxz = -1, then a negative linear relation exists between X and Z; if rxz = 0, no linear relation exits between X and Z; when rxz = 1, a strong linear relation between X and Z exists. Descriptive Statistics Metric: Descriptive statistics (DS) such as the mean, standard deviation, variance, etc., are used in quantifying how much distortion there is between the anonymized and original data. The larger the difference, the more privacy but also an indication of less usability; the closer the difference, the more usability but perhaps less privacy. The format used in the quantification is always in the form [36]:

𝑈𝑠𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝐷𝑆(𝑍) − 𝐷𝑆(𝑋) (9)

Where Z is the anonymized data, X is the original data, and DS, the descriptive statistics. Distance Measures Metric (Euclidean Distance): For distance measures, we employed clustering with k-means to evaluate how well the clustering of the original data compares with that of the anonymized data. In this case, the Euclidean Distance is used for k-means clustering and is expressed as follows [39]:

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑥, 𝑦 = 𝑥! − 𝑦! !!!!! (10)

The Davis Bouldin index: was also used in the evaluation of how well the clustering performed. The Davis Bouldin Index (DBI) is expressed as follows [21]:

𝐷𝐵𝐼 = !!

𝐷!!!!! (11)

Where 𝐷! ≡ 𝑚𝑎𝑥!:!!!

𝑅!,! (12)

And 𝑅!,! ≡!!!!!!!,!

(13)

And Ri,j is a quantification of how good the clustering is. Si and Sj is the distance within each cluster. Mi,j is the distance between clusters. Classification Error Metric: With the classification error test, both the original and anonymized data are passed through machine learning and the classification error (or accuracy) is returned. The classification error (CE) of the anonymized data is subtracted from that of the original.

6

The larger the difference, the more privacy (due to distortion); this might be an indication of low usability. However, a smaller difference might indicate better usability but then low privacy, as anonymized results might be closer to the original data in similarity. Depending on the machine-learning algorithm used, the classification error metric will be in this form [36]:

𝑈𝑠𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝐺𝑎𝑢𝑔𝑒 = 𝐶𝐸 𝑍 − 𝐶𝐸 𝑋 (14)

Where Z is the anonymized data, X the original data, and CE is the classification error.

III. METHODOLOGY In this section, we describe the implemented methodology;

in this case, heuristics used in the anonymization of network trace data, within the context of usability while at the same time granting privacy requirements. The goal of the heuristics is to provide an anonymized data that could be used by

researchers with close statistical traits to the original data. The trade-off in this case, is that we tilt towards more utility while making it harder for an attacker to decrypt the original data, assuming that the attacker has no prior knowledge. Because of the unique data structure of network traces, a single generalized approach is not applicable in anonymizing all the network trace attributes. In our approach, we apply a hybrid of anonymization heuristics for each group of related attributes. Combinations of microdata anonymization techniques were used in this study, as illustrated in Figure 1. The following attributes were anonymized in the network trace data: (i) Start and End Time (Time-stamp), (ii) Source IP and Destination IP, (iii) Protocol, (iv) Source Port and Destination Port, (v) Source Packet Size and Destination Packet Size, (vi) Source Bytes and Destination Bytes, (vii) TOS Flags. However, due to space constraints, we only present results for the Timestamp and IP Address attributes.

Figure 1: An illustration of the proposed anonymization heuristics for the network trace data.

A. Enumeration with multiplicative pertubation To preserve the flow structure of the timestamp, we

employed enumeration with multiplicative perturbation, a heuristic that combines multiplicative noise addition technique from the microdata privatization techniques and enumeration from network trace anonymization. The Enumeration with Multiplicative Perturbation Heuristic is implemented as follows: Step (I): A small epsilon constant value is chosen between 0 and 1. Data curators could conceal this random value, arbitrarily chosen between 0 and 1, as an additional layer of confidentiality. Step (ii): The small epsilon constant value is then multiplied to the original data (timestamp, both Start and End Time attributes) generating an enumerated set. Step (iii): The generated enumerated data is then added to the original data, producing an anonymized data set. Step (iv): A test for usability is done, using descriptive statistical analysis, entropy, correlation, and unsupervised learning using clustering (k-means). Step (v): If the desired threshold is met, the anonymized data is published. The goal with this heuristic is to keep the time flow structure intact and similar to the original data while at the same time anonymizing the time

series values. In this case, the anonymized time series data should generate similar usability results to the original.

B. Generalization and differential privacy The IP address is one of the most challenging attributes to

anonymize since each octet of the IP address is limited to a finite set of numbers, from 0 to 255. This makes the IP address attribute vulnerable to attackers in attempts to de-anonymize the privatized network trace [3]. With such restrictions, the curator of the data is left with the choice of completely anonymizing the IP address by employing full perturbation techniques, which in turn keeps the flow structure and prefix of the IP address distorted, and thus poor data usability. One solution to this problem would be to employ heuristics that would grant anonymization and at the same time keep the prefix of the IP address intact. However, full IP address prefix-preserving anonymization has been shown to be prone to de-anonymization attacks, yet presenting another challenge [5]. Therefore, to deal with this problem, we suggest a partial prefix-preserving heuristic in which differential privacy and generalization are used and implemented as follows: Octet 1, anonymization: The IP address is split into four octets. Generalization is applied to the first octet to partially preserve

7

the prefix of the anonymized IP address. The goal is to give the users of the anonymized data some level of usability by being able get a synthetic flow of the IP address structure in the network trace. Step (i): A small epsilon constant value is chosen and used for application (added or multiplied to data) with noise addition or multiplicative noise on the first octet. The goal is to preserve the flow structure in the first octet. Step (ii): Frequency count analysis to check that none of the first octet values from the original data re-appear in the anonymized data is done at this stage. Step (iii): If first octet values reappear in the anonymized data, generalization by replacing the reappearing values with the most frequent values in the anonymized first octet is done. Step (iv): Finally, generalization and k-anonymity are applied to ensure no unique values appear, and that all values in the first octet appear k >1. Step (v): A test for usability by comparing the original and anonymized first octet values, is done. Octet 2, 3, and 4 anonymization: To make it difficult to de-anonymize the full IP address, randomization using differential privacy is applied to the remaining three octets. However, since each octet is limited to a set of 0 to 255 finite numbers, the differential privacy perturbation process will generate some values that would exceed 255; for instance, it would not make meaning to have an octet value of 350. To mitigate this situation, a control statement is introduced at the end of the differential privacy process, to exclude all values greater than an IP class address and octet range. In this case, any values greater than 255 are excluded from the end results of the perturbation process. Differential privacy is applied to each of the three octets vertically and separately. Step (i): A vertical split of octet 2, 3, and 4 into separate attributes, is done. Step (ii): Anonymization using differential privacy on each attribute (octet) separately is done at this stage. Step (iii): Test to ensure that anonymized values in each octet are in range, from 0 to 255. Step (iv): If the anonymized values in an octet exceed the 0 to 255 range then return a generalized value using the most frequent value in that 0 to 255 range. Step (v): Test for usability. Step (vi): Combine all octets to a full anonymized IP address.

IV. RESULTS Preliminary results are presented in this section. However,

due to space limitation in this publication, only results for the timestamp and IP address attributes are presented. Real 2014 network trace (NetFlow) data provided by Los Alamos National Laboratory were used in this experiment. A total of 500000 network flow records were anonymized in this study. Microdata obfuscation techniques were applied for the anonymization process. Each attribute of the NetFlow trace was anonymized separately.

A. Timestamp anonymization and usability results Descriptive statistical analysis was done on both the original

and anonymized data sets, as shown in Table II. The aim was to study the statistical traits of both the original and anonymized data sets and show any similarities. In this case, the statistical traits of the anonymized data show an

augmentation of the original data – a generation of a synthetic data set in this case. For instance the original mean of the start time and end time was 1123355142 and 1123355214 respectively, and while that of the anonymized data set was at 1944808589 and 1944808714. The difference between the anonymized and original data was at 821453447 and 821453500 respectively. A larger difference might indicate more privacy and less usability, while a smaller difference might indicate better privacy but less usability. The results presented in Table II indicate a mid-way with both privacy and usability needs met after trade-offs (the difference).

TABLE II. STATISTICAL TRAITS OF ORIGINAL AND ANONYMIZED TIMESTAMP DATA

However, to meet the requirements of different users for the

anonymized data, a fine-tuning of the parameters in the anonymization heuristics would need to be done. Additionally, the normalized Shannon's entropy results, as shown in Table II, were similar for both original and anonymized data at approximately 0.77 and 0.76 for the start and end times respectively. The entropy results indicate that the distortions and uncertainty in both data sets might be similar. While the entropy results might be good for usability, it could likewise be argued that privacy levels might be inadequate since the two data sets are similar in that regard. However, the correlation values between the anonymized and original data was at 0.532 and 0.534 for the start and end time attributes respectively. The results could indicate that while correlations exist between the two data sets, the significance is not that high since the values do not approach 1.

Figure 2: K-means clustering results for the original start and end time data.

8

The results might indicate that privacy is maintained in the anonymized data, with an acceptable level of usability. In Figure 2, results from clustering the original network trace data (timestamp attribute), is presented. The x-axis in Figure 2 represents the start-time, while the y-axis represents the end-time of the activity in the network trace. The value of k for the k-means was set to 5 in this experiment. From an anecdotal point of view, we can see that the clustering results in Figure 2 have their own skeletal structure. However, this is not the case in Figure 3. In Figure 3, data privacy using noise addition was applied idealistically, without much consideration given to the issue of usability.

Figure 3: Idealistic Privacy application and clustering results An anecdotal view of results in Figure 3 might point to better privacy, since the skeletal cluster structure of the original data was dismantled and replaced with a new skeletal cluster structure.

Figure 4: K-means clustering for the anonymized start and end-time data. However, usability remains a challenge, as the anonymized clustering results are far from being close to the original

clustering. In the case of this study, the aim was to obtain clustering results with better usability. Therefore, a re-tuning of the parameters in the data privacy procedure is done to achieve better usability. On the other hand, the goal of using cluster analysis with k-means was to analyze how the unlabeled original network trace data would perform in comparison to the anonymized data. Furthermore, the Davis-Bouldin criterion shows a value of 0.522, as depicted in Table II, indicating how well the clustering algorithm (k-means) performed with the original time-stamp (start and end times) data. In Figure 4, clustering results (with k=5 for the k-means) for the anonymized data are presented, with the x-axis showing the start time and the y-axis presenting the end time.

Figure 5: K-means Cluster performance showing the average distance within centroid and items in each cluster The Davis-Bouldin criterion for the clustering performance on the anonymized data was 0.393 as shown in Table II, a value lower than that of the original data, and an indication of better clustering. However, while an anecdotal view of the plots shows that the cluster results look similar, the number of items in each cluster in the anonymized data differ from that of the original, as shown in Figure 5. For instance, in Figure 5, the number of items in cluster 0 for the original data is at 310678, while that of the anonymized data is at 291002. The trade-off would be the difference of 19676 items. The challenge still remains as to effectively balance anonymity and usability requirements, with trade-offs. In this case, if the usability threshold is not met, then the curator can fine-tune the anonymization parameters. The average-within-centroid distance returned a lower value for the anonymized data at 77865, and for the original data at 157093, with the lower value indicating better clustering, as shown in Figure 5.

B. Source and destination IP address anonymity results The IP address remains a challenging attribute to anonymize due to the finite nature of the IP addresses. Each octet is limited to a range of 0 to 255 and obfuscation becomes constricted to that range. As we hinted earlier, it would not make any meaning to have octet values ranging between 270 and 450, for instance. In this section we present preliminary results on the anonymization and usability of the source and destination IP attribute values using the heuristics in section 3. Correlation: The correlation between the original and anonymized data, as shown in Table III, for the first octet of

9

the source and destination IP show values of 0.9 and 1 respectively. These strong correlation values are indicative of a strong linear relationship between the original and anonymized octet 1 data. The first octet of the IP address was anonymized using noise addition and generalization to keep

the flow structure similar to the original. Since a partial prefix preserving anonymization was used, it is noteworthy that there are strong correlation values between the original and anonymized data for the first octet IP values.

TABLE III. STATISTICAL TRAITS OF ORIGINAL AND ANONYMIZED SOURCE AND DESTINATION IP ADDRESSES

Our view is that a researcher could still derive general

network information from the flow structure presented by the first octet in the IP address without compromising the specifics of the other inner 3 octets. Yet the correlation between the anonymized data and original data for the 2, 3, and 4 octets show values of 0 for the destination IP addresses and minimal values of -0.081, 0.093, and 0.213, for source IP addresses, indicating that there is very low relationship between the anonymized and original data for octets 2, 3, and 4. However, the very low correlation values might be a good indicator for stronger privacy, since we employed differential privacy in the anonymization of octets 2, 3 and 4. Therefore the correlation between the anonymized and original data would be nonexistent or at least very minimal due to the differential privacy randomization. Hence the partial prefix-preserving heuristic works in this case, the user of the anonymized data is only able to derive information from the first octet while all other internal IP address information is kept confidential.

Entropy: The Shannon Entropy test was done on both the original and anonymized data IP addresses to study the uncertainty and randomness in the data sets. The normalized Shannon's entropy values range between 0 and 1, with 0 indicating certainty and 1 indicating uncertainty. As shown in both Table III and Figure 6, the entropy values for octet 1 in both the original and anonymized data, is approximately at 0.1, indicative of certainty of values and thus maintenance of flow in the first octet. However, for octets 3 and 4, there is much less certainty in the original data and in octets 2, 3, and 4 for the anonymized data, though much lower than the original. Nevertheless, octet 2 in the original data provides more certainty than octet 2 in the anonymized data. While the entropy levels in octet 3 and 4 in the original data seem higher than that of the anonymized data, overall, octets 2, 3, and 4 in the anonymized data, provide more distributed uncertainty, better randomness, and thus better anonymity. Yet still, we constrained the random values in octet 2, 3, and 4 generated

during the differential privacy procedure not to exceed 255. An octet value of 355 or 400 would affect the usability of the anonymized IP address data. However, it could be argued that the certainty levels are maintained in octet 1 for both original and anonymized data, with distortion on octet 2, 3, and 4 in the anonymized data, indicating that the flow structure is kept, and thus partial prefix-preserving anonymity might be achieved.

Figure 6: Normalized Shannon's Entropy values for the original and anonymized IP addresses. Frequency Distribution histogram analysis: Furthermore, we did a frequency analysis to compare the distribution of values in each octet in the IP address, for both the original and anonymized IP addresses. For the original data the number of items in octet 1 between 40 and 45, that is, source IP addresses that start with octet values 40 to 45, came to approximately 400,000 out of 500,000 records, as shown in Figure 7. Similar results were actualized for the destination IP address, for octet 1 with about 300,000 items with values 40 to 45, as illustrated in Figure 8. With the exception of octet 2, the values in octet 3 and 4 are distributed across the range 0 to 85 in the original IP address data; this correlates with results shown in Figure 6, with higher entropy values for octet 3 and 4 in the original

10

data, indicating more uncertainty. The x-axis in each graph represents the IP octet values, and the y-axis, shows the frequency of each of those octet values. However, a look at the anonymized IP address data shows that octet 1 had about 390,000 IP address octet 1 values beginning with 200, as shown in Figure 9 and 10, for both source and destination IP address data respectively. The results show the effect of generalization used in the obfuscation of the original data for octet 1. The values in octet 2, 3 and 4 were distributed across the 0 to 255 range, with the highest concentration around octet value 190 due to the constraints placed on the differential privacy results, to prevent a return of value greater than 255. It would not make much meaning, as mentioned earlier, to have differential privacy results that exceed 255. For octet 2, 3, and 4, the Laplace distribution is kept due to the noise distribution used in the differential privacy process.

Figure 7: Frequency distribution for the original source IP octet values.

Figure 8: Frequency distribution for the original destination IP octet values

Our recommendation as a result of this study is that a privacy engineering approach be highly considered by curators during the anonymization process.

V. CONCLUSION Anonymizing network traces while maintaining an acceptable level of usability remains a challenge, especially when employing privatization techniques used for microdata obfuscation. Moreover, obfuscating network traces remains problematic due to the IP addresses and octet values being finite. Furthermore, generalized anonymization approaches fail to deliver specific solutions, as each entity will have unique data privacy and usability requirements, and the data in most cases have varying characteristics to be considered during the obfuscation process. In this study, we have provided a review of literature, pointing out some of the ongoing challenges in the network trace anonymization over the last 10-year period. We have suggested usability-aware anonymization heuristics by employing microdata privacy techniques, while taking into consideration the usability of the anonymized network trace data. Our preliminary results show that with trade-offs, it might be possible to generate anonymized network traces on a case-by-case basis, using micro-data anonymization techniques, such as differential privacy, k-anonymity, generalization, multiplicative noise addition.

Figure 9: Frequency distribution for anonymized source IP octet values In the initial stage of the privacy engineering process, the curators could gather privacy and usability requirements from the stakeholders involved, this would include both the policy makers and anticipated users (researchers) of the anonymized network trace data. The curators could then model the most applicable approach given trade-offs, on a case-by-case basis. The generated anonymization model could then be implemented across the enterprise for uniformity and prevention of information leakage attacks. On the limitations of this study, focus was placed on usability-aware

11

anonymization of network trace data and not on the types of attacks on anonymized network traces. While some consideration and mention of anonymization attacks was given in this study, focusing on de-anonymization attacks was beyond the scope of this study, and a subject left for future work.

Figure 10: Frequency distribution for anonymized destination IP octet values

ACKNOWLEDGMENT We would like to express our appreciation to the Los

Alamos National Laboratory, and more specifically, the Advanced Computing Solutions Group, for making this work possible.

REFERENCES [1] D. A. Maltz, J. Zhan, G. Xie, H. Zhang, G. Hjálmtýsson, A. Greenberg,

and J. Rexford, “Structure preserving anonymization of router configuration data”, In Proceedings of the 4th ACM SIGCOMM conference on Internet measurement (IMC '04), 2004, Pages 239-244.

[2] A. Slagell, J. Wang, and W. Yurcik, "Network log anonymization: Application of crypto-pan to cisco netflows." In Proceedings of the Workshop on Secure Knowledge Management , 2004.

[3] M. Bishop, R. Crawford, B. Bhumiratana, L. Clark, and K. Levitt, "Some problems in sanitizing network data.", 15th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, 2006., pp. 307-312.

[4] S.E. Coull, C.V. Wright, F. Monrose, M.P. Collins, and M.K. Reiter, "Playing Devil's Advocate: Inferring Sensitive Information from Anonymized Network Traces." In NDSS, 2007, vol. 7, pp. 35-47.

[5] B.F. Ribeiro, W. Chen, G. Miklau, and D.F. Towsley, "Analyzing Privacy in Enterprise Packet Trace Anonymization." In NDSS, 2008.

[6] S. Gattani and T.E. Daniels, “Reference models for network data anonymization”, In Proceedings of the 1st ACM workshop on Network data anonymization (NDA '08), 2008, pp. 41-48.

[7] J. King, K. Lakkaraju, and A. Slagell. "A taxonomy and adversarial model for attacks against network log anonymization." In Proceedings of the 2009 ACM symposium on Applied Computing, 2009, pp. 1286-1293.

[8] S.E. Coull, F. Monrose, M.K. Reiter, M. Bailey, "The Challenges of Effectively Anonymizing Network Data," Conference For Homeland Security, CATCH 2009, pp.230-236.

[9] M. Foukarakis, D. Antoniades, and M. Polychronakis, “Deep packet anonymization”, In Proceedings of the Second European Workshop on System Security (EUROSEC '09). ACM, 2009, pp. 16-21.

[10] A. Sperotto, G. Schaffrath, R. Sadre, C. Morariu, A. Pras, and B. Stiller, "An overview of IP flow-based intrusion detection." Communications Surveys & Tutorials, IEEE 12, no. 3, 2010, pp. 343-356.

[11] M. Burkhart, D. Schatzmann, B. Trammell, E. Boschi, and B. Plattner. "The role of network trace anonymization under attack.", ACM SIGCOMM Computer Communication Review 40, no. 1, 2010, pp. 5-11.

[12] F. McSherry, and R. Mahajan, "Differentially-private network trace analysis.", ACM SIGCOMM Computer Communication Review 41.4, 2011, pp. 123-134.

[13] F. McSherry, and R. Mahajan., "Differentially-private network trace analysis.", ACM SIGCOMM Computer Communication Review 41, no. 4, 2011, pp. 123-134.

[14] R.R. Paul, V.C. Valgenti, M. Kim, "Real-time Netshuffle: Graph distortion for on-line anonymization," Network Protocols (ICNP), 19th IEEE International Conference on, 2011, pp.133,134.

[15] D. Riboni, A. Villani, D. Vitali, C. Bettini, L.V. Mancini, "Obfuscation of sensitive data in network flows," INFOCOM, 2012 Proceedings, IEEE, 2012, pp.2372-2380.

[16] W. Qardaji and L. Ninghui, "Anonymizing Network Traces with Temporal Pseudonym Consistency." IEEE 32nd International Conference on Distributed Computing Systems Workshops (ICDCSW), 2012, pp. 622-633.

[17] M. Mendonca, S. Seetharaman, and K. Obraczka, "A flexible in-network ip anonymization service.", In Communications (ICC), 2012 IEEE International Conference, pp. 6651-6656.

[18] S. Jeon, J-H. Yun, and W-N. Kim, “Obfuscation of Critical Infrastructure Network Traffic using Fake Communication”, Annual Computer Security Applications Conference (ACSAC) 2013, Poster.

[19] M. Nassar, B. al Bouna, and Q. Malluhi, "Secure Outsourcing of Network Flow Data Analysis.", In Big Data (BigData Congress), 2013 IEEE International Congress, 2013, pp. 431-432.

[20] C. Xiaoyun, S. Yujie, T. Xiaosheng, H. Xiaohong, and M. Yan, "On measuring the privacy of anonymized data in multiparty network data sharing.", Communications, China 10, no. 5, 2013, pp. 120-127.

[21] Y-D. Ying-Dar, P-C. Lin, S-H. Wang, I-W. Chen, and Y-C. Lai. "Pcaplib: A system of extracting, classifying, and anonymizing real packet traces.", IEEE Systems Journal, Issue 99, pp.1-12.

[22] J. Stanek, L. Kencl, and J. Kuthan, "Analyzing anomalies in anonymized SIP traffic.", In IEEE 2014 IFIP Networking Conference, 2014, 2014, pp. 1-9.

[23] D. Riboni, A. Villani, D. Vitali, C. Bettini, L.V. Mancini, L.V, "Obfuscation of Sensitive Data for Incremental Release of Network Flows," IEEE Transactions on Networking, Issue 99, 2014, pp.1.

[24] T. Farah, and L. Trajkovic, "Anonym: A tool for anonymization of the Internet traffic." In IEEE 2013 International Conference on Cybernetics (CYBCONF), 2013, pp. 261-266.

[25] A.J. Slagell, K. Lakkaraju, and K. Luo, "FLAIM: A Multi-level Anonymization Framework for Computer and Network Logs." In LISA, vol. 6, 2006, pp. 3-8.

[26] J. Xu, J. Fan, M.H. Ammar, and Sue B. Moon, "Prefix-preserving ip address anonymization: Measurement-based security evaluation and a new cryptography-based scheme.", In 10th IEEE International Conference on Network Protocols, 2002, pp. 280-289.

[27] M. Burkhart, D. Brauckhoff, M. May, and E. Boschi, "The risk-utility tradeoff for IP address truncation." In Proceedings of the 1st ACM workshop on Network data anonymization, 2008, pp. 23-30.

[28] W. Yurcik, C. Woolam, G. Hellings, L. Khan, B. Thuraisingham, "Measuring anonymization privacy/analysis tradeoffs inherent to sharing network data", IEEE Network Operations and Management Symposium, 2008, pp.991-994.

[29] V. Ciriani, S.D.C. Vimercati, S. Foresti, and P. Samarati, “Theory of privacy and Anonymity”, In M. J. Atallah & M. Blanton (Eds.), In Algorithms and theory of computation handbook, CRC Press, 2009, pp.

12

18-33. [30] P. Samarati and L. Sweeney, “Protecting privacy when disclosing

information: k-anonymity and its enforcement through generalization and suppression”, Technical Report SRI-CSL-98-04, SRI Computer Science Laboratory, 1998

[31] L. Sweeney, “Achieving k-anonymity privacy protection using generalization and suppression”, International Journal of Uncertainty Fuzziness and Knowledge-Based Systems, 10(5), 2002, pp.571–588.

[32] T. Dalenius and S.P. Reiss, “Data-swapping: A technique for disclosure control”, Journal of Statistical Planning and Inference, 6(1), 1982, pp. 73–85.

[33] J. Kim, “A Method For Limiting Disclosure in Microdata Based Random Noise and Transformation”, In Proceedings of the Survey Research Methods, American Statistical Association, Vol. A, 1986, pp. 370–374.

[34] J. Kim and W.E. Winkler, “Multiplicative Noise for Masking Continuous Data”, Research Report Series, Statistics #2003-01, Statistical Research Division. 2003, Washington, D.C. Retrieved from http://www.census.gov/srd/papers/pdf/rrs2003-01.pdf

[35] C. Dwork, “Differential Privacy”, In M. Bugliesi, B. Preneel, V. Sassone, & I. Wegener (Eds.), Automata languages and programming, Vol. 4052, 2006, pp. 1–12. Springer.

[36] K. Mivule, “An Investigation Of Data Privacy and utility using machine learning as a gauge”, Dissertation, Computer Science Department, Bowie State University, 2014, ProQuest No: 3619387.

[37] M.H. Dunham, “Data Mining Introductory and Advanced Topics”, 2003, pp. 58–60, 97–99. Upper Saddle River, New Jersey: Prentice Hall.

[38] K. Mivule, (2012). “Utilizing noise addition for data privacy, an Overview”, In Proceedings of the International Conference on Information and Knowledge Engineering (IKE), 2012, pp. 65–71.

[39] S.E. Coull, C.V. Wright, A.D. Keromytis, F. Monrose, and M.K. Reiter, “Taming the devil: Techniques for evaluating anonymized network data”, In Network and Distributed System Security Symposium, 2008, pp. 125-135.

a study of usability-aware network trace anonymization

Data & Analytics